Cluster — Data Clustering Analysis Tool

Cluster is a command-line data clustering analysis tool that supports k-means and hierarchical clustering algorithms. It reads numerical data from CSV/JSONL sources, performs clustering, evaluates cluster quality, and exports results.

Data is stored in ~/.cluster/data.jsonl as JSONL records. Each record represents a clustering run with its parameters, assignments, centroids, and evaluation metrics.

Prerequisites

- Python 3.8+ with standard library (no external packages required for basic operations)
INLINECODE1 shell

Commands

`run`

Run a clustering algorithm on input data.

Environment Variables:

- INPUT (required) — Path to input CSV/JSONL file with numerical data
INLINECODE4 — Number of clusters (default: 3)
INLINECODE5 — Algorithm to use: kmeans or hierarchical (default: kmeans)
INLINECODE8 — Maximum iterations for k-means (default: 100)
INLINECODE9 — Random seed for reproducibility

Example:
CODEBLOCK0

`assign`

Assign new data points to existing clusters from a previous run.

Environment Variables:

- RUN_ID (required) — ID of the clustering run to use
INLINECODE12 (required) — Path to new data points (CSV/JSONL)

Example:
CODEBLOCK1

`centroids`

Display or export centroid coordinates for a clustering run.

Environment Variables:

- RUN_ID (required) — ID of the clustering run
INLINECODE15 — Output format: table, json, csv (default: table)

`evaluate`

Evaluate clustering quality with silhouette score, inertia, and Davies-Bouldin index.

Environment Variables:

- RUN_ID (required) — ID of the clustering run to evaluate

`visualize`

Generate a text-based or ASCII visualization of cluster assignments.

Environment Variables:

- RUN_ID (required) — ID of the clustering run
INLINECODE23 — Dimensions to plot, comma-separated (default: first two)

`export`

Export clustering results to a file.

Environment Variables:

- RUN_ID (required) — ID of the run to export
INLINECODE26 — Output file path (default: stdout)
INLINECODE27 — Export format: json, csv, jsonl (default: json)

`import`

Import a previously exported clustering run.

Environment Variables:

- INPUT (required) — Path to the file to import

`config`

View or update configuration settings.

Environment Variables:

- KEY — Configuration key to set
INLINECODE35 — Configuration value

`list`

List all stored clustering runs with summary info.

Environment Variables:

- LIMIT — Maximum runs to display (default: 20)
INLINECODE38 — Sort field: date, k, score (default: date)

`stats`

Show aggregate statistics across all clustering runs.

`help`

Display usage information and available commands.

`version`

Display the current version of the cluster tool.

Data Storage

All clustering runs are stored in ~/.cluster/data.jsonl. Each line is a JSON object with fields:

- id — Unique run identifier
INLINECODE47 — ISO 8601 creation time
INLINECODE48 — Algorithm used
INLINECODE49 — Number of clusters
INLINECODE50 — List of centroid coordinates
INLINECODE51 — Mapping of data point indices to cluster IDs
INLINECODE52 — Evaluation metrics (silhouette, inertia, etc.)
INLINECODE53 — Source data file path
INLINECODE54 — Number of data points clustered

Configuration

Config is stored in ~/.cluster/config.json. Available keys:

- default_k — Default number of clusters (default: 3)
INLINECODE57 — Default algorithm (default: kmeans)
INLINECODE58 — Default max iterations (default: 100)
INLINECODE59 — Default random seed (default: 42)

Cluster — 数据聚类分析工具

Cluster 是一款命令行数据聚类分析工具，支持 k-means 和层次聚类算法。它可以从 CSV/JSONL 源读取数值数据，执行聚类分析，评估聚类质量，并导出结果。

数据以 JSONL 记录形式存储在 ~/.cluster/data.jsonl 中。每条记录代表一次聚类运行，包含其参数、分配结果、质心和评估指标。

前置条件

- Python 3.8+ 及标准库（基本操作无需外部包）
bash shell

命令

run

对输入数据运行聚类算法。

环境变量：

- INPUT（必需）— 包含数值数据的输入 CSV/JSONL 文件路径
K — 聚类数量（默认：3）
ALGORITHM — 使用的算法：kmeans 或 hierarchical（默认：kmeans）
MAX_ITER — k-means 的最大迭代次数（默认：100）
SEED — 用于可重复性的随机种子

示例：
bash
INPUT=/path/to/data.csv K=5 ALGORITHM=kmeans bash scripts/script.sh run

assign

将新的数据点分配到先前运行的现有聚类中。

环境变量：

- RUN_ID（必需）— 要使用的聚类运行 ID
INPUT（必需）— 新数据点路径（CSV/JSONL）

示例：
bash
RUNID=abc123 INPUT=/path/to/newdata.csv bash scripts/script.sh assign

centroids

显示或导出聚类运行的质心坐标。

环境变量：

- RUN_ID（必需）— 聚类运行的 ID
FORMAT — 输出格式：table、json、csv（默认：table）

evaluate

使用轮廓系数、惯性和 Davies-Bouldin 指数评估聚类质量。

环境变量：

- RUN_ID（必需）— 要评估的聚类运行 ID

visualize

生成基于文本或 ASCII 的聚类分配可视化。

环境变量：

- RUN_ID（必需）— 聚类运行的 ID
DIMS — 要绘制的维度，逗号分隔（默认：前两个）

export

将聚类结果导出到文件。

环境变量：

- RUN_ID（必需）— 要导出的运行 ID
OUTPUT — 输出文件路径（默认：标准输出）
FORMAT — 导出格式：json、csv、jsonl（默认：json）

import

导入先前导出的聚类运行。

环境变量：

- INPUT（必需）— 要导入的文件路径

config

查看或更新配置设置。

环境变量：

- KEY — 要设置的配置键
VALUE — 配置值

list

列出所有存储的聚类运行及其摘要信息。

环境变量：

- LIMIT — 最大显示运行数（默认：20）
SORT — 排序字段：date、k、score（默认：date）

stats

显示所有聚类运行的汇总统计信息。

help

显示使用信息和可用命令。

version

显示当前 cluster 工具的版本。

数据存储

所有聚类运行存储在 ~/.cluster/data.jsonl 中。每行是一个 JSON 对象，包含以下字段：

- id — 唯一运行标识符
timestamp — ISO 8601 创建时间
algorithm — 使用的算法
k — 聚类数量
centroids — 质心坐标列表
assignments — 数据点索引到聚类 ID 的映射
metrics — 评估指标（轮廓系数、惯性等）
inputfile — 源数据文件路径
numpoints — 聚类的数据点数量

配置

配置存储在 ~/.cluster/config.json 中。可用键：

- defaultk — 默认聚类数量（默认：3）
defaultalgorithm — 默认算法（默认：kmeans）
maxiterations — 默认最大迭代次数（默认：100）
randomseed — 默认随机种子（默认：42）

由 BytesAgain 提供支持 | bytesagain.com | hello@bytesagain.com

cluster数据聚类分析

cluster

Cluster — Data Clustering Analysis Tool

Prerequisites

Commands