Cluster — Data Clustering Analysis Tool
Cluster is a command-line data clustering analysis tool that supports k-means and hierarchical clustering algorithms. It reads numerical data from CSV/JSONL sources, performs clustering, evaluates cluster quality, and exports results.
Data is stored in ~/.cluster/data.jsonl as JSONL records. Each record represents a clustering run with its parameters, assignments, centroids, and evaluation metrics.
Prerequisites
- - Python 3.8+ with standard library (no external packages required for basic operations)
- INLINECODE1 shell
Commands
run
Run a clustering algorithm on input data.
Environment Variables:
- -
INPUT (required) — Path to input CSV/JSONL file with numerical data - INLINECODE4 — Number of clusters (default: 3)
- INLINECODE5 — Algorithm to use:
kmeans or hierarchical (default: kmeans) - INLINECODE8 — Maximum iterations for k-means (default: 100)
- INLINECODE9 — Random seed for reproducibility
Example:
CODEBLOCK0
assign
Assign new data points to existing clusters from a previous run.
Environment Variables:
- -
RUN_ID (required) — ID of the clustering run to use - INLINECODE12 (required) — Path to new data points (CSV/JSONL)
Example:
CODEBLOCK1
centroids
Display or export centroid coordinates for a clustering run.
Environment Variables:
- -
RUN_ID (required) — ID of the clustering run - INLINECODE15 — Output format:
table, json, csv (default: table)
evaluate
Evaluate clustering quality with silhouette score, inertia, and Davies-Bouldin index.
Environment Variables:
- -
RUN_ID (required) — ID of the clustering run to evaluate
visualize
Generate a text-based or ASCII visualization of cluster assignments.
Environment Variables:
- -
RUN_ID (required) — ID of the clustering run - INLINECODE23 — Dimensions to plot, comma-separated (default: first two)
export
Export clustering results to a file.
Environment Variables:
- -
RUN_ID (required) — ID of the run to export - INLINECODE26 — Output file path (default: stdout)
- INLINECODE27 — Export format:
json, csv, jsonl (default: json)
import
Import a previously exported clustering run.
Environment Variables:
- -
INPUT (required) — Path to the file to import
config
View or update configuration settings.
Environment Variables:
- -
KEY — Configuration key to set - INLINECODE35 — Configuration value
list
List all stored clustering runs with summary info.
Environment Variables:
- -
LIMIT — Maximum runs to display (default: 20) - INLINECODE38 — Sort field:
date, k, score (default: date)
stats
Show aggregate statistics across all clustering runs.
help
Display usage information and available commands.
version
Display the current version of the cluster tool.
Data Storage
All clustering runs are stored in ~/.cluster/data.jsonl. Each line is a JSON object with fields:
- -
id — Unique run identifier - INLINECODE47 — ISO 8601 creation time
- INLINECODE48 — Algorithm used
- INLINECODE49 — Number of clusters
- INLINECODE50 — List of centroid coordinates
- INLINECODE51 — Mapping of data point indices to cluster IDs
- INLINECODE52 — Evaluation metrics (silhouette, inertia, etc.)
- INLINECODE53 — Source data file path
- INLINECODE54 — Number of data points clustered
Configuration
Config is stored in ~/.cluster/config.json. Available keys:
- -
default_k — Default number of clusters (default: 3) - INLINECODE57 — Default algorithm (default: kmeans)
- INLINECODE58 — Default max iterations (default: 100)
- INLINECODE59 — Default random seed (default: 42)
Powered by BytesAgain | bytesagain.com | hello@bytesagain.com
Cluster — 数据聚类分析工具
Cluster 是一款命令行数据聚类分析工具,支持 k-means 和层次聚类算法。它可以从 CSV/JSONL 源读取数值数据,执行聚类分析,评估聚类质量,并导出结果。
数据以 JSONL 记录形式存储在 ~/.cluster/data.jsonl 中。每条记录代表一次聚类运行,包含其参数、分配结果、质心和评估指标。
前置条件
- - Python 3.8+ 及标准库(基本操作无需外部包)
- bash shell
命令
run
对输入数据运行聚类算法。
环境变量:
- - INPUT(必需)— 包含数值数据的输入 CSV/JSONL 文件路径
- K — 聚类数量(默认:3)
- ALGORITHM — 使用的算法:kmeans 或 hierarchical(默认:kmeans)
- MAX_ITER — k-means 的最大迭代次数(默认:100)
- SEED — 用于可重复性的随机种子
示例:
bash
INPUT=/path/to/data.csv K=5 ALGORITHM=kmeans bash scripts/script.sh run
assign
将新的数据点分配到先前运行的现有聚类中。
环境变量:
- - RUN_ID(必需)— 要使用的聚类运行 ID
- INPUT(必需)— 新数据点路径(CSV/JSONL)
示例:
bash
RUNID=abc123 INPUT=/path/to/newdata.csv bash scripts/script.sh assign
centroids
显示或导出聚类运行的质心坐标。
环境变量:
- - RUN_ID(必需)— 聚类运行的 ID
- FORMAT — 输出格式:table、json、csv(默认:table)
evaluate
使用轮廓系数、惯性和 Davies-Bouldin 指数评估聚类质量。
环境变量:
- - RUN_ID(必需)— 要评估的聚类运行 ID
visualize
生成基于文本或 ASCII 的聚类分配可视化。
环境变量:
- - RUN_ID(必需)— 聚类运行的 ID
- DIMS — 要绘制的维度,逗号分隔(默认:前两个)
export
将聚类结果导出到文件。
环境变量:
- - RUN_ID(必需)— 要导出的运行 ID
- OUTPUT — 输出文件路径(默认:标准输出)
- FORMAT — 导出格式:json、csv、jsonl(默认:json)
import
导入先前导出的聚类运行。
环境变量:
config
查看或更新配置设置。
环境变量:
- - KEY — 要设置的配置键
- VALUE — 配置值
list
列出所有存储的聚类运行及其摘要信息。
环境变量:
- - LIMIT — 最大显示运行数(默认:20)
- SORT — 排序字段:date、k、score(默认:date)
stats
显示所有聚类运行的汇总统计信息。
help
显示使用信息和可用命令。
version
显示当前 cluster 工具的版本。
数据存储
所有聚类运行存储在 ~/.cluster/data.jsonl 中。每行是一个 JSON 对象,包含以下字段:
- - id — 唯一运行标识符
- timestamp — ISO 8601 创建时间
- algorithm — 使用的算法
- k — 聚类数量
- centroids — 质心坐标列表
- assignments — 数据点索引到聚类 ID 的映射
- metrics — 评估指标(轮廓系数、惯性等)
- inputfile — 源数据文件路径
- numpoints — 聚类的数据点数量
配置
配置存储在 ~/.cluster/config.json 中。可用键:
- - defaultk — 默认聚类数量(默认:3)
- defaultalgorithm — 默认算法(默认:kmeans)
- maxiterations — 默认最大迭代次数(默认:100)
- randomseed — 默认随机种子(默认:42)
由 BytesAgain 提供支持 | bytesagain.com | hello@bytesagain.com