ClickHouse 🏠
Real-time analytics on billions of rows. Sub-second queries. No indexes needed.
Setup
On first use, read setup.md for connection configuration.
When to Use
User needs OLAP analytics, log analysis, time-series data, or real-time dashboards. Agent handles schema design, query optimization, data ingestion, and cluster administration.
Architecture
Memory lives in ~/clickhouse/. See memory-template.md for structure.
CODEBLOCK0
Quick Reference
| Topic | File |
|---|
| Setup & connection | INLINECODE3 |
| Memory template |
memory-template.md |
| Query patterns |
queries.md |
| Performance tuning |
performance.md |
| Data ingestion |
ingestion.md |
Core Rules
1. Always Specify Engine
Every table needs an explicit engine. Default to MergeTree family:
CODEBLOCK1
2. ORDER BY is Your Index
ClickHouse has no traditional indexes. The
ORDER BY clause determines data layout:
- - Put high-cardinality filter columns first
- Put range columns (dates, timestamps) early
- Match your most common WHERE patterns
CODEBLOCK2
3. Use Appropriate Data Types
| Use Case | Type | Why |
|---|
| Timestamps | INLINECODE9 or INLINECODE10 | Native time functions |
| Low-cardinality strings |
LowCardinality(String) | 10x compression |
| Enums with few values |
Enum8 or
Enum16 | Smallest footprint |
| Nullable only if needed |
Nullable(T) | Adds overhead |
| IPs |
IPv4 or
IPv6 | 4 bytes vs 16+ |
4. Batch Inserts
Never insert row-by-row. ClickHouse is optimized for batch writes:
CODEBLOCK3
Minimum batch: 1,000 rows. Optimal: 10,000-100,000 rows.
5. Prewarm Queries with FINAL
Queries on ReplacingMergeTree/CollapsingMergeTree need
FINAL for accuracy:
CODEBLOCK4
INLINECODE18 has performance cost. For dashboards, consider materialized views.
6. Materialized Views for Speed
Pre-aggregate expensive computations:
CODEBLOCK5
7. Check System Tables First
Before debugging, check system tables:
CODEBLOCK6
Common Traps
- - String instead of LowCardinality → 10x larger storage for status/type columns
- Wrong ORDER BY → Full table scans instead of index lookups
- Row-by-row inserts → Massive part fragmentation, slow writes
- Missing TTL → Unbounded table growth, disk full
- SELECT * → Reads all columns, kills columnar advantage
- Nullable everywhere → Overhead + NULL handling complexity
- Forgetting FINAL → Stale/duplicate data in merge tables
Performance Checklist
Before running expensive queries:
- 1. Check EXPLAIN:
EXPLAIN SELECT ... shows execution plan - Sample first:
SELECT ... FROM table SAMPLE 0.01 for 1% sample - Limit columns: Only SELECT what you need
- Use PREWHERE: Filters before reading all columns
- Check parts: INLINECODE21
CODEBLOCK7
Cluster Administration
Adding TTL for Data Retention
CODEBLOCK8
Monitoring Disk Usage
CODEBLOCK9
External Endpoints
| Endpoint | Data Sent | Purpose |
|---|
| localhost:8123 | SQL queries | HTTP interface |
| localhost:9000 |
SQL queries | Native TCP interface |
No external services contacted. All queries run against user-specified ClickHouse instances.
Security & Privacy
Data saved locally (with user consent):
- - Connection profiles (host, port, database) in ~/clickhouse/memory.md
- Query patterns and schema documentation
- Authentication method preferences (password vs certificate)
Important: If you provide database passwords, they are stored in plain text in ~/clickhouse/. Consider using environment variables or connection profiles managed by clickhouse-client instead.
This skill does NOT:
- - Connect to any ClickHouse without explicit user configuration
- Send data to external services
- Automatically collect or store credentials without asking
Related Skills
Install with
clawhub install <slug> if user confirms:
- -
sql — SQL query patterns - INLINECODE24 — data analysis workflows
- INLINECODE25 — structured data exploration
Feedback
- - If useful: INLINECODE26
- Stay updated: INLINECODE27
ClickHouse 🏠
对数十亿行数据进行实时分析。亚秒级查询。无需索引。
设置
首次使用时,请阅读 setup.md 了解连接配置。
使用场景
用户需要OLAP分析、日志分析、时序数据或实时仪表盘。智能体负责处理模式设计、查询优化、数据摄入和集群管理。
架构
内存文件位于 ~/clickhouse/。结构请参考 memory-template.md。
~/clickhouse/
├── memory.md # 连接配置 + 查询模式
├── schemas/ # 各数据库的表定义
└── queries/ # 保存的分析查询
快速参考
memory-template.md |
| 查询模式 | queries.md |
| 性能调优 | performance.md |
| 数据摄入 | ingestion.md |
核心规则
1. 始终指定引擎
每个表都需要显式指定引擎。默认使用MergeTree系列:
sql
-- 时序数据/日志
CREATE TABLE events (
timestamp DateTime,
event_type String,
data String
) ENGINE = MergeTree()
ORDER BY (timestamp, event_type);
-- 聚合指标
CREATE TABLE daily_stats (
date Date,
metric String,
value AggregateFunction(sum, UInt64)
) ENGINE = AggregatingMergeTree()
ORDER BY (date, metric);
2. ORDER BY 就是你的索引
ClickHouse没有传统索引。ORDER BY 子句决定了数据布局:
- - 将高基数的过滤列放在前面
- 将范围列(日期、时间戳)放在前面
- 匹配最常见的WHERE模式
sql
-- 好:先按user_id过滤,再按日期范围
ORDER BY (userid, date, eventtype)
-- 差:按user_id过滤时却把date放在前面
ORDER BY (date, userid, eventtype)
3. 使用合适的数据类型
| 使用场景 | 类型 | 原因 |
|---|
| 时间戳 | DateTime 或 DateTime64 | 原生时间函数 |
| 低基数字符串 |
LowCardinality(String) | 10倍压缩 |
| 少量枚举值 | Enum8 或 Enum16 | 最小存储空间 |
| 仅在需要时使用Nullable | Nullable(T) | 增加开销 |
| IP地址 | IPv4 或 IPv6 | 4字节 vs 16+字节 |
4. 批量插入
永远不要逐行插入。ClickHouse针对批量写入进行了优化:
bash
好:批量插入
clickhouse-client --query=INSERT INTO events FORMAT JSONEachRow < batch.json
差:循环逐行插入
for row in data:
INSERT INTO events VALUES (...)
最小批次:1,000行。最佳批次:10,000-100,000行。
5. 使用FINAL预热查询
对ReplacingMergeTree/CollapsingMergeTree的查询需要使用 FINAL 确保准确性:
sql
-- 可能返回重复/旧版本
SELECT * FROM users WHERE id = 123;
-- 保证返回最新版本
SELECT * FROM users FINAL WHERE id = 123;
FINAL 有性能开销。对于仪表盘,考虑使用物化视图。
6. 使用物化视图提升速度
预聚合昂贵的计算:
sql
CREATE MATERIALIZED VIEW hourly_events
ENGINE = SummingMergeTree()
ORDER BY (hour, event_type)
AS SELECT
toStartOfHour(timestamp) AS hour,
event_type,
count() AS events
FROM events
GROUP BY hour, event_type;
7. 优先检查系统表
在调试之前,先检查系统表:
sql
-- 正在运行的查询
SELECT * FROM system.processes;
-- 最近的查询性能
SELECT query, elapsed, readrows, memoryusage
FROM system.query_log
WHERE type = QueryFinish
ORDER BY event_time DESC
LIMIT 10;
-- 表大小
SELECT database, table, formatReadableSize(total_bytes) as size
FROM system.tables
ORDER BY total_bytes DESC;
常见陷阱
- - 使用String而非LowCardinality → 状态/类型列存储空间增大10倍
- 错误的ORDER BY → 全表扫描而非索引查找
- 逐行插入 → 大量分区碎片,写入缓慢
- 缺少TTL → 表无限增长,磁盘满
- SELECT * → 读取所有列,丧失列式优势
- 到处使用Nullable → 开销 + NULL处理复杂性
- 忘记FINAL → 合并表中出现过期/重复数据
性能检查清单
在执行昂贵查询之前:
- 1. 检查EXPLAIN:EXPLAIN SELECT ... 显示执行计划
- 先采样:SELECT ... FROM table SAMPLE 0.01 获取1%样本
- 限制列数:只SELECT需要的列
- 使用PREWHERE:在读取所有列之前进行过滤
- 检查分区:SELECT count() FROM system.parts WHERE table=X
sql
-- PREWHERE优化
SELECT userid, eventtype, data
FROM events
PREWHERE date = today()
WHERE event_type = click;
集群管理
添加TTL进行数据保留
sql
-- 删除旧数据
ALTER TABLE events
MODIFY TTL timestamp + INTERVAL 90 DAY;
-- 移动到冷存储
ALTER TABLE events
MODIFY TTL timestamp + INTERVAL 30 DAY TO VOLUME cold;
监控磁盘使用
sql
SELECT
database,
table,
formatReadableSize(sum(bytesondisk)) as disk_size,
sum(rows) as total_rows,
count() as parts
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytesondisk) DESC;
外部端点
| 端点 | 发送数据 | 用途 |
|---|
| localhost:8123 | SQL查询 | HTTP接口 |
| localhost:9000 |
SQL查询 | 原生TCP接口 |
不联系任何外部服务。所有查询都在用户指定的ClickHouse实例上运行。
安全与隐私
本地保存的数据(经用户同意):
- - 连接配置(主机、端口、数据库)保存在 ~/clickhouse/memory.md
- 查询模式和模式文档
- 认证方式偏好(密码 vs 证书)
重要提示: 如果您提供数据库密码,它们将以明文形式存储在 ~/clickhouse/ 中。建议使用环境变量或由clickhouse-client管理的连接配置。
此技能不会:
- - 在没有用户明确配置的情况下连接到任何ClickHouse
- 将数据发送到外部服务
- 未经询问自动收集或存储凭据
相关技能
如果用户确认,使用 clawhub install
安装:
- - sql — SQL查询模式
- analytics — 数据分析工作流
- data-analysis — 结构化数据探索
反馈
- - 如果觉得有用:clawhub star clickhouse
- 保持更新:clawhub sync