Data Analyst Skill 📊
Turn your AI agent into a data analysis powerhouse.
Query databases, analyze spreadsheets, create visualizations, and generate insights that drive decisions.
What This Skill Does
✅ SQL Queries — Write and execute queries against databases
✅ Spreadsheet Analysis — Process CSV, Excel, Google Sheets data
✅ Data Visualization — Create charts, graphs, and dashboards
✅ Report Generation — Automated reports with insights
✅ Data Cleaning — Handle missing data, outliers, formatting
✅ Statistical Analysis — Descriptive stats, trends, correlations
Quick Start
- 1. Configure your data sources in
TOOLS.md:
CODEBLOCK0
- 2. Set up your workspace:
CODEBLOCK1
- 3. Start analyzing!
SQL Query Patterns
Common Query Templates
Basic Data Exploration
CODEBLOCK2
Time-Based Analysis
CODEBLOCK3
Cohort Analysis
CODEBLOCK4
Funnel Analysis
-- Conversion funnel
WITH funnel AS (
SELECT
COUNT(DISTINCT CASE WHEN event = 'page_view' THEN user_id END) as views,
COUNT(DISTINCT CASE WHEN event = 'signup' THEN user_id END) as signups,
COUNT(DISTINCT CASE WHEN event = 'purchase' THEN user_id END) as purchases
FROM events
WHERE date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
views,
signups,
ROUND(signups * 100.0 / NULLIF(views, 0), 2) as signup_rate,
purchases,
ROUND(purchases * 100.0 / NULLIF(signups, 0), 2) as purchase_rate
FROM funnel;
Data Cleaning
Common Data Quality Issues
| Issue | Detection | Solution |
|---|
| Missing values | INLINECODE1 or empty string | Impute, drop, or flag |
| Duplicates |
GROUP BY with
HAVING COUNT(*) > 1 | Deduplicate with rules |
|
Outliers | Z-score > 3 or IQR method | Investigate, cap, or exclude |
|
Inconsistent formats | Sample and pattern match | Standardize with transforms |
|
Invalid values | Range checks, referential integrity | Validate and correct |
Data Cleaning SQL Patterns
CODEBLOCK6
Data Cleaning Checklist
CODEBLOCK7
Spreadsheet Analysis
CSV/Excel Processing with Python
CODEBLOCK8
Common Pandas Operations
CODEBLOCK9
Data Visualization
Chart Selection Guide
| Data Type | Best Chart | Use When |
|---|
| Trend over time | Line chart | Showing patterns/changes over time |
| Category comparison |
Bar chart | Comparing discrete categories |
| Part of whole | Pie/Donut | Showing proportions (≤5 categories) |
| Distribution | Histogram | Understanding data spread |
| Correlation | Scatter plot | Relationship between two variables |
| Many categories | Horizontal bar | Ranking or comparing many items |
| Geographic | Map | Location-based data |
Python Visualization with Matplotlib/Seaborn
CODEBLOCK10
ASCII Charts (Quick Terminal Visualization)
When you can't generate images, use ASCII:
CODEBLOCK11
Report Generation
Standard Report Template
CODEBLOCK12
Automated Report Script
CODEBLOCK13
Statistical Analysis
Descriptive Statistics
| Statistic | What It Tells You | Use Case |
|---|
| Mean | Average value | Central tendency |
| Median |
Middle value | Robust to outliers |
|
Mode | Most common | Categorical data |
|
Std Dev | Spread around mean | Variability |
|
Min/Max | Range | Data boundaries |
|
Percentiles | Distribution shape | Benchmarking |
Quick Stats with Python
CODEBLOCK14
Statistical Tests Quick Reference
| Test | Use Case | Python |
|---|
| T-test | Compare two means | INLINECODE4 |
| Chi-square |
Categorical independence |
scipy.stats.chi2_contingency(table) |
| ANOVA | Compare 3+ means |
scipy.stats.f_oneway(a, b, c) |
| Pearson | Linear correlation |
scipy.stats.pearsonr(x, y) |
Analysis Workflow
Standard Analysis Process
- 1. Define the Question
- What are we trying to answer?
- What decisions will this inform?
- 2. Understand the Data
- What data is available?
- What's the structure and quality?
- 3. Clean and Prepare
- Handle missing values
- Fix data types
- Remove duplicates
- 4. Explore
- Descriptive statistics
- Initial visualizations
- Identify patterns
- 5. Analyze
- Deep dive into findings
- Statistical tests if needed
- Validate hypotheses
- 6. Communicate
- Clear visualizations
- Actionable insights
- Recommendations
Analysis Request Template
CODEBLOCK15
Scripts
data-init.sh
Initialize your data analysis workspace.
query.sh
Quick SQL query execution.
CODEBLOCK16
analyze.py
Python analysis toolkit.
CODEBLOCK17
Integration Tips
With Other Skills
| Skill | Integration |
|---|
| Marketing | Analyze campaign performance, content metrics |
| Sales |
Pipeline analytics, conversion analysis |
|
Business Dev | Market research data, competitor analysis |
Common Data Sources
- - Databases: PostgreSQL, MySQL, SQLite
- Warehouses: BigQuery, Snowflake, Redshift
- Spreadsheets: Google Sheets, Excel, CSV
- APIs: REST endpoints, GraphQL
- Files: JSON, Parquet, XML
Best Practices
- 1. Start with the question — Know what you're trying to answer
- Validate your data — Garbage in = garbage out
- Document everything — Queries, assumptions, decisions
- Visualize appropriately — Right chart for right data
- Show your work — Methodology matters
- Lead with insights — Not just data dumps
- Make it actionable — "So what?" → "Now what?"
- Version your queries — Track changes over time
Common Mistakes
❌ Confirmation bias — Looking for data to support a conclusion
❌ Correlation ≠ causation — Be careful with claims
❌ Cherry-picking — Using only favorable data
❌ Ignoring outliers — Investigate before removing
❌ Over-complicating — Simple analysis often wins
❌ No context — Numbers without comparison are meaningless
License
License: MIT — use freely, modify, distribute.
"The goal is to turn data into information, and information into insight." — Carly Fiorina
数据分析师技能 📊
将您的AI智能体转变为数据分析利器。
查询数据库、分析电子表格、创建可视化图表、生成驱动决策的洞察。
该技能的功能
✅ SQL查询 — 编写并执行数据库查询
✅ 电子表格分析 — 处理CSV、Excel、Google Sheets数据
✅ 数据可视化 — 创建图表、图形和仪表盘
✅ 报告生成 — 自动生成包含洞察的报告
✅ 数据清洗 — 处理缺失数据、异常值、格式问题
✅ 统计分析 — 描述性统计、趋势分析、相关性分析
快速入门
- 1. 在 TOOLS.md 中配置您的数据源:
markdown
数据源
- - 主数据库:[连接字符串或描述]
- 电子表格:[Google Sheets URL / 本地路径]
- 数据仓库:[BigQuery/Snowflake/等]
- 2. 设置您的工作空间:
bash
./scripts/data-init.sh
- 3. 开始分析!
SQL查询模式
常用查询模板
基础数据探索
sql
-- 行数统计
SELECT COUNT(*) FROM table_name;
-- 样本数据
SELECT * FROM table_name LIMIT 10;
-- 列统计
SELECT
column_name,
COUNT(*) as count,
COUNT(DISTINCT columnname) as uniquevalues,
MIN(columnname) as minval,
MAX(columnname) as maxval
FROM table_name
GROUP BY column_name;
基于时间的分析
sql
-- 每日聚合
SELECT
DATE(created_at) as date,
COUNT(*) as daily_count,
SUM(amount) as daily_total
FROM transactions
GROUP BY DATE(created_at)
ORDER BY date DESC;
-- 月度环比比较
SELECT
DATETRUNC(month, createdat) as month,
COUNT(*) as count,
LAG(COUNT(*)) OVER (ORDER BY DATETRUNC(month, createdat)) as prev_month,
(COUNT() - LAG(COUNT()) OVER (ORDER BY DATETRUNC(month, createdat))) /
NULLIF(LAG(COUNT()) OVER (ORDER BY DATETRUNC(month, createdat)), 0) 100 as growth_pct
FROM transactions
GROUP BY DATETRUNC(month, createdat)
ORDER BY month;
同期群分析
sql
-- 按注册月份划分的用户同期群
SELECT
DATETRUNC(month, u.createdat) as cohort_month,
DATETRUNC(month, o.createdat) as activity_month,
COUNT(DISTINCT u.id) as users
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY cohortmonth, activitymonth
ORDER BY cohortmonth, activitymonth;
漏斗分析
sql
-- 转化漏斗
WITH funnel AS (
SELECT
COUNT(DISTINCT CASE WHEN event = pageview THEN userid END) as views,
COUNT(DISTINCT CASE WHEN event = signup THEN user_id END) as signups,
COUNT(DISTINCT CASE WHEN event = purchase THEN user_id END) as purchases
FROM events
WHERE date >= CURRENT_DATE - INTERVAL 30 days
)
SELECT
views,
signups,
ROUND(signups * 100.0 / NULLIF(views, 0), 2) as signup_rate,
purchases,
ROUND(purchases * 100.0 / NULLIF(signups, 0), 2) as purchase_rate
FROM funnel;
数据清洗
常见数据质量问题
| 问题 | 检测方法 | 解决方案 |
|---|
| 缺失值 | IS NULL 或空字符串 | 填充、删除或标记 |
| 重复值 |
GROUP BY 配合 HAVING COUNT(*) > 1 | 按规则去重 |
|
异常值 | Z-score > 3 或 IQR 方法 | 调查、截断或排除 |
|
格式不一致 | 抽样和模式匹配 | 通过转换标准化 |
|
无效值 | 范围检查、参照完整性 | 验证并修正 |
数据清洗SQL模式
sql
-- 查找重复项
SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
-- 查找空值
SELECT
COUNT(*) as total,
SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) as null_emails,
SUM(CASE WHEN name IS NULL THEN 1 ELSE 0 END) as null_names
FROM users;
-- 标准化文本
UPDATE products
SET category = LOWER(TRIM(category));
-- 移除异常值(IQR方法)
WITH stats AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) as q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as q3
FROM data
)
SELECT * FROM data, stats
WHERE value BETWEEN q1 - 1.5(q3-q1) AND q3 + 1.5(q3-q1);
数据清洗检查清单
markdown
数据质量审计:[数据集]
行级检查
- - [ ] 总行数:[X]
- [ ] 重复行:[X]
- [ ] 包含空值的行:[X]
列级检查
| 列名 | 类型 | 空值数 | 唯一值数 | 最小值 | 最大值 | 问题 |
|---|
| [col] | [type] | [n] | [n] | [v] | [v] | [notes] |
数据血缘
- - 来源:[数据来源]
- 最后更新:[日期]
- 已知问题:[列表]
已执行的清洗操作
- 1. [操作及原因]
- [操作及原因]
电子表格分析
使用Python处理CSV/Excel
python
import pandas as pd
加载数据
df = pd.read
csv(data.csv) # 或 pd.readexcel(data.xlsx)
基础探索
print(df.shape) # (行数, 列数)
print(df.info()) # 列类型和空值
print(df.describe()) # 数值统计
数据清洗
df = df.drop_duplicates()
df[date] = pd.to_datetime(df[date])
df[amount] = df[amount].fillna(0)
分析
summary = df.groupby(category).agg({
amount: [sum, mean, count],
quantity: sum
}).round(2)
导出
summary.to
csv(analysisoutput.csv)
常用Pandas操作
python
筛选
filtered = df[df[status] == active]
filtered = df[df[amount] > 1000]
filtered = df[df[date].between(2024-01-01, 2024-12-31)]
聚合
by_category = df.groupby(category)[amount].sum()
pivot = df.pivot_table(values=amount, index=month, columns=category, aggfunc=sum)
窗口函数
df[running_total] = df[amount].cumsum()
df[pct
change] = df[amount].pctchange()
df[rolling_avg] = df[amount].rolling(window=7).mean()
合并
merged = pd.merge(df1, df2, on=id, how=left)
数据可视化
图表选择指南
| 数据类型 | 最佳图表 | 使用场景 |
|---|
| 时间趋势 | 折线图 | 展示随时间变化的模式/趋势 |
| 类别比较 |
柱状图 | 比较离散类别 |
| 整体占比 | 饼图/环形图 | 展示比例(≤5个类别) |
| 分布 | 直方图 | 理解数据分布 |
| 相关性 | 散点图 | 两个变量之间的关系 |
| 多类别 | 水平条形图 | 对多个项目进行排名或比较 |
| 地理数据 | 地图 | 基于位置的数据 |
使用Matplotlib/Seaborn进行Python可视化
python
import matplotlib.pyplot as plt
import seaborn as sns
设置样式
plt.style.use(seaborn-v0_8-whitegrid)
sns.set_palette(husl)
折线图(趋势)
plt.figure(figsize=(10, 6))
plt.plot(df[date], df[value], marker=o)
plt.title(随时间变化的趋势)
plt.xlabel(日期)
plt.ylabel(数值)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig(trend.png, dpi=150)
柱状图(比较)
plt.figure(figsize=(10, 6))
sns