Setup
On first use, create ~/pandas/ and read setup.md for initialization. User preferences are stored in ~/pandas/memory.md — users can view or edit this file anytime.
When to Use
User needs to work with tabular data in Python. Agent handles DataFrame operations, data cleaning, aggregations, merges, pivots, and exports.
Architecture
Memory lives in ~/pandas/. See memory-template.md for structure.
CODEBLOCK0
Quick Reference
| Topic | File |
|---|
| Setup process | INLINECODE5 |
| Memory template |
memory-template.md |
Core Rules
1. Use Vectorized Operations
- - NEVER iterate with
for loops over DataFrame rows - Use
.apply() only when vectorized alternatives don't exist - Prefer
df['col'].str.method() over INLINECODE10
2. Chain Methods for Readability
CODEBLOCK1
3. Handle Missing Data Explicitly
- - Always check
df.isna().sum() before analysis - Choose strategy:
dropna(), fillna(), or interpolation - Document WHY missing values exist before removing them
4. Use Categorical for Repeated Strings
CODEBLOCK2
5. Merge with Validation
CODEBLOCK3
6. Prefer query() for Complex Filters
CODEBLOCK4
7. Set Index When Appropriate
CODEBLOCK5
Common Traps
- - SettingWithCopyWarning → Use
.loc[] for assignment: INLINECODE15 - Slow loops → Replace
iterrows() with vectorized ops or INLINECODE17 - Memory explosion → Use
dtype in read_csv(): INLINECODE20 - Silent data loss → Check shape before/after merge: INLINECODE21
- Index confusion → Use
reset_index() after groupby() to get clean DataFrame - Chained indexing →
df['a']['b'] fails silently; use INLINECODE25
Security & Privacy
Data storage:
- - User preferences stored in INLINECODE26
- All DataFrame operations run locally
- No data is sent externally
This skill does NOT:
- - Upload data to any service
- Access files outside
~/pandas/ and the working directory - Modify source data files without explicit instruction
User control:
- - View stored preferences: INLINECODE28
- Clear all data: INLINECODE29
Related Skills
Install with
clawhub install <slug> if user confirms:
- -
data-analysis — general data analysis patterns - INLINECODE32 — CSV file handling
- INLINECODE33 — database queries
- INLINECODE34 — Excel file operations
Feedback
- - If useful: INLINECODE35
- Stay updated: INLINECODE36
设置
首次使用时,创建 ~/pandas/ 目录并阅读 setup.md 进行初始化。用户偏好设置存储在 ~/pandas/memory.md 中——用户可随时查看或编辑此文件。
使用时机
用户需要使用Python处理表格数据。代理负责处理DataFrame操作、数据清洗、聚合、合并、透视及导出。
架构
记忆文件位于 ~/pandas/ 目录下。结构参考 memory-template.md。
~/pandas/
├── memory.md # 用户偏好设置和常用模式
└── snippets/ # 保存的代码片段(可选)
快速参考
memory-template.md |
核心规则
1. 使用向量化操作
- - 绝对不要用 for 循环遍历DataFrame行
- 仅在无向量化替代方案时使用 .apply()
- 优先使用 df[col].str.method() 而非 apply(lambda x: x.method())
2. 链式方法提升可读性
python
推荐:方法链式调用
result = (df
.query(age > 30)
.groupby(city)
.agg({salary: mean})
.reset_index())
不推荐:大量中间变量
filtered = df[df[age] > 30]
grouped = filtered.groupby(city)
result = grouped.agg({salary: mean}).reset_index()
3. 显式处理缺失数据
- - 分析前始终检查 df.isna().sum()
- 选择策略:dropna()、fillna() 或插值法
- 删除缺失值前需记录其存在原因
4. 对重复字符串使用分类类型
python
对唯一值较少的列节省内存
df[status] = df[status].astype(category)
df[country] = df[country].astype(category)
5. 带验证的合并操作
python
始终指定合并方式并验证
result = pd.merge(
df1, df2,
on=id,
how=left,
validate=m:1 # 多对一:捕获意外重复
)
6. 复杂筛选优先使用query()
python
可读性强
df.query(age > 30 and city == NYC and salary < 100000)
可读性差
df[(df[age] > 30) & (df[city] == NYC) & (df[salary] < 100000)]
7. 适时设置索引
python
更快的查找,更干净的合并
df = df.set
index(userid)
user_data = df.loc[12345] # O(1) 查找
常见陷阱
- - SettingWithCopyWarning → 使用 .loc[] 进行赋值:df.loc[mask, col] = value
- 慢速循环 → 用向量化操作或 apply() 替代 iterrows()
- 内存爆炸 → 在 readcsv() 中使用 dtype:pd.readcsv(f, dtype={id: int32})
- 静默数据丢失 → 合并前后检查形状:print(f合并前: {len(df1)}, 合并后: {len(result)})
- 索引混淆 → groupby() 后使用 reset_index() 获取干净的DataFrame
- 链式索引 → df[a][b] 静默失败;应使用 df.loc[:, [a, b]]
安全与隐私
数据存储:
- - 用户偏好设置存储在 ~/pandas/memory.md
- 所有DataFrame操作在本地运行
- 无数据外传
此技能不会:
- - 向任何服务上传数据
- 访问 ~/pandas/ 和工作目录以外的文件
- 未经明确指令修改源数据文件
用户控制:
- - 查看存储的偏好设置:cat ~/pandas/memory.md
- 清除所有数据:rm -rf ~/pandas/
相关技能
用户确认后使用 clawhub install 安装:
- - data-analysis — 通用数据分析模式
- csv — CSV文件处理
- sql — 数据库查询
- excel-xlsx — Excel文件操作
反馈
- - 如有帮助:clawhub star pandas
- 保持更新:clawhub sync