Sheet Data Enrichment & Summarization
Workflow
Phase 1: Reconnaissance
- 1. Read the full sheet (or specify range) to understand the schema
- Identify the source column (e.g., URLs/links) and target column (data to fill)
- Count empty rows in the target column — these are the work items
- Confirm the plan with the user before proceeding
Phase 2: Extraction
Process rows in batches. For each row with an empty target column:
- 1. Classify the source before fetching:
- Regular article URL →
web_fetch
- SPA / JS-rendered page → browser automation
- WeChat article (mp.weixin.qq.com) → browser (captcha may block
web_fetch)
- Social media (Weibo, Douyin, etc.) → browser or note as "no extractable data"
- Dead link / paywall → flag for user
- 2. Extract the target data from the fetched content:
- Search for the data in multiple locations (top, bottom, metadata)
- Use targeted patterns (see
references/extraction-patterns.md)
- If ambiguous, flag for manual review rather than guessing
- 3. Verify before writing (每篇先核实):
- Confirm the extracted value is correct before writing to the sheet
- Cross-reference with known mappings if available
- When batch-processing, verify a sample first, then apply patterns
Phase 3: Write-back
- 1. Write confirmed values to the target column
- Always verify row alignment — read the row before writing to confirm the target cell matches the expected entry
- Use single-cell writes (
G5:G5) not bare references (G5) - For Feishu Sheets: range format must be INLINECODE4
Phase 4: Summarization
When the user requests aggregation:
- 1. Re-read all enriched sheets to capture any manual corrections
- Group by the requested dimensions (e.g., assignee → contributor → sum)
- Sort groups by total descending for readability
- Output to a new sheet/spreadsheet with headers + subtotals + grand total
Key Lessons (Extraction)
Where to find data on a web page
Data placement varies by source. Always check multiple locations:
| Position | Example |
|---|
| Below title | INLINECODE5 / INLINECODE6 |
| End of article |
(记者 李四) /
Reporter: John |
| Before timestamp |
王五 2026-03-18 14:00 |
| Metadata line |
文|赵六 /
Author: Sarah |
| Combined format |
采写:记者 孙七 编辑:周八 |
Never conclude "no data" after checking only the top of the page. Always check the end too.
When to use browser vs web_fetch
| Signal | Tool |
|---|
| Static HTML, server-rendered | INLINECODE13 (fast, cheap) |
| Returns empty/minimal content |
Switch to browser |
| URL contains
mp.weixin.qq.com | Browser (WeChat captcha) |
| SPA framework (React/Vue/Angular) | Browser |
| Baidu mini-program (
smartapps.cn) | Browser or find alternate URL |
| Social media embeds (Weibo, Douyin) | Browser, but often no structured data |
Reusable mappings
When the same source consistently maps to the same value across rows/sheets:
- 1. Build a mapping table after confirming 2+ instances
- Apply mappings to future rows without re-fetching
- Always note the mapping source for audit
Common "no data" patterns
These formats typically have no individual attribution:
- - Flash news / wire alerts (e.g., "财联社电", "每经AI快讯")
- Press releases / corporate announcements
- Social media reposts without original attribution
- Aggregated roundup articles
Mark these clearly (leave blank or use a placeholder like "/" per user preference).
Row Alignment Safety
Critical: Off-by-one errors are the #1 failure mode when writing to spreadsheets.
Before every write:
- 1. Read the target row to verify the adjacent cells match expectations
- If processing multiple rows, re-read periodically to catch drift
- Never assume row N in your data maps to row N in the sheet — always verify
Feishu Sheets Specifics
- - Range format:
sheetId!A1:B2 (not sheet name) - Single-cell write:
sheetId!G5:G5 (not sheetId!G5) - Get sheet IDs via
feishu_sheet action INLINECODE20 - Write returns
revision number — useful for tracking changes - Cannot create new sheet tabs in existing spreadsheet via API; create a new spreadsheet for summaries
Output Format
Summary sheets should include:
- - Headers: Group dimension, detail dimension, count, total
- Subtotals: Per group
- Grand total: Final row
- Sorting: By total descending within each group, groups ordered by subtotal descending
表格数据丰富与汇总
工作流程
第一阶段:侦察
- 1. 读取整个表格(或指定范围)以理解数据结构
- 识别源列(例如URL/链接)和目标列(待填充的数据)
- 统计目标列中的空行——这些就是待处理项
- 在继续操作前与用户确认计划
第二阶段:提取
按批次处理行。对于每个目标列为空的行:
- 1. 在获取前对源进行分类:
- 常规文章URL → web_fetch
- SPA/JS渲染页面 → 浏览器自动化
- 微信公众号文章(mp.weixin.qq.com)→ 浏览器(验证码可能阻止web_fetch)
- 社交媒体(微博、抖音等)→ 浏览器或标注为无可提取数据
- 死链/付费墙 → 标记给用户
- 2. 从获取的内容中提取目标数据:
- 在多个位置搜索数据(顶部、底部、元数据)
- 使用针对性模式(参见
references/extraction-patterns.md)
- 如有歧义,标记为人工审核而非猜测
- 3. 写入前核实(每篇先核实):
- 确认提取的值正确后再写入表格
- 如有已知映射则交叉引用
- 批量处理时,先验证样本,再应用模式
第三阶段:回写
- 1. 将确认的值写入目标列
- 始终验证行对齐——写入前读取该行以确认目标单元格与预期条目匹配
- 使用单单元格写入(G5:G5)而非裸引用(G5)
- 飞书表格:范围格式必须为sheetId!G5:G5
第四阶段:汇总
当用户请求聚合时:
- 1. 重新读取所有已丰富数据的表格以捕获任何手动修正
- 按请求的维度分组(例如,负责人→贡献者→合计)
- 按总计降序排列分组以便阅读
- 输出到新表格/电子表格,包含表头+小计+总计
关键经验(提取)
在网页上查找数据的位置
数据位置因来源而异。始终检查多个位置:
| 位置 | 示例 |
|---|
| 标题下方 | 作者:张三 / By Jane Doe |
| 文章末尾 |
(记者 李四) / Reporter: John |
| 时间戳前 | 王五 2026-03-18 14:00 |
| 元数据行 | 文|赵六 / Author: Sarah |
| 组合格式 | 采写:记者 孙七 编辑:周八 |
仅检查页面顶部后切勿得出无数据的结论。 始终也要检查页面底部。
何时使用浏览器 vs web_fetch
| 信号 | 工具 |
|---|
| 静态HTML,服务端渲染 | web_fetch(快速、低成本) |
| 返回空/最小内容 |
切换到浏览器 |
| URL包含mp.weixin.qq.com | 浏览器(微信验证码) |
| SPA框架(React/Vue/Angular) | 浏览器 |
| 百度小程序(smartapps.cn) | 浏览器或寻找替代URL |
| 社交媒体嵌入(微博、抖音) | 浏览器,但通常无结构化数据 |
可复用映射
当同一来源在跨行/跨表中始终映射到相同值时:
- 1. 确认2个以上实例后建立映射表
- 将映射应用于后续行,无需重新获取
- 始终记录映射来源以备审计
常见的无数据模式
以下格式通常没有个人署名:
- - 快讯/通讯社消息(如财联社电、每经AI快讯)
- 新闻稿/公司公告
- 无原始署名的社交媒体转发
- 聚合综述文章
清晰标记这些内容(留空或根据用户偏好使用占位符如/)。
行对齐安全
关键:偏移一位错误是写入电子表格时的头号失败模式。
每次写入前:
- 1. 读取目标行以验证相邻单元格与预期匹配
- 如处理多行,定期重新读取以捕获偏移
- 切勿假设数据中的第N行对应表格中的第N行——始终验证
飞书表格特定说明
- - 范围格式:sheetId!A1:B2(非工作表名称)
- 单单元格写入:sheetId!G5:G5(非sheetId!G5)
- 通过feishu_sheet操作的info获取工作表ID
- 写入返回revision号——可用于跟踪变更
- 无法通过API在现有电子表格中创建新工作表标签;为汇总结果创建新电子表格
输出格式
汇总表应包含:
- - 表头:分组维度、明细维度、计数、总计
- 小计:每组
- 总计:最后一行
- 排序:每组内按总计降序排列,组按小计降序排列