Sheet Data Enrichment & Summarization

Workflow

Phase 1: Reconnaissance

1. Read the full sheet (or specify range) to understand the schema
Identify the source column (e.g., URLs/links) and target column (data to fill)
Count empty rows in the target column — these are the work items
Confirm the plan with the user before proceeding

Phase 2: Extraction

Process rows in batches. For each row with an empty target column:

1. Classify the source before fetching:

- Regular article URL → web_fetch - SPA / JS-rendered page → browser automation - WeChat article (mp.weixin.qq.com) → browser (captcha may block web_fetch) - Social media (Weibo, Douyin, etc.) → browser or note as "no extractable data" - Dead link / paywall → flag for user

2. Extract the target data from the fetched content:

- Search for the data in multiple locations (top, bottom, metadata) - Use targeted patterns (see references/extraction-patterns.md) - If ambiguous, flag for manual review rather than guessing

3. Verify before writing (每篇先核实):

- Confirm the extracted value is correct before writing to the sheet - Cross-reference with known mappings if available - When batch-processing, verify a sample first, then apply patterns

Phase 3: Write-back

1. Write confirmed values to the target column
Always verify row alignment — read the row before writing to confirm the target cell matches the expected entry
Use single-cell writes (G5:G5) not bare references (G5)
For Feishu Sheets: range format must be INLINECODE4

Phase 4: Summarization

When the user requests aggregation:

1. Re-read all enriched sheets to capture any manual corrections
Group by the requested dimensions (e.g., assignee → contributor → sum)
Sort groups by total descending for readability
Output to a new sheet/spreadsheet with headers + subtotals + grand total

Key Lessons (Extraction)

Where to find data on a web page

Data placement varies by source. Always check multiple locations:

Position	Example
Below title	INLINECODE5 / INLINECODE6
End of article

Never conclude "no data" after checking only the top of the page. Always check the end too.

When to use browser vs web_fetch

Signal	Tool
Static HTML, server-rendered	INLINECODE13 (fast, cheap)
Returns empty/minimal content

Reusable mappings

When the same source consistently maps to the same value across rows/sheets:

1. Build a mapping table after confirming 2+ instances
Apply mappings to future rows without re-fetching
Always note the mapping source for audit

Common "no data" patterns

These formats typically have no individual attribution:

- Flash news / wire alerts (e.g., "财联社电", "每经AI快讯")
Press releases / corporate announcements
Social media reposts without original attribution
Aggregated roundup articles

Mark these clearly (leave blank or use a placeholder like "/" per user preference).

Row Alignment Safety

Critical: Off-by-one errors are the #1 failure mode when writing to spreadsheets.

Before every write:

1. Read the target row to verify the adjacent cells match expectations
If processing multiple rows, re-read periodically to catch drift
Never assume row N in your data maps to row N in the sheet — always verify

Feishu Sheets Specifics

- Range format: sheetId!A1:B2 (not sheet name)
Single-cell write: sheetId!G5:G5 (not sheetId!G5)
Get sheet IDs via feishu_sheet action INLINECODE20
Write returns revision number — useful for tracking changes
Cannot create new sheet tabs in existing spreadsheet via API; create a new spreadsheet for summaries

Output Format

Summary sheets should include:

- Headers: Group dimension, detail dimension, count, total
Subtotals: Per group
Grand total: Final row
Sorting: By total descending within each group, groups ordered by subtotal descending

表格数据丰富与汇总

工作流程

第一阶段：侦察

1. 读取整个表格（或指定范围）以理解数据结构
识别源列（例如URL/链接）和目标列（待填充的数据）
统计目标列中的空行——这些就是待处理项
在继续操作前与用户确认计划

第二阶段：提取

按批次处理行。对于每个目标列为空的行：

1. 在获取前对源进行分类：

- 常规文章URL → web_fetch - SPA/JS渲染页面 → 浏览器自动化 - 微信公众号文章（mp.weixin.qq.com）→ 浏览器（验证码可能阻止web_fetch） - 社交媒体（微博、抖音等）→ 浏览器或标注为无可提取数据 - 死链/付费墙 → 标记给用户

2. 从获取的内容中提取目标数据：

- 在多个位置搜索数据（顶部、底部、元数据） - 使用针对性模式（参见references/extraction-patterns.md） - 如有歧义，标记为人工审核而非猜测

3. 写入前核实（每篇先核实）：

- 确认提取的值正确后再写入表格 - 如有已知映射则交叉引用 - 批量处理时，先验证样本，再应用模式

第三阶段：回写

1. 将确认的值写入目标列
始终验证行对齐——写入前读取该行以确认目标单元格与预期条目匹配
使用单单元格写入（G5:G5）而非裸引用（G5）
飞书表格：范围格式必须为sheetId!G5:G5

第四阶段：汇总

当用户请求聚合时：

1. 重新读取所有已丰富数据的表格以捕获任何手动修正
按请求的维度分组（例如，负责人→贡献者→合计）
按总计降序排列分组以便阅读
输出到新表格/电子表格，包含表头+小计+总计

关键经验（提取）

在网页上查找数据的位置

数据位置因来源而异。始终检查多个位置：

位置	示例
标题下方	作者：张三 / By Jane Doe
文章末尾

仅检查页面顶部后切勿得出无数据的结论。 始终也要检查页面底部。

何时使用浏览器 vs web_fetch

信号	工具
静态HTML，服务端渲染	web_fetch（快速、低成本）
返回空/最小内容

可复用映射

当同一来源在跨行/跨表中始终映射到相同值时：

1. 确认2个以上实例后建立映射表
将映射应用于后续行，无需重新获取
始终记录映射来源以备审计

常见的无数据模式

以下格式通常没有个人署名：

- 快讯/通讯社消息（如财联社电、每经AI快讯）
新闻稿/公司公告
无原始署名的社交媒体转发
聚合综述文章

清晰标记这些内容（留空或根据用户偏好使用占位符如/）。

行对齐安全

关键：偏移一位错误是写入电子表格时的头号失败模式。

每次写入前：

1. 读取目标行以验证相邻单元格与预期匹配
如处理多行，定期重新读取以捕获偏移
切勿假设数据中的第N行对应表格中的第N行——始终验证

飞书表格特定说明

- 范围格式：sheetId!A1:B2（非工作表名称）
单单元格写入：sheetId!G5:G5（非sheetId!G5）
通过feishu_sheet操作的info获取工作表ID
写入返回revision号——可用于跟踪变更
无法通过API在现有电子表格中创建新工作表标签；为汇总结果创建新电子表格

输出格式

汇总表应包含：

- 表头：分组维度、明细维度、计数、总计
小计：每组
总计：最后一行
排序：每组内按总计降序排列，组按小计降序排列

sheet-data-enrichment表格数据增强