Related Works Report Pipeline
Orchestrate a full related-works reporting workflow for: $ARGUMENTS
Overview
This workflow turns a small set of paper markdown files into a reproducible report:
CODEBLOCK0
The output is not just the final report. Every intermediate artifact is written into the user-selected work folder so the process can be resumed, audited, or partially rerun.
Required Inputs
Before running, collect or infer:
- - PAPERMDPATHS: one or more paper markdown paths
- WORKDIR: a user-specified working folder
If WORKDIR is missing, ask the user for it before doing substantial work.
All process outputs and the final report must be written under WORKDIR. Do not write process artifacts elsewhere.
Constants
- - TAVILYONLY = true — Tavily is the only allowed search mechanism for title-to-arXiv matching
- TAVILYBATCHMODE = sequential — finish batch N before batch N+1
- NOSEARCHFALLBACK = true — if Tavily fails, record the failure; do not switch to arXiv search, arXiv API search, or another provider
- LOWCONTEXTABSTRACTMODE = true — write one JSON line per title, then render markdown from JSONL
- CITATIONNORMALIZATIONREQUIRED = true — the final workflow must consider a normalized-citation companion section so Part 1 can be aligned with dedup ids like INLINECODE2
- FINALREPORTNAME =
final_related_works_report.md
Work Folder Layout
Create and maintain these artifacts under WORKDIR:
| Artifact | Purpose |
|---|
| INLINECODE5 | Per-source Related Works verbatim + citation tables |
| INLINECODE6 |
Citation-normalized companion text using dedup ids like
P001 when replacement is unambiguous |
|
step2_deduplicated_paper_list.md | Deduplicated paper list with source occurrences |
|
title_batches/batch_XX.md | Abstract lookup batches |
|
abstract_batches/batch_XX_fetches.jsonl | One JSON line per title after Tavily + local fetch |
|
abstract_batches/batch_XX_results.md | Rendered markdown per abstract batch |
|
final_related_works_report.md | Final assembled report |
Execution Rule
Run phases in order. Do not stop after a checkpoint unless:
- - the user explicitly says to stop, or
- an input is missing and must be confirmed, or
- Tavily failures are severe enough that the user should decide whether to continue with partial results
Parallelism rules:
- - Phase 1 extraction may use one clean-context sub agent per source paper.
- Phase 3 abstract lookup must run batches sequentially, not in parallel.
Phase 0: Initialize the Work Folder
- 1. Validate
PAPER_MD_PATHS. - Create
WORKDIR, WORKDIR/title_batches, and WORKDIR/abstract_batches. - Record the chosen source files in the first process artifact.
Phase 1: Extract Related Works and cited papers
Use one clean-context sub agent per source markdown file.
Each sub agent must return:
- - the verbatim Related Works section
- the papers cited inside that section
- enough citation metadata to later support normalization and deduplication:
- citation token in the text (
[12],
Guo et al., 2017, etc.)
- year
- title
- authors
- raw reference text
Merge all outputs into WORKDIR/step1_extracted_related_works_and_citations.md.
Phase 2: Deduplicate cited papers
Build WORKDIR/step2_deduplicated_paper_list.md.
Rules:
- - deduplicate conservatively by normalized title
- merge only when the works are clearly the same paper
- preserve source occurrences so every original citation can be traced back
Phase 2B: Citation normalization companion text
Before final assembly, produce WORKDIR/step1_normalized_related_works.md.
Goal:
- - keep the original Related Works text untouched in INLINECODE22
- create a companion version where in-text citations are rewritten to dedup ids like INLINECODE23
Preferred format:
- - numeric citations:
[12] -> INLINECODE25 - author-year citations:
(Guo et al., 2017) -> INLINECODE27 - grouped citations: rewrite each cited work individually when the mapping is unambiguous
Rules:
- - only replace citations when the mapping from source citation token to dedup id is unambiguous
- if a citation is ambiguous, keep the original token and add a short note below that section
- do not overwrite the verbatim source text
This step exists because the final report should be easy to align with the deduplicated bibliography.
Phase 3: arXiv abstracts via Tavily + local helper
Use the helper scripts stored inside this skill:
- - INLINECODE28
- INLINECODE29
Search rules
- - Tavily MCP only
- query shape: INLINECODE30
- preferred matches:
abs, then html, then INLINECODE33 - convert
html/pdf to canonical INLINECODE36 - no fallback search provider
Batch rules
- - process batches sequentially
- inside a batch, process titles one by one
- if Tavily rate limits, wait and retry Tavily only
- if Tavily still fails, record the error in the JSONL and leave
arXiv URL and Abstract empty
Low-context pattern
For each processed title, immediately append one JSON line to:
Each line should include at least:
- - INLINECODE40
- INLINECODE41
- INLINECODE42
- INLINECODE43
- INLINECODE44
- INLINECODE45
After a batch completes, render markdown with:
CODEBLOCK1
Phase 4: Final assembly
Use the final builder script stored inside this skill:
CODEBLOCK2
The final report should contain:
- - Summary
- Part 1: Related Works original text
- Part 1B: citation-normalized companion text
- Part 2: BibTeX-style entries with retrieved abstracts when available
Key Rules
- - Never fabricate a paper match or abstract.
- Never use non-Tavily search when resolving titles to arXiv.
- Keep all process artifacts inside
WORKDIR. - Prefer scripts inside this skill over ad hoc in-message code.
- Preserve source-paper order in Part 1.
- Preserve dedup order from
step2 in Part 2.
Utility Scripts
- -
scripts/fetch_arxiv_abs.py — compact metadata + abstract extraction from a known arXiv URL - INLINECODE49 — render one batch markdown from JSONL
- INLINECODE50 — assemble the final report from workdir artifacts
Additional Resources
- - For copy-paste invocations and expected
WORKDIR contents, see examples.md
Example Invocation
CODEBLOCK3
相关工作报告流程
为以下内容编排完整的工作报告流程:$ARGUMENTS
概述
该工作流将少量论文Markdown文件转化为可复现的报告:
text
论文md文件
-> 步骤1 提取相关工作 + 引用的论文
-> 步骤2 去重引用的论文
-> 顺序执行Tavily摘要查找 + 本地arXiv获取
-> 规范化引用伴随文本
-> finalrelatedworks_report.md
输出不仅仅是最终报告。每个中间产物都会写入用户选择的工作文件夹,以便流程可以恢复、审计或部分重新运行。
必需输入
运行前,收集或推断:
- - PAPERMDPATHS:一个或多个论文Markdown路径
- WORKDIR:用户指定的工作文件夹
如果缺少WORKDIR,在进行实质性工作前向用户询问。
所有流程输出和最终报告必须写入WORKDIR下。不要将流程产物写入其他位置。
常量
- - TAVILYONLY = true — Tavily是标题到arXiv匹配唯一允许的搜索机制
- TAVILYBATCHMODE = sequential — 完成批次N后再进行批次N+1
- NOSEARCHFALLBACK = true — 如果Tavily失败,记录失败;不要切换到arXiv搜索、arXiv API搜索或其他提供商
- LOWCONTEXTABSTRACTMODE = true — 每个标题写入一行JSON,然后从JSONL渲染Markdown
- CITATIONNORMALIZATIONREQUIRED = true — 最终工作流必须包含规范化引用伴随部分,以便第1部分可以与P001等去重ID对齐
- FINALREPORTNAME = finalrelatedworks_report.md
工作文件夹布局
在WORKDIR下创建并维护以下产物:
| 产物 | 用途 |
|---|
| step1extractedrelatedworksandcitations.md | 逐源逐字相关工作 + 引用表格 |
| step1normalizedrelatedworks.md |
使用去重ID(如P001)进行引用规范化的伴随文本(当替换无歧义时) |
| step2
deduplicatedpaper_list.md | 去重后的论文列表及来源出现情况 |
| title
batches/batchXX.md | 摘要查找批次 |
| abstract
batches/batchXX_fetches.jsonl | 每次Tavily + 本地获取后每个标题一行JSON |
| abstract
batches/batchXX_results.md | 每个摘要批次的渲染Markdown |
| final
relatedworks_report.md | 最终组装报告 |
执行规则
按顺序执行各阶段。除非以下情况,否则不要在检查点后停止:
- - 用户明确要求停止,或
- 缺少输入需要确认,或
- Tavily失败严重到用户应决定是否继续使用部分结果
并行规则:
- - 阶段1提取可为每篇源论文使用一个干净上下文的子代理。
- 阶段3摘要查找必须顺序执行批次,不能并行。
阶段0:初始化工作文件夹
- 1. 验证PAPERMDPATHS。
- 创建WORKDIR、WORKDIR/titlebatches和WORKDIR/abstractbatches。
- 在第一个流程产物中记录所选源文件。
阶段1:提取相关工作和引用的论文
每篇源Markdown文件使用一个干净上下文的子代理。
每个子代理必须返回:
- - 逐字的相关工作部分
- 该部分中引用的论文
- 足够的引用元数据以支持后续规范化和去重:
- 文本中的引用标记([12]、Guo et al., 2017等)
- 年份
- 标题
- 作者
- 原始参考文献文本
将所有输出合并到WORKDIR/step1extractedrelatedworksand_citations.md。
阶段2:去重引用的论文
构建WORKDIR/step2deduplicatedpaper_list.md。
规则:
- - 通过规范化标题保守去重
- 仅在作品明确为同一篇论文时合并
- 保留来源出现情况,以便每个原始引用都可追溯
阶段2B:引用规范化伴随文本
在最终组装前,生成WORKDIR/step1normalizedrelated_works.md。
目标:
- - 在step1extractedrelatedworksand_citations.md中保持原始相关工作文本不变
- 创建一个伴随版本,其中文内引用被重写为去重ID,如P001
首选格式:
- - 数字引用:[12] -> [P052]
- 作者-年份引用:(Guo et al., 2017) -> [P095]
- 分组引用:当映射无歧义时,逐个重写每个被引作品
规则:
- - 仅当从源引用标记到去重ID的映射无歧义时才替换引用
- 如果引用有歧义,保留原始标记并在该部分下方添加简短注释
- 不要覆盖逐字源文本
此步骤存在是因为最终报告应易于与去重后的参考文献对齐。
阶段3:通过Tavily + 本地助手获取arXiv摘要
使用此技能中存储的辅助脚本:
- - .cursor/skills/related-works-report-from-paper-mds/scripts/fetcharxivabs.py
- .cursor/skills/related-works-report-from-paper-mds/scripts/jsonltoabstractbatchmd.py
搜索规则
- - 仅使用Tavily MCP
- 查询格式:<论文标题> arXiv
- 优先匹配:abs,然后html,然后pdf
- 将html/pdf转换为规范格式https://arxiv.org/abs/
- 无备用搜索提供商
批次规则
- - 顺序处理批次
- 批次内逐个处理标题
- 如果Tavily限速,等待并仅重试Tavily
- 如果Tavily仍然失败,在JSONL中记录错误,并保留arXiv URL和Abstract为空
低上下文模式
对于每个处理的标题,立即附加一行JSON到:
- - WORKDIR/abstractbatches/batchXX_fetches.jsonl
每行至少包含:
- - dedupid
- inputtitle
- tavilystatus
- tavilyerror
- arxiv_url
- fetch
批次完成后,使用以下命令渲染Markdown:
bash
python3 .cursor/skills/related-works-report-from-paper-mds/scripts/jsonltoabstractbatchmd.py \
WORKDIR/abstractbatches/batchXX_fetches.jsonl \
WORKDIR/abstractbatches/batchXX_results.md
阶段4:最终组装
使用此技能中存储的最终构建脚本:
bash
python3 .cursor/skills/related-works-report-from-paper-mds/scripts/buildfinalrelatedworksreport.py \
WORKDIR/step1extractedrelatedworksand_citations.md \
WORKDIR/step2deduplicatedpaper_list.md \
WORKDIR/abstract_batches \
WORKDIR/finalrelatedworks_report.md \
WORKDIR/step1normalizedrelated_works.md
最终报告应包含:
- - 摘要
- 第1部分:相关工作原始文本
- 第1B部分:引用规范化伴随文本
- 第2部分:BibTeX风格条目,附有获取到的摘要(如可用)
关键规则
- - 绝不虚构论文匹配或摘要。
- 在解析标题到arXiv时绝不使用非Tavily搜索。
- 将所有流程产物保留在WORKDIR内。
- 优先使用此技能内的脚本,而非临时的消息内代码。
- 在第1部分中保留源论文顺序。
- 在第2部分中保留来自step2的去重顺序。
实用脚本
- - scripts/fetcharxivabs.py — 从已知arXiv URL提取紧凑元数据+摘要
- scripts/jsonltoabstractbatchmd.py — 从JSONL渲染一个批次的Markdown
- scripts/buildfinalrelatedworksreport.py — 从工作目录产物组装最终报告
附加资源
示例调用
text
/related-works-report-from-paper-mds \
0refs/papermds/2025ConfidenceVLA.md 0refs/papermds/2025SAFE.md 0refs/papermds/2025FAILDetect.md --workdir 0docs/relatedworksreportrun_02