ArXiv Summarizer Orchestrator
Run the full pipeline by composing three sub-skills.
Sub-skill Order
- 1. INLINECODE0
- INLINECODE1
- INLINECODE2
Workflow Parameters
- -
language: manual language parameter used by all stages. Default is English when omitted. - INLINECODE4 :
subagent_parallel or serial. - INLINECODE7 : default
5 when paper_processing_mode=subagent_parallel.
Workflow
Stage A: Collection Setup + Query Retrieval
- 1. Initialize one run with
arxiv-search-collector/scripts/init_collection_run.py. - Model generates multiple focused queries from original topic and writes a minimal
query_plan.json (label + query only). - Run
arxiv-search-collector/scripts/fetch_queries_batch.py with the plan file (recommended). - (Optional fallback) call
arxiv-search-collector/scripts/fetch_query_metadata.py manually for one-by-one fetch. - Model reads each indexed query list and decides keep indexes.
- Merge selected items with
arxiv-search-collector/scripts/merge_selected_papers.py. - If relevance/coverage is still not good, iterate Stage A:
- generate another query plan with new labels,
- fetch again,
- re-merge with
--incremental and updated
selection-json.
- set weak labels to empty keep list (
[]) to explicitly drop them.
Pass --language <LANG> to collector scripts so all generated markdown files in Stage A follow the selected language.
Use serial query fetch in Stage A with conservative controls (for example --min-interval-sec 5, --retry-max 4).
Default collector settings already include retries/backoff and run-local throttle state (<run_dir>/.runtime/arxiv_api_state.json), so manual tuning is usually unnecessary.
Prefer cache reuse (no --force) unless query parameters changed or data refresh is required.
Output: one run directory with per-paper metadata subdirectories.
Stage B: Per-paper Artifact Download + Manual Summary
For each paper directory, invoke sub-skill arxiv-paper-processor once and let that skill produce <paper_dir>/summary.md.
Recommended pre-step for many papers:
- 1. Run one batch artifact download before per-paper reading:
CODEBLOCK0
Per-paper execution steps (inside arxiv-paper-processor):
- 1. If
<paper_dir>/summary.md already exists and is complete, skip this paper. - If usable source (
source/source_extract/*.tex) or PDF (source/paper.pdf) already exists, skip download. - If artifacts are missing, download source with
arxiv-paper-processor/scripts/download_arxiv_source.py. - If source is unusable, download PDF with
arxiv-paper-processor/scripts/download_arxiv_pdf.py. - Model reads content and manually writes
<paper_dir>/summary.md by reference format, in language.
Parallel strategy for many papers:
- - Default:
paper_processing_mode=subagent_parallel with max_parallel_papers=5. - Optional:
paper_processing_mode=serial to process one paper at a time. - In parallel mode, run multiple
arxiv-paper-processor instances in batches; concurrent papers must not exceed max_parallel_papers. - Wait for one batch to finish before starting the next batch.
- In serial mode, run exactly one
arxiv-paper-processor instance at a time. - Subagent workers should only own one paper directory each to avoid file conflicts.
- Do not use scripts to auto-compose summary text; scripts are download-only tools.
Output: all paper directories contain summary.md.
Stage C: Bundle + Final Hierarchical Report
- 1. Run
arxiv-batch-reporter/scripts/collect_summaries_bundle.py --language <LANG>. - Model reads
summaries_bundle.md and writes collection_report_template.md in base dir. - In template, each paper leaf entry must include one standalone placeholder line:
{{ARXIV_BRIEF:<arxiv_id>}}. - Run
arxiv-batch-reporter/scripts/render_collection_report.py to generate final collection_report.md. - Do not manually paraphrase per-paper conclusion lines in final report; they must come from per-paper
summary.md section 10 via script injection.
If language is non-English (for example Chinese), all intermediate markdown files and final reports should follow that language.
Periodic Scheduling
This orchestrator is suitable for cron/scheduled execution in OpenClaw:
- - Frequency examples: daily, weekly, monthly.
- For rolling windows, use lookback (
1d, 7d, 30d) when initializing runs.
Output Layout
INLINECODE53
- -
task_meta.json, INLINECODE55 - INLINECODE56 , INLINECODE57
- INLINECODE58 + downloaded source/pdf + INLINECODE59
- INLINECODE60
- INLINECODE61
- final rendered collection report (e.g.
collection_report.md)
Use references/workflow-checklist.md as execution checklist.
Related Skills
This is the top-level orchestration skill.
Before using it, install and enable these three sub-skills:
- - INLINECODE64
- INLINECODE65
- INLINECODE66
Execution order inside this orchestrator:
- 1.
arxiv-search-collector (Stage A) - INLINECODE68 (Stage B)
- INLINECODE69 (Stage C)
ArXiv 摘要生成编排器
通过组合三个子技能来运行完整流水线。
子技能顺序
- 1. arxiv-search-collector
- arxiv-paper-processor
- arxiv-batch-reporter
工作流参数
- - language: 所有阶段使用的手动语言参数。省略时默认为英语。
- paperprocessingmode: subagentparallel 或 serial。
- maxparallelpapers: 当 paperprocessingmode=subagentparallel 时默认为 5。
工作流
阶段 A:收集设置 + 查询检索
- 1. 使用 arxiv-search-collector/scripts/initcollectionrun.py 初始化一次运行。
- 模型根据原始主题生成多个聚焦查询,并编写一个精简的 queryplan.json(仅包含 label + query)。
- 使用计划文件运行 arxiv-search-collector/scripts/fetchqueriesbatch.py(推荐)。
- (可选回退)手动调用 arxiv-search-collector/scripts/fetchquerymetadata.py 进行逐个获取。
- 模型读取每个索引查询列表并决定保留的索引。
- 使用 arxiv-search-collector/scripts/mergeselected_papers.py 合并选中的项目。
- 如果相关性/覆盖度仍不理想,迭代阶段 A:
- 使用新标签生成另一个查询计划,
- 再次获取,
- 使用 --incremental 和更新的 selection-json 重新合并。
- 将弱标签设置为空保留列表([])以明确丢弃。
向收集脚本传递 --language ,使阶段 A 中生成的所有 markdown 文件遵循所选语言。
在阶段 A 中使用保守控制的串行查询获取(例如 --min-interval-sec 5,--retry-max 4)。
默认收集器设置已包含重试/退避和运行本地节流状态(dir>/.runtime/arxivapi_state.json),因此通常无需手动调整。
优先使用缓存重用(不使用 --force),除非查询参数已更改或需要刷新数据。
输出:一个运行目录,包含每篇论文的元数据子目录。
阶段 B:每篇论文的工件下载 + 手动摘要
对于每个论文目录,调用一次子技能 arxiv-paper-processor,让该技能生成 /summary.md。
对于多篇论文,推荐的预处理步骤:
- 1. 在逐篇阅读之前,先运行一次批量工件下载:
bash
python3 arxiv-paper-processor/scripts/downloadpapersbatch.py \
--run-dir /path/to/run \
--artifact sourcethenpdf \
--max-workers 3 \
--min-interval-sec 5 \
--language
逐篇论文执行步骤(在 arxiv-paper-processor 内部):
- 1. 如果 dir>/summary.md 已存在且完整,则跳过此论文。
- 如果可用的源文件(source/sourceextract/*.tex)或 PDF(source/paper.pdf)已存在,则跳过下载。
- 如果工件缺失,使用 arxiv-paper-processor/scripts/downloadarxivsource.py 下载源文件。
- 如果源文件不可用,使用 arxiv-paper-processor/scripts/downloadarxivpdf.py 下载 PDF。
- 模型阅读内容,并按照参考格式手动编写 /summary.md,使用指定的 language。
多篇论文的并行策略:
- - 默认:paperprocessingmode=subagentparallel,maxparallelpapers=5。
- 可选:paperprocessingmode=serial,一次处理一篇论文。
- 在并行模式下,分批运行多个 arxiv-paper-processor 实例;并发论文数不得超过 maxparallel_papers。
- 等待一批完成后才开始下一批。
- 在串行模式下,一次只运行一个 arxiv-paper-processor 实例。
- 子代理工作进程应各自只拥有一个论文目录,以避免文件冲突。
- 不要使用脚本自动生成摘要文本;脚本仅为下载工具。
输出:所有论文目录包含 summary.md。
阶段 C:打包 + 最终分层报告
- 1. 运行 arxiv-batch-reporter/scripts/collectsummariesbundle.py --language 。
- 模型读取 summariesbundle.md,并在基础目录中编写 collectionreporttemplate.md。
- 在模板中,每篇论文的叶子条目必须包含一个独立的占位行:{{ARXIVBRIEF:id>}}。
- 运行 arxiv-batch-reporter/scripts/rendercollectionreport.py 生成最终的 collectionreport.md。
- 不要在最终报告中手动改写每篇论文的结论行;它们必须通过脚本注入来自每篇论文的 summary.md 第10节。
如果 language 是非英语(例如中文),所有中间 markdown 文件和最终报告都应遵循该语言。
定期调度
此编排器适用于 OpenClaw 中的 cron/定时执行:
- - 频率示例:每日、每周、每月。
- 对于滚动窗口,在初始化运行时使用回溯(1d、7d、30d)。
输出布局
/--/
- - taskmeta.json、taskmeta.md
- queryresults/、queryselection/
- id>/metadata.md + 下载的源文件/pdf + summary.md
- summariesbundle.md
- collectionreporttemplate.md
- 最终渲染的收集报告(例如 collection_report.md)
使用 references/workflow-checklist.md 作为执行检查清单。
相关技能
这是顶层编排技能。
在使用之前,安装并启用这三个子技能:
- - arxiv-search-collector
- arxiv-paper-processor
- arxiv-batch-reporter
此编排器内部的执行顺序:
- 1. arxiv-search-collector(阶段 A)
- arxiv-paper-processor(阶段 B)
- arxiv-batch-reporter(阶段 C)