Agent Survey Corpus (arXiv PDFs → text extracts)
Goal: create a small, local reference library so you can learn from real agent surveys when refining:
- - C2 outline structure (paper-like sectioning)
- C4 tables/claims organization
- C5 writing style and density
This is intentionally not part of the pipeline; it is an optional, repo-level toolkit.
Inputs
Outputs
- - INLINECODE1
- INLINECODE2
- INLINECODE3 (tracked; auto-generated summary)
Workflow
1) Edit ref/agent-surveys/arxiv_ids.txt (one arXiv id per line).
2) Run the downloader to fetch PDFs and extract the first N pages to text.
3) Skim the extracted text under ref/agent-surveys/text/:
- look at section counts (H2), subsection granularity (H3), and how they transition between chapters.
- identify repeated rhetorical patterns you want the pipeline writer to imitate.
Script
Quick Start
All Options
- -
--workspace <dir> (use . to write into repo root) - INLINECODE10 (default:
ref/agent-surveys/arxiv_ids.txt) - INLINECODE12 (default: 20)
- INLINECODE13 (default: 1.0)
- INLINECODE14 (re-download + re-extract)
Examples
- - Download/extract into repo root
ref/:
-
python scripts/run.py --workspace . --max-pages 20
- - Download/extract into a specific folder (treated as workspace root):
- INLINECODE17
Troubleshooting
- - Download fails / timeout: rerun with a larger
--sleep, or try fewer ids. - Text extract is empty: the PDF may be scanned; try another survey or increase
--max-pages. - Files showing up in git status: PDFs/text are ignored via
.gitignore (ref/**/pdfs/, ref/**/text/).
智能体综述语料库(arXiv PDF → 文本提取)
目标:创建一个小型本地参考库,以便在优化以下内容时从真实智能体综述中学习:
- - C2 大纲结构(论文式章节划分)
- C4 表格/主张组织
- C5 写作风格与密度
这有意不作为流程的一部分;它是一个可选的、仓库级别的工具包。
输入
- - ref/agent-surveys/arxiv_ids.txt
输出
- - ref/agent-surveys/pdfs/
- ref/agent-surveys/text/
- ref/agent-surveys/STYLE_REPORT.md(受追踪;自动生成的摘要)
工作流程
1) 编辑 ref/agent-surveys/arxiv_ids.txt(每行一个 arXiv ID)。
2) 运行下载器获取 PDF 并提取前 N 页为文本。
3) 浏览 ref/agent-surveys/text/ 下的提取文本:
- 查看章节数量(H2)、子章节粒度(H3)以及章节间的过渡方式。
- 识别希望流程编写者模仿的重复修辞模式。
脚本
快速开始
- - python scripts/run.py --help
- python scripts/run.py --workspace . --max-pages 20
所有选项
- - --workspace <目录>(使用 . 写入仓库根目录)
- --inputs <分号分隔>(默认:ref/agent-surveys/arxiv_ids.txt)
- --max-pages (默认:20)
- --sleep <秒数>(默认:1.0)
- --overwrite(重新下载 + 重新提取)
示例
- python scripts/run.py --workspace . --max-pages 20
- python scripts/run.py --workspace /tmp/surveys --max-pages 30
故障排除
- - 下载失败/超时:使用更大的 --sleep 重新运行,或尝试更少的 ID。
- 文本提取为空:PDF 可能是扫描件;尝试其他综述或增加 --max-pages。
- 文件出现在 git 状态中:PDF/文本通过 .gitignore 被忽略(ref//pdfs/、ref//text/)。