Agent Survey Corpus (arXiv PDFs → text extracts)

Goal: create a small, local reference library so you can learn from real agent surveys when refining:

- C2 outline structure (paper-like sectioning)
C4 tables/claims organization
C5 writing style and density

This is intentionally not part of the pipeline; it is an optional, repo-level toolkit.

Inputs

- INLINECODE0

Outputs

- INLINECODE1
INLINECODE2
INLINECODE3 (tracked; auto-generated summary)

Workflow

1) Edit ref/agent-surveys/arxiv_ids.txt (one arXiv id per line).
2) Run the downloader to fetch PDFs and extract the first N pages to text.
3) Skim the extracted text under ref/agent-surveys/text/:
- look at section counts (H2), subsection granularity (H3), and how they transition between chapters.
- identify repeated rhetorical patterns you want the pipeline writer to imitate.

Script

Quick Start

- INLINECODE6
INLINECODE7

All Options

- --workspace <dir> (use . to write into repo root)
INLINECODE10 (default: ref/agent-surveys/arxiv_ids.txt)
INLINECODE12 (default: 20)
INLINECODE13 (default: 1.0)
INLINECODE14 (re-download + re-extract)

Examples

- Download/extract into repo root ref/:

- python scripts/run.py --workspace . --max-pages 20

- Download/extract into a specific folder (treated as workspace root):

- INLINECODE17

Troubleshooting

- Download fails / timeout: rerun with a larger --sleep, or try fewer ids.
Text extract is empty: the PDF may be scanned; try another survey or increase --max-pages.
Files showing up in git status: PDFs/text are ignored via .gitignore (ref/**/pdfs/, ref/**/text/).

智能体综述语料库（arXiv PDF → 文本提取）

目标：创建一个小型本地参考库，以便在优化以下内容时从真实智能体综述中学习：

- C2 大纲结构（论文式章节划分）
C4 表格/主张组织
C5 写作风格与密度

这有意不作为流程的一部分；它是一个可选的、仓库级别的工具包。

输入

- ref/agent-surveys/arxiv_ids.txt

输出

- ref/agent-surveys/pdfs/
ref/agent-surveys/text/
ref/agent-surveys/STYLE_REPORT.md（受追踪；自动生成的摘要）

工作流程

1) 编辑 ref/agent-surveys/arxiv_ids.txt（每行一个 arXiv ID）。
2) 运行下载器获取 PDF 并提取前 N 页为文本。
3) 浏览 ref/agent-surveys/text/ 下的提取文本：
- 查看章节数量（H2）、子章节粒度（H3）以及章节间的过渡方式。
- 识别希望流程编写者模仿的重复修辞模式。

脚本

快速开始

- python scripts/run.py --help
python scripts/run.py --workspace . --max-pages 20

所有选项

- --workspace <目录>（使用 . 写入仓库根目录）
--inputs <分号分隔>（默认：ref/agent-surveys/arxiv_ids.txt）
--max-pages （默认：20）
--sleep <秒数>（默认：1.0）
--overwrite（重新下载 + 重新提取）

示例

- python scripts/run.py --workspace . --max-pages 20

- 下载/提取到特定文件夹（视为工作区根目录）：

- python scripts/run.py --workspace /tmp/surveys --max-pages 30

故障排除

- 下载失败/超时：使用更大的 --sleep 重新运行，或尝试更少的 ID。
文本提取为空：PDF 可能是扫描件；尝试其他综述或增加 --max-pages。
文件出现在 git 状态中：PDF/文本通过 .gitignore 被忽略（ref//pdfs/、ref//text/）。

agent-survey-corpus智能体语料库