arXiv Search (metadata-first)
Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.
When online, prefer rich arXiv metadata (categories, arxivid, pdfurl, published/updated, etc.). When offline, accept an export and convert it cleanly.
Load Order
Always read:
- -
references/domain_pack_overview.md — how domain packs drive topic-specific behavior
Domain packs (loaded by topic match):
- -
assets/domain_packs/llm_agents.json — pinned IDs, query rewrite rules for LLM agent topics
Script Boundary
Use scripts/run.py only for:
- - arXiv API retrieval and XML parsing
- offline export conversion (CSV/JSON/JSONL normalization)
- metadata enrichment via
id_list backfill
Do not treat run.py as the place for:
- - hardcoded topic detection or query rewriting (use domain packs)
- domain-specific pinned paper lists (externalize to
assets/domain_packs/)
Input
- -
queries.md (keywords, excludes, time window)
Outputs
- -
papers/papers_raw.jsonl (JSONL; 1 paper per line)
- Each record includes at least:
title,
authors,
year,
url,
abstract
- When using the arXiv API online mode, records also include helpful metadata:
arxiv_id,
pdf_url,
categories,
primary_category,
published,
updated,
doi,
journal_ref,
comment
- - Convenience index (optional but generated by the script):
- INLINECODE22
Decision: online vs offline
- - If you have network access: run arXiv API retrieval.
- If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields.
- Hybrid: if you import offline but still have network later, you can enrich missing fields (abstract/authors/categories) via arXiv
id_list using --enrich-metadata or queries.md enrich_metadata: true.
Workflow (heuristic)
- 1. Read
queries.md and expand into concrete query strings. - Retrieve results (online) or import an export (offline).
- Normalize every record to include at least:
-
title,
authors (array),
year,
url,
abstract
- 4. Keep the set broad at this stage; dedupe/ranking comes next.
- Apply time window and
max_results if specified.
Quality checklist
- - [ ]
papers/papers_raw.jsonl exists. - [ ] Each line is valid JSON and contains
title, authors, year, url.
Side effects
- - Allowed: create/overwrite
papers/papers_raw.jsonl; append notes to STATUS.md. - Not allowed: write prose sections in
output/ before writing is approved.
Script
Quick Start
- - INLINECODE42
- Online: INLINECODE43
- Offline import: INLINECODE44
All Options
- -
--query <q>: repeatable; multiple queries are unioned - INLINECODE46 : repeatable; excludes applied after retrieval
- INLINECODE47 : cap total retrieved
- INLINECODE48 : offline mode (CSV/JSON/JSONL)
- INLINECODE49 : best-effort enrich via arXiv
id_list (needs network) - INLINECODE51 also supports:
keywords, exclude, time window, max_results, INLINECODE56
Examples
- - Online (multi-query + excludes):
-
python scripts/run.py --workspace <ws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300
- - Fetch a single paper by arXiv ID (direct
id_list fetch):
-
python scripts/run.py --workspace <ws> --query 2509.02547 --max-results 1
- - Offline auto-detect (no flags):
- Place
papers/import.csv (or
.json/.jsonl) under the workspace, then run:
python scripts/run.py --workspace <ws>
- - Offline import + time window (via
queries.md):
- Set
- time window: { from: 2022, to: 2025 } then run offline import normally
Troubleshooting
Common Issues
Issue: papers/papers_raw.jsonl is empty
Symptom:
- - Script exits with “No results returned …” or output file is empty.
Causes:
- - Network is blocked (online mode).
- Queries are too narrow or
queries.md is empty.
Solutions:
- - Use offline import: place
papers/import.csv|json|jsonl in the workspace or pass --input. - Broaden keywords and reduce excludes in
queries.md. - Run with explicit
--query to sanity-check the parser.
Issue: Offline import records miss fields
Symptom:
- - Downstream steps fail because records miss
authors/year/abstract/url.
Causes:
- - Export columns don’t match expected fields; upstream export is incomplete.
Solutions:
- - Ensure the export contains at least
title, authors, year, url, abstract. - If you later have network, use
--enrich-metadata to backfill missing fields (best effort).
Recovery Checklist
- - [ ] Confirm
queries.md has non-empty keywords (or pass --query). - [ ] If offline: confirm workspace has
papers/import.* and rerun. - [ ] Spot-check 3–5 JSONL lines: valid JSON + required fields.
arXiv搜索(元数据优先)
收集具有足够元数据的初始论文集,以支持后续的排序、分类构建和引用生成。
在线时,优先获取丰富的arXiv元数据(类别、arxivid、pdfurl、发布/更新日期等)。离线时,接受导出文件并进行干净转换。
加载顺序
始终读取:
- - references/domainpackoverview.md — 领域包如何驱动特定主题行为
领域包(按主题匹配加载):
- - assets/domainpacks/llmagents.json — 固定ID、LLM代理主题的查询重写规则
脚本边界
scripts/run.py 仅用于:
- - arXiv API检索和XML解析
- 离线导出转换(CSV/JSON/JSONL规范化)
- 通过 id_list 回填进行元数据丰富
不要将 run.py 用于:
- - 硬编码的主题检测或查询重写(使用领域包)
- 特定领域的固定论文列表(外部化到 assets/domain_packs/)
输入
- - queries.md(关键词、排除项、时间窗口)
输出
- - papers/papers_raw.jsonl(JSONL格式;每行一篇论文)
- 每条记录至少包含:title、authors、year、url、abstract
- 使用arXiv API在线模式时,记录还包含有用的元数据:arxiv
id、pdfurl、categories、primary
category、published、updated、doi、journalref、comment
- papers/papers_raw.csv
决策:在线 vs 离线
- - 如果有网络访问:运行arXiv API检索。
- 如果没有:导入用户提供的导出文件(CSV/JSON/JSONL)并规范化字段。
- 混合模式:如果导入离线数据但后续有网络,可以通过arXiv idlist 使用 --enrich-metadata 或 queries.md 中的 enrichmetadata: true 丰富缺失字段(摘要/作者/类别)。
工作流程(启发式)
- 1. 读取 queries.md 并扩展为具体的查询字符串。
- 检索结果(在线)或导入导出文件(离线)。
- 规范化每条记录,至少包含:
- title、authors(数组)、year、url、abstract
- 4. 在此阶段保持集合广泛;去重/排序在下一步进行。
- 如果指定了时间窗口和 max_results,则应用。
质量检查清单
- - [ ] papers/papers_raw.jsonl 存在。
- [ ] 每行是有效的JSON,包含 title、authors、year、url。
副作用
- - 允许:创建/覆盖 papers/papers_raw.jsonl;向 STATUS.md 追加注释。
- 不允许:在写入批准前在 output/ 中写入散文章节。
脚本
快速开始
- - python scripts/run.py --help
- 在线:python scripts/run.py --workspace <工作目录> --query <查询> --max-results 200
- 离线导入:python scripts/run.py --workspace <工作目录> --input
所有选项
- - --query
:可重复;多个查询会合并
- --exclude :可重复;检索后应用排除项
- --max-results :限制总检索数量
- --input :离线模式(CSV/JSON/JSONL)
- --enrich-metadata:尽力通过arXiv idlist 丰富元数据(需要网络)
- queries.md 也支持:keywords、exclude、time window、maxresults、enrich_metadata
示例
- python scripts/run.py --workspace
--query LLM agent --query tool use --exclude survey --max-results 300
- - 通过arXiv ID获取单篇论文(直接 id_list 获取):
- python scripts/run.py --workspace --query 2509.02547 --max-results 1
- 将 papers/import.csv(或 .json/.jsonl)放在工作目录下,然后运行:python scripts/run.py --workspace
- - 离线导入 + 时间窗口(通过 queries.md):
- 设置 - time window: { from: 2022, to: 2025 } 然后正常运行离线导入
故障排除
常见问题
问题:papers/papers_raw.jsonl 为空
症状:
原因:
- - 网络被屏蔽(在线模式)。
- 查询范围太窄或 queries.md 为空。
解决方案:
- - 使用离线导入:将 papers/import.csv|json|jsonl 放在工作目录中或传递 --input。
- 在 queries.md 中扩大关键词范围并减少排除项。
- 使用显式的 --query 运行以检查解析器。
问题:离线导入记录缺少字段
症状:
- - 后续步骤失败,因为记录缺少 authors/year/abstract/url。
原因:
解决方案:
- - 确保导出至少包含 title、authors、year、url、abstract。
- 如果后续有网络,使用 --enrich-metadata 回填缺失字段(尽力而为)。
恢复检查清单
- - [ ] 确认 queries.md 有非空的 keywords(或传递 --query)。
- [ ] 如果离线:确认工作目录有 papers/import.* 并重新运行。
- [ ] 抽查3-5行JSONL:有效的JSON + 必需字段。