arXiv Search (metadata-first)

Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.

When online, prefer rich arXiv metadata (categories, arxivid, pdfurl, published/updated, etc.). When offline, accept an export and convert it cleanly.

Load Order

Always read:

- references/domain_pack_overview.md — how domain packs drive topic-specific behavior

Domain packs (loaded by topic match):

- assets/domain_packs/llm_agents.json — pinned IDs, query rewrite rules for LLM agent topics

Script Boundary

Use scripts/run.py only for:

- arXiv API retrieval and XML parsing
offline export conversion (CSV/JSON/JSONL normalization)
metadata enrichment via id_list backfill

Do not treat run.py as the place for:

- hardcoded topic detection or query rewriting (use domain packs)
domain-specific pinned paper lists (externalize to assets/domain_packs/)

Input

- queries.md (keywords, excludes, time window)

Outputs

- papers/papers_raw.jsonl (JSONL; 1 paper per line)

- Each record includes at least: title, authors, year, url, abstract - When using the arXiv API online mode, records also include helpful metadata: arxiv_id, pdf_url, categories, primary_category, published, updated, doi, journal_ref, comment

- Convenience index (optional but generated by the script):

- INLINECODE22

Decision: online vs offline

- If you have network access: run arXiv API retrieval.
If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields.
Hybrid: if you import offline but still have network later, you can enrich missing fields (abstract/authors/categories) via arXiv id_list using --enrich-metadata or queries.md enrich_metadata: true.

Workflow (heuristic)

1. Read queries.md and expand into concrete query strings.
Retrieve results (online) or import an export (offline).
Normalize every record to include at least:

- title, authors (array), year, url, abstract

4. Keep the set broad at this stage; dedupe/ranking comes next.
Apply time window and max_results if specified.

Quality checklist

- [ ] papers/papers_raw.jsonl exists.
[ ] Each line is valid JSON and contains title, authors, year, url.

Side effects

- Allowed: create/overwrite papers/papers_raw.jsonl; append notes to STATUS.md.
Not allowed: write prose sections in output/ before writing is approved.

Script

Quick Start

- INLINECODE42
Online: INLINECODE43
Offline import: INLINECODE44

All Options

- --query <q>: repeatable; multiple queries are unioned
INLINECODE46: repeatable; excludes applied after retrieval
INLINECODE47: cap total retrieved
INLINECODE48: offline mode (CSV/JSON/JSONL)
INLINECODE49: best-effort enrich via arXiv id_list (needs network)
INLINECODE51 also supports: keywords, exclude, time window, max_results, INLINECODE56

Examples

- Online (multi-query + excludes):

- python scripts/run.py --workspace <ws> --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300

- Fetch a single paper by arXiv ID (direct id_list fetch):

- python scripts/run.py --workspace <ws> --query 2509.02547 --max-results 1

- Offline auto-detect (no flags):

- Place papers/import.csv (or .json/.jsonl) under the workspace, then run: python scripts/run.py --workspace <ws>

- Offline import + time window (via queries.md):

- Set - time window: { from: 2022, to: 2025 } then run offline import normally

Troubleshooting

Common Issues

Issue: `papers/papers_raw.jsonl` is empty

Symptom:

- Script exits with “No results returned …” or output file is empty.

Causes:

- Network is blocked (online mode).
Queries are too narrow or queries.md is empty.

Solutions:

- Use offline import: place papers/import.csv|json|jsonl in the workspace or pass --input.
Broaden keywords and reduce excludes in queries.md.
Run with explicit --query to sanity-check the parser.

Issue: Offline import records miss fields

Symptom:

- Downstream steps fail because records miss authors/year/abstract/url.

Causes:

- Export columns don’t match expected fields; upstream export is incomplete.

Solutions:

- Ensure the export contains at least title, authors, year, url, abstract.
If you later have network, use --enrich-metadata to backfill missing fields (best effort).

Recovery Checklist

- [ ] Confirm queries.md has non-empty keywords (or pass --query).
[ ] If offline: confirm workspace has papers/import.* and rerun.
[ ] Spot-check 3–5 JSONL lines: valid JSON + required fields.

arXiv搜索（元数据优先）

收集具有足够元数据的初始论文集，以支持后续的排序、分类构建和引用生成。

在线时，优先获取丰富的arXiv元数据（类别、arxivid、pdfurl、发布/更新日期等）。离线时，接受导出文件并进行干净转换。

加载顺序

始终读取：

- references/domainpackoverview.md — 领域包如何驱动特定主题行为

领域包（按主题匹配加载）：

- assets/domainpacks/llmagents.json — 固定ID、LLM代理主题的查询重写规则

脚本边界

scripts/run.py 仅用于：

- arXiv API检索和XML解析
离线导出转换（CSV/JSON/JSONL规范化）
通过 id_list 回填进行元数据丰富

不要将 run.py 用于：

- 硬编码的主题检测或查询重写（使用领域包）
特定领域的固定论文列表（外部化到 assets/domain_packs/）

输入

- queries.md（关键词、排除项、时间窗口）

输出

- papers/papers_raw.jsonl（JSONL格式；每行一篇论文）

- 每条记录至少包含：title、authors、year、url、abstract - 使用arXiv API在线模式时，记录还包含有用的元数据：arxivid、pdfurl、categories、primarycategory、published、updated、doi、journalref、comment

- 便捷索引（可选，由脚本生成）：

- papers/papers_raw.csv

决策：在线 vs 离线

- 如果有网络访问：运行arXiv API检索。
如果没有：导入用户提供的导出文件（CSV/JSON/JSONL）并规范化字段。
混合模式：如果导入离线数据但后续有网络，可以通过arXiv idlist 使用 --enrich-metadata 或 queries.md 中的 enrichmetadata: true 丰富缺失字段（摘要/作者/类别）。

工作流程（启发式）

1. 读取 queries.md 并扩展为具体的查询字符串。
检索结果（在线）或导入导出文件（离线）。
规范化每条记录，至少包含：

- title、authors（数组）、year、url、abstract

4. 在此阶段保持集合广泛；去重/排序在下一步进行。
如果指定了时间窗口和 max_results，则应用。

质量检查清单

- [ ] papers/papers_raw.jsonl 存在。
[ ] 每行是有效的JSON，包含 title、authors、year、url。

副作用

- 允许：创建/覆盖 papers/papers_raw.jsonl；向 STATUS.md 追加注释。
不允许：在写入批准前在 output/ 中写入散文章节。

脚本

快速开始

- python scripts/run.py --help
在线：python scripts/run.py --workspace <工作目录> --query <查询> --max-results 200
离线导入：python scripts/run.py --workspace <工作目录> --input

所有选项

- --query ：可重复；多个查询会合并
--exclude ：可重复；检索后应用排除项
--max-results ：限制总检索数量
--input ：离线模式（CSV/JSON/JSONL）
--enrich-metadata：尽力通过arXiv idlist 丰富元数据（需要网络）
queries.md 也支持：keywords、exclude、time window、maxresults、enrich_metadata

示例

- 在线（多查询 + 排除项）：

- python scripts/run.py --workspace --query LLM agent --query tool use --exclude survey --max-results 300

- 通过arXiv ID获取单篇论文（直接 id_list 获取）：

- python scripts/run.py --workspace --query 2509.02547 --max-results 1

- 离线自动检测（无标志）：

- 将 papers/import.csv（或 .json/.jsonl）放在工作目录下，然后运行：python scripts/run.py --workspace

- 离线导入 + 时间窗口（通过 queries.md）：

- 设置 - time window: { from: 2022, to: 2025 } 然后正常运行离线导入

故障排除

常见问题

问题：papers/papers_raw.jsonl 为空

症状：

- 脚本退出显示未返回结果...或输出文件为空。

原因：

- 网络被屏蔽（在线模式）。
查询范围太窄或 queries.md 为空。

解决方案：

- 使用离线导入：将 papers/import.csv|json|jsonl 放在工作目录中或传递 --input。
在 queries.md 中扩大关键词范围并减少排除项。
使用显式的 --query 运行以检查解析器。

问题：离线导入记录缺少字段

症状：

- 后续步骤失败，因为记录缺少 authors/year/abstract/url。

原因：

- 导出列与预期字段不匹配；上游导出不完整。

解决方案：

- 确保导出至少包含 title、authors、year、url、abstract。
如果后续有网络，使用 --enrich-metadata 回填缺失字段（尽力而为）。

恢复检查清单

- [ ] 确认 queries.md 有非空的 keywords（或传递 --query）。
[ ] 如果离线：确认工作目录有 papers/import.* 并重新运行。
[ ] 抽查3-5行JSONL：有效的JSON + 必需字段。

arxiv-searcharXiv论文搜索

arxiv-search

arXiv Search (metadata-first)

Load Order

Script Boundary

Input

Outputs

Decision: online vs offline

Workflow (heuristic)

Quality checklist

Side effects

Script

Quick Start

All Options

Examples

Troubleshooting

Common Issues

Issue: papers/papers_raw.jsonl is empty

Issue: Offline import records miss fields

Recovery Checklist

arXiv搜索（元数据优先）

加载顺序

脚本边界

输入

输出

决策：在线 vs 离线

工作流程（启发式）

质量检查清单

副作用

脚本

快速开始

所有选项

示例

故障排除

常见问题

问题：papers/papers_raw.jsonl 为空

问题：离线导入记录缺少字段

恢复检查清单

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

Issue: `papers/papers_raw.jsonl` is empty