ArXiv Search Collector

Use this skill when you want model-led query planning and model-led relevance filtering.

Core Principle

Scripts are tools. The model performs the reasoning and decisions:

1. Expand the original topic into multiple focused queries.
Run one fetch command per query.
Read each query result list and decide keep indexes.
Merge kept items and dedupe with one script.

Step 1: Initialize Run

CODEBLOCK0

This creates a run directory with task_meta.json, task_meta.md, query_results/, and query_selection/.

Language Parameter

- --language must be set manually for each collection run.
Use the same language value across all collector scripts for consistency.
If --language is non-English (for example Chinese), generated markdown files are written in that language:

- task_meta.md - query_results/<label>.md - <arxiv_id>/metadata.md - INLINECODE10

Query Writing Requirements

Follow these rules before running per-query fetch:

1. Determine query count from final target range.

- Prefer 3 queries for small/medium targets (2-5, 5-10).
Prefer 4 queries for larger targets (10-50 or above).
Avoid writing too many low-quality queries.

2. Allocate target budget to each query, then oversample.

- Let target_max be the upper bound in target range.
Compute target_per_query = ceil(target_max / query_count).
Fetch each query with max_results = target_per_query * 2 (or * 3 when recall is more important).
Example: target 5-10, query count 3 -> target_per_query=4 -> each query fetches 8-12.

3. Keep one original-theme query, then add normalized/synonym expansions.

- Query 1 keeps original topic wording.
Remaining queries use normalized terms and close synonyms.
Prefer concise noun phrases that match arXiv indexing behavior.

4. Use OR inside the same semantic group (synonyms), and AND across groups.

- Same-group synonyms should be connected with OR to increase recall.

- Example group A (model terms): LLM OR "large language model" OR AI. - Example group B (Lean terms): "Lean 4" OR Lean OR "formal language".

- Different semantic groups should be connected with AND to keep relevance.

- Example: (LLM-group) AND (Lean-group).

- Recommended pattern:

- INLINECODE31

Query Examples (arXiv API-ready)

Theme A: LLM applications in Lean 4 formalization

- INLINECODE33
INLINECODE34
INLINECODE35
INLINECODE36

Theme B: agentic tool use for code generation

- INLINECODE38
INLINECODE39
INLINECODE40

Theme C: multimodal reasoning with retrieval

- INLINECODE42
INLINECODE43
INLINECODE44

Step 2: Fetch One Query at a Time

Model defines queries manually, for example:

- INLINECODE45
INLINECODE46
INLINECODE47

Recommended batch mode (safe defaults, serial execution):

CODEBLOCK1

In batch mode, the script auto-applies:

- serial API calls
INLINECODE48
INLINECODE49
INLINECODE50
INLINECODE51
INLINECODE52
per-run rate-state file (<run_dir>/.runtime/arxiv_api_state.json) for throttling
auto max_results from target_range and query count (default oversample x2, cap 60)
default language/categories from INLINECODE58

Minimal query_plan.json only needs label and query.
See references/query-plan-format.md.
You normally do not need to set fetch-control args manually.

If you need one-by-one manual fetch, run each query:

CODEBLOCK2

Output files:

- query_results/<label>.json (indexed full metadata list)
INLINECODE64 (human-readable preview)

Date range is applied directly in arXiv API search_query via submittedDate:[... TO ...].
No second local date-filter pass is performed.

Rate-limit controls in fetch_query_metadata.py:

- --min-interval-sec (default 5.0)
INLINECODE70 (default 4)
INLINECODE72 (default 5.0)
INLINECODE74 (default 120.0)
INLINECODE76 (default 1.0)
INLINECODE78 (optional override; default is <run_dir>/.runtime/arxiv_api_state.json)
INLINECODE80 to bypass cache and re-fetch

Step 3: Model Filters Relevance

For each query list, the model reads indexed results and decides what to keep.

Use keep specs by index and/or arXiv ID when merging.
To explicitly drop one weak query in later iterations, set that label to an empty keep list in selection-json.

Step 4: Merge and Dedupe

CODEBLOCK3

or with selection-json:

CODEBLOCK4

An empty list means this query label is intentionally dropped (keep 0).

This writes final outputs:

- INLINECODE84
INLINECODE85
INLINECODE86
INLINECODE87

Step 5: Iterative Retry Loop (Incremental)

If relevance is weak or final count is insufficient after Step 4, iterate:

1. Review papers_index.md and per-paper metadata quality.
Adjust query plan (usually broaden with additional synonym OR terms, keep cross-group AND constraints).
Fetch additional query results with new labels.
Re-run merge in incremental mode:

CODEBLOCK5

Incremental behavior:

- Previous label selections are loaded from query_selection/selected_by_query.json.
Labels provided in the new selection-json override previous selections for those labels.
New labels can be added.
Old labels can be dropped by setting [].

Stop retrying when:

- relevance is acceptable, or
additional broadened queries mainly add low-relevance papers.

If relevant papers are genuinely scarce, it is valid to finish below the original minimum target range.

Notes

- Keep API concurrency conservative by controlling query count and --max-results.
Keep per-query fetch serial (no parallel API calls in Stage A).
Reuse cache by default for identical query/date/request settings; only use --force when necessary.
Prefer default run-local rate-state so all steps in the same run share one cooldown/throttling state.
If arXiv API returns 429 Too Many Requests, retry later and/or increase --min-interval-sec.
Prefer explicit, narrow queries and let the model filter aggressively.
Use references/io-contract.md for exact files and schema.

Related Skills

This skill is a sub-skill of arxiv-summarizer-orchestrator.

Pipeline position:

1. Step 1 (collection): arxiv-search-collector (this skill)
Step 2 (per-paper processing): INLINECODE101
Step 3 (batch reporting): INLINECODE102

This skill produces the initial paper-set structure and metadata that Stage B and Stage C depend on.

ArXiv 搜索收集器

当您希望由模型主导查询规划与相关性过滤时，请使用此技能。

核心原则

脚本是工具，模型负责推理与决策：

1. 将原始主题扩展为多个聚焦查询。
每个查询执行一次获取命令。
读取每个查询结果列表并决定保留的索引。
合并保留项并通过一个脚本去重。

步骤 1：初始化运行

bash
python3 scripts/initcollectionrun.py \
--output-root /path/to/data \
--topic LLM applications in Lean 4 formalization \
--keywords Lean 4,LLM,formalization \
--categories cs.AI,cs.LO \
--target-range 5-10 \
--lookback 30d \
--language English

这将创建一个包含 taskmeta.json、taskmeta.md、queryresults/ 和 queryselection/ 的运行目录。

语言参数

- --language 必须在每次收集运行时手动设置。
在所有收集器脚本中使用相同的语言值以保持一致性。
如果 --language 为非英语（例如 Chinese），生成的 markdown 文件将使用该语言编写：

- task_meta.md - query_results/.md - /metadata.md - papers_index.md

查询编写要求

在每次按查询获取之前，请遵循以下规则：

1. 根据最终目标范围确定查询数量。

- 对于小/中型目标（2-5、5-10），优先使用 3 个查询。
对于较大目标（10-50 或以上），优先使用 4 个查询。
避免编写过多低质量查询。

2. 为每个查询分配目标预算，然后进行过采样。

- 设 targetmax 为目标范围的上限。
计算 targetperquery = ceil(targetmax / querycount)。
以 maxresults = targetperquery 2（当召回率更重要时为 3）获取每个查询。
示例：目标 5-10，查询数量 3 -> targetperquery=4 -> 每个查询获取 8-12 条结果。

3. 保留一个原始主题查询，然后添加规范化/同义词扩展。

- 查询 1 保留原始主题措辞。
其余查询使用规范化术语和近义同义词。
优先使用与 arXiv 索引行为匹配的简洁名词短语。

4. 在同一语义组（同义词）内使用 OR，在不同组之间使用 AND。

- 同组同义词应使用 OR 连接以增加召回率。

- 示例组 A（模型术语）：LLM OR large language model OR AI。 - 示例组 B（Lean 术语）：Lean 4 OR Lean OR formal language。

- 不同语义组应使用 AND 连接以保持相关性。

- 示例：(LLM-group) AND (Lean-group)。

- 推荐模式：

- (<领域术语用 OR 连接>) AND (<方法/模型术语用 OR 连接>) [AND <可选约束术语>]

查询示例（arXiv API 就绪）

主题 A：LLM applications in Lean 4 formalization

- all:LLM applications in Lean 4 formalization
(all:Lean 4 OR all:Lean OR all:formal language) AND (all:LLM OR all:large language model OR all:AI)
(all:Lean OR all:formalization) AND (all:LLM OR all:large language model) AND all:theorem proving
(all:Lean OR all:proof assistant) AND (all:AI OR all:LLM)

主题 B：agentic tool use for code generation

- all:agentic tool use code generation
(all:agentic OR all:autonomous agent) AND (all:LLM OR all:large language model)
(all:tool use OR all:function calling) AND (all:coding assistant OR all:code generation)

主题 C：multimodal reasoning with retrieval

- all:multimodal reasoning retrieval
(all:multimodal OR all:vision language) AND (all:retrieval OR all:RAG)
(all:multimodal model OR all:vision language model) AND (all:reasoning OR all:tool use)

步骤 2：每次获取一个查询

模型手动定义查询，例如：

- all:Lean 4
all:LLM formalization
all:AI formal verification

推荐批量模式（安全默认值，串行执行）：

bash
python3 scripts/fetchqueriesbatch.py \
--run-dir /path/to/run-dir \
--plan-json /path/to/query_plan.json

在批量模式下，脚本自动应用：

- 串行 API 调用
--min-interval-sec 5
--retry-max 4
--retry-base-sec 5
--retry-max-sec 120
--retry-jitter-sec 1
每次运行的速率状态文件（dir>/.runtime/arxivapistate.json）用于限流
根据 targetrange 和查询数量自动设置 maxresults（默认过采样 x2，上限 60）
来自 taskmeta.json 的默认语言/类别

最小 query_plan.json 只需要 label 和 query。
请参阅 references/query-plan-format.md。
通常不需要手动设置获取控制参数。

如果需要逐个手动获取，请运行每个查询：

bash
python3 scripts/fetchquerymetadata.py \
--run-dir /path/to/run-dir \
--label lean4 \
--query all:Lean 4 \
--max-results 30 \
--min-interval-sec 5 \
--retry-max 4 \
--language English

输出文件：

- queryresults/.json（索引化的完整元数据列表）
queryresults/.md（人类可读预览）

日期范围通过 submittedDate:[... TO ...] 直接在 arXiv API search_query 中应用。
不执行第二次本地日期过滤。

fetchquerymetadata.py 中的速率限制控制：

- --min-interval-sec（默认 5.0）
--retry-max（默认 4）
--retry-base-sec（默认 5.0）
--retry-max-sec（默认 120.0）
--retry-jitter-sec（默认 1.0）
--rate-state-path（可选覆盖；默认是 dir>/.runtime/arxivapi_state.json）
--force 用于绕过缓存并重新获取

步骤 3：模型过滤相关性

对于每个查询列表，模型读取索引化结果并决定保留哪些内容。

合并时使用基于索引和/或 arXiv ID 的保留规范。
要在后续迭代中明确丢弃某个弱查询，请在 selection-json 中将该标签的保留列表设置为空。

步骤 4：合并与去重

bash
python3 scripts/mergeselectedpapers.py \
--run-dir /path/to/run-dir \
--keep lean4:0,2,4 \
--keep llm-formalization:1,3 \
--language English

或使用 selection-json：

json
{
lean4-round1: [0, 2, 4],
lean4-round2: [],
formalization-round2: [1, 3, 5]
}

空列表表示该查询标签被有意丢弃（keep 0）。

这将写入最终输出：

- id>/metadata.json
id>/metadata.md
papersindex.json
papersindex.md

步骤 5：迭代重试循环（增量式）

如果在步骤 4 后相关性较弱或最终数量不足，请进行迭代：

1. 审查 papers_index.md 和每篇论文的元数据质量。
调整查询计划（通常通过添加额外的同义词 OR 术语来扩展，保持跨组 AND 约束）。
使用新标签获取额外的查询结果。
以增量模式重新运行合并：

bash
python3 scripts/mergeselectedpapers.py \
--run-dir /path/to/run-dir \
--incremental \
--selection-json /path/to/updated_selection.json \
--language English

增量行为

arxiv-search-collector 基于模型的arXiv检索工具