Model-driven arXiv retrieval workflow for building a paper set with a manual language parameter: initialize a run, fetch metadata for each model-designed query, let the model filter irrelevant items per query by keep indexes, then merge and dedupe into per-paper metadata directories. Use when query planning and relevance filtering should be done by the model, not rule-based heuristics."
3. Keep one original-theme query, then add normalized/synonym expansions.
- Query 1 keeps original topic wording.
Remaining queries use normalized terms and close synonyms.
Prefer concise noun phrases that match arXiv indexing behavior.
4. Use OR inside the same semantic group (synonyms), and AND across groups.
- Same-group synonyms should be connected with OR to increase recall.
- Example group A (model terms): LLM OR "large language model" OR AI.
- Example group B (Lean terms): "Lean 4" OR Lean OR "formal language".
- Different semantic groups should be connected with AND to keep relevance.
- Example: (LLM-group) AND (Lean-group).
- Recommended pattern:
- INLINECODE31
Query Examples (arXiv API-ready)
Theme A: LLM applications in Lean 4 formalization
- INLINECODE33
INLINECODE34
INLINECODE35
INLINECODE36
Theme B: agentic tool use for code generation
- INLINECODE38
INLINECODE39
INLINECODE40
Theme C: multimodal reasoning with retrieval
- INLINECODE42
INLINECODE43
INLINECODE44
Step 2: Fetch One Query at a Time
Model defines queries manually, for example:
- INLINECODE45
INLINECODE46
INLINECODE47
Recommended batch mode (safe defaults, serial execution):
CODEBLOCK1
In batch mode, the script auto-applies:
- serial API calls
INLINECODE48
INLINECODE49
INLINECODE50
INLINECODE51
INLINECODE52
per-run rate-state file (<run_dir>/.runtime/arxiv_api_state.json) for throttling
auto max_results from target_range and query count (default oversample x2, cap 60)
default language/categories from INLINECODE58
Minimal query_plan.json only needs label and query.
See references/query-plan-format.md.
You normally do not need to set fetch-control args manually.
If you need one-by-one manual fetch, run each query:
CODEBLOCK2
Output files:
- query_results/<label>.json (indexed full metadata list)
INLINECODE64 (human-readable preview)
Date range is applied directly in arXiv API search_query via submittedDate:[... TO ...].
No second local date-filter pass is performed.
Rate-limit controls in fetch_query_metadata.py:
- --min-interval-sec (default 5.0)
INLINECODE70 (default 4)
INLINECODE72 (default 5.0)
INLINECODE74 (default 120.0)
INLINECODE76 (default 1.0)
INLINECODE78 (optional override; default is <run_dir>/.runtime/arxiv_api_state.json)
INLINECODE80 to bypass cache and re-fetch
Step 3: Model Filters Relevance
For each query list, the model reads indexed results and decides what to keep.
Use keep specs by index and/or arXiv ID when merging.
To explicitly drop one weak query in later iterations, set that label to an empty keep list in selection-json.
Step 4: Merge and Dedupe
CODEBLOCK3
or with selection-json:
CODEBLOCK4
An empty list means this query label is intentionally dropped (keep 0).
This writes final outputs:
- INLINECODE84
INLINECODE85
INLINECODE86
INLINECODE87
Step 5: Iterative Retry Loop (Incremental)
If relevance is weak or final count is insufficient after Step 4, iterate:
1. Review papers_index.md and per-paper metadata quality.
Adjust query plan (usually broaden with additional synonym OR terms, keep cross-group AND constraints).
Fetch additional query results with new labels.
Re-run merge in incremental mode:
CODEBLOCK5
Incremental behavior:
- Previous label selections are loaded from query_selection/selected_by_query.json.
Labels provided in the new selection-json override previous selections for those labels.