Tavily arXiv Paper Fetech Pipeline
Orchestrate a title-to-arXiv metadata workflow for: $ARGUMENTS
Overview
This skill turns a list of paper titles into a reproducible arXiv lookup result:
CODEBLOCK0
The goal is not just to answer once. The goal is to leave behind a reusable work folder that can be resumed or consumed by another workflow.
Inputs
Extract or ask for:
- - TITLE_LIST: one or more paper titles
- WORKDIR: optional working folder
If WORKDIR is omitted, default to:
Accepted title formats:
- - a single title
- multiple titles separated by newlines
- markdown bullet lists
Constants
- - TAVILYONLY = true — Tavily is the only allowed search mechanism
- SEQUENTIALMODE = true — process titles one by one to reduce rate limits
- NOFALLBACKSEARCH = true — never switch to arXiv API search, guessed arXiv URLs, or another provider
- LOWCONTEXTMODE = true — append one JSON line per title, then render markdown from JSONL
Work Folder Layout
Write all outputs under WORKDIR:
| Artifact | Purpose |
|---|
| INLINECODE3 | normalized input title list |
| INLINECODE4 |
one JSON line per processed title |
|
paper_fetch_report.md | rendered report from the JSONL |
Execution Rule
Process titles in order. Do not parallelize Tavily calls.
If Tavily rate limits:
- - wait briefly
- retry Tavily only
- if it still fails, record the error and continue
Phase 0: Initialize
- 1. Normalize the input title list.
- Create
WORKDIR. - Write the normalized title list to
WORKDIR/input_titles.md.
Phase 1: Tavily resolution
For each title:
- 1. Search Tavily with:
CODEBLOCK1
- 2. Prefer results in this order:
-
https://arxiv.org/abs/...
-
https://arxiv.org/html/...
- INLINECODE10
- 3. Accept a result only when the match is reliable:
- exact title match after minor normalization
- same title with only punctuation, Unicode, or math-formatting differences
- arXiv page clearly shows the same paper title
- 4. If uncertain, record
no_match instead of guessing.
Phase 2: Local arXiv fetch
Once a reliable arXiv URL is known, run:
CODEBLOCK2
This returns compact JSON with:
- - canonical abs URL
- arXiv id
- title
- authors
- abstract
Phase 3: Low-context JSONL logging
For each title, immediately append one JSON line to:
Each line should include at least:
- - INLINECODE13
- INLINECODE14
- INLINECODE15
- INLINECODE16
- INLINECODE17
- INLINECODE18
Status values:
- - INLINECODE19
- INLINECODE20
- INLINECODE21
Phase 4: Render markdown report
After all titles are processed, run:
CODEBLOCK3
Output Format
The rendered report should look like:
CODEBLOCK4
Key Rules
- - Never fabricate a paper match.
- Never use non-Tavily search for title resolution.
- Keep all outputs inside
WORKDIR. - Prefer the local helper script over bringing full arXiv page content into context.
Utility Scripts
- -
scripts/fetch_arxiv_abs.py — fetch compact metadata from a known arXiv URL - INLINECODE24 — render JSONL to markdown
Additional Resources
- - For copy-paste invocations and expected outputs, see examples.md
Example Invocation
CODEBLOCK5
Tavily arXiv Paper Fetch 流程
为以下内容编排一个从标题到 arXiv 元数据的工作流程:$ARGUMENTS
概述
本技能将一系列论文标题转化为可复现的 arXiv 查询结果:
text
论文标题
-> 仅使用 Tavily 搜索
-> 规范的 arXiv 摘要页面 URL
-> 本地 arXiv 元数据获取
-> JSONL 处理日志
-> Markdown 报告
目标不仅仅是回答一次,而是留下一个可复用的工作文件夹,供其他工作流程恢复或使用。
输入
提取或询问以下内容:
- - TITLE_LIST:一个或多个论文标题
- WORKDIR:可选的工作文件夹
如果省略 WORKDIR,则默认为:
- - 0docs/tavilyarxivpaperfetech
接受的标题格式:
- - 单个标题
- 多个标题以换行符分隔
- Markdown 无序列表
常量
- - TAVILYONLY = true — Tavily 是唯一允许的搜索机制
- SEQUENTIALMODE = true — 逐个处理标题以减少速率限制
- NOFALLBACKSEARCH = true — 绝不切换到 arXiv API 搜索、猜测的 arXiv URL 或其他提供商
- LOWCONTEXTMODE = true — 每个标题追加一行 JSON,然后从 JSONL 渲染 Markdown
工作文件夹布局
所有输出写入 WORKDIR 下:
| 产物 | 用途 |
|---|
| inputtitles.md | 规范化后的输入标题列表 |
| paperfetches.jsonl |
每个已处理标题对应一行 JSON |
| paper
fetchreport.md | 从 JSONL 渲染的报告 |
执行规则
按顺序处理标题。不要并行调用 Tavily。
如果 Tavily 触发速率限制:
- - 短暂等待
- 仅重试 Tavily
- 如果仍然失败,记录错误并继续
阶段 0:初始化
- 1. 规范化输入标题列表。
- 创建 WORKDIR。
- 将规范化后的标题列表写入 WORKDIR/input_titles.md。
阶段 1:Tavily 解析
对于每个标题:
- 1. 使用以下查询搜索 Tavily:
text
<论文标题> arXiv
- 2. 按以下顺序优先选择结果:
- https://arxiv.org/abs/...
- https://arxiv.org/html/...
- https://arxiv.org/pdf/...
- 3. 仅在匹配可靠时接受结果:
- 经过轻微规范化后标题完全匹配
- 仅存在标点符号、Unicode 或数学格式差异的相同标题
- arXiv 页面明确显示相同的论文标题
- 4. 如果不确定,记录 no_match 而非猜测。
阶段 2:本地 arXiv 获取
一旦获得可靠的 arXiv URL,运行:
bash
python3 .cursor/skills/tavily-arxiv-paper-fetech/scripts/fetcharxivabs.py
这将返回包含以下内容的紧凑 JSON:
- - 规范的摘要页面 URL
- arXiv ID
- 标题
- 作者
- 摘要
阶段 3:低上下文 JSONL 日志记录
对于每个标题,立即向以下文件追加一行 JSON:
- - WORKDIR/paper_fetches.jsonl
每行至少包含:
- - index
- inputtitle
- tavilystatus
- tavilyerror
- arxivurl
- fetch
状态值:
阶段 4:渲染 Markdown 报告
所有标题处理完毕后,运行:
bash
python3 .cursor/skills/tavily-arxiv-paper-fetech/scripts/jsonltopaperfetchmd.py \
WORKDIR/paper_fetches.jsonl \
WORKDIR/paperfetchreport.md
输出格式
渲染后的报告应如下所示:
markdown
Tavily arXiv Paper Fetch 报告
结果
1. 原始标题
- - 状态:ok
- arXiv URL:https://arxiv.org/abs/xxxx.xxxxx
- arXiv ID:xxxx.xxxxx
- 解析后标题:...
- 作者:...
- 摘要:...
关键规则
- - 绝不捏造论文匹配。
- 绝不使用非 Tavily 搜索进行标题解析。
- 所有输出保留在 WORKDIR 内。
- 优先使用本地辅助脚本,而非将完整的 arXiv 页面内容引入上下文。
实用脚本
- - scripts/fetcharxivabs.py — 从已知的 arXiv URL 获取紧凑元数据
- scripts/jsonltopaperfetchmd.py — 将 JSONL 渲染为 Markdown
附加资源
示例调用
text
/tavily-arxiv-paper-fetech RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control\nOpenVLA: An Open-Source Vision-Language-Action Model --workdir 0docs/tavilyarxivlookuprun_01