Semantic Scholar Search Workflow
Search academic papers via the Semantic Scholar API using a structured 4-phase workflow.
Critical rule: NEVER make multiple sequential Bash calls for API requests. Always write ONE Python script that runs all searches, then execute it once. All rate limiting is handled inside s2.py automatically.
Phase 1: Understand & Plan
Parse the user's intent and choose a search strategy:
Decision Tree
| User wants... | Strategy | Function |
|---|
| Broad topic exploration | Relevance search | INLINECODE1 |
| Precise technical terms, exact phrases |
Bulk search with boolean operators |
search_bulk() with
build_bool_query() |
| Specific passages or methods | Snippet search |
search_snippets() |
| Known paper by title | Title match |
match_title() |
| Known paper by DOI/PMID/ArXiv | Direct lookup |
get_paper() |
| Papers citing a known work | Citation traversal |
get_citations() |
| Related to one paper | Single-seed recommendations |
find_similar() |
| Related to multiple papers | Multi-seed recommendations |
recommend() |
| Find a researcher | Author search |
search_authors() |
| Researcher's profile | Author details |
get_author() |
| Researcher's publications | Author papers |
get_author_papers() |
Query Construction Rules
- - Ambiguous terms (e.g., "stem cells" could mean mesenchymal or stem-like T cells): Use
build_bool_query() with exact phrases and exclusions
- Example:
build_bool_query(phrases=["stem-like T cells"], required=["CD4", "TCF7"], excluded=["mesenchymal", "hematopoietic stem cell"])
- - Multi-context queries (e.g., "topic X in cancer AND autoimmunity"): Plan separate searches, deduplicate with INLINECODE15
- Broad topics: Use
search_relevance() with filters (year, venue, fieldsOfStudy, minCitationCount)
Plan Filters
| Filter | Use when |
|---|
| INLINECODE17 | Recent work only |
| INLINECODE18 |
Precise date range (YYYY-MM-DD) |
|
fields_of_study="Medicine" | Restrict to domain |
|
min_citations=10 | Only established papers |
|
pub_types="Review" | Find reviews/meta-analyses |
|
pub_types="ClinicalTrial" | Clinical trials only |
|
open_access=True | Only open access papers |
Checkpoint: Before proceeding, verify: (1) search strategy matches user intent, (2) filters are appropriate, (3) query is specific enough to avoid irrelevant results.
Phase 2: Execute Search
Write ONE Python script. Example:
CODEBLOCK0
Execute with: INLINECODE24
Rules:
- - Import everything from s2: INLINECODE25
- Write script to
/tmp/s2_search.py (or similar temp path) - One Bash call to execute. Never chain multiple API calls via separate Bash invocations.
- Rate limiting, retries, and backoff are automatic inside s2.py
Checkpoint: Verify the script ran successfully (no exceptions) and returned results. If 0 results, broaden the query or relax filters before presenting.
Worked Examples
Example 1: Author workflow — "Find papers by Yann LeCun on self-supervised learning"
CODEBLOCK1
Example 2: Citation chain — "Who cited the Transformer paper and what did they build on?"
CODEBLOCK2
Example 3: Multi-seed recommendations with BibTeX export — "Find papers like these two but not about NLP"
CODEBLOCK3
Phase 3: Summarize & Present
- - Use
format_results() for consistent output (summary table + top-10 details) - If user's language is Chinese, present summaries in Chinese
- Always note total results count and search strategy used
- Highlight most relevant papers based on the user's specific question
Phase 4: User Interaction Loop
After presenting results, always offer these options:
- 1. Translate — titles/summaries to Chinese (or other language)
- Details — full abstract for specific paper numbers
- Refine — narrow or expand search with different terms/filters
- Similar — find papers similar to a specific result (
find_similar()) - Citations — who cited a specific paper (
get_citations()) - Export — save results via
export_bibtex(), export_markdown(), or INLINECODE32 - Done — end search session
Loop until user says done. Each follow-up uses the same single-script pattern.
API Quick Reference
Helper Module (s2.py)
CODEBLOCK4
Paper Search Functions
| Function | Purpose | Max Results |
|---|
| INLINECODE34 | Simple broad search | 1,000 |
| INLINECODE35 |
Boolean precise search | 10,000,000 |
|
search_snippets(query, **filters) | Full-text passage search | 1,000 |
|
match_title(title) | Exact title match | 1 |
|
get_paper(paper_id) | Single paper details | — |
|
get_citations(paper_id, max_results) | Who cited this | 10,000 |
|
get_references(paper_id, max_results) | What this cites | 10,000 |
|
find_similar(paper_id, limit, pool) | Single-seed recommendations | 500 |
|
recommend(positive_ids, negative_ids, limit) | Multi-seed recommendations | 500 |
|
batch_papers(ids, fields) | Batch lookup (≤500) | — |
Author Functions
| Function | Purpose | Max Results |
|---|
| INLINECODE44 | Find researchers by name | 1,000 |
| INLINECODE45 |
Author profile (affiliations, h-index) | — |
|
get_author_papers(author_id, max_results) | Author's publications | 10,000 |
|
get_paper_authors(paper_id, max_results) | Paper's author list | 1,000 |
|
batch_authors(ids, fields) | Batch author lookup (≤1000) | — |
Filter Parameters (kwargs)
INLINECODE49 , publication_date, venue, fields_of_study, min_citations, pub_types, INLINECODE55
- -
year: "2020-", "-2019", INLINECODE59 - INLINECODE60 :
"2024-01-01:2024-06-30" (YYYY-MM-DD range, open-ended OK) - INLINECODE62 :
Review, JournalArticle, Conference, ClinicalTrial, MetaAnalysis, Dataset, Book, CaseReport, Editorial, LettersAndComments, News, Study, INLINECODE75
Boolean Query Syntax (bulk search only)
| Syntax | Example | Meaning |
|---|
| INLINECODE76 | INLINECODE77 | Exact phrase |
| INLINECODE78 |
+transformer | Must include |
|
- |
-survey | Exclude |
|
\| |
CNN \| RNN | OR |
|
* |
neuro* | Prefix wildcard |
|
() |
(CNN \| RNN) +attention | Grouping |
Use build_bool_query(phrases, required, excluded, or_terms) to construct safely.
Output Functions
| Function | Purpose |
|---|
| INLINECODE89 | Markdown summary table |
| INLINECODE90 |
Detailed entries with TLDR/abstract |
|
format_results(papers, query_desc) | Combined: summary + table + details |
|
format_authors(authors, max_rows=20) | Author table (name, affiliations, h-index) |
|
export_bibtex(papers) | BibTeX entries (requires
citationStyles field) |
|
export_markdown(papers, query_desc) | Full markdown report saved to file |
|
export_json(papers, path) | JSON export saved to file |
|
deduplicate(papers) | Remove duplicates by paperId |
Supported ID Formats
INLINECODE98 , ARXIV:2106.15928, PMID:19872477, PMCID:PMC2323569, CorpusId:215416146, ACL:2020.acl-main.447, DBLP:conf/acl/..., MAG:3015453090, INLINECODE106
Paper Fields
Default: INLINECODE107
Additional: abstract, references, citations, openAccessPdf, publicationDate, publicationVenue, fieldsOfStudy, s2FieldsOfStudy, journal, isOpenAccess, referenceCount, influentialCitationCount, citationStyles, embedding, INLINECODE122
Author fields: name, affiliations, paperCount, citationCount, hIndex, homepage, externalIds, INLINECODE130
Rate Limiting
Handled automatically by s2.py: 1.1s gap between requests, exponential backoff (2s→4s→8s→16s→32s, max 60s) on 429/504 errors, up to 5 retries.
Troubleshooting
| Error | Cause | Fix |
|---|
| INLINECODE132 | Missing or invalid API key | Verify S2_API_KEY is set: INLINECODE134 |
| INLINECODE135 after 5 retries |
Sustained rate limit exceeded | Wait 60s, reduce
max_results, or split into smaller batches |
|
ModuleNotFoundError: s2 | Skill directory not on path | Verify skill is installed at
~/.claude/skills/ or
~/.openclaw/skills/ |
|
ModuleNotFoundError: requests |
requests not installed |
pip install requests or
uv pip install requests |
| 0 results returned | Query too specific or filters too narrow | Broaden query, remove filters, try
search_relevance() instead of
search_bulk() |
|
KeyError: 'data' | Endpoint returned error object | Check
r.get("message") for API error details |
|
tldr field is empty | Not all papers have TLDR | Fall back to
abstract field; bulk search never returns
tldr |
Semantic Scholar 搜索工作流
通过语义学者API使用结构化的四阶段工作流搜索学术论文。
关键规则: 切勿多次连续调用Bash进行API请求。始终编写一个Python脚本运行所有搜索,然后一次性执行。所有速率限制由s2.py自动处理。
第一阶段:理解与规划
解析用户意图并选择搜索策略:
决策树
| 用户想要... | 策略 | 函数 |
|---|
| 广泛主题探索 | 相关性搜索 | searchrelevance() |
| 精确技术术语、确切短语 |
使用布尔运算符的批量搜索 | searchbulk() 配合 build
boolquery() |
| 特定段落或方法 | 片段搜索 | search_snippets() |
| 已知论文(按标题) | 标题匹配 | match_title() |
| 已知论文(按DOI/PMID/ArXiv) | 直接查找 | get_paper() |
| 引用某已知论文的文献 | 引用遍历 | get_citations() |
| 与某篇论文相关 | 单种子推荐 | find_similar() |
| 与多篇论文相关 | 多种子推荐 | recommend() |
| 查找研究人员 | 作者搜索 | search_authors() |
| 研究人员简介 | 作者详情 | get_author() |
| 研究人员的出版物 | 作者论文 | get
authorpapers() |
查询构建规则
- - 歧义术语(例如,干细胞可能指间充质干细胞或干细胞样T细胞):使用buildboolquery()配合精确短语和排除项
- 示例:build
boolquery(phrases=[干细胞样T细胞], required=[CD4, TCF7], excluded=[间充质, 造血干细胞])
- - 多上下文查询(例如,癌症和自身免疫中的主题X):规划独立搜索,使用deduplicate()去重
- 广泛主题:使用带过滤器的search_relevance()(年份、会议地点、研究领域、最低引用次数)
规划过滤器
| 过滤器 | 使用场景 |
|---|
| year=2020- | 仅近期工作 |
| publication_date=2024-01-01:2024-06-30 |
精确日期范围(YYYY-MM-DD) |
| fields
ofstudy=Medicine | 限制领域 |
| min_citations=10 | 仅成熟论文 |
| pub_types=Review | 查找综述/元分析 |
| pub_types=ClinicalTrial | 仅临床试验 |
| open_access=True | 仅开放获取论文 |
检查点: 继续前确认:(1) 搜索策略匹配用户意图,(2) 过滤器适当,(3) 查询足够具体以避免不相关结果。
第二阶段:执行搜索
编写一个Python脚本。示例:
python
import sys, os
SKILL_DIR = next((p for p in [
os.path.expanduser(~/.claude/skills/semanticscholar-skill),
os.path.expanduser(~/.openclaw/skills/semanticscholar-skill),
] if os.path.isdir(p)), .)
sys.path.insert(0, SKILL_DIR)
from s2 import *
构建精确查询
q = build
boolquery(
phrases=[干细胞样T细胞],
required=[CD4, IBD],
excluded=[间充质]
)
papers = search
bulk(q, maxresults=30, year=2018-, fields
ofstudy=Medicine)
papers = deduplicate(papers)
print(format_results(papers, IBD中的干细胞样CD4 T细胞))
执行命令:python3 /tmp/s2_search.py
规则:
- - 从s2导入所有内容:from s2 import *
- 将脚本写入/tmp/s2_search.py(或类似临时路径)
- 一次Bash调用执行。切勿通过多次独立Bash调用链式进行多个API调用。
- 速率限制、重试和退避在s2.py中自动处理
检查点: 验证脚本成功运行(无异常)并返回结果。如果结果为0,在呈现前放宽查询或放松过滤器。
工作示例
示例1:作者工作流 — 查找Yann LeCun关于自监督学习的论文
python
import sys, os
SKILL_DIR = next((p for p in [
os.path.expanduser(~/.claude/skills/semanticscholar-skill),
os.path.expanduser(~/.openclaw/skills/semanticscholar-skill),
] if os.path.isdir(p)), .)
sys.path.insert(0, SKILL_DIR)
from s2 import *
authors = searchauthors(Yann LeCun, maxresults=5)
print(format_authors(authors))
使用第一个匹配的ID获取其论文
author_id = authors[0][authorId]
papers = get
authorpapers(author
id, maxresults=50)
本地按主题过滤
ssl_papers = [p for p in papers if self-supervised in (p.get(title) or ).lower()]
print(format
results(sslpapers, Yann LeCun - 自监督学习))
示例2:引用链 — 谁引用了Transformer论文,他们在此基础上构建了什么?
python
import sys, os
SKILL_DIR = next((p for p in [
os.path.expanduser(~/.claude/skills/semanticscholar-skill),
os.path.expanduser(~/.openclaw/skills/semanticscholar-skill),
] if os.path.isdir(p)), .)
sys.path.insert(0, SKILL_DIR)
from s2 import *
paper = get_paper(DOI:10.48550/arXiv.1706.03762)
print(f标题: {paper[title]}, 引用数: {paper[citationCount]})
获取引用此论文的高引用论文
citing = get
citations(paper[paperId], maxresults=50)
citing_papers = [c[citingPaper] for c in citing if c.get(citingPaper)]
citing_papers.sort(key=lambda p: p.get(citationCount, 0), reverse=True)
print(format
results(citingpapers, 引用《Attention Is All You Need》的高引用论文))
示例3:多种子推荐与BibTeX导出 — 查找类似这两篇但不关于NLP的论文
python
import sys, os
SKILL_DIR = next((p for p in [
os.path.expanduser(~/.claude/skills/semanticscholar-skill),
os.path.expanduser(~/.openclaw/skills/semanticscholar-skill),
] if os.path.isdir(p)), .)
sys.path.insert(0, SKILL_DIR)
from s2 import *
recs = recommend(
positive_ids=[DOI:10.1038/nature14539, ARXIV:2010.11929],
negative_ids=[ARXIV:1706.03762],
limit=20
)
print(format_results(recs, 类似深度学习与ViT的视觉论文,排除NLP))
导出前10个结果的BibTeX
bib
data = batchpapers([r[paperId] for r in recs[:10]], fields=title,citationStyles)
print(export
bibtex(bibdata))
第三阶段:总结与呈现
- - 使用format_results()获得一致的输出(摘要表+前10详情)
- 如果用户语言为中文,用中文呈现摘要
- 始终注明总结果数量和使用的搜索策略
- 根据用户的具体问题突出最相关的论文
第四阶段:用户交互循环
呈现结果后,始终提供以下选项:
- 1. 翻译 — 将标题/摘要翻译成中文(或其他语言)
- 详情 — 特定论文编号的完整摘要
- 优化 — 使用不同术语/过滤器缩小或扩大搜索
- 相似 — 查找与特定结果相似的论文(findsimilar())
- 引用 — 谁引用了特定论文(getcitations())
- 导出 — 通过exportbibtex()、exportmarkdown()或export_json()保存结果
- 完成 — 结束搜索会话
循环直到用户表示完成。每次后续操作使用相同的单脚本模式。
API快速参考
辅助模块(s2.py)
python
import sys, os
SKILL_DIR = next((p for p in [
os.path.expanduser(~/.claude/skills/semanticscholar-skill),
os.path.expanduser(~/.openclaw/skills/semanticscholar-skill),
] if os.path.isdir(p)), .)
sys.path.insert(0, SKILL_DIR)
from s2 import *
论文搜索函数
|