Literature Manager
Manage academic literature collections: search → download → convert → organize → verify.
Dependencies
- -
pdftotext (poppler-utils) — PDF text extraction - INLINECODE1 — downloading
- INLINECODE2 — JSON processing in audit
- INLINECODE3 (coreutils) — PDF validation
- INLINECODE4 (optional) — fallback PDF→MD converter (note: plain
uvx markitdown does NOT work for PDFs — must use uvx markitdown[pdf])
Quick Start
CODEBLOCK0
Workflow
1. Search
Use web_fetch on Google Scholar:
https://scholar.google.com/scholar?q=QUERY&as_ylo=YEAR
Extract: title, authors, year, journal, DOI, PDF links.
For each result, identify the best open-access PDF source (see Download Strategy).
2. Download
Run scripts/download.sh <DOI_or_URL> <output_dir/> per paper. The script tries sources in order:
- 1. Direct publisher PDF (Nature, eLife, Frontiers, PNAS, bioRxiv, arXiv)
- EuropePMC (
PMC_ID → PDF) - bioRxiv/arXiv preprint
- Sci-Hub —
https://sci-hub.box/<DOI> (use when publisher is paywalled)
CODEBLOCK2
⚠️ Legal note: Sci-Hub may violate publisher terms of service or copyright law in some jurisdictions. Use only if you understand and accept the legal implications in your context.
If all sources fail (including Sci-Hub), flag as permanent paywall. Provide the user with the DOI and ask for manual download.
3. Convert
Run scripts/convert.sh <input.pdf> <output.md>. Uses pdftotext (reliable) with uvx markitdown[pdf] as fallback.
CODEBLOCK3
Prefer uvx markitdown[pdf] over pdftotext when full fidelity (tables, figures captions) matters.
4. Organize
Standard folder structure:
CODEBLOCK4
Categories are user-defined. Number-prefix for sort order (e.g., 01-theoretical-frameworks/).
index.json schema per paper
CODEBLOCK5
README.md pattern
Per category section, per paper: title, authors, year, journal, DOI, short summary in user's language.
4b. DOI-Based Filenames & Path Mapping
Downloaded files are often named using DOI format rather than AuthorYear:
CODEBLOCK6
When markdown_path entries in index.json become stale (e.g., after folder reorganization), maintain a separate mapping file:
CODEBLOCK7
To build this mapping: cross-reference each paper's DOI in index.json against actual files on disk. Use find + Python to automate.
index.json Known Pitfalls
- -
id: null corruption: If many entries have id=null and share the same pdf_path, the index was likely corrupted during a batch write. Rebuild from actual files on disk. - DOI errors: Verify DOIs resolve correctly — typos in DOI fields are common (e.g., wrong suffix digits). Always cross-check with publisher page.
- Dead
markdown_path: After restructuring folders, markdown_path in index.json often points to old locations. Use the mapping file above as the source of truth.
5. Verify
Run scripts/audit.sh <references_dir/> for full verification:
- - Every PDF is valid (
file -b = PDF) - Every PDF title matches filename (
pdftotext | head) - Every PDF has matching markdown (and vice versa)
- index.json is valid, complete, paths exist, no duplicate IDs
- README.md stats match actual counts
6. Collect Resources
For tool/method papers, find GitHub repos and public datasets. Store in RESOURCES.md + resources.json.
Sub-agent Strategy
For large batches, parallelize:
- - Download: 1 sub-agent per batch of ~5-8 papers
- Organize: 1 sub-agent to build indexes
- Verify: 1 independent sub-agent (never the same as organizer)
Always use a separate sub-agent for verification (QC should not self-grade).
⚠️ Sub-agent Rules (Learned from Practice)
- 1. One batch at a time — do not spawn multiple note-writing batches simultaneously; LLM rate limits will cause silent failures
- Set a cron monitor whenever spawning long-running agents — agents can fail silently without triggering auto-announce; cron catches this
- Cron monitor pattern:
CODEBLOCK8
Adding Papers Incrementally
To add papers to an existing collection:
- 1. Download + convert new papers into correct category folder
- Append entries to index.json
- Update README.md stats
- Run audit to verify consistency
文献管理器
管理学术文献收藏:搜索 → 下载 → 转换 → 整理 → 验证。
依赖项
- - pdftotext (poppler-utils) — PDF文本提取
- curl — 下载
- python3 — 审计中的JSON处理
- file (coreutils) — PDF验证
- uvx markitdown[pdf] (可选) — 备用PDF→MD转换器(注意:纯uvx markitdown不适用于PDF——必须使用uvx markitdown[pdf])
快速开始
bash
通过DOI下载单篇论文
bash scripts/download.sh 10.1038/s41592-024-02200-1 output_dir/
将PDF转换为markdown
bash scripts/convert.sh paper.pdf output.md
验证单个PDF+MD配对
bash scripts/verify.sh paper.pdf paper.md
对references/文件夹进行完整审计
bash scripts/audit.sh /path/to/references/
工作流程
1. 搜索
在Google Scholar上使用web_fetch:
https://scholar.google.com/scholar?q=QUERY&as_ylo=YEAR
提取:标题、作者、年份、期刊、DOI、PDF链接。
对于每个结果,确定最佳开放获取PDF来源(参见下载策略)。
2. 下载
每篇论文运行scripts/download.sh orURL> 。脚本按顺序尝试以下来源:
- 1. 直接出版商PDF(Nature, eLife, Frontiers, PNAS, bioRxiv, arXiv)
- EuropePMC(PMC_ID → PDF)
- bioRxiv/arXiv预印本
- Sci-Hub — https://sci-hub.box/(当出版商有付费墙时使用)
bash
Sci-Hub下载示例:
curl -L https://sci-hub.box/10.1038/nature12345 -o paper.pdf
⚠️ 法律说明: Sci-Hub在某些司法管辖区可能违反出版商服务条款或版权法。仅当您理解并接受在您所处环境中的法律影响时方可使用。
如果所有来源(包括Sci-Hub)均失败,标记为永久付费墙。向用户提供DOI并请求手动下载。
3. 转换
运行scripts/convert.sh 。使用pdftotext(可靠)并以uvx markitdown[pdf]作为备用。
bash
适用于PDF的正确markitdown命令:
uvx markitdown[pdf] input.pdf > output.md
⚠️ 以下命令不适用于PDF(缺少[pdf]扩展):
uvx markitdown input.pdf
当需要完整保真度(表格、图表标题)时,优先使用uvx markitdown[pdf]而非pdftotext。
4. 整理
标准文件夹结构:
references/
├── README.md # 人工索引(按类别汇总)
├── index.json # 机器索引(结构化元数据)
├── RESOURCES.md # 代码仓库 + 数据集
├── resources.json # 结构化版本
├── /
│ ├── papers/ # PDF文件
│ └── markdown/ # 转换后的文本
└── /
├── papers/
└── markdown/
类别由用户定义。使用数字前缀进行排序(例如,01-theoretical-frameworks/)。
每篇论文的index.json模式
json
{
id: short_id,
title: Full title,
authors: [Author1, Author2],
year: 2024,
journal: Journal Name,
doi: 10.xxxx/...,
category: category_name,
subcategory: optional,
pdf_path: category/papers/filename.pdf,
markdown_path: category/markdown/filename.md,
tags: [tag1, tag2],
one
linesummary: English one-liner,
key_concepts: [concept1],
relevance
toproject: English description
}
README.md模式
每个类别部分,每篇论文:标题、作者、年份、期刊、DOI、用户语言的简短摘要。
4b. 基于DOI的文件名与路径映射
下载的文件通常使用DOI格式命名,而非AuthorYear格式:
10-1038_ncomms3018.md # DOI: 10.1038/ncomms3018
10-1016_j-neuron-2015-03-034.md
当index.json中的markdown_path条目过时(例如,在文件夹重组后),维护一个单独的映射文件:
json
// temp/papermdmapping.json
{
author2024keyword: references/new-downloads/10-1038s41592-024-02200-1.md,
...
}
构建此映射:将index.json中每篇论文的DOI与磁盘上的实际文件进行交叉引用。使用find + Python实现自动化。
index.json已知陷阱
- - id: null损坏:如果许多条目的id=null且共享相同的pdfpath,则索引可能在批量写入期间损坏。根据磁盘上的实际文件重建。
- DOI错误:验证DOI是否正确解析——DOI字段中的拼写错误很常见(例如,错误的后缀数字)。始终与出版商页面交叉核对。
- 失效的markdownpath:在重组文件夹后,index.json中的markdown_path通常指向旧位置。使用上述映射文件作为真实来源。
5. 验证
运行scripts/audit.sh 进行完整验证:
- - 每个PDF均有效(file -b = PDF)
- 每个PDF标题与文件名匹配(pdftotext | head)
- 每个PDF都有对应的markdown(反之亦然)
- index.json有效、完整、路径存在、无重复ID
- README.md统计信息与实际数量匹配
6. 收集资源
对于工具/方法论文,查找GitHub仓库和公共数据集。存储在RESOURCES.md + resources.json中。
子代理策略
对于大批量处理,并行化:
- - 下载:每批约5-8篇论文使用1个子代理
- 整理:1个子代理构建索引
- 验证:1个独立的子代理(绝不能与整理代理相同)
始终使用单独的子代理进行验证(质量控制不应自我评估)。
⚠️ 子代理规则(从实践中总结)
- 1. 一次一批——不要同时生成多个笔记编写批次;LLM速率限制会导致静默失败
- 在生成长时间运行的代理时设置cron监控——代理可能静默失败而不触发自动通知;cron可以捕获这种情况
- Cron监控模式:
1. 生成代理
2. 立即设置cron作业(每10-15分钟,隔离的agentTurn)
→ 检查预期输出文件是否存在
→ 重新生成失败的代理
→ 全部完成后:通知 + 删除cron
3. 任务完成后,确认cron已被移除
增量添加论文
要向现有收藏中添加论文:
- 1. 将新论文下载并转换到正确的类别文件夹中
- 将条目追加到index.json
- 更新README.md统计信息
- 运行审计以验证一致性