Literature Manager

Manage academic literature collections: search → download → convert → organize → verify.

Dependencies

- pdftotext (poppler-utils) — PDF text extraction
INLINECODE1 — downloading
INLINECODE2 — JSON processing in audit
INLINECODE3 (coreutils) — PDF validation
INLINECODE4 (optional) — fallback PDF→MD converter (note: plain uvx markitdown does NOT work for PDFs — must use uvx markitdown[pdf])

Quick Start

CODEBLOCK0

Workflow

1. Search

Use web_fetch on Google Scholar:

https://scholar.google.com/scholar?q=QUERY&as_ylo=YEAR

Extract: title, authors, year, journal, DOI, PDF links.

For each result, identify the best open-access PDF source (see Download Strategy).

2. Download

Run scripts/download.sh <DOI_or_URL> <output_dir/> per paper. The script tries sources in order:

1. Direct publisher PDF (Nature, eLife, Frontiers, PNAS, bioRxiv, arXiv)
EuropePMC (PMC_ID → PDF)
bioRxiv/arXiv preprint
Sci-Hub — https://sci-hub.box/<DOI> (use when publisher is paywalled)

CODEBLOCK2

⚠️ Legal note: Sci-Hub may violate publisher terms of service or copyright law in some jurisdictions. Use only if you understand and accept the legal implications in your context.

If all sources fail (including Sci-Hub), flag as permanent paywall. Provide the user with the DOI and ask for manual download.

3. Convert

Run scripts/convert.sh <input.pdf> <output.md>. Uses pdftotext (reliable) with uvx markitdown[pdf] as fallback.

CODEBLOCK3

Prefer uvx markitdown[pdf] over pdftotext when full fidelity (tables, figures captions) matters.

4. Organize

Standard folder structure:
CODEBLOCK4

Categories are user-defined. Number-prefix for sort order (e.g., 01-theoretical-frameworks/).

index.json schema per paper

CODEBLOCK5

README.md pattern

Per category section, per paper: title, authors, year, journal, DOI, short summary in user's language.

4b. DOI-Based Filenames & Path Mapping

Downloaded files are often named using DOI format rather than AuthorYear:
CODEBLOCK6

When markdown_path entries in index.json become stale (e.g., after folder reorganization), maintain a separate mapping file:

CODEBLOCK7

To build this mapping: cross-reference each paper's DOI in index.json against actual files on disk. Use find + Python to automate.

index.json Known Pitfalls

- id: null corruption: If many entries have id=null and share the same pdf_path, the index was likely corrupted during a batch write. Rebuild from actual files on disk.
DOI errors: Verify DOIs resolve correctly — typos in DOI fields are common (e.g., wrong suffix digits). Always cross-check with publisher page.
Dead markdown_path: After restructuring folders, markdown_path in index.json often points to old locations. Use the mapping file above as the source of truth.

5. Verify

Run scripts/audit.sh <references_dir/> for full verification:

- Every PDF is valid (file -b = PDF)
Every PDF title matches filename (pdftotext | head)
Every PDF has matching markdown (and vice versa)
index.json is valid, complete, paths exist, no duplicate IDs
README.md stats match actual counts

6. Collect Resources

For tool/method papers, find GitHub repos and public datasets. Store in RESOURCES.md + resources.json.

Sub-agent Strategy

For large batches, parallelize:

- Download: 1 sub-agent per batch of ~5-8 papers
Organize: 1 sub-agent to build indexes
Verify: 1 independent sub-agent (never the same as organizer)

Always use a separate sub-agent for verification (QC should not self-grade).

⚠️ Sub-agent Rules (Learned from Practice)

1. One batch at a time — do not spawn multiple note-writing batches simultaneously; LLM rate limits will cause silent failures
Set a cron monitor whenever spawning long-running agents — agents can fail silently without triggering auto-announce; cron catches this
Cron monitor pattern:

CODEBLOCK8

Adding Papers Incrementally

To add papers to an existing collection:

1. Download + convert new papers into correct category folder
Append entries to index.json
Update README.md stats
Run audit to verify consistency

文献管理器

管理学术文献收藏：搜索 → 下载 → 转换 → 整理 → 验证。

依赖项

- pdftotext (poppler-utils) — PDF文本提取
curl — 下载
python3 — 审计中的JSON处理
file (coreutils) — PDF验证
uvx markitdown[pdf] (可选) — 备用PDF→MD转换器（注意：纯uvx markitdown不适用于PDF——必须使用uvx markitdown[pdf]）

快速开始

bash

通过DOI下载单篇论文

bash scripts/download.sh 10.1038/s41592-024-02200-1 output_dir/

将PDF转换为markdown

bash scripts/convert.sh paper.pdf output.md

验证单个PDF+MD配对

bash scripts/verify.sh paper.pdf paper.md

对references/文件夹进行完整审计

bash scripts/audit.sh /path/to/references/

工作流程

1. 搜索

在Google Scholar上使用web_fetch：

https://scholar.google.com/scholar?q=QUERY&as_ylo=YEAR

提取：标题、作者、年份、期刊、DOI、PDF链接。

对于每个结果，确定最佳开放获取PDF来源（参见下载策略）。

2. 下载

每篇论文运行scripts/download.sh orURL> 。脚本按顺序尝试以下来源：

1. 直接出版商PDF（Nature, eLife, Frontiers, PNAS, bioRxiv, arXiv）
EuropePMC（PMC_ID → PDF）
bioRxiv/arXiv预印本
Sci-Hub — https://sci-hub.box/（当出版商有付费墙时使用）

bash

Sci-Hub下载示例：

curl -L https://sci-hub.box/10.1038/nature12345 -o paper.pdf

⚠️ 法律说明： Sci-Hub在某些司法管辖区可能违反出版商服务条款或版权法。仅当您理解并接受在您所处环境中的法律影响时方可使用。

如果所有来源（包括Sci-Hub）均失败，标记为永久付费墙。向用户提供DOI并请求手动下载。

3. 转换

运行scripts/convert.sh 。使用pdftotext（可靠）并以uvx markitdown[pdf]作为备用。

bash

适用于PDF的正确markitdown命令：

uvx markitdown[pdf] input.pdf > output.md

⚠️ 以下命令不适用于PDF（缺少[pdf]扩展）：

uvx markitdown input.pdf

当需要完整保真度（表格、图表标题）时，优先使用uvx markitdown[pdf]而非pdftotext。

4. 整理

标准文件夹结构：

references/
├── README.md # 人工索引（按类别汇总）
├── index.json # 机器索引（结构化元数据）
├── RESOURCES.md # 代码仓库 + 数据集
├── resources.json # 结构化版本
├── /
│ ├── papers/ # PDF文件
│ └── markdown/ # 转换后的文本
└── /
├── papers/
└── markdown/

类别由用户定义。使用数字前缀进行排序（例如，01-theoretical-frameworks/）。

每篇论文的index.json模式

json { id: short_id, title: Full title, authors: [Author1, Author2], year: 2024, journal: Journal Name, doi: 10.xxxx/..., category: category_name, subcategory: optional, pdf_path: category/papers/filename.pdf, markdown_path: category/markdown/filename.md, tags: [tag1, tag2], onelinesummary: English one-liner, key_concepts: [concept1], relevancetoproject: English description }

README.md模式

每个类别部分，每篇论文：标题、作者、年份、期刊、DOI、用户语言的简短摘要。

4b. 基于DOI的文件名与路径映射

下载的文件通常使用DOI格式命名，而非AuthorYear格式：

10-1038_ncomms3018.md # DOI: 10.1038/ncomms3018
10-1016_j-neuron-2015-03-034.md

当index.json中的markdown_path条目过时（例如，在文件夹重组后），维护一个单独的映射文件：

json
// temp/papermdmapping.json
{
author2024keyword: references/new-downloads/10-1038s41592-024-02200-1.md,
...
}

构建此映射：将index.json中每篇论文的DOI与磁盘上的实际文件进行交叉引用。使用find + Python实现自动化。

index.json已知陷阱

- id: null损坏：如果许多条目的id=null且共享相同的pdfpath，则索引可能在批量写入期间损坏。根据磁盘上的实际文件重建。
DOI错误：验证DOI是否正确解析——DOI字段中的拼写错误很常见（例如，错误的后缀数字）。始终与出版商页面交叉核对。
失效的markdownpath：在重组文件夹后，index.json中的markdown_path通常指向旧位置。使用上述映射文件作为真实来源。

5. 验证

运行scripts/audit.sh 进行完整验证：

- 每个PDF均有效（file -b = PDF）
每个PDF标题与文件名匹配（pdftotext | head）
每个PDF都有对应的markdown（反之亦然）
index.json有效、完整、路径存在、无重复ID
README.md统计信息与实际数量匹配

6. 收集资源

对于工具/方法论文，查找GitHub仓库和公共数据集。存储在RESOURCES.md + resources.json中。

子代理策略

对于大批量处理，并行化：

- 下载：每批约5-8篇论文使用1个子代理
整理：1个子代理构建索引
验证：1个独立的子代理（绝不能与整理代理相同）

始终使用单独的子代理进行验证（质量控制不应自我评估）。

⚠️ 子代理规则（从实践中总结）

1. 一次一批——不要同时生成多个笔记编写批次；LLM速率限制会导致静默失败
在生成长时间运行的代理时设置cron监控——代理可能静默失败而不触发自动通知；cron可以捕获这种情况
Cron监控模式：

1. 生成代理
2. 立即设置cron作业（每10-15分钟，隔离的agentTurn）
→ 检查预期输出文件是否存在
→ 重新生成失败的代理
→ 全部完成后：通知 + 删除cron
3. 任务完成后，确认cron已被移除

增量添加论文

要向现有收藏中添加论文：

1. 将新论文下载并转换到正确的类别文件夹中
将条目追加到index.json
更新README.md统计信息
运行审计以验证一致性

literature-manager文献管理器