PDF Rename — Academic Paper Organizer
Rename academic PDFs to: INLINECODE0
Three-stage pipeline (strict order):
CODEBLOCK0
Anti-error principle: Never re-parse PDF content during Rename stage. The Manifest is the single source of truth.
Quick Start
CODEBLOCK1
Workflow Details
Stage 1: Extract
INLINECODE1 reads every PDF in the folder and generates manifest.json.
For each PDF it extracts:
- - Title: from PDF first-page text (heuristic: first non-metadata line)
- Year: from filename prefix (most reliable) or PDF text (conference-year pattern)
- Venue: inferred from PDF text (NeurIPS, ICML, arXiv, etc.)
- Status:
needs_verification (title/year from auto-extraction)
Manifest schema — see INLINECODE4
⚠️ PDF text extraction is unreliable for titles. Expected quality: filename > PDF text for title. Always verify with web search before executing rename.
Stage 2: Verify
Before running rename, manually or via web search verify:
- 1. Title is correct (filename is often sufficient, but multi-word titles may differ)
- Year is correct (arXiv submission year ≠ conference year)
- Venue is correct
Inject verified data via scripts/apply_verified.py:
- - Key = original filename (exact match)
- Value = INLINECODE6
Set confirmed: False or omit entry for files to skip.
Stage 3: Rename
INLINECODE8 reads manifest and renames files:
- - Status must be
ready to execute - Duplicate titles → append
(1), (2), etc. - Files with status
needs_verification or manual_review are skipped - Backup is created automatically at INLINECODE14
Key Design Decisions
| Problem | Solution |
|---|
| PDF title extraction garbled/incomplete | Use filename as primary title source; PDF text only for venue/year hints |
| Wrong year from arXiv ID vs conference year |
Verify with web search; inject corrected year in
VERIFIED_DATA |
| Duplicate papers (same content, different filenames) | Detect via title similarity; rename both with
(1),
(2) suffixes |
| Accidental data loss | Always create timestamped backup before renaming |
Scripts
| Script | Purpose |
|---|
| INLINECODE18 | Stage 1: extract PDF metadata → manifest.json |
| INLINECODE19 |
Stage 2: inject verified data into manifest |
|
scripts/execute.py | Stage 3: rename files from manifest (preview or execute) |
|
scripts/find_duplicates.py | Utility: detect near-duplicate titles in manifest |
References
- -
references/manifest_spec.md — Full manifest JSON schema - INLINECODE23 — Standard venue abbreviation map
- INLINECODE24 — Common mistakes and how to avoid them
PDF Rename — 学术论文整理工具
将学术PDF重命名为:[年份] [会议/期刊] 标题.pdf
三阶段流程(严格顺序):
提取 → 验证 → 重命名
防错原则: 重命名阶段绝不重新解析PDF内容。清单是唯一的事实来源。
快速开始
bash
阶段1:提取元数据 → 生成清单
python scripts/extract.py <文件夹路径>
阶段2:验证(手动或网络搜索),然后注入已验证数据
→ 使用网络验证的值编辑 scripts/VERIFIED_DATA 字典
python scripts/apply_verified.py <文件夹路径>
阶段3:预览重命名方案
python scripts/execute.py <文件夹路径> --preview
执行重命名(含备份)
python scripts/execute.py <文件夹路径> --execute
工作流程详解
阶段1:提取
scripts/extract.py 读取文件夹中的每个PDF并生成 manifest.json。
对每个PDF提取:
- - 标题:来自PDF首页文本(启发式:第一行非元数据内容)
- 年份:来自文件名前缀(最可靠)或PDF文本(会议年份模式)
- 会议/期刊:从PDF文本推断(NeurIPS、ICML、arXiv等)
- 状态:needs_verification(标题/年份来自自动提取)
清单模式 — 参见 references/manifest_spec.md
⚠️ PDF文本提取对标题不可靠。预期质量:文件名 > PDF文本中的标题。执行重命名前务必通过网络搜索验证。
阶段2:验证
运行重命名前,手动或通过网络搜索验证:
- 1. 标题是否正确(文件名通常足够,但多词标题可能不同)
- 年份是否正确(arXiv提交年份 ≠ 会议年份)
- 会议/期刊是否正确
通过 scripts/apply_verified.py 注入已验证数据:
- - 键 = 原始文件名(精确匹配)
- 值 = {title, year, venue, confirmed: True}
设置 confirmed: False 或省略条目以跳过文件。
阶段3:重命名
scripts/execute.py 读取清单并重命名文件:
- - 状态必须为 ready 才能执行
- 重复标题 → 追加 (1)、(2) 等
- 状态为 needsverification 或 manualreview 的文件将被跳过
- 自动在 <文件夹>/backupYYYYMMDD_HHMMSS/ 创建备份
关键设计决策
| 问题 | 解决方案 |
|---|
| PDF标题提取乱码/不完整 | 使用文件名作为主要标题来源;PDF文本仅用于会议/期刊和年份提示 |
| arXiv ID与会议年份不一致 |
通过网络搜索验证;在 VERIFIED_DATA 中注入修正后的年份 |
| 重复论文(相同内容,不同文件名) | 通过标题相似度检测;使用 (1)、(2) 后缀重命名两者 |
| 意外数据丢失 | 重命名前始终创建带时间戳的备份 |
脚本
| 脚本 | 用途 |
|---|
| scripts/extract.py | 阶段1:提取PDF元数据 → manifest.json |
| scripts/apply_verified.py |
阶段2:将验证数据注入清单 |
| scripts/execute.py | 阶段3:根据清单重命名文件(预览或执行) |
| scripts/find_duplicates.py | 工具:检测清单中的近似重复标题 |
参考资料
- - references/manifestspec.md — 完整清单JSON模式
- references/venueabbrev.md — 标准会议/期刊缩写映射表
- references/anti_patterns.md — 常见错误及避免方法