PDF Rename — Academic Paper Organizer

Rename academic PDFs to: INLINECODE0

Three-stage pipeline (strict order):

CODEBLOCK0

Anti-error principle: Never re-parse PDF content during Rename stage. The Manifest is the single source of truth.

Quick Start

CODEBLOCK1

Workflow Details

Stage 1: Extract

INLINECODE1 reads every PDF in the folder and generates manifest.json.

For each PDF it extracts:

- Title: from PDF first-page text (heuristic: first non-metadata line)
Year: from filename prefix (most reliable) or PDF text (conference-year pattern)
Venue: inferred from PDF text (NeurIPS, ICML, arXiv, etc.)
Status: needs_verification (title/year from auto-extraction)

Manifest schema — see INLINECODE4

⚠️ PDF text extraction is unreliable for titles. Expected quality: filename > PDF text for title. Always verify with web search before executing rename.

Stage 2: Verify

Before running rename, manually or via web search verify:

1. Title is correct (filename is often sufficient, but multi-word titles may differ)
Year is correct (arXiv submission year ≠ conference year)
Venue is correct

Inject verified data via scripts/apply_verified.py:

- Key = original filename (exact match)
Value = INLINECODE6

Set confirmed: False or omit entry for files to skip.

Stage 3: Rename

INLINECODE8 reads manifest and renames files:

- Status must be ready to execute
Duplicate titles → append (1), (2), etc.
Files with status needs_verification or manual_review are skipped
Backup is created automatically at INLINECODE14

Key Design Decisions

Problem	Solution
PDF title extraction garbled/incomplete	Use filename as primary title source; PDF text only for venue/year hints
Wrong year from arXiv ID vs conference year

Verify with web search; inject corrected year in VERIFIED_DATA |
| Duplicate papers (same content, different filenames) | Detect via title similarity; rename both with (1), (2) suffixes |
| Accidental data loss | Always create timestamped backup before renaming |

Scripts

Script	Purpose
INLINECODE18	Stage 1: extract PDF metadata → manifest.json
INLINECODE19

References

- references/manifest_spec.md — Full manifest JSON schema
INLINECODE23 — Standard venue abbreviation map
INLINECODE24 — Common mistakes and how to avoid them

PDF Rename — 学术论文整理工具

将学术PDF重命名为：[年份] [会议/期刊] 标题.pdf

三阶段流程（严格顺序）：

提取 → 验证 → 重命名

防错原则： 重命名阶段绝不重新解析PDF内容。清单是唯一的事实来源。

快速开始

bash

阶段1：提取元数据 → 生成清单

python scripts/extract.py <文件夹路径>

阶段2：验证（手动或网络搜索），然后注入已验证数据

→ 使用网络验证的值编辑 scripts/VERIFIED_DATA 字典

python scripts/apply_verified.py <文件夹路径>

阶段3：预览重命名方案

python scripts/execute.py <文件夹路径> --preview

执行重命名（含备份）

python scripts/execute.py <文件夹路径> --execute

工作流程详解

阶段1：提取

scripts/extract.py 读取文件夹中的每个PDF并生成 manifest.json。

对每个PDF提取：

- 标题：来自PDF首页文本（启发式：第一行非元数据内容）
年份：来自文件名前缀（最可靠）或PDF文本（会议年份模式）
会议/期刊：从PDF文本推断（NeurIPS、ICML、arXiv等）
状态：needs_verification（标题/年份来自自动提取）

清单模式 — 参见 references/manifest_spec.md

⚠️ PDF文本提取对标题不可靠。预期质量：文件名 > PDF文本中的标题。执行重命名前务必通过网络搜索验证。

阶段2：验证

运行重命名前，手动或通过网络搜索验证：

1. 标题是否正确（文件名通常足够，但多词标题可能不同）
年份是否正确（arXiv提交年份 ≠ 会议年份）
会议/期刊是否正确

通过 scripts/apply_verified.py 注入已验证数据：

- 键 = 原始文件名（精确匹配）
值 = {title, year, venue, confirmed: True}

设置 confirmed: False 或省略条目以跳过文件。

阶段3：重命名

scripts/execute.py 读取清单并重命名文件：

- 状态必须为 ready 才能执行
重复标题 → 追加 (1)、(2) 等
状态为 needsverification 或 manualreview 的文件将被跳过
自动在 <文件夹>/backupYYYYMMDD_HHMMSS/ 创建备份

关键设计决策

问题	解决方案
PDF标题提取乱码/不完整	使用文件名作为主要标题来源；PDF文本仅用于会议/期刊和年份提示
arXiv ID与会议年份不一致

脚本

脚本	用途
scripts/extract.py	阶段1：提取PDF元数据 → manifest.json
scripts/apply_verified.py

参考资料

- references/manifestspec.md — 完整清单JSON模式
references/venueabbrev.md — 标准会议/期刊缩写映射表
references/anti_patterns.md — 常见错误及避免方法

pdf-renamePDF重命名

pdf-rename

PDF Rename — Academic Paper Organizer

Quick Start