When to Use
Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.
Quick Reference
| Topic | File |
|---|
| Code examples | INLINECODE0 |
| OCR setup |
ocr.md |
| Troubleshooting |
troubleshooting.md |
Core Rules
1. Install PyMuPDF First
CODEBLOCK0
Import as fitz (historical name):
CODEBLOCK1
2. Basic Text Extraction
CODEBLOCK2
3. Pick the Right Method
| PDF Type | Method |
|---|
| Text-based | INLINECODE4 — fast, accurate |
| Scanned |
OCR with pytesseract — slower |
| Mixed | Check each page, use OCR when needed |
4. Check for Text Before OCR
CODEBLOCK3
5. Handle Errors Gracefully
CODEBLOCK4
Extraction Traps
| Trap | What Happens | Fix |
|---|
| OCR on text PDF | Slow + worse accuracy | Check get_text() first |
| Forget to close doc |
Memory leak | Use
with or
doc.close() |
| Assume page order | Wrong reading flow | Use
sort=True in get_text() |
| Ignore encoding | Garbled characters | PyMuPDF handles UTF-8 |
Scope
This skill provides instructions for using PyMuPDF to extract PDF text.
This skill ONLY:
- - Gives code examples for PyMuPDF
- Explains OCR setup when needed
- Troubleshoots common issues
This skill NEVER:
- - Accesses files without user request
- Sends data externally
- Modifies original PDFs
Security & Privacy
All processing is local:
- - PyMuPDF runs entirely on your machine
- No external API calls
- No data leaves your system
Output Formats
Plain Text
CODEBLOCK5
Structured (dict)
CODEBLOCK6
JSON
CODEBLOCK7
Full Example
CODEBLOCK8
Feedback
- - Useful? INLINECODE9
- Stay updated: INLINECODE10
何时使用
当代理需要从PDF中提取文本时使用。使用PyMuPDF(fitz)进行快速本地提取。适用于基于文本的文档、带OCR的扫描页面、表单和复杂布局。
快速参考
ocr.md |
| 故障排除 | troubleshooting.md |
核心规则
1. 首先安装PyMuPDF
bash
pip install PyMuPDF
导入为fitz(历史名称):
python
import fitz # PyMuPDF
2. 基本文本提取
python
import fitz
doc = fitz.open(document.pdf)
text =
for page in doc:
text += page.get_text()
doc.close()
3. 选择正确的方法
| PDF类型 | 方法 |
|---|
| 基于文本 | page.get_text() — 快速、准确 |
| 扫描件 |
使用pytesseract进行OCR — 较慢 |
| 混合类型 | 检查每页,必要时使用OCR |
4. 在OCR前检查文本
python
def needs_ocr(page):
text = page.get_text().strip()
return len(text) < 50 # 文本极少时可能是扫描件
5. 优雅处理错误
python
try:
doc = fitz.open(path)
except fitz.FileDataError:
print(无效或损坏的PDF)
except fitz.PasswordError:
doc = fitz.open(path, password=secret)
提取陷阱
| 陷阱 | 后果 | 修复方法 |
|---|
| 对文本PDF使用OCR | 速度慢+准确度低 | 先检查get_text() |
| 忘记关闭文档 |
内存泄漏 | 使用with或doc.close() |
| 假设页面顺序 | 读取流程错误 | 在get_text()中使用sort=True |
| 忽略编码 | 乱码字符 | PyMuPDF处理UTF-8 |
范围
本技能提供使用PyMuPDF提取PDF文本的说明。
本技能仅:
- - 提供PyMuPDF的代码示例
- 必要时解释OCR设置
- 排查常见问题
本技能绝不:
- - 未经用户请求访问文件
- 将数据发送到外部
- 修改原始PDF
安全与隐私
所有处理均在本地进行:
- - PyMuPDF完全在您的机器上运行
- 无外部API调用
- 无数据离开您的系统
输出格式
纯文本
python
text = page.get_text()
结构化数据(字典)
python
blocks = page.get_text(dict)[blocks]
for b in blocks:
if b[type] == 0: # 文本块
for line in b[lines]:
for span in line[spans]:
print(span[text], span[size])
JSON
python
import json
data = page.get_text(json)
parsed = json.loads(data)
完整示例
python
import fitz
def extract_pdf(path):
从PDF提取文本,扫描页使用OCR作为后备方案。
doc = fitz.open(path)
results = []
for i, page in enumerate(doc):
text = page.get_text()
method = text
# 如果文本极少,可能是扫描件
if len(text.strip()) < 50:
# 此处进行OCR(参见ocr.md)
method = needs_ocr
results.append({
page: i + 1,
text: text,
method: method
})
doc.close()
return {
pages: len(results),
content: results,
word_count: sum(len(r[text].split()) for r in results)
}
使用示例
result = extract_pdf(document.pdf)
print(f从{result[pages]}页中提取了{result[word_count]}个词)
反馈
- - 有用吗?clawhub star extract-pdf-text
- 保持更新:clawhub sync