Tool List
1. pdftomarkdown
Convert PDF documents to Markdown format, preserving document structure, formulas, tables, and images.
Description: Use MinerU to parse PDF documents and output in Markdown format, supporting OCR, formula recognition, table extraction, and other features.
Parameters:
- -
file_path (string, required): Absolute path to the PDF file - INLINECODE1 (string, required): Absolute path to the output directory
- INLINECODE2 (string, optional): Parsing backend, options:
hybrid-auto-engine (default), pipeline, INLINECODE5 - INLINECODE6 (string, optional): OCR language code, such as
en (English), ch (Chinese), ja (Japanese), etc., defaults to auto-detection - INLINECODE10 (boolean, optional): Whether to enable formula recognition, defaults to true
- INLINECODE11 (boolean, optional): Whether to enable table extraction, defaults to true
- INLINECODE12 (integer, optional): Start page number (starting from 0), defaults to 0
- INLINECODE13 (integer, optional): End page number (starting from 0), defaults to -1 meaning parse all pages
Return Value:
CODEBLOCK0
Examples:
python .claude/skills/pdf-process/script/pdf_parser.py \
'{"name": "pdf_to_markdown", "arguments": {"file_path": "/path/to/document.pdf", "output_dir": "/path/to/output"}}'
# Use specific backend
python .claude/skills/pdf-process/script/pdf_parser.py \
'{"name": "pdf_to_markdown", "arguments": {"file_path": "/path/to/document.pdf", "output_dir": "/path/to/output", "backend": "pipeline"}}'
# Parse specific pages
python .claude/skills/pdf-process/script/pdf_parser.py \
'{"name": "pdf_to_markdown", "arguments": {"file_path": "/path/to/document.pdf", "output_dir": "/path/to/output", "start_page": 0, "end_page": 5}}'
2. pdftojson
Convert PDF documents to JSON format, including detailed layout and structural information.
Description: Use MinerU to parse PDF documents and output in JSON format, containing structured information such as text blocks, images, tables, formulas, etc.
Parameters:
- -
file_path (string, required): Absolute path to the PDF file - INLINECODE15 (string, required): Absolute path to the output directory
- INLINECODE16 (string, optional): Parsing backend, options:
hybrid-auto-engine (default), pipeline, INLINECODE19 - INLINECODE20 (string, optional): OCR language code, such as
en (English), ch (Chinese), ja (Japanese), etc., defaults to auto-detection - INLINECODE24 (boolean, optional): Whether to enable formula recognition, defaults to true
- INLINECODE25 (boolean, optional): Whether to enable table extraction, defaults to true
- INLINECODE26 (integer, optional): Start page number (starting from 0), defaults to 0
- INLINECODE27 (integer, optional): End page number (starting from 0), defaults to -1 meaning parse all pages
Return Value:
CODEBLOCK2
Examples:
python .claude/skills/pdf-process/script/pdf_parser.py \
'{"name": "pdf_to_json", "arguments": {"file_path": "/path/to/document.pdf", "output_dir": "/path/to/output"}}'
# Use specific backend and language
python .claude/skills/pdf-process/script/pdf_parser.py \
'{"name": "pdf_to_json", "arguments": {"file_path": "/path/to/document.pdf", "output_dir": "/path/to/output", "backend": "hybrid-auto-engine", "language": "ch"}}'
Installation Instructions
1. Install MinerU
CODEBLOCK4
2. Verify Installation
CODEBLOCK5
3. System Requirements
- - Python Version: 3.10-3.13
- Operating System: Linux / Windows / macOS 14.0+
- Memory:
- Using
pipeline backend: minimum 16GB, recommended 32GB+
- Using
hybrid/vlm backend: minimum 16GB, recommended 32GB+
- - Disk Space: minimum 20GB (SSD recommended)
- GPU (optional):
-
pipeline backend: supports CPU-only
-
hybrid/vlm backend: requires NVIDIA GPU (Volta architecture and above) or Apple Silicon
Use Cases
- 1. Academic Paper Parsing: Extract structured content such as formulas, tables, and images
- Technical Document Conversion: Convert PDF documents to Markdown for version control and online publishing
- OCR Processing: Process scanned PDFs and garbled PDFs
- Multilingual Documents: Supports OCR recognition for 109 languages
- Batch Processing: Batch convert multiple PDF documents
Backend Selection Recommendations
- - hybrid-auto-engine (default): Balanced accuracy and speed, suitable for most scenarios
- pipeline: Suitable for CPU-only environments, best compatibility
- vlm-auto-engine: Highest accuracy, requires GPU acceleration
Notes
- 1. File Paths: All paths must be absolute paths
- Output Directory: Non-existent directories will be created automatically
- Performance: Using GPU can significantly improve parsing speed
- Page Numbers: Page numbers start counting from 0
- Memory: Processing large documents may consume more memory
Troubleshooting
Common Issues
- 1. Installation Failure:
- Ensure using Python 3.10-3.13
- Windows only supports Python 3.10-3.12 (ray does not support 3.13)
- Using
uv pip install can resolve most dependency conflicts
- 2. Insufficient Memory:
- Use
pipeline backend
- Limit parsing pages:
start_page and
end_page
- Reduce virtual memory allocation
- 3. Slow Parsing Speed:
- Enable GPU acceleration
- Use
hybrid-auto-engine backend
- Disable unnecessary features (formulas, tables)
- 4. Low OCR Accuracy:
- Specify the correct document language
- Ensure the backend supports OCR (use
pipeline or
hybrid-*)
Related Resources
- - MinerU Official Documentation: https://opendatalab.github.io/MinerU/
- MinerU GitHub: https://github.com/opendatalab/MinerU
- Online Demo: https://mineru.net/
工具列表
1. pdftomarkdown
将PDF文档转换为Markdown格式,保留文档结构、公式、表格和图片。
描述:使用MinerU解析PDF文档并以Markdown格式输出,支持OCR、公式识别、表格提取等功能。
参数:
- - filepath(字符串,必填):PDF文件的绝对路径
- outputdir(字符串,必填):输出目录的绝对路径
- backend(字符串,可选):解析后端,可选值:hybrid-auto-engine(默认)、pipeline、vlm-auto-engine
- language(字符串,可选):OCR语言代码,如en(英语)、ch(中文)、ja(日语)等,默认为自动检测
- enableformula(布尔值,可选):是否启用公式识别,默认为true
- enabletable(布尔值,可选):是否启用表格提取,默认为true
- startpage(整数,可选):起始页码(从0开始),默认为0
- endpage(整数,可选):结束页码(从0开始),默认为-1表示解析所有页面
返回值:
json
{
success: true,
output_path: /path/to/output,
markdown_content: 转换后的Markdown内容...,
images: [图片路径列表],
tables: [表格信息列表],
formula_count: 10
}
示例:
bash
python .claude/skills/pdf-process/script/pdf_parser.py \
{name: pdftomarkdown, arguments: {filepath: /path/to/document.pdf, outputdir: /path/to/output}}
使用特定后端
python .claude/skills/pdf-process/script/pdf_parser.py \
{name: pdf
tomarkdown, arguments: {file
path: /path/to/document.pdf, outputdir: /path/to/output, backend: pipeline}}
解析特定页面
python .claude/skills/pdf-process/script/pdf_parser.py \
{name: pdf
tomarkdown, arguments: {file
path: /path/to/document.pdf, outputdir: /path/to/output, start
page: 0, endpage: 5}}
2. pdftojson
将PDF文档转换为JSON格式,包含详细的布局和结构信息。
描述:使用MinerU解析PDF文档并以JSON格式输出,包含文本块、图片、表格、公式等结构化信息。
参数:
- - filepath(字符串,必填):PDF文件的绝对路径
- outputdir(字符串,必填):输出目录的绝对路径
- backend(字符串,可选):解析后端,可选值:hybrid-auto-engine(默认)、pipeline、vlm-auto-engine
- language(字符串,可选):OCR语言代码,如en(英语)、ch(中文)、ja(日语)等,默认为自动检测
- enableformula(布尔值,可选):是否启用公式识别,默认为true
- enabletable(布尔值,可选):是否启用表格提取,默认为true
- startpage(整数,可选):起始页码(从0开始),默认为0
- endpage(整数,可选):结束页码(从0开始),默认为-1表示解析所有页面
返回值:
json
{
success: true,
output_path: /path/to/output.json,
pages: [
{
page_no: 0,
page_size: [595, 842],
blocks: [
{
type: text,
text: 文本内容,
bbox: [x, y, x, y]
}
],
images: [],
tables: [],
formulas: []
}
],
metadata: {
total_pages: 10,
author: 作者,
title: 标题
}
}
示例:
bash
python .claude/skills/pdf-process/script/pdf_parser.py \
{name: pdftojson, arguments: {filepath: /path/to/document.pdf, outputdir: /path/to/output}}
使用特定后端和语言
python .claude/skills/pdf-process/script/pdf_parser.py \
{name: pdf
tojson, arguments: {file
path: /path/to/document.pdf, outputdir: /path/to/output, backend: hybrid-auto-engine, language: ch}}
安装说明
1. 安装MinerU
bash
更新pip并安装uv
pip install --upgrade pip
pip install uv
安装MinerU(包含所有功能)
uv pip install -U mineru[all]
2. 验证安装
bash
检查MinerU是否安装成功
mineru --version
测试基本功能
mineru --help
3. 系统要求
- - Python版本:3.10-3.13
- 操作系统:Linux / Windows / macOS 14.0+
- 内存:
- 使用pipeline后端:最低16GB,推荐32GB+
- 使用hybrid/vlm后端:最低16GB,推荐32GB+
- - 磁盘空间:最低20GB(推荐SSD)
- GPU(可选):
- pipeline后端:支持纯CPU运行
- hybrid/vlm后端:需要NVIDIA GPU(Volta架构及以上)或Apple Silicon
使用场景
- 1. 学术论文解析:提取公式、表格、图片等结构化内容
- 技术文档转换:将PDF文档转换为Markdown,便于版本控制和在线发布
- OCR处理:处理扫描版PDF和乱码PDF
- 多语言文档:支持109种语言的OCR识别
- 批量处理:批量转换多个PDF文档
后端选择建议
- - hybrid-auto-engine(默认):精度和速度均衡,适用于大多数场景
- pipeline:适用于纯CPU环境,兼容性最佳
- vlm-auto-engine:精度最高,需要GPU加速
注意事项
- 1. 文件路径:所有路径必须为绝对路径
- 输出目录:不存在的目录会自动创建
- 性能:使用GPU可显著提升解析速度
- 页码:页码从0开始计数
- 内存:处理大型文档可能消耗较多内存
故障排除
常见问题
- 1. 安装失败:
- 确保使用Python 3.10-3.13
- Windows仅支持Python 3.10-3.12(ray不支持3.13)
- 使用uv pip install可解决大部分依赖冲突
- 2. 内存不足:
- 使用pipeline后端
- 限制解析页面:start
page和endpage
- 减少虚拟内存分配
- 3. 解析速度慢:
- 启用GPU加速
- 使用hybrid-auto-engine后端
- 禁用不必要的功能(公式、表格)
- 4. OCR精度低:
- 指定正确的文档语言
- 确保后端支持OCR(使用pipeline或hybrid-*)
相关资源
- - MinerU官方文档:https://opendatalab.github.io/MinerU/
- MinerU GitHub:https://github.com/opendatalab/MinerU
- 在线演示:https://mineru.net/