MinerU PDF Parser
Parse PDF documents using MinerU MCP to extract structured content including text, tables, and formulas with MLX acceleration on Apple Silicon.
Installation
Option 1: Install MinerU MCP (for Claude Code)
CODEBLOCK0
This installs and configures MinerU for all Claude projects. Models are downloaded on first use.
Option 2: Use Direct Tool (preserves files)
The skill includes a direct parsing tool that saves output to a persistent directory:
CODEBLOCK1
Advantages:
- - ✅ Files are saved permanently (not auto-deleted)
- ✅ Full control over output location
- ✅ No MCP overhead
- ✅ Works with any Python environment that has MinerU
Quick Start
Method 1: Using the Direct Tool (Recommended)
CODEBLOCK2
Method 2: Using MinerU MCP (Temporary Files)
Parse a PDF document
CODEBLOCK3
Check system capabilities
CODEBLOCK4
Parameters
parse_pdf
Required:
- -
file_path - Absolute path to the PDF file
Optional:
- -
backend - Processing backend (default: pipeline)
-
pipeline - Fast, general-purpose (recommended)
-
vlm-mlx-engine - Fastest on Apple Silicon (M1/M2/M3/M4)
-
vlm-transformers - Slowest but most accurate
- -
formula_enable - Enable formula recognition (default: true) - INLINECODE8 - Enable table recognition (default:
true) - INLINECODE10 - Starting page (0-indexed, default:
0) - INLINECODE12 - Ending page (default:
-1 for all pages)
list_backends
No parameters required. Returns system information and backend recommendations.
Usage Examples
Extract tables from a specific page range
CODEBLOCK5
Parse with formula recognition only (faster)
CODEBLOCK6
Parse single page (fastest for testing)
CODEBLOCK7
Performance
On Apple Silicon M4 (16GB RAM):
- -
pipeline: ~32s/page, CPU-only, good quality - INLINECODE15 : ~38s/page, Apple Silicon optimized, excellent quality
- INLINECODE16 : ~148s/page, highest quality, slowest
Note: First run downloads models (can take 5-10 minutes). Models are cached in ~/.cache/uv/ for faster subsequent runs.
Output Format
Returns structured Markdown with:
- - Document metadata (file, backend, pages, settings)
- Extracted text with preserved structure
- Tables formatted as Markdown tables
- Formulas converted to LaTeX
Supported Formats
- - PDF documents (
.pdf) - JPEG images (
.jpg, .jpeg) - PNG images (
.png) - Other image formats (WebP, GIF, etc.)
Troubleshooting
Module not found error
If you get "No module named 'mcp_mineru'", make sure you installed it:
CODEBLOCK8
Slow processing on first run
This is normal. MinerU downloads ML models on first use. Subsequent runs will be much faster.
Timeout errors
Increase timeout for large documents or use smaller page ranges for testing.
Notes
- - Output is returned as Markdown text
- Tables are preserved in Markdown format
- Mathematical formulas are converted to LaTeX
- Works with scanned documents (OCR built-in)
- Optimized for Apple Silicon (M1/M2/M3/M4) with MLX backend
File Persistence
Why Files Get Deleted (MCP Method)
The MinerU MCP server uses Python's tempfile.TemporaryDirectory(), which automatically deletes files when the context exits. This is by design to prevent temporary files from accumulating.
How to Preserve Files
Method A: Use the Direct Tool (Recommended)
The skill provides parse.py which saves files to a persistent directory:
CODEBLOCK9
Advantages:
- - ✅ Files are never auto-deleted
- ✅ Full control over output location
- ✅ Can be used in batch processing
- ✅ No MCP connection needed
Generated Structure:
CODEBLOCK10
Method B: Redirect MCP Output
If using the MCP method, capture the output and save it:
CODEBLOCK11
Comparison
| Feature | Direct Tool | MCP Method |
|---|
| Files persisted | ✅ Yes | ❌ No (auto-deleted) |
| Custom output dir |
✅ Yes | ❌ No (temp only) |
| Claude Code integration | ⚠️ Manual | ✅ Native |
| Speed | ✅ Fast | ⚠️ MCP overhead |
| Offline use | ✅ Yes | ⚠️ Needs Claude Code |
Recommendation
- - Use Direct Tool when you need to keep the files for later use
- Use MCP Method when working within Claude Code and only need the text content
MinerU PDF 解析器
使用 MinerU MCP 解析 PDF 文档,提取结构化内容,包括文本、表格和公式,并在 Apple Silicon 上通过 MLX 加速。
安装
选项 1:安装 MinerU MCP(适用于 Claude Code)
bash
claude mcp add --transport stdio --scope user mineru -- \
uvx --from mcp-mineru python -m mcp_mineru.server
这将为所有 Claude 项目安装并配置 MinerU。模型在首次使用时下载。
选项 2:使用直接工具(保留文件)
该技能包含一个直接解析工具,可将输出保存到持久化目录:
bash
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py path>
优势:
- - ✅ 文件永久保存(不会自动删除)
- ✅ 完全控制输出位置
- ✅ 无 MCP 开销
- ✅ 适用于任何安装了 MinerU 的 Python 环境
快速开始
方法 1:使用直接工具(推荐)
bash
解析整个 PDF
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
/path/to/document.pdf \
/path/to/output
解析特定页面
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
/path/to/document.pdf \
/path/to/output \
--start-page 0 --end-page 2
使用 Apple Silicon 优化
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
/path/to/document.pdf \
/path/to/output \
--backend vlm-mlx-engine
仅提取文本(更快)
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
/path/to/document.pdf \
/path/to/output \
--no-table --no-formula
方法 2:使用 MinerU MCP(临时文件)
解析 PDF 文档
bash
uvx --from mcp-mineru python -c
import asyncio
from mcpmineru.server import calltool
async def parse_pdf():
result = await call_tool(
name=parse_pdf,
arguments={
file_path: /path/to/document.pdf,
backend: pipeline,
formula_enable: True,
table_enable: True,
start_page: 0,
end_page: -1 # -1 表示所有页面
}
)
if hasattr(result, content):
for item in result.content:
if hasattr(item, text):
print(item.text)
break
asyncio.run(parse_pdf())
检查系统能力
bash
uvx --from mcp-mineru python -c
import asyncio
from mcpmineru.server import calltool
async def list_backends():
result = await call_tool(
name=list_backends,
arguments={}
)
if hasattr(result, content):
for item in result.content:
if hasattr(item, text):
print(item.text)
break
asyncio.run(list_backends())
参数
parse_pdf
必需参数:
- - file_path - PDF 文件的绝对路径
可选参数:
- - backend - 处理后端(默认值:pipeline)
- pipeline - 快速,通用(推荐)
- vlm-mlx-engine - Apple Silicon(M1/M2/M3/M4)上最快
- vlm-transformers - 最慢但最准确
- - formulaenable - 启用公式识别(默认值:true)
- tableenable - 启用表格识别(默认值:true)
- startpage - 起始页码(从 0 开始,默认值:0)
- endpage - 结束页码(默认值:-1 表示所有页面)
list_backends
无需参数。返回系统信息和后端推荐。
使用示例
从特定页码范围提取表格
bash
uvx --from mcp-mineru python -c
import asyncio
from mcpmineru.server import calltool
async def parse_pdf():
result = await call_tool(
name=parse_pdf,
arguments={
file_path: /path/to/document.pdf,
backend: pipeline,
table_enable: True,
start_page: 5,
end_page: 10
}
)
if hasattr(result, content):
for item in result.content:
if hasattr(item, text):
print(item.text)
break
asyncio.run(parse_pdf())
仅使用公式识别进行解析(更快)
bash
uvx --from mcp-mineru python -c
import asyncio
from mcpmineru.server import calltool
async def parse_pdf():
result = await call_tool(
name=parse_pdf,
arguments={
file_path: /path/to/document.pdf,
backend: vlm-mlx-engine,
formula_enable: True,
table_enable: False # 禁用表格以提高速度
}
)
if hasattr(result, content):
for item in result.content:
if hasattr(item, text):
print(item.text)
break
asyncio.run(parse_pdf())
解析单页(测试时最快)
bash
uvx --from mcp-mineru python -c
import asyncio
from mcpmineru.server import calltool
async def parse_pdf():
result = await call_tool(
name=parse_pdf,
arguments={
file_path: /path/to/document.pdf,
backend: pipeline,
formula_enable: False,
table_enable: False,
start_page: 0,
end_page: 0
}
)
if hasattr(result, content):
for item in result.content:
if hasattr(item, text):
print(item.text)
break
asyncio.run(parse_pdf())
性能
在 Apple Silicon M4(16GB 内存)上:
- - pipeline:约 32 秒/页,仅 CPU,质量良好
- vlm-mlx-engine:约 38 秒/页,Apple Silicon 优化,质量优秀
- vlm-transformers:约 148 秒/页,质量最高,速度最慢
注意: 首次运行会下载模型(可能需要 5-10 分钟)。模型会缓存到 ~/.cache/uv/ 中,以便后续运行更快。
输出格式
返回结构化 Markdown,包含:
- - 文档元数据(文件、后端、页面、设置)
- 提取的文本,结构保留
- 格式化为 Markdown 表格的表格
- 转换为 LaTeX 的公式
支持的格式
- - PDF 文档(.pdf)
- JPEG 图像(.jpg,.jpeg)
- PNG 图像(.png)
- 其他图像格式(WebP、GIF 等)
故障排除
模块未找到错误
如果出现 No module named mcp_mineru,请确保已安装:
bash
claude mcp add --transport stdio --scope user mineru -- \
uvx --from mcp-mineru python -m mcp_mineru.server
首次运行处理缓慢
这是正常现象。MinerU 在首次使用时下载 ML 模型。后续运行会快得多。
超时错误
对于大型文档,请增加超时时间,或使用较小的页码范围进行测试。
注意事项
- - 输出以 Markdown 文本形式返回
- 表格以 Markdown 格式保留
- 数学公式转换为 LaTeX
- 适用于扫描文档(内置 OCR)
- 针对 Apple Silicon(M1/M2/M3/M4)进行了优化,使用 MLX 后端
文件持久化
为什么文件会被删除(MCP 方法)
MinerU MCP 服务器使用 Python 的 tempfile.TemporaryDirectory(),当上下文退出时会自动删除文件。这是有意设计的,以防止临时文件堆积。
如何保留文件
方法 A:使用直接工具(推荐)
该技能提供了 parse.py,可将文件保存到持久化目录:
bash
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
/path/to/input.pdf \
/path/to/output_dir
优势: