MinerU PDF Parser

Parse PDF documents using MinerU MCP to extract structured content including text, tables, and formulas with MLX acceleration on Apple Silicon.

Installation

Option 1: Install MinerU MCP (for Claude Code)

CODEBLOCK0

This installs and configures MinerU for all Claude projects. Models are downloaded on first use.

Option 2: Use Direct Tool (preserves files)

The skill includes a direct parsing tool that saves output to a persistent directory:

CODEBLOCK1

Advantages:

- ✅ Files are saved permanently (not auto-deleted)
✅ Full control over output location
✅ No MCP overhead
✅ Works with any Python environment that has MinerU

Quick Start

Method 1: Using the Direct Tool (Recommended)

CODEBLOCK2

Method 2: Using MinerU MCP (Temporary Files)

Parse a PDF document

CODEBLOCK3

Check system capabilities

CODEBLOCK4

Parameters

parse_pdf

Required:

- file_path - Absolute path to the PDF file

Optional:

- backend - Processing backend (default: pipeline)

- pipeline - Fast, general-purpose (recommended)
- vlm-mlx-engine - Fastest on Apple Silicon (M1/M2/M3/M4)
- vlm-transformers - Slowest but most accurate

- formula_enable - Enable formula recognition (default: true)
INLINECODE8 - Enable table recognition (default: true)
INLINECODE10 - Starting page (0-indexed, default: 0)
INLINECODE12 - Ending page (default: -1 for all pages)

list_backends

No parameters required. Returns system information and backend recommendations.

Usage Examples

Extract tables from a specific page range

CODEBLOCK5

Parse with formula recognition only (faster)

CODEBLOCK6

Parse single page (fastest for testing)

CODEBLOCK7

Performance

On Apple Silicon M4 (16GB RAM):

- pipeline: ~32s/page, CPU-only, good quality
INLINECODE15: ~38s/page, Apple Silicon optimized, excellent quality
INLINECODE16: ~148s/page, highest quality, slowest

Note: First run downloads models (can take 5-10 minutes). Models are cached in ~/.cache/uv/ for faster subsequent runs.

Output Format

Returns structured Markdown with:

- Document metadata (file, backend, pages, settings)
Extracted text with preserved structure
Tables formatted as Markdown tables
Formulas converted to LaTeX

Supported Formats

- PDF documents (.pdf)
JPEG images (.jpg, .jpeg)
PNG images (.png)
Other image formats (WebP, GIF, etc.)

Troubleshooting

Module not found error

If you get "No module named 'mcp_mineru'", make sure you installed it:

CODEBLOCK8

Slow processing on first run

This is normal. MinerU downloads ML models on first use. Subsequent runs will be much faster.

Timeout errors

Increase timeout for large documents or use smaller page ranges for testing.

Notes

- Output is returned as Markdown text
Tables are preserved in Markdown format
Mathematical formulas are converted to LaTeX
Works with scanned documents (OCR built-in)
Optimized for Apple Silicon (M1/M2/M3/M4) with MLX backend

File Persistence

Why Files Get Deleted (MCP Method)

The MinerU MCP server uses Python's tempfile.TemporaryDirectory(), which automatically deletes files when the context exits. This is by design to prevent temporary files from accumulating.

How to Preserve Files

Method A: Use the Direct Tool (Recommended)

The skill provides parse.py which saves files to a persistent directory:

CODEBLOCK9

Advantages:

- ✅ Files are never auto-deleted
✅ Full control over output location
✅ Can be used in batch processing
✅ No MCP connection needed

Generated Structure:
CODEBLOCK10

Method B: Redirect MCP Output

If using the MCP method, capture the output and save it:

CODEBLOCK11

Comparison

Feature	Direct Tool	MCP Method
Files persisted	✅ Yes	❌ No (auto-deleted)
Custom output dir

Recommendation

- Use Direct Tool when you need to keep the files for later use
Use MCP Method when working within Claude Code and only need the text content

MinerU PDF 解析器

使用 MinerU MCP 解析 PDF 文档，提取结构化内容，包括文本、表格和公式，并在 Apple Silicon 上通过 MLX 加速。

安装

选项 1：安装 MinerU MCP（适用于 Claude Code）

bash
claude mcp add --transport stdio --scope user mineru -- \
uvx --from mcp-mineru python -m mcp_mineru.server

这将为所有 Claude 项目安装并配置 MinerU。模型在首次使用时下载。

选项 2：使用直接工具（保留文件）

该技能包含一个直接解析工具，可将输出保存到持久化目录：

bash
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py path> dir> [options]

优势：

- ✅ 文件永久保存（不会自动删除）
✅ 完全控制输出位置
✅ 无 MCP 开销
✅ 适用于任何安装了 MinerU 的 Python 环境

快速开始

方法 1：使用直接工具（推荐）

bash

解析整个 PDF

python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
/path/to/document.pdf \
/path/to/output

解析特定页面

python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \ /path/to/document.pdf \ /path/to/output \ --start-page 0 --end-page 2

使用 Apple Silicon 优化

python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \ /path/to/document.pdf \ /path/to/output \ --backend vlm-mlx-engine

仅提取文本（更快）

python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \ /path/to/document.pdf \ /path/to/output \ --no-table --no-formula

方法 2：使用 MinerU MCP（临时文件）

解析 PDF 文档

bash
uvx --from mcp-mineru python -c
import asyncio
from mcpmineru.server import calltool

async def parse_pdf():
result = await call_tool(
name=parse_pdf,
arguments={
file_path: /path/to/document.pdf,
backend: pipeline,
formula_enable: True,
table_enable: True,
start_page: 0,
end_page: -1 # -1 表示所有页面
}
)
if hasattr(result, content):
for item in result.content:
if hasattr(item, text):
print(item.text)
break

asyncio.run(parse_pdf())

检查系统能力

bash
uvx --from mcp-mineru python -c
import asyncio
from mcpmineru.server import calltool

async def list_backends():
result = await call_tool(
name=list_backends,
arguments={}
)
if hasattr(result, content):
for item in result.content:
if hasattr(item, text):
print(item.text)
break

asyncio.run(list_backends())

参数

parse_pdf

必需参数：

- file_path - PDF 文件的绝对路径

可选参数：

- backend - 处理后端（默认值：pipeline）

- pipeline - 快速，通用（推荐）
- vlm-mlx-engine - Apple Silicon（M1/M2/M3/M4）上最快
- vlm-transformers - 最慢但最准确

- formulaenable - 启用公式识别（默认值：true）
tableenable - 启用表格识别（默认值：true）
startpage - 起始页码（从 0 开始，默认值：0）
endpage - 结束页码（默认值：-1 表示所有页面）

list_backends

无需参数。返回系统信息和后端推荐。

使用示例

从特定页码范围提取表格

bash
uvx --from mcp-mineru python -c
import asyncio
from mcpmineru.server import calltool

async def parse_pdf():
result = await call_tool(
name=parse_pdf,
arguments={
file_path: /path/to/document.pdf,
backend: pipeline,
table_enable: True,
start_page: 5,
end_page: 10
}
)
if hasattr(result, content):
for item in result.content:
if hasattr(item, text):
print(item.text)
break

asyncio.run(parse_pdf())

仅使用公式识别进行解析（更快）

bash
uvx --from mcp-mineru python -c
import asyncio
from mcpmineru.server import calltool

async def parse_pdf():
result = await call_tool(
name=parse_pdf,
arguments={
file_path: /path/to/document.pdf,
backend: vlm-mlx-engine,
formula_enable: True,
table_enable: False # 禁用表格以提高速度
}
)
if hasattr(result, content):
for item in result.content:
if hasattr(item, text):
print(item.text)
break

asyncio.run(parse_pdf())

解析单页（测试时最快）

bash
uvx --from mcp-mineru python -c
import asyncio
from mcpmineru.server import calltool

async def parse_pdf():
result = await call_tool(
name=parse_pdf,
arguments={
file_path: /path/to/document.pdf,
backend: pipeline,
formula_enable: False,
table_enable: False,
start_page: 0,
end_page: 0
}
)
if hasattr(result, content):
for item in result.content:
if hasattr(item, text):
print(item.text)
break

asyncio.run(parse_pdf())

性能

在 Apple Silicon M4（16GB 内存）上：

- pipeline：约 32 秒/页，仅 CPU，质量良好
vlm-mlx-engine：约 38 秒/页，Apple Silicon 优化，质量优秀
vlm-transformers：约 148 秒/页，质量最高，速度最慢

注意： 首次运行会下载模型（可能需要 5-10 分钟）。模型会缓存到 ~/.cache/uv/ 中，以便后续运行更快。

输出格式

返回结构化 Markdown，包含：

- 文档元数据（文件、后端、页面、设置）
提取的文本，结构保留
格式化为 Markdown 表格的表格
转换为 LaTeX 的公式

支持的格式

- PDF 文档（.pdf）
JPEG 图像（.jpg，.jpeg）
PNG 图像（.png）
其他图像格式（WebP、GIF 等）

故障排除

模块未找到错误

如果出现 No module named mcp_mineru，请确保已安装：

bash
claude mcp add --transport stdio --scope user mineru -- \
uvx --from mcp-mineru python -m mcp_mineru.server

首次运行处理缓慢

这是正常现象。MinerU 在首次使用时下载 ML 模型。后续运行会快得多。

超时错误

对于大型文档，请增加超时时间，或使用较小的页码范围进行测试。

注意事项

- 输出以 Markdown 文本形式返回
表格以 Markdown 格式保留
数学公式转换为 LaTeX
适用于扫描文档（内置 OCR）
针对 Apple Silicon（M1/M2/M3/M4）进行了优化，使用 MLX 后端

文件持久化

为什么文件会被删除（MCP 方法）

MinerU MCP 服务器使用 Python 的 tempfile.TemporaryDirectory()，当上下文退出时会自动删除文件。这是有意设计的，以防止临时文件堆积。

如何保留文件

方法 A：使用直接工具（推荐）

该技能提供了 parse.py，可将文件保存到持久化目录：

bash
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
/path/to/input.pdf \
/path/to/output_dir

优势：

- ✅ 文件永远不会自动

mineru-pdfMinerU PDF解析

mineru-pdf

MinerU PDF Parser

Installation

Option 1: Install MinerU MCP (for Claude Code)

Option 2: Use Direct Tool (preserves files)

Quick Start

Method 1: Using the Direct Tool (Recommended)

Method 2: Using MinerU MCP (Temporary Files)

Parse a PDF document

Check system capabilities

Parameters

parse_pdf

list_backends

Usage Examples

Extract tables from a specific page range

Parse with formula recognition only (faster)

Parse single page (fastest for testing)

Performance

Output Format

Supported Formats

Troubleshooting

Module not found error

Slow processing on first run

Timeout errors

Notes

File Persistence

Why Files Get Deleted (MCP Method)

How to Preserve Files

Comparison

Recommendation

MinerU PDF 解析器

安装

选项 1：安装 MinerU MCP（适用于 Claude Code）

选项 2：使用直接工具（保留文件）

快速开始

方法 1：使用直接工具（推荐）

解析整个 PDF

解析特定页面

使用 Apple Silicon 优化

仅提取文本（更快）

方法 2：使用 MinerU MCP（临时文件）

解析 PDF 文档

检查系统能力

参数

parse_pdf

list_backends

使用示例

从特定页码范围提取表格

仅使用公式识别进行解析（更快）

解析单页（测试时最快）

性能

输出格式

支持的格式

故障排除

模块未找到错误

首次运行处理缓慢

超时错误

注意事项

文件持久化

为什么文件会被删除（MCP 方法）

如何保留文件

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement