MinerU PDF Extractor
Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.
Note: This is a community skill, not an official MinerU product. You need to obtain your own API key from MinerU.
📁 Skill Structure
CODEBLOCK0
🔧 Requirements
Required Environment Variables
Scripts automatically read MinerU Token from environment variables (choose one):
CODEBLOCK1
Required Command-Line Tools
- -
curl - For HTTP requests (usually pre-installed) - INLINECODE1 - For extracting results (usually pre-installed)
Optional Tools
- -
jq - For enhanced JSON parsing and security (recommended but not required)
- If not installed, scripts will use fallback methods
- Install:
apt-get install jq (Debian/Ubuntu) or
brew install jq (macOS)
Optional Configuration
CODEBLOCK2
💡 Get Token: Visit https://mineru.net/apiManage/docs to register and obtain an API Key
📄 Feature 1: Parse Local PDF Documents
For locally stored PDF files. Requires 4 steps.
Quick Start
CODEBLOCK3
Script Descriptions
localfilestep1applyupload_url.sh
Apply for upload URL and batch_id.
Usage:
CODEBLOCK4
Parameters:
- -
language: ch (Chinese), en (English), auto (auto-detect), default INLINECODE9 - INLINECODE10 :
doclayout_yolo (fast), layoutlmv3 (accurate), default INLINECODE13
Output:
BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...
localfilestep2uploadfile.sh
Upload PDF file to the presigned URL.
Usage:
./local_file_step2_upload_file.sh <upload_url> <pdf_file_path>
localfilestep3pollresult.sh
Poll extraction results until completion or failure.
Usage:
CODEBLOCK7
Output:
FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip
localfilestep4_download.sh
Download result ZIP and extract.
Usage:
CODEBLOCK9
Output Structure:
CODEBLOCK10
Detailed Documentation
📚 Complete Guide: See docs/Local_File_Parsing_Guide.md
🌐 Feature 2: Parse Online PDF Documents (URL Method)
For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.
Quick Start
CODEBLOCK11
Script Descriptions
onlinefilestep1submittask.sh
Submit parsing task for online PDF.
Usage:
CODEBLOCK12
Parameters:
- -
pdf_url: Complete URL of the online PDF (required) - INLINECODE16 :
ch (Chinese), en (English), auto (auto-detect), default INLINECODE20 - INLINECODE21 :
doclayout_yolo (fast), layoutlmv3 (accurate), default INLINECODE24
Output:
TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
onlinefilestep2pollresult.sh
Poll extraction results, automatically download and extract when complete.
Usage:
CODEBLOCK14
Output Structure:
CODEBLOCK15
Detailed Documentation
📚 Complete Guide: See docs/Online_URL_Parsing_Guide.md
📊 Comparison of Two Parsing Methods
| Feature | Local PDF Parsing | Online PDF Parsing |
|---|
| Steps | 4 steps | 2 steps |
| Upload Required |
✅ Yes | ❌ No |
|
Average Time | 30-60 seconds | 10-20 seconds |
|
Use Case | Local files | Files already online (arXiv, websites, etc.) |
|
File Size Limit | 200MB | Limited by source server |
⚙️ Advanced Usage
Batch Process Local Files
CODEBLOCK16
Batch Process Online Files
CODEBLOCK17
⚠️ Notes
- 1. Token Configuration: Scripts prioritize
MINERU_TOKEN, fall back to MINERU_API_KEY if not found - Token Security: Do not hard-code tokens in scripts; use environment variables
- URL Accessibility: For online parsing, ensure the provided URL is publicly accessible
- File Limits: Single file recommended not exceeding 200MB, maximum 600 pages
- Network Stability: Ensure stable network when uploading large files
- Security: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks
- Optional jq: Installing
jq provides enhanced JSON parsing and additional security checks
📚 Reference Documentation
| Document | Description |
|---|
| INLINECODE29 | Detailed curl commands and parameters for local PDF parsing |
| INLINECODE30 |
Detailed curl commands and parameters for online PDF parsing |
External Resources:
- - 🏠 MinerU Official: https://mineru.net/
- 📖 API Documentation: https://mineru.net/apiManage/docs
- 💻 GitHub Repository: https://github.com/opendatalab/MinerU
Skill Version: 1.0.0
Release Date: 2026-02-18
Community Skill - Not affiliated with MinerU official
MinerU PDF 提取器
使用 MinerU API 将 PDF 文档提取为结构化 Markdown。支持公式识别、表格提取和 OCR。
注意:这是一个社区技能,并非 MinerU 官方产品。您需要从 MinerU 获取自己的 API 密钥。
📁 技能结构
mineru-pdf-extractor/
├── SKILL.md # 英文文档
├── SKILL_zh.md # 中文文档
├── docs/ # 文档
│ ├── LocalFileParsing_Guide.md # 本地 PDF 解析详细指南(英文)
│ ├── OnlineURLParsing_Guide.md # 在线 PDF 解析详细指南(英文)
│ ├── MinerU_本地文档解析完整流程.md # 本地解析完整指南(中文)
│ └── MinerU_在线文档解析完整流程.md # 在线解析完整指南(中文)
└── scripts/ # 可执行脚本
├── localfilestep1applyupload_url.sh # 本地解析步骤 1
├── localfilestep2uploadfile.sh # 本地解析步骤 2
├── localfilestep3pollresult.sh # 本地解析步骤 3
├── localfilestep4_download.sh # 本地解析步骤 4
├── onlinefilestep1submittask.sh # 在线解析步骤 1
└── onlinefilestep2pollresult.sh # 在线解析步骤 2
🔧 环境要求
必需的环境变量
脚本会自动从环境变量中读取 MinerU Token(二选一):
bash
选项 1:设置 MINERU_TOKEN
export MINERU
TOKEN=yourapi
tokenhere
选项 2:设置 MINERUAPIKEY
export MINERU
APIKEY=your
apitoken_here
必需的命令行工具
- - curl - 用于 HTTP 请求(通常已预装)
- unzip - 用于解压结果(通常已预装)
可选工具
- - jq - 用于增强 JSON 解析和安全性(推荐但非必需)
- 如果未安装,脚本将使用备用方法
- 安装:apt-get install jq(Debian/Ubuntu)或 brew install jq(macOS)
可选配置
bash
设置 API 基础 URL(默认已预配置)
export MINERU
BASEURL=https://mineru.net/api/v4
💡 获取 Token:访问 https://mineru.net/apiManage/docs 注册并获取 API 密钥
📄 功能 1:解析本地 PDF 文档
适用于本地存储的 PDF 文件。需要 4 个步骤。
快速开始
bash
cd scripts/
步骤 1:申请上传 URL
./local
filestep1
applyupload_url.sh /path/to/your.pdf
输出:BATCHID=xxx UPLOADURL=xxx
步骤 2:上传文件
./local
filestep2
uploadfile.sh $UPLOAD_URL /path/to/your.pdf
步骤 3:轮询结果
./local
filestep3
pollresult.sh $BATCH_ID
输出:FULLZIPURL=xxx
步骤 4:下载结果
./local
filestep4
download.sh $FULLZIP_URL result.zip extracted/
脚本说明
localfilestep1applyupload_url.sh
申请上传 URL 和 batch_id。
用法:
bash
./localfilestep1applyuploadurl.sh filepath> [language] [layoutmodel]
参数:
- - language:ch(中文)、en(英文)、auto(自动检测),默认 ch
- layoutmodel:doclayoutyolo(快速)、layoutlmv3(精确),默认 doclayout_yolo
输出:
BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...
localfilestep2uploadfile.sh
将 PDF 文件上传到预签名 URL。
用法:
bash
./localfilestep2uploadfile.sh url> file_path>
localfilestep3pollresult.sh
轮询提取结果,直到完成或失败。
用法:
bash
./localfilestep3pollresult.sh id> [maxretries] [retryintervalseconds]
输出:
FULLZIPURL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip
localfilestep4_download.sh
下载结果 ZIP 并解压。
用法:
bash
./localfilestep4download.sh url> [outputzipfilename] [extractdirectoryname]
输出结构:
extracted/
├── full.md # 📄 Markdown 文档(主要结果)
├── images/ # 🖼️ 提取的图片
├── content_list.json # 结构化内容
└── layout.json # 布局分析数据
详细文档
📚 完整指南:参见 docs/LocalFileParsing_Guide.md
🌐 功能 2:解析在线 PDF 文档(URL 方式)
适用于已在线可用的 PDF 文件(例如 arXiv、网站)。仅需 2 个步骤,更简洁高效。
快速开始
bash
cd scripts/
步骤 1:提交解析任务(直接提供 URL)
./online
filestep1
submittask.sh https://arxiv.org/pdf/2410.17247.pdf
输出:TASK_ID=xxx
步骤 2:轮询结果并自动下载/解压
./online
filestep2
pollresult.sh $TASK_ID extracted/
脚本说明
onlinefilestep1submittask.sh
提交在线 PDF 的解析任务。
用法:
bash
./onlinefilestep1submittask.sh url> [language] [layoutmodel]
参数:
- - pdfurl:在线 PDF 的完整 URL(必需)
- language:ch(中文)、en(英文)、auto(自动检测),默认 ch
- layoutmodel:doclayoutyolo(快速)、layoutlmv3(精确),默认 doclayoutyolo
输出:
TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
onlinefilestep2pollresult.sh
轮询提取结果,完成后自动下载并解压。
用法:
bash
./onlinefilestep2pollresult.sh id> [outputdirectory] [maxretries] [retryinterval_seconds]
输出结构:
extracted/
├── full.md # 📄 Markdown 文档(主要结果)
├── images/ # 🖼️ 提取的图片
├── content_list.json # 结构化内容
└── layout.json # 布局分析数据
详细文档
📚 完整指南:参见 docs/OnlineURLParsing_Guide.md
📊 两种解析方法对比
| 特性 | 本地 PDF 解析 | 在线 PDF 解析 |
|---|
| 步骤数 | 4 步 | 2 步 |
| 需要上传 |
✅ 是 | ❌ 否 |
|
平均时间 | 30-60 秒 | 10-20 秒 |
|
使用场景 | 本地文件 | 已在线文件(arXiv、网站等) |
|
文件大小限制 | 200MB | 受源服务器限制 |
⚙️ 高级用法
批量处理本地文件
bash
for pdf in /path/to/pdfs/*.pdf; do
echo 正在处理:$pdf
# 步骤 1
result=$(./localfilestep1applyupload_url.sh $pdf 2>&1)
batchid=$(echo $result | grep BATCHID | cut -d= -f2)
uploadurl=$(echo $result | grep UPLOADURL | cut -d= -f2)
# 步骤 2
./localfilestep2uploadfile.sh $upload_url $pdf
# 步骤 3
zipurl=$(./localfilestep3pollresult.sh $batchid | grep FULLZIPURL | cut -d= -f2)
# 步骤 4
filename=$(basename $pdf .pdf)
./localfilestep4_d