MinerU PDF Extractor

Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.

Note: This is a community skill, not an official MinerU product. You need to obtain your own API key from MinerU.

📁 Skill Structure

CODEBLOCK0

🔧 Requirements

Required Environment Variables

Scripts automatically read MinerU Token from environment variables (choose one):

CODEBLOCK1

Required Command-Line Tools

- curl - For HTTP requests (usually pre-installed)
INLINECODE1 - For extracting results (usually pre-installed)

Optional Tools

- jq - For enhanced JSON parsing and security (recommended but not required)

- If not installed, scripts will use fallback methods - Install: apt-get install jq (Debian/Ubuntu) or brew install jq (macOS)

Optional Configuration

CODEBLOCK2

💡 Get Token: Visit https://mineru.net/apiManage/docs to register and obtain an API Key

📄 Feature 1: Parse Local PDF Documents

For locally stored PDF files. Requires 4 steps.

Quick Start

CODEBLOCK3

Script Descriptions

localfilestep1applyupload_url.sh

Apply for upload URL and batch_id.

Usage:
CODEBLOCK4

Parameters:

- language: ch (Chinese), en (English), auto (auto-detect), default INLINECODE9
INLINECODE10: doclayout_yolo (fast), layoutlmv3 (accurate), default INLINECODE13

Output:

BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...

localfilestep2uploadfile.sh

Upload PDF file to the presigned URL.

Usage:

./local_file_step2_upload_file.sh <upload_url> <pdf_file_path>

localfilestep3pollresult.sh

Poll extraction results until completion or failure.

Usage:
CODEBLOCK7

Output:

FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip

localfilestep4_download.sh

Download result ZIP and extract.

Usage:
CODEBLOCK9

Output Structure:
CODEBLOCK10

Detailed Documentation

📚 Complete Guide: See docs/Local_File_Parsing_Guide.md

🌐 Feature 2: Parse Online PDF Documents (URL Method)

For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.

Quick Start

CODEBLOCK11

Script Descriptions

onlinefilestep1submittask.sh

Submit parsing task for online PDF.

Usage:
CODEBLOCK12

Parameters:

- pdf_url: Complete URL of the online PDF (required)
INLINECODE16: ch (Chinese), en (English), auto (auto-detect), default INLINECODE20
INLINECODE21: doclayout_yolo (fast), layoutlmv3 (accurate), default INLINECODE24

Output:

TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

onlinefilestep2pollresult.sh

Poll extraction results, automatically download and extract when complete.

Usage:
CODEBLOCK14

Output Structure:
CODEBLOCK15

Detailed Documentation

📚 Complete Guide: See docs/Online_URL_Parsing_Guide.md

📊 Comparison of Two Parsing Methods

Feature	Local PDF Parsing	Online PDF Parsing
Steps	4 steps	2 steps
Upload Required

⚙️ Advanced Usage

Batch Process Local Files

CODEBLOCK16

Batch Process Online Files

CODEBLOCK17

⚠️ Notes

1. Token Configuration: Scripts prioritize MINERU_TOKEN, fall back to MINERU_API_KEY if not found
Token Security: Do not hard-code tokens in scripts; use environment variables
URL Accessibility: For online parsing, ensure the provided URL is publicly accessible
File Limits: Single file recommended not exceeding 200MB, maximum 600 pages
Network Stability: Ensure stable network when uploading large files
Security: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks
Optional jq: Installing jq provides enhanced JSON parsing and additional security checks

📚 Reference Documentation

Document	Description
INLINECODE29	Detailed curl commands and parameters for local PDF parsing
INLINECODE30

Detailed curl commands and parameters for online PDF parsing |

External Resources:

- 🏠 MinerU Official: https://mineru.net/
📖 API Documentation: https://mineru.net/apiManage/docs
💻 GitHub Repository: https://github.com/opendatalab/MinerU

Skill Version: 1.0.0
Release Date: 2026-02-18
Community Skill - Not affiliated with MinerU official

MinerU PDF 提取器

使用 MinerU API 将 PDF 文档提取为结构化 Markdown。支持公式识别、表格提取和 OCR。

注意：这是一个社区技能，并非 MinerU 官方产品。您需要从 MinerU 获取自己的 API 密钥。

📁 技能结构

mineru-pdf-extractor/
├── SKILL.md # 英文文档
├── SKILL_zh.md # 中文文档
├── docs/ # 文档
│ ├── LocalFileParsing_Guide.md # 本地 PDF 解析详细指南（英文）
│ ├── OnlineURLParsing_Guide.md # 在线 PDF 解析详细指南（英文）
│ ├── MinerU_本地文档解析完整流程.md # 本地解析完整指南（中文）
│ └── MinerU_在线文档解析完整流程.md # 在线解析完整指南（中文）
└── scripts/ # 可执行脚本
├── localfilestep1applyupload_url.sh # 本地解析步骤 1
├── localfilestep2uploadfile.sh # 本地解析步骤 2
├── localfilestep3pollresult.sh # 本地解析步骤 3
├── localfilestep4_download.sh # 本地解析步骤 4
├── onlinefilestep1submittask.sh # 在线解析步骤 1
└── onlinefilestep2pollresult.sh # 在线解析步骤 2

🔧 环境要求

必需的环境变量

脚本会自动从环境变量中读取 MinerU Token（二选一）：

bash

选项 1：设置 MINERU_TOKEN

export MINERUTOKEN=yourapitokenhere

选项 2：设置 MINERUAPIKEY

export MINERUAPIKEY=yourapitoken_here

必需的命令行工具

- curl - 用于 HTTP 请求（通常已预装）
unzip - 用于解压结果（通常已预装）

可选工具

- jq - 用于增强 JSON 解析和安全性（推荐但非必需）

- 如果未安装，脚本将使用备用方法 - 安装：apt-get install jq（Debian/Ubuntu）或 brew install jq（macOS）

可选配置

bash

设置 API 基础 URL（默认已预配置）

export MINERUBASEURL=https://mineru.net/api/v4

💡 获取 Token：访问 https://mineru.net/apiManage/docs 注册并获取 API 密钥

📄 功能 1：解析本地 PDF 文档

适用于本地存储的 PDF 文件。需要 4 个步骤。

快速开始

bash
cd scripts/

步骤 1：申请上传 URL

./localfilestep1applyupload_url.sh /path/to/your.pdf

输出：BATCHID=xxx UPLOADURL=xxx

步骤 2：上传文件

./localfilestep2uploadfile.sh $UPLOAD_URL /path/to/your.pdf

步骤 3：轮询结果

./localfilestep3pollresult.sh $BATCH_ID

输出：FULLZIPURL=xxx

步骤 4：下载结果

./localfilestep4download.sh $FULLZIP_URL result.zip extracted/

脚本说明

localfilestep1applyupload_url.sh

申请上传 URL 和 batch_id。

用法：
bash
./localfilestep1applyuploadurl.sh filepath> [language] [layoutmodel]

参数：

- language：ch（中文）、en（英文）、auto（自动检测），默认 ch
layoutmodel：doclayoutyolo（快速）、layoutlmv3（精确），默认 doclayout_yolo

输出：

BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...

localfilestep2uploadfile.sh

将 PDF 文件上传到预签名 URL。

用法：
bash
./localfilestep2uploadfile.sh url> file_path>

localfilestep3pollresult.sh

轮询提取结果，直到完成或失败。

用法：
bash
./localfilestep3pollresult.sh id> [maxretries] [retryintervalseconds]

输出：

FULLZIPURL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip

localfilestep4_download.sh

下载结果 ZIP 并解压。

用法：
bash
./localfilestep4download.sh url> [outputzipfilename] [extractdirectoryname]

输出结构：

extracted/
├── full.md # 📄 Markdown 文档（主要结果）
├── images/ # 🖼️ 提取的图片
├── content_list.json # 结构化内容
└── layout.json # 布局分析数据

详细文档

📚 完整指南：参见 docs/LocalFileParsing_Guide.md

🌐 功能 2：解析在线 PDF 文档（URL 方式）

适用于已在线可用的 PDF 文件（例如 arXiv、网站）。仅需 2 个步骤，更简洁高效。

快速开始

bash
cd scripts/

步骤 1：提交解析任务（直接提供 URL）
./onlinefilestep1submittask.sh https://arxiv.org/pdf/2410.17247.pdf
输出：TASK_ID=xxx

步骤 2：轮询结果并自动下载/解压
./onlinefilestep2pollresult.sh $TASK_ID extracted/
脚本说明

onlinefilestep1submittask.sh

提交在线 PDF 的解析任务。

用法：
bash
./onlinefilestep1submittask.sh url> [language] [layoutmodel]

参数：

- pdfurl：在线 PDF 的完整 URL（必需）
language：ch（中文）、en（英文）、auto（自动检测），默认 ch
layoutmodel：doclayoutyolo（快速）、layoutlmv3（精确），默认 doclayoutyolo

输出：

TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

onlinefilestep2pollresult.sh

轮询提取结果，完成后自动下载并解压。

用法：
bash
./onlinefilestep2pollresult.sh id> [outputdirectory] [maxretries] [retryinterval_seconds]

输出结构：

extracted/
├── full.md # 📄 Markdown 文档（主要结果）
├── images/ # 🖼️ 提取的图片
├── content_list.json # 结构化内容
└── layout.json # 布局分析数据

详细文档

📚 完整指南：参见 docs/OnlineURLParsing_Guide.md

📊 两种解析方法对比

特性	本地 PDF 解析	在线 PDF 解析
步骤数	4 步	2 步
需要上传

✅ 是 | ❌ 否 |
| 平均时间 | 30-60 秒 | 10-20 秒 |
| 使用场景 | 本地文件 | 已在线文件（arXiv、网站等） |
| 文件大小限制 | 200MB | 受源服务器限制 |

⚙️ 高级用法

批量处理本地文件

bash
for pdf in /path/to/pdfs/*.pdf; do
echo 正在处理：$pdf

# 步骤 1
result=$(./localfilestep1applyupload_url.sh $pdf 2>&1)
batchid=$(echo $result | grep BATCHID | cut -d= -f2)
uploadurl=$(echo $result | grep UPLOADURL | cut -d= -f2)

# 步骤 2
./localfilestep2uploadfile.sh $upload_url $pdf

# 步骤 3
zipurl=$(./localfilestep3pollresult.sh $batchid | grep FULLZIPURL | cut -d= -f2)

# 步骤 4
filename=$(basename $pdf .pdf)
./localfilestep4_d

mineru-pdf-extractorMinerU PDF提取