PaddleOCR Document Parsing Skill
When to Use This Skill
Use Document Parsing for:
- - Documents with tables (invoices, financial reports, spreadsheets)
- Documents with mathematical formulas (academic papers, scientific documents)
- Documents with charts and diagrams
- Multi-column layouts (newspapers, magazines, brochures)
- Complex document structures requiring layout analysis
- Any document requiring structured understanding
Use Text Recognition instead for:
- - Simple text-only extraction
- Quick OCR tasks where speed is critical
- Screenshots or simple images with clear text
Installation
Install Python dependencies before using this skill. From the skill directory (skills/paddleocr-doc-parsing):
CODEBLOCK0
Optional — for document optimization and split_pdf.py (page extraction):
CODEBLOCK1
How to Use This Skill
⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔
- 1. ONLY use PaddleOCR Document Parsing API - Execute the script INLINECODE2
- NEVER parse documents directly - Do NOT parse documents yourself
- NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
- IF API fails - Display the error message and STOP immediately
- NO fallback methods - Do NOT attempt document parsing any other way
If the script execution fails (API not configured, network error, etc.):
- - Show the error message to the user
- Do NOT offer to help using your vision capabilities
- Do NOT ask "Would you like me to try parsing it?"
- Simply stop and wait for user to fix the configuration
Basic Workflow
- 1. Execute document parsing:
python scripts/vl_caller.py --file-url "URL provided by user" --pretty
Or for local files:
CODEBLOCK3
Optional: explicitly set file type:
python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0 --pretty
-
--file-type 0: PDF
-
--file-type 1: image
- If omitted, the service can infer file type from input.
Default behavior: save raw JSON to a temp file:
- If --output is omitted, the script saves automatically under the system temp directory
- Default path pattern: <system-temp>/paddleocr/doc-parsing/results/result_<timestamp>_<id>.json
- If --output is provided, it overrides the default temp-file destination
- If --stdout is provided, JSON is printed to stdout and no file is saved
- In save mode, the script prints the absolute saved path on stderr: Result saved to: /absolute/path/...
- In default/custom save mode, read and parse the saved JSON file before responding
- In save mode, always tell the user the saved file path and that full raw JSON is available there
- Use --stdout only when you explicitly want to skip file persistence
- 2. The output JSON contains COMPLETE content with all document data:
- Headers, footers, page numbers
- Main text content
- Tables with structure
- Formulas (with LaTeX)
- Figures and charts
- Footnotes and references
- Seals and stamps
- Layout and reading order
Input type note:
- Supported file types depend on the model and endpoint configuration.
- Always follow the file type constraints documented by your endpoint API.
- 3. Extract what the user needs from the output JSON using these fields:
- Top-level
text
-
result[n].markdown
- INLINECODE13
IMPORTANT: Complete Content Display
CRITICAL: You must display the COMPLETE extracted content to the user based on their needs.
- - The output JSON contains ALL document content in a structured format
- In save mode, the raw provider result can be inspected in the saved JSON file
- Display the full content requested by the user, do NOT truncate or summarize
- If user asks for "all text", show the entire
text field - If user asks for "tables", show ALL tables in the document
- If user asks for "main content", filter out headers/footers but show ALL body text
What this means:
- - DO: Display complete text, all tables, all formulas as requested
- DO: Present content using these fields: top-level
text, result[n].markdown, and INLINECODE17 - DON'T: Truncate with "..." unless content is excessively long (>10,000 chars)
- DON'T: Summarize or provide excerpts when user asks for full content
- DON'T: Say "Here's a preview" when user expects complete output
Example - Correct:
CODEBLOCK5
Example - Incorrect:
CODEBLOCK6
Understanding the JSON Response
The output JSON uses an envelope wrapping the raw API result:
CODEBLOCK7
Key fields:
- -
text — extracted markdown text from all pages (use this for quick text display) - INLINECODE19 - raw provider response object
- INLINECODE20 - structured parsing output for each page (layout/content/confidence and related metadata)
- INLINECODE21 — full rendered page output in markdown/HTML
Raw result location (default): the temp-file path printed by the script on stderr
Usage Examples
Example 1: Extract Full Document Text
CODEBLOCK8
Then use:
- - Top-level
text for quick full-text output - INLINECODE23 when page-level output is needed
Example 2: Extract Structured Page Data
CODEBLOCK9
Then use:
- -
result[n].prunedResult for structured parsing data (layout/content/confidence) - INLINECODE25 for rendered page content
Example 3: Print JSON Without Saving
CODEBLOCK10
Then return:
- - Full
text when user asks for full document content - INLINECODE27 and
result[n].markdown when user needs complete structured page data
First-Time Configuration
When API is not configured:
The error will show:
CODEBLOCK11
Configuration workflow:
- 1. Show the exact error message to the user.
- 2. Guide the user to configure:
- Set
PADDLEOCR_DOC_PARSING_API_URL to the full Triton inference endpoint URL.
Format:
http://<host>:<port>/v2/models/layout-parsing/infer
Example:
http://10.0.133.33:8020/v2/models/layout-parsing/infer
- If the service is behind an nginx with Basic Auth, also set:
-
PADDLEOCR_BASIC_AUTH_USER — nginx username (e.g.
ocr_admin)
-
PADDLEOCR_BASIC_AUTH_PASSWORD — nginx password
-
PADDLEOCR_ACCESS_TOKEN is
not required for local deployments. Leave it empty or omit it.
- Optionally set
PADDLEOCR_DOC_PARSING_TIMEOUT (default: 600 seconds).
- In OpenClaw, set environment variables in
~/.openclaw/openclaw.json:
CODEBLOCK12
- 3. Ask the user to confirm the environment is configured.
- 4. Retry only after confirmation:
- Once the user confirms the environment variables are set, retry the original parsing task.
Handling Large Files
There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.
Tips for large files:
Use URL for Large Local Files (Recommended)
For very large local files, prefer
--file-url over
--file-path to avoid base64 encoding overhead:
CODEBLOCK13
Process Specific Pages (PDF Only)
If you only need certain pages from a large PDF, extract them first:
CODEBLOCK14
Error Handling
Service unreachable:
error: API request failed: ...
→ Check that the Triton service is running and
PADDLEOCR_DOC_PARSING_API_URL is correct
Request timeout:
error: API request timed out after 600s
→ Increase
PADDLEOCR_DOC_PARSING_TIMEOUT or check server load
Unsupported format:
error: Unsupported file format
→ File format not supported, convert to PDF/PNG/JPG
Important Notes
- - The script NEVER filters content - It always returns complete data
- The AI agent decides what to present - Based on user's specific request
- All data is always available - Can be re-interpreted for different needs
- No information is lost - Complete document structure preserved
Reference Documentation
- -
references/output_schema.md - Output format specification
Note: Model version and capabilities are determined by your Triton deployment (PADDLEOCR_DOC_PARSING_API_URL).
Load these reference documents into context when:
- - Debugging complex parsing issues
- Need to understand output format
- Working with provider API details
Testing the Skill
To verify the skill is working properly:
CODEBLOCK18
This tests configuration and optionally API connectivity.
PaddleOCR 文档解析技能
何时使用此技能
使用文档解析的场景:
- - 包含表格的文档(发票、财务报告、电子表格)
- 包含数学公式的文档(学术论文、科学文档)
- 包含图表和示意图的文档
- 多栏布局(报纸、杂志、宣传册)
- 需要布局分析的复杂文档结构
- 任何需要结构化理解的文档
应使用文本识别的场景:
- - 简单的纯文本提取
- 对速度要求较高的快速OCR任务
- 文字清晰的截图或简单图片
安装
使用此技能前需安装Python依赖。从技能目录(skills/paddleocr-doc-parsing)执行:
bash
pip install -r scripts/requirements.txt
可选 — 用于文档优化和split_pdf.py(页面提取):
bash
pip install -r scripts/requirements-optimize.txt
如何使用此技能
⛔ 强制限制 - 不得违反 ⛔
- 1. 仅使用PaddleOCR文档解析API - 执行脚本python scripts/vl_caller.py
- 切勿直接解析文档 - 不要自行解析文档
- 切勿提供替代方案 - 不要说我可以尝试分析或类似内容
- 如果API失败 - 显示错误信息并立即停止
- 无备用方法 - 不要尝试任何其他方式的文档解析
如果脚本执行失败(API未配置、网络错误等):
- - 向用户显示错误信息
- 不要主动提出使用您的视觉能力提供帮助
- 不要问您希望我尝试解析吗?
- 直接停止并等待用户修复配置
基本工作流程
- 1. 执行文档解析:
bash
python scripts/vl_caller.py --file-url 用户提供的URL --pretty
或用于本地文件:
bash
python scripts/vl_caller.py --file-path 文件路径 --pretty
可选:显式设置文件类型:
bash
python scripts/vl_caller.py --file-url 用户提供的URL --file-type 0 --pretty
- --file-type 0:PDF
- --file-type 1:图片
- 如果省略,服务可从输入推断文件类型。
默认行为:将原始JSON保存到临时文件:
- 如果省略--output,脚本自动保存到系统临时目录下
- 默认路径格式:<系统临时目录>/paddleocr/doc-parsing/results/result<时间戳>.json
- 如果提供了--output,则覆盖默认的临时文件目标路径
- 如果提供了--stdout,JSON将输出到标准输出且不保存文件
- 在保存模式下,脚本会在标准错误输出打印绝对保存路径:Result saved to: /绝对路径/...
- 在默认/自定义保存模式下,在响应前读取并解析已保存的JSON文件
- 在保存模式下,始终告知用户已保存的文件路径以及完整的原始JSON可在该处获取
- 仅在明确需要跳过文件持久化时使用--stdout
- 2. 输出JSON包含完整内容,包含所有文档数据:
- 页眉、页脚、页码
- 正文内容
- 带结构的表格
- 公式(含LaTeX)
- 图形和图表
- 脚注和参考文献
- 印章和戳记
- 布局和阅读顺序
输入类型说明:
- 支持的文件类型取决于模型和端点配置。
- 始终遵循端点API文档中规定的文件类型限制。
- 3. 使用以下字段从输出JSON中提取用户所需内容:
- 顶层text
- result[n].markdown
- result[n].prunedResult
重要:完整内容显示
关键:您必须根据用户需求显示完整的提取内容。
- - 输出JSON以结构化格式包含所有文档内容
- 在保存模式下,可在保存的JSON文件中查看原始提供者结果
- 显示用户请求的完整内容,不要截断或总结
- 如果用户要求所有文本,显示整个text字段
- 如果用户要求表格,显示文档中所有表格
- 如果用户要求主要内容,过滤掉页眉/页脚但显示所有正文
这意味着:
- - 要:按请求显示完整文本、所有表格、所有公式
- 要:使用以下字段呈现内容:顶层text、result[n].markdown和result[n].prunedResult
- 不要:用...截断,除非内容过长(超过10,000字符)
- 不要:在用户要求完整内容时进行总结或提供摘录
- 不要:在用户期望完整输出时说以下是预览
示例 - 正确:
用户:提取此文档中的所有文本
智能体:我已解析完整文档。以下是提取的所有文本:
[按阅读顺序显示整个文本字段或拼接的区域]
文档统计:
质量:优秀(置信度:0.92)
示例 - 错误:
用户:提取所有文本
智能体:我发现了一个包含多个部分的文档。以下是开头部分:
引言...(为简洁起见已截断内容)
理解JSON响应
输出JSON使用信封包装原始API结果:
json
{
ok: true,
text: 从所有页面提取的完整markdown/HTML文本,
result: { ... }, // 原始提供者响应
error: null
}
关键字段:
- - text — 从所有页面提取的markdown文本(用于快速文本显示)
- result - 原始提供者响应对象
- result[n].prunedResult - 每页的结构化解析输出(布局/内容/置信度及相关元数据)
- result[n].markdown — 每页的完整渲染输出(markdown/HTML格式)
原始结果位置(默认):脚本在标准错误输出打印的临时文件路径
使用示例
示例1:提取完整文档文本
bash
python scripts/vl_caller.py \
--file-url https://example.com/paper.pdf \
--pretty
然后使用:
- - 顶层text用于快速全文输出
- 需要页面级输出时使用result[n].markdown
示例2:提取结构化页面数据
bash
python scripts/vl_caller.py \
--file-path ./financial_report.pdf \
--pretty
然后使用:
- - result[n].prunedResult用于结构化解析数据(布局/内容/置信度)
- result[n].markdown用于渲染的页面内容
示例3:打印JSON而不保存
bash
python scripts/vl_caller.py \
--file-url URL \
--stdout \
--pretty
然后返回:
- - 用户要求完整文档内容时返回完整text
- 用户需要完整结构化页面数据时返回result[n].prunedResult和result[n].markdown
首次配置
当API未配置时:
错误将显示:
CONFIGERROR: PADDLEOCRDOCPARSINGAPI_URL未配置。请将其设置为您的Triton端点,例如:http://10.0.0.1:8020/v2/models/layout-parsing/infer
配置工作流程:
- 1. 向用户显示确切的错误信息。
- 2. 引导用户进行配置:
- 将PADDLEOCR
DOCPARSING
APIURL设置为完整的Triton推理端点URL。
格式:http://<主机>:<端口>/v2/models/layout-parsing/infer
示例:http://10.0.133.33:8020/v2/models/layout-parsing/infer
- 如果服务位于带基本认证的nginx后面,还需设置:
- PADDLEOCR
BASICAUTH
USER — nginx用户名(例如ocradmin)
- PADDLEOCR
BASICAUTH_PASSWORD — nginx密码
- 本地部署
不需要PADDLEOCR
ACCESSTOKEN。留空或省略即可。
- 可选设置PADDLEOCR
DOCPARSING_TIMEOUT(默认:600秒)。
- 在OpenClaw中,在~/.openclaw/openclaw.json中设置环境变量:
json
{
skills: {
entries: {
paddleocr-doc-parsing: {
enabled: true,
env: {
PADDLEOCR
DOCPARSING
APIURL: http://10.0.133.33:8020/v2/models/layout