PDF Vision Extraction Skill (Enhanced)
Overview
This skill handles image-based or scanned PDFs that contain no selectable text. It supports multiple vision APIs with automatic fallback:
Primary Models
- - Xflow:
qwen3-vl-plus (your primary vision model) - ZhipuAI:
glm-4.6v-flash (free vision model with fallback support) - Fallback:
glm-5 (text-only, but may work with some image prompts)
Unlike traditional PDF text extraction tools (pdftotext, pdfplumber) which only work on text-based PDFs, this skill can process:
- - Scanned documents
- Image-only PDFs
- Photographed documents
- Handwritten notes (with limitations)
- Complex layouts with tables and formatting
Supported Models
Vision-Capable Models
| Provider | Model | Type | Context | Free |
|---|
| Xflow | INLINECODE5 | Vision + Text | 131K | ❌ |
| ZhipuAI |
glm-4.6v-flash | Vision + Text | 32K | ✅ |
| ZhipuAI |
glm-5 | Text-only* | 128K | ❌ |
Additional Text Models (for fallback)
| Provider | Model | Context | Free |
|---|
| ZhipuAI | INLINECODE8 | 128K | ✅ |
| ZhipuAI |
cogview-3-flash | 32K | ✅ |
*Note: glm-5 is primarily text-only but may handle image prompts in some cases.
Prerequisites
1. API Configuration
Your OpenClaw must be configured with both providers:
Xflow Configuration (already set up):
- -
models.providers.openai.baseUrl: INLINECODE12 - INLINECODE13 : Your Xflow API key
ZhipuAI Configuration (update token):
- -
models.providers.zhipuai.baseUrl: INLINECODE15 - INLINECODE16 : Your ZhipuAI API token
2. Required System Tools
- -
pypdfium2 Python library (for PDF to image conversion) - INLINECODE18 (for API calls)
- INLINECODE19 (for image encoding)
3. Python Libraries (already installed)
CODEBLOCK0
Usage
Automatic Fallback Mode (Default)
Uses Xflow first, falls back to ZhipuAI if needed:
CODEBLOCK1
Specific Model Selection
Force a specific model for cost or performance reasons:
CODEBLOCK2
Structured Data Extraction
CODEBLOCK3
Multi-page PDF Handling
CODEBLOCK4
Configuration
Environment Variables
The skill reads configuration from your OpenClaw config file (
~/.openclaw/openclaw.json):
- -
models.providers.openai.baseUrl & INLINECODE22 - INLINECODE23 & INLINECODE24
Output Format
Returns extracted text content as a string. For structured data requests, the AI model will format output according to your prompt instructions.
Examples
Cost-Optimized Extraction (Free Model)
Command: --model glm-4.6v-flash
Use case: When you want to use free vision capabilities
Result: Good quality extraction at no cost
High-Quality Extraction (Premium Model)
Command: --model qwen3-vl-plus
Use case: When you need maximum accuracy and complex layout understanding
Result: Best possible extraction quality
Automatic Fallback (Recommended)
Command: No
--model flag
Use case: Production environments where reliability is key
Result: Uses best available model, falls back gracefully
Model Comparison
GLM-4.6V-Flash (Free)
- - ✅ Completely free
- ✅ Good Chinese text recognition
- ✅ Decent table structure preservation
- ⚠️ Lower context window (32K vs 131K)
- ⚠️ May struggle with very complex layouts
Qwen3-VL-Plus (Premium)
- - ✅ Superior image understanding
- ✅ Excellent table and structure recognition
- ✅ Larger context window (131K)
- ✅ Better handling of mixed languages
- ❌ Requires paid API access
Limitations
- - Single page processing: Currently processes one page at a time
- Image quality: Better results with higher resolution scans
- Complex layouts: May struggle with very dense or overlapping text
- Handwriting: Limited accuracy with handwritten content
- File size: Large PDFs may exceed API token limits
Technical Implementation
The skill follows this workflow:
- 1. PDF to Image: Converts specified PDF page to PNG using INLINECODE28
- Model Selection: Chooses model based on user preference or fallback logic
- API Call: Sends image + prompt to selected vision API endpoint
- Response Parsing: Extracts and returns the AI-generated text content
- Fallback: If primary model fails, tries alternative models
For debugging, temporary files are created in /tmp/:
- -
/tmp/pdf_vision_page.png - converted image - INLINECODE31 - API request payload
- INLINECODE32 - API response
Integration Notes
This skill complements the standard pdf skill:
- - Use
pdf skill for text-based PDFs (faster, no API cost) - Use
pdf-vision skill for image-based/scanned PDFs (requires vision API)
Both skills can be used together in a fallback pattern:
- 1. Try
pdf skill first - If no text extracted, fall back to
pdf-vision skill
Cost Optimization Tips
- 1. Use GLM-4.6V-Flash for routine tasks - it's free and quite capable
- Reserve Qwen3-VL-Plus for complex documents - when you need maximum accuracy
- Test both models on your document types - choose based on your quality requirements
- Monitor API usage - track which models you're using most
Update Your GLM API Token
Replace the placeholder token in your config:
CODEBLOCK5
PDF视觉提取技能(增强版)
概述
本技能处理基于图像或扫描的PDF文件,这些文件不包含可选文本。支持多种视觉API,并具备自动回退功能:
主要模型
- - Xflow:qwen3-vl-plus(您的主要视觉模型)
- 智谱AI:glm-4.6v-flash(免费视觉模型,支持回退)
- 回退模型:glm-5(纯文本模型,但在某些情况下可处理图像提示)
与仅适用于文本型PDF的传统PDF文本提取工具(pdftotext、pdfplumber)不同,本技能可处理:
- - 扫描文档
- 纯图像PDF
- 拍照文档
- 手写笔记(有限制)
- 包含表格和格式的复杂布局
支持的模型
视觉能力模型
| 提供商 | 模型 | 类型 | 上下文 | 免费 |
|---|
| Xflow | qwen3-vl-plus | 视觉+文本 | 131K | ❌ |
| 智谱AI |
glm-4.6v-flash | 视觉+文本 | 32K | ✅ |
| 智谱AI | glm-5 | 纯文本* | 128K | ❌ |
额外文本模型(用于回退)
| 提供商 | 模型 | 上下文 | 免费 |
|---|
| 智谱AI | glm-4-flash-250414 | 128K | ✅ |
| 智谱AI |
cogview-3-flash | 32K | ✅ |
*注意:glm-5主要是纯文本模型,但在某些情况下可处理图像提示。
前置条件
1. API配置
您的OpenClaw必须配置以下两个提供商:
Xflow配置(已设置):
- - models.providers.openai.baseUrl:https://apis.iflow.cn/v1
- models.providers.openai.apiKey:您的Xflow API密钥
智谱AI配置(更新令牌):
- - models.providers.zhipuai.baseUrl:https://open.bigmodel.cn/api/paas/v4
- models.providers.zhipuai.apiKey:您的智谱AI API令牌
2. 必需的系统工具
- - pypdfium2 Python库(用于PDF转图像)
- curl(用于API调用)
- base64(用于图像编码)
3. Python库(已安装)
bash
pypdfium2
使用方法
自动回退模式(默认)
优先使用Xflow,必要时回退到智谱AI:
bash
./scripts/pdf_vision.py --pdf-path /path/to/document.pdf
指定模型选择
出于成本或性能原因强制使用特定模型:
bash
使用免费的GLM-4.6V-Flash模型
./scripts/pdf_vision.py --pdf-path document.pdf --model zhipuai/glm-4.6v-flash
使用特定的Xflow模型
./scripts/pdf_vision.py --pdf-path document.pdf --model openai/qwen3-vl-plus
简短形式(自动检测提供商)
./scripts/pdf_vision.py --pdf-path document.pdf --model glm-4.6v-flash
结构化数据提取
bash
./scripts/pdf_vision.py --pdf-path invoice.pdf --prompt 提取为JSON:供应商、日期、总计 --model glm-4.6v-flash
多页PDF处理
bash
专门处理第3页
./scripts/pdf_vision.py --pdf-path book.pdf --page 3 --output page3.txt
配置
环境变量
本技能从您的OpenClaw配置文件(~/.openclaw/openclaw.json)读取配置:
- - models.providers.openai.baseUrl 和 apiKey
- models.providers.zhipuai.baseUrl 和 apiKey
输出格式
返回提取的文本内容作为字符串。对于结构化数据请求,AI模型将根据您的提示指令格式化输出。
示例
成本优化提取(免费模型)
命令: --model glm-4.6v-flash
使用场景: 当您想使用免费视觉能力时
结果: 零成本的优质提取
高质量提取(高级模型)
命令: --model qwen3-vl-plus
使用场景: 当您需要最大准确度和复杂布局理解时
结果: 最佳提取质量
自动回退(推荐)
命令: 无--model标志
使用场景: 可靠性至关重要的生产环境
结果: 使用最佳可用模型,优雅回退
模型对比
GLM-4.6V-Flash(免费)
- - ✅ 完全免费
- ✅ 良好的中文文本识别
- ✅ 不错的表格结构保留
- ⚠️ 较低的上下文窗口(32K vs 131K)
- ⚠️ 可能难以处理非常复杂的布局
Qwen3-VL-Plus(高级)
- - ✅ 卓越的图像理解能力
- ✅ 出色的表格和结构识别
- ✅ 更大的上下文窗口(131K)
- ✅ 更好的混合语言处理
- ❌ 需要付费API访问
限制
- - 单页处理:目前一次处理一页
- 图像质量:更高分辨率的扫描件效果更好
- 复杂布局:可能难以处理非常密集或重叠的文本
- 手写内容:手写内容的准确度有限
- 文件大小:大型PDF可能超过API令牌限制
技术实现
本技能遵循以下工作流程:
- 1. PDF转图像:使用pypdfium2将指定PDF页面转换为PNG
- 模型选择:根据用户偏好或回退逻辑选择模型
- API调用:将图像+提示发送到选定的视觉API端点
- 响应解析:提取并返回AI生成的文本内容
- 回退:如果主要模型失败,尝试替代模型
调试时,临时文件创建在/tmp/目录下:
- - /tmp/pdfvisionpage.png - 转换后的图像
- /tmp/pdfvisionpayload.json - API请求负载
- /tmp/pdfvisionresponse.json - API响应
集成说明
本技能补充了标准的pdf技能:
- - 对文本型PDF使用pdf技能(更快,无API成本)
- 对基于图像/扫描的PDF使用pdf-vision技能(需要视觉API)
两种技能可以在回退模式中一起使用:
- 1. 先尝试pdf技能
- 如果未提取到文本,回退到pdf-vision技能
成本优化技巧
- 1. 日常任务使用GLM-4.6V-Flash - 免费且相当有能力
- 复杂文档保留Qwen3-VL-Plus - 当您需要最大准确度时
- 在您的文档类型上测试两个模型 - 根据您的质量要求选择
- 监控API使用情况 - 跟踪您最常使用的模型
更新您的GLM API令牌
替换配置中的占位符令牌:
bash
将YOURACTUALGLM_TOKEN替换为您的真实令牌
sed -i s/YOUR
GLMAPI
TOKENHERE/YOUR
ACTUALGLM_TOKEN/g ~/.openclaw/openclaw.json