PDF Vision Extraction Skill (Enhanced)

Overview

This skill handles image-based or scanned PDFs that contain no selectable text. It supports multiple vision APIs with automatic fallback:

Primary Models

- Xflow: qwen3-vl-plus (your primary vision model)
ZhipuAI: glm-4.6v-flash (free vision model with fallback support)
Fallback: glm-5 (text-only, but may work with some image prompts)

Unlike traditional PDF text extraction tools (pdftotext, pdfplumber) which only work on text-based PDFs, this skill can process:

- Scanned documents
Image-only PDFs
Photographed documents
Handwritten notes (with limitations)
Complex layouts with tables and formatting

Supported Models

Vision-Capable Models
Provider Model Type Context Free
Xflow INLINECODE5 Vision + Text 131K ❌
ZhipuAI
`glm-4.6v-flash` | Vision + Text | 32K | ✅ |

Provider	Model	Type	Context	Free
Xflow	INLINECODE5	Vision + Text	131K	❌
ZhipuAI

| ZhipuAI | glm-5 | Text-only* | 128K | ❌ |

Additional Text Models (for fallback)
Provider Model Context Free
ZhipuAI INLINECODE8 128K ✅
ZhipuAI
`cogview-3-flash` | 32K | ✅ |

Provider	Model	Context	Free
ZhipuAI	INLINECODE8	128K	✅
ZhipuAI

*Note: glm-5 is primarily text-only but may handle image prompts in some cases.

Prerequisites

1. API Configuration

Your OpenClaw must be configured with both providers:

Xflow Configuration (already set up):

- models.providers.openai.baseUrl: INLINECODE12
INLINECODE13: Your Xflow API key

ZhipuAI Configuration (update token):

- models.providers.zhipuai.baseUrl: INLINECODE15
INLINECODE16: Your ZhipuAI API token

2. Required System Tools

- pypdfium2 Python library (for PDF to image conversion)
INLINECODE18 (for API calls)
INLINECODE19 (for image encoding)

3. Python Libraries (already installed)

CODEBLOCK0

Usage

Automatic Fallback Mode (Default)

Uses Xflow first, falls back to ZhipuAI if needed: CODEBLOCK1

Specific Model Selection

Force a specific model for cost or performance reasons: CODEBLOCK2

Structured Data Extraction

CODEBLOCK3

Multi-page PDF Handling

CODEBLOCK4

Configuration

Environment Variables

The skill reads configuration from your OpenClaw config file (~/.openclaw/openclaw.json):

- models.providers.openai.baseUrl & INLINECODE22
INLINECODE23 & INLINECODE24

Output Format

Returns extracted text content as a string. For structured data requests, the AI model will format output according to your prompt instructions.

Examples

Cost-Optimized Extraction (Free Model)

Command: --model glm-4.6v-flash Use case: When you want to use free vision capabilities Result: Good quality extraction at no cost

High-Quality Extraction (Premium Model)

Command: --model qwen3-vl-plus Use case: When you need maximum accuracy and complex layout understanding Result: Best possible extraction quality

Automatic Fallback (Recommended)

Command: No --model flag Use case: Production environments where reliability is key Result: Uses best available model, falls back gracefully

Model Comparison

GLM-4.6V-Flash (Free)

- ✅ Completely free
✅ Good Chinese text recognition
✅ Decent table structure preservation
⚠️ Lower context window (32K vs 131K)
⚠️ May struggle with very complex layouts

Qwen3-VL-Plus (Premium)

- ✅ Superior image understanding
✅ Excellent table and structure recognition
✅ Larger context window (131K)
✅ Better handling of mixed languages
❌ Requires paid API access

Limitations

- Single page processing: Currently processes one page at a time
Image quality: Better results with higher resolution scans
Complex layouts: May struggle with very dense or overlapping text
Handwriting: Limited accuracy with handwritten content
File size: Large PDFs may exceed API token limits

Technical Implementation

The skill follows this workflow:

1. PDF to Image: Converts specified PDF page to PNG using INLINECODE28
Model Selection: Chooses model based on user preference or fallback logic
API Call: Sends image + prompt to selected vision API endpoint
Response Parsing: Extracts and returns the AI-generated text content
Fallback: If primary model fails, tries alternative models

For debugging, temporary files are created in /tmp/:

- /tmp/pdf_vision_page.png - converted image
INLINECODE31 - API request payload
INLINECODE32 - API response

Integration Notes

This skill complements the standard pdf skill:

- Use pdf skill for text-based PDFs (faster, no API cost)
Use pdf-vision skill for image-based/scanned PDFs (requires vision API)

Both skills can be used together in a fallback pattern:

1. Try pdf skill first
If no text extracted, fall back to pdf-vision skill

Cost Optimization Tips

1. Use GLM-4.6V-Flash for routine tasks - it's free and quite capable
Reserve Qwen3-VL-Plus for complex documents - when you need maximum accuracy
Test both models on your document types - choose based on your quality requirements
Monitor API usage - track which models you're using most

Update Your GLM API Token

Replace the placeholder token in your config:
CODEBLOCK5

PDF视觉提取技能（增强版）

概述

本技能处理基于图像或扫描的PDF文件，这些文件不包含可选文本。支持多种视觉API，并具备自动回退功能：

主要模型

- Xflow：qwen3-vl-plus（您的主要视觉模型）
智谱AI：glm-4.6v-flash（免费视觉模型，支持回退）
回退模型：glm-5（纯文本模型，但在某些情况下可处理图像提示）

与仅适用于文本型PDF的传统PDF文本提取工具（pdftotext、pdfplumber）不同，本技能可处理：

- 扫描文档
纯图像PDF
拍照文档
手写笔记（有限制）
包含表格和格式的复杂布局

支持的模型

视觉能力模型
提供商模型类型上下文免费
Xflow qwen3-vl-plus 视觉+文本 131K ❌
智谱AI
glm-4.6v-flash | 视觉+文本 | 32K | ✅ |

提供商	模型	类型	上下文	免费
Xflow	qwen3-vl-plus	视觉+文本	131K	❌
智谱AI

| 智谱AI | glm-5 | 纯文本* | 128K | ❌ |

额外文本模型（用于回退）
提供商模型上下文免费
智谱AI glm-4-flash-250414 128K ✅
智谱AI
cogview-3-flash | 32K | ✅ |

提供商	模型	上下文	免费
智谱AI	glm-4-flash-250414	128K	✅
智谱AI

*注意：glm-5主要是纯文本模型，但在某些情况下可处理图像提示。

前置条件

1. API配置

您的OpenClaw必须配置以下两个提供商：

Xflow配置（已设置）：

- models.providers.openai.baseUrl：https://apis.iflow.cn/v1
models.providers.openai.apiKey：您的Xflow API密钥

智谱AI配置（更新令牌）：

- models.providers.zhipuai.baseUrl：https://open.bigmodel.cn/api/paas/v4
models.providers.zhipuai.apiKey：您的智谱AI API令牌

2. 必需的系统工具

- pypdfium2 Python库（用于PDF转图像）
curl（用于API调用）
base64（用于图像编码）

3. Python库（已安装）

bash pypdfium2

使用方法

自动回退模式（默认）

优先使用Xflow，必要时回退到智谱AI： bash ./scripts/pdf_vision.py --pdf-path /path/to/document.pdf

指定模型选择

出于成本或性能原因强制使用特定模型： bash

使用免费的GLM-4.6V-Flash模型

./scripts/pdf_vision.py --pdf-path document.pdf --model zhipuai/glm-4.6v-flash

使用特定的Xflow模型

./scripts/pdf_vision.py --pdf-path document.pdf --model openai/qwen3-vl-plus

简短形式（自动检测提供商）

./scripts/pdf_vision.py --pdf-path document.pdf --model glm-4.6v-flash

结构化数据提取

bash ./scripts/pdf_vision.py --pdf-path invoice.pdf --prompt 提取为JSON：供应商、日期、总计 --model glm-4.6v-flash

多页PDF处理

bash

专门处理第3页

./scripts/pdf_vision.py --pdf-path book.pdf --page 3 --output page3.txt

配置

环境变量

本技能从您的OpenClaw配置文件（~/.openclaw/openclaw.json）读取配置：

- models.providers.openai.baseUrl 和 apiKey
models.providers.zhipuai.baseUrl 和 apiKey

输出格式

返回提取的文本内容作为字符串。对于结构化数据请求，AI模型将根据您的提示指令格式化输出。

示例

成本优化提取（免费模型）

命令： --model glm-4.6v-flash 使用场景： 当您想使用免费视觉能力时 结果： 零成本的优质提取

高质量提取（高级模型）

命令： --model qwen3-vl-plus 使用场景： 当您需要最大准确度和复杂布局理解时 结果： 最佳提取质量

自动回退（推荐）

命令： 无--model标志 使用场景： 可靠性至关重要的生产环境 结果： 使用最佳可用模型，优雅回退

模型对比

GLM-4.6V-Flash（免费）

- ✅ 完全免费
✅ 良好的中文文本识别
✅ 不错的表格结构保留
⚠️ 较低的上下文窗口（32K vs 131K）
⚠️ 可能难以处理非常复杂的布局

Qwen3-VL-Plus（高级）

- ✅ 卓越的图像理解能力
✅ 出色的表格和结构识别
✅ 更大的上下文窗口（131K）
✅ 更好的混合语言处理
❌ 需要付费API访问

限制

- 单页处理：目前一次处理一页
图像质量：更高分辨率的扫描件效果更好
复杂布局：可能难以处理非常密集或重叠的文本
手写内容：手写内容的准确度有限
文件大小：大型PDF可能超过API令牌限制

技术实现

本技能遵循以下工作流程：

1. PDF转图像：使用pypdfium2将指定PDF页面转换为PNG
模型选择：根据用户偏好或回退逻辑选择模型
API调用：将图像+提示发送到选定的视觉API端点
响应解析：提取并返回AI生成的文本内容
回退：如果主要模型失败，尝试替代模型

调试时，临时文件创建在/tmp/目录下：

- /tmp/pdfvisionpage.png - 转换后的图像
/tmp/pdfvisionpayload.json - API请求负载
/tmp/pdfvisionresponse.json - API响应

集成说明

本技能补充了标准的pdf技能：

- 对文本型PDF使用pdf技能（更快，无API成本）
对基于图像/扫描的PDF使用pdf-vision技能（需要视觉API）

两种技能可以在回退模式中一起使用：

1. 先尝试pdf技能
如果未提取到文本，回退到pdf-vision技能

成本优化技巧

1. 日常任务使用GLM-4.6V-Flash - 免费且相当有能力
复杂文档保留Qwen3-VL-Plus - 当您需要最大准确度时
在您的文档类型上测试两个模型 - 根据您的质量要求选择
监控API使用情况 - 跟踪您最常使用的模型

更新您的GLM API令牌

替换配置中的占位符令牌：
bash

将YOURACTUALGLM_TOKEN替换为您的真实令牌

sed -i s/YOURGLMAPITOKENHERE/YOURACTUALGLM_TOKEN/g ~/.openclaw/openclaw.json

pdf-visionPDF视觉提取

pdf-vision

PDF Vision Extraction Skill (Enhanced)

Overview

Primary Models

Supported Models

Vision-Capable ModelsProviderModelTypeContextFreeXflowINLINECODE5Vision + Text131K❌ZhipuAI glm-4.6v-flash | Vision + Text | 32K | ✅ |

Additional Text Models (for fallback)ProviderModelContextFreeZhipuAIINLINECODE8128K✅ZhipuAI cogview-3-flash | 32K | ✅ |

Prerequisites

1. API Configuration

2. Required System Tools

3. Python Libraries (already installed)

Usage

Automatic Fallback Mode (Default)

Specific Model Selection

Structured Data Extraction

Multi-page PDF Handling

Configuration

Environment Variables

Output Format

Examples

Cost-Optimized Extraction (Free Model)

High-Quality Extraction (Premium Model)

Automatic Fallback (Recommended)

Model Comparison

GLM-4.6V-Flash (Free)

Qwen3-VL-Plus (Premium)

Limitations

Technical Implementation

Integration Notes

Cost Optimization Tips

Update Your GLM API Token

PDF视觉提取技能（增强版）

概述

主要模型

支持的模型

视觉能力模型提供商模型类型上下文免费Xflowqwen3-vl-plus视觉+文本131K❌智谱AI glm-4.6v-flash | 视觉+文本 | 32K | ✅ |

额外文本模型（用于回退）提供商模型上下文免费智谱AIglm-4-flash-250414128K✅智谱AI cogview-3-flash | 32K | ✅ |

前置条件

1. API配置

2. 必需的系统工具

3. Python库（已安装）

使用方法

自动回退模式（默认）

指定模型选择

使用免费的GLM-4.6V-Flash模型

使用特定的Xflow模型

简短形式（自动检测提供商）

结构化数据提取

多页PDF处理

专门处理第3页

配置

环境变量

输出格式

示例

成本优化提取（免费模型）

高质量提取（高级模型）

自动回退（推荐）

模型对比

GLM-4.6V-Flash（免费）

Qwen3-VL-Plus（高级）

限制

技术实现

集成说明

成本优化技巧

更新您的GLM API令牌

将YOURACTUALGLM_TOKEN替换为您的真实令牌

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

Vision-Capable Models
Provider Model Type Context Free
Xflow INLINECODE5 Vision + Text 131K ❌
ZhipuAI
`glm-4.6v-flash` | Vision + Text | 32K | ✅ |

Additional Text Models (for fallback)
Provider Model Context Free
ZhipuAI INLINECODE8 128K ✅
ZhipuAI
`cogview-3-flash` | 32K | ✅ |

视觉能力模型
提供商模型类型上下文免费
Xflow qwen3-vl-plus 视觉+文本 131K ❌
智谱AI
glm-4.6v-flash | 视觉+文本 | 32K | ✅ |

额外文本模型（用于回退）
提供商模型上下文免费
智谱AI glm-4-flash-250414 128K ✅
智谱AI
cogview-3-flash | 32K | ✅ |