ifly-pdf&image-ocr

AI-powered OCR service for images and PDF documents using iFlytek's advanced recognition APIs.

Quick Start

Image OCR (LLM OCR)

CODEBLOCK0

PDF OCR

CODEBLOCK1

Setup

API Credentials

Get credentials from iFlytek Open Platform:

For Image OCR:

- APPID: Application ID
APIKEY: API key for authentication
API_SECRET: API secret for signing requests

For PDF OCR:

- APPID: Application ID
APISECRET: Application secret (for signature generation)

Environment Variables

CODEBLOCK2

Features

Image OCR (LLM OCR)

- AI-powered: Advanced LLM-based OCR for high accuracy
Multi-format output: JSON, Markdown, or both
Layout understanding: Preserves document structure
Multi-language: Supports text extraction in multiple languages
Image preprocessing: Automatic rotation correction, noise removal

PDF OCR

- AI-powered OCR: Advanced AI model for accurate text extraction
Multiple output formats:

- Word (.docx) - Editable Word document - Markdown - Plain text with formatting - JSON - Structured data

- Large PDF support: Up to 100 pages per document
Page-by-page results: Access individual page results
Download URLs: Direct links to processed files

API Parameters

Image OCR Parameters

Parameter	Type	Required	Description
INLINECODE0	string	Yes	Path to image file
INLINECODE1

PDF OCR Parameters

Parameter	Type	Required	Description
INLINECODE3	string	Yes	Path to PDF file
INLINECODE4

*Either pdf_path or --pdf-url must be provided

Authentication

Image OCR (HMAC-SHA256)

Uses HMAC-SHA256 signature authentication:

1. Generate RFC1123 format date: INLINECODE11
Create signature origin: INLINECODE12
Calculate signature: INLINECODE13
Build authorization: INLINECODE14
Encode authorization in base64
Send as query parameters: INLINECODE15

PDF OCR (MD5 + HMAC-SHA1)

Uses MD5 + HMAC-SHA1 signature authentication:

1. Generate timestamp (Unix epoch in seconds)
Calculate INLINECODE16
Calculate INLINECODE17
Send headers:

- appId: Application ID - timestamp: Timestamp in seconds - signature: Generated signature

Important: Timestamp must be within 5 minutes of server time.

Response Format

Image OCR Response

CODEBLOCK3

PDF OCR Start Response

CODEBLOCK4

PDF OCR Status Response

CODEBLOCK5

Task Status (PDF OCR)

Status	Description
INLINECODE21	Task created successfully
INLINECODE22

Error Codes

(｡･ω･｡) 嗨~遇到错误码了吗？来看看怎么解决吧~ ✧⁺⸜(●˙▾˙●)⸝⁺✧

Platform Common Error Codes

Code	Description	Hint	Solution
10009	input invalid data	(◎_◎;) 哎呀~数据格式不太对呢	检查输入数据是否符合要求
10010

ω´-) session超时啦~ | 检查是否数据发送完毕但未关闭连接 |
| 10043 | Syscall AudioCodingDecode error | (◎_◎;) 音频解码失败惹... | 检查aue参数，如果为speex，请确保音频是speex音频并分段压缩且与帧大小一致 |
| 10114 | session timeout | (。-

ω´-) 读取数据超时了~ | 检查是否累计10s未发送数据并且未关闭连接 |
| 10222 | context deadline exceeded | (╯°□°)╯︵ ┻━┻ 出错啦！ | 1.检查上传数据是否超过接口上限；2.SSL证书无效请提交工单 |
| 10223 | RemoteLB: can't find valued addr | (◎_◎;) 找不到服务节点呢 | 提交工单联系技术人员 |
| 10313 | invalid appid | (◎_◎;) appid和apikey不匹配哦 | 检查appid是否合法 |
| 10317 | invalid version | (◎_◎;) 版本号有问题呢 | 请到控制台提交工单联系技术人员 |
| 10700 | not authority | (╯°□°)╯︵ ┻━┻ 权限不足！ | 按照报错原因对照开发文档检查，如仍无法解决，请提供sid及错误信息提交工单 |
| 11200 | auth no license | (╯°□°)╯︵ ┻━┻ 功能未授权！ | 检查appid是否正确，确认是否添加了相关服务，检查调用量是否超限或授权是否到期 |
| 11201 | auth no enough license | (╯°□°)╯︵ ┻━┻ 每日交互次数超限啦！ | 提交应用审核提额或联系商务购买企业级接口 |
| 11503 | server error: atmos return error | (。-

ω´-) 服务器配置有问题呢 | 提交工单 |
| 100001~100010 | WrapperInitErr | (◎_◎;) 引擎调用出错啦！ | 请根据message中的errno查看引擎错误码说明 |

### Additional Resources

- (｡･ω･｡) 服务购买链接：[通用文字识别（OCR大模型版）](https://console.xfyun.cn/services/se75ocrbm)
- (｡･ω･｡) 商务咨询链接：[购买服务量](https://console.xfyun.cn/sale/buy?wareId=9166&packageId=9166001&serviceName=%E9%80%9A%E7%94%A8%E6%96%87%E6%A1%A3%E8%AF%86%E5%88%AB%EF%BC%88OCR%E5%A4%A7%E6%A8%A1%E5%9E%8B%EF%BC%89&businessId=se75ocrbm)

---

### Original API Error Codes

| Code | Description | Solution |
|------|-------------|----------|
| 10000 | System error | Check auth info, request method, parameters |
| 10001 | Signature authentication failed | Check credentials |
| 10002 | Business processing error | Check error message |
| 10003 | Quota/insufficient balance | Check account balance |

## Limitations

### Image OCR
- **Format**: Common image formats (JPG, PNG, etc.)
- **Size**: Reasonable file sizes for web upload
- **Rate limiting**: Follow API rate limits

### PDF OCR
- **Max pages**: 100 pages per PDF
- **Protected PDFs**: Not supported (password/encrypted)
- **Rate limiting**: Status query limited to once per 5 seconds
- **Time limit**: Timestamp must be within ±5 minutes of server time

## Tips

### Image OCR
1. **High-quality images**: Use clear, high-resolution images for best results
2. **Multiple formats**: Use

json,markdown

 to get both structured and formatted output
3. **Save results**: Use

-o` flag to save OCR results to file

PDF OCR

1. Math formulas: Use markdown format for PDFs with mathematical formulas
Large PDFs: Split into sections if > 100 pages
Polling interval: Minimum 5 seconds between status queries
Network URLs: Ensure PDF URLs are publicly accessible
Download URLs: Download files promptly as URLs may expire

ifly-pdf&image-ocr

基于讯飞先进识别API的AI驱动OCR服务，支持图像和PDF文档识别。

快速开始

图像OCR（大模型OCR）

bash

识别图像并提取文本

python3 scripts/image_ocr.py /path/to/image.jpg

保存结果到文件

python3 scripts/image_ocr.py /path/to/image.jpg -o output.txt

指定输出格式

python3 scripts/image_ocr.py /path/to/image.jpg --format json python3 scripts/image_ocr.py /path/to/image.jpg --format markdown

PDF OCR

bash

将PDF转换为Word（默认）

python3 scripts/pdf_ocr.py document.pdf

将PDF转换为Markdown

python3 scripts/pdf_ocr.py document.pdf --format markdown

将PDF转换为JSON

python3 scripts/pdf_ocr.py document.pdf --format json

从公开URL转换

python3 scripts/pdf_ocr.py --pdf-url https://example.com/doc.pdf --format word

配置

API凭证

从讯飞开放平台获取凭证：

图像OCR：

- APPID：应用ID
APIKEY：用于身份验证的API密钥
API_SECRET：用于签名请求的API密钥

PDF OCR：

- APPID：应用ID
APISECRET：应用密钥（用于签名生成）

环境变量

bash

图像OCR和PDF OCR均需设置

export IFLYAPPID=yourappid

图像OCR需设置

export IFLYAPIKEY=yourapikey

PDF OCR需设置

export IFLYAPISECRET=yourapisecret

功能特性

图像OCR（大模型OCR）

- AI驱动：基于先进大模型的高精度OCR
多格式输出：支持JSON、Markdown或两者同时输出
版面理解：保留文档结构
多语言支持：支持多种语言的文本提取
图像预处理：自动旋转校正、去噪处理

PDF OCR

- AI驱动OCR：先进AI模型实现精确文本提取
多种输出格式：

- Word (.docx) - 可编辑的Word文档 - Markdown - 带格式的纯文本 - JSON - 结构化数据

- 大型PDF支持：每份文档最多100页
逐页结果：可获取每页的独立结果
下载链接：提供处理文件的直接链接

API参数

图像OCR参数

参数	类型	必填	描述
image_path	字符串	是	图像文件路径
--format

字符串 | 否 | 输出格式：json、markdown、json,markdown（默认：json,markdown） | | --output | 字符串 | 否 | 将结果保存到文件 |

PDF OCR参数

参数	类型	必填	描述
pdf_path	字符串	是	PDF文件路径
--pdf-url

字符串 | 否 | PDF文件的公开URL | | --format | 字符串 | 否 | 输出格式：word、markdown、json（默认：word） | | --no-poll | 标志 | 否 | 返回任务ID，不进行轮询 | | --poll-interval | 整数 | 否 | 轮询间隔（秒，最小5，默认：5） | | --max-wait | 整数 | 否 | 最大等待时间（秒，默认：300） |

*必须提供pdf_path或--pdf-url其中之一

身份验证

图像OCR（HMAC-SHA256）

使用HMAC-SHA256签名认证：

1. 生成RFC1123格式日期：EEE, dd MMM yyyy HH:mm:ss GMT
创建签名原文：host: {host}\\ndate: {date}\\nPOST {path} HTTP/1.1
计算签名：HMAC-SHA256(signature_origin, apiSecret)
构建授权：hmac username={apiKey}, algorithm=hmac-sha256, headers=host date request-line, signature={signature}
对授权信息进行Base64编码
作为查询参数发送：?authorization={auth}&host={host}&date={date}

PDF OCR（MD5 + HMAC-SHA1）

使用MD5 + HMAC-SHA1签名认证：

1. 生成时间戳（Unix纪元秒数）
计算auth = MD5(appId + timestamp)
计算signature = Base64(HMAC-SHA1(auth, apiSecret))
发送请求头：

- appId：应用ID - timestamp：时间戳（秒） - signature：生成的签名

重要提示：时间戳必须在服务器时间的5分钟范围内。

响应格式

图像OCR响应

json
{
header: {
code: 0,
message: success
},
payload: {
result: {
text: Base64编码的OCR文本...
}
}
}

PDF OCR启动响应

json
{
flag: true,
code: 0,
desc: 成功,
data: {
taskNo: 25082744936879,
status: CREATE,
tip: 任务创建成功
}
}

PDF OCR状态响应

json
{
flag: true,
code: 0,
desc: 成功,
data: {
taskNo: 25082759289333,
exportFormat: word,
status: FINISH,
downUrl: http://bjcdn.openstorage.cn/...,
tip: 已完成,
pageList: [...]
}
}

任务状态（PDF OCR）

状态	描述
CREATE	任务创建成功
WAITING

错误码

(｡･ω･｡) 嗨~遇到错误码了吗？来看看怎么解决吧~ ✧⁺⸜(●˙▾˙●)⸝⁺✧

平台通用错误码

错误码	描述	提示	解决方案
10009	输入数据无效	(◎_◎;) 哎呀~数据格式不太对呢	检查输入数据是否符合要求
10010

ifly-pdf&image-ocr讯飞PDF图像OCR