PDF OCR 提取器

使用此技能从扫描版PDF或基于图像的PDF中提取文本，这些PDF缺少原生文本层。该工具完全免费，不调用第三方API，且无使用次数限制。它会将PDF页面渲染为图像，并运行光学字符识别（OCR）。

依赖项

此技能需要：

1. 系统二进制文件：tesseract（以及所需的语言数据包，如chi_sim或eng）。
Python包：pypdfium2、pytesseract和Pillow。

注意：运行时请勿自动执行pip install命令。依赖用户或环境预先安装元数据块中定义的依赖项。

快速开始

在临时目录中创建一个Python脚本（例如extract.py），以安全地处理提取操作：

python
import pypdfium2 as pdfium
import pytesseract
from PIL import Image
import sys
import os

def extract(pdf_path):
doc = pdfium.PdfDocument(pdf_path)
full_text = []
for i, page in enumerate(doc):
# 将页面渲染为高分辨率图像
bitmap = page.render(scale=2)
tmpimg = f/tmp/page{i}.png
bitmap.topil().save(tmpimg)

# 运行OCR（假设已安装英文和简体中文语言包）
text = pytesseract.imagetostring(Image.open(tmpimg), lang=chisim+eng)
full_text.append(text)

# 清理临时文件
os.remove(tmp_img)

return \n.join(full_text)

if name == main:
if len(sys.argv) > 1:
print(extract(sys.argv[1]))

然后执行脚本：
bash
python3 extract.py /path/to/document.pdf

安全与沙盒限制

- 仅将临时图像写入/tmp/目录，并在提取后立即清理。
请勿尝试通过shell命令动态下载或安装语言包；如果缺少特定语言，请通知用户。

pdf-ocr-extractorPDF文字提取

pdf-ocr-extractor

PDF OCR Extractor

Dependencies

Quick Start

Security & Sandbox Constraints

PDF OCR 提取器

依赖项

快速开始

安全与沙盒限制

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

pdf-ocr-extractorPDF文字提取

pdf-ocr-extractor

PDF OCR Extractor

Dependencies

Quick Start

Security & Sandbox Constraints

PDF OCR 提取器

依赖项

快速开始

安全与沙盒限制

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement