PDF to Word Converter

🇨🇳 简体中文 / Simplified Chinese

A skill to extract text from scanned PDF documents and convert them into reusable Word (.docx) files using the free, local docr OCR engine.

Prerequisites

1. Initialize the OCR engine by downloading the binaries:

   bash scripts/install.sh

2. Install the required Python dependencies:

CODEBLOCK1

Usage

Run the Python script passing the input PDF file and the desired output .docx file path. You can also append any additional standard docr arguments (such as engine preferences).

CODEBLOCK2

Examples

Convert a single file with the default local engine:
CODEBLOCK3

Using Other API Engines

By default, the script uses the local RapidOCR engine. The underlying docr tool also supports other engines like the Google Gemini API for potentially higher recognition accuracy on complex layouts.

To use Gemini, first configure your API key:
CODEBLOCK4

Then pass the -engine gemini argument to the script:
CODEBLOCK5

If your document has tables, you can force Gemini to output them in Markdown format so the script can parse them into native Word tables:
CODEBLOCK6

How it Works

1. The script calls docr, which uses the specified OCR model (RapidOCR by default) to read text from the scanned PDF.
The extracted text is temporarily stored.
The python-docx library is used to read the temporary text and construct a formatted Word document.
Temporary files are cleaned up automatically.

PDF 转 Word 转换器

🇨🇳 简体中文 / Simplified Chinese

一项从扫描版PDF文档中提取文本，并使用免费的本地docrOCR引擎将其转换为可复用的Word（.docx）文件的技能。

前置条件

1. 通过下载二进制文件初始化OCR引擎：

bash bash scripts/install.sh

2. 安装所需的Python依赖：

bash pip install -r scripts/requirements.txt

使用方法

运行Python脚本，传入输入的PDF文件和期望输出的.docx文件路径。您还可以附加任何额外的标准docr参数（例如引擎偏好）。

bash
python scripts/pdf2word.py <输入文件.pdf> <输出文件.docx> [docr参数...]

示例

使用默认本地引擎转换单个文件：
bash
python scripts/pdf2word.py sample.pdf sample_output.docx

使用其他API引擎

默认情况下，脚本使用本地RapidOCR引擎。底层docr工具也支持其他引擎，如Google Gemini API，可在复杂布局上获得更高的识别精度。

要使用Gemini，请先配置您的API密钥：
bash
mkdir -p ~/.ocr
echo geminiapikey=您的gemini密钥 > ~/.ocr/config

然后向脚本传递-engine gemini参数：
bash
python scripts/pdf2word.py sample.pdf sample_output.docx -engine gemini

如果您的文档包含表格，您可以强制Gemini以Markdown格式输出表格，以便脚本将其解析为原生Word表格：
bash
python scripts/pdf2word.py sample.pdf sample_output.docx -engine gemini -prompt 提取所有文本，并使用|符号以Markdown格式保留表格。

工作原理

1. 脚本调用docr，使用指定的OCR模型（默认RapidOCR）从扫描版PDF中读取文本。
提取的文本被临时存储。
使用python-docx库读取临时文本并构建格式化的Word文档。
临时文件会被自动清理。

pdf2word-skillsPDF转Word

pdf2word-skills

PDF to Word Converter

Prerequisites