PDF to Word Converter
🇨🇳 简体中文 / Simplified Chinese
A skill to extract text from scanned PDF documents and convert them into reusable Word (.docx) files using the free, local docr OCR engine.
Prerequisites
- 1. Initialize the OCR engine by downloading the binaries:
bash scripts/install.sh
- 2. Install the required Python dependencies:
CODEBLOCK1
Usage
Run the Python script passing the input PDF file and the desired output .docx file path. You can also append any additional standard docr arguments (such as engine preferences).
CODEBLOCK2
Examples
Convert a single file with the default local engine:
CODEBLOCK3
Using Other API Engines
By default, the script uses the local RapidOCR engine. The underlying docr tool also supports other engines like the Google Gemini API for potentially higher recognition accuracy on complex layouts.
To use Gemini, first configure your API key:
CODEBLOCK4
Then pass the -engine gemini argument to the script:
CODEBLOCK5
If your document has tables, you can force Gemini to output them in Markdown format so the script can parse them into native Word tables:
CODEBLOCK6
How it Works
- 1. The script calls
docr, which uses the specified OCR model (RapidOCR by default) to read text from the scanned PDF. - The extracted text is temporarily stored.
- The
python-docx library is used to read the temporary text and construct a formatted Word document. - Temporary files are cleaned up automatically.
PDF 转 Word 转换器
🇨🇳 简体中文 / Simplified Chinese
一项从扫描版PDF文档中提取文本,并使用免费的本地docrOCR引擎将其转换为可复用的Word(.docx)文件的技能。
前置条件
- 1. 通过下载二进制文件初始化OCR引擎:
bash
bash scripts/install.sh
- 2. 安装所需的Python依赖:
bash
pip install -r scripts/requirements.txt
使用方法
运行Python脚本,传入输入的PDF文件和期望输出的.docx文件路径。您还可以附加任何额外的标准docr参数(例如引擎偏好)。
bash
python scripts/pdf2word.py <输入文件.pdf> <输出文件.docx> [docr参数...]
示例
使用默认本地引擎转换单个文件:
bash
python scripts/pdf2word.py sample.pdf sample_output.docx
使用其他API引擎
默认情况下,脚本使用本地RapidOCR引擎。底层docr工具也支持其他引擎,如Google Gemini API,可在复杂布局上获得更高的识别精度。
要使用Gemini,请先配置您的API密钥:
bash
mkdir -p ~/.ocr
echo geminiapikey=您的gemini密钥 > ~/.ocr/config
然后向脚本传递-engine gemini参数:
bash
python scripts/pdf2word.py sample.pdf sample_output.docx -engine gemini
如果您的文档包含表格,您可以强制Gemini以Markdown格式输出表格,以便脚本将其解析为原生Word表格:
bash
python scripts/pdf2word.py sample.pdf sample_output.docx -engine gemini -prompt 提取所有文本,并使用|符号以Markdown格式保留表格。
工作原理
- 1. 脚本调用docr,使用指定的OCR模型(默认RapidOCR)从扫描版PDF中读取文本。
- 提取的文本被临时存储。
- 使用python-docx库读取临时文本并构建格式化的Word文档。
- 临时文件会被自动清理。