PDF-Text-Extractor - Extract Text from PDFs

Vernox Utility Skill - Perfect for document digitization.

Overview

PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).

Features

✅ Text Extraction

- Extract text from PDFs without external tools
Support for both text-based and scanned PDFs
Preserve document structure and formatting
Fast extraction (milliseconds for text-based)

✅ OCR Support

- Use Tesseract.js for scanned documents
Support multiple languages (English, Spanish, French, German)
Configurable OCR quality/speed
Fallback to text extraction when possible

✅ Batch Processing

- Process multiple PDFs at once
Batch extraction for document workflows
Progress tracking for large files
Error handling and retry logic

✅ Output Options

- Plain text output
JSON output with metadata
Markdown conversion
HTML output (preserving links)

✅ Utility Features

- Page-by-page extraction
Character/word counting
Language detection
Metadata extraction (author, title, creation date)

Installation

CODEBLOCK0

Quick Start

Extract Text from PDF

CODEBLOCK1

Batch Extract Multiple PDFs

CODEBLOCK2

Extract with OCR

CODEBLOCK3

Tool Functions

`extractText`

Extract text content from a single PDF file.

Parameters:

- pdfPath (string, required): Path to PDF file
INLINECODE2 (object, optional): Extraction options

- outputFormat (string): 'text' | 'json' | 'markdown' | 'html'
- ocr (boolean): Enable OCR for scanned docs
- language (string): OCR language code ('eng', 'spa', 'fra', 'deu')
- preserveFormatting (boolean): Keep headings/structure
- minConfidence (number): Minimum OCR confidence score (0-100)

Returns:

- text (string): Extracted text content
INLINECODE9 (number): Number of pages processed
INLINECODE10 (number): Total word count
INLINECODE11 (number): Total character count
INLINECODE12 (string): Detected language
INLINECODE13 (object): PDF metadata (title, author, creation date)
INLINECODE14 (string): 'text' or 'ocr' (extraction method)

`extractBatch`

Extract text from multiple PDF files at once.

Parameters:

- pdfFiles (array, required): Array of PDF file paths
INLINECODE17 (object, optional): Same as extractText

Returns:

- results (array): Array of extraction results
INLINECODE19 (number): Total pages across all PDFs
INLINECODE20 (number): Successfully extracted
INLINECODE21 (number): Failed extractions
INLINECODE22 (array): Error details for failures

`countWords`

Count words in extracted text.

Parameters:

- text (string, required): Text to count
INLINECODE25 (object, optional):

- minWordLength (number): Minimum characters per word (default: 3)
- excludeNumbers (boolean): Don't count numbers as words
- countByPage (boolean): Return word count per page

Returns:

- wordCount (number): Total word count
INLINECODE30 (number): Total character count
INLINECODE31 (array): Word count per page
INLINECODE32 (number): Average words per page

`detectLanguage`

Detect the language of extracted text.

Parameters:

- text (string, required): Text to analyze
INLINECODE35 (number): Minimum confidence for detection

Returns:

- language (string): Detected language code
INLINECODE37 (string): Full language name
INLINECODE38 (number): Confidence score (0-100)

Use Cases

Document Digitization

- Convert paper documents to digital text
Process invoices and receipts
Digitize contracts and agreements
Archive physical documents

Content Analysis

- Extract text for analysis tools
Prepare content for LLM processing
Clean up scanned documents
Parse PDF-based reports

Data Extraction

- Extract data from PDF reports
Parse tables from PDFs
Pull structured data
Automate document workflows

Text Processing

- Prepare content for translation
Clean up OCR output
Extract specific sections
Search within PDF content

Performance

Text-Based PDFs

- Speed: ~100ms for 10-page PDF
Accuracy: 100% (exact text)
Memory: ~10MB for typical document

OCR Processing

- Speed: ~1-3s per page (high quality)
Accuracy: 85-95% (depends on scan quality)
Memory: ~50-100MB peak during OCR

Technical Details

PDF Parsing

- Uses native PDF.js library
Extracts text layer directly (no OCR needed)
Preserves document structure
Handles password-protected PDFs

OCR Engine

- Tesseract.js under the hood
Supports 100+ languages
Adjustable quality/speed tradeoff
Confidence scoring for accuracy

Dependencies

- ZERO external dependencies
Uses Node.js built-in modules only
PDF.js included in skill
Tesseract.js bundled

Error Handling

Invalid PDF

- Clear error message
Suggest fix (check file format)
Skip to next file in batch

OCR Failure

- Report confidence score
Suggest rescan at higher quality
Fallback to basic extraction

Memory Issues

- Stream processing for large files
Progress reporting
Graceful degradation

Configuration

Edit `config.json`:

CODEBLOCK4

Examples

Extract from Invoice

CODEBLOCK5

Extract from Scanned Contract

CODEBLOCK6

Batch Process Documents

CODEBLOCK7

Troubleshooting

OCR Not Working

- Check if PDF is truly scanned (not text-based)
Try different quality settings (low/medium/high)
Ensure language matches document
Check image quality of scan

Extraction Returns Empty

- PDF may be image-only
OCR failed with low confidence
Try different language setting

Slow Processing

- Large PDF takes longer
Reduce quality for speed
Process in smaller batches

Tips

Best Results

- Use text-based PDFs when possible (faster, 100% accurate)
High-quality scans for OCR (300 DPI+)
Clean background before scanning
Use correct language setting

Performance Optimization

- Batch processing for multiple files
Disable OCR for text-based PDFs
Lower OCR quality for speed when acceptable

Roadmap

- [ ] PDF/A support
[ ] Advanced OCR pre-processing
[ ] Table extraction from OCR
[ ] Handwriting OCR
[ ] PDF form field extraction
[ ] Batch language detection
[ ] Confidence scoring visualization

License

MIT

Extract text from PDFs. Fast, accurate, zero dependencies. 🔮

PDF-文本提取器 - 从PDF中提取文本

Vernox实用技能 - 适用于文档数字化。

概述

PDF-文本提取器是一个零依赖的工具，用于从PDF文件中提取文本内容。支持嵌入式文本提取（适用于基于文本的PDF）和OCR（适用于扫描文档）。

功能特性

✅ 文本提取

- 无需外部工具即可从PDF中提取文本
支持基于文本和扫描的PDF
保留文档结构和格式
快速提取（基于文本的PDF仅需毫秒级）

✅ OCR支持

- 使用Tesseract.js处理扫描文档
支持多种语言（英语、西班牙语、法语、德语）
可配置OCR质量/速度
尽可能回退到文本提取

✅ 批量处理

- 同时处理多个PDF
文档工作流的批量提取
大文件进度跟踪
错误处理和重试逻辑

✅ 输出选项

- 纯文本输出
带元数据的JSON输出
Markdown转换
HTML输出（保留链接）

✅ 实用功能

- 逐页提取
字符/单词计数
语言检测
元数据提取（作者、标题、创建日期）

安装

bash
clawhub install pdf-text-extractor

快速开始

从PDF提取文本

javascript
const result = await extractText({
pdfPath: ./document.pdf,
options: {
outputFormat: text,
ocr: true,
language: eng
}
});

console.log(result.text);
console.log(页数: ${result.pages});
console.log(单词数: ${result.wordCount});

批量提取多个PDF

javascript
const results = await extractBatch({
pdfFiles: [
./document1.pdf,
./document2.pdf,
./document3.pdf
],
options: {
outputFormat: json,
ocr: true
}
});

console.log(已提取 ${results.length} 个PDF);

使用OCR提取

javascript
const result = await extractText({
pdfPath: ./scanned-document.pdf,
options: {
ocr: true,
language: eng,
ocrQuality: high
}
});

// 将使用OCR（检测到扫描文档）

工具函数

extractText

从单个PDF文件中提取文本内容。

参数：

- pdfPath（字符串，必填）：PDF文件路径
options（对象，可选）：提取选项

- outputFormat（字符串）：text | json | markdown | html
- ocr（布尔值）：为扫描文档启用OCR
- language（字符串）：OCR语言代码（eng, spa, fra, deu）
- preserveFormatting（布尔值）：保留标题/结构
- minConfidence（数字）：最低OCR置信度分数（0-100）

返回：

- text（字符串）：提取的文本内容
pages（数字）：处理的页数
wordCount（数字）：总单词数
charCount（数字）：总字符数
language（字符串）：检测到的语言
metadata（对象）：PDF元数据（标题、作者、创建日期）
method（字符串）：text 或 ocr（提取方法）

extractBatch

一次性从多个PDF文件中提取文本。

参数：

- pdfFiles（数组，必填）：PDF文件路径数组
options（对象，可选）：与extractText相同

返回：

- results（数组）：提取结果数组
totalPages（数字）：所有PDF的总页数
successCount（数字）：成功提取的数量
failureCount（数字）：提取失败的数量
errors（数组）：失败的错误详情

countWords

统计提取文本中的单词数。

参数：

- text（字符串，必填）：要计数的文本
options（对象，可选）：

- minWordLength（数字）：每个单词的最小字符数（默认：3）
- excludeNumbers（布尔值）：不将数字计为单词
- countByPage（布尔值）：返回每页的单词数

返回：

- wordCount（数字）：总单词数
charCount（数字）：总字符数
pageCounts（数组）：每页的单词数
averageWordsPerPage（数字）：每页平均单词数

detectLanguage

检测提取文本的语言。

参数：

- text（字符串，必填）：要分析的文本
minConfidence（数字）：检测的最低置信度

返回：

- language（字符串）：检测到的语言代码
languageName（字符串）：完整语言名称
confidence（数字）：置信度分数（0-100）

使用场景

文档数字化

- 将纸质文档转换为数字文本
处理发票和收据
数字化合同和协议
归档物理文档

内容分析

- 提取文本用于分析工具
为LLM处理准备内容
清理扫描文档
解析基于PDF的报告

数据提取

- 从PDF报告中提取数据
解析PDF中的表格
提取结构化数据
自动化文档工作流

文本处理

- 准备翻译内容
清理OCR输出
提取特定章节
在PDF内容中搜索

性能

基于文本的PDF

- 速度： 10页PDF约100毫秒
准确率： 100%（精确文本）
内存： 典型文档约10MB

OCR处理

- 速度： 每页约1-3秒（高质量）
准确率： 85-95%（取决于扫描质量）
内存： OCR期间峰值约50-100MB

技术细节

PDF解析

- 使用原生PDF.js库
直接提取文本层（无需OCR）
保留文档结构
处理受密码保护的PDF

OCR引擎

- 底层使用Tesseract.js
支持100多种语言
可调节的质量/速度权衡
置信度评分确保准确性

依赖项

- 零外部依赖
仅使用Node.js内置模块
技能中包含PDF.js
捆绑了Tesseract.js

错误处理

无效PDF

- 清晰的错误信息
建议修复（检查文件格式）
批量处理中跳过到下一个文件

OCR失败

- 报告置信度分数
建议以更高质量重新扫描
回退到基本提取

内存问题

- 大文件的流式处理
进度报告
优雅降级

配置

编辑 config.json：

json { ocr: { enabled: true, defaultLanguage: eng, quality: medium, languages: [eng, spa, fra, deu] }, output: { defaultFormat: text, preserveFormatting: true, includeMetadata: true }, batch: { maxConcurrent: 3, timeoutSeconds: 30 } }

示例

从发票中提取

javascript const invoice = await extractText(./invoice.pdf); console.log(invoice.text); // 发票 #12345 日期: 2026-02-04...

从扫描合同中提取

javascript const contract = await extractText(./scanned-contract.pdf, { ocr: true, language: eng, ocrQuality: high }); console.log(contract.text); // 协议本合同双方...

批量处理文档

javascript const docs = await extractBatch([ ./doc1.pdf, ./doc2.pdf, ./doc3.pdf, ./doc4.pdf ]); console.log(已处理 ${docs.successCount}/${docs.results.length} 个文档);

pdf-text-extractorPDF文本提取