PDF-Text-Extractor - Extract Text from PDFs
Vernox Utility Skill - Perfect for document digitization.
Overview
PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).
Features
✅ Text Extraction
- - Extract text from PDFs without external tools
- Support for both text-based and scanned PDFs
- Preserve document structure and formatting
- Fast extraction (milliseconds for text-based)
✅ OCR Support
- - Use Tesseract.js for scanned documents
- Support multiple languages (English, Spanish, French, German)
- Configurable OCR quality/speed
- Fallback to text extraction when possible
✅ Batch Processing
- - Process multiple PDFs at once
- Batch extraction for document workflows
- Progress tracking for large files
- Error handling and retry logic
✅ Output Options
- - Plain text output
- JSON output with metadata
- Markdown conversion
- HTML output (preserving links)
✅ Utility Features
- - Page-by-page extraction
- Character/word counting
- Language detection
- Metadata extraction (author, title, creation date)
Installation
CODEBLOCK0
Quick Start
Extract Text from PDF
CODEBLOCK1
Batch Extract Multiple PDFs
CODEBLOCK2
Extract with OCR
CODEBLOCK3
Tool Functions
extractText
Extract text content from a single PDF file.
Parameters:
- -
pdfPath (string, required): Path to PDF file - INLINECODE2 (object, optional): Extraction options
-
outputFormat (string): 'text' | 'json' | 'markdown' | 'html'
-
ocr (boolean): Enable OCR for scanned docs
-
language (string): OCR language code ('eng', 'spa', 'fra', 'deu')
-
preserveFormatting (boolean): Keep headings/structure
-
minConfidence (number): Minimum OCR confidence score (0-100)
Returns:
- -
text (string): Extracted text content - INLINECODE9 (number): Number of pages processed
- INLINECODE10 (number): Total word count
- INLINECODE11 (number): Total character count
- INLINECODE12 (string): Detected language
- INLINECODE13 (object): PDF metadata (title, author, creation date)
- INLINECODE14 (string): 'text' or 'ocr' (extraction method)
extractBatch
Extract text from multiple PDF files at once.
Parameters:
- -
pdfFiles (array, required): Array of PDF file paths - INLINECODE17 (object, optional): Same as extractText
Returns:
- -
results (array): Array of extraction results - INLINECODE19 (number): Total pages across all PDFs
- INLINECODE20 (number): Successfully extracted
- INLINECODE21 (number): Failed extractions
- INLINECODE22 (array): Error details for failures
countWords
Count words in extracted text.
Parameters:
- -
text (string, required): Text to count - INLINECODE25 (object, optional):
-
minWordLength (number): Minimum characters per word (default: 3)
-
excludeNumbers (boolean): Don't count numbers as words
-
countByPage (boolean): Return word count per page
Returns:
- -
wordCount (number): Total word count - INLINECODE30 (number): Total character count
- INLINECODE31 (array): Word count per page
- INLINECODE32 (number): Average words per page
detectLanguage
Detect the language of extracted text.
Parameters:
- -
text (string, required): Text to analyze - INLINECODE35 (number): Minimum confidence for detection
Returns:
- -
language (string): Detected language code - INLINECODE37 (string): Full language name
- INLINECODE38 (number): Confidence score (0-100)
Use Cases
Document Digitization
- - Convert paper documents to digital text
- Process invoices and receipts
- Digitize contracts and agreements
- Archive physical documents
Content Analysis
- - Extract text for analysis tools
- Prepare content for LLM processing
- Clean up scanned documents
- Parse PDF-based reports
Data Extraction
- - Extract data from PDF reports
- Parse tables from PDFs
- Pull structured data
- Automate document workflows
Text Processing
- - Prepare content for translation
- Clean up OCR output
- Extract specific sections
- Search within PDF content
Performance
Text-Based PDFs
- - Speed: ~100ms for 10-page PDF
- Accuracy: 100% (exact text)
- Memory: ~10MB for typical document
OCR Processing
- - Speed: ~1-3s per page (high quality)
- Accuracy: 85-95% (depends on scan quality)
- Memory: ~50-100MB peak during OCR
Technical Details
PDF Parsing
- - Uses native PDF.js library
- Extracts text layer directly (no OCR needed)
- Preserves document structure
- Handles password-protected PDFs
OCR Engine
- - Tesseract.js under the hood
- Supports 100+ languages
- Adjustable quality/speed tradeoff
- Confidence scoring for accuracy
Dependencies
- - ZERO external dependencies
- Uses Node.js built-in modules only
- PDF.js included in skill
- Tesseract.js bundled
Error Handling
Invalid PDF
- - Clear error message
- Suggest fix (check file format)
- Skip to next file in batch
OCR Failure
- - Report confidence score
- Suggest rescan at higher quality
- Fallback to basic extraction
Memory Issues
- - Stream processing for large files
- Progress reporting
- Graceful degradation
Configuration
Edit config.json:
CODEBLOCK4
Examples
Extract from Invoice
CODEBLOCK5
Extract from Scanned Contract
CODEBLOCK6
Batch Process Documents
CODEBLOCK7
Troubleshooting
OCR Not Working
- - Check if PDF is truly scanned (not text-based)
- Try different quality settings (low/medium/high)
- Ensure language matches document
- Check image quality of scan
Extraction Returns Empty
- - PDF may be image-only
- OCR failed with low confidence
- Try different language setting
Slow Processing
- - Large PDF takes longer
- Reduce quality for speed
- Process in smaller batches
Tips
Best Results
- - Use text-based PDFs when possible (faster, 100% accurate)
- High-quality scans for OCR (300 DPI+)
- Clean background before scanning
- Use correct language setting
Performance Optimization
- - Batch processing for multiple files
- Disable OCR for text-based PDFs
- Lower OCR quality for speed when acceptable
Roadmap
- - [ ] PDF/A support
- [ ] Advanced OCR pre-processing
- [ ] Table extraction from OCR
- [ ] Handwriting OCR
- [ ] PDF form field extraction
- [ ] Batch language detection
- [ ] Confidence scoring visualization
License
MIT
Extract text from PDFs. Fast, accurate, zero dependencies. 🔮
PDF-文本提取器 - 从PDF中提取文本
Vernox实用技能 - 适用于文档数字化。
概述
PDF-文本提取器是一个零依赖的工具,用于从PDF文件中提取文本内容。支持嵌入式文本提取(适用于基于文本的PDF)和OCR(适用于扫描文档)。
功能特性
✅ 文本提取
- - 无需外部工具即可从PDF中提取文本
- 支持基于文本和扫描的PDF
- 保留文档结构和格式
- 快速提取(基于文本的PDF仅需毫秒级)
✅ OCR支持
- - 使用Tesseract.js处理扫描文档
- 支持多种语言(英语、西班牙语、法语、德语)
- 可配置OCR质量/速度
- 尽可能回退到文本提取
✅ 批量处理
- - 同时处理多个PDF
- 文档工作流的批量提取
- 大文件进度跟踪
- 错误处理和重试逻辑
✅ 输出选项
- - 纯文本输出
- 带元数据的JSON输出
- Markdown转换
- HTML输出(保留链接)
✅ 实用功能
- - 逐页提取
- 字符/单词计数
- 语言检测
- 元数据提取(作者、标题、创建日期)
安装
bash
clawhub install pdf-text-extractor
快速开始
从PDF提取文本
javascript
const result = await extractText({
pdfPath: ./document.pdf,
options: {
outputFormat: text,
ocr: true,
language: eng
}
});
console.log(result.text);
console.log(页数: ${result.pages});
console.log(单词数: ${result.wordCount});
批量提取多个PDF
javascript
const results = await extractBatch({
pdfFiles: [
./document1.pdf,
./document2.pdf,
./document3.pdf
],
options: {
outputFormat: json,
ocr: true
}
});
console.log(已提取 ${results.length} 个PDF);
使用OCR提取
javascript
const result = await extractText({
pdfPath: ./scanned-document.pdf,
options: {
ocr: true,
language: eng,
ocrQuality: high
}
});
// 将使用OCR(检测到扫描文档)
工具函数
extractText
从单个PDF文件中提取文本内容。
参数:
- - pdfPath(字符串,必填):PDF文件路径
- options(对象,可选):提取选项
- outputFormat(字符串):text | json | markdown | html
- ocr(布尔值):为扫描文档启用OCR
- language(字符串):OCR语言代码(eng, spa, fra, deu)
- preserveFormatting(布尔值):保留标题/结构
- minConfidence(数字):最低OCR置信度分数(0-100)
返回:
- - text(字符串):提取的文本内容
- pages(数字):处理的页数
- wordCount(数字):总单词数
- charCount(数字):总字符数
- language(字符串):检测到的语言
- metadata(对象):PDF元数据(标题、作者、创建日期)
- method(字符串):text 或 ocr(提取方法)
extractBatch
一次性从多个PDF文件中提取文本。
参数:
- - pdfFiles(数组,必填):PDF文件路径数组
- options(对象,可选):与extractText相同
返回:
- - results(数组):提取结果数组
- totalPages(数字):所有PDF的总页数
- successCount(数字):成功提取的数量
- failureCount(数字):提取失败的数量
- errors(数组):失败的错误详情
countWords
统计提取文本中的单词数。
参数:
- - text(字符串,必填):要计数的文本
- options(对象,可选):
- minWordLength(数字):每个单词的最小字符数(默认:3)
- excludeNumbers(布尔值):不将数字计为单词
- countByPage(布尔值):返回每页的单词数
返回:
- - wordCount(数字):总单词数
- charCount(数字):总字符数
- pageCounts(数组):每页的单词数
- averageWordsPerPage(数字):每页平均单词数
detectLanguage
检测提取文本的语言。
参数:
- - text(字符串,必填):要分析的文本
- minConfidence(数字):检测的最低置信度
返回:
- - language(字符串):检测到的语言代码
- languageName(字符串):完整语言名称
- confidence(数字):置信度分数(0-100)
使用场景
文档数字化
- - 将纸质文档转换为数字文本
- 处理发票和收据
- 数字化合同和协议
- 归档物理文档
内容分析
- - 提取文本用于分析工具
- 为LLM处理准备内容
- 清理扫描文档
- 解析基于PDF的报告
数据提取
- - 从PDF报告中提取数据
- 解析PDF中的表格
- 提取结构化数据
- 自动化文档工作流
文本处理
- - 准备翻译内容
- 清理OCR输出
- 提取特定章节
- 在PDF内容中搜索
性能
基于文本的PDF
- - 速度: 10页PDF约100毫秒
- 准确率: 100%(精确文本)
- 内存: 典型文档约10MB
OCR处理
- - 速度: 每页约1-3秒(高质量)
- 准确率: 85-95%(取决于扫描质量)
- 内存: OCR期间峰值约50-100MB
技术细节
PDF解析
- - 使用原生PDF.js库
- 直接提取文本层(无需OCR)
- 保留文档结构
- 处理受密码保护的PDF
OCR引擎
- - 底层使用Tesseract.js
- 支持100多种语言
- 可调节的质量/速度权衡
- 置信度评分确保准确性
依赖项
- - 零外部依赖
- 仅使用Node.js内置模块
- 技能中包含PDF.js
- 捆绑了Tesseract.js
错误处理
无效PDF
- - 清晰的错误信息
- 建议修复(检查文件格式)
- 批量处理中跳过到下一个文件
OCR失败
- - 报告置信度分数
- 建议以更高质量重新扫描
- 回退到基本提取
内存问题
配置
编辑 config.json:
json
{
ocr: {
enabled: true,
defaultLanguage: eng,
quality: medium,
languages: [eng, spa, fra, deu]
},
output: {
defaultFormat: text,
preserveFormatting: true,
includeMetadata: true
},
batch: {
maxConcurrent: 3,
timeoutSeconds: 30
}
}
示例
从发票中提取
javascript
const invoice = await extractText(./invoice.pdf);
console.log(invoice.text);
// 发票 #12345 日期: 2026-02-04...
从扫描合同中提取
javascript
const contract = await extractText(./scanned-contract.pdf, {
ocr: true,
language: eng,
ocrQuality: high
});
console.log(contract.text);
// 协议 本合同双方...
批量处理文档
javascript
const docs = await extractBatch([
./doc1.pdf,
./doc2.pdf,
./doc3.pdf,
./doc4.pdf
]);
console.log(已处理 ${docs.successCount}/${docs.results.length} 个文档);
故障排除
OCR不工作
- - 检查PDF是否确实是扫描件(非基于文本)
- 尝试不同的质量设置(低/中/高)
- 确保语言与文档匹配
- 检查扫描图像质量
提取返回空
- - PDF可能仅为图像
- OCR因置信度低而失败
- 尝试不同的语言设置
处理速度慢
- - 大PDF需要更长时间
- 降低质量以提高速度
- 分批处理较小的文件
提示
最佳结果
- - 尽可能使用基于文本的PDF(更快,100%准确)
- 高质量扫描用于OCR(300 DPI以上)
- 扫描