Vector Text Fixer
Fixes garbled text in PDF/SVG vector graphics to make them editable in AI tools.
Features
- - Garbled Text Detection: Automatically identifies garbled text in PDF/SVG files
- Smart Repair: Infers original text content based on context
- Batch Processing: Supports batch processing of multiple files in a folder
- Format Preservation: Repaired files maintain original vector format and layout
- AI-assisted Editing: Outputs intermediate format that can be imported into AI editors
Supported Scenarios
1. PDF Garbled Text Repair
- - Box/question mark issues caused by font embedding problems
- Garbled text caused by encoding conversion errors
- Abnormal characters generated by missing font substitution
- Multi-language mixed encoding issues
2. SVG Garbled Text Repair
- - Text entity encoding errors
- Special character escaping issues
- Display abnormalities caused by invalid font references
- XML encoding declaration errors
Usage
Command Line
CODEBLOCK0
Python API
CODEBLOCK1
Input Parameters
| Parameter | Type | Required | Description |
|---|
| INLINECODE0 | str | Yes* | Input file path (PDF or SVG) |
| INLINECODE1 |
str | No | Batch processing input folder |
|
--output | str | Yes* | Output file/folder path |
|
--interactive | bool | No | Enable interactive repair mode |
|
--export-json | str | No | Export editable JSON format |
|
--encoding | str | No | Specify source file encoding (default: auto-detect) |
|
--font-substitution | dict | No | Font replacement mapping |
|
--repair-level | str | No | Repair level: minimal, standard, aggressive (default: standard) |
*At least one of --input and --batch is required
Output Format
Repaired PDF/SVG
- - Maintains original vector format
- Garbled text replaced with readable content
- Fonts and layout remain unchanged
JSON Export Format
CODEBLOCK2
Garbled Text Detection Rules
The tool uses the following rules to detect garbled text:
- 1. Replacement Character Detection: Identifies U+FFFD (�) and box characters
- Control Character Filtering: Excludes non-printing control characters
- Encoding Consistency: Detects anomalies caused by mixed encodings
- Font Fallback Detection: Identifies substitution characters generated due to missing fonts
- Probability Model: Garbled text probability assessment based on character frequency
Repair Strategies
Minimal
- - Only repairs obvious errors (replacement characters, null bytes)
- Maintains maximum integrity of original text
- Suitable for minor garbled text issues
Standard
- - Repairs common encoding issues
- Smart font replacement
- Balances repair rate and accuracy
Aggressive
- - Comprehensive text re-encoding
- Uses OCR-assisted recognition
- Suitable for severely garbled documents
Examples
Fix Single Page PDF
Input:
CODEBLOCK3
Output:
CODEBLOCK4
Export Editable JSON
Input:
CODEBLOCK5
Output JSON Structure:
CODEBLOCK6
Dependencies
CODEBLOCK7
Limitations
- - Encrypted PDFs require password unlock before processing
- Severely damaged vector files may not be fully repairable
- Some rare fonts may not map correctly
- Scanned PDFs require OCR recognition first
Version Information
- - Version: 1.0.0
- Last Updated: 2026-02-06
- Status: Ready for use
Risk Assessment
| Risk Indicator | Assessment | Level |
|---|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access |
No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
Security Checklist
- - [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
Prerequisites
CODEBLOCK8
Evaluation Criteria
Success Metrics
- - [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
Test Cases
- 1. Basic Functionality: Standard input → Expected output
- Edge Case: Invalid input → Graceful error handling
- Performance: Large dataset → Acceptable processing time
Lifecycle Status
- - Current Stage: Draft
- Next Review Date: 2026-03-06
- Known Issues: None
- Planned Improvements:
- Performance optimization
- Additional feature support
Vector Text Fixer
修复PDF/SVG矢量图形中的乱码文本,使其可在AI工具中编辑。
功能特点
- - 乱码文本检测:自动识别PDF/SVG文件中的乱码文本
- 智能修复:根据上下文推断原始文本内容
- 批量处理:支持批量处理文件夹中的多个文件
- 格式保留:修复后的文件保持原始矢量格式和布局
- AI辅助编辑:输出可导入AI编辑器的中间格式
支持场景
1. PDF乱码文本修复
- - 字体嵌入问题导致的方框/问号问题
- 编码转换错误引起的乱码文本
- 字体缺失替换产生的异常字符
- 多语言混合编码问题
2. SVG乱码文本修复
- - 文本实体编码错误
- 特殊字符转义问题
- 无效字体引用导致的显示异常
- XML编码声明错误
使用方法
命令行
bash
修复单个PDF文件
python scripts/main.py --input document.pdf --output fixed.pdf
修复单个SVG文件
python scripts/main.py --input diagram.svg --output fixed.svg
批量处理文件夹
python scripts/main.py --batch ./input
folder --output ./outputfolder
交互式修复(手动指定替换内容)
python scripts/main.py --input doc.pdf --interactive
导出为可编辑格式(JSON)
python scripts/main.py --input doc.pdf --export-json editable.json
Python API
python
from scripts.main import VectorTextFixer
创建修复器实例
fixer = VectorTextFixer()
修复PDF
result = fixer.fix_pdf(input.pdf, output.pdf)
修复SVG
result = fixer.fix_svg(input.svg, output.svg)
批量处理
results = fixer.batch
fix(./inputfolder, ./output_folder)
获取文本映射(用于AI编辑)
text
map = fixer.extracttext_map(input.pdf)
输入参数
| 参数 | 类型 | 必填 | 说明 |
|---|
| --input | str | 是* | 输入文件路径(PDF或SVG) |
| --batch |
str | 否 | 批量处理输入文件夹 |
| --output | str | 是* | 输出文件/文件夹路径 |
| --interactive | bool | 否 | 启用交互式修复模式 |
| --export-json | str | 否 | 导出可编辑JSON格式 |
| --encoding | str | 否 | 指定源文件编码(默认:自动检测) |
| --font-substitution | dict | 否 | 字体替换映射 |
| --repair-level | str | 否 | 修复级别:minimal、standard、aggressive(默认:standard) |
*--input和--batch至少需要指定一个
输出格式
修复后的PDF/SVG
- - 保持原始矢量格式
- 乱码文本替换为可读内容
- 字体和布局保持不变
JSON导出格式
json
{
file_type: pdf,
pages: [
{
page_num: 1,
text_blocks: [
{
id: tb_001,
bbox: [100, 200, 300, 220],
original_text: �����,
detected_encoding: UTF-8,
confidence: 0.3,
suggested_fix: 示例文本
}
]
}
],
fonts_used: [Arial, SimSun],
repair_summary: {
total_blocks: 15,
fixed_blocks: 12,
skipped_blocks: 3
}
}
乱码文本检测规则
该工具使用以下规则检测乱码文本:
- 1. 替换字符检测:识别U+FFFD(�)和方框字符
- 控制字符过滤:排除非打印控制字符
- 编码一致性:检测混合编码引起的异常
- 字体回退检测:识别因字体缺失产生的替换字符
- 概率模型:基于字符频率的乱码文本概率评估
修复策略
最小修复
- - 仅修复明显错误(替换字符、空字节)
- 最大程度保持原始文本完整性
- 适用于轻微乱码问题
标准修复
- - 修复常见编码问题
- 智能字体替换
- 平衡修复率和准确率
激进修复
- - 全面文本重新编码
- 使用OCR辅助识别
- 适用于严重乱码文档
示例
修复单页PDF
输入:
bash
python scripts/main.py --input report.pdf --output fixed_report.pdf
输出:
✓ 处理中:report.pdf
✓ 检测到5个乱码文本块
✓ 自动修复4个块
⚠ 1个块需要人工审核
✓ 输出已保存:fixed_report.pdf
✓ 报告已保存:fixedreportrepair_log.json
导出可编辑JSON
输入:
bash
python scripts/main.py --input diagram.svg --export-json editable.json
输出JSON结构:
json
{
file_type: svg,
svg_info: {
width: 800,
height: 600,
viewBox: 0 0 800 600
},
text_elements: [
{
id: text_1,
x: 100,
y: 200,
font_family: Arial,
font_size: 14,
original: �����,
user_editable: ,
confidence: 0.25
}
]
}
依赖项
pdfplumber>=0.10.0 # PDF解析
PyMuPDF>=1.23.0 # PDF处理(fitz)
cairosvg>=2.7.0 # SVG转换
beautifulsoup4>=4.12.0 # SVG解析
fonttools>=4.40.0 # 字体处理
chardet>=5.0.0 # 编码检测
Pillow>=10.0.0 # 图像处理
局限性
- - 加密PDF需要先解锁密码才能处理
- 严重损坏的矢量文件可能无法完全修复
- 某些稀有字体可能无法正确映射
- 扫描版PDF需要先进行OCR识别
版本信息
- - 版本:1.0.0
- 最后更新:2026-02-06
- 状态:可投入使用
风险评估
| 风险指标 | 评估 | 级别 |
|---|
| 代码执行 | 本地执行Python/R脚本 | 中 |
| 网络访问 |
无外部API调用 | 低 |
| 文件系统访问 | 读取输入文件,写入输出文件 | 中 |
| 指令篡改 | 标准提示词指南 | 低 |
| 数据泄露 | 输出文件保存到工作区 | 低 |
安全检查清单
- - [ ] 无硬编码凭据或API密钥
- [ ] 无未授权文件系统访问(../)
- [ ] 输出不暴露敏感信息
- [ ] 已实施提示注入防护
- [ ] 输入文件路径已验证(无../遍历)
- [ ] 输出目录限制在工作区内
- [ ] 脚本在沙盒环境中执行
- [ ] 错误消息已清理(不暴露堆栈跟踪)
- [ ] 依赖项已审计
前置条件
bash
Python依赖项
pip install -r requirements.txt
评估标准
成功指标
- - [ ] 成功执行主要功能
- [ ] 输出符合质量标准
- [ ] 优雅处理边界情况
- [ ] 性能可接受
测试用例
- 1. 基本功能:标准输入 → 预期输出
- 边界情况:无效输入 → 优雅错误处理
- 性能:大数据集 → 可接受处理时间
生命周期状态
- - 当前阶段:草稿
- 下次审核日期:2026-03-06
- 已知问题:无
- 计划改进:
- 性能优化
- 额外功能支持