Vector Text Fixer

Fixes garbled text in PDF/SVG vector graphics to make them editable in AI tools.

Features

- Garbled Text Detection: Automatically identifies garbled text in PDF/SVG files
Smart Repair: Infers original text content based on context
Batch Processing: Supports batch processing of multiple files in a folder
Format Preservation: Repaired files maintain original vector format and layout
AI-assisted Editing: Outputs intermediate format that can be imported into AI editors

Supported Scenarios

1. PDF Garbled Text Repair

- Box/question mark issues caused by font embedding problems
Garbled text caused by encoding conversion errors
Abnormal characters generated by missing font substitution
Multi-language mixed encoding issues

2. SVG Garbled Text Repair

- Text entity encoding errors
Special character escaping issues
Display abnormalities caused by invalid font references
XML encoding declaration errors

Usage

Command Line

CODEBLOCK0

Python API

CODEBLOCK1

Input Parameters

Parameter	Type	Required	Description
INLINECODE0	str	Yes*	Input file path (PDF or SVG)
INLINECODE1

str | No | Batch processing input folder | | --output | str | Yes* | Output file/folder path | | --interactive | bool | No | Enable interactive repair mode | | --export-json | str | No | Export editable JSON format | | --encoding | str | No | Specify source file encoding (default: auto-detect) | | --font-substitution | dict | No | Font replacement mapping | | --repair-level | str | No | Repair level: minimal, standard, aggressive (default: standard) |

*At least one of --input and --batch is required

Output Format

Repaired PDF/SVG

- Maintains original vector format
Garbled text replaced with readable content
Fonts and layout remain unchanged

JSON Export Format

CODEBLOCK2

Garbled Text Detection Rules

The tool uses the following rules to detect garbled text:

1. Replacement Character Detection: Identifies U+FFFD (�) and box characters
Control Character Filtering: Excludes non-printing control characters
Encoding Consistency: Detects anomalies caused by mixed encodings
Font Fallback Detection: Identifies substitution characters generated due to missing fonts
Probability Model: Garbled text probability assessment based on character frequency

Repair Strategies

Minimal

- Only repairs obvious errors (replacement characters, null bytes)
Maintains maximum integrity of original text
Suitable for minor garbled text issues

Standard

- Repairs common encoding issues
Smart font replacement
Balances repair rate and accuracy

Aggressive

- Comprehensive text re-encoding
Uses OCR-assisted recognition
Suitable for severely garbled documents

Examples

Fix Single Page PDF

Input:
CODEBLOCK3

Output:
CODEBLOCK4

Export Editable JSON

Input:
CODEBLOCK5

Output JSON Structure:
CODEBLOCK6

Dependencies

CODEBLOCK7

Limitations

- Encrypted PDFs require password unlock before processing
Severely damaged vector files may not be fully repairable
Some rare fonts may not map correctly
Scanned PDFs require OCR recognition first

Version Information

- Version: 1.0.0
Last Updated: 2026-02-06
Status: Ready for use

Risk Assessment

Risk Indicator	Assessment	Level
Code Execution	Python/R scripts executed locally	Medium
Network Access

Security Checklist

- [ ] No hardcoded credentials or API keys
[ ] No unauthorized file system access (../)
[ ] Output does not expose sensitive information
[ ] Prompt injection protections in place
[ ] Input file paths validated (no ../ traversal)
[ ] Output directory restricted to workspace
[ ] Script execution in sandboxed environment
[ ] Error messages sanitized (no stack traces exposed)
[ ] Dependencies audited

Prerequisites

CODEBLOCK8

Evaluation Criteria

Success Metrics

- [ ] Successfully executes main functionality
[ ] Output meets quality standards
[ ] Handles edge cases gracefully
[ ] Performance is acceptable

Test Cases

1. Basic Functionality: Standard input → Expected output
Edge Case: Invalid input → Graceful error handling
Performance: Large dataset → Acceptable processing time

Lifecycle Status

- Current Stage: Draft
Next Review Date: 2026-03-06
Known Issues: None
Planned Improvements:

- Performance optimization - Additional feature support

Vector Text Fixer

修复PDF/SVG矢量图形中的乱码文本，使其可在AI工具中编辑。

功能特点

- 乱码文本检测：自动识别PDF/SVG文件中的乱码文本
智能修复：根据上下文推断原始文本内容
批量处理：支持批量处理文件夹中的多个文件
格式保留：修复后的文件保持原始矢量格式和布局
AI辅助编辑：输出可导入AI编辑器的中间格式

支持场景

1. PDF乱码文本修复

- 字体嵌入问题导致的方框/问号问题
编码转换错误引起的乱码文本
字体缺失替换产生的异常字符
多语言混合编码问题

2. SVG乱码文本修复

- 文本实体编码错误
特殊字符转义问题
无效字体引用导致的显示异常
XML编码声明错误

使用方法

命令行

bash

修复单个PDF文件

python scripts/main.py --input document.pdf --output fixed.pdf

修复单个SVG文件

python scripts/main.py --input diagram.svg --output fixed.svg

批量处理文件夹

python scripts/main.py --batch ./inputfolder --output ./outputfolder

交互式修复（手动指定替换内容）

python scripts/main.py --input doc.pdf --interactive

导出为可编辑格式（JSON）

python scripts/main.py --input doc.pdf --export-json editable.json

Python API

python
from scripts.main import VectorTextFixer

创建修复器实例

fixer = VectorTextFixer()

修复PDF

result = fixer.fix_pdf(input.pdf, output.pdf)

修复SVG

result = fixer.fix_svg(input.svg, output.svg)

批量处理

results = fixer.batchfix(./inputfolder, ./output_folder)

获取文本映射（用于AI编辑）

textmap = fixer.extracttext_map(input.pdf)

输入参数

参数	类型	必填	说明
--input	str	是*	输入文件路径（PDF或SVG）
--batch

str | 否 | 批量处理输入文件夹 | | --output | str | 是* | 输出文件/文件夹路径 | | --interactive | bool | 否 | 启用交互式修复模式 | | --export-json | str | 否 | 导出可编辑JSON格式 | | --encoding | str | 否 | 指定源文件编码（默认：自动检测） | | --font-substitution | dict | 否 | 字体替换映射 | | --repair-level | str | 否 | 修复级别：minimal、standard、aggressive（默认：standard） |

*--input和--batch至少需要指定一个

输出格式

修复后的PDF/SVG

- 保持原始矢量格式
乱码文本替换为可读内容
字体和布局保持不变

JSON导出格式

json { file_type: pdf, pages: [ { page_num: 1, text_blocks: [ { id: tb_001, bbox: [100, 200, 300, 220], original_text: ��, detected_encoding: UTF-8, confidence: 0.3, suggested_fix: 示例文本 } ] } ], fonts_used: [Arial, SimSun], repair_summary: { total_blocks: 15, fixed_blocks: 12, skipped_blocks: 3 } }

乱码文本检测规则

该工具使用以下规则检测乱码文本：

1. 替换字符检测：识别U+FFFD（�）和方框字符
控制字符过滤：排除非打印控制字符
编码一致性：检测混合编码引起的异常
字体回退检测：识别因字体缺失产生的替换字符
概率模型：基于字符频率的乱码文本概率评估

修复策略

最小修复

- 仅修复明显错误（替换字符、空字节）
最大程度保持原始文本完整性
适用于轻微乱码问题

标准修复

- 修复常见编码问题
智能字体替换
平衡修复率和准确率

激进修复

- 全面文本重新编码
使用OCR辅助识别
适用于严重乱码文档

示例

修复单页PDF

输入：
bash
python scripts/main.py --input report.pdf --output fixed_report.pdf

输出：

✓ 处理中：report.pdf
✓ 检测到5个乱码文本块
✓ 自动修复4个块
⚠ 1个块需要人工审核
✓ 输出已保存：fixed_report.pdf
✓ 报告已保存：fixedreportrepair_log.json

导出可编辑JSON

输入：
bash
python scripts/main.py --input diagram.svg --export-json editable.json

输出JSON结构：
json
{
file_type: svg,
svg_info: {
width: 800,
height: 600,
viewBox: 0 0 800 600
},
text_elements: [
{
id: text_1,
x: 100,
y: 200,
font_family: Arial,
font_size: 14,
original: ��,
user_editable: ,
confidence: 0.25
}
]
}

依赖项

pdfplumber>=0.10.0 # PDF解析
PyMuPDF>=1.23.0 # PDF处理（fitz）
cairosvg>=2.7.0 # SVG转换
beautifulsoup4>=4.12.0 # SVG解析
fonttools>=4.40.0 # 字体处理
chardet>=5.0.0 # 编码检测
Pillow>=10.0.0 # 图像处理

局限性

- 加密PDF需要先解锁密码才能处理
严重损坏的矢量文件可能无法完全修复
某些稀有字体可能无法正确映射
扫描版PDF需要先进行OCR识别

版本信息

- 版本：1.0.0
最后更新：2026-02-06
状态：可投入使用

风险评估

风险指标	评估	级别
代码执行	本地执行Python/R脚本	中
网络访问

安全检查清单

- [ ] 无硬编码凭据或API密钥
[ ] 无未授权文件系统访问（../）
[ ] 输出不暴露敏感信息
[ ] 已实施提示注入防护
[ ] 输入文件路径已验证（无../遍历）
[ ] 输出目录限制在工作区内
[ ] 脚本在沙盒环境中执行
[ ] 错误消息已清理（不暴露堆栈跟踪）
[ ] 依赖项已审计

前置条件

bash

Python依赖项

pip install -r requirements.txt

评估标准

成功指标

- [ ] 成功执行主要功能
[ ] 输出符合质量标准
[ ] 优雅处理边界情况
[ ] 性能可接受

测试用例

1. 基本功能：标准输入 → 预期输出
边界情况：无效输入 → 优雅错误处理
性能：大数据集 → 可接受处理时间

生命周期状态

- 当前阶段：草稿
下次审核日期：2026-03-06
已知问题：无
计划改进：

- 性能优化 - 额外功能支持

vector-text-fixer矢量文本修复