Image Duplication Detector
ID: 195
Description
Uses Computer Vision (CV) algorithms to scan all images in paper manuscripts to detect potential duplication or local tampering (PS traces).
Usage
CODEBLOCK0
Parameters
| Parameter | Type | Default | Required | Description |
|---|
| INLINECODE0 | string | - | Yes | Input PDF file or image folder path |
| INLINECODE1 |
string | report.json | No | Output report path |
|
--threshold | float | 0.85 | No | Similarity threshold (0-1), higher is stricter |
|
--detect-tampering | flag | false | No | Enable tampering/PS trace detection |
|
--visualize | flag | false | No | Generate visualization comparison images |
|
--temp-dir | string | ./temp | No | Temporary file directory |
Output Format
CODEBLOCK1
Requirements
CODEBLOCK2
Algorithm Details
Duplication Detection
- - Perceptual Hashing: Uses pHash, dHash, aHash combination to detect visually similar images
- Feature Matching: ORB feature point matching to verify similarity
- SSIM: Structural similarity index as auxiliary verification
Tampering Detection
- - ELA (Error Level Analysis): Detects JPEG compression level inconsistencies
- Noise Analysis: Noise pattern anomaly detection
- Copy-Move Detection: Copy-move forgery detection
- Lighting Inconsistency: Lighting consistency analysis
Example
CODEBLOCK3
Notes
- - Supports PDF, PNG, JPG, TIFF formats
- Large files recommended for batch processing
- Tampering detection may produce false positives, manual review recommended
Risk Assessment
| Risk Indicator | Assessment | Level |
|---|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access |
No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
Security Checklist
- - [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
Prerequisites
CODEBLOCK4
Evaluation Criteria
Success Metrics
- - [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
Test Cases
- 1. Basic Functionality: Standard input → Expected output
- Edge Case: Invalid input → Graceful error handling
- Performance: Large dataset → Acceptable processing time
Lifecycle Status
- - Current Stage: Draft
- Next Review Date: 2026-03-06
- Known Issues: None
- Planned Improvements:
- Performance optimization
- Additional feature support
图像重复检测器
ID: 195
描述
使用计算机视觉(CV)算法扫描论文手稿中的所有图像,以检测潜在的重复或局部篡改(PS痕迹)。
使用方法
bash
扫描单个PDF文件
python scripts/main.py --input paper.pdf --output report.json
扫描图像文件夹
python scripts/main.py --input ./images/ --output report.json
指定相似度阈值(默认0.85)
python scripts/main.py --input paper.pdf --threshold 0.90 --output report.json
启用篡改检测
python scripts/main.py --input paper.pdf --detect-tampering --output report.json
生成可视化报告
python scripts/main.py --input paper.pdf --visualize --output report.json
参数
| 参数 | 类型 | 默认值 | 必填 | 描述 |
|---|
| --input | 字符串 | - | 是 | 输入PDF文件或图像文件夹路径 |
| --output |
字符串 | report.json | 否 | 输出报告路径 |
| --threshold | 浮点数 | 0.85 | 否 | 相似度阈值(0-1),越高越严格 |
| --detect-tampering | 标志 | false | 否 | 启用篡改/PS痕迹检测 |
| --visualize | 标志 | false | 否 | 生成可视化对比图像 |
| --temp-dir | 字符串 | ./temp | 否 | 临时文件目录 |
输出格式
json
{
summary: {
total_images: 12,
duplicates_found: 2,
tampering_detected: 1,
processing_time: 3.5s
},
duplicates: [
{
group_id: 1,
similarity: 0.98,
images: [
{page: 2, index: 1, path: ...},
{page: 5, index: 3, path: ...}
]
}
],
tampering: [
{
image: page3img_2.png,
suspicious_regions: [
{x: 120, y: 80, width: 50, height: 50, confidence: 0.92}
]
}
]
}
依赖要求
opencv-python>=4.8.0
numpy>=1.24.0
Pillow>=10.0.0
PyPDF2>=3.0.0
pdf2image>=1.16.0
imagehash>=4.3.0
scikit-image>=0.21.0
matplotlib>=3.7.0
算法详情
重复检测
- - 感知哈希:使用pHash、dHash、aHash组合检测视觉相似图像
- 特征匹配:ORB特征点匹配以验证相似性
- SSIM:结构相似性指数作为辅助验证
篡改检测
- - ELA(误差水平分析):检测JPEG压缩级别不一致
- 噪声分析:噪声模式异常检测
- 复制-移动检测:复制-移动伪造检测
- 光照不一致:光照一致性分析
示例
python
from scripts.main import ImageDuplicationDetector
detector = ImageDuplicationDetector(
threshold=0.85,
detect_tampering=True
)
results = detector.scan(paper.pdf)
detector.save_report(results, report.json)
注意事项
- - 支持PDF、PNG、JPG、TIFF格式
- 大文件建议批量处理
- 篡改检测可能产生误报,建议人工复核
风险评估
| 风险指标 | 评估 | 级别 |
|---|
| 代码执行 | Python/R脚本在本地执行 | 中等 |
| 网络访问 |
无外部API调用 | 低 |
| 文件系统访问 | 读取输入文件,写入输出文件 | 中等 |
| 指令篡改 | 标准提示词指南 | 低 |
| 数据泄露 | 输出文件保存到工作区 | 低 |
安全检查清单
- - [ ] 无硬编码凭据或API密钥
- [ ] 无未经授权的文件系统访问(../)
- [ ] 输出不暴露敏感信息
- [ ] 已实施提示注入防护
- [ ] 输入文件路径已验证(无../遍历)
- [ ] 输出目录限制在工作区
- [ ] 脚本在沙盒环境中执行
- [ ] 错误消息已清理(不暴露堆栈跟踪)
- [ ] 依赖项已审计
前置条件
bash
Python依赖
pip install -r requirements.txt
评估标准
成功指标
- - [ ] 成功执行主要功能
- [ ] 输出符合质量标准
- [ ] 优雅处理边缘情况
- [ ] 性能可接受
测试用例
- 1. 基本功能:标准输入 → 预期输出
- 边缘情况:无效输入 → 优雅的错误处理
- 性能:大数据集 → 可接受的处理时间
生命周期状态
- - 当前阶段:草稿
- 下次审查日期:2026-03-06
- 已知问题:无
- 计划改进:
- 性能优化
- 增加更多功能支持