Image Duplication Detector

ID: 195

Description

Uses Computer Vision (CV) algorithms to scan all images in paper manuscripts to detect potential duplication or local tampering (PS traces).

Usage

CODEBLOCK0

Parameters

Parameter	Type	Default	Required	Description
INLINECODE0	string	-	Yes	Input PDF file or image folder path
INLINECODE1

Output Format

CODEBLOCK1

Requirements

CODEBLOCK2

Algorithm Details

Duplication Detection

- Perceptual Hashing: Uses pHash, dHash, aHash combination to detect visually similar images
Feature Matching: ORB feature point matching to verify similarity
SSIM: Structural similarity index as auxiliary verification

Tampering Detection

- ELA (Error Level Analysis): Detects JPEG compression level inconsistencies
Noise Analysis: Noise pattern anomaly detection
Copy-Move Detection: Copy-move forgery detection
Lighting Inconsistency: Lighting consistency analysis

Example

CODEBLOCK3

Notes

- Supports PDF, PNG, JPG, TIFF formats
Large files recommended for batch processing
Tampering detection may produce false positives, manual review recommended

Risk Assessment

Risk Indicator	Assessment	Level
Code Execution	Python/R scripts executed locally	Medium
Network Access

Security Checklist

- [ ] No hardcoded credentials or API keys
[ ] No unauthorized file system access (../)
[ ] Output does not expose sensitive information
[ ] Prompt injection protections in place
[ ] Input file paths validated (no ../ traversal)
[ ] Output directory restricted to workspace
[ ] Script execution in sandboxed environment
[ ] Error messages sanitized (no stack traces exposed)
[ ] Dependencies audited

Prerequisites

CODEBLOCK4

Evaluation Criteria

Success Metrics

- [ ] Successfully executes main functionality
[ ] Output meets quality standards
[ ] Handles edge cases gracefully
[ ] Performance is acceptable

Test Cases

1. Basic Functionality: Standard input → Expected output
Edge Case: Invalid input → Graceful error handling
Performance: Large dataset → Acceptable processing time

Lifecycle Status

- Current Stage: Draft
Next Review Date: 2026-03-06
Known Issues: None
Planned Improvements:

- Performance optimization - Additional feature support

图像重复检测器

ID: 195

描述

使用计算机视觉（CV）算法扫描论文手稿中的所有图像，以检测潜在的重复或局部篡改（PS痕迹）。

使用方法

bash

扫描单个PDF文件

python scripts/main.py --input paper.pdf --output report.json

扫描图像文件夹

python scripts/main.py --input ./images/ --output report.json

指定相似度阈值（默认0.85）

python scripts/main.py --input paper.pdf --threshold 0.90 --output report.json

启用篡改检测

python scripts/main.py --input paper.pdf --detect-tampering --output report.json

生成可视化报告

python scripts/main.py --input paper.pdf --visualize --output report.json

参数

参数	类型	默认值	必填	描述
--input	字符串	-	是	输入PDF文件或图像文件夹路径
--output

字符串 | report.json | 否 | 输出报告路径 | | --threshold | 浮点数 | 0.85 | 否 | 相似度阈值（0-1），越高越严格 | | --detect-tampering | 标志 | false | 否 | 启用篡改/PS痕迹检测 | | --visualize | 标志 | false | 否 | 生成可视化对比图像 | | --temp-dir | 字符串 | ./temp | 否 | 临时文件目录 |

输出格式

json
{
summary: {
total_images: 12,
duplicates_found: 2,
tampering_detected: 1,
processing_time: 3.5s
},
duplicates: [
{
group_id: 1,
similarity: 0.98,
images: [
{page: 2, index: 1, path: ...},
{page: 5, index: 3, path: ...}
]
}
],
tampering: [
{
image: page3img_2.png,
suspicious_regions: [
{x: 120, y: 80, width: 50, height: 50, confidence: 0.92}
]
}
]
}

依赖要求

opencv-python>=4.8.0
numpy>=1.24.0
Pillow>=10.0.0
PyPDF2>=3.0.0
pdf2image>=1.16.0
imagehash>=4.3.0
scikit-image>=0.21.0
matplotlib>=3.7.0

算法详情

重复检测

- 感知哈希：使用pHash、dHash、aHash组合检测视觉相似图像
特征匹配：ORB特征点匹配以验证相似性
SSIM：结构相似性指数作为辅助验证

篡改检测

- ELA（误差水平分析）：检测JPEG压缩级别不一致
噪声分析：噪声模式异常检测
复制-移动检测：复制-移动伪造检测
光照不一致：光照一致性分析

示例

python
from scripts.main import ImageDuplicationDetector

detector = ImageDuplicationDetector(
threshold=0.85,
detect_tampering=True
)

results = detector.scan(paper.pdf)
detector.save_report(results, report.json)

注意事项

- 支持PDF、PNG、JPG、TIFF格式
大文件建议批量处理
篡改检测可能产生误报，建议人工复核

风险评估

风险指标	评估	级别
代码执行	Python/R脚本在本地执行	中等
网络访问

安全检查清单

- [ ] 无硬编码凭据或API密钥
[ ] 无未经授权的文件系统访问（../）
[ ] 输出不暴露敏感信息
[ ] 已实施提示注入防护
[ ] 输入文件路径已验证（无../遍历）
[ ] 输出目录限制在工作区
[ ] 脚本在沙盒环境中执行
[ ] 错误消息已清理（不暴露堆栈跟踪）
[ ] 依赖项已审计

前置条件

bash

Python依赖

pip install -r requirements.txt

评估标准

成功指标

- [ ] 成功执行主要功能
[ ] 输出符合质量标准
[ ] 优雅处理边缘情况
[ ] 性能可接受

测试用例

1. 基本功能：标准输入 → 预期输出
边缘情况：无效输入 → 优雅的错误处理
性能：大数据集 → 可接受的处理时间

生命周期状态

- 当前阶段：草稿
下次审查日期：2026-03-06
已知问题：无
计划改进：

- 性能优化 - 增加更多功能支持

image-duplication-detector图像重复检测