File-Deduplicator - Find and Remove Duplicates
Vernox Utility Skill - Clean up your digital hoard.
Overview
File-Deduplicator is an intelligent file duplicate finder and remover. Uses content hashing to identify identical files across directories, then provides options to remove duplicates safely.
Features
✅ Duplicate Detection
- - Content-based hashing (MD5) for fast comparison
- Size-based detection (exact match, near match)
- Name-based detection (similar filenames)
- Directory scanning (recursive)
- Exclude patterns (.git, node_modules, etc.)
✅ Removal Options
- - Auto-delete duplicates (keep newest/oldest)
- Interactive review before deletion
- Move to archive instead of delete
- Preserve permissions and metadata
- Dry-run mode (preview changes)
✅ Analysis Tools
- - Duplicate count summary
- Space savings estimation
- Largest duplicate files
- Most common duplicate patterns
- Detailed report generation
✅ Safety Features
- - Confirmation prompts before deletion
- Backup to archive folder
- Size threshold (don't remove huge files by mistake)
- Whitelist important directories
- Undo functionality (log for recovery)
Installation
CODEBLOCK0
Quick Start
Find Duplicates in Directory
CODEBLOCK1
Remove Duplicates Automatically
CODEBLOCK2
Dry-Run Preview
CODEBLOCK3
Tool Functions
findDuplicates
Find duplicate files across directories.
Parameters:
- -
directories (array|string, required): Directory paths to scan - INLINECODE2 (object, optional):
-
method (string): 'content' | 'size' | 'name' - comparison method
-
includeSubdirs (boolean): Scan recursively (default: true)
-
minSize (number): Minimum size in bytes (default: 0)
-
maxSize (number): Maximum size in bytes (default: 0)
-
excludePatterns (array): Glob patterns to exclude (default: ['.git', 'node_modules'])
-
whitelist (array): Directories to never scan (default: [])
Returns:
- -
duplicates (array): Array of duplicate groups
-
duplicateCount (number): Number of duplicate groups found
-
totalFiles (number): Total files scanned
-
scanDuration (number): Time taken to scan (ms)
-
spaceWasted (number): Total bytes wasted by duplicates
-
spaceSaved (number): Potential savings if duplicates removed
removeDuplicates
Remove duplicate files based on findings.
Parameters:
- -
directories (array|string, required): Same as findDuplicates - INLINECODE17 (object, optional):
-
keep (string): 'newest' | 'oldest' | 'smallest' | 'largest' - which to keep
-
action (string): 'delete' | 'move' | 'archive'
-
archivePath (string): Where to move files when action='move'
-
dryRun (boolean): Preview without actual action
-
autoConfirm (boolean): Auto-confirm deletions
-
sizeThreshold (number): Don't remove files larger than this
Returns:
- -
filesRemoved (number): Number of files removed/moved - INLINECODE25 (number): Bytes saved
- INLINECODE26 (number): Number of duplicate groups handled
- INLINECODE27 (string): Path to action log
- INLINECODE28 (array): Any errors encountered
analyzeDirectory
Analyze a single directory for duplicates.
Parameters:
- -
directory (string, required): Path to directory - INLINECODE31 (object, optional): Same as findDuplicates options
Returns:
- -
fileCount (number): Total files in directory - INLINECODE33 (number): Total bytes in directory
- INLINECODE34 (number): Bytes in duplicate files
- INLINECODE35 (number): Percentage of files that are duplicates
Use Cases
Digital Hoarder Cleanup
- - Find duplicate photos/videos
- Identify wasted storage space
- Remove old duplicates, keep newest
- Clean up download folders
Document Management
- - Find duplicate PDFs, docs, reports
- Keep latest version, archive old versions
- Prevent version confusion
- Reduce backup bloat
Project Cleanup
- - Find duplicate source files
- Remove duplicate build artifacts
- Clean up node_modules duplicates
- Save storage on SSD/HDD
Backup Optimization
- - Find duplicate backup files
- Remove redundant backups
- Identify what's actually duplicated
- Save space on backup drives
Configuration
Edit config.json:
CODEBLOCK4
Methods
Content-Based (Recommended)
- - Fast MD5 hashing
- Detects exact duplicates regardless of filename
- Works across renamed files
- Perfect for documents, code, archives
Size-Based
- - Compares file sizes
- Faster than content hashing
- Good for media files where content hashing is slow
- Finds near-duplicates (similar but not exact)
Name-Based
- - Compares filenames
- Detects similar named files
- Good for finding version duplicates (filev1, filev2)
Examples
Find Duplicates in Documents
CODEBLOCK5
Remove Duplicates, Keep Newest
CODEBLOCK6
Move to Archive Instead of Delete
CODEBLOCK7
Dry-Run Preview Changes
CODEBLOCK8
Performance
Scanning Speed
- - Small directories (<1000 files): <1s
- Medium directories (1000-10000 files): 1-5s
- Large directories (10000+ files): 5-20s
Detection Accuracy
- - Content-based: 100% (exact duplicates)
- Size-based: Fast but may miss renamed files
- Name-based: Detects naming patterns only
Memory Usage
- - Hash cache: ~1MB per 100,000 files
- Batch processing: Processes 1000 files at a time
- Peak memory: ~200MB for 1M files
Safety Features
Size Thresholding
Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.
Archive Mode
Move files to archive directory instead of deleting. No data loss, full recoverability.
Action Logging
All deletions/moves are logged to file for recovery and audit.
Undo Functionality
Log file can be used to restore accidentally deleted files (limited undo window).
Error Handling
Permission Errors
- - Clear error message
- Suggest running with sudo
- Skip files that can't be accessed
File Lock Errors
- - Detect locked files
- Skip and report
- Suggest closing applications using files
Space Errors
- - Check available disk space before deletion
- Warn if space is critically low
- Prevent disk-full scenarios
Troubleshooting
Not Finding Expected Duplicates
- - Check detection method (content vs size vs name)
- Verify exclude patterns aren't too broad
- Check if files are in whitelisted directories
- Try with includeSubdirs: false
Deletion Not Working
- - Check write permissions on directories
- Verify action isn't 'delete' with autoConfirm: true
- Check size threshold isn't blocking all deletions
- Check file locks (is another program using files?)
Slow Scanning
- - Reduce includeSubdirs scope
- Use size-based detection (faster)
- Exclude large directories (node_modules, .git)
- Process directories individually instead of batch
Tips
Best Results
- - Use content-based detection for documents (100% accurate)
- Run dry-run first to preview changes
- Archive instead of delete for important files
- Check logs if anything unexpected deleted
Performance Optimization
- - Process frequently used directories first
- Use size threshold to skip large media files
- Exclude hidden directories from scan
- Process directories in parallel when possible
Space Management
- - Regular duplicate cleanup prevents storage bloat
- Delete temp directories regularly
- Clear download folders of installers
- Empty trash before large scans
Roadmap
- - [ ] Duplicate detection by image similarity
- [ ] Near-duplicate detection (similar but not exact)
- [ ] Duplicate detection across network drives
- [ ] Cloud storage integration (S3, Google Drive)
- [ ] Automatic scheduling of scans
- [ ] Heuristic duplicate detection (ML-based)
- [ ] Recover deleted files from backup
- [ ] Duplicate detection by file content similarity (not just hash)
License
MIT
Find duplicates. Save space. Keep your system clean. 🔮
文件去重器 - 查找并删除重复文件
Vernox 实用技能 - 清理你的数字囤积物。
概述
文件去重器是一款智能文件重复查找与删除工具。它使用内容哈希技术跨目录识别相同文件,并提供安全删除重复文件的选项。
功能特性
✅ 重复检测
- - 基于内容的哈希(MD5)实现快速比较
- 基于大小的检测(精确匹配、近似匹配)
- 基于名称的检测(相似文件名)
- 目录扫描(递归)
- 排除模式(.git、node_modules 等)
✅ 删除选项
- - 自动删除重复文件(保留最新/最旧)
- 删除前交互式审查
- 移至归档而非删除
- 保留权限和元数据
- 试运行模式(预览更改)
✅ 分析工具
- - 重复文件数量汇总
- 空间节省估算
- 最大重复文件
- 最常见重复模式
- 详细报告生成
✅ 安全功能
- - 删除前确认提示
- 备份至归档文件夹
- 大小阈值(防止误删大文件)
- 重要目录白名单
- 撤销功能(恢复日志)
安装
bash
clawhub install file-deduplicator
快速入门
在目录中查找重复文件
javascript
const result = await findDuplicates({
directories: [./documents, ./downloads, ./projects],
options: {
method: content, // 基于内容的比较
includeSubdirs: true
}
});
console.log(找到 ${result.duplicateCount} 个重复组);
console.log(可节省空间: ${result.spaceSaved});
自动删除重复文件
javascript
const result = await removeDuplicates({
directories: [./documents, ./downloads],
options: {
method: content,
keep: newest, // 保留最新,删除最旧
action: delete, // 或 move 移至归档
autoConfirm: false // 每个操作显示确认
}
});
console.log(已删除 ${result.filesRemoved} 个重复文件);
console.log(节省空间: ${result.spaceSaved});
试运行预览
javascript
const result = await removeDuplicates({
directories: [./documents, ./downloads],
options: {
method: content,
keep: newest,
action: delete,
dryRun: true // 预览而不实际删除
}
});
console.log(将删除:);
result.duplicates.forEach((dup, i) => {
console.log(${i+1}. ${dup.file});
});
工具函数
findDuplicates
跨目录查找重复文件。
参数:
- - directories(数组|字符串,必填):要扫描的目录路径
- options(对象,可选):
- method(字符串):content | size | name - 比较方法
- includeSubdirs(布尔值):递归扫描(默认:true)
- minSize(数字):最小字节数(默认:0)
- maxSize(数字):最大字节数(默认:0)
- excludePatterns(数组):要排除的 Glob 模式(默认:[.git, node_modules])
- whitelist(数组):从不扫描的目录(默认:[])
返回值:
- duplicateCount(数字):找到的重复组数量
- totalFiles(数字):扫描的文件总数
- scanDuration(数字):扫描耗时(毫秒)
- spaceWasted(数字):重复文件浪费的总字节数
- spaceSaved(数字):删除重复文件后可节省的空间
removeDuplicates
根据检测结果删除重复文件。
参数:
- - directories(数组|字符串,必填):同 findDuplicates
- options(对象,可选):
- keep(字符串):newest | oldest | smallest | largest - 保留哪个
- action(字符串):delete | move | archive
- archivePath(字符串):action=move 时文件移动的目标路径
- dryRun(布尔值):预览而不实际执行操作
- autoConfirm(布尔值):自动确认删除
- sizeThreshold(数字):不删除超过此大小的文件
返回值:
- - filesRemoved(数字):已删除/移动的文件数量
- spaceSaved(数字):节省的字节数
- groupsProcessed(数字):处理的重复组数量
- logPath(字符串):操作日志路径
- errors(数组):遇到的任何错误
analyzeDirectory
分析单个目录中的重复文件。
参数:
- - directory(字符串,必填):目录路径
- options(对象,可选):同 findDuplicates 的选项
返回值:
- - fileCount(数字):目录中的文件总数
- totalSize(数字):目录中的总字节数
- duplicateSize(数字):重复文件中的字节数
- duplicateRatio(数字):重复文件所占百分比
使用场景
数字囤积者清理
- - 查找重复的照片/视频
- 识别浪费的存储空间
- 删除旧重复文件,保留最新
- 清理下载文件夹
文档管理
- - 查找重复的 PDF、文档、报告
- 保留最新版本,归档旧版本
- 防止版本混淆
- 减少备份膨胀
项目清理
- - 查找重复的源文件
- 删除重复的构建产物
- 清理 node_modules 重复文件
- 节省 SSD/HDD 存储空间
备份优化
- - 查找重复的备份文件
- 删除冗余备份
- 识别实际重复的内容
- 节省备份驱动器空间
配置
编辑 config.json:
json
{
detection: {
defaultMethod: content,
sizeTolerancePercent: 0, // 仅精确匹配
nameSimilarity: 0.7, // 0-1,值越低越相似
includeSubdirs: true
},
removal: {
defaultAction: delete,
defaultKeep: newest,
archivePath: ./archive,
sizeThreshold: 10485760, // 10MB 阈值
autoConfirm: false,
dryRunDefault: false
},
exclude: {
patterns: [.git, node_modules, .vscode, .idea],
whitelist: [important, work, projects]
}
}
方法
基于内容(推荐)
- - 快速 MD5 哈希
- 检测精确重复,无论文件名如何
- 适用于重命名文件
- 完美适用于文档、代码、归档文件
基于大小
- - 比较文件大小
- 比内容哈希更快
- 适用于内容哈希较慢的媒体文件
- 查找近似重复(相似但不精确)
基于名称
- - 比较文件名
- 检测相似命名的文件
- 适用于查找版本重复(filev1、filev2)
示例
在文档中查找重复文件
javascript
const result = await findDuplicates({
directories: ~/Documents,
options: {
method: content,
includeSubdirs: true
}
});
console.log(找到 ${result.duplicateCount} 个重复组);
result.duplicates.slice(0, 5).forEach((set, i) => {
console.log(组 ${i+1}: ${set.files.length} 个文件);
console.log( 总大小: ${set.totalSize} 字节);
});
删除重复文件,保留最新
javascript
const result = await removeDuplicates({
directories: ~/Documents,
options: {
keep: newest,
action: delete
}
});
console.log(已删除 ${result.filesRemoved} 个文件);
console.log(节省 ${result.spaceSaved} 字节);
移至归档而非删除
javascript
const result = await removeDuplicates({
directories: ~/Downloads,
options: {
keep: newest,
action: move,
archivePath: ~/Documents/Archive
}
});
console.log(已归档 ${result.filesRemoved} 个文件);
console.log(安全位置: ~/Documents/Archive);
试运行预览更改
javascript
const result = await removeDuplicates({
directories: ~/Documents,
options: {
dryRun: true // 仅显示将要