File-Deduplicator - Find and Remove Duplicates

Vernox Utility Skill - Clean up your digital hoard.

Overview

File-Deduplicator is an intelligent file duplicate finder and remover. Uses content hashing to identify identical files across directories, then provides options to remove duplicates safely.

Features

✅ Duplicate Detection

- Content-based hashing (MD5) for fast comparison
Size-based detection (exact match, near match)
Name-based detection (similar filenames)
Directory scanning (recursive)
Exclude patterns (.git, node_modules, etc.)

✅ Removal Options

- Auto-delete duplicates (keep newest/oldest)
Interactive review before deletion
Move to archive instead of delete
Preserve permissions and metadata
Dry-run mode (preview changes)

✅ Analysis Tools

- Duplicate count summary
Space savings estimation
Largest duplicate files
Most common duplicate patterns
Detailed report generation

✅ Safety Features

- Confirmation prompts before deletion
Backup to archive folder
Size threshold (don't remove huge files by mistake)
Whitelist important directories
Undo functionality (log for recovery)

Installation

CODEBLOCK0

Quick Start

Find Duplicates in Directory

CODEBLOCK1

Remove Duplicates Automatically

CODEBLOCK2

Dry-Run Preview

CODEBLOCK3

Tool Functions

`findDuplicates`

Find duplicate files across directories.

Parameters:

- directories (array|string, required): Directory paths to scan
INLINECODE2 (object, optional):

- method (string): 'content' | 'size' | 'name' - comparison method
- includeSubdirs (boolean): Scan recursively (default: true)
- minSize (number): Minimum size in bytes (default: 0)
- maxSize (number): Maximum size in bytes (default: 0)
- excludePatterns (array): Glob patterns to exclude (default: ['.git', 'node_modules'])
- whitelist (array): Directories to never scan (default: [])

Returns:

- duplicates (array): Array of duplicate groups

- duplicateCount (number): Number of duplicate groups found
- totalFiles (number): Total files scanned
- scanDuration (number): Time taken to scan (ms)
- spaceWasted (number): Total bytes wasted by duplicates
- spaceSaved (number): Potential savings if duplicates removed

`removeDuplicates`

Remove duplicate files based on findings.

Parameters:

- directories (array|string, required): Same as findDuplicates
INLINECODE17 (object, optional):

- keep (string): 'newest' | 'oldest' | 'smallest' | 'largest' - which to keep
- action (string): 'delete' | 'move' | 'archive'
- archivePath (string): Where to move files when action='move'
- dryRun (boolean): Preview without actual action
- autoConfirm (boolean): Auto-confirm deletions
- sizeThreshold (number): Don't remove files larger than this

Returns:

- filesRemoved (number): Number of files removed/moved
INLINECODE25 (number): Bytes saved
INLINECODE26 (number): Number of duplicate groups handled
INLINECODE27 (string): Path to action log
INLINECODE28 (array): Any errors encountered

`analyzeDirectory`

Analyze a single directory for duplicates.

Parameters:

- directory (string, required): Path to directory
INLINECODE31 (object, optional): Same as findDuplicates options

Returns:

- fileCount (number): Total files in directory
INLINECODE33 (number): Total bytes in directory
INLINECODE34 (number): Bytes in duplicate files
INLINECODE35 (number): Percentage of files that are duplicates

Use Cases

Digital Hoarder Cleanup

- Find duplicate photos/videos
Identify wasted storage space
Remove old duplicates, keep newest
Clean up download folders

Document Management

- Find duplicate PDFs, docs, reports
Keep latest version, archive old versions
Prevent version confusion
Reduce backup bloat

Project Cleanup

- Find duplicate source files
Remove duplicate build artifacts
Clean up node_modules duplicates
Save storage on SSD/HDD

Backup Optimization

- Find duplicate backup files
Remove redundant backups
Identify what's actually duplicated
Save space on backup drives

Configuration

Edit `config.json`:

CODEBLOCK4

Methods

Content-Based (Recommended)

- Fast MD5 hashing
Detects exact duplicates regardless of filename
Works across renamed files
Perfect for documents, code, archives

Size-Based

- Compares file sizes
Faster than content hashing
Good for media files where content hashing is slow
Finds near-duplicates (similar but not exact)

Name-Based

- Compares filenames
Detects similar named files
Good for finding version duplicates (filev1, filev2)

Examples

Find Duplicates in Documents

CODEBLOCK5

Remove Duplicates, Keep Newest

CODEBLOCK6

Move to Archive Instead of Delete

CODEBLOCK7

Dry-Run Preview Changes

CODEBLOCK8

Performance

Scanning Speed

- Small directories (<1000 files): <1s
Medium directories (1000-10000 files): 1-5s
Large directories (10000+ files): 5-20s

Detection Accuracy

- Content-based: 100% (exact duplicates)
Size-based: Fast but may miss renamed files
Name-based: Detects naming patterns only

Memory Usage

- Hash cache: ~1MB per 100,000 files
Batch processing: Processes 1000 files at a time
Peak memory: ~200MB for 1M files

Safety Features

Size Thresholding

Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.

Archive Mode

Move files to archive directory instead of deleting. No data loss, full recoverability.

Action Logging

All deletions/moves are logged to file for recovery and audit.

Undo Functionality

Log file can be used to restore accidentally deleted files (limited undo window).

Error Handling

Permission Errors

- Clear error message
Suggest running with sudo
Skip files that can't be accessed

File Lock Errors

- Detect locked files
Skip and report
Suggest closing applications using files

Space Errors

- Check available disk space before deletion
Warn if space is critically low
Prevent disk-full scenarios

Troubleshooting

Not Finding Expected Duplicates

- Check detection method (content vs size vs name)
Verify exclude patterns aren't too broad
Check if files are in whitelisted directories
Try with includeSubdirs: false

Deletion Not Working

- Check write permissions on directories
Verify action isn't 'delete' with autoConfirm: true
Check size threshold isn't blocking all deletions
Check file locks (is another program using files?)

Slow Scanning

- Reduce includeSubdirs scope
Use size-based detection (faster)
Exclude large directories (node_modules, .git)
Process directories individually instead of batch

Tips

Best Results

- Use content-based detection for documents (100% accurate)
Run dry-run first to preview changes
Archive instead of delete for important files
Check logs if anything unexpected deleted

Performance Optimization

- Process frequently used directories first
Use size threshold to skip large media files
Exclude hidden directories from scan
Process directories in parallel when possible

Space Management

- Regular duplicate cleanup prevents storage bloat
Delete temp directories regularly
Clear download folders of installers
Empty trash before large scans

Roadmap

- [ ] Duplicate detection by image similarity
[ ] Near-duplicate detection (similar but not exact)
[ ] Duplicate detection across network drives
[ ] Cloud storage integration (S3, Google Drive)
[ ] Automatic scheduling of scans
[ ] Heuristic duplicate detection (ML-based)
[ ] Recover deleted files from backup
[ ] Duplicate detection by file content similarity (not just hash)

License

MIT

Find duplicates. Save space. Keep your system clean. 🔮

文件去重器 - 查找并删除重复文件

Vernox 实用技能 - 清理你的数字囤积物。

概述

文件去重器是一款智能文件重复查找与删除工具。它使用内容哈希技术跨目录识别相同文件，并提供安全删除重复文件的选项。

功能特性

✅ 重复检测

- 基于内容的哈希（MD5）实现快速比较
基于大小的检测（精确匹配、近似匹配）
基于名称的检测（相似文件名）
目录扫描（递归）
排除模式（.git、node_modules 等）

✅ 删除选项

- 自动删除重复文件（保留最新/最旧）
删除前交互式审查
移至归档而非删除
保留权限和元数据
试运行模式（预览更改）

✅ 分析工具

- 重复文件数量汇总
空间节省估算
最大重复文件
最常见重复模式
详细报告生成

✅ 安全功能

- 删除前确认提示
备份至归档文件夹
大小阈值（防止误删大文件）
重要目录白名单
撤销功能（恢复日志）

安装

bash
clawhub install file-deduplicator

快速入门

在目录中查找重复文件

javascript
const result = await findDuplicates({
directories: [./documents, ./downloads, ./projects],
options: {
method: content, // 基于内容的比较
includeSubdirs: true
}
});

console.log(找到 ${result.duplicateCount} 个重复组);
console.log(可节省空间: ${result.spaceSaved});

自动删除重复文件

javascript
const result = await removeDuplicates({
directories: [./documents, ./downloads],
options: {
method: content,
keep: newest, // 保留最新，删除最旧
action: delete, // 或 move 移至归档
autoConfirm: false // 每个操作显示确认
}
});

console.log(已删除 ${result.filesRemoved} 个重复文件);
console.log(节省空间: ${result.spaceSaved});

试运行预览

javascript
const result = await removeDuplicates({
directories: [./documents, ./downloads],
options: {
method: content,
keep: newest,
action: delete,
dryRun: true // 预览而不实际删除
}
});

console.log(将删除:);
result.duplicates.forEach((dup, i) => {
console.log(${i+1}. ${dup.file});
});

工具函数

findDuplicates

跨目录查找重复文件。

参数：

- directories（数组|字符串，必填）：要扫描的目录路径
options（对象，可选）：

- method（字符串）：content | size | name - 比较方法
- includeSubdirs（布尔值）：递归扫描（默认：true）
- minSize（数字）：最小字节数（默认：0）
- maxSize（数字）：最大字节数（默认：0）
- excludePatterns（数组）：要排除的 Glob 模式（默认：[.git, node_modules]）
- whitelist（数组）：从不扫描的目录（默认：[]）

返回值：

- duplicates（数组）：重复组数组

- duplicateCount（数字）：找到的重复组数量
- totalFiles（数字）：扫描的文件总数
- scanDuration（数字）：扫描耗时（毫秒）
- spaceWasted（数字）：重复文件浪费的总字节数
- spaceSaved（数字）：删除重复文件后可节省的空间

removeDuplicates

根据检测结果删除重复文件。

参数：

- directories（数组|字符串，必填）：同 findDuplicates
options（对象，可选）：

返回值：

- filesRemoved（数字）：已删除/移动的文件数量
spaceSaved（数字）：节省的字节数
groupsProcessed（数字）：处理的重复组数量
logPath（字符串）：操作日志路径
errors（数组）：遇到的任何错误

analyzeDirectory

分析单个目录中的重复文件。

参数：

- directory（字符串，必填）：目录路径
options（对象，可选）：同 findDuplicates 的选项

返回值：

- fileCount（数字）：目录中的文件总数
totalSize（数字）：目录中的总字节数
duplicateSize（数字）：重复文件中的字节数
duplicateRatio（数字）：重复文件所占百分比

使用场景

数字囤积者清理

- 查找重复的照片/视频
识别浪费的存储空间
删除旧重复文件，保留最新
清理下载文件夹

文档管理

- 查找重复的 PDF、文档、报告
保留最新版本，归档旧版本
防止版本混淆
减少备份膨胀

项目清理

- 查找重复的源文件
删除重复的构建产物
清理 node_modules 重复文件
节省 SSD/HDD 存储空间

备份优化

- 查找重复的备份文件
删除冗余备份
识别实际重复的内容
节省备份驱动器空间

配置

编辑 config.json：

json { detection: { defaultMethod: content, sizeTolerancePercent: 0, // 仅精确匹配 nameSimilarity: 0.7, // 0-1，值越低越相似 includeSubdirs: true }, removal: { defaultAction: delete, defaultKeep: newest, archivePath: ./archive, sizeThreshold: 10485760, // 10MB 阈值 autoConfirm: false, dryRunDefault: false }, exclude: { patterns: [.git, node_modules, .vscode, .idea], whitelist: [important, work, projects] } }

方法

基于内容（推荐）

- 快速 MD5 哈希
检测精确重复，无论文件名如何
适用于重命名文件
完美适用于文档、代码、归档文件

基于大小

- 比较文件大小
比内容哈希更快
适用于内容哈希较慢的媒体文件
查找近似重复（相似但不精确）

基于名称

- 比较文件名
检测相似命名的文件
适用于查找版本重复（filev1、filev2）

示例

在文档中查找重复文件

javascript const result = await findDuplicates({ directories: ~/Documents, options: { method: content, includeSubdirs: true } });

console.log(找到 ${result.duplicateCount} 个重复组);
result.duplicates.slice(0, 5).forEach((set, i) => {
console.log(组 ${i+1}: ${set.files.length} 个文件);
console.log( 总大小: ${set.totalSize} 字节);
});

删除重复文件，保留最新

javascript const result = await removeDuplicates({ directories: ~/Documents, options: { keep: newest, action: delete } });

console.log(已删除 ${result.filesRemoved} 个文件);
console.log(节省 ${result.spaceSaved} 字节);

移至归档而非删除

javascript const result = await removeDuplicates({ directories: ~/Downloads, options: { keep: newest, action: move, archivePath: ~/Documents/Archive } });

console.log(已归档 ${result.filesRemoved} 个文件);
console.log(安全位置: ~/Documents/Archive);

试运行预览更改

javascript const result = await removeDuplicates({ directories: ~/Documents, options: { dryRun: true // 仅显示将要

file-deduplicator智能去重工具