Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
111 lines
3.2 KiB
Markdown
111 lines
3.2 KiB
Markdown
# disk-checker
|
|
|
|
Fast Ubuntu-friendly CLI for scanning folders, checking file sizes, hashing the first chunk of same-size files, and reporting possible duplicates plus symlinks, hard links, special files, and scan errors.
|
|
|
|
## Install Rust on Ubuntu
|
|
|
|
```bash
|
|
sudo apt update
|
|
sudo apt install -y build-essential curl
|
|
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
|
source "$HOME/.cargo/env"
|
|
```
|
|
|
|
## Build
|
|
|
|
```bash
|
|
cargo build --release
|
|
```
|
|
|
|
The binary will be at:
|
|
|
|
```bash
|
|
target/release/disk-checker
|
|
```
|
|
|
|
## Usage
|
|
|
|
Scan the current directory:
|
|
|
|
```bash
|
|
disk-checker
|
|
```
|
|
|
|
Scan one or more paths:
|
|
|
|
```bash
|
|
disk-checker ~/Downloads /mnt/shared
|
|
```
|
|
|
|
Use JSON for scripts:
|
|
|
|
```bash
|
|
disk-checker ~/Downloads --json
|
|
```
|
|
|
|
Hash a larger first chunk before grouping possible duplicates:
|
|
|
|
```bash
|
|
disk-checker ~/Downloads --hash-bytes 8MiB
|
|
```
|
|
|
|
Follow symlinks while still reporting them separately:
|
|
|
|
```bash
|
|
disk-checker ~/Downloads --follow-links
|
|
```
|
|
|
|
Verify possible duplicates with a full-file hash pass:
|
|
|
|
```bash
|
|
disk-checker ~/Downloads --verify-full
|
|
```
|
|
|
|
Review duplicate groups one by one and choose which path to keep:
|
|
|
|
```bash
|
|
disk-checker ~/Downloads --interactive
|
|
```
|
|
|
|
Interactive mode automatically full-verifies only the duplicate candidate groups before prompting. It is non-destructive: it writes a reviewed shell deletion plan instead of deleting files immediately.
|
|
|
|
```bash
|
|
disk-checker ~/Downloads --interactive --delete-plan review-delete.sh
|
|
```
|
|
|
|
Use the fastest triage mode for huge datasets by grouping same-size files without hashing:
|
|
|
|
```bash
|
|
disk-checker /mnt/storage --size-only --min-size 100MiB --threads 32
|
|
```
|
|
|
|
Limit traversal depth:
|
|
|
|
```bash
|
|
disk-checker /mnt/storage --max-depth 3
|
|
```
|
|
|
|
Limit scanning and hashing workers:
|
|
|
|
```bash
|
|
disk-checker ~/Downloads --threads 4
|
|
```
|
|
|
|
Disable progress output:
|
|
|
|
```bash
|
|
disk-checker ~/Downloads --no-progress
|
|
```
|
|
|
|
## Notes
|
|
|
|
- By default, duplicate results are **possible duplicates**: same file size plus same first `1MiB` BLAKE3 hash.
|
|
- This is intentionally fast because it avoids reading whole files unless you pass `--verify-full`.
|
|
- `--size-only` is even faster for triage, but it only means files have the same size; use it to narrow the search, not as proof.
|
|
- Symlinks are not followed by default to avoid surprises and cycles.
|
|
- Hard link groups are reported separately because they are multiple paths to the same inode, not extra disk copies.
|
|
- Hidden files and gitignored files are included; this is a disk scanner, not a source-code search tool.
|
|
- Fast mode does **not** read 30TB of file content. It reads metadata plus up to the hash window for same-size candidate files: for example, 30,000 candidate files at `1MiB` is about 30GiB of content reads.
|
|
- Fully verifying all 30TB in 10 minutes would require roughly 50GB/s sustained reads. `--verify-full` only fully reads candidate groups, but storage throughput is still the hard limit for exact verification.
|
|
- Progress output is real and writes to stderr: traversal shows live discovered counts because total traversal work is unknown, while hashing shows determinate byte progress from actual reads. Progress is disabled automatically for `--json` and can be disabled with `--no-progress`.
|