# disk-checker Fast Ubuntu-friendly CLI for scanning folders, checking file sizes, hashing the first chunk of same-size files, and reporting possible duplicates plus symlinks, hard links, special files, and scan errors. ## Install Rust on Ubuntu ```bash sudo apt update sudo apt install -y build-essential curl curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source "$HOME/.cargo/env" ``` ## Build ```bash cargo build --release ``` The binary will be at: ```bash target/release/disk-checker ``` ## Usage Scan the current directory: ```bash disk-checker ``` Scan one or more paths: ```bash disk-checker ~/Downloads /mnt/shared ``` Use JSON for scripts: ```bash disk-checker ~/Downloads --json ``` Hash a larger first chunk before grouping possible duplicates: ```bash disk-checker ~/Downloads --hash-bytes 8MiB ``` Follow symlinks while still reporting them separately: ```bash disk-checker ~/Downloads --follow-links ``` Verify possible duplicates with a full-file hash pass: ```bash disk-checker ~/Downloads --verify-full ``` Review duplicate groups one by one and choose which path to keep: ```bash disk-checker ~/Downloads --verify-full --interactive ``` Interactive mode requires `--verify-full` and is non-destructive: it writes a reviewed shell deletion plan instead of deleting files immediately. ```bash disk-checker ~/Downloads --verify-full --interactive --delete-plan review-delete.sh ``` Use the fastest triage mode for huge datasets by grouping same-size files without hashing: ```bash disk-checker /mnt/storage --size-only --min-size 100MiB --threads 32 ``` Limit traversal depth: ```bash disk-checker /mnt/storage --max-depth 3 ``` Limit scanning and hashing workers: ```bash disk-checker ~/Downloads --threads 4 ``` Disable progress output: ```bash disk-checker ~/Downloads --no-progress ``` ## Notes - By default, duplicate results are **possible duplicates**: same file size plus same first `1MiB` BLAKE3 hash. - This is intentionally fast because it avoids reading whole files unless you pass `--verify-full`. - `--size-only` is even faster for triage, but it only means files have the same size; use it to narrow the search, not as proof. - Symlinks are not followed by default to avoid surprises and cycles. - Hard link groups are reported separately because they are multiple paths to the same inode, not extra disk copies. - Hidden files and gitignored files are included; this is a disk scanner, not a source-code search tool. - Fast mode does **not** read 30TB of file content. It reads metadata plus up to the hash window for same-size candidate files: for example, 30,000 candidate files at `1MiB` is about 30GiB of content reads. - Fully verifying all 30TB in 10 minutes would require roughly 50GB/s sustained reads. `--verify-full` only fully reads candidate groups, but storage throughput is still the hard limit for exact verification. - Progress output is real and writes to stderr: traversal shows live discovered counts because total traversal work is unknown, while hashing shows determinate byte progress from actual reads. Progress is disabled automatically for `--json` and can be disabled with `--no-progress`.