Lab 09 — File Deduplication

Goal

Find all groups of duplicate files in a directory tree using a three-stage filter: size → quick hash (first/last K bytes) → full content hash. After this lab you should be able to design and implement an efficient file deduper that minimizes I/O on huge directories in under 25 minutes.

Background Concepts

The naive approach — read every file and group by full hash — is correct but wastes I/O on files that are clearly not duplicates (different sizes). The classical optimization is a three-stage cascade:

  1. Group by size. Two files of different sizes cannot be duplicates. This is a stat call (cheap, no read).
  2. Group by quick hash. For each size-group with ≥2 files, hash the first and last K bytes (e.g., 4 KB each). Files with different head+tail hashes are not duplicates.
  3. Group by full hash. For each surviving group with ≥2 files, hash the full content. This is the only stage that does full reads.

This works because in real datasets, most files of the same size differ in their first/last few KB (think .docx files with embedded timestamps, video files with different headers, log files with different first lines). The cascade reduces total I/O by ~10–100×.

Interview Context

This is a popular practical problem at infrastructure / file-storage companies (Dropbox, Box, Google Drive, AWS S3 dedup). It tests I/O awareness, hashing fluency, and ability to design a multi-stage filter pipeline. The interviewer wants to hear “I’d minimize I/O by going size → quick-hash → full-hash” before any code.

Problem Statement

Design find_duplicates(root) -> list[list[Path]]:

  • Walk the directory tree under root.
  • Return groups (lists) of paths that have identical content. Each group has ≥ 2 paths.
  • Use the three-stage cascade.

Constraints

  • Up to 10^6 files
  • Up to 1 TB total bytes
  • Memory: ≤ 1 GB
  • Read budget: minimize bytes read (the goal of optimization)

Clarifying Questions

  1. Symlinks — follow or skip? (Skip by default to avoid loops; configurable.)
  2. Hidden files? (Include by default; let caller filter.)
  3. Empty files (size 0) — duplicates of each other? (Yes — include them as a group; or filter; ask.)
  4. Hash function? (hashlib.blake2b is fast; sha256 is the cryptographic default; md5 is fine for non-adversarial dedup. Pick non-cryptographic-fast for performance, cryptographic if the result is durable.)
  5. Concurrency? (Not strictly required; CPU is hashing, I/O is reads — both parallelize well. State as a follow-up.)

Examples

root/
├── a.txt    "hello"
├── b.txt    "hello"
├── c.txt    "world"
├── d.txt    "world"
└── e.txt    "different"

find_duplicates("root")
-> [[Path("root/a.txt"), Path("root/b.txt")],
    [Path("root/c.txt"), Path("root/d.txt")]]

Initial Brute Force

For every pair of files, compare byte-for-byte. O(N² · L) reads. Catastrophic at 10^6 files.

Brute Force Complexity

O(N²) pairwise comparisons, each O(L). At N=10^6 and L=1 MB, this is 10^18 byte comparisons — never completes.

Optimization Path

Three-stage cascade:

  1. Group by size: O(N) stat calls. Memory O(N · path).
  2. Quick-hash within each size group: O(K · |group|) per group; only run on size-groups with ≥2 files.
  3. Full-hash within each quick-hash group: O(L · |group|).

Total reads: most files are excluded at stage 1 (different sizes); of the rest, most are excluded at stage 2 (quick hash differs). Only true (or very near) duplicates get a full read.

Final Expected Approach

Walk the tree once, building dict[size, list[Path]]. For groups with len ≥ 2, build dict[quick_hash, list[Path]]. For surviving groups, build dict[full_hash, list[Path]]. Output groups with len ≥ 2.

Data Structures Used

StructureStage
dict[int, list[Path]]size grouping
dict[bytes, list[Path]]quick-hash grouping
dict[bytes, list[Path]]full-hash grouping

Correctness Argument

Soundness: every group output has identical full content (verified by full-hash equality, modulo collision probability of 2^-256 for SHA-256, negligible).

Completeness: two files A, B with identical content have:

  • equal size (so they survive stage 1)
  • equal quick-hash (so they survive stage 2)
  • equal full-hash (so they end up in the same output group)

Therefore A and B appear in the same output group. The cascade does not miss any duplicate.

Complexity

  • Time: O(N) for size grouping; O(K · N_size_dups) for quick-hash; O(L · N_full_dups) for full-hash. K ≪ L, N_full_dups ≪ N.
  • Space: O(N) for path bookkeeping.

Implementation Requirements

import os, hashlib
from pathlib import Path
from collections import defaultdict
from typing import Iterable

QUICK_BYTES = 4096   # 4 KB head + 4 KB tail

def _quick_hash(path: Path) -> bytes:
    h = hashlib.blake2b(digest_size=16)
    size = path.stat().st_size
    with open(path, "rb") as f:
        head = f.read(QUICK_BYTES)
        h.update(head)
        if size > 2 * QUICK_BYTES:
            f.seek(-QUICK_BYTES, os.SEEK_END)
            tail = f.read(QUICK_BYTES)
            h.update(tail)
    return h.digest()

def _full_hash(path: Path, chunk: int = 1 << 20) -> bytes:
    h = hashlib.blake2b(digest_size=32)
    with open(path, "rb") as f:
        while True:
            buf = f.read(chunk)
            if not buf: break
            h.update(buf)
    return h.digest()

def find_duplicates(root: str | Path,
                    follow_symlinks: bool = False,
                    include_empty: bool = False) -> list[list[Path]]:
    root = Path(root)
    by_size: dict[int, list[Path]] = defaultdict(list)
    for dirpath, _, files in os.walk(root, followlinks=follow_symlinks):
        for name in files:
            p = Path(dirpath) / name
            try:
                st = p.stat() if follow_symlinks else p.lstat()
                if not include_empty and st.st_size == 0: continue
                if not p.is_file(): continue
                by_size[st.st_size].append(p)
            except (OSError, PermissionError):
                continue

    candidates_after_size: list[list[Path]] = [g for g in by_size.values() if len(g) >= 2]

    by_quick: list[list[Path]] = []
    for group in candidates_after_size:
        sub: dict[bytes, list[Path]] = defaultdict(list)
        for p in group:
            try: sub[_quick_hash(p)].append(p)
            except OSError: continue
        for g in sub.values():
            if len(g) >= 2: by_quick.append(g)

    out: list[list[Path]] = []
    for group in by_quick:
        sub: dict[bytes, list[Path]] = defaultdict(list)
        for p in group:
            try: sub[_full_hash(p)].append(p)
            except OSError: continue
        for g in sub.values():
            if len(g) >= 2: out.append(g)

    return out

Tests

import unittest, tempfile, os
from pathlib import Path

class TestDedup(unittest.TestCase):
    def setUp(self):
        self.tmp = tempfile.TemporaryDirectory()
        self.root = Path(self.tmp.name)

    def tearDown(self):
        self.tmp.cleanup()

    def _w(self, name: str, content: bytes):
        p = self.root / name
        p.parent.mkdir(parents=True, exist_ok=True)
        p.write_bytes(content)
        return p

    def test_basic(self):
        a = self._w("a", b"hello")
        b = self._w("sub/b", b"hello")
        c = self._w("c", b"world")
        d = self._w("sub/d", b"world")
        e = self._w("e", b"unique")
        groups = find_duplicates(self.root)
        flat = sorted([sorted(map(str, g)) for g in groups])
        self.assertEqual(len(flat), 2)
        self.assertIn(sorted([str(a), str(b)]), flat)
        self.assertIn(sorted([str(c), str(d)]), flat)

    def test_size_excludes_non_duplicates(self):
        self._w("a", b"x" * 100)
        self._w("b", b"x" * 200)             # different size
        self.assertEqual(find_duplicates(self.root), [])

    def test_quick_hash_disambiguates(self):
        # Same size, different head: stage 2 separates them
        self._w("a", b"head1" + b"x" * 8000)
        self._w("b", b"head2" + b"x" * 8000)
        self.assertEqual(find_duplicates(self.root), [])

    def test_large_groups(self):
        for i in range(5):
            self._w(f"copy{i}", b"same content")
        groups = find_duplicates(self.root)
        self.assertEqual(len(groups), 1)
        self.assertEqual(len(groups[0]), 5)

Follow-up Questions

(3) Scale to N nodes? Distributed dedup over a fleet: size grouping is local on each node; for cross-node dedup, broadcast (size, quick_hash, node, path) tuples to a coordinator, group by (size, quick_hash), then have nodes that share a group exchange full hashes. The full read happens locally; only hashes (32 bytes) move over the network.

(4) Observe / monitor? Files scanned (counter), bytes read at each stage (gauge — quantifies the savings of the cascade), groups found (gauge), errors per stage (counter). The “bytes read at full-hash stage / total bytes” ratio is the key savings metric.

(8) Partial failure? A file deleted/replaced mid-scan: the second-stage hash may differ from the first-stage size or hash. Solutions: (a) treat the file as missing on OSError and skip, (b) snapshot the FS (LVM snapshot, ZFS, Btrfs) before scanning. (a) is the practical answer.

(13) Poison-pill input? A multi-TB file (or a sparse file with a 1 PB apparent size) blows up the full-read stage. Mitigation: cap files by size (skip if > max_file_size), or use Merkle-tree chunked hashing where each chunk is independent and partial reads can be cached/resumed.

(11) Configuration knobs? quick_bytes (default 4 KB), chunk_size for full read, max_file_size, follow_symlinks, include_empty. Knobs not to expose: hash algorithm (pick BLAKE2b for speed unless a contract requires SHA-256).

Product Extension

fdupes, rmlint, jdupes are the canonical Linux tools and use exactly this cascade. Dropbox’s chunked dedup operates at the 4 MB block level (each file is split into chunks, each chunk hashed and deduplicated separately) — letting two files that share a 1 MB prefix store the prefix once. ZFS dedup operates at the block level too. The “full-file dedup” you wrote here is the simplest version; production systems often go to chunk-level dedup for higher savings.

Language/Runtime Follow-ups

  • Python: os.walk is the standard; pathlib.Path.rglob('*') is more idiomatic but slower. hashlib.blake2b is in stdlib and ~3x faster than SHA-256.
  • Java: Files.walk(Path) is the equivalent; MessageDigest for hashing. BLAKE2 requires BouncyCastle.
  • Go: filepath.Walk (deprecated for filepath.WalkDir). crypto/sha256 is fast; golang.org/x/crypto/blake2b for BLAKE2.
  • C++: std::filesystem::recursive_directory_iterator. OpenSSL’s EVP_DigestUpdate for hashing.
  • JS/TS: fs.promises.readdir(dir, {recursive: true}) (Node 20+). crypto.createHash('blake2b512') for hashing.

Common Bugs

  1. Treating symlinks as files — infinite loops or duplicate phantom matches. Use lstat and explicit is_file() check.
  2. Forgetting to filter size-1 groups before stage 2 (running quick-hash on isolated files wastes I/O).
  3. Hashing with md5 and getting bitten by collision in adversarial datasets. Use SHA-256 or BLAKE2b.
  4. Reading the entire file into memory at full-hash stage instead of streaming. OOM on large files.
  5. Not handling permission errors — first OSError halts the entire scan instead of skipping the file.

Debugging Strategy

For “missed duplicates”: verify with cmp (Unix) or byte-by-byte. Print the size/quick-hash/full-hash of both files. Most missed-dup bugs are quick-hash logic (e.g., not seeking to end correctly for files smaller than 2 * QUICK_BYTES).

Mastery Criteria

  • Articulated the three-stage cascade in <60 seconds before coding.
  • Implemented in <25 minutes; all four tests pass.
  • Stated bytes-read savings is the main optimization signal, not wall-clock.
  • Answered follow-ups #3, #4, #8, #13 crisply.
  • Compared full-file vs chunk-level dedup correctly.
  • Identified BLAKE2b as the right hash for performance.