Lab 09 — File Deduplication
Goal
Find all groups of duplicate files in a directory tree using a three-stage filter: size → quick hash (first/last K bytes) → full content hash. After this lab you should be able to design and implement an efficient file deduper that minimizes I/O on huge directories in under 25 minutes.
Background Concepts
The naive approach — read every file and group by full hash — is correct but wastes I/O on files that are clearly not duplicates (different sizes). The classical optimization is a three-stage cascade:
- Group by size. Two files of different sizes cannot be duplicates. This is a
statcall (cheap, no read). - Group by quick hash. For each size-group with ≥2 files, hash the first and last K bytes (e.g., 4 KB each). Files with different head+tail hashes are not duplicates.
- Group by full hash. For each surviving group with ≥2 files, hash the full content. This is the only stage that does full reads.
This works because in real datasets, most files of the same size differ in their first/last few KB (think .docx files with embedded timestamps, video files with different headers, log files with different first lines). The cascade reduces total I/O by ~10–100×.
Interview Context
This is a popular practical problem at infrastructure / file-storage companies (Dropbox, Box, Google Drive, AWS S3 dedup). It tests I/O awareness, hashing fluency, and ability to design a multi-stage filter pipeline. The interviewer wants to hear “I’d minimize I/O by going size → quick-hash → full-hash” before any code.
Problem Statement
Design find_duplicates(root) -> list[list[Path]]:
- Walk the directory tree under
root. - Return groups (lists) of paths that have identical content. Each group has ≥ 2 paths.
- Use the three-stage cascade.
Constraints
- Up to 10^6 files
- Up to 1 TB total bytes
- Memory: ≤ 1 GB
- Read budget: minimize bytes read (the goal of optimization)
Clarifying Questions
- Symlinks — follow or skip? (Skip by default to avoid loops; configurable.)
- Hidden files? (Include by default; let caller filter.)
- Empty files (size 0) — duplicates of each other? (Yes — include them as a group; or filter; ask.)
- Hash function? (
hashlib.blake2bis fast;sha256is the cryptographic default;md5is fine for non-adversarial dedup. Pick non-cryptographic-fast for performance, cryptographic if the result is durable.) - Concurrency? (Not strictly required; CPU is hashing, I/O is reads — both parallelize well. State as a follow-up.)
Examples
root/
├── a.txt "hello"
├── b.txt "hello"
├── c.txt "world"
├── d.txt "world"
└── e.txt "different"
find_duplicates("root")
-> [[Path("root/a.txt"), Path("root/b.txt")],
[Path("root/c.txt"), Path("root/d.txt")]]
Initial Brute Force
For every pair of files, compare byte-for-byte. O(N² · L) reads. Catastrophic at 10^6 files.
Brute Force Complexity
O(N²) pairwise comparisons, each O(L). At N=10^6 and L=1 MB, this is 10^18 byte comparisons — never completes.
Optimization Path
Three-stage cascade:
- Group by size: O(N)
statcalls. Memory O(N · path). - Quick-hash within each size group: O(K · |group|) per group; only run on size-groups with ≥2 files.
- Full-hash within each quick-hash group: O(L · |group|).
Total reads: most files are excluded at stage 1 (different sizes); of the rest, most are excluded at stage 2 (quick hash differs). Only true (or very near) duplicates get a full read.
Final Expected Approach
Walk the tree once, building dict[size, list[Path]]. For groups with len ≥ 2, build dict[quick_hash, list[Path]]. For surviving groups, build dict[full_hash, list[Path]]. Output groups with len ≥ 2.
Data Structures Used
| Structure | Stage |
|---|---|
dict[int, list[Path]] | size grouping |
dict[bytes, list[Path]] | quick-hash grouping |
dict[bytes, list[Path]] | full-hash grouping |
Correctness Argument
Soundness: every group output has identical full content (verified by full-hash equality, modulo collision probability of 2^-256 for SHA-256, negligible).
Completeness: two files A, B with identical content have:
- equal size (so they survive stage 1)
- equal quick-hash (so they survive stage 2)
- equal full-hash (so they end up in the same output group)
Therefore A and B appear in the same output group. The cascade does not miss any duplicate.
Complexity
- Time: O(N) for size grouping; O(K · N_size_dups) for quick-hash; O(L · N_full_dups) for full-hash. K ≪ L, N_full_dups ≪ N.
- Space: O(N) for path bookkeeping.
Implementation Requirements
import os, hashlib
from pathlib import Path
from collections import defaultdict
from typing import Iterable
QUICK_BYTES = 4096 # 4 KB head + 4 KB tail
def _quick_hash(path: Path) -> bytes:
h = hashlib.blake2b(digest_size=16)
size = path.stat().st_size
with open(path, "rb") as f:
head = f.read(QUICK_BYTES)
h.update(head)
if size > 2 * QUICK_BYTES:
f.seek(-QUICK_BYTES, os.SEEK_END)
tail = f.read(QUICK_BYTES)
h.update(tail)
return h.digest()
def _full_hash(path: Path, chunk: int = 1 << 20) -> bytes:
h = hashlib.blake2b(digest_size=32)
with open(path, "rb") as f:
while True:
buf = f.read(chunk)
if not buf: break
h.update(buf)
return h.digest()
def find_duplicates(root: str | Path,
follow_symlinks: bool = False,
include_empty: bool = False) -> list[list[Path]]:
root = Path(root)
by_size: dict[int, list[Path]] = defaultdict(list)
for dirpath, _, files in os.walk(root, followlinks=follow_symlinks):
for name in files:
p = Path(dirpath) / name
try:
st = p.stat() if follow_symlinks else p.lstat()
if not include_empty and st.st_size == 0: continue
if not p.is_file(): continue
by_size[st.st_size].append(p)
except (OSError, PermissionError):
continue
candidates_after_size: list[list[Path]] = [g for g in by_size.values() if len(g) >= 2]
by_quick: list[list[Path]] = []
for group in candidates_after_size:
sub: dict[bytes, list[Path]] = defaultdict(list)
for p in group:
try: sub[_quick_hash(p)].append(p)
except OSError: continue
for g in sub.values():
if len(g) >= 2: by_quick.append(g)
out: list[list[Path]] = []
for group in by_quick:
sub: dict[bytes, list[Path]] = defaultdict(list)
for p in group:
try: sub[_full_hash(p)].append(p)
except OSError: continue
for g in sub.values():
if len(g) >= 2: out.append(g)
return out
Tests
import unittest, tempfile, os
from pathlib import Path
class TestDedup(unittest.TestCase):
def setUp(self):
self.tmp = tempfile.TemporaryDirectory()
self.root = Path(self.tmp.name)
def tearDown(self):
self.tmp.cleanup()
def _w(self, name: str, content: bytes):
p = self.root / name
p.parent.mkdir(parents=True, exist_ok=True)
p.write_bytes(content)
return p
def test_basic(self):
a = self._w("a", b"hello")
b = self._w("sub/b", b"hello")
c = self._w("c", b"world")
d = self._w("sub/d", b"world")
e = self._w("e", b"unique")
groups = find_duplicates(self.root)
flat = sorted([sorted(map(str, g)) for g in groups])
self.assertEqual(len(flat), 2)
self.assertIn(sorted([str(a), str(b)]), flat)
self.assertIn(sorted([str(c), str(d)]), flat)
def test_size_excludes_non_duplicates(self):
self._w("a", b"x" * 100)
self._w("b", b"x" * 200) # different size
self.assertEqual(find_duplicates(self.root), [])
def test_quick_hash_disambiguates(self):
# Same size, different head: stage 2 separates them
self._w("a", b"head1" + b"x" * 8000)
self._w("b", b"head2" + b"x" * 8000)
self.assertEqual(find_duplicates(self.root), [])
def test_large_groups(self):
for i in range(5):
self._w(f"copy{i}", b"same content")
groups = find_duplicates(self.root)
self.assertEqual(len(groups), 1)
self.assertEqual(len(groups[0]), 5)
Follow-up Questions
(3) Scale to N nodes? Distributed dedup over a fleet: size grouping is local on each node; for cross-node dedup, broadcast (size, quick_hash, node, path) tuples to a coordinator, group by (size, quick_hash), then have nodes that share a group exchange full hashes. The full read happens locally; only hashes (32 bytes) move over the network.
(4) Observe / monitor? Files scanned (counter), bytes read at each stage (gauge — quantifies the savings of the cascade), groups found (gauge), errors per stage (counter). The “bytes read at full-hash stage / total bytes” ratio is the key savings metric.
(8) Partial failure? A file deleted/replaced mid-scan: the second-stage hash may differ from the first-stage size or hash. Solutions: (a) treat the file as missing on OSError and skip, (b) snapshot the FS (LVM snapshot, ZFS, Btrfs) before scanning. (a) is the practical answer.
(13) Poison-pill input? A multi-TB file (or a sparse file with a 1 PB apparent size) blows up the full-read stage. Mitigation: cap files by size (skip if > max_file_size), or use Merkle-tree chunked hashing where each chunk is independent and partial reads can be cached/resumed.
(11) Configuration knobs? quick_bytes (default 4 KB), chunk_size for full read, max_file_size, follow_symlinks, include_empty. Knobs not to expose: hash algorithm (pick BLAKE2b for speed unless a contract requires SHA-256).
Product Extension
fdupes, rmlint, jdupes are the canonical Linux tools and use exactly this cascade. Dropbox’s chunked dedup operates at the 4 MB block level (each file is split into chunks, each chunk hashed and deduplicated separately) — letting two files that share a 1 MB prefix store the prefix once. ZFS dedup operates at the block level too. The “full-file dedup” you wrote here is the simplest version; production systems often go to chunk-level dedup for higher savings.
Language/Runtime Follow-ups
- Python:
os.walkis the standard;pathlib.Path.rglob('*')is more idiomatic but slower.hashlib.blake2bis in stdlib and ~3x faster than SHA-256. - Java:
Files.walk(Path)is the equivalent;MessageDigestfor hashing.BLAKE2requires BouncyCastle. - Go:
filepath.Walk(deprecated forfilepath.WalkDir).crypto/sha256is fast;golang.org/x/crypto/blake2bfor BLAKE2. - C++:
std::filesystem::recursive_directory_iterator. OpenSSL’sEVP_DigestUpdatefor hashing. - JS/TS:
fs.promises.readdir(dir, {recursive: true})(Node 20+).crypto.createHash('blake2b512')for hashing.
Common Bugs
- Treating symlinks as files — infinite loops or duplicate phantom matches. Use
lstatand explicitis_file()check. - Forgetting to filter size-1 groups before stage 2 (running quick-hash on isolated files wastes I/O).
- Hashing with
md5and getting bitten by collision in adversarial datasets. Use SHA-256 or BLAKE2b. - Reading the entire file into memory at full-hash stage instead of streaming. OOM on large files.
- Not handling permission errors — first
OSErrorhalts the entire scan instead of skipping the file.
Debugging Strategy
For “missed duplicates”: verify with cmp (Unix) or byte-by-byte. Print the size/quick-hash/full-hash of both files. Most missed-dup bugs are quick-hash logic (e.g., not seeking to end correctly for files smaller than 2 * QUICK_BYTES).
Mastery Criteria
- Articulated the three-stage cascade in <60 seconds before coding.
- Implemented in <25 minutes; all four tests pass.
- Stated bytes-read savings is the main optimization signal, not wall-clock.
- Answered follow-ups #3, #4, #8, #13 crisply.
- Compared full-file vs chunk-level dedup correctly.
- Identified BLAKE2b as the right hash for performance.