p94 — Repeated DNA Sequences

Source: LeetCode 187 · Medium · Topics: Hash Set, Rolling Hash, Bit Manipulation Companies (2024–2025): Google (medium), Amazon (medium), Bytedance (high) Loop position: The Rabin-Karp warm-up. Tests whether you can implement a rolling polynomial hash without copy-pasting it.

1. Quick Context

Find all 10-letter substrings of a DNA string (alphabet {A,C,G,T}) that appear more than once.

Three solutions:

Set of substrings: O(n·k) time & memory; trivial.
Bit-encoded fixed-width hash: since k=10 and alphabet is 4 chars, encode each substring in 20 bits (10 × 2 bits). Maintain rolling 20-bit integer. O(n) time, O(n) memory.
Polynomial rolling hash (Rabin-Karp): general technique for arbitrary alphabet and k. O(n) expected with collision handling.

2. LeetCode Link + Attempt Gate

🔗 https://leetcode.com/problems/repeated-dna-sequences/

STOP. 25-min timer. Implement bit-encoded version then rolling-hash version.

3. Prerequisite Concepts

Hash sets.
Sliding window.
Modular arithmetic (for rolling hash).
Bit manipulation.

4. How to Approach (FRAMEWORK 1–9)

1. Restate: Find all duplicate 10-length windows.

3. Examples: "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" → ["AAAAACCCCC", "CCCCCAAAAA"].

5. Brute: Set of all substrings, return duplicates. O(n·k).

6. Target: O(n) or O(n + #matches).

7. Pattern: Rolling hash.

8. Develop: see solution.py.

6. Deeper Insight

6.1 Bit-encoded fixed-width hash (the DNA-specific trick)

Map A→00, C→01, G→10, T→11. Each 10-mer fits in 20 bits. Sliding window:

mask = (1 << 20) - 1
h = (h << 2 | code[c]) & mask

After first 10 chars, h uniquely identifies the current 10-mer. Use a dict {hash → count} or set + seen.

6.2 Polynomial rolling hash (Rabin-Karp, general)

For arbitrary alphabet:

h = (h * BASE - s[i - k] * BASE^k + s[i]) % MOD

where BASE > alphabet size and MOD is a large prime (e.g., (1<<61) - 1).

Collision handling: hash matches → still verify substrings character-by-character. Use two hashes with different MODs for ultra-low collision probability (1/MOD²).

6.3 Why this is the canonical “rolling hash” lesson

Rolling hash underpins:

Rabin-Karp (substring search).
Karp-Rabin minimum-rotation.
Longest common substring via hashing + binary search on length.
Polynomial-hash-based suffix array construction.
Zobrist hashing (game-tree DP).

6.4 Mersenne primes for fast modulo

(1<<61) - 1 is Mersenne. x mod ((1<<61)-1) = (x & ((1<<61)-1)) + (x >> 61) then conditional subtract. Avoids slow %. Useful in performance-critical C++/Rust.

6.5 Hash families and adversarial inputs

Constant BASE and MOD can be exploited by adversarial inputs to force collisions. In competitive programming use randomized BASE chosen at runtime; in security-sensitive code use SipHash or a keyed cryptographic hash.

6.6 Family: rolling hash / substring repeats

LC 187 Repeated DNA (this).
LC 1044 Longest Duplicate Substring (rolling hash + binary search on length).
LC 28 strStr / KMP (Rabin-Karp alternative).
LC 1316 Distinct Echo Substrings.
LC 718 Maximum Length of Repeated Subarray (rolling hash variant).

7. Anti-Pattern Analysis

Use Python’s set of substrings without rolling hash — works on LC but signals you didn’t engage the problem.
Build hash from scratch every window — O(n·k); same as brute.
Forget collision verification — Rabin-Karp without verify is incorrect.
Use BASE = 4 with simple mod — works for k=10 here but breaks for general alphabets.
Use hash(str) builtin — Python’s hash is randomized per run; can’t be used as portable rolling hash.

8. Skills & Takeaways

Bit-encoded fixed-alphabet rolling.
Polynomial rolling hash with two-hash collision handling.
Recognizing when rolling hash applies.

9. When to Move On

Implement both bit-encoded and polynomial versions cold.
Cite LC 1044 (binary search + rolling hash).
Explain Mersenne-prime mod trick.

10. Company Context

Google: L5.
ByteDance: loves rolling hash.
Amazon: L5/L6.

Strong: “Bit-encoded since alphabet has 4 chars; 20-bit window suffices. For general case, polynomial rolling hash with double-hash collision verification.”

11. Interviewer’s Lens

Signal	Junior	Mid	Senior	Staff+
Set approach	Cold	Cold	Cold	Cold
Bit-encoded	None	After hint	Cold	+ explains uniqueness
Polynomial	None	None	Sketch	+ collision handling
Family	None	None	LC 1044	+ Mersenne + adversarial

12. Level Delta

Mid (L4): Set of substrings.
Senior (L5): Bit-encoded rolling.
Staff (L6): + polynomial rolling for general alphabet + double hash.
Principal (L7+): + Mersenne-prime mod + randomized BASE for adversarial safety + connection to fingerprinting / Bloom-filter false-positive analysis.

13. Follow-up Q&A

Q1: Generalize to arbitrary k and alphabet. A1: Polynomial rolling hash; collision verify.

Q2: Longest repeated substring (LC 1044). A2: Binary search on length L; for each L, rolling hash all length-L windows; if any repeats, L is achievable.

Q3: Probability of collision with single hash MOD ~ 10^9. A3: ~ n²/MOD by birthday. For n = 10^5, ~10 collisions; need verify or double hash.

Q4: Stream of DNA — find new repeats online. A4: Maintain dict {hash → first_index}; on each new window check membership.

Q5: Memory-bound k = 10^6, n = 10^9? A5: Bloom filter or Count-Min sketch for probabilistic dedup.

14. Solution Walkthrough

See solution.py:

find_repeated_set — naive set-based.
find_repeated_bit — bit-encoded rolling.
find_repeated_polynomial — polynomial rolling hash + verify.

15. Beyond the Problem

Principal code review: “Repeated DNA Sequences is the gateway to rolling hash, one of the most leveraged techniques in real systems. The Staff+ recognition: this is the same algorithm as rsync’s rolling checksum (Adler-32), content-defined chunking in deduplication systems (Restic, borg, Plan9’s Venti), Git’s pack-file delta compression, and the Karp-Rabin substring matcher used inside grep-like tools when patterns are long. The bit-encoded approach here is the special case where alphabet is small enough to encode in fixed bits — same trick used by bioinformatics tools (BLAST, minimap2) for k-mer indexing of genomes. Once you internalize ‘rolling hash for substring problems’, a whole class of problems becomes O(n) instead of O(n²) — and once you understand collision verification, you understand why production systems use double-hashing or cryptographic-grade rolling functions (SipHash for HashDoS resistance). The deeper lesson: hashing is a probabilistic data structure, and engineering with it requires reasoning about collision probability and adversarial inputs.”

LeetCode Interview — Extreme Coding