p94 — Repeated DNA Sequences
Source: LeetCode 187 · Medium · Topics: Hash Set, Rolling Hash, Bit Manipulation Companies (2024–2025): Google (medium), Amazon (medium), Bytedance (high) Loop position: The Rabin-Karp warm-up. Tests whether you can implement a rolling polynomial hash without copy-pasting it.
1. Quick Context
Find all 10-letter substrings of a DNA string (alphabet {A,C,G,T}) that appear more than once.
Three solutions:
- Set of substrings: O(n·k) time & memory; trivial.
- Bit-encoded fixed-width hash: since k=10 and alphabet is 4 chars, encode each substring in 20 bits (10 × 2 bits). Maintain rolling 20-bit integer. O(n) time, O(n) memory.
- Polynomial rolling hash (Rabin-Karp): general technique for arbitrary alphabet and k. O(n) expected with collision handling.
2. LeetCode Link + Attempt Gate
🔗 https://leetcode.com/problems/repeated-dna-sequences/
STOP. 25-min timer. Implement bit-encoded version then rolling-hash version.
3. Prerequisite Concepts
- Hash sets.
- Sliding window.
- Modular arithmetic (for rolling hash).
- Bit manipulation.
4. How to Approach (FRAMEWORK 1–9)
1. Restate: Find all duplicate 10-length windows.
3. Examples: "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" → ["AAAAACCCCC", "CCCCCAAAAA"].
5. Brute: Set of all substrings, return duplicates. O(n·k).
6. Target: O(n) or O(n + #matches).
7. Pattern: Rolling hash.
8. Develop: see solution.py.
5. Progressive Hints
See hints.md.
6. Deeper Insight
6.1 Bit-encoded fixed-width hash (the DNA-specific trick)
Map A→00, C→01, G→10, T→11. Each 10-mer fits in 20 bits. Sliding window:
mask = (1 << 20) - 1
h = (h << 2 | code[c]) & mask
After first 10 chars, h uniquely identifies the current 10-mer. Use a dict {hash → count} or set + seen.
6.2 Polynomial rolling hash (Rabin-Karp, general)
For arbitrary alphabet:
h = (h * BASE - s[i - k] * BASE^k + s[i]) % MOD
where BASE > alphabet size and MOD is a large prime (e.g., (1<<61) - 1).
Collision handling: hash matches → still verify substrings character-by-character. Use two hashes with different MODs for ultra-low collision probability (1/MOD²).
6.3 Why this is the canonical “rolling hash” lesson
Rolling hash underpins:
- Rabin-Karp (substring search).
- Karp-Rabin minimum-rotation.
- Longest common substring via hashing + binary search on length.
- Polynomial-hash-based suffix array construction.
- Zobrist hashing (game-tree DP).
6.4 Mersenne primes for fast modulo
(1<<61) - 1 is Mersenne. x mod ((1<<61)-1) = (x & ((1<<61)-1)) + (x >> 61) then conditional subtract. Avoids slow %. Useful in performance-critical C++/Rust.
6.5 Hash families and adversarial inputs
Constant BASE and MOD can be exploited by adversarial inputs to force collisions. In competitive programming use randomized BASE chosen at runtime; in security-sensitive code use SipHash or a keyed cryptographic hash.
6.6 Family: rolling hash / substring repeats
- LC 187 Repeated DNA (this).
- LC 1044 Longest Duplicate Substring (rolling hash + binary search on length).
- LC 28 strStr / KMP (Rabin-Karp alternative).
- LC 1316 Distinct Echo Substrings.
- LC 718 Maximum Length of Repeated Subarray (rolling hash variant).
7. Anti-Pattern Analysis
- Use Python’s set of substrings without rolling hash — works on LC but signals you didn’t engage the problem.
- Build hash from scratch every window — O(n·k); same as brute.
- Forget collision verification — Rabin-Karp without verify is incorrect.
- Use BASE = 4 with simple mod — works for k=10 here but breaks for general alphabets.
- Use
hash(str)builtin — Python’s hash is randomized per run; can’t be used as portable rolling hash.
8. Skills & Takeaways
- Bit-encoded fixed-alphabet rolling.
- Polynomial rolling hash with two-hash collision handling.
- Recognizing when rolling hash applies.
9. When to Move On
- Implement both bit-encoded and polynomial versions cold.
- Cite LC 1044 (binary search + rolling hash).
- Explain Mersenne-prime mod trick.
10. Company Context
- Google: L5.
- ByteDance: loves rolling hash.
- Amazon: L5/L6.
Strong: “Bit-encoded since alphabet has 4 chars; 20-bit window suffices. For general case, polynomial rolling hash with double-hash collision verification.”
11. Interviewer’s Lens
| Signal | Junior | Mid | Senior | Staff+ |
|---|---|---|---|---|
| Set approach | Cold | Cold | Cold | Cold |
| Bit-encoded | None | After hint | Cold | + explains uniqueness |
| Polynomial | None | None | Sketch | + collision handling |
| Family | None | None | LC 1044 | + Mersenne + adversarial |
12. Level Delta
- Mid (L4): Set of substrings.
- Senior (L5): Bit-encoded rolling.
- Staff (L6): + polynomial rolling for general alphabet + double hash.
- Principal (L7+): + Mersenne-prime mod + randomized BASE for adversarial safety + connection to fingerprinting / Bloom-filter false-positive analysis.
13. Follow-up Q&A
Q1: Generalize to arbitrary k and alphabet. A1: Polynomial rolling hash; collision verify.
Q2: Longest repeated substring (LC 1044). A2: Binary search on length L; for each L, rolling hash all length-L windows; if any repeats, L is achievable.
Q3: Probability of collision with single hash MOD ~ 10^9. A3: ~ n²/MOD by birthday. For n = 10^5, ~10 collisions; need verify or double hash.
Q4: Stream of DNA — find new repeats online. A4: Maintain dict {hash → first_index}; on each new window check membership.
Q5: Memory-bound k = 10^6, n = 10^9? A5: Bloom filter or Count-Min sketch for probabilistic dedup.
14. Solution Walkthrough
See solution.py:
find_repeated_set— naive set-based.find_repeated_bit— bit-encoded rolling.find_repeated_polynomial— polynomial rolling hash + verify.
15. Beyond the Problem
Principal code review: “Repeated DNA Sequences is the gateway to rolling hash, one of the most leveraged techniques in real systems. The Staff+ recognition: this is the same algorithm as rsync’s rolling checksum (Adler-32), content-defined chunking in deduplication systems (Restic, borg, Plan9’s Venti), Git’s pack-file delta compression, and the Karp-Rabin substring matcher used inside grep-like tools when patterns are long. The bit-encoded approach here is the special case where alphabet is small enough to encode in fixed bits — same trick used by bioinformatics tools (BLAST, minimap2) for k-mer indexing of genomes. Once you internalize ‘rolling hash for substring problems’, a whole class of problems becomes O(n) instead of O(n²) — and once you understand collision verification, you understand why production systems use double-hashing or cryptographic-grade rolling functions (SipHash for HashDoS resistance). The deeper lesson: hashing is a probabilistic data structure, and engineering with it requires reasoning about collision probability and adversarial inputs.”