Lab 06 — Rolling Hash (Rabin–Karp + Double Hashing)
Goal
Implement a polynomial rolling hash with two independent (base, mod) pairs. Use it to (a) find repeated substrings of a fixed length (LC 187 — Repeated DNA Sequences) and (b) find the longest duplicate substring via binary search on length (LC 1044 — Longest Duplicate Substring). Internalize the modular arithmetic well enough to avoid collisions on adversarial inputs.
Background Concepts
A polynomial rolling hash treats a string as a base-b number mod p:
H(S) = (S[0] · b^(L-1) + S[1] · b^(L-2) + ... + S[L-1]) mod p
The “rolling” part: when you slide the window from S[i..i+L-1] to S[i+1..i+L], you update in O(1):
H_new = ((H_old - S[i] · b^(L-1)) · b + S[i+L]) mod p
For substring equality from prefix hashes, precompute pref[i] = (S[0] · b^(i-1) + ... + S[i-1]) mod p. Then H(S[l..r]) = (pref[r+1] - pref[l] · b^(r - l + 1)) mod p.
Single hash collision risk. A single mod-p hash has birthday-paradox collision probability ~k²/(2p) for k strings. At p ~ 10⁹ and k = 10⁵, that’s ~5% chance of collision. Adversarial inputs constructed to collide on a fixed (b, p) make single hashing unsafe.
Double hashing uses two independent (b₁, p₁) and (b₂, p₂); two strings collide on both with probability ~k²/(p₁ · p₂) ≈ 10⁻⁹ for ~10⁵ strings. Effectively safe for interviews. Anti-hash-resistant code uses random bases per run.
Interview Context
Rolling hash shows up whenever the brute-force algorithm involves “compare every substring of length L to every other” — a quadratic-in-N algorithm. The reduction is “convert string equality to integer equality, get O(1) per comparison”. Asked at: Google (frequent — duplicate detection, plagiarism), search-infra teams, biotech companies (DNA sequence problems), Stripe.
The signal: “many substring equality checks”, “longest repeated”, “find duplicates of length L”.
Problem Statement A (LC 187)
Find all 10-letter sequences that occur more than once in a DNA string s over {A, C, G, T}. Return them in any order.
Problem Statement B (LC 1044)
Given a string s, find the longest duplicated substring (any substring of length ≥ 1 that appears at least twice — overlaps allowed). Return any longest. If no duplicates, return "".
Constraints
- LC 187: 1 ≤ |s| ≤ 10⁵; alphabet
ACGT. - LC 1044: 2 ≤ |s| ≤ 3 × 10⁴; lowercase ASCII.
Clarifying Questions
- Overlaps allowed? (Yes — both problems.)
- Single longest, or all? (LC 1044: any one longest.)
- Case sensitivity / encoding? (As given by problem.)
- May we use suffix array / suffix automaton? (Yes, but the lab assignment is rolling hash.)
Examples
LC 187: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" → ["AAAAACCCCC", "CCCCCAAAAA"].
LC 1044: s = "banana" → "ana".
Initial Brute Force
LC 187: hashtable of every length-10 substring. O(N · L) time, O(N · L) memory — already passes.
LC 1044: for each L from N-1 down to 1, scan all length-L substrings and check duplicates. O(N² · L) at worst — 3 × 10¹² at the limits. TLE.
Brute Force Complexity
LC 187: O(N · L) acceptable. LC 1044: O(N²) at minimum, O(N³) naively. TLE.
Optimization Path
LC 187 with rolling hash: O(N) amortized per scan, O(N) total. Useful when L is large; for L = 10 the brute force is fine, but rolling hash demonstrates the technique.
LC 1044 with rolling hash + binary search on L: outer loop binary-searches L in [1, N-1]; inner loop hashes every length-L substring and looks up duplicates in a dict. O(N log N) total, with double-hashing or hash+verify.
Final Expected Approach
def has_duplicate_of_length(s, L):
# Returns the starting index of a duplicate of length L, or -1.
h = 0; pow_L = 1
seen = {}
for i in range(L):
h = (h * BASE + ord(s[i])) % MOD
if i: pow_L = (pow_L * BASE) % MOD
seen[h] = 0
for i in range(1, len(s) - L + 1):
h = ((h - ord(s[i-1]) * pow_L) * BASE + ord(s[i + L - 1])) % MOD
if h in seen:
# verify (or use double-hash) to avoid false-positive
if s[seen[h]:seen[h]+L] == s[i:i+L]:
return i
else:
seen[h] = i
return -1
# Binary search L in [1, N-1].
For LC 187, just collect a multiset of length-10 hashes; output any whose count > 1 (and verify).
Data Structures Used
- Two integer mods (~10⁹ primes), two random bases.
- Dict from hash (or hash-pair) → starting index.
- Precomputed power array
pow_b[i].
Correctness Argument
If two strings have the same content, their polynomial hash is equal — exact, no probability. The other direction (same hash → same content) is not guaranteed; this is why we verify on collision (or use double-hashing for ~10⁻¹⁸ collision probability).
For LC 1044, monotonicity: if a duplicate of length L exists, every L’ ≤ L has a duplicate (a prefix of one of the duplicate occurrences). So the answer set {L : duplicate exists} is a prefix [1, L*], and binary search finds L* in O(log N) calls of has_duplicate_of_length.
Complexity
| Operation | Time | Space |
|---|---|---|
| Single-length scan | O(N) amortized | O(N) |
| LC 187 | O(N · L) | O(N · L) |
| LC 1044 (binary search) | O(N log N) | O(N) |
Verifying on collision adds an O(L) hit per match; with double-hashing it’s negligible.
Implementation Requirements
import random
class RollingHash:
def __init__(self, s):
self.n = len(s)
# Use two (base, mod) pairs.
self.MOD1, self.MOD2 = (1 << 31) - 1, (1 << 61) - 1
self.B1 = random.randint(27, self.MOD1 - 1)
self.B2 = random.randint(27, self.MOD2 - 1)
self.h1 = [0] * (self.n + 1)
self.h2 = [0] * (self.n + 1)
self.p1 = [1] * (self.n + 1)
self.p2 = [1] * (self.n + 1)
for i, c in enumerate(s):
v = ord(c)
self.h1[i+1] = (self.h1[i] * self.B1 + v) % self.MOD1
self.h2[i+1] = (self.h2[i] * self.B2 + v) % self.MOD2
self.p1[i+1] = (self.p1[i] * self.B1) % self.MOD1
self.p2[i+1] = (self.p2[i] * self.B2) % self.MOD2
def hash_pair(self, l, r): # [l, r)
a = (self.h1[r] - self.h1[l] * self.p1[r - l]) % self.MOD1
b = (self.h2[r] - self.h2[l] * self.p2[r - l]) % self.MOD2
return (a, b)
def longestDupSubstring(s):
n = len(s)
rh = RollingHash(s)
def find(L):
seen = {}
for i in range(n - L + 1):
h = rh.hash_pair(i, i + L)
if h in seen: return i
seen[h] = i
return -1
lo, hi, best_start, best_len = 1, n - 1, 0, 0
while lo <= hi:
mid = (lo + hi) // 2
i = find(mid)
if i != -1:
best_start, best_len = i, mid
lo = mid + 1
else:
hi = mid - 1
return s[best_start:best_start + best_len]
Tests
- LC 187 sample →
["AAAAACCCCC", "CCCCCAAAAA"]. - LC 1044
"banana" → "ana". - LC 1044
"abcd"→"". - LC 1044
"aaaaaa"→"aaaaa"(overlapping duplicates). - LC 1044
"abab"→"ab". - Single-hash adversary: paste a Codeforces anti-hash test for
(b=31, p=10^9+7); verify your double-hash survives. - Stress: 1000 random strings of length 100, compare LC 1044 result to a brute-force suffix-array approach.
Follow-up Questions
- “What if I demand zero false positives?” → either verify on every match (O(L) extra) or use suffix array / suffix automaton (deterministic O(N log N) or O(N)).
- “Multiple texts share patterns.” → hash all, group by hash-pair, verify within group.
- “Stream the text.” → keep a rolling hash; emit matches online; constant memory beyond the dictionary.
- “Distinguish substrings cyclically equal.” → hash min-rotation (Booth’s algorithm) or all rotations.
- “Avoid Python big-int slowdown.” → use
(1 << 61) - 1(Mersenne prime) and bitwise reduction; or numpy.
Product Extension
Plagiarism detection over a corpus: shingle each document into k-grams, hash each shingle, store (doc_id, hash) tuples; near-duplicate documents share many shingles. MinHash + LSH (locality-sensitive hashing) is the production technology, built on the same rolling-hash foundation. Snapchat / Imgur-style “hash this image” reuses the same logic with perceptual hashes.
Language/Runtime Follow-ups
- Python: integers are bignums; prefer mods that fit in 63 bits to keep ops fast. Avoid
pow(b, k, m)in hot loops — precompute pow tables. - Java: use
longfor mod ~10⁹ to keepb · h + cfrom overflowingint. For mod 2⁶¹-1 you needMath.floorModand 128-bit (Math.multiplyHighin JDK 9+). - Go:
uint64with mod 2⁶¹-1 lets you(x * y) >> 61reduce. Idiomatic CP technique. - C++:
__int128for mod 2⁶¹-1 multiplications. Otherwiseunsigned long long. - JS/TS:
Numberonly safe for integers ≤ 2⁵³. UseBigInt(slow) or pick mod ≤ 2²⁶ with two-pair hashing to fit safely.
Common Bugs
- Single-hash collision — passes random tests, fails LC 1044 hidden cases. Double-hash always.
- Negative mod in languages with truncated division (Java, C, JS):
(a - b) % MODcan be negative; addMODand re-mod. - Power table off-by-one:
p[i] = b^i.hash[l..r)usesp[r - l], notp[r - l - 1]. - Reusing the same
(b, p)across runs on adversarial inputs — randomize per run. - Forgetting verify on single-hash → wrong answer.
- Boundary in rolling update: drop the leftmost char first (multiply by
pow_L), then add the rightmost.
Debugging Strategy
Construct two strings of the same length with known equality and known inequality; assert your hash_pair(0, L) gives equal pairs iff the strings are equal. For LC 1044, when the binary search returns wrong-length, instrument find(L) to dump (i, hash, prior_index) on each match candidate. If a hash collision passes verify, your hash is wrong; if verify catches it, single-hash worked but barely — switch to double-hash.
Mastery Criteria
- Implemented double-hash rolling hash from scratch in <15 minutes.
- Stated the collision probability for single vs double hash.
- Solved LC 187 in <10 minutes; LC 1044 in <30.
- Recognized binary-search-on-length as the standard “longest duplicate” reduction.
- Used random base per run.
- Stated the alternative (suffix array) and when to prefer it.