Lab 06 — Rolling Hash (Rabin–Karp + Double Hashing)

Goal

Implement a polynomial rolling hash with two independent (base, mod) pairs. Use it to (a) find repeated substrings of a fixed length (LC 187 — Repeated DNA Sequences) and (b) find the longest duplicate substring via binary search on length (LC 1044 — Longest Duplicate Substring). Internalize the modular arithmetic well enough to avoid collisions on adversarial inputs.

Background Concepts

A polynomial rolling hash treats a string as a base-b number mod p:

H(S) = (S[0] · b^(L-1) + S[1] · b^(L-2) + ... + S[L-1]) mod p

The “rolling” part: when you slide the window from S[i..i+L-1] to S[i+1..i+L], you update in O(1):

H_new = ((H_old - S[i] · b^(L-1)) · b + S[i+L]) mod p

For substring equality from prefix hashes, precompute pref[i] = (S[0] · b^(i-1) + ... + S[i-1]) mod p. Then H(S[l..r]) = (pref[r+1] - pref[l] · b^(r - l + 1)) mod p.

Single hash collision risk. A single mod-p hash has birthday-paradox collision probability ~k²/(2p) for k strings. At p ~ 10⁹ and k = 10⁵, that’s ~5% chance of collision. Adversarial inputs constructed to collide on a fixed (b, p) make single hashing unsafe.

Double hashing uses two independent (b₁, p₁) and (b₂, p₂); two strings collide on both with probability ~k²/(p₁ · p₂) ≈ 10⁻⁹ for ~10⁵ strings. Effectively safe for interviews. Anti-hash-resistant code uses random bases per run.

Interview Context

Rolling hash shows up whenever the brute-force algorithm involves “compare every substring of length L to every other” — a quadratic-in-N algorithm. The reduction is “convert string equality to integer equality, get O(1) per comparison”. Asked at: Google (frequent — duplicate detection, plagiarism), search-infra teams, biotech companies (DNA sequence problems), Stripe.

The signal: “many substring equality checks”, “longest repeated”, “find duplicates of length L”.

Problem Statement A (LC 187)

Find all 10-letter sequences that occur more than once in a DNA string s over {A, C, G, T}. Return them in any order.

Problem Statement B (LC 1044)

Given a string s, find the longest duplicated substring (any substring of length ≥ 1 that appears at least twice — overlaps allowed). Return any longest. If no duplicates, return "".

Constraints

  • LC 187: 1 ≤ |s| ≤ 10⁵; alphabet ACGT.
  • LC 1044: 2 ≤ |s| ≤ 3 × 10⁴; lowercase ASCII.

Clarifying Questions

  1. Overlaps allowed? (Yes — both problems.)
  2. Single longest, or all? (LC 1044: any one longest.)
  3. Case sensitivity / encoding? (As given by problem.)
  4. May we use suffix array / suffix automaton? (Yes, but the lab assignment is rolling hash.)

Examples

LC 187: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"["AAAAACCCCC", "CCCCCAAAAA"].

LC 1044: s = "banana""ana".

Initial Brute Force

LC 187: hashtable of every length-10 substring. O(N · L) time, O(N · L) memory — already passes.

LC 1044: for each L from N-1 down to 1, scan all length-L substrings and check duplicates. O(N² · L) at worst — 3 × 10¹² at the limits. TLE.

Brute Force Complexity

LC 187: O(N · L) acceptable. LC 1044: O(N²) at minimum, O(N³) naively. TLE.

Optimization Path

LC 187 with rolling hash: O(N) amortized per scan, O(N) total. Useful when L is large; for L = 10 the brute force is fine, but rolling hash demonstrates the technique.

LC 1044 with rolling hash + binary search on L: outer loop binary-searches L in [1, N-1]; inner loop hashes every length-L substring and looks up duplicates in a dict. O(N log N) total, with double-hashing or hash+verify.

Final Expected Approach

def has_duplicate_of_length(s, L):
    # Returns the starting index of a duplicate of length L, or -1.
    h = 0; pow_L = 1
    seen = {}
    for i in range(L):
        h = (h * BASE + ord(s[i])) % MOD
        if i: pow_L = (pow_L * BASE) % MOD
    seen[h] = 0
    for i in range(1, len(s) - L + 1):
        h = ((h - ord(s[i-1]) * pow_L) * BASE + ord(s[i + L - 1])) % MOD
        if h in seen:
            # verify (or use double-hash) to avoid false-positive
            if s[seen[h]:seen[h]+L] == s[i:i+L]:
                return i
        else:
            seen[h] = i
    return -1

# Binary search L in [1, N-1].

For LC 187, just collect a multiset of length-10 hashes; output any whose count > 1 (and verify).

Data Structures Used

  • Two integer mods (~10⁹ primes), two random bases.
  • Dict from hash (or hash-pair) → starting index.
  • Precomputed power array pow_b[i].

Correctness Argument

If two strings have the same content, their polynomial hash is equal — exact, no probability. The other direction (same hash → same content) is not guaranteed; this is why we verify on collision (or use double-hashing for ~10⁻¹⁸ collision probability).

For LC 1044, monotonicity: if a duplicate of length L exists, every L’ ≤ L has a duplicate (a prefix of one of the duplicate occurrences). So the answer set {L : duplicate exists} is a prefix [1, L*], and binary search finds L* in O(log N) calls of has_duplicate_of_length.

Complexity

OperationTimeSpace
Single-length scanO(N) amortizedO(N)
LC 187O(N · L)O(N · L)
LC 1044 (binary search)O(N log N)O(N)

Verifying on collision adds an O(L) hit per match; with double-hashing it’s negligible.

Implementation Requirements

import random

class RollingHash:
    def __init__(self, s):
        self.n = len(s)
        # Use two (base, mod) pairs.
        self.MOD1, self.MOD2 = (1 << 31) - 1, (1 << 61) - 1
        self.B1 = random.randint(27, self.MOD1 - 1)
        self.B2 = random.randint(27, self.MOD2 - 1)
        self.h1 = [0] * (self.n + 1)
        self.h2 = [0] * (self.n + 1)
        self.p1 = [1] * (self.n + 1)
        self.p2 = [1] * (self.n + 1)
        for i, c in enumerate(s):
            v = ord(c)
            self.h1[i+1] = (self.h1[i] * self.B1 + v) % self.MOD1
            self.h2[i+1] = (self.h2[i] * self.B2 + v) % self.MOD2
            self.p1[i+1] = (self.p1[i] * self.B1) % self.MOD1
            self.p2[i+1] = (self.p2[i] * self.B2) % self.MOD2

    def hash_pair(self, l, r):  # [l, r)
        a = (self.h1[r] - self.h1[l] * self.p1[r - l]) % self.MOD1
        b = (self.h2[r] - self.h2[l] * self.p2[r - l]) % self.MOD2
        return (a, b)


def longestDupSubstring(s):
    n = len(s)
    rh = RollingHash(s)
    def find(L):
        seen = {}
        for i in range(n - L + 1):
            h = rh.hash_pair(i, i + L)
            if h in seen: return i
            seen[h] = i
        return -1
    lo, hi, best_start, best_len = 1, n - 1, 0, 0
    while lo <= hi:
        mid = (lo + hi) // 2
        i = find(mid)
        if i != -1:
            best_start, best_len = i, mid
            lo = mid + 1
        else:
            hi = mid - 1
    return s[best_start:best_start + best_len]

Tests

  • LC 187 sample → ["AAAAACCCCC", "CCCCCAAAAA"].
  • LC 1044 "banana" → "ana".
  • LC 1044 "abcd""".
  • LC 1044 "aaaaaa""aaaaa" (overlapping duplicates).
  • LC 1044 "abab""ab".
  • Single-hash adversary: paste a Codeforces anti-hash test for (b=31, p=10^9+7); verify your double-hash survives.
  • Stress: 1000 random strings of length 100, compare LC 1044 result to a brute-force suffix-array approach.

Follow-up Questions

  1. “What if I demand zero false positives?” → either verify on every match (O(L) extra) or use suffix array / suffix automaton (deterministic O(N log N) or O(N)).
  2. “Multiple texts share patterns.” → hash all, group by hash-pair, verify within group.
  3. “Stream the text.” → keep a rolling hash; emit matches online; constant memory beyond the dictionary.
  4. “Distinguish substrings cyclically equal.” → hash min-rotation (Booth’s algorithm) or all rotations.
  5. “Avoid Python big-int slowdown.” → use (1 << 61) - 1 (Mersenne prime) and bitwise reduction; or numpy.

Product Extension

Plagiarism detection over a corpus: shingle each document into k-grams, hash each shingle, store (doc_id, hash) tuples; near-duplicate documents share many shingles. MinHash + LSH (locality-sensitive hashing) is the production technology, built on the same rolling-hash foundation. Snapchat / Imgur-style “hash this image” reuses the same logic with perceptual hashes.

Language/Runtime Follow-ups

  • Python: integers are bignums; prefer mods that fit in 63 bits to keep ops fast. Avoid pow(b, k, m) in hot loops — precompute pow tables.
  • Java: use long for mod ~10⁹ to keep b · h + c from overflowing int. For mod 2⁶¹-1 you need Math.floorMod and 128-bit (Math.multiplyHigh in JDK 9+).
  • Go: uint64 with mod 2⁶¹-1 lets you (x * y) >> 61 reduce. Idiomatic CP technique.
  • C++: __int128 for mod 2⁶¹-1 multiplications. Otherwise unsigned long long.
  • JS/TS: Number only safe for integers ≤ 2⁵³. Use BigInt (slow) or pick mod ≤ 2²⁶ with two-pair hashing to fit safely.

Common Bugs

  1. Single-hash collision — passes random tests, fails LC 1044 hidden cases. Double-hash always.
  2. Negative mod in languages with truncated division (Java, C, JS): (a - b) % MOD can be negative; add MOD and re-mod.
  3. Power table off-by-one: p[i] = b^i. hash[l..r) uses p[r - l], not p[r - l - 1].
  4. Reusing the same (b, p) across runs on adversarial inputs — randomize per run.
  5. Forgetting verify on single-hash → wrong answer.
  6. Boundary in rolling update: drop the leftmost char first (multiply by pow_L), then add the rightmost.

Debugging Strategy

Construct two strings of the same length with known equality and known inequality; assert your hash_pair(0, L) gives equal pairs iff the strings are equal. For LC 1044, when the binary search returns wrong-length, instrument find(L) to dump (i, hash, prior_index) on each match candidate. If a hash collision passes verify, your hash is wrong; if verify catches it, single-hash worked but barely — switch to double-hash.

Mastery Criteria

  • Implemented double-hash rolling hash from scratch in <15 minutes.
  • Stated the collision probability for single vs double hash.
  • Solved LC 187 in <10 minutes; LC 1044 in <30.
  • Recognized binary-search-on-length as the standard “longest duplicate” reduction.
  • Used random base per run.
  • Stated the alternative (suffix array) and when to prefer it.