Lab 05 — KMP String Matching

Goal

Implement Knuth–Morris–Pratt (KMP): build the failure function (longest proper prefix-suffix) of a pattern in O(M), then match the pattern against a text in O(N + M). Apply it to LeetCode 28 (strStr) and LeetCode 459 (Repeated Substring Pattern). After this lab, you should be able to derive fail[] on a 10-character string by hand and write the matcher in <12 minutes.

Background Concepts

The naive substring search compares the pattern against every position in the text: O(N · M) worst case. KMP exploits the fact that when a mismatch occurs at pattern position j, the prefix P[0..j-1] did match. So we already know the last j characters of the text. From that we compute “what is the longest proper prefix of P that is also a suffix of P[0..j-1]?” — call that length fail[j-1]. Then we resume matching at pattern position fail[j-1] without backtracking the text pointer.

The failure function (also called “longest proper prefix-suffix” or LPS):

  • fail[i] = length of the longest proper prefix of P[0..i] that is also a suffix of P[0..i].
  • “Proper” = strictly shorter than i + 1.
  • fail[0] = 0 always.

Build in O(M) using a two-pointer recurrence: j = fail[i-1]; if P[j] == P[i], fail[i] = j + 1; else fall back j = fail[j-1] and retry, until j = 0.

The matcher: walk text pointer i forward; pattern pointer j advances on match, falls back to fail[j-1] on mismatch (without resetting i).

Interview Context

KMP is the bedrock single-pattern string algorithm. Asked at every FAANG, every quant shop, every search-infra team. The give-away signal: “find pattern in text” with N, M up to 10⁵ — naive is 10¹⁰ ops. Most candidates know strStr exists; few can derive fail[] correctly under pressure. Doing it cleanly is a strong signal at L4+.

Problem Statement (LC 28)

Given two strings haystack and needle, return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Constraints

  • 1 ≤ haystack.length, needle.length ≤ 10⁴ (LC 28); generalize to 10⁵.
  • All printable ASCII.

Clarifying Questions

  1. First occurrence (leftmost), or any? (Leftmost.)
  2. Return 0 for empty needle? (Yes — convention.)
  3. Case-sensitive? (Yes by default.)

Examples

strStr("sadbutsad", "sad") → 0
strStr("leetcode", "leeto") → -1
strStr("hello", "ll") → 2
strStr("aabaaabaaac", "aabaaac") → 4

Initial Brute Force

Two nested loops: for each starting index i ∈ [0, N - M], compare text[i..i+M-1] against pattern; return i on full match.

Brute Force Complexity

O((N - M + 1) · M) ≈ O(N · M). Worst case at text = "aaa..a", pattern = "aaa..b": each starting position fails on the last character. At N = M = 10⁵: 10¹⁰ ops. TLE.

Optimization Path

KMP O(N + M). Z-algorithm O(N + M) — equally good, different invariants. Rabin–Karp O(N) expected with hashing — needs verify on collision. Suffix automaton O(N + M) for the text-side, O(M) match — overkill for one pattern.

KMP is the canonical answer because it (a) is exact (no probabilistic concerns), (b) generalizes to “longest border” / “shortest period” follow-ups, (c) is preferred by interviewers as a known-quantity algorithm.

Final Expected Approach

  1. Build fail[] of length M for the pattern.
  2. Walk a single pointer i over the text and j over the pattern.
    • If text[i] == pat[j]: advance both; if j == M, report match at i - M.
    • If mismatch and j > 0: j = fail[j - 1] (don’t advance i).
    • If mismatch and j == 0: i += 1.

Data Structures Used

  • fail[] — int array of length M.
  • Two pointers, i and j.

Correctness Argument

Failure function invariant: at the end of the build loop’s iteration on i, fail[i] equals the length of the longest proper prefix of P[0..i] matching its suffix.

Proof sketch: assume fail[0..i-1] is correct. Set j = fail[i-1] — the longest proper border of P[0..i-1]. Try to extend: if P[j] == P[i], the border extends by one to j + 1, and there is no longer border (any longer would give a longer border at i-1). If not, fall back to the next-shorter border via j = fail[j-1]; repeat.

Match invariant: when the matcher is at text position i and pattern position j, the last j characters of text up to i-1 equal P[0..j-1]. On mismatch, falling back to fail[j-1] finds the longest proper prefix of P that is a suffix of P[0..j-1], which is also a suffix of the text so far — preserving the invariant without rescanning text.

Linearity: in the matcher, the variable i + (i - j) strictly increases on every iteration. Since i ≤ N and j ≥ 0, the loop runs ≤ 2N times. Same trick for the build: i + (i - j) ≤ 2M.

Complexity

OperationTimeSpace
Failure function buildO(M)O(M)
MatchO(N + M)O(M)

Implementation Requirements

def build_failure(pat):
    m = len(pat)
    fail = [0] * m
    j = 0
    for i in range(1, m):
        while j > 0 and pat[j] != pat[i]:
            j = fail[j - 1]
        if pat[j] == pat[i]:
            j += 1
        fail[i] = j
    return fail

def strStr(text, pat):
    if not pat: return 0
    n, m = len(text), len(pat)
    if m > n: return -1
    fail = build_failure(pat)
    j = 0
    for i in range(n):
        while j > 0 and text[i] != pat[j]:
            j = fail[j - 1]
        if text[i] == pat[j]:
            j += 1
            if j == m:
                return i - m + 1
    return -1

Tests

  • Empty pattern → 0.
  • Pattern not in text → −1.
  • Pattern at the start: strStr("abcd", "ab") == 0.
  • Pattern at the end: strStr("abcd", "cd") == 2.
  • Pattern equals text: strStr("hello", "hello") == 0.
  • Repeated chars: strStr("aaaa", "aa") == 0.
  • Worst-case backtrack: strStr("aaaaab", "aaab") == 2.
  • Verify fail for "aabaaab"[0, 1, 0, 1, 2, 2, 3].
  • Stress: random texts/patterns, compared against text.find(pat).

Follow-up Questions

  1. “Find all occurrences.” → on match, instead of returning, record i - m + 1 and continue with j = fail[j - 1].
  2. “Repeated substring pattern (LC 459).”s is composed of k ≥ 2 repetitions of a substring iff n % (n - fail[n-1]) == 0 and fail[n-1] > 0.
  3. “Shortest palindrome (LC 214).” → run KMP on s + '#' + reverse(s); the answer prefix length is fail[-1].
  4. “Multi-pattern matching.” → Aho–Corasick (Phase 3 #17) generalizes KMP to a trie of patterns.
  5. “Strict period” (longest period of S) = n - fail[n-1]; longest border = fail[n-1]. They are dual.
  6. “Z algorithm — implement it instead.” → different invariant, same asymptotics; pick whichever is fluent.

Product Extension

Search-engine snippet generation: for each query term, find the first match in each candidate document. Multi-pattern at scale uses Aho–Corasick; single-pattern intra-doc still uses KMP because of its predictable cache behavior. Anti-virus signature scanning of binaries is the same problem, multi-pattern, with patterns numbered in the millions — Aho–Corasick territory but KMP per-pattern is the building block.

Language/Runtime Follow-ups

  • Python: built-in str.find is C-implemented Two-Way / Crochemore — usually faster than Python-level KMP. KMP wins when you want all occurrences or the failure function for other purposes.
  • Java: String.indexOf is a naive scan. KMP wins for adversarial inputs. Use char[] over String.charAt for the inner loop.
  • Go: strings.Index uses Rabin–Karp with a fallback. KMP useful when you want explicit fail[].
  • C++: string::find is naive in libstdc++. KMP from scratch is the canonical CP move. std::vector<int> for fail.
  • JS/TS: String.prototype.indexOf is engine-dependent; V8 uses Boyer–Moore–Horspool. KMP needed when you implement custom matchers.

Common Bugs

  1. Setting fail[0] = 1 instead of 0. The “proper” prefix excludes the full string.
  2. In the build, forgetting while j > 0 and pat[j] != pat[i]: j = fail[j-1] is a while, not an if. Treating it as if gives wrong answers on patterns like "aabaaab".
  3. Resetting j = 0 between independent matches but forgetting to reset i — leftover from a “match all” loop.
  4. Using pat[i] != pat[j] vs pat[i] != pat[j-1] — pick a consistent indexing for the LPS (length, not index) and don’t mix.
  5. Off-by-one when reporting the match index: i - m + 1 (0-indexed start) vs i - m.
  6. For LC 459, forgetting the fail[n-1] > 0 guard — without it, a non-repeating string passes the divisibility check trivially.

Debugging Strategy

Compute fail[] for a 7-char pattern by hand and compare. For "aabaaab": fail = [0, 1, 0, 1, 2, 2, 3]. Walk the recurrence: at i=4, j=fail[3]=1, pat[1]==‘a’==pat[4]=‘a’ → fail[4]=2. At i=5, j=fail[4]=2, pat[2]==‘b’!=pat[5]=‘a’ → j=fail[1]=1, pat[1]==‘a’==pat[5]=‘a’ → fail[5]=2. If your code disagrees, instrument with prints.

For the matcher, trace (i, j) per iteration on text="aabaaabaaac", pat="aabaaac". After an early mismatch at (i=6, j=6) (text=‘b’ vs pat=‘c’), j=fail[5]=2, so we resume at text 6 vs pat 2.

Mastery Criteria

  • Computed fail[] for an unfamiliar 8-character pattern by hand in <2 minutes.
  • Wrote build_failure and strStr in <12 minutes total, no off-by-ones.
  • Solved LC 28, LC 459, and LC 1392 from cold start.
  • Stated longest-border vs shortest-period duality.
  • Identified KMP as the single-pattern engine; identified Aho–Corasick as the multi-pattern generalization.
  • Stated the linear-time argument (potential function i + (i - j)).