Lab 05 — KMP String Matching
Goal
Implement Knuth–Morris–Pratt (KMP): build the failure function (longest proper prefix-suffix) of a pattern in O(M), then match the pattern against a text in O(N + M). Apply it to LeetCode 28 (strStr) and LeetCode 459 (Repeated Substring Pattern). After this lab, you should be able to derive fail[] on a 10-character string by hand and write the matcher in <12 minutes.
Background Concepts
The naive substring search compares the pattern against every position in the text: O(N · M) worst case. KMP exploits the fact that when a mismatch occurs at pattern position j, the prefix P[0..j-1] did match. So we already know the last j characters of the text. From that we compute “what is the longest proper prefix of P that is also a suffix of P[0..j-1]?” — call that length fail[j-1]. Then we resume matching at pattern position fail[j-1] without backtracking the text pointer.
The failure function (also called “longest proper prefix-suffix” or LPS):
fail[i]= length of the longest proper prefix ofP[0..i]that is also a suffix ofP[0..i].- “Proper” = strictly shorter than
i + 1. fail[0] = 0always.
Build in O(M) using a two-pointer recurrence: j = fail[i-1]; if P[j] == P[i], fail[i] = j + 1; else fall back j = fail[j-1] and retry, until j = 0.
The matcher: walk text pointer i forward; pattern pointer j advances on match, falls back to fail[j-1] on mismatch (without resetting i).
Interview Context
KMP is the bedrock single-pattern string algorithm. Asked at every FAANG, every quant shop, every search-infra team. The give-away signal: “find pattern in text” with N, M up to 10⁵ — naive is 10¹⁰ ops. Most candidates know strStr exists; few can derive fail[] correctly under pressure. Doing it cleanly is a strong signal at L4+.
Problem Statement (LC 28)
Given two strings haystack and needle, return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.
Constraints
- 1 ≤ haystack.length, needle.length ≤ 10⁴ (LC 28); generalize to 10⁵.
- All printable ASCII.
Clarifying Questions
- First occurrence (leftmost), or any? (Leftmost.)
- Return
0for emptyneedle? (Yes — convention.) - Case-sensitive? (Yes by default.)
Examples
strStr("sadbutsad", "sad") → 0
strStr("leetcode", "leeto") → -1
strStr("hello", "ll") → 2
strStr("aabaaabaaac", "aabaaac") → 4
Initial Brute Force
Two nested loops: for each starting index i ∈ [0, N - M], compare text[i..i+M-1] against pattern; return i on full match.
Brute Force Complexity
O((N - M + 1) · M) ≈ O(N · M). Worst case at text = "aaa..a", pattern = "aaa..b": each starting position fails on the last character. At N = M = 10⁵: 10¹⁰ ops. TLE.
Optimization Path
KMP O(N + M). Z-algorithm O(N + M) — equally good, different invariants. Rabin–Karp O(N) expected with hashing — needs verify on collision. Suffix automaton O(N + M) for the text-side, O(M) match — overkill for one pattern.
KMP is the canonical answer because it (a) is exact (no probabilistic concerns), (b) generalizes to “longest border” / “shortest period” follow-ups, (c) is preferred by interviewers as a known-quantity algorithm.
Final Expected Approach
- Build
fail[]of length M for the pattern. - Walk a single pointer
iover the text andjover the pattern.- If
text[i] == pat[j]: advance both; ifj == M, report match ati - M. - If mismatch and
j > 0:j = fail[j - 1](don’t advancei). - If mismatch and
j == 0:i += 1.
- If
Data Structures Used
fail[]— int array of length M.- Two pointers,
iandj.
Correctness Argument
Failure function invariant: at the end of the build loop’s iteration on i, fail[i] equals the length of the longest proper prefix of P[0..i] matching its suffix.
Proof sketch: assume fail[0..i-1] is correct. Set j = fail[i-1] — the longest proper border of P[0..i-1]. Try to extend: if P[j] == P[i], the border extends by one to j + 1, and there is no longer border (any longer would give a longer border at i-1). If not, fall back to the next-shorter border via j = fail[j-1]; repeat.
Match invariant: when the matcher is at text position i and pattern position j, the last j characters of text up to i-1 equal P[0..j-1]. On mismatch, falling back to fail[j-1] finds the longest proper prefix of P that is a suffix of P[0..j-1], which is also a suffix of the text so far — preserving the invariant without rescanning text.
Linearity: in the matcher, the variable i + (i - j) strictly increases on every iteration. Since i ≤ N and j ≥ 0, the loop runs ≤ 2N times. Same trick for the build: i + (i - j) ≤ 2M.
Complexity
| Operation | Time | Space |
|---|---|---|
| Failure function build | O(M) | O(M) |
| Match | O(N + M) | O(M) |
Implementation Requirements
def build_failure(pat):
m = len(pat)
fail = [0] * m
j = 0
for i in range(1, m):
while j > 0 and pat[j] != pat[i]:
j = fail[j - 1]
if pat[j] == pat[i]:
j += 1
fail[i] = j
return fail
def strStr(text, pat):
if not pat: return 0
n, m = len(text), len(pat)
if m > n: return -1
fail = build_failure(pat)
j = 0
for i in range(n):
while j > 0 and text[i] != pat[j]:
j = fail[j - 1]
if text[i] == pat[j]:
j += 1
if j == m:
return i - m + 1
return -1
Tests
- Empty pattern → 0.
- Pattern not in text → −1.
- Pattern at the start:
strStr("abcd", "ab") == 0. - Pattern at the end:
strStr("abcd", "cd") == 2. - Pattern equals text:
strStr("hello", "hello") == 0. - Repeated chars:
strStr("aaaa", "aa") == 0. - Worst-case backtrack:
strStr("aaaaab", "aaab") == 2. - Verify
failfor"aabaaab"→[0, 1, 0, 1, 2, 2, 3]. - Stress: random texts/patterns, compared against
text.find(pat).
Follow-up Questions
- “Find all occurrences.” → on match, instead of returning, record
i - m + 1and continue withj = fail[j - 1]. - “Repeated substring pattern (LC 459).” →
sis composed ofk ≥ 2repetitions of a substring iffn % (n - fail[n-1]) == 0andfail[n-1] > 0. - “Shortest palindrome (LC 214).” → run KMP on
s + '#' + reverse(s); the answer prefix length isfail[-1]. - “Multi-pattern matching.” → Aho–Corasick (Phase 3 #17) generalizes KMP to a trie of patterns.
- “Strict period” (longest period of S) =
n - fail[n-1]; longest border =fail[n-1]. They are dual. - “Z algorithm — implement it instead.” → different invariant, same asymptotics; pick whichever is fluent.
Product Extension
Search-engine snippet generation: for each query term, find the first match in each candidate document. Multi-pattern at scale uses Aho–Corasick; single-pattern intra-doc still uses KMP because of its predictable cache behavior. Anti-virus signature scanning of binaries is the same problem, multi-pattern, with patterns numbered in the millions — Aho–Corasick territory but KMP per-pattern is the building block.
Language/Runtime Follow-ups
- Python: built-in
str.findis C-implemented Two-Way / Crochemore — usually faster than Python-level KMP. KMP wins when you want all occurrences or the failure function for other purposes. - Java:
String.indexOfis a naive scan. KMP wins for adversarial inputs. Usechar[]overString.charAtfor the inner loop. - Go:
strings.Indexuses Rabin–Karp with a fallback. KMP useful when you want explicitfail[]. - C++:
string::findis naive in libstdc++. KMP from scratch is the canonical CP move.std::vector<int>forfail. - JS/TS:
String.prototype.indexOfis engine-dependent; V8 uses Boyer–Moore–Horspool. KMP needed when you implement custom matchers.
Common Bugs
- Setting
fail[0] = 1instead of 0. The “proper” prefix excludes the full string. - In the build, forgetting
while j > 0 and pat[j] != pat[i]: j = fail[j-1]is awhile, not anif. Treating it asifgives wrong answers on patterns like"aabaaab". - Resetting
j = 0between independent matches but forgetting to reseti— leftover from a “match all” loop. - Using
pat[i] != pat[j]vspat[i] != pat[j-1]— pick a consistent indexing for the LPS (length, not index) and don’t mix. - Off-by-one when reporting the match index:
i - m + 1(0-indexed start) vsi - m. - For LC 459, forgetting the
fail[n-1] > 0guard — without it, a non-repeating string passes the divisibility check trivially.
Debugging Strategy
Compute fail[] for a 7-char pattern by hand and compare. For "aabaaab": fail = [0, 1, 0, 1, 2, 2, 3]. Walk the recurrence: at i=4, j=fail[3]=1, pat[1]==‘a’==pat[4]=‘a’ → fail[4]=2. At i=5, j=fail[4]=2, pat[2]==‘b’!=pat[5]=‘a’ → j=fail[1]=1, pat[1]==‘a’==pat[5]=‘a’ → fail[5]=2. If your code disagrees, instrument with prints.
For the matcher, trace (i, j) per iteration on text="aabaaabaaac", pat="aabaaac". After an early mismatch at (i=6, j=6) (text=‘b’ vs pat=‘c’), j=fail[5]=2, so we resume at text 6 vs pat 2.
Mastery Criteria
-
Computed
fail[]for an unfamiliar 8-character pattern by hand in <2 minutes. -
Wrote
build_failureandstrStrin <12 minutes total, no off-by-ones. - Solved LC 28, LC 459, and LC 1392 from cold start.
- Stated longest-border vs shortest-period duality.
- Identified KMP as the single-pattern engine; identified Aho–Corasick as the multi-pattern generalization.
-
Stated the linear-time argument (potential function
i + (i - j)).