p93 — Implement strStr() / KMP
Source: LeetCode 28 · Easy (deceptively) · Topics: String, KMP, Z-algorithm Companies (2024–2025): Google (high), Amazon (medium), Meta (medium) Loop position: The one “Easy” problem that separates levels. Naive O(nm) gets you the job at Mid. KMP O(n+m) signals Senior. Articulating the failure function invariant is Staff.
1. Quick Context
Return the index of the first occurrence of needle in haystack, or -1.
Naive: O(n·m). Slide window of length m; compare char-by-char.
KMP: O(n+m). Precompute failure function lps[i] = length of the longest proper prefix of needle[0..i] that is also a suffix. On mismatch, skip via j = lps[j-1] instead of restarting.
2. LeetCode Link + Attempt Gate
🔗 https://leetcode.com/problems/find-the-index-of-the-first-occurrence-in-a-string/
STOP. 30-min timer. Write KMP from memory — including the failure function loop.
3. Prerequisite Concepts
- Strings as char arrays.
- Proper prefix vs suffix.
- Amortized analysis (KMP runs in O(n+m) despite nested loop).
4. How to Approach (FRAMEWORK 1–9)
1. Restate: Substring search.
2. Clarify: Empty needle? Return 0. Needle longer than haystack? Return -1.
3. Examples: haystack="sadbutsad", needle="sad" → 0. needle="abc" → -1.
5. Brute: O(n·m).
6. Target: O(n+m).
7. Pattern: KMP failure function.
8. Develop: see solution.py.
9. Test: match at start; match at end; no match; needle = haystack; periodic needle (aaab in aaaaab).
5. Progressive Hints
See hints.md.
6. Deeper Insight
6.1 The failure function — what it stores
lps[i] = max k < i+1 such that needle[0..k-1] == needle[i-k+1..i].
It’s the longest prefix of needle ending at position i that’s also a prefix of needle. On mismatch, we know the last lps[j-1] chars matched a prefix → resume comparing from position lps[j-1] instead of 0.
6.2 Building lps in O(m) — the trick
lps[0] = 0
length = 0
for i in 1..m-1:
while length > 0 and needle[i] != needle[length]:
length = lps[length - 1]
if needle[i] == needle[length]:
length += 1
lps[i] = length
Amortized O(m): length can increase by at most 1 per iteration (total m increases). Each decrement strictly reduces length. So total work O(m).
6.3 KMP match loop
i = 0 # haystack
j = 0 # needle
while i < n:
if haystack[i] == needle[j]:
i += 1; j += 1
if j == m: return i - m # full match
else:
if j > 0: j = lps[j - 1] # fall back
else: i += 1 # nothing to fall back on
Critical property: i never decreases. That’s what gives O(n+m).
6.4 Why amortized linear
Total i increments: O(n). Total j increments: ≤ i ⇒ O(n). Total j decrements: ≤ total j-increments ⇒ O(n). So overall O(n+m).
6.5 Alternatives
- Z-algorithm: computes
z[i]= length of longest substring starting at i that matches a prefix. O(n+m) by concatenatingneedle + sentinel + haystack. - Rabin-Karp: polynomial rolling hash. O(n+m) expected, O(nm) worst.
- Boyer-Moore: sublinear average via bad-character + good-suffix heuristics.
6.6 Family: substring / pattern matching
- LC 28 strStr (this).
- LC 459 Repeated Substring Pattern (KMP failure function reveals period).
- LC 686 Repeated String Match.
- LC 214 Shortest Palindrome (KMP on
s + '#' + reverse(s)). - LC 1392 Longest Happy Prefix (lps[m-1] directly).
7. Anti-Pattern Analysis
- Use Python’s
str.find— works but signals you avoided the question. - O(n·m) naive — passes LC but interviewer flags as Junior signal.
- Off-by-one in lps building — common;
lps[length - 1]notlps[length]on mismatch. - Forget the
if j > 0branch — infinite loop when j=0 mismatches. - Decrement
ion mismatch — defeats the point of KMP. - Confuse
jwith i when explaining loop invariants — interviewers test this.
8. Skills & Takeaways
- Failure function construction.
- Match loop never decrementing
i. - Amortized linear-time argument.
9. When to Move On
- Implement KMP cold in 20 min including lps build.
- Explain amortized O(n+m).
- Cite Z-algorithm as alternative.
10. Company Context
- Google: L5; expects KMP cold + amortized argument.
- Amazon: L4/L5.
- Meta: E5.
Strong: “Naive is O(nm); KMP via failure function in O(n+m). lps[i] = longest proper prefix of needle[0..i] that’s also a suffix. On mismatch, fall back to lps[j-1] without decrementing i; amortized linear.”
11. Interviewer’s Lens
| Signal | Junior | Mid | Senior | Staff+ |
|---|---|---|---|---|
| Naive | Works | Cold | Cold | Cold |
| KMP | None | After hint | Cold | + amortized proof |
| Failure fn | None | None | Builds | + explains semantics |
| Family | None | None | LC 459 | + Z-algorithm |
12. Level Delta
- Mid (L4): Naive O(nm); mentions KMP.
- Senior (L5): Implements KMP cold.
- Staff (L6): + amortized proof + LC 459 period detection.
- Principal (L7+): + framing as DFA over needle alphabet (Aho-Corasick generalization for multi-pattern) + suffix automaton connection.
13. Follow-up Q&A
Q1: Find ALL occurrences, not just first.
A1: On full match, instead of return, append i - m to results and set j = lps[m - 1] to continue.
Q2: Multi-pattern search (k patterns). A2: Aho-Corasick — KMP generalized to a trie with failure links. O(n + Σ|p_i| + #matches).
Q3: Rotated string matching.
A3: KMP on s + s for needle = original; or KMP on doubled haystack.
Q4: Approximate matching with k errors. A4: Bit-parallel (Shift-Or with k). Or DP.
Q5: What does lps[m-1] tell you about the string?
A5: Length of the longest proper prefix that’s also a suffix → if m - lps[m-1] divides m, the string is periodic with that period (LC 459).
14. Solution Walkthrough
See solution.py:
strstr_kmp— KMP match._build_lps— failure function.strstr_brute— naive O(n·m).
15. Beyond the Problem
Principal code review: “KMP (1977) is one of the deepest ‘easy’ algorithms in computer science. The Staff+ recognition is that the failure function is a DFA — specifically, a partial deterministic automaton that recognizes any string ending in
needle. Aho-Corasick (1975) generalizes this to multi-pattern matching by building a trie of all needles and adding ‘failure links’ (the multi-needle version of lps). The lesson scales: any ‘detect this pattern in a stream’ problem at production scale runs Aho-Corasick or KMP — virus signature scanning (ClamAV), intrusion detection (Snort), DNS-blocklist matching, content filtering, regex engines (RE2 uses NFA but falls back to KMP-like for literal substrings). The deeper lesson is the amortized linear-time technique: KMP’s correctness rests on the fact thatinever decreases, which is the same argument used in two-pointer problems, monotonic-stack problems, and the Boyer-Moore voting algorithm. Once you see this pattern — ‘each character is examined O(1) amortized’ — you recognize it everywhere.”