p93 — Implement strStr() / KMP

Source: LeetCode 28 · Easy (deceptively) · Topics: String, KMP, Z-algorithm Companies (2024–2025): Google (high), Amazon (medium), Meta (medium) Loop position: The one “Easy” problem that separates levels. Naive O(nm) gets you the job at Mid. KMP O(n+m) signals Senior. Articulating the failure function invariant is Staff.

1. Quick Context

Return the index of the first occurrence of needle in haystack, or -1.

Naive: O(n·m). Slide window of length m; compare char-by-char.

KMP: O(n+m). Precompute failure function lps[i] = length of the longest proper prefix of needle[0..i] that is also a suffix. On mismatch, skip via j = lps[j-1] instead of restarting.

🔗 https://leetcode.com/problems/find-the-index-of-the-first-occurrence-in-a-string/

STOP. 30-min timer. Write KMP from memory — including the failure function loop.

3. Prerequisite Concepts

  • Strings as char arrays.
  • Proper prefix vs suffix.
  • Amortized analysis (KMP runs in O(n+m) despite nested loop).

4. How to Approach (FRAMEWORK 1–9)

1. Restate: Substring search.

2. Clarify: Empty needle? Return 0. Needle longer than haystack? Return -1.

3. Examples: haystack="sadbutsad", needle="sad" → 0. needle="abc" → -1.

5. Brute: O(n·m).

6. Target: O(n+m).

7. Pattern: KMP failure function.

8. Develop: see solution.py.

9. Test: match at start; match at end; no match; needle = haystack; periodic needle (aaab in aaaaab).

5. Progressive Hints

See hints.md.

6. Deeper Insight

6.1 The failure function — what it stores

lps[i] = max k < i+1 such that needle[0..k-1] == needle[i-k+1..i].

It’s the longest prefix of needle ending at position i that’s also a prefix of needle. On mismatch, we know the last lps[j-1] chars matched a prefix → resume comparing from position lps[j-1] instead of 0.

6.2 Building lps in O(m) — the trick

lps[0] = 0
length = 0
for i in 1..m-1:
    while length > 0 and needle[i] != needle[length]:
        length = lps[length - 1]
    if needle[i] == needle[length]:
        length += 1
    lps[i] = length

Amortized O(m): length can increase by at most 1 per iteration (total m increases). Each decrement strictly reduces length. So total work O(m).

6.3 KMP match loop

i = 0  # haystack
j = 0  # needle
while i < n:
    if haystack[i] == needle[j]:
        i += 1; j += 1
        if j == m: return i - m  # full match
    else:
        if j > 0: j = lps[j - 1]  # fall back
        else: i += 1               # nothing to fall back on

Critical property: i never decreases. That’s what gives O(n+m).

6.4 Why amortized linear

Total i increments: O(n). Total j increments: ≤ i ⇒ O(n). Total j decrements: ≤ total j-increments ⇒ O(n). So overall O(n+m).

6.5 Alternatives

  • Z-algorithm: computes z[i] = length of longest substring starting at i that matches a prefix. O(n+m) by concatenating needle + sentinel + haystack.
  • Rabin-Karp: polynomial rolling hash. O(n+m) expected, O(nm) worst.
  • Boyer-Moore: sublinear average via bad-character + good-suffix heuristics.

6.6 Family: substring / pattern matching

  • LC 28 strStr (this).
  • LC 459 Repeated Substring Pattern (KMP failure function reveals period).
  • LC 686 Repeated String Match.
  • LC 214 Shortest Palindrome (KMP on s + '#' + reverse(s)).
  • LC 1392 Longest Happy Prefix (lps[m-1] directly).

7. Anti-Pattern Analysis

  1. Use Python’s str.find — works but signals you avoided the question.
  2. O(n·m) naive — passes LC but interviewer flags as Junior signal.
  3. Off-by-one in lps building — common; lps[length - 1] not lps[length] on mismatch.
  4. Forget the if j > 0 branch — infinite loop when j=0 mismatches.
  5. Decrement i on mismatch — defeats the point of KMP.
  6. Confuse j with i when explaining loop invariants — interviewers test this.

8. Skills & Takeaways

  • Failure function construction.
  • Match loop never decrementing i.
  • Amortized linear-time argument.

9. When to Move On

  • Implement KMP cold in 20 min including lps build.
  • Explain amortized O(n+m).
  • Cite Z-algorithm as alternative.

10. Company Context

  • Google: L5; expects KMP cold + amortized argument.
  • Amazon: L4/L5.
  • Meta: E5.

Strong: “Naive is O(nm); KMP via failure function in O(n+m). lps[i] = longest proper prefix of needle[0..i] that’s also a suffix. On mismatch, fall back to lps[j-1] without decrementing i; amortized linear.”

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff+
NaiveWorksColdColdCold
KMPNoneAfter hintCold+ amortized proof
Failure fnNoneNoneBuilds+ explains semantics
FamilyNoneNoneLC 459+ Z-algorithm

12. Level Delta

  • Mid (L4): Naive O(nm); mentions KMP.
  • Senior (L5): Implements KMP cold.
  • Staff (L6): + amortized proof + LC 459 period detection.
  • Principal (L7+): + framing as DFA over needle alphabet (Aho-Corasick generalization for multi-pattern) + suffix automaton connection.

13. Follow-up Q&A

Q1: Find ALL occurrences, not just first. A1: On full match, instead of return, append i - m to results and set j = lps[m - 1] to continue.

Q2: Multi-pattern search (k patterns). A2: Aho-Corasick — KMP generalized to a trie with failure links. O(n + Σ|p_i| + #matches).

Q3: Rotated string matching. A3: KMP on s + s for needle = original; or KMP on doubled haystack.

Q4: Approximate matching with k errors. A4: Bit-parallel (Shift-Or with k). Or DP.

Q5: What does lps[m-1] tell you about the string? A5: Length of the longest proper prefix that’s also a suffix → if m - lps[m-1] divides m, the string is periodic with that period (LC 459).

14. Solution Walkthrough

See solution.py:

  • strstr_kmp — KMP match.
  • _build_lps — failure function.
  • strstr_brute — naive O(n·m).

15. Beyond the Problem

Principal code review: “KMP (1977) is one of the deepest ‘easy’ algorithms in computer science. The Staff+ recognition is that the failure function is a DFA — specifically, a partial deterministic automaton that recognizes any string ending in needle. Aho-Corasick (1975) generalizes this to multi-pattern matching by building a trie of all needles and adding ‘failure links’ (the multi-needle version of lps). The lesson scales: any ‘detect this pattern in a stream’ problem at production scale runs Aho-Corasick or KMP — virus signature scanning (ClamAV), intrusion detection (Snort), DNS-blocklist matching, content filtering, regex engines (RE2 uses NFA but falls back to KMP-like for literal substrings). The deeper lesson is the amortized linear-time technique: KMP’s correctness rests on the fact that i never decreases, which is the same argument used in two-pointer problems, monotonic-stack problems, and the Boyer-Moore voting algorithm. Once you see this pattern — ‘each character is examined O(1) amortized’ — you recognize it everywhere.”