p72 — Regular Expression Matching

Source: LeetCode 10 · Hard · Topics: DP, Recursion, String Companies (2024–2025): Google (high), Meta (high), Amazon (medium-high), Microsoft (medium) Loop position: The DP problem where the recurrence has a branching * case. Tests careful case analysis: * can match ZERO or MORE preceding chars, leading to a “skip the (char,*) pair” vs “consume one char” decision at every step.

1. Quick Context

Implement regex matching for . (matches any single char) and * (matches zero or more of the PRECEDING element). Match the ENTIRE string (not substring).

Strategy: 2D DP. dp[i][j] = True iff s[:i] matches p[:j]. Three cases for the pattern’s last character:

Plain char or .: dp[i][j] = dp[i-1][j-1] AND (s[i-1] matches p[j-1]).
* (forms pair with p[j-2]): two choices —
- Match ZERO of p[j-2]: dp[i][j-2].
- Match ONE MORE of p[j-2]: dp[i-1][j] AND s[i-1] matches p[j-2].

O(m·n).

2. LeetCode Link + Attempt Gate

🔗 https://leetcode.com/problems/regular-expression-matching/

STOP. 40-min timer. The case-analysis is the hardest part.

3. Prerequisite Concepts

2D string DP (alignment style).
Why “match zero” via * is the non-obvious case.
Empty string can be matched by a*, a*b*c*, etc.
The * always pairs with the preceding character — index j-2.

4. How to Approach (FRAMEWORK 1–9)

1. Restate: Boolean: does full string match full pattern with . and * semantics?

2. Clarify: * always follows a real char or . (LC promises valid input). Match is FULL not substring.

3. Examples: s="aa", p="a*" → True. s="ab", p=".*" → True. s="aab", p="c*a*b" → True.

5. Brute force: Recursive char-by-char matching with branching on *. Exponential without memoization.

6. Target: O(m·n) DP.

7. Pattern: 2D DP, alignment-style with a branching * rule.

8. Develop: see solution.py.

9. Test: empty s vs empty p; a* matches empty; .* matches anything; nested stars like a*b*c*.

6. Deeper Insight

6.1 The `dp[i][j]` definition matters

Use dp[i][j] = “s[0..i-1] matches p[0..j-1]” (1-indexed end-exclusive). So dp[0][0] = True (empty matches empty).

Initialization row dp[0][j]: empty string matches pattern iff pattern is zero-or-more star pairs. dp[0][j] = dp[0][j-2] whenever p[j-1] == '*'.

6.2 The `*` case is the entire problem

if p[j-1] == '*':
    # Option A: match zero of p[j-2]
    dp[i][j] = dp[i][j-2]
    # Option B: match one MORE of p[j-2]
    if p[j-2] == '.' or p[j-2] == s[i-1]:
        dp[i][j] = dp[i][j] or dp[i-1][j]

“Match one more” means: we already used the (char,*) pair to consume zero or more chars up through s[i-2]; now we consume one more (s[i-1] if it matches p[j-2]). This stays on the SAME j because the * is still active.

6.3 Why `dp[i-1][j]` not `dp[i-1][j-2]` in Option B

The * is “greedy” in the sense that it remains in scope as we consume more chars. dp[i-1][j] says: “up to s[i-2] matched the SAME pattern p[:j] (including the active *)” — we just extend by one more matching char.

6.4 NFA/DFA framing

a*b*c* is a classic regular language. The DP is essentially simulating an NFA where each state is a position in the pattern, and * transitions are ε-moves. Production regex engines use either Thompson’s NFA construction (linear time) or backtracking (exponential worst case — see the StackOverflow July 2016 outage). This DP is conceptually the NFA simulation, written iteratively.

6.5 Backtracking RegEx engines and ReDoS

Most production regex engines (Perl, Python’s re, JavaScript) use backtracking, which can hit exponential worst cases on patterns like (a+)+$ against "aaaa...!". RE2 (Google) uses NFA construction to guarantee linear time. Knowing this distinction matters at Staff+.

6.6 LC 44 Wildcard Matching

LC 44 uses * differently — it matches ANY sequence of any characters (no preceding char binding). Different DP recurrence: dp[i][j] = dp[i-1][j] OR dp[i][j-1] for *. Easier than LC 10.

6.7 Memoized recursion variant

Often the cleanest for interview pressure: recurse on (i, j) = “does s[i:] match p[j:]?” with memoization. Two cases on next pattern token (regular vs c*). Many candidates find this easier than iterative.

7. Anti-Pattern Analysis

Treat * as “matches any single char” — that’s LC 44 wildcard. LC 10’s * modifies the PRECEDING char.
Forget the empty-prefix row dp[0][j] — wrong answer for inputs like s="", p="a*b*".
Use dp[i-1][j-1] instead of dp[i-1][j] for “match one more” — silently wrong on s="aaa", p="a*".
Use 0-indexed dp[i][j] = match s[:i+1] vs p[:j+1]` — boundary handling becomes painful.
Greedy “match longest first” — fails on s="aa", p="a*a" and others.
Catastrophic backtracking in a recursive solution — TLE on adversarial inputs without memoization.

8. Skills & Takeaways

2D string DP with branching recurrence.
* modifies preceding char; pair with p[j-2].
Three cases at each step (plain, ., *).
Initialization row for empty-string matching.
Connection to NFA/DFA theory.

9. When to Move On

Write iterative or memoized version cold in 35 min.
State the recurrence in English: “If pattern ends in *: either skip the (char,*) pair or consume one more matching char.”
Solve LC 44 (Wildcard Matching) by analogy.
Articulate when backtracking regex engines blow up.

10. Company Context

Google: L5/L6; “implement a small regex” is a classic.
Meta: E5/E6; tests case analysis carefully.
Amazon: L6; rare but discriminating.
Microsoft: L65; engine-implementation flavor.

Scorecard for strong: “Defined dp[i][j] cleanly with empty-prefix row, articulated *’s two cases, used dp[i-1][j] for ‘match one more’ correctly, mentioned NFA framing.”

11. Interviewer’s Lens

Signal	Junior	Mid	Senior	Staff+
Case analysis	Confused	After hints	Cold 3-case split	+ cites NFA
`*` recurrence	Off-by-one	Correct after debugging	Cold	+ explains intuition
Initialization	Misses	After failing test	Cold	+ works backward from empty match
Generalization	None	LC 44 by name	LC 44 solved by analogy	+ RE2 vs backtracking ReDoS discussion

12. Level Delta

Mid (L4): Memoized recursion after hints; passes most tests.
Senior (L5): Iterative DP cold; correct empty-prefix row; correct * recurrence; cites LC 44.
Staff (L6): + NFA/Thompson construction reference + ReDoS pitfall + RE2 mention.
Principal (L7+): + full regex-to-NFA conversion + DFA construction tradeoffs + production grep engine architecture.

13. Follow-up Q&A

Q1: Add + (one or more) and ? (zero or one). A1: + reduces to a copy: a+ ≡ aa*. ? is “match 0 or 1”: dp[i][j] = dp[i][j-2] or (matching dp[i-1][j-2]).

Q2: What’s the time complexity of brute recursion? A2: Exponential. Each * doubles the search tree. Memoization reduces to O(m·n).

Q3: Implement using NFA construction. A3: Build Thompson NFA, then simulate with set-of-states. O((m+n)·s) where s is input length.

Q4: What about anchors (^, $)? A4: LC version assumes full match (implicit anchors). Adding partial-match support requires extending the recurrence.

Q5: Production case where backtracking regex bit you? A5: ReDoS attacks — (a+)+$ against "aaaaaa!". Use RE2 or rewrite the regex to avoid catastrophic backtracking.

14. Solution Walkthrough

See solution.py:

is_match_dp — iterative 2D DP, canonical.
is_match_memo — top-down with @lru_cache.
is_match_brute — naive recursion (oracle for small inputs).

15. Beyond the Problem

Principal code review: “Regex matching is the gateway to NFA/DFA theory. The DP you write here is essentially a tabular NFA simulation: each dp[i][j] cell is ‘can the NFA, after consuming s[:i], be in pattern-state j?’ The branching * recurrence corresponds to ε-transitions in the NFA. Production regex engines split into two camps: backtracking (Perl, Python re, PCRE) — concise to implement, exponential worst case (ReDoS), and NFA simulation (Go’s regexp/RE2, Rust’s regex) — slightly more complex code, linear time guaranteed. The 2016 StackOverflow outage and several CVE-rated ReDoS attacks demonstrate that this isn’t an academic distinction. When you write this DP in an interview, mention RE2; it shows you understand that the elegant n×m table is the same algorithm Google’s grep uses for trillions of lines of logs daily.”

LeetCode Interview — Extreme Coding