p59 — Longest Common Subsequence

Source: LeetCode 1143 · Medium · Topics: 2D DP, String DP Companies (2024–2025): Google (very high), Amazon (high), Meta (high), Microsoft (high), Bloomberg (medium-high) Loop position: The anchor 2D string DP. Tests grid-DP fluency; sets the template for Edit Distance, Interleaving String, Distinct Subsequences, Regular Expression Matching.

1. Quick Context

Given two strings text1, text2, return the length of their longest common subsequence (not necessarily contiguous).

dp[i][j] = length of LCS of text1[:i] and text2[:j]
if text1[i-1] == text2[j-1]: dp[i][j] = dp[i-1][j-1] + 1
else: dp[i][j] = max(dp[i-1][j], dp[i][j-1])
dp[0][j] = dp[i][0] = 0
answer = dp[m][n]

O(m·n) time, O(m·n) space; compressible to O(min(m,n)) space with a rolling 1D array.

🔗 https://leetcode.com/problems/longest-common-subsequence/

STOP. 20-min timer.

3. Prerequisite Concepts

  • 2D DP grid; indexing convention (dp[i][j] corresponds to prefixes text1[:i], text2[:j]).
  • Subsequence vs substring distinction.
  • Inclusion-exclusion reasoning (“if last chars match, contribute 1; else drop one side”).

4. How to Approach (FRAMEWORK 1–9)

1. Restate: Pick matching characters from both strings, preserving order, maximize count.

2. Clarify: Strings only lowercase? (Yes typically.) Case sensitive? (Yes.) Max length 1000 on LC.

3. Examples: ("abcde", "ace") → 3; ("abc", "abc") → 3; ("abc", "def") → 0.

5. Brute force: Recursion on prefix lengths; exponential without memo.

6. Target: O(m·n) time, O(min(m,n)) space.

7. Pattern: 2D string DP, “match or drop one side.”

8. Develop: see solution.py.

9. Test: empty inputs; identical; disjoint; one a subseq of other; LC examples.

5. Progressive Hints

See hints.md.

6. Deeper Insight

6.1 The recurrence

Define dp[i][j] = LCS of the prefixes text1[:i] and text2[:j].

Case A: characters match. text1[i-1] == text2[j-1]. The LCS must include this matched character (or there exists an equal-length LCS that does), so dp[i][j] = dp[i-1][j-1] + 1.

Case B: characters differ. The LCS uses at most one of these last characters. Try dropping each: dp[i][j] = max(dp[i-1][j], dp[i][j-1]).

Base: dp[0][*] = dp[*][0] = 0 (empty prefix has zero LCS).

6.2 Why +1 in the match case is correct

If text1[i-1] == text2[j-1], suppose the optimal LCS does not use this matching character. Then it’s an LCS of text1[:i-1] and text2[:j-1] with same length. So we can equivalently extend it by appending this matched character — without loss of optimality. Hence dp[i][j] = dp[i-1][j-1] + 1.

6.3 Space compression to O(min(m, n))

Notice dp[i][j] depends only on dp[i-1][*] (previous row) and dp[i][j-1] (current row, previous column). Use ONE 1D array, rolling left-to-right.

Subtle: when computing dp[i][j], you need the old dp[i-1][j-1] value — which gets overwritten when you move to column j. Save it in a prev_diag temporary:

prev_diag = dp[0]
for j in 1..n:
    tmp = dp[j]
    if text1[i-1] == text2[j-1]:
        dp[j] = prev_diag + 1
    else:
        dp[j] = max(dp[j], dp[j-1])
    prev_diag = tmp

Use the shorter string as the columns to minimize space.

6.4 Reconstructing the actual LCS

Walk back from dp[m][n]:

i, j = m, n
result = []
while i > 0 and j > 0:
    if text1[i-1] == text2[j-1]:
        result.append(text1[i-1]); i -= 1; j -= 1
    elif dp[i-1][j] >= dp[i][j-1]:
        i -= 1
    else:
        j -= 1
result.reverse()

Requires full 2D table. Tie-breaking choice yields one valid LCS; multiple LCSes may exist.

6.5 Connection to Edit Distance

LCS and Edit Distance (p60) share the same recurrence skeleton — both dp[i][j] over two-string prefixes with three or four moves (diag, up, left). Edit Distance adds a “substitute” move; LCS has only “match-or-skip-one-side.” Same skeleton, different transitions.

In fact: edit_distance(s, t) >= m + n - 2 * lcs(s, t), with equality when only insertions/deletions are allowed (no substitution).

6.6 Bit-parallel LCS (Hunt-Szymanski)

For very long strings with small alphabets, the Allison-Dix bit-parallel algorithm computes LCS in O(m·n / w) where w is machine word size. Out of scope for LC but a real-world technique used in diff and bioinformatics aligners.

6.7 Distinguishing from “Longest Common Substring”

Substring = contiguous. Different recurrence: dp[i][j] = dp[i-1][j-1] + 1 if match, else 0 (chain breaks). Answer = max(dp).

Don’t confuse these two.

7. Anti-Pattern Analysis

  1. Confuse subsequence with substring — wrong recurrence; very common.
  2. Off-by-one in indicestext1[i] instead of text1[i-1] when dp[i][j] is defined on prefixes of length i, j. Inverts the whole grid.
  3. Forget the max in the mismatch casedp[i][j] = dp[i-1][j] only, misses the other branch.
  4. dp[0][0] = 1 — wrong base case; should be 0 (empty prefix has no LCS).
  5. 1D compression without prev_diag — corrupts the match case by reading the already-overwritten value.
  6. Reconstruct LCS without storing the full 2D table — impossible with O(min) space; tradeoff.

8. Skills & Takeaways

  • 2D string-DP grid template.
  • “Match or drop one side” decomposition — generalizes to many sequence-alignment problems.
  • O(min(m,n)) space with prev_diag trick.
  • Distinct-from-substring framing.
  • Foundation for Edit Distance (p60), Distinct Subsequences (LC 115), Interleaving String, Wildcard Matching.

9. When to Move On

  • Implement O(m·n) cold in 15 min.
  • Compress to 1D space with prev_diag correctly.
  • Reconstruct an actual LCS via backtrack.
  • Solve LC 583 (Delete Operation for Two Strings) in 5 min as a corollary.
  • Articulate the recurrence in 60 seconds verbally.

10. Company Context

  • Google: L5+. Often paired with “now reconstruct the LCS.”
  • Amazon: L6 onsite; productized as “diff-tool minimal common ancestor.”
  • Meta: E5; LC 583 (delete ops) follow-up.
  • Microsoft: Compiler / DevDiv teams; diff algorithms.
  • Bloomberg: Trade-message reconciliation.

Scorecard for strong: “Derived recurrence, compressed to 1D with prev_diag, reconstructed LCS, related to Edit Distance.”

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff+
Distinction subseq/substringConfusesAwareArticulates difference+ writes both recurrences
Recurrence derivationMemorizedDerives slowlyDerives in 30s with justification+ connects to Edit Distance + sequence alignment
Space compressionO(m·n)+ prompted1D with prev_diag+ tradeoff with reconstruction stated
ReconstructionDoesn’tSketchesImplements backtrack+ ties broken consistently + alt LCSes acknowledged

12. Level Delta

  • Mid (L4): Correct O(m·n) DP, returns length.
  • Senior (L5): + 1D space + reconstruction + LC 583 instant solve.
  • Staff (L6): + connects to Edit Distance recurrence + mentions Hunt-Szymanski for real-world diff + LCS-of-K-strings is NP-hard.
  • Principal (L7+): + bit-parallel algorithms + offline sequence alignment with affine gap costs (Gotoh) + biological sequence alignment context (Needleman-Wunsch).

13. Follow-up Q&A

Q1: Print one valid LCS. A1: Backtrack from dp[m][n] as in §6.4. Requires the full 2D table.

Q2: LC 583 — minimum number of delete operations to make two strings equal. A2: m + n - 2 * lcs(text1, text2). (We keep the LCS in both, delete everything else.)

Q3: LC 1092 — shortest common supersequence. A3: Length = m + n - lcs. To reconstruct: backtrack as in §6.4 but emit BOTH characters at mismatch steps and the matched character at match steps.

Q4: LCS of K strings. A4: NP-hard in general. K-dimensional DP works for fixed K with O(n^K) time.

Q5: Memory-bounded large-string LCS — what if m, n = 10^6? A5: O(m·n) is 10^12 — infeasible. Real-world tools (Myers’s diff, used in git diff) solve in O((m+n)·D) where D is the edit distance, suited to “small diff” cases. For large D, fall back to chunked / heuristic alignment.

14. Solution Walkthrough

See solution.py:

  • lcs_2d — O(m·n) time and space, full table (for reconstruction).
  • lcs_1d — O(min(m,n)) space via prev_diag.
  • lcs_reconstruct — returns one valid LCS string.
  • lcs_brute — exponential, all subsequences (stress only for small inputs).

15. Beyond the Problem

Principal code review: “LCS is the grandparent of every sequence-alignment algorithm in production — git diff, rsync, bioinformatics BLAST, voice recognition (DTW). The recurrence you wrote on the whiteboard, with variants for affine gap penalties and scoring matrices, runs in genomics pipelines processing terabytes daily. When you reach for the 2D grid, remember you’re not solving a toy problem — you’re discovering the same algorithm Smith and Waterman published in 1981 that won them a place in CS history. The interview asks for the length; production asks for the alignment, the score, the gap structure, and the parallelization strategy. Knowing only the length is the floor.”