Lab 06 — LCS / Edit Distance
Goal
Solve Edit Distance (LC 72 — Levenshtein) with the full four-stage progression. Internalize the canonical two-string DP dp[i][j] indexed by prefix lengths, and the three-way min over insert / delete / replace. After this lab you can write any LCS-family DP from a blank screen in <12 minutes and apply the rolling-row collapse to O(M) space.
Background Concepts
Edit distance — sometimes called Levenshtein distance — is the minimum number of single-character edits (insert, delete, replace) needed to transform string A into string B. The state dp[i][j] indexes prefix-i of A and prefix-j of B; the recurrence has one branch per edit operation plus a free pass on character match.
LCS (longest common subsequence) and edit distance are the foundational two-string DPs. They share the index convention (dp[i][j] = answer for prefix-i of A, prefix-j of B), the boundary handling (dp[i][0] = i, dp[0][j] = j), and the rolling-row space optimization (O(N · M) → O(M)). Mastering edit distance gives you LCS for free.
Interview Context
Edit Distance is a top-15 Hard-tagged DP problem at Google, Microsoft, and Amazon. It shows up in coding rounds at staff level routinely, often paired with a follow-up “now reconstruct the alignment”. LCS (LC 1143) is the gentler Medium variant and tests the same skill. Candidates who can derive both recurrences from scratch and articulate the four edit operations precisely demonstrate fluency that translates to nearly every two-string DP problem in the corpus (regex match, distinct subsequences, interleaving strings, longest common substring).
Problem Statement
Given two strings word1 and word2, return the minimum number of operations required to convert word1 into word2. Allowed operations: insert a character, delete a character, replace a character (each cost 1).
Constraints
- 0 ≤
word1.length,word2.length≤ 500 word1andword2consist of lowercase English letters.
Clarifying Questions
- Are insert/delete/replace each cost 1? (Yes — Levenshtein.)
- Are there any other operations (transpose, e.g.)? (No — Damerau-Levenshtein adds transpose; not in scope.)
- Are characters lowercase only? (Yes — given.)
- Are empty strings valid inputs? (Yes; answer is
len(word1) + len(word2)’s difference, specificallymax(len(word1), len(word2))when one is empty.) - Return the count or the alignment? (Count — alignment is a follow-up.)
Examples
word1="horse", word2="ros" → 3 (horse→rorse→rose→ros)
word1="intention", word2="execution" → 5
word1="", word2="abc" → 3 (insert 3)
word1="abc", word2="" → 3 (delete 3)
word1="abc", word2="abc" → 0
Initial Brute Force
def edit_brute(w1, w2):
def f(i, j):
if i == 0: return j # insert remaining w2
if j == 0: return i # delete remaining w1
if w1[i-1] == w2[j-1]:
return f(i-1, j-1) # match: no edit
return 1 + min(
f(i-1, j), # delete w1[i-1]
f(i, j-1), # insert w2[j-1]
f(i-1, j-1), # replace
)
return f(len(w1), len(w2))
Brute Force Complexity
Each non-base call branches into 3; recursion depth N + M. Worst case O(3^(N+M)). At N=M=500, completely infeasible.
Optimization Path
There are (N+1)(M+1) distinct (i, j) pairs — memoization gives O(N · M) time. Tabulation replaces recursion with a row-major loop. Since dp[i][j] depends only on dp[i-1][j-1], dp[i-1][j], dp[i][j-1], the previous row plus the in-progress row are enough — collapse to two 1D arrays of size M+1. With careful use of a saved diagonal, you can collapse to a single 1D array.
Final Expected Approach
dp[i][j] = edit distance between word1[:i] and word2[:j].
dp[0][j] = j (insert j chars to get word2[:j] from empty word1[:0])
dp[i][0] = i (delete i chars from word1[:i] to get empty word2[:0])
dp[i][j] = dp[i-1][j-1] if word1[i-1] == word2[j-1]
= 1 + min(dp[i-1][j-1], dp[i-1][j], dp[i][j-1]) otherwise
The three operations correspond to:
dp[i-1][j-1] + 1— replaceword1[i-1]withword2[j-1].dp[i-1][j] + 1— deleteword1[i-1].dp[i][j-1] + 1— insertword2[j-1].
Data Structures Used
- 2D
dp[(N+1) x (M+1)]array (tabulated). - Two 1D
prev,currarrays of sizeM+1(rolled).
Correctness Argument
By induction on (i, j) in row-major order. Base cases: dp[0][j] = j (clearly j inserts), dp[i][0] = i (clearly i deletes). Inductive step: an optimal alignment of word1[:i] with word2[:j] ends with one of: (a) match — word1[i-1] == word2[j-1] aligned; cost is dp[i-1][j-1]. (b) replace — pair word1[i-1] with word2[j-1]; cost dp[i-1][j-1] + 1. (c) delete — word1[i-1] deleted, word1[:i-1] aligned with word2[:j]; cost dp[i-1][j] + 1. (d) insert — word2[j-1] inserted, word1[:i] aligned with word2[:j-1]; cost dp[i][j-1] + 1. The min of these is the optimum. (a) and (b) are mutually exclusive based on character equality.
Complexity
| Stage | Time | Space |
|---|---|---|
| Brute force | O(3^(N+M)) | O(N+M) |
| Memoized | O(N · M) | O(N · M) |
| Tabulated | O(N · M) | O(N · M) |
| Space-optimized | O(N · M) | O(min(N, M)) |
Implementation Requirements
All four stages.
# ---- Stage 1: Brute force ----
def edit_brute(w1, w2):
def f(i, j):
if i == 0: return j
if j == 0: return i
if w1[i-1] == w2[j-1]: return f(i-1, j-1)
return 1 + min(f(i-1, j), f(i, j-1), f(i-1, j-1))
return f(len(w1), len(w2))
# ---- Stage 2: Memoized ----
from functools import lru_cache
def edit_memo(w1, w2):
@lru_cache(None)
def f(i, j):
if i == 0: return j
if j == 0: return i
if w1[i-1] == w2[j-1]: return f(i-1, j-1)
return 1 + min(f(i-1, j), f(i, j-1), f(i-1, j-1))
return f(len(w1), len(w2))
# ---- Stage 3: Tabulated 2D ----
def edit_tab(w1, w2):
n, m = len(w1), len(w2)
dp = [[0] * (m + 1) for _ in range(n + 1)]
for j in range(m + 1): dp[0][j] = j
for i in range(n + 1): dp[i][0] = i
for i in range(1, n + 1):
for j in range(1, m + 1):
if w1[i-1] == w2[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])
return dp[n][m]
# ---- Stage 4: Space-optimized O(M) ----
def minDistance(w1, w2):
n, m = len(w1), len(w2)
if n < m:
w1, w2, n, m = w2, w1, m, n # ensure m is the smaller dim
prev = list(range(m + 1))
for i in range(1, n + 1):
curr = [i] + [0] * m
for j in range(1, m + 1):
if w1[i-1] == w2[j-1]:
curr[j] = prev[j-1]
else:
curr[j] = 1 + min(prev[j], curr[j-1], prev[j-1])
prev = curr
return prev[m]
Tests
("horse", "ros")→ 3.("intention", "execution")→ 5.("", "abc")→ 3.("abc", "")→ 3.("abc", "abc")→ 0.("a", "b")→ 1.("a"*500, "b"*500)→ 500 — performance test.- All four implementations equivalent on random N≤8 inputs.
Follow-up Questions
- “Reconstruct the alignment (sequence of operations).” → Backtrack from
dp[n][m]: at each(i, j), look at which of the three (or four) predecessors matches the current value; emit the corresponding operation. - “Custom costs for insert / delete / replace.” → Replace
+1with the appropriate cost in each branch; works without other change. - “Levenshtein with transpositions (Damerau-Levenshtein).” → Add a fourth branch
dp[i-2][j-2] + 1ifword1[i-1]==word2[j-2] and word1[i-2]==word2[j-1]. - “Longest common subsequence.” (LC 1143) → Same shape; recurrence:
dp[i][j] = dp[i-1][j-1] + 1if match, elsemax(dp[i-1][j], dp[i][j-1]). - “Minimum ASCII delete sum.” (LC 712) → Variant where deletes cost ASCII value of the deleted char.
Product Extension
Edit distance is the engine behind diff/patch tools, spell correctors, fuzzy search (“did you mean”), DNA-sequence alignment (Needleman-Wunsch is a generalization with custom scoring matrices), and code-review side-by-side comparison. Real systems use Hirschberg’s algorithm to reconstruct the alignment in O(M) space.
Language/Runtime Follow-ups
- Python:
lru_cacheworks on(i, j)since both are ints. The space-optimized version benefits from swapping to ensurem ≤ n. - Java:
int[][] dpwith explicit boundary fill. For O(M) space, twoint[m+1]arrays. - Go: 2D slice;
make([][]int, n+1)then per-rowmake([]int, m+1). - C++:
vector<vector<int>>2D; or twovector<int>(m+1)for O(M). - JS/TS: 2D array via
Array.from({length: n+1}, () => new Array(m+1).fill(0)).
Common Bugs
- Off-by-one between string index and DP index —
word1[i-1]notword1[i]. The convention “prefix-i” means indexiis exclusive in the prefix, so the latest char isword1[i-1]. - Forgetting the boundary
dp[0][j] = j,dp[i][0] = i— defaults to 0 and produces nonsense answers. - Using
maxinstead ofminin the recurrence — wrong direction. - Including
dp[i-1][j-1] + 1in the match branch — match has no edit cost; should bedp[i-1][j-1]exactly. - Space-optimized version: forgetting
curr[0] = i— the new row’s column 0 isi(deletingichars to match empty prefix), not 0. - Mistakenly thinking insert and delete are symmetric in cost across both strings — they aren’t; insert into
word1is the same as delete fromword2. Levenshtein conflates them in our recurrence which is fine because all costs are 1.
Debugging Strategy
For ("horse", "ros"), the full table is:
"" r o s
"" 0 1 2 3
h 1 1 2 3
o 2 2 1 2
r 3 2 2 2
s 4 3 3 2
e 5 4 4 3
Print and check. If the boundary row/column is wrong, the entire table is. For the rolled version, print prev after each row.
Mastery Criteria
- Recognized “min operations to transform” as edit distance within 60 seconds.
- Wrote the brute recursion with all four cases (match / replace / delete / insert) in <3 minutes.
- Wrote the 2D tabulated version in <5 minutes.
- Performed the rolling-row collapse to two 1D arrays in <3 minutes.
- Stated O(N · M) time and O(M) space.
- Articulated which DP cell corresponds to which edit operation.
- Solved LC 72 unaided in <20 minutes (full progression).
- Solved LC 1143 (LCS) in <8 minutes by changing the recurrence.
- Solved LC 583 (Delete Operation for Two Strings) in <8 minutes (LCS + arithmetic).