p29 — Longest Substring Without Repeating Characters
Source: LeetCode 3 · Medium · Topics: Hash Table, String, Sliding Window Companies (2024–2025 frequency): Amazon (very high), Meta (high), Google (high), Microsoft (high), Bloomberg (medium), Apple (medium) Loop position: warmup or first algorithm problem. The canonical variable-window template.
1. Quick Context
Given a string s, find the length of the longest substring (contiguous) without repeating characters.
This is THE canonical variable-length sliding window problem. The template you learn here — expand right, shrink left when invariant breaks — generalizes to ALL window problems: Minimum Window Substring (p30), Longest Substring With At Most K Distinct Characters, Subarray Sum Equals K with positive constraints, Permutation in a String, Find All Anagrams.
Two valid mental models:
- Last-seen map (
char → last_index): when you see a repeat inside the window, jumplefttolast_seen[c] + 1. O(N), one pass. - Set + shrinking pointer: maintain set of chars in window; on repeat, advance
leftone char at a time (removing from set) until the duplicate is gone. O(N) amortized.
Model 1 is cleaner; model 2 generalizes more naturally to “at most K distinct” variants.
2. LeetCode Link + Attempt Gate
🔗 https://leetcode.com/problems/longest-substring-without-repeating-characters/
STOP. 20-min timer. If you can write the O(N) sliding window in under 15 min without bugs, you own this problem.
3. Prerequisite Concepts
- Hash maps / dictionaries.
- Sliding window template (variable size).
- Substring vs subsequence (substring = contiguous!).
4. How to Approach (FRAMEWORK Steps 1–9)
Step 1 — Restate: “Given a string s, return the length of the longest contiguous substring such that all characters in it are unique.”
Step 2 — Clarify:
- “Substring or subsequence?” (Substring = contiguous. Critical.)
- “Case-sensitive?
'A'vs'a'distinct?” (Yes, distinct.) - “Character set? ASCII? Unicode?” (LC: printable ASCII / digits / symbols. ~95 chars.)
- “Return length or the substring itself?” (Length only.)
- “Empty string?” (Return 0.)
- “Could be
len(s) = 0orlen(s) = 5*10^4?” (Yes, both.)
Step 3 — Constraints: N ≤ 5×10⁴. O(N²·σ) brute too slow at upper bound; O(N) sliding window required.
Step 4 — Examples:
"abcabcbb"→ 3 ("abc")."bbbbb"→ 1."pwwkew"→ 3 ("wke"; note"pwke"is a subsequence, not substring).""→ 0." "→ 1."dvdf"→ 3 ("vdf"; the tricky one whereleftmust NOT jump backward).
Step 5 — Brute Force: Try every substring s[i..j]; check uniqueness via set in O(j-i). Total O(N³) or O(N²·σ) with incremental set. State it.
Step 6 — Complexity: Optimal O(N) time, O(min(N, σ)) space where σ is alphabet size.
Step 7 — Pattern Recognition: “Longest contiguous interval satisfying a property” + “property maintainable via dict” → variable sliding window with hash map.
Step 8 — Develop: see solution.py.
Step 9 — Test: Empty, all-same, all-distinct, single-char, the tricky "dvdf" (where last_seen['d'] = 0 is OUTSIDE the current window when we encounter the second d).
5. Progressive Hints
If stuck for 5 min, open hints.md.
6. Deeper Insight — The “don’t jump backward” subtlety
The map-based approach:
left = 0
best = 0
last_seen = {} # char → last index it was seen
for right, c in enumerate(s):
if c in last_seen and last_seen[c] >= left:
left = last_seen[c] + 1
last_seen[c] = right
best = max(best, right - left + 1)
return best
The trap: writing left = last_seen[c] + 1 UNCONDITIONALLY (without the >= left guard, or equivalently without max(left, last_seen[c]+1)). On input "dvdf":
r=0, c='d': last_seen={d:0}, left=0, best=1.r=1, c='v': last_seen={d:0, v:1}, left=0, best=2.r=2, c='d': duplicate!last_seen['d'] = 0. Naive:left = 0 + 1 = 1. Correct:left = max(0, 0+1) = 1. (Same here.)r=3, c='f': last_seen={d:2, v:1, f:3}, left=1. best = max(2, 3-1+1=3) = 3. ✓
But on "abba":
r=0, c='a': left=0, best=1, last={a:0}.r=1, c='b': left=0, best=2, last={a:0,b:1}.r=2, c='b': duplicate!last_seen['b']=1. Naive:left = 1+1 = 2. Correct:left = max(0, 2) = 2. last={a:0,b:2}, best = max(2, 1) = 2.r=3, c='a': duplicate!last_seen['a']=0. Naive:left = 0+1 = 1— but we already moved left to 2 because of the earlier'b'! Going BACK to 1 would re-include the'b'at index 2, breaking uniqueness. Correct:left = max(2, 0+1) = 2. last={a:3,b:2}, best = max(2, 3-2+1=2) = 2. ✓
Without the max, you’d report 3 (wrong — "bba" contains repeated b).
Take-away: in sliding window with last-seen indices, ALWAYS left = max(left, last_seen[c] + 1). Never assign directly. This is the single most common bug.
Alternative: set + shrinking pointer
left = 0
seen = set()
best = 0
for right, c in enumerate(s):
while c in seen:
seen.discard(s[left])
left += 1
seen.add(c)
best = max(best, right - left + 1)
return best
This is more “obviously correct” — left only advances forward. Trade-off: when there’s a duplicate, you do a small inner loop, but each char is added/removed at most once → O(N) amortized total.
Both are O(N); pick the one you can write bug-free under pressure. The set approach generalizes more cleanly to “at most K distinct” variants because you can replace the while c in seen condition with while len(distinct) > K.
7. Anti-Pattern Analysis
- Confusing substring with subsequence. “pwwkew” → some students answer 4 (
"pwke"). That’s a subsequence, not substring. left = last_seen[c] + 1withoutmax. Classic bug — see Section 6.- O(N²) substring enumeration. Works but slow at N=5×10⁴.
- Using
str.find()orinfor each window. Both are O(N) inside an O(N) loop → O(N²). - Not handling empty string. Return 0, not error.
- Updating
last_seen[c]BEFORE checking it. You’d see the just-updated value and do nothing. Check first, then update. - Maintaining the longest substring (not just length) and using
s[left:right+1]. O(N) slice inside the loop — kills the linear complexity. If you must return the substring, track(best_left, best_right)indices and slice once at the end. - Forgetting
right - left + 1is the window length. Off-by-one.
8. Skills & Takeaways
- Variable sliding window template — burn this into your hands.
- Last-seen-index map — useful pattern beyond strings (e.g., LRU cache lite, deduplication).
max(left, jump_target)— the “monotonic left pointer” invariant.- Amortized analysis — set + shrinking pointer is O(N) because each char enters/leaves at most once.
- Substring vs subsequence — vocabulary discipline; misreading this is rookie.
9. When to Move On
Move on when:
- You can write the map-based solution in under 8 minutes, no bugs, with
max(left, ...). - You can write the set-based solution in under 10 minutes.
- You can explain why
"abba"breaks the naiveleft = last_seen[c] + 1. - Your stress test passes 300 random small strings vs brute.
- You can pivot to LC 340 (Longest Substring with At Most K Distinct) by changing the invariant from
len(seen) > 0 duplicatestolen(seen) > K.
10. Company Context
Amazon:
- The single most-asked Medium string problem at Amazon in 2024-25 (per reported loops).
- Bar Raisers will probe: “can you also return the substring itself, not just length?” and “what if the alphabet is Unicode?”
- Scorecard phrase: “Algorithm — sliding window template, correct
max(left, ...)jump.”
Meta:
- Often a warmup before a Hard. Usually expected in 10-12 min clean.
- Scorecard phrase: “Fluent — recognized as variable-window, no false starts.”
Google:
- May be extended to “longest substring with at most K distinct characters” (LC 340) in the same session.
- Scorecard phrase: “Algorithm + generalization — template transferred to K-distinct variant.”
Microsoft:
- Sometimes asked alongside or as warmup for LC 76 (Minimum Window Substring).
- Scorecard phrase: “Strong — sliding window family fluency.”
Bloomberg:
- Variant: “given a stream of trades, longest run without a repeated ticker.” Same algorithm.
- Scorecard phrase: “Pragmatic — recognized streaming variable-window.”
Apple:
- May ask for Unicode-correct version. Discussion of grapheme clusters earns Senior signal.
- Scorecard phrase: “Correctness + i18n awareness.”
11. Interviewer’s Lens
| Signal | Junior | Mid | Senior | Staff/Principal |
|---|---|---|---|---|
| First instinct | Brute O(N²) enumerate | Map-based sliding window | Map-based with max(left, ...) upfront | Both variants, mentions K-distinct generalization |
"abba" test | Misses it | Catches after bug | Includes in test from start | Explains why max is required |
| Substring return | Slices each window O(N²) | Tracks (best_l, best_r), slices once | Same as Senior, explains why | Same + unicode discussion |
| Complexity | “Linear-ish” | O(N), space O(N) | O(N), space O(min(N, σ)) | + amortized analysis of set variant |
12. Level Delta
Mid (L4):
- Map-based sliding window solution.
- O(N) time, O(min(N, σ)) space stated.
- Edge cases including
""and single char.
Senior (L5):
- Includes
max(left, last_seen[c] + 1)from the start with verbal justification. - Tests
"abba"and"dvdf"explicitly. - Knows the set + shrinking-pointer alternative; chooses one with rationale.
Staff (L6):
- Implements both approaches; explains which one to pick depending on follow-up direction (K-distinct → set; pure dedup → map).
- Discusses Unicode: characters vs code units vs grapheme clusters.
- Discusses streaming variant: can maintain answer online with bounded memory if we cap window length.
Principal (L7+):
- Discusses suffix automaton / suffix array approach for offline batch processing of many such queries on the same string.
- Discusses distributed variant: split the string into chunks; each chunk computes its own answer plus boundary “overhang” info; merge with a lambda-architecture style combine. Tricky but tractable.
- Discusses production: web-log session deduplication, token-stream rate-limiting, DNA sequence analysis (longest distinct k-mer run).
13. Follow-up Q&A
Q1: Return the substring itself, not the length.
A: Track best_left, best_right indices during the loop; at the end, return s[best_left:best_right+1]. Don’t slice inside the loop.
Q2: At most K distinct characters (LC 340).
A: Replace the invariant. Use a Counter count[c]; expand right (count[c] += 1); while len(count) > K, shrink left (count[s[left]] -= 1; if count[s[left]] == 0: del count[s[left]]; left += 1). Record best = max(best, right - left + 1) each step.
Q3: Unicode strings.
A: Python’s str iterates code points, so basic Unicode works as-is. Grapheme clusters (e.g., emoji with skin-tone modifiers) require regex package with \X to iterate proper graphemes. Discuss with interviewer whether the spec is “code point” or “user-perceived character.”
Q4: Streaming input — bounded memory. A: Without bounding, worst case requires O(N) memory (all distinct chars). If we cap window length to W, sliding window is fine with O(W) memory. If we want exact answer with sublinear memory, no known approach — fundamental.
Q5: Longest substring where each character appears at most K times. A: Variant of the same template. Use Counter; expand right; while any count > K, shrink left. Same O(N) time.
14. Solution Walkthrough
See solution.py:
length_of_longest_substring_map(s)— canonical. Map char → last index. O(N).length_of_longest_substring_set(s)— set + shrinking pointer. O(N) amortized.length_of_longest_substring_brute(s)— O(N²·σ) all substrings via set. Oracle.stress_test()— 300 random strings, alphabet {a,b,c,d}, len ≤ 20.edge_cases()— empty, single, all-same, no repeats, LC canonical,"abba","dvdf","tmmzuxt".
Run: python3 solution.py.
15. Beyond the Problem
Principal-engineer code review comment:
“The
max(left, last_seen[c] + 1)line needs a comment. Future readers won’t derive WHY we need themax— they’ll see it as defensive coding and ‘simplify’ it in a ‘cleanup’ PR, breaking"abba". Either add# don't let left jump backward when char was last seen before current windowor use the set+shrinking-pointer variant which is self-documenting. Also: if this is called millions of times per second (token stream dedup), profile the dict overhead — for ASCII-only inputs, a fixed-sizeint[128]array beats a Python dict by 3-5x. Choose the implementation that matches your hot path, not the textbook one.”
Connections forward:
- p30 — Minimum Window Substring: same template, harder invariant (match counter).
- LC 340 — Longest Substring with At Most K Distinct: trivial generalization.
- LC 159 — Longest Substring with At Most 2 Distinct: K=2 special case.
- LC 904 — Fruit Into Baskets: K=2 in disguise.
- LC 1004 — Max Consecutive Ones III: “at most K zeros.”
- Phase 02 Lab 02 (Sliding Window): template lab.
- Production: web-log session deduplication, token rate-limiting, log compaction, DNA k-mer analysis.