p29 — Longest Substring Without Repeating Characters

Source: LeetCode 3 · Medium · Topics: Hash Table, String, Sliding Window Companies (2024–2025 frequency): Amazon (very high), Meta (high), Google (high), Microsoft (high), Bloomberg (medium), Apple (medium) Loop position: warmup or first algorithm problem. The canonical variable-window template.

1. Quick Context

Given a string s, find the length of the longest substring (contiguous) without repeating characters.

This is THE canonical variable-length sliding window problem. The template you learn here — expand right, shrink left when invariant breaks — generalizes to ALL window problems: Minimum Window Substring (p30), Longest Substring With At Most K Distinct Characters, Subarray Sum Equals K with positive constraints, Permutation in a String, Find All Anagrams.

Two valid mental models:

  1. Last-seen map (char → last_index): when you see a repeat inside the window, jump left to last_seen[c] + 1. O(N), one pass.
  2. Set + shrinking pointer: maintain set of chars in window; on repeat, advance left one char at a time (removing from set) until the duplicate is gone. O(N) amortized.

Model 1 is cleaner; model 2 generalizes more naturally to “at most K distinct” variants.


🔗 https://leetcode.com/problems/longest-substring-without-repeating-characters/

STOP. 20-min timer. If you can write the O(N) sliding window in under 15 min without bugs, you own this problem.


3. Prerequisite Concepts

  • Hash maps / dictionaries.
  • Sliding window template (variable size).
  • Substring vs subsequence (substring = contiguous!).

4. How to Approach (FRAMEWORK Steps 1–9)

Step 1 — Restate: “Given a string s, return the length of the longest contiguous substring such that all characters in it are unique.”

Step 2 — Clarify:

  • “Substring or subsequence?” (Substring = contiguous. Critical.)
  • “Case-sensitive? 'A' vs 'a' distinct?” (Yes, distinct.)
  • “Character set? ASCII? Unicode?” (LC: printable ASCII / digits / symbols. ~95 chars.)
  • “Return length or the substring itself?” (Length only.)
  • “Empty string?” (Return 0.)
  • “Could be len(s) = 0 or len(s) = 5*10^4?” (Yes, both.)

Step 3 — Constraints: N ≤ 5×10⁴. O(N²·σ) brute too slow at upper bound; O(N) sliding window required.

Step 4 — Examples:

  • "abcabcbb" → 3 ("abc").
  • "bbbbb" → 1.
  • "pwwkew" → 3 ("wke"; note "pwke" is a subsequence, not substring).
  • "" → 0.
  • " " → 1.
  • "dvdf" → 3 ("vdf"; the tricky one where left must NOT jump backward).

Step 5 — Brute Force: Try every substring s[i..j]; check uniqueness via set in O(j-i). Total O(N³) or O(N²·σ) with incremental set. State it.

Step 6 — Complexity: Optimal O(N) time, O(min(N, σ)) space where σ is alphabet size.

Step 7 — Pattern Recognition: “Longest contiguous interval satisfying a property” + “property maintainable via dict” → variable sliding window with hash map.

Step 8 — Develop: see solution.py.

Step 9 — Test: Empty, all-same, all-distinct, single-char, the tricky "dvdf" (where last_seen['d'] = 0 is OUTSIDE the current window when we encounter the second d).


5. Progressive Hints

If stuck for 5 min, open hints.md.


6. Deeper Insight — The “don’t jump backward” subtlety

The map-based approach:

left = 0
best = 0
last_seen = {}  # char → last index it was seen
for right, c in enumerate(s):
    if c in last_seen and last_seen[c] >= left:
        left = last_seen[c] + 1
    last_seen[c] = right
    best = max(best, right - left + 1)
return best

The trap: writing left = last_seen[c] + 1 UNCONDITIONALLY (without the >= left guard, or equivalently without max(left, last_seen[c]+1)). On input "dvdf":

  • r=0, c='d': last_seen={d:0}, left=0, best=1.
  • r=1, c='v': last_seen={d:0, v:1}, left=0, best=2.
  • r=2, c='d': duplicate! last_seen['d'] = 0. Naive: left = 0 + 1 = 1. Correct: left = max(0, 0+1) = 1. (Same here.)
  • r=3, c='f': last_seen={d:2, v:1, f:3}, left=1. best = max(2, 3-1+1=3) = 3. ✓

But on "abba":

  • r=0, c='a': left=0, best=1, last={a:0}.
  • r=1, c='b': left=0, best=2, last={a:0,b:1}.
  • r=2, c='b': duplicate! last_seen['b']=1. Naive: left = 1+1 = 2. Correct: left = max(0, 2) = 2. last={a:0,b:2}, best = max(2, 1) = 2.
  • r=3, c='a': duplicate! last_seen['a']=0. Naive: left = 0+1 = 1 — but we already moved left to 2 because of the earlier 'b'! Going BACK to 1 would re-include the 'b' at index 2, breaking uniqueness. Correct: left = max(2, 0+1) = 2. last={a:3,b:2}, best = max(2, 3-2+1=2) = 2. ✓

Without the max, you’d report 3 (wrong — "bba" contains repeated b).

Take-away: in sliding window with last-seen indices, ALWAYS left = max(left, last_seen[c] + 1). Never assign directly. This is the single most common bug.

Alternative: set + shrinking pointer

left = 0
seen = set()
best = 0
for right, c in enumerate(s):
    while c in seen:
        seen.discard(s[left])
        left += 1
    seen.add(c)
    best = max(best, right - left + 1)
return best

This is more “obviously correct” — left only advances forward. Trade-off: when there’s a duplicate, you do a small inner loop, but each char is added/removed at most once → O(N) amortized total.

Both are O(N); pick the one you can write bug-free under pressure. The set approach generalizes more cleanly to “at most K distinct” variants because you can replace the while c in seen condition with while len(distinct) > K.


7. Anti-Pattern Analysis

  1. Confusing substring with subsequence. “pwwkew” → some students answer 4 ("pwke"). That’s a subsequence, not substring.
  2. left = last_seen[c] + 1 without max. Classic bug — see Section 6.
  3. O(N²) substring enumeration. Works but slow at N=5×10⁴.
  4. Using str.find() or in for each window. Both are O(N) inside an O(N) loop → O(N²).
  5. Not handling empty string. Return 0, not error.
  6. Updating last_seen[c] BEFORE checking it. You’d see the just-updated value and do nothing. Check first, then update.
  7. Maintaining the longest substring (not just length) and using s[left:right+1]. O(N) slice inside the loop — kills the linear complexity. If you must return the substring, track (best_left, best_right) indices and slice once at the end.
  8. Forgetting right - left + 1 is the window length. Off-by-one.

8. Skills & Takeaways

  • Variable sliding window template — burn this into your hands.
  • Last-seen-index map — useful pattern beyond strings (e.g., LRU cache lite, deduplication).
  • max(left, jump_target) — the “monotonic left pointer” invariant.
  • Amortized analysis — set + shrinking pointer is O(N) because each char enters/leaves at most once.
  • Substring vs subsequence — vocabulary discipline; misreading this is rookie.

9. When to Move On

Move on when:

  • You can write the map-based solution in under 8 minutes, no bugs, with max(left, ...).
  • You can write the set-based solution in under 10 minutes.
  • You can explain why "abba" breaks the naive left = last_seen[c] + 1.
  • Your stress test passes 300 random small strings vs brute.
  • You can pivot to LC 340 (Longest Substring with At Most K Distinct) by changing the invariant from len(seen) > 0 duplicates to len(seen) > K.

10. Company Context

Amazon:

  • The single most-asked Medium string problem at Amazon in 2024-25 (per reported loops).
  • Bar Raisers will probe: “can you also return the substring itself, not just length?” and “what if the alphabet is Unicode?”
  • Scorecard phrase: “Algorithm — sliding window template, correct max(left, ...) jump.”

Meta:

  • Often a warmup before a Hard. Usually expected in 10-12 min clean.
  • Scorecard phrase: “Fluent — recognized as variable-window, no false starts.”

Google:

  • May be extended to “longest substring with at most K distinct characters” (LC 340) in the same session.
  • Scorecard phrase: “Algorithm + generalization — template transferred to K-distinct variant.”

Microsoft:

  • Sometimes asked alongside or as warmup for LC 76 (Minimum Window Substring).
  • Scorecard phrase: “Strong — sliding window family fluency.”

Bloomberg:

  • Variant: “given a stream of trades, longest run without a repeated ticker.” Same algorithm.
  • Scorecard phrase: “Pragmatic — recognized streaming variable-window.”

Apple:

  • May ask for Unicode-correct version. Discussion of grapheme clusters earns Senior signal.
  • Scorecard phrase: “Correctness + i18n awareness.”

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff/Principal
First instinctBrute O(N²) enumerateMap-based sliding windowMap-based with max(left, ...) upfrontBoth variants, mentions K-distinct generalization
"abba" testMisses itCatches after bugIncludes in test from startExplains why max is required
Substring returnSlices each window O(N²)Tracks (best_l, best_r), slices onceSame as Senior, explains whySame + unicode discussion
Complexity“Linear-ish”O(N), space O(N)O(N), space O(min(N, σ))+ amortized analysis of set variant

12. Level Delta

Mid (L4):

  • Map-based sliding window solution.
  • O(N) time, O(min(N, σ)) space stated.
  • Edge cases including "" and single char.

Senior (L5):

  • Includes max(left, last_seen[c] + 1) from the start with verbal justification.
  • Tests "abba" and "dvdf" explicitly.
  • Knows the set + shrinking-pointer alternative; chooses one with rationale.

Staff (L6):

  • Implements both approaches; explains which one to pick depending on follow-up direction (K-distinct → set; pure dedup → map).
  • Discusses Unicode: characters vs code units vs grapheme clusters.
  • Discusses streaming variant: can maintain answer online with bounded memory if we cap window length.

Principal (L7+):

  • Discusses suffix automaton / suffix array approach for offline batch processing of many such queries on the same string.
  • Discusses distributed variant: split the string into chunks; each chunk computes its own answer plus boundary “overhang” info; merge with a lambda-architecture style combine. Tricky but tractable.
  • Discusses production: web-log session deduplication, token-stream rate-limiting, DNA sequence analysis (longest distinct k-mer run).

13. Follow-up Q&A

Q1: Return the substring itself, not the length. A: Track best_left, best_right indices during the loop; at the end, return s[best_left:best_right+1]. Don’t slice inside the loop.

Q2: At most K distinct characters (LC 340). A: Replace the invariant. Use a Counter count[c]; expand right (count[c] += 1); while len(count) > K, shrink left (count[s[left]] -= 1; if count[s[left]] == 0: del count[s[left]]; left += 1). Record best = max(best, right - left + 1) each step.

Q3: Unicode strings. A: Python’s str iterates code points, so basic Unicode works as-is. Grapheme clusters (e.g., emoji with skin-tone modifiers) require regex package with \X to iterate proper graphemes. Discuss with interviewer whether the spec is “code point” or “user-perceived character.”

Q4: Streaming input — bounded memory. A: Without bounding, worst case requires O(N) memory (all distinct chars). If we cap window length to W, sliding window is fine with O(W) memory. If we want exact answer with sublinear memory, no known approach — fundamental.

Q5: Longest substring where each character appears at most K times. A: Variant of the same template. Use Counter; expand right; while any count > K, shrink left. Same O(N) time.


14. Solution Walkthrough

See solution.py:

  • length_of_longest_substring_map(s) — canonical. Map char → last index. O(N).
  • length_of_longest_substring_set(s) — set + shrinking pointer. O(N) amortized.
  • length_of_longest_substring_brute(s) — O(N²·σ) all substrings via set. Oracle.
  • stress_test() — 300 random strings, alphabet {a,b,c,d}, len ≤ 20.
  • edge_cases() — empty, single, all-same, no repeats, LC canonical, "abba", "dvdf", "tmmzuxt".

Run: python3 solution.py.


15. Beyond the Problem

Principal-engineer code review comment:

“The max(left, last_seen[c] + 1) line needs a comment. Future readers won’t derive WHY we need the max — they’ll see it as defensive coding and ‘simplify’ it in a ‘cleanup’ PR, breaking "abba". Either add # don't let left jump backward when char was last seen before current window or use the set+shrinking-pointer variant which is self-documenting. Also: if this is called millions of times per second (token stream dedup), profile the dict overhead — for ASCII-only inputs, a fixed-size int[128] array beats a Python dict by 3-5x. Choose the implementation that matches your hot path, not the textbook one.”

Connections forward:

  • p30 — Minimum Window Substring: same template, harder invariant (match counter).
  • LC 340 — Longest Substring with At Most K Distinct: trivial generalization.
  • LC 159 — Longest Substring with At Most 2 Distinct: K=2 special case.
  • LC 904 — Fruit Into Baskets: K=2 in disguise.
  • LC 1004 — Max Consecutive Ones III: “at most K zeros.”
  • Phase 02 Lab 02 (Sliding Window): template lab.
  • Production: web-log session deduplication, token rate-limiting, log compaction, DNA k-mer analysis.