Lab 02 — Sliding Window: Longest Substring With At Most K Distinct Characters
Goal
Master the variable-size sliding window with a frequency map and a “shrink while violation” invariant. Deliverable solves the problem in O(N) time, O(K) extra space, with the loop invariant articulated explicitly: at the end of every iteration, s[l..r] contains at most K distinct characters.
Background Concepts
The shrink-while-violation skeleton; using a count of distinct items vs a full hashmap traversal; the difference between “at most K” and “exactly K” (the atMost(K) - atMost(K-1) trick); when fixed-size and variable-size windows apply. Review pattern Sliding Window.
Interview Context
This problem (LeetCode 340) is a Google / Meta favorite, and a near-direct ancestor of LeetCode 76 (Minimum Window Substring), 159 (At Most Two Distinct), 992 (Subarrays With K Different Integers), and 424 (Longest Repeating Character Replacement). The interview signal is whether you maintain the invariant cleanly: a single while violation: loop that shrinks until valid, then unconditional answer-update. Weak candidates write nested if-statements with off-by-one errors. Strong candidates write the canonical 5-line shrink-and-update.
Problem Statement
Given a string s and an integer K, return the length of the longest substring of s that contains at most K distinct characters.
Constraints
1 ≤ |s| ≤ 5 × 10^40 ≤ K ≤ |s|sconsists of arbitrary characters (Unicode in some variants; ASCII in the standard variant).
Clarifying Questions
- Is the alphabet ASCII or Unicode? (If Unicode, hashmap; if ASCII-lowercase, an
int[26].) - What if
K == 0? (Empty substring is valid; answer is 0.) - What if
K >= number of distinct chars in s? (Answer islen(s).) - Is the answer the length or the substring itself? (LC asks length; mention you can record
(start, length)for the substring.) - Can
sbe empty? (Per constraints,|s| ≥ 1, but the function should return 0 on empty input.)
Examples
| Input | Output | Note |
|---|---|---|
s="eceba", K=2 | 3 | “ece” |
s="aa", K=1 | 2 | “aa” |
s="abcabc", K=2 | 4 | “bcbc” or “cabc” — wait, “cabc” has 3 distinct; correct: “bcbc” or “abca” — both length 4 |
s="a", K=0 | 0 | empty |
s="abcdef", K=10 | 6 | full string |
Initial Brute Force
Enumerate all substrings, check distinct-count, track max.
def longest_brute(s, K):
best = 0
for i in range(len(s)):
seen = set()
for j in range(i, len(s)):
seen.add(s[j])
if len(seen) <= K:
best = max(best, j - i + 1)
else:
break
return best
The inner loop can break on first violation, so this is effectively O(N · K) average and O(N²) worst case (when K is large enough that no break happens).
Brute Force Complexity
Time O(N² · |alphabet|) at worst (set ops). At N=5×10^4, that’s ~2.5×10^9 — too slow in any language.
Optimization Path
Observation: as r advances, the set of distinct characters in s[l..r] is monotone non-decreasing; as l advances (keeping r fixed), it is monotone non-increasing. So we can use a two-pointer / sliding window: advance r, and while the window has more than K distinct chars, advance l. Each character enters and leaves the window at most once, total O(N).
Use a frequency map freq[c] keyed by character; distinct is the count of characters with freq[c] > 0. Increment distinct when a key first reaches 1; decrement when a key drops to 0.
Final Expected Approach
def longest_at_most_k_distinct(s, K):
if K == 0: return 0
freq = {}
l = 0
best = 0
for r, c in enumerate(s):
freq[c] = freq.get(c, 0) + 1
while len(freq) > K:
freq[s[l]] -= 1
if freq[s[l]] == 0:
del freq[s[l]]
l += 1
best = max(best, r - l + 1)
return best
We use len(freq) directly as the distinct count, since we delete keys that hit zero.
Data Structures Used
- A hashmap
freq: char → intof size at most K+1 during the violation, ≤ K otherwise. - Two integer pointers
l,r. - A running maximum
best.
Correctness Argument
Invariant: at the start of each iteration of the outer loop and after the inner shrink, freq contains exactly the characters of s[l..r] with their counts, and len(freq) ≤ K.
Base: before iteration 0, l = 0, freq = {}, len(freq) = 0 ≤ K. ✓
Step: we add s[r] to freq. If this brings len(freq) > K, we shrink: decrement freq[s[l]], delete on zero, advance l. The shrink loop terminates because l ≤ r always (proved: each shrink-step removes one character, and at most r - l + 1 characters can be in the window, so after at most r - l + 1 shrinks we have len(freq) ≤ 1 ≤ K).
Optimality (max): for each r, the smallest l such that the window has ≤ K distinct is recorded; this gives the longest valid window ending at r. Taking the max over all r gives the global longest.
Why while, not if: in this problem each new character can add at most one to distinct, so if would also work. But the canonical sliding-window template uses while because in cousin problems (LC 76) a single r-advance can violate by more than 1 (when adding required chars). Always default to while; pay the (zero) cost of generality.
Complexity
Time O(N): each character is added once, removed at most once. Each hash op is O(1) average. Space O(min(N, K+1)) for the frequency map.
Implementation Requirements
- Use
len(freq)(or maintain adistinctcounter) — do not iterate the map to count. - Delete keys that hit zero, or your
len(freq)will be wrong. - Update
bestafter the shrink, not inside it. (Inside the shrink, the window is invalid.) - For ASCII-lowercase, an
int[26]plus a separatedistinctcounter is faster than a hashmap (no hashing constant). - Guard
K == 0(orK < 0) at the top.
Tests
- Smoke:
("eceba", 2) → 3,("aa", 1) → 2. - Unit:
K = 0→ 0;K ≥ |distinct(s)|→|s|; single-character string. - Edge:
s = ""→ 0;K = 1, s = "abcdef"→ 1;K = |s|→|s|. - Large:
|s| = 5 × 10^4, alphabet 26,K = 3; should run in well under 100ms. - Random: generate 100 random strings of length ≤ 200, alphabet of varying size, varying K; cross-check against the brute force.
- Adversarial: all-same-character (
"aaaa...", K=1 → N); strictly increasing alphabet ("abcdef...", K → answer is K). - Unicode follow-up: make sure your Java/JS code iterates by codepoint, not by char, if the spec extends to Unicode.
Follow-up Questions
- “Now solve exactly K distinct.” →
atMost(K) - atMost(K-1)for the count variant; for the longest-with-exactly-K, run the same sliding window but only recordbestwhenlen(freq) == K(not≤ K). - “Now solve LC 76 (Minimum Window Substring).” → same skeleton, but the violation is “window does not yet contain all required chars”; we grow until satisfied, then shrink while still satisfied, recording the minimum each time we’re satisfied.
- “What if the string is streamed and only
radvances?” → two-pointer doesn’t apply directly; you’d need an order-preserving structure. (Out of scope here.) - “What if K can change with each query?” → preprocess differently; this problem is not amenable to a single offline sliding window for many K values.
- “What if the input is very large, alphabet huge, but
sonly contains a few distinct chars?” → no change; the hashmap is bounded bymin(K+1, |distinct(s)|).
Product Extension
A real-time content-moderation system tracks the longest run of messages in a chat where at most K distinct emojis are used (an indicator of spam-bot activity, which tends to use a small bag of emojis on repeat). The sliding window updates per message in O(1) amortized. The same skeleton applies to “longest session window with at most K distinct user-agents” for fraud detection, and “longest range of cells with at most K distinct values” for spreadsheet anomaly detection.
Language/Runtime Follow-ups
- Python:
dictoverhead is real; for ASCII-lowercase,[0]*26plus adistinctint is ~3× faster. Uses = list(s)only if you need indexing speed (string indexing is already O(1)). - Java:
HashMap<Character, Integer>boxes keys and values. For ASCII, useint[128]and trackdistinctmanually.chars()method exists buts.charAt(i)is fine. - Go:
map[byte]intfor ASCII;map[rune]intfor Unicode.range son a string yields(byte_index, rune)— this is a top-3 Go string trap:for i, c := range sdoes not give youias character index for multi-byte chars. - C++:
std::unordered_map<char, int>works; for ASCII,std::array<int, 128>is faster. - JS/TS:
Map<string, number>works. Iteratingfor (const c of s)yields codepoints, not UTF-16 code units — important if Unicode is in scope. Otherwise,s[i]works for ASCII. - Unicode caveat: “distinct character” might mean codepoint, grapheme cluster, or UTF-16 code unit — clarify with the interviewer.
Common Bugs
- Updating
bestinside the shrink loop — records invalid windows, returns wrong answers when the only valid window length is small. - Forgetting to delete zero-count keys —
len(freq)becomes stale, breaks the violation check. Equivalent bug: maintaining a separatedistinctcounter and forgetting to decrement when a count hits zero. - Using
ifinstead ofwhile— works for this problem but breaks for LC 76 and friends. Build the habit ofwhile. - Off-by-one in
r - l + 1—ris inclusive,lis inclusive; window length isr - l + 1. - Java boxing in
HashMap<Character, Integer>— autoboxing tax is ~3× over the primitiveint[]approach for ASCII. - Go string range bug — iterating with
for i := 0; i < len(s); i++and indexing ass[i]gives bytes, not runes; for UTF-8 data this misclassifies multi-byte chars as multiple distinct ones. - JS UTF-16 surrogate pair bug —
s.lengthfor"😀abc"is 5 (surrogate + 3 ASCII), ands[0],s[1]are the surrogate halves, not the emoji.
Debugging Strategy
- Trace
("eceba", 2)by hand. Window evolution:e | ec | ece | (shrink to ce, then) cebᴬ— whenbenters, window hasc, e, b(3 distinct), shrink: removec(was at position 1), window ise bthen add … etc. If your trace doesn’t match, your shrink logic is wrong. - Diff against the brute force on 50 random inputs.
- Print
(l, r, len(freq), best)per iteration;len(freq)should never exceed K after the shrink completes.
Mastery Criteria
- Recognized “longest substring with property” as sliding window in <30 seconds.
- Wrote the canonical shrink-while-violation template with the answer-update outside the shrink, in <5 minutes.
- Articulated the loop invariant and the termination of the shrink loop.
- Generalized to “exactly K distinct” via the
atMost(K) - atMost(K-1)trick. - Identified the language-specific Unicode / boxing trap.
- Solved LC 76, LC 424, LC 992 within a week, observing the same skeleton.