Lab 02 — Sliding Window: Longest Substring With At Most K Distinct Characters

Goal

Master the variable-size sliding window with a frequency map and a “shrink while violation” invariant. Deliverable solves the problem in O(N) time, O(K) extra space, with the loop invariant articulated explicitly: at the end of every iteration, s[l..r] contains at most K distinct characters.

Background Concepts

The shrink-while-violation skeleton; using a count of distinct items vs a full hashmap traversal; the difference between “at most K” and “exactly K” (the atMost(K) - atMost(K-1) trick); when fixed-size and variable-size windows apply. Review pattern Sliding Window.

Interview Context

This problem (LeetCode 340) is a Google / Meta favorite, and a near-direct ancestor of LeetCode 76 (Minimum Window Substring), 159 (At Most Two Distinct), 992 (Subarrays With K Different Integers), and 424 (Longest Repeating Character Replacement). The interview signal is whether you maintain the invariant cleanly: a single while violation: loop that shrinks until valid, then unconditional answer-update. Weak candidates write nested if-statements with off-by-one errors. Strong candidates write the canonical 5-line shrink-and-update.

Problem Statement

Given a string s and an integer K, return the length of the longest substring of s that contains at most K distinct characters.

Constraints

1 ≤ |s| ≤ 5 × 10^4
0 ≤ K ≤ |s|
s consists of arbitrary characters (Unicode in some variants; ASCII in the standard variant).

Clarifying Questions

Is the alphabet ASCII or Unicode? (If Unicode, hashmap; if ASCII-lowercase, an int[26].)
What if K == 0? (Empty substring is valid; answer is 0.)
What if K >= number of distinct chars in s? (Answer is len(s).)
Is the answer the length or the substring itself? (LC asks length; mention you can record (start, length) for the substring.)
Can s be empty? (Per constraints, |s| ≥ 1, but the function should return 0 on empty input.)

Examples

Input	Output	Note
`s="eceba", K=2`	3	“ece”
`s="aa", K=1`	2	“aa”
`s="abcabc", K=2`	4	“bcbc” or “cabc” — wait, “cabc” has 3 distinct; correct: “bcbc” or “abca” — both length 4
`s="a", K=0`	0	empty
`s="abcdef", K=10`	6	full string

Initial Brute Force

Enumerate all substrings, check distinct-count, track max.

def longest_brute(s, K):
    best = 0
    for i in range(len(s)):
        seen = set()
        for j in range(i, len(s)):
            seen.add(s[j])
            if len(seen) <= K:
                best = max(best, j - i + 1)
            else:
                break
    return best

The inner loop can break on first violation, so this is effectively O(N · K) average and O(N²) worst case (when K is large enough that no break happens).

Brute Force Complexity

Time O(N² · |alphabet|) at worst (set ops). At N=5×10^4, that’s ~2.5×10^9 — too slow in any language.

Optimization Path

Observation: as r advances, the set of distinct characters in s[l..r] is monotone non-decreasing; as l advances (keeping r fixed), it is monotone non-increasing. So we can use a two-pointer / sliding window: advance r, and while the window has more than K distinct chars, advance l. Each character enters and leaves the window at most once, total O(N).

Use a frequency map freq[c] keyed by character; distinct is the count of characters with freq[c] > 0. Increment distinct when a key first reaches 1; decrement when a key drops to 0.

Final Expected Approach

def longest_at_most_k_distinct(s, K):
    if K == 0: return 0
    freq = {}
    l = 0
    best = 0
    for r, c in enumerate(s):
        freq[c] = freq.get(c, 0) + 1
        while len(freq) > K:
            freq[s[l]] -= 1
            if freq[s[l]] == 0:
                del freq[s[l]]
            l += 1
        best = max(best, r - l + 1)
    return best

We use len(freq) directly as the distinct count, since we delete keys that hit zero.

Data Structures Used

A hashmap freq: char → int of size at most K+1 during the violation, ≤ K otherwise.
Two integer pointers l, r.
A running maximum best.

Correctness Argument

Invariant: at the start of each iteration of the outer loop and after the inner shrink, freq contains exactly the characters of s[l..r] with their counts, and len(freq) ≤ K.

Base: before iteration 0, l = 0, freq = {}, len(freq) = 0 ≤ K. ✓

Step: we add s[r] to freq. If this brings len(freq) > K, we shrink: decrement freq[s[l]], delete on zero, advance l. The shrink loop terminates because l ≤ r always (proved: each shrink-step removes one character, and at most r - l + 1 characters can be in the window, so after at most r - l + 1 shrinks we have len(freq) ≤ 1 ≤ K).

Optimality (max): for each r, the smallest l such that the window has ≤ K distinct is recorded; this gives the longest valid window ending at r. Taking the max over all r gives the global longest.

Why while, not if: in this problem each new character can add at most one to distinct, so if would also work. But the canonical sliding-window template uses while because in cousin problems (LC 76) a single r-advance can violate by more than 1 (when adding required chars). Always default to while; pay the (zero) cost of generality.

Complexity

Time O(N): each character is added once, removed at most once. Each hash op is O(1) average. Space O(min(N, K+1)) for the frequency map.

Implementation Requirements

Use len(freq) (or maintain a distinct counter) — do not iterate the map to count.
Delete keys that hit zero, or your len(freq) will be wrong.
Update best after the shrink, not inside it. (Inside the shrink, the window is invalid.)
For ASCII-lowercase, an int[26] plus a separate distinct counter is faster than a hashmap (no hashing constant).
Guard K == 0 (or K < 0) at the top.

Tests

Smoke: ("eceba", 2) → 3, ("aa", 1) → 2.
Unit: K = 0 → 0; K ≥ |distinct(s)| → |s|; single-character string.
Edge: s = "" → 0; K = 1, s = "abcdef" → 1; K = |s| → |s|.
Large: |s| = 5 × 10^4, alphabet 26, K = 3; should run in well under 100ms.
Random: generate 100 random strings of length ≤ 200, alphabet of varying size, varying K; cross-check against the brute force.
Adversarial: all-same-character ("aaaa...", K=1 → N); strictly increasing alphabet ("abcdef...", K → answer is K).
Unicode follow-up: make sure your Java/JS code iterates by codepoint, not by char, if the spec extends to Unicode.

Follow-up Questions

“Now solve exactly K distinct.” → atMost(K) - atMost(K-1) for the count variant; for the longest-with-exactly-K, run the same sliding window but only record best when len(freq) == K (not ≤ K).
“Now solve LC 76 (Minimum Window Substring).” → same skeleton, but the violation is “window does not yet contain all required chars”; we grow until satisfied, then shrink while still satisfied, recording the minimum each time we’re satisfied.
“What if the string is streamed and only r advances?” → two-pointer doesn’t apply directly; you’d need an order-preserving structure. (Out of scope here.)
“What if K can change with each query?” → preprocess differently; this problem is not amenable to a single offline sliding window for many K values.
“What if the input is very large, alphabet huge, but s only contains a few distinct chars?” → no change; the hashmap is bounded by min(K+1, |distinct(s)|).

Product Extension

A real-time content-moderation system tracks the longest run of messages in a chat where at most K distinct emojis are used (an indicator of spam-bot activity, which tends to use a small bag of emojis on repeat). The sliding window updates per message in O(1) amortized. The same skeleton applies to “longest session window with at most K distinct user-agents” for fraud detection, and “longest range of cells with at most K distinct values” for spreadsheet anomaly detection.

Language/Runtime Follow-ups

Python: dict overhead is real; for ASCII-lowercase, [0]*26 plus a distinct int is ~3× faster. Use s = list(s) only if you need indexing speed (string indexing is already O(1)).
Java: HashMap<Character, Integer> boxes keys and values. For ASCII, use int[128] and track distinct manually. chars() method exists but s.charAt(i) is fine.
Go: map[byte]int for ASCII; map[rune]int for Unicode. range s on a string yields (byte_index, rune) — this is a top-3 Go string trap: for i, c := range s does not give you i as character index for multi-byte chars.
C++: std::unordered_map<char, int> works; for ASCII, std::array<int, 128> is faster.
JS/TS: Map<string, number> works. Iterating for (const c of s) yields codepoints, not UTF-16 code units — important if Unicode is in scope. Otherwise, s[i] works for ASCII.
Unicode caveat: “distinct character” might mean codepoint, grapheme cluster, or UTF-16 code unit — clarify with the interviewer.

Common Bugs

Updating best inside the shrink loop — records invalid windows, returns wrong answers when the only valid window length is small.
Forgetting to delete zero-count keys — len(freq) becomes stale, breaks the violation check. Equivalent bug: maintaining a separate distinct counter and forgetting to decrement when a count hits zero.
Using if instead of while — works for this problem but breaks for LC 76 and friends. Build the habit of while.
Off-by-one in r - l + 1 — r is inclusive, l is inclusive; window length is r - l + 1.
Java boxing in HashMap<Character, Integer> — autoboxing tax is ~3× over the primitive int[] approach for ASCII.
Go string range bug — iterating with for i := 0; i < len(s); i++ and indexing as s[i] gives bytes, not runes; for UTF-8 data this misclassifies multi-byte chars as multiple distinct ones.
JS UTF-16 surrogate pair bug — s.length for "😀abc" is 5 (surrogate + 3 ASCII), and s[0], s[1] are the surrogate halves, not the emoji.

Debugging Strategy

Trace ("eceba", 2) by hand. Window evolution: e | ec | ece | (shrink to ce, then) cebᴬ — when b enters, window has c, e, b (3 distinct), shrink: remove c (was at position 1), window is e b then add … etc. If your trace doesn’t match, your shrink logic is wrong.
Diff against the brute force on 50 random inputs.
Print (l, r, len(freq), best) per iteration; len(freq) should never exceed K after the shrink completes.

Mastery Criteria

Recognized “longest substring with property” as sliding window in <30 seconds.
Wrote the canonical shrink-while-violation template with the answer-update outside the shrink, in <5 minutes.
Articulated the loop invariant and the termination of the shrink loop.
Generalized to “exactly K distinct” via the atMost(K) - atMost(K-1) trick.
Identified the language-specific Unicode / boxing trap.
Solved LC 76, LC 424, LC 992 within a week, observing the same skeleton.

LeetCode Interview — Extreme Coding