Lab 02 — Sliding Window: Longest Substring With At Most K Distinct Characters

Goal

Master the variable-size sliding window with a frequency map and a “shrink while violation” invariant. Deliverable solves the problem in O(N) time, O(K) extra space, with the loop invariant articulated explicitly: at the end of every iteration, s[l..r] contains at most K distinct characters.

Background Concepts

The shrink-while-violation skeleton; using a count of distinct items vs a full hashmap traversal; the difference between “at most K” and “exactly K” (the atMost(K) - atMost(K-1) trick); when fixed-size and variable-size windows apply. Review pattern Sliding Window.

Interview Context

This problem (LeetCode 340) is a Google / Meta favorite, and a near-direct ancestor of LeetCode 76 (Minimum Window Substring), 159 (At Most Two Distinct), 992 (Subarrays With K Different Integers), and 424 (Longest Repeating Character Replacement). The interview signal is whether you maintain the invariant cleanly: a single while violation: loop that shrinks until valid, then unconditional answer-update. Weak candidates write nested if-statements with off-by-one errors. Strong candidates write the canonical 5-line shrink-and-update.

Problem Statement

Given a string s and an integer K, return the length of the longest substring of s that contains at most K distinct characters.

Constraints

  • 1 ≤ |s| ≤ 5 × 10^4
  • 0 ≤ K ≤ |s|
  • s consists of arbitrary characters (Unicode in some variants; ASCII in the standard variant).

Clarifying Questions

  1. Is the alphabet ASCII or Unicode? (If Unicode, hashmap; if ASCII-lowercase, an int[26].)
  2. What if K == 0? (Empty substring is valid; answer is 0.)
  3. What if K >= number of distinct chars in s? (Answer is len(s).)
  4. Is the answer the length or the substring itself? (LC asks length; mention you can record (start, length) for the substring.)
  5. Can s be empty? (Per constraints, |s| ≥ 1, but the function should return 0 on empty input.)

Examples

InputOutputNote
s="eceba", K=23“ece”
s="aa", K=12“aa”
s="abcabc", K=24“bcbc” or “cabc” — wait, “cabc” has 3 distinct; correct: “bcbc” or “abca” — both length 4
s="a", K=00empty
s="abcdef", K=106full string

Initial Brute Force

Enumerate all substrings, check distinct-count, track max.

def longest_brute(s, K):
    best = 0
    for i in range(len(s)):
        seen = set()
        for j in range(i, len(s)):
            seen.add(s[j])
            if len(seen) <= K:
                best = max(best, j - i + 1)
            else:
                break
    return best

The inner loop can break on first violation, so this is effectively O(N · K) average and O(N²) worst case (when K is large enough that no break happens).

Brute Force Complexity

Time O(N² · |alphabet|) at worst (set ops). At N=5×10^4, that’s ~2.5×10^9 — too slow in any language.

Optimization Path

Observation: as r advances, the set of distinct characters in s[l..r] is monotone non-decreasing; as l advances (keeping r fixed), it is monotone non-increasing. So we can use a two-pointer / sliding window: advance r, and while the window has more than K distinct chars, advance l. Each character enters and leaves the window at most once, total O(N).

Use a frequency map freq[c] keyed by character; distinct is the count of characters with freq[c] > 0. Increment distinct when a key first reaches 1; decrement when a key drops to 0.

Final Expected Approach

def longest_at_most_k_distinct(s, K):
    if K == 0: return 0
    freq = {}
    l = 0
    best = 0
    for r, c in enumerate(s):
        freq[c] = freq.get(c, 0) + 1
        while len(freq) > K:
            freq[s[l]] -= 1
            if freq[s[l]] == 0:
                del freq[s[l]]
            l += 1
        best = max(best, r - l + 1)
    return best

We use len(freq) directly as the distinct count, since we delete keys that hit zero.

Data Structures Used

  • A hashmap freq: char → int of size at most K+1 during the violation, ≤ K otherwise.
  • Two integer pointers l, r.
  • A running maximum best.

Correctness Argument

Invariant: at the start of each iteration of the outer loop and after the inner shrink, freq contains exactly the characters of s[l..r] with their counts, and len(freq) ≤ K.

Base: before iteration 0, l = 0, freq = {}, len(freq) = 0 ≤ K. ✓

Step: we add s[r] to freq. If this brings len(freq) > K, we shrink: decrement freq[s[l]], delete on zero, advance l. The shrink loop terminates because l ≤ r always (proved: each shrink-step removes one character, and at most r - l + 1 characters can be in the window, so after at most r - l + 1 shrinks we have len(freq) ≤ 1 ≤ K).

Optimality (max): for each r, the smallest l such that the window has ≤ K distinct is recorded; this gives the longest valid window ending at r. Taking the max over all r gives the global longest.

Why while, not if: in this problem each new character can add at most one to distinct, so if would also work. But the canonical sliding-window template uses while because in cousin problems (LC 76) a single r-advance can violate by more than 1 (when adding required chars). Always default to while; pay the (zero) cost of generality.

Complexity

Time O(N): each character is added once, removed at most once. Each hash op is O(1) average. Space O(min(N, K+1)) for the frequency map.

Implementation Requirements

  • Use len(freq) (or maintain a distinct counter) — do not iterate the map to count.
  • Delete keys that hit zero, or your len(freq) will be wrong.
  • Update best after the shrink, not inside it. (Inside the shrink, the window is invalid.)
  • For ASCII-lowercase, an int[26] plus a separate distinct counter is faster than a hashmap (no hashing constant).
  • Guard K == 0 (or K < 0) at the top.

Tests

  • Smoke: ("eceba", 2) → 3, ("aa", 1) → 2.
  • Unit: K = 0 → 0; K ≥ |distinct(s)||s|; single-character string.
  • Edge: s = "" → 0; K = 1, s = "abcdef" → 1; K = |s||s|.
  • Large: |s| = 5 × 10^4, alphabet 26, K = 3; should run in well under 100ms.
  • Random: generate 100 random strings of length ≤ 200, alphabet of varying size, varying K; cross-check against the brute force.
  • Adversarial: all-same-character ("aaaa...", K=1 → N); strictly increasing alphabet ("abcdef...", K → answer is K).
  • Unicode follow-up: make sure your Java/JS code iterates by codepoint, not by char, if the spec extends to Unicode.

Follow-up Questions

  • “Now solve exactly K distinct.” → atMost(K) - atMost(K-1) for the count variant; for the longest-with-exactly-K, run the same sliding window but only record best when len(freq) == K (not ≤ K).
  • “Now solve LC 76 (Minimum Window Substring).” → same skeleton, but the violation is “window does not yet contain all required chars”; we grow until satisfied, then shrink while still satisfied, recording the minimum each time we’re satisfied.
  • “What if the string is streamed and only r advances?” → two-pointer doesn’t apply directly; you’d need an order-preserving structure. (Out of scope here.)
  • “What if K can change with each query?” → preprocess differently; this problem is not amenable to a single offline sliding window for many K values.
  • “What if the input is very large, alphabet huge, but s only contains a few distinct chars?” → no change; the hashmap is bounded by min(K+1, |distinct(s)|).

Product Extension

A real-time content-moderation system tracks the longest run of messages in a chat where at most K distinct emojis are used (an indicator of spam-bot activity, which tends to use a small bag of emojis on repeat). The sliding window updates per message in O(1) amortized. The same skeleton applies to “longest session window with at most K distinct user-agents” for fraud detection, and “longest range of cells with at most K distinct values” for spreadsheet anomaly detection.

Language/Runtime Follow-ups

  • Python: dict overhead is real; for ASCII-lowercase, [0]*26 plus a distinct int is ~3× faster. Use s = list(s) only if you need indexing speed (string indexing is already O(1)).
  • Java: HashMap<Character, Integer> boxes keys and values. For ASCII, use int[128] and track distinct manually. chars() method exists but s.charAt(i) is fine.
  • Go: map[byte]int for ASCII; map[rune]int for Unicode. range s on a string yields (byte_index, rune) — this is a top-3 Go string trap: for i, c := range s does not give you i as character index for multi-byte chars.
  • C++: std::unordered_map<char, int> works; for ASCII, std::array<int, 128> is faster.
  • JS/TS: Map<string, number> works. Iterating for (const c of s) yields codepoints, not UTF-16 code units — important if Unicode is in scope. Otherwise, s[i] works for ASCII.
  • Unicode caveat: “distinct character” might mean codepoint, grapheme cluster, or UTF-16 code unit — clarify with the interviewer.

Common Bugs

  1. Updating best inside the shrink loop — records invalid windows, returns wrong answers when the only valid window length is small.
  2. Forgetting to delete zero-count keyslen(freq) becomes stale, breaks the violation check. Equivalent bug: maintaining a separate distinct counter and forgetting to decrement when a count hits zero.
  3. Using if instead of while — works for this problem but breaks for LC 76 and friends. Build the habit of while.
  4. Off-by-one in r - l + 1r is inclusive, l is inclusive; window length is r - l + 1.
  5. Java boxing in HashMap<Character, Integer> — autoboxing tax is ~3× over the primitive int[] approach for ASCII.
  6. Go string range bug — iterating with for i := 0; i < len(s); i++ and indexing as s[i] gives bytes, not runes; for UTF-8 data this misclassifies multi-byte chars as multiple distinct ones.
  7. JS UTF-16 surrogate pair bugs.length for "😀abc" is 5 (surrogate + 3 ASCII), and s[0], s[1] are the surrogate halves, not the emoji.

Debugging Strategy

  • Trace ("eceba", 2) by hand. Window evolution: e | ec | ece | (shrink to ce, then) cebᴬ — when b enters, window has c, e, b (3 distinct), shrink: remove c (was at position 1), window is e b then add … etc. If your trace doesn’t match, your shrink logic is wrong.
  • Diff against the brute force on 50 random inputs.
  • Print (l, r, len(freq), best) per iteration; len(freq) should never exceed K after the shrink completes.

Mastery Criteria

  • Recognized “longest substring with property” as sliding window in <30 seconds.
  • Wrote the canonical shrink-while-violation template with the answer-update outside the shrink, in <5 minutes.
  • Articulated the loop invariant and the termination of the shrink loop.
  • Generalized to “exactly K distinct” via the atMost(K) - atMost(K-1) trick.
  • Identified the language-specific Unicode / boxing trap.
  • Solved LC 76, LC 424, LC 992 within a week, observing the same skeleton.