p104 — Design Autocomplete System

Source: LeetCode 642 · Hard · Topics: Design, Trie, Heap, Stream Companies (2024–2025): Google (very high — namesake), Amazon (high), Apple (medium), Yelp (high), Airbnb (medium) Loop position: The trie-plus-heap hard. Every search bar with suggestions runs this exact algorithm. Maps to the question that gets you the L5+ at Google.

1. Quick Context

Stream of characters arrives via input(c). Build up the current query until '#' arrives (signals end of sentence — commit it to history, reset state). After every non-'#' char, return the top 3 sentences that have the current input as a prefix, ranked by (frequency desc, sentence asc lexicographically).

State:

Trie of sentences, each terminal node weighted by frequency count.
Current input buffer (self.buf) + pointer to the current trie node (self.cur).
On '#': increment the buffer’s count in the trie, reset buf + cur to root.
Otherwise: append char, descend (or mark “off-trie” if no such child).

Two engineering choices for top-3:

Maintain top-3 at every node on insert — O(L) per query (just read the node), O(L × log K) per insert where L = sentence length, K = top-K size.
DFS subtree on each query — O(N_subtree log N_subtree) per query, O(L) per insert.

For LC, both pass. For production scale, (1) wins reads, (2) wins writes. Real systems use per-node top-K heap maintained on insert.

2. LeetCode Link + Attempt Gate

🔗 https://leetcode.com/problems/design-search-autocomplete-system/

STOP. 30-min timer. This is the longest of the week; budget accordingly.

3. Prerequisite Concepts

Trie (insert + descend by prefix).
Top-K with a fixed-size min-heap.
Stream-state machine (input is a stream; reset on '#').

4. How to Approach (FRAMEWORK 1–9)

1. Restate: Stream char-by-char; after each, return top-3 sentences with current prefix.

2. Clarify: Casing? Case-sensitive. Spaces? Allowed as ordinary chars. Empty query? (after '#') Returns []. Off-trie? Empty list and stay off-trie until '#'.

3. Examples: History [(“i love you”, 5), (“island”, 3), (“ironman”, 2), (“i love leetcode”, 2)]. Stream i, , a, #. After i → [“i love you”, “island”, “i love leetcode”]. After → [“i love you”, “i love leetcode”]. After a → [] (no sentence “i a…”). After # → [] and commit “i a” with freq +1.

5. Brute: Linear scan of all (sentence, freq) on each char.

6. Target: O(L) per input (read-optimized).

7. Pattern: Trie + per-node maintained top-3 list.

8. Develop: see solution.py.

9. Test: off-trie behavior; commit increments count, not duplicates; top-3 tie-break by sentence lex.

5. Progressive Hints

See hints.md.

6. Deeper Insight

6.1 The state-machine you’ll forget

input(c) is stateful. After c arrives:

If c == '#': commit self.buf to trie (increment count), then self.buf = "", self.cur = root.
Else: append c to self.buf. If self.cur is not None, descend (self.cur = self.cur.children.get(c)). If we go off-trie, self.cur = None — and stay off-trie for the rest of this query.

Half of LC submissions implement input as if it were a one-shot query (return top3(query)) and fail the state machine. The clue: input returns suggestions AFTER each char, which only makes sense if you remember the prefix so far.

6.2 Per-node top-K vs DFS-on-query

Choice	Read cost	Write cost	When to use
Per-node top-K maintained on insert	O(L)	O(L · K log K)	Production search bars (read-heavy)
DFS subtree on each query	O(subtree · log subtree)	O(L)	Write-heavy / mostly-cold dictionary
Per-node sorted list (no heap)	O(L)	O(L · K)	Tiny K (3 here); simplest code

Google’s real autocomplete uses (1) per-node top-K AND tier-loads from disk (top suggestions in RAM, long tail on SSD); the prefix node points to a precomputed list.

6.3 Tie-break: (frequency DESC, sentence ASC)

The comparator is not natural. A (freq, sentence) tuple sorted descending sorts sentences DESCENDING too — wrong. You need: sort by (-freq, sentence). In a min-heap of size 3 (keeping top 3), the heap key is (freq, neg_sentence) so the worst element pops correctly. Or use a custom class with __lt__. This is the bug most candidates hit and don’t notice until the LC harness tells them.

6.4 Stream-friendliness vs query-friendliness

input returning suggestions on every char is a UX choice (Google does it). But it implies the state machine. Compare to a hypothetical suggest(prefix) — stateless, no buf, no cur — which would be trivial. The interview question rewards understanding why the API is shaped this way.

6.5 Family: prefix data structures

LC 642 Autocomplete (this).
LC 208 Implement Trie.
LC 211 Add and Search Word — Data Structure (wildcard).
LC 720 Longest Word in Dictionary.
LC 745 Prefix and Suffix Search.

7. Anti-Pattern Analysis

Stateless implementation — input(c) treats each char as a one-shot query, ignoring self.buf / self.cur. Off-by-everything.
Wrong tie-break — sorted(candidates, reverse=True) sorts sentences descending; needs key=(-freq, sentence).
Reset state on every call — defeats the streaming abstraction.
Off-trie recovery — descending into a missing child should not crash. Mark self.cur = None and return [] for the rest of this query (until '#').
Re-descend the trie from root on every input(c) — O(L) per char, total O(L²) per sentence. Use the self.cur pointer.
Forget '#' commit semantics — must increment count of existing sentence or insert with count 1; resets state.

8. Skills & Takeaways

Trie with per-node augmentation (frequency, top-K).
Custom tie-break: (-frequency, sentence).
Stream state-machine pattern.

9. When to Move On

Implement cold in 30 min with correct state machine + tie-break.
Articulate the per-node top-K vs DFS-on-query tradeoff.
Extend to typo tolerance (Levenshtein-1 prefix match).

10. Company Context

Google: Search team and Cloud Search; “you wrote autocomplete; now we want you to be able to grade an autocomplete implementation in code review.”
Amazon: Product search (amazon.com suggestions); A9 team.
Apple: Spotlight, Maps search.
Yelp / Airbnb: Place / location autocomplete.

Strong: “Trie with per-node frequency. State machine: input(c) descends one node; ‘#’ commits + resets. Top-3 via per-node maintained list on insert OR DFS-then-sort on query. Tie-break by (-freq, sentence). Off-trie stays off-trie until #.”

11. Interviewer’s Lens

Signal	Junior	Mid	Senior	Staff+
State machine	Misses	Bugs	Cold	Cold
Tie-break	Wrong	Fixes	Cold	Cold
Off-trie handling	Crashes	After hint	Cold	Cold
Per-node top-K	—	—	Mentions	Implements + benchmarks
Typo tolerance	—	—	Mentions	Edit-distance + BK-tree

12. Level Delta

Mid (L4): State machine works; uses DFS-then-sort. Passes LC. Doesn’t pass scale probe.
Senior (L5): State machine clean; per-node maintained top-3. Articulates tie-break. Survives scale probe.
Staff (L6): + names BK-tree / Levenshtein automaton for typo tolerance + describes how Google tiers suggestions (RAM hot prefix → SSD cold). + handles personalization (per-user history weighted in).
Principal (L7+): + discusses query log mining as the source of suggestion frequencies (Google’s suggestions come from anonymized query logs, not a curated dictionary) + privacy (differential privacy on aggregation) + multilingual (per-locale trie, language detection) + freshness (last-24h trending injected via separate index).

13. Follow-up Q&A

Q1: Top-K instead of fixed 3. A1: Parameterize. Per-node list of size K — bounded. Heap operations are still O(K log K) per insert.

Q2: Typo tolerance — “leetcoke” → “leetcode”. A2: Levenshtein automaton (descend trie + DP state of edit distance, prune branches with distance > threshold). Or BK-tree (precomputed metric tree on edit distance).

Q3: Per-user personalization. A3: Per-user trie blended with global trie by weight. Or per-user “boost” map applied on top-K read.

Q4: 1B sentences — won’t fit in RAM. A4: Tier trie: hot prefixes (~top 100k) in RAM, cold suffixes on SSD. Or shard trie by first 2 chars across machines (consistent hashing on prefix). LSM-tree storage of (prefix → top-K) is what production search uses.

Q5: Live updates (new trending query in last 5 min). A5: Separate hot index updated every minute, merged into top-K at read time. (Like Twitter trending; same shape as the count-min-sketch + heavy-hitters pattern.)

14. Solution Walkthrough

See solution.py:

AutocompleteSystem — trie with per-node frequency, state machine, DFS-on-query top-3.
AutocompleteSystemBrute — linear scan oracle.
Stress test: random sentence histories + random char streams interspersed with '#'.

15. Beyond the Problem

Principal code review: “What you’ve built is the single-machine, single-language, single-user autocomplete. Google’s production autocomplete is the same trie shape but distributed across three orthogonal scaling axes: (1) corpus — query logs reach hundreds of TB per locale; the trie shards by first 2-3 chars over thousands of machines, with consistent hashing for live re-balancing; (2) freshness — a new trending query needs to surface in minutes, not at the next index build; there is a hot in-memory delta index merged into top-K at read time (same architecture as Elasticsearch’s near_real_time segments or LSM trees); (3) personalization + privacy — your search history weights suggestions but must satisfy differential privacy on aggregation. The Staff/Principal signal is naming all three axes AND articulating that the query-log mining that produces the suggestions is itself a non-trivial pipeline (de-duplication, spell normalization, language detection, profanity filtering, geo-binning). Bonus signal: knowing that Bayesian inference on click-through-rate dominates raw frequency for suggestion ranking — a high-frequency-but-low-CTR suggestion is demoted. The simple (freq, lex) tie-break in this problem is the first step of a 30-feature ranking model.”

LeetCode Interview — Extreme Coding