p104 — Design Autocomplete System

Source: LeetCode 642 · Hard · Topics: Design, Trie, Heap, Stream Companies (2024–2025): Google (very high — namesake), Amazon (high), Apple (medium), Yelp (high), Airbnb (medium) Loop position: The trie-plus-heap hard. Every search bar with suggestions runs this exact algorithm. Maps to the question that gets you the L5+ at Google.

1. Quick Context

Stream of characters arrives via input(c). Build up the current query until '#' arrives (signals end of sentence — commit it to history, reset state). After every non-'#' char, return the top 3 sentences that have the current input as a prefix, ranked by (frequency desc, sentence asc lexicographically).

State:

  • Trie of sentences, each terminal node weighted by frequency count.
  • Current input buffer (self.buf) + pointer to the current trie node (self.cur).
  • On '#': increment the buffer’s count in the trie, reset buf + cur to root.
  • Otherwise: append char, descend (or mark “off-trie” if no such child).

Two engineering choices for top-3:

  1. Maintain top-3 at every node on insertO(L) per query (just read the node), O(L × log K) per insert where L = sentence length, K = top-K size.
  2. DFS subtree on each queryO(N_subtree log N_subtree) per query, O(L) per insert.

For LC, both pass. For production scale, (1) wins reads, (2) wins writes. Real systems use per-node top-K heap maintained on insert.

🔗 https://leetcode.com/problems/design-search-autocomplete-system/

STOP. 30-min timer. This is the longest of the week; budget accordingly.

3. Prerequisite Concepts

  • Trie (insert + descend by prefix).
  • Top-K with a fixed-size min-heap.
  • Stream-state machine (input is a stream; reset on '#').

4. How to Approach (FRAMEWORK 1–9)

1. Restate: Stream char-by-char; after each, return top-3 sentences with current prefix.

2. Clarify: Casing? Case-sensitive. Spaces? Allowed as ordinary chars. Empty query? (after '#') Returns []. Off-trie? Empty list and stay off-trie until '#'.

3. Examples: History [(“i love you”, 5), (“island”, 3), (“ironman”, 2), (“i love leetcode”, 2)]. Stream i, , a, #. After i → [“i love you”, “island”, “i love leetcode”]. After → [“i love you”, “i love leetcode”]. After a → [] (no sentence “i a…”). After # → [] and commit “i a” with freq +1.

5. Brute: Linear scan of all (sentence, freq) on each char.

6. Target: O(L) per input (read-optimized).

7. Pattern: Trie + per-node maintained top-3 list.

8. Develop: see solution.py.

9. Test: off-trie behavior; commit increments count, not duplicates; top-3 tie-break by sentence lex.

5. Progressive Hints

See hints.md.

6. Deeper Insight

6.1 The state-machine you’ll forget

input(c) is stateful. After c arrives:

  • If c == '#': commit self.buf to trie (increment count), then self.buf = "", self.cur = root.
  • Else: append c to self.buf. If self.cur is not None, descend (self.cur = self.cur.children.get(c)). If we go off-trie, self.cur = None — and stay off-trie for the rest of this query.

Half of LC submissions implement input as if it were a one-shot query (return top3(query)) and fail the state machine. The clue: input returns suggestions AFTER each char, which only makes sense if you remember the prefix so far.

6.2 Per-node top-K vs DFS-on-query

ChoiceRead costWrite costWhen to use
Per-node top-K maintained on insertO(L)O(L · K log K)Production search bars (read-heavy)
DFS subtree on each queryO(subtree · log subtree)O(L)Write-heavy / mostly-cold dictionary
Per-node sorted list (no heap)O(L)O(L · K)Tiny K (3 here); simplest code

Google’s real autocomplete uses (1) per-node top-K AND tier-loads from disk (top suggestions in RAM, long tail on SSD); the prefix node points to a precomputed list.

6.3 Tie-break: (frequency DESC, sentence ASC)

The comparator is not natural. A (freq, sentence) tuple sorted descending sorts sentences DESCENDING too — wrong. You need: sort by (-freq, sentence). In a min-heap of size 3 (keeping top 3), the heap key is (freq, neg_sentence) so the worst element pops correctly. Or use a custom class with __lt__. This is the bug most candidates hit and don’t notice until the LC harness tells them.

6.4 Stream-friendliness vs query-friendliness

input returning suggestions on every char is a UX choice (Google does it). But it implies the state machine. Compare to a hypothetical suggest(prefix) — stateless, no buf, no cur — which would be trivial. The interview question rewards understanding why the API is shaped this way.

6.5 Family: prefix data structures

  • LC 642 Autocomplete (this).
  • LC 208 Implement Trie.
  • LC 211 Add and Search Word — Data Structure (wildcard).
  • LC 720 Longest Word in Dictionary.
  • LC 745 Prefix and Suffix Search.

7. Anti-Pattern Analysis

  1. Stateless implementationinput(c) treats each char as a one-shot query, ignoring self.buf / self.cur. Off-by-everything.
  2. Wrong tie-breaksorted(candidates, reverse=True) sorts sentences descending; needs key=(-freq, sentence).
  3. Reset state on every call — defeats the streaming abstraction.
  4. Off-trie recovery — descending into a missing child should not crash. Mark self.cur = None and return [] for the rest of this query (until '#').
  5. Re-descend the trie from root on every input(c) — O(L) per char, total O(L²) per sentence. Use the self.cur pointer.
  6. Forget '#' commit semantics — must increment count of existing sentence or insert with count 1; resets state.

8. Skills & Takeaways

  • Trie with per-node augmentation (frequency, top-K).
  • Custom tie-break: (-frequency, sentence).
  • Stream state-machine pattern.

9. When to Move On

  • Implement cold in 30 min with correct state machine + tie-break.
  • Articulate the per-node top-K vs DFS-on-query tradeoff.
  • Extend to typo tolerance (Levenshtein-1 prefix match).

10. Company Context

  • Google: Search team and Cloud Search; “you wrote autocomplete; now we want you to be able to grade an autocomplete implementation in code review.”
  • Amazon: Product search (amazon.com suggestions); A9 team.
  • Apple: Spotlight, Maps search.
  • Yelp / Airbnb: Place / location autocomplete.

Strong: “Trie with per-node frequency. State machine: input(c) descends one node; ‘#’ commits + resets. Top-3 via per-node maintained list on insert OR DFS-then-sort on query. Tie-break by (-freq, sentence). Off-trie stays off-trie until #.”

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff+
State machineMissesBugsColdCold
Tie-breakWrongFixesColdCold
Off-trie handlingCrashesAfter hintColdCold
Per-node top-KMentionsImplements + benchmarks
Typo toleranceMentionsEdit-distance + BK-tree

12. Level Delta

  • Mid (L4): State machine works; uses DFS-then-sort. Passes LC. Doesn’t pass scale probe.
  • Senior (L5): State machine clean; per-node maintained top-3. Articulates tie-break. Survives scale probe.
  • Staff (L6): + names BK-tree / Levenshtein automaton for typo tolerance + describes how Google tiers suggestions (RAM hot prefix → SSD cold). + handles personalization (per-user history weighted in).
  • Principal (L7+): + discusses query log mining as the source of suggestion frequencies (Google’s suggestions come from anonymized query logs, not a curated dictionary) + privacy (differential privacy on aggregation) + multilingual (per-locale trie, language detection) + freshness (last-24h trending injected via separate index).

13. Follow-up Q&A

Q1: Top-K instead of fixed 3. A1: Parameterize. Per-node list of size K — bounded. Heap operations are still O(K log K) per insert.

Q2: Typo tolerance — “leetcoke” → “leetcode”. A2: Levenshtein automaton (descend trie + DP state of edit distance, prune branches with distance > threshold). Or BK-tree (precomputed metric tree on edit distance).

Q3: Per-user personalization. A3: Per-user trie blended with global trie by weight. Or per-user “boost” map applied on top-K read.

Q4: 1B sentences — won’t fit in RAM. A4: Tier trie: hot prefixes (~top 100k) in RAM, cold suffixes on SSD. Or shard trie by first 2 chars across machines (consistent hashing on prefix). LSM-tree storage of (prefix → top-K) is what production search uses.

Q5: Live updates (new trending query in last 5 min). A5: Separate hot index updated every minute, merged into top-K at read time. (Like Twitter trending; same shape as the count-min-sketch + heavy-hitters pattern.)

14. Solution Walkthrough

See solution.py:

  • AutocompleteSystem — trie with per-node frequency, state machine, DFS-on-query top-3.
  • AutocompleteSystemBrute — linear scan oracle.
  • Stress test: random sentence histories + random char streams interspersed with '#'.

15. Beyond the Problem

Principal code review: “What you’ve built is the single-machine, single-language, single-user autocomplete. Google’s production autocomplete is the same trie shape but distributed across three orthogonal scaling axes: (1) corpus — query logs reach hundreds of TB per locale; the trie shards by first 2-3 chars over thousands of machines, with consistent hashing for live re-balancing; (2) freshness — a new trending query needs to surface in minutes, not at the next index build; there is a hot in-memory delta index merged into top-K at read time (same architecture as Elasticsearch’s near_real_time segments or LSM trees); (3) personalization + privacy — your search history weights suggestions but must satisfy differential privacy on aggregation. The Staff/Principal signal is naming all three axes AND articulating that the query-log mining that produces the suggestions is itself a non-trivial pipeline (de-duplication, spell normalization, language detection, profanity filtering, geo-binning). Bonus signal: knowing that Bayesian inference on click-through-rate dominates raw frequency for suggestion ranking — a high-frequency-but-low-CTR suggestion is demoted. The simple (freq, lex) tie-break in this problem is the first step of a 30-feature ranking model.”