Lab 07 — Autocomplete
Goal
Implement an autocomplete service that returns the top-K suggestions for any prefix in sub-millisecond time. Use a trie augmented with per-node top-K caches, and support weighted suggestions (popularity-ranked). After this lab you should be able to design and implement the data structure in under 30 minutes and answer follow-ups about scale, freshness, and personalization.
Background Concepts
A trie (prefix tree) maps prefixes to a set of completions in O(prefix length) time. The naive design — at query time, walk to the prefix node and DFS to gather all descendants, then sort by weight — is correct but slow if the prefix matches many descendants. The production trick is to precompute and cache the top-K at each node during insert; query then becomes O(prefix-length + K).
The cache update is the subtle part. When add(word, weight) is called, walk the trie down the word’s path; at each node, merge the new word into the node’s top_k (a sorted list or small heap) and discard anything past K. The data flow is “bottom-up via the path you just walked, but for top-K at every level”.
Interview Context
Autocomplete is a top-10 question at search-heavy companies (Google search, Amazon product search, LinkedIn, Yelp). The bar is: trie + per-node top-K cache + ability to answer follow-ups about distributing the index, refreshing weights, and personalizing.
Problem Statement
Design Autocomplete(K):
add(word, weight)— insert with weight; subsequent adds of the same word update its weight (additive or replace, your choice — pick one).suggest(prefix) -> list[str]— return top-K words by weight starting withprefix. Sub-millisecond average.
Constraints
- Up to 10^6 words
- 10^5 queries / second
- Average word length 10–30
- K ≤ 10
Clarifying Questions
- Weight semantics: replace or additive? (Pick one; “additive” matches “popularity”.)
- Case sensitivity? (Default case-insensitive; lowercase on insert.)
- Tie-breaking on equal weight? (Lexicographic.)
- Real-time updates required, or build-once? (Both supported; weights mutable.)
- K provided per-query or fixed? (Fixed at construction simplifies caching; per-query is a follow-up.)
Examples
ac = Autocomplete(K=3)
ac.add("apple", 5); ac.add("app", 10); ac.add("apply", 3); ac.add("apricot", 1)
ac.suggest("ap") -> ["app", "apple", "apply"]
ac.suggest("app") -> ["app", "apple", "apply"]
ac.suggest("apr") -> ["apricot"]
ac.suggest("z") -> []
Initial Brute Force
Store words in a dict[word, weight]. On suggest, scan all words, filter by startswith(prefix), sort by weight, return top-K. O(N · prefix-length) per query.
Brute Force Complexity
Per suggest: O(N · L) where N = number of words, L = average length. At N=10^6 and 10^5 QPS, this is 10^11 operations / second — far too slow.
Optimization Path
A trie reduces the descend-to-prefix step to O(L). The remaining work — gathering the top-K descendants — is what the per-node top_k cache eliminates. Insert becomes O(L · K log K) (we update the cache at L nodes, each O(K log K)); query becomes O(L + K).
Final Expected Approach
Trie node has children: dict[char, Node] and top_k: list[(weight, word)]. add walks down the path; at each node, runs a small merge to maintain top-K. suggest walks to the prefix node and returns its top_k.
Data Structures Used
| Structure | Purpose |
|---|---|
| Trie nodes | prefix indexing |
Per-node top_k: list[(weight, word)] | precomputed answers |
dict[word, weight] | dedup + weight tracking |
Correctness Argument
Invariant: at any node N, top_k(N) is the top-K (by weight, ties broken lexicographically) of { (weight, word) : word starts with prefix(N) }. After every add(word, weight), we visit exactly the nodes on word’s path. At each visited node, we either (a) update the (word, weight) entry if word is already in top_k, or (b) insert and trim to K. No node off the path’s top_k set could change because word doesn’t extend any non-path prefix.
Edge case: weight updates that decrease a word’s standing — if word was in the top-K and its new weight kicks it out, we need to recompute the node’s top-K from a wider candidate set. The clean approach: store at each node a count_per_word dict and full candidate set restricted to top-K-wide buffer (e.g., top-10K when K=10) — heavy but correct. The simpler approach: on weight decrease, do a DFS to rebuild top_k. Document the choice.
Complexity
add: O(L · K log K)suggest: O(L + K)- Space: O(N · L · K) worst case; in practice much less because most nodes don’t have K distinct descendants
Implementation Requirements
import heapq
from typing import Optional
class _Node:
__slots__ = ("children", "top_k")
def __init__(self):
self.children: dict[str, _Node] = {}
self.top_k: list[tuple[int, str]] = [] # min-heap of (weight, word) ... ish
class Autocomplete:
def __init__(self, k: int = 5):
self._root = _Node()
self._weights: dict[str, int] = {}
self._k = k
def add(self, word: str, weight_delta: int = 1) -> None:
word = word.lower()
new_weight = self._weights.get(word, 0) + weight_delta
self._weights[word] = new_weight
node = self._root
nodes_on_path: list[_Node] = [node]
for ch in word:
node = node.children.setdefault(ch, _Node())
nodes_on_path.append(node)
for n in nodes_on_path:
self._upsert_top_k(n, word, new_weight)
def suggest(self, prefix: str) -> list[str]:
prefix = prefix.lower()
node = self._root
for ch in prefix:
node = node.children.get(ch)
if node is None:
return []
# top_k stored as (weight_neg, word) so sorted asc gives top weights desc
ranked = sorted(node.top_k, key=lambda p: (-p[0], p[1]))
return [w for _, w in ranked[:self._k]]
def _upsert_top_k(self, node: _Node, word: str, weight: int) -> None:
for i, (w, ww) in enumerate(node.top_k):
if ww == word:
node.top_k[i] = (weight, word)
node.top_k.sort(key=lambda p: (-p[0], p[1]))
return
node.top_k.append((weight, word))
node.top_k.sort(key=lambda p: (-p[0], p[1]))
if len(node.top_k) > self._k:
del node.top_k[self._k:]
Tests
import unittest
class TestAutocomplete(unittest.TestCase):
def test_basic(self):
ac = Autocomplete(k=3)
for w, n in [("app", 10), ("apple", 5), ("apply", 3), ("apricot", 1)]:
ac.add(w, n)
self.assertEqual(ac.suggest("ap"), ["app", "apple", "apply"])
self.assertEqual(ac.suggest("app"), ["app", "apple", "apply"])
self.assertEqual(ac.suggest("apr"), ["apricot"])
self.assertEqual(ac.suggest("z"), [])
def test_weight_update(self):
ac = Autocomplete(k=3)
ac.add("a", 1); ac.add("b", 2); ac.add("c", 3)
ac.add("a", 10) # a now weight 11
# Suggest off the empty prefix
self.assertEqual(ac.suggest(""), ["a", "c", "b"])
def test_top_k_truncation(self):
ac = Autocomplete(k=2)
for c, w in [("a", 1), ("b", 2), ("c", 3), ("d", 4)]:
ac.add(c, w)
self.assertEqual(ac.suggest(""), ["d", "c"])
def test_lex_tie(self):
ac = Autocomplete(k=3)
ac.add("banana", 5); ac.add("apple", 5); ac.add("cherry", 5)
self.assertEqual(ac.suggest(""), ["apple", "banana", "cherry"])
Follow-up Questions
(3) Scale to N nodes? Shard by first character (A–Z) → 26 shards, each with its own trie. For more even distribution, shard by hash(prefix[:2]). Each suggestion query routes to one shard. For the empty-prefix query, broadcast and merge — the price you pay for sharding by prefix.
(4) Observe / monitor? Per-prefix-length latency histogram (short prefixes = many candidates, slow); cache hit-rate (proportion of queries hitting precomputed top-K vs needing DFS); query volume per prefix (top hot prefixes). Alert on p99 latency.
(9) Eviction / cleanup? Words may go cold (a celebrity who stopped trending). Strategy: timestamped weight, decay on a schedule (multiply all weights by 0.95 daily), delete when below a threshold. Or use a separate “deletion” path: remove(word) walks the trie, removes from each node’s top_k, and rebuilds top_k from a wider candidate set if the removed word was in it.
(11) Configuration knobs? K, case_sensitivity, weight_decay_factor. Knobs not to expose: trie node layout, sort algorithm.
(8) Partial failure? A query that hits a partially-rebuilt index during a add could see stale top-K. Solutions: (a) atomic per-node update (under a per-node lock), (b) versioned snapshot (queries read a stable version while writes go to a shadow), (c) accept stale results for ~1 second. For autocomplete, eventual consistency is fine.
Product Extension
Google’s autocomplete is far more than a trie: it’s a personalized, context-aware, learning-ranked system. The trie + top-K is the index layer; on top sits a ranker that combines popularity, personalization (your history), context (location, time), and freshness (trending). Production systems also add typo tolerance (Levenshtein-edit-distance fuzzy match within edit-distance ≤ 2) — a much harder problem solved with FSTs or n-gram inverted indexes.
Language/Runtime Follow-ups
- Python: dict-based trie is the simplest; for memory, switch to arrays of 26/128 children once you optimize. The implementation above is fine for 10^6 words; beyond, consider a DAWG (DAG of suffix-shared subtrees).
- Java:
HashMap<Character, Node>per node, or arrays for ASCII. Apache Commons Collections hasTrieMap(PATRICIA trie). - Go:
map[rune]*Nodeper node. Excellent for this workload because of GC’s tolerance for many small allocations. - C++: same pattern. For best performance, use
std::array<Node*, 26>for ASCII. - JS/TS:
Map<string, Node>per node; no concurrency concerns in single-threaded Node.
Common Bugs
- Inserting and forgetting to update top-K at every node on the path (only updating the leaf). Subsequent prefix queries return empty.
- Sorting top-K by weight only, forgetting lex tie-break. Tests with equal weights become flaky.
- On weight decrease, leaving a stale entry in top-K. Solution: full rebuild of top-K on decrease.
- Case mismatch: insert lowercase, query as-is. Lowercase both.
- Memory: storing the full word in every node’s top-K — at 10^6 words and depth 30, this is 30M strings. Store an integer ID and look up the word in a side dict for memory savings.
Debugging Strategy
For “wrong suggestions” bugs: print node.top_k at the prefix node and verify it matches the expected top-K. For “missing word” bugs: walk down the trie from root, printing node.children at each step, confirm the path exists. For weight bugs: dump self._weights[word] and compare to expected. For performance, profile: most hot paths are dict.setdefault and the in-place sort.
Mastery Criteria
- Implemented trie + per-node top-K in <30 minutes.
- All four tests pass first run.
- Stated trie + top-K caching tradeoff (insert is K-times slower, query is L+K instead of L+all-descendants).
- Answered follow-ups #3 (sharding), #4 (observability), #9 (decay), #8 (eventual consistency) crisply.
- Compared trie vs DAWG vs FST for memory.
- Articulated typo-tolerance design (BK-tree / fuzzy n-grams) at a high level.