Lab 07 — Autocomplete

Goal

Implement an autocomplete service that returns the top-K suggestions for any prefix in sub-millisecond time. Use a trie augmented with per-node top-K caches, and support weighted suggestions (popularity-ranked). After this lab you should be able to design and implement the data structure in under 30 minutes and answer follow-ups about scale, freshness, and personalization.

Background Concepts

A trie (prefix tree) maps prefixes to a set of completions in O(prefix length) time. The naive design — at query time, walk to the prefix node and DFS to gather all descendants, then sort by weight — is correct but slow if the prefix matches many descendants. The production trick is to precompute and cache the top-K at each node during insert; query then becomes O(prefix-length + K).

The cache update is the subtle part. When add(word, weight) is called, walk the trie down the word’s path; at each node, merge the new word into the node’s top_k (a sorted list or small heap) and discard anything past K. The data flow is “bottom-up via the path you just walked, but for top-K at every level”.

Interview Context

Autocomplete is a top-10 question at search-heavy companies (Google search, Amazon product search, LinkedIn, Yelp). The bar is: trie + per-node top-K cache + ability to answer follow-ups about distributing the index, refreshing weights, and personalizing.

Problem Statement

Design Autocomplete(K):

add(word, weight) — insert with weight; subsequent adds of the same word update its weight (additive or replace, your choice — pick one).
suggest(prefix) -> list[str] — return top-K words by weight starting with prefix. Sub-millisecond average.

Constraints

Up to 10^6 words
10^5 queries / second
Average word length 10–30
K ≤ 10

Clarifying Questions

Weight semantics: replace or additive? (Pick one; “additive” matches “popularity”.)
Case sensitivity? (Default case-insensitive; lowercase on insert.)
Tie-breaking on equal weight? (Lexicographic.)
Real-time updates required, or build-once? (Both supported; weights mutable.)
K provided per-query or fixed? (Fixed at construction simplifies caching; per-query is a follow-up.)

Examples

ac = Autocomplete(K=3)
ac.add("apple", 5); ac.add("app", 10); ac.add("apply", 3); ac.add("apricot", 1)
ac.suggest("ap")    -> ["app", "apple", "apply"]
ac.suggest("app")   -> ["app", "apple", "apply"]
ac.suggest("apr")   -> ["apricot"]
ac.suggest("z")     -> []

Initial Brute Force

Store words in a dict[word, weight]. On suggest, scan all words, filter by startswith(prefix), sort by weight, return top-K. O(N · prefix-length) per query.

Brute Force Complexity

Per suggest: O(N · L) where N = number of words, L = average length. At N=10^6 and 10^5 QPS, this is 10^11 operations / second — far too slow.

Optimization Path

A trie reduces the descend-to-prefix step to O(L). The remaining work — gathering the top-K descendants — is what the per-node top_k cache eliminates. Insert becomes O(L · K log K) (we update the cache at L nodes, each O(K log K)); query becomes O(L + K).

Final Expected Approach

Trie node has children: dict[char, Node] and top_k: list[(weight, word)]. add walks down the path; at each node, runs a small merge to maintain top-K. suggest walks to the prefix node and returns its top_k.

Data Structures Used

Structure	Purpose
Trie nodes	prefix indexing
Per-node `top_k: list[(weight, word)]`	precomputed answers
`dict[word, weight]`	dedup + weight tracking

Correctness Argument

Invariant: at any node N, top_k(N) is the top-K (by weight, ties broken lexicographically) of { (weight, word) : word starts with prefix(N) }. After every add(word, weight), we visit exactly the nodes on word’s path. At each visited node, we either (a) update the (word, weight) entry if word is already in top_k, or (b) insert and trim to K. No node off the path’s top_k set could change because word doesn’t extend any non-path prefix.

Edge case: weight updates that decrease a word’s standing — if word was in the top-K and its new weight kicks it out, we need to recompute the node’s top-K from a wider candidate set. The clean approach: store at each node a count_per_word dict and full candidate set restricted to top-K-wide buffer (e.g., top-10K when K=10) — heavy but correct. The simpler approach: on weight decrease, do a DFS to rebuild top_k. Document the choice.

Complexity

add: O(L · K log K)
suggest: O(L + K)
Space: O(N · L · K) worst case; in practice much less because most nodes don’t have K distinct descendants

Implementation Requirements

import heapq
from typing import Optional

class _Node:
    __slots__ = ("children", "top_k")
    def __init__(self):
        self.children: dict[str, _Node] = {}
        self.top_k: list[tuple[int, str]] = []  # min-heap of (weight, word) ... ish

class Autocomplete:
    def __init__(self, k: int = 5):
        self._root = _Node()
        self._weights: dict[str, int] = {}
        self._k = k

    def add(self, word: str, weight_delta: int = 1) -> None:
        word = word.lower()
        new_weight = self._weights.get(word, 0) + weight_delta
        self._weights[word] = new_weight
        node = self._root
        nodes_on_path: list[_Node] = [node]
        for ch in word:
            node = node.children.setdefault(ch, _Node())
            nodes_on_path.append(node)
        for n in nodes_on_path:
            self._upsert_top_k(n, word, new_weight)

    def suggest(self, prefix: str) -> list[str]:
        prefix = prefix.lower()
        node = self._root
        for ch in prefix:
            node = node.children.get(ch)
            if node is None:
                return []
        # top_k stored as (weight_neg, word) so sorted asc gives top weights desc
        ranked = sorted(node.top_k, key=lambda p: (-p[0], p[1]))
        return [w for _, w in ranked[:self._k]]

    def _upsert_top_k(self, node: _Node, word: str, weight: int) -> None:
        for i, (w, ww) in enumerate(node.top_k):
            if ww == word:
                node.top_k[i] = (weight, word)
                node.top_k.sort(key=lambda p: (-p[0], p[1]))
                return
        node.top_k.append((weight, word))
        node.top_k.sort(key=lambda p: (-p[0], p[1]))
        if len(node.top_k) > self._k:
            del node.top_k[self._k:]

Tests

import unittest

class TestAutocomplete(unittest.TestCase):
    def test_basic(self):
        ac = Autocomplete(k=3)
        for w, n in [("app", 10), ("apple", 5), ("apply", 3), ("apricot", 1)]:
            ac.add(w, n)
        self.assertEqual(ac.suggest("ap"), ["app", "apple", "apply"])
        self.assertEqual(ac.suggest("app"), ["app", "apple", "apply"])
        self.assertEqual(ac.suggest("apr"), ["apricot"])
        self.assertEqual(ac.suggest("z"), [])

    def test_weight_update(self):
        ac = Autocomplete(k=3)
        ac.add("a", 1); ac.add("b", 2); ac.add("c", 3)
        ac.add("a", 10)                        # a now weight 11
        # Suggest off the empty prefix
        self.assertEqual(ac.suggest(""), ["a", "c", "b"])

    def test_top_k_truncation(self):
        ac = Autocomplete(k=2)
        for c, w in [("a", 1), ("b", 2), ("c", 3), ("d", 4)]:
            ac.add(c, w)
        self.assertEqual(ac.suggest(""), ["d", "c"])

    def test_lex_tie(self):
        ac = Autocomplete(k=3)
        ac.add("banana", 5); ac.add("apple", 5); ac.add("cherry", 5)
        self.assertEqual(ac.suggest(""), ["apple", "banana", "cherry"])

Follow-up Questions

(3) Scale to N nodes? Shard by first character (A–Z) → 26 shards, each with its own trie. For more even distribution, shard by hash(prefix[:2]). Each suggestion query routes to one shard. For the empty-prefix query, broadcast and merge — the price you pay for sharding by prefix.

(4) Observe / monitor? Per-prefix-length latency histogram (short prefixes = many candidates, slow); cache hit-rate (proportion of queries hitting precomputed top-K vs needing DFS); query volume per prefix (top hot prefixes). Alert on p99 latency.

(9) Eviction / cleanup? Words may go cold (a celebrity who stopped trending). Strategy: timestamped weight, decay on a schedule (multiply all weights by 0.95 daily), delete when below a threshold. Or use a separate “deletion” path: remove(word) walks the trie, removes from each node’s top_k, and rebuilds top_k from a wider candidate set if the removed word was in it.

(11) Configuration knobs? K, case_sensitivity, weight_decay_factor. Knobs not to expose: trie node layout, sort algorithm.

(8) Partial failure? A query that hits a partially-rebuilt index during a add could see stale top-K. Solutions: (a) atomic per-node update (under a per-node lock), (b) versioned snapshot (queries read a stable version while writes go to a shadow), (c) accept stale results for ~1 second. For autocomplete, eventual consistency is fine.

Product Extension

Google’s autocomplete is far more than a trie: it’s a personalized, context-aware, learning-ranked system. The trie + top-K is the index layer; on top sits a ranker that combines popularity, personalization (your history), context (location, time), and freshness (trending). Production systems also add typo tolerance (Levenshtein-edit-distance fuzzy match within edit-distance ≤ 2) — a much harder problem solved with FSTs or n-gram inverted indexes.

Language/Runtime Follow-ups

Python: dict-based trie is the simplest; for memory, switch to arrays of 26/128 children once you optimize. The implementation above is fine for 10^6 words; beyond, consider a DAWG (DAG of suffix-shared subtrees).
Java: HashMap<Character, Node> per node, or arrays for ASCII. Apache Commons Collections has TrieMap (PATRICIA trie).
Go: map[rune]*Node per node. Excellent for this workload because of GC’s tolerance for many small allocations.
C++: same pattern. For best performance, use std::array<Node*, 26> for ASCII.
JS/TS: Map<string, Node> per node; no concurrency concerns in single-threaded Node.

Common Bugs

Inserting and forgetting to update top-K at every node on the path (only updating the leaf). Subsequent prefix queries return empty.
Sorting top-K by weight only, forgetting lex tie-break. Tests with equal weights become flaky.
On weight decrease, leaving a stale entry in top-K. Solution: full rebuild of top-K on decrease.
Case mismatch: insert lowercase, query as-is. Lowercase both.
Memory: storing the full word in every node’s top-K — at 10^6 words and depth 30, this is 30M strings. Store an integer ID and look up the word in a side dict for memory savings.

Debugging Strategy

For “wrong suggestions” bugs: print node.top_k at the prefix node and verify it matches the expected top-K. For “missing word” bugs: walk down the trie from root, printing node.children at each step, confirm the path exists. For weight bugs: dump self._weights[word] and compare to expected. For performance, profile: most hot paths are dict.setdefault and the in-place sort.

Mastery Criteria

Implemented trie + per-node top-K in <30 minutes.
All four tests pass first run.
Stated trie + top-K caching tradeoff (insert is K-times slower, query is L+K instead of L+all-descendants).
Answered follow-ups #3 (sharding), #4 (observability), #9 (decay), #8 (eventual consistency) crisply.
Compared trie vs DAWG vs FST for memory.
Articulated typo-tolerance design (BK-tree / fuzzy n-grams) at a high level.

LeetCode Interview — Extreme Coding