Phase 3 — Advanced Data Structures

Target level: Medium → Hard Expected duration: 3 weeks (12-week track) / 3 weeks (6-month track) / 4 weeks (12-month track) Weekly cadence: ~8 advanced structures per week + 30–60 problems applying them under the framework


Why Advanced Data Structures Unlock Hards

Phase 2 gave you 28 patterns that solve the vast majority of Mediums. The patterns work because each one carries an O(N) or O(N log N) algorithm in its template — you recognize the signal, instantiate the template, and the runtime falls out for free.

Hards are different. The signal still fires — you still recognize “this is sliding window with a tricky max”, “this is DP with a state transition”, “this is shortest path with a constraint” — but the vanilla template’s complexity is one factor too high. A sliding-window max over a stream of N updates becomes O(N²) with a sorted list. A DP with N=20 and “set of visited” in the state explodes to O(2^N · N²) without bitmask compression. A range-sum problem with both updates and queries blows past prefix sums. A string match against a pattern of length M in a text of length N is O(N·M) with naive comparison; that’s 10^10 ops at N=10^5, M=10^5.

The advanced data structures in this phase are the augmented engines that bring Hard problems back into reach. Each one is a 1–2 log-factor improvement over a naive structure. They are not “tricks”. They are well-defined, well-proven structures with known invariants, well-understood failure modes, and known operating ranges. The skill is not to invent them — it is to recognize when the vanilla template is one log factor short, identify which augmented structure plugs the hole, and instantiate it correctly under interview pressure.

There are roughly three families:

  1. Range query / range update structures — segment tree, Fenwick tree, sparse table, sqrt decomposition. These turn O(N) per range query into O(log N) or O(√N), and (with augmentation) handle range updates the same way. They show up whenever the problem has a sequence and you need both updates and aggregates over arbitrary subranges in the same workload.
  2. String-matching / hashing / suffix structures — KMP, Z, Manacher, rolling hash, suffix array, suffix automaton, Aho-Corasick, tries. These bring per-character work down from O(M) (full pattern recompare) to O(1) amortized, enabling O(N+M) or O(N log N) algorithms over strings. They show up whenever the problem mentions “substring”, “match”, “occurrence”, “palindrome”, or “common”.
  3. State-compression and amortization — bitmask DP, meet-in-the-middle, DSU with α(N), bit manipulation idioms, Bloom/skip/LRU-LFU. These exploit problem-specific structural facts (small N, splittable input, near-constant amortized work, probabilistic acceptance) to clear constraints that naive DP/search cannot.

You will not memorize 24 implementations cold. You will understand the invariants well enough to derive each implementation in 5–15 minutes under pressure, and instantly recognize which one is needed from the problem signal.

After this phase you can solve unmistakably-Hard LeetCode problems on first attempt: range queries with updates, palindromic counts in linear time, multi-pattern matching, exact-cover by bitmask DP, subset-sum at N=40 by meet-in-the-middle, equation-solving by weighted DSU, dynamic LRU caches. You also become visibly stronger in mock interviews because you no longer flinch at “what if the input is updated?”, “what if N is 40?”, or “what if there are 10^5 patterns to match?”.


What You Will Be Able To Do After This Phase

  • For any range-query Hard, identify within 60 seconds whether vanilla prefix sums suffice, or whether Fenwick / segment tree / sparse table / sqrt decomposition is required, and why.
  • Implement a segment tree (point update, range query) from memory in <12 minutes.
  • Add lazy propagation when the problem demands range updates, and articulate the lazy-tag-push invariant.
  • Recognize a string-match Hard and pick the right tool: KMP (single pattern), Z (border + offsets), Manacher (palindromes), rolling hash (probabilistic equality, multi-substring), Aho-Corasick (multi-pattern), suffix array/automaton (overview-level for “longest common substring”, “distinct substrings”).
  • Build a trie augmented with counts / deletion / prefix-sum cache for word-search and autocomplete-class problems.
  • Recognize bitmask DP from N ≤ 20 constraint, formulate the state, and implement the transition without bugs.
  • Recognize meet-in-the-middle from N ≤ 40 (split into 20+20) and code the two-half merge.
  • Implement DSU with path compression + union by rank, and prove the α(N) amortized bound.
  • Use bit manipulation idioms (popcount, lowbit, isolate trailing one, parity tricks) without thinking.

How To Read This Phase

Read the inline reference below in two passes. Pass 1: linear, end to end, to assemble a mental map of which structure plugs which hole. Pass 2: as you work through the labs, refer back to the structure entries to clarify invariants and pitfalls. Each entry has a fixed shape:

  1. When to use — the problem signal that should fire this structure within 2 minutes of reading.
  2. Complexity — build, query, update, space.
  3. Implementation pitfalls — the bugs that consume the most interview minutes.
  4. Classic problems — 3–6 representative problems where the structure is the intended solution.

Where labs cover the structure hands-on, the entry references the lab. Where the structure is overview-only (rare in interviews but expected of strong candidates), the entry says so explicitly.


Inline Advanced Data Structure Reference


1. Segment Tree (point update + range query)

When To Use

  • Sequence of N elements, with both point updates and range aggregates (sum / min / max / gcd / xor) on the same workload.
  • The aggregate is associative — i.e., it can be combined from two disjoint segments.
  • Q queries + Q updates, with N · Q too large for naive O(N) per query and prefix sums (which are O(1) per query but O(N) per update) too rigid for the update mix.

Complexity

Build O(N). Point update O(log N). Range query O(log N). Space O(4N) — the tree array is conventionally sized 4N to fit any nearly-balanced binary tree on N leaves.

Key Implementation Pitfalls

  • Off-by-one on the recursive boundariesquery(node, nl, nr, ql, qr): total miss is qr < nl or ql > nr (closed intervals); total cover is ql <= nl and nr <= qr. Mixing open and closed intervals is the #1 cause of broken segment trees.
  • Sizing the tree array4N is safe; 2 * next_power_of_two(N) is tight. If you pick 2N you’ll segfault for non-power-of-two N.
  • Combining two subtree results — for sum, just add; for min/max, take the extreme; for gcd, recurse. Define a single combine(a, b) so you can swap aggregates without rewriting the body.
  • Recursive vs iterative — iterative segment tree with n rounded to power of 2, leaves at [n, 2n), is shorter and faster. Pick one style and stick to it.

Classic Problems

  • LeetCode 307 — Range Sum Query Mutable
  • LeetCode 308 — Range Sum Query 2D Mutable (with row segment trees)
  • LeetCode 1157 — Online Majority Element In Subarray (segment tree of candidates + frequency check)
  • LeetCode 715 — Range Module (segment tree of intervals, or coordinate-compressed)

Hands-on: see Lab 01.


2. Segment Tree With Lazy Propagation (range update + range query)

When To Use

  • Same as #1 but updates affect ranges, not single points: “add v to all elements in [l, r]”, “set all elements in [l, r] to v”, “flip all elements in [l, r]”.
  • The update operation has a clean composition rule: applying update u₁ then u₂ to a node is equivalent to a single combined update.

Complexity

All ops O(log N). Space O(4N) for the values + O(4N) for the lazy tags.

Key Implementation Pitfalls

  • Pushdown orderbefore recursing into children, push the parent’s lazy tag down. After recursing, recompute the parent’s value from the children. Forgetting either side breaks the structure silently — queries return stale values.
  • Composing lazy tags — for “set” and “add” mixed, “set” must override the pending “add”. Define a clear apply(child_lazy, parent_lazy) rule and write it down before coding.
  • Tag identity / no-op — every lazy slot needs an “identity” value (e.g., 0 for add, sentinel for set) that means “no pending op”. Don’t conflate “identity” with a legal value.
  • Range-set with empty intersection — apply only on full-cover nodes; recurse only on partial overlap.

Classic Problems

  • LeetCode 732 — My Calendar III (count of overlapping intervals — coord-compressed lazy seg tree)
  • LeetCode 2569 — Handling Sum Queries After Update (range flip + range sum, lazy + bitmask)
  • Codeforces “EDU: Segment Tree” sections

Hands-on: see Lab 02.


3. Fenwick Tree / Binary Indexed Tree (prefix sums with updates)

When To Use

  • Workload is point updates + prefix queries (or range queries via two prefix queries).
  • The aggregate is invertible (sum, xor) — Fenwick can’t express min/max naturally because they’re not invertible.
  • You want a smaller, faster, simpler structure than a segment tree, with smaller constants and easier code.

Complexity

Build O(N) (or O(N log N) trivially). Point update O(log N). Prefix query O(log N). Range query (for sum/xor) = query(r) - query(l - 1). Space O(N).

Key Implementation Pitfalls

  • 1-indexed — Fenwick trees are conventionally 1-indexed. Calling update(0, v) infinite-loops because 0 & -0 == 0. Use update(i + 1, v) if your data is 0-indexed.
  • i += i & -i (update) vs i -= i & -i (query) — the directions are not symmetric. Memorize: update goes up, query goes down.
  • Range-update + point-query is a different Fenwick variant — subtract on r+1, add on l. Not the same code path.
  • Range-update + range-query needs two Fenwick trees (the BIT² trick). Out of interview scope unless the problem explicitly demands.

Classic Problems

  • LeetCode 307 — Range Sum Query Mutable (canonical)
  • LeetCode 315 — Count of Smaller Numbers After Self (canonical Fenwick on coord-compressed values)
  • LeetCode 327 — Count of Range Sum
  • LeetCode 493 — Reverse Pairs

Hands-on: see Lab 03.


4. 2D Fenwick Tree (matrix prefix sums with updates)

When To Use

  • Matrix problems with both point updates and rectangle-sum queries.
  • Static prefix-sum + occasional updates — but updates make recomputing the prefix-sum O(NM), so 2D Fenwick.
  • Coordinate-compressed if the matrix is sparse (most cells empty).

Complexity

Update O(log N · log M). Rectangle query O(log N · log M). Space O(N · M).

Key Implementation Pitfalls

  • Nested i & -i — outer loop walks the row index, inner loop walks the column index. Independent.
  • Inclusion-exclusion for rectangle sum: Q(r2, c2) - Q(r1-1, c2) - Q(r2, c1-1) + Q(r1-1, c1-1). Forgetting the + corner is the canonical bug.
  • Sparse matrices — if N · M = 10^10 you cannot allocate the array. Use a dict of dict or coordinate compression along each axis.

Classic Problems

  • LeetCode 308 — Range Sum Query 2D Mutable (canonical)
  • LeetCode 327 — Count of Range Sum (often reduced to 2D via coords)
  • LeetCode 1505 — Min Number of Swaps to Make Strings K-Equal (variants use 2D BIT)

5. Sparse Table (immutable RMQ)

When To Use

  • The array is static (no updates) and you need idempotent range queries (min, max, gcd, “is there a 1 in this range”).
  • O(1) query is required (segment tree’s O(log N) is too slow).
  • Builds in O(N log N) — fine for one-time use.

Complexity

Build O(N log N). Query O(1) for idempotent ops, O(log N) for non-idempotent ops (sum). Space O(N log N).

Key Implementation Pitfalls

  • Idempotent op required for O(1) — the trick is query(l, r) = combine(table[k][l], table[k][r - 2^k + 1]) where k = floor(log2(r - l + 1)). The two halves overlap; this only works if combining the same element twice equals once. Sum is not idempotent (counts twice). Min, max, gcd, “or”, “and” are.
  • floor(log2(len)) precomputation — building a log_floor[] table of size N is required for true O(1). Computing log2 per query is too slow.
  • Memory — at N = 10^6, N log N ≈ 2 × 10^7 ints = 80 MB. Plan for it.

Classic Problems

  • LeetCode 1851 — Minimum Interval to Include Each Query (one approach uses sparse table + offline sort)
  • Codeforces RMQ classics

Hands-on: see Lab 04.


6. Sqrt Decomposition (block-based queries)

When To Use

  • Workload mixes point/range updates and range queries, but the operation is hard to fit into segment tree (e.g., “sum of distinct values in a range”, “k-th smallest in a range with offline queries”).
  • You want a much simpler implementation than a segment tree, accepting an O(√N) factor.
  • Mo’s algorithm (offline range queries, total cost O((N+Q) · √N)) is a sqrt-decomp specialization.

Complexity

Build O(N). Query/update O(√N) per op. Space O(N).

Key Implementation Pitfalls

  • Block size choice — block size B = ⌈√N⌉ minimizes total cost Q · (N/B + B). Slightly larger B (e.g., 1.5√N) sometimes wins for cache reasons.
  • Edge of block — left partial block (l to end of block) and right partial block (start of block to r) handled as scalars; full middle blocks summed via block totals.
  • Mo’s algorithm — sort queries by (block of l, r) (with even/odd block hack to halve constants), then move pointers. Don’t confuse “block of l” sorting with sorting by l.

Classic Problems

  • “Range Distinct Count” (offline) — Mo’s classic.
  • LeetCode 850 — Rectangle Area II (sweep + sqrt-block coord compression)
  • Codeforces “EDU: Sqrt Decomposition”

7. Persistent Segment Tree (overview)

When To Use

  • You need to query prior versions of an array — “sum over [l, r] as of version v”, “k-th smallest in [l, r] (offline, treat as 2D)”.
  • Classic application: “find k-th smallest in subarray” via merging persistent trees indexed by position.
  • Rare in standard interviews; expected for Grandmaster / CP-style.

Complexity

Update O(log N) creating a new version (path-copy). Query on any version O(log N). Space O(N + Q log N) over Q versions.

Key Implementation Pitfalls

  • Reference each version’s root — store an array roots[v]; never mutate an old node.
  • Memory blow-up — N + Q log N at N = 10^5, Q = 10^5 ≈ 1.7M nodes × ~24 bytes = 40 MB. Pre-allocate node pool.
  • Garbage collection — in GC’d languages, hold the root references to keep nodes alive; in C++, use a node pool + manual indices.

Classic Problems

  • “K-th smallest in range” (offline-via-persistent-seg-tree).
  • “Count distinct in range” (with persistent seg tree of last-occurrence positions).

This is overview-only in this phase; you should know it exists, what it solves, and the rough cost. Implementing it is a Phase 7 exercise.


8. Treap / Implicit Treap (overview)

When To Use

  • Balanced BST with expected O(log N) operations via randomized priorities — simpler than red-black or AVL.
  • Implicit treap keys by position in an array, supporting O(log N) array splice / split / merge / range reverse / range sum — classic for “rope” data structures.
  • Order-statistics tree (find k-th, count less than) when sorted-container library doesn’t expose it.

Complexity

Insert / delete / split / merge / range-op all O(log N) expected.

Key Implementation Pitfalls

  • Heap property on random priorities — re-bubble after insertion / deletion. Without correct rotations, you lose the log bound.
  • Lazy reverse / lazy add on implicit treap mirrors lazy segment tree — push tag before recursing.
  • Expected vs worst case — adversarial input can’t degrade because priorities are random; this is the entire point.

Classic Problems

  • “Array splice” with O(log N) per op — implicit treap canonical.
  • Order-statistics queries on an indexed multiset.

Overview only.


9. Splay Tree (overview, when used)

When To Use

  • Self-adjusting BST — recently accessed nodes move toward the root.
  • Useful when the access pattern has temporal locality (LRU-like): O(log N) amortized but O(1) for hot keys.
  • Library-link: the data structure underlying many compiler symbol tables.

Complexity

All ops O(log N) amortized; individual ops up to O(N) worst case. The amortization argument uses a potential function.

Key Implementation Pitfalls

  • Splay step — zig (one rotation), zig-zig (same-side double), zig-zag (opposite-side double). The choice depends on grandparent direction; getting it wrong destroys amortization.
  • Splay on every access — including failed search. Forgetting this means the amortization breaks.

Classic Problems

  • Rare in competitive interviews; common in systems-engineering follow-ups about LRU implementations.

Overview only.


10. KMP (Knuth–Morris–Pratt failure function)

When To Use

  • Single pattern P matched against a text T.
  • You need either all occurrences of P in T, or just first, in O(N + M).
  • Or you need the longest border (longest proper prefix = suffix) of a string — that’s the failure function itself.

Complexity

Failure-function build O(M). Match O(N + M). Space O(M).

Key Implementation Pitfalls

  • Failure function recursionfail[i] is the length of the longest proper prefix of P[0..i] that’s also a suffix. The recurrence walks j = fail[i - 1] backward via j = fail[j - 1] until matched.
  • 0 vs 1 indexed — pick one and stick with it. Most resources are 0-indexed; many CP templates are 1-indexed.
  • Forgetting to reset j to 0 between independent matches — the matcher state is per-text, not global.

Classic Problems

  • LeetCode 28 — Find the Index of the First Occurrence in a String (canonical strstr)
  • LeetCode 459 — Repeated Substring Pattern (one-shot via failure function)
  • LeetCode 214 — Shortest Palindrome (failure function on s + '#' + reverse(s))
  • LeetCode 1392 — Longest Happy Prefix

Hands-on: see Lab 05.


11. Z Algorithm

When To Use

  • Compute, for each position i in a string S, the longest prefix of S that starts at i (z[i]).
  • Substring matching — concatenate P + '#' + T and look for z[i] = M in the T half.
  • Pattern problems where “longest prefix matching at offset” is the natural query.

Complexity

Build O(N) using a sliding [l, r] window of the rightmost reaching match. Match O(N + M).

Key Implementation Pitfalls

  • Maintaining the [l, r] Z-box — when i ≤ r, copy from z[i - l] capped at r - i + 1, then extend; otherwise start fresh from i. Off-by-one on the cap is the canonical bug.
  • Sentinel character — must not appear in either P or T. Use \0 or a fresh symbol; in Python, a tuple of (0, ord(c)) for the sentinel and (1, ord(c)) for real chars works.
  • Z and KMP overlap — they solve the same problems with different invariants. Picking one and being fluent is better than knowing both shallowly.

Classic Problems

  • Same as KMP — LC 28, 214, 459, 1392.
  • “Number of occurrences of P in T overlapping” — count z[i] >= M in the T region.

12. Manacher’s Algorithm (longest palindrome in O(N))

When To Use

  • “Longest palindromic substring” or “count of palindromic substrings”.
  • Naive expand-around-center is O(N²); Manacher’s is O(N).
  • The trick is mirroring around the rightmost-reaching palindrome center.

Complexity

O(N). Space O(N).

Key Implementation Pitfalls

  • Even-length palindromes — Manacher’s classic trick is to insert # between every pair of chars (and at ends): "abba""#a#b#b#a#". Now every palindrome (odd or even original) is odd-length in the transformed string.
  • P[i] vs original-string radiusP[i] after the transform is the radius in the transformed string. The original palindrome has length P[i].
  • Maintaining the rightmost-reaching center C and right-boundary RP[i] = min(R - i, P[2C - i]) if i < R, else expand from scratch. Off-by-one on R - i (vs R - i + 1) is the canonical bug.

Classic Problems

  • LeetCode 5 — Longest Palindromic Substring (canonical)
  • LeetCode 647 — Palindromic Substrings (count)
  • LeetCode 1960 — Maximum Product of Two Palindromic Substrings

13. Rolling Hash (Rabin–Karp + double hashing)

When To Use

  • Compare a sliding-window substring against a pattern in O(1) per shift (after O(M) preprocessing).
  • Find duplicate substrings of length L in O(N) instead of O(N²) — the canonical “longest duplicate substring” via binary search on L + hashing.
  • Compare two substrings of S equal in O(1) — precompute prefix hashes.

Complexity

Preprocess O(N). Per-comparison O(1) hash plus O(M) verify (in adversarial settings; often skipped). For “longest duplicate substring”: O(N log N).

Key Implementation Pitfalls

  • Hash collisions — single hash with mod ~10^9 has ~50% collision probability over 10^5 strings (birthday paradox). Always double-hash in interview answers, or single-hash + explicit verify on match.
  • Modular arithmetic — base ~30 (alphabet) to ~10^9 prime; mod a large prime ~10^9. In Python, use pow(base, k, mod) for the negative-power inverse trick. In Java/C++, use long to avoid overflow on base * value.
  • Anti-hash adversarial inputs — Codeforces has problems specifically constructed to break common base/mod choices. Use random base from [26, mod-1] per run.

Classic Problems

  • LeetCode 187 — Repeated DNA Sequences (canonical small-window hashing)
  • LeetCode 1044 — Longest Duplicate Substring (binary search on length + rolling hash)
  • LeetCode 28 — Find the Index of the First Occurrence (Rabin–Karp variant)
  • LeetCode 1392 — Longest Happy Prefix

Hands-on: see Lab 06.


14. Suffix Array (overview, applications)

When To Use

  • All suffixes of S sorted lexicographically — enables binary search for any pattern in O(M log N).
  • “Longest common substring of two strings” via combined suffix array + LCP.
  • “Number of distinct substrings of S” = N(N+1)/2 − Σ LCP[i].

Complexity

Build O(N log² N) (radix-sort + double-the-rank trick) or O(N log N) (DC3 / SA-IS, harder). LCP via Kasai’s algorithm O(N). Per-pattern search O(M log N).

Key Implementation Pitfalls

  • Doubling-the-rank — sort suffixes by their first 1, 2, 4, …, N characters using the previous round’s ranks. Each round is a radix sort.
  • LCP array — Kasai’s algorithm: walk in original-index order, decrement an h counter that tracks current LCP. Subtle but linear.
  • Sentinel — append a unique smaller-than-all-others character to avoid prefix-of-another-suffix issues.

Classic Problems

  • “Longest common substring of two strings” (SA + LCP + range minimum).
  • “Number of distinct substrings”.
  • “K-th lexicographically smallest substring”.

Overview only — labs don’t drill suffix arrays directly; rolling hash + Z/KMP cover most interview cases.


15. Suffix Automaton (overview, applications)

When To Use

  • Smallest DFA accepting all substrings of S, built in O(N) — strictly stronger than a suffix tree for many queries.
  • “Number of distinct substrings”, “longest common substring”, “count occurrences of a pattern” — all O(M) per query after O(N) build.

Complexity

Build O(N · σ) where σ is alphabet size. Space O(N · σ).

Key Implementation Pitfalls

  • link (suffix-link) pointers — the equivalence-class tree of states. Subtle to derive; templates exist.
  • cnt augmentation — counting occurrences requires DP on the suffix-link tree.
  • Online construction — add char at a time, maintain the latest state.

Classic Problems

  • “Distinct substrings count”.
  • “Longest common substring of K strings”.
  • Codeforces / SPOJ string problems.

Overview only. Rare in interviews; expected at Grandmaster.


16. Trie Variants (compressed, with counts, with deletion)

When To Use

  • Prefix queries: “does any inserted word start with prefix P?”, “all words with prefix P”.
  • Multi-word search (LC 212 — Word Search II): compile dictionary into a trie, DFS the grid against the trie for O((R · C) · 4^L) instead of per-word DFS.
  • Autocomplete with frequency ranking — trie augmented with word counts and best-K-at-each-node.

Complexity

Insert O(L). Search prefix O(L). Space O(Σ L · σ) for plain array-based trie; O(Σ L · branching) for hash-based.

Key Implementation Pitfalls

  • Array-of-26 vs hash-of-char — array-of-26 is faster (no hashing); hash is more memory-efficient on sparse tries. Pick one based on problem constraints.
  • End-of-word marker — use a separate is_end boolean, not a sentinel char that could collide with input.
  • Compressed (radix) trie — chains of single-child nodes are merged into a single edge labeled with the substring. Saves memory at the cost of more complex insertion (split an edge mid-substring).
  • Deletion — typically just clear is_end and prune empty subtrees. Don’t free shared nodes.
  • Counts at each node — increment on insert, decrement on delete; useful for “count words with prefix P” in O(L).

Classic Problems

  • LeetCode 208 — Implement Trie (canonical)
  • LeetCode 211 — Design Add and Search Words (with . wildcard — DFS)
  • LeetCode 212 — Word Search II (canonical trie-on-grid)
  • LeetCode 421 — Maximum XOR of Two Numbers in an Array (binary trie)
  • LeetCode 642 — Design Search Autocomplete System

Hands-on: see Lab 07.


17. Aho–Corasick (multi-pattern matching)

When To Use

  • Match a set of K patterns simultaneously against a text T — “find all dictionary words that occur in T”.
  • Naive: O(N · ΣM) — too slow for K = 10^4 patterns.
  • Aho–Corasick: O(N + ΣM + #matches) — linear in everything.

Complexity

Build O(ΣM · σ). Match O(N + #matches). Space O(ΣM · σ).

Key Implementation Pitfalls

  • Failure links — analog of KMP’s failure function on a trie. Built via BFS over the trie.
  • Output (dict-suffix) links — for each node, follow failure links to collect every matching pattern that ends at this position. Without dict-suffix links, you’d miss patterns that are suffixes of other patterns.
  • Node pool size — total nodes ≤ ΣM + 1. Pre-allocate.

Classic Problems

  • LeetCode 1032 — Stream of Characters (canonical reverse-trie + Aho-Corasick)
  • “Find all dictionary words occurring in document”.

18. Bloom Filter (probabilistic membership)

When To Use

  • “Is X in the set?” with a tolerated false-positive rate, zero false negatives.
  • Memory tight (you can’t store the full set), or you want a fast pre-filter before a slow exact check (e.g., disk lookup).
  • Streaming dedup with a fixed false-positive budget.

Complexity

Insert / query O(K) where K is the number of hash functions. Space O(M) bits where M is chosen for the target false-positive rate (1 - e^(-KN/M))^K.

Key Implementation Pitfalls

  • K (#hash functions) and M (#bits) — given target FPR p and capacity n, optimum is M = -n ln(p) / ln(2)², K = (M/n) ln 2.
  • No deletion — standard Bloom can’t delete (can’t tell which other element shares the bit). Counting Bloom does, with K counters.
  • False positives compound with set size — a 1% Bloom filter at capacity is 1% per query, not 1% over a workload. Rebuild when growing.

Classic Problems

  • System-design follow-ups (“how do you check whether a URL has been crawled in the last 30 days?”).
  • LeetCode-adjacent: not common as a graded problem, but expected in design interviews.

19. Skip List (overview)

When To Use

  • Randomized alternative to balanced BST — O(log N) expected, simpler to implement than AVL/RB.
  • Used in practice in Redis sorted sets, LevelDB MemTable.
  • Order-statistics, range queries on a sorted set, with concurrent-modification flexibility.

Complexity

Insert / delete / search O(log N) expected. Space O(N) expected (geometric level distribution).

Key Implementation Pitfalls

  • Level distribution — each new node’s level is geometric with p = 1/2 (or 1/4 in production). Cap at log N.
  • Update array on insert/delete — track the predecessor at each level; splice carefully.
  • Concurrent skip list — much simpler than concurrent BST; standard library impls in Java (ConcurrentSkipListMap).

Classic Problems

  • LeetCode 1206 — Design Skiplist (canonical implementation problem)
  • System-design discussions of Redis ZSET / LevelDB.

Overview only; the implementation problem (LC 1206) is good practice but rare.


20. LRU / LFU Implementation Deep Dive

When To Use

  • Cache eviction problems with O(1) get/put requirement.
  • LRU: hash map + doubly-linked list. Touched node moves to head; evict tail.
  • LFU: hash map of (key → node) + hash map of (freq → doubly-linked list). On hit, move node to next-freq list; on evict, drop tail of min-freq list.

Complexity

LRU: O(1) get and put. Space O(capacity). LFU: O(1) get and put. Space O(capacity). Maintaining min_freq is the subtle bookkeeping bit.

Key Implementation Pitfalls

  • LRU: doubly-linked list with sentinel head/tail — eliminates null checks. Always add at head, evict from tail.
  • LRU: hash map points to nodes, not keys — so you can splice the node in O(1) without searching.
  • LFU: min_freq invariant — increment when freq-list at min_freq becomes empty only if the touched node was the cause.
  • LFU: list-per-frequency — implement as a doubly-linked list of nodes; ordering within a freq is LRU.

Classic Problems

  • LeetCode 146 — LRU Cache (canonical)
  • LeetCode 460 — LFU Cache
  • LeetCode 432 — All O(1) Data Structure (frequency buckets)

Both are bread-and-butter for systems-engineering interviews.


21. Disjoint Set Union (DSU) with Path Compression and Union by Rank — Proof of α(N) Amortization

When To Use

  • Online connectivity (#1 trigger).
  • Kruskal’s MST.
  • Equation problems (weighted DSU — LC 399 Evaluate Division).
  • Offline divide-and-conquer queries with rollback (advanced).

Complexity

Each op amortized inverse-Ackermann α(N) — for all practical N (up to 2^65536), α(N) ≤ 4. Effectively constant.

Proof Sketch (Tarjan)

  • Without compression or rank: worst-case chain → O(N) per op.
  • Path compression alone: each find shortens the path. Amortized O(log N) per op.
  • Union by rank (or size) alone: depth bounded by O(log N). Per-op O(log N) worst case.
  • Both together: per-op amortized O(α(N)). Tarjan’s potential function counts “blocks” of nodes by rank and shows the total cost over M ops is O(M · α(N)). The proof uses Ackermann’s hierarchy A(k, n) and α(N) is its inverse.
  • For an interview: state “with both heuristics, amortized O(α(N)), where α grows so slowly it’s ≤ 4 for any N you’ll see in practice; you treat it as O(1)”. Cite Tarjan 1975.

Key Implementation Pitfalls

  • Recursive find blows the stack at N = 10^5 in Python. Use iterative two-pass: walk up to root, then walk again compressing.
  • Path-halving variant (parent[x] = parent[parent[x]] per step) — simpler, asymptotically equivalent, often faster than full compression in practice.
  • Union by rank vs union by size — both work. Rank is the height upper bound of the tree (compression doesn’t decrease rank); size is the count of nodes in the tree. Pick one.
  • Forgetting to update rank when ranks are equal — break the tie and increment the survivor’s rank.

Classic Problems

  • See pattern 18 in Phase 2 README.
  • LeetCode 200 — Number of Islands (DSU alternative)
  • LeetCode 305 — Number of Islands II (online DSU canonical)
  • LeetCode 547 — Number of Provinces
  • LeetCode 684 — Redundant Connection
  • LeetCode 721 — Accounts Merge
  • LeetCode 952 — Largest Component Size by Common Factor
  • LeetCode 399 — Evaluate Division (weighted DSU)

Phase 2 covered DSU mechanically. Phase 3’s contribution is the proof of α(N) and the weighted / rollback variants.


22. Bit Manipulation Idioms (popcount, lowbit, isolate trailing one, parity)

When To Use

  • Bitmask DP (pattern 23) requires fluency in these primitives.
  • Subset enumeration, parity tricks, fast set operations.
  • Hot-loop optimization where each int represents a tiny set (≤ 64 elements).

Idioms

  • Popcount: __builtin_popcount(x) (C/C++/Java via Integer.bitCount), bin(x).count('1') (Python — slow), x.bit_count() (Python 3.10+, fast).
  • Lowbit / lowest set bit: x & -x gives the value of the lowest set bit. x & (x - 1) clears it.
  • Isolate trailing ones: x & ~(x + 1). Set trailing zero: x | (x + 1).
  • Iterate subsets of mask: s = mask; while s: ... ; s = (s - 1) & mask enumerates each subset of mask exactly once.
  • Iterate set bits: while x: lb = x & -x; ... ; x ^= lb. Each step does O(1) work, total O(popcount).
  • Parity: bin(x).count('1') & 1 or — faster — XOR-fold: x ^= x >> 16; x ^= x >> 8; x ^= x >> 4; x ^= x >> 2; x ^= x >> 1; return x & 1.
  • Power of two test: x > 0 and (x & (x - 1)) == 0.
  • Swap without temp: a ^= b; b ^= a; a ^= b — academic; never use in production.

Classic Problems

  • LeetCode 191 — Number of 1 Bits (popcount).
  • LeetCode 338 — Counting Bits (DP using dp[x] = dp[x >> 1] + (x & 1)).
  • LeetCode 461 — Hamming Distance (popcount of XOR).
  • LeetCode 78 — Subsets via bitmask iteration.

Mastery here is a prerequisite for Pattern 23.


23. Bitmask DP Foundation

When To Use

  • N ≤ ~20 and the problem asks for an optimum over subsets or assignments of N items.
  • Examples: traveling salesman (TSP), assignment problem, “shortest path visiting all nodes”, “minimum cost to cover all groups”.
  • The state is a bitmask of “which items are used / visited / completed”.

Canonical Forms

  • Permutation DP: dp[mask][i] = min over j in mask\{i}: dp[mask \ {i}][j] + cost(j, i). Result: min over i: dp[full_mask][i] (or back-to-start for TSP cycle).
  • Subset cover DP: dp[mask] = min over partition of mask into subset s and rest: cost(s) + dp[mask \ s].
  • Assignment DP: dp[mask] = min cost to assign people 0..popcount(mask)-1 to the jobs in mask.

Complexity

Permutation DP: O(2^N · N²) time, O(2^N · N) space. N=20 → 4 × 10^8 — borderline. Subset DP: O(3^N) time (enumerating subset of subset). N=15 → 14M — comfortable.

Key Implementation Pitfalls

  • Subset-of-subset enumeration uses s = mask; while s: ... ; s = (s - 1) & mask. The mask invariant is critical.
  • Initial mask — for permutation DP, initialize dp[1 << i][i] for the first city; iterate masks in increasing order so dependencies are resolved.
  • Reconstruction — to recover the order, store predecessor (mask, i) → (prev_mask, prev_i) and walk back.
  • N too large — N > 20 is not bitmask DP territory. Reach for meet-in-the-middle (#24) or heuristics.

Classic Problems

  • LeetCode 847 — Shortest Path Visiting All Nodes (canonical bitmask + BFS)
  • LeetCode 1125 — Smallest Sufficient Team (subset cover DP)
  • LeetCode 943 — Find the Shortest Superstring (TSP-like, DP over permutations)
  • LeetCode 698 — Partition to K Equal Sum Subsets (subset assignment)
  • LeetCode 1494 — Parallel Courses II
  • LeetCode 1879 — Minimum XOR Sum of Two Arrays (assignment DP)

Hands-on: see Lab 08.


24. Meet-in-the-Middle (split, sort, two-pointer)

When To Use

  • N ≤ 40 (or up to 50), the problem asks for “subset with property X”, and 2^N is too large but 2^(N/2) is fine.
  • Examples: “subset sum closest to T at N=40”, “count subsets with XOR equal to K”, “split items into two groups minimizing difference”.

Canonical Template

left, right = a[:n // 2], a[n // 2:]
sums_left  = sorted(sum(combo) for combo in subsets(left))   # 2^(N/2)
sums_right = sorted(sum(combo) for combo in subsets(right))  # 2^(N/2)
# for each L in sums_left, binary-search the closest R such that L + R ≈ T.

Complexity

Time O(2^(N/2) · N/2) for enumeration + O(2^(N/2) · log(2^(N/2))) = O(N · 2^(N/2)) for sort, then O(2^(N/2) · log) for the merge. At N=40 → ~10^6 ops. Space O(2^(N/2)) for the two subset-sum lists.

Key Implementation Pitfalls

  • Enumerate subsets correctlyfor mask in range(1 << k): sum = sum of bits set in mask via popcount-iteration. Or recursive include/exclude.
  • Two-pointer or binary search — once both halves are sorted, sweep with two pointers (one from each end) to minimize / count target.
  • Memory — at N=40, half-mask space is 2^20 = 1M entries × 8 bytes = 8 MB. Comfortable, but watch out at N=44.
  • Counting (not just existence) — careful binary-search for lo, hi bounds; use bisect_left and bisect_right.

Classic Problems

  • LeetCode 1755 — Closest Subsequence Sum (canonical N=40)
  • LeetCode 956 — Tallest Billboard (subset DP alternative; meet-in-the-middle viable)
  • LeetCode 805 — Split Array With Same Average (meet-in-the-middle)
  • “Subset sum at N=40” — competitive-programming staple.

Hands-on: see Lab 09.


Recognition Cheat Sheet

Problem SignalStructure
Range query + point update, sum/min/maxSegment tree (#1) or Fenwick (#3, if invertible)
Range query + range updateLazy segment tree (#2)
Static range min/max with O(1) queriesSparse table (#5)
Range distinct count, hard-to-segment-tree aggregateSqrt decomposition / Mo’s (#6)
Single pattern in textKMP (#10) or Z (#11)
Longest palindrome / count palindromesManacher (#12)
Many-substring equality / longest duplicateRolling hash (#13)
Multi-pattern dictionary in textAho–Corasick (#17)
Prefix queries, autocomplete, word-on-gridTrie variants (#16)
Probabilistic membershipBloom filter (#18)
Cache with O(1) get/putLRU / LFU (#20)
Connectivity / equation graphsDSU (#21)
N ≤ 20, subset / assignment optimumBitmask DP (#23)
N ≤ 40, subset existence / closest sumMeet-in-the-middle (#24)
Bit-level state mechanicsBit idioms (#22)

Mastery Checklist

You have completed Phase 3 when you can, on demand and from memory:

  • Implement a segment tree (point update, range sum/min/max) in <12 minutes, with no off-by-ones, on the first attempt.
  • Add lazy propagation for range-add + range-sum in <20 minutes, articulating the push-down invariant.
  • Implement a Fenwick tree (1-indexed, prefix-sum + point update) in <8 minutes.
  • State why Fenwick can’t do range-min naturally and which segment-tree augmentation handles it.
  • Build a sparse table for static RMQ in <10 minutes, including the log_floor[] precompute.
  • Choose between segment tree, Fenwick, sparse table, and sqrt decomposition based on the workload (read-only vs mixed; aggregate type) in <30 seconds.
  • Compute KMP’s failure function on a string of length 20 by hand, no errors.
  • Implement KMP match in <12 minutes.
  • Implement Manacher’s longest palindrome in <20 minutes (this one is hard; that’s expected).
  • Implement double-hashing rolling hash in <15 minutes; explain why single hash is insufficient.
  • Implement a trie (insert, search, startsWith) in <8 minutes.
  • Implement Aho–Corasick at the conceptual level (failure + dict-suffix links) and state its complexity.
  • State the Bloom filter formula: target FPR p, capacity n → M = -n ln(p) / ln(2)², K = (M/n) ln 2.
  • Implement LRU cache (146) in <10 minutes; LFU (460) in <25 minutes.
  • Implement DSU with path compression + union by rank in <8 minutes; state the α(N) bound and cite Tarjan.
  • Use x & -x, x & (x - 1), subset-of-mask enumeration without thinking.
  • Recognize bitmask-DP from N ≤ 20 and write the transition in <10 minutes for an unfamiliar problem.
  • Recognize meet-in-the-middle from N = 40 and write both halves + merge in <20 minutes.

If any of these takes >2× the budget, drill it again — that structure is your weakest link. Hards rarely fail because all your structures are weak; they fail because one of them is, and that’s the one this Hard happened to need.


Exit Criteria

You may proceed to Phase 4 — Graph Mastery only when:

  1. All 9 labs are complete, with the deliverable code written, tested, and reviewed via the REVIEW_TEMPLATE.
  2. Mastery checklist is fully ticked.
  3. 30+ Hard problems solved across the structures above (10 segment-tree-class, 5 string-algo, 5 trie/AC, 5 DSU/bitmask, 5 free choice).
  4. Mock interview at Phase 3 level: you receive a Hard segment-tree problem, a Hard string problem, and a Medium-Hard bitmask DP problem in a 90-minute window. Solve at least 2 of the 3 cleanly.
  5. No structure is “the one I always get wrong” — drill it until it isn’t.

If any of these fails, do not proceed. Phase 4 builds on the assumption that DSU, segment trees, and bitmask are reflexes. If they are not, Phase 4’s harder graph problems will compound the gap.


Labs

#LabStructureCanonical Problem
01Segment tree (range query)Point update + range sum/min/maxLC 307
02Segment tree with lazy propagationRange update + range queryRange-add + range-sum
03Fenwick tree (BIT)Coord-compressed FenwickLC 315
04Sparse table for RMQStatic O(1) RMQRange-min array
05KMP string matchingFailure function + matchLC 28 / 459
06Rolling hashDouble hashingLC 187 / 1044
07Trie applicationsTrie with is_end + DFS-on-trieLC 208 / 212
08Bitmask DPPermutation DP over subsetsLC 847
09Meet-in-the-middleSplit-sort-mergeLC 1755

Common Failures At This Phase

These are the failure modes that consume the most candidate time at Phase-3 level. Tag them when they occur using FAILURE_ANALYSIS.md.

  • Segment tree off-by-ones — closed-vs-open intervals mixed mid-recursion. Fix: always closed [l, r], never mix with [l, r).
  • Fenwick tree 0-index trapupdate(0, …) infinite-loops. Fix: shift to 1-index at the boundary.
  • KMP failure function off-by-onej = fail[j - 1] vs j = fail[j]. Fix: derive on a 5-char example.
  • Rolling hash single-mod collisions — pass random unit tests, fail adversarial. Fix: double-hash always.
  • Bitmask DP transition directiondp[mask] from dp[mask & ~bit] (forward) vs dp[mask | bit] from dp[mask] (backward). Both work; mixing them mid-implementation breaks. Fix: pick one before coding.
  • DSU recursive find stack overflow at N=10^5 in Python. Fix: iterative two-pass.
  • Lazy segment tree forgetting to push before recursing into a child. Fix: write push_down(node) as the first line of any non-leaf recursion.

Cross-References

  • FRAMEWORK.md — apply on every Hard.
  • CODE_QUALITY.md — Hards do not get graded leniency; clean code still required.
  • COMMUNICATION.md — out loud at the recognition step, the structure name and complexity must be explicit. “I’ll use a segment tree with lazy propagation; build O(N), query and update O(log N), space O(4N).”
  • SPACED_REPETITION.md — segment tree and KMP should be on a 7-day cycle for the first month after this phase. Bitmask and meet-in-the-middle on 14-day.
  • Phase 4 — Graphs — DSU shows up immediately; review #21 the day before starting Phase 4.
  • Phase 5 — DP — bitmask DP is the bridge. Without #23 fluency, Phase 5’s “DP on graphs / DAG / interval” labs will hurt.
  • Phase 7 — Competitive — persistent seg tree, suffix array/automaton, splay/treap deepen here.