p42 — Top K Frequent Elements

Source: LeetCode 347 · Medium · Topics: Array, Hash Table, Heap, Bucket Sort, Quickselect Companies (2024–2025 frequency): Meta (very high), Amazon (very high), Google (high), Microsoft (high), Apple (medium), Uber (high), LinkedIn (high) Loop position: Phone screen and first round. The “bucket sort is faster” curveball — most candidates default to heap, the senior signal is recognizing bucket sort.

1. Quick Context

Given nums, return the k most frequent elements.

Three acceptable solutions:

ApproachTimeSpaceWhen to pick
Counter + sort by countO(n log n)O(n)Baseline; never primary
Counter + min-heap of size KO(n log K)O(n + K)Production default
Counter + bucket sortO(n)O(n)Optimal — beats heap when count ≤ n

The discriminator is recognizing bucket sort applies: counts are bounded by n (an element can appear at most n times), so a bucket array of size n+1 indexed by frequency lets you sweep top-K in linear time. Most candidates miss this.


🔗 https://leetcode.com/problems/top-k-frequent-elements/

STOP. 20-min timer. Aim to derive bucket sort.


3. Prerequisite Concepts

  • collections.Counter for O(n) frequency counting.
  • heapq and min-heap-of-size-K (p41).
  • The bucket-sort insight: when values are bounded by n, sort by indexing.

4. How to Approach (FRAMEWORK Steps 1–9)

Step 1 — Restate: “Count each element’s frequency; return K elements with highest counts. Order among the K is unspecified.”

Step 2 — Clarify:

  • “Ties — which to break in favor of?” (LC: any; if asked, specify lex / first-seen / etc.)
  • “k ≤ #distinct?” (LC: guaranteed.)
  • “n bound?” (LC: 10^5; both heap and bucket comfortable.)
  • “Are elements hashable?” (Yes — ints.)
  • “Return order matters?” (LC: no.)

Step 3 — Constraints: n ≤ 10^5 → counts ≤ 10^5. Any of the three approaches comfortably passes.

Step 4 — Examples:

  • [1,1,1,2,2,3], k=2[1, 2]
  • [1], k=1[1]
  • [4,4,4,4,4], k=1[4]

Step 5 — Brute Force: Counter → list of (count, val) → sort by count → take top K. O(n log n).

Step 6 — Complexity: Heap O(n log K); Bucket O(n).

Step 7 — Pattern: “Top K by frequency” → bucket sort wins when count is bounded by n; heap is general.

Step 8 — Develop: see solution.py.

Step 9 — Test: k=1, k=#distinct (all elements), all same, all distinct (every count = 1).


5. Progressive Hints

If stuck for 5 min, open hints.md.


6. Deeper Insight — Bucket sort is the surprise

6.1 The bucket-sort insight

Element counts are bounded: 1 ≤ count ≤ n. So you can create a buckets array of size n + 1 where buckets[c] is the list of elements with frequency c. Sweep c from n down to 1, collecting elements until you have K.

counter = Counter(nums)
buckets = [[] for _ in range(len(nums) + 1)]
for val, cnt in counter.items():
    buckets[cnt].append(val)

result = []
for c in range(len(buckets) - 1, 0, -1):
    for val in buckets[c]:
        result.append(val)
        if len(result) == k: return result

Time: O(n) for Counter + O(n) for bucket fill + O(n) for sweep = O(n) total.

This is the optimal answer. It beats heap O(n log K) asymptotically. Why isn’t it always used? Memory: bucket array is size n + 1 regardless of K, while heap is size K. For huge n and tiny K, the heap can be more cache-friendly.

6.2 Heap solution — when bucket sort doesn’t apply

For STREAMING input where you don’t know n in advance, bucket sort fails — you can’t pre-size the bucket array. Heap-of-size-K is the streaming-friendly answer.

counter = Counter(nums)
heap = []
for val, cnt in counter.items():
    if len(heap) < k:
        heapq.heappush(heap, (cnt, val))
    elif cnt > heap[0][0]:
        heapq.heappushpop(heap, (cnt, val))
return [val for cnt, val in heap]

Tuple (cnt, val) — min-heap orders by count (first tuple element). On ties, val breaks the tie (deterministic).

6.3 Quickselect on (val, count) pairs — O(n) expected

Use quickselect to find the K largest by count. Same algorithm as p41 but the comparison key is the count, not the value itself.

items = list(counter.items())   # [(val, count), ...]
# Quickselect on items, comparing by count, targeting position len(items) - k
quickselect(items, len(items) - k, key=lambda x: x[1])
return [val for val, _ in items[len(items) - k:]]

O(n) expected; same trade-offs as p41.

6.4 Why ties don’t matter (usually)

LC accepts any K elements with the top-K counts. If two elements tie for K-th place, either is acceptable. BUT: the production version often requires deterministic tie-breaking (lex order, first-seen, etc.) — confirm with the interviewer.

For lex tie-break: heap key becomes (cnt, val) directly works (min-heap → smallest cnt-then-val at root). For “first-seen” tie-break: include an insertion counter in the key.

6.5 LC 692 — Top K Frequent Words (tie-break variant)

Same problem with deterministic tie-break: “if two words have same frequency, the lexicographically smaller comes first.” This makes heap key tricky: we want (LARGER count, SMALLER word) to be at the heap root, but heapq is min-heap.

Trick: invert just the count: heapq.heappush(heap, (-cnt, word)). Then a min-heap pops smallest (-cnt, word) = largest cnt, lex-smallest word on ties. After collecting top K (with size-K heap), pop everything and reverse.

This is the kind of detail that distinguishes a senior from a mid-level candidate.

6.6 Counter is a hash table (and that matters)

collections.Counter is a subclass of dict. Frequency counting is O(n) expected but O(n²) worst case under adversarial hash collisions (theoretical only — Python’s hash is randomized per-process by default). For arbitrary objects (not just ints/strings), ensure __hash__ is well-defined.


7. Anti-Pattern Analysis

  1. Sort + slice as your only answer — O(n log n), misses the bucket-sort O(n) signal.
  2. Heap of size n (push all, pop K) — O(n log n). Wrong reflex: heap-of-size-K is O(n log K).
  3. Push counts, not (count, val) tuples — you lose track of which value the count refers to.
  4. Bucket sort with defaultdict(list) instead of array — slower in practice for dense bucket fill; preallocated list is the right choice.
  5. Bucket sort with size max(counter.values()) + 1 — works but requires a pre-scan; len(nums) + 1 is upper bound and avoids the scan. Tiny win for clarity.
  6. Forgetting LC 692 tie-break — push (-cnt, word) not (cnt, word) for size-K heap with lex tie-break.
  7. Mutating counter.items() during iteration — fortunately we don’t, but a common bug elsewhere.
  8. Returning a heap, not its values — extract with list comprehension.
  9. Not using Counter — manual defaultdict(int) loop works but Counter is idiomatic.

8. Skills & Takeaways

  • Bucket sort O(n) when count is bounded by n.
  • Min-heap of size K with (count, val) tuples for streaming-friendly top-K.
  • Quickselect on items by count key for O(n) expected.
  • LC 692 trick: (-cnt, word) for deterministic lex tie-break in size-K min-heap.
  • Counter is the idiomatic frequency-counter; subclass of dict.
  • Three valid answers, deliberate pick: bucket (optimal), heap (streaming), quickselect (interview flair).

9. When to Move On

Move on when:

  • You can write bucket-sort version in ≤ 7 min.
  • You can write heap version in ≤ 5 min.
  • You can articulate why bucket sort is O(n).
  • You handle LC 692 tie-break correctly.
  • Stress test on 500 random arrays agrees with sort baseline.

10. Company Context

Meta:

  • Phone-screen staple. The bucket-sort recognition is the senior signal.
  • Bar Raiser asks: “what’s the time complexity? Can you do better than O(n log K)?” — expects bucket sort.
  • Scorecard phrase: “Algorithm — derived O(n) bucket sort; named heap and quickselect alternatives.”

Amazon:

  • L5/L6 staple. Often combined with “top K customers by purchase count” framing.
  • Bar Raiser tests “top K over a 1B-row table” → discusses partitioning + per-shard top-K.

Google:

  • L4/L5. Often follows up with LC 692 (words + lex tie-break) for senior signal.

Microsoft:

  • L63/L64. Common in search/ranking team interviews.

Uber/LinkedIn:

  • “Top K trending hashtags” / “top K skills” framing. Streaming variant common — heap-of-size-K + sliding window.

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff/Principal
First answerSort by countHeap of size KBucket sort O(n)+ names quickselect + streaming caveat
CounterManual dict loopCounter(nums)+ explains it’s a dict subclass+ adversarial hashing caveat
Tie-break“Doesn’t matter”“Use tuple key”“(-cnt, val) for lex order”+ deterministic-output discussion
Streaming“Re-sort each time”“Heap-of-size-K”+ “bucket sort needs known n”+ sliding-window Misra-Gries / Space-Saving
LC 692Brute lex sortTuple key (basic)(-cnt, word) for size-K min-heap, reverse at end+ count-min sketch for approximate large-stream variant

12. Level Delta

Mid (L4):

  • Heap solution clean with (cnt, val) tuples.
  • States O(n log K).
  • May miss bucket sort.

Senior (L5):

  • Bucket sort first; names heap as streaming alternative.
  • Handles LC 692 tie-break correctly.
  • Mentions Counter internals.

Staff (L6):

  • Discusses production: hot-key detection, trending topics, real-time leaderboard.
  • Discusses approximate streaming algorithms (Misra-Gries, Space-Saving).
  • Discusses sharded top-K for distributed systems.

Principal (L7+):

  • Discusses count-min sketch + heap for bounded-memory streaming top-K with provable error.
  • Discusses sliding-window top-K with lazy deletion.
  • Discusses adversarial input: hash collisions, count overflow, memory exhaustion.
  • Discusses A/B testing: how top-K rankings drift over time and the need for stable tie-break.

13. Follow-up Q&A

Q1: LC 692 — Top K Frequent Words with lex tie-break. A: Use heapq with key (-cnt, word). Min-heap pops smallest (-cnt, word) = largest cnt, lex-smallest on ties. After collecting K, pop all and reverse to get descending order.

Q2: Streaming over an unbounded input? A: Heap-of-size-K is feasible only if #distinct values fits in memory. For unbounded distinct values, use Space-Saving algorithm (bounded-memory approximate top-K with provable error bounds).

Q3: Top K over a sliding window of W? A: Maintain count map per window position + heap with lazy deletion. Or: bucket of “active counts” per element.

Q4: K = n (return ALL elements by frequency)? A: Sort by count: O(n log n). No win from heap or bucket — must touch everything anyway.

Q5: Top K by SUM-of-values, not count? A: Same template; replace Counter with grouped sum, then bucket / heap / quickselect on sums.

Q6: Top K across multiple shards? A: Per-shard top-K + merge with K · S-heap (S shards). Total O((K · S) log (K · S)).


14. Solution Walkthrough

See solution.py:

  • top_k_frequent_bucket(nums, k) — O(n) bucket sort.
  • top_k_frequent_heap(nums, k) — heap of size K with (cnt, val) tuples.
  • top_k_frequent_sort(nums, k) — baseline.
  • top_k_frequent_words(words, k) — LC 692 with lex tie-break.
  • edge_cases() + stress_test() — 500 random arrays; compare as SETS (order unspecified).

Run: python3 solution.py.


15. Beyond the Problem

Principal-engineer code review comment:

“Three things. (1) The bucket sort is asymptotically optimal but memory is O(n) regardless of K — for a service processing top-1 over n=10^9, this matters; use the heap. State the trade-off in the docstring. (2) LC 692’s (-cnt, word) trick is the kind of code that breaks two years later when someone ‘simplifies’ it; add a comment with WHY the negation is there and a test that fails if you remove it. (3) For real top-K systems (trending hashtags, recommendation candidates), you’re almost never doing exact top-K over batched data — you’re doing approximate top-K over a stream with bounded memory. Mention Space-Saving / Misra-Gries / count-min sketch in the design doc; the interview question is the toy version of a much harder real problem.”

Connections forward:

  • p41 — Kth Largest Element (sibling pattern).
  • p44 — Merge K Sorted Lists (heap-over-K).
  • LC 692 — Top K Frequent Words (lex tie-break).
  • LC 973 — K Closest Points (heap with distance key).
  • LC 451 — Sort Characters By Frequency (bucket sort revisited).
  • Phase 02 Lab 09 — Heap Top-K (lab generalization).
  • Production: trending hashtags, hot-key detection, recommendation candidates, search-suggestion ranking.