Lab 09 — Heap For Top-K: Top K Frequent Elements (Heap Vs Bucket Sort)

Goal

Master the two canonical “top K” algorithms — min-heap of size K for streaming/general cases and bucket sort by frequency for bounded-frequency cases. Deliverable solves LC 347 with both, articulates the time-space trade-off, and recognizes which language-specific gotcha (Python heapq is min-only; Java PriorityQueue boxes; C++ defaults to max-heap) applies.

Background Concepts

Min-heap of size K as the canonical “running top K” structure: pop when size exceeds K, ensuring O(N log K). Bucket sort by frequency (not value) when frequencies are bounded by N. The duality “top K frequent” / “K largest” / “K smallest” via heap-direction inversion. QuickSelect as the O(N) average alternative. Review pattern Heap Top K and Heap Foundations.

Interview Context

Top-K problems are interview gold — they appear at every Big Tech, often as the warmup or the second problem. The interview signal is whether you can match the right structure to the input shape: heap when N is huge or streaming, bucket sort when frequencies are bounded (which they always are in this problem since the max frequency is N). Strong candidates write the heap solution. Elite candidates write the heap solution, then also mention the O(N) bucket-sort alternative and articulate when each is preferred.

Problem Statement

Given an integer array nums and an integer K, return the K most frequent elements. The answer can be returned in any order.

Constraints

1 ≤ |nums| ≤ 10^5
-10^4 ≤ nums[i] ≤ 10^4
K is in the range [1, |distinct values in nums|].
The answer is unique (no ties at the K-th position that would create ambiguity).

Clarifying Questions

Tie-breaking — what if two values share the K-th frequency rank? (Per constraints, the answer is unique. But if it weren’t, ask: arbitrary, or some specified rule like smallest value first?)
Output order — sorted by frequency, by value, or any? (LC: any.)
Are floats / strings possible, or strictly ints? (Per constraints, ints — but the algorithm extends trivially to any hashable type.)
Streaming or offline? (Offline, but extending to streaming — windowed top-K — is a follow-up.)
Memory-constrained? (No special constraint, but the algorithm should be O(N) space because the frequency map is unavoidable.)

Examples

Input	Output
`nums=[1,1,1,2,2,3], K=2`	`[1,2]`
`nums=[1], K=1`	`[1]`
`nums=[4,1,-1,2,-1,2,3], K=2`	`[-1,2]`

Initial Brute Force

Count frequencies; sort by frequency; take top K.

from collections import Counter

def top_k_brute(nums, K):
    cnt = Counter(nums)
    return [x for x, _ in sorted(cnt.items(), key=lambda kv: -kv[1])[:K]]

Brute Force Complexity

Time O(N + D log D) where D is the number of distinct values. Space O(D). At N=10⁵, D≤2×10⁴+1, this is fast — but it sorts more than needed (full O(D log D) when we only want top K).

Optimization Path A — Min-Heap of Size K

Build a frequency map. Walk the (value, freq) pairs maintaining a min-heap of size K keyed by freq. For each pair, push; if heap size > K, pop. The K survivors are the top K.

Why min-heap? We want to drop the smallest frequency when the heap overflows; min-heap.peek() gives the smallest in O(1).

Time: O(N) frequency count + O(D log K) heap. For D >> K, this is much faster than O(D log D).

Optimization Path B — Bucket Sort by Frequency

Frequencies are integers in [1, N]. Create buckets bucket[f] = list of values with freq == f. Walk f from N down to 1, collecting values into the result until we have K.

Time: O(N). Space: O(N). The cleanest O(N) solution.

Final Expected Approach (Heap)

import heapq
from collections import Counter

def top_k_frequent_heap(nums, K):
    cnt = Counter(nums)
    heap = []   # (freq, value), min-heap
    for v, f in cnt.items():
        heapq.heappush(heap, (f, v))
        if len(heap) > K:
            heapq.heappop(heap)
    return [v for _, v in heap]

Final Expected Approach (Bucket Sort)

from collections import Counter

def top_k_frequent_bucket(nums, K):
    cnt = Counter(nums)
    buckets = [[] for _ in range(len(nums) + 1)]
    for v, f in cnt.items():
        buckets[f].append(v)
    out = []
    for f in range(len(nums), 0, -1):
        for v in buckets[f]:
            out.append(v)
            if len(out) == K: return out
    return out

Data Structures Used

Heap approach: Counter (hashmap) + min-heap of size K.
Bucket approach: Counter + a list-of-lists buckets indexed by frequency.

Correctness Argument (Heap)

Invariant: after processing the first i distinct values, the heap contains the top-min(K, i) most frequent among them. Adding the next value either grows the heap (if size < K) or replaces the min if the new freq exceeds it (push then pop-if-over-K does both correctly: push always grows; pop removes the smallest, which is the new value if it’s the smallest, leaving the heap unchanged).

Final step: after processing all D values, heap = top K. ✓

Correctness Argument (Bucket)

Frequencies range in [1, N], so buckets indexed [0..N] capture all. Walking f from high to low and collecting until K values found gives exactly the K most-frequent (with arbitrary tie-breaking within a bucket, acceptable per constraints).

Complexity

Approach	Time	Space
Brute (full sort)	O(N + D log D)	O(D)
Heap	O(N + D log K)	O(N + K)
Bucket	O(N)	O(N)
QuickSelect	O(N) average, O(N²) worst	O(D)

For D ≈ N and K ≪ D, heap and bucket are both linear-class; bucket’s hidden constant is smaller. For streaming, heap is the natural choice.

Implementation Requirements

Use a min-heap, even though we want the largest K — popping the min when overflowing keeps the largest K in the heap.
For Python’s heapq, push tuples (freq, value) — the comparison is lexicographic, so freq dominates.
Java’s PriorityQueue is a min-heap by default. No comparator inversion needed for “top-K largest” via the min-heap-of-size-K trick. Push int[]{freq, value} with Comparator.comparingInt(a -> a[0]).
C++’s priority_queue is a max-heap by default. For min-heap: priority_queue<pair<int,int>, vector<pair<int,int>>, greater<>>.
For bucket sort, size buckets N + 1 (frequencies 1..N, plus index 0 unused).

Tests

Smoke: ([1,1,1,2,2,3], 2) → [1,2] (in any order).
Unit: K = 1 → most frequent only; K = D → all distinct values.
All distinct values: [1,2,3,4,5], K=3 → 3 of these (any 3, since frequencies tie at 1 — but per problem statement, the answer is unique, so this case wouldn’t be a test… unless K = 5, returning all).
All same: [7]*100, K=1 → [7].
Negative values: [-1,-1,1,1,2], K=2 → [-1,1] (tie at freq 2 — problem assumes unique answer; for this test, K = 1 → [-1] or [1]).
Large: N = 10⁵, K = 10, random values; both approaches sub-50ms.
Adversarial: all-distinct → bucket sort touches all of bucket[1]; heap pushes D items.

Follow-up Questions

“Streaming top-K — elements arrive one at a time, query top K at any moment.” → maintain a min-heap of size K plus a hashmap; per element, look up old freq, decrement-and-rebuild. Or use Misra-Gries / Space-Saving sketches for approximate.
“Top K from multiple sorted streams (LC 23 generalized)?” → merge-K-sorted via min-heap.
“K closest points to origin (LC 973)?” → max-heap of size K keyed by distance, push and pop-if-over-K.
“Kth largest element (LC 215)?” → min-heap of size K → root is K-th largest. Or QuickSelect for O(N) average.
“Sliding window top-K?” → monotonic deque (LC 239 for K=1) or balanced BST (Phase 3).
“Memory-constrained: top K of a billion items?” → heap-of-K only — O(K) memory, O(N log K) time.

Product Extension

Recommendation systems, trending topics, top-N queries on dashboards, “most frequent error code in last 5 minutes” log monitors — all are top-K problems. Real-world systems use approximate algorithms (Count-Min Sketch + heap) for cardinality at internet scale, but the exact algorithm is what an interview is testing. The Misra-Gries summary (heavy hitters) generalizes the heap to a streaming, memory-bounded, approximate version.

Language/Runtime Follow-ups

Python: heapq is min-only. To use it as max-heap, push -x (or (-freq, value) and decode at the end). Counter(nums).most_common(K) is a one-liner shortcut — mention it in the interview as the “Pythonic” answer, but be ready to write the algorithm by hand.
Java: PriorityQueue<int[]> with Comparator.comparingInt(a -> a[0]) is the canonical pattern. Boxing tax if you use PriorityQueue<Integer>. Sorting via Arrays.sort(arr, comparator) works on Integer[] but not int[].
Go: container/heap requires implementing 5 methods. Verbose. For Top-K problems, a simple sort sort.Slice(items, less) and slicing top K is often clearer.
C++: priority_queue<int> is a max-heap by default. For min-heap: priority_queue<int, vector<int>, greater<int>>. The top() is O(1), pop() is O(log N) and returns void — you must call top() first.
JS/TS: no built-in heap. Implement (~30 LOC) or use a library. For interview, Array.sort with a comparator and slicing is often acceptable for offline cases; mention you’d need a heap for streaming.
Heap tuple comparison: Python compares tuples lexicographically; if freq ties, comparison falls back to value, which can fail for non-comparable types. Wrap in a class with __lt__ defined on freq only, or pre-encode as (freq, hash(value)).

Common Bugs

Using a max-heap of size D instead of a min-heap of size K — wastes time and space; O(D log D) instead of O(D log K).
Java boxing in PriorityQueue<Integer> — silent ~2-3× slowdown.
Python heapq confusion: forgetting it’s min-only; writing heappush(heap, -x) for max-heap and forgetting to negate when reading. Use a tuple with negated key cleanly: heappush(heap, (-freq, value)).
C++ default direction wrong: priority_queue<int> is max-heap; using it for “min-heap of size K” gives wrong results.
Heap size check after push: if len(heap) > K: pop is correct. If you swap to if len(heap) >= K: pop you’ll never have K items.
Bucket-sort allocation cost: [[] for _ in range(N + 1)] is O(N), but in C++ you can vector<vector<int>> buckets(N + 1); — same idea, just be aware of the per-bucket overhead.
Unstable comparison on ties in heap tuples — for (freq, value), tied freqs compare values. If values are non-comparable (custom objects), this errors. Pre-encode or wrap.

Debugging Strategy

Trace ([1,1,1,2,2,3], 2). Counter: {1:3, 2:2, 3:1}. Heap evolution: push (3,1) → heap [(3,1)], size 1 ≤ 2 ✓; push (2,2) → [(2,2),(3,1)], size 2 ≤ 2 ✓; push (1,3) → [(1,3),(3,1),(2,2)], size 3 > 2 → pop (1,3); heap = [(2,2),(3,1)]. Result: values [2, 1]. ✓
Bucket trace: counter same; buckets [[], [3], [2], [1], [], [], [], ...]. Walk f=6,5,4,3 → take 1; f=2 → take 2; size=2, return [1, 2]. ✓
Cross-check the two approaches on 50 random inputs (compare as sets).

Mastery Criteria

Recognized “Top K” pattern within 30 seconds.
Wrote the min-heap-of-size-K solution within 6 minutes.
Mentioned bucket sort as the O(N) alternative when frequencies are bounded.
Identified the language-specific heap-direction trap (Python min-only, Java min-default, C++ max-default).
Solved LC 215 (Kth Largest), LC 973 (K Closest), LC 23 (Merge K Sorted Lists) within a week — same pattern, different keys.
Discussed QuickSelect as an O(N)-average alternative when asked.

LeetCode Interview — Extreme Coding