Lab 06 — Heap Priority: Kth Largest In A Stream
Goal
Master the binary heap as the canonical streaming top-K device, the min-heap-of-size-K trick, and the cost model for push/pop. The deliverable: an online data structure that, after O(K) initialization, returns the Kth largest element in O(log K) per add.
Background Concepts
Binary heap as an array; sift-up / sift-down; min-heap vs max-heap; heapify is O(N); push/pop are O(log N). Review the Heaps section of the Phase 1 README and runtime concept 1 (stack vs heap memory — note the distinction between the call stack and the heap data structure).
Interview Context
Asked at Amazon, Apple, Bloomberg, and any role touching streaming systems. The signal: do you reach for a heap immediately when “online K-th largest” is mentioned? Do you choose a min-heap of size K (not a sorted list, not a max-heap)? Do you state the O(log K) per add?
Problem Statement
Design a KthLargest(k, nums) class. The constructor receives the integer k and an initial array nums. The method add(val) inserts val into the stream and returns the Kth largest element among all elements seen so far.
Constraints
1 ≤ k ≤ 10^40 ≤ |nums| ≤ 10^4-10^4 ≤ val, nums[i] ≤ 10^4- At most
10^4calls toadd. - Guaranteed: at the time of any
addreturn, there are at leastkelements seen.
Clarifying Questions
- Is
kfixed for the lifetime of the object? (Yes — set once.) - Are duplicates allowed? (Yes —
add(5)twice keeps both.) - What if fewer than
kelements have been seen? (Per constraints, won’t happen at return time. Confirm.) - Is “Kth largest” 1-indexed? (Yes —
K=1is the maximum.) - Streaming: do we ever remove elements? (No — additions only.)
Examples
KthLargest(3, [4, 5, 8, 2])
add(3) → 4 // sorted desc: 8, 5, 4, 3, 2 → 3rd is 4
add(5) → 5 // 8, 5, 5, 4, 3, 2 → 3rd is 5
add(10) → 5 // 10, 8, 5, 5, 4, 3, 2 → 3rd is 5
add(9) → 8 // 10, 9, 8, 5, 5, 4, 3, 2 → 3rd is 8
add(4) → 8 // 10, 9, 8, 5, 5, 4, 4, 3, 2 → 3rd is 8
Initial Brute Force
Maintain a sorted list. On add, insert in sorted order (O(N)) and read index N-k.
class KthLargestBrute:
def __init__(self, k, nums):
self.k = k
self.arr = sorted(nums)
def add(self, val):
# binary search insertion
import bisect
bisect.insort(self.arr, val)
return self.arr[-self.k]
Brute Force Complexity
bisect.insort is O(log N) for the search but O(N) for the actual insertion (array shift). Total O(N) per add. For 10⁴ adds and 10⁴ initial size: 10⁸ ops. Borderline.
Optimization Path
We don’t need to track all elements. Only the top K. A min-heap of size K keeps the K largest seen, with the smallest of them at the top — that’s the Kth largest.
- On
add: push, then if size > K, pop (the smallest, which is no longer in the top K). - Return
heap[0].
Final Expected Approach
import heapq
class KthLargest:
def __init__(self, k, nums):
self.k = k
self.heap = []
for x in nums:
self.add(x)
def add(self, val):
heapq.heappush(self.heap, val)
if len(self.heap) > self.k:
heapq.heappop(self.heap)
return self.heap[0]
For the constructor, you can do better: take the first k elements, heapify them (O(k)), then for each remaining element, push if it beats the top, else skip. But the simple version above is acceptable and amortizes the same.
Data Structures Used
A binary min-heap. Underlying storage: a dynamic array. Capacity: K.
Correctness Argument
Invariant: self.heap contains the K largest values seen so far (when ≥ K have been seen), and self.heap[0] is the minimum of those — i.e., the Kth largest overall.
After heappush(val): heap may have K+1 elements; the smallest is at the top. Popping removes it. The remaining K elements are still the K largest (we only removed the smallest of K+1, which by definition is excluded from the top K of K+1). Hence self.heap[0] is the Kth largest of K+1 = the Kth largest overall.
Complexity
- Constructor: O(N log K) using the per-element approach; O(N) using bottom-up heapify on first K then sift the rest.
add: O(log K).- Space: O(K).
Implementation Requirements
- Use the language’s built-in min-heap; don’t roll your own unless asked.
- Bound the heap size to K explicitly; if you don’t, you’ve built a sorted set, not the optimization.
- For max-heap-only languages (e.g., Java’s
PriorityQueueis min-heap by default — fine here), use the natural orientation. If you need a max-heap, negate or pass a comparator. - Don’t allocate fresh on every add.
Tests
- Smoke: the canonical example above.
- Unit: K=1 (always returns max); K==N (returns min after each add).
- Edge: empty
numsand a stream that brings size up to K; duplicate values; negative values. - Large: 10⁴ adds of random ints with K=100; assert per-call O(log K) by timing.
- Random: maintain a brute-force sorted-list reference; assert equality of returned value on each call.
- Invalid:
addbefore reaching K elements (per constraints not happening; if defensive, raise or buffer).
Follow-up Questions
- “What if the stream supports
remove(val)?” → switch to a balanced BST or two heaps with lazy deletion. - “Maintain the K smallest.” → max-heap of size K (mirror).
- “K-th most frequent element in a stream.” → counter + heap with re-inserts on count change.
- “Top K trending hashtags over a sliding 1-hour window.” → heap + circular buffer + lazy deletion of stale entries.
- “Implement the min-heap from scratch.” → array-backed, sift-up on push, sift-down on pop, parent at
(i-1)//2, children at2i+1,2i+2. - “Why O(N) heapify rather than N pushes?” → bottom-up sift-down sums to O(N); pushes sum to O(N log N).
Product Extension
A leaderboard service that streams game scores and surfaces the top 100. Memory budget per shard is tight; the min-heap-of-size-K pattern is the standard approach. Combine with sharding (each shard maintains its own top-100; the aggregator maintains a heap of heap-tops). The same pattern powers “top-N alerts,” “p99 latency tracking,” and “trending content” feeds.
Language/Runtime Follow-ups
- Python:
heapqis min-heap only. For max-heap behavior, push-xand negate on pop.heapq.heapify(list)is O(N) in place. - Java:
PriorityQueuedefaults to min-heap.PriorityQueue<Integer> pq = new PriorityQueue<>();. Reversed:new PriorityQueue<>(Comparator.reverseOrder()).pq.poll()andpq.peek()are O(log N) and O(1). - Go: must implement the
heap.Interface(Len,Less,Swap,Push,Pop). Verbose; stand-alone helpers in thecontainer/heappackage. - C++:
std::priority_queue<int>is a max-heap by default. Usestd::priority_queue<int, std::vector<int>, std::greater<int>>for a min-heap. - JS/TS: no built-in heap. Must implement or pull a library. This is a not-uncommon interview surprise.
- Memory model: the data-structure heap lives in the process heap (not the call stack). Sizes up to 10⁴ are trivial.
Common Bugs
- Maintaining a max-heap — works for finding max, but you’d need to extract K elements per call. Wrong tool.
- Forgetting to bound size to K — heap grows to N; per-add cost becomes O(log N) instead of O(log K) (small impact for small N, but conceptually wrong and uses more memory).
- Returning
heap[-1]— Python’sheap[-1]is not the largest; onlyheap[0]is the min. Other indices are unordered. - Off-by-one on K — K=1 should track the maximum; if you accidentally maintain K-1 elements, you’re answering the wrong query.
- Java
PriorityQueuereversed Comparator typo — using(a, b) -> b - aoverflows for large negative ints. UseInteger.compare(b, a).
Debugging Strategy
- After each
add, print the heap. Should be ≤ K elements with the K-th-largest at index 0. - Cross-check against
sorted(all_seen)[-K]. - For perf: time 10⁴ adds; should be milliseconds.
Mastery Criteria
- Selected the min-heap-of-size-K pattern within 30 seconds of hearing “Kth largest streaming.”
- Stated the loop invariant aloud.
- Wrote the implementation in under 5 minutes.
- Identified the K=1 and K=N degenerate cases.
- Knew the language idiom:
heapqPython,PriorityQueueJava,priority_queue<…, greater<…>>C++. - Mentioned the
(a, b) -> b - aoverflow trap in Java. - Sketched the
O(N)bottom-up heapify alternative for the constructor.