Lab 06 — Heap Priority: Kth Largest In A Stream

Goal

Master the binary heap as the canonical streaming top-K device, the min-heap-of-size-K trick, and the cost model for push/pop. The deliverable: an online data structure that, after O(K) initialization, returns the Kth largest element in O(log K) per add.

Background Concepts

Binary heap as an array; sift-up / sift-down; min-heap vs max-heap; heapify is O(N); push/pop are O(log N). Review the Heaps section of the Phase 1 README and runtime concept 1 (stack vs heap memory — note the distinction between the call stack and the heap data structure).

Interview Context

Asked at Amazon, Apple, Bloomberg, and any role touching streaming systems. The signal: do you reach for a heap immediately when “online K-th largest” is mentioned? Do you choose a min-heap of size K (not a sorted list, not a max-heap)? Do you state the O(log K) per add?

Problem Statement

Design a KthLargest(k, nums) class. The constructor receives the integer k and an initial array nums. The method add(val) inserts val into the stream and returns the Kth largest element among all elements seen so far.

Constraints

  • 1 ≤ k ≤ 10^4
  • 0 ≤ |nums| ≤ 10^4
  • -10^4 ≤ val, nums[i] ≤ 10^4
  • At most 10^4 calls to add.
  • Guaranteed: at the time of any add return, there are at least k elements seen.

Clarifying Questions

  1. Is k fixed for the lifetime of the object? (Yes — set once.)
  2. Are duplicates allowed? (Yes — add(5) twice keeps both.)
  3. What if fewer than k elements have been seen? (Per constraints, won’t happen at return time. Confirm.)
  4. Is “Kth largest” 1-indexed? (Yes — K=1 is the maximum.)
  5. Streaming: do we ever remove elements? (No — additions only.)

Examples

KthLargest(3, [4, 5, 8, 2])
add(3) → 4    // sorted desc: 8, 5, 4, 3, 2 → 3rd is 4
add(5) → 5    // 8, 5, 5, 4, 3, 2 → 3rd is 5
add(10) → 5   // 10, 8, 5, 5, 4, 3, 2 → 3rd is 5
add(9) → 8    // 10, 9, 8, 5, 5, 4, 3, 2 → 3rd is 8
add(4) → 8    // 10, 9, 8, 5, 5, 4, 4, 3, 2 → 3rd is 8

Initial Brute Force

Maintain a sorted list. On add, insert in sorted order (O(N)) and read index N-k.

class KthLargestBrute:
    def __init__(self, k, nums):
        self.k = k
        self.arr = sorted(nums)
    def add(self, val):
        # binary search insertion
        import bisect
        bisect.insort(self.arr, val)
        return self.arr[-self.k]

Brute Force Complexity

bisect.insort is O(log N) for the search but O(N) for the actual insertion (array shift). Total O(N) per add. For 10⁴ adds and 10⁴ initial size: 10⁸ ops. Borderline.

Optimization Path

We don’t need to track all elements. Only the top K. A min-heap of size K keeps the K largest seen, with the smallest of them at the top — that’s the Kth largest.

  • On add: push, then if size > K, pop (the smallest, which is no longer in the top K).
  • Return heap[0].

Final Expected Approach

import heapq

class KthLargest:
    def __init__(self, k, nums):
        self.k = k
        self.heap = []
        for x in nums:
            self.add(x)

    def add(self, val):
        heapq.heappush(self.heap, val)
        if len(self.heap) > self.k:
            heapq.heappop(self.heap)
        return self.heap[0]

For the constructor, you can do better: take the first k elements, heapify them (O(k)), then for each remaining element, push if it beats the top, else skip. But the simple version above is acceptable and amortizes the same.

Data Structures Used

A binary min-heap. Underlying storage: a dynamic array. Capacity: K.

Correctness Argument

Invariant: self.heap contains the K largest values seen so far (when ≥ K have been seen), and self.heap[0] is the minimum of those — i.e., the Kth largest overall.

After heappush(val): heap may have K+1 elements; the smallest is at the top. Popping removes it. The remaining K elements are still the K largest (we only removed the smallest of K+1, which by definition is excluded from the top K of K+1). Hence self.heap[0] is the Kth largest of K+1 = the Kth largest overall.

Complexity

  • Constructor: O(N log K) using the per-element approach; O(N) using bottom-up heapify on first K then sift the rest.
  • add: O(log K).
  • Space: O(K).

Implementation Requirements

  • Use the language’s built-in min-heap; don’t roll your own unless asked.
  • Bound the heap size to K explicitly; if you don’t, you’ve built a sorted set, not the optimization.
  • For max-heap-only languages (e.g., Java’s PriorityQueue is min-heap by default — fine here), use the natural orientation. If you need a max-heap, negate or pass a comparator.
  • Don’t allocate fresh on every add.

Tests

  • Smoke: the canonical example above.
  • Unit: K=1 (always returns max); K==N (returns min after each add).
  • Edge: empty nums and a stream that brings size up to K; duplicate values; negative values.
  • Large: 10⁴ adds of random ints with K=100; assert per-call O(log K) by timing.
  • Random: maintain a brute-force sorted-list reference; assert equality of returned value on each call.
  • Invalid: add before reaching K elements (per constraints not happening; if defensive, raise or buffer).

Follow-up Questions

  • “What if the stream supports remove(val)?” → switch to a balanced BST or two heaps with lazy deletion.
  • “Maintain the K smallest.” → max-heap of size K (mirror).
  • “K-th most frequent element in a stream.” → counter + heap with re-inserts on count change.
  • “Top K trending hashtags over a sliding 1-hour window.” → heap + circular buffer + lazy deletion of stale entries.
  • “Implement the min-heap from scratch.” → array-backed, sift-up on push, sift-down on pop, parent at (i-1)//2, children at 2i+1, 2i+2.
  • “Why O(N) heapify rather than N pushes?” → bottom-up sift-down sums to O(N); pushes sum to O(N log N).

Product Extension

A leaderboard service that streams game scores and surfaces the top 100. Memory budget per shard is tight; the min-heap-of-size-K pattern is the standard approach. Combine with sharding (each shard maintains its own top-100; the aggregator maintains a heap of heap-tops). The same pattern powers “top-N alerts,” “p99 latency tracking,” and “trending content” feeds.

Language/Runtime Follow-ups

  • Python: heapq is min-heap only. For max-heap behavior, push -x and negate on pop. heapq.heapify(list) is O(N) in place.
  • Java: PriorityQueue defaults to min-heap. PriorityQueue<Integer> pq = new PriorityQueue<>();. Reversed: new PriorityQueue<>(Comparator.reverseOrder()). pq.poll() and pq.peek() are O(log N) and O(1).
  • Go: must implement the heap.Interface (Len, Less, Swap, Push, Pop). Verbose; stand-alone helpers in the container/heap package.
  • C++: std::priority_queue<int> is a max-heap by default. Use std::priority_queue<int, std::vector<int>, std::greater<int>> for a min-heap.
  • JS/TS: no built-in heap. Must implement or pull a library. This is a not-uncommon interview surprise.
  • Memory model: the data-structure heap lives in the process heap (not the call stack). Sizes up to 10⁴ are trivial.

Common Bugs

  1. Maintaining a max-heap — works for finding max, but you’d need to extract K elements per call. Wrong tool.
  2. Forgetting to bound size to K — heap grows to N; per-add cost becomes O(log N) instead of O(log K) (small impact for small N, but conceptually wrong and uses more memory).
  3. Returning heap[-1] — Python’s heap[-1] is not the largest; only heap[0] is the min. Other indices are unordered.
  4. Off-by-one on K — K=1 should track the maximum; if you accidentally maintain K-1 elements, you’re answering the wrong query.
  5. Java PriorityQueue reversed Comparator typo — using (a, b) -> b - a overflows for large negative ints. Use Integer.compare(b, a).

Debugging Strategy

  • After each add, print the heap. Should be ≤ K elements with the K-th-largest at index 0.
  • Cross-check against sorted(all_seen)[-K].
  • For perf: time 10⁴ adds; should be milliseconds.

Mastery Criteria

  • Selected the min-heap-of-size-K pattern within 30 seconds of hearing “Kth largest streaming.”
  • Stated the loop invariant aloud.
  • Wrote the implementation in under 5 minutes.
  • Identified the K=1 and K=N degenerate cases.
  • Knew the language idiom: heapq Python, PriorityQueue Java, priority_queue<…, greater<…>> C++.
  • Mentioned the (a, b) -> b - a overflow trap in Java.
  • Sketched the O(N) bottom-up heapify alternative for the constructor.