p43 — Find Median from Data Stream

Source: LeetCode 295 · Hard · Topics: Heap, Design, Two Pointers Companies (2024–2025 frequency): Google (very high), Meta (high), Amazon (high), Microsoft (high), Bloomberg (very high), Apple (medium-high) Loop position: Hard algorithm round; common in senior+ loops. The two-heap dance is the canonical answer; competence here is a strong senior signal.

1. Quick Context

Design a class supporting:

  • addNum(num) — add an integer.
  • findMedian() — return the median of all numbers added so far.

The two-heap pattern:

  • lo — max-heap of the smaller half (in Python: store negated values).
  • hi — min-heap of the larger half.

Invariants (maintained after every addNum):

  1. |lo| is either equal to |hi| or one more than |hi| (we let lo hold the “extra” when total is odd).
  2. Every element in lo is ≤ every element in hi.

Median:

  • If |lo| > |hi|: median is top of lo (single middle element).
  • Else (equal sizes): median is (top(lo) + top(hi)) / 2.

Time: O(log n) per addNum, O(1) per findMedian. Space: O(n).

The senior signal: stating the invariants BEFORE coding. The mid signal: getting the code right. The junior signal: insertion-sort (O(n) per add → TLE).


🔗 https://leetcode.com/problems/find-median-from-data-stream/

STOP. 25-min timer. Write invariants down first.


3. Prerequisite Concepts

  • heapq (min-heap; negate for max-heap).
  • The invariant-based reasoning style: state precondition, postcondition, prove preservation.

4. How to Approach (FRAMEWORK Steps 1–9)

Step 1 — Restate: “Maintain a multiset of integers; support O(log n) add and O(1) median query.”

Step 2 — Clarify:

  • “Even count median?” (Average of two middles per LC.)
  • “Duplicates?” (Allowed.)
  • “Integer or float output?” (LC: float; (a + b) / 2 for averages.)
  • “Range of numbers?” (LC: any int.)
  • “How many adds vs queries?” (LC: up to 5 × 10^4 of each.)
  • “Streaming or batch?” (Streaming — never see future inputs.)

Step 3 — Constraints: N ≤ 5·10^4 operations → O(log n) per add is comfortable; O(n) per add (insertion sort) is borderline TLE.

Step 4 — Examples:

  • add(1) → median = 1; add(2) → median = 1.5; add(3) → median = 2.

Step 5 — Brute Force: Maintain a sorted list, bisect.insort (O(n) per add), index by n // 2. Works for small N but unsuitable for true streaming.

Step 6 — Complexity: Two heaps: O(log n) add, O(1) find. Sorted list with bisect: O(n) add (shift cost), O(1) find. SkipList / BBST: O(log n) both.

Step 7 — Pattern: Streaming median → two heaps. Generalization: streaming k-th element → two heaps with sizes k-1 and rest.

Step 8 — Develop: see solution.py.

Step 9 — Test: add then median; alternating add/median; even and odd counts; all duplicates; negatives.


5. Progressive Hints

If stuck for 5 min, open hints.md.


6. Deeper Insight — The two-heap dance

6.1 The add algorithm

def addNum(num):
    # Step 1: always push onto lo first (max-heap; negate)
    heappush(lo, -num)
    # Step 2: move top of lo to hi (ensures lo's max ≤ hi's min)
    heappush(hi, -heappop(lo))
    # Step 3: rebalance if hi grew larger than lo
    if len(hi) > len(lo):
        heappush(lo, -heappop(hi))

Why this works:

  • Step 1 unconditionally adds to lo. We don’t yet know which half num belongs to.
  • Step 2 moves the LARGEST of lo to hi. This guarantees the cross-heap ordering invariant (every lo ≤ every hi), because after Step 2, lo’s new max is ≤ hi’s new min (which is the value we just moved).
  • Step 3 restores the size invariant: lo should be equal to or one larger than hi.

Three operations, all O(log n). The “always push to lo, then move to hi, then maybe move back” template is the canonical version because it’s bug-resistant.

6.2 Alternative add: branch-on-comparison

def addNum(num):
    if not lo or num <= -lo[0]:
        heappush(lo, -num)
    else:
        heappush(hi, num)
    # rebalance
    if len(lo) > len(hi) + 1:
        heappush(hi, -heappop(lo))
    elif len(hi) > len(lo):
        heappush(lo, -heappop(hi))

Same big-O. Slightly faster constant (one push + at most one rebalance vs three pushes), but more branching. Either is acceptable.

6.3 The median query

def findMedian():
    if len(lo) > len(hi):
        return float(-lo[0])
    return (-lo[0] + hi[0]) / 2.0

When |lo| == |hi| + 1: single middle is top of lo (since lo is the larger half). When |lo| == |hi|: two middles, average them.

Edge case: empty stream — return… what? Behavior undefined per LC; in practice, raise or return 0. ASK the interviewer.

6.4 Removal — the unsupported operation

If the problem asks for remove(num) (sliding-window median, LC 480), the two-heap approach needs lazy deletion:

  • On remove(num), increment a deleted_count[num].
  • After every operation, “clean” the top of both heaps by popping any value whose deleted_count > 0.

This is non-trivial. The cleaner alternative is sortedcontainers.SortedList with O(log n) add/remove/index — but sortedcontainers isn’t stdlib in interviews. Mention both.

6.5 Why this beats “maintain a sorted list”

Naïve sorted list with bisect.insort has O(log n) for the binary search but O(n) for the insertion (shift). For 10^5 operations: 10^5 · 10^5 = 10^10 — TLE.

Two heaps: 10^5 · log(10^5) ≈ 10^5 · 17 = 1.7M operations — instant.

6.6 The percentile generalization

For the k-th order statistic in a stream, use two heaps with sizes k - 1 (max-heap) and n - k + 1 (min-heap). Same dance; constant rebalance to maintain those sizes. Median is k = (n+1)/2; for arbitrary k, the size invariant becomes the variable.

Production: P50/P95/P99 percentile dashboards use this pattern (or approximate variants like t-digest / HDR Histogram for bounded memory).


7. Anti-Pattern Analysis

  1. Insertion-sort list (bisect.insort) — O(n) per add. TLE.
  2. Sort on every query — O(n log n) per query. TLE.
  3. Single heap — can’t extract the median without popping half the heap.
  4. Forget to negate lo for max-heap. Median is wrong every time.
  5. Pushing num directly without the lo-then-move dance — easy to violate cross-heap invariant.
  6. Returning int instead of float — LC wants float; (a + b) / 2 gives float in Python 3 anyway, but (-lo[0]) for odd count is int; return float(...) to be safe.
  7. Rebalance check len(lo) - len(hi) > 1 — works for the lo-then-move dance, but with branch-on-comparison you also need the reverse check len(hi) > len(lo).
  8. Calling heap[0] on empty heap — IndexError. Guard empty stream.
  9. Storing (num, idx) tuples to support removal — premature optimization unless asked.
  10. Confusing “median” with “mean” — different problems.

8. Skills & Takeaways

  • Two-heap dance with explicit size and ordering invariants.
  • Max-heap via negation in Python.
  • Invariant-driven design — state preconditions/postconditions before coding.
  • Lazy deletion for sliding-window variant (LC 480).
  • Percentile generalization to arbitrary order statistic.
  • sortedcontainers.SortedList as a non-stdlib alternative (BBST).

9. When to Move On

Move on when:

  • You can write addNum + findMedian in ≤ 10 min, bug-free.
  • You can state the two invariants without notes.
  • You can sketch the sliding-window variant (LC 480) with lazy deletion.
  • You can explain why insertion-sort is O(n) and unacceptable.
  • Stress test on 5000 random operations agrees with a sorted-list oracle.

10. Company Context

Google:

  • One of the most-asked Hard problems. L4/L5 expected to solve cleanly; L6+ asked to extend to percentiles or sliding window.
  • Bar Raiser scorecard: “Design — articulated invariants before coding; clean implementation; discussed scaling to streaming systems.”

Meta:

  • L5+ algorithm round. Two-heap dance must be on-the-fly correct; can’t fumble.

Amazon:

  • L5/L6 hard staple. Often paired with sliding-window follow-up.

Microsoft:

  • L63/L64. Often phrased as “find running median of stock prices.”

Bloomberg:

  • Loves percentile-tracking problems. Often pairs with “now do P95” for senior signal (heap-size generalization).

Apple:

  • Performance monitoring teams ask running-percentile variants.

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff/Principal
First instinct“Sort each query”“Maintain sorted list”“Two heaps”+ states invariants first
Heap orientationConfused about max-heapNegates correctly+ explains “lo holds the extra” convention+ percentile generalization
RebalanceBug-proneWorks“Push to lo, move max to hi, rebalance back”+ branch-on-comparison alternative discussed
Sliding window“Re-build heaps”“Stuck”Lazy deletion sketch+ SortedList alternative + production t-digest
Empty streamCrashesGuards“Undefined per spec; ASK”+ production: median of empty stream returns NaN, log warning

12. Level Delta

Mid (L4):

  • Two-heap implementation with explicit negation.
  • Handles odd/even cases.
  • Acknowledges insertion-sort is too slow.

Senior (L5):

  • States invariants before coding.
  • Articulates lo-then-move-to-hi-then-rebalance reasoning.
  • Mentions sliding-window with lazy deletion.

Staff (L6):

  • Generalizes to arbitrary percentile.
  • Discusses SortedList / BBST alternative.
  • Discusses production: P95/P99 dashboards, monitoring infra.

Principal (L7+):

  • Discusses approximate streaming percentiles: t-digest, HDR Histogram, count-min sketch + quantile estimation.
  • Discusses bounded memory with O(1)-space approximations (GK Sketch — Greenwald-Khanna).
  • Discusses distributed percentile aggregation: per-shard sketches, merge mathematically.
  • Discusses trade-offs: exact (this problem) vs approximate (production); when to switch.

13. Follow-up Q&A

Q1: All numbers are in [0, 100]. Can we do O(1)? A: Yes — count-array of size 101. addNum: O(1) increment. findMedian: scan to find the median position. O(101) = O(1).

Q2: 99% of numbers are in [0, 100]. Hybrid? A: Count-array for the dense range, two-heaps for outliers. Maintain low_count, high_count, mid_array[101]. findMedian first scans mid_array adjusted by the heaps. Complex; mention only when prompted.

Q3: Sliding window (LC 480 — median of every window of size W)? A: Two heaps + lazy deletion: deleted_count[num]; clean tops after each operation. Or: SortedList from sortedcontainers (cleanest).

Q4: P95 / arbitrary percentile in a stream? A: Two heaps with size invariant |lo| = ⌈p · n⌉, |hi| = n - |lo|. Recompute target sizes after each add; rebalance accordingly. Or: t-digest / HDR Histogram for approximate.

Q5: Median over multiple shards (distributed)? A: Cannot trivially merge sorted halves — median is a holistic statistic. Approximate: each shard maintains a t-digest; merge t-digests is mathematically defined; exact median is intractable without full data shuffle.

Q6: Memory grows unbounded; how to cap? A: Approximate algorithms — t-digest with bounded number of centroids gives O(K) memory with provable error bounds (typically ~1% for K=100 centroids).


14. Solution Walkthrough

See solution.py:

  • MedianFinder — two-heap implementation; the canonical answer.
  • MedianFinderBranching — alternative addNum with branch-on-comparison.
  • MedianFinderSorted — sorted-list baseline (oracle).
  • edge_cases() + stress_test() — 5000 random add/findMedian sequences; all three implementations agree.

Run: python3 solution.py.


15. Beyond the Problem

Principal-engineer code review comment:

“Three things. (1) The two invariants (size and cross-heap ordering) should be in the docstring as preconditions/postconditions for every public method — this is the kind of code that gets ‘optimized’ into bugs without an explicit invariant statement. (2) For production percentile tracking, this exact algorithm is the WRONG choice — memory grows linearly forever. Use t-digest or HDR Histogram; cite the references. The interview problem hides the scaling cliff. (3) The ‘find median’ operation is O(1), which is great for a UI dashboard that polls every second; but if the median is recomputed millions of times per second (alerting), even O(1) heap-top access has cache effects — consider caching the most recent median and invalidating on addNum. Premature optimization, but worth a comment for the curious reader.”

Connections forward:

  • p41 — Kth Largest (heap-of-size-K).
  • p44 — Merge K Sorted Lists (heap of heads).
  • LC 480 — Sliding Window Median (lazy-deletion variant).
  • LC 4 — Median of Two Sorted Arrays (different technique: partition).
  • LC 703 — Kth Largest in a Stream (heap streaming sibling).
  • Phase 02 Lab 09 — Heap Top-K.
  • Production: percentile dashboards (P50/P95/P99), monitoring infrastructure, t-digest, HDR Histogram, GK Sketch.