p43 — Find Median from Data Stream
Source: LeetCode 295 · Hard · Topics: Heap, Design, Two Pointers Companies (2024–2025 frequency): Google (very high), Meta (high), Amazon (high), Microsoft (high), Bloomberg (very high), Apple (medium-high) Loop position: Hard algorithm round; common in senior+ loops. The two-heap dance is the canonical answer; competence here is a strong senior signal.
1. Quick Context
Design a class supporting:
addNum(num)— add an integer.findMedian()— return the median of all numbers added so far.
The two-heap pattern:
lo— max-heap of the smaller half (in Python: store negated values).hi— min-heap of the larger half.
Invariants (maintained after every addNum):
|lo|is either equal to|hi|or one more than|hi|(we letlohold the “extra” when total is odd).- Every element in
lois ≤ every element inhi.
Median:
- If
|lo| > |hi|: median is top oflo(single middle element). - Else (equal sizes): median is
(top(lo) + top(hi)) / 2.
Time: O(log n) per addNum, O(1) per findMedian. Space: O(n).
The senior signal: stating the invariants BEFORE coding. The mid signal: getting the code right. The junior signal: insertion-sort (O(n) per add → TLE).
2. LeetCode Link + Attempt Gate
🔗 https://leetcode.com/problems/find-median-from-data-stream/
STOP. 25-min timer. Write invariants down first.
3. Prerequisite Concepts
heapq(min-heap; negate for max-heap).- The invariant-based reasoning style: state precondition, postcondition, prove preservation.
4. How to Approach (FRAMEWORK Steps 1–9)
Step 1 — Restate: “Maintain a multiset of integers; support O(log n) add and O(1) median query.”
Step 2 — Clarify:
- “Even count median?” (Average of two middles per LC.)
- “Duplicates?” (Allowed.)
- “Integer or float output?” (LC: float;
(a + b) / 2for averages.) - “Range of numbers?” (LC: any int.)
- “How many adds vs queries?” (LC: up to 5 × 10^4 of each.)
- “Streaming or batch?” (Streaming — never see future inputs.)
Step 3 — Constraints: N ≤ 5·10^4 operations → O(log n) per add is comfortable; O(n) per add (insertion sort) is borderline TLE.
Step 4 — Examples:
add(1) → median = 1; add(2) → median = 1.5; add(3) → median = 2.
Step 5 — Brute Force: Maintain a sorted list, bisect.insort (O(n) per add), index by n // 2. Works for small N but unsuitable for true streaming.
Step 6 — Complexity: Two heaps: O(log n) add, O(1) find. Sorted list with bisect: O(n) add (shift cost), O(1) find. SkipList / BBST: O(log n) both.
Step 7 — Pattern: Streaming median → two heaps. Generalization: streaming k-th element → two heaps with sizes k-1 and rest.
Step 8 — Develop: see solution.py.
Step 9 — Test: add then median; alternating add/median; even and odd counts; all duplicates; negatives.
5. Progressive Hints
If stuck for 5 min, open hints.md.
6. Deeper Insight — The two-heap dance
6.1 The add algorithm
def addNum(num):
# Step 1: always push onto lo first (max-heap; negate)
heappush(lo, -num)
# Step 2: move top of lo to hi (ensures lo's max ≤ hi's min)
heappush(hi, -heappop(lo))
# Step 3: rebalance if hi grew larger than lo
if len(hi) > len(lo):
heappush(lo, -heappop(hi))
Why this works:
- Step 1 unconditionally adds to
lo. We don’t yet know which halfnumbelongs to. - Step 2 moves the LARGEST of
lotohi. This guarantees the cross-heap ordering invariant (everylo≤ everyhi), because after Step 2,lo’s new max is ≤hi’s new min (which is the value we just moved). - Step 3 restores the size invariant:
loshould be equal to or one larger thanhi.
Three operations, all O(log n). The “always push to lo, then move to hi, then maybe move back” template is the canonical version because it’s bug-resistant.
6.2 Alternative add: branch-on-comparison
def addNum(num):
if not lo or num <= -lo[0]:
heappush(lo, -num)
else:
heappush(hi, num)
# rebalance
if len(lo) > len(hi) + 1:
heappush(hi, -heappop(lo))
elif len(hi) > len(lo):
heappush(lo, -heappop(hi))
Same big-O. Slightly faster constant (one push + at most one rebalance vs three pushes), but more branching. Either is acceptable.
6.3 The median query
def findMedian():
if len(lo) > len(hi):
return float(-lo[0])
return (-lo[0] + hi[0]) / 2.0
When |lo| == |hi| + 1: single middle is top of lo (since lo is the larger half).
When |lo| == |hi|: two middles, average them.
Edge case: empty stream — return… what? Behavior undefined per LC; in practice, raise or return 0. ASK the interviewer.
6.4 Removal — the unsupported operation
If the problem asks for remove(num) (sliding-window median, LC 480), the two-heap approach needs lazy deletion:
- On
remove(num), increment adeleted_count[num]. - After every operation, “clean” the top of both heaps by popping any value whose
deleted_count > 0.
This is non-trivial. The cleaner alternative is sortedcontainers.SortedList with O(log n) add/remove/index — but sortedcontainers isn’t stdlib in interviews. Mention both.
6.5 Why this beats “maintain a sorted list”
Naïve sorted list with bisect.insort has O(log n) for the binary search but O(n) for the insertion (shift). For 10^5 operations: 10^5 · 10^5 = 10^10 — TLE.
Two heaps: 10^5 · log(10^5) ≈ 10^5 · 17 = 1.7M operations — instant.
6.6 The percentile generalization
For the k-th order statistic in a stream, use two heaps with sizes k - 1 (max-heap) and n - k + 1 (min-heap). Same dance; constant rebalance to maintain those sizes. Median is k = (n+1)/2; for arbitrary k, the size invariant becomes the variable.
Production: P50/P95/P99 percentile dashboards use this pattern (or approximate variants like t-digest / HDR Histogram for bounded memory).
7. Anti-Pattern Analysis
- Insertion-sort list (
bisect.insort) — O(n) per add. TLE. - Sort on every query — O(n log n) per query. TLE.
- Single heap — can’t extract the median without popping half the heap.
- Forget to negate
lofor max-heap. Median is wrong every time. - Pushing
numdirectly without the lo-then-move dance — easy to violate cross-heap invariant. - Returning
intinstead offloat— LC wants float;(a + b) / 2gives float in Python 3 anyway, but(-lo[0])for odd count isint; returnfloat(...)to be safe. - Rebalance check
len(lo) - len(hi) > 1— works for the lo-then-move dance, but with branch-on-comparison you also need the reverse checklen(hi) > len(lo). - Calling
heap[0]on empty heap — IndexError. Guard empty stream. - Storing
(num, idx)tuples to support removal — premature optimization unless asked. - Confusing “median” with “mean” — different problems.
8. Skills & Takeaways
- Two-heap dance with explicit size and ordering invariants.
- Max-heap via negation in Python.
- Invariant-driven design — state preconditions/postconditions before coding.
- Lazy deletion for sliding-window variant (LC 480).
- Percentile generalization to arbitrary order statistic.
sortedcontainers.SortedListas a non-stdlib alternative (BBST).
9. When to Move On
Move on when:
- You can write
addNum+findMedianin ≤ 10 min, bug-free. - You can state the two invariants without notes.
- You can sketch the sliding-window variant (LC 480) with lazy deletion.
- You can explain why insertion-sort is O(n) and unacceptable.
- Stress test on 5000 random operations agrees with a sorted-list oracle.
10. Company Context
Google:
- One of the most-asked Hard problems. L4/L5 expected to solve cleanly; L6+ asked to extend to percentiles or sliding window.
- Bar Raiser scorecard: “Design — articulated invariants before coding; clean implementation; discussed scaling to streaming systems.”
Meta:
- L5+ algorithm round. Two-heap dance must be on-the-fly correct; can’t fumble.
Amazon:
- L5/L6 hard staple. Often paired with sliding-window follow-up.
Microsoft:
- L63/L64. Often phrased as “find running median of stock prices.”
Bloomberg:
- Loves percentile-tracking problems. Often pairs with “now do P95” for senior signal (heap-size generalization).
Apple:
- Performance monitoring teams ask running-percentile variants.
11. Interviewer’s Lens
| Signal | Junior | Mid | Senior | Staff/Principal |
|---|---|---|---|---|
| First instinct | “Sort each query” | “Maintain sorted list” | “Two heaps” | + states invariants first |
| Heap orientation | Confused about max-heap | Negates correctly | + explains “lo holds the extra” convention | + percentile generalization |
| Rebalance | Bug-prone | Works | “Push to lo, move max to hi, rebalance back” | + branch-on-comparison alternative discussed |
| Sliding window | “Re-build heaps” | “Stuck” | Lazy deletion sketch | + SortedList alternative + production t-digest |
| Empty stream | Crashes | Guards | “Undefined per spec; ASK” | + production: median of empty stream returns NaN, log warning |
12. Level Delta
Mid (L4):
- Two-heap implementation with explicit negation.
- Handles odd/even cases.
- Acknowledges insertion-sort is too slow.
Senior (L5):
- States invariants before coding.
- Articulates lo-then-move-to-hi-then-rebalance reasoning.
- Mentions sliding-window with lazy deletion.
Staff (L6):
- Generalizes to arbitrary percentile.
- Discusses
SortedList/ BBST alternative. - Discusses production: P95/P99 dashboards, monitoring infra.
Principal (L7+):
- Discusses approximate streaming percentiles: t-digest, HDR Histogram, count-min sketch + quantile estimation.
- Discusses bounded memory with O(1)-space approximations (GK Sketch — Greenwald-Khanna).
- Discusses distributed percentile aggregation: per-shard sketches, merge mathematically.
- Discusses trade-offs: exact (this problem) vs approximate (production); when to switch.
13. Follow-up Q&A
Q1: All numbers are in [0, 100]. Can we do O(1)?
A: Yes — count-array of size 101. addNum: O(1) increment. findMedian: scan to find the median position. O(101) = O(1).
Q2: 99% of numbers are in [0, 100]. Hybrid?
A: Count-array for the dense range, two-heaps for outliers. Maintain low_count, high_count, mid_array[101]. findMedian first scans mid_array adjusted by the heaps. Complex; mention only when prompted.
Q3: Sliding window (LC 480 — median of every window of size W)?
A: Two heaps + lazy deletion: deleted_count[num]; clean tops after each operation. Or: SortedList from sortedcontainers (cleanest).
Q4: P95 / arbitrary percentile in a stream?
A: Two heaps with size invariant |lo| = ⌈p · n⌉, |hi| = n - |lo|. Recompute target sizes after each add; rebalance accordingly. Or: t-digest / HDR Histogram for approximate.
Q5: Median over multiple shards (distributed)? A: Cannot trivially merge sorted halves — median is a holistic statistic. Approximate: each shard maintains a t-digest; merge t-digests is mathematically defined; exact median is intractable without full data shuffle.
Q6: Memory grows unbounded; how to cap? A: Approximate algorithms — t-digest with bounded number of centroids gives O(K) memory with provable error bounds (typically ~1% for K=100 centroids).
14. Solution Walkthrough
See solution.py:
MedianFinder— two-heap implementation; the canonical answer.MedianFinderBranching— alternativeaddNumwith branch-on-comparison.MedianFinderSorted— sorted-list baseline (oracle).edge_cases()+stress_test()— 5000 random add/findMedian sequences; all three implementations agree.
Run: python3 solution.py.
15. Beyond the Problem
Principal-engineer code review comment:
“Three things. (1) The two invariants (size and cross-heap ordering) should be in the docstring as preconditions/postconditions for every public method — this is the kind of code that gets ‘optimized’ into bugs without an explicit invariant statement. (2) For production percentile tracking, this exact algorithm is the WRONG choice — memory grows linearly forever. Use t-digest or HDR Histogram; cite the references. The interview problem hides the scaling cliff. (3) The ‘find median’ operation is O(1), which is great for a UI dashboard that polls every second; but if the median is recomputed millions of times per second (alerting), even O(1) heap-top access has cache effects — consider caching the most recent median and invalidating on
addNum. Premature optimization, but worth a comment for the curious reader.”
Connections forward:
- p41 — Kth Largest (heap-of-size-K).
- p44 — Merge K Sorted Lists (heap of heads).
- LC 480 — Sliding Window Median (lazy-deletion variant).
- LC 4 — Median of Two Sorted Arrays (different technique: partition).
- LC 703 — Kth Largest in a Stream (heap streaming sibling).
- Phase 02 Lab 09 — Heap Top-K.
- Production: percentile dashboards (P50/P95/P99), monitoring infrastructure, t-digest, HDR Histogram, GK Sketch.