p35 — Median of Two Sorted Arrays

Source: LeetCode 4 · Hard · Topics: Array, Binary Search, Divide & Conquer Companies (2024–2025 frequency): Google (very high), Meta (high), Apple (high), Amazon (medium), Microsoft (medium) Loop position: algorithm round, hardest slot. This is the canonical “binary-search-on-partition” Hard. A successful O(log(min(m,n))) solution is a strong staff+ signal.

1. Quick Context

Given two sorted arrays nums1 (length m) and nums2 (length n), return the median of the combined m + n elements. The required complexity is O(log(min(m, n))).

The straightforward O(m + n) merge gives the answer easily — but the problem demands log time, which means: binary search on the partition of the smaller array.

The setup:

  • Total length T = m + n. The median splits the combined sorted array into two equal halves (or, when T is odd, the left half has one more element).
  • Choose a partition i of nums1 (0..m) and the matching partition j = (T+1)//2 - i of nums2 (so the combined left half has (T+1)//2 elements).
  • The partition is correct iff maxLeft(A) <= minRight(B) AND maxLeft(B) <= minRight(A).
  • Binary search on i over [0, m]. Each step adjusts based on which of the two cross-inequalities fails.

The discriminator: writing a correct binary-search-on-partition for the first time in an interview is extremely hard if you haven’t done it before. Most candidates get O(m+n) merge and a Bar Raiser response of “good — now can you do log?” If you fold under that pressure, you’re capped at mid-level. If you produce the O(log) solution with correct partition handling, you’re signaling staff+.


🔗 https://leetcode.com/problems/median-of-two-sorted-arrays/

STOP. 35-min timer for the merge approach AND the partition approach. This problem deserves the full time.


3. Prerequisite Concepts

  • Convergent-bounds binary search (see p34).
  • Two-pointer merge (LC 88 — Merge Sorted Array).
  • The concept of a “partition” of a sorted array — index i says “the first i elements go to the left side.”
  • Sentinel values (-inf, +inf) for edge partitions.

4. How to Approach (FRAMEWORK Steps 1–9)

Step 1 — Restate: “Two sorted arrays. Find the median of their union. O(log(min(m, n))).”

Step 2 — Clarify:

  • “Are the arrays each sorted ascending?” (Yes.)
  • “Can one or both be empty?” (At least one is non-empty. Per LC.)
  • “Definition of median for even total: average of two middle elements?” (Yes.)
  • “Return float?” (Yes.)
  • “Can values be negative?” (Yes.)
  • “Required complexity?” (O(log(min(m, n))).)

Step 3 — Constraints: m, n ≤ 1000. O(m + n) is fast enough, but the interview requires O(log). The discriminator is whether you produce log.

Step 4 — Examples:

  • [1,3], [2] → 2.0 (combined [1,2,3], median 2).
  • [1,2], [3,4] → 2.5 (combined [1,2,3,4], median (2+3)/2).
  • [], [1] → 1.0.
  • [2], [] → 2.0.
  • [1,2,3,4,5], [6,7,8,9,10] → 5.5 (no overlap).
  • [1,1,1], [1,1,1] → 1.0.

Step 5 — Brute Force:

  1. O((m+n) log(m+n)): concatenate and sort. Trivial.
  2. O(m + n): merge with two pointers, stop at the median index. State and pivot — but also show that you CAN do brute first to set up the contrast.

Step 6 — Complexity: O(log(min(m, n))) time, O(1) space. We binary-search the partition of the SMALLER array — log of its length.

Step 7 — Pattern Recognition:

  • “Sorted data + need sub-linear” → binary search.
  • “Combine two sorted things and answer a ‘split’ query” → binary search on partition.
  • The partition invariant maxLeft(A) <= minRight(B) AND maxLeft(B) <= minRight(A) is the heart of the technique.

Step 8 — Develop: see solution.py. Implement BOTH the merge oracle and the partition binary search.

Step 9 — Test: all the edge cases above, plus interleaved cases, plus stress test against the merge oracle.


5. Progressive Hints

If stuck for 5 min, open hints.md.


6. Deeper Insight — Binary search on partition

6.1 The partition framing

Let A = nums1 (length m), B = nums2 (length n), with m <= n (swap if not). The total length is T = m + n.

A “partition” is a pair (i, j) with 0 <= i <= m and 0 <= j <= n such that:

  • The left side A[0..i-1] + B[0..j-1] has exactly (T + 1) // 2 elements.
  • The right side A[i..] + B[j..] has the other T // 2 elements.

So j = (T + 1) // 2 - i — once i is fixed, j is determined.

6.2 The correctness invariant

A partition (i, j) is “correct” (median-revealing) iff: $$\text{maxLeft}(A) \leq \text{minRight}(B) \quad \text{AND} \quad \text{maxLeft}(B) \leq \text{minRight}(A)$$ where maxLeft(A) = A[i-1] (or -inf if i == 0), minRight(A) = A[i] (or +inf if i == m), and similarly for B.

If both inequalities hold, then EVERYTHING on the left side is <= EVERYTHING on the right side. The median is determined by the boundary:

  • If T is odd, median = max(maxLeft(A), maxLeft(B)).
  • If T is even, median = (max(maxLeft(A), maxLeft(B)) + min(minRight(A), minRight(B))) / 2.

6.3 Why binary search on i converges

If maxLeft(A) > minRight(B) — A’s left is too big → too many A elements on the left → decrease i (search left half of [0, m]).

Otherwise (maxLeft(B) > minRight(A)) — B’s left is too big (we have too FEW A elements on the left) → increase i (search right half of [0, m]).

The predicate “is i too large?” is monotonic — that’s why binary search works. Convergent template, same as p34.

6.4 Why we binary search the SMALLER array

So the search range is [0, m] with m = min(m, n). This gives O(log(min(m, n))). If we accidentally searched the larger one, we’d get O(log(max(m, n))) — still log, but the spec is stricter.

There’s another reason: j = (T + 1) // 2 - i. If i is in [0, m] and m <= n, then j is automatically in [0, n] (provable from (T+1)//2 >= m). If we swapped, we’d have to clamp j.

6.5 Sentinels

The edge cases (i == 0, i == m, j == 0, j == n) make maxLeft or minRight undefined. The clean fix: use float('-inf') and float('inf') as sentinels. This eliminates a tangle of conditional branches.

6.6 Why this is staff-level

The bar isn’t “do you know binary search?” — it’s “do you recognize when to binary-search a derived parameter (the partition index) instead of the data itself?” This is the same intellectual leap as binary-search-on-answer (LC 410, LC 875). Once you see it, you see it everywhere: it’s the spine of LC 4, LC 162, LC 410, LC 875, LC 1011, LC 1482, and many more.


7. Anti-Pattern Analysis

  1. Concatenate-and-sort. O((m+n) log(m+n)). Works but signals you didn’t see the structure.
  2. O(m+n) merge as final answer. Works but capped at mid-level. The interviewer will push for log.
  3. Binary-searching the LARGER array. Gets the wrong complexity bound; j can go out of range.
  4. Forgetting sentinels. A pile of if i == 0/else branches that nobody (including you) can debug live.
  5. Wrong partition formula. Common bug: j = T // 2 - i (drops one element when T is odd). The correct formula is j = (T + 1) // 2 - i, which puts the extra element on the left for odd T (so the median is max(maxLeft(A), maxLeft(B))).
  6. (lo + hi) // 2 integer overflow in C++/Java. (Python is fine.) Use lo + (hi - lo) // 2.
  7. Returning int for even-total medians. The answer is a float (e.g., 2.5). Use float division /, not //.
  8. Reaching for the “k-th smallest of two sorted arrays” approach. Also valid (and a great problem), but more code and trickier edge cases than the partition approach.
  9. Manual case-by-case for tiny m (≤ 2). Looks clever but adds complexity. The general partition handles tiny m correctly.

8. Skills & Takeaways

  • Binary search on derived parameter (partition index) — a pattern, not a trick.
  • Partition invariant: maxLeft(A) <= minRight(B) AND maxLeft(B) <= minRight(A). Memorize.
  • Sentinels (±inf) as edge-case eliminator.
  • Always search the smaller array.
  • Parity-aware median formula: odd → max of left; even → average of max-left and min-right.
  • The conceptual link to LC 410, LC 875 — binary search on answer.

9. When to Move On

Move on when:

  • You can write the merge oracle in under 5 min.
  • You can write the partition binary search in under 25 min (this Hard is genuinely Hard).
  • You can articulate the partition invariant from memory.
  • Your stress test passes 200 random pairs vs the merge oracle.
  • You can derive the “k-th smallest of two sorted arrays” extension on demand.

10. Company Context

Google:

  • One of THE most-asked Hard problems at Google for L5+ in 2024-25. Bar Raisers expect either the partition solution or a strong attempt with correct framing.
  • The “did the candidate see the binary-search-on-partition pattern” is the central decision point.
  • Scorecard phrase: “Algorithm — produced O(log) median via binary search on partition. Strong signal.”

Meta:

  • Less common than at Google, but appears in senior loops. Even reaching the O(m+n) merge cleanly is good; the O(log) is the staff+ signal.
  • Scorecard phrase: “Strong — O(m+n) merge confidently, attempted O(log) partition.”

Apple:

  • Asked in algo loops, particularly for media/systems teams. Often paired with a “what if the arrays are streams” follow-up.
  • Scorecard phrase: “Algorithm — partition binary search, articulated invariant.”

Amazon:

  • Less common in standard L5 loops but appears in L6+ or Bar Raiser slots.
  • Scorecard phrase: “Strong — Hard problem solved with correct complexity.”

Microsoft:

  • Standard “Hard slot” for L63+. Bar Raiser may ask for both the merge and the partition approach.
  • Scorecard phrase: “Strong — both approaches with clear reasoning.”

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff/Principal
ApproachConcatenate-sortO(m+n) mergeAttempts partition, may not finishPartition with sentinels, clean
SentinelsNoneMaybe a few if branches±inf sentinels from the start+ explains why they eliminate edge cases
Which array to searchEitherWhichever is convenientAlways smaller+ proves why (j-range)
Median parityBug on even or oddHandles after one failCorrect first try+ states the (T+1)//2 choice and why
Follow-up“Don’t know”“Use k-th smallest?”“k-th smallest in two sorted arrays”+ streaming extension, k-way generalization

12. Level Delta

Mid (L4):

  • O(m+n) merge solution.
  • Handles odd and even totals.
  • Handles empty array.

Senior (L5):

  • Attempts the partition binary search.
  • Articulates the partition invariant correctly even if implementation has bugs.
  • Uses sentinels.
  • Tests odd, even, empty, all-same, no-overlap.

Staff (L6):

  • Clean O(log(min(m, n))) partition solution with sentinels.
  • Connects to “binary search on derived parameter” pattern (LC 410, LC 875).
  • Discusses k-th smallest extension (LC 4 generalized).
  • Discusses what changes for K sorted arrays — heap-based merge, log(K) median tracking with two heaps (LC 295 family).

Principal (L7+):

  • Discusses distributed/streaming variants: maintaining a running median over two streams = LC 295 (two heaps). Maintaining a median across federated sorted shards = log-ranked search via approximate quantile sketches (t-digest, KLL).
  • Discusses GPU-friendly variant: parallel partition via “split-merge” on a balanced binary tree of subarrays.
  • Discusses production: financial median-price-over-N-sources, A/B test result median across cohorts, distributed monitoring p50 latency.

13. Follow-up Q&A

Q1: K-th smallest element across two sorted arrays. A: Generalize the partition. Pick a partition (i, j) with i + j = K. The k-th smallest is max(A[i-1], B[j-1]) if the partition is valid. Binary search over i in [max(0, K-n), min(K, m)]. O(log(min(m, n, K))).

Q2: Median of K sorted arrays. A: Two approaches: (1) Heap merge — extract the (T/2)-th element. O(T log K). (2) Generalized partition: binary-search a “pivot” and count how many elements in each array are below the pivot. Binary search on the pivot value, O(K log V log K) where V is the value range. Harder, often overkill.

Q3: Running median of two infinite streams. A: Two heaps (max-heap for lower half, min-heap for upper half). Insert: O(log T). Median: O(1). This is LC 295.

Q4: What if the arrays are NOT sorted but we have access to “sort hints”? A: Sort first (O(T log T)), then apply partition. There’s no shortcut without structure.

Q5: Approximate median for huge unsortable streams. A: Quantile sketches: t-digest, GK, KLL. Sub-linear memory, ε-approximate median. Production-ready (used in Datadog, Prometheus).


14. Solution Walkthrough

See solution.py:

  • find_median_sorted_arrays(nums1, nums2) — partition binary search. O(log(min(m,n))).
  • find_median_brute(nums1, nums2) — merge with two pointers (O(m+n)). Oracle.
  • stress_test() — 200 random pairs (lengths 0..15 each, at least one non-empty), values in [-30, 30].
  • edge_cases() — empty + non-empty, single elements, no overlap, full overlap, all equal, interleaved, odd and even totals, negatives.

Run: python3 solution.py.


15. Beyond the Problem

Principal-engineer code review comment:

“This solution lives or dies by the partition invariant. Write a clear comment block at the top of the function with the formula, the j = (T+1)//2 - i derivation, and a worked example — your future self (and reviewers) will thank you. Second: the float('inf') sentinels are clever but ALSO subtle — if you ever extend this to integer-only environments (some embedded systems), you’ll need INT_MAX/MIN and have to think about overflow. Document the sentinel choice. Third: if this function is going into a library, expose BOTH the O(m+n) merge variant AND the O(log) partition variant — the merge is more cache-friendly for small arrays, and for typical inputs (say m+n < 64), it’s likely FASTER in wall-clock time. Profile, then pick the right default. Fourth: write at least one test that exercises every cross-check failure case — maxLeft(A) > minRight(B) AND maxLeft(B) > minRight(A) — to make sure the binary search direction is right both ways.”

Connections forward:

  • p33, p34 — rotated binary search siblings.
  • LC 295 — Find Median from Data Stream: dynamic-streaming variant via two heaps.
  • LC 295’s heap pattern → general top-K and quantile structures.
  • LC 215 — K-th Largest Element: quickselect or heap.
  • LC 410 — Split Array Largest Sum: binary search on answer (Phase 02 Lab 04).
  • LC 875 — Koko Eating Bananas: binary search on answer.
  • Phase 02 Lab 04 — Binary Search on Answer: full template lab.
  • Phase 02 Lab 09 — Heap Top-K: complementary technique.
  • Production: distributed median (quantile sketches: t-digest, GK, KLL), monitoring p50 latency, A/B test median computation.