p41 — Kth Largest Element in an Array
Source: LeetCode 215 · Medium · Topics: Array, Heap, Divide & Conquer, Quickselect, Sorting Companies (2024–2025 frequency): Meta (very high), Amazon (very high), Google (high), Microsoft (high), Apple (medium-high), Bloomberg (high) Loop position: Phone screen or first algo round. The “two acceptable answers” problem — interviewers expect you to NAME both and pick deliberately.
1. Quick Context
Return the K-th largest element in nums. Note: K-th largest by rank, not K-th distinct value.
Three acceptable solutions:
| Approach | Time | Space | When to pick |
|---|---|---|---|
| Sort + index | O(n log n) | O(1) | Tiny n, lazy answer; mention as baseline |
| Min-heap of size K | O(n log K) | O(K) | Production — streaming-friendly, predictable |
| Quickselect | O(n) expected, O(n²) worst | O(1) avg | “Can you do better than O(n log n)?” → yes, expected linear |
The discriminator is the named trade-off: candidates who give only one answer fail the signal; candidates who name all three and explain when each wins clear the bar.
Streaming variant: for “K-th largest in a STREAM” (LC 703), the heap-of-size-K is the only viable approach — quickselect needs the full array.
2. LeetCode Link + Attempt Gate
🔗 https://leetcode.com/problems/kth-largest-element-in-an-array/
STOP. 20-min timer. Plan both heap AND quickselect before coding.
3. Prerequisite Concepts
heapqmodule:heappush,heappop,heapify,heappushpop,nlargest/nsmallest.- Partition (Lomuto or Hoare) for quickselect.
- The min-heap of size K trick: keep the K LARGEST seen so far in a MIN-heap so the smallest of the K is at the root.
4. How to Approach (FRAMEWORK Steps 1–9)
Step 1 — Restate: “Given nums and k, return the element that would be at index n - k if nums were sorted ascending. Equivalently, K-th largest by rank (duplicates count separately).”
Step 2 — Clarify:
- “K-th distinct, or K-th by rank?” (By rank —
[3,3,3], k=2returns 3.) - “1 ≤ K ≤ n always?” (Yes per LC; defensive check otherwise.)
- “Negatives allowed?” (Yes.)
- “Can I sort in place?” (Mutates input — ASK; quickselect mutates, heap does not.)
- “n bound?” (LC: n ≤ 10^5; both heap and quickselect comfortable.)
- “Stream or batch?” (Batch per LC; if stream → heap only.)
Step 3 — Constraints: n ≤ 10^5, k ≤ n. Sort O(n log n) ≈ 10^5 · 17 ≈ 2M ops — passes but suboptimal.
Step 4 — Examples:
[3,2,1,5,6,4], k=2→ 5[3,2,3,1,2,4,5,5,6], k=4→ 4[1], k=1→ 1[7,7,7], k=2→ 7
Step 5 — Brute Force: Sort, return nums[-k]. O(n log n).
Step 6 — Complexity: Heap O(n log K); Quickselect O(n) expected, O(n²) worst.
Step 7 — Pattern: “Top K / K-th element” → heap-of-size-K is the production default; quickselect is the interview “linear-time” payoff.
Step 8 — Develop: see solution.py.
Step 9 — Test: k=1 (max), k=n (min), all duplicates, single element, negatives.
5. Progressive Hints
If stuck for 5 min, open hints.md.
6. Deeper Insight — Heap, Quickselect, and the streaming case
6.1 Min-heap of size K — the most underrated trick
To find the K largest, maintain a MIN-heap of the K largest seen so far. The root is the smallest-of-the-K — i.e., the current candidate for K-th largest. For each new element:
- If heap has fewer than K: push.
- Else if element > heap[0]:
heappushpop(heap, element)(atomic push-then-pop).
At end, heap[0] is the K-th largest.
import heapq
heap = []
for x in nums:
if len(heap) < k:
heapq.heappush(heap, x)
elif x > heap[0]:
heapq.heappushpop(heap, x)
return heap[0]
Why this is brilliant: never holds more than K elements → O(K) memory regardless of n. Stream-friendly.
Counter-intuitive part: to find the K LARGEST, use a MIN-heap. (The min-of-the-largest-K is the K-th largest.)
6.2 Quickselect — expected O(n)
Variant of quicksort that recurses into only ONE side.
def quickselect(arr, k): # 0-indexed: returns the element that would be at sorted index k
if len(arr) == 1: return arr[0]
pivot = random.choice(arr)
lows = [x for x in arr if x < pivot]
highs = [x for x in arr if x > pivot]
pivots = [x for x in arr if x == pivot]
if k < len(lows): return quickselect(lows, k)
elif k < len(lows) + len(pivots): return pivot
else: return quickselect(highs, k - len(lows) - len(pivots))
Expected runtime: each level discards a constant fraction → T(n) = T(n/2) + O(n) = O(n).
Worst case O(n²) if pivot is always extremal — defeated by random pivot. NEVER use first/last element as pivot in interview code without acknowledging the adversarial input risk.
In-place version (Lomuto partition) avoids the list-comprehension allocation; better cache locality, but harder to write under pressure. Above 3-way split (lows/equals/highs) handles duplicates cleanly — important for [3,3,3,3,3] where Lomuto degrades.
6.3 Why heap is the production default
| Concern | Heap | Quickselect |
|---|---|---|
| Worst case | O(n log K) deterministic | O(n²) — DoS risk on adversarial input |
| Predictable latency | Yes | No |
| Streaming | Yes — process one at a time | No — needs full array |
| Mutates input | No | Yes (or copy) |
| Code complexity | 5 lines | 15+ lines |
In a real system serving requests, the O(n²) tail risk of quickselect is unacceptable. Heap wins production. Quickselect wins the “do better than O(n log n)” interview question only.
6.4 LC 703 — Kth Largest in a Stream
The streaming variant: a class that supports add(val) and returns the K-th largest so far. Heap-of-size-K is the canonical answer:
class KthLargest:
def __init__(self, k, nums):
self.k = k
self.heap = []
for x in nums: self.add(x)
def add(self, val):
if len(self.heap) < self.k: heapq.heappush(self.heap, val)
elif val > self.heap[0]: heapq.heappushpop(self.heap, val)
return self.heap[0]
Quickselect cannot do this. This is why heap-of-size-K is the “production” answer.
6.5 Median of medians (BFPRT) — deterministic O(n)
Worst-case O(n) selection algorithm. Beautiful theory; slow in practice due to large constants. Mention only if asked “deterministic linear-time selection?” — answer: BFPRT. Don’t implement under pressure.
7. Anti-Pattern Analysis
- Sort + index as your only answer — fails the “two acceptable answers” signal. Always mention heap and quickselect.
- MAX-heap of size n, pop k times — O(n + k log n). For k near n this is fine, but for k=5, n=10^6 it’s worse than min-heap-of-size-K. Wrong reflex.
- Quickselect with
nums[0]ornums[-1]pivot — adversarial inputs degrade to O(n²). ALWAYS randomize. - Quickselect with Lomuto partition on
[3,3,3,…,3]— degrades to O(n²). Use 3-way (Dutch flag) split OR random pivot. nlargest(k, nums)[-1]— works (under the hood uses heap-of-size-K), but in an interview you must show you understand it, not just call the library.sorted(nums)[-k]— fine as baseline; bad as primary answer.- Mutating the input without warning — quickselect mutates; if the caller cares, copy first.
- Forgetting
heappushpopis atomic —heappushthenheappopis two O(log K) ops instead of one. Marginal but a code-quality signal. - Returning the heap, not
heap[0]— off-by-one bug under pressure. - K-th
distinct— different problem. Verify with interviewer.
8. Skills & Takeaways
- Min-heap of size K for top-K (counter-intuitive: MIN for LARGEST).
heapq.heappushpop— atomic push-then-pop, single O(log K).- Quickselect with random pivot + 3-way split.
- Heap = production, Quickselect = “do better than O(n log n)”.
- Streaming variant (LC 703) — heap only.
- Median of medians (BFPRT) as worst-case linear deterministic backup answer.
- Negate values for max-heap in Python (
heapqis min-heap).
9. When to Move On
Move on when:
- You can write heap solution in ≤ 5 min.
- You can write quickselect with random pivot in ≤ 10 min.
- You can articulate the production-vs-interview trade-off without prompting.
- You handle
[3,3,3], k=2→ 3 correctly (K-th by rank, not distinct). - Stress test on 1000 random arrays agrees with
sorted(nums)[-k].
10. Company Context
Meta:
- One of the top-5 most-asked problems at Meta. Bar Raiser expects BOTH solutions named, heap implemented, quickselect sketched.
- Scorecard phrase: “Algorithm — named two valid approaches, chose deliberately, articulated worst-case trade-offs.”
- Common follow-up: “now make it streaming” → LC 703.
Amazon:
- L5+ staple. Bar Raiser tests the “K-th in a 1TB file you can’t fit in memory” external-sort variant.
- Expects discussion of partitioning across workers.
Google:
- L4/L5. Often paired with “implement quickselect” specifically. Random pivot is a hard requirement.
Microsoft:
- L63/L64. Often phrased as “find the K closest stars to Earth from a list” (variant — same template).
Apple:
- Common in iOS performance loops (e.g., “K largest battery drainers”). Streaming variant frequent.
Bloomberg:
- Loves this problem; usually paired with “now do it over a sliding window of size W” (heap + lazy deletion).
11. Interviewer’s Lens
| Signal | Junior | Mid | Senior | Staff/Principal |
|---|---|---|---|---|
| First answer | sorted(nums)[-k] | Heap | Heap + names quickselect | + names BFPRT |
| Heap orientation | Min for smallest | Min-heap of size K for K-th largest | + explains why MIN heap for LARGEST | + streaming variant |
| Quickselect pivot | First element | Last element | Random | + 3-way for duplicates |
| Streaming follow-up | “Re-sort each time” | “Maintain sorted list” | “Heap of size K” | + lazy deletion for sliding window |
| Big-O on duplicates | “O(n log n)” | “O(n log K)” | + “quickselect degrades on duplicates without 3-way” | + median-of-medians worst case |
12. Level Delta
Mid (L4):
- Heap solution clean.
- Sketches quickselect; may have pivot bug.
- States O(n log K).
Senior (L5):
- Names heap + quickselect + sort; picks heap for production with justification.
- Quickselect with random pivot, 3-way split.
- Handles streaming follow-up (LC 703).
Staff (L6):
- Discusses median-of-medians for deterministic O(n).
- Discusses sliding-window variant with lazy-deletion heap.
- Discusses external sort for arrays > memory.
Principal (L7+):
- Discusses production: percentile dashboards (P95/P99 latency), recommendation top-K, search ranking.
- Discusses approximate top-K: Space-Saving (Misra-Gries) algorithm, Count-Min Sketch for unbounded streams with bounded memory.
- Discusses parallelization: per-shard top-K, merge with K · S-heap (S shards).
- Discusses adversarial defense: random pivot prevents algorithmic DoS via crafted input.
13. Follow-up Q&A
Q1: K-th SMALLEST?
A: Mirror everything. Max-heap of size K (heapq with negation), or quickselect targeting k-1 (0-indexed).
Q2: Streaming (LC 703)? A: Heap-of-size-K only; quickselect needs full array. See class skeleton in §6.4.
Q3: K-th distinct? A: Different problem. Dedupe first (set), then K-th over distinct values. Verify with interviewer.
Q4: K-th over sliding window of size W?
A: Heap of size W with lazy deletion: on slide-out, push tombstone (value, "removed", index); on access, pop tombstones from root until valid. Or: use a sorted container (e.g., SortedList from sortedcontainers) with O(log W) add/remove.
Q5: Arrays too large for memory (e.g., 1TB)? A: External sort + skip to position. OR: per-chunk top-K (heap of size K per chunk), merge with K·M-heap (M = #chunks). Total O(N log K) work, O(M·K) merge memory.
Q6: Find ALL elements ranked between K1 and K2? A: Two quickselects (for K1 and K2 boundaries), then linear pass for elements in range. Or: nth_element twice (C++ STL).
14. Solution Walkthrough
See solution.py:
find_kth_largest_heap(nums, k)— min-heap of size K.find_kth_largest_quickselect(nums, k)— 3-way split with random pivot.find_kth_largest_sort(nums, k)— baseline.find_kth_largest_nlargest(nums, k)—heapq.nlargest(library shortcut).KthLargest— LC 703 streaming class.edge_cases()+stress_test()— 1000 random arrays, all four agree.
Run: python3 solution.py.
15. Beyond the Problem
Principal-engineer code review comment:
“Three things. (1) The heap version is the right default for production — but make the size-K invariant explicit in a comment because future-readers WILL try to optimize by removing the
len(heap) < kbranch and break it. (2) For quickselect, the random pivot is non-negotiable; add a property-test that runs against sorted, reverse-sorted, all-duplicates, and a known adversarial input — those are the cases where pivot choice matters. (3) For the streaming variant, consider whether you need APPROXIMATE top-K (e.g., dashboard top-K queries over billions of events): the Space-Saving algorithm gives bounded-memory approximate top-K with provable error bounds; exact heap-of-size-K doesn’t scale beyond what fits in memory. The interview question hides this scaling cliff — call it out in the design doc.”
Connections forward:
- p42 — Top K Frequent (heap + count map).
- p43 — Running median (two heaps).
- p44 — Merge K sorted lists (heap of heads).
- LC 703 — Kth Largest in a Stream.
- LC 973 — K Closest Points to Origin (same template).
- LC 692 — Top K Frequent Words (heap with tie-break).
- Phase 02 Lab 09 — Heap Top-K (lab generalization).
- Production: percentile dashboards (P50/P95/P99), recommendation top-K, search ranking, hot-key detection.