p41 — Kth Largest Element in an Array

Source: LeetCode 215 · Medium · Topics: Array, Heap, Divide & Conquer, Quickselect, Sorting Companies (2024–2025 frequency): Meta (very high), Amazon (very high), Google (high), Microsoft (high), Apple (medium-high), Bloomberg (high) Loop position: Phone screen or first algo round. The “two acceptable answers” problem — interviewers expect you to NAME both and pick deliberately.

1. Quick Context

Return the K-th largest element in nums. Note: K-th largest by rank, not K-th distinct value.

Three acceptable solutions:

Approach	Time	Space	When to pick
Sort + index	O(n log n)	O(1)	Tiny n, lazy answer; mention as baseline
Min-heap of size K	O(n log K)	O(K)	Production — streaming-friendly, predictable
Quickselect	O(n) expected, O(n²) worst	O(1) avg	“Can you do better than O(n log n)?” → yes, expected linear

The discriminator is the named trade-off: candidates who give only one answer fail the signal; candidates who name all three and explain when each wins clear the bar.

Streaming variant: for “K-th largest in a STREAM” (LC 703), the heap-of-size-K is the only viable approach — quickselect needs the full array.

2. LeetCode Link + Attempt Gate

🔗 https://leetcode.com/problems/kth-largest-element-in-an-array/

STOP. 20-min timer. Plan both heap AND quickselect before coding.

3. Prerequisite Concepts

heapq module: heappush, heappop, heapify, heappushpop, nlargest/nsmallest.
Partition (Lomuto or Hoare) for quickselect.
The min-heap of size K trick: keep the K LARGEST seen so far in a MIN-heap so the smallest of the K is at the root.

4. How to Approach (FRAMEWORK Steps 1–9)

Step 1 — Restate: “Given nums and k, return the element that would be at index n - k if nums were sorted ascending. Equivalently, K-th largest by rank (duplicates count separately).”

Step 2 — Clarify:

“K-th distinct, or K-th by rank?” (By rank — [3,3,3], k=2 returns 3.)
“1 ≤ K ≤ n always?” (Yes per LC; defensive check otherwise.)
“Negatives allowed?” (Yes.)
“Can I sort in place?” (Mutates input — ASK; quickselect mutates, heap does not.)
“n bound?” (LC: n ≤ 10^5; both heap and quickselect comfortable.)
“Stream or batch?” (Batch per LC; if stream → heap only.)

Step 3 — Constraints: n ≤ 10^5, k ≤ n. Sort O(n log n) ≈ 10^5 · 17 ≈ 2M ops — passes but suboptimal.

Step 4 — Examples:

[3,2,1,5,6,4], k=2 → 5
[3,2,3,1,2,4,5,5,6], k=4 → 4
[1], k=1 → 1
[7,7,7], k=2 → 7

Step 5 — Brute Force: Sort, return nums[-k]. O(n log n).

Step 6 — Complexity: Heap O(n log K); Quickselect O(n) expected, O(n²) worst.

Step 7 — Pattern: “Top K / K-th element” → heap-of-size-K is the production default; quickselect is the interview “linear-time” payoff.

Step 8 — Develop: see solution.py.

Step 9 — Test: k=1 (max), k=n (min), all duplicates, single element, negatives.

6. Deeper Insight — Heap, Quickselect, and the streaming case

6.1 Min-heap of size K — the most underrated trick

To find the K largest, maintain a MIN-heap of the K largest seen so far. The root is the smallest-of-the-K — i.e., the current candidate for K-th largest. For each new element:

If heap has fewer than K: push.
Else if element > heap[0]: heappushpop(heap, element) (atomic push-then-pop).

At end, heap[0] is the K-th largest.

import heapq
heap = []
for x in nums:
    if len(heap) < k:
        heapq.heappush(heap, x)
    elif x > heap[0]:
        heapq.heappushpop(heap, x)
return heap[0]

Why this is brilliant: never holds more than K elements → O(K) memory regardless of n. Stream-friendly.

Counter-intuitive part: to find the K LARGEST, use a MIN-heap. (The min-of-the-largest-K is the K-th largest.)

6.2 Quickselect — expected O(n)

Variant of quicksort that recurses into only ONE side.

def quickselect(arr, k):  # 0-indexed: returns the element that would be at sorted index k
    if len(arr) == 1: return arr[0]
    pivot = random.choice(arr)
    lows  = [x for x in arr if x < pivot]
    highs = [x for x in arr if x > pivot]
    pivots = [x for x in arr if x == pivot]
    if k < len(lows): return quickselect(lows, k)
    elif k < len(lows) + len(pivots): return pivot
    else: return quickselect(highs, k - len(lows) - len(pivots))

Expected runtime: each level discards a constant fraction → T(n) = T(n/2) + O(n) = O(n).

Worst case O(n²) if pivot is always extremal — defeated by random pivot. NEVER use first/last element as pivot in interview code without acknowledging the adversarial input risk.

In-place version (Lomuto partition) avoids the list-comprehension allocation; better cache locality, but harder to write under pressure. Above 3-way split (lows/equals/highs) handles duplicates cleanly — important for [3,3,3,3,3] where Lomuto degrades.

6.3 Why heap is the production default

Concern	Heap	Quickselect
Worst case	O(n log K) deterministic	O(n²) — DoS risk on adversarial input
Predictable latency	Yes	No
Streaming	Yes — process one at a time	No — needs full array
Mutates input	No	Yes (or copy)
Code complexity	5 lines	15+ lines

In a real system serving requests, the O(n²) tail risk of quickselect is unacceptable. Heap wins production. Quickselect wins the “do better than O(n log n)” interview question only.

6.4 LC 703 — Kth Largest in a Stream

The streaming variant: a class that supports add(val) and returns the K-th largest so far. Heap-of-size-K is the canonical answer:

class KthLargest:
    def __init__(self, k, nums):
        self.k = k
        self.heap = []
        for x in nums: self.add(x)
    def add(self, val):
        if len(self.heap) < self.k: heapq.heappush(self.heap, val)
        elif val > self.heap[0]: heapq.heappushpop(self.heap, val)
        return self.heap[0]

Quickselect cannot do this. This is why heap-of-size-K is the “production” answer.

6.5 Median of medians (BFPRT) — deterministic O(n)

Worst-case O(n) selection algorithm. Beautiful theory; slow in practice due to large constants. Mention only if asked “deterministic linear-time selection?” — answer: BFPRT. Don’t implement under pressure.

7. Anti-Pattern Analysis

Sort + index as your only answer — fails the “two acceptable answers” signal. Always mention heap and quickselect.
MAX-heap of size n, pop k times — O(n + k log n). For k near n this is fine, but for k=5, n=10^6 it’s worse than min-heap-of-size-K. Wrong reflex.
Quickselect with nums[0] or nums[-1] pivot — adversarial inputs degrade to O(n²). ALWAYS randomize.
Quickselect with Lomuto partition on [3,3,3,…,3] — degrades to O(n²). Use 3-way (Dutch flag) split OR random pivot.
nlargest(k, nums)[-1] — works (under the hood uses heap-of-size-K), but in an interview you must show you understand it, not just call the library.
sorted(nums)[-k] — fine as baseline; bad as primary answer.
Mutating the input without warning — quickselect mutates; if the caller cares, copy first.
Forgetting heappushpop is atomic — heappush then heappop is two O(log K) ops instead of one. Marginal but a code-quality signal.
Returning the heap, not heap[0] — off-by-one bug under pressure.
K-th distinct — different problem. Verify with interviewer.

8. Skills & Takeaways

Min-heap of size K for top-K (counter-intuitive: MIN for LARGEST).
heapq.heappushpop — atomic push-then-pop, single O(log K).
Quickselect with random pivot + 3-way split.
Heap = production, Quickselect = “do better than O(n log n)”.
Streaming variant (LC 703) — heap only.
Median of medians (BFPRT) as worst-case linear deterministic backup answer.
Negate values for max-heap in Python (heapq is min-heap).

9. When to Move On

Move on when:

You can write heap solution in ≤ 5 min.
You can write quickselect with random pivot in ≤ 10 min.
You can articulate the production-vs-interview trade-off without prompting.
You handle [3,3,3], k=2 → 3 correctly (K-th by rank, not distinct).
Stress test on 1000 random arrays agrees with sorted(nums)[-k].

10. Company Context

Meta:

One of the top-5 most-asked problems at Meta. Bar Raiser expects BOTH solutions named, heap implemented, quickselect sketched.
Scorecard phrase: “Algorithm — named two valid approaches, chose deliberately, articulated worst-case trade-offs.”
Common follow-up: “now make it streaming” → LC 703.

Amazon:

L5+ staple. Bar Raiser tests the “K-th in a 1TB file you can’t fit in memory” external-sort variant.
Expects discussion of partitioning across workers.

Google:

L4/L5. Often paired with “implement quickselect” specifically. Random pivot is a hard requirement.

Microsoft:

L63/L64. Often phrased as “find the K closest stars to Earth from a list” (variant — same template).

Apple:

Common in iOS performance loops (e.g., “K largest battery drainers”). Streaming variant frequent.

Bloomberg:

Loves this problem; usually paired with “now do it over a sliding window of size W” (heap + lazy deletion).

11. Interviewer’s Lens

Signal	Junior	Mid	Senior	Staff/Principal
First answer	`sorted(nums)[-k]`	Heap	Heap + names quickselect	+ names BFPRT
Heap orientation	Min for smallest	Min-heap of size K for K-th largest	+ explains why MIN heap for LARGEST	+ streaming variant
Quickselect pivot	First element	Last element	Random	+ 3-way for duplicates
Streaming follow-up	“Re-sort each time”	“Maintain sorted list”	“Heap of size K”	+ lazy deletion for sliding window
Big-O on duplicates	“O(n log n)”	“O(n log K)”	+ “quickselect degrades on duplicates without 3-way”	+ median-of-medians worst case

12. Level Delta

Mid (L4):

Heap solution clean.
Sketches quickselect; may have pivot bug.
States O(n log K).

Senior (L5):

Names heap + quickselect + sort; picks heap for production with justification.
Quickselect with random pivot, 3-way split.
Handles streaming follow-up (LC 703).

Staff (L6):

Discusses median-of-medians for deterministic O(n).
Discusses sliding-window variant with lazy-deletion heap.
Discusses external sort for arrays > memory.

Principal (L7+):

Discusses production: percentile dashboards (P95/P99 latency), recommendation top-K, search ranking.
Discusses approximate top-K: Space-Saving (Misra-Gries) algorithm, Count-Min Sketch for unbounded streams with bounded memory.
Discusses parallelization: per-shard top-K, merge with K · S-heap (S shards).
Discusses adversarial defense: random pivot prevents algorithmic DoS via crafted input.

13. Follow-up Q&A

Q1: K-th SMALLEST? A: Mirror everything. Max-heap of size K (heapq with negation), or quickselect targeting k-1 (0-indexed).

Q2: Streaming (LC 703)? A: Heap-of-size-K only; quickselect needs full array. See class skeleton in §6.4.

Q3: K-th distinct? A: Different problem. Dedupe first (set), then K-th over distinct values. Verify with interviewer.

Q4: K-th over sliding window of size W? A: Heap of size W with lazy deletion: on slide-out, push tombstone (value, "removed", index); on access, pop tombstones from root until valid. Or: use a sorted container (e.g., SortedList from sortedcontainers) with O(log W) add/remove.

Q5: Arrays too large for memory (e.g., 1TB)? A: External sort + skip to position. OR: per-chunk top-K (heap of size K per chunk), merge with K·M-heap (M = #chunks). Total O(N log K) work, O(M·K) merge memory.

Q6: Find ALL elements ranked between K1 and K2? A: Two quickselects (for K1 and K2 boundaries), then linear pass for elements in range. Or: nth_element twice (C++ STL).

14. Solution Walkthrough

See solution.py:

find_kth_largest_heap(nums, k) — min-heap of size K.
find_kth_largest_quickselect(nums, k) — 3-way split with random pivot.
find_kth_largest_sort(nums, k) — baseline.
find_kth_largest_nlargest(nums, k) — heapq.nlargest (library shortcut).
KthLargest — LC 703 streaming class.
edge_cases() + stress_test() — 1000 random arrays, all four agree.

Run: python3 solution.py.

15. Beyond the Problem

Principal-engineer code review comment:

“Three things. (1) The heap version is the right default for production — but make the size-K invariant explicit in a comment because future-readers WILL try to optimize by removing the len(heap) < k branch and break it. (2) For quickselect, the random pivot is non-negotiable; add a property-test that runs against sorted, reverse-sorted, all-duplicates, and a known adversarial input — those are the cases where pivot choice matters. (3) For the streaming variant, consider whether you need APPROXIMATE top-K (e.g., dashboard top-K queries over billions of events): the Space-Saving algorithm gives bounded-memory approximate top-K with provable error bounds; exact heap-of-size-K doesn’t scale beyond what fits in memory. The interview question hides this scaling cliff — call it out in the design doc.”

Connections forward:

p42 — Top K Frequent (heap + count map).
p43 — Running median (two heaps).
p44 — Merge K sorted lists (heap of heads).
LC 703 — Kth Largest in a Stream.
LC 973 — K Closest Points to Origin (same template).
LC 692 — Top K Frequent Words (heap with tie-break).
Phase 02 Lab 09 — Heap Top-K (lab generalization).
Production: percentile dashboards (P50/P95/P99), recommendation top-K, search ranking, hot-key detection.

LeetCode Interview — Extreme Coding