p41 — Kth Largest Element in an Array

Source: LeetCode 215 · Medium · Topics: Array, Heap, Divide & Conquer, Quickselect, Sorting Companies (2024–2025 frequency): Meta (very high), Amazon (very high), Google (high), Microsoft (high), Apple (medium-high), Bloomberg (high) Loop position: Phone screen or first algo round. The “two acceptable answers” problem — interviewers expect you to NAME both and pick deliberately.

1. Quick Context

Return the K-th largest element in nums. Note: K-th largest by rank, not K-th distinct value.

Three acceptable solutions:

ApproachTimeSpaceWhen to pick
Sort + indexO(n log n)O(1)Tiny n, lazy answer; mention as baseline
Min-heap of size KO(n log K)O(K)Production — streaming-friendly, predictable
QuickselectO(n) expected, O(n²) worstO(1) avg“Can you do better than O(n log n)?” → yes, expected linear

The discriminator is the named trade-off: candidates who give only one answer fail the signal; candidates who name all three and explain when each wins clear the bar.

Streaming variant: for “K-th largest in a STREAM” (LC 703), the heap-of-size-K is the only viable approach — quickselect needs the full array.


🔗 https://leetcode.com/problems/kth-largest-element-in-an-array/

STOP. 20-min timer. Plan both heap AND quickselect before coding.


3. Prerequisite Concepts

  • heapq module: heappush, heappop, heapify, heappushpop, nlargest/nsmallest.
  • Partition (Lomuto or Hoare) for quickselect.
  • The min-heap of size K trick: keep the K LARGEST seen so far in a MIN-heap so the smallest of the K is at the root.

4. How to Approach (FRAMEWORK Steps 1–9)

Step 1 — Restate: “Given nums and k, return the element that would be at index n - k if nums were sorted ascending. Equivalently, K-th largest by rank (duplicates count separately).”

Step 2 — Clarify:

  • “K-th distinct, or K-th by rank?” (By rank — [3,3,3], k=2 returns 3.)
  • “1 ≤ K ≤ n always?” (Yes per LC; defensive check otherwise.)
  • “Negatives allowed?” (Yes.)
  • “Can I sort in place?” (Mutates input — ASK; quickselect mutates, heap does not.)
  • “n bound?” (LC: n ≤ 10^5; both heap and quickselect comfortable.)
  • “Stream or batch?” (Batch per LC; if stream → heap only.)

Step 3 — Constraints: n ≤ 10^5, k ≤ n. Sort O(n log n) ≈ 10^5 · 17 ≈ 2M ops — passes but suboptimal.

Step 4 — Examples:

  • [3,2,1,5,6,4], k=2 → 5
  • [3,2,3,1,2,4,5,5,6], k=4 → 4
  • [1], k=1 → 1
  • [7,7,7], k=2 → 7

Step 5 — Brute Force: Sort, return nums[-k]. O(n log n).

Step 6 — Complexity: Heap O(n log K); Quickselect O(n) expected, O(n²) worst.

Step 7 — Pattern: “Top K / K-th element” → heap-of-size-K is the production default; quickselect is the interview “linear-time” payoff.

Step 8 — Develop: see solution.py.

Step 9 — Test: k=1 (max), k=n (min), all duplicates, single element, negatives.


5. Progressive Hints

If stuck for 5 min, open hints.md.


6. Deeper Insight — Heap, Quickselect, and the streaming case

6.1 Min-heap of size K — the most underrated trick

To find the K largest, maintain a MIN-heap of the K largest seen so far. The root is the smallest-of-the-K — i.e., the current candidate for K-th largest. For each new element:

  • If heap has fewer than K: push.
  • Else if element > heap[0]: heappushpop(heap, element) (atomic push-then-pop).

At end, heap[0] is the K-th largest.

import heapq
heap = []
for x in nums:
    if len(heap) < k:
        heapq.heappush(heap, x)
    elif x > heap[0]:
        heapq.heappushpop(heap, x)
return heap[0]

Why this is brilliant: never holds more than K elements → O(K) memory regardless of n. Stream-friendly.

Counter-intuitive part: to find the K LARGEST, use a MIN-heap. (The min-of-the-largest-K is the K-th largest.)

6.2 Quickselect — expected O(n)

Variant of quicksort that recurses into only ONE side.

def quickselect(arr, k):  # 0-indexed: returns the element that would be at sorted index k
    if len(arr) == 1: return arr[0]
    pivot = random.choice(arr)
    lows  = [x for x in arr if x < pivot]
    highs = [x for x in arr if x > pivot]
    pivots = [x for x in arr if x == pivot]
    if k < len(lows): return quickselect(lows, k)
    elif k < len(lows) + len(pivots): return pivot
    else: return quickselect(highs, k - len(lows) - len(pivots))

Expected runtime: each level discards a constant fraction → T(n) = T(n/2) + O(n) = O(n).

Worst case O(n²) if pivot is always extremal — defeated by random pivot. NEVER use first/last element as pivot in interview code without acknowledging the adversarial input risk.

In-place version (Lomuto partition) avoids the list-comprehension allocation; better cache locality, but harder to write under pressure. Above 3-way split (lows/equals/highs) handles duplicates cleanly — important for [3,3,3,3,3] where Lomuto degrades.

6.3 Why heap is the production default

ConcernHeapQuickselect
Worst caseO(n log K) deterministicO(n²) — DoS risk on adversarial input
Predictable latencyYesNo
StreamingYes — process one at a timeNo — needs full array
Mutates inputNoYes (or copy)
Code complexity5 lines15+ lines

In a real system serving requests, the O(n²) tail risk of quickselect is unacceptable. Heap wins production. Quickselect wins the “do better than O(n log n)” interview question only.

6.4 LC 703 — Kth Largest in a Stream

The streaming variant: a class that supports add(val) and returns the K-th largest so far. Heap-of-size-K is the canonical answer:

class KthLargest:
    def __init__(self, k, nums):
        self.k = k
        self.heap = []
        for x in nums: self.add(x)
    def add(self, val):
        if len(self.heap) < self.k: heapq.heappush(self.heap, val)
        elif val > self.heap[0]: heapq.heappushpop(self.heap, val)
        return self.heap[0]

Quickselect cannot do this. This is why heap-of-size-K is the “production” answer.

6.5 Median of medians (BFPRT) — deterministic O(n)

Worst-case O(n) selection algorithm. Beautiful theory; slow in practice due to large constants. Mention only if asked “deterministic linear-time selection?” — answer: BFPRT. Don’t implement under pressure.


7. Anti-Pattern Analysis

  1. Sort + index as your only answer — fails the “two acceptable answers” signal. Always mention heap and quickselect.
  2. MAX-heap of size n, pop k times — O(n + k log n). For k near n this is fine, but for k=5, n=10^6 it’s worse than min-heap-of-size-K. Wrong reflex.
  3. Quickselect with nums[0] or nums[-1] pivot — adversarial inputs degrade to O(n²). ALWAYS randomize.
  4. Quickselect with Lomuto partition on [3,3,3,…,3] — degrades to O(n²). Use 3-way (Dutch flag) split OR random pivot.
  5. nlargest(k, nums)[-1] — works (under the hood uses heap-of-size-K), but in an interview you must show you understand it, not just call the library.
  6. sorted(nums)[-k] — fine as baseline; bad as primary answer.
  7. Mutating the input without warning — quickselect mutates; if the caller cares, copy first.
  8. Forgetting heappushpop is atomicheappush then heappop is two O(log K) ops instead of one. Marginal but a code-quality signal.
  9. Returning the heap, not heap[0] — off-by-one bug under pressure.
  10. K-th distinct — different problem. Verify with interviewer.

8. Skills & Takeaways

  • Min-heap of size K for top-K (counter-intuitive: MIN for LARGEST).
  • heapq.heappushpop — atomic push-then-pop, single O(log K).
  • Quickselect with random pivot + 3-way split.
  • Heap = production, Quickselect = “do better than O(n log n)”.
  • Streaming variant (LC 703) — heap only.
  • Median of medians (BFPRT) as worst-case linear deterministic backup answer.
  • Negate values for max-heap in Python (heapq is min-heap).

9. When to Move On

Move on when:

  • You can write heap solution in ≤ 5 min.
  • You can write quickselect with random pivot in ≤ 10 min.
  • You can articulate the production-vs-interview trade-off without prompting.
  • You handle [3,3,3], k=2 → 3 correctly (K-th by rank, not distinct).
  • Stress test on 1000 random arrays agrees with sorted(nums)[-k].

10. Company Context

Meta:

  • One of the top-5 most-asked problems at Meta. Bar Raiser expects BOTH solutions named, heap implemented, quickselect sketched.
  • Scorecard phrase: “Algorithm — named two valid approaches, chose deliberately, articulated worst-case trade-offs.”
  • Common follow-up: “now make it streaming” → LC 703.

Amazon:

  • L5+ staple. Bar Raiser tests the “K-th in a 1TB file you can’t fit in memory” external-sort variant.
  • Expects discussion of partitioning across workers.

Google:

  • L4/L5. Often paired with “implement quickselect” specifically. Random pivot is a hard requirement.

Microsoft:

  • L63/L64. Often phrased as “find the K closest stars to Earth from a list” (variant — same template).

Apple:

  • Common in iOS performance loops (e.g., “K largest battery drainers”). Streaming variant frequent.

Bloomberg:

  • Loves this problem; usually paired with “now do it over a sliding window of size W” (heap + lazy deletion).

11. Interviewer’s Lens

SignalJuniorMidSeniorStaff/Principal
First answersorted(nums)[-k]HeapHeap + names quickselect+ names BFPRT
Heap orientationMin for smallestMin-heap of size K for K-th largest+ explains why MIN heap for LARGEST+ streaming variant
Quickselect pivotFirst elementLast elementRandom+ 3-way for duplicates
Streaming follow-up“Re-sort each time”“Maintain sorted list”“Heap of size K”+ lazy deletion for sliding window
Big-O on duplicates“O(n log n)”“O(n log K)”+ “quickselect degrades on duplicates without 3-way”+ median-of-medians worst case

12. Level Delta

Mid (L4):

  • Heap solution clean.
  • Sketches quickselect; may have pivot bug.
  • States O(n log K).

Senior (L5):

  • Names heap + quickselect + sort; picks heap for production with justification.
  • Quickselect with random pivot, 3-way split.
  • Handles streaming follow-up (LC 703).

Staff (L6):

  • Discusses median-of-medians for deterministic O(n).
  • Discusses sliding-window variant with lazy-deletion heap.
  • Discusses external sort for arrays > memory.

Principal (L7+):

  • Discusses production: percentile dashboards (P95/P99 latency), recommendation top-K, search ranking.
  • Discusses approximate top-K: Space-Saving (Misra-Gries) algorithm, Count-Min Sketch for unbounded streams with bounded memory.
  • Discusses parallelization: per-shard top-K, merge with K · S-heap (S shards).
  • Discusses adversarial defense: random pivot prevents algorithmic DoS via crafted input.

13. Follow-up Q&A

Q1: K-th SMALLEST? A: Mirror everything. Max-heap of size K (heapq with negation), or quickselect targeting k-1 (0-indexed).

Q2: Streaming (LC 703)? A: Heap-of-size-K only; quickselect needs full array. See class skeleton in §6.4.

Q3: K-th distinct? A: Different problem. Dedupe first (set), then K-th over distinct values. Verify with interviewer.

Q4: K-th over sliding window of size W? A: Heap of size W with lazy deletion: on slide-out, push tombstone (value, "removed", index); on access, pop tombstones from root until valid. Or: use a sorted container (e.g., SortedList from sortedcontainers) with O(log W) add/remove.

Q5: Arrays too large for memory (e.g., 1TB)? A: External sort + skip to position. OR: per-chunk top-K (heap of size K per chunk), merge with K·M-heap (M = #chunks). Total O(N log K) work, O(M·K) merge memory.

Q6: Find ALL elements ranked between K1 and K2? A: Two quickselects (for K1 and K2 boundaries), then linear pass for elements in range. Or: nth_element twice (C++ STL).


14. Solution Walkthrough

See solution.py:

  • find_kth_largest_heap(nums, k) — min-heap of size K.
  • find_kth_largest_quickselect(nums, k) — 3-way split with random pivot.
  • find_kth_largest_sort(nums, k) — baseline.
  • find_kth_largest_nlargest(nums, k)heapq.nlargest (library shortcut).
  • KthLargest — LC 703 streaming class.
  • edge_cases() + stress_test() — 1000 random arrays, all four agree.

Run: python3 solution.py.


15. Beyond the Problem

Principal-engineer code review comment:

“Three things. (1) The heap version is the right default for production — but make the size-K invariant explicit in a comment because future-readers WILL try to optimize by removing the len(heap) < k branch and break it. (2) For quickselect, the random pivot is non-negotiable; add a property-test that runs against sorted, reverse-sorted, all-duplicates, and a known adversarial input — those are the cases where pivot choice matters. (3) For the streaming variant, consider whether you need APPROXIMATE top-K (e.g., dashboard top-K queries over billions of events): the Space-Saving algorithm gives bounded-memory approximate top-K with provable error bounds; exact heap-of-size-K doesn’t scale beyond what fits in memory. The interview question hides this scaling cliff — call it out in the design doc.”

Connections forward:

  • p42 — Top K Frequent (heap + count map).
  • p43 — Running median (two heaps).
  • p44 — Merge K sorted lists (heap of heads).
  • LC 703 — Kth Largest in a Stream.
  • LC 973 — K Closest Points to Origin (same template).
  • LC 692 — Top K Frequent Words (heap with tie-break).
  • Phase 02 Lab 09 — Heap Top-K (lab generalization).
  • Production: percentile dashboards (P50/P95/P99), recommendation top-K, search ranking, hot-key detection.