Lab 01 — Edge Case Taxonomy (Find the Median of an Unsorted Array)

Goal

Build a reusable, systematic edge-case taxonomy you can apply to any new problem in under 90 seconds. Use “find the median of an unsorted integer array” as the anchor — a problem that looks trivial but has at least 12 edge cases that a careless candidate will miss. By the end you should be able to enumerate 8+ edge cases for any array problem before writing a single line of code.

Background Concepts

An edge case is an input that is technically legal under the constraints but exercises a degenerate or boundary behavior in your algorithm. They fall into a small number of universal categories:

  1. Empty / null — what does your function do with [] or None?
  2. Singleton — one element
  3. Identical elements — all equal; tests duplicate handling
  4. Boundary valuesINT_MAX, INT_MIN, 0, negatives
  5. Sorted / reverse-sorted — tests algorithms that assume scrambled input
  6. Maximum size — N at the constraint upper bound; tests complexity
  7. Output ambiguity — multiple valid answers; tests the spec
  8. Arithmetic overflow — sums/products that exceed INT_MAX

The taxonomy is universal. The application is problem-specific.

Interview Context

“Find the median” is asked as a warm-up at Meta, Microsoft, and Bloomberg phone screens. The interviewer is not testing whether you know quickselect. They are testing whether you ask “what do you mean by median for an even-length array — average of the two middles or either one?” before writing code. Candidates who skip this question lose the point even if their code is otherwise correct.

The senior signal is to list out edges aloud before coding: “Empty array — should I return None or throw? Single element — that’s the median. Two elements — average. Even vs odd length — different formulas. Are values bounded so the sum of two won’t overflow?” Five sentences. Then code.

Problem Statement

Given an unsorted array of integers nums, return the median. If nums has odd length, return the middle value after sorting. If nums has even length, return the average of the two middle values.

Constraints

  • 0 ≤ |nums| ≤ 10^5
  • -10^9 ≤ nums[i] ≤ 10^9
  • The return type may be a float (because of averaging)

Clarifying Questions

  1. Empty input? What should I return — None, NaN, raise an exception?
  2. Even length: average or either middle? Lower middle, upper middle, or the float average?
  3. Are duplicates allowed? (Yes; median definition handles them naturally.)
  4. Floating point precision concerns? If nums[i] is up to 10^9, sum of two is 2×10^9 — fits in 32-bit signed int barely, but using (a + b) / 2.0 in C++ overflows for INT_MAX + INT_MAX. Better: a/2.0 + b/2.0 or a + (b - a)/2.0.
  5. Modify input allowed? (Affects whether you can sort in place or need to copy.)

Examples

nums = [3, 1, 2]            → 2          (odd length)
nums = [3, 1, 2, 4]         → 2.5        (even, average of 2 and 3)
nums = [5]                  → 5          (singleton)
nums = []                   → None / raise (clarify with interviewer)
nums = [7, 7, 7, 7]         → 7.0        (all duplicates, even length)
nums = [INT_MAX, INT_MAX]   → INT_MAX    (overflow risk in average)
nums = [-3, -1, -2]         → -2         (negatives)

Initial Brute Force

Sort, then index. Two lines of code.

def median(nums):
    if not nums:
        return None
    s = sorted(nums)
    n = len(s)
    if n % 2 == 1:
        return s[n // 2]
    return (s[n // 2 - 1] + s[n // 2]) / 2

Brute Force Complexity

Time O(N log N), space O(N) (or O(1) if you sort in place and the caller allows mutation). For N = 10^5 this is ~1.7×10^6 comparisons — well within any interview time limit.

Optimization Path

The interviewer may now ask: “Can you do better than O(N log N)?” The answer is quickselect, which finds the k-th smallest in expected O(N) using partition-based recursion. Worst case O(N²); use median-of-medians for guaranteed O(N) if pressed.

For the edge-case lab, do not optimize. The point is to enumerate edges before the algorithm matters. Quickselect has more edge cases (recursion depth on degenerate partitions, pivot selection bias) so optimizing without first nailing edges makes the bug surface larger.

Final Expected Approach

  1. Validate input. Return None (or raise) on empty.
  2. Sort a copy (do not mutate caller’s array unless agreed).
  3. Compute middle index mid = n // 2.
  4. If odd, return sorted[mid].
  5. If even, return sorted[mid - 1] + sorted[mid] divided by 2, using a + (b - a) / 2 form to avoid overflow.

Data Structures Used

  • A sortable copy of the array. In Python sorted() returns a new list. In Java use Arrays.sort() on a clone; in C++ std::sort on a copy.
  • No auxiliary structures.

Correctness Argument

After sorting, by definition the value at index n // 2 is the lower-middle (0-indexed); the value at n // 2 - 1 is the upper-lower; their average is the median for even lengths. For odd lengths, n // 2 is exactly the middle. The sort guarantees the ordering invariant required.

Complexity

Time O(N log N) sort + O(1) lookup. Space O(N) for the copy (or O(1) if in-place sort is allowed). Quickselect: expected O(N), worst O(N²); median-of-medians: worst O(N) with larger constant.

Implementation Requirements

  • Function signature should accept any iterable convertible to a list.
  • Do not mutate the caller’s input.
  • Return type: float (even for odd-length inputs, for consistency) or use a tagged return; document which.
  • Handle empty input explicitly with the chosen convention.

Tests

Smoke (3 normal)

assert median([3, 1, 2]) == 2
assert median([1, 2, 3, 4]) == 2.5
assert median([5, 2, 8, 1, 9]) == 5

Edge (5 — exceeds the 3 minimum because this lab is about edges)

assert median([]) is None              # empty
assert median([42]) == 42              # singleton
assert median([1, 2]) == 1.5           # even, smallest
assert median([7, 7, 7, 7]) == 7       # all duplicates
assert median([-3, -1, -2]) == -2      # all negatives
assert median([10**9, 10**9]) == 10**9 # overflow boundary
assert median([-10**9, 10**9]) == 0    # mixed extremes

Large

import random
random.seed(0)
big = [random.randint(-10**9, 10**9) for _ in range(10**5)]
result = median(big)
assert isinstance(result, (int, float))  # didn't crash; didn't take >1s

Randomized verifier

def brute_median(nums):
    s = sorted(nums)
    n = len(s)
    return s[n//2] if n % 2 else (s[n//2 - 1] + s[n//2]) / 2

for _ in range(1000):
    n = random.randint(1, 50)
    nums = [random.randint(-100, 100) for _ in range(n)]
    assert median(nums) == brute_median(nums)

Invalid input

try:
    median(None)
    assert False, "should have raised"
except TypeError:
    pass

Follow-up Questions

  1. Streaming median. Find the median as numbers arrive one at a time. → Two heaps (max-heap of lower half, min-heap of upper half). O(log N) per insert, O(1) per query.
  2. Median of two sorted arrays. Classic LC 4 hard. → Binary search on partition; O(log min(N, M)).
  3. k-th smallest in unsorted. → Quickselect; O(N) expected.
  4. Weighted median. Each value has a weight; find the value where cumulative weight crosses half. → Sort + prefix scan; O(N log N).
  5. Approximate median in one pass with O(1) memory. → Reservoir sampling + recursion, or P² algorithm.

Product Extension

  • Latency percentiles in distributed monitoring. P50 (median), P99, P99.9. Cannot store all latencies — use t-digest or HdrHistogram for compact mergeable approximations.
  • A/B testing. Comparing median user session length between buckets requires bootstrap confidence intervals because medians don’t have closed-form variance.
  • Recommendation systems. “Median rating per item” for cold-start ranking.

Language/Runtime Follow-ups

  • Python: sorted() is TimSort, O(N log N) worst case; uses additional O(N) memory. list.sort() is in-place. nums[n//2] is O(1) indexing.
  • Java: Arrays.sort(int[]) uses dual-pivot quicksort (O(N log N) average, O(N²) worst on adversarial input). Arrays.sort(Object[]) uses TimSort. Auto-boxing Integer adds overhead.
  • C++: std::sort is introsort (quicksort + heapsort fallback); worst-case O(N log N). std::nth_element is O(N) average for quickselect. Beware integer overflow in (a + b) / 2 for signed 32-bit; use a + (b - a) / 2.
  • Go: sort.Ints is introsort, O(N log N). No overflow checks in int arithmetic; wraps silently on 32-bit platforms.
  • JavaScript: Array.prototype.sort() defaults to lexicographic string comparison[10, 9, 2].sort() returns [10, 2, 9]. Always pass a comparator: sort((a, b) => a - b).

Common Bugs

  1. Empty input crash. s[n // 2] with n == 0 is s[0] on an empty list → IndexError.
  2. Integer overflow on average. (a + b) / 2 overflows when a + b > INT_MAX. Use a + (b - a) / 2 or use floating-point earlier.
  3. Integer division for even-length median. In Python 2 / Java, (a + b) / 2 truncates. In Python 3, / is float — but // is integer. Be explicit.
  4. Mutating caller’s array. Passing nums.sort() to a function modifies the original. Use sorted(nums).
  5. Off-by-one for even length. n // 2 is the upper middle (0-indexed); n // 2 - 1 is the lower. Confusing these gives the wrong answer for [1, 2, 3, 4].
  6. JavaScript default sort. Returns string-sorted order for numbers.

Debugging Strategy

If your function returns the wrong value:

  1. Print the sorted array. Is it actually sorted? (Confirms no JavaScript-style default-sort bug.)
  2. Print n, n // 2, n % 2. Is the index what you expect?
  3. Check parity branch — did you accidentally swap the odd/even branches?
  4. For overflow: print s[n//2 - 1] + s[n//2] before dividing; check if it matches the expected sum.
  5. For mutation bugs: print the input both before and after the call. If it changed, you mutated.

If you TLE on the large test, you wrote O(N²) accidentally (e.g., used insertion sort, or sorted inside a loop).

Mastery Criteria

  • Wrote the function correctly on the first try with all edge cases handled
  • Listed all 8+ edge cases aloud before writing code (time yourself: under 90 seconds)
  • Identified the overflow risk in (a + b) / 2 without prompting
  • Wrote the randomized verifier in under 5 minutes
  • Can recite the universal edge-case taxonomy (empty / singleton / two / duplicates / sorted / boundary / overflow / mixed) without looking
  • Re-applied the taxonomy to one Phase 2 problem and caught at least one edge case you previously missed