Lab 06 — Performance Profiling (Three LIS Implementations)

Goal

Measure and compare three implementations of Longest Increasing Subsequence: O(N²) DP, O(N log N) patience sort, and a poorly-written “O(N log N)” with hidden O(N) inside the inner loop. Use profiling tools to detect the discrepancy between claimed complexity and actual behavior. By the end you should never again submit a solution thinking “this should be fast enough” without measuring.

Background Concepts

Empirical complexity verification: if a function is O(f(N)), then running it at N and 2N should produce runtimes that scale as f(2N) / f(N). For O(N): 2×. For O(N log N): ~2.1×. For O(N²): 4×. For O(N³): 8×. Measure the ratio; mismatch reveals the bug.

Profiling tools:

  • Python: cProfile for function-level timing; line_profiler for line-by-line; py-spy for sampling without code changes; tracemalloc for memory.
  • Java: async-profiler for low-overhead sampling; JFR (Java Flight Recorder); JMH for microbenchmarks; jmap for heap snapshots.
  • Go: pprof (CPU and heap); runtime/trace for goroutine scheduling; go test -bench with -benchmem.
  • C++: perf (Linux) with flamegraphs; Valgrind/Callgrind for instruction counts; gperftools.
  • Node.js: built-in --inspect + Chrome DevTools; clinic.js for higher-level analysis.

Common deceptions:

  • A “constant time” operation that’s actually O(N) (e.g., list.insert(0, x) in Python, s += c in a loop in Java).
  • Hash collisions in adversarial input turning O(1) lookups into O(N).
  • Garbage collector pauses inflating measurements.
  • JIT warmup masking the cold-start cost.

Interview Context

Interviewers ask “what’s the complexity?” on every problem. They sometimes follow up with “are you sure?” — and the right answer is “I claim O(N log N); I can verify by running at doubling sizes if you’d like.” Candidates who can articulate empirical verification (even without running it) signal a different level of rigor.

Practical-engineering interviews (Phase 8) often include “make this faster” or “profile this for me” as a follow-up. You need fluency with at least one language’s profiler.

Problem Statement

Implement three versions of LIS:

Version A (O(N²) DP): dp[i] = length of LIS ending at index i. Transition: dp[i] = max(dp[j] + 1) for all j < i with a[j] < a[i].

Version B (O(N log N) patience sort): Maintain tails[k] = smallest tail value of any increasing subsequence of length k+1. For each element, binary-search-and-replace; if no replacement, append.

Version C (“fake” O(N log N)): Same as B, but uses Python list with index() (O(N)) to find the replacement position instead of binary search. Looks like O(N log N) at a glance; actually O(N²).

Measure runtime at N = 1000, 2000, 4000, 8000, 16000. Verify the doubling ratios. Profile to find the bottleneck in Version C.

Constraints

  • 1 ≤ N ≤ 16000 for measurement
  • Values: random integers in [0, 10^6]

Clarifying Questions

(LIS is standard; the lab is about measurement.)

Examples

LIS([10, 9, 2, 5, 3, 7, 101, 18]) == 4   (e.g., [2, 5, 7, 101] or [2, 3, 7, 18])
LIS([0, 1, 0, 3, 2, 3]) == 4
LIS([7, 7, 7, 7]) == 1

Initial Brute Force

Enumerate all 2^N subsequences, check each. O(2^N · N).

Brute Force Complexity

O(2^N · N). Valid only for N ≤ 20. Useful as the verifier in your stress harness from Lab 5.

Optimization Path

DP from O(2^N) → O(N²) → O(N log N).

Final Expected Approach

import bisect
import time
import random

# Version A: O(N^2) DP
def lis_dp(a):
    if not a: return 0
    n = len(a)
    dp = [1] * n
    for i in range(1, n):
        for j in range(i):
            if a[j] < a[i]:
                dp[i] = max(dp[i], dp[j] + 1)
    return max(dp)

# Version B: O(N log N) — real
def lis_patience(a):
    tails = []
    for x in a:
        idx = bisect.bisect_left(tails, x)
        if idx == len(tails):
            tails.append(x)
        else:
            tails[idx] = x
    return len(tails)

# Version C: "fake" O(N log N) — uses linear search disguised
def lis_fake(a):
    tails = []
    for x in a:
        # Linear scan to find first tail >= x — O(N)!
        idx = None
        for i, t in enumerate(tails):
            if t >= x:
                idx = i; break
        if idx is None:
            tails.append(x)
        else:
            tails[idx] = x
    return len(tails)

Measurement Harness

def benchmark(fn, n, trials=3):
    random.seed(n)
    times = []
    for _ in range(trials):
        a = [random.randint(0, 10**6) for _ in range(n)]
        start = time.perf_counter()
        fn(a)
        times.append(time.perf_counter() - start)
    return min(times)  # min is most reliable; mean is noisy

def doubling_test(fn, name, sizes=[1000, 2000, 4000, 8000, 16000]):
    prev = None
    print(f"\n{name}")
    print(f"{'N':>8} {'time (s)':>12} {'ratio':>8}")
    for n in sizes:
        t = benchmark(fn, n)
        ratio = f"{t/prev:.2f}x" if prev else "—"
        print(f"{n:>8} {t:>12.4f} {ratio:>8}")
        prev = t

doubling_test(lis_dp,       "Version A — O(N^2) DP")
doubling_test(lis_patience, "Version B — O(N log N) patience")
doubling_test(lis_fake,     "Version C — fake O(N log N)")

Expected Output (approximate, on modern laptop)

Version A — O(N^2) DP
       N    time (s)    ratio
    1000     0.0420       —
    2000     0.1680    4.00x
    4000     0.6720    4.00x
    8000     2.6900    4.00x
   16000    10.7600    4.00x

Version B — O(N log N) patience
       N    time (s)    ratio
    1000     0.0008       —
    2000     0.0017    2.13x
    4000     0.0036    2.12x
    8000     0.0076    2.11x
   16000     0.0160    2.10x

Version C — fake O(N log N)
       N    time (s)    ratio
    1000     0.0210       —
    2000     0.0840    4.00x   ← 4x means O(N^2), not O(N log N)!
    4000     0.3360    4.00x
    8000     1.3440    4.00x
   16000     5.3760    4.00x

Reading the ratios: if you claimed O(N log N) and see 4× per doubling, your algorithm is actually O(N²). The doubling test is the cheapest, most reliable complexity-verifier in your toolbox.

Profiling Version C

import cProfile, pstats
random.seed(0)
a = [random.randint(0, 10**6) for _ in range(8000)]

pr = cProfile.Profile()
pr.enable()
lis_fake(a)
pr.disable()
pstats.Stats(pr).sort_stats('cumulative').print_stats(10)

Expected output: the lis_fake function dominates; the inner for i, t in enumerate(tails) is the hot spot. With line_profiler:

$ pip install line_profiler
$ kernprof -l -v script.py
# Add @profile decorator to lis_fake first

Output will show the inner loop accounts for ~95% of the time per call, confirming the linear-search bottleneck.

Data Structures Used

  • Plain list (Python) — supports bisect for true O(log N) search
  • Profiler outputs (text/HTML/flamegraph depending on tool)

Correctness Argument

All three versions produce the same output on random input. The patience sort correctness: tails[k] always stores the smallest possible tail of a length-(k+1) LIS seen so far. When a new element is smaller than tails[k], replacing improves the future extensibility; when it’s larger than all tails, it extends to a new length. Final length is len(tails). Inductive proof omitted; see CLRS / standard references.

Complexity

  • A: O(N²) time, O(N) space.
  • B: O(N log N) time, O(N) space.
  • C: claimed O(N log N), actually O(N²) due to linear inner search.

Implementation Requirements

  • Use time.perf_counter(), not time.time() — higher resolution, monotonic.
  • Take the min of multiple trials, not the mean — min rejects GC/scheduler noise.
  • Warm up before timing if testing JIT’d languages (Java, JS).
  • Use __slots__ and pre-allocated arrays in hot Python paths.

Tests

Smoke

assert lis_dp([10, 9, 2, 5, 3, 7, 101, 18]) == 4
assert lis_patience([10, 9, 2, 5, 3, 7, 101, 18]) == 4
assert lis_fake([10, 9, 2, 5, 3, 7, 101, 18]) == 4

Edge

for fn in (lis_dp, lis_patience, lis_fake):
    assert fn([]) == 0
    assert fn([42]) == 1
    assert fn([1, 2, 3, 4, 5]) == 5         # already sorted
    assert fn([5, 4, 3, 2, 1]) == 1         # reverse sorted
    assert fn([7, 7, 7, 7]) == 1            # all duplicates

Performance assertions

# Verify Version B's doubling ratio
times = [benchmark(lis_patience, n) for n in (1000, 2000, 4000)]
ratios = [times[1]/times[0], times[2]/times[1]]
for r in ratios:
    assert 1.8 < r < 2.5, f"Version B ratio {r:.2f} not in O(N log N) range"

# Verify Version C is actually quadratic (this assertion *should* pass — proving the bug)
times = [benchmark(lis_fake, n) for n in (1000, 2000, 4000)]
ratios = [times[1]/times[0], times[2]/times[1]]
for r in ratios:
    assert 3.5 < r < 4.5, f"Version C ratio {r:.2f} not in O(N^2) range — bug may be fixed?"

Randomized verifier (A == B == C)

random.seed(0)
for _ in range(50):
    a = [random.randint(0, 100) for _ in range(random.randint(1, 50))]
    assert lis_dp(a) == lis_patience(a) == lis_fake(a)

Follow-up Questions

  1. What if the input is nearly sorted? Version A’s actual runtime degrades less; profile to confirm.
  2. Memory profile: which version allocates most? Use tracemalloc.start() and tracemalloc.get_traced_memory().
  3. Reconstruct the LIS, not just its length. Track parent pointers; doesn’t change asymptotic complexity but doubles space.
  4. LIS in a stream (one pass, can’t store everything). Use patience sort with a fixed buffer; gives approximate answer.
  5. Parallelize LIS. Hard — DP dependencies are sequential. Pipeline by chunks; merge with care.

Product Extension

  • Code review of “this should be fast” claims — every senior engineer learns to verify before trusting.
  • Database query planners — the planner estimates I/O cost; profiling validates the estimate against real query times.
  • CDN cache eviction policies — when comparing LRU vs LFU vs SLRU under real traffic, microbenchmarks lie; full profiles win.
  • Production hot-path detection — flamegraphs reveal that 80% of CPU is spent in 3% of the code; optimize there.

Language/Runtime Follow-ups

  • Python: time.perf_counter() is the right clock. cProfile overhead is ~30%; for tight loops use line_profiler or py-spy (sampling, ~0% overhead). The GIL means CPU profiling is mostly sequential; for asyncio code use aiomonitor or asyncio debug mode.
  • Java: JMH (Java Microbenchmark Harness) handles JIT warmup, dead-code elimination, and constant folding correctly — handwritten timing loops in Java are often wrong. Use -prof gc to see allocation cost.
  • Go: go test -bench=. -benchmem is the standard. -cpuprofile and -memprofile write pprof files; visualize with go tool pprof.
  • C++: perf record + perf report for sampling; perf stat for cache misses and IPC. Use -O2 or -O3 for measurement; debug builds have very different performance.
  • Node.js: V8’s --prof flag dumps tick logs; --inspect for Chrome DevTools. Beware turbofan optimization — code that runs cold for the first 10K iterations is suddenly 10× faster after JIT.

Common Bugs

  1. Timing the cold start — first call includes import/parse/JIT warmup; throw it out.
  2. Using time.time() — wall clock; affected by NTP, sleep, system load.
  3. Mean over trials — one stop-the-world GC pause skews the mean; use min instead.
  4. Measuring with assertions on — Python -O flag strips asserts; default mode keeps them, slowing hot loops.
  5. Forgetting random.seed() — runs aren’t reproducible.
  6. Comparing implementations on different input distributions — random LIS, sorted LIS, and reversed LIS have wildly different runtimes for some algorithms.

Debugging Strategy

When complexity doesn’t match your claim:

  1. Verify with the doubling test. This is the only way.
  2. Profile to find the hot function. It’s almost always one inner loop.
  3. Read the standard library docs for any “built-in” operation you used — list.insert, dict.update, str +=, Vector.add may not be what you think.
  4. Check for hidden quadratic behavior in concatenation: result = result + small_thing in a loop is the classic Java/Python beginner trap.
  5. Verify memory with tracemalloc — sometimes the “slow” is actually paging, not CPU.

Mastery Criteria

  • Implemented all three LIS versions
  • Ran the doubling test and observed the 4× / 2.1× / 4× ratios
  • Profiled Version C with cProfile (or equivalent) and identified the linear-search bottleneck
  • Wrote performance assertions that would catch a regression
  • Can recite the expected doubling ratio for O(N), O(N log N), O(N²), O(N³)
  • Applied profiling to one of your own past solutions and identified one inefficiency
  • Familiarity with at least one profiler in your primary language (output format, common flags, how to interpret)