Lab 06 — Performance Profiling (Three LIS Implementations)
Goal
Measure and compare three implementations of Longest Increasing Subsequence: O(N²) DP, O(N log N) patience sort, and a poorly-written “O(N log N)” with hidden O(N) inside the inner loop. Use profiling tools to detect the discrepancy between claimed complexity and actual behavior. By the end you should never again submit a solution thinking “this should be fast enough” without measuring.
Background Concepts
Empirical complexity verification: if a function is O(f(N)), then running it at N and 2N should produce runtimes that scale as f(2N) / f(N). For O(N): 2×. For O(N log N): ~2.1×. For O(N²): 4×. For O(N³): 8×. Measure the ratio; mismatch reveals the bug.
Profiling tools:
- Python:
cProfilefor function-level timing;line_profilerfor line-by-line;py-spyfor sampling without code changes;tracemallocfor memory. - Java:
async-profilerfor low-overhead sampling; JFR (Java Flight Recorder); JMH for microbenchmarks; jmap for heap snapshots. - Go:
pprof(CPU and heap);runtime/tracefor goroutine scheduling;go test -benchwith-benchmem. - C++:
perf(Linux) with flamegraphs; Valgrind/Callgrind for instruction counts; gperftools. - Node.js: built-in
--inspect+ Chrome DevTools;clinic.jsfor higher-level analysis.
Common deceptions:
- A “constant time” operation that’s actually O(N) (e.g.,
list.insert(0, x)in Python,s += cin a loop in Java). - Hash collisions in adversarial input turning O(1) lookups into O(N).
- Garbage collector pauses inflating measurements.
- JIT warmup masking the cold-start cost.
Interview Context
Interviewers ask “what’s the complexity?” on every problem. They sometimes follow up with “are you sure?” — and the right answer is “I claim O(N log N); I can verify by running at doubling sizes if you’d like.” Candidates who can articulate empirical verification (even without running it) signal a different level of rigor.
Practical-engineering interviews (Phase 8) often include “make this faster” or “profile this for me” as a follow-up. You need fluency with at least one language’s profiler.
Problem Statement
Implement three versions of LIS:
Version A (O(N²) DP): dp[i] = length of LIS ending at index i. Transition: dp[i] = max(dp[j] + 1) for all j < i with a[j] < a[i].
Version B (O(N log N) patience sort): Maintain tails[k] = smallest tail value of any increasing subsequence of length k+1. For each element, binary-search-and-replace; if no replacement, append.
Version C (“fake” O(N log N)): Same as B, but uses Python list with index() (O(N)) to find the replacement position instead of binary search. Looks like O(N log N) at a glance; actually O(N²).
Measure runtime at N = 1000, 2000, 4000, 8000, 16000. Verify the doubling ratios. Profile to find the bottleneck in Version C.
Constraints
- 1 ≤ N ≤ 16000 for measurement
- Values: random integers in
[0, 10^6]
Clarifying Questions
(LIS is standard; the lab is about measurement.)
Examples
LIS([10, 9, 2, 5, 3, 7, 101, 18]) == 4 (e.g., [2, 5, 7, 101] or [2, 3, 7, 18])
LIS([0, 1, 0, 3, 2, 3]) == 4
LIS([7, 7, 7, 7]) == 1
Initial Brute Force
Enumerate all 2^N subsequences, check each. O(2^N · N).
Brute Force Complexity
O(2^N · N). Valid only for N ≤ 20. Useful as the verifier in your stress harness from Lab 5.
Optimization Path
DP from O(2^N) → O(N²) → O(N log N).
Final Expected Approach
import bisect
import time
import random
# Version A: O(N^2) DP
def lis_dp(a):
if not a: return 0
n = len(a)
dp = [1] * n
for i in range(1, n):
for j in range(i):
if a[j] < a[i]:
dp[i] = max(dp[i], dp[j] + 1)
return max(dp)
# Version B: O(N log N) — real
def lis_patience(a):
tails = []
for x in a:
idx = bisect.bisect_left(tails, x)
if idx == len(tails):
tails.append(x)
else:
tails[idx] = x
return len(tails)
# Version C: "fake" O(N log N) — uses linear search disguised
def lis_fake(a):
tails = []
for x in a:
# Linear scan to find first tail >= x — O(N)!
idx = None
for i, t in enumerate(tails):
if t >= x:
idx = i; break
if idx is None:
tails.append(x)
else:
tails[idx] = x
return len(tails)
Measurement Harness
def benchmark(fn, n, trials=3):
random.seed(n)
times = []
for _ in range(trials):
a = [random.randint(0, 10**6) for _ in range(n)]
start = time.perf_counter()
fn(a)
times.append(time.perf_counter() - start)
return min(times) # min is most reliable; mean is noisy
def doubling_test(fn, name, sizes=[1000, 2000, 4000, 8000, 16000]):
prev = None
print(f"\n{name}")
print(f"{'N':>8} {'time (s)':>12} {'ratio':>8}")
for n in sizes:
t = benchmark(fn, n)
ratio = f"{t/prev:.2f}x" if prev else "—"
print(f"{n:>8} {t:>12.4f} {ratio:>8}")
prev = t
doubling_test(lis_dp, "Version A — O(N^2) DP")
doubling_test(lis_patience, "Version B — O(N log N) patience")
doubling_test(lis_fake, "Version C — fake O(N log N)")
Expected Output (approximate, on modern laptop)
Version A — O(N^2) DP
N time (s) ratio
1000 0.0420 —
2000 0.1680 4.00x
4000 0.6720 4.00x
8000 2.6900 4.00x
16000 10.7600 4.00x
Version B — O(N log N) patience
N time (s) ratio
1000 0.0008 —
2000 0.0017 2.13x
4000 0.0036 2.12x
8000 0.0076 2.11x
16000 0.0160 2.10x
Version C — fake O(N log N)
N time (s) ratio
1000 0.0210 —
2000 0.0840 4.00x ← 4x means O(N^2), not O(N log N)!
4000 0.3360 4.00x
8000 1.3440 4.00x
16000 5.3760 4.00x
Reading the ratios: if you claimed O(N log N) and see 4× per doubling, your algorithm is actually O(N²). The doubling test is the cheapest, most reliable complexity-verifier in your toolbox.
Profiling Version C
import cProfile, pstats
random.seed(0)
a = [random.randint(0, 10**6) for _ in range(8000)]
pr = cProfile.Profile()
pr.enable()
lis_fake(a)
pr.disable()
pstats.Stats(pr).sort_stats('cumulative').print_stats(10)
Expected output: the lis_fake function dominates; the inner for i, t in enumerate(tails) is the hot spot. With line_profiler:
$ pip install line_profiler
$ kernprof -l -v script.py
# Add @profile decorator to lis_fake first
Output will show the inner loop accounts for ~95% of the time per call, confirming the linear-search bottleneck.
Data Structures Used
- Plain list (Python) — supports
bisectfor true O(log N) search - Profiler outputs (text/HTML/flamegraph depending on tool)
Correctness Argument
All three versions produce the same output on random input. The patience sort correctness: tails[k] always stores the smallest possible tail of a length-(k+1) LIS seen so far. When a new element is smaller than tails[k], replacing improves the future extensibility; when it’s larger than all tails, it extends to a new length. Final length is len(tails). Inductive proof omitted; see CLRS / standard references.
Complexity
- A: O(N²) time, O(N) space.
- B: O(N log N) time, O(N) space.
- C: claimed O(N log N), actually O(N²) due to linear inner search.
Implementation Requirements
- Use
time.perf_counter(), nottime.time()— higher resolution, monotonic. - Take the
minof multiple trials, not the mean —minrejects GC/scheduler noise. - Warm up before timing if testing JIT’d languages (Java, JS).
- Use
__slots__and pre-allocated arrays in hot Python paths.
Tests
Smoke
assert lis_dp([10, 9, 2, 5, 3, 7, 101, 18]) == 4
assert lis_patience([10, 9, 2, 5, 3, 7, 101, 18]) == 4
assert lis_fake([10, 9, 2, 5, 3, 7, 101, 18]) == 4
Edge
for fn in (lis_dp, lis_patience, lis_fake):
assert fn([]) == 0
assert fn([42]) == 1
assert fn([1, 2, 3, 4, 5]) == 5 # already sorted
assert fn([5, 4, 3, 2, 1]) == 1 # reverse sorted
assert fn([7, 7, 7, 7]) == 1 # all duplicates
Performance assertions
# Verify Version B's doubling ratio
times = [benchmark(lis_patience, n) for n in (1000, 2000, 4000)]
ratios = [times[1]/times[0], times[2]/times[1]]
for r in ratios:
assert 1.8 < r < 2.5, f"Version B ratio {r:.2f} not in O(N log N) range"
# Verify Version C is actually quadratic (this assertion *should* pass — proving the bug)
times = [benchmark(lis_fake, n) for n in (1000, 2000, 4000)]
ratios = [times[1]/times[0], times[2]/times[1]]
for r in ratios:
assert 3.5 < r < 4.5, f"Version C ratio {r:.2f} not in O(N^2) range — bug may be fixed?"
Randomized verifier (A == B == C)
random.seed(0)
for _ in range(50):
a = [random.randint(0, 100) for _ in range(random.randint(1, 50))]
assert lis_dp(a) == lis_patience(a) == lis_fake(a)
Follow-up Questions
- What if the input is nearly sorted? Version A’s actual runtime degrades less; profile to confirm.
- Memory profile: which version allocates most? Use
tracemalloc.start()andtracemalloc.get_traced_memory(). - Reconstruct the LIS, not just its length. Track parent pointers; doesn’t change asymptotic complexity but doubles space.
- LIS in a stream (one pass, can’t store everything). Use patience sort with a fixed buffer; gives approximate answer.
- Parallelize LIS. Hard — DP dependencies are sequential. Pipeline by chunks; merge with care.
Product Extension
- Code review of “this should be fast” claims — every senior engineer learns to verify before trusting.
- Database query planners — the planner estimates I/O cost; profiling validates the estimate against real query times.
- CDN cache eviction policies — when comparing LRU vs LFU vs SLRU under real traffic, microbenchmarks lie; full profiles win.
- Production hot-path detection — flamegraphs reveal that 80% of CPU is spent in 3% of the code; optimize there.
Language/Runtime Follow-ups
- Python:
time.perf_counter()is the right clock.cProfileoverhead is ~30%; for tight loops useline_profilerorpy-spy(sampling, ~0% overhead). The GIL means CPU profiling is mostly sequential; for asyncio code useaiomonitororasynciodebug mode. - Java: JMH (Java Microbenchmark Harness) handles JIT warmup, dead-code elimination, and constant folding correctly — handwritten timing loops in Java are often wrong. Use
-prof gcto see allocation cost. - Go:
go test -bench=. -benchmemis the standard.-cpuprofileand-memprofilewrite pprof files; visualize withgo tool pprof. - C++:
perf record+perf reportfor sampling;perf statfor cache misses and IPC. Use-O2or-O3for measurement; debug builds have very different performance. - Node.js: V8’s
--profflag dumps tick logs;--inspectfor Chrome DevTools. Beware turbofan optimization — code that runs cold for the first 10K iterations is suddenly 10× faster after JIT.
Common Bugs
- Timing the cold start — first call includes import/parse/JIT warmup; throw it out.
- Using
time.time()— wall clock; affected by NTP, sleep, system load. - Mean over trials — one stop-the-world GC pause skews the mean; use min instead.
- Measuring with assertions on — Python
-Oflag strips asserts; default mode keeps them, slowing hot loops. - Forgetting
random.seed()— runs aren’t reproducible. - Comparing implementations on different input distributions — random LIS, sorted LIS, and reversed LIS have wildly different runtimes for some algorithms.
Debugging Strategy
When complexity doesn’t match your claim:
- Verify with the doubling test. This is the only way.
- Profile to find the hot function. It’s almost always one inner loop.
- Read the standard library docs for any “built-in” operation you used —
list.insert,dict.update,str +=,Vector.addmay not be what you think. - Check for hidden quadratic behavior in concatenation:
result = result + small_thingin a loop is the classic Java/Python beginner trap. - Verify memory with
tracemalloc— sometimes the “slow” is actually paging, not CPU.
Mastery Criteria
- Implemented all three LIS versions
- Ran the doubling test and observed the 4× / 2.1× / 4× ratios
- Profiled Version C with cProfile (or equivalent) and identified the linear-search bottleneck
- Wrote performance assertions that would catch a regression
- Can recite the expected doubling ratio for O(N), O(N log N), O(N²), O(N³)
- Applied profiling to one of your own past solutions and identified one inefficiency
- Familiarity with at least one profiler in your primary language (output format, common flags, how to interpret)