Lab 06 — Stress Testing: Finding Bugs You Can’t Spot by Reading
Goal
Build a stress-testing harness in <10 minutes that pits a brute-force oracle against a candidate solution on randomly generated small inputs, automatically catching mismatches. Use it to find two intentionally-injected bugs in a candidate solution where reading the code wouldn’t reveal them. By the end, stress testing is your default tool when a CP solution passes samples but fails hidden tests, and you can build a fresh harness for any problem in <10 minutes.
Background
In CP and high-stakes interviews, you frequently face: “my code passes the samples but WAs on hidden tests.” Reading harder doesn’t help — your mental model of the algorithm is exactly what’s wrong. Stress testing breaks this by replacing your brain with the machine: write a slow-but-obviously-correct brute, write a fast random input generator, and let the computer compare outputs on millions of small cases. The first mismatch is a tiny counterexample you can debug by hand.
Top competitive programmers (red CF) use stress testing constantly. It’s the single most underused tool by interview candidates and the fastest way to debug a “looks right but doesn’t work” solution.
Interview Context
Interview problems rarely give you the freedom to write a stress test under time pressure, but the meta-skill — converting “I’m stuck” into a structured experiment — is exactly what staff-level interviews probe. In quant interviews, “How would you validate this code?” is a standard question; “I’d write a brute oracle and stress-test against random inputs” is a strong answer. In CP, stress testing is required at the Div 2/Div 1 level.
This lab is the meta-lab for the entire phase: build the tooling that will save you in every other lab.
Problem Statement
Given a candidate solution and a brute-force oracle for some problem, build a harness that:
- Generates random inputs of small size (so brute is fast).
- Runs both solutions.
- Compares outputs and prints / dies on the first mismatch.
- Uses a seeded RNG so failures are reproducible.
We’ll use prefix sum range queries as the test problem. Sub-problems:
- Brute: for each query
(l, r), suma[l..r]directly.O(N)per query. - Candidate: precompute
prefix[i] = a[0] + ... + a[i-1], answer each query asprefix[r+1] - prefix[l].O(1)per query.
We will deliberately introduce two bugs in the candidate and use the harness to find each.
Constraints
- For stressing:
N ≤ 50, values in[-10, 10],Q ≤ 50queries. Small enough that brute isO(N · Q) = 2500ops per test, allowing ~10⁵ tests / second. - The candidate should pass when correct, fail clearly when buggy.
Clarifying Questions
- “What’s the comparison rule for outputs?” — exact equality (lists of ints).
- “How small should random inputs be?” — small enough that brute finishes in microseconds per test, big enough to expose edge cases. Rule of thumb: smallest size where the candidate’s structure is non-trivial.
- “Is determinism required?” — yes; seed the RNG so the same failing test re-runs identically.
- “Output format on mismatch?” — input that triggered, both outputs, the seed.
Examples
A passing harness run prints nothing (or a PASSED line). A failing run prints the first counterexample:
MISMATCH at iteration 47 (seed=12345):
input: N=5, a=[3, -2, 1, 4, -1], queries=[(0, 4), (1, 3), (2, 2)]
brute: [5, 3, 1]
candidate:[5, 3, 0]
Initial Brute Force
The brute oracle is the brute force here:
def prefix_sum_brute(a, queries):
return [sum(a[l:r+1]) for l, r in queries]
It’s O(N · Q), slow but unambiguous.
Brute Force Complexity
O(N · Q) per test case. For N = Q = 50, ~2500 ops per test. Running 100,000 stress iterations completes in seconds.
Optimization Path
The harness itself doesn’t optimize. The thing being tested (candidate solution) does. The harness’s job is to detect when the optimization is incorrect.
Final Expected Approach
The candidate (intentionally with bugs to discover):
def prefix_sum_candidate(a, queries):
n = len(a)
prefix = [0] * (n + 1)
for i in range(n):
prefix[i + 1] = prefix[i] + a[i]
# BUG 1: should be prefix[r+1] - prefix[l], not prefix[r] - prefix[l]
return [prefix[r] - prefix[l] for l, r in queries]
The harness:
import random
def stress(brute, candidate, gen, n_iters=10000, seed=0):
rng = random.Random(seed)
for it in range(n_iters):
inp = gen(rng)
b_out = brute(*inp)
c_out = candidate(*inp)
if b_out != c_out:
print(f"MISMATCH at iteration {it} (seed={seed}):")
print(f" input: {inp}")
print(f" brute: {b_out}")
print(f" candidate: {c_out}")
return False
print(f"PASSED {n_iters} iterations.")
return True
def gen_prefix_sum(rng):
n = rng.randint(1, 10)
a = [rng.randint(-5, 5) for _ in range(n)]
q = rng.randint(1, 5)
queries = []
for _ in range(q):
l = rng.randint(0, n - 1)
r = rng.randint(l, n - 1)
queries.append((l, r))
return (a, queries)
stress(prefix_sum_brute, prefix_sum_candidate, gen_prefix_sum, n_iters=1000, seed=42)
The harness will fire and report the first failing input within a few iterations. Fix prefix[r] to prefix[r+1]. Re-run.
Now introduce BUG 2 (subtle): use prefix[i] initialized as 0 for i = 0 but set prefix[i+1] = prefix[i] + a[i+1] (off-by-one in the recurrence). Stress finds it again.
After both fixes, the harness reports PASSED 10000 iterations. and you know the candidate is (probably) correct.
Data Structures Used
- A seeded RNG (
random.Random(seed)in Python;mt19937in C++). - A generator function returning a random valid input.
- A brute oracle.
- A candidate solution.
- The harness loop.
Correctness Argument
Why this works. If the brute oracle is correct (small enough that we can verify by hand), and the candidate disagrees, then the candidate is wrong on that input. We have a counterexample. Conversely, if the candidate matches the brute on n_iters = 10⁵ random small inputs, it’s probably correct — the chance that a non-trivial bug survives all of them is small for most input distributions, but not zero. Add adversarial inputs (all same, all max, all min, edge sizes 0, 1) explicitly to the generator to harden coverage.
Why determinism matters. When the harness fires, you want to re-run with the same seed to reproduce; without a seed, the bug might evaporate next run.
Why small inputs. The smaller the input, the faster brute runs (more iterations), and the easier the counterexample is to debug by hand. CF-grade stress tests use N ≤ 5 for the first pass.
Complexity
Per stress iteration: brute is O(N · Q); candidate is O(N + Q); comparison is O(Q). Harness overhead is negligible. For N = Q = 10 and n_iters = 10⁵, total is ~10⁷ ops — under 1 second in Python.
Implementation Requirements
- Seed the RNG explicitly. Print the seed on failure.
- Generator must produce valid inputs (respects all problem constraints — non-empty arrays, valid index ranges, etc.).
- On mismatch, print the minimal failing input. (Optional refinement: shrink the input by retrying with smaller sizes once you’ve found a failure.)
- Cover edge cases: empty array, single element, all-same values, max-size inputs.
Tests
The harness is itself code; it should be tested.
def test_harness_finds_planted_bug():
def buggy(a, queries):
return [sum(a[l:r]) for l, r in queries] # off-by-one: should be a[l:r+1]
# The harness should fire (return False) on a buggy candidate.
result = stress(prefix_sum_brute, buggy, gen_prefix_sum, n_iters=1000, seed=1)
assert result == False
def test_harness_passes_correct_candidate():
def correct(a, queries):
n = len(a)
prefix = [0] * (n + 1)
for i in range(n):
prefix[i + 1] = prefix[i] + a[i]
return [prefix[r + 1] - prefix[l] for l, r in queries]
result = stress(prefix_sum_brute, correct, gen_prefix_sum, n_iters=1000, seed=1)
assert result == True
Follow-up Questions
- “What if brute is also buggy?” — write brute as straightforwardly as possible (read the problem statement and implement it word-for-word). Skip optimizations. If both brute and candidate agree on a bug, you have no oracle and stress testing fails. Mitigation: cross-check brute against the problem’s sample I/O before stressing.
- “How to shrink a counterexample?” — once a failing input is found, repeatedly remove array elements / queries / values; if it still fails, keep the smaller input. Greedy; good enough.
- “Stress testing for graph problems?” — generator emits random small graphs (
N ≤ 6vertices, random edges). Brute is BFS / DFS over all paths. - “What if the answer isn’t unique?” — write a checker instead of a comparator: validate the candidate’s output as one of many valid answers (e.g., for “any valid topological order”).
- “Multi-threaded stress?” — easy with a process pool; each worker has its own seed offset.
Product Extension
Property-based testing in production: tools like Hypothesis (Python), QuickCheck (Haskell), proptest (Rust) generate random inputs and check invariants — same idea, different framing. Fuzz testing for security: AFL, libFuzzer feed random / mutated bytes to a binary and check for crashes. Differential testing across implementations: compare a new compiler against an old one on random programs (CSmith for C, csmith for SQL via SQLancer, etc.). The harness pattern transfers directly.
Language / Runtime Follow-ups
- Python:
random.Random(seed)—mt19937under the hood.pytest+hypothesisfor property-based testing in production code. - C++:
std::mt19937 rng(seed); std::uniform_int_distribution<int> dist(lo, hi);— never userand()(low-quality, broken seeding on Windows). - Java:
java.util.Random(seed)(LCG, low-quality but reproducible) orSplittableRandom(better statistical quality). - Go:
rand.New(rand.NewSource(seed)). The defaultmath/randglobal is not thread-safe. - JS/TS: seedable RNG requires a library (
seedrandom); built-inMath.random()is not seedable.
Common Bugs
- Forgetting to seed the RNG → non-reproducible failures.
- Brute oracle has the same bug as the candidate → false negative (stress passes a buggy solution).
- Generator produces invalid inputs (e.g., negative array sizes) → both brute and candidate crash → not a useful comparison.
- Generator’s distribution is too narrow → never hits edge cases (all-equal, sorted, reverse-sorted, single element).
- Output comparison uses
==on floats with rounding errors → spurious mismatches; use tolerance. - Harness exits silently on first iteration if generator throws → wrap in try/except and report.
- Letting
Ngrow too large → brute is too slow → fewer iterations → less coverage.
Debugging Strategy
When stress fires:
- Print the failing input. Run just that input through both brute and candidate.
- If brute disagrees with your hand-computed answer, brute is buggy. Fix brute first.
- If candidate disagrees with brute (and brute matches your hand-computed answer): trace candidate’s execution on the failing input by hand or in a debugger. The bug is local to a few lines.
- Once fixed, re-run stress with the same seed; if it passes, increment seed and run again.
When stress passes but the real submission still WA:
- Generator might not cover the failing case. Inspect the failing test’s input distribution (size, value range, special structure) and update the generator.
- Add explicit corner cases: empty input, single element, max-size input, all duplicates, all distinct, sorted asc, sorted desc.
- Push
Nhigher; some bugs surface only at scale.
Mastery Criteria
- Build a stress harness from blank for an array problem in <10 minutes.
- Find a planted off-by-one bug in <30 seconds of harness runtime.
- Articulate why the brute oracle must be obviously-correct in one sentence.
- Generate adversarial corner cases (empty, single, all-equal) deliberately, not only random.
- Use the same harness pattern across array, graph, and string problems.
- When a real submission WAs, default to “stress test” instead of “read the code again”.