Lab 06 — Stress Testing: Finding Bugs You Can’t Spot by Reading

Goal

Build a stress-testing harness in <10 minutes that pits a brute-force oracle against a candidate solution on randomly generated small inputs, automatically catching mismatches. Use it to find two intentionally-injected bugs in a candidate solution where reading the code wouldn’t reveal them. By the end, stress testing is your default tool when a CP solution passes samples but fails hidden tests, and you can build a fresh harness for any problem in <10 minutes.

Background

In CP and high-stakes interviews, you frequently face: “my code passes the samples but WAs on hidden tests.” Reading harder doesn’t help — your mental model of the algorithm is exactly what’s wrong. Stress testing breaks this by replacing your brain with the machine: write a slow-but-obviously-correct brute, write a fast random input generator, and let the computer compare outputs on millions of small cases. The first mismatch is a tiny counterexample you can debug by hand.

Top competitive programmers (red CF) use stress testing constantly. It’s the single most underused tool by interview candidates and the fastest way to debug a “looks right but doesn’t work” solution.

Interview Context

Interview problems rarely give you the freedom to write a stress test under time pressure, but the meta-skill — converting “I’m stuck” into a structured experiment — is exactly what staff-level interviews probe. In quant interviews, “How would you validate this code?” is a standard question; “I’d write a brute oracle and stress-test against random inputs” is a strong answer. In CP, stress testing is required at the Div 2/Div 1 level.

This lab is the meta-lab for the entire phase: build the tooling that will save you in every other lab.

Problem Statement

Given a candidate solution and a brute-force oracle for some problem, build a harness that:

Generates random inputs of small size (so brute is fast).
Runs both solutions.
Compares outputs and prints / dies on the first mismatch.
Uses a seeded RNG so failures are reproducible.

We’ll use prefix sum range queries as the test problem. Sub-problems:

Brute: for each query (l, r), sum a[l..r] directly. O(N) per query.
Candidate: precompute prefix[i] = a[0] + ... + a[i-1], answer each query as prefix[r+1] - prefix[l]. O(1) per query.

We will deliberately introduce two bugs in the candidate and use the harness to find each.

Constraints

For stressing: N ≤ 50, values in [-10, 10], Q ≤ 50 queries. Small enough that brute is O(N · Q) = 2500 ops per test, allowing ~10⁵ tests / second.
The candidate should pass when correct, fail clearly when buggy.

Clarifying Questions

“What’s the comparison rule for outputs?” — exact equality (lists of ints).
“How small should random inputs be?” — small enough that brute finishes in microseconds per test, big enough to expose edge cases. Rule of thumb: smallest size where the candidate’s structure is non-trivial.
“Is determinism required?” — yes; seed the RNG so the same failing test re-runs identically.
“Output format on mismatch?” — input that triggered, both outputs, the seed.

Examples

A passing harness run prints nothing (or a PASSED line). A failing run prints the first counterexample:

MISMATCH at iteration 47 (seed=12345):
input:  N=5, a=[3, -2, 1, 4, -1], queries=[(0, 4), (1, 3), (2, 2)]
brute:    [5, 3, 1]
candidate:[5, 3, 0]

Initial Brute Force

The brute oracle is the brute force here:

def prefix_sum_brute(a, queries):
    return [sum(a[l:r+1]) for l, r in queries]

It’s O(N · Q), slow but unambiguous.

Brute Force Complexity

O(N · Q) per test case. For N = Q = 50, ~2500 ops per test. Running 100,000 stress iterations completes in seconds.

Optimization Path

The harness itself doesn’t optimize. The thing being tested (candidate solution) does. The harness’s job is to detect when the optimization is incorrect.

Final Expected Approach

The candidate (intentionally with bugs to discover):

def prefix_sum_candidate(a, queries):
    n = len(a)
    prefix = [0] * (n + 1)
    for i in range(n):
        prefix[i + 1] = prefix[i] + a[i]
    # BUG 1: should be prefix[r+1] - prefix[l], not prefix[r] - prefix[l]
    return [prefix[r] - prefix[l] for l, r in queries]

The harness:

import random

def stress(brute, candidate, gen, n_iters=10000, seed=0):
    rng = random.Random(seed)
    for it in range(n_iters):
        inp = gen(rng)
        b_out = brute(*inp)
        c_out = candidate(*inp)
        if b_out != c_out:
            print(f"MISMATCH at iteration {it} (seed={seed}):")
            print(f"  input:     {inp}")
            print(f"  brute:     {b_out}")
            print(f"  candidate: {c_out}")
            return False
    print(f"PASSED {n_iters} iterations.")
    return True

def gen_prefix_sum(rng):
    n = rng.randint(1, 10)
    a = [rng.randint(-5, 5) for _ in range(n)]
    q = rng.randint(1, 5)
    queries = []
    for _ in range(q):
        l = rng.randint(0, n - 1)
        r = rng.randint(l, n - 1)
        queries.append((l, r))
    return (a, queries)

stress(prefix_sum_brute, prefix_sum_candidate, gen_prefix_sum, n_iters=1000, seed=42)

The harness will fire and report the first failing input within a few iterations. Fix prefix[r] to prefix[r+1]. Re-run.

Now introduce BUG 2 (subtle): use prefix[i] initialized as 0 for i = 0 but set prefix[i+1] = prefix[i] + a[i+1] (off-by-one in the recurrence). Stress finds it again.

After both fixes, the harness reports PASSED 10000 iterations. and you know the candidate is (probably) correct.

Data Structures Used

A seeded RNG (random.Random(seed) in Python; mt19937 in C++).
A generator function returning a random valid input.
A brute oracle.
A candidate solution.
The harness loop.

Correctness Argument

Why this works. If the brute oracle is correct (small enough that we can verify by hand), and the candidate disagrees, then the candidate is wrong on that input. We have a counterexample. Conversely, if the candidate matches the brute on n_iters = 10⁵ random small inputs, it’s probably correct — the chance that a non-trivial bug survives all of them is small for most input distributions, but not zero. Add adversarial inputs (all same, all max, all min, edge sizes 0, 1) explicitly to the generator to harden coverage.

Why determinism matters. When the harness fires, you want to re-run with the same seed to reproduce; without a seed, the bug might evaporate next run.

Why small inputs. The smaller the input, the faster brute runs (more iterations), and the easier the counterexample is to debug by hand. CF-grade stress tests use N ≤ 5 for the first pass.

Complexity

Per stress iteration: brute is O(N · Q); candidate is O(N + Q); comparison is O(Q). Harness overhead is negligible. For N = Q = 10 and n_iters = 10⁵, total is ~10⁷ ops — under 1 second in Python.

Implementation Requirements

Seed the RNG explicitly. Print the seed on failure.
Generator must produce valid inputs (respects all problem constraints — non-empty arrays, valid index ranges, etc.).
On mismatch, print the minimal failing input. (Optional refinement: shrink the input by retrying with smaller sizes once you’ve found a failure.)
Cover edge cases: empty array, single element, all-same values, max-size inputs.

Tests

The harness is itself code; it should be tested.

def test_harness_finds_planted_bug():
    def buggy(a, queries):
        return [sum(a[l:r]) for l, r in queries]   # off-by-one: should be a[l:r+1]
    # The harness should fire (return False) on a buggy candidate.
    result = stress(prefix_sum_brute, buggy, gen_prefix_sum, n_iters=1000, seed=1)
    assert result == False

def test_harness_passes_correct_candidate():
    def correct(a, queries):
        n = len(a)
        prefix = [0] * (n + 1)
        for i in range(n):
            prefix[i + 1] = prefix[i] + a[i]
        return [prefix[r + 1] - prefix[l] for l, r in queries]
    result = stress(prefix_sum_brute, correct, gen_prefix_sum, n_iters=1000, seed=1)
    assert result == True

Follow-up Questions

“What if brute is also buggy?” — write brute as straightforwardly as possible (read the problem statement and implement it word-for-word). Skip optimizations. If both brute and candidate agree on a bug, you have no oracle and stress testing fails. Mitigation: cross-check brute against the problem’s sample I/O before stressing.
“How to shrink a counterexample?” — once a failing input is found, repeatedly remove array elements / queries / values; if it still fails, keep the smaller input. Greedy; good enough.
“Stress testing for graph problems?” — generator emits random small graphs (N ≤ 6 vertices, random edges). Brute is BFS / DFS over all paths.
“What if the answer isn’t unique?” — write a checker instead of a comparator: validate the candidate’s output as one of many valid answers (e.g., for “any valid topological order”).
“Multi-threaded stress?” — easy with a process pool; each worker has its own seed offset.

Product Extension

Property-based testing in production: tools like Hypothesis (Python), QuickCheck (Haskell), proptest (Rust) generate random inputs and check invariants — same idea, different framing. Fuzz testing for security: AFL, libFuzzer feed random / mutated bytes to a binary and check for crashes. Differential testing across implementations: compare a new compiler against an old one on random programs (CSmith for C, csmith for SQL via SQLancer, etc.). The harness pattern transfers directly.

Language / Runtime Follow-ups

Python: random.Random(seed) — mt19937 under the hood. pytest + hypothesis for property-based testing in production code.
C++: std::mt19937 rng(seed); std::uniform_int_distribution<int> dist(lo, hi); — never use rand() (low-quality, broken seeding on Windows).
Java: java.util.Random(seed) (LCG, low-quality but reproducible) or SplittableRandom (better statistical quality).
Go: rand.New(rand.NewSource(seed)). The default math/rand global is not thread-safe.
JS/TS: seedable RNG requires a library (seedrandom); built-in Math.random() is not seedable.

Common Bugs

Forgetting to seed the RNG → non-reproducible failures.
Brute oracle has the same bug as the candidate → false negative (stress passes a buggy solution).
Generator produces invalid inputs (e.g., negative array sizes) → both brute and candidate crash → not a useful comparison.
Generator’s distribution is too narrow → never hits edge cases (all-equal, sorted, reverse-sorted, single element).
Output comparison uses == on floats with rounding errors → spurious mismatches; use tolerance.
Harness exits silently on first iteration if generator throws → wrap in try/except and report.
Letting N grow too large → brute is too slow → fewer iterations → less coverage.

Debugging Strategy

When stress fires:

Print the failing input. Run just that input through both brute and candidate.
If brute disagrees with your hand-computed answer, brute is buggy. Fix brute first.
If candidate disagrees with brute (and brute matches your hand-computed answer): trace candidate’s execution on the failing input by hand or in a debugger. The bug is local to a few lines.
Once fixed, re-run stress with the same seed; if it passes, increment seed and run again.

When stress passes but the real submission still WA:

Generator might not cover the failing case. Inspect the failing test’s input distribution (size, value range, special structure) and update the generator.
Add explicit corner cases: empty input, single element, max-size input, all duplicates, all distinct, sorted asc, sorted desc.
Push N higher; some bugs surface only at scale.

Mastery Criteria

Build a stress harness from blank for an array problem in <10 minutes.
Find a planted off-by-one bug in <30 seconds of harness runtime.
Articulate why the brute oracle must be obviously-correct in one sentence.
Generate adversarial corner cases (empty, single, all-equal) deliberately, not only random.
Use the same harness pattern across array, graph, and string problems.
When a real submission WAs, default to “stress test” instead of “read the code again”.

← Lab 05 — Coordinate Compression · Phase 7 README

LeetCode Interview — Extreme Coding