Lab 05 — Stress Testing Harness (Two-Pointer Variants)

Goal

Build a reusable stress-testing harness: a randomized input generator + a known-correct brute-force verifier + a diff loop that finds the smallest failing input. This is the single most valuable debugging tool in competitive programming, and it is shockingly underused in interview prep. After this lab you should reach for the harness automatically whenever your solution passes the given examples but you don’t trust it.

Background Concepts

A stress test has three components:

Generator (gen) — produces random valid inputs, parameterized by size and seed for reproducibility.
Brute force (brute) — a known-correct slow solution. Often O(N²) or O(2^N), valid only for tiny N.
Fast solution (fast) — the optimized solution you’re testing.

The harness loops: generate input → run both → compare. On mismatch, print the input and both outputs and stop. Then shrink the failing input to the smallest case that still fails — this is what makes debugging fast.

Why this works: brute force is correct by construction (it tries everything). Any discrepancy is your bug, not the brute force’s. Random testing covers cases you didn’t think of.

Why people don’t use it: they think writing brute force is “wasted time.” It is not; in interview prep, the brute force is also your starting point for the optimization conversation with the interviewer.

Interview Context

Stress testing rarely happens during a 45-min interview, but the practice habit shows up in your interview performance:

You instantly know how to write the brute force (which the interviewer always wants you to articulate first).
You catch bugs in practice that would otherwise be silently learned-wrong, then deployed mid-interview.
You build pattern recognition for “this kind of two-pointer has off-by-one risk” because you’ve seen the harness flag them.

At competitive companies (Jane Street, Hudson River, Citadel) and at top-tier interviews (Google L6+), interviewers will sometimes ask “how would you verify this is correct beyond running the examples?” The answer is the stress harness.

Problem Statement

Implement and stress-test three two-pointer problems known to have subtle off-by-one bugs:

A. Two Sum II (sorted array). Given a sorted array a and target t, return indices (i, j) with i < j and a[i] + a[j] == t, or (-1, -1) if no such pair exists.

B. Container With Most Water (LC 11). Given heights h, find indices (i, j) maximizing (j - i) * min(h[i], h[j]).

C. 3Sum (LC 15). Given nums, return all unique triples (a, b, c) with a + b + c == 0. Each triple sorted ascending; output deduplicated.

For each, write the fast solution, write the brute force, and write the stress harness. Run for ≥1000 random trials.

Constraints

|a| ≤ 1000 for stress; ≤ 10^5 for the real fast solution
-1000 ≤ a[i] ≤ 1000

Clarifying Questions

(These problems are standard; the lab is about the harness, not disambiguation.)

Examples

two_sum_sorted([1, 3, 4, 5, 7], 9)  → (1, 3)   # 3 + 5 = 8? no — actually (2, 3) since 4+5=9
container([1, 8, 6, 2, 5, 4, 8, 3, 7]) → 49     # i=1 (h=8), j=8 (h=7), width=7
3sum([-1, 0, 1, 2, -1, -4]) → [[-1, -1, 2], [-1, 0, 1]]

Initial Brute Force

Two Sum: O(N²) double loop. Trivially correct. Container: O(N²) double loop, take max. 3Sum: O(N³) triple loop; collect, sort each triple, deduplicate via set of tuples.

Brute Force Complexity

O(N²), O(N²), O(N³). All valid for N ≤ 100 in <1 sec.

Optimization Path

All three are classic two-pointer problems. After sorting (for 3Sum and Two Sum), pointers move from both ends inward based on the comparison.

Final Expected Approach

Fast solutions

def two_sum_sorted(a, t):
    i, j = 0, len(a) - 1
    while i < j:
        s = a[i] + a[j]
        if s == t: return (i, j)
        elif s < t: i += 1
        else: j -= 1
    return (-1, -1)

def container(h):
    i, j = 0, len(h) - 1
    best = 0
    while i < j:
        best = max(best, (j - i) * min(h[i], h[j]))
        if h[i] < h[j]: i += 1
        else: j -= 1
    return best

def three_sum(nums):
    nums = sorted(nums)
    n = len(nums)
    res = []
    for i in range(n - 2):
        if i > 0 and nums[i] == nums[i-1]: continue   # skip dup anchor
        j, k = i + 1, n - 1
        while j < k:
            s = nums[i] + nums[j] + nums[k]
            if s == 0:
                res.append([nums[i], nums[j], nums[k]])
                j += 1; k -= 1
                while j < k and nums[j] == nums[j-1]: j += 1   # skip dup j
                while j < k and nums[k] == nums[k+1]: k -= 1   # skip dup k
            elif s < 0: j += 1
            else: k -= 1
    return res

Brute forces

def brute_two_sum(a, t):
    for i in range(len(a)):
        for j in range(i+1, len(a)):
            if a[i] + a[j] == t: return (i, j)
    return (-1, -1)

def brute_container(h):
    best = 0
    for i in range(len(h)):
        for j in range(i+1, len(h)):
            best = max(best, (j - i) * min(h[i], h[j]))
    return best

def brute_3sum(nums):
    n = len(nums)
    found = set()
    for i in range(n):
        for j in range(i+1, n):
            for k in range(j+1, n):
                if nums[i] + nums[j] + nums[k] == 0:
                    found.add(tuple(sorted([nums[i], nums[j], nums[k]])))
    return sorted([list(t) for t in found])

The Stress Harness

import random

def stress(gen, brute, fast, normalize, trials=2000, seed=0):
    random.seed(seed)
    for t in range(trials):
        inp = gen()
        b = normalize(brute(*inp))
        f = normalize(fast(*inp))
        if b != f:
            print(f"FAIL on trial {t}")
            print(f"  input: {inp}")
            print(f"  brute: {b}")
            print(f"  fast:  {f}")
            # Shrink: try to find a smaller failing input
            shrunk = shrink_input(inp, brute, fast, normalize)
            print(f"  smallest failing input: {shrunk}")
            return False
    print(f"PASS {trials} trials")
    return True

def shrink_input(inp, brute, fast, normalize):
    """Greedy shrink — drop elements one at a time, keep if still fails."""
    arr, *rest = inp
    current = list(arr)
    changed = True
    while changed:
        changed = False
        for i in range(len(current)):
            candidate = current[:i] + current[i+1:]
            if len(candidate) < 2: continue
            try:
                if normalize(brute(candidate, *rest)) != normalize(fast(candidate, *rest)):
                    current = candidate; changed = True; break
            except Exception:
                continue
    return (current, *rest)

# Generators
def gen_two_sum():
    n = random.randint(2, 20)
    a = sorted(random.randint(-30, 30) for _ in range(n))
    t = random.randint(-60, 60)
    return (a, t)

def gen_container():
    n = random.randint(2, 30)
    return ([random.randint(0, 20) for _ in range(n)],)

def gen_3sum():
    n = random.randint(3, 15)
    return ([random.randint(-10, 10) for _ in range(n)],)

# Normalizers (canonicalize output before comparison)
def norm_two_sum(r):
    # Both -1, -1 OR a valid pair; for the pair, the sum is what matters, not index
    if r == (-1, -1): return None
    return "found"  # we only care that one was found; index may differ
    # NOTE: If indices must match exactly, change the brute force to scan in two-pointer order

def norm_container(x): return x   # int, direct compare
def norm_3sum(triples): return sorted([sorted(t) for t in triples])

# Run
stress(gen_two_sum, brute_two_sum, two_sum_sorted, norm_two_sum)
stress(gen_container, brute_container, container, norm_container)
stress(gen_3sum, brute_3sum, three_sum, norm_3sum)

Data Structures Used

Lists of integers
Set of tuples for 3Sum deduplication in brute force
A small library of helper functions (gen, brute, fast, normalize, stress, shrink_input) that you reuse across problems

Correctness Argument

The brute force is correct because it enumerates all valid candidates (O(N^k) for k-sum). Any output from the fast solution that disagrees with the brute is a bug in the fast solution. Random sampling over thousands of trials gives high confidence (though not certainty) that the fast is correct; the smaller the input space (e.g., values in [-10, 10] with N ≤ 15), the more confident.

Complexity

Harness adds zero asymptotic cost — the fast solution’s complexity is unchanged. Each trial costs O(brute) which is the bottleneck; with N ≤ 30 it runs ~2000 trials in <2 seconds.

Implementation Requirements

Use random.seed() for reproducibility — failures must be re-runnable
Print the failing input before the outputs, so you can copy-paste and re-run
Always normalize outputs before comparison (canonical sort order, etc.)
Implement shrinking — a 20-element failure is hard to debug; a 4-element failure is obvious

Tests

The harness itself, as a test

# Sanity: plant a bug in the fast solution and verify the harness catches it
def buggy_two_sum(a, t):
    i, j = 0, len(a) - 1
    while i < j:
        s = a[i] + a[j]
        if s == t: return (i, j)
        elif s < t: i += 1
        else: j -= 1
    return (0, 0)   # BUG: should return (-1, -1)

assert stress(gen_two_sum, brute_two_sum, buggy_two_sum, norm_two_sum, trials=500) is False

Pass-through on the correct solutions

assert stress(gen_two_sum, brute_two_sum, two_sum_sorted, norm_two_sum, trials=2000) is True
assert stress(gen_container, brute_container, container, norm_container, trials=2000) is True
assert stress(gen_3sum, brute_3sum, three_sum, norm_3sum, trials=2000) is True

Edge generators

Add specialized generators that stress edge cases:

def gen_two_sum_edge():
    """Heavy on duplicates and boundary targets."""
    n = random.randint(2, 10)
    a = sorted([random.choice([-1, 0, 1]) for _ in range(n)])
    t = random.choice([-2, 0, 2])
    return (a, t)

Follow-up Questions

Generator-based testing (Hypothesis library). Python’s hypothesis library generates inputs and shrinks them automatically. Show how to convert the harness into Hypothesis strategies.
Detecting performance regressions. Add timing to the harness; flag when fast > 10× the previous run on the same seed.
Coverage-guided fuzzing. Use atheris or similar to mutate inputs that increase code coverage; finds rarer bugs than purely random.
Concurrent stress testing. Run brute and fast on different threads; useful for testing thread-safe versions.
What if there’s no brute force? Then write two independent fast solutions (different algorithms) and stress them against each other. Common for geometry problems.

Product Extension

CI-integrated fuzzing. Google’s OSS-Fuzz runs continuous random testing on open-source projects; finds thousands of bugs annually.
Property-based testing in production. Stripe, Jane Street, Klarna use property tests to validate financial logic where the brute force is “the spec.”
Differential testing. Compare two implementations of the same protocol (e.g., two JSON parsers) on random inputs to find spec ambiguities.

Language/Runtime Follow-ups

Python: hypothesis is the gold standard for property-based testing. random.seed() is per-thread; for parallel stress, use independent random.Random instances.
Java: jqwik or junit-quickcheck for property-based; JMH for performance regression detection.
Go: Built-in testing/quick and (Go 1.18+) native fuzzing with go test -fuzz.
C++: rapidcheck (QuickCheck-style); LLVM’s libFuzzer for coverage-guided.
Rust: proptest and quickcheck crates; native cargo fuzz.

Common Bugs

Non-reproducible failures — forgot random.seed(); can’t re-run the failing case.
Output comparison fails due to ordering — set vs list, dict iteration order; always normalize.
Brute force itself is buggy — verify the brute on the given problem examples first.
Generator produces invalid inputs — e.g., for Two Sum Sorted, the generator must produce a sorted array. Verify with assert all(a[i] <= a[i+1] for i in range(len(a)-1)).
Shrinker breaks the input invariant — for Two Sum Sorted, dropping an element keeps the array sorted; but for a tree-structured input, dropping a node may break invariants. Custom shrinkers per problem.

Debugging Strategy

When the harness reports a failure:

Read the smallest failing input that the shrinker produced. If it’s ≤ 5 elements, trace by hand.
Run only the fast solution with prints on that small input. Compare to expected.
The bug is almost always in a boundary condition — empty input, single element, all duplicates, exact-target match.
If the bug only appears with duplicates, suspect your dedup logic (3Sum is famous for this).
If the bug only appears with negatives, suspect signed comparisons or abs() misuse.

Mastery Criteria

Built the stress harness in under 20 minutes for the three target problems
Caught at least one bug by planting one and verifying the harness flagged it
Wrote a shrinker that reduces failures to ≤ 10 elements
Ran 2000+ trials per problem with no failure
Built a reusable harness module you can drop into any future problem
Applied the harness to one Phase 2 or Phase 5 problem and either confirmed correctness or found a bug
Can explain why random testing complements (not replaces) edge-case enumeration

LeetCode Interview — Extreme Coding