Lab 15 — Retry With Exponential Backoff and Jitter

Goal

Implement a reusable retry(fn, policy) primitive that retries a callable on failure with exponential backoff plus decorrelated jitter, bounded by max attempts and total deadline, with an explicit retryable-error predicate so non-retryable errors fail fast. After this lab you should be able to write a production-shaped retry helper from a blank screen in <15 minutes and articulate why naive sleep(2 ** attempt) is wrong in <60 seconds.

Background Concepts

A retry primitive has four orthogonal knobs: (a) how many times to retry (max attempts and / or total deadline), (b) how long to wait between attempts (the backoff schedule), (c) which errors are retryable (a predicate), and (d) what to do on final failure (raise, return a sentinel, surface diagnostics). The non-trivial knob is (b). Naive exponential backoff wait = base * 2 ** attempt causes a thundering herd: when a downstream service recovers, every retrying client wakes simultaneously and re-overloads it. The fix is jitter: randomize the wait. The two industry-standard schedules are full jitter (wait = uniform(0, base * 2 ** attempt)) and decorrelated jitter (wait = uniform(base, prev_wait * 3), capped at max_wait). Decorrelated jitter is preferred when retries cluster across many clients because its waits are less correlated across attempts.

The total deadline matters as much as the attempt count. A 5-attempt schedule with base=1s, cap=30s can spend up to two minutes blocked — unacceptable for a request-path retry. Production retry helpers always take a deadline.

Interview Context

This is a 20-minute warmup at Stripe, Uber, Cloudflare, and any team whose service makes downstream calls. It’s also a frequent follow-up to the rate-limiter and circuit-breaker labs. Candidates who write for i in range(5): try: return fn() except: time.sleep(2 ** i) get a partial credit; candidates who name jitter, deadline, retryable-error predicate, and the relationship to the circuit breaker (Lab 16) get a strong signal.

Problem Statement

Implement retry(fn, max_attempts, base_delay, max_delay, deadline_s, is_retryable, jitter='decorrelated') that calls fn() repeatedly until it succeeds or the policy gives up. On non-retryable exceptions, fail immediately. On retryable exceptions, sleep according to the schedule and try again. On exceeding max_attempts or deadline_s, raise the last exception wrapped in a RetryExhausted.

Constraints

  • max_attempts ≥ 1 (1 means “no retries”; the function is called at most once).
  • base_delay > 0, max_delaybase_delay.
  • deadline_s may be None (no deadline) or a positive float (wall-clock seconds from retry() invocation).
  • is_retryable: Exception -> bool must be a pure function.
  • The implementation must not busy-spin and must respect both the per-attempt cap and the deadline (whichever fires first).

Clarifying Questions

  1. Is the deadline measured from retry() invocation or from the first failure? (From invocation — simpler reasoning.)
  2. Should fn() be called at least once even if the deadline is already past at start? (Yes — at least one attempt.)
  3. Should we sleep after the final attempt? (No — pointless.)
  4. Does is_retryable apply to the last attempt’s exception, or do we always re-raise the last? (Re-raise the last; non-retryable short-circuits.)
  5. Synchronous or async? (Both — implement sync first, async variant in follow-ups.)
  6. Should we surface the attempt count and total elapsed time in the wrapped exception? (Yes — operational visibility.)

Examples

retry(lambda: http_get(url), max_attempts=5, base_delay=0.1, max_delay=10, deadline_s=30,
      is_retryable=lambda e: isinstance(e, (TimeoutError, ConnectionError)))
# → returns the response if any attempt succeeds within 30s and 5 tries.
# → raises RetryExhausted("timeout", attempts=5, elapsed=12.3s) on timeout.
# → raises ValueError immediately if fn() raises ValueError (non-retryable).

Initial Brute Force

def retry_naive(fn, max_attempts):
    for i in range(max_attempts):
        try:
            return fn()
        except Exception:
            if i == max_attempts - 1:
                raise
            time.sleep(2 ** i)

This is what most candidates write first. It has all four bugs listed above: no jitter (thundering herd), no deadline (unbounded wait), no retryable predicate (retries on programming errors), no cap on max_delay.

Brute Force Complexity

Time: dominated by sleeps; up to Σ 2^i ≈ 2^max_attempts seconds in the worst case. For max_attempts=10, that’s 17 minutes. Space: O(1).

Optimization Path

Add (1) max_delay cap → bounds per-attempt sleep, (2) deadline_s total cap → bounds end-to-end blocking, (3) is_retryable predicate → fast-fails on programmer errors, (4) jitter → spreads herd, (5) structured exception with diagnostics → operational legibility. Each addition is a one-knob change; together they take the primitive from “buggy in production” to “shippable”.

Final Expected Approach

Loop up to max_attempts times. Track wall-clock start. On each attempt, call fn(). On success, return the value. On failure, check is_retryable; if false, re-raise. If we’re at the last attempt or past the deadline, raise RetryExhausted. Otherwise compute the next sleep using the chosen jitter scheme, clip to remaining-deadline so we don’t oversleep, and time.sleep(wait). Log each attempt.

Data Structures Used

  • A monotonic clock reference (time.monotonic()) to compute deadlines — wall-clock can jump.
  • A small RetryExhausted exception class carrying attempts, elapsed, last_exception.
  • An optional Logger for per-attempt diagnostics (don’t print; inject a logger).

Correctness Argument

We make at most max_attempts calls to fn (loop bound). We sleep between attempts but never after the final one (the loop returns or raises before sleeping past the last attempt). We respect deadline_s by computing remaining = deadline - elapsed and clipping the sleep; if remaining ≤ 0 we raise immediately. Non-retryable exceptions short-circuit by re-raising before the sleep. The exception we surface is always the last underlying failure, wrapped with diagnostics.

Complexity

AspectCost
Wall-clockbounded by min(deadline_s, Σ wait_i)
CPU per failed attemptO(1) plus fn’s own cost
MemoryO(1)

Implementation Requirements

A complete working implementation is required.

import random
import time
from dataclasses import dataclass
from typing import Callable, Optional, TypeVar

T = TypeVar("T")


class RetryExhausted(Exception):
    def __init__(self, message: str, attempts: int, elapsed: float, last_exception: BaseException):
        super().__init__(f"{message} (attempts={attempts}, elapsed={elapsed:.2f}s)")
        self.attempts = attempts
        self.elapsed = elapsed
        self.last_exception = last_exception


@dataclass
class RetryPolicy:
    max_attempts: int = 5
    base_delay: float = 0.1
    max_delay: float = 30.0
    deadline_s: Optional[float] = None
    jitter: str = "decorrelated"            # "decorrelated" | "full" | "none"
    is_retryable: Callable[[BaseException], bool] = lambda e: True

    def __post_init__(self):
        if self.max_attempts < 1:
            raise ValueError("max_attempts must be >= 1")
        if self.base_delay <= 0 or self.max_delay < self.base_delay:
            raise ValueError("invalid delay bounds")


def _next_wait(policy: RetryPolicy, attempt: int, prev_wait: float) -> float:
    if policy.jitter == "none":
        w = min(policy.base_delay * (2 ** attempt), policy.max_delay)
    elif policy.jitter == "full":
        cap = min(policy.base_delay * (2 ** attempt), policy.max_delay)
        w = random.uniform(0, cap)
    elif policy.jitter == "decorrelated":
        w = min(random.uniform(policy.base_delay, max(prev_wait, policy.base_delay) * 3),
                policy.max_delay)
    else:
        raise ValueError(f"unknown jitter scheme: {policy.jitter}")
    return w


def retry(fn: Callable[[], T], policy: RetryPolicy, *, sleep=time.sleep, clock=time.monotonic) -> T:
    start = clock()
    last_exc: Optional[BaseException] = None
    prev_wait = policy.base_delay
    for attempt in range(policy.max_attempts):
        try:
            return fn()
        except BaseException as e:
            last_exc = e
            if not policy.is_retryable(e):
                raise
            if attempt == policy.max_attempts - 1:
                break
            elapsed = clock() - start
            if policy.deadline_s is not None and elapsed >= policy.deadline_s:
                break
            wait = _next_wait(policy, attempt, prev_wait)
            if policy.deadline_s is not None:
                wait = min(wait, max(0.0, policy.deadline_s - elapsed))
            if wait > 0:
                sleep(wait)
            prev_wait = wait
    elapsed = clock() - start
    raise RetryExhausted("retry exhausted", attempt + 1, elapsed, last_exc) from last_exc

sleep and clock are dependency-injected so tests do not have to wait real time.

Tests

def test_succeeds_first_try():
    assert retry(lambda: 42, RetryPolicy(max_attempts=3)) == 42

def test_succeeds_after_failures():
    n = {"i": 0}
    def fn():
        n["i"] += 1
        if n["i"] < 3: raise TimeoutError()
        return "ok"
    assert retry(fn, RetryPolicy(max_attempts=5, base_delay=0.001)) == "ok"
    assert n["i"] == 3

def test_non_retryable_short_circuits():
    n = {"i": 0}
    def fn():
        n["i"] += 1
        raise ValueError("bad")
    policy = RetryPolicy(is_retryable=lambda e: not isinstance(e, ValueError))
    try: retry(fn, policy)
    except ValueError: pass
    assert n["i"] == 1

def test_exhaustion_wraps_exception():
    def fn(): raise TimeoutError("nope")
    try: retry(fn, RetryPolicy(max_attempts=2, base_delay=0.001))
    except RetryExhausted as e:
        assert e.attempts == 2
        assert isinstance(e.last_exception, TimeoutError)

def test_deadline_respected():
    fake_time = [0.0]
    def fn(): raise TimeoutError()
    sleeps = []
    def fake_sleep(t): sleeps.append(t); fake_time[0] += t
    def fake_clock(): return fake_time[0]
    try:
        retry(fn, RetryPolicy(max_attempts=100, base_delay=1, deadline_s=5, jitter="none"),
              sleep=fake_sleep, clock=fake_clock)
    except RetryExhausted as e:
        assert e.elapsed <= 5.001

Follow-up Questions

  1. How would you make it thread-safe? The function is reentrant — no shared state across calls. The injected sleep and clock should themselves be thread-safe (the stdlib ones are). Per-call state (attempt counter, prev_wait) is local. No locks needed.
  2. How would you observe and monitor it? Emit (a) retry.attempts counter labeled by callsite and outcome (success_first_try, success_after_retry, exhausted, non_retryable), (b) retry.elapsed histogram, (c) retry.attempt_count histogram. Log per-attempt at DEBUG, per-final-failure at WARN.
  3. How would you handle a poison-pill input? A request that always raises a retryable error wastes the deadline on every retry. Wrap repeated callers behind a circuit breaker (Lab 16); after N consecutive RetryExhausteds, open the breaker and fail fast for a cooldown period.
  4. What configuration knobs would you expose? max_attempts, base_delay, max_delay, deadline_s, jitter strategy, is_retryable predicate. Defaults: 5 / 100ms / 30s / None / decorrelated / lambda e: True. Don’t expose internal multipliers (the 3× in decorrelated jitter) — they’re stable and tuning them in production is a smell.
  5. How would you test it deterministically? Inject sleep and clock; advance fake time inside the fake sleep. Seed random for reproducible jitter. The test for the deadline above uses this pattern.
  6. What is the relationship to the circuit breaker? A retry without a circuit breaker is dangerous: if the downstream is fully down, every caller retries the full schedule, multiplying load. The right composition is circuit_breaker(retry(fn)) — the breaker short-circuits the retry once it has seen enough failures.

Product Extension

Retry primitives are the workhorse of every microservice’s outbound RPC layer. AWS SDK, Google Cloud SDK, and gRPC all ship retry helpers; their default schedules are decorrelated jitter with deadlines. The is_retryable predicate in production is the hardest knob: HTTP 5xx is usually retryable, 4xx usually is not, but 429 is retryable with Retry-After honored. Lift this complexity into the predicate.

Language/Runtime Follow-ups

  • Python: as above. For async, swap time.sleep for asyncio.sleep and make retry an async def.
  • Java: use Resilience4j or Failsafe in production. Hand-rolled: a RetryPolicy builder, a Callable<T> argument, Thread.sleep (or ScheduledExecutorService.schedule in async).
  • Go: a function Retry(ctx context.Context, fn func() error, policy Policy) error. Use time.NewTimer so a ctx.Done() can cancel mid-sleep. Cancellation is the deadline mechanism.
  • C++: std::this_thread::sleep_for and std::chrono for the deadline. Pass a stop-token to support cancellation.
  • JS/TS: await new Promise(r => setTimeout(r, ms)). The retry function is async. Use AbortSignal for the deadline.

Common Bugs

  1. Sleeping after the final attempt — wastes wall-clock.
  2. Using time.time() instead of time.monotonic() — wall-clock can jump backwards across NTP corrections, causing negative elapsed and crashes.
  3. Catching BaseException and swallowing KeyboardInterrupt / SystemExit — never make these retryable. Either narrow the catch or have the predicate exclude them.
  4. Computing the next wait before the deadline check — you sleep past the deadline. Always check elapsed first.
  5. Forgetting to clip wait to remaining = deadline - elapsed — a 30s sleep when only 2s of deadline remain.
  6. Not seeding random deterministically in tests — flaky test failures.

Debugging Strategy

When retries don’t fire: print is_retryable(e) for the actual exception; assert it returns True. When they fire too long: print attempt, wait, and clock() - start per attempt — the bug is almost always a missing deadline check or an uncapped jitter computation. When tests are flaky: confirm sleep is injected (no real sleeps in unit tests) and random.seed(0) at the top of the test.

Mastery Criteria

  • Wrote the brute force naive retry in <2 minutes from cold start.
  • Added max_delay, deadline_s, is_retryable, and jitter incrementally, justifying each.
  • Wrote both full-jitter and decorrelated-jitter formulas from memory.
  • Stated the difference between time.time() and time.monotonic() and which to use here.
  • Wrote deterministic tests using injected sleep and clock.
  • Articulated the retry+circuit-breaker composition in <60 seconds.
  • Solved this from a blank screen in <15 minutes including 5 unit tests.
  • Listed the four bugs in the naive for i in range: sleep(2**i) retry without prompting.