Lab 15 — Retry With Exponential Backoff and Jitter
Goal
Implement a reusable retry(fn, policy) primitive that retries a callable on failure with exponential backoff plus decorrelated jitter, bounded by max attempts and total deadline, with an explicit retryable-error predicate so non-retryable errors fail fast. After this lab you should be able to write a production-shaped retry helper from a blank screen in <15 minutes and articulate why naive sleep(2 ** attempt) is wrong in <60 seconds.
Background Concepts
A retry primitive has four orthogonal knobs: (a) how many times to retry (max attempts and / or total deadline), (b) how long to wait between attempts (the backoff schedule), (c) which errors are retryable (a predicate), and (d) what to do on final failure (raise, return a sentinel, surface diagnostics). The non-trivial knob is (b). Naive exponential backoff wait = base * 2 ** attempt causes a thundering herd: when a downstream service recovers, every retrying client wakes simultaneously and re-overloads it. The fix is jitter: randomize the wait. The two industry-standard schedules are full jitter (wait = uniform(0, base * 2 ** attempt)) and decorrelated jitter (wait = uniform(base, prev_wait * 3), capped at max_wait). Decorrelated jitter is preferred when retries cluster across many clients because its waits are less correlated across attempts.
The total deadline matters as much as the attempt count. A 5-attempt schedule with base=1s, cap=30s can spend up to two minutes blocked — unacceptable for a request-path retry. Production retry helpers always take a deadline.
Interview Context
This is a 20-minute warmup at Stripe, Uber, Cloudflare, and any team whose service makes downstream calls. It’s also a frequent follow-up to the rate-limiter and circuit-breaker labs. Candidates who write for i in range(5): try: return fn() except: time.sleep(2 ** i) get a partial credit; candidates who name jitter, deadline, retryable-error predicate, and the relationship to the circuit breaker (Lab 16) get a strong signal.
Problem Statement
Implement retry(fn, max_attempts, base_delay, max_delay, deadline_s, is_retryable, jitter='decorrelated') that calls fn() repeatedly until it succeeds or the policy gives up. On non-retryable exceptions, fail immediately. On retryable exceptions, sleep according to the schedule and try again. On exceeding max_attempts or deadline_s, raise the last exception wrapped in a RetryExhausted.
Constraints
max_attempts≥ 1 (1 means “no retries”; the function is called at most once).base_delay> 0,max_delay≥base_delay.deadline_smay beNone(no deadline) or a positive float (wall-clock seconds fromretry()invocation).is_retryable: Exception -> boolmust be a pure function.- The implementation must not busy-spin and must respect both the per-attempt cap and the deadline (whichever fires first).
Clarifying Questions
- Is the deadline measured from
retry()invocation or from the first failure? (From invocation — simpler reasoning.) - Should
fn()be called at least once even if the deadline is already past at start? (Yes — at least one attempt.) - Should we sleep after the final attempt? (No — pointless.)
- Does
is_retryableapply to the last attempt’s exception, or do we always re-raise the last? (Re-raise the last; non-retryable short-circuits.) - Synchronous or async? (Both — implement sync first, async variant in follow-ups.)
- Should we surface the attempt count and total elapsed time in the wrapped exception? (Yes — operational visibility.)
Examples
retry(lambda: http_get(url), max_attempts=5, base_delay=0.1, max_delay=10, deadline_s=30,
is_retryable=lambda e: isinstance(e, (TimeoutError, ConnectionError)))
# → returns the response if any attempt succeeds within 30s and 5 tries.
# → raises RetryExhausted("timeout", attempts=5, elapsed=12.3s) on timeout.
# → raises ValueError immediately if fn() raises ValueError (non-retryable).
Initial Brute Force
def retry_naive(fn, max_attempts):
for i in range(max_attempts):
try:
return fn()
except Exception:
if i == max_attempts - 1:
raise
time.sleep(2 ** i)
This is what most candidates write first. It has all four bugs listed above: no jitter (thundering herd), no deadline (unbounded wait), no retryable predicate (retries on programming errors), no cap on max_delay.
Brute Force Complexity
Time: dominated by sleeps; up to Σ 2^i ≈ 2^max_attempts seconds in the worst case. For max_attempts=10, that’s 17 minutes. Space: O(1).
Optimization Path
Add (1) max_delay cap → bounds per-attempt sleep, (2) deadline_s total cap → bounds end-to-end blocking, (3) is_retryable predicate → fast-fails on programmer errors, (4) jitter → spreads herd, (5) structured exception with diagnostics → operational legibility. Each addition is a one-knob change; together they take the primitive from “buggy in production” to “shippable”.
Final Expected Approach
Loop up to max_attempts times. Track wall-clock start. On each attempt, call fn(). On success, return the value. On failure, check is_retryable; if false, re-raise. If we’re at the last attempt or past the deadline, raise RetryExhausted. Otherwise compute the next sleep using the chosen jitter scheme, clip to remaining-deadline so we don’t oversleep, and time.sleep(wait). Log each attempt.
Data Structures Used
- A monotonic clock reference (
time.monotonic()) to compute deadlines — wall-clock can jump. - A small
RetryExhaustedexception class carryingattempts,elapsed,last_exception. - An optional
Loggerfor per-attempt diagnostics (don’t print; inject a logger).
Correctness Argument
We make at most max_attempts calls to fn (loop bound). We sleep between attempts but never after the final one (the loop returns or raises before sleeping past the last attempt). We respect deadline_s by computing remaining = deadline - elapsed and clipping the sleep; if remaining ≤ 0 we raise immediately. Non-retryable exceptions short-circuit by re-raising before the sleep. The exception we surface is always the last underlying failure, wrapped with diagnostics.
Complexity
| Aspect | Cost |
|---|---|
| Wall-clock | bounded by min(deadline_s, Σ wait_i) |
| CPU per failed attempt | O(1) plus fn’s own cost |
| Memory | O(1) |
Implementation Requirements
A complete working implementation is required.
import random
import time
from dataclasses import dataclass
from typing import Callable, Optional, TypeVar
T = TypeVar("T")
class RetryExhausted(Exception):
def __init__(self, message: str, attempts: int, elapsed: float, last_exception: BaseException):
super().__init__(f"{message} (attempts={attempts}, elapsed={elapsed:.2f}s)")
self.attempts = attempts
self.elapsed = elapsed
self.last_exception = last_exception
@dataclass
class RetryPolicy:
max_attempts: int = 5
base_delay: float = 0.1
max_delay: float = 30.0
deadline_s: Optional[float] = None
jitter: str = "decorrelated" # "decorrelated" | "full" | "none"
is_retryable: Callable[[BaseException], bool] = lambda e: True
def __post_init__(self):
if self.max_attempts < 1:
raise ValueError("max_attempts must be >= 1")
if self.base_delay <= 0 or self.max_delay < self.base_delay:
raise ValueError("invalid delay bounds")
def _next_wait(policy: RetryPolicy, attempt: int, prev_wait: float) -> float:
if policy.jitter == "none":
w = min(policy.base_delay * (2 ** attempt), policy.max_delay)
elif policy.jitter == "full":
cap = min(policy.base_delay * (2 ** attempt), policy.max_delay)
w = random.uniform(0, cap)
elif policy.jitter == "decorrelated":
w = min(random.uniform(policy.base_delay, max(prev_wait, policy.base_delay) * 3),
policy.max_delay)
else:
raise ValueError(f"unknown jitter scheme: {policy.jitter}")
return w
def retry(fn: Callable[[], T], policy: RetryPolicy, *, sleep=time.sleep, clock=time.monotonic) -> T:
start = clock()
last_exc: Optional[BaseException] = None
prev_wait = policy.base_delay
for attempt in range(policy.max_attempts):
try:
return fn()
except BaseException as e:
last_exc = e
if not policy.is_retryable(e):
raise
if attempt == policy.max_attempts - 1:
break
elapsed = clock() - start
if policy.deadline_s is not None and elapsed >= policy.deadline_s:
break
wait = _next_wait(policy, attempt, prev_wait)
if policy.deadline_s is not None:
wait = min(wait, max(0.0, policy.deadline_s - elapsed))
if wait > 0:
sleep(wait)
prev_wait = wait
elapsed = clock() - start
raise RetryExhausted("retry exhausted", attempt + 1, elapsed, last_exc) from last_exc
sleep and clock are dependency-injected so tests do not have to wait real time.
Tests
def test_succeeds_first_try():
assert retry(lambda: 42, RetryPolicy(max_attempts=3)) == 42
def test_succeeds_after_failures():
n = {"i": 0}
def fn():
n["i"] += 1
if n["i"] < 3: raise TimeoutError()
return "ok"
assert retry(fn, RetryPolicy(max_attempts=5, base_delay=0.001)) == "ok"
assert n["i"] == 3
def test_non_retryable_short_circuits():
n = {"i": 0}
def fn():
n["i"] += 1
raise ValueError("bad")
policy = RetryPolicy(is_retryable=lambda e: not isinstance(e, ValueError))
try: retry(fn, policy)
except ValueError: pass
assert n["i"] == 1
def test_exhaustion_wraps_exception():
def fn(): raise TimeoutError("nope")
try: retry(fn, RetryPolicy(max_attempts=2, base_delay=0.001))
except RetryExhausted as e:
assert e.attempts == 2
assert isinstance(e.last_exception, TimeoutError)
def test_deadline_respected():
fake_time = [0.0]
def fn(): raise TimeoutError()
sleeps = []
def fake_sleep(t): sleeps.append(t); fake_time[0] += t
def fake_clock(): return fake_time[0]
try:
retry(fn, RetryPolicy(max_attempts=100, base_delay=1, deadline_s=5, jitter="none"),
sleep=fake_sleep, clock=fake_clock)
except RetryExhausted as e:
assert e.elapsed <= 5.001
Follow-up Questions
- How would you make it thread-safe? The function is reentrant — no shared state across calls. The injected
sleepandclockshould themselves be thread-safe (the stdlib ones are). Per-call state (attempt counter, prev_wait) is local. No locks needed. - How would you observe and monitor it? Emit (a)
retry.attemptscounter labeled by callsite and outcome (success_first_try,success_after_retry,exhausted,non_retryable), (b)retry.elapsedhistogram, (c)retry.attempt_counthistogram. Log per-attempt at DEBUG, per-final-failure at WARN. - How would you handle a poison-pill input? A request that always raises a retryable error wastes the deadline on every retry. Wrap repeated callers behind a circuit breaker (Lab 16); after N consecutive
RetryExhausteds, open the breaker and fail fast for a cooldown period. - What configuration knobs would you expose?
max_attempts,base_delay,max_delay,deadline_s,jitterstrategy,is_retryablepredicate. Defaults: 5 / 100ms / 30s / None / decorrelated /lambda e: True. Don’t expose internal multipliers (the 3× in decorrelated jitter) — they’re stable and tuning them in production is a smell. - How would you test it deterministically? Inject
sleepandclock; advance fake time inside the fakesleep. Seedrandomfor reproducible jitter. The test for the deadline above uses this pattern. - What is the relationship to the circuit breaker? A retry without a circuit breaker is dangerous: if the downstream is fully down, every caller retries the full schedule, multiplying load. The right composition is
circuit_breaker(retry(fn))— the breaker short-circuits the retry once it has seen enough failures.
Product Extension
Retry primitives are the workhorse of every microservice’s outbound RPC layer. AWS SDK, Google Cloud SDK, and gRPC all ship retry helpers; their default schedules are decorrelated jitter with deadlines. The is_retryable predicate in production is the hardest knob: HTTP 5xx is usually retryable, 4xx usually is not, but 429 is retryable with Retry-After honored. Lift this complexity into the predicate.
Language/Runtime Follow-ups
- Python: as above. For async, swap
time.sleepforasyncio.sleepand makeretryanasync def. - Java: use
Resilience4jorFailsafein production. Hand-rolled: aRetryPolicybuilder, aCallable<T>argument,Thread.sleep(orScheduledExecutorService.schedulein async). - Go: a function
Retry(ctx context.Context, fn func() error, policy Policy) error. Usetime.NewTimerso actx.Done()can cancel mid-sleep. Cancellation is the deadline mechanism. - C++:
std::this_thread::sleep_forandstd::chronofor the deadline. Pass a stop-token to support cancellation. - JS/TS:
await new Promise(r => setTimeout(r, ms)). The retry function isasync. UseAbortSignalfor the deadline.
Common Bugs
- Sleeping after the final attempt — wastes wall-clock.
- Using
time.time()instead oftime.monotonic()— wall-clock can jump backwards across NTP corrections, causing negativeelapsedand crashes. - Catching
BaseExceptionand swallowingKeyboardInterrupt/SystemExit— never make these retryable. Either narrow the catch or have the predicate exclude them. - Computing the next wait before the deadline check — you sleep past the deadline. Always check elapsed first.
- Forgetting to clip
waittoremaining = deadline - elapsed— a 30s sleep when only 2s of deadline remain. - Not seeding
randomdeterministically in tests — flaky test failures.
Debugging Strategy
When retries don’t fire: print is_retryable(e) for the actual exception; assert it returns True. When they fire too long: print attempt, wait, and clock() - start per attempt — the bug is almost always a missing deadline check or an uncapped jitter computation. When tests are flaky: confirm sleep is injected (no real sleeps in unit tests) and random.seed(0) at the top of the test.
Mastery Criteria
- Wrote the brute force naive retry in <2 minutes from cold start.
-
Added
max_delay,deadline_s,is_retryable, and jitter incrementally, justifying each. - Wrote both full-jitter and decorrelated-jitter formulas from memory.
-
Stated the difference between
time.time()andtime.monotonic()and which to use here. -
Wrote deterministic tests using injected
sleepandclock. - Articulated the retry+circuit-breaker composition in <60 seconds.
- Solved this from a blank screen in <15 minutes including 5 unit tests.
-
Listed the four bugs in the naive
for i in range: sleep(2**i)retry without prompting.