Lab 03 — Rate Limiter
Goal
Implement four rate-limiting algorithms — token bucket, leaky bucket, sliding window log, sliding window counter — and articulate the tradeoffs between them. After this lab you should be able to pick the right algorithm for a stated workload in under 30 seconds and implement any of the four in under 15 minutes.
Background Concepts
A rate limiter caps the number of requests a key (user, IP, API token) may make over a time window. The four standard algorithms differ in how much history they keep and what kind of bursts they allow:
- Token bucket: a bucket of capacity
Brefills at rateRper second; each request consumes one token; if no token, reject. Allows bursts up toB. - Leaky bucket: requests enter a queue of capacity
Qthat drains at rateR; if the queue is full, reject. Smooths bursts (output rate is constant). - Sliding window log: keep a list of timestamps over the last
Wseconds; reject iflen(log) ≥ N. Most accurate, most memory. - Sliding window counter: keep a count for the current and previous fixed window; estimate by linear interpolation. Cheap; mildly inaccurate at boundaries.
The token bucket is by far the most-common production choice (used by AWS, Stripe, GitHub) because it gives sensible burst tolerance with O(1) memory per key. Sliding window log is the choice when you must guarantee strict request-count caps (e.g., quota enforcement against a legal contract). Leaky bucket is used in network shaping. Sliding window counter is the right pick when memory is constrained and approximate is acceptable.
Interview Context
Rate limiter is asked at every senior+ practical round at Stripe, Cloudflare, Uber, and most high-scale API companies. The strong answer compares the four algorithms, picks one, justifies the pick, implements it, and answers the inevitable follow-ups about distributed coordination (multiple servers must share one quota), persistence, and observability. The weak answer implements one variant without acknowledging the others exist.
Problem Statement
Design RateLimiter with allow(key) -> bool. Configurable rate R requests per W seconds. Implement all four algorithms behind a common interface so they can be benchmarked.
Constraints
- Up to 10^6 distinct keys
- Up to 10^5 QPS aggregate
- Sub-millisecond per-call latency
- Configurable rate per key (follow-up)
Clarifying Questions
- Per-key or global limit? (Per-key by convention.)
- Should refused requests be queued or rejected? (Token bucket rejects; leaky bucket queues.)
- Time source: monotonic clock or wall clock? (Always monotonic — wall clock can jump backward.)
- Distributed across N servers, or single-process? (Often a follow-up; default single-process.)
- Burst tolerance — yes or no? (Token bucket allows bursts; sliding window log enforces strict.)
Examples
Rate = 5 req / 1 s.
Token bucket (capacity 5, refill 5/s):
t=0: 5 quick requests → all allow (bucket drains 5→0)
t=0.1: 1 request → reject (bucket=0.5 < 1)
t=1.0: 1 request → allow (bucket refilled to 5; now 4)
Sliding window log (limit 5 over 1 s):
t=0..0.5: 5 requests → all allow
t=0.6: 1 request → reject (6 in last 1 s)
t=1.1: 1 request → allow (the t=0 request slid out)
Initial Brute Force
A dict[key, list[timestamp]] and on every allow, filter out timestamps older than W and check len. This is the sliding window log baseline; it is O(history-size) per call and unbounded memory. Acceptable for low-rate testing; not acceptable for production at 10^5 QPS.
Brute Force Complexity
Per call: O(N) where N is the request count in the window. Memory: O(N · keys). At 10^5 QPS over 1-second windows for 10^6 keys, the memory could approach 10^11 timestamps — orders of magnitude too high.
Optimization Path
For each algorithm, the optimization target is different:
- Token bucket: store
(tokens, last_refill_time)per key. Compute refill on demand:tokens = min(B, tokens + (now - last_refill) * R). O(1) per call, O(1) memory per key. - Leaky bucket: equivalent to token bucket if reject-on-full; if queue requests, store the queue.
- Sliding window log: same as brute force, but trim the prefix lazily on each call. O(amortized 1) per call.
- Sliding window counter: store
(curr_count, prev_count, curr_window_start). Approximate the rolling count withprev * (1 - elapsed/W) + curr. O(1) per call, O(1) memory.
The token bucket is the dominant choice; the other three are presented for comparison.
Final Expected Approach
Define a RateLimiter interface with allow(key) -> bool. Implement four classes. Each takes rate: float (per second) and capacity (or window). Use time.monotonic(). Make all four thread-safe via per-key fine-grained locks (a dict[key, Lock] lazily created — or just a single global lock, which is simpler and adequate for most workloads).
Data Structures Used
| Algorithm | Per-key state |
|---|---|
| Token bucket | (float tokens, float last_refill_t) |
| Leaky bucket (reject) | (float queue_size, float last_drain_t) |
| Sliding log | deque[float] of timestamps |
| Sliding counter | (int curr, int prev, float window_start) |
Correctness Argument
Token bucket: tokens are produced at rate R continuously and capped at B; consumption is one per allowed request. Equivalent to the differential equation dt/dt = R - consumption, integrated by the lazy refill formula. Correct provided we never let tokens go below 0 (we check >= 1 before decrementing) or above B (the min(B, ...)).
Sliding window log: the invariant is “at any time, the deque contains exactly the timestamps in [now - W, now]”. We maintain it by trimming the prefix on every call. Then allow is len(deque) < N.
Sliding window counter: the approximation is estimate = prev_count * (1 - elapsed/W) + curr_count. This is exact when requests are uniformly distributed within each window and an upper bound otherwise (off by at most one window’s burst). Acceptable for most production rate limiters.
Complexity
| Algorithm | Time / call | Space / key |
|---|---|---|
| Token bucket | O(1) | O(1) |
| Leaky bucket | O(1) | O(1) |
| Sliding log | O(1) amortized | O(N) |
| Sliding counter | O(1) | O(1) |
Implementation Requirements
import time, threading
from collections import deque
from typing import Hashable
class TokenBucket:
def __init__(self, rate: float, capacity: float):
self._rate, self._cap = rate, capacity
self._state: dict[Hashable, list[float]] = {} # key -> [tokens, last_t]
self._lock = threading.Lock()
def allow(self, key: Hashable) -> bool:
now = time.monotonic()
with self._lock:
s = self._state.get(key)
if s is None:
s = [self._cap, now]; self._state[key] = s
tokens, last = s
tokens = min(self._cap, tokens + (now - last) * self._rate)
if tokens >= 1:
s[0] = tokens - 1; s[1] = now
return True
s[0] = tokens; s[1] = now
return False
class SlidingWindowLog:
def __init__(self, max_in_window: int, window_s: float):
self._max, self._w = max_in_window, window_s
self._logs: dict[Hashable, deque[float]] = {}
self._lock = threading.Lock()
def allow(self, key: Hashable) -> bool:
now = time.monotonic()
with self._lock:
dq = self._logs.setdefault(key, deque())
cutoff = now - self._w
while dq and dq[0] < cutoff:
dq.popleft()
if len(dq) >= self._max:
return False
dq.append(now)
return True
class SlidingWindowCounter:
def __init__(self, max_in_window: int, window_s: float):
self._max, self._w = max_in_window, window_s
self._state: dict[Hashable, list] = {} # [curr, prev, window_start]
self._lock = threading.Lock()
def allow(self, key: Hashable) -> bool:
now = time.monotonic()
with self._lock:
s = self._state.get(key)
if s is None:
s = [0, 0, now]; self._state[key] = s
curr, prev, ws = s
elapsed = now - ws
if elapsed >= 2 * self._w:
curr = prev = 0; ws = now
elif elapsed >= self._w:
prev, curr = curr, 0; ws += self._w
elapsed -= self._w
estimate = prev * (1 - elapsed / self._w) + curr
if estimate >= self._max:
s[0], s[1], s[2] = curr, prev, ws
return False
s[0], s[1], s[2] = curr + 1, prev, ws
return True
class LeakyBucket:
"""Reject-on-full leaky bucket. Equivalent to token bucket when reject."""
def __init__(self, rate: float, capacity: float):
self._rate, self._cap = rate, capacity
self._state: dict[Hashable, list[float]] = {}
self._lock = threading.Lock()
def allow(self, key: Hashable) -> bool:
now = time.monotonic()
with self._lock:
s = self._state.get(key)
if s is None:
s = [0.0, now]; self._state[key] = s
level, last = s
level = max(0.0, level - (now - last) * self._rate)
if level + 1 > self._cap:
s[0], s[1] = level, now
return False
s[0], s[1] = level + 1, now
return True
Tests
import unittest, time
class TestTokenBucket(unittest.TestCase):
def test_burst_then_steady(self):
rl = TokenBucket(rate=5, capacity=5)
# Burst: 5 quick allows
for _ in range(5):
self.assertTrue(rl.allow("k"))
# 6th: reject (bucket empty)
self.assertFalse(rl.allow("k"))
# Wait 0.4s → 2 tokens accumulated
time.sleep(0.4)
self.assertTrue(rl.allow("k"))
self.assertTrue(rl.allow("k"))
def test_per_key_isolation(self):
rl = TokenBucket(rate=1, capacity=1)
self.assertTrue(rl.allow("a"))
self.assertTrue(rl.allow("b")) # different key, full bucket
self.assertFalse(rl.allow("a"))
class TestSlidingLog(unittest.TestCase):
def test_strict_count(self):
rl = SlidingWindowLog(max_in_window=3, window_s=1.0)
for _ in range(3):
self.assertTrue(rl.allow("k"))
self.assertFalse(rl.allow("k"))
time.sleep(1.05)
self.assertTrue(rl.allow("k")) # log has slid out
Follow-up Questions
(3) How would you scale to N nodes? This is the key follow-up. Options: (a) sticky routing — route all requests for a key to a fixed node by hash(key) % N; each node enforces locally. Simple, but rebalancing on add/remove is painful. (b) Centralized counter in Redis using INCR + EXPIRE per window. Network round-trip per call — only works at moderate QPS. (c) Approximate distributed: each node enforces R/N locally and accepts that bursts up to R are possible. The pragmatic real-world answer; documents your error budget. (d) Token bucket in Redis with Lua script for atomic refill+decrement — Stripe and GitHub do this in production.
(7) How would you handle backpressure? The whole point of a rate limiter is backpressure on upstream traffic. The question becomes: when the limiter rejects, what does the client see? HTTP 429 with Retry-After header is the standard. Enable cooperative backoff so clients don’t retry-storm. Optionally include X-RateLimit-Remaining and X-RateLimit-Reset headers (GitHub convention).
(9) What’s the eviction policy? Per-key state grows unbounded if you never clean up. Two strategies: (a) lazy expiry — when a key has been silent for >2W, drop its state on the next access. (b) Background scavenger — periodically scan and remove stale entries. The lazy approach is preferred; it’s O(0) overhead in the steady state.
(11) What configuration knobs? rate and capacity (or window) per limit class. Optionally per-key overrides (a dict[key, (rate, cap)] for VIP customers). Knobs not to expose: the algorithm choice (pick one and stick).
(4) How would you observe / monitor? Allow rate (counter), reject rate (counter), reject ratio (gauge), per-key reject rate (top-K dashboard for finding hot keys). Bucket-fill / queue-length gauge for diagnosing whether you’re rate-limited because of bursts or steady overload.
Product Extension
Stripe’s API uses token bucket with per-account capacity. AWS API Gateway uses token bucket per stage. GitHub’s API uses sliding window with hourly windows visible to clients. Twitter (X) uses fixed windows for some endpoints, sliding for others. The choice depends on the contract you offer customers (“up to 5 burst requests” → token bucket; “exactly 5000 per hour” → sliding log or sliding counter).
Language/Runtime Follow-ups
- Python: the implementation above. For high QPS, replace the global lock with a
dict[key, Lock]lazily, or shard byhash(key) % N. - Java: Guava’s
RateLimiteris a token bucket with smoothing options. For distributed, Bucket4j is excellent. - Go:
golang.org/x/time/rateis a token bucket (Allow/Wait). For distributed, use Redis with a Lua script. - C++: no stdlib; use
std::chrono::steady_clock::now(). Folly has a token bucket. - JS/TS:
bottleneck(npm) is the canonical client-side. Server-side: Redis-backed for distributed.
Common Bugs
- Using
time.time()(wall clock) instead oftime.monotonic()— clock skew or NTP adjustments cause negative deltas and free tokens. - Token bucket: not capping at
B— bucket grows unbounded over idle periods; first burst is huge. - Sliding log: not trimming on every call, only on insert — memory grows for read-heavy patterns.
- Sliding counter: failing to advance the window pointer when 2W has passed (key idle for long enough that both windows are stale).
- Forgetting per-key isolation — a single shared bucket across all keys.
Debugging Strategy
Log every (key, allow/reject, bucket_state) transition for a single key. Hand-trace against expected behavior. For distributed bugs, capture the Lua script’s input and output and replay against a local Redis. For thundering-herd bugs (many clients see “reset” simultaneously and all retry at once), add jitter on Retry-After (server-side recommends Retry-After: random_in_range(t, 2t)).
Mastery Criteria
- Implemented all four algorithms in <60 minutes total (15 min each).
- All tests pass first run.
- Compared the four algorithms verbally in <90 seconds, naming a workload where each is the right choice.
-
Stated why
time.monotonic()is required without prompting. - Answered follow-ups #3 (distributed), #4, #7, #9, #11 in <90 seconds each.
- Identified that the leaky bucket and reject-on-full token bucket are mathematically equivalent when reject (different when queue).