Lab 03 — Rate Limiter

Goal

Implement four rate-limiting algorithms — token bucket, leaky bucket, sliding window log, sliding window counter — and articulate the tradeoffs between them. After this lab you should be able to pick the right algorithm for a stated workload in under 30 seconds and implement any of the four in under 15 minutes.

Background Concepts

A rate limiter caps the number of requests a key (user, IP, API token) may make over a time window. The four standard algorithms differ in how much history they keep and what kind of bursts they allow:

Token bucket: a bucket of capacity B refills at rate R per second; each request consumes one token; if no token, reject. Allows bursts up to B.
Leaky bucket: requests enter a queue of capacity Q that drains at rate R; if the queue is full, reject. Smooths bursts (output rate is constant).
Sliding window log: keep a list of timestamps over the last W seconds; reject if len(log) ≥ N. Most accurate, most memory.
Sliding window counter: keep a count for the current and previous fixed window; estimate by linear interpolation. Cheap; mildly inaccurate at boundaries.

The token bucket is by far the most-common production choice (used by AWS, Stripe, GitHub) because it gives sensible burst tolerance with O(1) memory per key. Sliding window log is the choice when you must guarantee strict request-count caps (e.g., quota enforcement against a legal contract). Leaky bucket is used in network shaping. Sliding window counter is the right pick when memory is constrained and approximate is acceptable.

Interview Context

Rate limiter is asked at every senior+ practical round at Stripe, Cloudflare, Uber, and most high-scale API companies. The strong answer compares the four algorithms, picks one, justifies the pick, implements it, and answers the inevitable follow-ups about distributed coordination (multiple servers must share one quota), persistence, and observability. The weak answer implements one variant without acknowledging the others exist.

Problem Statement

Design RateLimiter with allow(key) -> bool. Configurable rate R requests per W seconds. Implement all four algorithms behind a common interface so they can be benchmarked.

Constraints

Up to 10^6 distinct keys
Up to 10^5 QPS aggregate
Sub-millisecond per-call latency
Configurable rate per key (follow-up)

Clarifying Questions

Per-key or global limit? (Per-key by convention.)
Should refused requests be queued or rejected? (Token bucket rejects; leaky bucket queues.)
Time source: monotonic clock or wall clock? (Always monotonic — wall clock can jump backward.)
Distributed across N servers, or single-process? (Often a follow-up; default single-process.)
Burst tolerance — yes or no? (Token bucket allows bursts; sliding window log enforces strict.)

Examples

Rate = 5 req / 1 s.

Token bucket (capacity 5, refill 5/s):
  t=0:   5 quick requests   → all allow (bucket drains 5→0)
  t=0.1: 1 request          → reject (bucket=0.5 < 1)
  t=1.0: 1 request          → allow (bucket refilled to 5; now 4)

Sliding window log (limit 5 over 1 s):
  t=0..0.5: 5 requests      → all allow
  t=0.6: 1 request          → reject (6 in last 1 s)
  t=1.1: 1 request          → allow (the t=0 request slid out)

Initial Brute Force

A dict[key, list[timestamp]] and on every allow, filter out timestamps older than W and check len. This is the sliding window log baseline; it is O(history-size) per call and unbounded memory. Acceptable for low-rate testing; not acceptable for production at 10^5 QPS.

Brute Force Complexity

Per call: O(N) where N is the request count in the window. Memory: O(N · keys). At 10^5 QPS over 1-second windows for 10^6 keys, the memory could approach 10^11 timestamps — orders of magnitude too high.

Optimization Path

For each algorithm, the optimization target is different:

Token bucket: store (tokens, last_refill_time) per key. Compute refill on demand: tokens = min(B, tokens + (now - last_refill) * R). O(1) per call, O(1) memory per key.
Leaky bucket: equivalent to token bucket if reject-on-full; if queue requests, store the queue.
Sliding window log: same as brute force, but trim the prefix lazily on each call. O(amortized 1) per call.
Sliding window counter: store (curr_count, prev_count, curr_window_start). Approximate the rolling count with prev * (1 - elapsed/W) + curr. O(1) per call, O(1) memory.

The token bucket is the dominant choice; the other three are presented for comparison.

Final Expected Approach

Define a RateLimiter interface with allow(key) -> bool. Implement four classes. Each takes rate: float (per second) and capacity (or window). Use time.monotonic(). Make all four thread-safe via per-key fine-grained locks (a dict[key, Lock] lazily created — or just a single global lock, which is simpler and adequate for most workloads).

Data Structures Used

Algorithm	Per-key state
Token bucket	`(float tokens, float last_refill_t)`
Leaky bucket (reject)	`(float queue_size, float last_drain_t)`
Sliding log	`deque[float]` of timestamps
Sliding counter	`(int curr, int prev, float window_start)`

Correctness Argument

Token bucket: tokens are produced at rate R continuously and capped at B; consumption is one per allowed request. Equivalent to the differential equation dt/dt = R - consumption, integrated by the lazy refill formula. Correct provided we never let tokens go below 0 (we check >= 1 before decrementing) or above B (the min(B, ...)).

Sliding window log: the invariant is “at any time, the deque contains exactly the timestamps in [now - W, now]”. We maintain it by trimming the prefix on every call. Then allow is len(deque) < N.

Sliding window counter: the approximation is estimate = prev_count * (1 - elapsed/W) + curr_count. This is exact when requests are uniformly distributed within each window and an upper bound otherwise (off by at most one window’s burst). Acceptable for most production rate limiters.

Complexity

Algorithm	Time / call	Space / key
Token bucket	O(1)	O(1)
Leaky bucket	O(1)	O(1)
Sliding log	O(1) amortized	O(N)
Sliding counter	O(1)	O(1)

Implementation Requirements

import time, threading
from collections import deque
from typing import Hashable

class TokenBucket:
    def __init__(self, rate: float, capacity: float):
        self._rate, self._cap = rate, capacity
        self._state: dict[Hashable, list[float]] = {}    # key -> [tokens, last_t]
        self._lock = threading.Lock()

    def allow(self, key: Hashable) -> bool:
        now = time.monotonic()
        with self._lock:
            s = self._state.get(key)
            if s is None:
                s = [self._cap, now]; self._state[key] = s
            tokens, last = s
            tokens = min(self._cap, tokens + (now - last) * self._rate)
            if tokens >= 1:
                s[0] = tokens - 1; s[1] = now
                return True
            s[0] = tokens; s[1] = now
            return False


class SlidingWindowLog:
    def __init__(self, max_in_window: int, window_s: float):
        self._max, self._w = max_in_window, window_s
        self._logs: dict[Hashable, deque[float]] = {}
        self._lock = threading.Lock()

    def allow(self, key: Hashable) -> bool:
        now = time.monotonic()
        with self._lock:
            dq = self._logs.setdefault(key, deque())
            cutoff = now - self._w
            while dq and dq[0] < cutoff:
                dq.popleft()
            if len(dq) >= self._max:
                return False
            dq.append(now)
            return True


class SlidingWindowCounter:
    def __init__(self, max_in_window: int, window_s: float):
        self._max, self._w = max_in_window, window_s
        self._state: dict[Hashable, list] = {}    # [curr, prev, window_start]
        self._lock = threading.Lock()

    def allow(self, key: Hashable) -> bool:
        now = time.monotonic()
        with self._lock:
            s = self._state.get(key)
            if s is None:
                s = [0, 0, now]; self._state[key] = s
            curr, prev, ws = s
            elapsed = now - ws
            if elapsed >= 2 * self._w:
                curr = prev = 0; ws = now
            elif elapsed >= self._w:
                prev, curr = curr, 0; ws += self._w
                elapsed -= self._w
            estimate = prev * (1 - elapsed / self._w) + curr
            if estimate >= self._max:
                s[0], s[1], s[2] = curr, prev, ws
                return False
            s[0], s[1], s[2] = curr + 1, prev, ws
            return True


class LeakyBucket:
    """Reject-on-full leaky bucket. Equivalent to token bucket when reject."""
    def __init__(self, rate: float, capacity: float):
        self._rate, self._cap = rate, capacity
        self._state: dict[Hashable, list[float]] = {}
        self._lock = threading.Lock()

    def allow(self, key: Hashable) -> bool:
        now = time.monotonic()
        with self._lock:
            s = self._state.get(key)
            if s is None:
                s = [0.0, now]; self._state[key] = s
            level, last = s
            level = max(0.0, level - (now - last) * self._rate)
            if level + 1 > self._cap:
                s[0], s[1] = level, now
                return False
            s[0], s[1] = level + 1, now
            return True

Tests

import unittest, time

class TestTokenBucket(unittest.TestCase):
    def test_burst_then_steady(self):
        rl = TokenBucket(rate=5, capacity=5)
        # Burst: 5 quick allows
        for _ in range(5):
            self.assertTrue(rl.allow("k"))
        # 6th: reject (bucket empty)
        self.assertFalse(rl.allow("k"))
        # Wait 0.4s → 2 tokens accumulated
        time.sleep(0.4)
        self.assertTrue(rl.allow("k"))
        self.assertTrue(rl.allow("k"))

    def test_per_key_isolation(self):
        rl = TokenBucket(rate=1, capacity=1)
        self.assertTrue(rl.allow("a"))
        self.assertTrue(rl.allow("b"))      # different key, full bucket
        self.assertFalse(rl.allow("a"))


class TestSlidingLog(unittest.TestCase):
    def test_strict_count(self):
        rl = SlidingWindowLog(max_in_window=3, window_s=1.0)
        for _ in range(3):
            self.assertTrue(rl.allow("k"))
        self.assertFalse(rl.allow("k"))
        time.sleep(1.05)
        self.assertTrue(rl.allow("k"))      # log has slid out

Follow-up Questions

(3) How would you scale to N nodes? This is the key follow-up. Options: (a) sticky routing — route all requests for a key to a fixed node by hash(key) % N; each node enforces locally. Simple, but rebalancing on add/remove is painful. (b) Centralized counter in Redis using INCR + EXPIRE per window. Network round-trip per call — only works at moderate QPS. (c) Approximate distributed: each node enforces R/N locally and accepts that bursts up to R are possible. The pragmatic real-world answer; documents your error budget. (d) Token bucket in Redis with Lua script for atomic refill+decrement — Stripe and GitHub do this in production.

(7) How would you handle backpressure? The whole point of a rate limiter is backpressure on upstream traffic. The question becomes: when the limiter rejects, what does the client see? HTTP 429 with Retry-After header is the standard. Enable cooperative backoff so clients don’t retry-storm. Optionally include X-RateLimit-Remaining and X-RateLimit-Reset headers (GitHub convention).

(9) What’s the eviction policy? Per-key state grows unbounded if you never clean up. Two strategies: (a) lazy expiry — when a key has been silent for >2W, drop its state on the next access. (b) Background scavenger — periodically scan and remove stale entries. The lazy approach is preferred; it’s O(0) overhead in the steady state.

(11) What configuration knobs? rate and capacity (or window) per limit class. Optionally per-key overrides (a dict[key, (rate, cap)] for VIP customers). Knobs not to expose: the algorithm choice (pick one and stick).

(4) How would you observe / monitor? Allow rate (counter), reject rate (counter), reject ratio (gauge), per-key reject rate (top-K dashboard for finding hot keys). Bucket-fill / queue-length gauge for diagnosing whether you’re rate-limited because of bursts or steady overload.

Product Extension

Stripe’s API uses token bucket with per-account capacity. AWS API Gateway uses token bucket per stage. GitHub’s API uses sliding window with hourly windows visible to clients. Twitter (X) uses fixed windows for some endpoints, sliding for others. The choice depends on the contract you offer customers (“up to 5 burst requests” → token bucket; “exactly 5000 per hour” → sliding log or sliding counter).

Language/Runtime Follow-ups

Python: the implementation above. For high QPS, replace the global lock with a dict[key, Lock] lazily, or shard by hash(key) % N.
Java: Guava’s RateLimiter is a token bucket with smoothing options. For distributed, Bucket4j is excellent.
Go: golang.org/x/time/rate is a token bucket (Allow / Wait). For distributed, use Redis with a Lua script.
C++: no stdlib; use std::chrono::steady_clock::now(). Folly has a token bucket.
JS/TS: bottleneck (npm) is the canonical client-side. Server-side: Redis-backed for distributed.

Common Bugs

Using time.time() (wall clock) instead of time.monotonic() — clock skew or NTP adjustments cause negative deltas and free tokens.
Token bucket: not capping at B — bucket grows unbounded over idle periods; first burst is huge.
Sliding log: not trimming on every call, only on insert — memory grows for read-heavy patterns.
Sliding counter: failing to advance the window pointer when 2W has passed (key idle for long enough that both windows are stale).
Forgetting per-key isolation — a single shared bucket across all keys.

Debugging Strategy

Log every (key, allow/reject, bucket_state) transition for a single key. Hand-trace against expected behavior. For distributed bugs, capture the Lua script’s input and output and replay against a local Redis. For thundering-herd bugs (many clients see “reset” simultaneously and all retry at once), add jitter on Retry-After (server-side recommends Retry-After: random_in_range(t, 2t)).

Mastery Criteria

Implemented all four algorithms in <60 minutes total (15 min each).
All tests pass first run.
Compared the four algorithms verbally in <90 seconds, naming a workload where each is the right choice.
Stated why time.monotonic() is required without prompting.
Answered follow-ups #3 (distributed), #4, #7, #9, #11 in <90 seconds each.
Identified that the leaky bucket and reject-on-full token bucket are mathematically equivalent when reject (different when queue).

LeetCode Interview — Extreme Coding