Lab 10 — Consistent Hashing
Goal
Implement a consistent hash ring with virtual nodes that minimizes key remapping when servers are added or removed. After this lab you should be able to design and implement consistent hashing in under 30 minutes and articulate why it beats hash(key) % N.
Background Concepts
The naive sharding scheme hash(key) % N has a catastrophic failure mode: when N changes (a server is added or removed), nearly every key remaps to a different shard. For an in-memory cache fleet, this means the entire cache is invalidated; for a stateful sharded store, this means most data must be physically migrated.
Consistent hashing solves this. Servers are placed on a ring (a circular hash space, e.g., [0, 2^64)). Each key is hashed onto the ring and assigned to the next server clockwise. When a server is added, only keys between its predecessor and itself on the ring are remapped. When a server is removed, only its keys are remapped — to its successor.
Without virtual nodes (vnodes), the ring is unbalanced: a 4-server ring assigns wildly unequal slices. Virtual nodes fix this: each physical server gets V ring positions (e.g., V=200). The ring becomes statistically balanced in O(1/sqrt(V)) deviation.
Interview Context
Consistent hashing is the default sharding mechanism for distributed caches (Memcached client libraries, Redis Cluster’s slot variant), distributed databases (DynamoDB, Cassandra), and load balancers (HAProxy, Envoy with ring_hash). It is asked at infrastructure roles at every Big Tech and many high-scale companies.
Problem Statement
Implement ConsistentHashRing(vnodes_per_server):
add_server(server_id)— add a server with V vnodes.remove_server(server_id)— remove all vnodes for the server.get_server(key) -> server_id— return the server responsible forkey.keys_moved(key, before, after)— for analysis: did this key remap?
Constraints
- 1 ≤ servers ≤ 10^4
- 1 ≤ vnodes per server ≤ 1000
- 10^5 lookups / second
- Lookup latency: O(log N · V)
Clarifying Questions
- Hash function: cryptographic or fast? (Use a fast non-crypto hash: MurmurHash, xxHash. Stable across processes.)
- Vnode count V: hard-coded or configurable? (Configurable, default 100–200.)
- Replication: should
get_serverreturn one or multiple distinct servers? (Often a follow-up; primary is one.) - Hot-spotting awareness: do we know any keys are extremely hot? (Bounded-load consistent hashing is a follow-up.)
Examples
ring = ConsistentHashRing(vnodes_per_server=100)
ring.add_server("s1"); ring.add_server("s2"); ring.add_server("s3")
ring.get_server("user-42") -> "s2"
ring.add_server("s4")
ring.get_server("user-42") -> "s2" or "s4" (only some keys remap)
# ~25% of keys remap on adding a 4th server, not 75% as with mod-N
Initial Brute Force
hash(key) % N. Simple and balanced; catastrophic on N change. Useful for understanding the problem, not as the solution.
Brute Force Complexity
O(1) per lookup; O(N · keys / N) = O(keys) remap on N change — not the problem; the brute force is mod-N and we’re moving away from it.
Optimization Path
Replace mod-N with a sorted ring. Servers map to multiple positions; lookup is binary search. On add/remove, only insert/delete vnode positions; existing positions don’t move.
Final Expected Approach
A sorted list of (hash_value, server_id) tuples kept in ring order. get_server(key): hash the key, binary-search for the smallest ring position ≥ key_hash; wrap around if past the end. add_server: insert V positions. remove_server: remove all V positions.
Data Structures Used
| Structure | Purpose |
|---|---|
Sorted list of (hash, server) | the ring |
dict[server_id, list[hash_positions]] | bookkeeping for removal |
bisect for binary search | O(log N · V) lookup |
Correctness Argument
Key locality on resize: when adding server S with vnodes [v_1, …, v_V], the only keys whose owner changes are those whose hash falls in some (predecessor(v_i), v_i] range. The expected fraction of keys affected is V / (total vnodes) ≈ 1/(N+1) — exactly the right number to assign to the new server, and no more.
Balanced load: with V vnodes per server, the variance of the load assigned to each server scales as O(log N / V). At V=100, N=10, the imbalance is < 5%; at V=1000, < 1.5%.
Complexity
get_server: O(log(N · V))add_server: O(V · log(N · V)) per insert; total O(V log N) per server- Space: O(N · V)
Implementation Requirements
import bisect, hashlib
from typing import Optional
def _hash(s: str) -> int:
"""Fast, deterministic hash. Use MD5 for speed; SHA-1 also fine."""
return int.from_bytes(hashlib.md5(s.encode()).digest()[:8], "big")
class ConsistentHashRing:
def __init__(self, vnodes_per_server: int = 100):
self._v = vnodes_per_server
self._ring: list[tuple[int, str]] = [] # sorted by hash
self._server_positions: dict[str, list[int]] = {}
def add_server(self, server_id: str) -> None:
if server_id in self._server_positions:
return
positions = []
for i in range(self._v):
h = _hash(f"{server_id}#{i}")
bisect.insort(self._ring, (h, server_id))
positions.append(h)
self._server_positions[server_id] = positions
def remove_server(self, server_id: str) -> None:
positions = self._server_positions.pop(server_id, None)
if positions is None: return
# Rebuild filtered ring (O(N V) — acceptable; remove is rare)
self._ring = [(h, s) for (h, s) in self._ring if s != server_id]
def get_server(self, key: str) -> Optional[str]:
if not self._ring:
return None
kh = _hash(key)
idx = bisect.bisect_left(self._ring, (kh, ""))
if idx == len(self._ring):
idx = 0 # wrap around
return self._ring[idx][1]
def server_count(self) -> int:
return len(self._server_positions)
# Bounded-load variant for hot-spot mitigation:
class BoundedLoadRing:
"""Consistent hashing with bounded-load: each server's load ≤ avg * (1+ε)."""
def __init__(self, vnodes_per_server: int = 100, epsilon: float = 0.25):
self._inner = ConsistentHashRing(vnodes_per_server)
self._eps = epsilon
self._load: dict[str, int] = {}
def add_server(self, sid: str) -> None:
self._inner.add_server(sid); self._load.setdefault(sid, 0)
def remove_server(self, sid: str) -> None:
self._inner.remove_server(sid); self._load.pop(sid, None)
def get_server(self, key: str, total_keys: int) -> Optional[str]:
n = self._inner.server_count()
if n == 0: return None
cap = (total_keys / n) * (1 + self._eps)
# Walk forward from the first candidate until we find one under cap.
kh = _hash(key)
ring = self._inner._ring
idx = bisect.bisect_left(ring, (kh, ""))
if idx == len(ring): idx = 0
for offset in range(len(ring)):
_, sid = ring[(idx + offset) % len(ring)]
if self._load.get(sid, 0) < cap:
self._load[sid] = self._load.get(sid, 0) + 1
return sid
return ring[idx][1] # all over cap; pick first
Tests
import unittest, random, statistics
class TestRing(unittest.TestCase):
def test_basic(self):
r = ConsistentHashRing(vnodes_per_server=10)
r.add_server("s1"); r.add_server("s2"); r.add_server("s3")
self.assertIsNotNone(r.get_server("k1"))
self.assertIn(r.get_server("k1"), {"s1", "s2", "s3"})
r.remove_server("s2")
self.assertIn(r.get_server("k1"), {"s1", "s3"})
def test_minimal_remapping(self):
r = ConsistentHashRing(vnodes_per_server=200)
for s in ["s1", "s2", "s3"]: r.add_server(s)
keys = [f"key-{i}" for i in range(10000)]
before = {k: r.get_server(k) for k in keys}
r.add_server("s4")
after = {k: r.get_server(k) for k in keys}
moved = sum(1 for k in keys if before[k] != after[k])
# Expected ~25% remapping (from 3 servers to 4).
# mod-N would have moved ~75%.
self.assertLess(moved, 3500)
self.assertGreater(moved, 1500)
def test_balance(self):
r = ConsistentHashRing(vnodes_per_server=200)
for i in range(10): r.add_server(f"s{i}")
keys = [f"k-{i}" for i in range(20000)]
loads = {}
for k in keys:
s = r.get_server(k)
loads[s] = loads.get(s, 0) + 1
avg = 2000
# With 200 vnodes per server, variance should be small.
for cnt in loads.values():
self.assertLess(abs(cnt - avg), 350) # ≤17% deviation
def test_empty_ring(self):
r = ConsistentHashRing(vnodes_per_server=10)
self.assertIsNone(r.get_server("k"))
Follow-up Questions
(3) Scale to N nodes? Already designed for it. The ring scales because lookup is O(log N · V). The bottleneck on add_server is O(V log N) insertions; sorted-tree (red-black tree) implementations get O(V log N) similarly. For very large N, use a B-tree or skip list. For replication: get_servers(key, R) returns the next R unique servers clockwise.
(8) Partial failure? A server going down is naturally handled — its vnodes are removed and keys remap to the successor. The challenge is hot spots: when one server dies, all its load moves to one successor. Bounded-load consistent hashing (Mirsky’s variant) caps each server at (1+ε) × avg_load, spilling overflow to the next server. Implemented above.
(10) Consistency model? The ring itself is a routing function. The actual stored data has whatever consistency model the underlying store offers (linearizable, eventual, etc.). One subtlety: when a server is added, the data on the predecessor needs to be transferred to the new server before the routing change takes effect, or you serve stale/missing data. Two-phase add: install vnodes-as-readonly → migrate keys → activate.
(11) Configuration knobs? vnodes_per_server (100–500 covers most workloads), hash_function. Not to expose: ring data structure, balance/heuristics.
(4) Observe / monitor? Per-server load (gauge), key remapping events (counter), p99 lookup latency (histogram). Imbalance alert: trigger if any server’s load > 1.5x avg.
Product Extension
DynamoDB uses consistent hashing with explicit ranges. Cassandra uses 256 vnodes per node by default. Memcached’s clients (ketama, libmemcached) use consistent hashing. Envoy’s ring_hash load balancer uses it for sticky-session routing. Discord’s chat sharding originally used hash(channel) % N and famously hit the rebalance problem; they migrated to a fixed-bucket scheme. The point: even big companies get this wrong if they pick mod-N.
Language/Runtime Follow-ups
- Python:
bisectis the right tool for sorted-list maintenance. For very large rings, usesortedcontainers.SortedList(skip-list-backed) for O(log N) inserts. - Java:
TreeMap<Long, String>—ceilingKey(hash)does the lookup. Idiomatic. - Go:
sort.Searchover a[]uint64for the ring. Good locality, fast. - C++:
std::map<uint64_t, std::string>withlower_bound. Or sort a vector and binary-search. - JS/TS: no sorted-tree in stdlib; use the
sorted-array-functionsnpm package or maintain a sorted array manually.
Common Bugs
- Using
hash(key) % len(ring)to pick a vnode index — that is mod-N inside the ring. Use the ring’s actual hash space. - Forgetting to wrap around —
bisectreturnslen(ring)for a key past the last vnode; you must wrap to index 0. - Hash collisions on small rings — two servers’ vnodes land on the same hash. Either accept “first inserted wins” (deterministic) or perturb the suffix until unique.
- Removing a server but leaving its vnodes in
_server_positions(memory leak; subsequentadd_serverfor the same id silently no-ops because of theif … in …guard). - Using
hash(key)(Python’s built-in) — randomized per-process. Different processes route the same key to different servers. Use a stable hash like MD5 or MurmurHash.
Debugging Strategy
For “wrong server” complaints: log the (hashed-key, ring-position-found, server). For imbalance: dump load distribution and check vnode count. For “everything remapped” after add: count the remapped key fraction; if > 1/N, vnode count is too low or the hash function is poor.
Mastery Criteria
- Implemented ring + lookup + add/remove in <30 minutes.
- Stated minimal-remapping invariant (≤ 1/N keys move on add) without prompting.
- All four tests pass.
- Articulated why vnodes are needed in <60 seconds.
- Compared mod-N, ring-no-vnodes, ring-with-vnodes, bounded-load on a whiteboard.
- Answered follow-ups #3, #8 (bounded load), #10 (data migration coordination), #11 crisply.