Lab 10 — Consistent Hashing

Goal

Implement a consistent hash ring with virtual nodes that minimizes key remapping when servers are added or removed. After this lab you should be able to design and implement consistent hashing in under 30 minutes and articulate why it beats hash(key) % N.

Background Concepts

The naive sharding scheme hash(key) % N has a catastrophic failure mode: when N changes (a server is added or removed), nearly every key remaps to a different shard. For an in-memory cache fleet, this means the entire cache is invalidated; for a stateful sharded store, this means most data must be physically migrated.

Consistent hashing solves this. Servers are placed on a ring (a circular hash space, e.g., [0, 2^64)). Each key is hashed onto the ring and assigned to the next server clockwise. When a server is added, only keys between its predecessor and itself on the ring are remapped. When a server is removed, only its keys are remapped — to its successor.

Without virtual nodes (vnodes), the ring is unbalanced: a 4-server ring assigns wildly unequal slices. Virtual nodes fix this: each physical server gets V ring positions (e.g., V=200). The ring becomes statistically balanced in O(1/sqrt(V)) deviation.

Interview Context

Consistent hashing is the default sharding mechanism for distributed caches (Memcached client libraries, Redis Cluster’s slot variant), distributed databases (DynamoDB, Cassandra), and load balancers (HAProxy, Envoy with ring_hash). It is asked at infrastructure roles at every Big Tech and many high-scale companies.

Problem Statement

Implement ConsistentHashRing(vnodes_per_server):

  • add_server(server_id) — add a server with V vnodes.
  • remove_server(server_id) — remove all vnodes for the server.
  • get_server(key) -> server_id — return the server responsible for key.
  • keys_moved(key, before, after) — for analysis: did this key remap?

Constraints

  • 1 ≤ servers ≤ 10^4
  • 1 ≤ vnodes per server ≤ 1000
  • 10^5 lookups / second
  • Lookup latency: O(log N · V)

Clarifying Questions

  1. Hash function: cryptographic or fast? (Use a fast non-crypto hash: MurmurHash, xxHash. Stable across processes.)
  2. Vnode count V: hard-coded or configurable? (Configurable, default 100–200.)
  3. Replication: should get_server return one or multiple distinct servers? (Often a follow-up; primary is one.)
  4. Hot-spotting awareness: do we know any keys are extremely hot? (Bounded-load consistent hashing is a follow-up.)

Examples

ring = ConsistentHashRing(vnodes_per_server=100)
ring.add_server("s1"); ring.add_server("s2"); ring.add_server("s3")
ring.get_server("user-42")     -> "s2"
ring.add_server("s4")
ring.get_server("user-42")     -> "s2" or "s4" (only some keys remap)
# ~25% of keys remap on adding a 4th server, not 75% as with mod-N

Initial Brute Force

hash(key) % N. Simple and balanced; catastrophic on N change. Useful for understanding the problem, not as the solution.

Brute Force Complexity

O(1) per lookup; O(N · keys / N) = O(keys) remap on N change — not the problem; the brute force is mod-N and we’re moving away from it.

Optimization Path

Replace mod-N with a sorted ring. Servers map to multiple positions; lookup is binary search. On add/remove, only insert/delete vnode positions; existing positions don’t move.

Final Expected Approach

A sorted list of (hash_value, server_id) tuples kept in ring order. get_server(key): hash the key, binary-search for the smallest ring position ≥ key_hash; wrap around if past the end. add_server: insert V positions. remove_server: remove all V positions.

Data Structures Used

StructurePurpose
Sorted list of (hash, server)the ring
dict[server_id, list[hash_positions]]bookkeeping for removal
bisect for binary searchO(log N · V) lookup

Correctness Argument

Key locality on resize: when adding server S with vnodes [v_1, …, v_V], the only keys whose owner changes are those whose hash falls in some (predecessor(v_i), v_i] range. The expected fraction of keys affected is V / (total vnodes)1/(N+1) — exactly the right number to assign to the new server, and no more.

Balanced load: with V vnodes per server, the variance of the load assigned to each server scales as O(log N / V). At V=100, N=10, the imbalance is < 5%; at V=1000, < 1.5%.

Complexity

  • get_server: O(log(N · V))
  • add_server: O(V · log(N · V)) per insert; total O(V log N) per server
  • Space: O(N · V)

Implementation Requirements

import bisect, hashlib
from typing import Optional

def _hash(s: str) -> int:
    """Fast, deterministic hash. Use MD5 for speed; SHA-1 also fine."""
    return int.from_bytes(hashlib.md5(s.encode()).digest()[:8], "big")

class ConsistentHashRing:
    def __init__(self, vnodes_per_server: int = 100):
        self._v = vnodes_per_server
        self._ring: list[tuple[int, str]] = []   # sorted by hash
        self._server_positions: dict[str, list[int]] = {}

    def add_server(self, server_id: str) -> None:
        if server_id in self._server_positions:
            return
        positions = []
        for i in range(self._v):
            h = _hash(f"{server_id}#{i}")
            bisect.insort(self._ring, (h, server_id))
            positions.append(h)
        self._server_positions[server_id] = positions

    def remove_server(self, server_id: str) -> None:
        positions = self._server_positions.pop(server_id, None)
        if positions is None: return
        # Rebuild filtered ring (O(N V) — acceptable; remove is rare)
        self._ring = [(h, s) for (h, s) in self._ring if s != server_id]

    def get_server(self, key: str) -> Optional[str]:
        if not self._ring:
            return None
        kh = _hash(key)
        idx = bisect.bisect_left(self._ring, (kh, ""))
        if idx == len(self._ring):
            idx = 0                               # wrap around
        return self._ring[idx][1]

    def server_count(self) -> int:
        return len(self._server_positions)


# Bounded-load variant for hot-spot mitigation:
class BoundedLoadRing:
    """Consistent hashing with bounded-load: each server's load ≤ avg * (1+ε)."""
    def __init__(self, vnodes_per_server: int = 100, epsilon: float = 0.25):
        self._inner = ConsistentHashRing(vnodes_per_server)
        self._eps = epsilon
        self._load: dict[str, int] = {}

    def add_server(self, sid: str) -> None:
        self._inner.add_server(sid); self._load.setdefault(sid, 0)

    def remove_server(self, sid: str) -> None:
        self._inner.remove_server(sid); self._load.pop(sid, None)

    def get_server(self, key: str, total_keys: int) -> Optional[str]:
        n = self._inner.server_count()
        if n == 0: return None
        cap = (total_keys / n) * (1 + self._eps)
        # Walk forward from the first candidate until we find one under cap.
        kh = _hash(key)
        ring = self._inner._ring
        idx = bisect.bisect_left(ring, (kh, ""))
        if idx == len(ring): idx = 0
        for offset in range(len(ring)):
            _, sid = ring[(idx + offset) % len(ring)]
            if self._load.get(sid, 0) < cap:
                self._load[sid] = self._load.get(sid, 0) + 1
                return sid
        return ring[idx][1]                       # all over cap; pick first

Tests

import unittest, random, statistics

class TestRing(unittest.TestCase):
    def test_basic(self):
        r = ConsistentHashRing(vnodes_per_server=10)
        r.add_server("s1"); r.add_server("s2"); r.add_server("s3")
        self.assertIsNotNone(r.get_server("k1"))
        self.assertIn(r.get_server("k1"), {"s1", "s2", "s3"})
        r.remove_server("s2")
        self.assertIn(r.get_server("k1"), {"s1", "s3"})

    def test_minimal_remapping(self):
        r = ConsistentHashRing(vnodes_per_server=200)
        for s in ["s1", "s2", "s3"]: r.add_server(s)
        keys = [f"key-{i}" for i in range(10000)]
        before = {k: r.get_server(k) for k in keys}
        r.add_server("s4")
        after = {k: r.get_server(k) for k in keys}
        moved = sum(1 for k in keys if before[k] != after[k])
        # Expected ~25% remapping (from 3 servers to 4).
        # mod-N would have moved ~75%.
        self.assertLess(moved, 3500)
        self.assertGreater(moved, 1500)

    def test_balance(self):
        r = ConsistentHashRing(vnodes_per_server=200)
        for i in range(10): r.add_server(f"s{i}")
        keys = [f"k-{i}" for i in range(20000)]
        loads = {}
        for k in keys:
            s = r.get_server(k)
            loads[s] = loads.get(s, 0) + 1
        avg = 2000
        # With 200 vnodes per server, variance should be small.
        for cnt in loads.values():
            self.assertLess(abs(cnt - avg), 350)   # ≤17% deviation

    def test_empty_ring(self):
        r = ConsistentHashRing(vnodes_per_server=10)
        self.assertIsNone(r.get_server("k"))

Follow-up Questions

(3) Scale to N nodes? Already designed for it. The ring scales because lookup is O(log N · V). The bottleneck on add_server is O(V log N) insertions; sorted-tree (red-black tree) implementations get O(V log N) similarly. For very large N, use a B-tree or skip list. For replication: get_servers(key, R) returns the next R unique servers clockwise.

(8) Partial failure? A server going down is naturally handled — its vnodes are removed and keys remap to the successor. The challenge is hot spots: when one server dies, all its load moves to one successor. Bounded-load consistent hashing (Mirsky’s variant) caps each server at (1+ε) × avg_load, spilling overflow to the next server. Implemented above.

(10) Consistency model? The ring itself is a routing function. The actual stored data has whatever consistency model the underlying store offers (linearizable, eventual, etc.). One subtlety: when a server is added, the data on the predecessor needs to be transferred to the new server before the routing change takes effect, or you serve stale/missing data. Two-phase add: install vnodes-as-readonly → migrate keys → activate.

(11) Configuration knobs? vnodes_per_server (100–500 covers most workloads), hash_function. Not to expose: ring data structure, balance/heuristics.

(4) Observe / monitor? Per-server load (gauge), key remapping events (counter), p99 lookup latency (histogram). Imbalance alert: trigger if any server’s load > 1.5x avg.

Product Extension

DynamoDB uses consistent hashing with explicit ranges. Cassandra uses 256 vnodes per node by default. Memcached’s clients (ketama, libmemcached) use consistent hashing. Envoy’s ring_hash load balancer uses it for sticky-session routing. Discord’s chat sharding originally used hash(channel) % N and famously hit the rebalance problem; they migrated to a fixed-bucket scheme. The point: even big companies get this wrong if they pick mod-N.

Language/Runtime Follow-ups

  • Python: bisect is the right tool for sorted-list maintenance. For very large rings, use sortedcontainers.SortedList (skip-list-backed) for O(log N) inserts.
  • Java: TreeMap<Long, String>ceilingKey(hash) does the lookup. Idiomatic.
  • Go: sort.Search over a []uint64 for the ring. Good locality, fast.
  • C++: std::map<uint64_t, std::string> with lower_bound. Or sort a vector and binary-search.
  • JS/TS: no sorted-tree in stdlib; use the sorted-array-functions npm package or maintain a sorted array manually.

Common Bugs

  1. Using hash(key) % len(ring) to pick a vnode index — that is mod-N inside the ring. Use the ring’s actual hash space.
  2. Forgetting to wrap around — bisect returns len(ring) for a key past the last vnode; you must wrap to index 0.
  3. Hash collisions on small rings — two servers’ vnodes land on the same hash. Either accept “first inserted wins” (deterministic) or perturb the suffix until unique.
  4. Removing a server but leaving its vnodes in _server_positions (memory leak; subsequent add_server for the same id silently no-ops because of the if … in … guard).
  5. Using hash(key) (Python’s built-in) — randomized per-process. Different processes route the same key to different servers. Use a stable hash like MD5 or MurmurHash.

Debugging Strategy

For “wrong server” complaints: log the (hashed-key, ring-position-found, server). For imbalance: dump load distribution and check vnode count. For “everything remapped” after add: count the remapped key fraction; if > 1/N, vnode count is too low or the hash function is poor.

Mastery Criteria

  • Implemented ring + lookup + add/remove in <30 minutes.
  • Stated minimal-remapping invariant (≤ 1/N keys move on add) without prompting.
  • All four tests pass.
  • Articulated why vnodes are needed in <60 seconds.
  • Compared mod-N, ring-no-vnodes, ring-with-vnodes, bounded-load on a whiteboard.
  • Answered follow-ups #3, #8 (bounded load), #10 (data migration coordination), #11 crisply.