Lab 05 — Huffman Coding

Goal

Implement Huffman coding from scratch using a min-heap, and prove its optimality via the canonical exchange argument: in some optimal prefix-free code tree, the two least-frequent symbols are siblings at maximum depth.

Background

Huffman coding is the apex example of “greedy via min-heap” and the most-cited example of greedy optimality in CS curricula. The proof has two non-trivial steps: a swap-to-leaf-depth lemma (any internal node at maximum depth can be assumed to have the two least-frequent symbols), and an induction on the merged tree (the greedy is optimal on n - 1 symbols, and combining the two smallest preserves optimality). Mastering this proof teaches a more sophisticated form of exchange argument than the linear-scan greedy of Labs 1–4.

Interview Context

Huffman is asked occasionally at top-tier interviews — Google, Apple, AWS — usually as an open-ended “design a compression algorithm” or as a follow-up to a lab on heap usage. More commonly, the technique (greedy via min-heap, with optimality proof) appears in adjacent problems: LC 1167 — Minimum Cost to Connect Sticks, LC 23 — Merge K Sorted Lists, and the rope-merging problem. Mastery of Huffman = mastery of the entire family.

Problem Statement

Given a frequency map freq: Symbol -> int over n distinct symbols (n ≥ 2), construct a prefix-free binary code such that the expected code length Σ freq[s] * len(code[s]) is minimized. Return either the code map or the encoding tree.

For interview formulation, often phrased as: “Given n ropes of given lengths, you can merge two ropes at a cost equal to the sum of their lengths. Find the minimum total cost to merge all ropes into one.” (Equivalent to Huffman; rope lengths = frequencies.)

LeetCode reference: LC 1167 — Minimum Cost to Connect Sticks (the rope formulation).

Constraints

  • 2 ≤ n ≤ 10^4
  • 1 ≤ freq[s] ≤ 10^4
  • Time O(n log n); space O(n).
  • Tie-break on equal frequencies: any order is acceptable; the optimal cost is invariant.

Clarifying Questions

  1. “Should I return the codes or just the cost?” — usually cost (rope formulation); for full Huffman, return the tree or the codes.
  2. “Are frequencies guaranteed positive?” — yes (zero-frequency symbols don’t need codes).
  3. “Are there always at least 2 symbols?” — assume yes; with 1 symbol, prefix-free coding is trivially “0” (or empty, depending on definition).
  4. “Is n = 0 a valid input?” — typically no.
  5. “Should the codes be canonical?” — usually no; any optimal-length code is acceptable.

Examples

  • Frequencies: {a: 5, b: 9, c: 12, d: 13, e: 16, f: 45}. Codes (one valid set): f: 0, c: 100, d: 101, a: 1100, b: 1101, e: 111. Total cost: 5*4 + 9*4 + 12*3 + 13*3 + 16*3 + 45*1 = 20 + 36 + 36 + 39 + 48 + 45 = 224.
  • Ropes [2, 4, 3]: merge 2+3=5 (cost 5), then 5+4=9 (cost 9), total 14.
  • Ropes [1, 8, 3, 5]: merge 1+3=4, merge 4+5=9, merge 9+8=17. Total = 4+9+17=30. Or: 1+3=4, 4+5=9, 8+9=17 → same. Min cost 30.

Initial Brute Force

Try every binary-tree topology over the leaves; compute the weighted external path length; return the minimum-cost tree. Catalan-number many trees → infeasible past n = 10.

# Sketched only — exponential
def huffman_brute(freq):
    # enumerate all binary trees with leaves = freq, return min weighted path length
    ...

Brute Force Complexity

O(C_n) where C_n is the Catalan number — C_{10} ≈ 16800, C_{15} ≈ 9.7M. Useful only as stress-test for n ≤ 8.

Optimization Path

  1. Brute — exhaustive trees.
  2. DP on intervals — possible if leaves are ordered (matrix-chain style), but Huffman’s leaves are unordered, so this doesn’t apply.
  3. Greedy with min-heapO(n log n) time, O(n) space. Optimality from the exchange argument below.

Final Expected Approach

import heapq

def huffman_cost(freqs):
    heap = list(freqs)            # frequencies only, for cost-only variant
    heapq.heapify(heap)
    total = 0
    while len(heap) > 1:
        a = heapq.heappop(heap)
        b = heapq.heappop(heap)
        s = a + b
        total += s
        heapq.heappush(heap, s)
    return total

def huffman_codes(freqs):
    # freqs: list of (symbol, freq) tuples
    heap = [[f, [[s, ""]]] for s, f in freqs]
    heapq.heapify(heap)
    while len(heap) > 1:
        lo = heapq.heappop(heap)
        hi = heapq.heappop(heap)
        for pair in lo[1]:
            pair[1] = '0' + pair[1]
        for pair in hi[1]:
            pair[1] = '1' + pair[1]
        heapq.heappush(heap, [lo[0] + hi[0], lo[1] + hi[1]])
    return dict(heap[0][1])

Data Structures Used

  • A min-heap of (frequency, optional payload) pairs.
  • The implicit binary tree formed by the merge sequence.

Correctness Argument

We prove the greedy is optimal by induction on n (the number of symbols), using two lemmas.

Lemma 1 (Swap-to-deepest). In some optimal prefix code tree T*, the two least-frequent symbols x and y are siblings at the maximum depth of any leaf.

Proof. Take any optimal tree T'. Let a and b be two siblings at the maximum-depth leaf level of T' (such a pair exists in any full binary tree where every internal node has 2 children). If {a, b} = {x, y}, done. Otherwise, suppose WLOG freq[x] ≤ freq[a] and x ≠ a. Swap x with a (place the symbol x at a’s leaf and vice versa). The cost change is:

Δ = freq[x] * depth(a) + freq[a] * depth(x) - freq[x] * depth(x) - freq[a] * depth(a) = (freq[a] - freq[x]) * (depth(x) - depth(a))

Since freq[a] ≥ freq[x] (because x is among the two least-frequent) and depth(a) ≥ depth(x) (because a is at maximum depth), Δ ≤ 0. Equality holds; the new tree is also optimal. Repeat with y and b. We’ve moved x, y to the maximum-depth pair without increasing cost. QED for Lemma 1.

Lemma 2 (Greedy preserves optimality on the residual). Let x, y be the two least-frequent symbols. Construct freq' by replacing x and y with a single super-symbol z of frequency freq[x] + freq[y]. Then any optimal tree T* for freq' extends to an optimal tree for freq by replacing z’s leaf with an internal node whose children are leaves for x and y.

Proof. Let T_extended be the extension of T* (replace z-leaf with internal node + x, y children). The cost satisfies:

cost(T_extended) = cost(T*) + freq[x] + freq[y]

(The + freq[x] + freq[y] comes from x, y being one level deeper than z was.)

Conversely, any tree T for freq where x, y are siblings (which by Lemma 1 we can assume WLOG) collapses to a tree T_collapsed for freq' by merging x, y into z:

cost(T_collapsed) = cost(T) - freq[x] - freq[y]

So cost(T) = cost(T_collapsed) + freq[x] + freq[y] ≥ cost(T*) + freq[x] + freq[y] = cost(T_extended). Therefore T_extended is at least as good as any tree where x, y are siblings; combined with Lemma 1, T_extended is optimal for freq. QED for Lemma 2.

Inductive proof of greedy optimality. Base case n = 2: only one tree possible, greedy gives it. Inductive step: greedy merges the two least-frequent symbols x, y first, recurses on the residual of size n - 1, and by inductive hypothesis the recursive call produces an optimal tree for freq'. By Lemma 2, the extended tree is optimal for freq. QED.

The two lemmas together are the full exchange-argument proof. Lemma 1 is the swap step; Lemma 2 is the induction step.

Complexity

  • Time O(n log n)n - 1 merge operations, each with two pop and one push, each O(log n).
  • Space O(n) — heap and tree.

Implementation Requirements

  • Use a min-heap (heapq in Python uses a min-heap by default; in Java, PriorityQueue is min by default).
  • Tie-breakers: when frequencies are equal, the heap may pick either; correctness is unaffected. For deterministic output, add a secondary index.
  • For the cost-only variant, the symbol payload can be omitted.

Tests

def test_huffman_cost():
    assert huffman_cost([2, 3, 4]) == 14  # 2+3=5, 5+4=9
    assert huffman_cost([1, 8, 3, 5]) == 30
    assert huffman_cost([5]) == 0  # n=1 edge: merge cost is 0
    # uniform
    assert huffman_cost([1, 1, 1, 1]) == 8  # 1+1=2, 1+1=2, 2+2=4. Total=2+2+4=8.
    # large skew
    assert huffman_cost([1, 1, 1000]) == 1003  # 1+1=2, 2+1000=1002. Total=2+1002=1004?
    # actually: heap [1,1,1000] → pop 1, pop 1, push 2. heap [2, 1000]. pop 2, pop 1000, push 1002. cost = 2 + 1002 = 1004.

Wait — let me re-verify the last test. Heap [1, 1, 1000]. Pop 1 + 1 = 2 (cost contribution: 2). Push 2. Heap [2, 1000]. Pop 2 + 1000 = 1002 (cost contribution: 1002). Total = 2 + 1002 = 1004. So the test should be assert huffman_cost([1, 1, 1000]) == 1004. Correct your tests.

Follow-up Questions

  • Adaptive Huffman — frequencies are unknown a priori; encoder and decoder maintain a tree that updates as symbols arrive. Used in older compression standards.
  • Canonical Huffman — codes are normalized so only code lengths need to be transmitted, not the tree structure. Used in DEFLATE / zlib.
  • Length-limited Huffman (max code length L) — the package-merge algorithm, more complex than vanilla Huffman.
  • Arithmetic coding — beats Huffman for non-power-of-2 frequencies; not greedy.

Product Extension

Used in: gzip / DEFLATE (with length-limited variant), HTTP/2 HPACK header compression, JPEG entropy coding stage, MP3 audio coding. Whenever a known-frequency distribution must be losslessly compressed with a prefix-free code, Huffman or its variants are the workhorse.

Language / Runtime Follow-ups

  • Python: heapq for the heap. Watch out: heapq is min-heap; for tie-breaking, use (freq, counter, payload) to avoid comparing payloads.
  • Java: PriorityQueue<Node> with Comparator.comparingInt(n -> n.freq).
  • Go: implement heap.Interface (5 methods) on a slice of nodes; standard library does not provide a generic typed heap pre-1.21.
  • C++: std::priority_queue<Node, std::vector<Node>, std::greater<Node>>. Define operator< on Node to compare by frequency.
  • JS/TS: no built-in heap; either bring a library (@datastructures-js/priority-queue) or hand-roll a binary heap.

Common Bugs

  • Mixing up min-heap and max-heap: with a max-heap, you’d merge the two largest — the answer is wrong by a lot.
  • Pushing the merged node back with the wrong frequency (e.g., max(a, b) instead of a + b).
  • For the codes variant: assigning ‘0’ to high-freq and ‘1’ to low-freq and forgetting that the prefix is built bottom-up (so the last prepended bit is the root’s assignment — make sure the prepend order is right).
  • Heap of size 1 at the start (single symbol): the while len(heap) > 1 loop is correct; cost is 0.

Debugging Strategy

  • Hand-trace a small example (e.g., [1, 1, 1, 1]) and verify each merge step.
  • Compare cost output against the brute force for n ≤ 6.
  • For codes: verify the tree visually — every internal node has exactly two children, every leaf is a symbol, and code lengths are weighted appropriately.

Mastery Criteria

  • You can implement Huffman cost-only in <8 minutes from blank.
  • You can implement Huffman with full code map in <15 minutes.
  • You can deliver Lemma 1 (swap-to-deepest) in <2 minutes, out loud.
  • You can deliver Lemma 2 (induction on residual) in <2 minutes, out loud.
  • You can recognize LC 1167 / connect-sticks as a Huffman variant in <30 seconds.
  • You can articulate when Huffman is not optimal (e.g., when the alphabet allows non-binary codes, or when arithmetic coding is admissible).

← Lab 04 — Gas Station · Phase 6 README · Lab 06 — Greedy Vs DP →