Huffman Coding
Huffman Coding: The Tree of Brevity
The Story: The Morse Code Problem
In 1836, Samuel Morse invented a code to send messages over wires. He noticed that the letter ‘E’ is the most common letter in English, while ‘Z’ is very rare. So he made ‘E’ a single dot (.). He made ‘Z’ a complex sequence (--..). By making common things short and rare things long, he saved time.
In 1952, David Huffman formalized this idea mathematically. He proved that if you build a specific type of binary tree based on frequency, you can create the optimal prefix-free code for any set of data.
This is Huffman Coding. It is the reason ZIP files are small and JPEG images load fast.
Why do we need it?
- File Compression: ZIP, GZIP, PNG, JPEG, MP3 all use Huffman coding (or its variants) as a final step to squeeze out bits.
- Bandwidth Savings: Sending less data over the network means faster websites and lower costs.
How the Algorithm “Thinks”
The algorithm is a Bottom-Up Builder.
- Count Frequencies: ‘A’: 5, ‘B’: 9, ‘C’: 12, ‘D’: 13, ‘E’: 16, ‘F’: 45.
- The Priority Queue: Put all characters into a Min-Heap sorted by frequency.
- Merge:
- Pick the two smallest nodes (5 and 9).
- Merge them into a new node (Sum: 14).
- Put 14 back into the heap.
- Repeat: Keep merging the two smallest until only one node (the Root) remains.
- Assign Codes:
- Left branch = ‘0’
- Right branch = ‘1’
The most frequent characters end up near the top (short codes). The rare characters end up deep at the bottom (long codes).
Engineering Context: Prefix-Free Codes
A crucial property of Huffman codes is that no code is a prefix of another.
- If ‘E’ is
0and ‘A’ is01, the decoder gets confused when it sees0… is it ‘E’ or the start of ‘A’? - In Huffman, if
0is taken by ‘E’, no other code starts with0. This allows for instant, unambiguous decoding without separator markers.
Implementation (Python)
import heapq
from collections import Counter
class Node:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
# Comparison operator for Priority Queue
def __lt__(self, other):
return self.freq < other.freq
def build_huffman_tree(text):
# 1. Count frequencies
freq = Counter(text)
# 2. Build Min-Heap
heap = [Node(char, f) for char, f in freq.items()]
heapq.heapify(heap)
# 3. Merge nodes
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
# Create internal node (char is None)
merged = Node(None, left.freq + right.freq)
merged.left = left
merged.right = right
heapq.heappush(heap, merged)
return heap[0] # Root
def generate_codes(node, prefix="", code_map={}):
if node:
if node.char is not None:
code_map[node.char] = prefix
generate_codes(node.left, prefix + "0", code_map)
generate_codes(node.right, prefix + "1", code_map)
return code_map
# Usage
text = "BEEP BOOP BEER!"
root = build_huffman_tree(text)
codes = generate_codes(root)
print(f"Codes: {codes}")
# Common letters like 'E' will have shorter binary stringsSummary
Huffman Coding teaches us that equality is inefficient. Treating every character as equal (8 bits) wastes space. By embracing inequality—giving privileges (short codes) to the frequent and burdens (long codes) to the rare—we achieve the most efficient system possible. It reminds us that context (frequency) defines value.
