HyperLogLog (HLL)
HyperLogLog (HLL)
Introduction: The “Unique Visitors” Problem
You are the CTO of a social media site. You want to count how many unique users visited a specific page today (Cardinality Estimation).
- Set Approach: Store every User ID in a Hash Set. 100 million IDs 8 bytes 800MB. For 10,000 pages, you need 8TB of RAM. Impossible.
- HyperLogLog: Count those 100 million users using only 1.5 KB of memory with a error.
HyperLogLog is a probabilistic algorithm for the count-distinct problem. It’s used by Redis, BigQuery, and AWS Redshift to handle massive scale.
What Problem does it solve?
- Input: A stream of elements.
- Output: An estimate of how many unique elements have been seen.
- The Promise: Fixed, tiny memory usage regardless of how many items you count.
Core Idea (Intuition)
Think of Flipping a Coin.
- If I tell you “I flipped a coin and got 3 Tails in a row at the start (TTT…),” you’d guess I flipped it maybe 8 times.
- If I tell you “I got 20 Tails in a row (TTTTTTTTTTTTTTTTTTTT…),” you’d guess I flipped it millions of times.
HyperLogLog applies this to data:
- Hash an item to a binary string.
- Look for the number of leading zeros.
- The more zeros you see, the more unique items likely exist in the set.
- To avoid luck, it splits the data into many “buckets” and averages the results.
Why use it?
- ✅ Memory Efficiency: It is the undisputed king of space. It can count billions of items in a space smaller than a single tweet.
- ✅ Mergeable: You can calculate HLL for “Monday” and HLL for “Tuesday,” and merge them to get the unique count for the “Whole Week” perfectly.
Typical Business Scenarios
✅ DAU/MAU Counting: Estimating Daily/Monthly Active Users in real-time.
✅ Ad-Tech: Counting unique people who saw an ad across billions of impressions.
✅ Network Security: Counting unique IP addresses connecting to a server to detect a DDoS attack.
❌ Exact Counts Needed: If you need the exact number for billing or legal reasons, HLL is not for you. It has a standard error of (about 2% for typical configs).
Performance & Complexity
- Time: to add an item; to query.
- Space: (practically constant, e.g., 1.5KB).
Summary
“HyperLogLog is the ‘Estimation Miracle’. It turns the impossible task of counting billions into a tiny statistical game of counting zeros. Use it when ‘close enough’ is worth 10,000x memory savings.”
