Count-Min Sketch
Count-Min Sketch
Introduction: The “Frequent Items” Problem
Imagine you are managing a global network. You want to know: “How many requests did IP 1.2.3.4 send in the last hour?” In a stream of billions of packets, storing a counter for every single IP would consume terabytes of RAM.
Count-Min Sketch is a probabilistic structure that estimates the frequency of events in a stream. It’s like a Bloom Filter, but instead of bits (0/1), it stores counters. It might overestimate the frequency slightly, but it will never underestimate it.
What Problem does it solve?
- Input: A stream of elements.
- Output: An estimated frequency (count) for any given element.
- The Promise: Fixed, low memory usage to track frequencies in massive datasets.
Core Idea (Intuition)
Think of a Multi-Layered Grid of Counters.
- Grid: Create rows and columns of counters.
- Add an Item: For each row, use a different hash function to find a column, and increment that specific counter (
count[row][col]++). - Query an Item: Hash it with the same functions. You get different counts (one from each row).
- The Answer: The minimum of these counts is your estimate.
Why the minimum? Because of collisions, some counters might be shared by multiple items (causing overestimation). The probability of all counters being heavily collided for the same item is extremely low.
Why use it?
✅ Heavy Hitters: Finding the most frequent items (the “Top 1%”) in a stream without storing everything.
✅ DDoS Protection: Identifying IPs that are sending an unusually high volume of traffic.
❌ Overestimation: It can say an item appeared 1,005 times when it only appeared 1,000. It will never say 999.
❌ Distinct Counts: If you want to know “how many unique items,” use HyperLogLog. CM Sketch is for “how many times did THIS item appear.”
Typical Business Scenarios
- ✅ Real-time Analytics: Tracking trending hashtags on Twitter as they happen.
- ✅ Content Delivery: Determining which videos are “Hot” right now to push them to edge cache.
- ✅ Databases: Query optimizers use sketches to estimate the cost of joins by knowing how many duplicate values exist in a column.
Performance & Complexity
- Time: where is the number of hash functions (usually very small, like 5-10).
- Space: - fixed memory based on your desired error tolerance.
Summary
“Count-Min Sketch is the ‘Frequency Estimator’. By taking the minimum across multiple hashed counters, it filters out noise and gives you a reliable upper bound on how often an event occurred.”
