MinHash & LSH
MinHash & LSH: The Art of the Sketch
The Story: The Portrait Artist and the Crowd
Imagine you are looking for a specific person in a crowd of 10 million people. You don’t have time to look at everyone’s eyes, ears, and fingerprints.
Instead, you have a Portrait Artist. For every person, the artist quickly sketches three things: their height, the color of their hat, and the shape of their glasses. These three numbers are their “Signature.”
If two people have the same signature, they are likely the same person (or at least very similar).
In 1997, Andrei Broder developed MinHash to help Altavista (an early search engine) find duplicate web pages. Later, Locality-Sensitive Hashing (LSH) was created to group these “signatures” so that similar items would land in the same “buckets.” It was the birth of large-scale similarity search.
Why do we need it?
In the era of Big Data, exact comparison is impossible.
- Search Engines: Detecting “Near-Duplicates.” If two news articles are 95% the same, Google only wants to show one.
- Copyright Detection: Identifying if a song or an image is a slightly modified version of an original.
- Recommendation Systems: “Users who bought X also bought Y.” Finding similar users among millions.
How the Algorithm “Thinks”
The algorithm is a clever strategist who uses Probability to avoid heavy lifting.
The Signature (MinHash): It takes a massive document and breaks it into small sets of words (Shingles). Instead of storing the whole set, it applies dozens of random hash functions. For each function, it only keeps the minimum value. This small list of minimum values is the MinHash Signature—the “Sketch” of the document.
The Projection (LSH): Even with signatures, comparing everything is . LSH solves this by being “Locality-Sensitive.” Normally, hash functions are designed to avoid collisions. But LSH is designed to cause collisions for similar items. It chops the signature into bands and hashes each band into a bucket.
The Bucket Logic: If two documents are similar, at least one of their bands will likely land in the same bucket. We only compare the documents that land in the same bucket. We skip 99.9% of the crowd.
Engineering Context: Fast but “Fuzzy”
MinHash and LSH are Probabilistic Algorithms.
- The Trade-off: They are incredibly fast and use very little memory, but they can have False Positives (unrelated items in the same bucket) or False Negatives (missing a similar pair).
- Precision Control: Engineers tune the number of hash functions and the size of the bands to balance speed and accuracy.
Implementation (Python Sketch)
import random
def get_minhash_signature(content_set, num_hashes=100):
# content_set: set of "shingles" (sub-strings)
signature = []
for i in range(num_hashes):
# In reality, use a family of hash functions
# For this sketch, we simulate with a random seed
min_hash = min(hash(f"{item}_{i}") for item in content_set)
signature.append(min_hash)
return signature
def lsh_bucket(signature, bands=20):
# Divide the signature into bands
rows = len(signature) // bands
buckets = []
for i in range(0, len(signature), rows):
band = tuple(signature[i:i+rows])
buckets.append(hash(band))
return buckets # If two docs share any bucket, they are candidatesSummary
MinHash and LSH teach us that perfection is the enemy of scale. By using “Sketches” and “Buckets,” we can navigate a sea of billions with ease. It reminds us that to understand the vastness of the world, we don’t need to see every detail; we just need to know which details matter.
