Locality Sensitive Hashing (LSH)
Locality Sensitive Hashing: Birds of a Feather
The Story: The Logic of the Artistâs Palette
Imagine you are an artist with 1 million paintings. Someone brings you a new sketch and asks: âWhich of your paintings is most similar to this?â
In a normal database (a Map), you would hash the sketch. But if the sketch has one extra pencil stroke, the hash changes completely. You would find nothing.
In 1998, Indyk and Motwani designed a âbrokenâ hash function on purpose. Instead of maximizing the difference, they wanted to maximize the Collision. They imagined projecting the paintings onto a low-dimensional wall. If two paintings look similar from 10 different angles, they are probably the same painting.
This is Locality Sensitive Hashing (LSH). It is the art of grouping âClose Enoughâ items together without checking every pair (), which would be impossible at scale.
Why do we need it?
LSH is the engine for Similarity Search.
- Recommendation Systems: âFind users who have similar movie tastes to Luke.â
- Copyright Detection: âIs this YouTube video a slightly cropped version of a copyrighted movie?â
- Plagiarism Check: âIs this essay 90% similar to an existing article?â
- Image Search: Finding âVisually similarâ photos.
How the Algorithm âThinksâ
The algorithm is a creative grouper.
- Vectorization: It turns data (text, images) into a list of numbers (Vectors).
- Projection (The Shadow): It projects these vectors onto random hyperplanes. Think of it as taking a âsnapshotâ of the data from a specific angle.
- The Signature: It records the result of these projections as a short signature.
- The Bucket: Items with the same signature are put into the same âBucket.â
- The Search: To find something similar, you only look inside the same bucket. You ignore the billions of items in other buckets.
Engineering Context: The Precision-Recall Trade-off
- Speed: Searching a bucket of 1,000 items is much faster than searching a database of 1 billion.
- Accuracy: You might miss some similar items (Recall) or find some slightly different items (Precision).
- Optimization: Engineers use multiple âBandsâ of LSH. If two items match in any band, they are considered candidates. This is the MinHash + LSH combo used by Google and Amazon.
Implementation (Python Sketch)
import numpy as np
class SimpleLSH:
def __init__(self, input_dim, num_projections=10):
# Create random hyperplanes (the 'angles' we look from)
self.projections = np.random.randn(num_projections, input_dim)
self.buckets = {}
def _get_hash(self, vector):
# Project vector and get a binary signature (above/below plane)
sig = (np.dot(self.projections, vector) > 0).astype(int)
return "".join(sig.astype(str))
def add(self, label, vector):
h = self._get_hash(vector)
if h not in self.buckets:
self.buckets[h] = []
self.buckets[h].append(label)
def query(self, vector):
h = self._get_hash(vector)
return self.buckets.get(h, [])
# Usage
lsh = SimpleLSH(input_dim=128, num_projections=8)
vec_luke = np.random.randn(128)
vec_similar = vec_luke + 0.1 * np.random.randn(128) # Slightly different
lsh.add("Luke", vec_luke)
print(f"Similar to query: {lsh.query(vec_similar)}") # Likely returns ['Luke']Summary
LSH teaches us that sometimes, the details donât matter. By blurring the lines of identity, we gain the ability to find connections in a world of overwhelming complexity. It reminds us that in the digital age, being âSimilarâ is often more useful than being âEqual.â
