Inverted Index
Inverted Index: The World Flipped Upside Down
The Story: The Index Cards of the 19th Century
Before computers, historians and researchers faced a nightmare. If you wanted to find every mention of “Liberty” in a library of 10,000 books, you would have to read every single one.
To solve this, librarians began creating Concordances. They would take a book, list every unique word, and write down every page number where that word appeared on a small index card.
Instead of: Book A [Word 1, Word 2, Word 3…] They flipped it to: Word 1 [Book A, Book B, Book D…]
This “flip” changed the history of human knowledge. It allowed a researcher to find a needle in a haystack in seconds. Today, this 19th-century technique—now called the Inverted Index—is the core technology behind Elasticsearch, Lucene, and every search bar you use.
Why do we need it?
In a standard database (like SQL), you usually search for a row by its ID. But in a search engine, you search for a “Document” by its content.
- The Problem: If you have 100 million blog posts and you search for “Common Algorithms,” a standard database has to scan every single word of every single post (). This takes minutes.
- The Solution: The Inverted Index. When a document is saved, we break it into words (tokens) and store a map of which word belongs to which document. When you search, we just look at the map ().
How the Algorithm “Thinks”
The algorithm is a meticulous librarian who performs a “Reverse Mapping”:
Tokenization (The Breakdown): It takes a sentence like “To be or not to be” and breaks it into individual words:
[to, be, or, not]. It ignores capital letters and punctuation.The Great Flip (Inversion): Instead of remembering “Document 1 has these words,” it builds a giant table where each Word is the key.
to[Doc 1, Doc 2, Doc 5]be[Doc 1, Doc 3]or[Doc 1]
The Intersection (The Search): When you search for “to be,” the librarian looks at the lists for
toandbeand finds the Intersection: the documents that appear in both lists.
Engineering Context: The Cost of Speed
The Inverted Index is why Elasticsearch is so fast at searching but relatively slow at writing.
- The Trade-off: Every time you add a document, the system has to update thousands of lists in the index. This is computationally expensive and uses a lot of disk space.
- Compression: Because these lists (Postings Lists) can be massive, engineers use clever bit-packing techniques to keep the index small enough to fit in RAM.
Implementation (Python)
from collections import defaultdict
def create_inverted_index(documents):
"""
documents: Dictionary {doc_id: "content string"}
"""
index = defaultdict(set)
for doc_id, content in documents.items():
# 1. Tokenization: Lowercase and split into words
words = content.lower().split()
# 2. Inversion: Map word -> doc_id
for word in words:
index[word].add(doc_id)
return index
def search(query, index):
query_words = query.lower().split()
if not query_words:
return set()
# 3. Intersection: Find docs that contain ALL query words
results = index.get(query_words[0], set())
for word in query_words[1:]:
results = results.intersection(index.get(word, set()))
return results
# Example Usage
# docs = {
# 1: "Common algorithms for engineers",
# 2: "The soul of a new machine",
# 3: "How to find common ground"
# }
# idx = create_inverted_index(docs)
# print(search("common", idx)) # Output: {1, 3}Summary
The Inverted Index teaches us that if you want to find something fast, you must stop looking through the lens of “what I have” and start looking through the lens of “what people want.” By flipping the world upside down, we turned the impossible task of reading everything into the simple task of looking at a map.
