KMP Algorithm
KMP Algorithm: The One-Way Dancer
The Story: The Deja Vu
Imagine you are reading a book, looking for the word âABABACâ. You read: âA⊠B⊠A⊠B⊠AâŠâ Great! You have matched 5 characters. You are excited. Then you read the 6th character⊠itâs âDâ. âABABADâ. It doesnât match âABABACâ.
In a naive algorithm, you would sigh, go back to the 2nd character of the book, and start all over again. But KMP is smarter. It thinks: âWait. I just matched âABABAâ. I know what I saw. The last part âABAâ could be the start of a new match! I donât need to go back. I just need to shift my pattern so that the beginning âABAâ aligns with the ending âABAâ I just saw.â
This is KMP (Knuth-Morris-Pratt). It never moves backwards in the text pointer. It uses a pre-calculated table to slide the pattern forward as much as possible.
Why do we need it?
- Ctrl+F: Searching for a keyword in a massive document.
- DNA Sequencing: Finding a specific gene pattern in a genome of billions of base pairs.
- Streaming Data: KMP can search a stream of data (like keyboard input) in real-time because it never needs to ârewindâ the stream.
How the Algorithm âThinksâ
The algorithm has two phases:
The Preparation (LPS Table): Before reading the book, KMP analyzes the Pattern itself. It builds an array called
LPS(Longest Prefix which is also Suffix).- For âABABACâ:
- LPS Table says: âIf you fail at index 5, you can safely jump to index 3.â
The Dance (Search):
- It matches characters one by one.
- If characters match: Move forward.
- If mismatch: Donât move the text pointer back. Look at the LPS table to see where to reset the pattern pointer.
Engineering Context: Naive vs. KMP
- Naive Search: Worst case is . If you search for âAAAAAâ in âAAAAAAAAAâ, itâs very slow.
- KMP: Guaranteed . The time depends only on the length of the text and the pattern.
Implementation (Python)
def compute_lps(pattern):
# Longest Prefix Suffix table
lps = [0] * len(pattern)
length = 0 # Length of the previous longest prefix suffix
i = 1
while i < len(pattern):
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
# Tricky part: Fall back to previous LPS
length = lps[length - 1]
else:
lps[i] = 0
i += 1
return lps
def kmp_search(text, pattern):
if not pattern: return 0
lps = compute_lps(pattern)
i = 0 # Index for text
j = 0 # Index for pattern
while i < len(text):
if pattern[j] == text[i]:
i += 1
j += 1
if j == len(pattern):
return i - j # Match found at this index
# j = lps[j-1] # To find multiple matches
elif i < len(text) and pattern[j] != text[i]:
if j != 0:
j = lps[j-1] # The KMP Jump!
else:
i += 1
return -1
# Usage
print(kmp_search("ABABDABACDABABCABAB", "ABABCABAB"))Summary
KMP teaches us that failure contains information. Instead of treating a mismatch as a total loss, KMP uses it as a clue to skip ahead. It reminds us that we donât always need to start from scratch; if we know ourselves (the pattern), we can recover from errors gracefully.
