Levenshtein Distance
Levenshtein Distance: The Error Measure
The Story: The Typos
Imagine you are a teacher grading a spelling test.
- The student wrote: āKITTENā
- The correct word is: āSITTINGā
How many points should you deduct?
- Substitute āKā with āSā -> SITTEN
- Insert āIā after āSā -> SIITTEN (Wait, thatās wrongā¦)
- Letās try again.
- K -> S (Substitute)
- I -> I (Match)
- T -> T (Match)
- T -> T (Match)
- E -> I (Substitute)
- N -> N (Match)
- End -> Add G (Insert)
Total operations: 3. This count, 3, is the Levenshtein Distance. It is the minimum number of Insertions, Deletions, or Substitutions required to transform one string into another.
Why do we need it?
- Spell Checkers: āDid you meanā¦?ā If you type āGogleā, the distance to āGoogleā is 1. The distance to āAppleā is 4. The computer suggests āGoogleā.
- DNA Alignment: Comparing two gene sequences to see how closely related two species are.
- Plagiarism Detection: Measuring how similar two essays are, even if the student changed a few words.
How the Algorithm āThinksā
The algorithm uses Dynamic Programming. It builds a grid (matrix) where cell[i][j] represents the distance between the first i characters of Word A and the first j characters of Word B.
The Rules of the Grid:
- If characters match: No cost.
cell[i][j] = cell[i-1][j-1] - If mismatch: We take the minimum of three options + 1 cost:
- Insert: Look Left (
cell[i][j-1]) - Delete: Look Up (
cell[i-1][j]) - Replace: Look Diagonal (
cell[i-1][j-1])
- Insert: Look Left (
Engineering Context: The Complexity
- Time: where N and M are string lengths.
- Space: for the full matrix.
- Optimization: You only need the previous row to calculate the current row. So space can be reduced to .
Implementation (Python)
def levenshtein_distance(s1, s2):
m, n = len(s1), len(s2)
# Create a matrix of size (m+1) x (n+1)
# dp[i][j] = distance between s1[:i] and s2[:j]
dp = [[0] * (n + 1) for _ in range(m + 1)]
# Initialize first row and column
for i in range(m + 1):
dp[i][0] = i # Transforming s1[:i] to "" requires i deletions
for j in range(n + 1):
dp[0][j] = j # Transforming "" to s2[:j] requires j insertions
# Fill the matrix
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i - 1] == s2[j - 1]:
# Characters match, no cost
dp[i][j] = dp[i - 1][j - 1]
else:
# Mismatch, take min of Insert, Delete, Replace
dp[i][j] = 1 + min(
dp[i][j - 1], # Insert
dp[i - 1][j], # Delete
dp[i - 1][j - 1] # Replace
)
return dp[m][n]
# Usage
print(f"Distance: {levenshtein_distance('kitten', 'sitting')}") # Output: 3Summary
Levenshtein Distance teaches us that difference can be quantified. By breaking down vague concepts like āsimilarityā into concrete atomic actions (insert, delete, replace), we can teach computers to understand human error. It reminds us that in a world full of noise, the ability to measure the gap between expectation and reality is the first step to bridging it.
