Spelling correction with Levenshtein distance
How similar are the words ‘hear’ and ‘here’?
There are many answers to that question. First and foremost, ‘hear’ and ‘here’ are indistinguishable in sound. Meaning, a fluent English speaker will pronounce ‘here’ just as they pronounce ‘hear’. However, ‘here’ and ‘hear’ are different semantically or, rather, their definitions are distinct. ‘Here’ and ‘hear’ are also different in spelling.
Spelling is a particularly important issue for our users, here, at Babbel. It’s common for language learners to make spelling mistakes in a new language. It’s also common for users to make typo mistakes while using a mobile or web application. As we’re constantly looking for ways to improve our language learners’ experience, these issues were addressed in a recent project by using a popular algorithm called Levenshtein distance.
There’s no need to reinvent the wheel this time. Many years of computer science research have resulted in a vast library of algorithms to solve problems like spelling.
In this post, I’ll explore the Levenshtein distance algorithm. It was discovered and documented by a scientist named Vladmir Levenshtein in 1965.
At its core, the purpose of Levenshtein distance is to quantify the differences between 2 strings.
The Levenshtein distance algorithm defines 3 distinct edit operations that are needed to transform 1 string into another. They are as follows.
Each operation has an edit distance of 1. For example, I will transform the word ‘competers’ to ‘computer’ in operational steps.
- ‘competers’ to ‘competer’ (delete the ‘s’)
- ‘competer’ to ‘computer’ (substitute the first ‘e’ with ‘u’)
Above, there are 2 operations needed to convert ‘competers’ into ‘computer’. Therefore the Levenshtein distance between ‘competers’ and ‘computer’ is 2.
For programmatic reasons, it’s necessary to define the mathematical specifications. The Levenshtein distance Wikipedia article includes the following mathematical formula.
It can look intimidating, so I’ll walk through it step-by-step with an example.
As stated before, the words ‘here’ and ‘hear’ sound the same in English, but are spelled differently. I would like to know how differently they are spelled or, rather, the edit distance in a numeric value.
First, I’ll create a grid with the word ‘hear’ on the horizontal axis (
x coordinates) and ‘here’ on the vertical axis (
y coordinates). I’ll also fill in the first corresponding row and column with an increasing count starting from 0 to represent the distance from the start of the word.
Then, I’ll approach the first position
(1, 1). Because the first character of ‘h’ in ‘here’ matches the first character of ‘h’ in ‘hear’, I’ll look for the value in the diagonally inferior coordinate
(x - 1, y - 1). Its value is 0, so I’ll copy 0 from
(0, 0) to
Then, I’ll approach another coordinate either horizontally or vertically adjacent to the
(1, 1). I’ll choose the box above,
(1, 2). The character ‘h’ from ‘hear’ does not match the character ‘e’ from ‘here’. So I have to choose the minimum value from a neighboring box (
(0, 1), or
(1, 0)) and add 1. The lowest value would be the box below
(1, 1) and its value is 0. The sum of
0 + 1 is 1, so I’ll insert the value of 1 into
If I continue to do this for all of the boxes, I end up with the following grid.
The top right box
(4, 4) becomes the final edit distance between the strings. Therefore, the edit distance between ‘here’ and ‘hear’ is… 2!
It’s also possible to arrive at this conclusion by noticing that there are 2 edits necessary to turn ‘here’ into ‘hear’. I would have to substitute 2 characters ‘re’ to ‘ar’. However, this grid example is necessary to introduce the logic for the algorithm.
Implementation in Ruby
The best way, in my opinion, to understand an algorithm is to implement it in a programming language you know well. Here, at Babbel, we use Ruby in many of our backend applications. I’ll use Ruby for implementation, but it is also possble to translate the code below into any imperative programming language.
Above, I used a grid to explain the algorithm step-by-step. It’s essential to do the same for a programmatic solution as well. Grids are referred to as matrices in Ruby (and generally in mathematics). The Matrix class in Ruby is basically a 2-dimensional array with added mathematical functionality. I can access this class by requiring it.
Values in Ruby’s
Matrix class are immutable. For the purpose of this algorithm, I need to mutate values. To do this, I’ll inherit from the standard class and expose a normally private method.
class MutableMatrix < Matrix def =(a, b, value) @rows[a][b] = value end end
Now that I have a mutable matrix, I can use it to populate the boxes like above. Matrices define vertical positions increasingly from top to bottom, so position
(0, 0) will refer to the top left position (or box).
def mutable_matrix(first_string, second_string) MutableMatrix.build( first_string.length + 1, second_string.length + 1 ) do |row, col| if row == 0 col elsif col == 0 row else 0 end end end # mutable_matrix('aaa', 'bbbb') # => MutableMatrix[[0, 1, 2, 3, 4], # [1, 0, 0, 0, 0], # [2, 0, 0, 0, 0], # [3, 0, 0, 0, 0]]
Because of the deviation from normal algebraic order, I’ll refer to the positions using
y. Given that I now have a mutable matrix, I will build the first conditional. In the example of ‘here’ and ‘hear’ above, I first checked to see if the first characters were equal.
# (a, b) as a position first_string[a - 1] == second_string[b - 1]
If this conditional returns true, it’s necessary to find the value in the diagonally inferior position.
matrix[a - 1, b - 1]
If the characters do not match, I need to find the minimum distance to the neighboring positions.
[ matrix[a, b - 1], # Insertion: Look directly above the current position matrix[a - 1, b], # Deletion: Look left to the current position matrix[a - 1, b - 1] # Substitution: Look diagonally lower from the current position ].min + 1 # Conclusion: Return the minimum value from these 3 options and add 1
Now, I’ll combine these 2 cases in a method.
def cost(matrix:, first_string:, second_string:, a:, b:) if first_string[a - 1] == second_string[b - 1] matrix[a - 1, b - 1] else [ matrix[a, b - 1], matrix[a - 1, b], matrix[a - 1, b - 1] ].min + 1 end end
The edit cost is defined. Next, I need to fill in every position in the matrix. This requires… looping! It’s fine to loop over either string first.
(1..second_string.length).each do |b| (1..first_string.length).each do |a| matrix[a, b] = cost( matrix: matrix, first_string: first_string, second_string: second_string, a: a, b: b ) end end
Finally, I need to grab the last value from the matrix to determine the overall edit distance between 2 strings.
Here is the composition of all prior logic in a nice Ruby class. There’s multiple ways to implement it. I’ve chosen large methods with named arguments.
require 'matrix' class MutableMatrix < Matrix def =(a, b, value) @rows[a][b] = value end end class Levenshtein def distance(first_string, second_string) first_s_length = first_string.length second_s_length = second_string.length matrix = mutable_matrix( first_s_length: first_s_length, second_s_length: second_s_length ) (1..second_s_length).each do |b| (1..first_s_length).each do |a| matrix[a, b] = cost( matrix: matrix, first_string: first_string, second_string: second_string, a: a, b: b ) end end matrix[first_s_length, second_s_length] end private def cost(matrix:, first_string:, second_string:, a:, b:) if first_string[a - 1] == second_string[b - 1] matrix[a - 1, b - 1] else [ matrix[a, b - 1], matrix[a - 1, b], matrix[a - 1, b - 1] ].min + 1 end end def mutable_matrix(first_s_length:, second_s_length:) MutableMatrix.build( first_s_length + 1, second_s_length + 1 ) do |row, col| if row == 0 col elsif col == 0 row else 0 end end end end
Finally, running this implementation in Ruby is as simple as the following.
levenshtein = Levenshtein.new levenshtein.distance('competers', 'computer') # => 2 levenshtein.distance('hear', 'here') # => 2
Like most algorithms, Levenshtein distance has some faults. For one, there’s no collectively accepted rule for handling accent marks, capital letters, or punctuation. Another clear shortcoming is character swapping.
As an example, the word ‘receive’ in English is easy to misspell. Often ‘receive’ is misspelled as ‘recieve’, with the ‘i’ and ‘e’ being switched. Using the Levenshtein distance algorithm, this mistake would have an edit distance of 2 (for 2 substitutions). The word ‘receipt’ is also both 2 substitutional edits away from ‘receive’. This is not ideal because it’s more likely for someone to spell ‘receive’ as ‘recieve’ while the word ‘receipt’ is very different semantically. Seeing this as a major issue, another scientist named Frederick Damerau proposed an improvement. He added another edit to the Levenshtein distance algorithm, called transpositional edits, to account for character swapping. This improvement differentiates Levenshtein distance from Damerau-Levenshtein distance.
Modern language learning applications desire logic to help correct common learner mistakes with grammar and verb conjugation. It’s not quite possible to accomplish this with basic edit distance. However, the field of natural language processing introduces more refined models and algorithms to help with these problems. Still, Levenshtein distance is extremely useful in its simplicity and performance.