Babbel Bytes

Insights from the Babbel engineering team

Spelling correction with Levenshtein distance

Josephine Wright

How similar are the words ‘hear’ and ‘here’?

There are many answers to that question. First and foremost, ‘hear’ and ‘here’ are indistinguishable in sound. Meaning, a fluent English speaker will pronounce ‘here’ just as they pronounce ‘hear’. However, ‘here’ and ‘hear’ are different semantically or, rather, their definitions are distinct. ‘Here’ and ‘hear’ are also different in spelling.

Spelling is a particularly important issue for our users, here, at Babbel. It’s common for language learners to make spelling mistakes in a new language. It’s also common for users to make typo mistakes while using a mobile or web application. As we’re constantly looking for ways to improve our language learners’ experience, these issues were addressed in a recent project by using a popular algorithm called Levenshtein distance.


Levenshtein distance

There’s no need to reinvent the wheel this time. Many years of computer science research have resulted in a vast library of algorithms to solve problems like spelling.

In this post, I’ll explore the Levenshtein distance algorithm. It was discovered and documented by a scientist named Vladmir Levenshtein in 1965.

At its core, the purpose of Levenshtein distance is to quantify the differences between 2 strings.

Edit Operations

The Levenshtein distance algorithm defines 3 distinct edit operations that are needed to transform 1 string into another. They are as follows.

  1. Insertion
  2. Deletion
  3. Substitution

Each operation has an edit distance of 1. For example, I will transform the word ‘competers’ to ‘computer’ in operational steps.

  1. ‘competers’ to ‘competer’ (delete the ‘s’)
  2. ‘competer’ to ‘computer’ (substitute the first ‘e’ with ‘u’)

Above, there are 2 operations needed to convert ‘competers’ into ‘computer’. Therefore the Levenshtein distance between ‘competers’ and ‘computer’ is 2.

Formula

For programmatic reasons, it’s necessary to define the mathematical specifications. The Levenshtein distance Wikipedia article includes the following mathematical formula.

Levenshtein Formula

It can look intimidating, so I’ll walk through it step-by-step with an example.

Example

As stated before, the words ‘here’ and ‘hear’ sound the same in English, but are spelled differently. I would like to know how differently they are spelled or, rather, the edit distance in a numeric value.

First, I’ll create a grid with the word ‘hear’ on the horizontal axis (x coordinates) and ‘here’ on the vertical axis (y coordinates). I’ll also fill in the first corresponding row and column with an increasing count starting from 0 to represent the distance from the start of the word.

Levenshtein Grid, Step 1

Then, I’ll approach the first position (1, 1). Because the first character of ‘h’ in ‘here’ matches the first character of ‘h’ in ‘hear’, I’ll look for the value in the diagonally inferior coordinate (x - 1, y - 1). Its value is 0, so I’ll copy 0 from (0, 0) to (1, 1).

Levenshtein Grid, Step 2

Then, I’ll approach another coordinate either horizontally or vertically adjacent to the (1, 1). I’ll choose the box above, (1, 2). The character ‘h’ from ‘hear’ does not match the character ‘e’ from ‘here’. So I have to choose the minimum value from a neighboring box ((0, 0), (0, 1), or (1, 0)) and add 1. The lowest value would be the box below (1, 1) and its value is 0. The sum of 0 + 1 is 1, so I’ll insert the value of 1 into (1, 2).

Levenshtein Grid, Step 3

If I continue to do this for all of the boxes, I end up with the following grid.

Levenshtein Grid, Step 4

The top right box (4, 4) becomes the final edit distance between the strings. Therefore, the edit distance between ‘here’ and ‘hear’ is… 2!

It’s also possible to arrive at this conclusion by noticing that there are 2 edits necessary to turn ‘here’ into ‘hear’. I would have to substitute 2 characters ‘re’ to ‘ar’. However, this grid example is necessary to introduce the logic for the algorithm.

Implementation in Ruby

The best way, in my opinion, to understand an algorithm is to implement it in a programming language you know well. Here, at Babbel, we use Ruby in many of our backend applications. I’ll use Ruby for implementation, but it is also possble to translate the code below into any imperative programming language.

Matrices

Above, I used a grid to explain the algorithm step-by-step. It’s essential to do the same for a programmatic solution as well. Grids are referred to as matrices in Ruby (and generally in mathematics). The Matrix class in Ruby is basically a 2-dimensional array with added mathematical functionality. I can access this class by requiring it.

require 'matrix'

Values in Ruby’s Matrix class are immutable. For the purpose of this algorithm, I need to mutate values. To do this, I’ll inherit from the standard class and expose a normally private method.

class MutableMatrix < Matrix
  def []=(a, b, value)
    @rows[a][b] = value
  end
end

Algorithm

Now that I have a mutable matrix, I can use it to populate the boxes like above. Matrices define vertical positions increasingly from top to bottom, so position (0, 0) will refer to the top left position (or box).

def mutable_matrix(first_string, second_string)
  MutableMatrix.build(
    first_string.length + 1,
    second_string.length + 1
  ) do |row, col|
    if row == 0
      col
    elsif col == 0
      row
    else
      0
    end
  end
end

# mutable_matrix('aaa', 'bbbb')
# => MutableMatrix[[0, 1, 2, 3, 4],
#                  [1, 0, 0, 0, 0],
#                  [2, 0, 0, 0, 0],
#                  [3, 0, 0, 0, 0]]

Because of the deviation from normal algebraic order, I’ll refer to the positions using a and b, not x and y. Given that I now have a mutable matrix, I will build the first conditional. In the example of ‘here’ and ‘hear’ above, I first checked to see if the first characters were equal.

# (a, b) as a position
first_string[a - 1] == second_string[b - 1]

If this conditional returns true, it’s necessary to find the value in the diagonally inferior position.

matrix[a - 1, b - 1]

If the characters do not match, I need to find the minimum distance to the neighboring positions.

[
  matrix[a, b - 1],    # Insertion:    Look directly above the current position
  matrix[a - 1, b],    # Deletion:     Look left to the current position
  matrix[a - 1, b - 1] # Substitution: Look diagonally lower from the current position
].min + 1              # Conclusion:   Return the minimum value from these 3 options and add 1

Now, I’ll combine these 2 cases in a method.

def cost(matrix:, first_string:, second_string:, a:, b:)
  if first_string[a - 1] == second_string[b - 1]
    matrix[a - 1, b - 1]
  else
    [
      matrix[a, b - 1],
      matrix[a - 1, b],
      matrix[a - 1, b - 1]
    ].min + 1
  end
end

The edit cost is defined. Next, I need to fill in every position in the matrix. This requires… looping! It’s fine to loop over either string first.

(1..second_string.length).each do |b|
  (1..first_string.length).each do |a|
    matrix[a, b] = cost(
      matrix:        matrix, 
      first_string:  first_string, 
      second_string: second_string, 
      a:             a, 
      b:             b
    )
  end
end

Finally, I need to grab the last value from the matrix to determine the overall edit distance between 2 strings.

matrix[first_string.length, second_string.length]

Here is the composition of all prior logic in a nice Ruby class. There’s multiple ways to implement it. I’ve chosen large methods with named arguments.

require 'matrix'

class MutableMatrix < Matrix
  def []=(a, b, value)
    @rows[a][b] = value
  end
end

class Levenshtein
  def distance(first_string, second_string)
    first_s_length = first_string.length
    second_s_length = second_string.length
    matrix = mutable_matrix(
      first_s_length:  first_s_length,
      second_s_length: second_s_length
    )

    (1..second_s_length).each do |b|
      (1..first_s_length).each do |a|
        matrix[a, b] = cost(
          matrix:        matrix,
          first_string:  first_string,
          second_string: second_string,
          a:             a,
          b:             b
        )
      end
    end

    matrix[first_s_length, second_s_length]
  end

  private

  def cost(matrix:, first_string:, second_string:, a:, b:)
    if first_string[a - 1] == second_string[b - 1]
      matrix[a - 1, b - 1]
    else
      [
        matrix[a, b - 1],
        matrix[a - 1, b],
        matrix[a - 1, b - 1]
      ].min + 1
    end
  end

  def mutable_matrix(first_s_length:, second_s_length:)
    MutableMatrix.build(
      first_s_length + 1,
      second_s_length + 1
    ) do |row, col|
      if row == 0
        col
      elsif col == 0
        row
      else
        0
      end
    end
  end
end

Finally, running this implementation in Ruby is as simple as the following.

levenshtein = Levenshtein.new
levenshtein.distance('competers', 'computer')
# => 2
levenshtein.distance('hear', 'here')
# => 2

Shortcomings

Like most algorithms, Levenshtein distance has some faults. For one, there’s no collectively accepted rule for handling accent marks, capital letters, or punctuation. Another clear shortcoming is character swapping.

As an example, the word ‘receive’ in English is easy to misspell. Often ‘receive’ is misspelled as ‘recieve’, with the ‘i’ and ‘e’ being switched. Using the Levenshtein distance algorithm, this mistake would have an edit distance of 2 (for 2 substitutions). The word ‘receipt’ is also both 2 substitutional edits away from ‘receive’. This is not ideal because it’s more likely for someone to spell ‘receive’ as ‘recieve’ while the word ‘receipt’ is very different semantically. Seeing this as a major issue, another scientist named Frederick Damerau proposed an improvement. He added another edit to the Levenshtein distance algorithm, called transpositional edits, to account for character swapping. This improvement differentiates Levenshtein distance from Damerau-Levenshtein distance.

Modern language learning applications desire logic to help correct common learner mistakes with grammar and verb conjugation. It’s not quite possible to accomplish this with basic edit distance. However, the field of natural language processing introduces more refined models and algorithms to help with these problems. Still, Levenshtein distance is extremely useful in its simplicity and performance.

Facebook Twitter Google+ Reddit EMail
Comments