We want to know how “close” words are to other words. A similarity measure is a function that gives a score to two words (or more generally, character strings). For example, we want the distance between “ppl” and “people” to be low, but the distance between “ppl” and “cheese” to be far. In an automated cheese shop, one would expect people to sometimes misspell these words. If you have a way to get the distance between these abbreviations and words in a dictionary, you could guess that someone meant “Red Leicester” when she wrote “Red Lester”, because the distance between them is very small. (Perceptive people might notice that this is a convoluted way of saying: “spelling correction”)
One distance measure that is often used is the Levenshtein edit distance. It gives a cost of one for each character that differs between the two strings. The Levenshtein distance between “Tilsit” and “Tulsit” is 1. The distance between “Caerphilly” and “Carfilly” is 3, since two letters were dropped (or inserted depending on where you are coming from) and one letter had to change.
How would one compute the distance? One might think at first that you just compare the first letter of the first word with the first letter of the second word, then the second letters and so forth, each time incrementing a counter when the letters differ. Unfortunately that would say that the distance between “Czechoslovakian sheep’s milk cheese” and “Mud” is 35, while in actual fact it is only 34. It doesn’t take character insertions and deletions into account. To do this one has to somehow align the two strings:
Algorithm to compute the edit distance
An efficient way to calculate the correct distance is by using dynamic programming . Start with an empty table (using a somewhat smaller example):
Then beginning in the top-left corner, fill in the table. The number in each cell represents the lowest distance one could get if the two strings were aligned such that the letter represented by the column and the letter represented by the row were aligned. In the above case, the red cell would contain the distance if we had aligned “em” and “mel” such that “m” and “l” aligned.
The first column (and row) then represents the case where the two words are aligned so that the first letter of the one word lines up with either the first letter of the second word, or the second letter of the second word or the third and so forth.
Now we just need to fill in the rest. If the words have been aligned up to the previous character, the cells above, to the left, and above left to the current cell should be filled in already.
- Moving from the left cell to the current cell represents inserting a letter into the top string.
- Moving from the above cell to the current cell represents insertion of a character into the left word.
- Moving from top left to the current cell represents changing the character represented by the row into the character represented by the column. (or visa versa)
Each of these operations has a cost of 1, except when the characters represented by the row and column are the same already. Then the cost of moving diagonally is 0. From these three choices we select the one that would give the lowest total cost (we are looking for the minimum edit distance, otherwise it can become arbitrarily large).
The filled in table gives the minimum distance in the green lower right hand cell, since this represents the minimum cost when the two strings are aligning the so that the last characters of the two strings are taken into account. One possible way of aligning the words to get that distance is highlighted in yellow.
We might also want to assign a cost of 1 if two letters next to each other swop around. The normal Levenshtein distance in such a case would be 2. To add this case is simple, just check for it in addition to the three cases already checked for at each cell.
Going even further
In the same way we can add even more cases. We might want certain transitions to cost less or more. For example, the cost of going from “i” to “u” can be set to 0.8. When doing a diagonal transition when the row is “i” and the column is “u” we just check if “i changes to u” is in a list of non-standard transition-costs. If it is, add that cost to the total, in stead of the usual cost of 1. We can also have a different list for insertions. One would guess that inserting vowels (or equivalently deleting them in the other string) should be cheap. Therefore just add all the vowels to the deletion list with a cost of say 0.5.
Furthermore, some gansta might be very likely to spell “Ilchester” as “ilchesta”, “Venezuelan Beaver Cheese” as “Venezuelan Beava Cheese”, or “Perle de Champagne” as “Perle da Shampane”. We might therefore want to add “er to a” with cost 0.7, and “ch to sh” with cost 0.9. Finally, for the likely case that someone change “Roquefort” to “roc4t”, we add “q to c” with cost 0.9 and “for to 4” with cost 0.8.
Because one character in the first string can now map to two or more characters in the second string, we have to compare costs from more than 3 cells. For example, when we stand on the blue cell, we can move from either one of the usual 3 surrounding cells with costs 4.4, 4.4, or 5.4. Or we can recognise the “4 to for” in our list of transformations. The string “for” has a length of 3, so we look 3 characters back to find the total cost of 1.9 + 0.8 = 2.7. This is the lowest cost from these 4 possibilities, so we fill in the cell with that.
We present here the confused cheese matrix:
This is not very useful. But it has a good name. (Note that we did not define transformations to be symmetrical, so the distance between “ppl” and “people” is 1.5, but from “people” to “ppl” is 3) Below we list the top 10 variants from a dictionary for “ppl”, “ilchesta”, “roc4t”, and “cthns”.
The next and final step is of course to implement a fully functional automated cheese shop, which is left as an exercise to the reader.