In computer science applications, it is often necessary to store a large set of words—or a dictionary—from some natural language, say English. Particularly, dictionaries feature as a crucial component in many tasks in the field of natural language processing, such as spell checking, autocompletion and speech recognition.
Naively, dictionaries may be implemented by using any standard collection data structure and populating it with words. However, most standard collection implementations ignore much of the structure apparent in modern natural languages. For example, in the English language, the words “linguist”, “linguistic”, “linguistically”, “linguistics” and “linguists” all share the common prefix “linguist-”, which should ideally be stored only once.
Furthermore, certain operations which are common for dictionaries may not be efficient for a naive dictionary implementation. For example, we would likely often want to check whether a certain word occurs in a dictionary, known as the
search operation. If we use a simple unsorted array for storing dictionaries, this operation is very expensive, since we must do a linear search through the array. Sorting the array solves this problem, since we may then do an efficient binary search. However, if we wish to add a word to a dictionary—the
add operation—then we must potentially shift all the elements in the array to open the correct space for the new word, resulting in an expensive operation. The
remove operation is similarly expensive for both of these data structures.
Enter the hash table
A hash table is a truly remarkable data structure: it promises to perform all three operations mentioned so far in constant time (not entirely accurate), regardless of the size of the dictionary. Actually, the time taken to
remove a particular word is proportional to the length of that word, since we must calculate a special function of the word, called the hash function, which looks at every letter in the word. However, if we assume that the length of words is small compared to the number of words in the dictionary, then this may be treated as a constant.
It would seem that we have found the perfect data structure for storing dictionaries! Indeed, the three common operations are all performed very efficiently. But wait, what was that thing we said about the word “linguist” and its derivatives? That’s right, if we store the English dictionary using a hash table, we’ll be storing many duplicate copies of the prefix “linguist-”. Is there perhaps some way to avoid this?
A deterministic model
Deterministic finite automata (or DFAs), also known by the name of state machines, are a simple, but very powerful concept from theoretical computer science. They were invented as a way of mathematically modelling computer logic, but it turns out that they’re useful for many other things as well, as we shall soon see.
DFAs are a kind of labelled directed graph. We refer to the nodes as states, and to the directed edges as transitions. Each label is a single character from an alphabet, say the ASCII alphabet. There is also one special state, called the initial state, and potentially many final states, each denoted with a double circle. The purpose of this state machine is to either accept or reject each string of characters from the alphabet. Inituitively, we start at the initial state, then for each character read from the string, we follow the transition labelled with that character, or reject the string if there is no such transition. After reading the entire string, we accept it if we are in a final state, and reject it if we are not.
This lengthy explanation deserves an example. Let’s see what a DFA accepting the words “cat”, “catacombs”, “cats”, “hair”, “haircombs” and “hairs” could look like.
The construction of this DFA is apparent. When we add a word, we simply follow the existing transitions as far as we can, then we add a state and transition for each letter not already present in the DFA, marking the state corresponding to the last letter as final. It should be clear that this DFA accepts the desired words, and rejects all others. This kind of DFA construction is known as a trie. The effect of storing a dictionary this way is that all common prefixes are stored only once. Hooray! This solves our “linguist-” problem.
What about our common operations? The
search operation simply involves reading the word and seeing whether we are in a final state. The
add operation has already been discussed. For the
remove operation, we first do a
search, and if it is successful, we mark the final state as non-final, also removing any states and transitions which have now become unnecessary. The time taken by each of these operations on a given word is proportional to the length of the word, or constant for our purposes. This is comparable to the performance of a hash table.
The minimalistic approach
Can we do better? The reader may have observed that in the previous example, there are two copies of nodes and transitions spelling the suffix “-combs”. Aha! If DFAs give us a way of storing common prefixes only once, perhaps they also give us a way of storing common suffixes only once. Indeed, there is a special kind of DFA called a minimal DFA (or MDFA) which does just that. It turns out that for every DFA, there is a unique minimal DFA—one which has as few states as possible. There are numerous DFA minimization algorithms, which will not be discussed here. However, using these algorithms still give efficient implementations of the three common operations (although addition is slightly more expensive, but not immensely). The minimal DFA accepting the words “cat”, “catacombs”, “cats”, “hair”, “haircombs” and “hairs” is shown below.
As we can see, we save even more with the storage of this small dictionary, since each suffix is also stored only once. Is there some way in which we can measure exactly how great these savings are for an arbitrary dictionary?
Finding a formula to describe the memory usage of trie and minimal DFA dictionaries would be an impossible task. Indeed, it greatly depends on parameters of the language being stored. A language with many common suffixes and prefixes (like English) should see significant savings, but for other languages with different structures, the savings may be marginal. Therefore, we resort to running some simulations on a corpus of 99 171 English words. Multiple samples of different sizes were taken from this corpus, and dictionaries were constructed using both kinds of data structures. The average number of transitions was recorded for each data structure with each corpus size. Recording the number of states is unnecessary, since we are working only with connected graphs. For comparison to the standard collection data structures, the total number of letters in each corpus was also recorded.
As we would expect, the total number of characters increases linearly with the number of words. The number of transitions for tries and MDFAs are clearly sublinear, but to which degree? We can answer this by drawing the same plot on a log-log scale.
All three log-log plots are now approximately linear, which means that the original plots can be described as nd, where n is the number of words and d is the respective slope of each log-log plot. Measuring these slopes yields a memory consumption of n0.9996 for standard collection data structures (very close to n, as expected), n0.8520 for tries, and n0.7648 for MDFAs. For large dictionaries, these slower growth rates mean massive savings. Thus, these data structures are very successful in reducing the memory consumption (at least for the English language!) of natural language dictionaries, while also providing very efficient running times for the common dictionary operations.