I have always been fascinated by language. A human language (also called a natural language) is a complex, highly-sophisticated system that allows us to express thoughts and to communicate with one other. A language is build up out of words, phrases and sentences that speakers of the language attach meanings to. The vocabulary of a language is a set of words that belongs to that language. Words can be seen as made up out of letters, when written, or phonemes (sounds), when spoken. A sentence in a language is a well-formed sequence of words, that anyone who knows the language can understand.
The power of the way language works lies in its creativity: A speaker can use a finite number of words to construct a infinite number of previously unspoken sentences that are valid in the language. However, there can only be a finite number of rules or principles that governs the construction and interpretation sentences. To describe and to make sense of a language, we need to construct such a set of rules, called a grammar. A grammar can have two purposes: To distinguish between valid and invalid sentences, and to give analysis for valid sentences.
Why is this important? In Natural Language Processing (NLP), we need to make sense of large volumes of text. This text consists of sentences in some natural language, and to analyse it, our computer programs need to know something about the language of the text. We represent an approximation our knowledge of language with some formal description, a model. The most successful models used in NLP today are probabilistic models: They assign probabilities to events, in our case to the occurrence of words. A important reason for the use of probabilistic models is the occurrence of ambiguity at all levels of language. To construct a model, we use some training data (text) to set the parameters in the model. The training data may also be annotated (in our case, with information about the structure of the sentences).
A simple, yet very useful approach is that of an n-gram model. An n-gram is a sequence of n consecutive words. (In NLP, a common choice of n is 3.) When we construct an n-gram model, we find all the n-grams that occur in the training data, and count how many times each of them occurs. Given the previous n-1 words, we compute the probability of the next word as the number of times that word follows the n-1 words, divided by the total number of times the n-1 words occurs. Large n-gram language models play an important role in NLP applications such as Automatic Speech Recognition and Statistical Machine Translation. The Google n-gram viewer is a large database of n-grams collected from books.
The information encoded in a n-gram model is very useful. However, an n-gram model cannot capture non-local or variable-length dependencies between words. Also, an n-gram model does not reflect the way that sentences are structured and does not give any meaningful analysis of sentences. Therefore, we need more powerful models. In formal language theory, there exist classes of formal languages that have different capabilities of expressiveness. These classes are classified in the Chomsky hierarchy.
The lowest class in the Chomsky hierarchy is the regular languages. The regular languages have more expressive power than n-gram models. A regular language can be expressed by a finite-state automaton or with regular expressions. A finite-state automaton (FSA) is a theoretical device consisting of states and labeled transitions between states. The FSA reads a sequence from left to right, and by reading a word it follows a transition between its current state and a following state. A probability can be assigned to accepted word sequences. However, there are still language constructions that a regular language cannot properly express. The most important of these is centre-embedding. An example of a sentence with centre-embedded structure is:
The rat the cat the dog chased ate died.
Though this sentence is difficult to read and mostly of theoretical significance, it is a valid sentence and formal grammars should be able to process it.
The next class of languages in the Chomsky hierarchy is the context-free grammars. Context-free grammars provide a natural way to express the structure of languages. Words can be grouped into phrases, which can recursively be grouped in phrases again, assigning a hierarchical structure to sentences. In linguistics, the structure of a sentence is called its syntax. Each word in a language has a part-of-speech (word category), such as noun, verb or adjective. Phrases are labeled according to the part of speech of the word in the phrase that contribute most to its meaning, ex., noun phrase (NP) or verb phrase (VP). These hierarchical structures are also expressed as syntax-trees.
Here is an example of a syntax tree:
The context-free grammar rules used in this tree are:
Sentence -> NP VP VP -> Verb NP NP -> Det Noun Det -> The Noun -> man Noun -> book Verb -> took
In NLP, a program that analyses the syntax of sentences, usually based on a context-free grammar, is known as a parser. One application of parsers is in extracting information from text.
There are other, more expressive, classes in the Chomsky hierarchy: The context-sensitive languages and the recursively enumerate languages. There is a debate about which class of languages is most adequate to represent natural language. Though there is no accepted prove that English is not context-free, there are linguistics structures that may be expressed more appropriately by more powerful grammar formalism.
The large amount of text on the web that needs to be processed increases the need to develop techniques that can deal with natural language. There are many challenges in this task. However, as a reward we can develop technology that can interact with users more naturally. And we may gain new insights into how language really works.