Jan Buys (MSc Computer Science).

Probabilistic Tree Automata for Language Models and Grammar Correction

I am an MSc student in Computer Science at the MIH Media Lab. My research is in the area of Natural Language Processing. I have a background in Mathematics, Applied Mathematics and Computer Science. Automata theory and machine learning are central to my research.

Language models are an important component in many natural language processing (NLP) applications, including automatic speech recognition and statistical machine translation (ex. Google Translate). Probabilistic models have proven to be very successful in NLP. Probabilistic n-gram models are commonly used for language models. Based on n-gram data (for example, that made available by Google’s Ngram Viewer), these models are simple and efficient. However, they are not able to model long-distance dependencies, such as Wh-movement, in natural languages. Syntax-based language models, which describe the structure of sentences by parse trees, are a promising alternative to n-gram models. It has been shown that syntax-based methods for statistical machine translation can increase the quality of translations. Recently, probabilistic tree automata have been used to describe the syntactic structure and ambiguity in natural languages.

In order to analyze for example social media and blog posts (possibly written by second language speakers), one approach is to first translate these posts into a grammatical correct form. Once in this form, a standard NLP pipeline can be used for further analyses. My interest is thus in using syntax-based techniques for grammar correction.

My contact details: janbuys@ml.sun.ac.za