Wikipedia is an amazing resource of information for readers, people that quickly want to check facts and sometimes also the last resort to settle a heated debate about some topic. Recently, Wikipedia has also gained attention to researchers that recognized Wikipedia’s unique structure and scale as a great tool for research and experiments.
Wikipedia’s structure is built around articles. Currently, the English Wikipedia consists of 3.8 million articles of which 1.3 million articles are acknowledged as good articles, since they contain more than 200 readable characters (~33 words) and have at least one incoming link from a different page within Wikipedia. The rate at which the English Wikipedia grows has decreased over the last couple of years, nevertheless about 900 new articles are added each day (1).
Wikipedia’s articles are written and organized by following editorial and structural guidelines. For example, an article may only describe a single concept and there is only one article for each concept. Also, the title of an article has to be similar to the corresponding definition found in traditional thesauri. If a title has to be further qualified an expression, enclosed in parenthesis, is added to the title. For example, Car_(function) is the title of the article about the operation on linked lists used in the LISP programming language. Furthermore, the title of Wikipedia articles are case sensitive such that Optic_nerve describes the nerve that transmits information from the retina to the brain, while Optic_Nerve is the title of an article about the same named comic book.
Wikipedia redirects play an important role in navigating through articles, since equivalent terms are linked to an article using redirects. Usually the singular term that is the most general is chosen to define an article. For example, if you type http://en.wikipedia.org/wiki/Cars into the browser you are automatically redirected to the article with the title “Automobile”. Currently, Wikipedia contains 5.2 million redirects. Furthermore, if the desired article cannot be automatically determined, disambiguation pages are created that describe all possible meanings of an ambiguous term to help the reader find the desired article. The English Wikipedia contains about 100.000 disambiguation pages.
Another important property of Wikipedia is that it contains nearly 80 million internal links connecting articles with each other. Not only is the destination of a link important when analysing the link structure but also the anchor text which is used for a link. This information, for example, can be used for word sense disambiguation.
Articles in Wikipedia are also organized into a category structure. For example, the article Car belongs to the category “Wheeled vehicles” which is a subcategory of “Land Vehicles” which in turn belongs to the category “Land Transport”. Currently, Wikipedia contains around 800.000 categories. The purpose of categories is to organize articles into a hierarchical structure. Interestingly, this structure is not merely a tree structure but also contains cycles. For example, “Education” belongs to “Social Science”, which belongs to “Academic disciplines”, which belongs to “Academia” which again falls under the category “Education”. This cycle indicates that people can be educated about education.
There are more aspects about Wikipedia that all have been used as a basis for research but that I am not going to discuss further. These are info boxes, templates, discussion pages and edit histories.
The accuracy of Wikipedia articles is always under scrutiny and has also been a major topic for research. Giles (2) compares randomly chosen articles from Wikipedia with their associated articles from Encyclopedia Britannica. It turns out that both encyclopedias had equally many significant errors while Wikipedia contained more subtle errors such as misleading statements and fact omissions. Other approaches to assess the quality of Wikipedia articles use metrics based on the number of authors that worked on an article, total number of edits of the article, the size of an article, the number of internal and external links or the stability of an article (3, 4, 5). Wikipedia’s articles have also been used for creating language models to capture particular characteristics of a language (6).
Wikipedia’s internal link and category structure is an ideal starting point for creating thesauri. Terms in a thesaurus are related using four different kinds of relations:
1) the terms are equivalent
2) one term is broader or
3) narrower than the other
4) any other type of semantic relation
Wikipedia’s redirect pages provide precisely the information that is needed to define equivalent terms. Furthermore, the category structure can be used to define broader or narrower relationships between terms. Lastly, the internal links of articles can be mined for defining semantic relations between two terms. The most important aspect is that Wikipedia contains articles in more than 270 languages. Therefore, Wikipedia is an ideal source to translate thesauri into other languages.
Wikipedia articles can also be used to create ontologies where informative relations, such as “Tree consists-of Wood”, between articles are created by mining redirects, internal links, category links, category names etc. (7, 8).
Wikipedia’s links structure has also been used to create a network structure on which algorithms such as the HITS algorithm and Google’s PageRank algorithm were applied (9). Interestingly, the results of PageRanking Wikipedia articles indicate that articles that are closely related to religion achieve high scores, while the HITS algorithm prefers articles about famous people, common words, animals and abstract concepts such as music, philosophy and religion. As such, the articles “Pope”, “God” and “Priest” where the highest ranking articles for PageRank compared to “Television”, “Scientific classification” and “Animal” for the HITS algorithm.
(1) Wikipedia Statistics
(3) Wikipedia as participatory journalism: reliable sources? Metrics for evaluating collaborative media as a news source.
(4) Cooperation and quality in Wikipedia.
(5) Extracting trust from domain analysis: a case study on the Wikipedia Project.
(6) N-Gram-based text categorization.
(7) Extracting semantic relationships between Wikipedia categories
(8) A thesaurus construction method from large scale web dictionaries
(9) Network analysis for Wikipedia.