An example of topic models on the web

Comic which shows some rage about coffee tweets

Have you ever followed someone on Twitter expecting great tweets, but instead you only see tweets about coffee and muffins? Topic models will come to your rescue.

It all comes down to the fact that users need more control on Twitter.  Let’s be honest: most tweets in your stream only receive a cursory glance. Have you ever wanted a feature which allows you to follow people based on the topics that they really tweet about, instead of the (often misleading) description in their biography? Luckily the idea is not as far-fetched because topic models can be used to identify people who tweet about topics you’ll be interested in.

Microsoft Research created a research prototype that partially addresses the problem by illustrating that tweets can be categorized according to the topics learned from it. Twahpic takes a Twitter username or query as input and does and analysis of the collected tweets to categorize them into 4 categories; substance, social, style and status. Under each category the most talked about topics are displayed as well as the words in the tweets that attribute to it. The result is represented visually in tag clouds.

Before we continue with Twahpic lets look at some background information on topic models.

Some Background on Topic Models

Topic models work by discovering patterns of word use and connecting documents that exhibit similar patterns. It has emerged as a powerful technique to find structure in an unstructured collection of documents. Formally, topic models are probabilistic models for uncovering the underlying semantic structure of a document collection based on a hierarchical analysis of the original text. Topic models are useful for creating automated methods to organize, manage and deliver content of digital libraries.

Latent Dirichlet Allocation (LDA)

LDA is an example of a topic model; it is a completely unsupervised algorithm that models each document as a mixture of topics. The model generates automatic summaries of topics in terms of a discrete probability distribution over words for each topic, and further infers per-document discrete distribution over topics. LDA makes the assumption that each word is generated from an underlying topic.

Using LDA we can discover trends in language usage, thus finding words that end up together. LDA is used to learn, for example, 100 latent topics in a document collection, which is represented by a cluster of words. The table is an example from Blei et. al. applying LDA on the JSTOR archive of the journal Science. The table shows 5 topics from the 50-topic model.


The table illustrates the latent topics discovered in the JSTOR archive, it shows that a topic is a collection of highly probable words.

Labeled LDA

Twahpic uses a variation of LDA known as Labeled LDA (L-LDA), which incorporates a supervised label set into the algorithm. LDA is not appropriate for multi-labeled corpora because as an unsupervised model, it offers no obvious way of incorporating a supervised label set into its learning procedures. L-LDA adds this supervision by constraining the topic model to use only those topics that correspond to a document’s observed label set. Which means that a document’s topics are restricted to its labels. It is most appropriate for document archives where a document has more than 2 tags (labels).

Now that we have a bit of background on topic models lets have a look at the Twahpic implementation.

Twahpic implementation

Twahpic has defined 4 categories; substance, social, status and style.

  • Substance: Things that are concrete in nature for example sports, politics and music.
  • Social: Topics with terms about being social for example “thanks” and “party”.
  • Status: Topics which contains terms typically used in status updates such as “coming”, “going” and “party”.
  • Style: Topics in this category contain a particular language use, dialect or linguistic flair.

It uses LDA to learn 200 latent topics that are manually categorized into each of the 4 categories. The latent topics capture the broad trends on Twitter. Labeled LDA is then used to discover specific, smaller trends.  The description of L-LDA states that each document must have labels, so it is important to construct labels for each tweet being analyzed. The labels are constructed by using the meta-data of a tweet, the labels consist of “reply”, “@mention”, emoticons and hash tags. They settled on using a set of 504 labels for their implementation, it is not surprising that most of the labels consist of hash tags.

The learned labeled topics are sorted into each category by noting that post labeled as “reply” or “@mention” can usually be categorized under the social category. Since a “reply” is usually done when replying to a person and “@mention” is when a person is referred to in a tweet. The emoticons can be considered for the style or social category, while hash tags can be considered as substance. This may seem like a crude way of categorizing the labeled topics, since not all labels clearly falls in a category. There is however a large number of labels that can easily be automatically categorized in each category.

Interpreting the results

Now that we have a little background on the mechanics of Twahpic lets interpret the generated results. In this example we will use the @w3c (World Wide Web Consortium) twitter account. The included image, generated by Twahpic, is a breakdown of @w3c tweets according to the 4 categories as well as the topics mentioned in the collection of tweets. As expected the largest category is substance, while the most tweeted topic is “Web”.  Under each topic is a tag cloud representing the words. The word size represents its global frequency in that topic, while the word shade is proportional to the frequency in @w3c’s tweets. For example in the web topic the word google frequently occurs in tweets on Twitter, while web and open occurs frequently in the tweets by @w3c.

The topics and categories from @w3c twitter stream.

Selecting an individual tweet provides more information about the portion of each category evident in each tweet as well as the topics in the tweet. The image below shows a @w3c tweet that consists of the substance and social category. The tweet exhibits the time topic (substance) and the social media 1 topic (social). The result concurs with the content of the tweet that contains a reference to Twitter and a reference to time.

A tweet made by w3cImage of the LDA time topicImage of the Social Media topic

Reviewing the results, we can see that the process is not perfect and a number of tweets do not correctly fit to the specified topic. Although that may be the case, topic models still provide a new and interesting visualization of tweets on Twitter.


Twahpic is an interesting research prototype that illustrates the application of topic models on the web. It is useful in detecting topics as well as categorizing tweets. Although only partially addressing the problem mentioned in the introduction, it illustrates how Twitter can become more useful by applying topic models.


[1] Blei, D. M. and Lafferty, J.D., “Topic Models”, Text mining: classification, clustering, and applications., 2009, pp. 71-89, Chapman & Hall

[2] Ramage, D., Dumais, S., & Liebling, D., “Characterizing microblogs with topic models”. International AAAI Conference on Weblogs and Social Media. The AAAI Press.

[3] Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). “Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora”. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (p. 248–256). Association for Computational Linguistics.

No comments yet.

Leave a comment

Leave a Reply