Clustering algorithms find natural structure in data and can be used in any domain where the similarity between two data points can be quantified.
In terms of Twitter, we will try to group users depending on the most prevalent topics they tweet about. A simple approach is to represent a user with his/her latest 100 tweets and process the resulting document with the TF-IDF. We can use the cosine similarity to calculate the similarities between two user documents and apply k-means to perform the actual clustering.
Now, let us create a small proof of concept. We collect a set of 24 users, who is categorised into 4 categories and each category contains 6 users. Our goal is to use k-means to find 4 clusters, which contain the same 6 users as in the categories.
The following table shows the categories as well as the users in each category we collected.
|@wale, @tyga, @drake, @rickyrozay, @officialozzy, @dolly_parton||@jozyaltidore, @espn, @robdyrdek, @floydmayweather, @ryansheckler, @warrensapp||@bigcitymoms, @real_simple, @pbsparents, @playgrounddad, @preschoolers, @sesamestreet||@vkhosla, @padmasree, @sacca, @google, @bbcclick, @timoreilly|
Next, we run k-means and we find the following clusters.
|Cluster 1||Cluster 2||Cluster 3||Cluster 4|
|@tyga, @floydmayweather, @rickyrozay, @wale, @drake, @official_ozzy||@padmasree, @robdyrdek, @sesamestreet, @pbsparents, @sacca, @real_simple, @preschoolers, @playgrounddad||@espn, @warrensapp, @bigcitymoms, @ryansheckler, @jozyaltidore||@dolly_parton, @vkhosla, @timoreilly, @google, @bbcclick|
If we apply a liberal judgement to the results, we observe that the Music category is similar to Cluster 1, Sports is similar to Cluster 3, Family is similar to Cluster 2 and Technology is similar to Cluster 4. We create a tag cloud of the hashtags in each cluster.
The hashtag cloud is much more ambiguous than the previous cluster results and is open for interpretation. The first hashtag cloud was constructed from cluster 3, the second from cluster 2, the third from cluster 4 and the fourth from cluster 1.
The result for our small example is positive. We see that clusters can be constructed with a clustering algorithm that has a strong likeness to the original configuration. The hashtag representation has not been a success and different representation techniques should be considered to visualise the topics in each cluster.