Semantics, tagging and Twitter:

Another failed “Semantic Web” experiment, or a potential gold mine?

Twitter recently announced a new development, called “Annotations”, at the Chirp Twitter developers’ conference. Annotations is a way of adding additional metadata to your tweets, and is in many ways arguably an inevitable expansion of their original self-imposed 140 character limit, which has since become one of their strongest trademarks.

Annotations can be seen as the counterpart of “tags” which are often used on blog posts. They provide a context for the tweet – a semantics of sorts. Most twitter users are familiar with “hash tags”, which can be seen as the informal precursor to annotations.

Necessity is the mother of invention, and because natural language can be very ambiguous (ask any computational linguist!), the community of short message system users quickly started using hash tags to overcome this problem. The main motivation was to get the intended message across with the least amount of ambiguity, within the 140 character restriction.

The second motivation that led to the widespread use of tags in general, and hash tags in particular, is the ability it affords the author to provide a concise high-level summary of his intended message – a delineation of the intended conceptual grounds it attempts to cover. Therefore, messages with matching tags could point towards a higher conceptual match than messages simply containing the same keywords.

For instance, as an example, say you were interested in users’ opinions of Apple products in general. If your search for “apple products” presented you with a list of ten pages’ titles only, it might be fair to assume that an article titled “Apple Products” relates to the high-tech company, Apple Inc., and some of their products. However, can you completely rule out the possibility that it relates to McCutcheon’s Apple Products Inc? Well yes, if the author included tags such as “#high-tech”, “#gadgets”, “#apple-company” or “#review”!

But it goes deeper than this: Once objects are tagged, this allows one to explore “themes” in data. And yes, the notion of tags is not limited to textual messages only, it can be applied to all sorts of multimedia and hard-to-index data. This is one of the easiest forms of recommending “related” objects to users: Let users collaboratively tag objects with descriptive tags, and cluster these objects based on these tags – i.e. the more their tags agree the more “related” they are. We are still a long way from being able to automatically analyse, say, video content to detect the main themes or moods or other hard-to-define characteristics contained therein. Humans, however, have been doing this sort of thing for ages.

So how does this relate to Twitter’s chirpy new feature, “Annotations”? Twitter made sure not to define exactly what annotations are for, and how they should be used, but to “encourage innovation in their use” from the community. Initially, you get an additional 512 bytes of data, which “might be increased to up to 2 KB”, within which you can embed your own annotations in the form of name space/key/value pair JSON structures. Important to note is the use of name spaces which allows users to specify the semantics underlying their annotations.

What are the possible implications of this? Well, first off, providing dedicated extra space for annotations could see hash tags being moved from the main body of the tweet to the annotation, thus freeing up precious tweet-estate. Also, and very importantly, Twitter’s compressed format paved the way and created a market for services like URL-shorteners. With the possibility of a dedicated separate space for URLs, use of those services would become a choice and not a necessity.

But that’s just the surface stuff. It is important to realise that what Twitter has created with this, is not merely an added “feature”, but potentially a new platform. The open-endedness in which this is rolled out, provides at once amazing possibilities and a potential developer’s nightmare.

The Semantic Web and Linked Data concepts have slowly progressed over the last decade from a mere pipe dream to a serious initiative gaining lots of traction world-wide. With Twitter’s recently disclosed user-base of 105 million users and their phenomenal growth, they might just be in a position to lend this initiative some serious momentum.

At its heart, the Semantic Web notion strives to make the meaning or semantics of information available to machines, in order to provide a platform where machines can more readily meet the growing information needs of users. The incredible volume of tweets – 55 million tweets a day, 600 million searches a day, as revealed by Twitter – coupled with the ability to potentially make these understandable by machines via some agreed upon semantics, whether formally defined or decided via crowd-sourcing, sets the table for some potentially very interesting applications in the near future.

One could argue that this produces a great platform for incrementally refining machines’ understanding of linguistic “meaning” in the way people tweet and retweet their content, and in the way they apply semantic annotations to this content. Also, and admittedly much less ambitious, the possibilities of having added annotations for location, platform and temporal information and what not greatly increases the possibilities for personalised services catering to your likes and dislikes. This could happen for the simple reason that, wait for it, the service now knows more about you.

All in all, it seems this new addition to the Twitter camp was a necessary move for them as a service to move forward. It brings many hitherto impossible opportunities to the table, but simultaneously calls for a serious and collective effort in defining and engineering the way forward. However, if this pans out well, this might just be referred to in years to come as the #twitularity :)

, ,


  1. 1
    Otavio Ferreira on Friday 23 April, 16:32 PM #

    Good post Stephan.

    However, I don’t believe ad-hoc annotations are the best way of solving the ambiguity problem, or poor relevance.

    As you know, the Semantic Web has been developed on top of ontologies, which are formal and *shared* descriptions of domain concepts.

    Twitter can solve its own problem, but the proposed solution won’t support the broader environment, in which several different applications share the same understanding of a particular concept.

    It’s good that Twitter is at least thinking of metadata, but they have to go open-protocol in order to solve the problem properly.

    XML, XSD, RDF, RDFS, OWL, and OWL-S would do the trick.

  2. 2
    Stephan Gouws on Friday 23 April, 18:12 PM #

    Hi Otavio, thanks for the comment! I was hoping you’d chime in on this one! :)

    I agree that a completely organic come-what-may approach could lead to potential chaos.

    From a knowledge engineering point of view, the more formal, well-defined and universal the semantics used to define the underlying ontology, the better in terms of automated reasoning. But there’s a reason why we call people building formal ontologies knowledge ENGINEERS :P I think they are scared to introduce something as hairy as that so soon to the tweep on the street who just wants to get his message out..

    It’s also worth bearing in mind that, traditionally, formal ontologies have generally worked best in well-defined, niche areas where knowledge engineers can map out the relationships between concepts and classes in a reasonable amount of time. Twitter might not be the best place for this.

    On the other hand, I kind of like the ‘do what you want’ wild west approach they’re following. Who knows, maybe something comes out of this that none of us foresees? I think they’re betting on that.

    If all else fails and either adoption is low or the lack of standards get out of hand, they could always fall back on a more standardised approach and chalk it up to experience..

    What could be really really cool is a sort of centralised annotation regulation service. Where we currently have URL shorteners, we could now see ‘Entity Disambiguators’ where you enter your tags as usual in your twitter client, it sends it off to the annotation service, it tries to match it to any already defined entities (using your tweet message body and all the tags you provided for disambiguation) and presents you with the best disambiguated entities.

    Now the crowd-sourcing part: If it fails to resolve certain entities to a certain level of confidence, you define a new category and if you like, provide some ontological relationships to other already contained entities. The next person to use the same concept would get your bootstrapped entry, etc. I.e. a collaboratively edited ontology.

    As you know this could be immensely useful purely from a knowledge engineering perspective, but the direct benefits to users are just as huge: You can now formally classify tweets based on the collaboratively agreed upon ontology. Looking for tweets on #Series that #Aired in #August? Well, without the user having had to physically tag their tweet as subClassOf:#Series, we can return tweets on Lost, Big Bang, 24, etc, since other people had already established that relationship.

    Interesting times indeed..

  3. 3
    Jacques Bruwer on Friday 23 April, 18:26 PM #

    Very interesting post, thx. Stephan I have to agree with you on the point about the average tweeter just posting his message. I think Raffi Krikorian (twitter platform) said that 90% of Twitter developers don’t know what the Semantic Web is but that there’s certainly room for standards lovers to work within the Annotations scheme.

    If this will turn out to be true I’m not sure but this is as mentioned a very interesting time that lies ahead for the twitter platform.

  4. 4
    Otavio Ferreira on Friday 23 April, 21:00 PM #

    Hey Stephan. Thank you for replying.

    Yes, I fully agree with you. Developers cannot expose the underlying complexity of the Semantic Web to end users.

    So I believe researches have to come up with intuitive solutions to fulfilling upper layers on the Semantic Web Architecture, especially the User Interface & Applications tier. Ref.: Layer Cake:

    Your centralized annotation regulation service suggestion is indeed a better, more sophisticated solution. I’d give my contribution to this approach as follows:

    1) The service output would be an ontology, based on keywords analyzed. This formal description would then be applied without human interaction.

    2) Instead of centralizing the reasoning and overall decision making, several distributed services could be set up by different parties around the globe. Of course, an open protocol would be required.

    The Semantic Web for Human Beings is within reach!

Leave a comment

Leave a Reply