news.

Some observations about the NUS collection of SMSs

 

  • Ya i am doin too much.hereafter i wnt ask any one
  • Hey sorry I didnt give ya a a bell earlier hunny, just been in bed but mite go 2 the pub l8tr if u wana mt up? loads a luv Jen.
  • Pay credit card bill for my sis… Ur sis lesson until wat time? U had ur lunch already?
  • Not dat i dun wan sign up but i wan only for a mth… Try try first mah… Then y my sis got 1 mth one…
  • Where r u now? I finish oredi…
  • May b:-)lol

Spelling

These strange messages were selected randomly from a collection of SMS messages that the National University of Singapore is putting together. It shows some of the typical spelling variations and language phenomena present in SMSs. Sometimes:

  • “unnecessary” letters are dropped, especially vowels: doing – doin, month – mth, what – wat,
  • phonetic spelling is used: you – u, your – ur, maybe – may b, might – mite
  • numbers are used because of the way they sound: to – 2, later – l8tr, and
  • abbreviations are used: laugh out loud – lol.

Notice that most of these tricks are used to decrease the time it takes to type the message on a cellphone’s numerical keypad – a very limiting input device.

Typing methods

Speed and brevity are not necessarily the only motivations for the spelling. Look at the word “l8tr”, which could also be spelled “l8r”, since the “8” already ends in a “t” sound. On a cellphone, the “8” key and the “t” key is the same key. It would thus be very inefficient to type “l8tr”. When you look at the rest of the message, it becomes clear that this message was not typed on a multi-tap keypad. “Sorry” and “hunny” both have repeated letters, something that is mostly avoided on these keypads. Unfortunately we don’t know whether this specific message was typed using a qwerty keyboard or with predictive text or something else, but the NUS collection contains meta-data for many of the other messages:

Percentage of SMSs in collection typed using different input methods. (Graphics from Chen, 2011)

It therefore seems that for many words, creative ways were found to type them quickly. These forms became entrenched, and people still use them even after the original reason disappeared.

Language

Apart from the input device, the fact that SMS language is “close” to spoken conversation seems to influence the spelling. Contractions like “wana”, and the informal “sis”, are used frequently. We also see differences in spoken dialects reflected in the spelling. “Oredi”, in message 5, is a variant of “already” often used in the collection by users from Singapore. We assume that this reflects the way Singaporeans pronounce the word. The contraction of “that” to “dat” and “what” to “wat” also seem region specific. In South African English we would probably contract “that” to “tht” and “what” to “wht”, although it would be interesting to see some real data from South Africa.

The syntax also shows signs of contraction. The words “I have” are omitted from the second message. Words are omitted elsewhere, and some of the more ungrammatical sentences, like “I finish oredi…” may have been sent by a non-native English speaker.

Percentage of SMSs in collection sent by native English speakers.

Conclusion

We quickly took a peek at a small sample of the messages in the NUS SMS Corpus. We hope to use this collection to investigate the automatic translation of SMS-speak (or text-speak) to standard English. The percentage of tweets that were sent from mobile phones jumped from 25% in 2009 to more than 40% in 2010. We therefore hope that Twitter and SMS language share some features, and that tools for SMS normalisation will also be useful for microblog-message normalisation.

However, because the use of qwerty-keyboards and predictive text on mobile phones lessen the use of SMS-speak, we are interested in whether SMS-speak is still relevant. We would also like to have some sample messages from South Africa. Please help the investigation by submitting your (old and used) English messages to the collection at http://wing.comp.nus.edu.sg:8080/SMSCorpus/contribution.jsp .

References

Chen, Tao: Statistics of NUS SMS Corpus (For English Corpus 2011.04.01) , available from http://wing.comp.nus.edu.sg:8080/SMSCorpus/statistic.jsp. Retrieved April 2011

Rafi, M.S.: SMS Text Analysis: Language, Gender and Current Practices, the 26th Annual TESOL France Colloquium. Retrieved September, volume 20, 2009, 2008

Further reading

Tagg, C.: A Corpus Linguistics Study of SMS Text Messaging, The University of Birmingham, 2009

No comments yet.

Leave a comment

Leave a Reply

(required)