Have you ever wanted to search a database of documents for something “similar” to a reference document? Or needed to identify near-duplicate records in a database? When dealing with large collections of data, manual sifting may not be feasible. Highly sophisticated pattern recognition techniques exist to extract features from data, and cluster items accordingly. However, you don’t necessarily want to be a PatRec expert just to compare items in a data set. This article will present a surprisingly simple, surprisingly general (and reasonably accurate) way of comparing two documents, data items or records. It’s simple enough that you will probably be able to do basic comparisons from the command line of your operating system, or compare large data sets using only a few lines of code in your favourite programming language.
The automatic comparison technique that we will look at is called the normalised compression distance (NCD), and it is based on the information-theoretical concept of Kolmogorov complexity. The only requirement is that the documents or records that need to be compared, must compress well. Consequently, text documents respond well to NCD-based clustering, but already-compressed documents like JPEG images or MP3 audio files won’t. To compare the similarity of two chosen documents, you only need a compression technique that works particularly well for the type of document you’re working with. In the examples below, we’ll use the bzip2 algorithm, since it compresses both datasets we’re working with (plain text and XML) particularly well.
Some background on complexity
The notion of normalised compression distance (NCD) is based on a fairly theoretical concept from information theory known as Kolmogorov complexity. You don’t to need to know any of this background information to be able to use NCD, but it’s quite interesting to see where it comes from.
Kolmogorov complexity can be defined as “a thing is as complex as the time it takes to describe it”. For example, “89681442309683453279″ is more complex than “11111111111111111111″, because the latter can be described as “20 ones”, whereas the first sequence seems to have no pattern and can only be described by telling someone the sequence itself. We say that “1111111111111111111″ has a lower Kolmogorov complexity than “89681442309683453279″.
Sometimes, seemingly random sequences can have low Kolmogorov complexity, which we can discover if we can find the underlying pattern. For example, the sequence “31415926535897932384″ can be described as “the 1st 20 digits of π”. As a matter of fact, the infinite, nonrepeating, seemingly random sequence “3.14159265…” has the simple description of “the ratio of a circle’s circumference to its diameter”, or simply “π”. Sometimes such an underlying pattern could be difficult to discover – chances are you wouldn’t have been able to recognise the sequences that reduce to “the ratio of a circle’s circumference to half its radius” or “the 1st 100 consonants in Dante’s Inferno“.
From this point of view, the true Kolmogorov complexity of a sequence is impractical, possibly even unknowable. But it’s clear that we may be able to create algorithms that give their best shot at coming as close to the Kolmogorov limit as possible.
As a matter of fact, compression algorithms can be seen as trying to achieve exactly this: A good compression algorithm manages to reduce the length of the input text by as much as possible. Try this: generate a binary file with a million 1′s, and compress it using your favourite compression tool; the compressed file should be pretty small. Next, generate a binary file with completely random bytes; it shouldn’t compress at all. Compression algorithms try to discover underlying patterns, and represent them in as compact a form as possible. Since this is an important objective in, for example, data storage and communication, much effort has gone into developing good compression techniques for many different data sources, including text, images, sound and video. Almost as a side effect, compression algorithms turn out to be good complexity estimators too, since they try find the shortest sequence that describes the input data. In complexity estimation, we’re not as much interested in the shortest sequence itself, just in how long it is.
Complexity and similarity
But what does complexity measurement have to do with estimating how similar two documents or data records are? Well, if you know how complex two documents are, and you then concatenate the two documents, you would expect the new document’s complexity to be at most the sum of the individual documents’ complexity. If the documents are fairly similar, you would expect their concatenation to have a lower total complexity. For example, if the first document were “31415926535897932384″, and the second document were identical, you could describe their concatenation as “the 1st 20 digits of π, repeated twice”.
If you were to use a compression technique to estimate complexity, you would first compress the two individual documents, and find the compressed file sizes L1 and L2. Then you would concatenate the two documents, and compress the stiched-together version to get a length of L12. If the new file’s compressed size is L1 + L2, nothing that occurred in the first document helped us to better describe (compress) the second document. Therefore they are completely dissimilar. However, if the compressed concatenated file is much smaller than that, the documents are likely very similar, Of course, the compressed version of the concatenated file can never be smaller than either L1 or L2, because it contains more information that either one of the original documents.
For example, if I were to compress Lewis Carroll’s The Jabberwocky using the bz2 compression algorithm, the 954 bytes of the original poem is reduced to a compressed 530 bytes. This is L1, our estimate of The Jabberwocky‘s complexity. If I compare the poem to a copy of itself, by repeating it twice and then compressing it, the twice-repeated version of The Jabberwocky compresses to L12 = 630 bytes – much less than 530+530 (the individual complexities of the two identical versions of the poem). This indicates that the copy of the poem is “very” similar to the original version.
Note that real-life compression algorithms always have some overhead. Ideally we’d like The Jabberwocky to compress to 530 bytes, and the twice-repeated version to something like 531 bytes, to indicate that it contains almost no extra information, since the two versions were identical. Unfortunately, the price we pray for the ease of using such a blunt measurement device, is that perfect similarity will never be measured well.
Normalised compression distance
Formally, the normalised compression distance (NCD) is defined as
with L1, L2 and L12 the compressed length of the first document, the second document and the concatenated documents, respectively. A lower NCD indicates similar documents; a higher NCD indicates dissimilar documents. It is assumed that you’re using a compression technique that works well on the type of data you’re working with.
What’s so convenient about this, is that it uses off-the-shelf algorithms (most modern programming languages have compression libraries or modules available), and that it needs no specialised knowledge of the underlying data. Be warned, however, that you pay for this generality and convenience: using NCD for clustering will almost always be inferior (and quite possibly computationally more complex) than using a tailor-made pattern recognition technique that is carefully configured and trained to fit your problem domain’s data. But you certainly can’t beat the price, and it turns out to work suprisingly well for many applications, such as comparing natural-language documents, or searching for near-duplicate records in a dirty database.
NCD is simple enough to use from the command line. In Linux,
cat a.txt | bzip2 | wc -c
estimates the Kolmogorov complexity of a.txt (L1). Similarly,
cat a.txt b.txt | bzip2 | wc -c
estimates the complexity of the two concatenated files (L12).
In a language like Python, the zlib or bz2 libraries can be used, e.g.
L1 = len(bz2.compress(A))
To demonstrate the usefulness of the normalised compression distance, we look at two simple case studies. In the first, we have a handful of database records, including one which is a corrupted duplicate of another one. It will be shown that the NCD can clearly point out the near-duplicate record. In the second example, we compare a number of books, and cluster them by similarity.
In the first example, we take an exerpt of five XML records (Albania, Austria, France, Germany and Spain) from the standard MONDIAL database. We then duplicate one of the records (Germany) and randomly change some of its fields, e.g. changing record ID numbers, rounding off population sizes, and deleting some city records. This emulates typical small changes that one might find in a “dirty” database containing near-duplicate entries that might have been obtained from different sources. Identifying such entries by hand could be cumbersome, but a NCD comparison of the entries yields the following distances (remember that smaller values correspond to more similar entries):
The near-duplicate entry is clearly flagged. In a practical application, entries with a NCD of below an application-determined threshold could be flagged for review by the database manager.
In the next example, we compare a collection of six books, to establish which texts are more similar to each other. For this experiment, we select two works of Shakespeare (King Lear and Macbeth), two German texts (Goethe’s Faust and Nietzsche’s Also Sprach Zarathustra) and two English translations of the book of Genesis (from the King James Bible and the World English Bible). We expect the two versions of Genesis to be quite similar, we hope to notice some similarity between the two Shakespeare plays, and the German texts are expected to be somewhat similar to each other, but dissimilar from the English texts. All texts were plain text (UTF-8) versions sourced from Project Gutenberg. The results are shown below:
In general, the NCD values are higher than for the XML records, since we are dealing with less constrained text. The two versions of Genesis show the greatest similarity. The English texts cluster well, with Macbeth and King Lear the next-closest pair. Also, Also Sprach Zarathustra is the closest text to Faust in this corpus, thereby clustering the two German texts.
Note, however, how much noiser the measurement has become. The documents all cluster around unity, with the Genesis pair being the only reasonably strong outlier (but not nearly as clear as for the duplicate XML record in the previous example). We cannot cluster with as much confidence now, and may make mistakes in difficult problems like “seperate works of Shakespeare from works of Francis Bacon”. What is surprising, however, is that such a quick-and-dirty technique like NCD can at all be applied to tricky domains such as natural language processing.
Normalised compression distance is a way to compare documents or data units of an almost arbitrary nature, and needs know specialised knowledge of the data’s nature. The most important requirement is the availability of a compression technique that works well on the type of data. It is very simple to implement, and can give surprisingly good results, even in domains where more sophisticated pattern recognition techniques would be very difficult to implement. Note that it is not designed for speed or scalability, and may not be the best choice for processing very large data sets.
Despite its limitations, NCD gives an interesting glimpse into information theory, and the close relationship between complexity and compressibility.
 Cilibrasi, R. and Vitanyi, P, “Clustering by Compression”, IEEE Transactions on Information Theory. Vol. 51, No. 4. April 2005, pp. 1523-1545.
 Cebrián, M., Alfonseca, M. and Ortega, A., “Common Pitfalls Using the Normalized Compression Distance: What to Watch out for in a Compressor”, Communications in Information and Systems, Vol. 5, No. 4, 2005, pp. 367-384.