Every so often, I dink around with benchmarking common lossless compressors. One of the best sites for it is, I think, Werner Bergman’s Maximum Compression, which is a rather comprehensive running benchmark of just about every lossless compression benchmark under the sun. Really, there’s a lot. What you have to understand about the world of compressors is that they are very often academic projects or toys that very smart people play with in their free time. There are also companies (but not many) who invest in their own proprietary algorithms for lossless compression.
Here’s the catch, though: the quality of a compressor isn’t measured by its final compression ratio. The PAQ series of a compressors, for instance, offer great compression and really, truly awful compression times. The time goes with the highest compression levels of WinRK (a proprietary Win32 format with an accompanying GUI). But disk is cheap: nobody really cares about a fraction of a percentage of compression efficiency, do they? What people really want is for their (inevitable) archiving GUI to take less time doing what it does.
In this spirit, I have compiled not so much an exhaustive less of possible compression algorithms (I’ll leave that to Werner, who is very good at what he does), but rather a short list of the most common formats, tested on three different (relatively well-known) corpuses: the Calgary Corpus, the newer Canterbury Corpus, and Andrew Tridgell’s 1999 Large Corpus. The first of these two are corpuses used to test the very kind of academic project which I’ve avoided. I dislike using them because they are small in size, which means that there is significantly less opportunity for variations in compression formats to manifest themselves. In the interest of verifiability, however, I have used them. I also included Andrew Tridgell’s large corpus because it’s been my experience that small test corpuses tend to vary too much too to disk I/O latency and other vagaries of compression algorithms.
What will follow is a data table for each corpus, followed by some brief observations about each.
