Every so often, I dink around with benchmarking common lossless compressors. One of the best sites for it is, I think, Werner Bergman’s Maximum Compression, which is a rather comprehensive running benchmark of just about every lossless compression benchmark under the sun. Really, there’s a lot. What you have to understand about the world of compressors is that they are very often academic projects or toys that very smart people play with in their free time. There are also companies (but not many) who invest in their own proprietary algorithms for lossless compression.

Here’s the catch, though: the quality of a compressor isn’t measured by its final compression ratio. The PAQ series of a compressors1, for instance, offer great compression and really, truly awful compression times. The time goes with the highest compression levels of WinRK (a proprietary Win32 format with an accompanying GUI). But disk is cheap: nobody really cares about a fraction of a percentage of compression efficiency, do they? What people really want is for their (inevitable) archiving GUI to take less time doing what it does.

In this spirit, I have compiled not so much an exhaustive less of possible compression algorithms (I’ll leave that to Werner, who is very good at what he does), but rather a short list of the most common formats, tested on three different (relatively well-known) corpuses: the Calgary Corpus, the newer Canterbury Corpus, and Andrew Tridgell’s 1999 Large Corpus2. The first of these two are corpuses used to test the very kind of academic project which I’ve avoided. I dislike using them because they are small in size, which means that there is significantly less opportunity for variations in compression formats to manifest themselves. In the interest of verifiability, however, I have used them. I also included Andrew Tridgell’s large corpus because it’s been my experience that small test corpuses tend to vary too much too to disk I/O latency and other vagaries of compression algorithms.

What will follow is a data table for each corpus, followed by some brief observations about each.

First, a note about the test environment:

  • Windows Vista x64
  • Intel Q6600 Quad-Core
  • 4GB Corsair PC2 6400
  • Western Digital Caviar WD1600YS SATAII, 160GB (system drive)
  • Timer: Igor Pavlov’s timer.exe (times reported are “Process” times).

Next, a note about compressor versions

Compressor versions used
Compressor Version Source
tar 1.13 GnuWin32
gzip 1.3.12 GnuWin32
tar 1.0.4 GnuWin32
zip/unzip 2.32/5.52 GnuWin32
7z (32-bit) 4.57 7-Zip
rar 3.71 RarLabs

Now on to the benchmarks…

Calgary Corpus
Codec Setting Enc. Speed (s) Dec. Speed (s) Size (b) Ratio
tar 0.000 0.000 3,265,536 1.000
gzip fast 0.171 0.062 1,244,763 0.381
gzip 0.312 0.093 1,070,276 0.328
gzip best 0.561 0.062 1,062,584 0.325
bzip2 fast 0.499 0.218 961,633 0.294
bzip2 0.514 0.202 891,321 0.273
bzip2 best 0.483 0.218 891,321 0.273
zip -1 0.187 0.078 1,244,985 0.381
zip 0.358 0.078 1,070,495 0.328
zip -9 0.516 0.046 1,062,803 0.325
7z 1 0.436 0.171 962,460 0.295
7z 6 1.996 0.140 856,273 0.262
7z 9 2.152 0.140 853,686 0.261
rar m1 0.265 0.140 1,167,991 0.358
rar m3 1.950 0.140 935,499 0.286
rar m5 1.762 0.891 788,671 0.242

The Calgary Corpus dates back to the late 80s. It’s become the test to perform, but it may or may not adequately represent the standard compressor workload in 2008. You’ll notice that Winrar’s maximum setting produces the smallest archive, and more quickly than the neighboring 7-zip runs. Notice, too, that among the lowest values, there tends to be a sort of “bottoming-out” point at which the speed of the compressor’s process in CPU is limited by the speed of the disk.

Canterbury Corpus
Codec Setting Enc. Speed (s) Dec. Speed (s) Size (b) Ratio
tar 0.000 0.000 2,821,120 1.000
gzip fast 0.140 0.062 872,570 0.309
gzip 0.249 0.062 739,066 0.262
gzip best 1.138 0.062 736,223 0.261
bzip2 fast 0.390 0.156 584,964 0.207
bzip2 0.514 0.171 570,856 0.202
bzip2 best 0.390 0.156 570,856 0.202
zip -1 0.140 0.078 872,795 0.309
zip 0.343 0.062 739,286 0.262
zip -9 1.170 0.062 736,443 0.261
7z 1 0.280 0.930 569,953 0.202
7z 6 1.950 0.124 487,919 0.172
7z 9 2.199 0.124 485,391 0.173
rar m1 0.218 0.124 772,369 0.274
rar m3 1.232 0.093 515,831 0.183
rar m5 1.170 0.561 427,178 0.151

I’m still not entirely able to figure out the Canterbury Corpus; it’s ostensibly an “update” to the aging Calgary Corpus. One would think that having been created more than a decade after it’s predecessor, and with the express purpose of more accurately representing the compressor workload of 2001, it would at least be larger (hard disks and file sizes have increased in size since 1989, believe it or not), but in fact it’s not, which was somewhat of a disappointment to me, as I saw entirely the same trends as with the previous corpus. Is that an accurate determination of the algorithms in question? Maybe not—read on.

Large-Corpus
Codec Setting Enc. Speed (s) Dec. Speed (s) Size (b) Ratio
tar 0.000 0.000 247,933,952 1.000
gzip fast 7.347 2.698 65,782,177 0.265
gzip 13.072 3.151 53,870,968 0.217
gzip best 21.855 2.449 53,536,722 0.216
bzip2 fast 40.591 9.360 52,791,871 0.213
bzip2 54.506 10.567 39,372,759 0.159
bzip2 best 54.228 10.935 39,372,759 0.159
zip -1 6.349 2.208 65,782,411 0.265
zip 12.682 2.527 53,871,197 0.217
zip -9 21.529 2.433 53,536,951 0.216
7z 1 19.578 6.608 47,343,400 0.191
7z 6 128.645 4.035 26,373,931 0.106
7z 9 172.677 3.712 24,722,887 0.100
rar m1 9.016 4.446 48,939,730 0.197
rar m3 125.128 3.868 31,916,951 0.129
rar m5 138.435 23.852 29,200,310 0.118

Mostly interestingly in the Tridgell’s “large-corpus,” we finally see 7-Zip spring ahead of WinRAR in terms of pure compression ratio (and in speed, too, in some cases). I’m not an expert on compression, so I can’t tell you why certain efficiencies only manifest themselves over large datasets, but clearly 7-Zip wins in more modern cases where large data-sets (mostly text, if Tridgell’s description is accurate) are present.

Clearly, the LZMA algorithm (the heart of 7-Zip) is something to be proud of; not only is it GPL, but it often outperform the popular WinRAR in both pure compression and in efficiency as well. I’m a little surprised that the 7-Zip *nix port, p7zip, hasn’t gained more traction in Linux, but I suppose that old ways die hard. The cheapness of disk and bandwidth nowadays rather point to more transparent compression as the ideal rather than whatever archiving format has the best compression in terms of purely numeric results.

For those of you looking for a decent free arching program, check 7-Zip out; for those of you who lust after data tables of compression benchmarks, give Werner’s a look: it’ll satiate your desire for tabular results in ways you never thought possible.

  1. These compressor are, for the record, GPL; this phenomenon is actually pretty rare. For some unknowable reason, much of the research work in compression has gone on in the land of Windows and its sometimes-associated proprietary. Meanwhile, Linux has had gzip, bzip2, and zlib, and that’s about it, with notable exceptions.[]
  2. A note of history: Andrew Tridgell, usually associated with the Samba project, developed a ‘large’ corpus in 1999 to test his fork of bzip2, known as rzip. This latter format was tuned to allow for better compression on large files by implementing a much larger history buffer.[]
§1954 · January 19, 2008 · Tags: , , , , , ·

Leave a Reply