Introduction

In my lust for benchmarking compressors and things of that sort, I decided to attack some typical Linux CLI compressors for a very general comparison of their relative efficiency.

If you want to skip straight to the results, go ahead and view the Linux CLI compression benchmarks in PDF format.

First, however, I need to explain my corpus, compressors, and methodology.

Test Information

Corpus
OSDB is a database from a benchmarking project on SourceForge used to test open-source database performance. The version I downloaded was v0.21
linux-2.6.19 is, I hardly need say, the source code for version 2.6.19 of the Linux kernel
final_fantasy_tactics is a binary dump of the Playstation disc for Final Fantasy Tactics, a good mix of video, audio, and executable code.
Webster’s 1913 dictionary is also just what it sounds like. I pulled it from Project Gutenberg as one file, but it appears in the meantime to have disappeared, or at least been split up by letter.
Compressors (official packages from Ubuntu 6.10 repositories)
tar (1.15.91) is not a compressor, but is rather an archiver, used to gather multiple files together into a file which can then be compressed. I made tarballs out of everything for consistency’s sake
gzip (1.3.5) is a very lightweight compressor used in large part for transparent compression across networks (the page you’re reading at this very moment was probably gzipped before it was sent to your browser, which then unpacked it)
bzip2 (1.0.3) is a powerful block-sorting compressor—the bigger, stronger, slower brother of gzip.
zip (2.32) is probably the most common compression format in Windows. There are a number of different implementations, but this is the standard version that comes in any GNU toolchain. It uses the common DEFLATE method.
rzip (2.1) is a modified version of bzip2, the advantage of which lies in its 900MB history buffer, meaning that it can take advantage of data redundancies over much longer distances than its brethen. Its principle advantage is apparent for large files.
7za (4.42) is a Unix port of the esteemed 7-Zip archiver for Windows. It has several different modes for compression for its 7z format (PPMd and LZMA, the latter being principle), of which two were tested (the third is bzip2, which is already being tested).
rar (3.60) is the non-free command-line version of the famous WinRAR archiver.
arj (3.10.22) was a very popular archiving in the early-to-mid 90s, alongside pkzip. It isn’t of much use now, but exists as Free Software in any decent Linux repo, and so was also tested.
lha (1.14i) is an open-source compressor of the LZH format, an early format (1988) that has also fallen out of vogue. It has no switches or useful options.
lzop (1.01) is a free implementation of the super-fast LZO algorithm, providing very basic compression very quickly. It has been surpassed in efficiency by QuickLZ, but the latter didn’t have a Unix port by the time of this test
zoo (2.10) is a compression format from the 1980s that was mostly popular with OpenVAX servers. It is not seriously used anymore except perhaps on legacy machines.
Methodology
Any relevant switches have been documented in the data table. All compressions were run from the bash shell. I created a script for each compressor that compressed and then decompressed the file for each setting in turn. It was invoked with the GNU time which then wrote the utilized time to a log file. After each compression, an ls -lrt command was invoked, to get the precise size of the compressed archive, and this was also written to the log file. This was done on a Ubuntu 6.10 system, in the GNOME environment. Any other programs in use were in use for the duration of the test, so consistency should not be an issue.

Data

The table itself is too big to display within the confines of my blog, so you’ll need to view it separately. Linux CLI compression benchmarks, PDF format

Analysis

What conclusions can be drawn from these benchmarks? Linux’s relative paucity of readily-available compressors isn’t really a hindrance to it—there are few actually useful formats here: gzip and bzip2, obviously, are the foundation upon which Linux compression is built. zip offers no advantages over gzip except that it is both an archiver and a compression (a point made largely moot by tar and pipes, e.g. tar -czf foo.tar.gz bar/), but it is good to have for compatibility purposes. rzip, while interesting, is useful only for large files: if you’re attempting to squeeze every last ounce of extra space from a file, it would probably behoove you to use a different compressor, anyway.

In terms of the heavy-duty archivers like 7-Zip and RAR, they both offer quite a bit of power at the expense of being either difficult to invoke (7z creation is still broken in file-roller on Ubuntu systems) or non-free (rar is technically shareware, and not readily available in all distributions). It is interesting to note that 7-Zip still spanks WinRAR in general performance. With regard to 7-Zip, I’ve concluded that PPMd is not generally a worthwhile algorithm to use: in some cases, it offers a modest increase in efficiency over LZMA, but it’s decompression times are horrible, generally equal to or greater than its compression times. LZMA, meanwhile, offers modest compression times (relative to its ratio) and lightning-fast decompression. One odd behavior is that compression efficiency seems to work on plateaus—with LZMA, for instance, there are only 3 different compression ratios for 9 switches (1-4, 5-6, and 7-9).

lzop, or even LZO compression is general, doesn’t offer much in the way of compression improvement over a gzip on default settings, though it does offer a slight improvement in decompression.

Other, more esoteric formats like arj, zoo, and lha are really no more than historical curiosities. The only compressor from these than approaches the capabilities of gzip or bzip2 is lha, at which point why even bother with the latter?

Caveats

If you look at the data closely, you’ll notice some strange behavior, especially regarding the first test in a series. At first, I believed that the least-intensive compressions tended to be limited by disk I/O, rather than CPU cycles. Eventually, however, I began to wonder if it had more to do with reading the file into memory than it did with file writes. Running in a script, the original file might stay cached after it was read in the first time, speeding up additional compressions by not having to read the file from the disk again. Take from it what you will: these are very much unofficial benchmarks.

§1601 · February 1, 2007 · Tags: , , , , ·

2 Comments to “Linux command-line compressor benchmarks”

  1. Lasse Reinhold says:

    The timings for lzop have large errors because the switches -2, -3, -4, -5, -6 are all equivalent but the compression timings range from 0.18 to 0.45 seconds.

    It’s probably due to unpredictable write buffering and/or low timer resolution. Try timing with multiple passes of the same file on a ramdrive.

  2. Jos says:

    An advantage of zip files is the optional presence of a table of contents which allows faster access of single files. For large tar.{gz,bz2} files, getting at the last file in the archive is slow.

Leave a Reply