18 months ago, I did an informal test with GUI archivers and their various formats. Recently, I’ve been reading about data compression and formats, and got it into my head to do some more testing.

Because someone like Werner Bergmans says just about everything there is to say about compressors of all stripes, I thought my own testing would focus primarily on the sorts of programs and formats that one would see in everyday use. Thus, no esoteric command line encoders from universities and enthusiasts will make it into this test. Rather, I will be looking at the old standby ZIP format, as well as 7-Zip’s 7z, WinRAR’s rar, WinACE’s ace, UHARC, and the stalwart *nix tools, gzip and bzip2. All of these tools are readily usable in GUIs for the Windows platform.

Two other programs that I considered but did not end up using were KGB Archiver and WinRK. The former is a frontend for PAQ6v2, and its compression time, even under the “normal” setting, was prohibitively long, though had I used the maximum setting, it might have surpassed the other formats after a few days of compressing. The same with WinRK: its maximum (PWCM) mode would have gotten extremely good results after a ridiculous and unusable amount of time. For that reason, both have been excluded.

Testbed

Windows consists of my current Windows directory in a tarball, consisting of almost 6,500 files in just over 1GB of data, a mix of text, binary, and media. Linux 2.6.18 consists of the source for v2.6.18 of the Linux kernel Maximum Compression consists of Werner Bergmans’ test corpus, which he offers for download (a pack of the individual data types, not the 510-file corpus used for MFC tests). Genesis ROMs is a collection of over 1,200 pure binary dumps from Sega Genesis cartridges. Website backup is a tarball of a website I created some time ago, of which most of the size comes from the original, 300dpi source photographs used in the content.

The Data

View a much nicer PDF version of my original table

Compression Data
Corpus Type Version Mode Size (MB) Ratio
Windows tar 1.13 n/a 1,026,192,384 1.000
Windows zip 4.43* dictionary = 64K; word size = 128 417,771,283 0.407
Windows gzip 1.3.5 -9 423,443,208 0.413
Windows bzip2 1.0.3 -9 413,704,539 0.403
Windows 7zip 4.43 LZMA; dictionary = 64MB; word size = 128; solid 317,998,198 0.310
Windows rar 3.61 “best”; dictionary – 4096K; solid 355,591,630 0.347
Windows ace 2.65 “maximum”; dictionary = 4096K; solid 370,975,963 0.362
Windows uharc 0.6b -mx -md32768 -mm+ 326,051,179 0.318
Linux 2.6.18 tar 1.13 n/a 240,138,240 1.000
Linux 2.6.18 zip 4.43* dictionary = 64K; word size = 128 47,386,583 0.197
Linux 2.6.18 gzip 1.3.5 -9 52,467,357 0.218
Linux 2.6.18 bzip2 1.0.3 -9 41,863,580 0.174
Linux 2.6.18 7zip 4.43 LZMA; dictionary = 64MB; word size = 128; solid 33,974,276 0.141
Linux 2.6.18 rar 3.61 “best”; dictionary – 4096K; solid 35,342,737 0.147
Linux 2.6.18 ace 2.65 “maximum”; dictionary = 4096K; solid 40,184,600 0.167
Linux 2.6.18 uharc 0.6b -mx -md32768 -mm+ 30,251,189 0.126
Maximum Compression Corpus tar 1.13 n/a 53,144,064 1.000
Maximum Compression Corpus zip 4.43* dictionary = 64K; word size = 128 13,962,749 0.263
Maximum Compression Corpus gzip 1.3.5 -9 14,953,388 0.281
Maximum Compression Corpus bzip2 1.0.3 -9 13,532,091 0.255
Maximum Compression Corpus 7zip 4.43 LZMA; dictionary = 64MB; word size = 128; solid 12,371,876 0.233
Maximum Compression Corpus rar 3.61 “best”; dictionary – 4096K; solid 12,536,286 0.236
Maximum Compression Corpus ace 2.65 “maximum”; dictionary = 4096K; solid 13,213,213 0.249
Maximum Compression Corpus uharc 0.6b -mx -md32768 -mm+ 11,516,783 0.217
Genesis ROMs tar 1.13 n/a 1,652,285,440 1.000
Genesis ROMs zip 4.43* dictionary = 64K; word size = 128 837,796,034 0.507
Genesis ROMs gzip 1.3.5 -9 883,292,469 0.535
Genesis ROMs bzip2 1.0.3 -9 914,155,101 0.553
Genesis ROMs 7zip 4.43 LZMA; dictionary = 64MB; word size = 128; solid 515,316,845 0.312
Genesis ROMs rar 3.61 “best”; dictionary – 4096K; solid 599,596,546 0.363
Genesis ROMs ace 2.65 “maximum”; dictionary = 4096K; solid 611,391,679 0.370
Genesis ROMs uharc 0.6b -mx -md32768 -mm+ 529,355,573 0.320
Website Backup tar 1.13 n/a 276,687,872 1.000
Website Backup zip 4.43* dictionary = 64K; word size = 128 271,871,915 0.983
Website Backup gzip 1.3.5 -9 272,458,212 0.985
Website Backup bzip2 1.0.3 -9 269,352,089 0.973
Website Backup 7zip 4.43 -m0=lzma: a=1: d=0: lc=8: LP0: PB0: mf=bt2** 268,550,888 0.971
Website Backup rar 3.61 “best”; dictionary – 4096K; solid 269,030,245 0.972
Website Backup ace 2.65 “maximum”; dictionary = 4096K; solid 269,240,799 0.973
Website Backup uharc 0.6b -mx -md32768 -mm+ 264,782,664 0.957

* ZIP has any one of a number of implementations. The open-source zip that is part of the GNU toolchain doesn’t yet support deflate64 as an algorithm, so instead I used the Zip capabilities in 7zip 4.43, which does.

** When invoked with the GUI, 7zip choked on this test corpus at 24%, inexplicably. Invoked through the command line with the following switch string (taken from Werner Bergman’s configuration), it compressed without issue.

Conclusion

Interestingly, the top spot seems split between 7-Zip and UHARC. 7-Zip even beats WinRAR in compressing JPEGs, which surprises me, given the former’s lack of a multimedia filter. UHARC, of course, is a heftier compressor, meaning it takes longer to run and its resultant archives are less flexible. Still, one can easily make an SFX package with it.

In general, 7-Zip and WinRAR were the fastest, even at their maximum settings; these two programs get the best compression in the best time—a fact that I will point out even though time was not a factor in my tests. They also have the nicest user interfaces, in my estimation. WinACE is a waste of time. Using the ZIP format on its highest setting resulted in mediocre compression, but required one of the longest compression times, a result that doesn’t speak highly.

§1505 · November 13, 2006 · Tags: , , ·

7 Comments to “More on compression”

  1. You know what would be interesting is to see is the processor time used by each. Most people will be doing other things at the same time, so why not see what it would take away from the user.

  2. Ben says:

    gzip and bzip2 aren’t as culpable, but archivers like 7Zip and WinRAR in “best” mode eat up as much CPU as you’ll allow them, which is why they come with options to run the compression in the background. I’m not sure WinACE has this option, and while the UHARC frontend I used was very nice, it doesn’t have such an option, either.

    UHARC is so interesting to me, because it was the archiver for a long time, especially in the warez scene. I still remember seeing Class rips that had to be unpacked and injected with sound files. Then, at some point, RAR became the de facto standard for compression in the “Scene,” even though its multimedia compression doesn’t particularly impress me. Blame the ready availability of the unrar DLL, and the limitations inherent to UHARC: once games got to be over 2GB in size, there were some problems, apparently.

    As I said before, some formats like PAQ or RK were so ridiculous in their requirements as to be untenable for realistic desktop use. UHARC pushes the limits (especially decompression time), but I included it as a foil for WinRAR (UHARC being the first widespread compressor with a multimedia filter, iirc) and also because there are some nice GUIs for it now.

  3. Kabari says:

    Hey Ben! I stumbled across this site and just wanted to leave a post saying what’s up. Good to see you’re doing well! Are you still studying English man? I just got my English degree and am going to be attending UofC for grad school.
    Take care,
    -Kabari

  4. Ben says:

    Well, color me surprised. It’s been a hell of a long time, Kabari. I’m actually not studying English as a major (I’m an Info Tech guy—can you tell by the content of the above post?), but I haven’t been taking a number of English classes for fun and presenting at conferences (look at the bottom of the Colophon)

  5. Ben says:

    By the way, I love your company. Standards-based, Web2.0-y design? Yum.

  6. Kabari says:

    Thanks! Good read by the way. I was looking for compression info because I wanted to compress my javascript files but wasn’t sure how much of a difference gzipping would make.

  7. Ben says:

    For javascript, a fair bit, but it has to do with how much bandwidth you expect the the first place. On my meager workstation here at work, gzip with default switches compresses the prototype.js library to <25% of its original size in about 47 milliseconds. Provided you’re caching efficiently on the server side, and you have visitors who are using browsers capable of unzipping on the client side, you could see a significant savings in bandwidth and load times.

    The real question, though, is whether you want to run compression with Apache (mod_deflate, for instance), which will compress everything compressible, or if you want to use the native compression capabilities of the language you’re working with. PHP has a rather effective gzip handler that you can use just by sticking the following code at the top of your javascript file and changing the extension to .php.

    
    <?php 
    	ob_start ("ob_gzhandler");
    	header("Content-type: text/javascript; charset: UTF-8");
    	header("Cache-Control: must-revalidate");
    	$offset = 60 * 60 ;
    	$ExpStr = "Expires: " . 
    	gmdate("D, d M Y H:i:s",
    	time()   $offset) . " GMT";
    	header($ExpStr);
    ?>
    
    

Leave a Reply