18 months ago, I did an informal test with GUI archivers and their various formats. Recently, I’ve been reading about data compression and formats, and got it into my head to do some more testing.
Because someone like Werner Bergmans says just about everything there is to say about compressors of all stripes, I thought my own testing would focus primarily on the sorts of programs and formats that one would see in everyday use. Thus, no esoteric command line encoders from universities and enthusiasts will make it into this test. Rather, I will be looking at the old standby ZIP format, as well as 7-Zip’s 7z, WinRAR’s rar, WinACE’s ace, UHARC, and the stalwart *nix tools, gzip and bzip2. All of these tools are readily usable in GUIs for the Windows platform.
Two other programs that I considered but did not end up using were KGB Archiver and WinRK. The former is a frontend for PAQ6v2, and its compression time, even under the “normal” setting, was prohibitively long, though had I used the maximum setting, it might have surpassed the other formats after a few days of compressing. The same with WinRK: its maximum (PWCM) mode would have gotten extremely good results after a ridiculous and unusable amount of time. For that reason, both have been excluded.
Testbed
Windows consists of my current Windows directory in a tarball, consisting of almost 6,500 files in just over 1GB of data, a mix of text, binary, and media. Linux 2.6.18 consists of the source for v2.6.18 of the Linux kernel Maximum Compression consists of Werner Bergmans’ test corpus, which he offers for download (a pack of the individual data types, not the 510-file corpus used for MFC tests). Genesis ROMs is a collection of over 1,200 pure binary dumps from Sega Genesis cartridges. Website backup is a tarball of a website I created some time ago, of which most of the size comes from the original, 300dpi source photographs used in the content.
The Data
View a much nicer PDF version of my original table
| Corpus | Type | Version | Mode | Size (MB) | Ratio |
|---|---|---|---|---|---|
| Windows | tar | 1.13 | n/a | 1,026,192,384 | 1.000 |
| Windows | zip | 4.43* | dictionary = 64K; word size = 128 | 417,771,283 | 0.407 |
| Windows | gzip | 1.3.5 | -9 | 423,443,208 | 0.413 |
| Windows | bzip2 | 1.0.3 | -9 | 413,704,539 | 0.403 |
| Windows | 7zip | 4.43 | LZMA; dictionary = 64MB; word size = 128; solid | 317,998,198 | 0.310 |
| Windows | rar | 3.61 | “best”; dictionary – 4096K; solid | 355,591,630 | 0.347 |
| Windows | ace | 2.65 | “maximum”; dictionary = 4096K; solid | 370,975,963 | 0.362 |
| Windows | uharc | 0.6b | -mx -md32768 -mm+ | 326,051,179 | 0.318 |
| Linux 2.6.18 | tar | 1.13 | n/a | 240,138,240 | 1.000 |
| Linux 2.6.18 | zip | 4.43* | dictionary = 64K; word size = 128 | 47,386,583 | 0.197 |
| Linux 2.6.18 | gzip | 1.3.5 | -9 | 52,467,357 | 0.218 |
| Linux 2.6.18 | bzip2 | 1.0.3 | -9 | 41,863,580 | 0.174 |
| Linux 2.6.18 | 7zip | 4.43 | LZMA; dictionary = 64MB; word size = 128; solid | 33,974,276 | 0.141 |
| Linux 2.6.18 | rar | 3.61 | “best”; dictionary – 4096K; solid | 35,342,737 | 0.147 |
| Linux 2.6.18 | ace | 2.65 | “maximum”; dictionary = 4096K; solid | 40,184,600 | 0.167 |
| Linux 2.6.18 | uharc | 0.6b | -mx -md32768 -mm+ | 30,251,189 | 0.126 |
| Maximum Compression Corpus | tar | 1.13 | n/a | 53,144,064 | 1.000 |
| Maximum Compression Corpus | zip | 4.43* | dictionary = 64K; word size = 128 | 13,962,749 | 0.263 |
| Maximum Compression Corpus | gzip | 1.3.5 | -9 | 14,953,388 | 0.281 |
| Maximum Compression Corpus | bzip2 | 1.0.3 | -9 | 13,532,091 | 0.255 |
| Maximum Compression Corpus | 7zip | 4.43 | LZMA; dictionary = 64MB; word size = 128; solid | 12,371,876 | 0.233 |
| Maximum Compression Corpus | rar | 3.61 | “best”; dictionary – 4096K; solid | 12,536,286 | 0.236 |
| Maximum Compression Corpus | ace | 2.65 | “maximum”; dictionary = 4096K; solid | 13,213,213 | 0.249 |
| Maximum Compression Corpus | uharc | 0.6b | -mx -md32768 -mm+ | 11,516,783 | 0.217 |
| Genesis ROMs | tar | 1.13 | n/a | 1,652,285,440 | 1.000 |
| Genesis ROMs | zip | 4.43* | dictionary = 64K; word size = 128 | 837,796,034 | 0.507 |
| Genesis ROMs | gzip | 1.3.5 | -9 | 883,292,469 | 0.535 |
| Genesis ROMs | bzip2 | 1.0.3 | -9 | 914,155,101 | 0.553 |
| Genesis ROMs | 7zip | 4.43 | LZMA; dictionary = 64MB; word size = 128; solid | 515,316,845 | 0.312 |
| Genesis ROMs | rar | 3.61 | “best”; dictionary – 4096K; solid | 599,596,546 | 0.363 |
| Genesis ROMs | ace | 2.65 | “maximum”; dictionary = 4096K; solid | 611,391,679 | 0.370 |
| Genesis ROMs | uharc | 0.6b | -mx -md32768 -mm+ | 529,355,573 | 0.320 |
| Website Backup | tar | 1.13 | n/a | 276,687,872 | 1.000 |
| Website Backup | zip | 4.43* | dictionary = 64K; word size = 128 | 271,871,915 | 0.983 |
| Website Backup | gzip | 1.3.5 | -9 | 272,458,212 | 0.985 |
| Website Backup | bzip2 | 1.0.3 | -9 | 269,352,089 | 0.973 |
| Website Backup | 7zip | 4.43 | -m0=lzma: a=1: d=0: lc=8: LP0: PB0: mf=bt2** | 268,550,888 | 0.971 |
| Website Backup | rar | 3.61 | “best”; dictionary – 4096K; solid | 269,030,245 | 0.972 |
| Website Backup | ace | 2.65 | “maximum”; dictionary = 4096K; solid | 269,240,799 | 0.973 |
| Website Backup | uharc | 0.6b | -mx -md32768 -mm+ | 264,782,664 | 0.957 |
* ZIP has any one of a number of implementations. The open-source zip that is part of the GNU toolchain doesn’t yet support deflate64 as an algorithm, so instead I used the Zip capabilities in 7zip 4.43, which does.
** When invoked with the GUI, 7zip choked on this test corpus at 24%, inexplicably. Invoked through the command line with the following switch string (taken from Werner Bergman’s configuration), it compressed without issue.
Conclusion
Interestingly, the top spot seems split between 7-Zip and UHARC. 7-Zip even beats WinRAR in compressing JPEGs, which surprises me, given the former’s lack of a multimedia filter. UHARC, of course, is a heftier compressor, meaning it takes longer to run and its resultant archives are less flexible. Still, one can easily make an SFX package with it.
In general, 7-Zip and WinRAR were the fastest, even at their maximum settings; these two programs get the best compression in the best time—a fact that I will point out even though time was not a factor in my tests. They also have the nicest user interfaces, in my estimation. WinACE is a waste of time. Using the ZIP format on its highest setting resulted in mediocre compression, but required one of the longest compression times, a result that doesn’t speak highly.
You know what would be interesting is to see is the processor time used by each. Most people will be doing other things at the same time, so why not see what it would take away from the user.
gzip and bzip2 aren’t as culpable, but archivers like 7Zip and WinRAR in “best” mode eat up as much CPU as you’ll allow them, which is why they come with options to run the compression in the background. I’m not sure WinACE has this option, and while the UHARC frontend I used was very nice, it doesn’t have such an option, either.
UHARC is so interesting to me, because it was the archiver for a long time, especially in the warez scene. I still remember seeing Class rips that had to be unpacked and injected with sound files. Then, at some point, RAR became the de facto standard for compression in the “Scene,” even though its multimedia compression doesn’t particularly impress me. Blame the ready availability of the
unrarDLL, and the limitations inherent to UHARC: once games got to be over 2GB in size, there were some problems, apparently.As I said before, some formats like PAQ or RK were so ridiculous in their requirements as to be untenable for realistic desktop use. UHARC pushes the limits (especially decompression time), but I included it as a foil for WinRAR (UHARC being the first widespread compressor with a multimedia filter, iirc) and also because there are some nice GUIs for it now.
Hey Ben! I stumbled across this site and just wanted to leave a post saying what’s up. Good to see you’re doing well! Are you still studying English man? I just got my English degree and am going to be attending UofC for grad school.
Take care,
-Kabari
Well, color me surprised. It’s been a hell of a long time, Kabari. I’m actually not studying English as a major (I’m an Info Tech guy—can you tell by the content of the above post?), but I haven’t been taking a number of English classes for fun and presenting at conferences (look at the bottom of the Colophon)
By the way, I love your company. Standards-based, Web2.0-y design? Yum.
Thanks! Good read by the way. I was looking for compression info because I wanted to compress my javascript files but wasn’t sure how much of a difference gzipping would make.
For javascript, a fair bit, but it has to do with how much bandwidth you expect the the first place. On my meager workstation here at work,
gzipwith default switches compresses theprototype.jslibrary to <25% of its original size in about 47 milliseconds. Provided you’re caching efficiently on the server side, and you have visitors who are using browsers capable of unzipping on the client side, you could see a significant savings in bandwidth and load times.The real question, though, is whether you want to run compression with Apache (
mod_deflate, for instance), which will compress everything compressible, or if you want to use the native compression capabilities of the language you’re working with. PHP has a rather effectivegziphandler that you can use just by sticking the following code at the top of your javascript file and changing the extension to .php.<?php ob_start ("ob_gzhandler"); header("Content-type: text/javascript; charset: UTF-8"); header("Cache-Control: must-revalidate"); $offset = 60 * 60 ; $ExpStr = "Expires: " . gmdate("D, d M Y H:i:s", time() $offset) . " GMT"; header($ExpStr); ?>