A Modest Construct

Tag: compression

File Compressors in 64-bit

Though I’m not the sort of person who believes that native 64-bit compilations of programs will automagically make them perform faster or better, I do like to keep an eye on the state of the art, since I was an early adopter of native 64-bit OSes (I’ve been using 64-bit Linux since about Fedora Core 2 or 3, and beta versions of Windows XP x64) when AMD launched their K8 platform.

Previously, I’ve casually benchmarked the Javascript speeds of 64-bit browsers v. their 32-bit counterparts (here); more recently, I benchmarked a 64-bit compile of FLAC against several other 32-bit compiles of the same version (here).

This time, I decided to test various and sundry file compression utilities—more specifically, those which offer both 32- and 64-bit versions of themselves. This benchmark did not exhaustively test all potential combinations of compression options (if you’re interested in that, see Werner Bergman’s excellent Maximum Compression and Matt Mahoney’s Data Compression Programs), nor will it compare various compressors to each other; neither will it even list how well the programs actually compressed, since that’s not really a consideration here. The sole purpose of the benchmark was to compare the execution time of a 32-bit program with its 64-bit version.

Read the full article »

Tracking LZMA efficiency

I’m a big fan of 7-Zip. It isn’t the best-looking application ever written, but that could be because its creator, Igor Pavlov, is concerned much more with its compression methods than its interface. 7-Zip has its own container format, but more important is the LZMA compression algorithm that Igor wrote and put into the public domain.

I decided to do some quick and dirty benchmarks to track the progress of LZMA/7-Zip over time. I went back as far as Igor supplied binaries, including one from the very old 3.x series. Rather than test every single release between then and now, I used only “stable” releases, with the exception of version 4.65, which is the latest version of any sort, as well as 4.66, which uses an alpha version of Igor’s new LZMA2 codec (and, as you’ll see, provides definite performance improvement).

I used Igor’s Timer utility to time the process (global time was reported). The corpus in this case was the Linux kernel source, v2.6.28. I conducted these tests on a RAM disk to eliminate hard disk latency issues (especially for decompressions, which improved by about 25% from my initial HDD-based tests). My rig is a Intel Core 2 Quad Q6600 [2.4Ghz], with 4GB of RAM (one dedicated to the RAM disk), running Vista SP1 x64.

The command line setup was an approximation of the 7-Zip GUI’s “ultra” settings: -t7z -m0=lzma -mx=9 -mfb=64 -md=32m -ms=on, letting the archiver auto-choose the number of threads to spawn. Read the full article »

Using YUI compressor in a web project

Last year, I moved our small programming department from using JDeveloper and editing shared files directly on a network drive to using Netbeans 6.x and a proper version control system (Subversion).

After the initial learning curve, this has all been going swimmingly. I merged my first development branch into the trunk yesterday, and this branch just so happens to dovetail nicely into the whole point of this post, which is the YUI compressor, an open-source javascript and CSS minification tool developed by Yahoo’s YUI team.

Read the full article »

Size curves for office file formats

Just a few days ago, I compared the relative sizes of Microsoft’s Office Open XML (OOXML) and OASIS’s OpenDocument format (ODF). I noticed that while OOXML was smaller for smaller amounts of text, ODF was smaller for larger documents. I was curious as to the turning point for this curve, which I hypothesize has to do with the complexity of OOXML’s markup.

I ran a brief test using generated Lorem Ipsum text in approximate amounts (the leftmost column), and recorded its size (in bytes) when pasted into Notepad, and then as OpenDocument Text (OpenOffice.org 2.3.1), and then as OOXML (Office 2007 SP1).

After the data table is a graphical representation of the results. It’s clear that ODF slips below OOXML somewhere between 300Kb and 400Kb of raw textual data.

Comparison of file format sizes
Size Text OOXML ODF
5k 5030 12209 29408
25k 25158 14173 29715
50k 50318 15116 30039
100k 100638 18020 30616
200k 201276 24901 31670
300k 301918 31238 32676
400k 402558 37594 33634
800k 805118 61805 37418
1600k 1610238 110468 44881

file sizes

Comparing office file formats

A while ago, as OpenOffice.org 2.0 approached completion, I compared the file sizes of Microsoft Office’s binary format against OpenOffice’s new OpenDocument format. Recall that OpenDocument is an XML-based storage formatted that is ultimate compressed into a zip file, creating smaller file sizes. Microsoft’s new Office Open XML is essentially the same thing, but with a totally different XML schema.

I decided to revisit this kind of test, and had four test files:

  1. The text of Ulysses, in HTML format. I chose HTML format to test the extra markup, as it should theoretically create a more complex document.
  2. A very large generated Lorem Ipsum block (205′000+ characters), which is pseudo-random, but with a lot of redundancy.
  3. A one-page block of Lorem Ipsum text, in order to test the handling of small files
  4. A randomly generated CSV with multiple kinds of text and 5′000 records. Converted used in OpenOffice Calc and Microsoft Excel.

Read on for the data table on observations.

Read the full article »

Common compression and corpuses

Every so often, I dink around with benchmarking common lossless compressors. One of the best sites for it is, I think, Werner Bergman’s Maximum Compression, which is a rather comprehensive running benchmark of just about every lossless compression benchmark under the sun. Really, there’s a lot. What you have to understand about the world of compressors is that they are very often academic projects or toys that very smart people play with in their free time. There are also companies (but not many) who invest in their own proprietary algorithms for lossless compression.

Here’s the catch, though: the quality of a compressor isn’t measured by its final compression ratio. The PAQ series of a compressors, for instance, offer great compression and really, truly awful compression times. The time goes with the highest compression levels of WinRK (a proprietary Win32 format with an accompanying GUI). But disk is cheap: nobody really cares about a fraction of a percentage of compression efficiency, do they? What people really want is for their (inevitable) archiving GUI to take less time doing what it does.

In this spirit, I have compiled not so much an exhaustive less of possible compression algorithms (I’ll leave that to Werner, who is very good at what he does), but rather a short list of the most common formats, tested on three different (relatively well-known) corpuses: the Calgary Corpus, the newer Canterbury Corpus, and Andrew Tridgell’s 1999 Large Corpus. The first of these two are corpuses used to test the very kind of academic project which I’ve avoided. I dislike using them because they are small in size, which means that there is significantly less opportunity for variations in compression formats to manifest themselves. In the interest of verifiability, however, I have used them. I also included Andrew Tridgell’s large corpus because it’s been my experience that small test corpuses tend to vary too much too to disk I/O latency and other vagaries of compression algorithms.

What will follow is a data table for each corpus, followed by some brief observations about each.

Read the full article »