<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>A Modest Construct &#187; compression</title>
	<atom:link href="http://heliologue.com/tag/compression/feed/" rel="self" type="application/rss+xml" />
	<link>http://heliologue.com</link>
	<description></description>
	<lastBuildDate>Fri, 03 Feb 2012 17:18:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>File Compressors in 64-bit</title>
		<link>http://heliologue.com/2010/03/06/file-compressors-in-64-bit/</link>
		<comments>http://heliologue.com/2010/03/06/file-compressors-in-64-bit/#comments</comments>
		<pubDate>Sun, 07 Mar 2010 04:22:20 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://heliologue.com/?p=4991</guid>
		<description><![CDATA[Though I&#8217;m not the sort of person who believes that native 64-bit compilations of programs will automagically make them perform faster or better, I do like to keep an eye on the state of the art, since I was an early adopter of native 64-bit OSes (I&#8217;ve been using 64-bit Linux since about Fedora Core [...]]]></description>
			<content:encoded><![CDATA[<p>Though I&#8217;m not the sort of person who believes that native 64-bit compilations of programs will automagically make them perform faster or better, I do like to keep an eye on the state of the art, since I was an early adopter of native 64-bit OSes (I&#8217;ve been using 64-bit Linux since about Fedora Core 2 or 3, and beta versions of Windows XP x64) when AMD launched their K8 platform.</p>
<p>Previously, I&#8217;ve casually benchmarked the Javascript speeds of 64-bit browsers v. their 32-bit counterparts (<a href="http://heliologue.com/2008/03/23/javascript-engines-in-32-bit-and-64-bit-browsers/">here</a>);  more recently, I benchmarked a 64-bit compile of FLAC against several other 32-bit  compiles of the same version (<a href="http://heliologue.com/2008/12/21/flac-compile-benchmarks/">here</a>).</p>
<p>This time, I decided to test various and sundry file compression utilities—more specifically, those which offer both 32- and 64-bit versions of themselves.  This benchmark did not exhaustively test all potential combinations of compression options (if you&#8217;re interested in that, see Werner Bergman&#8217;s excellent <a href="http://www.maximumcompression.com/" rel="external">Maximum Compression</a> and Matt Mahoney&#8217;s <a href="http://mattmahoney.net/dc/" rel="external">Data Compression Programs</a>), nor will it compare various compressors to each other;  neither will it even list how well the programs actually compressed, since that&#8217;s not really a consideration here.  The sole purpose of the benchmark was to compare the execution time of a 32-bit program with its 64-bit version.</p>
<p><span id="more-4991"></span></p>
<p>The corpus in this case was <a href="http://mattmahoney.net/dc/textdata.html" rel="external">enwiki9</a>, compressed to a RAM disk to minimize the potential effects of write latency.  I wanted the corpus to be sufficiently large to better tease out significant differences in these compressors over a large dataset.  </p>
<p>The results for each compressor are listed on their own page, as well as an explanation of the compressor, its origin, and any additional notes.  One notable exception to this list is WinRK, which is available in a 64-bit version but contains no command-line interface.  </p>
]]></content:encoded>
			<wfw:commentRss>http://heliologue.com/2010/03/06/file-compressors-in-64-bit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tracking LZMA efficiency</title>
		<link>http://heliologue.com/2009/02/09/tracking-lzma-efficiency/</link>
		<comments>http://heliologue.com/2009/02/09/tracking-lzma-efficiency/#comments</comments>
		<pubDate>Tue, 10 Feb 2009 04:51:26 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[codecs]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://heliologue.com/?p=3580</guid>
		<description><![CDATA[I&#8217;m a big fan of 7-Zip. It isn&#8217;t the best-looking application ever written, but that could be because its creator, Igor Pavlov, is concerned much more with its compression methods than its interface. 7-Zip has its own container format, but more important is the LZMA compression algorithm that Igor wrote and put into the public [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m a big fan of <a href="http://7-zip.org">7-Zip</a>. It isn&#8217;t the best-looking application ever written, but that could be because its creator, Igor Pavlov, is concerned much more with its compression methods than its interface.  7-Zip has its own container format, but more important is the <a href="http://en.wikipedia.org/wiki/LZMA">LZMA</a> compression algorithm that Igor wrote and put into the public domain.</p>
<p>I decided to do some quick and dirty benchmarks to track the progress of LZMA/7-Zip over time.  I went back as far as Igor supplied binaries, including one from the very old 3.x series.  Rather than test every single release between then and now, I used only &#8220;stable&#8221; releases, with the exception of version 4.65, which is the latest version of any sort, as well as 4.66, which uses an alpha version of Igor&#8217;s new LZMA2 codec (and, as you&#8217;ll see, provides definite performance improvement).</p>
<p>I used Igor&#8217;s Timer utility to time the process (global time was reported).  The corpus in this case was the <a href="http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.28.tar.bz2">Linux kernel source, v2.6.28</a>.  I conducted these tests on a RAM disk to eliminate hard disk latency issues (especially for decompressions, which improved by about 25% from my initial HDD-based tests). My rig is a Intel Core 2 Quad Q6600 [2.4Ghz], with 4GB of RAM (one dedicated to the RAM disk), running Vista SP1 x64. </p>
<p>The command line setup was an approximation of the 7-Zip GUI&#8217;s &#8220;ultra&#8221; settings:  <code>-t7z -m0=lzma -mx=9 -mfb=64 -md=32m -ms=on</code>, letting the archiver auto-choose the number of threads to spawn.  <span id="more-3580"></span></p>
<h3>The Data</h3>
<table class="sortable rowstyle-even">
<caption>
		LZMA Efficiency<br />
	</caption>
<thead>
<tr>
<th class="sortable-text">
				7-zip version
			</th>
<th class="sortable-numeric">
				encoding time (s)
			</th>
<th class="sortable-numeric">
				decoding time (s)
			</th>
</tr>
</thead>
<tbody>
<tr>
<td>
				3.13
			</td>
<td>
				541.271
			</td>
<td>
				43.379
			</td>
</tr>
<tr>
<td>
				4.20
			</td>
<td>
				531.457
			</td>
<td>
				44.040
			</td>
</tr>
<tr>
<td>
				4.23
			</td>
<td>
				527.871
			</td>
<td>
				42.425
			</td>
</tr>
<tr>
<td>
				4.32
			</td>
<td>
				341.290
			</td>
<td>
				42.126
			</td>
</tr>
<tr>
<td>
				4.42
			</td>
<td>
				219.451
			</td>
<td>
				42.211
			</td>
</tr>
<tr>
<td>
				4.57
			</td>
<td>
				174.064
			</td>
<td>
				44.163
			</td>
</tr>
<tr>
<td>
				4.62
			</td>
<td>
				170.973
			</td>
<td>
				42.836
			</td>
</tr>
<tr>
<td>
				4.65
			</td>
<td>
				170.917
			</td>
<td>
				43.058
			</td>
</tr>
<tr>
<td>
				4.66 (lzma2)
			</td>
<td>
				126.259
			</td>
<td>
				46.663
			</td>
</tr>
</tbody>
</table>
<h3>The Analysis</h3>
<p><a href="/img/albums/Software/lzma_compression_graph.png" class="right" rel="lightbox" title="Tracking LZMA efficiency"><img src="/img/albums/Software/lzma_compression_graph_thumb.png" alt="LZMA efficiency graph" /></a></p>
<p>Without conducting a more thorough battery of tests on a variety of different configurations, it&#8217;s difficulty to say with certain just <em>where</em> the performance improvements came from, be it better using of threading or multiprocessors, general algorithmic improvements, or something else.  I also don&#8217;t know if the performance increases we see reside in improvements to LZMA itself as Igor was finalizing it, or just the code quality of 7-Zip, which <em>implements</em> LZMA.</p>
<p>In any case, the improvements since 3.13 are very clear (remember that lower is better), at least for compression, and for &#8220;ultra&#8221; settings.  Decompression remained largely similar, which surprised me.  Some of these results might be directly tied to the number and type of files that were compressed in the case:  4.66, for instance, improves decompression speed for uncompressable files, but no such files exist here since it&#8217;s source code.</p>
<p>Hats off to Igor Pavlov for his steady improvement on both a really great compression standard and one of my favorite pieces of software for Windows.</p>
]]></content:encoded>
			<wfw:commentRss>http://heliologue.com/2009/02/09/tracking-lzma-efficiency/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Using YUI compressor in a web project</title>
		<link>http://heliologue.com/2008/09/22/using-yui-compressor-in-a-web-project/</link>
		<comments>http://heliologue.com/2008/09/22/using-yui-compressor-in-a-web-project/#comments</comments>
		<pubDate>Mon, 22 Sep 2008 23:35:42 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[css]]></category>
		<category><![CDATA[Internet Explorer]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[jquery]]></category>
		<category><![CDATA[web design]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://heliologue.com/?p=2692</guid>
		<description><![CDATA[Last year, I moved our small programming department from using JDeveloper and editing shared files directly on a network drive to using Netbeans 6.x and a proper version control system (Subversion). After the initial learning curve, this has all been going swimmingly. I merged my first development branch into the trunk yesterday, and this branch [...]]]></description>
			<content:encoded><![CDATA[<p>Last year, I moved our small programming department from using JDeveloper and editing shared files directly on a network drive to using Netbeans 6.x and a proper version control system (Subversion).</p>
<p>After the initial learning curve, this has all been going swimmingly.  I merged my first development branch into the trunk yesterday, and this branch just so happens to dovetail nicely into the whole point of this post, which is the <a href="http://developer.yahoo.com/yui/compressor/">YUI compressor</a>, an open-source javascript and CSS minification tool developed by Yahoo&#8217;s <abbr title="Yahoo User Interface">YUI</abbr> team.</p>
<p><span id="more-2692"></span></p>
<h3>The Problem</h3>
<p>To understand quickly why one should minify production client-side code, consider only that with the upward trend of size in javascript libraries (and the necessary files for such libraries), it&#8217;s possibly to be downloading a <em>lot</em> of client-side code in a typical web application, especially as its scope grows.</p>
<p>For a long time, I was using <a href="http://dean.edwards.name/packer/">Dean Edward&#8217;s Packer</a>, as was everyone, because its Base62 encoding produced the very lowest file size.  However, what should have been obvious to everyone is that <code>eval(bunch_of_stuff_goes_here)</code> is making the browser do a lot more work, and this is a problem on dinosaurs like IE6.  </p>
<p>To make matters worse, the nature of such encoding also meant that for servers which tried to compress outgoing content like javascript (either with zlib or gzip), the compression ratio suffered.  Just look at this table that Julien Lecomte posted last August.</p>
<table class="sortable rowstyle-even">
<caption>Javascript compression</caption>
<thead>
<tr>
<th class="sortable-text">File</th>
<th class="sortable-numeric">File size (bytes)</th>
<th class="sortable-numeric">Gzipped file size (bytes)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Original jQuery libraries</td>
<td>62,885</td>
<td>19,758</td>
</tr>
<tr>
<td>jQuery minified with <a href="http://www.crockford.com/javascript/jsmin.html">JSMin</a></td>
<td>36,391</td>
<td>11,541</td>
</tr>
<tr>
<td>jQuery minified with Packer</td>
<td>21,557</td>
<td>11,119</td>
</tr>
<tr>
<td>jQuery minified with the YUI compressor</td>
<td>31,822</td>
<td>10,818</td>
</tr>
</tbody>
</table>
<p>I said to myself, &#8220;Hey, we use a lot of this stuff, and we still support a lot of users with slow computers and slow browsers.&#8221;  So, I moved our project from a Packer-based compression to a YUI-based compression, and turned on server-side GZIP compression for javascript files.  The only problem was that I was storing the javascript files already minified, and simply pasted them into a large a couple of large global <code>.js</code> files.  I had to keep a separate source directory, along with any customizations.</p>
<p>This got to be a pain in the ass, as you might well expect, and so when it occurred to me that I might be able to use the YUI library at build-time in our Netbeans project, I immediately sprung into action.</p>
<p>This was around April, and one night while attending <a href="http://heliologue.com/2008/04/13/anaheim-and-other-larks/">Sungard Summit</a> in Anaheim.  I did the initial work to get our Netbeans project into a state that could use the YUI compressor at build-time, creating separate source file directories and breaking our massive javascript file into modules;  I did the same with our CSS, splitting it up based on what it decorated.</p>
<p>There are a few tutorials about using the YUI library.  <a href="http://blog.gomilko.com/2007/11/29/yui-compression-tool-as-ant-task/">Some of them</a> involve adding the YUI library to Ant&#8217;s classpath (didn&#8217;t want to go down this route); <a href="http://www.henke.ws/machblog/index.cfm?event=showEntry&#038;entryId=8A5CAB53-19B9-BA51-EECADB57919F9714">a lot of them</a> involve invoking the library as an external executable during the build process, which is messy.</p>
<p>The solution I finally settled on was <a href="http://code.google.com/p/javaflight-code/">yui-compressor-ant-task</a>, a small library that allows Ant to use YUI as a build task.  By adding this library and the YUI compressor library to our common libraries folder, and enabling them at build only (and not for deploying in the web archive), it makes using the compressor pretty easy.</p>
<p>Here&#8217;s part of our <code>build.xml</code>:</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">&lt;!--</span>
<span style="color: #808080; font-style: italic;">   * minify will concatenate all of our non-TinyMCE javascripts and stylesheets</span>
<span style="color: #808080; font-style: italic;">   * then use the YUI compressor library to compress them</span>
<span style="color: #808080; font-style: italic;">   --&gt;</span>
   <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;target</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;minify&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
       <span style="color: #808080; font-style: italic;">&lt;!--${libs} is path to the downloaded jars --&gt;</span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property</span></span>
<span style="color: #009900;">           <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;yui-compressor.jar&quot;</span></span>
<span style="color: #009900;">           <span style="color: #000066;">location</span>=<span style="color: #ff0000;">&quot;${file.reference.yuicompressor.jar}&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;property</span></span>
<span style="color: #009900;">           <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;yui-compressor-ant-task.jar&quot;</span></span>
<span style="color: #009900;">           <span style="color: #000066;">location</span>=<span style="color: #ff0000;">&quot;${file.reference.yui-compressor-ant-task.jar}&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
&nbsp;
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;path</span> <span style="color: #000066;">id</span>=<span style="color: #ff0000;">&quot;task.classpath&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;pathelement</span> <span style="color: #000066;">location</span>=<span style="color: #ff0000;">&quot;${yui-compressor.jar}&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;pathelement</span> <span style="color: #000066;">location</span>=<span style="color: #ff0000;">&quot;${yui-compressor-ant-task.jar}&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/path<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
&nbsp;
       <span style="color: #808080; font-style: italic;">&lt;!-- yui-compressor task definition --&gt;</span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;taskdef</span></span>
<span style="color: #009900;">           <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;yui-compressor&quot;</span></span>
<span style="color: #009900;">           <span style="color: #000066;">classname</span>=<span style="color: #ff0000;">&quot;net.noha.tools.ant.yuicompressor.tasks.YuiCompressorTask&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
&nbsp;
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;classpath</span> <span style="color: #000066;">refid</span>=<span style="color: #ff0000;">&quot;task.classpath&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/taskdef<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
&nbsp;
       <span style="color: #808080; font-style: italic;">&lt;!-- concatenation of javascript --&gt;</span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;echo</span> <span style="color: #000066;">message</span>=<span style="color: #ff0000;">&quot;Building global javascript&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;concat</span> <span style="color: #000066;">destfile</span>=<span style="color: #ff0000;">&quot;${build.dir}/web/common/js/global.js&quot;</span> <span style="color: #000066;">force</span>=<span style="color: #ff0000;">&quot;no&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
           <span style="color: #808080; font-style: italic;">&lt;!-- explicitly order js concat because ordering matters here --&gt;</span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.bgiframe.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.hoverIntent.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/ui.core.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/ui.autocomplete.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/ui.datepicker.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/ui.tabs.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/tablesort.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/customsort.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.blockUI.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.form.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.ifixpng.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.superfish.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.cluetip.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.scrollTo.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.jqModal.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/validation.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/js/jquery.timeentry.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/concat<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
&nbsp;
       <span style="color: #808080; font-style: italic;">&lt;!-- concatenation of cascading stylesheets --&gt;</span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;echo</span> <span style="color: #000066;">message</span>=<span style="color: #ff0000;">&quot;Building global cascading stylesheets&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;concat</span> <span style="color: #000066;">destfile</span>=<span style="color: #ff0000;">&quot;${build.dir}/web/common/css/global.css&quot;</span> <span style="color: #000066;">force</span>=<span style="color: #ff0000;">&quot;no&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/base.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/superfish.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/announcements.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/myvt.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/forms.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/cluetip.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>   
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/tables.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/ui.tabs.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/ui.datepicker.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/ui.autocomplete.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/linkspan.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/stepMenu.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/print.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;fileset</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;web/common/css/youHaveMessages.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/concat<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
&nbsp;
       <span style="color: #808080; font-style: italic;">&lt;!-- invoke compressor --&gt;</span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;yui-compressor</span> <span style="color: #000066;">warn</span>=<span style="color: #ff0000;">&quot;false&quot;</span> <span style="color: #000066;">charset</span>=<span style="color: #ff0000;">&quot;UTF-8&quot;</span> <span style="color: #000066;">fromdir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span> <span style="color: #000066;">todir</span>=<span style="color: #ff0000;">&quot;${build.dir}&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;include</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;web/common/js/global.js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
           <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;include</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;web/common/css/global.css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/yui-compressor<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
&nbsp;
   <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/target<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
&nbsp;
   <span style="color: #808080; font-style: italic;">&lt;!--</span>
<span style="color: #808080; font-style: italic;">   * purge-src takes our compressed files, moves them to the base /common dir</span>
<span style="color: #808080; font-style: italic;">   * and deletes the source js and css dirs from the build dir</span>
<span style="color: #808080; font-style: italic;">   --&gt;</span>
   <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;target</span> <span style="color: #000066;">name</span>=<span style="color: #ff0000;">&quot;purge-src&quot;</span> <span style="color: #000066;">depends</span>=<span style="color: #ff0000;">&quot;minify&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;echo</span> <span style="color: #000066;">message</span>=<span style="color: #ff0000;">&quot;Purging javascript and stylesheet sources&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
&nbsp;
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;move</span> <span style="color: #000066;">file</span>=<span style="color: #ff0000;">&quot;${build.dir}/web/common/js/global-min.js&quot;</span> <span style="color: #000066;">tofile</span>=<span style="color: #ff0000;">&quot;${build.dir}/web/common/global.js&quot;</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;move</span> <span style="color: #000066;">file</span>=<span style="color: #ff0000;">&quot;${build.dir}/web/common/css/global-min.css&quot;</span> <span style="color: #000066;">tofile</span>=<span style="color: #ff0000;">&quot;${build.dir}/web/common/global.css&quot;</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span>
&nbsp;
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;delete</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}/web/common/js&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
       <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;delete</span> <span style="color: #000066;">dir</span>=<span style="color: #ff0000;">&quot;${build.dir}/web/common/css&quot;</span> <span style="color: #000000; font-weight: bold;">/&gt;</span></span>
&nbsp;
   <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/target<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<p>What you see there is essentially four steps.</p>
<ol>
<li>Concatenate all the constituent source files into a <code>global.js</code> and a <code>global.css</code></li>
<li>Compress both of these files, which creates <code>global-min.js</code> and <code>global-min.css</code> (default behavior)</li>
<li>Move these files out of the source directories and into the root of the common web directory as <code>global.js</code> and <code>global.css</code></li>
<li>Delete the source directories in our build folder so they don&#8217;t get deployed with the web archive</li>
</ol>
<p>Because certain browsers (IE) break without explicit ordering, we unfortunately can&#8217;t just use &#8220;*.js&#8221; and &#8220;*.css&#8221; in our concatenation step, but having to explicitly list our components in the build file certainly isn&#8217;t the end of the world.  The nice thing is that the Ant task will even print out handy statistics on just how much you&#8217;ve been able to compress the files down.  In our case, we have about 441.8KB of common Javascript and CSS in our source code that, by the time it gets sent to the user, has been minified and/or gzipped to about 89KB.</p>
]]></content:encoded>
			<wfw:commentRss>http://heliologue.com/2008/09/22/using-yui-compressor-in-a-web-project/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Size curves for office file formats</title>
		<link>http://heliologue.com/2008/02/15/size-curves-for-office-file-formats/</link>
		<comments>http://heliologue.com/2008/02/15/size-curves-for-office-file-formats/#comments</comments>
		<pubDate>Fri, 15 Feb 2008 20:19:43 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[Microsoft]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[Windows]]></category>

		<guid isPermaLink="false">http://heliologue.com/2008/02/15/size-curves-for-office-file-formats/</guid>
		<description><![CDATA[Just a few days ago, I compared the relative sizes of Microsoft&#8217;s Office Open XML (OOXML) and OASIS&#8217;s OpenDocument format (ODF). I noticed that while OOXML was smaller for smaller amounts of text, ODF was smaller for larger documents. I was curious as to the turning point for this curve, which I hypothesize has to [...]]]></description>
			<content:encoded><![CDATA[<p>Just a few days ago, I compared the relative sizes of Microsoft&#8217;s Office Open XML (OOXML) and OASIS&#8217;s OpenDocument format (ODF).  I noticed that while OOXML was smaller for smaller amounts of text, ODF was smaller for larger documents.  I was curious as to the turning point for this curve, which I hypothesize has to do with the complexity of OOXML&#8217;s markup.</p>
<p>I ran a brief test using <a href="http://lipsum.com">generated</a> Lorem Ipsum text in approximate amounts (the leftmost column), and recorded its size (in <strong>bytes</strong>) when pasted into Notepad, and then as OpenDocument Text (OpenOffice.org 2.3.1), and then as OOXML (Office 2007 SP1). </p>
<p>After the data table is a graphical representation of the results.  It&#8217;s clear that ODF slips below OOXML somewhere between 300Kb and 400Kb of raw textual data.</p>
<table class="sortable rowstyle-even">
<caption>Comparison of file format sizes</caption>
<thead>
<tr>
<th class="sortable-numeric" scope="col">Size</th>
<th class="sortable-numeric" scope="col">Text</th>
<th class="sortable-numeric" scope="col">OOXML</th>
<th class="sortable-numeric" scope="col">ODF</th>
</tr>
</thead>
<tbody>
<tr>
<td>5k</th>
<td>5030</td>
<td>12209</td>
<td>29408</td>
</tr>
<tr>
<td>25k</th>
<td>25158</td>
<td>14173</td>
<td>29715</td>
</tr>
<tr>
<td>50k</th>
<td>50318</td>
<td>15116</td>
<td>30039</td>
</tr>
<tr>
<td>100k</th>
<td>100638</td>
<td>18020</td>
<td>30616</td>
</tr>
<tr>
<td>200k</th>
<td>201276</td>
<td>24901</td>
<td>31670</td>
</tr>
<tr>
<td>300k</th>
<td>301918</td>
<td>31238</td>
<td>32676</td>
</tr>
<tr>
<td>400k</th>
<td>402558</td>
<td>37594</td>
<td>33634</td>
</tr>
<tr>
<td>800k</th>
<td>805118</td>
<td>61805</td>
<td>37418</td>
</tr>
<tr>
<td>1600k</th>
<td>1610238</td>
<td>110468</td>
<td>44881</td>
</tr>
</tbody>
</table>
<p><a href="/img/albums/Software/filesizes_chart.png" class="center" rel="lightbox" title="Size curves for office file formats"><img src="/img/albums/Software/filesizes_chart_thumb.png" alt="file sizes" /></p>
]]></content:encoded>
			<wfw:commentRss>http://heliologue.com/2008/02/15/size-curves-for-office-file-formats/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Comparing office file formats</title>
		<link>http://heliologue.com/2008/02/12/comparing-office-file-formats/</link>
		<comments>http://heliologue.com/2008/02/12/comparing-office-file-formats/#comments</comments>
		<pubDate>Tue, 12 Feb 2008 22:49:21 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[Microsoft]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[OpenDocument]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://heliologue.com/2008/02/12/comparing-office-file-formats/</guid>
		<description><![CDATA[A while ago, as OpenOffice.org 2.0 approached completion, I compared the file sizes of Microsoft Office&#8217;s binary format against OpenOffice&#8217;s new OpenDocument format. Recall that OpenDocument is an XML-based storage formatted that is ultimate compressed into a zip file, creating smaller file sizes. Microsoft&#8217;s new Office Open XML is essentially the same thing, but with [...]]]></description>
			<content:encoded><![CDATA[<p>A while ago, as OpenOffice.org 2.0 approached completion, I <a href="http://heliologue.com/2005/03/02/ooo/">compared</a> the file sizes of Microsoft Office&#8217;s binary format against OpenOffice&#8217;s new OpenDocument format.  Recall that OpenDocument is an XML-based storage formatted that is ultimate compressed into a zip file, creating smaller file sizes.  Microsoft&#8217;s new Office Open XML is essentially the same thing, but with a totally different XML schema.</p>
<p>I decided to revisit this kind of test, and had four test files:</p>
<ol>
<li>The text of <a href="http://www.gutenberg.org/etext/4300"><cite>Ulysses</cite></a>, in HTML format.  I chose HTML format to test the extra markup, as it should theoretically create a more complex document.</li>
<li>A very large generated Lorem Ipsum block (205&#8217;000+ characters), which is pseudo-random, but with a lot of redundancy.</li>
<li>A one-page block of Lorem Ipsum text, in order to test the handling of small files</li>
<li>A randomly generated CSV with multiple kinds of text and 5&#8217;000 records.  Converted used in OpenOffice Calc and Microsoft Excel.</li>
</ol>
<p>Read on for the data table on observations.</p>
<p><span id="more-1977"></span></p>
<table class="sortable rowstyle-even">
<caption>Comparison of office file formats sizes (in bytes)</caption>
<thead>
<tr>
<th class="sortable-text" scope="col">Corpus</th>
<th class="sortable-numeric" scope="col">Ulysses</th>
<th class="sortable-numeric" scope="col">Lorem Ipsum (long)</th>
<th class="sortable-numeric" scope="col">Lorem Ipsum (short)</th>
<th class="sortable-numeric" scope="col">Random CSV</th>
</tr>
</thead>
<tbody>
<tr>
<td>ASCII/HTML</td>
<td>1,646,749</td>
<td>1,586,432</td>
<td>4,069</td>
<td>23,784</td>
</tr>
<tr>
<td>MS Word 97/2000/XP</td>
<td>1,753,088</td>
<td>1,710,592</td>
<td>36,864</td>
<td>58,880</td>
</tr>
<tr>
<td>Office Open XML</td>
<td>876,755</td>
<td>599,277</td>
<td>13,344</td>
<td>26,906</td>
</tr>
<tr>
<td>OpenOffice 1.x</td>
<td>692,315</td>
<td>411,819</td>
<td>27,528</td>
<td>20,671</td>
</tr>
<tr>
<td>OpenDocument</td>
<td>693,672</td>
<td>412,629</td>
<td>27,675</td>
<td>20,826</td>
</tr>
</tbody>
</table>
<p>Remember that both XML formats (OpenDocument and OOXML) are basically XML representations of the data, markup mappings, and any associated files, all wrapped up in a ZIP file, which means everything is compressed by the Deflate algorithm.</p>
<p>I know for a fact that OOXML generated <em>more</em> markup—that is, larger XML files—than OpenDocument.  However, Microsoft appears to use a higher level of compression, since it&#8217;s clear from the &#8220;Lorem Ipsum (short)&#8221; corpus that OOXML excels at smaller files.  On the long Lorem Ipsum test, however, the OpenDocument format is significantly smaller, leading me to believe that there is a point at which the complexity of OOXML&#8217;s markup finally outweighs Office&#8217;s higher compression.  </p>
<p>Note that compression level is not part of the spec for either format (that I know of):  the compression strength is set by the program.  If you want, you can unzip the resulting file and rezip it with a stronger setting, and the programs will open it just fine (but you&#8217;ll lose the extra compression as soon as they save it again).</p>
<p>Also note Office&#8217;s older binary format (.doc, .xls, <i>&amp;c.</i>) is pretty much abysmal, offering no compression at all, but rather the size of plain ASCII text plus Office&#8217;s internal markup structures.  Makes you wonder why this was the dominant file format for so long&#8230;</p>
<p>I would have liked to test database formats as well, but ran into snags.  While Microsoft Access has no problem important a CSV file, OpenOffice&#8217;s &#8220;Base&#8221; module, a Java-based database program that debut with OpenOffice 2.0, lacks even a rudimentary data import.  On this one, I suppose, Microsoft wins by default.</p>
<p>Sometime in the near future, I want to plot the markup/compression curve for OOXML v. ODF. Stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://heliologue.com/2008/02/12/comparing-office-file-formats/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Common compression and corpuses</title>
		<link>http://heliologue.com/2008/01/19/common-compression-and-corpuses/</link>
		<comments>http://heliologue.com/2008/01/19/common-compression-and-corpuses/#comments</comments>
		<pubDate>Sun, 20 Jan 2008 04:43:21 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[technology]]></category>
		<category><![CDATA[Windows]]></category>

		<guid isPermaLink="false">http://heliologue.com/blog/2008/01/19/common-compression-and-corpuses/</guid>
		<description><![CDATA[Every so often, I dink around with benchmarking common lossless compressors. One of the best sites for it is, I think, Werner Bergman&#8217;s Maximum Compression, which is a rather comprehensive running benchmark of just about every lossless compression benchmark under the sun. Really, there&#8217;s a lot. What you have to understand about the world of [...]]]></description>
			<content:encoded><![CDATA[<p>Every so often, I dink around with benchmarking common lossless compressors.  One of the best sites for it is, I think, Werner Bergman&#8217;s <a href="http://www.maximumcompression.com/">Maximum Compression</a>, which is a rather comprehensive running benchmark of just about every lossless compression benchmark under the sun.  Really, there&#8217;s a <em>lot</em>.  What you have to understand about the world of compressors is that they are very often academic projects or toys that very smart people play with in their free time.  There are also companies (but not many) who invest in their own proprietary algorithms for lossless compression.</p>
<p>Here&#8217;s the catch, though:  the quality of a compressor isn&#8217;t measured by its final compression ratio.  The <a href="http://en.wikipedia.org/wiki/PAQ">PAQ</a> series of a compressors, for instance, offer great compression and really, truly awful compression times.  The time goes with the highest compression levels of WinRK (a proprietary Win32 format with an accompanying GUI).  But disk is cheap:  nobody <em>really</em> cares about a fraction of a percentage of compression efficiency, do they?  What people really want is for their (inevitable) <a href="http://heliologue.com/free-software/archivers-and-compressors/">archiving GUI</a> to take less time doing what it does.</p>
<p>In this spirit, I have compiled not so much an exhaustive less of possible compression algorithms (I&#8217;ll leave that to Werner, who is very good at what he does), but rather a short list of the most common formats, tested on three different (relatively well-known) corpuses:  the <a href="http://en.wikipedia.org/wiki/Calgary_Corpus">Calgary Corpus</a>, the newer <a href="http://en.wikipedia.org/wiki/Canterbury_Corpus">Canterbury Corpus</a>, and Andrew Tridgell&#8217;s 1999 <a href="http://samba.org/ftp/tridge/large-corpus/">Large Corpus</a>.  The first of these two are corpuses used to test the very kind of academic project which I&#8217;ve avoided.  I dislike using them because they are small in size, which means that there is significantly less opportunity for variations in compression formats to manifest themselves.  In the interest of verifiability, however, I have used them.  I also included Andrew Tridgell&#8217;s large corpus because it&#8217;s been my experience that small test corpuses tend to vary too much too to disk I/O latency and other vagaries of compression algorithms.</p>
<p>What will follow is a data table for each corpus, followed by some brief observations about each.</p>
<p><span id="more-1954"></span></p>
<p>First, a note about the test environment:</p>
<ul>
<li>Windows Vista x64</li>
<li>Intel Q6600 Quad-Core</li>
<li>4GB Corsair PC2 6400</li>
<li>Western Digital Caviar WD1600YS SATAII, 160GB (system drive)</li>
<li><strong>Timer</strong>: Igor Pavlov&#8217;s <a href="http://www.7-zip.org/dl/utils/timer301.zip">timer.exe</a> (times reported are &#8220;Process&#8221; times).</li>
</ul>
<p>Next, a note about compressor versions</p>
<table class="sortable rowstyle-even" summary="Compressors and versions">
<caption>
Compressor versions used<br />
</caption>
<thead>
<tr>
<th class="sortable-text">Compressor</th>
<th class="sortable-numeric">Version</th>
<th>Source</th>
</tr>
</thead>
<tbody>
<tr>
<td>
tar
</td>
<td>
1.13
</td>
<td>
<a href="http://gnuwin32.sf.net">GnuWin32</a>
</td>
</tr>
<tr>
<td>
gzip
</td>
<td>
1.3.12
</td>
<td>
<a href="http://gnuwin32.sf.net">GnuWin32</a>
</td>
</tr>
<tr>
<td>
tar
</td>
<td>
1.0.4
</td>
<td>
<a href="http://gnuwin32.sf.net">GnuWin32</a>
</td>
</tr>
<tr>
<td>
zip/unzip
</td>
<td>
2.32/5.52
</td>
<td>
<a href="http://gnuwin32.sf.net">GnuWin32</a>
</td>
</tr>
<tr>
<td>
7z (32-bit)
</td>
<td>
4.57
</td>
<td>
<a href="7-zip.org">7-Zip</a>
</td>
</tr>
<tr>
<td>
rar
</td>
<td>
3.71
</td>
<td>
<a href="http://rarlabs.com">RarLabs</a>
</td>
</tr>
</tbody>
</table>
<p>Now on to the benchmarks&#8230;</p>
<table class="sortable rowstyle-even" id="calgary-corpus">
<caption>
                Calgary Corpus<br />
            </caption>
<thead>
<tr>
<th class="sortable-text" scope="col">
                        Codec
                    </th>
<th class="sortable-text" scope="col">
                        Setting
                    </th>
<th class="sortable-numeric" scope="col">
                        Enc. Speed (s)
                    </th>
<th class="sortable-numeric" scope="col">
                        Dec. Speed (s)
                    </th>
<th class="sortable-numeric" scope="col">
                        Size (b)
                    </th>
<th class="sortable-numeric" scope="col">
                        Ratio
                    </th>
</tr>
</thead>
<tbody>
<tr>
<td>
                        tar
                    </td>
<td>
                    </td>
<td>
                        0.000
                    </td>
<td>
                        0.000
                    </td>
<td>
                        3,265,536
                    </td>
<td>
                        1.000
                    </td>
</tr>
<tr>
<td>
                        gzip
                    </td>
<td>
                        fast
                    </td>
<td>
                        0.171
                    </td>
<td>
                        0.062
                    </td>
<td>
                        1,244,763
                    </td>
<td>
                        0.381
                    </td>
</tr>
<tr>
<td>
                        gzip
                    </td>
<td>
                    </td>
<td>
                        0.312
                    </td>
<td>
                        0.093
                    </td>
<td>
                        1,070,276
                    </td>
<td>
                        0.328
                    </td>
</tr>
<tr>
<td>
                        gzip
                    </td>
<td>
                        best
                    </td>
<td>
                        0.561
                    </td>
<td>
                        0.062
                    </td>
<td>
                        1,062,584
                    </td>
<td>
                        0.325
                    </td>
</tr>
<tr>
<td>
                        bzip2
                    </td>
<td>
                        fast
                    </td>
<td>
                        0.499
                    </td>
<td>
                        0.218
                    </td>
<td>
                        961,633
                    </td>
<td>
                        0.294
                    </td>
</tr>
<tr>
<td>
                        bzip2
                    </td>
<td>
                    </td>
<td>
                        0.514
                    </td>
<td>
                        0.202
                    </td>
<td>
                        891,321
                    </td>
<td>
                        0.273
                    </td>
</tr>
<tr>
<td>
                        bzip2
                    </td>
<td>
                        best
                    </td>
<td>
                        0.483
                    </td>
<td>
                        0.218
                    </td>
<td>
                        891,321
                    </td>
<td>
                        0.273
                    </td>
</tr>
<tr>
<td>
                        zip
                    </td>
<td>
                        -1
                    </td>
<td>
                        0.187
                    </td>
<td>
                        0.078
                    </td>
<td>
                        1,244,985
                    </td>
<td>
                        0.381
                    </td>
</tr>
<tr>
<td>
                        zip
                    </td>
<td>
                    </td>
<td>
                        0.358
                    </td>
<td>
                        0.078
                    </td>
<td>
                        1,070,495
                    </td>
<td>
                        0.328
                    </td>
</tr>
<tr>
<td>
                        zip
                    </td>
<td>
                        -9
                    </td>
<td>
                        0.516
                    </td>
<td>
                        0.046
                    </td>
<td>
                        1,062,803
                    </td>
<td>
                        0.325
                    </td>
</tr>
<tr>
<td>
                        7z
                    </td>
<td>
                        1
                    </td>
<td>
                        0.436
                    </td>
<td>
                        0.171
                    </td>
<td>
                        962,460
                    </td>
<td>
                        0.295
                    </td>
</tr>
<tr>
<td>
                        7z
                    </td>
<td>
                        6
                    </td>
<td>
                        1.996
                    </td>
<td>
                        0.140
                    </td>
<td>
                        856,273
                    </td>
<td>
                        0.262
                    </td>
</tr>
<tr>
<td>
                        7z
                    </td>
<td>
                        9
                    </td>
<td>
                        2.152
                    </td>
<td>
                        0.140
                    </td>
<td>
                        853,686
                    </td>
<td>
                        0.261
                    </td>
</tr>
<tr>
<td>
                        rar
                    </td>
<td>
                        m1
                    </td>
<td>
                        0.265
                    </td>
<td>
                        0.140
                    </td>
<td>
                        1,167,991
                    </td>
<td>
                        0.358
                    </td>
</tr>
<tr>
<td>
                        rar
                    </td>
<td>
                        m3
                    </td>
<td>
                        1.950
                    </td>
<td>
                        0.140
                    </td>
<td>
                        935,499
                    </td>
<td>
                        0.286
                    </td>
</tr>
<tr>
<td>
                        rar
                    </td>
<td>
                        m5
                    </td>
<td>
                        1.762
                    </td>
<td>
                        0.891
                    </td>
<td>
                        788,671
                    </td>
<td>
                        0.242
                    </td>
</tr>
</tbody>
</table>
<p>The Calgary Corpus dates back to the late 80s.  It&#8217;s become <em>the</em> test to perform, but it may or may not adequately represent the standard compressor workload in 2008.  You&#8217;ll notice that Winrar&#8217;s maximum setting produces the smallest archive, and more quickly than the neighboring 7-zip runs.  Notice, too, that among the lowest values, there tends to be a sort of &#8220;bottoming-out&#8221; point at which the speed of the compressor&#8217;s process in CPU is limited by the speed of the disk.</p>
<table class="sortable rowstyle-even" id="canterbury-corpus">
<caption>
                Canterbury Corpus<br />
            </caption>
<thead>
<tr>
<th class="sortable-text" scope="col">
                        Codec
                    </th>
<th class="sortable-text" scope="col">
                        Setting
                    </th>
<th class="sortable-numeric" scope="col">
                        Enc. Speed (s)
                    </th>
<th class="sortable-numeric" scope="col">
                        Dec. Speed (s)
                    </th>
<th class="sortable-numeric" scope="col">
                        Size (b)
                    </th>
<th class="sortable-numeric" scope="col">
                        Ratio
                    </th>
</tr>
</thead>
<tbody>
<tr>
<td>
                        tar
                    </td>
<td>
                    </td>
<td>
                        0.000
                    </td>
<td>
                        0.000
                    </td>
<td>
                        2,821,120
                    </td>
<td>
                        1.000
                    </td>
</tr>
<tr>
<td>
                        gzip
                    </td>
<td>
                        fast
                    </td>
<td>
                        0.140
                    </td>
<td>
                        0.062
                    </td>
<td>
                        872,570
                    </td>
<td>
			0.309
                    </td>
</tr>
<tr>
<td>
                        gzip
                    </td>
<td>
                    </td>
<td>
                        0.249
                    </td>
<td>
                        0.062
                    </td>
<td>
                        739,066
                    </td>
<td>
			0.262
                    </td>
</tr>
<tr>
<td>
                        gzip
                    </td>
<td>
                        best
                    </td>
<td>
                        1.138
                    </td>
<td>
                        0.062
                    </td>
<td>
                        736,223
                    </td>
<td>
			0.261
                    </td>
</tr>
<tr>
<td>
                        bzip2
                    </td>
<td>
                        fast
                    </td>
<td>
                        0.390
                    </td>
<td>
                        0.156
                    </td>
<td>
                        584,964
                    </td>
<td>
			0.207
                    </td>
</tr>
<tr>
<td>
                        bzip2
                    </td>
<td>
                    </td>
<td>
                        0.514
                    </td>
<td>
                        0.171
                    </td>
<td>
                        570,856
                    </td>
<td>
			0.202
                    </td>
</tr>
<tr>
<td>
                        bzip2
                    </td>
<td>
                        best
                    </td>
<td>
                        0.390
                    </td>
<td>
                        0.156
                    </td>
<td>
                        570,856
                    </td>
<td>
			0.202
                    </td>
</tr>
<tr>
<td>
                        zip
                    </td>
<td>
                        -1
                    </td>
<td>
                        0.140
                    </td>
<td>
                        0.078
                    </td>
<td>
                        872,795
                    </td>
<td>
			0.309
                    </td>
</tr>
<tr>
<td>
                        zip
                    </td>
<td>
                    </td>
<td>
                        0.343
                    </td>
<td>
                        0.062
                    </td>
<td>
                        739,286
                    </td>
<td>
			0.262
                    </td>
</tr>
<tr>
<td>
                        zip
                    </td>
<td>
                        -9
                    </td>
<td>
                        1.170
                    </td>
<td>
                        0.062
                    </td>
<td>
                        736,443
                    </td>
<td>
			0.261
                    </td>
</tr>
<tr>
<td>
                        7z
                    </td>
<td>
                        1
                    </td>
<td>
                        0.280
                    </td>
<td>
                        0.930
                    </td>
<td>
                        569,953
                    </td>
<td>
			0.202
                    </td>
</tr>
<tr>
<td>
                        7z
                    </td>
<td>
                        6
                    </td>
<td>
                        1.950
                    </td>
<td>
                        0.124
                    </td>
<td>
                        487,919
                    </td>
<td>
			0.172
                    </td>
</tr>
<tr>
<td>
                        7z
                    </td>
<td>
                        9
                    </td>
<td>
                        2.199
                    </td>
<td>
                        0.124
                    </td>
<td>
                        485,391
                    </td>
<td>
			0.173
                    </td>
</tr>
<tr>
<td>
                        rar
                    </td>
<td>
                        m1
                    </td>
<td>
                        0.218
                    </td>
<td>
                        0.124
                    </td>
<td>
                        772,369
                    </td>
<td>
			0.274
                    </td>
</tr>
<tr>
<td>
                        rar
                    </td>
<td>
                        m3
                    </td>
<td>
                        1.232
                    </td>
<td>
                        0.093
                    </td>
<td>
                        515,831
                    </td>
<td>
			0.183
                    </td>
</tr>
<tr>
<td>
                        rar
                    </td>
<td>
                        m5
                    </td>
<td>
                        1.170
                    </td>
<td>
                        0.561
                    </td>
<td>
                        427,178
                    </td>
<td>
			0.151
                    </td>
</tr>
</tbody>
</table>
<p>I&#8217;m still not entirely able to figure out the Canterbury Corpus;  it&#8217;s ostensibly an &#8220;update&#8221; to the aging Calgary Corpus.  One would think that having been created more than a decade after it&#8217;s predecessor, and with the express purpose of more accurately representing the compressor workload of 2001, it would at least be <em>larger</em> (hard disks and file sizes <em>have</em> increased in size since 1989, believe it or not), but in fact it&#8217;s not, which was somewhat of a disappointment to me, as I saw entirely the same trends as with the previous corpus.  Is that an accurate determination of the algorithms in question?  Maybe not—read on.</p>
<table class="sortable rowstyle-even" id="tridge-large-corpus">
<caption>
                Large-Corpus<br />
            </caption>
<thead>
<tr>
<th class="sortable-text" scope="col">
                        Codec
                    </th>
<th class="sortable-text" scope="col">
                        Setting
                    </th>
<th class="sortable-numeric" scope="col">
                        Enc. Speed (s)
                    </th>
<th class="sortable-numeric" scope="col">
                        Dec. Speed (s)
                    </th>
<th class="sortable-numeric" scope="col">
                        Size (b)
                    </th>
<th class="sortable-numeric" scope="col">
                        Ratio
                    </th>
</tr>
</thead>
<tbody>
<tr>
<td>
                        tar
                    </td>
<td>
                    </td>
<td>
                        0.000
                    </td>
<td>
                        0.000
                    </td>
<td>
                        247,933,952
                    </td>
<td>
                        1.000
                    </td>
</tr>
<tr>
<td>
                        gzip
                    </td>
<td>
                        fast
                    </td>
<td>
                        7.347
                    </td>
<td>
                        2.698
                    </td>
<td>
                        65,782,177
                    </td>
<td>
			0.265
                    </td>
</tr>
<tr>
<td>
                        gzip
                    </td>
<td>
                    </td>
<td>
                        13.072
                    </td>
<td>
                        3.151
                    </td>
<td>
                        53,870,968
                    </td>
<td>
			0.217
                    </td>
</tr>
<tr>
<td>
                        gzip
                    </td>
<td>
                        best
                    </td>
<td>
                        21.855
                    </td>
<td>
                        2.449
                    </td>
<td>
                        53,536,722
                    </td>
<td>
			0.216
                    </td>
</tr>
<tr>
<td>
                        bzip2
                    </td>
<td>
                        fast
                    </td>
<td>
                        40.591
                    </td>
<td>
                        9.360
                    </td>
<td>
                        52,791,871
                    </td>
<td>
			0.213
                    </td>
</tr>
<tr>
<td>
                        bzip2
                    </td>
<td>
                    </td>
<td>
                        54.506
                    </td>
<td>
                        10.567
                    </td>
<td>
                        39,372,759
                    </td>
<td>
			0.159
                    </td>
</tr>
<tr>
<td>
                        bzip2
                    </td>
<td>
                        best
                    </td>
<td>
                        54.228
                    </td>
<td>
                        10.935
                    </td>
<td>
                        39,372,759
                    </td>
<td>
			0.159
                    </td>
</tr>
<tr>
<td>
                        zip
                    </td>
<td>
                        -1
                    </td>
<td>
                        6.349
                    </td>
<td>
                        2.208
                    </td>
<td>
                        65,782,411
                    </td>
<td>
			0.265
                    </td>
</tr>
<tr>
<td>
                        zip
                    </td>
<td>
                    </td>
<td>
                        12.682
                    </td>
<td>
                        2.527
                    </td>
<td>
                        53,871,197
                    </td>
<td>
			0.217
                    </td>
</tr>
<tr>
<td>
                        zip
                    </td>
<td>
                        -9
                    </td>
<td>
                        21.529
                    </td>
<td>
                        2.433
                    </td>
<td>
                        53,536,951
                    </td>
<td>
			0.216
                    </td>
</tr>
<tr>
<td>
                        7z
                    </td>
<td>
                        1
                    </td>
<td>
                        19.578
                    </td>
<td>
                        6.608
                    </td>
<td>
                        47,343,400
                    </td>
<td>
			0.191
                    </td>
</tr>
<tr>
<td>
                        7z
                    </th>
<td>
                        6
                    </td>
<td>
                        128.645
                    </td>
<td>
                        4.035
                    </td>
<td>
                        26,373,931
                    </td>
<td>
			0.106
                    </td>
</tr>
<tr>
<td>
                        7z
                    </td>
<td>
                        9
                    </td>
<td>
                        172.677
                    </td>
<td>
                        3.712
                    </td>
<td>
                        24,722,887
                    </td>
<td>
			0.100
                    </td>
</tr>
<tr>
<td>
                        rar
                    </td>
<td>
                        m1
                    </td>
<td>
                        9.016
                    </td>
<td>
                        4.446
                    </td>
<td>
                        48,939,730
                    </td>
<td>
			0.197
                    </td>
</tr>
<tr>
<td>
                        rar
                    </td>
<td>
                        m3
                    </td>
<td>
                        125.128
                    </td>
<td>
                        3.868
                    </td>
<td>
                        31,916,951
                    </td>
<td>
			0.129
                    </td>
</tr>
<tr>
<td>
                        rar
                    </td>
<td>
                        m5
                    </td>
<td>
                        138.435
                    </td>
<td>
                        23.852
                    </td>
<td>
                        29,200,310
                    </td>
<td>
			0.118
                    </td>
</tr>
</tbody>
</table>
<p>Mostly interestingly in the Tridgell&#8217;s &#8220;large-corpus,&#8221; we finally see 7-Zip spring ahead of WinRAR in terms of pure compression ratio (and in speed, too, in some cases).  I&#8217;m not an expert on compression, so I can&#8217;t tell you why certain efficiencies only manifest themselves over large datasets, but clearly 7-Zip wins in more modern cases where large data-sets (mostly text, if Tridgell&#8217;s description is accurate) are present.  </p>
<p>Clearly, the LZMA algorithm (the heart of 7-Zip) is something to be proud of;  not only is it GPL, but it often outperform the popular WinRAR in both pure compression and in efficiency as well.  I&#8217;m a little surprised that the 7-Zip  *nix port, p7zip, hasn&#8217;t gained more traction in Linux, but I suppose that old ways die hard.  The cheapness of disk and bandwidth nowadays rather point to more transparent compression as the ideal rather than whatever archiving format has the best compression in terms of purely numeric results.</p>
<p>For those of you looking for a decent free arching program, check <a href="http://7-zip.org">7-Zip</a> out;  for those of you who lust after data tables of compression benchmarks, give <a href="http://maximumcompression.com">Werner&#8217;s</a> a look:  it&#8217;ll satiate your desire for tabular results in ways you never thought possible.</p>
]]></content:encoded>
			<wfw:commentRss>http://heliologue.com/2008/01/19/common-compression-and-corpuses/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Linux command-line compressor benchmarks</title>
		<link>http://heliologue.com/2007/02/01/linux-command-line-compressor-benchmarks/</link>
		<comments>http://heliologue.com/2007/02/01/linux-command-line-compressor-benchmarks/#comments</comments>
		<pubDate>Thu, 01 Feb 2007 18:08:05 +0000</pubDate>
		<dc:creator>Ben</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[compression]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://heliologue.com/blog/2007/02/01/linux-command-line-compressor-benchmarks/</guid>
		<description><![CDATA[Introduction In my lust for benchmarking compressors and things of that sort, I decided to attack some typical Linux CLI compressors for a very general comparison of their relative efficiency. If you want to skip straight to the results, go ahead and view the Linux CLI compression benchmarks in PDF format. First, however, I need [...]]]></description>
			<content:encoded><![CDATA[<h3>Introduction</h3>
<p>In my lust for benchmarking compressors and things of that sort, I decided to attack some typical Linux CLI compressors for a very general comparison of their relative efficiency.</p>
<p>If you want to skip straight to the results, go ahead and view the <a href='http://heliologue.com/pdf/linux_cli_compression.pdf' title='Linux CLI compression benchmarks'>Linux CLI compression benchmarks in PDF format</a>.</p>
<p>First, however, I need to explain my corpus, compressors, and methodology. <span id="more-1601"></span></p>
<h3>Test Information</h3>
<dl>
<dt>Corpus</dt>
<dd><a href="http://sourceforge.net/project/showfiles.php?group_id=18681&#038;package_id=13797&#038;release_id=383554"><strong>OSDB</strong></a> is a database from a benchmarking project on SourceForge used to test open-source database performance.  The version I downloaded was v0.21</dd>
<dd><a href="http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.19.tar.bz2"><strong>linux-2.6.19</strong></a> is, I hardly need say, the source code for version 2.6.19 of the Linux kernel</dd>
<dd><strong>final_fantasy_tactics</strong> is a binary dump of the Playstation disc for <cite>Final Fantasy Tactics</cite>, a good mix of video, audio, and executable code.</dd>
<dd><strong>Webster&#8217;s 1913 dictionary</strong> is also just what it sounds like.  I pulled it from <a href="http://gutenberg.org">Project Gutenberg</a> as one file, but it appears in the meantime to have disappeared, or at least been split up by letter.</dd>
<dt>Compressors (official packages from Ubuntu 6.10 repositories)</dt>
<dd><a href="http://www.gnu.org/software/tar/"><strong>tar</strong></a> (1.15.91) is not a compressor, but is rather an archiver, used to gather multiple files together into a file which can then be compressed.  I made tarballs out of everything for consistency&#8217;s sake</dd>
<dd><a href="http://www.gzip.org/"><strong>gzip</strong></a> (1.3.5) is a very lightweight compressor used in large part for transparent compression across networks (the page you&#8217;re reading at this very moment was probably gzipped before it was sent to your browser, which then unpacked it)</dd>
<dd><a href="http://www.bzip.org/"><strong>bzip2</strong></a> (1.0.3) is a powerful block-sorting compressor—the bigger, stronger, slower brother of gzip.</dd>
<dd><a href="http://www.info-zip.org/"><strong>zip</strong></a> (2.32) is probably the most common compression format in Windows.  There are a number of different implementations, but this is the standard version that comes in any GNU toolchain.  It uses the common DEFLATE method.</dd>
<dd><a href="http://rzip.samba.org/"><strong>rzip</strong></a> (2.1) is a modified version of bzip2, the advantage of which lies in its 900MB history buffer, meaning that it can take advantage of data redundancies over much longer distances than its brethen.  Its principle advantage is apparent for large files.</dd>
<dd><a href="http://7-zip.org"><strong>7za</strong></a> (4.42) is a Unix port of the esteemed 7-Zip archiver for Windows.  It has several different modes for compression for its 7z format (PPMd and LZMA, the latter being principle), of which two were tested (the third is bzip2, which is already being tested).</dd>
<dd><a href="http://rarlabs.com"><strong>rar</strong></a> (3.60) is the non-free command-line version of the famous WinRAR archiver.</dd>
<dd><a href="http://testcase.newmail.ru/"><strong>arj</strong></a> (3.10.22) was a very popular archiving in the early-to-mid 90s, alongside pkzip.  It isn&#8217;t of much use now, but exists as Free Software in any decent Linux repo, and so was also tested.</dd>
<dd><a href="http://www2m.biglobe.ne.jp/~dolphin/lha/lha.htm"><strong>lha</strong></a> (1.14i) is an open-source compressor of the LZH format, an early format (1988) that has also fallen out of vogue.  It has no switches or useful options.</dd>
<dd><a href="http://www.lzop.org/"><strong>lzop</strong></a> (1.01) is a free implementation of the super-fast LZO algorithm, providing very basic compression very quickly.  It has been surpassed in efficiency by <a href="http://www.quicklz.com/">QuickLZ</a>, but the latter didn&#8217;t have a Unix port by the time of this test</dd>
<dd><strong>zoo</strong> (2.10) is a compression format from the 1980s that was mostly popular with OpenVAX servers.  It is not seriously used anymore except perhaps on legacy machines.</dd>
<dt>Methodology</dt>
<dd>Any relevant switches have been documented in the data table.  All compressions were run from the bash shell.  I created a script for each compressor that compressed and then decompressed the file for each setting in turn.  It was invoked with the GNU <code>time</code> which then wrote the utilized time to a log file.  After each compression, an <code>ls -lrt</code> command was invoked, to get the precise size of the compressed archive, and this was also written to the log file.  This was done on a Ubuntu 6.10 system, in the GNOME environment.  Any other programs in use were in use for the duration of the test, so consistency should not be an issue.</dd>
</dl>
<h3>Data</h3>
<p>The table itself is too big to display within the confines of my blog, so you&#8217;ll need to view it separately. <a href='http://heliologue.com/pdf/linux_cli_compression.pdf' title='Linux CLI compression benchmarks, PDF format'>Linux CLI compression benchmarks, PDF format</a></p>
<h3>Analysis</h3>
<p>What conclusions can be drawn from these benchmarks?  Linux&#8217;s relative paucity of readily-available compressors isn&#8217;t really a hindrance to it—there are few actually useful formats here:  <code>gzip</code> and <code>bzip2</code>, obviously, are the foundation upon which Linux compression is built.  <code>zip</code> offers no advantages over gzip except that it is both an archiver <em>and</em> a compression (a point made largely moot by <code>tar</code> and pipes, <i>e.g.</i> <code>tar -czf foo.tar.gz bar/</code>), but it is good to have for compatibility purposes.  <code>rzip</code>, while interesting, is useful only for large files:  if you&#8217;re attempting to squeeze every last ounce of extra space from a file, it would probably behoove you to use a different compressor, anyway.</p>
<p>In terms of the heavy-duty archivers like 7-Zip and RAR, they both offer quite a bit of power at the expense of being either difficult to invoke (7z creation is still broken in <code>file-roller</code> on Ubuntu systems) or non-free (<code>rar</code> is technically shareware, and not readily available in all distributions).  It is interesting to note that 7-Zip still spanks WinRAR in general performance.  With regard to 7-Zip, I&#8217;ve concluded that PPMd is not generally a worthwhile algorithm to use:  in some cases, it offers a modest increase in efficiency over LZMA, but it&#8217;s decompression times are horrible, generally equal to or greater than its compression times.  LZMA, meanwhile, offers modest compression times (relative to its ratio) and lightning-fast decompression.  One odd behavior is that compression efficiency seems to work on plateaus—with LZMA, for instance, there are only 3 different compression ratios for 9 switches (1-4, 5-6, and 7-9).</p>
<p><code>lzop</code>, or even LZO compression is general, doesn&#8217;t offer much in the way of compression improvement over a <code>gzip</code> on default settings, though it does offer a slight improvement in decompression.</p>
<p>Other, more esoteric formats like <code>arj</code>, <code>zoo</code>, and <code>lha</code> are really no more than historical curiosities.  The only compressor from these than approaches the capabilities of <code>gzip</code> or <code>bzip2</code> is <code>lha</code>, at which point why even bother with the latter?</p>
<h3>Caveats</h3>
<p>If you look at the data closely, you&#8217;ll notice some strange behavior, especially regarding the first test in a series.  At first, I believed that the least-intensive compressions tended to be limited by disk I/O, rather than CPU cycles.  Eventually, however, I began to wonder if it had more to do with reading the file into memory than it did with file writes.  Running in a script, the original file might stay cached after it was read in the first time, speeding up additional compressions by not having to read the file from the disk again.  Take from it what you will:  these are very much unofficial benchmarks.</p>
]]></content:encoded>
			<wfw:commentRss>http://heliologue.com/2007/02/01/linux-command-line-compressor-benchmarks/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

