A while ago, as OpenOffice.org 2.0 approached completion, I compared the file sizes of Microsoft Office’s binary format against OpenOffice’s new OpenDocument format. Recall that OpenDocument is an XML-based storage formatted that is ultimate compressed into a zip file, creating smaller file sizes. Microsoft’s new Office Open XML is essentially the same thing, but with a totally different XML schema.

I decided to revisit this kind of test, and had four test files:

  1. The text of Ulysses, in HTML format. I chose HTML format to test the extra markup, as it should theoretically create a more complex document.
  2. A very large generated Lorem Ipsum block (205’000+ characters), which is pseudo-random, but with a lot of redundancy.
  3. A one-page block of Lorem Ipsum text, in order to test the handling of small files
  4. A randomly generated CSV with multiple kinds of text and 5’000 records. Converted used in OpenOffice Calc and Microsoft Excel.

Read on for the data table on observations.

Comparison of office file formats sizes (in bytes)
Corpus Ulysses Lorem Ipsum (long) Lorem Ipsum (short) Random CSV
ASCII/HTML 1,646,749 1,586,432 4,069 23,784
MS Word 97/2000/XP 1,753,088 1,710,592 36,864 58,880
Office Open XML 876,755 599,277 13,344 26,906
OpenOffice 1.x 692,315 411,819 27,528 20,671
OpenDocument 693,672 412,629 27,675 20,826

Remember that both XML formats (OpenDocument and OOXML) are basically XML representations of the data, markup mappings, and any associated files, all wrapped up in a ZIP file, which means everything is compressed by the Deflate algorithm.

I know for a fact that OOXML generated more markup—that is, larger XML files—than OpenDocument. However, Microsoft appears to use a higher level of compression, since it’s clear from the “Lorem Ipsum (short)” corpus that OOXML excels1 at smaller files. On the long Lorem Ipsum test, however, the OpenDocument format is significantly smaller, leading me to believe that there is a point at which the complexity of OOXML’s markup finally outweighs Office’s higher compression.

Note that compression level is not part of the spec for either format (that I know of): the compression strength is set by the program. If you want, you can unzip the resulting file and rezip it with a stronger setting, and the programs will open it just fine (but you’ll lose the extra compression as soon as they save it again).

Also note Office’s older binary format (.doc, .xls, &c.) is pretty much abysmal, offering no compression at all, but rather the size of plain ASCII text plus Office’s internal markup structures. Makes you wonder why this was the dominant file format for so long…

I would have liked to test database formats as well, but ran into snags. While Microsoft Access has no problem important a CSV file, OpenOffice’s “Base” module, a Java-based database program that debut with OpenOffice 2.0, lacks even a rudimentary data import. On this one, I suppose, Microsoft wins by default.

Sometime in the near future, I want to plot the markup/compression curve for OOXML v. ODF. Stay tuned.

  1. No pun intended[]
§1977 · February 12, 2008 · Tags: , , , , , ·

3 Comments to “Comparing office file formats”

  1. Andrew Z says:

    How to get data into Base:
    http://searchenterpriselinux.techtarget.com/tip/0,289483,sid39_gci1222186,00.html

    Just for discussion, what’s the "big deal" (no pun intended) about large file sizes? Bandwidth and storage capacity are increasing quickly.

  2. Conor says:

    While I agree with Andrew that it’s kind of silly to be fussing over kilobytes, I do think there’s a lot of merit to the stance that one should always choose the best tool for the job. If one compression algorithm is better than the other, you should use it—after having also taken into account possible restrictions on its use, of course.

    It’s like saying you could bump the RAM in your system up by one percent. Would you? Why the hell not? And comparing OOXML (how I hate that sneaky name) and ODF, the change is more like 25%. Ultimately we’re still talking kilobytes here, but it’s the principle of the thing.

    And Ben, shame on you for not including your units in that table!

  3. Ben says:

    Hunh. I purposely didn’t include it in the table itself, but I could’ve sworn I mentioned it in the preceding text. Well, I guess I’ll make the change.

    To address Andrew Z’s question: naturally, the idea of squabbling over kilobytes is silly. But I’m not marketing ODF as a way to squeeze a bit of extra space out of your documents—though for large corpuses, the difference will be substantial.

    What I’m really trying to highlight is two things: one, the stronger or weaker overhead, but also the size (and complexity) of generated markup. Clearly, OOXML has a lot of cruft.

Leave a Reply