rev. 12 December 2006. Get the PDF.
The aspect of an operating system most often overlooked is its filesystem, the method by which data is stored to more permanent media—most often a magnetic hard drive. On any relatively modern Windows system, the only choice is NTFS, a complex, proprietary filesystem that provides excellent performance. Since variants of Unix—and Linux, which continues to gain market share, especially in the server market—are very popular, it is increasingly likely that IT administrators will have to choose a *nix filesystem for use on mission critical servers. There are many Free (libre1) high-performance file-systems for Unix systems which, compared with the standard ext2/3 filesystem (Second or Third Extended Filesystem), provide faster access to data at the price of a higher risk of data loss. For configurations which can eliminate the points of failure that lead to this risk, high-performance file systems like XFS, JFS, or ReiserFS offer comparable or better benchmarks than Microsoft’s NTFS, without the vendor lock-in.
XFS was originally developed by Silicon Graphics Inc. for their IRIX variant of Unix. As such, it is the oldest journaling filesystem available for Unix systems (Robbins, “Part 9” 1). In 2000, it was made available under the GPL for Linux. The first 2.4-series Linux kernel with native support was 2.4.25 (Tosatti 1), though it suffers from limited speeds for parallel buffered writes—around 300MiB/s. The 2.6 kernel series allows much greater speeds, in excess of 1GiB/s, according to Dave Chinner and Jeremy Higdon of SGI (1).
XFS is a 64-bit journaling filesystem with support for volumes (and files) of up to 9 exabytes. It has support for extents, a feature shared by other high-performance filesystems that the “basic” ext2/3 filesystem for Linux will not have until the inclusion of ext4, which only recently began development. Like Microsoft’s NTFS, XFS is based on B+ trees, which significantly improves search times and space allocations (“Open Source XFS for Linux” 2).
XFS displays good performance, especially for large files, but its continuing drawback seems to be that it can cause data loss in the event of a power outage or other catastrophic system failure. A major contributer to XFS’s performance numbers and its inherent danger has to do with the way it aggressively caches delayed writes to RAM. Under ideal circumstances, this leads to more intelligent allocations and faster writes to disk; in the case of an event which produces many small temporary files, XFS may never need to write those blocks at all. In the event of a system failure, however, there is the potential to lose a lot of data (Robbins, “Part 9” 4). This author’s personal contact with three developers of the renowned Debian distribution of Linux reveals that they all experimented with XFS and lost or corrupted data due to either bugs or outages; as a result, they have reverted back to ext3, the preferred filesystem of many Debian and kernel developers. Importantly, these were workstations without redundant storage or power, but this underscores the need for proper hardware if XFS is to be used.
Similar to XFS is JFS, though it is technically known as JFS2. The origins of JFS stretch back to IBM’s proprietary Unix, AIX. JFS as it is known to the Linux world is the second generation of JFS, which first shipped with OS/2 Warp in 1999. At this point, a snapshot of the source was “opened” to the Linux community for porting (Gordon and Haddad 2). JFS has been a part of the 2.6 kernel since its development, but enjoys only limited official support from some vendors like Novell/SuSe and Red Hat; in addition, it is only available for the 2.4 series in the form of patches against the kernel source (3). It has, however, enjoyed successful use in carrier-grade Linux platforms for use in the telecom industry (7).
One of JFS’s most appealing factors, despite its average raw performance, is its recovery time—the very thing that makes it enticing for carrier-grade platforms. It journals only metadata (other journaling filesystems write data and metadata to the journal by default), so it boasts a sub-second recovery time. It is a 64-bit filesystem that supports file and partitions in the petabyte range, and it too uses B+ trees for its indices (Best 2). In addition, it is considered one of the computationally cheapest filesystems available, making it an excellent choice for servers with only moderate computing power.
One other filesystem that has enjoyed a long and storied existence is ReiserFS, a very fast solution optimized for many small files. “ReiserFS” refers to the third version of the filesystem, which is no longer being actively developed Reiser4 is a somewhat controversial new filesystem, still in development, that boasts very fast speeds and a database-like nature, hearkening back to BeFS’s relational database model of storage. Because Reiser4 is stable but still experimental, and has yet been denied inclusion into the Linux kernel proper (and may or may not ever find its way into the 2.4 series still being used in data centers across the globe), it isn’t as yet a serious contender for the world’s hard drives. The deprecated ReiserFS, however, enjoyed hundreds of thousands of adherents. It remains the default filesystem of many Linux distributions, including Novell’s SuSe Enterprise Linux—that is, until October of 2006, when the project officially selected ext3 as its new default mode (Shankland 1).
ReiserFS debuted in 2001 with version 2.4.1 of the Linux kernel. At the time, it was the most advanced journaling filesystem available, offering metadata-only journaling before JFS was finished being ported. Its age is apparent now, however, in that it only has support for 8TB files and 16TB volumes, a fraction of the scalability offered by some of its more modern competitors. It is built around a modified B+ tree (sometimes called an S+ tree or a B* tree), and was unique at its introduction in its approach to allocation: it allocates blocks flexibly instead of in fixed amounts (Robbins, “Part 1” 3-4).
Like XFS, ReiserFS does logical journaling instead of physical-block journaling, which means that data integrity is second to performance: in the event of a failure, cached data will be lost. What’s more, ReiserFS has gained some notoriety for actually corrupting data even further when its tree is rebuilt is run after a crash. Theodore Ts’o, who developed the ext2 and ext3 filesystems, warns that all filesystems that do logical journaling are high-risk without high-end hardware and redundant power (2). Furthermore, ReiserFS performs well in speed, but tends to consume more CPU cycles than its competitors, which has been empirically proven in benchmarks. Jeff Mahoney, a developer who maintains ReiserFS code for SuSe says, simply, “ReiserFS v3 is a dead end.”
If data integrity was the only concern for the IT professional, the obvious choice would be ext3, as it provides reasonable performance and data security by default: even its most secure mode doesn’t seem to impact its performance. However, certain situations require filesystems with more robust performance; ext3’s numbers simply won’t do. Luckily, there are a number of excellent benchmarks published that easily illustrate the relative performance benefits of these filesystems. As Ts’o cautions, however, filesystems which perform aggressive caching and logical journaling can be misleading in benchmarks because of the artificial nature of the test. “[Logical journaling] takes much less space, so you can pack many more journal updates into a single disk block. For filesystem benchmarks that try to saturate the filesystem’s write bandwidth, and/or that also have very high levels of metadata updates, XFS and ReiserFS will tend to do much better than does ext3. Fortunately, many real-world workloads don’t have this characteristic, which is why ext3 tends to perform just fine in practice in many applications, despite what would appear to be much worse benchmarks numbers, at least for some benchmarks” (2).
Testing done by Chinner and Higdon for SGI show XFS as having by far the highest throughput and efficiency to rival that of ext3. ReiserFS has the lowest throughput and the lowest efficiency. JFS and the extended filesystems fall somewhere in the middle for throughput and rank well for efficiency (6).
A comprehensive benchmark done in April 2006 by Justin Piszcz using Debian’s testing branch is the clearest look at the performance of these filesystems running under modern software2. There are two important factors in measuring pure performance: speed and CPU utilization (efficiency).
In terms of the speed of file operations, Piszcz saw ext3 trailing behind and Reiser4, the experimental new filesystem intended to replace ReiserFS, also delivering an underwhelming performance. For XFS and JFS, the two journaling filesystems with enterprise pedigrees, the latter has slightly more impressive numbers. It is unknown how the benefit of a RAID array would affect these results. For CPU utilization, JFS is the clear winner, consistently using the least amount of resources; XFS was for the most part comparable to ext3 (5-26).
M. Benoit’s exhaustive benchmarks using Bonnie++ and IOZone reveal a number of interesting things with regard to the 2.6 kernel of Linux. He concludes that either JFS or XFS offer the best balance of speed and security: XFS seems to perform slightly faster, but JFS utilizes far fewer CPU cycles than any filesystem tested. Benoit only recommends using ReiserFS if the server will be handling a large number of extremely small files (Benoit 1; Ivers 4).
One other point that Benoit broaches with regard to mission-critical servers is that certain filesystems benefit from SCSI drives more than others. Since SCSI drives tend to be a staple of large datacenters, it’s more than likely that a prospective IT administrator will end up dealing with one. According to Benoit’s benchmarks, JFS and ext3 benefit the least from a SCSI drive as opposed to an EIDE drive; Reiser4 benefits the most, though it remains to be seen how it will perform when it becomes stable (1).
Benoit makes perhaps the most important point of all: “Only a fool would base his/her decision on benchmark results from a single source” (1). Each filesystem has strengths and weaknesses: its appropriateness for a given server’s storage unit has everything to do with what’s being stored, how important it is, and what kind of data center it’s running in. Benchmarks can give a rough idea of where a filesystem sits in the performance hierarchy, and certainly they might help to illustrate existing bugs or compatibility problems that will affect the decision to adopt a given filesystem, but the real test of a filesystem should be done on the hardware in question, and only after a careful analysis of the server’s needs.
It is impossible to accurately compare the performance of a *nix file system against Microsoft’s NTFS: there are too many “X factors” that would skew the data one way or the other. Certainly, it need not be proven that NTFS has the necessary scalability and performance to be placed on very active servers: Microsoft’s own website is a perfect example. However, for occasions which require running a *nix box, it is important to know that there are a number of flexible, stable filesystem that can fit just about any need, and even more are coming: Sun Microsystems has proudly unveiled a new 128-bit filesystem called ZFS, which has ostensible data limits that would be physically impossible to reach; it is available for Sun’s Solaris variant of Unix, and is being ported to work in the userland of other variants like Linux. Whatever the technical requirements of the IT community, the programmers and hackers and developers of the world continue to push the envelope in the realm of free (libre) software.
- Benoit, M. Linux File System Benchmarks. Ed. M Benoit. Oct. 2003. NetNation. 28 Oct. 2006 <http://fsbench.netnation.com>.
- Best, Steve. “JFS overview.” DeveloperWorks. 1 Jan. 2000. IBM. 28 Oct. 2006 <http://www-128.ibm.com/developerworks/library/l-jfs.html>.
- Chinner, Dave, and Jeremy Higdon. “Exploring High Bandwidth Filesystems on Large Systems.” SGI Developer Central. July 2006. SGI. 28 Oct. 2006 <http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf>.
- Gordon, Steve, and Ibrahim Haddad. “IBM’s Journaled Filesystem.” Linux Journal 1 Jan. 2003. 28 Oct. 2006 <http://www.linuxjournal.com/article/6268>.
- Ivers, Hans. “Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch.” Debian Administration. 21 Apr. 2006. 28 Oct. 2006 <http://www.debian-administration.org/articles/388>.
- Marcello, Tosatti. “XFS merged in 2.4.” Linux Kernel Mailing List 3 Dec. 2003. 28 Oct. 2006 <http://lkml.org/lkml/2003/12/8/35>.
- “Open Source XFS for Linux.” SGI Developer Central. 1 July 2006. SGI. 28 Oct. 2006 <http://oss.sgi.com/projects/xfs/datasheet.pdf>.
- Piszcz, Justin. “Benchmarking Filesystems Part II.” Linux Gazette Jan. 2006. 28 Oct. 2006 <http://linuxgazette.net/122/piszcz.html>.
- Robbins, Daniel. “Advanced filesystem implementor’s guide, Part 1.” DeveloperWorks. 1 Jun. 2001. IBM. 28 Oct. 2006
- —. “Advanced filesystem implementor’s guide, Part 9.” DeveloperWorks. 1 Jan. 2002. IBM. 28 Oct. 2006
- Shankland, Stephen. “Novell makes file storage software shift.” CNET News. 12 Oct. 2006. CNet Networks. 28 Oct. 2006 <http://news.com.com/Novell+makes+file-storage+software+shift/2100-1016_3-6125509.html>.
- Sharma, Mayank. “Novell will switch from ReiserFS to ext3.” Linux.com. 12 Oct. 2006. Open Source Technology Group. 28 Oct. 2006 <http://enterprise.linux.com/article.pl?sid=06/10/12/2120204>.
- Ts’o, Theodore. “ReiserFS.” LinuxMafia KnowledgeBase. 4 Dec. 2004. 28 Oct. 2006 <http://linuxmafia.com/faq/Filesystems/reiserfs.html>.
Tabular data from Justin Piszcz’s filesystem benchmarks
|001||Touch 10,000 Files||48.25||48.25||34.59||33.59||34.08||37.47|
|002||Find 10,000 Files||0.03||0.03||0.03||0.03||0.07||0.04|
|003||Remove 10,000 Files||0.16||0.16||1.64||1.84||2.98||2.51|
|004||Make 10,000 Directories||49.76||49.87||34.32||33.74||34.68||37.17|
|005||Find 10,000 Directories||0.65||0.65||0.63||1.07||1.46||0.72|
|006||Remove 10,000 Directories||1.66||1.67||3.58||43.48||119.42||5.39|
|007||Copy Tarball from Other to Current Disk||5.17||5.15||5.74||5.12||7.34||4.26|
|008||Copy Tarball from Current to Other Disk||6.96||7.00||6.97||6.89||8.21||6.69|
|009||UnTAR Kernel 126.96.36.199 Tarball||14.92||15.19||27.64||26.92||21.45||40.81|
|010||TAR Kernel 188.8.131.52 Source Tree||14.05||14.08||13.05||33.49||25.82||36.19|
|011||Remove Kernel 184.108.40.206 Source Tree||2.47||2.64||6.17||5.65||10.15||9.10|
|012||Copy 220.127.116.11 Tarball 10 Times||39.48||38.29||39.13||45.15||62.16||46.34|
|013||Create a 1GB File||15.02||15.02||15.12||15.96||25.40||15.87|
|014||Copy a 1GB File||36.87||36.51||38.54||47.60||50.63||41.25|
|015||Split 10M File into 1000 Byte Pieces||57.26||57.77||2.99||4.35||2.95||4.87|
|016||Split 10M File into 1024 Byte Pieces||28.73||28.97||2.24||4.04||2.61||4.01|
|017||Split 10M File into 2048 Byte Pieces||7.02||6.98||1.39||2.26||1.55||1.95|
|018||Split 10M File into 4096 Byte Pieces||1.85||1.83||0.67||1.05||0.99||0.98|
|019||Split 10M File into 8192 Byte Pieces||0.58||0.58||0.36||0.56||0.62||0.57|
|020||Copy 18.104.22.168 Kernel Source Tree||10.02||10.06||35.76||31.64||20.17||43.42|
|021||CAT 1GB File to /dev/null||18.90||18.59||18.00||37.33||21.37||18.70|
|001||Touch 10,000 Files||99.00||99.00||99.00||99.00||99.33||99.00|
|002||Find 10,000 Files||94.00||93.00||94.00||95.00||97.00||95.66|
|003||Remove 10,000 Files||98.00||98.66||73.66||99.00||99.00||91.66|
|004||Make 10,000 Directories||98.00||97.33||99.00||99.00||99.66||99.00|
|005||Find 10,000 Directories||99.00||99.00||99.00||99.00||99.00||99.00|
|006||Remove 10,000 Directories||99.00||99.00||88.66||99.00||99.00||97.00|
|007||Copy Tarball from Other to Current Disk||74.66||74.66||76.00||74.66||61.33||92.33|
|008||Copy Tarball from Current to Other Disk||60.00||59.33||59.33||62.00||86.00||62.66|
|009||UnTAR Kernel 22.214.171.124 Tarball||42.33||41.33||27.33||53.00||80.00||26.00|
|010||TAR Kernel 126.96.36.199 Source Tree||44.00||43.66||51.33||26.66||48.66||21.00|
|011||Remove Kernel 188.8.131.52 Source Tree||39.66||36.66||33.00||89.33||88.33||63.66|
|012||Copy 184.108.40.206 Tarball 10 Times||79.33||80.66||93.33||74.33||73.00||90.33|
|013||Create a 1GB File||56.00||55.66||67.33||57.00||50.00||64.33|
|014||Copy a 1GB File||42.00||42.00||47.00||37.33||52.00||49.33|
|015||Split 10M File into 1000 Byte Pieces||99.00||99.00||64.33||96.33||98.00||86.33|
|016||Split 10M File into 1024 Byte Pieces||99.00||99.00||77.33||97.66||99.00||97.00|
|017||Split 10M File into 2048 Byte Pieces||99.00||99.00||64.00||96.66||99.00||97.33|
|018||Split 10M File into 4096 Byte Pieces||99.00||99.00||69.33||99.00||99.00||97.33|
|019||Split 10M File into 8192 Byte Pieces||99.00||99.00||87.00||89.66||99.00||97.66|
|020||Copy 220.127.116.11 Kernel Source Tree||65.33||65.00||21.33||41.33||70.33||25.33|
|021||CAT 1GB File to /dev/null||26.33||27.00||27.33||36.66||46.33||30.00|
Graphical data from Justin Piszcz’s filesystem benchmarks
- When discussing “free” software, it is important to differentiate between free in a monetary sense (commonly called “free as in beer”) and free in a legal or philosophical sense (“free as in freedom” or simply libre), which implies the ability to make changes to the software and redistribute it.[↩]
- See Appendices A and B[↩]