FAQ ReiserFS

This FAQ is very ReiserFS centric and often a bit dated. The Reiser4 filesystem is mentioned as upcoming. Be sure to search the mailing list archives and help update this FAQ - Thanks!

1 What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.?
2 Mount fails after reiserfsck --rebuild-tree failure
3 Why is the execution time for a find . -type f | xargs cat {} \; command much longer when using ReiserFS than for the same command when using ext2?
4 Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS?
5 I am getting some errors in my kernel logs, that I do not know how to interpret
6 Will ReiserFS implement streams, extended attributes, etc.?
7 Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an ls(1) in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed?
8 I get attempt to read past the end of the partition error messages; is ReiserFS corrupted?
9 Can I use VMware with ReiserFS?
10 How do I install Debian potato with ReiserFS as root partition?
11 Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why?
12 Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device?
13 How do I change root from ext2 to ReiserFS without loss of data?
14 mount: /dev/hda5 has wrong major or minor number - what does that mean?
15 Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS?
16 The ReiserFS module doesn't insert properly - why?
17 Can I use ReiserFS with the software RAID?
18 Can I use ReiserFS with 3ware RAID?
19 Why do things freeze on my IDE hard drive for annoying amounts of time?
20 du(1) says ReiserFS makes space efficiency worse.
21 mkreiserfs(8) fails after repartitioning
22 Performance is poor, and my disk at 96% full still has free space.
23 Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2?
24 But it doesn't happen with ext2?
25 Can I use ReiserFS on other architectures than i386?
26 I need a program which will help me in rebuilding/recreating my partition table.
27 What partition type should I use for ReiserFS?
28 Can I use 32GB+ IDE Hard Drives with ReiserFS?
29 What about resizing ReiserFS?
30 What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems?
31 Why are ReiserFS filesystems not fscked on reboot after a crash?
32 Can I interactively repair a filesystem that was corrupted?
33 Can I use dump(8) and restore(8) with ReiserFS? Any caveats?
34 Does ReiserFS support snapshots?
35 Can I check reiserfs filesystems for errors without unmounting them?
36 What ReiserFS mount options should I use to get the performance winner for a mail server?
37 Does using ReiserFS mean I can just press the power off button without running /sbin/shutdown? Does it mean there is no risk of data loss?
38 How does ReiserFS support bad block handling?
39 I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.
40 I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable?
41 How can I put a label (like allowed by -L option of mkfs.ext2) on a ReiserFS instance?
42 Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything.
43 RedHat does not unmount / (/dev/root) with ReiserFS on halt. How to fix it?
44 How do I run programs from reiserfsprogs package on encrypted devices?
45 Are there any recomendation pro or against any particular hard drive manufacturers for using with ReiserFS?
46 I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?
47 In my program I am using fsync(2) calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance?
48 Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects?

What are the specs for ReiserFS: maximum number of files, of files a directory can have, of sub-dirs in a dir, of links to a file, maximum file size, maximum filesystem size, etc.?

Specifications for ReiserFS:

property	3.5	3.6
max number of files	232-3 => 4 Gi - 3	232-3 => 4 Gi-3
max number files a dir can have	518701895 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions)	232 - 4 => 4 Gi - 4 (but in practice this value is limited by hash function. r5 hash allows about 1 200 000 file names without collisions)
max file size	231-1 => 2 Gi-1	260 - bytes => 1 Ei, but page cache limits this to 8 Ti on architectures with 32 bit int
max number links to a file	216 => 64 Ki	232 => 4 Gi
max filesystem size	232 (4K) blocks => 16 Ti	232 (4K) blocks => 16 Ti

ReiserFS does meta-data journaling, enabling fast crash recovery without the expense of full data journaling. There are separate patches from Chris Mason that implement full data journaling for ReiserFS for Linux 2.4.16:

Note: Full data journaling is considered by many to be a good way to achieve file data integrity across system crashes. However, although file data may appear to be consistent from the kernel point of view, since there is no API exported to the userspace to control transactions, we may end-up in a situation where the application makes two write requests (as part of one logical transaction) but only one of these gets journaled before the system crashes. From the application point of view, we may then end up with inconsistent data in the file. Such issues should be addressed with the upcoming Reiser4. Such an API will be exported to userspace and all programs that need transactions will be able to use it.

Mount fails after reiserfsck --rebuild-tree failure

When reiserfsck --rebuild-tree is run, the first thing it does is to set the root inode value to -1. This makes the filesystem unmountable. (So, if reiserfsck will fail later on, because it contains serious errors, this filesystem could not be mounted.) Therefore once reiserfsck --rebuild-tree have failed for one of your filesystems, mounting of this partition is disabled. To correct the error you must check if you are have the latest reiserfsprogs package installed. If that fails, please send a bug report to our mailing list and be ready to answer our questions.

Why is the execution time for a `find . -type f | xargs cat {} \;` command much longer when using ReiserFS than for the same command when using ext2?

This effect is observed if the measured file set was produced by untarring some archive created not from a ReiserFS partition (or by copying files from a non-ReiserFS partition or by running a program that writes a bunch of files in some order). This is because the readdir() operation performed on the ReiserFS partition returns filenames not in the original write order but rather in some hash order (dependant on the hash function used). Thus when reading files' contents, the hard drive heads must move when going from one file to another. If you want ReiserFS to outperform any other filesystem in your setup here is one solution: Copy the entire directory that you are not satisfied with to the same partition but with a different name (use cp -a), then remove the old directory and rename the new one with the old name. If the partition does not have enough space available, another approach is to tar(1) up the whole partition, clear it, and then untar the previously saved data.

Is quota-support built-in in the vanilla 2.4 kernels for ReiserFS?

No, quota support for Linux kernels for the 2.4 branch are bundled separately and were available once at at SuSE (gone) by Chris Mason, they are still mirrored at TU-Wien. The reason these patches were not included into 2.4 kernel branch is because they implement new quota format and need new quota code too, which is too big of a change for 2.4 series of kernels. Various Linux distributions vendors (ie SuSE) do ship reiserfs-quota enabled kernels, though.

I am getting some errors in my kernel logs, that I do not know how to interpret

Messages like:

 vs-13070: reiserfs_read_inode2: i/o failure occurred trying to find stat data of 
 [1718696 1718710 0x0 SD]"

 zam-7001: io error in reiserfs_find_entry

most likely accompanied with samples below are definite signs of harddisk problems (bad sectors):

 hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
 hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6599945,
 sector=4286584
 end_request: I/O error, dev 03:03 (hda), sector 4286584

or

 scsi0: ERROR on channel 0, id 1, lun 0, CDB: Read (10) 00 00 01 ee 60 00 00 08 00
 Current sd 08:00: sense key Medium Error

or

 I/O error: dev 08:21, sector 65704

Messages about "access beyond end of device" may have lots of different reasons starting from not rebooting after fdisk requested it, unfinished resizings, data corruptions.

The following messages mean you have a noisy IDE cable, or it is just too low quality for choosen UDMA mode. Try to replace the cable with better one, or choose slower UDMA mode:

 hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
 hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
 hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
 hda: dma_intr: error=0x84 { DriveStatusError BadCRC }

If you see any message from ReiserFS that you cannot interpret and there is nothing similar to messages above around it, mail the message to us and we will explain it to you.

Will ReiserFS implement streams, extended attributes, etc.?

Here is the one page answer.

Reiserfs appears to be very slow while the RAID is resyncing. Mounting takes several minutes. Once mounted, an `ls(1)` in the mounted directory hangs. Forever. Once the RAID is sync'ed, things appear to work pretty well. How that can be fixed?

First of all we have included a patch that helps mounting the drive faster into linux kernel since 2.4.19. You can grab the patch for earlier kernels here.

Also RAID drivers have minimal guaranteed and maximal possible RAID rebuild bandwidth usage. These valueas are controlled through /proc/sys/dev/raid/speed_limit_min and /proc/sys/dev/raid/speed_limit_max sysctl variables (values are in 100 KiB/s). It seems that RAID logic cannot always understand if the disk sysbsystem busy or not at a given time. When it thinks disk subsystem is idle, it tries to rebuild the raid array at speed_limit_max speed which defaults to 100 MB per second. Decrease this value to something more suitable (a bit of experimentation might be needed).

I get attempt to read past the end of the partition error messages; is ReiserFS corrupted?

You changed your partition sizes, and then before rebooting ran mkreiserfs. The kernel does not change its belief in what the partition sizes are until reboot time. (This is fixable, but nobody has fixed it as of Dec. 2001). mkreiserfs created a filesystem that has the wrong notion of how large the partition it is on is. The filesystem's notion of what the partition boundaries are will last past reboot even though the kernel's notion will change. So yes, it is corrupted. Also some other kinds of metadata breakage can lead to such messages.

Can I use VMware with ReiserFS?

VMware was tested on SuSE Linux with Windows98 Guest OS on a ReiserFS partition. There's one trick at the beginning: the following line was added to the VMware config file

 host.FSSupportLocking1 = 0x52654973
 # (0x52654973 == *(u32 *) "ReIs")

Thanks to Gregory K. Ade for this hint.

How do I install Debian potato with ReiserFS as root partition?

Here are instructions by Dr. A.V. Le Blanc.

Starting with linux kernel v2.4.21 I cannot mount my FS anymore. Why?

Special sanity checks were added to kernel code to prohibit mounting of filesystems that are bigger then underlying block device. If you now see this message on mount:

  Filesystem on xx:yy cannot be mounted because it is bigger than the device

you may need to run fsck or increase size of your LVM partition. Or may be you forgot to reboot after fdisk when it told you to

If you do not use LVM, that usually means you need to run reiserfsck --rebuild-sb on your filesystem and agree to change its default size to proposed one.

Is it ok to use ReiserFS on a small size storage device: e.g. 16MB NAND flash block device?

Here are instructions.

How do I change root from ext2 to ReiserFS without loss of data?

Here are instructions.

`mount: /dev/hda5 has wrong major or minor number` - what does that mean?

The kernel does not know anything about ReiserFS, it is neither compiled in nor available as a module.

Will it be possible to read/write ReiserFS partitions created now with future versions of ReiserFS?

Yes. ReiserFS-3.6.x (Linux-2.4.x) works with both the old (3.5) and the new (3.6) formats. ReiserFS-3.5.x (Linux-2.2.x) can only work with the old (3.5) disk-format. There is no way to convert the new (3.6) disk-format to the old (3.5), but the old (3.5) format could be converted to the new one (3.6) with the "-o conv mount option.

The ReiserFS module doesn't insert properly - why?

After applying the patch, recompile the whole kernel including the modules target, reboot, then try to insert the module.

Can I use ReiserFS with the software RAID?

Yes, for all RAID levels using any Linux >= 2.4.1, but DO NOT use RAID with Linux 2.2.x. Our journaling and their RAID code step on each other in the buffering code. Also, mirroring is not safe in the 2.2.x kernels because online mirror rebuilds in 2.2.x break the write ordering requirements for the log. If you crash in the middle of an online rebuild, your meta-data may be corrupted. The only RAID level that is safe with ReiserFS in the 2.2.x kernels is the striping/concatenation level.

Can I use ReiserFS with 3ware RAID?

Yes, but you need to use Linux 2.2.19 or later for reasons other than ReiserFS. Also if you should encounter problems you should be suspicious that it might not be ReiserFS that has the bug. In special instructions. (archive.org)

Why do things freeze on my IDE hard drive for annoying amounts of time?

Because when large writes are scheduled all at once, reads can starve. A fix for this is evolving; the later your ReiserFS patch, the better we handle this.

`du(1)` says ReiserFS makes space efficiency worse.

Use df(1) not du(1), or use raw option for du(1) if it's supported. st_blocks summed up is less accurate than st_size for ReiserFS because we pack tails, and st_blocks rounds numbers up.

`mkreiserfs(8)` fails after repartitioning

The kernel requires you to reboot after repartitioning (for all filesystems). We intend to fix that.

Performance is poor, and my disk at 96% full still has free space.

Once a disk drive gets more than 85% full, the performance starts to suffer unless using a repacker (which isn't implemented yet.) You can probably get away with 92%, but if performance is valued you are making a mistake to keep it any fuller. This is true for almost all filesystems. ReiserFS, because of our packing tails together, pack more data into a given percentage used, but it still is subject to the rules for max recommended percentage used.

If you create the whole disk with one copy and then mount it read-only, then you can fully pack it without problem. Please be sure that you copy it from (or tar it from) a reiserfs partition so that files are created in reiserfs readdir() order as this will improve performance.

Why do I get a signal 11 when compiling the kernel using ReiserFS and not ext2?

Your CPU is overheating and/or you have bad RAM.

But it doesn't happen with ext2?

ext2 uses less heat sensitive gates in the CPU :-) Seriously, ext2 and ReiserFS contain random differences, and overheating and bad RAM have random sensitivities. (Signal 11 is not due to ReiserFS. One user had a cable blocking the fan; it did not affect ext2, but it wasn't until he fixed the cable-fan problem that ReiserFS worked.)

Can I use ReiserFS on other architectures than i386?

Yes, starting from the Linux kernel 2.4.13, ReiserFS can be run on any Linux supported arch.

I need a program which will help me in rebuilding/recreating my partition table.

gpart is a utility that handles ext2, FAT, Linux swap, HPFS, NTFS, FreeBSD and Solaris/x86 disklabels, Minix, ReiserFS. It prints a proposed content for the primary partition table and is well-documented.

What partition type should I use for ReiserFS?

Linux native filesystem (83)

Can I use 32GB+ IDE Hard Drives with ReiserFS?

Yes if you use Linux kernel 2.4 and up.

What about resizing ReiserFS?

This can be done with resize_reiserfs.

What should I put into the fifth (aka dump, fs_freq ) and the sixth (aka pass, fs_passno ) fields of /etc/fstab for ReiserFS filesystems?

You'd put in "0 0", e.g.

 /dev/sda3    /var     reiserfs notail,nodev,nosuid,noexec 0 0

Why are ReiserFS filesystems not fscked on reboot after a crash?

Because ReiserFS provides journaling of meta-data. After a crash, the consistency of a filesystem is restored by replaying the transaction log.

Can I interactively repair a filesystem that was corrupted?

This is done with reiserfsck.

Can I use `dump(8)` and `restore(8)` with ReiserFS? Any caveats?

No. dump(8) uses knowledge of the internal structure of ext2 and works together with restore, which also uses ext2 specific knowledge, to back up ext2 files. dump and restore are specific to ext2 and will not work with ReiserFS.

To back up ReiserFS files use tar(1), which is universal and can be applied to almost any reasonable Linux filesystem.

It is well known among system administrators that dump(8) is more complete than unix tar, and that there is quite a list of things that unix tar will fail to properly backup. This is not true of GNU/tar, which is quite complete. Basically, the only real disadvantage of GNU/tar compared to dump(8) is speed. Unfortunately, because it shares the same name as Unix tar(1), people are reluctant to believe this. (Yes, the GNU/tar has incremental backups, etc.) We will performance optimize ReiserFS backups for you (and the rest of the world) for $30K, which is not a lot if you are a large site spending a few hundred thousand on equipment for backups.

Does ReiserFS support snapshots?

No, but you can create ReiserFS on top of LVM logical volume and use LVM snapshot capabilities.

Can I check reiserfs filesystems for errors without unmounting them?

reiserfsck in checking mode may run over filesystems mounted read-only. There is no official way to fix mounted filesystems, though. You MUST completely unmount your filesystem in order to have it fixed. If you have LVM, you can check consistency of filesystems mounted read-write, here is the script contributed by Andreas Dilger:

What ReiserFS mount options should I use to get the performance winner for a mail server?

Craig Sanders answered in detail:

By the time I got around to running bonnie, the postmark and postal benchmarks had convinced me that notail was essential.

Host system:

Debian GNU/Linux (of course :)
Linux kernel 2.4.2 with latest 20010305 ReiserFS patch
dual P3-866 (256K cache)
512MB RAM
Adaptec 19160 SCSI Controller

External drive box:

Domex 8230u RAID controller, 32MB battery-backed cache.
6 x 18GB IBM DDYS-T18350M drives

For this particular hardware I was using, ReiserFS/notail on RAID5 was the clear performance winner for a mail server with lots of synced random I/O.

Does using ReiserFS mean I can just press the power off button without running `/sbin/shutdown`? Does it mean there is no risk of data loss?

No, definitely not. As of now, ReiserFS only provides meta-data journaling - that is, it records which files have been created or opened, whether they have had their size changed, or where they have been relocated. It guarantees that the structure of the internal ReiserFS tree will be correct, thereby allowing you after an unclean shutdown to start back up without having to run fsck on all the files that have not been changed.

Data in files that were being used at the time of the crash could have been corrupted. This is usual for most filesystems. Data journaling filesystems guarantee that there will be no garbage written into a file, but they don't guarantee that a file update will be. (Only Reiser4 guarantees that filesystem operations are performed as atomic operations, and provides atomic transaction functionality.)

ReiserFS does not guarantee the file contents themselves are uncorrupted nor that no data is lost. Moreover, even given that all of your system is on ReiserFS, many system components (like daemons, database managers, etc) require the shut down procedure for proper functioning.

However, there is separate implementation of data logging (dead) that will soon go into the mainstream kernel.

How does ReiserFS support bad block handling?

This is covered here.

I have a motherboard with VIA MVP3 chipset and experience ReiserFS problems.

William Oster answers:

If you are using a motherboard with a VIA MVP3 chipset, you may have ReiserFS problems caused by the way your kernel is configured for the so called PCI quirks. My experience is with kernel 2.2.18 and 2.2.19 but it may affect the 2.4.x series too if you are using MVP3 chipset (popular in socket 7 type motherboards, such as used by AMD K6 and classic Pentium).

I've confirmed this problem with several motherboards using the VIA MVP3 chipset, ReiserFS 3.5.29 to 3.5.32, and NCR 53c8xx SCSI. But please note: It probably affects any controller which uses DMA and PCI bus mastering.

Problems which I was inclined to attribute to the ReiserFS were actually problems with this kernel [mis] configuration.

If you fit this profile, DO NOT enable the CONFIG_PCI_QUIRKS configuration option in the /usr/src/linux/.config file. Although the Linux documentation suggests that this option can be enabled if in doubt, DO NOT enable it. It was never intended for the VIA MVP3 chipset anyway. It affects the way DMA is handled, and the combination of ReiserFS (and possibly NCR SCSI) can cause random disk corruption which eventually will result in ReiserFS and/or SCSI errors.

Evidently ReiserFS exercises the DMA and SCSI bus very thoroughly, The problems seem not to be as likely under the ext2 filesystem. Check your /usr/src/linux/.config file. You are safe from this problem if you find this line:

 # CONFIG_PCI_QUIRKS is not set

Any other setting could be dangerous to MVP3 chipset ReiserFS users especially when using PCI bus mastering controllers such as the NCR 53c8xx series. Re-configure your kernel to disable the "PCI quirks" option, then make dep, rebuild, and reinstall.

I am having extensive problems using ReiserFS; it seems to have bugs all over the place. I'm not compiling with a buggy compiler. What is happening? How can this be stable?

You have hardware problems. Really, you do. Even if the bugs don't show up with ext2, you have hardware problems. (See the signal 11 question). Most SuSE users use ReiserFS. Obscure bugs probably still exist; but if you find bugs as easily as using Windows, you have bad RAM, bad CPU, bad cable, bad cooling, VIA chipset with PCI quirks turned on, or other hardware or other software layer bugs.

ReiserFS is stable. You can be sure that if the bugs are encountered easily and commonly with normal usage patterns, it is not us. This does not mean that the next release won't somehow break something though :-/

Real bug reports are at the time of writing outnumbered 10 to 1 by hardware bugs that trigger error messages. We are working on making our error messages better at catching hardware bugs and identifying them as such. There is only so far we can go though in runtime consistency checking without serious speed reductions. We don't release software unless it goes through extensive testing; so if you don't think that our testing could have missed the bug, it is probably hardware.

How can I put a label (like allowed by `-L` option of `mkfs.ext2`) on a ReiserFS instance?

Currently, this feature is only implemented for ReiserFS v3.6 disk format. Adding it to v3.5 disk format would break existing disk format, and there is not enough free space in the superblock. You can set a label (and UUID) with recent reiserfsprogs package on ReiserFS v3.6 filesystem using -l switch (-u for UUID) to reiserfstune (for existing partitions) or to mkreiserfs (for partitions being created) commands. Support for labels and UUIDs was integrated into reiserfsprogs starting from version 3.x.1a.

Why, when I'm working on files (i.e. having open files) on my laptop, does ReiserFS access the disk every 5 seconds? This effectively prevents the disk from spinning down, i.e. APM modes to take over, even when I'm not writing anything.

Brent Graveland answers:

It's the atime update. Every time you run sync(1), the sync program's atime is updated. The next sync() writes this atime update, then sync(1) gets updated again.

RedHat does not unmount `/` (`/dev/root`) with ReiserFS on halt. How to fix it?

RedHat users kindly provided these patches (not tested by us):

Note that if you have RedHat Linux 7.2 or later, you do not need these patches.

How do I run programs from reiserfsprogs package on encrypted devices?

In order to access such encrypted entities you need to use losetup(8) tool to bind your entity to loop device.

Are there any recomendation pro or against any particular hard drive manufacturers for using with ReiserFS?

No, as bad hard drives are not ReiserFS specific but apply to all filesystems:

There is basically no preference, general the faster the drive is and less seek time is better rule applies as always. On the other hand almost every hard drive manufacturer has a widely known broken series of hard drives. The most recent example is IBM's Deskstar series disks, especially DTLA models produced in Hungary 2000-2001. These are known to fail very often, to the point that you probably don't want to use them even if you already paid for them. Also other Deskstar drives are seem to be a not very good choice. IBM released a note that deskstar drives should not run for more then 8 hours/day on average. These drives are also known to be very sensitive to temperature conditions and are known to fail on overheating. There is class action lawsuit against IBM on that drives series.

I am using RedHat 7.0 with gcc 2.96; why does ReiserFS seem unstable with it?

Use the most recent version of RedHat (gcc 2.96-85 or later with RedHat 7.2, although 7.1 is also okay for ReiserFS).

The choice of an unstable unreleased version of gcc 2.96 by RedHat as the default gcc was a Slashdot controversy. gcc 2.96 on RedHat 7.0 was unstable, and ReiserFS was one of the things that would fail for it. There are two gcc: 2.96 and 2.96-85 's. 2.96-85 works for ReiserFS, and the other (the one on RedHat 7.0) surely does not. Read the Linux kernel instructions about what compiler to use. The solution to code not working on broken compilers is the one RedHat has taken - fix the compiler. They fixed the compiler and thereby allowed the correctly compiled ReiserFS to work.

In my program I am using `fsync(2)` calls after each write to the file to guarantee integrity of my file data, and this is very slow, how can I improve the performance?

Answer from Chris Mason:

The main thing to remember is that fsyncs introduce a bunch of disk writes, and force the FS to wait on the buffers. The key to keeping performance up is to make it easy for the FS to do as much as possible before the fsync() call.

So, if your application modifies 3 files, and you want to make sure all 3 changes are safely on disk:

 write(file1)
 write(file2)
 write(file3)
 fsync(file1)
 fsync(file2)
 fsync(file3)

is much faster than:

 write(file1)
 fsync(file1)
 write(file2)
 fsync(file2)
 write(file3)
 fsync(file3)

It is also faster to write over existing bytes in the file than it is to append new bytes onto the end of a file. When you overwrite existing bytes in the file, you don't have to commit new metadata to disk on fsync(), the FS can just write the data blocks. This is fewer seeks. The more you write to a single file before calling fsync(), the faster overall things will run.

 write(8k)
 fsync(file)

is much faster than:

 write(4k)
 fsync(file)
 write(4k)
 fsync(file)

Trying to optimize for those 3 things alone can make a huge performance difference overall.

Answer from Josh MacDonald:

You have to understand that even using fsync() after every write() makes no guarantees. If the system crashes during either the write() or fsync() operation your data may be lost or corrupted.

Suppose the fsync() does complete, does your application keep its data in multiple files? If that is the case and you need to write() to multiple files as part of a transaction, you have even greater problems.

The only safe and easy way for you to implement some kind of transaction with the traditional file system guarantees is to use rename():

Keep all of your data in a single file.
Periodically write a complete copy of your database to a temporary file.
Rename the temporary file to the original database name.

Addition from Nikita Danilov:

One can implement something like a phase-tree at user-level and use rename() to atomically switch root of the tree. This overcomes the "everything-in-one-file" limitation but has the added complexity of requiring crash-recovery.

Stop your development for now and wait until Reiser4 filesystem will be released, that has transaction API exported to the userspace. That transaction API would solve all of your problems.

Our program needs to access a lot of working files. What is the recommended way to organize files to get the best results out of ReiserFS? Should all the files be placed in a single directory, or should the files be spread across a directory tree to limit the number of files per directory? Can you also summarize the relevant caching and locking effects?

Traditional file systems typically have poor performance when there are many files in a single directory, but not ReiserFS. These other file systems perform poorly because they use a linear search algorithm to find and replace entries in a directory. This means that the file system must scan, on average, half the blocks of a directory for every access. Typically, applications are required to work around this problem by manually structuring a tree of directories, allowing each individual directory to remain limited in size. For example, see how the Squid web proxy stores a large collection of files.

ReiserFS does not have this problem because it uses an internal tree to store all directories and file metadata. Directory operations remain effecient even for very large directories, so you can write your application free from this performance concern. However, there are several issues that complicate this matter: namely locking and locality.

The Linux VFS currently imposes locking restrictions that serialize many operations on directories, so if concurrent processes or threads will access the collection of files then you may be better off using multiple directories. Reiser4 will improve upon this restriction, although it is still under development.

ReiserFS attempts to store all of the files in a directory, along with the directory itself, in nearby locations on disk. An application may exploit this spatial locality if it can predict which files will be accessed with temporal locality. You may be better of using multiple directories to store your files if you can predict that many files within a directory will be accessed at the same time.

To summarize, ReiserFS supports efficient access to large directories and most traditional file systems do not. However, locking and locality issues may guide your decision to use manually structured directory trees instead, at least until ReiserFS exports control over packing locality to users, and improves its locking.