This document has been retrieved from archive.org in its version from 2006-11-13.
It was written by its respective author(s) and not by the author(s) of this article.
Please apply only formatting changes, adding or correcting sources and maybe
spelling- and punctuation fixes to this document. Thanks!
Reiser4 Transaction Design Document Last Update: Apr. 5, 2002 Joshua MacDonald, Hans Reiser and Alex Zarochentcev
Reiser4 will feature advanced new transaction capabilities. The transaction model we describe for version 4 allows the file system programmer to specify a set of operations and guarantees that all or none of those operations will survive a system failure (i.e., crash). The name for this specialized notion of a transaction is a transcrash.
In traditional Unix semantics, a sequence of write() system calls are not expected to be atomic, meaning that an in-progress write could be interrupted by a crash and leave part new and part old data behind. Writes are not even guaranteed to be ordered in the traditional semantics, meaning that recently-written data could survive a crash even though less-recently-written data does not survive. Some file systems offer a kind of write-atomicity, known as data-journaling, in which an individual data block is written to a log file before overwriting its real location, but this only ensures that individual blocks are written atomically, not the entire buffer of a write() system call. This technique doubles the amount of data written to the disk, which becomes significant when the disk transfer rate is a limiting performance factor.
Something more clever is possible. Instead of writing every modified block twice, we can write the block only once to a new location and then update the block's address in its parent node in the file system. However, the parent modification must also be included in the transaction. The WAFL (Write Anywhere File Layout) technique Hitz94 handles this by propagating file modifications all the way to the root node of the file system, which is then updated atomically.
In general, it is possible to use either approach to update a block - log a copy of the block and overwrite its original location or relocate the block and modify its parent block within the same transaction. In Reiser4 this decision is made independently for each block by a block-allocation plugin based on the set of modified blocks, the current file system layout, and the associated costs of each update method.
Definition of Atomicity
Most file systems perform write caching, meaning that modified data are not immediately written to the disk. Writes are deferred for a period of time, which allows the system greater control over disk scheduling. A system crash can happen at any time causing some recent modifications to be lost, and this can be a serious problem if an application has made several interdependent modifications, some of which are lost when others are not. Such an application is said to require atomicity—a guarantee that all or none of a sequence of interdependent operations will survive a crash. Without atomicity, a system crash can leave the file system in an inconsistent state.
Dependent modifications may also arise when an application reads modified data and then produces further output. Consider the following sequence of events:
* Process 1 writes file A * Process 2 reads file A * Process 2 writes file B
At this point, file B may be dependent on file A. If the write-caching strategy can reverse the commit order of these operations, meaning to commit file B before file A, these processes are exposed to possible inconsistency in the event of a crash.
By commiting the sequence of write operations atomically there is no exposure to inconsistency. It is still possible for the write-caching strategy to write these files in any order, as long as the commit mechanism realizes that both writes must complete for the transaction to commit successfully. This means that standard disk-scheduling techniques such as the elevator algorithm are not ruled out by atomicity requirements.
A transcrash is a set of operations, of which all or none must survive a crash. An atom maintains the collection of data blocks that a transcrash has attempted to modify along with all data blocks of other atoms that fused with it. Two atoms fuse when one transcrash attempts to read or write data blocks that are part of another atom./
There are two types of transcrash: read-write fusing, and write-only fusing. A write-only-fusing transcrash by default only causes atoms to fuse together as it writes to data blocks outside its own atom. A read-write-fusing transcrash causes atoms to fuse together whenever it reads or writes data blocks outside its own atom. One may always specify within a write-only-fusing transcrash that a specific operation is read-fusing. Put another way, read-write-fusing transcrashes assume there is read dependency whereas write-only-fusing transcrashes support explicit read dependency.
A block-capture request is the underlying mechanism used to dynamically associate transcrashes, data blocks, and atoms together. Initially, transcrashes and data blocks have no associated atom. When a block-capture request specifies a transcrash and block belonging to different atoms, those atoms are fused together (subject to a few restrictions discussed later).
Persons familiar with the database literature will note that these definitions do not imply isolation or serializability between processes. Isolation requires the ability to undo a sequence of operations when lock conflicts cause a deadlock to occur.
Rollback is the ability to abort and undo the effects of the operations in an uncommitted transcrash. Transcrashes do not provide isolation, which is needed to support separate rollback of separate transcrashes. We only support unified rollback of all transcrashes in progress at the time of crash recovery. However, our architecture is designed to support separate, concurrent atoms so that it can be expanded to implement fully isolated transactions in the future.
Currently, the only reason a transcrash will be aborted is due to a system crash. The system cannot individually abort a transcrash, and this means that transcrashes are only made available to trusted plugins inside the kernel. Once we have implemented isolation it will be possible for untrusted applications to access the transcrash interface for creating (only) isolated transcrashes.
Stage One: Capturing and Fusing
The initial stage starts when an atom begins. The beginning of an atom is controlled by the transaction manager itself, but the event is always triggered by a block-capture request.
A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, which means it has reached a state where it will be completed even if there is a crash. The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner.
The relocatable set is the set of blocks that have a dirty parent in the atom.
The relocate set is those members of the relocatable set that we choose to relocate rather than overwrite. Whether we relocate or overwrite is a decision made for performance reasons. By writing the relocate set to different locations we avoid writing a second copy of each block to the log. When the current location of a block is its optimal location, relocation is a possible cause of file system fragmentation. We discuss relocation policies in a later section.
The overwrite set contains all dirty blocks not in the relocate set (i.e., those which do not have a dirty parent and those for which overwrite is the better policy). A wandered copy of each overwrite block is written as part of the log before the atom commits and a second write replaces the original contents after the atom commits. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set.
(An alternative definition is the minimum overwrite set, which uses the same definition as above with the following modification. If at least three dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This optimization will be saved for a later version.)
The system responds to memory pressure by selecting dirty blocks to be flushed. When dirty blocks are written during stage one it is called early flushing because the atom remains uncommitted. When early flushing is needed we only select blocks from the relocate set because their buffers can be released, whereas the overwrite set remain pinned in memory until after the atom commits.
We must enforce that atoms make progress so they can eventually commit. An atom can only commit when it has no open transcrashes, but allowing atoms to fuse allows open transcrashes to join an existing atom which may be trying to commit.
For this reason, an age is associated with each atom and when an atom reaches expiration it begins actively flushing to disk. An expired atom takes steps to avoid new transcrashes prolonging its lifetime: (1) an expired atom will not accept any new transcrashes and (2) non-expired atoms will block rather than fuse with an expired atom. An expired atom is still allowed to fuse with any other stage-one atom to avoid stalling expired atoms.
Once an expired atom has no open transcrashes it is ready to close, meaning that it is ready to begin commit processing. All repacking, balancing, and allocation tasks have been performed by this point.
Applications that are required to wait for synchronous commit (e.g., using fsync()) may have to wait for a lot of unrelated blocks to flush since a large atom may have captured the bitmaps. We will only provide an interface for lazy transcrash commit that closes a transcrash and waits for it to commit. An application that would like to synchronize its data as early as possible would perhaps benefit from logical logging, which is not currently supported by our architecture, or NVRAM.
To finish stage one we have:
- The in-memory free space bitmaps have been updated such that the new relocate block locations are now allocated.
- The old locations of the relocate set and any blocks deleted by this atom are not immediately deallocated as they cannot be reused until this atom commits. We must maintain two bitmaps: commit_bitmap is logged to disk as part of the overwrite set prior to commit, and working_bitmap is the working in-memory copy. In the working_bitmap the old locations of the relocate set and deleted blocks are not deallocated until after commit.
- Each atom collects a data structure representing its deallocate set, which is a list of the blocks it must deallocate once it commits. The deallocate set can be represented in a number of ways: as a list of block locations, a set of bitmaps, or using extent-compression. We expect to use a bitmap representation in our first implementation. Regardless of the representation, the deallocate set data structure is included in the commit record of this atom where it will be used during crash recovery. The deallocate set is also used after the atom commits to update the in-memory bitmaps.
- Wandered locations are allocated for the overwrite set and a list of the association between wandered and real overwrite block locations for this atom is included in the commit record.
- The final commit record is formatted now, although it is not needed until stage three.
Stage Two: Completing Writes
At this stage we begin to write the remaining dirty blocks of the atom. Any blocks that were captured and never modified can be released immediately, since they do not take part in the commit operation. To "release" a block means to allow another atom to capture it freely. Relocate blocks and overwrite blocks are treated separately at this point. Relocate Blocks
A relocate block can be released once it has been flushed to disk. All relocate blocks that were early-flushed in stage one are considered clean at this point, so they are released immediately.
The remaining non-flushed relocate blocks are written at this point. Now we consider what happens if another atom requests to capture the block while the write request is being serviced.
A read-capture request is granted just as if the block did not belong to any atom at this point—it is considered clean despite belonging to a not-yet-committed atom. The only requirement on this interaction is that no atom can jump ahead in the commit ordering. Atoms must commit in the order that they reach stage two or else read-capture from a non-committed atom must explicitly construct and maintain this dependency.
A write-capture request can be granted by copying the block. This introduces the first major optimization called copy-on-capture. The capturing process assumes control of the block, and the committing atom retains an anonymous copy. When the write request completes, the anonymous copy is released (freed). Copy-on-capture is an optimization not performed in ReiserFS version 3 (which creates a copy of each dirty page at commit), but in that version the optimization is less important because the copying does not apply to unformatted nodes.
If a relocate block-write finishes before the block is captured it is released without further processing. Despite releasing relocate blocks in stage two, the atom still requires a list of old relocate block locations for deallocation purposes.
The overwrite blocks (including modified bitmaps and the superblock) are written at this point to their wandered locations as part of the log. Unlike relocate blocks, overwrite blocks are still needed after these writes complete as they must also be written back to their real location.
Similar to relocate blocks, a read-capture request is granted as if the block did not belong to any atom.
A write-capture request is granted by copying the block using the copy-on-capture method described above.
One issue with the copy-on-capture approach is that it does not address the use of memory-mapped files, which can have their contents modified at any point by a process. One answer to this is to exclude mmap() writers from any atomicity guarantees. A second alternative is to use hardware-level copy-on-write protection. A third alternative is to unmap the mapped blocks and allow ordinary page faults to capture them back again.
Stage Three: Commit
When all of the outstanding stage two disk writes have completed, the atom reaches stage three, at which time it finally commits by writing its commit record to the log. Once this record reaches the disk, crash recovery will replay the transaction. Stage Four: Post-commit Disk Writes The fourth stage begins when the commit record has been forced to the log.
Overwrite blocks need to be written to their real locations at this point, but there is also an ordering constraint. If a number of atoms commit in sequence that involve the same overwrite blocks, they must be sure to overwrite them in the proper order. This requires synchronization for atoms that have reached stage four and are writing overwrite blocks back to their real locations. This also suggests the second major optimization potential which is labeled steal-on-capture.
The steal-on-capture optimization is an extension of the copy-on-capture optimization that applies only to the overwrite set. The idea is that only the last transaction to modify an overwrite block actually needs to write that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction.
With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits.
When an overwrite block-write finishes the block is released without further processing.
April 2002 note: We are revising our strategy for bitmap handling.
The deallocate set can be deallocated in the in-memory bitmap blocks at this point. The bitmap modifications are not considered part of this atom (since it has committed). Instead, the deallocations are performed in the context of a different stage-one atom (or atoms). We call this process repossession, whereby a stage-one atom assumes responsibility for committing bitmap modifications on behalf of another atom in time.
For each bitmap block with pending deallocations in this stage, a separate stage-one atom may be chosen to repossess and deallocate blocks in that bitmap. This avoids the need to fuse atoms as a result of deallocation. A stage-one atom that has already captured a particular bitmap block will repossess for that block, otherwise a new atom can be selected.
For crash recovery purposes, each atom must maintain a list of atoms for which it repossesses bitmap blocks. This repossesses for list is included in the commit record for each atom. The issue of crash recovery and deallocation will be treated in the next section.
Stage Five: Commit Completion
When all of the outstanding stage four disk writes are complete and all of the atoms that stole from this atom commit, the atom no longer needs to be replayed during crash recovery—the overwrite set is either completely written or will be completely written by replaying later atoms. Before the log space occupied by this atom can be reclaimed, however, another topic must be discussed.
Wandered Overwrite-Block Allocation
Overwrite blocks were written to wandered locations during stage two. Wandered block locations are considered part of the log in most respects—they are only needed for crash recovery of an atom that completes stage three but does not complete stage five. In the simplest approach, wandered blocks are not allocated or deallocated in the ordinary sense, instead they are appended to a cyclical log area. There are some problems with this approach, especially when considering LVM configurations: (1) the overwrite set can be a bottleneck because it is entirely written to the same region of the logical disk and (2) it places limits on the size of the overwrite set.
For these reasons, we allow wandered blocks to be written anywhere in the disk, and as a consequence we allocate wandered blocks in stage one similarly to the relocate set. For maximum performance, the wandered set should be written using a sequential write. To achieve sequential writes in the common case, we allow the system to be configured with an optional number of areas specially reserved for wandered block allocation. In an LVM configuration, for example, reserved wandered block areas can be spread throughout the logical disk space to avoid any single disk being a bottleneck for the wandered set.
Wandered block locations still need to be deallocated with this approach, but we must prevent replay of the atom's overwrites before these blocks can be deallocated. At this point (stage five), a log record is written signifying that the atom should not have overwrites be replayed.
Stage Six: Deallocating Wandered Blocks
Once the do-not-replay-overwrites record for this atom has been forced to the log, the wandered block locations are deallocated using repossession, the same process used for the deallocate set.
At this point, a number of atoms may have repossessed bitmap blocks on behalf of this atom, for both the deallocate set and the wandered set. This atom must wait for all of those atoms to commit (i.e., reach stage four) before the log can wrap around and destroy this atom's commit record. Until such point, the atom is still needed during crash recovery because its deallocations may be incomplete.
This completes the life of an atom. Now we must discuss several special topics.
The file system must be able to ensure that there are adequate disk space reserves to complete all active transactions. Since the previous contents of modified blocks are preserved until a transaction commits, the transaction must reserve one block of free disk space for every block it modifies.
Ordinarily, it would be possible to simply fail an operation that cannot reserve enough free space to complete. Such a failure leaves the transaction in a state where it cannot likely make further progress. With isolated transactions, it is possible to simply abort the transaction at this point, but another solution is needed to handle this situation without isolated transactions. There are several possible solutions:
- Explicit space reservation — allow the transaction to pre-reserve the amount of space it intends to use. The application makes calls to an interface that reserves the required space. The call to reserve space may fail, so the application should only request a reservation at points where it is possible to recover a consistent state without exceeding the previous reservation. This is the only general purpose solution until there is support for isolated transactions. The other solutions are best avoided.
- Allow operations to fail when space reserves are exceeded. This presents possible file system inconsistency because it may not be possible to recover a consistent state.
- Crash the system. To avoid inconsistency the entire system can be artificially crashed, effectively aborting every non-committed atom in the system.
- Crash just one atom. It is possible to abort a non-committed atom without taking down the entire system, but this has extreme implications. Every process that has taken part in the atom is effected by this act, not just the transcrash that has exceeded its reservation.
We will implement explicit space reservation, but there is always the possibility that an application exceeds its own reservation, forcing us to use at least one of the other solutions as a backup measure. Space reservation is a service agreement between the transaction manager and the application, and as long as the application stays within its reservation it can expect to complete its transactions without failure or crashing.
To achieve some room for error, we will maintain emergency space reserves, disk space reserved for applications that make incorrect explicit reservations. This is an attempt to prevent faulty applications from failing or bringing down the system. The use of emergency space reserves will be reported to the system log so that faulty applications can be corrected. Note that these measures will not in general protect against attack: a malicious user could exploit a faulty application to bring down the system or compromise data integrity.
All of these options will be configurable on a per-file system basis:
- (1) how much emergency space to reserve (e.g., 5% of disk space) and
- (2) whether to fail the operation or crash the system when reserves are exceeded.
Write Atomicity Options
The transcrash interface provides the application with the ability to make a entire sequence of operations atomic, including all write() system calls. Even unmodified applications use a transcrash internally for each system call to protect file system consistency, but this requires special treatment for the write() system call.
Atomically writing a large buffer over pre-existing contents requires a large space reservation, reservation that is not required by the write semantics (this does not apply to create or append). It is acceptable for the system to break a large write into smaller atomic units to reduce space reservation requirements.
We will provide a per-file system option to limit the size of atomic writes when they are performed outside the scope of an existing transaction (i.e., when the system starts a transcrash internally to protect consistency). This allows the system administrator to choose atomic writes up to some size (the space reservation requirement), beyond which writes will be broken into smaller atomic units.
Crash Recovery Algorithm
April 2002 note: We are revising our strategy for crash recovery of bitmaps.
Some atoms may not be completely processed at the time of a crash. The crash recovery algorithm is responsible for determining what steps must be taken to make the file system consistent again. This includes making sure that:
- (1) all overwrites are complete and
- (2) all blocks have been deallocated.
We avoid discussing potential optimizations of the algorithm at present, to reduce complexity. Assume that after a crash occurs, the recovery manager has a way to detect the active log fragment, which contains the relevant set of log records that must be reprocessed. Also assume that each step can be performed using separate forward and/or reverse passes through the log. Later on we may choose to optimize these points.
Overwrite set processing is relatively simple. Every atom with a commit record found in the active log fragment, but without a corresponding do-not-replay record, has its overwrite set copied from wandered to real block locations. Overwrite recovery processing should be complete before deallocation processing begins.
Deallocation processing must deal separately with deallocation of the deallocate set (from stage four—deleted blocks and the old relocate set) and the wandered set (from stage six). The procedure is the same in each case, but since each atom performs two deallocation steps the recovery algorithm must treat them separately as well.
The deallocation of an atom may be found in three possible states depending on whether none, some, or all of the deallocate blocks were repossessed and later committed. For each bitmap that would be modified by an atom's deallocation, the recovery algorithm must determine whether a repossessing atom later commits the same bitmap block.
For each atom with a commit record in the active log fragment, the recovery algorithm determines: (1) which bitmap blocks are committed as part of its overwrite set and (2) which bitmap blocks are affected by its deallocation. For every committed atom that repossesses for another atom, the committed bitmap blocks are subtracted from the deallocate-affected bitmap blocks of the repossessed-for atom. After performing this computation, we know the set of deallocate-affected blocks that were not committed by any repossessing atoms; these deallocations are then reapplied to the on-disk bitmap blocks. This completes the crash recovery algorithm.
Relocation and Fragmentation
As previously discussed, the choice of which blocks to relocate (instead of overwrite) is a policy decision and, as such, not directly related to transaction management. However, this issue affects fragmentation in the file system and therefore influences performance of the transaction system in general. The basic tradeoff here is between optimizing read and write performance.
The relocate policy optimizes write performance because it allows the system to write blocks without costly seeks whenever possible. This can adversely affect read performance, since blocks that were once adjacent may become scattered throughout the disk.
The overwrite policy optimizes read performance because it attempts to maintain on-disk locality by preserving the location of existing blocks. This comes at the cost of write performance, since each block must be written twice per transaction.
Since system and application workloads vary, we will support several relocation policies:
- Always Relocate: This policy includes a block in the relocate set whenever it will reduce the number of blocks written to the disk.
- Never Relocate: This policy disables relocation. Blocks are always written to their original location using overwrite logging.
- Left Neighbor: This policy puts the block in the nearest available location to its left neighbor in the tree ordering. If that location is occupied by some member of the atom being written it makes the block a member of the overwrite set, otherwise the policy makes the block a member of the relocate set. This policy is simple to code, effective in the absence of a repacker, and will integrate well with an online repacker once that is coded. It will be the default policy initially.
Much more complex optimizations are possible, but deferred for a later release.
Unlike WAFL, we expect the use of a repacker to play an important role.
Meta-data journaling is a restricted operating mode in which only file system meta-data are subject to atomicity constraints. In meta-data journaling mode, file data blocks (unformatted nodes) are not captured and therefore need not be flushed as the result of transaction commit. In this case, file data blocks are not considered members of the either relocate or the overwrite set because they do not participate in the atomic update protocol—memory pressure and age are the only factors that cause unformatted nodes to be written to disk in the meta-data journaling mode. This mode is expected to be mostly of academic interest.
Bitmap Blocks Special Handling
Reiser4 allocates temporary blocks for wandered logging. That means we have a difference between the commit bitmap block content, which is what we should restore after a system crash, and the working bitmap block content which is used for free block search/allocation. (The changes to the bitmap are logged data that we write to disk at atom commit.)
We keep each bitmap block in memory in two versions: one for the WORKING BITMAP, and another one for the COMMIT BITMAP.
The working bitmap is used just for searching for free blocks, if their bits are not set in the working bitmap, the corresponding blocks can be allocated. The working bitmap gets updated at every block allocation. The commit bitmap reflects changes which are done in already committed atoms or in the atom which is currently being committed (we assume that atom commits are serialized, and only one atom can be committed in one period of time). The commit bitmap is updated at every atom commit. No bitmap data conversion (WORKING -> COMMIT) is needed, we only update the COMMIT bitmap at each transaction commit.
We should note that block allocation/deallocation does not touch COMMIT BITMAP until one atom reaches commit stage. At that stage we apply the atom's changes which were made during the transaction. We take deallocated block numbers from atom's deleted set and we take freshly allocated block numbers from the atom's captured lists, we apply those changes to the commit bitmap before we write modified commit bitmap blocks to disk. After applying the changes, the commit bitmap blocks are added to transaction as usual (try_capture() is called).
Having two bitmaps in memory gives us a great advantage because it allows one particular bitmap block handling optimization. The optimization is that we can allow several independent atoms to modify one bitmap block.
Any number of atoms are allowed to allocate new blocks in any bitmap block without capturing it. Block deallocation should be deferred until atom finishes commit stage (one reason for this is the elimination of unnecessary dependence between atoms). This means that unnecessary atom fusion could be avoided. We can keep atoms independent as long as they touch different data blocks and different internal nodes (in principle we could keep atoms independent even if they touch the same non-data internal node blocks, but this would require logical versioning rather than simple node versioning, and in our implementation we do simple node versioning except for bitmap blocks).
Another hot spot in the reiser4 filesystem is the super block, which contains the free blocks counter. A similar technique should be applied to allow several atoms to modify the free blocks counter. In general, points of high contention between multiple atoms can benefit from logical versioning rather than node versioning, and as the system matures more flavors of logical versioning will be added.
- Hitz94, Dave Hitz, James Lau, and Michael Malcolm, "File System Design for an NFS File Server Appliance", Proceedings of the Winter 1994 USENIX Conference, San Francisco, CA, January 1994, 235-246.