(→Reiser4 and upstream)
|Line 43:||Line 43:|
= Metadata and inline-data checksums (not stable stuff) =
= Metadata and inline-data checksums (not stable stuff) =
= Software Framework, Development model and
= Software Framework, Development model and =
Revision as of 14:23, 14 July 2017
Reiser4 is a software framework for creation, assembly and customizing file systems managing local (simple or logical) storage volumes of the Operating System. Reiser4 is a successor of ReiserFS (which is also known as ReiserFS of version 3). Reiser4 absorbed results of academic researches in the area of data storage, which had been conducted since 1992 by engineers of Namesys labs in collaboration with Moscow State University and Program Systems Institute of the Russian Academy of Sciences. For historical reasons Reiser4 currently works only for Linux OS. However, it can be easily ported to any operating system due to its modular infrastructure.
History of Reiser4
Namesys was created by Hans Reiser approximately in 1993 from a number of last graduates trained in the format of the old education system of Soviet Union (USSR). At the beginning of the new century Namesys engineers had accumulated a number of innovative ideas in the area of data storage software systems. However, it was rather problematic to implement them in the context of existing at that moment ReiserFS (v3), mostly because of design problems. On the other hand, ReiserFS (v3) had a number of shortcomings, which were hard to fix for the same reasons. So, it was made a decision to develop from scratch a new file system, which was supposed to absorb the experience of previous developments. In 2002 Namesys got a grant from DARPA for this. Reiser4 development was also sponsored by Linspire. However, in commercial terms Namesys activity was not successful, and eventually this had led to financial problems. Since the arrest of Hans Reiser (in October 2006) Reiser4 has been maintained by a former Namesys employee, mathematician, programmer, Ph.D Edward Shishkin. Currently development continues on a non-commercial base. In this development mode Reiser4 acquired stability and a number of many new features like modules for transparent compression (announced in 2007), different transaction models (Journaling, Write-Anywhere (COW), Hybrid transaction model), precise asynchronous discard support for SSD drives, metadata and inline data checksums, failover, etc. A new Reiser4 disk format version (4.0.1) were released (*).
Reiser4 and upstream
In contrast with its predecessor (ReiserFS, v3), Reiser4 was not accepted to the upstream Linux kernel because of political reasons. Later Edward Shishkin expressed an interest (*) to port Reiser4 to other operating systems, in particular, to FreeBSD, which is, according to his standpoint, "more open to academic researches". In this case it would be illogical to expect Reiser4 to be tightly integrated with some specified operating system. Thus, Reiser4 is developed as a standalone independent project (*). The archive of ports for upstream Linux kernels can be found at the project's sites on GitHub and Sourceforge (*).
Efficiency of disk space usage
Reiser4 provides the most efficient disk space usage among all file systems in all scenarios and workloads. In particular, on Compilebench ((c) Oracle) Reiser4 shows disk space efficiency 50 times (5000%) better than ext4, and 12 times (1200%) better than Btrfs with compression. The problem of internal fragmentation in Reiser4 is completely resolved by using a special technique of liquid records (or virtual keys). It means that for any fraction Q < 1 every keyed record in the tree can be split into 2 parts in the proportion Q with a possibility to quickly allocate unique keys for both parts. Such split (as well as merge) is performed by plugins of ITEM interface when packing data and metadata into tree nodes at flush time (just before writing to disk).
Reiser4 structure. Plugins. Heterogeneity in time and in space
File system as a complicated subsystem of modern OS is the most subjected to the problem of creeping featurism caused by the progress in hardware and software technologies. To resist this problem Namesys engineers made a decision to develop not simply a file system, but the whole software framework providing reusable environment. Reiser4 has two different code bases for kernel module and user-space utilities. All ones have modular infrastructure. It means that every calculation in the context of the file system (or user-space utility) looks like execution of a module of some interface. Resier4 kernel module has an abstract base, which is a direct acyclic graph (DAG) of interfaces. Every vertex of that graph represents an interface, and every directed edge of that graph represents a client-supplier relationship between interfaces. All top interfaces of that graph are suppliers for VFS. Every interface of that graph is implemented by one or more modules. At file system creation time user can specify those modules depending on the types of workload and storage media. For some interfaces there is a possibility to switch to another module at any moment (usually such switching is accompanied with a respective conversion of run-time objects). Finally, for some interfaces reiser4 performs such switching in intelligent manner (without user intervention). Thus, file system is in permanent evolution, adapting to current conditions.
For historical reasons Reiser4 modules are called plugins. Heterogeneity means an ability to choose different modules of the same interface to manage different objects of the same type. For example, bodies of small and large files in Reiser4 are managed by different plugins of ITEM interface (bodies of small files are packed to formatted blocks, whereas bodies of large files are stored as a number of unformatted blocks (extents)). Another example is compressed and non-compressed files, which are managed by different plugins of FILE interface. Heterogeneity in time means an ability to switch to different managing plugin for an object (thus, we can switch to a plugin, which is more preferable at some moment). Heterogeneity in space means an ability to assign different plugins to different components of the same compound object (e.g. logical volume). In addition, modular design allows to safely add various features, whose emergence is caused by continuous development of hardware storage technologies.
Atomicity of operations
Atomicity means that filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occurring. All operations in Reiser4 except long writes are atomic. In the case of long writes Reiser4 is forced to close transactions to free dirty pages in a response to memory pressure notifications and reopen them for the rest of user's data. So, long writes in Reiser4 are split into a number of atomic writes. Maximal length of atomic write depends on the file plugin. Edward Shishkin suggested a design of full atomicity (atomic writes of any length) in Write-Anywhere transaction model, where atom can be flushed without closing a transaction.
Different transaction models
Reiser4 offers different transaction models, so at mount time user can choose a one, which is more suitable for his type of storage media and workload. Journaling transaction model is recommended for HDD devices, as this transaction model doesn't lead to avalanche-like external fragmentation which results in performance degradation on rotating media storage. Write-Anywhere transaction model is recommended for SSD devices, which are not critical to external fragmentation. In this transaction model number of IO requests issued by a file system is minimal (it doesn't write to journal with the following overwriting blocks on disk), which is also important for SSD devices. Also Reiser4 offers a unique "hybrid" transaction model, which provides a strong invariant - a parent-first order on the storage tree nodes in term of disk addresses. This transaction model is recommended for HDD users, who don't perform a huge number of random overwrites. In this transaction model a part of atom's dirty pages (overwrite set) is committed via journal, and another part (relocate set) is written to different location on disk. All other file systems offers only one hardcoded transaction model. This is either only journalling (ext4, xfs, jfs, etc), or only write-anywhere, AKA "copy-on-write" (ZFS, Btrfs, etc).
Three-level block allocator
The first (lowest) level implements a map of free space (currently reiser4 supports only bitmap, but it also can be implemented as a tree of extents). The second level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node. Tight relationship between block allocation and transaction models was revealed at Namesys labs and implemented in Reiser4. The second level implements a transaction model. On this level block allocator decides, if dirty page will be written to the old place, or it will get a new location on disk. The third (highest) level implements allocation policies in 2 contexts (forward and reverse) for the whole locality of specified node.
Off-line file system check
Any corrupted Reiser4 volume can be repaired off-line by a special user-space utility fsck.reiser4, which is a part of reiser4progs. Fsck.reiser4 performs 3 passes. At the first pass it checks integrity of the basic data structure (tree). At the second pass fsck scans twig level and checks integrity of the extent regions (contiguous regions on disk, where bodies of large files are stored). At the third (semantic) pass, fsck scans leaf level and checks integrity of "semantic" objects (directories, regular files, symlinks, etc). Fsck.reiser4 absorbed the development experience of its predecessor reiserfsck. In particular, fsck.reiser4 is free from a shortcoming inherent to reiserfsck, whose rebuilding process gets confused by ReiserFS (v3) images stored in the volume being repaired.
Precise Asynchronous Discard support for SSD drives
In contrast with other file systems, Reiser4 not simply informs the block layer about extents being freed on disk. For every such extent Reiser4 checks if head and tail of respective erase units are free in the map of disk free space. If so, Reiser4 issues discard extents for larger regions. Such policy doesn't lead to accumulation of "not discarded garbage" on disk and, hence, there is no need to run periodically tools like fstrim, which scan disk and issue discard requests for such "garbage". In Reiser4 issuing discard requests is a delayed action, which is performed on per-atom basis at transaction commit time. It allows to reduce number of discard commands (because of merging of extents, which need to be discarded).