Welcome to the Reiser4 Wiki, the Wiki for users and developers of the ReiserFS and Reiser4 filesystems.

For now, most of the documentation is just a snapshot of the old Namesys site (archive.org, 2007-09-29).

There was also a Reiser4 Wiki (archive.org, 2007-07-06) once on pub.namesys.com.


From Reiser4 FS Wiki
(Difference between revisions)
Jump to: navigation, search
m (.)
(containers linked)
Line 318: Line 318:
Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is.
Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data [[containers|container]] that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is.
Items allow us to not have to round up to 4k the amount of space required to store an object.
Items allow us to not have to round up to 4k the amount of space required to store an object.
Line 725: Line 725:
* Defining a pluginid.
* Defining a pluginid.
* Composing a set of methods for the plugin from ones you create or reuse from other existing plugins.
* Composing a set of methods for the plugin from ones you create or reuse from other existing plugins.
* Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files).
* Defining a set of items that act as the storage [[containers]] of the object, or reusing existing items from other plugins (e.g. regular files).
* Implementing item handlers for all of the new items you create.
* Implementing item handlers for all of the new items you create.
* Creating a key assignment algorithm for all of the new items.
* Creating a key assignment algorithm for all of the new items.

Latest revision as of 21:59, 25 August 2016

 This document has been retrieved from archive.org in its version from 2006-11-13.
It was written by its respective author(s) and not by the author(s) of this article.
Please apply only formatting changes, adding or correcting sources and maybe
spelling- and punctuation fixes to this document. Thanks!

Reasons why Reiser4 is great for you:

  • Reiser4 is the fastest filesystem, and here are the benchmarks.
  • Reiser4 is an atomic filesystem, which means that your filesystem operations either entirely occur, or they entirely don't, and they don't corrupt due to half occuring. We do this without significant performance losses, because we invented algorithms to do it without copying the data twice.
  • Reiser4 uses dancing trees, which obsolete the balanced tree algorithms used in databases (see farther down). This makes Reiser4 more space efficient than other filesystems because we squish small files together rather than wasting space due to block alignment like they do. It also means that Reiser4 scales better than any other filesystem. Do you want a million files in a directory, and want to create them fast? No problem.
  • Reiser4 is based on plugins, which means that it will attract many outside contributors, and you'll be able to upgrade to their innovations without reformatting your disk. If you like to code, you'll really like plugins....
  • Reiser4 is architected for military grade security. You'll find it is easy to audit the code, and that assertions guard the entrance to every function.

V3 of ReiserFS is used as the default filesystem for SuSE, Lindows, FTOSX, Libranet, Xandros and Yoper. We don't touch the V3 code except to fix a bug, and as a result we don't get bug reports for the current mainstream kernel version. It shipped before the other journaling filesystems for Linux, and is the most stable of them as a result of having been out the longest. We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3.


[edit] Software Engineering Based Reiser4 Design Principles

[edit] Equal Source Code Access Is A Civil Right

Copyright and patent laws were invented to give you an incentive to share your knowledge with the rest of the world in return for a limited time monopoly on what you shared. That is not the way it works with software though, because software companies are allowed to keep their source code secret, but are still given monopoly rights over their software. There is little meaningful sharing of knowledge when binaries only are shared with the world, and all the rest is kept as a secret. The reasons for the existence of copyright and patent laws have been forgotten, their workings have been twisted, and greed and turf defense are what remain of them. Monopoly interests have taken laws intended to promote progress in the arts and sciences, and now use them to to further their own control over us by ensuring that innovations not theirs cannot enter the market for improvements to software.

Think of software objects as forming a society, not yet at the level of an AI society, but still a group of programs interacting, and choosing whether to interact, with each other. Think of social lockout, whether it be in the form of racial discrimination as in the civil rights movement, Mercantilism as happened a few centuries ago, or the endless other forms of division in human society. Is it so surprising that this evil casts its shadow on cyberspace? Is it so surprising that our cybershadows also find ways to engage in social lockout of others? Most of the cyber-world of software lives under tyranny today.

We are part of a movement to create a free cyber-world we can all participate equally in. Namesys does not oppose copyright laws as they were invented (14 year monopolies which disclosed everything that was temporarily monopolized), it opposes copyright laws as they have been twisted. Namesys opposes unlimited time monopolies which disclose nothing, and lockout all other inventors.

Many others in this movement are opposed to copyright law, even the version of it in which it was first created. We feel they are not acknowledging that a trade-off is being made, and that this trade-off has value.

Yet still we choose to give our software away for free for use with software that is given away for free (e.g. Gnu/Linux). Since we don't have a lot of illusions about our ability to entirely change the world, and it is amusing to sell free software, for those who do not want to disclose their software and do not want to give it away for free, we charge a license fee and let them keep their improvements to our software without sharing them. These fees help substantially in allowing us to survive as an organization.

We don't make nearly as much money as we would from charging everyone for usage rights, but we do make just enough to get by, and that is important.;-) We don't really feel that everyone should follow our example and make their software no charge for most users (it is too hard to survive fiscally doing this), but we do think that everyone should disclose their source code, and no one should design their software to exclude working with other software (e.g. Microsoft's Palladium which makes such a mockery of Athena).

[edit] Software Libre Takes More Than A License --- It Takes A Design

Making the source code available to you is not enough by itself to bring you all of the possible benefits of software libre. Many file systems are so difficult to modify that only someone who has worked with the code for years finds it feasible to modify it, and even then small changes can take months of labor due to their ripple effects on the other code and the difficulties of dealing with disk format changes.

This is why we have a plugin based architecture in Reiser4, so that it is not just possible, but easy, to improve the software.

Imagine that you were an experimental physicist who had spent his life using only the tools that were in his local hardware store. Then one day you joined a major research lab with a machine shop and a whole bunch of other physicists. All of a sudden you are not using just whatever tools the large tool companies who have never heard of you have made for you. You are now part of a cooperative of physicists all making your own tools, swapping tools with each other, suddenly empowered to have tools that are exactly what you want them to be, or even merely exactly what your colleagues want them to be, rather than what some big tool company, that has to do a market analysis before giving you what you want, wants them to be. That is the transition you will make when you go from version 3 to version 4 of ReiserFS. The tools your colleagues and sysadmins (your machinists) make are going to be much better for what you need.

[edit] Why Limit Interactions With Objects Strictly?

You may wonder why the design we will present is so highly structured, why every object is allowed to control what is done to it by its providing a limited interface, and why we pass requests to objects to do things rather than doing things directly to the object? Surely we limit our functionality by doing so, yes? Indeed we do, but is there a reason why the price is worth paying? Is there something that becomes crucial as complexity grows?

Chaos theory offers the answer. If you disturb one thing, and disturbing that thing inherently disturbs another thing, which in turn disturbs the first thing plus maybe a whole bunch of other things, and those things all disturb the first thing again, and...., etc., you get what chaos theory calls a feedback loop. These loops have a marvelous tendency for the end effect of the disturbance to be incalculable, and our inability to calculate such loops is perhaps a significant aspect of our being mere mortals. Of course, as you probably know most programmers want to be gods, and when they are unable to know what the effect will be of a change they make to their code, they dislike this. As a result, they go to great lengths to reduce the tendency of code changes to the design of one object to have ripple effects upon other objects. A vitaly important way to do this is to have very strictly defined interfaces to objects, and for the designer of each object to be able to know that the interface will never be violated when he writes it. This is called "object oriented design", or "structured programming", and if used well it can do a lot to reduce a type of chaotic behavior known as bugs.;-) Verifying the avoidance of interactions that violate the design for an object is a key task in security auditing (inspecting the code to see if it has security holes).

The expressive power of an information system is proportional not to the number of objects that get implemented for it, but instead is proportional to the number of possible effective interactions between objects in it. (Reiser's Law Of Information Economics)

This is similar to Adam Smith's observation that the wealth of nations is determined not by the number of their inhabitants, but by how well connected they are to each other. He traced the development of civilization throughout history, and found a consistent correlation between connectivity via roads and waterways, and wealth. He also found a correlation between specialization and wealth, and suggested that greater trade connectivity makes greater specialization economically viable.

You can think of namespaces as forming the roads and waterways that connect the components of an operating system. The cost of these connecting namespaces is influenced by the number of interfaces that they must know how to connect to. That cost is, if they are not clever to avoid it, N times N, where N is the number of interfaces, since they must write code that knows how to connect every kind to every kind.

One very important way to reduce the cost of fully connective namespaces is to teach all the objects how to use the same interface, so that the namespace can connect them without adding any code to the namespace. Very commonly, objects with different interfaces are segregated into different namespaces.

If you have two namespaces, one with N objects, and another with M objects, the expressive power of the objects they connect is proportional to (N times N) plus (M times M), which is less than (N plus M) times (N plus M). Try it on a calculator for some arbitrary N and M. Usually the cost of inventing the namespaces is much less than the cost of the users creating all the objects. This is what makes namespaces so exciting to work with: you can have an enormous impact on the productivity of the whole system just by being a bit fanatical in insisting on simplicity and consistency in a few areas.

Please remember this analysis later when we describe why we implement everything to support a "file" or "directory" interface, and why we aren't eager to support objects with unnecessarily different namespaces/interfaces --- such as "attributes" that cannot interact with files in all the same ways that files can interact with files.

[edit] Basic Semantics

To interact with an object you name it, and you say what you want it to do. The filesystem takes the name you give, and looks through things we call directories to find the object, and then gives the object your request to do something.

[edit] Files

character holding an object that looks like a sequence

A file is something that tries to look like a sequence of bytes.

You can read the bytes, and write the bytes. You can specify what byte to start to read/write from (the offset), and the number of bytes to read/write (the count). [Diagram needed]. You can also cut bytes off of the end of the file.

character sawing off end of file

Cutting bytes out of the middle or the beginning of a file, and inserting bytes into the middle of a file, are not permitted by any of our current file plugins, all of which implement fairly ancient Unix file semantics, but this is likely to change someday.

[edit] The Software Engineering Lurking Below File Plugins

Your interactions with a file are handled by the file's "plugin". These interactions are structured (in programming, such structures are generally called "interfaces") into a set of limited and defined interactions. (We are too lazy to perform the infinite work of programming plugins to handle infinite types of interactions.) Each way you can interact with a plugin is called a "method". A plugin is composed as a set of such methods.

Among programmers, laziness is considered the highest art form, and we do our best to express our souls in this art. This is why we have layers and layers of laziness built into our plugin architecture.

Each method is composed from a library of functions we thought would be useful in constructing plugin methods. Each plugin is composed from a library of methods used by plugins, and a plugin can be considered a one-to-one mapping (that's where you have two sets of things, and for every member of one set, you specify a member of the other set as its match) of every way of interacting with the plugin to a method handling it. For every file, there is a file pluginid. Whenever you attempt to interact with a file, we take the name of the file, find the pluginid for the file, and inside the kernel we have an array of plugins [diagram needed that is suitable for persons who don't know what an array or offset is], and we use the pluginid as the offset of that file's plugin within that array. (An offset is a position relative to something else, and in programming it is typically measured in bytes.)

This implies that when you invent a new file plugin, you have to recompile (Programmers don't actually write programs, they got too lazy for that long ago, instead they write instructions for the computer on how to write the program, and when the computer follows these instructions ("source code"), it is called "compiling", which programmers usually pretend was done by them when they speak about it, as in "I recompiled the kernel for my exact CPU this time, and now playing pong is noticeably faster.".) the kernel, and you can only add plugins to the end of the list, and you can never reuse or change pluginids for a plugin, or else you will have to go through the whole filesystem changing all of the pluginids that are no longer correct. Someday in a later version we will revise this so that plugins are "dynamically loadable" (which is when you can add something to a program while it is running), and you can add support for new plugins to a running kernel. When we do that we will carefully benchmark and ensure that there is no loss of performance (or we won't do it) from using dynamic loading.

Programs are often "layered", which is when the program is divided into layers, and each layer only talks to the layer immediately above it, or immediately below it, and never talks to a part of the program two levels below it, etc. This reduces the complexity of the interfaces for the various parts of the program, and most of the complexity of a program is in coding its interfaces.

characters each communicating with adjacent characters only

Reiser4 has a "semantic layer", and this semantic layer concerns itself with naming objects and specifying what to do to the objects, and doesn't concern itself with such things as how to pack objects into particular places on disk or in the tree.

An IO to a file may affect more than one physical sequence of bytes, or no physical sequence of bytes, it may affect the sequences of bytes offered by other files to the semantic layer, and the file plugin may invoke other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read and/or write what the caller sees as being a single sequence of bytes. Appearances are what is wanted.

When we say that security attributes are implemented as files, we mean that security attributes look like a sequence of bytes, but the security attributes may be stored in some compressed form that perhaps might be of fixed length, or even be just a single bit. For the filesystem to offer the benefits of simplicity it need merely provide a uniform appearance that all things it stores are sequences of bytes, and there is nothing to prevent it from gaining efficiency through using many different storage implementations to offer this uniform appearance.

For many files it is valuable for them to support efficient tree traversal to any offset in the sequence of bytes. It is not required though, and Unix/Gnu/Linux has traditionally supported some types of files which could not do this. A pipe will allow you take the output of one command, and connect it to the input of another command, and each of the commands will see the pipe as a file. This pipe is an example of a file for which you cannot simply jump to the middle of the file efficiently but instead you must go through it from beginning to end in sequential order.

[edit] Names and Objects

A name is a means of selecting an object. An object is anything that acts as though it is a single unified entity. What is an object is context dependent. For instance, if you tell an object to delete itself, many distinctly named entities (that are distinct objects in other ways such as reading) might well disappear as though they are a single object in response to the delete request.

A namespace is a mapping of names to objects. Filesystems, databases, search engines, environment variable names within shells, are all examples of namespaces. The early papers using the term tended to seek to convey that namespaces have commonality in their structure, are not fundamentally different, should be based on common design principles, and should be unified.

Such unification is a bit of a quest for a holy grail. In British mythology King Arthur sent his knights out on a quest for the holy grail, and if only they could become worthy of it, it would appear to them. None of them found it, and yet the quest made them what they became. Namespaces will never be unified, but the closer we can come to it, the more expressive power the OS will have. Reiser4 seeks to create a storage layer effective for such an eventually unified namespace, and gives it a semantic layer with some minor advantages over the state of the art. Later versions will add more and more expressive semantics to the storage layer.

Finding objects is layered. The semantic layer takes names and converts them into keys (we call this "resolving" the name). The storage layer (which contains the tree traversing code) takes keys and finds the bytes that store the parts of the object.

Keys are the fundamental name used by the Reiser4 tree. They are the name that the storage layer at the bottom of it all understands. They can be used to find anything in the tree, not just whole objects, but parts of objects as well.

Everything in the tree has exactly one key. Duplicate keys are allowed, but their use usually means that all duplicates must be examined to see if they really contain what is sought, and so duplicates are usually rare if high performance is desired. Allowing duplicates can allow keys to be more compact in some circumstances (e.g. hashed directory entries).

An objectid cannot be used for finding an object, only keys can. Objectids are used to compose keys so as to ensure that keys are unique.

[edit] Ordering of Name Components

When designing the naming system described in the future vision whitepaper I broke names from human and computer languages into their pieces, and then looked at their pieces to see which ones differed from each other in meaningful ways vs. which pieces were different expressions that provided the same functionality. (In more formal language, I would say that I systematically decomposed the ways of naming things that we use in human and computer languages into orthogonal primitives, and then determined their equivalence classes.) I then selected one way of expression from each set of ways that provided equivalent functionality. (Since that whitepaper is focused on what is not yet implemented, the whitepaper does not list all of the equivalence classes for names, but instead describes those which I thought I could say something interesting to the reader about. For instance, the NOT operator is simply unmentioned in it, as I really have nothing interesting to say about NOT, though it is very useful and will be documented when implemented.)

The ordering of two components of a name either has meaning, or it does not. If the resolution of one component of the name depends on what is named by another component, then that pair of name components forms a hierarchical name. Hierarchy can be indicated by means other than ordering. Many human languages indicate structure by use of suffix or tag mechanisms (e.g. Russian and Japanese). The syntactical mechanism one chooses to express hierarchy does not determine the possible semantics one can express so long as at least one effective method for expressing hierarchy is allowed. I choose to only offer one expression from each equivalence class of naming primitives, and here I chose the '/' separated file pathname expression traditional to Unix for pragmatic compatibility with existing operating systems. Reiser4 handles only hierarchical names, and non-hierarchical names are planned only for SSN Reiserfs.

[edit] Directories

Hierarchical names are implemented in Reiser4 by use of directories. The first component of a hierarchical name is the name of the directory, and the components that follow are passed to the directory to interpret. We use `/' to separate the components of a hierarchical name.

Directories may choose to delegate parts of their task to their sub-directories. The unix directory plugin when supplied with a name will use the part of the name before the first / to select a sub-directory (if there is a / in what it is resolving), and delegate resolving the part of the name after the first / to the sub-directory.

A directory can employ any arbitrary method at all of resolving the name components passed to it, so long as it returns a set of keys of objects as the result. In Reiser4, this set of keys always contains exactly one member, but this is designed to change in SSN Reiserfs. (Reiser4 also needs to interact with a standard interface for Unix filesystems called VFS (Virtual File System), and directories are also designed to be able to return what VFS understands, which we won't go into here.)

Directories will also return a list of names when asked. This list is not required to be a complete list of all names that they can resolve, and sometimes it is not desirable that it be so. Names can be hidden names in Reiser4. Directory plugins may be able to resolve more names than they can list, especially if they are written such that the number of names that they can resolve is infinite.

In partuclar, such names can resolve to the objects behaving like ordinary files (with respect to standard file system interface: read, write, readdir, etc.), but not backed up by storage layer. Such objects are called "pseudo files". Here is a list of pseudo files currently implemented in Reiser4 with description of their semantics.

[edit] The Unix Directory Plugin

The unix directory plugin implements directories by storing a set of directory entries per directory. These directory entries contain a name, and a key. When given a name to resolve, the unix directory plugin finds the directory entry containing that name, and then returns the key that is in the directory entry (more precisely, since a key selects not just the file but a particular byte within a file, it returns that part of the key which is sufficient to select the file, and which is sufficient to allow the code to determine what the full keys for those various parts when the byte offset and some other fields (like item type) are added to the partial key to form a whole key). The key can then be used by the tree storage layer to find all the pieces of that which was named.

[edit] Some Historical Details Of Design Flaws In The Unix Directory Interface

Unix differs from Multics, in that Multics defined a file to be a sequence of elements (the elements could be bytes, directory entries, or something else....), while Unix defines a file to be purely a sequence of bytes. In Multics directories were then considered to be a particular type of file which was a sequence of directory entries. For many years, all implementations of Unix directories were as sequences of bytes, and the notion of location within a Unix directory is tied not to a name as you might expect, but to a byte offset within the directory.

The problem is that one is using a byte offset to represent a location whose true meaning is not a byte offset but a directory entry, and doing so for a particular file in a system which meaningfully names that file not by byte offset within the directory but by filename. Various efforts are being made in the Unix community to pretend that this byte offset is something more general than a byte offset, and they often try to do so without increasing the size used to store the thing which they pretend is not a byte offset. Since byte offsets are normally smaller than filenames are allowed to be, the result is ugliness and pathetic kludges. Trust me that you would rather not know about the details of those kludges unless you absolutely have to, and let me say no more. Directories Are Unordered

Unix/Linux makes no promises regarding the order of names within directories. The order in which files are created is not necessarily the order in which names will be listed in a directory, and the use of lexicographic (alphabetic) order is surprisingly rare. The unix utilities typically sort directory listings after they are returned by the filesystem, which is why it seems like the filesystem sorts them, and is why listing very large directories can be slow. (Our current default plugin sorts filenames that are less than 15 letters long lexicographically. For those that are more than 15 characters long it sorts them first by their first 8 letters then by the hash of the whole name.)

There is value to allowing the user to specify an arbitrary order for names using an arbitrary ordering function the user supplies. This is not done in Reiser4, but is planned as a feature of later versions. Allowing the creation of a hash plugin is a limited form of this that is currently implemented.

[edit] Files That Are Also Directories

In Reiser4 (but not ReiserFS 3) an object can be both a file and a directory at the same time. If you access it as a file, you obtain the named sequence of bytes. If you use it as a directory you can obtain files within it, directory listings, etc. There was a lengthy discussion on the Linux Kernel Mailing List about whether this was technically feasible to do. I won't reproduce it here except to summarize that Linus showed that this was feasible without "breaking" VFS.

Allowing an object to be both a file and a directory is one of the features necessary to to compose the functionality present in streams and attributes using files and directories.

To implement a regular unix file with all of its metadata, we use a file plugin for the body of the file, a directory plugin for finding file plugins for each of the metadata, and particular file plugins for each of the metadata. We use a unix_file file plugin to access the body of the file, and a unix_file_dir directory plugin to resolve the names of its metadata to particular file plugins for particular metadata. These particular file plugins for unix file metadata (owner, permissions, etc.) are implemented to allow the metadata normally used by unix files to be quite compactly stored. Hidden Directory Entries

A file can exist but not be visible when using readdir in the usual way. WAFL does this with the .snapshots directory; it works well for them without disturbing users. This is useful for adding access to a variety of new features and their applications without disturbing the user when they are not relevant.

[edit] New Security Attributes and Set Theoretic Semantic Purity

character holding primitive icons Minimizing Number Of Primitives Is Important In Abstract Constructions

To a theoretician it is extremely important to minimize the number of primitives with which one achieves the desired functionality in an abstract construction. It is a bit hard to explain why this is so, but it is well accepted that breaking an abstract model into more basic primitives is very important. A not very precise explanation of why is to say that by breaking complex primitives into their more basic primitives, then recombining those basic primitives differently, you can usually express new things that the original complex primitives did not express. Let's follow this grand tradition of theoreticians and see what happens if we apply it to Gnu/Linux files and directories.

[edit] Can We Get By Using Just Files and Directories

(Composing Streams And Attributes From Files And Directories)?

In Gnu/Linux we have files, directories, and attributes. In NTFS they also have streams. Since Samba is important to Gnu/Linux, there frequently are requests that we add streams to ReiserFS. There are also requests that we add more and more different kinds of attributes using more and more different APIs. Can we do everything that can be done with {files, directories, attributes, streams} using just {files, directories}? I say yes--if we make files and directories more powerful and flexible. I hope that by the end of reading this you will agree.

Let us have two basic objects. A file is a sequence of bytes that has a name. A directory is a name space mapping names to a set of objects "within" the directory. We connect these directory name spaces such that one can use compound names whose subcomponents are separated by a delimiter '/'. What is missing from files and directories now that attributes and streams offer?

In ReiserFS 3, there exist file attributes. File attributes are out-of-band data describing the sequence of bytes which is the file. For example, the permissions defining who can access a file, or the last modification time, are file attributes. File attributes have their own API; creating new file attributes creates new code complexity and compatibility issues galore. ACLs are one example of new file attributes users want.

Since in Reiser4 files can also be directories, we can implement traditional file attributes as simply files. To access a file attribute, one need merely name the file, followed by a '/', followed by an attribute name. That is: a traditional file will be implemented to possess some of the features of a directory; it will contains files within the directory corresponding to file attributes which you can access by their names; and it will contain a file body which is what you access when you name the "directory" rather than the file.

Unix currently has a variety of attributes that are distinct from files (ACLS, permissions, timestamps, other mostly security related attributes, ...). This is because a variety of people needed this feature and that, and there was no infrastructure that would allow implementing the features as fully orthogonal features that could be applied to any file. Reiser4 will create that infrastructure.

[edit] List Of Features Needed To Get Attribute And Stream Functionality From Files And Directories

  • api efficient for small files
  • efficient storage for small files
  • plugins, including plugins that can compress a file serving as an attribute into a single bit
  • files that also act as directories when accessed as directories
  • inheritance (includes file aggregation)
  • constraints
  • transactions
  • hidden directory entries

Each of these additional features is a feature that would benefit the filesystem. So we add them in v4. Basic Tree Concepts Trees, Nodes, and Items

One way of organizing information is to put it into trees.

When we organize information in a computer, we typically sort it into piles (nodes we call them), and there is a name (a pointer) for each pile that the computer will be able to use to find the pile.

A height =4, 4 level, fanout = 3, balanced tree. It start with a root node, then traverses 2 internal nodes, and ends with the leaf nodes which hold the data and have no children.

Figure 1. One Example Of A Tree.

Some of the nodes can contain pointers, and we can go looking through the nodes to find those pointers to (usually other) nodes.

We are particularly interested in how to organize so that we can find things when we search for them. A tree is an organization structure that has some useful properties for that purpose. Definition of Tree:

  1. A tree is a set of nodes organized into a root node, and zero or more additional sets of nodes called subtrees.
  2. Each of the subtrees is a tree.
  3. No node in the tree points to the root node, and exactly one pointer from a node in the tree points to each non-root node in the tree.
  4. The root node has a pointer to each of its subtrees, which is, a pointer to the root node of the subtree.

[edit] Fine Points of the Definition

The absolutely most trivial of all graphs, the single, isolated node.

Figure 2. The simplest tree.

A trivial, connected, linear (unary) graph-a linear sequence of nodes connected by paths (edges, pointers).

Figure 3. A trivial, linear tree.

It is interesting to argue over whether finite should be a part of the definition of trees. There are many ways of defining trees, and which is the best definition depends on what your purpose is. Donald Knuth (a well known author of algorithm textbooks) supplies several definitions of tree. As his primary definition of tree he even supplies one which has no pointers/edges/lines in the definition, just sets of nodes.

Reiser4 uses a finite tree (the number of nodes is limited). Knuth defines trees as being finite sets of nodes. There are papers on infinite trees on the Internet. I think it more appropriate to consider finite an additional qualifier on trees, rather than bundling finite into the definition. However, I personally only deal with finite trees in my storage layer research. It is interesting to consider whether storage layers are inherently more motivated than semantic layers to limit themselves to finite trees rather than infinite trees. This is where some writers would say ".... is left as an exercise for the reader". :-) Oh the temptation.... I will remind the reader of my explanation of why storage layer trees are more motivated to be acyclic, and, at the cost of some effort at honesty, constrain myself to saying that doing more than providing that hint is beyond my level of industry.;-)

Edge is a term often used in tree definitions. A pointer is unidirectional (you can follow it from the node that has it to the node it points to, but you cannot follow it back from the node it points to to the node that has it). An edge is bidirectional (you can follow it in both directions).

Here are three alternative tree definitions, which are interesting in how they are mathematically equivalent to each other, though they are not equivalent to the definition I supplied because edges are not equivalent to pointers:

For all three of these definitions, let there be not more than one edge connecting the same two nodes.

  • a set of vertices (aka points) connected by edges (aka lines) for which the number of edges is one less than the number of vertices
  • or a set of vertices connected by edges which has no cycles (a cycle is a path from a vertex to itself)
  • or a set of vertices connected by edges for which there is exactly one path connecting any two vertices

The three alternative definitions do not have a unique root in their tree, and such trees are called free trees.

The definition I supplied is a definition of a rooted tree not a free tree. It also has no cycles, it has one less pointer than it has nodes, and there is exactly one path from the root to any node.

Please feel encouraged to read Knuth's writings for more discussions of these topics.

[edit] Graphs vs. Trees

Consider the purposes for which you might want to use a graph, and those for which you might want to use a tree? In a tree there is exactly one path from the root to each node in the tree, and a tree has the minimum number of pointers sufficient to connect all the nodes. This makes it a simple and efficient structure. Trees are useful for when efficiency with minimal complexity is what is desired, and there is no need to reach a node by more than one route. Reiser4 has both graphs and trees, with trees used for when the filesystem chooses the organization (in what we call the storage layer, which tries to be simple and efficient), and graphs for when the user chooses the organization (in the semantic layer, which tries to be expressive so that the user can do whatever he wants). Ordering The Tree Aids Searching Through It Keys

We assign everything stored in the tree a key. We find things by their keys. Use of keys gives us additional flexibility in how we sort things, and if the keys are small, it gives us a compact means of specifying enough to find the thing. It also limits what information we can use for finding things.

This limit restricts its usefulness, and so we have a storage layer, which finds things by keys, and a semantic layer, which has a rich naming system. The storage layer chooses keys for things solely to organize storage in a way that will improve performance, and the semantic layer understands names that have meaning to users. As you read, you might want to think about whether this is a useful separation that allows freedom in adding improvements that aid performance in the storage layer, while escaping paying a price for the side effects of those improvements on the flexible naming objectives of the semantic layer.

[edit] Choosing Which Subtree

We start our search at the root, because from the root we can reach every other node. How do we choose which subtree of the root to go to from the root?

The root contains pointers to its subtrees.

For each pointer to a subtree there is a corresponding left delimiting key . Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree.

Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node.

Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer, and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys we know what subtree of that node contains the thing we are looking for.

Duplicate keys are a topic for another time. For now I will just hint that when searching through objects with duplicate keys we find the first of them in the tree, and then we search through all duplicates one-by-one until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a tradeoff between key size and the average frequency of such inefficient linear searches. Using duplicate keys can also allow, if one defines one's insertion algorithms such that they always insert at the end of a set of duplicate keys, ordering objects with the same key by creation time.

The contents of each node in the tree are sorted within the node. So, the entire tree is sorted by key, and for a given key we know just where to go to find at least one thing with that key.

[edit] Nodes

[edit] Leaves, Twigs, and Branches

Leaves are nodes that have no children. Internal nodes are nodes that have children.

A height =4, 4 level, fanout = 3, balanced tree. It start with an internal root node, then traverses 2 internal branch nodes, and ends with the leaf nodes which hold the data and have no children. )

Figure 4. A height = 4, fanout = 3, balanced tree. A search will start with the root node, the sole level 4 internal node, traverse 2 more internal nodes, and end with a leaf node which holds the data and has no children.

A node that contains items is called a formatted node.

If an object is large, and is not compressed and doesn't need to support efficient insertions (compressed objects are special because they need to be able to change their space usage when you write to their middles because the compression might not be equally efficient for the new data), then it can be more efficient to store it in nodes without any use of items at all. We do so by default for objects larger than 16k.

Unformatted leaves (unfleaves) are leaves that contain only data, and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes.

Pointers to unfleaves are different in their structure from pointers to formatted nodes. Extent pointers point to unfleaves. An extent is a sequence of contiguous in block number order unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent, and a length. [diagram needed] Because the extent belongs to just one object, we can store just one key for the extent, and then we can calculate the key of any byte within that extent. If the extent is at least 2 blocks long, extent pointers are more compact than regular node pointers would be.

Node Pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they are probably soon to come. Notice how with extent pointers we don't have to store the delimiting key of each node pointed to, and with node pointers we need to. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well since they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date.

Twigs are parents of leaves. Extent Pointers exist only in twigs. This is a very controversial design decision I will discuss a bit later. Branches are internal nodes that are not twigs.

You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful to number as 1 the level with the leaves where object data is stored. The height of the tree will depend upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be.

For reasons of code simplicity, we find it easiest to implement Reiser4 such that it has a minimum height of 2, and the root is always an internal node. There is nothing deeper than judicial laziness to this: it simplifies the code to not deal with one node trees, and nobody cares about the waste of space.

An example of a Reiser4 tree:

A tree, starting with a root node, then traversing branch nodes, including the internal nodes called twig nodes (A Reiser4 feature), and ending with the leaf nodes which hold the data and have no children.

Figure 5. This Reiser4 tree is a 4 level, balanced tree with a fanout of 3. In practice Reiser4 fanout is much higher and varies from node to node, but a 4 level tree diagram with 16 million leaf nodes won't fit easily onto my monitor so I drew something smaller....;-)

[edit] Size of Nodes

We choose to make the nodes equal in size. This makes it much easier to allocate the unused space between nodes, because it will be some multiple of node sized, and there are no problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms.

If having the nodes be equal in size is not very important, perhaps due to the tree fitting into RAM, then using a class of algorithms called skip lists is worthy of consideration.

Reiser4 nodes are usually equal to the size of a page, which if you use Gnu/Linux on an Intel CPU is currently 4096 (4k) bytes. There is no measured empirical reason to think this size is better than others, it is just the one that Gnu/Linux makes easiest and cleanest to program into the code, and we have been too busy to experiment with other sizes. Sharing Blocks Saves Space

If nodes are of equal size, how do we store large objects? We chop them into pieces. We call these pieces items. Items are sized to fit within a single node.

Conventional filesystems store files in whole blocks. Roughly speaking, this means that on average half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, then the space wasted is much larger than the file. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because it will waste more than 90% of the space in the blocks it stores them in. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per item formatting overhead, whose percentage of total space consumed depends on average item size, and for that reason is hard to quantify.

Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process's directly accessible address space. Due to some implementation details mmap() needs file data to be 4k aligned, and if the data is already 4k aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files that are larger than 16k are 4k aligned. We don't yet have enough empirical data and experience to know whether 16k is the precise optimal default value for this cutoff point, but so far it seems to at least be a decent choice.

[edit] Items

Nodes in the tree are smaller than some of the objects they hold, and larger than some of the objects they hold, so how do we store them? One way is to pour them into items. An item is a data container that is contained entirely within a single node, and it allows us to manage space within nodes. For the default 4.0 node format, every item has a key, an offset to where in the node the item body starts, a length of the item body, and a pluginid that indicates what type of item it is.

Items allow us to not have to round up to 4k the amount of space required to store an object. The Structure of an Item Item_Body . . separated . . Item_Head

 	  	Item_Key 	Item_Offset 	Item_Length 	Item_Plugin_id

Types Of Items

Reiser4 includes many different kinds of items designed to hold different kinds of information.

  • static_stat_data: holds the owner, permissions, last access time, creation time, last modification time, size, and the number of links (names) to a file.
  • cmpnd_dir_item: holds directory entries, and the keys of the files they link to.
  • extent pointers explained above
  • node pointers: explained above
  • bodies: holds parts of files that are not large enough to be stored in unfleafs.

[edit] Units

We call a unit that which we must place as a whole into an item, without splitting it across multiple items. When traversing an item's contents it is often convenient to do so in units:

  • For body items the units are bytes.
  • For directory items the units are directory entries. The directory entries contain a name and a key of the file named (or at least the item plugin can pretend they do, in practice the name and key may be compressed).
  • For extent items the units are extents. Extent items only contain extents from the same file.
  • For static_stat_data the whole stat data item is one indivisible unit of fixed size.

What the Default Node Formats For ReiserFS 4.0 Look Like

An unformatted leaf node (unfleaf node), which is the only node without a Node_Header, has the trivial structure: .................................................................................................................................................................................................................................

The Structure of an Item Item_Body . . separated . . Item_Head

 	  	Item_Key 	Item_Offset 	Item_Length 	Item_Plugin_id

Aformatted leaf nodehas the structure: Block_Head Item_Body0 Item_Body1 - - - Item_Bodyn ....Free Space.... Item_Headn - - - Item_Head1 Item_Head0

A twig node has the structure: Block_Head Item_Body0 NodePointer0 Item_Body1 ExtentPointer1 Item_Body2 NodePointer2 Item_Body3 ExtentPointer3 - - - Item_Bodyn NodePointern ....Free Space.... Item_Headn - - - Item_Head0

A branch node has the structure: Block_Head Item_Body0 NodePointer0 - - - Item_Bodyn NodePointern ........Free Space...... Item_Headn - - - Item_Head0 Tree Design Concepts Height Balancing versus Space Balancing

Height Balanced Trees are trees such that each possible search path from root node to leaf node has exactly the same length (Length = number of nodes traversed from root node to leaf node). For instance the height of the tree in Figure 1 is four while the height of the left hand tree in Figure 1.3 is three and of the single node in Figure 2 is 1.

The term balancing is used for several very distinct purposes in the balanced tree literature. Two of the most common are: to describe balancing the height, and to describe balancing the space usage within the nodes of the tree. These quite different definitions are unfortunately a classic source of confusion for readers of the literature.

Most algorithms for accomplishing height balancing do so by only growing the tree at the top. Thus the tree never gets out of balance.

This is a 4 level unbalanced tree with fanout N = 3 that has then lost some nodes to deletions and needs to be balanced

Figure 6. This is an unbalanced tree. Three principle considerations in tree design

Three of the principle considerations in tree design are:

  • the fanout rate (see below)
  • the tightness of packing
  • the amount of the shifting of items in the tree from one node to another that is performed (which creates delays due to waiting while things move around in RAM, and on disk).

[edit] Fanout

The fanout rate n refers to how many nodes may be pointed to by each level's nodes. (see Figure 7) If each node can point to n nodes of the level below it, then starting from the top, the root node points to n internal nodes at the next level, each of which points to n more internal nodes at its next level, and so on... m levels of internal nodes can point to nm leaf nodes containing items in the last level. The more you want to be able to store in the tree, the larger you have to the fields in the key that first distinguish the objects (the objectids ), and then select parts of the object (the offsets). This means your keys must be larger, which decreases fanout (unless you compress your keys, but that will wait for our next version....).

A four level tree with fanout N = 1 is shown. It has just four nodes starting from the root node, traversing the internal and twig nodes and ending with the leaf node which contains the data. Then there is a graph with N = 2; that is it starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes) and each of these twig nodes points to 2 leaf nodes for a total of 8 leaf nodes in the four levels. Lastly, a fanout N = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes.

Figure 7. Three 4 level, height balanced trees with fanouts n = 1, 2, and 3. The first graph is a four level tree with fanout n = 1. It has just four nodes, starts with the (red) root node, traverses the (burgundy) internal and (blue) twig nodes, and ends with the (green) leaf node which contains the data. The second tree, with 4 levels and fanout n = 2, starts with a root node, traverses 2 internal nodes, each of which points to two twig nodes (for a total of four twig nodes), and each of these points to 2 leaf nodes for a total of 8 leaf nodes. Lastly, a 4 level, fanout n = 3 tree is shown which has 1 root node, 3 internal nodes, 9 twig nodes, and 27 leaf nodes.

[edit] What Are B+Trees, and Why Are They Better than B-Trees

It is possible to store not just pointers and keys in internal nodes, but also to store the objects those keys correspond to in the internal nodes. This is what the original B-tree algorithms did.

Then B+trees were invented in which only pointers and keys are stored in internal nodes, and all of the objects are stored at the leaf level.

Figure 8.

Figure 9.

Warning! I found from experience that most persons who don't first deeply understand why B+trees are better than B-Trees won't later understand explanations of the advantages of putting extents on the twig level rather than using BLOBs. The same principles that make B+Trees better than B-Trees, also make Reiser4 faster than using BLOBs like most databases do. So make sure this section fully digests before moving on to the next section, ok?;-) B+Trees Have Higher Fanout Than B-Trees

Fanout is increased when we put only pointers and keys in internal nodes, and don't dilute them with object data. Increased fanout increases our ability to cache all of the internal nodes because there are fewer internal nodes.

Often persons respond to this by saying, "but B-trees cache objects, and caching objects is just as valuable". It is not, on average, is the answer. Of course, discussing averages makes the discussion much harder.

We need to discuss some cache design principles for a while before we can get to this.

[edit] Cache Design Principles

[edit] Reiser's Untie The Uncorrelated Principle of Cache Design

Tying the caching of things whose usage does not strongly correlate is bad.


  • you have two sets of things, A and B.
  • you need things from those two sets at semi-random, with there existing a tendency for some items to be needed much more frequently than others, but which items those are can shift slowly over time.
  • you can keep things around after you use them in a cache of limited size.
  • you tie the caching of every thing from A to the caching of another thing from B. (that means, whenever you fetch something from A into the cache, you fetch its partner from B into the cache)

Then this increases the amount of cache required to store everything recently accessed from A. If there is a strong correlation between the need for the two particular objects that are tied in each of the pairings, stronger than the gain from spending those cache resources on caching more members of B according to the LRU algorithm, then this might be worthwhile. If there is no such strong correlation, then it is bad.

But wait, you might say, you need things from B also, so it is good that some of them were cached. Yes, you need some random subset of B. The problem is that without a correlation existing, the things from B that you need are not especially likely to be those same things from B that were tied to the things from A that were needed.

This tendency to inefficiently tie things that are randomly needed exists outside the computer industry. For instance, suppose you like both popcorn and sushi, with your need for them on a particular day being random. Suppose that you like movies randomly. Suppose a theater requires you to eat only popcorn while watching the movie you randomly found optimal to watch, and not eat sushi from the restaurant on the corner while watching that movie. Is this a socially optimum system? Suppose quality is randomly distributed across all the hot dog vendors: if you can only eat the hot dog produced by the best movie displayer on a particular night that you want to watch a movie, and you aren't allowed to bring in hot dogs from outside the movie theater, is it a socially optimum system? Optimal for you?

Tying the uncorrelated is a very common error in designing caches, but it is still not enough to describe why B+Trees are better. With internal nodes, we store more than one pointer per node. That means that pointers are not separately cached. You could well argue that pointers and the objects they point to are more strongly correlated than the different pointers. We need another cache design principle.

[edit] Reiser's Maximize The Variance Principle of Cache Design

If two types of things that are cached and accessed, in units that are aggregates, have different average temperatures, then segregating the two types into separate units helps caching.

For balanced trees, these units of aggregates are nodes. This principle applies to the situation where it may be necessary to tie things into larger units for efficient access, and guides what things should be tied together.

Suppose you have R bytes of RAM for cache, and D bytes of disk. Suppose that 80% of accesses are to the most recently used things which are stored in H (hotset) bytes of nodes. Reducing the size of H to where it is smaller than R is very important to performance. If you evenly disperse your frequently accessed data, then a larger cache is required and caching is less effective.

  1. If, all else being equal, we increase the variation in temperature among all aggregates (nodes), then we increase the effectiveness of using a fast small cache.
  2. If two types of things have different average temperatures (ratios of likelihood of access to size in bytes), then separating them into separate aggregates (nodes) increases the variation in temperature in the system as a whole.
  3. Conclusion: If all else is equal, if two types of things cached several to an aggregate (node) have different average temperatures then segregating them into separate nodes helps caching.

Pointers To Nodes Have A Higher Average Temperature Than The Nodes They Point To

Pointers to nodes tend to be frequently accessed relative to the number of bytes required to cache them. Consider that you have to use the pointers for all tree traversals that reach the nodes beneath them and they are smaller than the nodes they point to.

Putting only node pointers and delimiting keys into internal nodes concentrates the pointers. Since pointers tend to be more frequently accessed per byte of their size than items storing file bodies, a high average temperature difference exists between pointers and object data.

According to the caching principles described above, segregating these two types of things with different average temperatures, pointers and object data, increases the efficiency of caching.

[edit] Segregating By Temperature Directly

Now you might say, well, why not segregate by actual temperature instead of by type which only correlates with temperature? We do what we can easily and effectively code, with not just temperature segregation in consideration. There are tree designs which rearrange the tree so that objects which have a higher temperature are higher in the tree than pointers with a lower temperature. The difference in average temperature between object data and pointers to nodes is so high that I don't find such designs a compelling optimization, and they add complexity. I could be wrong.

If one had no compelling semantic basis for aggregating objects near each other (this is true for some applications), and if one wanted to access objects by nodes rather than individually, it would be interesting to have a node repacker sort object data into nodes by temperature. You would need to have the repacker change the keys of the objects it sorts. Perhaps someone will have us implement that for some application someday for Reiser4. BLOBs Unbalance the Tree, Reduce Segregation of Pointers and Data, and Thereby Reduce Performance

BLOBs, Binary Large OBjects, are a method of storing objects larger than a node by storing pointers to nodes containing the object. These pointers are commonly stored in what is called the leaf nodes (level 1, except that the BLOBs are then sort of a basement "level B" :-\ ) of a "B*" tree.

This is a tree that was four levels until a BLOB was inserted with a pointer from a leaf node. In this case the BLOB's blocks are all contiguous.

Figure 10. A Binary Large OBject (BLOB) has been inserted with, in a leaf node, pointers to its blocks. This is what a ReiserFS V3 tree looks like.

BLOBs are a significant unintentional definitional drift, albeit one accepted by the entire database community. This placement of pointers into nodes containing data is a performance problem for ReiserFS V3 which uses BLOBs (Never accept that "let's just try it my way and see and we can change it if it doesn't work" argument. It took years and a disk format change to get BLOBs out of ReiserFS, and performance suffered the whole time (if tails were turned on.)). Because the pointers to BLOBs are diluted by data, it makes caching all pointers to all nodes in RAM infeasible for typical file sets.

Reiser4 returns to the classical definition of a height balanced tree in which the lengths of the paths to all leaf nodes are equal. It does not try to pretend that all of the nodes storing objects larger than a node are somehow not part of the tree even though the tree stores pointers to them. As a result, the amount of RAM required to store pointers to nodes is dramatically reduced. For typical configurations, RAM is large enough to hold all of the internal nodes.

This is a Reiser4 tree with extents in the level 1 Leaf Nodes and the pointer to it in the level 2 Twig Nodes. In this case the BLOB's blocks are all contiguous.

Figure 11. A Reiser4, 4 level, height balanced tree with fanout = 3 and the data that was stored in BLOBs now stored in extents in the level 1 leaf nodes and pointed to by extent pointers stored in the level 2 twig nodes.

Gray and Reuter say the criterion for searching external memory is to "minimize the number of different pages along the average (or longest) search path. ....by reducing the number of different pages for an arbitrary search path, the probability of having to read a block from disk is reduced." (1993, Transaction Processing: concepts and techniques, Morgan Kaufman Publishers, San Francisco, CA, p.834 ...)

My problem with this explanation of why the height balanced approach is effective is that it does not convey that you can get away with having a moderately unbalanced tree provided that you do not significantly increase the total number of internal nodes. In practice, most trees that are unbalanced do have significantly more internal nodes. In practice, most moderately unbalanced trees have a moderate increase in the cost of in-memory tree traversals, and an immoderate increase in the amount of IO due to the increased number of internal nodes. But if one were to put all the BLOBs together in the same location in the tree, since the amount of internal nodes would not significantly increase, the performance penalty for having them on a lower level of the tree than all other leaf nodes would not be a significant additional IO cost. There would be a moderate increase in that part of the tree traversal time cost which is dependent on RAM speed, but this would not be so critical. Segregating BLOBs could perhaps substantially recover the performance lost by architects not noticing the drift in the definition of height balancing for trees. It might be undesirable to segregate objects by their size rather than just their semantics though. Perhaps someday someone will try it and see what results.

[edit] Dancing Trees Are Faster Than Balanced Trees

character shoving tree-like characters to left

Balanced trees have traditionally employed fixed criterion for determining whether nodes should be squeezed together into fewer nodes so as to save space. This criterion is traditionally satisfied at the end of every modification to the tree. A typical such criterion is to guarantee that after each modification to the tree the modified node cannot be squeezed together with its left and right neighbor into two or fewer nodes. ReiserFS V3 uses that criterion for its leaf nodes.

The more neighboring nodes you consider for squeezing into one fewer nodes, the more memory bandwidth you consume on average per modification to the tree, and the more likely you are to need to read those nodes because they are not in memory.

It is a typical pattern in memory management algorithm design that the more tightly packed memory is kept, the more overhead is added to the cost of changing what is stored where in it. This overhead can be significant enough that some commercial databases actually only delete nodes when they are completely empty, and they feel that in practice this works well.

Trees that adhere to fixed space usage balancing criteria can have many things rigorously proven about their worst case performance in publishable papers. This is different from their being optimal. An algorithm can have worse bounds on its theoretical worst case performance and be a better algorithm. Just because one cannot rigorously define average usage patterns does not mean they are the slightest bit less important. Sorry mere mortal mathematicians, that is life. Maybe some might prefer to think about the questions that they can define and answer rigorously, but this does not in the slightest make them the right questions. Yes, I am a chaotic....

In Reiser4 we employ not balanced trees, but dancing trees. Dancing trees merge insufficiently full nodes, not with every modification to the tree, but instead:

  • in response to memory pressure triggering a flush to disk,
  • as a result of a transaction closure flushing nodes to disk

If It Is In RAM, Dirty, and Contiguous, Then Squeeze It ALL Together Just Before Writing

Let a slum be defined as a sequence of contiguous in the tree order, and dirty in this transaction, nodes. (In simpler words, a bunch of dirty nodes that are right next to each other.) A dancing tree responds to memory pressure by squeezing and flushing slums.

It is possible that merely squeezing a slum might free up enough space that flushing is unnecessary, but the current implementation of Reiser4 always flushes the slums it squeezes. This is not necessarily the right approach, but we found it simpler and good enough for now. Another simplification we choose to engage in for now is that instead of trying to estimate whether squeezing a slum will save space before squeezing it, we just squeeze it and see.

Balanced trees have an inherent tradeoff between balancing cost and space efficiency. If they consider more neighboring nodes, for the purpose of merging them to save a node, with every change to the tree, then they can pack the tree more tightly at the cost of moving more data with every change to the tree.

By contrast, with a dancing tree, you simply take a large slum, shove everything in it as far to the left as it will go, and then free all the nodes in the slum that are left with nothing remaining in them, at the time of committing the slum's contents to disk in response to memory pressure. This gives you extreme space efficiency when slums are large, at a cost in data movement that is lower than it would be with an invariant balancing criterion because it is done less often. By compressing at the time one flushes to disk, one compresses less often, and that means one can afford to do it more thoroughly. By compressing dirty nodes that are in memory, one avoids performing additional I/O as a result of balancing. Procrastination Leads To Wiser Decisions: Allocate on Flush

ReiserFS V3 assigns block numbers to nodes as it creates them. XFS is smarter, they wait until the last moment just before writing nodes to disk. I'd like to thank the XFS team for making an effort to ensure that I understood the merits of their approach. The easy way to see its merits is to consider a file that is deleted before it reaches disk. Such a file should have no effect on the disk layout.

character squeezing a folding form Reiser4 The Atomic Filesystem Reducing The Damage of Crashing

When a computer crashes there is data in RAM which has not reached disk that is lost. You might at first be tempted to think that we want to then keep all of the data that did reach disk.

Suppose that you were performing a transfer of $10 from bank account A to bank account B, and this consisted of two operations 1) debit $10 from A, and 2) credit $10 to B.

Suppose that 1) but not 2) reached disk and the computer crashed. It would be better to disregard 1) than to let 1) but not 2) take effect, yes?

When there is a set of operations which we will ensure will all take effect, or none take effect, we call the set as a whole an atom. Reiser4 implements all of its filesystem system calls (requests to the kernel to do something are called system calls ) as fully atomic operations, and allows one to define new atomic operations using its plugin infrastructure. Why don't all filesystems do this? Performance.

Reiser4 possesses employs new algorithms that allow it to make these operations atomic at little additional cost where other filesystems have paid a heavy, usually prohibitive, price to do that. We hope to share with you how that is done.

[edit] A Brief History Of How Filesystems Have Handled Crashes

[edit] Filesystem Checkers

Originally filesystems had filesystem checkers that would run after every crash. The problem with that was that 1) the checkers can not handle every form of damage well, and 2) the checkers run for a long time. The amount of data stored on hard drives increased faster than the transfer rate (the rate at which hard drives transfer their data from the platter spinning inside them into the computer's RAM when they are asked to do one large continuous read, or the rate in the other direction for writes), which means that the checkers took longer to run, and as the decades ticked by it became less and less reasonable for a mission critical server to wait for the checker.

[edit] Fixed Location Journaling

A solution to this was adopted of first writing each atomic operation to a location on disk called the journal or log, and then, only after each atom had fully reached the journal, writing it to the committed area of the filesystem.

The problem with this is that twice as much data needs to be written. On the one hand, if the workload is dominated by seeks, this is not as much of a burden as one might think. On the other hand, for writes of large files, it halves performance because such writes are usually transfer time dominated.

For this reason, meta-data journaling came to dominate general purpose usage. With meta-data journaling, the filesystem guarantees that all of its operations on its meta-data will be done atomically. If a file is being written to, the data in that file being written may be corrupted as a result of non-atomic data operations, but the filesystem's internals will all be consistent. The performance advantage was substantial. V3 of reiserfs offers both meta-data and data journaling, and defaults to meta-data journaling because that is the right solution for most users. Oddly enough, meta-data journaling is much more work to implement because it requires being precise about what needs to be journaled. As is so often the case in programming, doing less work requires more code.

With fixed location data journaling, the overhead of making each operation atomic is too high for it to be appropriate for average applications that don't especially need it --- because of the cost of writing twice. Applications that do need atomicity are written to use fsync and rename to accomplish atomicity, and these tools are simply terrible for that job. Terrible in performance, and terrible in the ugliness they add to the coding of applications. Stuffing a transaction into a single file just because you need the transaction to be atomic is hardly what one would call flexible semantics. Also, data journaling, with all its performance cost, still does not necessarily guarantee that every system call is fully atomic, much less that one can construct sets of operations that are fully atomic. It usually merely guarantees that the files will not contain random garbage, however many blocks of them happen to get written, and however much the application might view the result as inconsistent data. I hope you understand that we are trying to set a new expectation here for how secure a filesystem should keep your data, when we provide these atomicity guarantees.

[edit] Wandering Logs

One way to avoid having to write the data twice is to change one's definition of where the log area and the committed area are, instead of moving the data from the log to the committed area.

There is an annoying complication to this though, in that there are probably a number of pointers to the data from the rest of the filesystem, and we need for them to point to the new data. When the commit occurs, we need to write those pointers so that they point to the data we are committing. Fortunately, these pointers tend to be highly concentrated as a result of our tree design. But wait, if we are going to update those pointers, then we want to commit those pointers atomically also, which we could do if we write them to another location and update the pointers to them, and.... up the tree the changes ripple. When we get to the top of the tree, since disk drives write sectors atomically, the block number of the top can be written atomically into the superblock by the disk thereby committing everything the new top points to. This is indeed the way WAFL, the Write Anywhere File Layout filesystem invented by Dave Hitz at Network Appliance, works. It always ripples changes all the way to the top, and indeed that works rather well in practice, and most of their users are quite happy with its performance. Writing Twice May Be Optimal Sometimes

Suppose that a file is currently well laid out, and you write to a single block in the middle of it, and you then expect to do many reads of the file. That is an extreme case illustrating that sometimes it is worth writing twice so that a block can keep its current location while committing atomically. If one writes a node twice in this way, one also does not need to update its parent and ripple all the way to the top of the tree. Our code is a toolkit that can be used to implement different layout policies, and one of the available choices is whether to write over a block in its current place, or to relocate it to somewhere else. I don't think there is one right answer for all usage patterns.

If a block is adjacent to many other dirty blocks in the tree, then this decreases the significance of the cost to read performance of relocating it and its neighbors. If one knows that a repacker will run once a week (a repacker is expected for V4.1, and is (a bit oddly) absent from WAFL), this also decreases the cost of relocation. After a few years of experimentation, measurement, and user feedback, we will say more about our experiences in constructing user selectable policies.

Do we pay a performance penalty for making Reiser4 atomic? Yes, we do. Is it an acceptable penalty? We picked up a lot more performance from other improvements in Reiser4 than we lost to atomicity, and so it is not isolated in our measurements, but I am unscientifically confident that the answer is yes. If changes are either large or batched together with enough other changes to become large, the performance penalty is low and drowned out by other performance improvements. Scattered small changes threaten us with read performance losses compared to overwriting in place and taking our chances with the data's consistency if there is a crash, but use of a repacker will mostly alleviate this scenario. I have to say that in my heart I don't have any serious doubts that for the general purpose user the increase in data security is worthwhile. The users though will have the final say. Committing

A transaction preserves the previous contents of all modified blocks in their original location on disk until the transaction commits, and commit means the transaction has hereby reached a state where it will be completed even if there is a crash.

The dirty blocks of an atom (which were captured and subsequently modified) are divided into two sets, relocate and overwrite, each of which is preserved in a different manner.

The relocatable set is the set of blocks that have a dirty parent in the atom.

The relocate set is those members of the relocatable set that will be written to a new or first location rather than overwritten.

The overwrite set contains all dirty blocks in the atom that need to be written to their original locations, which is all those not in the relocate set. In practice this is those which do not have a parent we want to dirty, plus also those for which overwrite is the better layout policy despite the write twice cost. Note that the superblock is the parent of the root node and the free space bitmap blocks have no parent. By these definitions, the superblock and modified bitmap blocks are always part of the overwrite set.

The wandered set is the set of blocks that the overwrite set will be written to temporarily until the overwrite set commits.

An interesting definition is the minimum overwrite set, which uses the same definitions as above with the following modification. If at least two dirty blocks have a common parent that is clean then its parent is added to the minimum overwrite set. The parent's dirty children are removed from the overwrite set and placed in the relocate set. This policy is an example of what will be experimented with in later versions of Reiser4 using the layout toolkit.

For space reasons, we leave out the full details on exactly when we relocate vs. overwrite, and the reader should not regret this because years of experimenting is probably ahead before we can speak with the authority necessary for a published paper on the effects of the many details and variations possible.

When we commit we write a wander list which consists of a mapping of the wander set to the overwrite set. The wander list is a linked list of blocks containing pairs of block numbers. The last act of committing a transaction is to update the super block to point to the front of that list. Once that is done, if there is a crash, the crash recovery will go through that list and "play" it, which means to write the wandered set over the overwrite set. If there is not a crash, we will also play it.

There are many more details of how we handle the deallocation of wandered blocks, the handling of bitmap blocks, and so forth. You are encouraged to read the comments at the top of our source code files (e.g. wander.c) for such details.... Journalling optimizations

[edit] Copy-on-capture

Suppose one wants to capture a node which belongs to an atom with stage >= ASTAGE_PRE_COMMIT. This capture request should wait (sleep in capture_fuse_wait()) when atom is committed. The copy-on-capture optimization allows to satisfy capture request by creating a copy of a node which is being captured. The commit process takes control on one copy of the node, the capturing process takes control over another copy. It does not lead to any node versions confilicts because it is guaranted that one copy below the commit process will not be modified. Steal-on-capture

The idea of steal-on-capture optimization is that only the last committed transaction to modify an overwrite block actually needs to write that block. Other transactions can skip post-commit that block. This optimization, which is also present in ReiserFS version 3, means that frequently modified overwrite blocks will be written less than two times per transaction.

With this optimization a frequently modified overwrite block may avoid being overwritten by a series of atoms; as a result crash recovery must replay more atoms than without the optimization. If an atom has overwrite blocks stolen, the atom must be replayed during crash recovery until every stealing-atom commits. Repacker

Another way of escaping from the balancing time vs. space efficiency tradeoff is to use a repacker. 80% of files on the disk remain unchanged for long periods of time. It is efficient to pack them perfectly, by using a repacker that runs much less often than every write to disk. This repacker goes through the entire tree ordering, from left to right and then from right to left, alternating each time it runs. When it goes from left to right in the tree ordering, it shoves everything as far to the left as it will go, and when it goes from right to left it shoves everything as far to the right as it will go. (Left means small in key or in block number:-) ). In the absence of FS activity the effect of this over time is to sort by tree order (defragment), and to pack with perfect efficiency.

Reiser4.1 will modify the repacker to insert controlled "air holes", as it is well known that insertion efficiency is harmed by overly tight packing.

I hypothesize that it is more efficient to periodically run a repacker that systematically repacks using large IOs than to perform lots of 1 block reads of neighboring nodes of the modification points so as to preserve a balancing invariant in the face of poorly localized modifications to the tree. Plugins

man holding 3 plugins 8 Kinds of Plugins Make Reiser4 The Most Tweakable Filesystem Going File Plugins

Every file possesses a plugin id, and every directory possesses a plugin id. This plugin id will identify a set of methods. The set of methods will embody all of the different possible interactions with the file or directory that come from sources external to ReiserFS. It is a layer of indirection added between the external interface to ReiserFS, and the rest of ReiserFS. Each method will have a methodid. It will be usual to mix and match methods from other plugins when composing plugins. Directory Plugins

Reiser4 will implement a plugin for traditional directories. It will implement directory style access to file attributes as part of the plugin for regular files. Later we will describe why this is useful. Other directory plugins we will leave for later versions. There is no deep reason for this deferral. It is simply the randomness of what features attract sponsors and make it into a release specification; there are no sponsors at the moment for additional directory plugins. I have no doubt that they will appear later; new directory plugins will be too much fun to miss out on.:-) Hash Plugins

Directory is mapping from file name to file itself. This mapping is implemented through Reiser4 internal balanced tree. Unfortunately file names cannot be used as keys until keys of variable length are implemented, or unreasonable limitations on maximal file name length are imposed. To work around this file name is hashed and hash is used as key in a tree. No hash function is perfect and there always be hash collisions, that is, file names having the same value of a hash. Previous versions of reiserfs (3.5 and 3.6) used "generation counter" to overcome this problem: keys for file names having the same hash value were distinguished by having different generation counters. This allowed to amortize hash collisions at the cost of reducing number of bits used for hashing. This "generation counter" technique is actually some ad hoc form of support for non-unique keys. Keeping in mind that some form of this have to be implemented anyway, it seemed justifiable to implement more regular support for non-unique keys in Reiser4.

Another reason for using hashes is that some (arguable brain-dead) interfaces require them: telldir(3), and seekdir(3). These functions presume that file system can issue 64 bit "cookies" that can be used to resume a readdir. Cookies are implemented in most filesystems as byte offsets within a directory (which means they cannot shrink directories), and in ReiserFS as hashes of file names plus a generation counter.

Curiously enough, Single UNIX specification tags telldir(3), and seekdir(3) as "Extension", because "returning to a given point in a directory is quite difficult to describe formally, in spite of its intuitive appeal, when systems that use B-trees, hashing functions, or other similar mechanisms to order their directories are considered".

We order directory entries in ReiserFS by their cookies. This costs us performance compared to ordering lexicographically. (But is immensely faster than the linear searching employed by most other Unix filesystems.) Depending on the hash and its match to the application usage pattern there may be more or less performance lossage. Hash plugins will probably remain until version 5 or so, when directory plugins and ordering function plugins will obsolete them. Directory entries will then be ordered by file names like they should be (and possibly stem compressed as well). Security Plugins

Security plugins handle all security checks. They are normally invoked by file and directory plugins.

Example of reading a file:

  • Access the pluginid for the file.
  • Invoke the read method for the plugin.
  • The read method determines the security plugin for the file.
  • That security plugin invokes its read check method for determining whether to permit the read.
  • The read check method for the security plugin reads file/attributes containing the permissions on the file.
  • Since file/attributes are also files, this means invoking the plugin for reading the file/attribute.
  • The pluginid for this particular file/attribute for this file happens to be inherited (saving space and centralizing control of it).
  • The read method for the file/attribute is coded such that it does not check permissions when called by a sec plug method. (Endless recursion is thereby avoided.)
  • The file/attribute plugin employs a decompression algorithm specially designed for efficient decompression of our encoding of ACLs.
  • The security plugin determines that the read should be permitted.
  • The read method continues and completes.

Item Plugins

The balancing code will be able to balance an item iff it has an item plugin implemented for it. The item plugin will implement each of the methods the balancing code needs (methods such as splitting items, estimating how large the split pieces will be, overwriting, appending to, cutting from, or inserting into the item, etc).

In addition to all of the balancing operations, item plugins will also implement intra-item search plugins.

V3 of ReiserFS understood the structure of the items it balanced. This made adding new types of items storing such new security attributes as other researchers might develop too expensive in coding time, greatly inhibiting the addition of them to ReiserFS. In writing Reiser4 we hoped that there would be a great proliferation in the types of security attributes in ReiserFS if we made it a matter requiring not a modification of the balancing code by our most experienced programmers, but the writing of an item handler. This is necessary if we are to achieve our goal of making the adding of each new security attribute an order of magnitude or more easier to perform than it is now.

[edit] Key Assignment Plugins

When assigning the key to an item, the key assignment plugin is invoked, and it has a key assignment method for each item type. A single key assignment plugin is defined for the whole FS at FS creation time. We know from experience that there is no "correct" key assignment policy; squid has very different needs from average user home directories. Yes, there could be value in varying it more flexibly than just at FS creation time, but we have to draw the line somewhere when deciding what goes into each release.... Node Search and Item Search Plugins

Every node layout has a search method for that layout, and every item that is searched through has a search method for that item. (When doing searches, we search through a node to find an item, and then search within the item for those items that contain multiple things to find.) Putting Your New Plugin To Work Will Mean Recompiling

If you want to add a new plugin, we think your having to ask the sysadmin to recompile the kernel with your new plugin added to it will be acceptable for version 4.0. We will initially code plugin-id lookup as an in-kernel fixed length array lookup, methodids as function pointers, and make no provision for post-compilation loading of plugins. Performance, and coding cost, motivates this.

character almost drowning while other character hands him a plugin Without Plugins We Will Drown

People often ask, as ReiserFS grows in features, how will we keep the design from being drowned under the weight of the added complexity and from reaching the point where it is difficult to work on the code?

The infrastructure to support security attributes implemented as files also enables lots of features not necessarily security related. The plugins we are choosing to implement in v4.0 are all security related because of our funding source, but users will add other sorts of plugins just as they took DARPA's TCP/IP and used it for non-military computers. Only requiring that all features be implemented in the manner that maximizes code reuse will ReiserFS coding complexity down to where we can manage it over the long term. Plugins: FS Programming For The Lazy

Most plugins will have only a very few of their features unique to them and the rest of the plugin will be reused code. What Namesys sees as its role as a DARPA contractor is not primarily supplying a suite of security plugins, though we are doing that, but creating an architectural (not just the license) enabling of lots of outside vendors to efficiently create lots of innovative security plugins that Namesys would never have imagined if working by itself. Enhancing Security

superman character complaining about emergency

By far most casualties in wars have always been to civilians. In future information infrastructure attacks, who will take more damage, civilian or military installations? DARPA is funding us to make all Gnu/Linux computers throughout the world a little bit more resistant to attack. Fine Graining Security

Good Security Requires Precision In Specification Of Security

Suppose you have a large file with many components. A general principle of security is that good security requires precision of permissions. When security lacks precision, it increases the burden of being secure; the extent to which users adhere to security requirements in practice is a function of the burden of adhering to it. Space Efficiency Concerns Motivate Imprecise Security

Many filesystems make it space usage ineffective to store small components as separate files for various reasons. Not being separate files means that they cannot have separate permissions. One of the reasons for using overly aggregated units of security is space efficiency. ReiserFS currently improves this by an order of magnitude over most of the existing alternative art. Space efficiency is the hardest of the reasons to eliminate; its elimination makes it that much more enticing to attempt to eliminate the other reasons. Security Definition Units And Data Access Patterns Sometimes Inherently Don't Align

Applications sometimes want to operate on a collection of components as a single aggregated stream. (Note that commonly two different applications want to operate on data with different levels of aggregation; the infrastructure for solving this security issue will also solve that problem as well.) /etc/passwd As Example

I am going to use the /etc/passwd file as an example, not because I think that other solutions won't solve its problems better, but because the implementation of it as a single flat file in the early Unixes is a wonderful illustrative example of poorly granularized security that the readers may share my personal experiences with. I hope they will be able to imagine that other data files less famous could have similar problems.

Have you ever tried to figure out just exactly what part of your continually changing /etc/passwd file changed near the time of a break-in? Have you ever wished that you could have a modification time on each field in it? Have you ever wished the users could change part of it, such as the gecos field, themselves (setuid utilities have been written to allow this, but this is a pedagogical not a practical example), but not have the power to change it for other users?

There were good reasons why /etc/passwd was first implemented as a single file with one single permission governing the entire file. If we can eliminate them one by one, the same techniques for making finer grained security effective will be of value to other highly secure data files. Aggregating Files Can Improve The User Interface To Them

Consider the use of emacs on a collection of a thousand small 8-32 byte files like you might have if you deconstructed /etc/passwd into small files with separable acls for every field. It is more convenient in screen real estate, buffer management, and other user interface considerations, to operate on them as an aggregation all placed into a single buffer rather than as a thousand 8-32 byte buffers. How Do We Write Modifications To An Aggregation

Suppose we create a plugin that aggregates all of the files in a directory into a single stream. How does one handle writes to that aggregation that change the length of the components of that aggregation?

Richard Stallman pointed out to me that if we separate the aggregated files with delimiters, then emacs need not be changed at all to acquire an effective interface for large numbers of small files accessed via an aggregation plugin. If /new_syntax_access_path/big_directory_of_small_files/.glued is a plugin that aggregates every file in big_directory_of_small_files with a delimiter separating every file within the aggregation, then one can simply type emacs /new_syntax_access_path/big_directory_of_small_files/.glued, and the filesystem has done all the work emacs needs to be effective at this. Not a line of emacs needs to be changed.

One needs to be able to choose different delimiting syntax for different aggregation plugins so that one can, for say the passwd file, aggregate subdirectories into lines, and files within those subdirectories into colon separate fields within the line. XML would benefit from yet other delimiter construction rules. (We have been told by Philipp Guehring of LivingXML.NET that ReiserFS is higher performance than any database for storing XML, so this issue is not purely theoretical.)

[edit] Aggregation Is Best Implemented As Inheritance

In summary, to be able to achieve precision in security we need to have inheritance with specifiable delimiters and we need whole file inheritance to support ACLs. One Plugin Using Delimiters That Resemble sys_reiser4() Syntax

We provide the infrastructure for your constructing plugins that implement arbitrary processing of writes to inheriting files, but we also supply one generic inheriting file plugin that intentionally uses delimiters very close to the sys_reiser4() syntax. We will document the syntax more fully when that code is working, for now syntax details are in the comments in the file invert.c in the source code. API Suitable For Accessing Files That Store Security Attributes

A new system call sys_reiser4() will be implemented to support applications that don't have to be fooled into thinking that they are using POSIX. Through this entry point a richer set of semantics will access the same files that are also accessible using POSIX calls. Reiser4() will not implement more than hierarchical names. A full set theoretic naming system as described on our future vision page will not be implemented before SSN Reiserfs is implemented (Distributed Reiserfs is our distributed filesystem, Semi-Structured Naming Reiserfs is our enhanced semantics, whether we implement Didtrubuted Reiserfs or SSN Reiserfs first depends on which sponsors we find ;-) ). Reiser4() will implement all features necessary to access ACLs as files/directories rather than as something neither file nor directory. These include opening and closing transactions, performing a sequence of I/Os in one system call, and accessing files without use of file descriptors (necessary for efficient small I/O). SSN Reiserfs will use a syntax suitable for evolving into SSN Reiserfs syntax with its set theoretic naming. Flaws In Traditional File API When Applied To Security Attributes

Security related attributes tend to be small. The traditional filesystem API for reading and writing files has these flaws in the context of accessing security attributes:

  • Creating a file descriptor is excessive overhead and not useful when accessing an 8 byte attribute.
  • A system call for every attribute accessed is too much overhead when accessing lots of little attributes.
  • Lacking constraints: it is important to constrain what is written to the attribute, often in complex ways.
  • Lacking atomic semantics: Often one needs to update multiple attributes as one action that is guaranteed to either fully succeed or fully fail.

The Usual Resolution Of These Flaws Is A One-Off Solution

The usual response to these flaws is that people adding security related and other attributes create a set of methods unique to their attributes, plus non-reusable code to implement those methods in which their particular attributes are accessed and stored not using the methods for files, but using their particular methods for that attribute. Their particular API for that attribute typically does a one-off instantiation of a lightweight single system call write constrained atomic access with no code being reusable by those who want to modify file bodies. It is basic and crucial to system design to decompose desired functionality into reusable, orthogonal separated components. Persons designing security attributes are typically doing it without the filesystem that they want offering them a proper foundation and tool kit. They need more help from us core FS developers. Linus said that we can have a system call to use as our experimental plaything in this. With what I have in mind for the API, one rather flexible system call is all we want for creating atomic lightweight batched constrained accesses to files, with each of those adjectives to accesses being an orthogonal optional feature that may or may not be invoked in a particular instance of the new system call. One-Off Solutions Are A Lot of Work To Do A Lot Of

Looking at the coin from the other side, we want to make it an order of magnitude less work to add features to ReiserFS so that both users and Namesys can add at least an order of magnitude more of them. To verify that it is truly more extensible you have to do some extending, and our DARPA funding motivates us to instantiate most of those extensions as new security features.

This system call's syntax enables attributes to be implemented as a particular type of file. It avoids uglifying the semantics with two APIs for two supposedly different kinds of objects that don't truly need different treatment. All of its special features that are useful for accessing particular attributes are all also available for use on files. It has symmetry, and its features have been fully orthogonalized. There is nothing particularly interesting about this system call to a languages specialist (It's ideas were explored decades ago except by filesystem developers.) until SSN Reiserfs, when we will further evolve it into a set theoretic syntax that deconstructs tuple structured names into hierarchy and vicinity set intersection. That is described at www.namesys.com/whitepaper.html

[edit] Steps For Creating A Security Attribute

You can create a new security attribute by:

  • Defining a pluginid.
  • Composing a set of methods for the plugin from ones you create or reuse from other existing plugins.
  • Defining a set of items that act as the storage containers of the object, or reusing existing items from other plugins (e.g. regular files).
  • Implementing item handlers for all of the new items you create.
  • Creating a key assignment algorithm for all of the new items.

reiser4() System Call Description

The reiser4() system call (still being debugged at the time of writing) executes a sequence of commands separated by commas.

Assignment, and transaction, are the commands supported in Reiser4(); more commands will appear in SSN Reiserfs. <- and <<- are two of the assignment operators.

lhs(assignment target) values:

   * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t))
     assigns (writes) to the buffer starting at address offset in the process address space, ending at last_byte. (The assignment source may be smaller or larger than the assignment target.) Representation of offset and last_byte is left to the coder to determine. It is an issue that will be of much dispute and little importance. Notice / is used to indicate that the order of the operands matters; see the future vision whitepaper for details of why this is appropriate syntax design. Note the lack of a file descriptor.
   * /filename
     assigns to the file named filename.
   * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t))
     writes to the body, starting at offset, ending not past last_byte
   * /filename/..../range/(offset<-(loff_t) )
     writes to the body starting at ofset

rhs (assignment source) values:

   * /..../process/range/(offset<-(loff_t),last_byte<-(loff_t))
     reads from the buffer starting at address offset in the process address space, ending at last_byte. Representation of offset, last_byte is left to the coder to determine, as it is an issue that will be of much dispute and little importance.
   * /filename
     reads the entirety of the file named filename.
   * /filename/..../range/(offset<-(loff_t),last_byte<-(loff_t))
     reads from the body, starting at first_byte, ending not past last_byte
   * /filename/..../range/(offset<-(loff_t))
     reads from the body starting at offset until the end
   * /filename/..../stat/owner
     reads from the ownership field of the stat data (stat data is that which is returned by the stat() system call (owner, permissions, etc.) and stored on a per file basis by the FS.)

Note that "...." and "process" are style conventions for the name of a hidden subdirectory implementing methods and accessing metadata supported by a plugin. It is possible to rename it, etc. We had a discussion about whether to instead use names that could not clash with any legitimate name likely to be used by users. Vladimir Demidov suggested that cryptic names historically have harmed the acceptance of several languages, and so it was realized that being novice unfriendly in the naming was worse than risking a name collision, especially since it could be cured by using rename on "...." and "process" for the few cases where it is necessary. Constraints

(Note: this is not yet coded.) Another way security may be insufficiently fine grained is in values: it can be useful to allow persons to change data but only within certain constraints. For this project we will implement plugins; one type of plugin will be write constraints. Write-constraints are invoked upon write to a file; if they return non-error then the write is allowed. We will implement two trivial sample write-constraint plugins. One will be in the form of a kernel function loadable as a kernel module which returns non-error (thus allowing the write) if the file consists of the strings "secret" or "sensitive" but not "top-secret". The other, which does exactly the same, will be in the form of a perl program residing in a file and executed in user-space. Use of kernel functions will have performance advantages, particularly for small functions, but severe disadvantages in power of scripting, flexibility, and ability to be installed by non-secure sources. Both types of plugins will have their place.

Note that ACLs will also embody write constraints.

We will implement both constraints that are compiled into the kernel, and constraints that are implemented as user space processes. Specifically, we will implement a plugin that executes an arbitrary constraint contained in an arbitary named file as a user space process, passes the proposed new file contents to that process as standard input, and iff the process exits without error allows the write to occur.

It can be useful to have read constraints as well as write constraints. Auditing

(Note: this is not yet coded.) We will implement a plugin that notifies administrators by email when access is made to files, e.g. read access.

With each plugin implemented creating additional plugins becomes easier as the available toolkit is enriched. Auditing constitutes a major additional security feature, yet it will be easy to implement once the infrastructure to support it exists. (It would be substantial work to implement it without that infrastructure.)

The scope of this project is not the creation of plugins themselves, but the creation of the infrastructure that plugin authors would find useful. We want to enable future contributors to implement more secure systems on the Gnu/Linux platform, not implement them ourselves. By laying a proper foundation and creating a toolkit for them, we hope to reduce the cost of coding new security attributes for those who follow us by an order of magnitude. Employing a proper set of well orthogonalized primitives also changes the addition of these attributes from being a complexity burden upon the architecture into being an empowering extension of the architecture. Increasing the Allowed Granularity of Security

man holding sieve, only objects of a certain size go through.

(This feature is not yet coded.) Inheritance of security attributes is important to providing flexibility in their administration. We have spoken about making security more fine grained, but sometimes it needs to be larger grained. Sometimes a large number of files are logically one unit in regards to their security and it is desirable to have a single point of control over their security. Inheritance of attributes is the mechanism for implementing that. Security administrators should have the power to choose whatever units of security they desire without having to distort them to make them correspond to semantic units. Inheritance of file bodies using aggregation plugins allows the units of security to be smaller than files; inheritance of attributes allows them to be larger than files. Encryption On Commit

Currently, encrypted files suffer severely in their write performance when implemented using schemes that encrypt at every write() rather than at every commit to disk. We encrypt on flush such that a file with an encryption plugin id is encrypted not at the time of write, but at the time of flush to disk.

Encryption is implemented as a special form of repacking on flush, and it occurs for any node which has its CONTAINS_ENCRYPTED_DATA state flag set on it. Conclusion

Reiser4 offers a dramatically better infrastructure for creating new filesystem features. Files and directories have all of the features needed to make it not necessary to have file attributes be something different from files. The effectiveness of this new infrastructure is tested using a variety of new security features. Performance is greatly improved by the use of dancing trees, wandering logs, allocate on flush, a repacker, and encryption on commit. It was an important question whether we could increase the level of abstraction in our design without harming performance. Reiser4 gives you BOTH the most cleanly abstracted storage AND the highest performance storage of any filesystem.

[edit] Citations

     Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques". Morgan Kaufmann Publishers, Inc., 1993. Old but good textbook on transactions. Available at http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-190-2
     D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Proceedings of the 1994 USENIX Winter Technical Conference, pp. 235-246, San Francisco, CA, January 1994 Available at http://citeseer.nj.nec.com/hitz95file.html
     D. Hitz. "A Storage Networking Appliance". Tech. Rep TR3001, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3001.html
     D. Hitz, J. Lau and M. Malcolm. "File system design for an NFS file server appliance". Tech. Rep. TR3002, Network Appliance, Inc., 1995 Available at http://www.netapp.com/tech_library/3002.html
     J. Ousterhout and F. Douglis. "Beating the I/O Bottleneck: A Case for Log-Structured File Systems". ACM Operating System Reviews, Vol. 23, No. 1, pp.11-28, January 1989 Available at http://citeseer.nj.nec.com/ousterhout88beating.html
     M. Seltzer, K. Smith, H. Balakrishnan, J. Chang, S. McMains and V. Padmanabhan. "File System Logging versus Clustering: A Performance Comparison". Proceedings of the 1995 USENIX Technical Conference, pp. 249-264, New Orleans, LA, January 1995 Available at http://citeseer.nj.nec.com/seltzer95file.html
     M. Seltzer. "LFS and FFS Supplementary Information". 1995 http://www.eecs.harvard.edu/~margo/usenix.195/
     J. Ousterhout. "A Critique of Seltzer's 1993 USENIX Paper" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique1.html
     J. Ousterhout. "A Critique of Seltzer's LFS Measurements" http://www.eecs.harvard.edu/~margo/usenix.195/ouster_critique2.html
     A. Sweeny, D. Doucette, W. Hu, C. Anderson, M. Nishimoto and G. Peck. "Scalability in the XFS File System". Proceedings of the 1996 USENIX Technical Conference, pp. 1-14, San Diego, CA, January 1996 Available at http://citeseer.nj.nec.com/sweeney96scalability.html
     G.M. Adel'son-Vel'skii and E.M. Landis, An algorithm for the organization of information, Soviet Math. Doklady 3, 1259-1262, 1972, This paper on AVL trees can be thought of as the founding paper of the field of storing data in trees. Those not conversant in Russian will want to read the [Lewis and Denenberg] treatment of AVL trees in its place. [Wood] contains a modern treatment of trees.
     Inside Macintosh, Files, by Apple Computer Inc., Addison-Wesley, 1992. Employs balanced trees for filenames, it was an interesting filesystem architecture for its time in a number of ways, now its problems with internal fragmentation have become more severe as disk drives have grown larger. I look forward to the replacement they are working on.
     Maurice J. Bach. "The Design of the Unix Operating System". 1986, Prentice-Hall Software Series, Englewood Cliffs, NJ, superbly written but sadly dated, contains detailed descriptions of the filesystem routines and interfaces in a manner especially useful for those trying to implement a Unix compatible filesystem. See [Vahalia].
     R. Haskin, Raymond A. Lorie: On Extending the Functions of a Relational Database System. SIGMOD Conference (body of paper not on web) 1982: 207-212, Reiser4 obsoletes this approach.
     Chen, P.M. Patterson, David A., A New Approach to I/O Performance Evaluation---Self-Scaling I/O Benchmarks, Predicted I/O Performance, 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, also available on Chen's web page.
     Ganger, Gregory R., Kaashoek, M. Frans. "Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files." A very well written paper focused on 1-10k file size issues, they use some similar notions (most especially their concept of grouping compared to my packing localities). Note that they focus on the 1-10k file size range, and not the sub-1k range. The 1-10k range is the weakpoint in ReiserFS V3 performance. The page with link to postscript paper available at http://amsterdam.lcs.mit.edu/papers/cffs.html
     by Remi Card extensive information, source code is available Probably our toughest current competitor, it is showing its age though, and recent enhancements of it (journaling, htrees, etc.) have not been performance effective. It embodies both the strengths and weaknesses of the incrementalist approach to coding, and substantially resembles the older FFS filesystem from BSD.
     M. McKusick, W. Joy, S. Leffler, R. Fabry. "A Fast File System for UNIX". ACM Transactions on Computer Systems, Vol. 2, No. 3, pp. 181-197, August 1984 describes the implementation of a filesystem which employs parent directory location knowledge in determining file layout. It uses large blocks for all but the tail of files to improve I/O performance, and uses small blocks called fragments for the tails so as to reduce the cost due to internal fragmentation. Numerous other improvements are also made to what was once the state-of-the-art. FFS remains the architectural foundation for many current block allocation filesystems, and was later bundled with the standard Unix releases. Note that unrequested serialization and the use of fragments places it at a performance disadvantage to ext2fs, though whether ext2fs is thereby made less reliable is a matter of dispute that I take no position on (Reiser4 is an atomic filesystem, which is a different level of reliability entirely). Available at http://citeseer.nj.nec.com/mckusick84fast.html.
     Gregory R. Ganger, Yale N. Patt. "Metadata Update Performance in File Systems". ( Abstract only)
     Describes a filesystem enriched to have more than hierarchical semantics, he shares many goals with this author, forgive me for thinking his work worthwhile. If I had to suggest one improvement in a sentence, I would say his semantic algebra needs closure.(Postscript only).
     [Hitz, Dave]
     A rather well designed filesystem optimized for NFS and RAID in combination. Note that RAID increases the merits of write-optimization in block layout algorithms. Available at http://www.netapp.com/technology/level3/3002.html
     [Holton and Das]
     Holton, Mike., Das, Raj. "The XFS space manager and namespace manager use sophisticated B-Tree indexing technology to represent file location information contained inside directory files and to represent the structure of the files themselves (location of information in a file)". Note that it is still a block (extent) allocation based filesystem, no attempt is made to store the actual file contents in the tree. It is targeted at the needs of the other end of the file size usage spectrum from ReiserFS, and is an excellent design for that purpose (though most filesystems including Reiser4 do well at writing large files, and I think it is medium-sized and smaller files where filesystems can substantively differentiate themselves.) SGI has also traditionally been a leader in resisting the use of unrequested serialization of I/O. Unfortunately, the paper is a bit vague on details. Available at http://www.sgi.com/Technology/xfs-whitepaper.html
     Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satayanarayanan, N., Sidebotham, R.N., West, M.J. "Scale and Performance in a Distributed File System". ACM Transactions on Computer Systems, 6(1), February 1988 A classic benchmark, it was too CPU bound to effectively stress ext2fs and ReiserFS, and is no longer very effective for modern filesystems.
     Knuth, D.E., The Art of Computer Programming, Vol. 3 (Sorting and Searching), Addison-Wesley, Reading, MA, 1973, the earliest reference discussing trees storing records of varying length.
     Wittle, Mark., and Bruce, Keith. "LADDIS: The Next Generation in NFS File Server Benchmarking", Proceedings of the Summer 1993 USENIX Conference., July 1993, pp. 111-128
     [Lewis and Denenberg]
     Lewis, Harry R., Denenberg, Larry. "Data Structures & Their Algorithms", HarperCollins Publishers, NY, NY, 1991, an algorithms textbook suitable for readers wishing to learn about balanced trees and their AVL predecessors.
     McCreight, E.M., Pagination of B*-trees with variable length records, Commun. ACM 20 (9), 670-674, 1977, describes algorithms for trees with variable length records.
     [McVoy and Kleiman]
     The implementation of write-clustering for Sun's UFS. Available at http://www.sun.ca/white-papers/ufs-cluster.html
     "Inside OLE" by Kraig Brockshmidt, discusses Structured Storage, abstract only. Structured storage is what you get when application developers need features to better manage the storage of objects on disk by the applications they write, and the filesystem group at their company can't be bothered with them. Miserable performance, miserable semantics. Available at http://www.microsoft.com/mspress/books/abs/5-843-2b.htm.
     J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M.D. Kupfer, and J.G. Thompson. "A Trace-driven Analysis of the UNIX 4.2BSD File System". In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15--24, Orcas Island, WA, December 1985.
     "Inside the Windows NT File System" the book is written by Helen Custer, NTFS is architected by Tom Miller with contributions by Gary Kimura, Brian Andrew, and David Goebel, Microsoft Press, 1994, an easy to read little book, they fundamentally disagree with me on adding serialization of I/O not requested by the application programmer, and I note that the performance penalty they pay for their decision is high, especially compared with ext2fs. Their FS design is perhaps optimal for floppies and other hardware eject media beyond OS control. A less serialized higher performance log structured architecture is described in [Rosenblum and Ousterhout]. That said, Microsoft is to be commended for recognizing the importance of attempting to optimize for small files, and leading the OS designer effort to integrate small objects into the file name space. This book is notable for not referencing the work of persons not working for Microsoft, or providing any form of proper attribution to previous authors such as [Rosenblum and Ousterhout]. Though perhaps they really didn't read any of the literature and it explains why theirs is the worst performing filesystem in the industry....
     K. Peacock. "The CounterPoint Fast File System". Proceedings of the Usenix Conference Winter 1988
     Rob Pike and Peter Weinberger, The Hideous Name, USENIX Summer 1985 Conference Proceedings, pp. 563, Portland Oregon, 1985. Short, informal, and drives home why inconsistent naming schemes in an OS are detrimental. Available at http://achille.cs.bell-labs.com/cm/cs/doc/85/1-05.ps.gz. His discussion of naming in plan 9: http://plan9.bell-labs.com/plan9/doc/names.html
     [Rosenblum and Ousterhout]
     M. Rosenblum and J. Ousterhout. "The Design and Implementation of a Log-Structured File System". ACM Transactions on Computer Systems, Vol. 10, No. 1, pp. 26-52, February 1992. Available at http://citeseer.nj.nec.com/rosenblum91design.html. This paper was quite influential in a number of ways on many modern filesystems, and the notion of using a cleaner may be applied to a future release of ReiserFS. There is an interesting on-going debate over the relative merits of FFS vs. LFS architectures, and the interested reader may peruse http://www.scriptics.com/people/john.ousterhout/seltzer93.html and the arguments by Margo Seltzer it links to.
     "tmpfs: A Virtual Memory File System" discusses a filesystem built to use swap space and intended for temporary files, due to a complete lack of disk synchronization it offers extremely high performance.
     Uresh Vahalia, "Unix Kernal Internals"
     Reiser, Hans T., Future Vision Whitepaper, 1984, Revised 1993. Available at http://www.namesys.com/whitepaper.html.
Personal tools