Revision as of 13:05, 12 November 2020

Reiser4 offers a brand new method of aggregation of block devices into logical volumes on a local machine. It is a qualitatively new level in file systems (and operating systems) development - local volumes with parallel scaling out.

Reiser4 doesn't implement its own block layer like ZFS etc. In our approach scaling out is performed by file system means, rather than by block layer means. The flow of IO-requests issued against each device is controlled by user. To add a device to a logical volume with parallel scaling out, you first need to format that device - this is the difference between parallel and non-parallel scaling at first glance. The principal difference between parallel and non-parallel scaling out will be discussed below.

Systems with parallel scaling out provide better scalability and resolve a number of problems inherent to non-parallel ones. In logical volumes with parallel scaling out devices of smaller size and(or) throughput don't become a "bottlenecks", as it happens e.g. in RAID-0 and its popular modifications.

1 Fundamental shortcomings of logical volumes composed by block-layer means
2 The previous art
3 Our approach. O(1) space allocator
4 Parallel scaling out as disk resources conservation. Definitions and examples
5 Overhead of parallelism for local FS
6 Аsymmetric logical volumes. Data and meta-data bricks
7 Stripes. Fibers. Distribution, allocation and migration. Basic definitions
8 Data distribution and migration. Fiber-Striping. Burst Buffers
9 Atomicity of volume operations
10 Limitations on asymmetric logical volumes
11 Security issues
12 Software and Disk Version 5.1.3. Compatibility
13 TODO

Fundamental shortcomings of logical volumes composed by block-layer means

1. Local file systems don't take participation in scaling out. They just face a huge virtual block device, for which they need to maintain a free space map. Such maps grow as the volume fills with data. It results in increasing latency on free blocks allocation and consequently in essential performance drop on large volumes which are almost full.

2. Loss of disk resources (space and throughput) on logical volumes composed of devices with different physical and geometric parameters (because of poor translations provided by "classic" RAID levels). Low-performance devices become a bottleneck in RAID arrays Attempts to replace RAID levels with better algorithms lead to inevitable and unfixable degradation of logical volumes. Indeed, defragmentation tools work only on the space of virtual disk addresses. If you use classic RAID levels, then everything is fine here: reducing fragmentation on virtual device always results in reducing fragmentation on physical ones. However, if you use more sophisticated translations to save disk space and bandwidth, then fragmentations on real devices tends to accumulate, and you are not able to reduce it just defragmenting the virtual device. Note that the interest is always real devices - no one actually cares what happens on virtual ones.

3. With only block layer means it is impossible to build heterogeneous storage composed of devices of different nature. You are not able to use different approaches to devices-components of the same logical volume (e.g. defragment only rotational drives, issue discard requests only for solid state drives, etc).

4. It is impossible to efficiently implement data migration (and, hence, data tiering) on logical volumes composed by block-layer means.

The previous art

Previously, there was only one method for scaling out local volumes - by block layer means. That is, file system deals only with virtual disk addresses (allocation, defragmentation, etc), and the block layer translates virtual disk addresses to real ones and backward. The most common complaint is about performance drop on such logical volumes, which are large and more than 70% full.

Mostly it is related to disk space allocators, which are, to put it mildly, not excellent, and introduce big latency when searching for a free blocks on extra-large volumes. Moreover, nothing better has been invented for the past 30 years. Also, it easily may be that the best algorithms for free space management simply do not exist.

Some file systems (ZFS and like) implement their own block layers. It helps to implement a failover, however, the mentioned problem doesn't disappear - if the block layer does its job very well, then the file system, again, faces a huge logical block device, which is hard to handle.

Significant progress in scaling out was made by parallel network file systems (GPFS, Lustre, etc). However it was unclear, how to apply their technologies to a local FS. Mostly, it is because local file systems don't have such luxury like "backend storage" as the network ones do. What local FS does have - is only extremely poor interface of interaction with the block layer. For example, in Linux local FS can only compose and issue an IO request against some buffer (page). In other words, it was unclear, what a "local parallel file system" is.

Our approach. O(1) space allocator

In ~2010 we had realized that the first approach (implementation an own block layer inside a local FS) is totally wrong. Instead we need to pay attention to parallel network file systems to adopt their methods. However, as I mentioned, there is no something even close to a direct analogy - it means that for local FS we need to design "parallelism" from scratch. The same about distribution algorithms - we are totally unhappy with existing ones. Of course, you can deploy a networking FS on the same local machine for a number of block devices, but it will be something not serious. We state that a serious analogy can be defined and implemented in properly designed local FS - meet Reiser5.

The basic idea is pretty simple - to not mess with large free space maps (whose sizes depend on the volume size). Instead, we need to manage many small ones of limited size. At any moment the file system should be able to pick up a proper such small space map, and work only with it. Needless to say, that for any logical volume, which is as big as you want, search time in a such map will be also limited by some value, which doesn't depend on logical volume size. For this reason, we'll call it O(1) - space allocator.

The simplest way is to maintain one space map per each block device, which is a component of the logical volume. If some device is too large, simply split it into a number of partitions to make sure that any space map does not exceed some upper limit. Thus, users also should put some efforts from their side to make the space allocator be O(1).

Parallel scaling out as disk resources conservation. Definitions and examples

Here we'll consider an abstract subsystem S of the operating system managing a logical volume composed of one, or more removable components.

Definition 1. We'll say that S saves disk space and throughput of the logical volume, if

1) its data capacity is a sum of data capacities of its components

2) its disk bandwidth is a sum of disk bandwidths of its components

We'll say that LV managed by such system is with parallel scaling out (PSO).

There is a good analogy to understand the feature of PSO: imagine that it rains and you put several cylindrical buckets with different sized holes for collecting water. In this example raindrops represent IO-requests, the set of bucket represents a logical volume. Note that amount of water felt to each bucket is proportional to the square of its hole (considered as throughput). In this example all buckets are filled with water evenly and fairly: if one bucket is full, then other ones is also full. Note, that non-cylindrical form of buckets will likely break fairness of water distribution between them, so that PSO won't take place in this case.

In practice, however, IO-systems are more complicated: IO requests are distributed, queued, etc. And conservation of disk resources usually doesn't take place: disk bandwidth of any logical volume turns out to be smaller than the sum of ones of its components. Nevertheless, if the loss of resources is small, and doesn't increase with the growth of the volume, then we'll say that such system features parallel scaling out.

In complex IO-systems "leak" of disk bandwidth has complex nature and can happen on every its subsystem: on the file system, on the block layer, etc. The loss can also be caused by interface properties, etc. The fundamental reason of almost all resource leaks is that mentioned subsystems were poorly designed (because better algorithms were not known at that moment, or because of other reasons).

The classic example of disk space and throughput loss is RAID arrays. Linear RAID saves disk space, but always drops disk bandwidth. RAID-0, composed of different size and bandwidth devices, drops both, disk space and disk bandwidth of the resulted logical volume. The same is for their modifications like RAID-5. In all mentioned examples the loss of disk bandwidth is caused by poor algorithms (specifically, by the fact that IO requests are directed to every component in wrong proportions).

Definition 2. A file system managing a logical volume is said to be with parallel scaling out, if it saves disk space and bandwidth of that logical volume. In other words, if it doesn't drop the mentioned disk resources.

Note that file system is only a part of an IO-subsystem. And it can easily happen that the file system saves disk resources, while the whole system is not. For example, because of poorly designed block layer, who puts IO requests issued for different block devices to the same queue on a local machine, etc.

As an example, let's calculate disk bandwidth of a logical volume composed of 2 devices, the first of which has disk bandwidth 200M/sec, second - 300M/sec. We'll consider 3 systems: in the first one the mentioned devices compose linear RAID, in the second one - striped RAID (RAID-0), in the third one they are managed by a file system with parallel scaling out.

Linear RAID distributes IO requests not evenly: first we write to the first device. Once it is full, we write to the second one. Disk bandwidth of linear RAID is defined by the throughput of the device we are currently writing to. Thus it always is not more than throughput of the faster device, i.e. 300 M/sec.

RAID-0 distributes IO requests evenly (but not fairly). In the same interval of time the same number N/2 of IO-requests will be issued against each device. On the first device it will be written in N/400 sec. On the second device it will be written in N/600 sec. Note that the first device is slower, therefore we should wait N/400 sec for all N IO-requests to be written to the array. So throughput of RAID-0 in our case is N/(N/400) = 400 M/sec.

FS with parallel scaling out distributes IO requests evenly and fairly. In the same interval of time the number of blocks issued against each device is N*C, where C is relative throughput of the device. Relative throughput of the first device is 200/(200+300) = 0.4. Of the second one - 300/(200+300) = 0.6

Portion of IO-requests issued for each device will be written in parallel in the same time 0.4N/200 = 0.6N/300 sec. Therefore, throughput of our logical volume in this case is N/(0.4N/200) = 500 M/sec.

The resulted table of throughput:

Linear RAID:              <300 M/sec
RAID-0:                    400 M/sec
Parallel scaling out FS    500 M/sec

According to definitions above any local file system built on a top of RAID/LVM does NOT possesses parallel scaling out (first, because RAID and LVM don't save disk resources, second, because latency introduced by free space allocator grows with volume. For the same reasons any local FS, which implements its own block layer (ZFS, Btrfs, etc) does NOT possesses parallel scaling out. Note that any network FS built on a top of two or more local FS managing simple partitions as backend saves disk resources.

Overhead of parallelism for local FS

As we mentioned above, the characteristic feature of any FS with PSO is that before adding a device to a logical volume you should format it. Of course, it adds some overhead to the system. However, that overhead is not dramatically large. Specifically, with reiser4 disk format40 specifications the disk overhead includes 80K at the beginning of each device-component. Next, for each device Reiser5 reads on-disk super-block and loads its to memory, Thus, memory overhead includes one persistent memory super-block (~500 bytes) per each device-component of a logical volume. That is, a logical volume composed of one million devices will take ~500M of memory (pinned). I think that a person maintaining such volume will be able to find $30 for additional memory card. That overhead is a single disadvantage of FS with PSO. At least, we don't know other ones.

Аsymmetric logical volumes. Data and meta-data bricks

So, any logical volume with parallel scaling out is composed of block devices formatted by mkfs utility. Such device has a special name brick, or storage subvolume of a logical volume.

For the beginning we have implemented the simplest approach, when meta-data is located on dedicated block devices - we'll call them meta-data bricks. I remind that in reiser4 the notion of "meta-data" includes all kind of items (key'ed records in the storage tree). And the notion of data means unformatted blocks pointed out by "extent pointers". Such unformatted nodes are used to store bodies of regular files.

Meta-data bricks are allowed to contain unformatted data blocks. Data bricks contain only unformatted data blocks. For obvious reasons such logical volumes are called "asymmetric".

Stripes. Fibers. Distribution, allocation and migration. Basic definitions

Stripe is a logical unit of distribution, that is a minimal object, any parts of which can not be stored on different bricks.

A set of distribution units dispatched to the same brick is called fiber.

Comment. In the previous art fibers were called stripes (case of RAID-0), and logical units of distribution didn't have a special name. For a number of adjacent sectors forming such a unit a notion of "stripe width" was used.

Data stripe is a logical block of some size at some offset in a file.

Meta-data striping also can be defined, but we don't consider it here for simplicity.

File system block is, as usual, an allocation unit on some brick.

Stripe is said allocated, if all its parts got disk addresses on some brick.

From these definitions it directly follows that file system block can not contain more than one stripe. On the other hand, an allocated stripe can occupy many blocks.

For any file system block its full address in a logical volume is defined as a pair (brick-ID, disk-address).

Stripe is said dispatched, if it got the first component (brick-ID) of its full address in the logical volume.

Stripe is said migrated, if its old disk addresses got released, and new ones (possibly on another brick) got allocated.

The core difference between parallel and non-parallel scaling out in terms of distribution and allocation: In PSO-systems any stripe firstly gets distributed, then allocated. In systems with non-parallel scaling out it is other way around - any stripe firstly gets allocated, then distributed. An example is any local FS built a top of RAID-0 array. Indeed, at first, such FS allocates a virtual disk address for a logical block, then block layer assigns a real device-ID and translates that virtual address to real one.

Data distribution and migration. Fiber-Striping. Burst Buffers

Distribution defines what device-component of a logical volume an IO request composed for a dirty buffer(page) will be issued against.

In file systems with PSO "destination" device is always defined by a virtual disk address allocated for that request. E.g. for RAID-0 ID of destination device is defined as (N % M), where N is a virtual address (block number), allocated by the file system, M is number of disks in the array.

In our approach (O(1) space allocator) we allocate disk addresses on each physical device independently, so for every IO-request we first need to assign a destination device, then ask a block allocator managing that device to allocate a block number for this request. So, in our approach distribution doesn't depend on allocation.

By default Reiser5 offers distribution based on algorithms (so-called fiber-striping) invented by Eduard Shishkin (patented stuff). With our algorithms all your data will be distributed evenly and fairly among all devices-components of the logical volume. It means that portion of IO requests issued against each device is equal to relative capacity of that device assigned by user. Operation of adding/removing a device to/from a logical volume automatically invokes data migration, so that resulted distribution is also fair. Portion of migrated data is always equal to relative capacity of the added/removed device. The speed of data migration is mostly determined by throughput of the device to be added/removed.

Alternatively, Reiser4 allows users to control data distribution and migration themselves.

An important application distribution and migration find as data tiering in HPC area as so-called Burst Buffers (dump of "hot data" on high-performance proxy-device with its following migration to "persistent storage" in background mode).

In all cases the file system memorizes stripes location.

Atomicity of volume operations

Almost all volume operations (adding/removing a brick, changing bricks capacity, etc) involve re-balancing (i.e. massive migration of data blocks), so it is technically difficult to implement full atomicity of such operations. Instead, we issue 2 checkpoints (first before re-balancing, second - after), and handle 2 cases depending on where in relation to those points the volume operation was interrupted. In the first case user should repeat the operation again, in the second case user should complete the operation (in the background mode) using volume.reiser4 utility. See administration guide on reiser4 logical volumes for details.

Limitations on asymmetric logical volumes

Maximal number of bricks in a logical volume:

in the "builtin" distribution mode - 2^32
in the "custom" distribution mode  - 2^64

In the "builtin" distribution mode any 2 bricks of the same logical volume can not differ in size more than 2^19 (~1 million) times. For example, your logical volume can not contain both, 1M and 2T bricks.

Maximal number of stripe pointers held by one 4K-metadata block: 75 (for node40 format).

Maximal number of data blocks served by 1 meta-data block: 75*S, where S is stripe width in file system blocks. For example, for 128K- stripes and 4K blocks (S=32) one meta-data block can serve not more than 2400 data blocks. In particular, when all bricks are of equal capacity, it means that one meta-data brick can serve not more than 2400 data bricks.

For the best quality of "builtin" distribution it is recommended that:

a) stripe size is not larger than 1/10000 of total volume size.

b) number of bricks in your logical volume is a power of 2 (i.e. 2, 4, 8, 16, etc). If you cannot afford it, then make sure that number of hash space segments (a property of your logical volume, which can be increased online) is not smaller than 100 * number-of-bricks.

Not more than one volume operations on the same logical volume can be executed in parallel. If some volume operation is not completed, then attempts to execute other ones will return error (EBUSY).

Security issues

"Builtin" distribution combines random and deterministic methods. It is "salted" with volume-ID, which is known only to root. Once it is compromised (revealed), the logical volume can be subjected to "free space attack" - with known volume-ID an attacker (non-privileged user) will be able to fill some data brick up to 100%, while others have a lot of free space. Thus, nobody will be able to write anymore to that volume. So, keep your volume-ID a secret!

Software and Disk Version 5.1.3. Compatibility

To implement parallel scaling out we upgraded Reiser4 code base with the following new plugins:

1) "asymmetric" volume plugin (new interface);
2) "fsx32" distribution plugin (new interface);
3) "striped" file plugin (existing interface);
4) "extent41" item plugin (existing interface);
5) "format41" disk format plugin (existing interface).

In the best traditions we increment version numbers. The old disk and software version was 4.0.2. "Minor" number (2) is incremented because of (1-4). "Major" number (0) is incremented because of (5) and changes in the format super-block. "Principal" number (4) is incremented because of changes in master super-block. For more details about compatibility see this

Old reiser4 partitions (of format 4.0.X) will be supported by Reiser5 kernel module. For this you need to enable option "support "Plan-A key allocation scheme" (not default), when configuring the kernel. Note that it will automatically disable support of logical volumes. Such mutual exclusiveness is due to performance reasons.

Reiser4progs of software release number 5.X.Y don't support old reiser4 partitions of format 4.0.X. To fsck the last ones use reiser4progs of software release number 4.X.Y - it will exist as a separate branch.

TODO

. Upgrading FSCK to work with logical volumes;

. Asymmetric LV w/ more than 1 meta-data brick per-volume;

. Symmetric logical volumes (meta-data on all bricks);

. 3D-snapshots of LV (snapshots with an ability to roll back not only file operations, but also volume operations);

. Global (networking) logical volumes.

@@ Line 454: / Line 454: @@
 = TODO =
-. Interface for user-defined data distribution and migration (Burst
-Buffers);
 . Upgrading FSCK to work with logical volumes;

Logical Volumes Background