Revision as of 16:57, 16 August 2020

Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices.

WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization

IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume.

1 Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing
2 Prepare Software and Hardware
3 Creating a logical volume
4 Adding a data brick to LV
5 Removing a data brick from LV
6 Changing brick's capacity
7 Operations with meta-data brick
8 Unmounting a logical volume
9 Deploying a logical volume after correct unmount
10 Deploying a logical volume after correct shutdown
11 Deploying a logical volume after hard reset or system crash
12 LV monitoring
13 Checking free space
14 Checking quality of data distribution
15 FAQ

Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing

Basic configuration of a logical volume is the following information:

1) Volume UUID;
2) Number of bricks in the volume;
3) List of brick names or UUIDs in the volume;
4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3).

The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc).

For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume.

Abstract capacity (or simply capacity) of a brick is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one.

Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user.

Capacity of a logical volume is defined as a sum of capacities of its bricks-components.

Relative capacity of a brick is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick.

Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1.

(Real) data space usage on a brick is number of data blocks, stored on that brick.

Ideal (or expected) data space usage on a brick is T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick.

It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput.

When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity.

Most volume operations are accompanied by rebalancing, which keeps fairness of distribution. For example, adding a brick to a logical volume changes its partitioning, and hence, breaks fairness of the distribution, so we need to move some data stripes to the new brick to make distribution fair. Also you can not simply remove a brick from a logical volume - all data stripes should be moved from that brick to other bricks of the logical volume.

Every time when user performs a volume operation, Reiser5 marks LV as "not balanced". After successful balancing the status of LV is changed to "balanced". If balancing procedure fails for some reasons, it should be resumed manually (with volume.reiser4 utility).

It is allowed to perform regular file operations on not balanced LV. However, in this case: a) we don't guarantee a good quality of data distribution on your LV. b) you won't be able to perform volume operations on your LV except balancing - any other volume operation will return error (EBUSY).

So, don't forget to bring your LV to the balanced state as soon as possible!

Prepare Software and Hardware

Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found here.

Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs:

"Loading Reiser4 (Software Framework Release: 5.X.Y)"

Build and install the latest libaal

Download, build and install the latest version 2.A.B of Reiser4progs package.

Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine:

# volume.reiser4 -?

Creating a logical volume

Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuid(1)) and store in an environment variable for convenience:

# VOL_ID=`uuid -v4`
# echo "Using uuid $VOL_ID"

Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 256K:

# STRIPE=256K
# echo "Using stripe size $STRIPE"

Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility:

# mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1

Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume.

Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks.

Mount your initial logical volume consisting of one meta-data brick:

# mount /dev/vdb1 /mnt

Find a record about your volume in the output of the following command:

# volume.reiser4 -l

Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume!

Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV).

Adding a data brick to LV

At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. removing a brick) over your volume in progress, otherwise your operation will fail with EBUSY.

Obviously, adding a brick will increase capacity of your volume.

Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error.

Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity).

# mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2

Important: it is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail.

Update item #4 of your volume configuration with UUID or name of the brick you want to add.

To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point:

# volume.reiser4 -a /dev/vdb2 /mnt

The procedure of adding a brick automatically invokes re-balancing, which moves a portion of data stripes to the newly added brick (so that the resulted distribution will fair).

Portion of data blocks moved during such rebalancing is equal to the relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small.

Like other user-space utilities, the operation of adding a brick can return error, even in the assumption that the brick you wanted to add is properly formatted. In this case check the status of your LV:

# volume.reiser4 /mnt

If the volume is unbalanced, then simply complete balancing manually:

# volume.reiser4 -b /mnt

Otherwise, check number of bricks in your LV. Most likely that it is the same as it was before the failed operation. In this case simply repeat the operation of adding a brick from scratch.

Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4).

Removing a data brick from LV

At any time you are able to remove a data brick from your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that there is no other volume operations (e.g. adding a brick) over your volume in progress, otherwise your operation will fail with EBUSY.

Obviously, removing a brick will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC).

Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt.

Update your volume configuration with the UUID and name of the brick you want to remove (#item #4).

To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point:

# volume.reiser4 -r /dev/vdb2 /mnt

The procedure of brick removal automatically invokes re-balancing, which distributes data of the brick to be removed among other bricks, so that resulted distribution is also fair. Portion of data stripes moved during such rebalancing is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity).

It can happen, that the command above completes with error (like other user-space applications). In this case check the status of your LV:

# volume.reiser4 /mnt

If volume is not balanced, then simply complete balancing manually:

# volume.reiser4 -b /mnt

Otherwise, check the number of the bricks in your logical volume - it should be the same as before the failed operation. The error -ENOSPC indicates that free space on other bricks is not enough to fit all the data of the brick you want to remove.

On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc).

Changing brick's capacity

At any time (in the assumption that no other volume operation is in progress) you can change abstract capacity of any brick to some new value, different from 0. Changing capacity always changes volume partitioning, and therefore, breaks fairness of distribution, so Reiser5 automatically launches rebalancing to make sure that resulted distribution is fair for the new set of capacities.

In particular, increasing bricks capacity will move some data from other bricks to the brick, whose capacity was increased. Decreasing bricks capacity will move some data from the brick, whose capacity was decreased, to other bricks.

To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run

# volume.reiser4 -z /dev/vdb1 -c 200000 /mnt

Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt".

The operation of changing capacity can return error. Most likely, it is -ENOSPC, which is a side effect of concurrent regular file writes. In this case check the status of your LV. If it is unbalanced, then consider removing some files from your LV and complete balancing by running

# volume.reiser4 -b /mnt

Otherwise, repeat the operation from scratch.

Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead.

Operations with meta-data brick

Meta-data brick can also contain data stripes and participate in data distribution like other data bricks. So that all the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it from Data-Storage Array (DSA), not from the logical volume. DSA is a subset of LV consisting of bricks, participating in data distribution. Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA.

Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied.

To check the status of meta-data brick simply run

# volume.reiser4 /mnt

and compare values of "bricks total" and "bricks in DSA". If they are equal, then meta-data brick participates in data distribution. Otherwise, "bricks total" should be 1 more than "bricks in DSA" - it indicates that meta-data brick doesn't participate in data distribution (and therefore, doesn't contain data blocks). Note that other cases are impossible: for data bricks participation in LV and DSA is always equivalent.

Unmounting a logical volume

To terminate a mount session just issue usual umount command with the mount point specified.

Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command:

# volume.reiser4 -u BRICK_NAME

Deploying a logical volume after correct unmount

Make sure (by checking your volume configuration) that all bricks of the volume are registered in the system. To register a brick issue the following command:

# volume.reiser4 -g BRICK_NAME

The list of all volumes and bricks registered in the system can be found in the output of the following command:

# volume.reisrer4 -l

Issue usual mount(8) command against one of the bricks of your volume. It is recommended to issue it against meta-data brick.

NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume.

Deploying a logical volume after correct shutdown

To mount your LV, first, make sure that all its bricks (data and meta-data) are registered in the system. Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after _every_ volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage offline logical volumes. So that, users are prompted to do this on their own. It is not at all difficult.

To register a brick in the system use the following command:

# volume.reiser4 -g BRICK_NAME

To print a list of all registered bricks use

# volume.reiser4 -l

To mount your LV simply issue a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick.

Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against.

Deploying a logical volume after hard reset or system crash

If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this section.

In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel.

Depending on a kind of interrupted volume operation, perform one of the following actions:

Adding a brick was interrupted

Check your volume configuration. Register the old set of bricks (that is, the set of brick that the volume had before applying the operation) and try to mount. In the case of error register also the brick you wanted to add and try to mount again.

Check the status of your LV by running

# volume.reiser4 /mnt

In the volume is unbalanced, then complete balancing manually by running

# volume.reiser4 -b /mnt

Check "bricks total" of your LV in the output of

# volume.reiser4 /mnt

Compare it with the old number of bricks in the configuration. The new value should be an increment of the old one. If the number of bricks is the same, then your operation of adding a brick was completely rolled back by the transaction manager, so that you need to repeat it from scratch. Otherwise, your operation was successfully completed - update your volume configuration respectively.

Brick removal was interrupted

Check your volume configuration. Register the old set of bricks (that is, the set of brick that volume had before applying the interrupted operation) except the brick you wanted to remove. Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again.

Check the status of your LV:

# volume.reiser4 /mnt

If the volume is unbalanced then complete balancing manually by running

# volume.reiser4 -b /mnt

Comment. After sucessful balancing completion the brick will be automatically removed form the volume. Make sure of it by checking status of your LV:

# volume.reiser4 /mnt

Update your volume configuration respectively.

Another volume operation was interrupted

Using the volume configuration, register the new set of bricks and try to mount the volume. The mount should be successful.

Check the status of your LV:

# volume.reiser4 /mnt

If the volume is unbalanced then complete balancing manually by running

# volume.reiser4 -b /mnt

LV monitoring

Common info about LV mounted at /mnt

# volume.reiser4 /mnt

ID:		Volume UUID
volume:		ID of plugin managing the volume
distribution:	ID of distribution plugin
stripe:		Stripe size in bytes
segments:	Number of hash space segments (for distribution)
bricks total:	Total number of bricks in the volume
bricks in DSA:	Number of bricks participating in data distribution
balanced:	Balanced status of the volume

Info about any its brick of index J

# volume.reiser4 -p J /mnt

internal ID:	Brick's "internal ID" and its status in the volume
external ID:	Brick's UUID
device name:	Name of the block device associated with the brick
block count:	Size of the block device in blocks
blocks used:	Total number of occupied blocks on the device
system blocks:	Minimal possible number of busy blocks on that device
data capacity:	Abstract capacity of the brick
space usage:	Portion of occupied blocks on the device
in DSA:		Participation in regular data distribution
is proxy:	Participation in data tiering (Burst Buffers, etc)

Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY).

WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress.

Checking free space

To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run

# sync
# df --block-size=4K /mnt

To check number of free blocks on the brick of index J run

# volume.reiser4 -p J /mnt

Then calculate the difference between block count and blocks used

Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail).

NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks.

"Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95

Checking quality of data distribution

Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality.

Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one.

If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority.

Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks.

To check quality of distribution

make sure that meta-data brick doesn't contain data blocks;
make sure that no regular file and volume operations are currently in progress;
find "blocks used", "system blocks" and "data capacity" statistics for each data brick:

# sync
# volume.reiser4 -p 1 /mnt
...
# volume.reiser4 -p N /mnt

find real data space usage on each brick;
calculate partitioning and ideal data space usage on each data brick;
find deviation of (4) from (5).

Example.

Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning:

# VOL_ID=`uuid -v4`
# echo "Using uuid $VOL_ID"

# mkfs.reiser4 -U $VOL_ID -y    -t 256K /dev/vdb1
# mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1
# mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1

# mount /dev/vdb1 /mnt

Fill the meta-data brick with data:

# dd if=/dev/zero of=/mnt/myfile bs=256K

No space left on device...

Add data-bricks /dev/sdc1 and dev/sdd1 to the volume:

# volume.reiser4 -a /dev/vdc1 /mnt
# volume.reiser4 -a /dev/vdd1 /mnt

Move all data blocks to the newly added bricks:

# volume.reiser4 -r /dev/vdb1 /mnt
# sync

Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution

# volume.reiser4 /mnt -p0
blocks used: 503

# volume.reiser4 /mnt -p1
blocks used: 1657203
system blocks: 115
data capacity: 2621069

# volume.reiser4 /mnt -p2
blocks used: 833001
system blocks: 73
data capacity: 1310391

Basing on the statistics above calculate quality of distribution.

Total data capacity of the volume:

C = 2621069 + 1310391 = 3931460

Relative capacities of data bricks:

C1 = 2621069 /(2621069 + 1310391) = 0.6667
C2 = 1310464 /(2621069 + 1310391) = 0.3333

Real space usage on data bricks (blocks used - system blocks):

R1 = 1657203 - 115 = 1657088
R2 = 833001 - 73 = 832928

Space usage on the volume:

R = R1 + R2 = 1657088 + 832928 = 2490016

Ideal data space usage on data bricks:

I1 = C1 * R = 0.6667 * 2490016 = 1660094
I2 = C2 * R = 0.3333 * 2490016 = 829922

Deviation:

D = (R1, R2) - (I1, I2) = (3006, -3006)

Relative deviation:

D/R = (-0.0012, 0.0012)

Quality of distribution:

Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988

Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q.

Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven.

FAQ

Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume?

A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations.

Logical Volumes Administration