Logical Volumes Administration
Before working with logical volumes you need to understand some basic principles.
Logical volume (LV) can be composed of any number of block devices, different in physical and geometric parameters. However the optimal configuration (true parallelism) imposes some restrictions and dependencies on the size of such devices.
WARNING: The stuff is not stable. Don't put important data to logical volumes managed by software of release number 5.X.Y. Also don't mount your old partitions in kernels with Reiser4 of SFRN 5.X.Y before its stabilization
IMPORTANT: Currently there is no tools to manage Reiser5 logical volumes off-line, so it it strongly recommended to save/update configurations of your LV in a file, which doesn't belong to that volume.
Basic definitions. Volume configuration. Brick's capacity. Partitioning. Fair distribution. Balancing
Basic configuration of a logical volume is the following information:
1) Volume UUID; 2) Number of bricks in the volume; 3) List of brick names or UUIDs in the volume; 4) UUID or name of the brick to be added/removed (if any). That brick is not counted in (2) and (3).
The item #4 is to handle incomplete operations interrupted by various reasons (system crash, hard reset, etc) when bringing logical volumes on-line.
For each volume its configuration should be stored somewhere (but not on that volume!) and properly updated before and after each volume operation performed on that volume. We make the user responsible for this. Volume configuration is needed to facilitate deploying a volume.
Capacity of a brick (or abstract capacity) is a positive integer number. Capacity is a brick's property defined by user. Don't confuse it with the size of block device. Think of it as of brick's "weight" in some units. And this is the user, who decides, which property of the brick to assign as its abstract capacity and in which units. In particular, it can be size of the block device in kilobytes, or its size in megabytes, or its throughput in M/sec, or other geometric or physical parameter of the device, associated with the brick. It is important that capacities of all bricks of the same logical volume are measured in the same units. Also, it would be utterly pointless to assign different properties as abstract capacities for bricks of the same LV. For example, size of block device for one brick, and disk bandwidth for another one.
Capacity of each brick gets initialized by mkfs utility. By default it is calculated as number of free blocks on the device at the very end of the formatting procedure. For meta-data brick it is calculated as 70% of such amount. Capacity of any brick can be changed on-line by user.
Capacity of a logical volume is defined as a sum of capacities of its bricks-components.
Relative capacity of a brick is the ratio of brick's capacity to volume's capacity. Relative capacity defines a portion of IO-requests that will be issued against that brick.
Array of relative capacities (C1, C2, ...) of all bricks is called volume partitioning. Obviously, C1 + C2 + ... = 1.
Real data space usage on a brick is number of data blocks, stored on that brick.
Ideal data space usage on a brick is defined as T*C, where T is total number of data blocks stored in the volume. C is relative capacity of the brick.
It is recommended to compose volumes in the way so that space-based partitioning coincides with throughput-based one - it would be the optimal volume configuration, which provides true parallelism. If it is impossible for some reason, then choose a preferred partitioning method (space-based, or throughput-based). Note that space-based partitioning saves volume space, whereas throughput based one saves volume throughput.
When performing regular file operations, Reiser5 distributes data stripes throughout the volume evenly and fairly. It means that portion of IO-requests issued against each brick is equal to its relative capacity, that is, to the portion of capacity that the brick adds to the total volume's capacity.
In contrast with regular file operations, volume operations break fairness of data distribution on your logical volume. To restore fairness of distribution, a special balancing procedure should be run on the volume. For example, after adding a brick to a logical volume, the balancing procedure will populate the new brick with data, moved from other bricks.
All volume operations except brick removal are fast, atomic and leave the volume in unbalanced state.
Operation of brick removal always includes balancing, which moves data from the brick you want to remove to other bricks of the volume. If that data migration is interrupted for some reason, then the volume is marked as a "volume with incomplete brick removal".
It is allowed to perform regular file and volume operations on a not balanced LV (assuming, it was not incomplete removal). However, in this case we don't guarantee a good quality of data distribution on your LV. In addition, on a volume with incomplete removal you won't be able to perform regular volume operations - first you will need to complete the removal by running a special removal completion procedure on your volume.
Prepare Software and Hardware
Build, install and boot kernel with Reiser4 of software framework release number 5.X.Y. Kernel patches can be found here.
Note that by Linux kernel and GNU utilities the testing stuff is still recognized as "Reiser4". Make sure there is the following message in kernel logs:
"Loading Reiser4 (Software Framework Release: 5.X.Y)"
Build and install the latest libaal
Download, build and install the latest version 2.A.B of Reiser4progs package.
Make sure that utility for managing logical volumes is installed (as a part of reiser4progs package) on your machine:
# volume.reiser4 -?
Creating a logical volume
Start from choosing a unique ID (uuid) of your volume. By default it is generated by mkfs utility. However, user can generate it himself by proper tools (e.g. uuidgen(1)) and store in an environment variable for convenience:
# VOL_ID=`uuidgen` # echo "Using uuid $VOL_ID"
Choose a stripe size for your logical volume. For a good quality of distribution it is recommended that stripe doesn't exceed 1/10000 of volume size. On the other hand, too small stripes will increase space consumption on your meta-data brick. In our example we choose stripe size 512K:
# STRIPE=512K # echo "Using stripe size $STRIPE"
Start from creating the first brick of your volume - meta-data brick, passing volume-ID and stripe size to mkfs.reiser4 utility:
# mkfs.reiser4 -U $VOL_ID -t $STRIPE /dev/vdb1
Currently only one meta-data brick per volume is supported, so it is recommended that size of block device for meta-data brick in not too small. In most cases it will be enough, if your meta-data brick is not smaller than 1/200 of maximal volume size. For example, 100G meta-data brick will be able to service ~20T logical volume.
Data and meta-data bricks don't differ from the standpoint of disk format, and there is no special option to inform mkfs utility that we want to create exactly meta-data brick: the first brick in the volume automatically becomes a meta-data brick, and other bricks are interpreted as data bricks.
Mount your initial logical volume consisting of one meta-data brick:
# mount /dev/vdb1 /mnt
Find a record about your volume in the output of the following command:
# volume.reiser4 -l
Create configuration of your logical volume (its definition is above) and store it somewhere, but not on that volume!
Your logical volume is now on-line and ready to use. You can perform regular file operations and volume operations (e.g. add a data brick to your LV).
Adding a data brick to LV
At any time you are able to add a data brick to your LV. You can do it in parallel with regular file operations executing on this volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY.
Obviously, adding a brick will increase capacity of your volume.
Choose a block device for the new data brick. Make sure that it is not too large, or too small. Capacities of any 2 bricks of the same logical volume can not differ more than 2^19 (~1 million) times. E.g. your logical volume can not contain both, 1M and 2T bricks. Any attempts to add a brick of improper capacity will fail with error.
Format it with the same volume ID and stripe size, as you used for meta-data brick, but specify also "-a" option (to not restrict data capacity).
# mkfs.reiser4 -U $VOL_ID -t $STRIPE -a /dev/vdb2
It is important that data brick is formatted with the same volume ID and stripe size, as the meta-data brick of your logical volume. Otherwise, operation of adding a data brick will fail.
Update item #4 of your volume configuration with UUID or name of the brick you want to add.
To add a brick simply pass its name as an argument for the option "-a" and specify your LV via its mount point:
# volume.reiser4 -a /dev/vdb2 /mnt
By default operation of adding a brick is fast and atomic and leaves the volume in unbalanced state, so after adding a brick you might want to run a balancing procedure, which will move a portion of data to the new brick from other bricks of the logical volume, which will make data distribution on your volume fair:
# volume.reiser4 -b /mnt
Portion of data blocks, being moved during such rebalancing, is equal to relative capacity of the new brick, that is to the portion of capacity that the new brick adds to updated LV's capacity. This important property defines the cost of balancing procedure. If the portion of capacity added by a brick is small, then number of stripes moved during balancing is also small.
Using this operation in conjunction with the option -B (--with-balance) will automatically trigger the balancing procedure:
# volume.reiser4 -Ba /dev/vdb2 /mnt
Upon successful completion update your volume configuration. That is, increment (#2), add info about the new brick to (#3) and remove records at (#4).
When adding more than one brick at once, call volume.reiser4 with option -a for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing only after adding the last brick.
Removing a data brick from LV
At any time you are able to remove any data brick from your LV. You can do it in parallel with regular file operations executing on that volume. Make sure, however, that your volume is not marked as "volume with incomplete brick removal", or there is no other volume operations over your volume in progress. Otherwise your operation will fail with EBUSY.
Obviously, the removal operation will decrease abstract capacity of your LV. Note that other bricks should have enough space to store all data blocks of the brick you want to remove, otherwise, the removal operation will return error (ENOSPC).
Suppose you want to remove brick /dev/vdb2 from your LV mounted at /mnt.
Update your volume configuration with the UUID and name of the brick you want to remove (#item #4).
To remove a brick simply pass its name as an argument for option "-r" and specify the logical volume by its mount point:
# volume.reiser4 -r /dev/vdb2 /mnt
The procedure of brick removal starts from moving all data from the brick you want to remove to other bricks of your volume, so that resulted data distribution among the rest of bricks will be also fair.
Portion of data stripes being moved during such migration is equal to the relative capacity of the brick to be removed (that it to the portion of capacity that the brick added to LV's capacity). Successful brick removal always leaves the volume is balanced state.
So, in contrast with the operation of adding a brick, removing a brick is a rather long operation, which can be interrupted for various reasons. In this case volume will be marked as a "volume with incomplete brick removal".
To check removal status of your LV simply run
# volume.reiser4 /mnt
and check the field "health".
To complete brick removal in the current mount session simply run
# volume.reiser4 -R /mnt
Note, that the option -R (--finish-removal) doesn't accept any arguments.
On success update your volume configuration: remove the information about the brick /dev/vdb2 at #3 and #4. Check your kernel logs: it should contain a message that brick /dev/vdb2 has been unregistered. Now device /dev/vdb2 doesn't belong to the logical volume any more, and you can reuse it for other purposes (re-format, etc).
Changing brick's capacity
At any time (assuming that your volume is not marked as a "volume with incomplete brick removal) you can change abstract capacity of any brick to some new value, different from 0.
Changing capacity always changes volume partitioning, and therefore, breaks fairness of data distribution on the volume. By default operation of changing brick capacity leaves the volume in unbalanced state, so after changing brick capacity you might want to run a balancing procedure to make data distribution on your volume fair. In particular, after increasing brick capacity the balancing procedure will move some data from other bricks to the brick, whose capacity was increased. After decreasing bricks capacity the balancing procedure will move some data from the brick, whose capacity was decreased, to other bricks.
To change abstract capacity of a brick /dev/vdb1 to a new value (e.g. 200000), simply run
# volume.reiser4 -z /dev/vdb1 -c 200000 /mnt
Pronounced as "resize brick /dev/vdb1 to new capacity 200000 in volume mounted at /mnt".
By default the operation of changing capacity is fast, atomic and leave the volume in unbalanced state. To automatically invoke balancing, use this operation in conjunction with the option -B (--with-balance). Also you can run a balancing procedure later at any time by executing
# volume.reiser4 -b /mnt
When changing capacities of more than one brick at once, call the volume.reiser4 utility for each brick individually in any order. It will be reasonable to not complete each call with balancing. Run balancing after changing capacity of the last brick.
Comment. Changing bricks capacity to 0 is undefined and will return error. Consider brick removal operation instead.
Operations with meta-data brick
Meta-data brick also can contain data stripes and participate in data distribution like other data bricks. All the volume operations described above are also applicable to meta-data brick. Note, however, that it is impossible to completely remove meta-data brick from the logical volume for obvious reasons (meta-data need to be stored somewhere), so brick removal operation applied to the meta-data brick actually removes it only from Data-Storage Array (DSA), which is a subset of LV consisting of bricks, participating in regular data distribution, corresponding to their abstract capacities.
Once you remove meta-data brick from DSA, that brick will be used only to store meta-data. Operation of adding a brick, being applied to a meta-data brick, returns the last one back to DSA.
Important: Reiser5 doesn't count busy data and meta-data blocks separately. So in contrast with data bricks (which contain only data) you are not able to find out real space occupied by data blocks on the meta-data brick - Reiser5 knows only total space occupied.
To check the status of meta-data brick of the volume mounted at /mnt simply run
# volume.reiser4 -p0 /mnt
and check the field "in DSA".
Unmounting a logical volume
To terminate a mount session just issue usual umount against the mount point:
# umount /mnt
Note that after unmounting the volume all bricks by default remain to be registered in the system till system shutdown. If you want to unregister a brick before system shutdown, then simply issue the following command:
# volume.reiser4 -u BRICK_NAME
Deploying a logical volume after correct unmount
After unmounting a logical volume all its bricks remain to be registered in the system. So, if you want to mount the volume again, simply issue the mount command against some its brick. is recommended to issue it against meta-data brick.
NOTE: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was earlier removed from the logical volume.
Deploying a logical volume after correct shutdown
First of all, check configuration of your volume and make sure that all its bricks (data and meta-data ones) are registered in the system. The list of registered bricks can be printed by
# volume.reiser4 -l
Also make sure that the set of registered per volume bricks doesn't contain bricks not mentioned in the volume configuration.
Important: Reiser5 will refuse to mount a logical volume, in the case, when a wrong (incomplete or redundant) set of bricks is registered in the system. Redundant set of bricks appears, for example, when you mistakenly register a brick that was removed from the logical volume. For this reasons we strongly recommend for user to keep a track of his LV - store its configuration somewhere, but not in this volume! And don't forget to update that configuration after every volume operation. If you lost configuration of your LV and don't remember it (wich is most likely for large volumes), then it will be rather painful to restore it: currently there is no tools for to manage logical volumes off-line. So that, users are prompted to do this on their own. It is not at all difficult.
To register a brick in the system use the following command:
# volume.reiser4 -g BRICK_NAME
To print a list of all registered bricks use
# volume.reiser4 -l
Now mount your LV, simply issuing a mount(8) command against one of the bricks of your LV. We recommend to issue it against meta-data brick.
Comment. Reiser5 always tries to register the brick which is passed to the mount command as an argument, so it is not necessarily to preregister the brick you want to issue a mount command against.
Deploying a logical volume after hard reset or system crash
If no volume operations were interrupted by hard reset or system crash, then just follow the instructions in this section.
In Reiser5 only restricted number of bricks participate in every transaction. Maximal number of such bricks can be specified by user. At mount time a transaction replay procedure will be launched on each such brick independently in parallel.
Depending on a kind of interrupted volume operation, perform one of the following actions:
Volume balancing was interrupted
Check your volume configuration.
Register the complete set of bricks and mount the volume by issuing the mount command against some its brick.
Check the balanced status of your LV by running
# volume.reiser4 /mnt
and checking "balanced" value. If the volume is unbalanced, then complete balancing by running
# volume.reiser4 -b /mnt
Brick removal was interrupted
Check your volume configuration. Register the new set of bricks (that is, the set of bricks without the brick you wanted to remove). Try to mount the volume. In the case of error register also the brick you wanted to remove and try to mount again.
Check the status of your LV by running
# volume.reiser4 /mnt
and checking the value of "health". If required, complete brick removal by running
# volume.reiser4 -R /mnt
Note, that the option -R doesn't accept any arguments. After successful removal completion the brick will be automatically removed from the volume and unregistered. Make sure of it by checking status of your LV and the list of registered bricks:
# volume.reiser4 /mnt
# volume.reiser4 -l
Upon successful completion update your volume configuration respectively.
Common info about LV mounted at /mnt
# volume.reiser4 /mnt
ID: Volume UUID volume: ID of plugin managing the volume distribution: ID of distribution plugin stripe: Stripe size in bytes segments: Number of hash space segments (for distribution) bricks total: Total number of bricks in the volume bricks in DSA: Number of bricks participating in data distribution balanced: Balanced status of the volume health: Brick removal completion status
Info about any its brick of index J
# volume.reiser4 -p J /mnt
internal ID: Brick's "internal ID" and its status in the volume external ID: Brick's UUID device name: Name of the block device associated with the brick block count: Size of the block device in blocks blocks used: Total number of occupied blocks on the device system blocks: Minimal possible number of busy blocks on that device data capacity: Abstract capacity of the brick space usage: Portion of occupied blocks on the device in DSA: Participation in regular data distribution is proxy: Participation in data tiering (Burst Buffers, etc)
Comment. When retrieving brick's info make sure that no volume operations over that volume are in progress. Otherwise the command above will return error (EBUSY).
WARNING. Bricks info provided by such way is not necessarily the most recent one. To get an actual info run sync(1) and make sure that no regular file operations are in progress.
Checking free space
To check number of available free blocks on a volume mounted at /mnt, make sure that no regular file operations, as well as volume operations, are in progress on that volume, then run
# sync # df --block-size=4K /mnt
To check number of free blocks on the brick of index J run
# volume.reiser4 -p J /mnt
Then calculate the difference between block count and blocks used
Comment. Not all free blocks on a brick/volume are available for use. Number of available free blocks is always ~95% of total number of free blocks (Reiser4 reserves 5% to make sure that regular file truncate operations won't fail).
NOTE: volume.reiser4 shows total number of free blocks, whereas df(1) shows number of available free blocks.
"Space usage" statistics shows a portion of busy blocks on individual brick. For the reasons explained above "space usage" on any brick can not be more than 0.95
Checking quality of data distribution
Quality of data distribution is a measure of deviation of the real data space usage from the ideal one defined by volume partitioning. The smaller the deviation, the better the distribution quality.
Checking quality of distribution makes sense only in the case when your volume partitioning is space-based, or if it coincides with the space-based one.
If your partitioning is throughput-based, and it doesn't coincide with the space-based one, then quality of actual data distribution can be rather bad, as in this case the file system is worried for low-performance devices to not become a bottleneck, and effective space usage in this case is not a high priority.
Checking quality of data distribution is based on the free blocks accounting, provided by the file system. Note that file system doesn't count busy data and meta-data blocks separately, so you are not able to find real data space usage, and hence to check quality of distribution in the case when meta-data brick contains data blocks.
To check quality of distribution
- make sure that meta-data brick doesn't contain data blocks;
- make sure that no regular file and volume operations are currently in progress;
- find "blocks used", "system blocks" and "data capacity" statistics for each data brick:
# sync # volume.reiser4 -p 1 /mnt ... # volume.reiser4 -p N /mnt
- find real data space usage on each brick;
- calculate partitioning and ideal data space usage on each data brick;
- find deviation of (4) from (5).
Let' build a LV of 3 bricks (one 10G meta-data brick sdb1, and two data bricks: sdc1 (10G), sdd1(5G)) with space-based partitioning:
# VOL_ID=`uuid -v4` # echo "Using uuid $VOL_ID" # mkfs.reiser4 -U $VOL_ID -y -t 256K /dev/vdb1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdc1 # mkfs.reiser4 -U $VOL_ID -y -a -t 256K /dev/vdd1 # mount /dev/vdb1 /mnt
Fill the meta-data brick with data:
# dd if=/dev/zero of=/mnt/myfile bs=256K
No space left on device...
Add data-bricks /dev/sdc1 and dev/sdd1 to the volume:
# volume.reiser4 -a /dev/vdc1 /mnt # volume.reiser4 -a /dev/vdd1 /mnt
Move all data blocks to the newly added bricks:
# volume.reiser4 -r /dev/vdb1 /mnt # sync
Now meta-data brick doesn't contain data blocks (only meta-data ones), so that we can calculate quality of data distribution
# volume.reiser4 /mnt -p0 blocks used: 503
# volume.reiser4 /mnt -p1 blocks used: 1657203 system blocks: 115 data capacity: 2621069
# volume.reiser4 /mnt -p2 blocks used: 833001 system blocks: 73 data capacity: 1310391
Basing on the statistics above calculate quality of distribution.
Total data capacity of the volume:
C = 2621069 + 1310391 = 3931460
Relative capacities of data bricks:
C1 = 2621069 /(2621069 + 1310391) = 0.6667 C2 = 1310464 /(2621069 + 1310391) = 0.3333
Real space usage on data bricks (blocks used - system blocks):
R1 = 1657203 - 115 = 1657088 R2 = 833001 - 73 = 832928
Space usage on the volume:
R = R1 + R2 = 1657088 + 832928 = 2490016
Ideal data space usage on data bricks:
I1 = C1 * R = 0.6667 * 2490016 = 1660094 I2 = C2 * R = 0.3333 * 2490016 = 829922
D = (R1, R2) - (I1, I2) = (3006, -3006)
D/R = (-0.0012, 0.0012)
Quality of distribution:
Q = 1 - max(|D1|, |D1|) = 1 - 0.0012 = 0.9988
Comment. For any specified number of bricks N and quality of distribution Q it is possible to find a configuration of a logical volume composed of N bricks, so that quality of distribution on that volume will be better than Q.
Comment. Quality of distribution Q doesn't depend on the number of bricks in the logical volume. This is a theorem, which can be strictly proven.
Q. What happens if I lose a device-component (due to a breakdown, etc) of my logical volume?
A. Bodies of some your regular files will become "punched" in random places. Portion of such files depends on the relative capacity of the lost brick, on the number of bricks in the logical volume, and on other factors. Fsck will be able to detect and remove such files with corrupted bodies. Nevertheless, we recommend to consider mirroring your bricks (e.g. by software, or hardware RAID-1) to avoid such highly unpleasant situations.