Btrfs is probably the most modern filesystem of all widely used filesystems on Linux. In this article we explain how to use Btrfs as the only filesystem on a server machine, and how that enables some sweet capabilities, like very resilient RAID-1, flexible adding or replacing of disk drives, using snapshots for quick backups and so on.
The techniques described in this article were tested using the Ubuntu 16.04 server install, but are applicable on any system with about the same versions of btrfs, grub (2.02), the kernel (4.4) and the likes.
The hardware requirements for a btrfs based RAID-1 disk setup are very flexible. The amount of disks can be anything above two. The size of the disks in the RAID array do not need to be identical, thanks to the flexibility of btrfs RAID-1 as it works on the data level, and not just on device level like traditional mdadm does. Btrfs also includes the features traditionally provided by LVM, so Btrfs conveniently replaces both mdadm and LVM in a single easy to use tool. The best practice is to start with a setup that has 2–4 disks, and then later keep adding new disks when more space is needed, and the size of those disks is whatever is the best price/size ratio at that later time.
The Btrfs setup
When the hardware is ready, the next step is to install the operating system (=Linux). During the partitioning phase, create one big partition that fills all of the disk. There is no need to create a /boot partition nor a swap disk. For Grub compatibility reasons we need to create a real partition (eg. sda1, sdb1..) on every disk and not assign the whole disk to Btrfs, even though Btrfs would support that too. Remember to mark every primary partition (eg. sda1, sdb1..) bootable in the partition table.
After the partitioning step, select the first disk partition (e.g. sda1) as the root filesystem and use Btrfs as the filesystem type. Complete the installation and boot.
After boot you can expand the root filesystem to use all disks with the command:
btrfs device add /dev/sdb1 /dev/sdc1 /dev/sdd1 /
You can check the status of the btrfs system with
btrfs fi show (fi is short for filesystem):
$ sudo btrfs fi show
Label: 'root' uuid: 31e77d75-c07d-44dd-b969-d640dfdf5f81
Total devices 4 FS bytes used 1.78GiB
devid 1 size 884.94GiB used 4.02GiB path /dev/sda1
devid 2 size 265.42GiB used 0.00B path /dev/sdb1
devid 3 size 283.18GiB used 0.00B path /dev/sdc1
devid 4 size 265.42GiB used 0.00B path /dev/sdd1
This pools the devices together and creates a big root filesystem. To make it a RAID-1 system run:
sudo btrfs balance start -v -mconvert=raid1 -dconvert=raid1 /
After this, the available disk space halves but becomes resilient against single disk failures. The read speed might also increase a bit, as data can be accesses in parallel on at least two devices.
The command btrfs fi usage is a new command that explains how disk space is used and how much might be available:
$ sudo btrfs fi usage /
Device size: 1.66TiB
Device allocated: 6.06GiB
Device unallocated: 1.65TiB
Device missing: 0.00B
Free (estimated): 846.76GiB (min: 846.76GiB)
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 32.00MiB (used: 0.00B)
Data,RAID1: Size:2.00GiB, Used:1.69GiB
Metadata,RAID1: Size:1.00GiB, Used:72.81MiB
System,RAID1: Size:32.00MiB, Used:16.00KiB
By default the Linux system boot will hang if any of the devices used by the Btrfs root filesystem is missing. This is not the ideal behavior in a server environment, as we rather have the system boot and continue to operate in a degraded mode, so that services continue to work and admins can access remotely and assess the next steps.
To enable btrfs to boot in degraded mode we need to add the ‘
degraded‘ mount option to two locations. First we need to make sure that Grub can mount the root filesystem and access the kernel. To to that we edit the rootflags line in
to include the option ‘
degraded‘ like this:
For the Grub config to take effect we need to run ‘
update-grub‘ and after that install the new Grub on the master boot sector (MBR) of every disk. That can easily be scripted like this:
for x in a b c d; do sudo grub-install /dev/sd$x; done
Secondly we need to allow the Linux system to mount its filesystems in degraded more by adding the same option to /etc/fstab like this:
UUID=.... / btrfs degraded,noatime,nodiratime,subvol=@ 0 1
Note that also noatime and noadirtime have been selected, as they increase performance with the drawback of not recording access times to files or directories, but that feature is almost never used by anything, so it does not have any practical drawback.
With the setup above we now have a system with 4 disks, each disk containing one partition and those partitions are pooled together with Btrfs RAID-1. If any of the disks fail, the system will continue to operate and can also resume to operate after a reboot (thanks to mount option ‘degraded’) and it does not matter which of the disks break, as any disk is good for booting (thanks to having Grub in every disks’ MBR). If a disk failure occurs, it is up to the system administrator to detect it (e.g. from syslog) and then add a new disk and run ‘
btrfs device replace...‘ as explained in our Btrfs recovery article.
Using ZRAM for swap
Note that this setup does not have any swap partitions. We can’t put a swap partition on the raw disk, as there is no redundancy on raw disk and if any of the disks fail, the swap partition and all memory stored on it would be lost and the kernel would most likely panic and halt. As btrfs RAID-1 is not a block level thing, we cannot have a swap partition on it either. We could have a swap file, but btrfs isn’t any good for keeping swap files. Our solution was not to have any traditional swap partition at all, but instead use ZRAM to store resident memory in a compressed format.
To install zram simply run:
apt install zram-config
After next reboot there will automatically be a zram device that the system uses for swapping. It does not matter how much RAM a system has, because at some point the kernel will anyway swap something our from active memory to swap to use the active memory more efficiently. Using ZRAM for swap will prevent it from going to real disk therefore make both swap out and swap in faster (though with some cost of more CPU use).
Would you like to make a full system backup that does not consume any disk space? On a copy-on-write filesystem like Btrfs it is possible to create snapshots as a window into the filesystem state at a certain point in time.
A practical way to do it could be to have a directory called /snaphosts/ in the under the root filesystem and then save snapshots there at regular intervals. Using the -r option we make the snapshot read-only, which is ideal for backups.
$ sudo mkdir /snapshots $ sudo btrfs subvolume snapshot -r / /snapshots/root.$(date +%Y%m%d-%H%M)
Create a readonly snapshot of '/' in '/snapshots/root.20160919-0954'
$ tree -L 3 /snapshots
|-- initrd.img -> boot/initrd.img-4.4.0-36-generic
|-- initrd.img.old -> boot/initrd.img-4.4.0-31-generic
To be able to track how much disk space a snapshot uses, or more exactly to view the amount of data that changed between two snapshots, we can use Btrfs quota groups. The are not enabled by default, so start by running:
$ sudo bftrs quota enable /
After that you can view the subvolumes (snapshots) disk usage:
$ sudo btrfs qgroup show /
qgroupid rfer excl
-------- ---- ----
0/5 16.00KiB 16.00KiB
0/257 1.75GiB 47.74MiB
0/258 48.00KiB 48.00KiB
0/267 0.00B 16.00EiB
0/268 48.00KiB 16.00EiB
0/269 1.75GiB 44.95MiB
To find out which subvolume ID is mounted as what, list them with:
$ sudo btrfs subvolume list / ID 257 gen 5367 top level 5 path @ ID 258 gen 5366 top level 5 path @home ID 269 gen 5354 top level 257 path snapshots/root.20160919-0954