Btrfs (pronounced Better FS) is a relatively new filesystem that operates on the copy-on-write principle (abbreviated COW, which stems a more friendly pronunciation for btrfs: Butter FS). Btrfs includes a lot of interesting functionality and replaces traditional Linux disk and filesystem tools like LVM (volume manager, disk snapshots) and mdadm (software RAID).
In RAID usage btrfs is much more flexible and space efficient than traditional mdadm, because in btrfs the disks in the RAID array do not conform to any predefined size or count requirements. You can attach any amount of disk of any size to a btrfs RAID array, and btrfs will automatically balance the data across the devices according to the requirements of the selected RAID level. RAID levels 0 and 1 are currently supported while RAID 5 and 6 are under development and will be available as officially supported configurations soon.
Let’s take an example case to see how btrfs RAID works. In traditional mdadm based RAID if you have two 1 TB disks configured to mirror each other in RAID 1 mode, and you want to expand this setup, you basically need to install two more disks, and they need to be of equal size. In btrfs, life is much more flexible.
To expand an array, you can add a disk of any size and btrfs will automatically adjust to the sitation. So if you have two 1 TB disks, and you add a 2 TB disk to the array, then all 4 TB would be utilised and you will have 2 TB of usable storage. Btrfs will make sure that as RAID 1 mandates, each file will exist in two copies and those copies are stored on physically different disks. If you would expand the two 1 TB array with one 1 TB disk, then btrfs would give you 1,5 TB of usable disk space in total while still satisfying the RAID 1 replicate requirements.
Btrfs RAID is flexible also in the sense that you can either create the RAID array by running something like
mkfs.btrfs -d raid1 -m raid1 /dev/sdb1 /dev/sdc1 when you create the partitions, or you can at any later time convert existing partitions to use RAID by running a command like
btrfs balance start -dconvert=raid1 -mconvert=raid1 /home
If RAID is active, it will be visible in the btrfs filesystem df command:
$ btrfs fi df /home Data, RAID1: total=1.59TiB, used=1.59TiB System, RAID1: total=32.00MiB, used=256.00KiB Metadata, RAID1: total=10.00GiB, used=8.27GiB unknown, single: total=512.00MiB, used=0.00
Keep in mind, that the free values shown by the normal
df are not reliable, as it is hard to predict how disk usage will behave in a copy-on-write and snapshotting filesystem like btrfs.
Recovering from failed disks in btrfs
If a disk in a btrfs RAID 1 array fails, then btrfs will refuse to mount that filesystem and error messages will be visible in the syslog. If it was the root filesystem, then the system will refuse to boot normally and the system will usually boot to an initramfs console. Luckily all decent systems that support btrfs (like Ubuntu 14.04) will have btrfs tools included in the initramfs environment, so you can run btrfs commands from there and try to recover from the situation without the need to boot the system form an alternative media, like a live CD.
A btrfs volume with a failed (missing) disk will output something like this:
$ btrfs fi show Label: none uuid: 4e90ec15-e6f5-470d-96be-677f654a5c79 Total devices 3 FS bytes used 1.59TiB devid 1 size 2.71TiB used 1.60TiB path /dev/sda1 devid 2 size 1.80TiB used 1.48TiB path devid 3 size 447.14GiB used 121.00GiB path /dev/sdc1/pre> To force the btrfs volume to mount anyway, the degraded option can be used:
$ mount -t btrfs -o degraded /dev/sda2 /home
The correct thing to do when a disk in an RAID array fails, is to replace it. Once you have a new disk in place notify btrfs about it with the command:
$ btrfs replace start /dev/sdd1 /dev/sdb1 /home
This command reads files both from the original drive (if still accessible) and from other disks in the RAID array, and uses that information to populate the new clean disk.
If needed, the
-r flag will prevent the system from trying to read from the outgoing drive if possible. Replacement operations can be canceled, but they cannot be paused. Once the operation is complete, /dev/sdc1 will no longer be a part of the array and can be disposed of.
In some cases one could also run
btrfs device delete missing /dev/sdb1 and then add a new drive, but the
replace command is the primary command and can be run even if the old drive is completely dead.
Recovering from filesystem corruption in btrfs
In cases where the physical disk has not failed but instead something in the btrfs journal or checksum trees is corrupted and does not match, and the filesystem refuses to mount, this is the recommended procedure to try:
First make a backup the volume.
After that try to mount the volume in the read-only recovery mode:
$ mount -t btrfs -o ro,recovery /dev/sda2 /home
If that fails, look in syslog (or run dmesg) and look for btrfs errors:
[ 74.926506] Btrfs loaded [ 74.927393] BTRFS: device fsid 4e90ec15-e6f5-470d-96be-677f654a5c79 devid 2 transid 691061 /dev/sdc1 [ 77.439765] BTRFS info (device sdc1): disk space caching is enabled [ 77.440620] BTRFS: failed to read the system array on sdc1
If there are messages relating to the log tree (not in the example above), then reset the log tree by running:
If syslog shows problems regarding the chunk tree, then
btrfs rescue chunk-recover may be of used to replace the chunk blocks with new ones that should work (but may loose some data). Each disk has multiple copies of super blocks, and they are very unlikely to all get corrupted at the same time, but it happens they can be recovered with the rescue command:
$ btrfs rescue super-recover -v /dev/sde1 All Devices: Device: id = 3, name = /dev/sdc1 Device: id = 1, name = /dev/sda2 Device: id = 2, name = /dev/sde1 Before Recovering: [All good supers]: device name = /dev/sdc1 superblock bytenr = 65536 device name = /dev/sdc1 superblock bytenr = 67108864 device name = /dev/sdc1 superblock bytenr = 274877906944 device name = /dev/sda2 superblock bytenr = 65536 device name = /dev/sda2 superblock bytenr = 67108864 device name = /dev/sda2 superblock bytenr = 274877906944 [All bad supers]: device name = /dev/sde1 superblock bytenr = 65536 device name = /dev/sde1 superblock bytenr = 67108864 device name = /dev/sde1 superblock bytenr = 274877906944
After those, try btrfsck, and possibly with options -s1, -s2, -s3. It the volume is still not mountable, then try
btrfsck --repair --init-extent-tree may be necessary if the extent tree was corrupted. If there is corruption in the checksums, try –init-csum-tree.
Last resort is to run
btrfs check --repair but it’s not recommended because it might write changes to the disk that destroys data.
Generic tools might also be useful. For example the tool testdisk is able to scan disks and find lost partition tables, including ones with btrfs partitions.
Restoring files from a broken btrfs filesystem
If it simply is impossible to mount a btrfs filesystem, it is possible to use the command
btrfs restore to fetch files from withing a damaged btrfs partition. The default command will get all files from the root volume.
Sometimes simply restore isn’t enough. For example in Ubuntu, by default the /home directory is a separate btrfs subvolume. To fetch files from there the correct volume root must be defined via the
-r option. Also you might not be interested in restoring all possible files, maybe just one particular directory, and for such use a filename filter can be defined with the
To fetch all files from /home/otto/Kuvat on a system there the @home subvolume object id is 258, the task can be accomplished with the command:
$ btrfs restore -i -vvvv -r 258 --path-regex "^/(otto(|/Kuvat(|/.*)))$" /dev/sdc1 . ... Restoring ./otto/Kuvat/2015/08/09/IMG_2888.JPG Restoring ./otto/Kuvat/2015/08/09/IMG_2889.JPG Restoring ./otto/Kuvat/2015/08/09/IMG_2890.JPG Restoring ./otto/Kuvat/2015/08/09/IMG_2891.JPG Restoring ./otto/Kuvat/2015/08/09/IMG_2892.JPG Restoring ./otto/Kuvat/2015/08/09/IMG_2893.JPG Found objectid=18094703, key=18094702 Done searching /otto/Kuvat/2015/08/09 Found objectid=18094632, key=18094631 Done searching /otto/Kuvat/2015/08 Found objectid=8304076, key=8304075 Done searching /otto/Kuvat/2015 Found objectid=272, key=271 Done searching /otto/Kuvat Found objectid=258, key=257 Done searching /otto Found objectid=257, key=256 Done searching
Detecting data corruption
If btrfs detects errors, they will be logged to syslog. Btrfs also maintains error counters, which on normal healthy drives should always list all zeros:
$ btrfs device stats /mnt [/dev/sda2].write_io_errs 0 [/dev/sda2].read_io_errs 0 [/dev/sda2].flush_io_errs 0 [/dev/sda2].corruption_errs 0 [/dev/sda2].generation_errs 0 [/dev/sdc1].write_io_errs 0 [/dev/sdc1].read_io_errs 0 [/dev/sdc1].flush_io_errs 0 [/dev/sdc1].corruption_errs 0 [/dev/sdc1].generation_errs 0
Btrfs automatically calculates CRC-32C checksums for both data and metadata blocks, and at regular intervals checks if the checksums still match or not. If data corruption is detected, then btrfs will log errors to syslog. If RAID 1 is enabled, btrfs will also automatically fix the corrupted data by overwriting with by the correct duplicate. This process can also be automatically triggered by running
Checking disk health
Disks that support the SMART standard are able to report their health status. The Gnome tool ‘disks’ provides a very easy way to access SMART data. Just open Disks, select a device and choose ‘Show SMART Data & Self-Tests’.
Alternatively a command line representation of the same data can be fetched with:
$ sudo smartctl -A -H /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-90-generic] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 6322 12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 2646 175 Program_Fail_Count_Chip 0x0032 100 100 010 Old_age Always - 0 176 Erase_Fail_Count_Chip 0x0032 100 100 010 Old_age Always - 0 177 Wear_Leveling_Count 0x0013 086 086 010 Pre-fail Always - 503 178 Used_Rsvd_Blk_Cnt_Chip 0x0013 094 094 010 Pre-fail Always - 184 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 094 094 010 Pre-fail Always - 336 180 Unused_Rsvd_Blk_Cnt_Tot 0x0013 094 094 010 Pre-fail Always - 6192 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 253 253 000 Old_age Always - 0 232 Available_Reservd_Space 0x0013 094 094 000 Pre-fail Always - 3080 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 15812437286 242 Total_LBAs_Read 0x0032 099 099 000 Old_age Always - 11879385910
Storing data efficiently and dependably are fundamental tasks in computing. Data integrity and data availability are key principles in information security. Btrfs is a promising technology that will help system administrators fulfill these goals and in particular with SSD disks btrfs should be the default filesystem choice for all new systems, and a must for syadmins to read up on.
Amendment 29.5.2016: Freeing up space on a Btrfs RAID system
If you even end up in a situation that your production disk is full, and you urgently must resolve it by creating more space, you can convert a live Btrfs RAID device to a normal single data device. This naturally risks that your whole system could fail miserably if a disk failure would occur during the time you don’t have RAID, but it might be worth taking a chance if a imminent full disk situation can be avoided and thus operations continue until you have time to work out a long-term solution. Basically if your Btrfs device had RAID-1 and you convert it you normal single mode, your disk space will double:
sudo btrfs balance start -dconvert=single -mconvert=raid1 /data
This is a pretty nice feature that is made possible by the unique design in Btrfs. For example ZFS does not have a balance command at all.