zondag 10 november 2013

Part II : ZFS - Howto: Ubuntu Home Server (ZFS + virtual IpFire)

Series : Ubuntu Home Server (ZFS + virtual IpFire) 

-->Part II  : Setting up ZFS (this post)<--

Setting up ZFS 

Table of Contents:

Installation of ZFS
Creation of the first zpool
Having a look at your first pool
Using Datasets
Dataset properties - Network sharing
Dataset properties - Quotas
Dataset properties - Compression
Data Scrubbing
Give user permission for status polling of ZFS

Towards a next generation file system at home

This blog continues from the previous post describing the installation of Ubuntu and a virtualized IpFire dedicated firewall : http://blog.gjpvanwesten.nl/2013/11/howto-ubuntu-home-server-with-zfs-and.html. Here I will be quickly describing the installation of ZFS, setting up your first zpool, some network shares and simple quotas. I will also include a health script I found on Calomel.org.


As a reminder, here's what I set up:

  • Ubuntu 12.04.03 LTS
  • ZFS on Linux (6 disk RAIDZ2 , double parity)
  • Virtual firewall (IpFire) running through Virtualbox
  • DHCP / DNS / samba file sharing
  • LAN runs in the 192.168.10.xxx range.

Running on:
  • Athlon II X3
  • 16 GB ECC RAM
  • 3 Ethernet adapters (2 for the firewall and 1 for the host OS)
  • 6 WD 2 TB disks (mix of Green and Red)
  • 500 GB bootdrive (2.5 " WD)

Installation of ZFS:

Installing zfs is actually very simple when you use the native implementation by ZFS on Linux (ZoL). They actually have a PPA with binaries available which makes the process rather trivial. Before we do so make sure you have performed the necessary preparations.

Open a terminal and become root (sudo -s).

First install dkms to allow rebuilding the kernel modules for zfs after kernel updates:
sudo apt-get install dkms

also don't forget the kernel headers:
I guess you need to reboot after installing these, if you already have them you can continue. 
Again have a terminal with su and do:

sudo apt-get install linux-headers-`uname -r` linux-headers-generic build-essential

Now to install zfs:

apt-add-repository ppa:zfs-native/stable
apt-get update
apt-get install ubuntu-zfs

I guess this requires another reboot, after which you can again open a terminal an become root. 

Type :
zpool status

If you don't get an error but see "no zpools available" or equivalent, then installation was successful and we can move forward to creating the zpool. 

Creation of the first zpool :

A zpool is a virtual device that contains the pooled storage space of the underlying devices. Hence a zpool can be as simple as a single drive, but it can also be a complex combination of virtual RAID5 devices that are striped and form a RAID50. For further reading I would refer you to the FAQ of ZoL or the Oracle documentation. Here I assume some prior knowledge.

I will use a pool consisting of a single RAIDZ2 virtual device (vdev) as an example. This is a pool consisting of 6 2TB disks, of which 2 are used for parity. Hence this array can sustain two disk failures and still function. The total size is 6 * 2 TB minus 2 * 2TB (parity) = 8 TB.

The creation of a zpool can be done with any vali disk identifier, but is recommended to be done with disk ids. The reason for this is that pools will not break when you add new disks to the system (which would be the case when using /dev/sda) and that the zpool is not dependant on the actual hardwired connectivity. When using ids, I have successfully switched disks in my pool from being connected to a raid expansion to being connected via the onboard sata and there was not problem what so ever.

Now to make life a bit easier, ZoL allows you to define an alias that resolves to a disk by id. This is easier when creating the pool, but is also easier when reading out your zpool status. To use these aliases create a tab delimited text file /etc/zfs/vdev_id.conf:
leafpad /etc/zfs/vdev_id.conf

Now I assume you already have some dedicated disks you want to use to create the zpool (it is recommended to use full disks, but you can also create a pool from a number of partitions).
In the file enter the following (each line represents a disk, this is my current setup): # by-vdev # name fully qualified or base name of device link alias 2tb1 /dev/disk/by-id/scsi-SATA_WDC_WD20EFRX-68_WD-WMC301348703 alias 2tb2 /dev/disk/by-id/scsi-SATA_WDC_WD20EFRX-68_WD-WMC301355426 alias 2tb3 /dev/disk/by-id/scsi-SATA_WDC_WD20EARX-00_WD-WCAZAL405566 alias 2tb4 /dev/disk/by-id/scsi-SATA_WDC_WD20EFRX-68_WD-WCC300687955 alias 2tb5 /dev/disk/by-id/scsi-SATA_WDC_WD20EARX-00_WD-WCAZAL762406 alias 2tb6 /dev/disk/by-id/scsi-SATA_WDC_WD20EZRX-00_WD-WCC300740429 Now close the file and run the following to enable the aliases: udevadm trigger

The next step is actual creation of the pool, now I use a raidz2, but you can also create a raidz1 or mirror, or set of both. When creating the pool you can choose to use 512 bit sector sizes of 4096 bit sector sizes (common on modern drives). The latter should perform better but will lead to more lost space in particular when storing smaller files. I am not going into all the details but some further documentation can be found herehereand here .

In principle when using raidz, ideally one should use an even number of data disks coupled to the amount of parity disk required by the raidz version. So

2 data + 1 parity = 3
4 data + 1 parity = 5
n data + 1 parity = n + 1

2 data + 2 parity = 4 (although wasteful)
4 data + 2 parity = 6
n data + 2 parity = n + 2

2 data + 3 parity = 5 (although wasteful)
4 data + 3 parity = 7
n data + 1 parity = n + 3

I am creating a 6 disk raidz2.

The command is (in a terminal with su):
zpool create -o ashift=12 -O casesensitivity=mixed -O compression=lz4 zfspool raidz2 2tb1 2tb2 2tb3 2tb4 2tb5 2tb6

Let me go through all the parts of the command in turn:
zpool (zpool is used for the administration of zpools)
create (tell zpool to create a new pool)
-o ashift=12 (I will use a 4096 bit sector size)
-O casesensitivity=mixed (I intend to create shares for windows using samba)
-O compression=lz4 (turn on block compression on all files, found to perform well and very vast)
zfspool (name of the pool to be created)
raidz2 (type of the pool, can be substituted with raidz1, mirror, etc)
2tb1 2tb2 2tb3 2tb4 2tb5 2tb6 (individual disk members)

Uppercase 'O' is for properties of datasets (children of zpools, inherited by default) and lowercase are properties of the zpool.

Having a look at your first pool:
If you were successful again run "zpool status":

root@gertdus-server:~# zpool status
pool: zfspool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM zfspool ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 2tb1 ONLINE 0 0 0 2tb2 ONLINE 0 0 0 2tb3 ONLINE 0 0 0 2tb4 ONLINE 0 0 0 2tb5 ONLINE 0 0 0 2tb6 ONLINE 0 0 0 errors: No known data errors

The zpool will be automounted by default under it's zpool name, hence you can find it at :

obviously you can change the mountpoint (further reading).
zfs set mountpoint=/foo/bar zfspool

Now lets interpret the output. the overview shows the name of your pool: 

status - can be ONLINE, DEGRADED (member failed but not critical), FAULTED (critical failure), OFFLINE (manual set in offline state)

scan - indicates if a data scan (scrub) or rebuild (resilvering in zfs) is running

config - lists the different pools on the system (here 1). Shown are the pool and it's state, the underlying vdevs and their state and the state of each individual disk / partition. If all is well the values for state is "ONLINE" and the values for READ/WRITE/CKSUM are '0'.

Other states vdevs and pools can be in are DEGRADED (member failed but not critical), FAULTED (critical failure), OFFLINE (manual set in offline state).

Disks can be UNAVAILABLE (missing), FAULTED (smart error, corrup data), OFFLINE (manual set in offline state).

With the command 'zpool list' you can get an overview of some of the properties and used size by the pool. Note that it displays the raw sie and does not correct for the double parity.

zfspool  10.9T  6.37T  4.51T    58%  1.00x  ONLINE  -

Shown are the name, total size (6 * 931 GB = 10.9 TB) and allocated / used space (6.37 TB , of which 4/6 is actual data (4 data disks, 2 parity disks) so 4.24 TB). Free is 4.51 (again 4/6 so 3.0 TB), meaning that the pool is in use for 58 %. Lastly it shows how much space is gained by deduplication but I have not turned that on due to the memory requirements.

After creation of the pool, it is created with owner root (given that it is created using su), so take ownership of the pool and subfolders (in a terminal with su):
chown -R username:username /zfspool

Note that you can destroy the pool with:
zpool destroy zfspool

Using datasets:

When using ZFS it is recommended to create file systems (datasets) which are nested under the zpool. The idea is that this allows for much better use of the space based. One can ideally combine files of the same type (e.g. dataset for music, documents, etc).

These datasets can have different options / properties as can the zpool. Among this are compression, deduplication, mixed case sensitivity, or network share. So the actuall network sharing can be a property of the data set (both samba and nfs).

Above we used the zpool command, this controls and manages zpools. For datasets we use the 'zfs' command which controls and manges datasets.

Creation of a dataset is easy:
zfs create zfspool/backup

creating a dataset named "backup" which will be automounted under the zpool directory:

Obviously this can be more complicated. Remember we already set casesensitivity and compression as properties of the pool above, if you don't want that for this dataset, they can be turned off:
zfs create -o casesensitivity=sensitive -o compression=off zfspool/backup

You can now show the properties of the dataset with:
zfs list

root@gertdus-server:~# zfs list NAME USED AVAIL REFER MOUNTPOINT zfspool 4.24T 2.89T 416K /zfspool zfspool/backup 242G 108G 242G /zfspool/backup

Showing you the space used and available (I have defined a quota here) and mountpoints. Note that now the correct space used is shown (contrary to zpool list, where lost space due to parity was not considered). The total is identical though.

Dataset properties - Network sharing:

Setting up network sharing using zfs is extremely simple. Though on Linux it requires a configured samba / nfs system. If you haven't install samba / nfs. Make sure that the user you want to share it for is also available as samba user and that the password is correct (see part III for further info on this).

Installing samba and nfs
apt-get install nfs-kernel-server samba

If your username is steve then use:
smbpasswd steve

Setting the actual share is as simple as:
sudo zfs set sharesmb=on zfspool/backup
sudo zfs set sharenfs=on zfspool/backup

You can then verify with
zfs get all | grep share

root@gertdus-server:~# zfs get all | grep share
zfspool                sharenfs              off                     default
zfspool                sharesmb              off                     default
zfspool/backup         sharenfs              off                     default
zfspool/backup         sharesmb              on                      local

Make sure that steve has permission to access all the files in your shared folders. This command inside your shares should help: (Change the name, obviously.)
cd /zfspool/backup
sudo -R chown username:username *

you might wish to change the workgroup name for Samba in /etc/samba/smb.conf.

Dataset properties - Quotas:

Setting a dataset quota is also very simple. Unfortunately, to my knowledge there is no support for individual users or groups, just quota's on a per dataset level. Not setting them will have the effect that each and every data set you create get's listed as havng the full zpool worth of free space (which can be messy).

Setting the quota:
zfs set quota=1862G zfspool/backup
zfs set quota=1T zfspool/backup

check with
zfs get all | grep quota

root@gertdus-server:~# zfs get all | grep quota
zfspool                quota                 none                    default
zfspool                refquota              none                    default
zfspool/backup         quota                 350G                    local
zfspool/backup         refquota              none                    default

Dataset properties - Compression:

Compression is another property that can be set to on. In particular using Lz4 can be highly advantageous as it requires not so much CPU (a resource usually not very stressed on a file server), but translates in much more effective use of space and quicker access. 

When setting compression on simple clear text files (e.g. .xml, or .txt, *.sdf) the compression can be impressive (I get 1.73, meaning the space use is 1.73 times less, or taking up 58 % of the space when it would not be compressed). 

Hence I would recommend always enabling / enabling it at a pool level (as shown above). ZFS supports multiple types, Lz4, Lzjb, and gzip. When you just use 'on' this turns on lzjb (default), however Lz4 has been shown to perform better (better compression) and faster. 

Again this is a property that can be set with zfs:
zfs set compression=lz4 zfspool/werk

check with (using compress gives both the property and ratio)
zfs get all | grep compress

root@gertdus-server:~# zfs get all | grep compress
zfspool                compressratio         1.09x                   -
zfspool                compression           lz4                     local
zfspool                refcompressratio      1.00x                   -
zfspool/backup         compressratio         1.10x                   -
zfspool/backup         compression           lz4                     local
zfspool/backup         refcompressratio      1.10x                   -
zfspool/werk           compressratio         1.73x                   -
zfspool/werk           compression           lz4                     local
zfspool/werk           refcompressratio      1.73x                   -

Easy as pie... Note that the ratio for the whole pool is 1.09 , so the data takes up 92 % of the space, saving me in total 541 GB (8 % of the total 7.26 TB)!

Data Scrubbing:

ZFS provides the unique ability to validate the quality of your data via data scrubbing. Herein the data blocks are validated against their checksums and can be repaired. It is recommended to do so every once in a while to check for silent corruption. It is generally accepted that the time between scrubs can be higher using enterprise grade disks than when using consumer disks. Some people claims that the stress caused on the disks by doing a scrub leads to an lower life expectancy.

However the people at Backblaze show that this life expectancy is actually pretty good (http://blog.backblaze.com/2013/11/12/how-long-do-disk-drives-last/) So I have settled on a scrub every 30 days (as included in the ZFS health script below).

Running a scrub is easy:
zpool scrub zfspool

And there you go, the time it takes differs based on the amount of disks, controller speed, cpu etc... (I get to about 230 MB/s, so a scrub runs in around 8-9 hours)..


Give user permission for status polling of ZFS (tested on Ubuntu 12 / CentOS 6):

This trick allows the user to run zpool status , zfs get (eliminates the need to sudo for quick reference, also allows you to use this cool app )

Perform all of the following as root or using sudo
leafpad /etc/udev/rules.d/91-zfs-permissions.rules

insert the contents:
#Use this to add a group and more permissive permissions for zfs
#so that you don't always need run it as root.  beware, users not root
#can do nearly EVERYTHING, including, but not limited to destroying
#volumes and deleting datasets.  they CANNOT mount datasets or create new
#volumes, export datasets via NFS, or other things that require root
#permissions outside of ZFS.
ACTION=="add", KERNEL=="zfs", MODE="0660", GROUP="zfs"

Adjust "GROUP" and "MODE" to your needs
run the following commands
groupadd zfs
gpasswd -a gertdus zfs

again, adjusting "zfs" to your match group in udev file.

And a screenshot


Here I will list some problems I encountered and simple solutions to them... (makes life easier)

When datasets are not automounted:

It's a bit dirty, but works. I have had automount problems with 0.6.2, not with 0.6.1. 
 add to /etc/rc.local
zfs mount -a

When kernelmodules are not recompiled after kernel upgrade:

reinstall dkms
apt-get install --reinstall zfs-dkms

Get the version number of the registered modules:
dkms status

(eg: for the daily ppa.)
Try to build the modules manually:
dkms remove -m zfs -v --all
dkms remove -m spl -v --all
dkms add -m spl -v
dkms add -m zfs -v
dkms install -m spl -v
dkms install -m zfs -v

If you get the same error, then reinstall the headers package:
apt-get install linux-headers-`uname -r` linux-headers-generic build-essential

Useful zpool commands:

sudo zpool list
list info on zfs pools rather than datasets

sudo zpool history zpoolname
shows history for named zpool

sudo zpool clear zpoolname
removes errors, starts resilvering if needed

sudo zdb -DD zpoolname
shows deduplication histogram

sudo zpool iostat
shows IO parameters

sudo zpool scrub zpoolname
manually starts scrub job

zpool set propertyname=value zpoolname
zpool set propertyname=value datasetname
change single property: e.g. zfs set dedup=on zfstest / zfs set sharesemb=on zfstest

Useful zfs commands

zfs mount
display mounted sets

zfs get mountpoint datasetname
get mountpoint of dataset

zfs set mountpoint=/foo_mount data
set mountpoint of dataset

sudo zfs mount -a
mounts all datasets

sudo zfs mount datasetname
mounts said dataset

zfs unmount datasetname
zfs unmount -a
unmount sets similar to above

zfs get all zpoolname
zfs get all datasetname
get all information on the zpool or dataset

zfs get propertyname zpoolname
zfs get propertyname datasetname
get information on 1 property : e.g. zfs get dedup zfstest / zfs get sharesmb zfstest

ZFS health script form calomel.org

#! /bin/bash
# Calomel.org
#     https://calomel.org/zfs_health_check_script.html
#     FreeBSD 9.1 ZFS Health Check script
#     zfs_health.sh @ Version 0.15

# Check health of ZFS volumes and drives. On any faults send email. In FreeBSD
# 10 there is supposed to be a ZFSd daemon to monitor the health of the ZFS
# pools. For now, in FreeBSD 9, we will make our own checks and run this script
# through cron a few times a day.

# 99 problems but ZFS aint one

# Health - Check if all zfs volumes are in good condition. We are looking for
# any keyword signifying a degraded or broken array.

condition=$(/sbin/zpool status | egrep -i '(DEGRADED|FAULTED|OFFLINE|UNAVAIL|REMOVED|FAIL|DESTROYED|corrupt|cannot|unrecover)')
if [ "${condition}" ]; then
       emailSubject="`hostname` - ZFS pool - HEALTH fault"

# Capacity - Make sure pool capacities are below 80% for best performance. The
# percentage really depends on how large your volume is. If you have a 128GB
# SSD then 80% is reasonable. If you have a 60TB raid-z2 array then you can
# probably set the warning closer to 95%.
# ZFS uses a copy-on-write scheme. The file system writes new data to
# sequential free blocks first and when the uberblock has been updated the new
# inode pointers become valid. This method is true only when the pool has
# enough free sequential blocks. If the pool is at capacity and space limited,
# ZFS will be have to randomly write blocks. This means ZFS can not create an
# optimal set of sequential writes and write performance is severely impacted.


if [ ${problems} -eq 0 ]; then
  capacity=$(/sbin/zpool list -H -o capacity)
  for line in ${capacity//%/}
      if [ $line -ge $maxCapacity ]; then
        emailSubject="`hostname` - ZFS pool - Capacity Exceeded"

# Errors - Check the columns for READ, WRITE and CKSUM (checksum) drive errors
# on all volumes and all drives using "zpool status". If any non-zero errors
# are reported an email will be sent out. You should then look to replace the
# faulty drive and run "zpool scrub" on the affected volume after resilvering.

if [ ${problems} -eq 0 ]; then
  errors=$(/sbin/zpool status | grep ONLINE | grep -v state | awk '{print $3 $4 $5}' | grep -v 000)
  if [ "${errors}" ]; then
       emailSubject="`hostname` - ZFS pool - Drive Errors"

# Scrub Expired - Check if all volumes have been scrubbed in at least the last
# 8 days. The general guide is to scrub volumes on desktop quality drives once
# a week and volumes on enterprise class drives once a month. You can always
# use cron to schedule "zpool scrub" in off hours. We scrub our volumes every
# Sunday morning for example.
# Scrubbing traverses all the data in the pool once and verifies all blocks can
# be read. Scrubbing proceeds as fast as the devices allows, though the
# priority of any I/O remains below that of normal calls. This operation might
# negatively impact performance, but the file system will remain usable and
# responsive while scrubbing occurs. To initiate an explicit scrub, use the
# "zpool scrub" command.
# The scrubExpire variable is in seconds. So for 8 days we calculate 8 days
# times 24 hours times 3600 seconds to equal 691200 seconds.


if [ ${problems} -eq 0 ]; then
  currentDate=$(date +%s)
  zfsVolumes=$(/sbin/zpool list -H -o name)

 for volume in ${zfsVolumes}
   if [ $(/sbin/zpool status $volume | egrep -c "none requested") -ge 1 ]; then
       echo "ERROR: You need to run \"zpool scrub $volume\" before this script can monitor the scrub expiration time."
   if [ $(/sbin/zpool status $volume | egrep -c "scrub in progress|resilver") -ge 1 ]; then

   ### FreeBSD with *nix supported date format
    #scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $15 $12 $13}')
    #scrubDate=$(date -j -f '%Y%b%e-%H%M%S' $scrubRawDate'-000000' +%s)

   ### Ubuntu with GNU supported date format
   scrubRawDate=$(/sbin/zpool status $volume | grep scrub | awk '{print $11" "$12" " $13" " $14" "$15}')
   scrubDate=$(date -d "$scrubRawDate" +%s)

      if [ $(($currentDate - $scrubDate)) -ge $scrubExpire ]; then
       emailSubject="`hostname` - ZFS pool - Scrub Time Expired. Scrub Needed on $volume"

# Notifications - On any problems send email with drive status information and
# capacities including a helpful subject line to root. Also use logger to write
# the email subject to the local logs. This is the place you may want to put
# any other notifications like:
# + Update an anonymous twitter account with your ZFS status (https://twitter.com/zfsmonitor)
# + Playing a sound file or beep the internal speaker
# + Update Nagios, Cacti, Zabbix, Munin or even BigBrother

if [ "$problems" -ne 0 ]; then
 echo -e "$emailSubject \n\n\n `/sbin/zpool list` \n\n\n `/sbin/zpool status`" | mail -s "$emailSubject" root
 logger $emailSubject

if [ "$problems" -eq 0 ]; then
 echo "ZFS Healthy"

### EOF ###