zondag 24 augustus 2014

Part IV: ZFS - RAIDZx- Howto: Ubuntu Home Server (ZFS + virtual IpFire)

Series : Ubuntu Home Server (ZFS + virtual IpFire) 

-->Part IV: How much space do you lose with a RaidZ1/RaidZ2/RaidZ3? (this post)<--
Part V : Poor man's deduplication

How much space do you lose with a RaidZ1/RaidZ2/RaidZ3?

(hint RAID calculators are wrong...)

Table of Contents:

Why space is lost
Sector sizes
How I simulated the lost space
Results
Conclusions
Update - tested with actual data

Why space is lost

How much space do I lose for a 7 disk RaidZ2? 

How much space is available on 9 disks of 4TB in RaidZ1?

The idea behind any RAID is to introduce redundancy (Redundant Array of Inexpensive Disks), this is also why some argue that a RAID-0 is no actual raid (hence the 0). Having redundancy by definition means that one loses net space. This can be a lot (half as in RAID-1) or relatively little (1/nth of the disks space with N disks as in RaidZ1) but space is always lost. 

However, in the case of ZFS RaidZ arrays more space is lost than just this traditional disk overhead. This is also why traditional RAID calculators are commonly overestimating the amount of available space in a ZFS array.

ZFS uses a variable block size and by default it can be < 128 Kb and is 128 Kb max. So any file > 128 Kb is split in blocks of 128 Kb which are distributed over the (data) disks evenly. So when we have an array of 4 disks this means 32 Kb per disk. This is a nice round number. When we have a different number of drives (e.g. 7) it's a more ugly number (18.29 Kb) and then usually this means that the hard drive loses more sectors than strictly needed (see below) as it can not write half sectors.  Now this means that ZFS can disappoint when a non optimal number of drives is placed in a RaidZx array.

Sector sizes


Traditionally hard drives used to have 512 (2^9) byte sector sizes. Hard drives can only write full sectors (being either written with data or empty). This means that any file (even tiny files) occupies a multitude of 512 bytes, and a multitude of sectors. However, with larger drives being introduced (with more dense platters) this was leading to problems requiring better error correction. This problem was solved by the introduction of 4096 (2^12) byte sector sizes (8 times the earlier sector size). The 4k sector sizes are known as advanced format drives. A large number (but not all!!) drives with advanced format can also emulate 512 bit sector sizes.

ZFS can deal with both by setting the property ashift upon pool creation.

Note that ashift cannot be changed later and requires the pool to be destroyed if you want to change it so think carefully before placing data on your pool!

ashift = 9 (from 2^9) means that ZFS uses 512 byte sector sizes, ashift = 12 (from 2^12) means that ZFS uses 4096 byte sector sizes (see also in part 2).

It is expected that next generation drives will not be able to emulate 512 byte sectors, so these drives cannot be added to a pool created with ashift = 9! 

The problem is that in our previous example with a 7 data disk RaidZx would require 37*512 byte sectors (= 18.50 Kb per disk, or 1 % extra space) on a 512 byte sector size disk. On a 4K disk it would require 5*4096 byte sectors (= 20.00 Kb per disk, or 9 % extra space). So you can see that a difference starts to develop.

How I simulated the lost space


A lot of people have done a lot of work on how to accurately calculate this, but it's difficult to get it right. So I figured I would simulate this (more or less accurately). I set up a virtual machine in virtual box and created 20 hard disk files of 4 gigabytes the way HDD manufacturers calculate this (4,000,000,000 bytes, so approximately 4 manufacturer's GB's (i'll call them mGB) ). Using my virtual machine I created all possible RaidZx arrays using 3 (RaidZ1) and 20 drives (RaidZ1 / Z2 / Z3). Afterwards I used 'zpool list' and 'zfs list' to view the net space available. I expect these results to translate directly to arrays constructed with 4TB drives as sold by manufacturers (or 2TB (50%), 1 TB (25%)). 

So basically I created and destroyed 108 zfs pools in a virtual machine.

Figure 1: Virtual machine set up


This was a lot of fun, see screenshot below

Figure 2: 16 virtual disks RaidZ3 array


Results


The results are displayed below in plots and a table. What I mean by theoretical space is the space available after parity disks removed, so you would expect the 7 data drive RaidZx to have (assuming manufacturer 4TB so actually 3.64 TB / 3724 GB drives) 7 * 3724 = 26068 GB / 25.46 TB space available. 

However, this requires 8 drives in RaidZ1, 9 drives in RaidZ2, and 10 drives in RaidZ3 to provide parity (1/nth in RaidZ1, 2/nth in RaidZ1, 3/nth in RaidZ1)  so the brute space is 29.09 TB (Z1), 32.73 (Z2), and 36.37 (Z3). In the table below these additional disks have been removed and not considered!

Table 1 fraction of space available of the theoretical available space.
Data disks RaidZ1-0.5k RaidZ2-0.5k RaidZ3-0.5k RaidZ1-4k RaidZ2-4k RaidZ3-4k
1 0.98 0.98 0.98 0.98 0.98 0.98
2 0.98 0.98 0.98 0.98 0.95 0.98
3 0.98 0.98 0.98 0.95 0.97 0.92
4 0.99 0.99 0.99 0.99 0.99 0.99
5 0.98 0.98 0.98 0.94 0.92 0.89
6 0.98 0.98 0.98 0.97 0.93 0.91
7 0.98 0.98 0.98 0.94 0.97 0.94
8 0.99 0.98 0.99 0.99 0.94 0.99
9 0.98 0.98 0.98 0.97 0.92 0.95
10 0.99 0.98 0.98 0.97 0.90 0.93
11 0.98 0.98 0.98 0.95 0.96 0.91
12 0.99 0.98 0.97 0.95 0.94 0.89
13 0.98 0.98 0.98 0.94 0.93 0.88
14 0.98 0.98 0.97 0.94 0.92 0.87
15 0.98 0.97 0.97 0.93 0.92 0.86
16 0.99 0.99 0.99 0.99 0.99 0.94
17 0.98 0.98 0.98 0.98 0.98 0.92
18 0.98 0.97 n/a 0.98 0.97 n/a
19 0.98 n/a n/a 0.97 n/a n/a
As expected, using 512 byte sector sizes leads to little lost space (wether it be RaidZ1 or other formats). However in particular in the case of RaidZ3 sometimes a lot of space is lost. For instance in the case of a 15 data disk RaidZ3 using 4Kb sector  sector sizes leads to only 86% of the space available. This means that when using 4TB disk size you actually use 7.64 TB! (0.14 * (15 * 3.64 TB)). That means that your RaidZ3 does not cost you 3 drives but 3 + an additional 2.1 drives. So you place 18 * 3.64 TB = 65.52 TB in your chassis and get only 12.9 * 3.64 TB = 46.96 TB storage....

Perhaps the results are more clear in a chart, first a 512 byte sector size chart and subsequently a 4k sector size chart.

Figure 3 : 512 bytes sector size net available fraction


Figure 4: 4k sector size net available fraction

Another way of looking at the data is to see what fraction of the last disk you added is actually used. So, ideally if you have 3 disks (again 3.64 TB) the capacity is 10.92 TB and this goes up to 14.56 if you add a 4th. However this is not always the case. 


Table 2 fraction of space from last added drive actually used.
Data disks
RaidZ1-0.5k
RaidZ2-0.5k
RaidZ3-0.5k
RaidZ1-4k
RaidZ2-4k
RaidZ3-4k
1 0.97 0.97 0.97 0.97 0.97 0.97
2 0.98 0.97 0.97 0.98 0.91 0.97
3 0.97 0.98 0.97 0.88 1.00 0.80
4 0.99 0.99 0.99 1.07 1.02 1.16
5 0.96 0.96 0.93 0.77 0.66 0.52
6 0.96 0.96 0.96 1.07 0.96 0.99
7 0.93 0.99 0.96 0.80 1.18 1.07
8 1.04 0.93 1.04 1.26 0.71 1.32
9 0.91 0.99 0.91 0.85 0.77 0.69
10 1.04 0.96 0.96 0.91 0.71 0.74
11 0.93 0.93 0.96 0.82 1.54 0.69
12 1.02 0.99 0.88 0.91 0.77 0.69
13 0.93 0.96 1.07 0.85 0.77 0.69
14 0.88 0.93 0.88 0.82 0.77 0.80
15 0.99 0.93 0.96 0.85 0.91 0.69
16 1.13 1.18 1.18 1.81 1.98 2.03
17 0.88 0.82 0.80 0.88 0.82 0.74
18 0.88 0.85 n/a 0.88 0.85 n/a
19 1.04 n/a n/a 0.91 n/a n/a
So the worst case is a 5 data disk RaidZ3, the 5th disk is only used for 52%! Again using 512 byte sector sizes is more efficient. 

Conclusions

So please use this as a look up post to figure out how many drives you want to place in your array. The last thing noteworthy is that it always pays off to add more drives (an array with more drives is always larger, just sometimes not so much).

Figure 5: total array size for various RaidZx configurations



I had a lot of fun doing this , also experimenting with more exotic pool configurations just to see how it's done.



Update

I received some feedback that ZFS list only provides an estimate and hence might be erroneous. So I choose to test some selected configurations (4, 7, 8, 15 datadisks 4k sector in Z1 , Z2, and Z3). I used dd to fill a created dataset on the volume with data from /dev/zero using blocks of 1M. 

What deviation I found was < 1 % (4/12 was 100% correct, 6/12 were 0.1% smaller than estimate, 1/16 was 0.1% larger, and 1/16 was 0.4% larger). So this lead me to conclude that the estimate from 'zfs list' is accurate enough no to warrant a full data investigation. 

Figure 6: Table listing results of actual data tests.

4 opmerkingen:

  1. Nice Article!

    Actually, you can add 4K devices to an ashift=9 pool but you must enforce this with the -o ashift=9 option when adding the drive. I've tried and it worked. It's a relatively new feature.

    The estimated free space by zfs list is based on 128K blocks so it should be no supprise that your figures matched your dd testing. If you start to write a lot of small files to the pool, things will become ugly, especially for RAIDZ3 because parity is done at the file/block level, so more files = more parity, if I'm correct.

    BeantwoordenVerwijderen
  2. Thanks for you comment!

    The new ashift feature is great! Although I read on your blog (kudos for your posts btw) that performance deteriorates rather fast using ashift=9?

    On your second point, so with more smaller files the numbers would be more negative (i.e. the net space would be smaller). I will add it to the text. Actually this is (for me) a bias in the right direction, as I would tent to avoid the set ups with a > 6 % loss anyway.

    BeantwoordenVerwijderen
  3. Great post man! But I'm wondering, is this something that will be a "issue" on disks with real 4k sectors (aka, not advanced format) or do we only lose this much space on disks with fake 4k sectors?

    BeantwoordenVerwijderen
  4. Hi Kai,

    As far as I know this is true in both emulated and real 4k disks.

    Although I did learn that ZFS can now force 512 bits sectors using ashift=9 on 4k native disks. Yet this comes at a significant performance penalty....

    BeantwoordenVerwijderen