zaterdag 30 augustus 2014

Part V: ZFS - Poor man's deduplication - Howto: Ubuntu Home Server (ZFS + virtual IpFire)



Series : Ubuntu Home Server (ZFS + virtual IpFire) 

Part IV: How much space do you lose with a RaidZ1/RaidZ2/RaidZ3?
-->Part V : Poor man's deduplication  (this post)<--


Poor man's deduplication..

Table of Contents:

Why deduplication?
How to check for duplicate files
Processing and sorting your list
Consolidating directories
Removing duplicates



Why deduplication?

I use our ZFS server / NAS mostly for backup and network storage. As I am working in cheminformatics my job produces a lot of data, supervising students add to this.. Much of this data and old files need to be stored (in particular if you have published on this data, it needs to be reproducible). I like to think I am pretty organised in the way I store data, but turns out I am not.... I work from home sometimes, do work om my workstation at work, sometimes on a laptop (used to make backups on external drives in the pre-zfs era). 

When I installed my NAS I consciously turned off deduplication for two reasons, firstly the server hardware was likely not powerful enough (X3 with 16GB RAM). Secondly because I thought I did not need it (convinced that my file system was well organised...)

However ZFS showed that I had a number of duplicate blocks.. In fact the output from zdb -S showed that I could in theory obtain a 1.1 reduction (or 9%) by deduplication. The output is listed here:

root@KarelDoorman:~# zdb -S zfspool
Simulated DDT histogram:

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    35.7M   4.43T   4.13T   4.15T    35.7M   4.43T   4.13T   4.15T
     2    2.70M    333G    274G    278G    5.80M    717G    589G    597G
     4     275K   30.7G   23.6G   24.3G    1.29M    146G    113G    116G
     8    32.4K   3.39G   2.66G   2.74G     317K   33.0G   26.2G   27.0G
    16    4.02K    254M    207M    224M    81.7K   4.86G   3.94G   4.28G
    32      913   16.9M   9.36M   15.1M    40.1K    685M    365M    625M
    64       59   1.49M    590K    975K    4.83K    124M   46.0M   77.5M
   128       18    574K     15K    144K    3.28K    117M   2.72M   26.2M
   256        6    390K   4.50K   48.0K    2.01K    151M   1.58M   16.0M
   512        5    258K      4K   40.0K    4.10K    151M   3.12M   32.8M
    1K        4    257K      3K   32.0K    5.52K    379M   4.24M   44.1M
    2K        2      1K      1K   16.0K    5.18K   2.59M   2.59M   41.4M
    8K        1    128K      1K   7.99K    10.1K   1.27G   10.1M   81.0M
   16K        1    128K      1K   7.99K    25.8K   3.22G   25.8M    206M
 Total    38.7M   4.79T   4.42T   4.44T    43.3M   5.32T   4.85T   4.87T

dedup = 1.10, compress = 1.10, copies = 1.01, dedup * compress / copies = 1.20

However, this also shows that there are in total 38.7 million unique blocks in my file system, which would require 38.7 million * 320 bytes ~ 11.5 GB of RAM for just the dedup table (requiring in total 4 times that as ARC should be limited to 25 %, so 48 GB).

Now at the current prices (Tweaker.net) this costs about 800 EUR! Corresponding to 5.7 disks of 2TB. So it seems to me that deduplication in ZFS is out of my budget. Hence, poor man's (manual) deduplication.

If you're interested, this is what it looks like graphically (note the logarithmic y scale):

This shows that there are 35,700,000 blocks that are unique, 2,700,000 that exist in duplicate, 27,5000 that exist in triplicate, etc. It even shows that there is a single block present in 16,000 fold. So hence I thought I'd try to do this on the file level, maybe there was something to gain. 



How to check for duplicate files? 

Checking for duplicates can be rather tedious. Files can be named differently, have different timestamps etc. For this a brilliant program has been written : Fdupes. Github page : https://github.com/adrianlopezroche/fdupes. It even has it's own wikipedia page: http://en.wikipedia.org/wiki/Fdupes

"The program first compares file size and MD5 signatures and then performs a byte-by-byte check for verification." 

So you can be rather sure that a duplicate is an actual duplicate. I ran this on my /zfspool folder (which is the root folder of all zfs datasets. In total about 1.7 million files are stored on my zfs pool (taking up approximately 5 TB of space). The results were as follows (note that I processed and grouped the output using pipelining tools (e.g. KNIME or Pipeline Pilot):


So it turns out that of the 1.7 million files about 665692 were duplicates taking up 893.51 Gigabytes!

so much for an organised file system as the majority of these duplicates were actually in my work folder....

Processing and sorting your list

By default fdupes outputs the duplicates with their full location, this can be ported to a text file and then the files are sorted with blank lines separating the groups. 

Now I wanted to easily gain a lot os space and hence I wanted to start with the largest files, moreover I wanted a uniqie identifier per files to browse through. So using two simple bash scripts I did the following:
1 removed the empty lines 

#!/bin/sh
files="/media/dupes/duplicates.txt"
for i in $files
do
  sed '/^$/d' $i > duplicates_out.txt
done


2 for each file calculate the MD5 hash 9also because I am paranoid) and add the size in bytes (for later sorting)

#!/bin/sh
while read name
do
hash=`md5sum "$name" | awk '{print $1}'`
size=`ls -all "$name" | awk '{print $5}'`
echo -e  "$name\t$hash\t$size"
done

Afterward you get a file that lists 3 columns seperated by tabs: name, hash, size (bytes).

Using Pipeline pilot I created a unique ID (hash_size), calculated the size in MB / GB, and calculated first occurences for each ID. (protocol is available HERE, but this can easily be done with KNIME). 

Consolidating directories

From this I observed several directories to be duplicate from other directories, with Rsync you can easily merge them (I used timestamps and kept a log).

rsync -avhP /directory1/ /directory2/ > /loggind_dir/transferlogs/1.txt

Removing duplicates

After this I repeated fdupes and the processing and deleted the remaining duplicates. In total I freed about 700 GB!



Geen opmerkingen:

Een reactie posten