Series : Ubuntu Home Server (ZFS + virtual IpFire)
Part IV: How much space do you lose with a RaidZ1/RaidZ2/RaidZ3?
-->Part V : Poor man's deduplication (this post)<--
-->Part V : Poor man's deduplication (this post)<--
Poor man's deduplication..
Table of Contents:
Why deduplication?
How to check for duplicate files
Processing and sorting your list
Consolidating directories
Removing duplicates
Consolidating directories
Removing duplicates
Why deduplication?
I use our ZFS server / NAS mostly for backup and network storage. As I am working in cheminformatics my job produces a lot of data, supervising students add to this.. Much of this data and old files need to be stored (in particular if you have published on this data, it needs to be reproducible). I like to think I am pretty organised in the way I store data, but turns out I am not.... I work from home sometimes, do work om my workstation at work, sometimes on a laptop (used to make backups on external drives in the pre-zfs era).
When I installed my NAS I consciously turned off deduplication for two reasons, firstly the server hardware was likely not powerful enough (X3 with 16GB RAM). Secondly because I thought I did not need it (convinced that my file system was well organised...)
However ZFS showed that I had a number of duplicate blocks.. In fact the output from zdb -S showed that I could in theory obtain a 1.1 reduction (or 9%) by deduplication. The output is listed here:
root@KarelDoorman:~# zdb -S zfspool
Simulated DDT histogram:
bucket allocated referenced
______ ______________________________ ______________________________
refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
------ ------ ----- ----- ----- ------ ----- ----- -----
1 35.7M 4.43T 4.13T 4.15T 35.7M 4.43T 4.13T 4.15T
2 2.70M 333G 274G 278G 5.80M 717G 589G 597G
4 275K 30.7G 23.6G 24.3G 1.29M 146G 113G 116G
8 32.4K 3.39G 2.66G 2.74G 317K 33.0G 26.2G 27.0G
16 4.02K 254M 207M 224M 81.7K 4.86G 3.94G 4.28G
32 913 16.9M 9.36M 15.1M 40.1K 685M 365M 625M
64 59 1.49M 590K 975K 4.83K 124M 46.0M 77.5M
128 18 574K 15K 144K 3.28K 117M 2.72M 26.2M
256 6 390K 4.50K 48.0K 2.01K 151M 1.58M 16.0M
512 5 258K 4K 40.0K 4.10K 151M 3.12M 32.8M
1K 4 257K 3K 32.0K 5.52K 379M 4.24M 44.1M
2K 2 1K 1K 16.0K 5.18K 2.59M 2.59M 41.4M
8K 1 128K 1K 7.99K 10.1K 1.27G 10.1M 81.0M
16K 1 128K 1K 7.99K 25.8K 3.22G 25.8M 206M
Total 38.7M 4.79T 4.42T 4.44T 43.3M 5.32T 4.85T 4.87T
dedup = 1.10, compress = 1.10, copies = 1.01, dedup * compress / copies = 1.20
However, this also shows that there are in total 38.7 million unique blocks in my file system, which would require 38.7 million * 320 bytes ~ 11.5 GB of RAM for just the dedup table (requiring in total 4 times that as ARC should be limited to 25 %, so 48 GB).
Now at the current prices (Tweaker.net) this costs about 800 EUR! Corresponding to 5.7 disks of 2TB. So it seems to me that deduplication in ZFS is out of my budget. Hence, poor man's (manual) deduplication.
If you're interested, this is what it looks like graphically (note the logarithmic y scale):
This shows that there are 35,700,000 blocks that are unique, 2,700,000 that exist in duplicate, 27,5000 that exist in triplicate, etc. It even shows that there is a single block present in 16,000 fold. So hence I thought I'd try to do this on the file level, maybe there was something to gain.
How to check for duplicate files?
Checking for duplicates can be rather tedious. Files can be named differently, have different timestamps etc. For this a brilliant program has been written : Fdupes. Github page : https://github.com/adrianlopezroche/fdupes. It even has it's own wikipedia page: http://en.wikipedia.org/wiki/Fdupes
"The program first compares file size and MD5 signatures and then performs a byte-by-byte check for verification."
So you can be rather sure that a duplicate is an actual duplicate. I ran this on my /zfspool folder (which is the root folder of all zfs datasets. In total about 1.7 million files are stored on my zfs pool (taking up approximately 5 TB of space). The results were as follows (note that I processed and grouped the output using pipelining tools (e.g. KNIME or Pipeline Pilot):
So it turns out that of the 1.7 million files about 665692 were duplicates taking up 893.51 Gigabytes!
so much for an organised file system as the majority of these duplicates were actually in my work folder....
Processing and sorting your list
By default fdupes outputs the duplicates with their full location, this can be ported to a text file and then the files are sorted with blank lines separating the groups.
Now I wanted to easily gain a lot os space and hence I wanted to start with the largest files, moreover I wanted a uniqie identifier per files to browse through. So using two simple bash scripts I did the following:
1 removed the empty lines
#!/bin/sh
files="/media/dupes/duplicates.txt"
for i in $files
do
sed '/^$/d' $i > duplicates_out.txt
done
2 for each file calculate the MD5 hash 9also because I am paranoid) and add the size in bytes (for later sorting)
#!/bin/sh
while read name
do
hash=`md5sum "$name" | awk '{print $1}'`
size=`ls -all "$name" | awk '{print $5}'`
echo -e "$name\t$hash\t$size"
done
Afterward you get a file that lists 3 columns seperated by tabs: name, hash, size (bytes).
Using Pipeline pilot I created a unique ID (hash_size), calculated the size in MB / GB, and calculated first occurences for each ID. (protocol is available HERE, but this can easily be done with KNIME).
Consolidating directories
From this I observed several directories to be duplicate from other directories, with Rsync you can easily merge them (I used timestamps and kept a log).
rsync -avhP /directory1/ /directory2/ > /loggind_dir/transferlogs/1.txt
Removing duplicates
After this I repeated fdupes and the processing and deleted the remaining duplicates. In total I freed about 700 GB!