I am redesigning my homelab servers from scratch, and I want to give ZFS a try after some experimentation I've done in a VM.
I know there is a limitation - disks in vdevs must all have the same size (otherwise only the smallest one would be used). That's a problem for me - I have an assorted bunch of various sized HDDs, and unfortunately I have no budget for upgrades now. So my goal is to squeeze the maximum out of what I already have without compromising ZFS.
Disclaimer: I know ZFS is an enterprise file system, and that in the enterprise world it's cheaper to buy a bag of identical disks rather than pay engineers to do what I am going to do.
So, to do this anyway, I have come up with the following workaround idea, which I want to validate.
My initial setup:
- 4 disks of 16 TB each (sd[abcd])
- 3 disks of 8 TB each (sd[efg])
- 2 disks of 6 TB each (sd[hi])
I carve a partition on every disk as large as the smallest non-zero free space - of course, while accounting for partition alignment etc. - and repeat this until no more disks with free space left.
Final picture:
- 4 disks of 16 TB each (sd[abcd]):
part1: 6 TB
part2: 2 TB
part3: 8 TB
- 3 disks of 8 TB each (sd[efg]):
part1: 6 TB
part2: 2 TB
- 2 disks of 6 TB each (sd[hi]):
part1: 6 TB
+-------------------------+------------+----------------------------------+
sda: | sda1: 6 TB | sda2: 2 TB | sda3: 8 TB |
+-------------------------+------------+----------------------------------+
sdb: | sdb1: 6 TB | sdb2: 2 TB | sdb3: 8 TB |
+-------------------------+------------+----------------------------------+
sdc: | sdc1: 6 TB | sdc2: 2 TB | sdc3: 8 TB |
+-------------------------+------------+----------------------------------+
sdd: | sdd1: 6 TB | sdd2: 2 TB | sdd3: 8 TB |
+-------------------------+------------+----------------------------------+
sde: | sde1: 6 TB | sde2: 2 TB |
+-------------------------+------------+
sdf: | sdf1: 6 TB | sdf2: 2 TB |
+-------------------------+------------+
sdg: | sdg1: 6 TB | sdg2: 2 TB |
+-------------------------+------------+
sdh: | sdh1: 6 TB |
+-------------------------+
sdi: | sdi1: 6 TB |
+-------------------------+
Now I have created equally sized partitions on physically different disks that I can use as building blocks for multiple RAIDZ:
zpool create tank
# |----- 16 TB -----| |--- 8 TB ---| |- 6TB -|
raidz2 sda1 sdb1 sdc1 sdd1 sde1 sdf1 sdg1 sdh1 sdi1 # part1: 6 TB
raidz2 sda2 sdb2 sdc2 sdd2 sde2 sdf2 sdg2 # .......... part2: 2 TB
raidz2 sda3 sdb3 sdc3 sdd3 # .......................... part3: 8 TB
I think the following should be true:
- every RAIDZ vdev uses only partitions that are on physically different drives - so if one disk fails, at most one device in RAIDZ will fail
- when one drive fails (say
sda), it will cause all three RAIDZ to degrade, but replacing the disk and repartitioning it the same way will let ZFS transparently recover - since ZFS is said to prefer writes to the vdev with the most free space, and first RAIDZ will have the most of it, then until we run out of first 6 TB I suppose other RAIDZ wouldn't see much use, so there shouldn't be IOPS bottlenecks. I hope so.
The final touch is to set the IO scheduler to "noop", though I am not sure if ZFS would be intelligent enough to properly schedule across and realize sda1 and sda2 are on the same spinning rust device.
In theory, I don't see why this setup wouldn't work, but I suppose I might be missing something else. Are there any downsides or dangers in running this configuration?
UPD: This question is different from linked question, since OP of linked question is concerned about hardware compatibility (which is not a focus of this question), and is not worried about performance (I am), and answers are mostly ~10 years old - during which I believe ZFS-on-Linux development wasn't stalled.
While one of the answers suggests to use partitioning scheme approach similar to what I have in my question, they only mention "There may be some performance issues due to having multiple pools on the same physical disks", but without expanding the thought what those performance issues would be and how to possibly mitigate them.
Answers suggesting to just ditch uneven disks and use even disks are not applicable to me due to budget constraints - pulling up both servers to the same baseline (EXOS X16) would cost me around $3500, which is well outside of the range for upgrades.
UPD2: I have ran fio tests for two environments:
- "whole" (prefixed "W: " in the report) - just giving uneven disks to ZFS as is and agreeing with 46 TB out of 100 TB to be unusable even for parity.
- "slices" (prefixed "S: ") - the partitioning scheme mentioned above in the question.
The job is this:
[global]
numjobs=$ncpus # 48
runtime=30m
ramp_time=5m
time_based=1
directory=/testpool
direct=1
buffered=0
unified_rw_reporting=mixed
randrepeat=1
randseed=42
fallocate=native
fadvise_hint=1
size=32G
ioengine=libaio
iodepth=32
steadystate=iops:3%
steadystate_ramp_time=30s
steadystate_duration=1m
steadystate_check_interval=10s
group_reporting
stonewall
[randread_4k]
rw=randread
bs=4k
[randread_1m]
rw=randread
bs=1m
[read_4k]
rw=read
bs=4k
[read_1m]
rw=read
bs=1m
[randwrite_4k]
rw=randwrite
bs=4k
[randwrite_1m]
rw=randwrite
bs=1m
[write_4k]
rw=read
bs=4k
[write_1m]
rw=read
bs=1m
[randrw_4k]
rw=randrw
bs=4k
[randrw_1m]
rw=randrw
bs=1m
[rw_4k]
rw=rw
bs=4k
[rw_1m]
rw=rw
bs=1m
The report is too large to be inlined in the question, I've posted it on pastebin.
Important parts are here:
randread_4k: (groupid=0, jobs=48)
W: mixed: IOPS=340, BW=1430KiB/s (1464kB/s)(126MiB/90419msec)
S: mixed: IOPS=382, BW=1600KiB/s (1638kB/s)(141MiB/90304msec)
W: clat (usec): min=12, max=6645.0k, avg=4261163.60, stdev=680324.05
S: clat (usec): min=12, max=4974.1k, avg=3798524.13, stdev=537804.44
W: lat (msec): min=74, max=6820, avg=4401.94, stdev=686.47
S: lat (msec): min=57, max=5108, avg=3923.71, stdev=540.23
W: bw ( KiB/s): min= 600, max= 2231, per=95.28%, avg=1362.87, stdev= 6.45, samples=8636
S: bw ( KiB/s): min= 731, max= 2573, per=94.52%, avg=1512.83, stdev= 7.12, samples=8592
W: iops : min= 144, max= 557, avg=339.38, stdev= 1.62, samples=8636
S: iops : min= 149, max= 629, avg=359.56, stdev= 1.84, samples=8592
randread_1m: (groupid=1, jobs=48)
W: mixed: IOPS=144, BW=148MiB/s (155MB/s)(62.3GiB/430760msec)
S: mixed: IOPS=115, BW=132MiB/s (138MB/s)(11.7GiB/90813msec)
W: clat (usec): min=19, max=12389k, avg=10163959.96, stdev=1001629.54
S: clat (usec): min=17, max=15459k, avg=12095875.47, stdev=2649313.67
W: lat (msec): min=284, max=12746, avg=10495.85, stdev=1004.54
S: lat (msec): min=187, max=15939, avg=12513.38, stdev=2652.76
W: bw ( KiB/s): min=97488, max=240331, per=98.01%, avg=148655.25, stdev=1018.29, samples=40904
S: bw ( KiB/s): min=95560, max=215943, per=90.67%, avg=122503.29, stdev=887.62, samples=8313
W: iops : min= 48, max= 198, avg=98.08, stdev= 1.00, samples=40904
S: iops : min= 48, max= 185, avg=72.89, stdev= 0.88, samples=8313
read_4k: (groupid=2, jobs=48)
W: mixed: IOPS=116k, BW=454MiB/s (476MB/s)(40.7GiB/91840msec)
S: mixed: IOPS=56.0k, BW=219MiB/s (230MB/s)(38.8GiB/181331msec)
W: clat (usec): min=6, max=1918.8k, avg=12789.05, stdev=91865.16
S: clat (usec): min=5, max=1788.7k, avg=26548.54, stdev=106850.51
W: lat (usec): min=26, max=1918.8k, avg=13197.96, stdev=93308.57
S: lat (usec): min=17, max=1788.7k, avg=27400.32, stdev=108457.21
W: bw ( KiB/s): min=39179, max=1295017, per=100.00%, avg=632679.95, stdev=3638.22, samples=6454
S: bw ( KiB/s): min= 9910, max=1849259, per=100.00%, avg=251363.49, stdev=6653.90, samples=15377
W: iops : min= 9772, max=323738, avg=158156.16, stdev=909.58, samples=6454
S: iops : min= 2457, max=462297, avg=62823.73, stdev=1663.46, samples=15377
read_1m: (groupid=3, jobs=48)
W: mixed: IOPS=450, BW=464MiB/s (487MB/s)(50.7GiB/111861msec)
S: mixed: IOPS=188, BW=190MiB/s (199MB/s)(210GiB/1131607msec)
W: clat (usec): min=19, max=5368.4k, avg=3249563.28, stdev=507631.91
S: clat (usec): min=17, max=14976k, avg=7846837.00, stdev=2023099.53
W: lat (msec): min=83, max=5370, avg=3355.95, stdev=483.37
S: lat (msec): min=43, max=15076, avg=8100.89, stdev=2051.37
W: bw ( KiB/s): min=124100, max=1303072, per=100.00%, avg=655130.44, stdev=2886.11, samples=7512
S: bw ( KiB/s): min=92741, max=3303459, per=100.00%, avg=283658.91, stdev=5267.93, samples=73760
W: iops : min= 74, max= 1233, avg=596.00, stdev= 2.83, samples=7512
S: iops : min= 48, max= 3191, avg=230.54, stdev= 5.17, samples=73760
randwrite_4k: (groupid=4, jobs=48)
W: mixed: IOPS=5323, BW=20.8MiB/s (21.8MB/s)(36.6GiB/1800036msec)
S: mixed: IOPS=5128, BW=20.0MiB/s (21.0MB/s)(18.8GiB/960158msec)
W: clat (usec): min=13, max=1723.9k, avg=279521.48, stdev=148733.45
S: clat (usec): min=12, max=1329.7k, avg=290129.96, stdev=126039.73
W: lat (msec): min=23, max=1762, avg=288.52, stdev=152.77
S: lat (msec): min=11, max=1361, avg=299.47, stdev=129.52
W: bw ( KiB/s): min= 2808, max=55301, per=99.98%, avg=21292.76, stdev=185.22, samples=171552
S: bw ( KiB/s): min= 4152, max=45879, per=99.95%, avg=20511.67, stdev=150.06, samples=91584
W: iops : min= 666, max=13793, avg=5306.94, stdev=46.40, samples=171552
S: iops : min= 1002, max=11448, avg=5110.17, stdev=37.63, samples=91584
randwrite_1m: (groupid=5, jobs=48)
W: mixed: IOPS=450, BW=451MiB/s (473MB/s)(793GiB/1800067msec)
S: mixed: IOPS=278, BW=279MiB/s (293MB/s)(491GiB/1800135msec)
W: clat (usec): min=19, max=10503k, avg=3302412.28, stdev=1391453.26
S: clat (usec): min=14, max=12881k, avg=5339285.04, stdev=1341220.09
W: lat (msec): min=16, max=10607, avg=3409.04, stdev=1403.59
S: lat (msec): min=16, max=13013, avg=5511.73, stdev=1345.50
W: bw ( KiB/s): min=90819, max=9151429, per=100.00%, avg=531329.96, stdev=10206.46, samples=148975
S: bw ( KiB/s): min=96896, max=2133389, per=100.00%, avg=335613.71, stdev=3668.19, samples=145954
W: iops : min= 48, max= 8909, avg=474.34, stdev=10.02, samples=148975
S: iops : min= 48, max= 2044, avg=282.04, stdev= 3.59, samples=145954
write_4k: (groupid=6, jobs=48)
W: mixed: IOPS=114k, BW=445MiB/s (467MB/s)(39.8GiB/91578msec)
S: mixed: IOPS=51.8k, BW=202MiB/s (212MB/s)(31.7GiB/160445msec)
W: clat (usec): min=5, max=1488.4k, avg=13052.97, stdev=93701.20
S: clat (usec): min=6, max=1806.6k, avg=28719.75, stdev=80942.95
W: lat (usec): min=17, max=1488.5k, avg=13470.28, stdev=95173.31
S: lat (usec): min=19, max=1806.6k, avg=29641.23, stdev=82078.38
W: bw ( KiB/s): min=29984, max=1244892, per=100.00%, avg=631002.26, stdev=3642.36, samples=6327
S: bw ( KiB/s): min=11906, max=1431513, per=100.00%, avg=211374.05, stdev=5364.05, samples=15014
W: iops : min= 7474, max=311204, avg=157736.62, stdev=910.60, samples=6327
S: iops : min= 2952, max=357856, avg=52826.19, stdev=1341.00, samples=15014
write_1m: (groupid=7, jobs=48)
W: mixed: IOPS=432, BW=449MiB/s (471MB/s)(40.3GiB/91833msec)
S: mixed: IOPS=178, BW=192MiB/s (201MB/s)(21.0GiB/111827msec)
W: clat (usec): min=12, max=5519.8k, avg=3357115.69, stdev=546629.30
S: clat (usec): min=19, max=13552k, avg=8020957.79, stdev=2326691.56
W: lat (msec): min=2, max=5523, avg=3467.31, stdev=522.57
S: lat (msec): min=46, max=14460, avg=8289.78, stdev=2354.53
W: bw ( KiB/s): min=164865, max=1209364, per=100.00%, avg=655807.50, stdev=2713.68, samples=5914
S: bw ( KiB/s): min=94597, max=1234339, per=100.00%, avg=277133.84, stdev=4474.84, samples=7036
W: iops : min= 114, max= 1141, avg=596.84, stdev= 2.66, samples=5914
S: iops : min= 48, max= 1164, avg=224.24, stdev= 4.39, samples=7036
randrw_4k: (groupid=8, jobs=48)
W: mixed: IOPS=298, BW=1216KiB/s (1245kB/s)(321MiB/270504msec)
S: mixed: IOPS=322, BW=1359KiB/s (1392kB/s)(120MiB/90368msec)
W: clat (usec): min=12, max=10647k, avg=4938473.04, stdev=926543.76
S: clat (usec): min=10, max=6821.3k, avg=4496138.92, stdev=761258.32
W: lat (msec): min=78, max=10802, avg=5099.21, stdev=943.00
S: lat (msec): min=86, max=7001, avg=4644.53, stdev=764.19
W: bw ( KiB/s): min= 659, max= 3595, per=100.00%, avg=1335.91, stdev= 6.86, samples=43430
S: bw ( KiB/s): min= 653, max= 3754, per=100.00%, avg=1439.44, stdev= 7.67, samples=14594
W: iops : min= 96, max= 847, avg=269.52, stdev= 1.74, samples=43430
S: iops : min= 96, max= 874, avg=291.55, stdev= 1.92, samples=14594
randrw_1m: (groupid=9, jobs=48)
W: mixed: IOPS=151, BW=153MiB/s (160MB/s)(269GiB/1800742msec)
S: mixed: IOPS=145, BW=146MiB/s (153MB/s)(257GiB/1800696msec)
W: clat (usec): min=16, max=19552k, avg=9772423.15, stdev=2043748.17
S: clat (usec): min=20, max=19321k, avg=10226894.99, stdev=2742482.02
W: lat (msec): min=110, max=19891, avg=10088.45, stdev=2072.23
S: lat (msec): min=256, max=19741, avg=10558.10, stdev=2787.20
W: bw ( KiB/s): min=179600, max=1576780, per=100.00%, avg=286874.50, stdev=1699.33, samples=186234
S: bw ( KiB/s): min=187315, max=1611381, per=100.00%, avg=274198.58, stdev=1582.02, samples=186438
W: iops : min= 96, max= 1455, avg=187.48, stdev= 1.67, samples=186234
S: iops : min= 96, max= 1484, avg=173.92, stdev= 1.55, samples=186438
rw_4k: (groupid=10, jobs=48)
W: mixed: IOPS=25.0k, BW=97.7MiB/s (102MB/s)(172GiB/1800222msec)
S: mixed: IOPS=35.1k, BW=137MiB/s (144MB/s)(241GiB/1800293msec)
W: clat (usec): min=10, max=2285.6k, avg=59487.38, stdev=101914.05
S: clat (usec): min=7, max=1669.2k, avg=42385.56, stdev=85914.30
W: lat (usec): min=45, max=2287.1k, avg=61396.99, stdev=103541.65
S: lat (usec): min=159, max=1670.0k, avg=43743.14, stdev=87225.69
W: bw ( KiB/s): min= 5667, max=2212983, per=100.00%, avg=103103.24, stdev=1356.67, samples=328732
S: bw ( KiB/s): min=10227, max=1440658, per=100.00%, avg=142136.15, stdev=1318.62, samples=337133
W: iops : min= 1350, max=553216, avg=25741.71, stdev=339.17, samples=328732
S: iops : min= 2495, max=360137, avg=35498.44, stdev=329.67, samples=337133
rw_1m: (groupid=11, jobs=48)
W: mixed: IOPS=321, BW=323MiB/s (338MB/s)(567GiB/1800214msec)
S: mixed: IOPS=276, BW=277MiB/s (291MB/s)(487GiB/1800221msec)
W: clat (usec): min=14, max=8387.5k, avg=4618327.58, stdev=795144.01
S: clat (usec): min=11, max=12213k, avg=5375807.25, stdev=1426130.93
W: lat (msec): min=105, max=8590, avg=4767.47, stdev=807.78
S: lat (msec): min=125, max=12857, avg=5549.47, stdev=1435.73
W: bw ( KiB/s): min=194610, max=2282744, per=100.00%, avg=388709.51, stdev=2316.33, samples=292390
S: bw ( KiB/s): min=195372, max=2950029, per=100.00%, avg=386392.73, stdev=2803.01, samples=252796
W: iops : min= 96, max= 2166, avg=312.71, stdev= 2.31, samples=292390
S: iops : min= 96, max= 2823, avg=312.04, stdev= 2.78, samples=252796