4

I am redesigning my homelab servers from scratch, and I want to give ZFS a try after some experimentation I've done in a VM.

I know there is a limitation - disks in vdevs must all have the same size (otherwise only the smallest one would be used). That's a problem for me - I have an assorted bunch of various sized HDDs, and unfortunately I have no budget for upgrades now. So my goal is to squeeze the maximum out of what I already have without compromising ZFS.

Disclaimer: I know ZFS is an enterprise file system, and that in the enterprise world it's cheaper to buy a bag of identical disks rather than pay engineers to do what I am going to do.

So, to do this anyway, I have come up with the following workaround idea, which I want to validate.

My initial setup:

- 4 disks of 16 TB each (sd[abcd])
- 3 disks of 8 TB each (sd[efg])
- 2 disks of 6 TB each (sd[hi])

I carve a partition on every disk as large as the smallest non-zero free space - of course, while accounting for partition alignment etc. - and repeat this until no more disks with free space left.

Final picture:

- 4 disks of 16 TB each (sd[abcd]):
    part1: 6 TB
    part2: 2 TB
    part3: 8 TB
- 3 disks of 8 TB each (sd[efg]):
    part1: 6 TB
    part2: 2 TB
- 2 disks of 6 TB each (sd[hi]):
    part1: 6 TB
 +-------------------------+------------+----------------------------------+

sda: | sda1: 6 TB | sda2: 2 TB | sda3: 8 TB | +-------------------------+------------+----------------------------------+ sdb: | sdb1: 6 TB | sdb2: 2 TB | sdb3: 8 TB | +-------------------------+------------+----------------------------------+ sdc: | sdc1: 6 TB | sdc2: 2 TB | sdc3: 8 TB | +-------------------------+------------+----------------------------------+ sdd: | sdd1: 6 TB | sdd2: 2 TB | sdd3: 8 TB | +-------------------------+------------+----------------------------------+ sde: | sde1: 6 TB | sde2: 2 TB | +-------------------------+------------+ sdf: | sdf1: 6 TB | sdf2: 2 TB | +-------------------------+------------+ sdg: | sdg1: 6 TB | sdg2: 2 TB | +-------------------------+------------+ sdh: | sdh1: 6 TB | +-------------------------+ sdi: | sdi1: 6 TB | +-------------------------+

Now I have created equally sized partitions on physically different disks that I can use as building blocks for multiple RAIDZ:

zpool create tank
  #       |----- 16 TB -----|  |--- 8 TB ---|  |- 6TB -|
  raidz2  sda1 sdb1 sdc1 sdd1  sde1 sdf1 sdg1  sdh1 sdi1  # part1: 6 TB
  raidz2  sda2 sdb2 sdc2 sdd2  sde2 sdf2 sdg2  # .......... part2: 2 TB
  raidz2  sda3 sdb3 sdc3 sdd3  # .......................... part3: 8 TB

I think the following should be true:

  • every RAIDZ vdev uses only partitions that are on physically different drives - so if one disk fails, at most one device in RAIDZ will fail
  • when one drive fails (say sda), it will cause all three RAIDZ to degrade, but replacing the disk and repartitioning it the same way will let ZFS transparently recover
  • since ZFS is said to prefer writes to the vdev with the most free space, and first RAIDZ will have the most of it, then until we run out of first 6 TB I suppose other RAIDZ wouldn't see much use, so there shouldn't be IOPS bottlenecks. I hope so.

The final touch is to set the IO scheduler to "noop", though I am not sure if ZFS would be intelligent enough to properly schedule across and realize sda1 and sda2 are on the same spinning rust device.

In theory, I don't see why this setup wouldn't work, but I suppose I might be missing something else. Are there any downsides or dangers in running this configuration?


UPD: This question is different from linked question, since OP of linked question is concerned about hardware compatibility (which is not a focus of this question), and is not worried about performance (I am), and answers are mostly ~10 years old - during which I believe ZFS-on-Linux development wasn't stalled.

While one of the answers suggests to use partitioning scheme approach similar to what I have in my question, they only mention "There may be some performance issues due to having multiple pools on the same physical disks", but without expanding the thought what those performance issues would be and how to possibly mitigate them.

Answers suggesting to just ditch uneven disks and use even disks are not applicable to me due to budget constraints - pulling up both servers to the same baseline (EXOS X16) would cost me around $3500, which is well outside of the range for upgrades.


UPD2: I have ran fio tests for two environments:

  • "whole" (prefixed "W: " in the report) - just giving uneven disks to ZFS as is and agreeing with 46 TB out of 100 TB to be unusable even for parity.
  • "slices" (prefixed "S: ") - the partitioning scheme mentioned above in the question.

The job is this:

[global]
numjobs=$ncpus # 48
runtime=30m
ramp_time=5m
time_based=1
directory=/testpool

direct=1 buffered=0

unified_rw_reporting=mixed randrepeat=1 randseed=42 fallocate=native fadvise_hint=1

size=32G ioengine=libaio iodepth=32

steadystate=iops:3% steadystate_ramp_time=30s steadystate_duration=1m steadystate_check_interval=10s

group_reporting stonewall

[randread_4k] rw=randread bs=4k

[randread_1m] rw=randread bs=1m

[read_4k] rw=read bs=4k

[read_1m] rw=read bs=1m

[randwrite_4k] rw=randwrite bs=4k

[randwrite_1m] rw=randwrite bs=1m

[write_4k] rw=read bs=4k

[write_1m] rw=read bs=1m

[randrw_4k] rw=randrw bs=4k

[randrw_1m] rw=randrw bs=1m

[rw_4k] rw=rw bs=4k

[rw_1m] rw=rw bs=1m

The report is too large to be inlined in the question, I've posted it on pastebin.

Important parts are here:

randread_4k: (groupid=0, jobs=48)
W:   mixed: IOPS=340, BW=1430KiB/s (1464kB/s)(126MiB/90419msec)
S:   mixed: IOPS=382, BW=1600KiB/s (1638kB/s)(141MiB/90304msec)
W:     clat (usec): min=12, max=6645.0k, avg=4261163.60, stdev=680324.05
S:     clat (usec): min=12, max=4974.1k, avg=3798524.13, stdev=537804.44
W:      lat (msec): min=74, max=6820, avg=4401.94, stdev=686.47
S:      lat (msec): min=57, max=5108, avg=3923.71, stdev=540.23
W:    bw (  KiB/s): min=  600, max= 2231, per=95.28%, avg=1362.87, stdev= 6.45, samples=8636
S:    bw (  KiB/s): min=  731, max= 2573, per=94.52%, avg=1512.83, stdev= 7.12, samples=8592
W:    iops        : min=  144, max=  557, avg=339.38, stdev= 1.62, samples=8636
S:    iops        : min=  149, max=  629, avg=359.56, stdev= 1.84, samples=8592

randread_1m: (groupid=1, jobs=48) W: mixed: IOPS=144, BW=148MiB/s (155MB/s)(62.3GiB/430760msec) S: mixed: IOPS=115, BW=132MiB/s (138MB/s)(11.7GiB/90813msec) W: clat (usec): min=19, max=12389k, avg=10163959.96, stdev=1001629.54 S: clat (usec): min=17, max=15459k, avg=12095875.47, stdev=2649313.67 W: lat (msec): min=284, max=12746, avg=10495.85, stdev=1004.54 S: lat (msec): min=187, max=15939, avg=12513.38, stdev=2652.76 W: bw ( KiB/s): min=97488, max=240331, per=98.01%, avg=148655.25, stdev=1018.29, samples=40904 S: bw ( KiB/s): min=95560, max=215943, per=90.67%, avg=122503.29, stdev=887.62, samples=8313 W: iops : min= 48, max= 198, avg=98.08, stdev= 1.00, samples=40904 S: iops : min= 48, max= 185, avg=72.89, stdev= 0.88, samples=8313

read_4k: (groupid=2, jobs=48) W: mixed: IOPS=116k, BW=454MiB/s (476MB/s)(40.7GiB/91840msec) S: mixed: IOPS=56.0k, BW=219MiB/s (230MB/s)(38.8GiB/181331msec) W: clat (usec): min=6, max=1918.8k, avg=12789.05, stdev=91865.16 S: clat (usec): min=5, max=1788.7k, avg=26548.54, stdev=106850.51 W: lat (usec): min=26, max=1918.8k, avg=13197.96, stdev=93308.57 S: lat (usec): min=17, max=1788.7k, avg=27400.32, stdev=108457.21 W: bw ( KiB/s): min=39179, max=1295017, per=100.00%, avg=632679.95, stdev=3638.22, samples=6454 S: bw ( KiB/s): min= 9910, max=1849259, per=100.00%, avg=251363.49, stdev=6653.90, samples=15377 W: iops : min= 9772, max=323738, avg=158156.16, stdev=909.58, samples=6454 S: iops : min= 2457, max=462297, avg=62823.73, stdev=1663.46, samples=15377

read_1m: (groupid=3, jobs=48) W: mixed: IOPS=450, BW=464MiB/s (487MB/s)(50.7GiB/111861msec) S: mixed: IOPS=188, BW=190MiB/s (199MB/s)(210GiB/1131607msec) W: clat (usec): min=19, max=5368.4k, avg=3249563.28, stdev=507631.91 S: clat (usec): min=17, max=14976k, avg=7846837.00, stdev=2023099.53 W: lat (msec): min=83, max=5370, avg=3355.95, stdev=483.37 S: lat (msec): min=43, max=15076, avg=8100.89, stdev=2051.37 W: bw ( KiB/s): min=124100, max=1303072, per=100.00%, avg=655130.44, stdev=2886.11, samples=7512 S: bw ( KiB/s): min=92741, max=3303459, per=100.00%, avg=283658.91, stdev=5267.93, samples=73760 W: iops : min= 74, max= 1233, avg=596.00, stdev= 2.83, samples=7512 S: iops : min= 48, max= 3191, avg=230.54, stdev= 5.17, samples=73760

randwrite_4k: (groupid=4, jobs=48) W: mixed: IOPS=5323, BW=20.8MiB/s (21.8MB/s)(36.6GiB/1800036msec) S: mixed: IOPS=5128, BW=20.0MiB/s (21.0MB/s)(18.8GiB/960158msec) W: clat (usec): min=13, max=1723.9k, avg=279521.48, stdev=148733.45 S: clat (usec): min=12, max=1329.7k, avg=290129.96, stdev=126039.73 W: lat (msec): min=23, max=1762, avg=288.52, stdev=152.77 S: lat (msec): min=11, max=1361, avg=299.47, stdev=129.52 W: bw ( KiB/s): min= 2808, max=55301, per=99.98%, avg=21292.76, stdev=185.22, samples=171552 S: bw ( KiB/s): min= 4152, max=45879, per=99.95%, avg=20511.67, stdev=150.06, samples=91584 W: iops : min= 666, max=13793, avg=5306.94, stdev=46.40, samples=171552 S: iops : min= 1002, max=11448, avg=5110.17, stdev=37.63, samples=91584

randwrite_1m: (groupid=5, jobs=48) W: mixed: IOPS=450, BW=451MiB/s (473MB/s)(793GiB/1800067msec) S: mixed: IOPS=278, BW=279MiB/s (293MB/s)(491GiB/1800135msec) W: clat (usec): min=19, max=10503k, avg=3302412.28, stdev=1391453.26 S: clat (usec): min=14, max=12881k, avg=5339285.04, stdev=1341220.09 W: lat (msec): min=16, max=10607, avg=3409.04, stdev=1403.59 S: lat (msec): min=16, max=13013, avg=5511.73, stdev=1345.50 W: bw ( KiB/s): min=90819, max=9151429, per=100.00%, avg=531329.96, stdev=10206.46, samples=148975 S: bw ( KiB/s): min=96896, max=2133389, per=100.00%, avg=335613.71, stdev=3668.19, samples=145954 W: iops : min= 48, max= 8909, avg=474.34, stdev=10.02, samples=148975 S: iops : min= 48, max= 2044, avg=282.04, stdev= 3.59, samples=145954

write_4k: (groupid=6, jobs=48) W: mixed: IOPS=114k, BW=445MiB/s (467MB/s)(39.8GiB/91578msec) S: mixed: IOPS=51.8k, BW=202MiB/s (212MB/s)(31.7GiB/160445msec) W: clat (usec): min=5, max=1488.4k, avg=13052.97, stdev=93701.20 S: clat (usec): min=6, max=1806.6k, avg=28719.75, stdev=80942.95 W: lat (usec): min=17, max=1488.5k, avg=13470.28, stdev=95173.31 S: lat (usec): min=19, max=1806.6k, avg=29641.23, stdev=82078.38 W: bw ( KiB/s): min=29984, max=1244892, per=100.00%, avg=631002.26, stdev=3642.36, samples=6327 S: bw ( KiB/s): min=11906, max=1431513, per=100.00%, avg=211374.05, stdev=5364.05, samples=15014 W: iops : min= 7474, max=311204, avg=157736.62, stdev=910.60, samples=6327 S: iops : min= 2952, max=357856, avg=52826.19, stdev=1341.00, samples=15014

write_1m: (groupid=7, jobs=48) W: mixed: IOPS=432, BW=449MiB/s (471MB/s)(40.3GiB/91833msec) S: mixed: IOPS=178, BW=192MiB/s (201MB/s)(21.0GiB/111827msec) W: clat (usec): min=12, max=5519.8k, avg=3357115.69, stdev=546629.30 S: clat (usec): min=19, max=13552k, avg=8020957.79, stdev=2326691.56 W: lat (msec): min=2, max=5523, avg=3467.31, stdev=522.57 S: lat (msec): min=46, max=14460, avg=8289.78, stdev=2354.53 W: bw ( KiB/s): min=164865, max=1209364, per=100.00%, avg=655807.50, stdev=2713.68, samples=5914 S: bw ( KiB/s): min=94597, max=1234339, per=100.00%, avg=277133.84, stdev=4474.84, samples=7036 W: iops : min= 114, max= 1141, avg=596.84, stdev= 2.66, samples=5914 S: iops : min= 48, max= 1164, avg=224.24, stdev= 4.39, samples=7036

randrw_4k: (groupid=8, jobs=48) W: mixed: IOPS=298, BW=1216KiB/s (1245kB/s)(321MiB/270504msec) S: mixed: IOPS=322, BW=1359KiB/s (1392kB/s)(120MiB/90368msec) W: clat (usec): min=12, max=10647k, avg=4938473.04, stdev=926543.76 S: clat (usec): min=10, max=6821.3k, avg=4496138.92, stdev=761258.32 W: lat (msec): min=78, max=10802, avg=5099.21, stdev=943.00 S: lat (msec): min=86, max=7001, avg=4644.53, stdev=764.19 W: bw ( KiB/s): min= 659, max= 3595, per=100.00%, avg=1335.91, stdev= 6.86, samples=43430 S: bw ( KiB/s): min= 653, max= 3754, per=100.00%, avg=1439.44, stdev= 7.67, samples=14594 W: iops : min= 96, max= 847, avg=269.52, stdev= 1.74, samples=43430 S: iops : min= 96, max= 874, avg=291.55, stdev= 1.92, samples=14594

randrw_1m: (groupid=9, jobs=48) W: mixed: IOPS=151, BW=153MiB/s (160MB/s)(269GiB/1800742msec) S: mixed: IOPS=145, BW=146MiB/s (153MB/s)(257GiB/1800696msec) W: clat (usec): min=16, max=19552k, avg=9772423.15, stdev=2043748.17 S: clat (usec): min=20, max=19321k, avg=10226894.99, stdev=2742482.02 W: lat (msec): min=110, max=19891, avg=10088.45, stdev=2072.23 S: lat (msec): min=256, max=19741, avg=10558.10, stdev=2787.20 W: bw ( KiB/s): min=179600, max=1576780, per=100.00%, avg=286874.50, stdev=1699.33, samples=186234 S: bw ( KiB/s): min=187315, max=1611381, per=100.00%, avg=274198.58, stdev=1582.02, samples=186438 W: iops : min= 96, max= 1455, avg=187.48, stdev= 1.67, samples=186234 S: iops : min= 96, max= 1484, avg=173.92, stdev= 1.55, samples=186438

rw_4k: (groupid=10, jobs=48) W: mixed: IOPS=25.0k, BW=97.7MiB/s (102MB/s)(172GiB/1800222msec) S: mixed: IOPS=35.1k, BW=137MiB/s (144MB/s)(241GiB/1800293msec) W: clat (usec): min=10, max=2285.6k, avg=59487.38, stdev=101914.05 S: clat (usec): min=7, max=1669.2k, avg=42385.56, stdev=85914.30 W: lat (usec): min=45, max=2287.1k, avg=61396.99, stdev=103541.65 S: lat (usec): min=159, max=1670.0k, avg=43743.14, stdev=87225.69 W: bw ( KiB/s): min= 5667, max=2212983, per=100.00%, avg=103103.24, stdev=1356.67, samples=328732 S: bw ( KiB/s): min=10227, max=1440658, per=100.00%, avg=142136.15, stdev=1318.62, samples=337133 W: iops : min= 1350, max=553216, avg=25741.71, stdev=339.17, samples=328732 S: iops : min= 2495, max=360137, avg=35498.44, stdev=329.67, samples=337133

rw_1m: (groupid=11, jobs=48) W: mixed: IOPS=321, BW=323MiB/s (338MB/s)(567GiB/1800214msec) S: mixed: IOPS=276, BW=277MiB/s (291MB/s)(487GiB/1800221msec) W: clat (usec): min=14, max=8387.5k, avg=4618327.58, stdev=795144.01 S: clat (usec): min=11, max=12213k, avg=5375807.25, stdev=1426130.93 W: lat (msec): min=105, max=8590, avg=4767.47, stdev=807.78 S: lat (msec): min=125, max=12857, avg=5549.47, stdev=1435.73 W: bw ( KiB/s): min=194610, max=2282744, per=100.00%, avg=388709.51, stdev=2316.33, samples=292390 S: bw ( KiB/s): min=195372, max=2950029, per=100.00%, avg=386392.73, stdev=2803.01, samples=252796 W: iops : min= 96, max= 2166, avg=312.71, stdev= 2.31, samples=292390 S: iops : min= 96, max= 2823, avg=312.04, stdev= 2.78, samples=252796

toriningen
  • 1,014

1 Answers1

0

To address the safety of setup described above:

I have started a fio test for random writes (with consequent reads) to provide a load, with a total of 48 concurrent processes working on 32 GB files each.

Then I have physically ejected cradles for HDDs in phy0 and phy1. First I've got complaints about irrecoverable errors (I didn't save the message), then the disks became "REMOVED":

  pool: testpool
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
config:
    NAME                                       STATE     READ WRITE CKSUM
    testpool                                   DEGRADED     0     0     0
      raidz2-0                                 DEGRADED     0     0     0
        pci-0000:82:00.0-sas-phy0-lun-0-part1  REMOVED      0     0     0
        pci-0000:82:00.0-sas-phy1-lun-0-part1  REMOVED      0     0     0
        pci-0000:82:00.0-sas-phy5-lun-0-part1  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy6-lun-0-part1  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy3-lun-0-part1  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy4-lun-0-part1  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy7-lun-0-part1  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy2-lun-0-part1  ONLINE       0     0     0
        pci-0000:00:1f.2-ata-1.0-part1         ONLINE       0     0     0
      raidz2-1                                 DEGRADED     0     0     0
        pci-0000:82:00.0-sas-phy0-lun-0-part2  REMOVED      0     0     0
        pci-0000:82:00.0-sas-phy1-lun-0-part2  REMOVED      0     0     0
        pci-0000:82:00.0-sas-phy5-lun-0-part2  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy6-lun-0-part2  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy3-lun-0-part2  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy4-lun-0-part2  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy7-lun-0-part2  ONLINE       0     0     0
      raidz2-2                                 DEGRADED     0     0     0
        pci-0000:82:00.0-sas-phy0-lun-0-part3  REMOVED      0     0     0
        pci-0000:82:00.0-sas-phy1-lun-0-part3  REMOVED      0     0     0
        pci-0000:82:00.0-sas-phy5-lun-0-part3  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy6-lun-0-part3  ONLINE       0     0     0

errors: No known data errors

fio keeps running in the background without errors.

Now I attach these disks to another server, create zpool over these partitions (just put everything in top vdev) and destroy it - I do this to wipe any zfs labels. Then I plug the disks back into original server:

  pool: testpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Aug 17 21:30:39 2024
        63.2G / 82.7G scanned, 2.88G / 55.2G issued at 22.4M/s
        192M resilvered, 5.22% done, 00:39:57 to go
config:
    NAME                                       STATE     READ WRITE CKSUM
    testpool                                   DEGRADED     0     0     0
      raidz2-0                                 ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy0-lun-0-part1  ONLINE       0     0     0  (resilvering)
        pci-0000:82:00.0-sas-phy1-lun-0-part1  ONLINE       0     0     0  (resilvering)
        pci-0000:82:00.0-sas-phy5-lun-0-part1  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy6-lun-0-part1  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy3-lun-0-part1  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy4-lun-0-part1  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy7-lun-0-part1  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy2-lun-0-part1  ONLINE       0     0     0
        pci-0000:00:1f.2-ata-1.0-part1         ONLINE       0     0     0
      raidz2-1                                 ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy0-lun-0-part2  ONLINE       0     0     0  (resilvering)
        pci-0000:82:00.0-sas-phy1-lun-0-part2  ONLINE       0     0     0  (awaiting resilver)
        pci-0000:82:00.0-sas-phy5-lun-0-part2  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy6-lun-0-part2  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy3-lun-0-part2  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy4-lun-0-part2  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy7-lun-0-part2  ONLINE       0     0     0
      raidz2-2                                 DEGRADED     0     0     0
        pci-0000:82:00.0-sas-phy0-lun-0-part3  REMOVED      0     0     0
        pci-0000:82:00.0-sas-phy1-lun-0-part3  REMOVED      0     0     0
        pci-0000:82:00.0-sas-phy5-lun-0-part3  ONLINE       0     0     0
        pci-0000:82:00.0-sas-phy6-lun-0-part3  ONLINE       0     0     0

errors: No known data errors

After few more minutes all devices are (resilvering). After half an hour or so, resilvering is complete and the pool is not DEGRADED anymore.

So, it seems that degrading three raidz2 at once pose no risk to ZFS, even in the case of abrupt termination under heavy random write load. This is a safe configuration.


To address speed concerns.

For random 4k reads and writes sliced approach provides results that are on par with the whole disk layout.

However, situation worsens when a lot of sequential blocks are read or written - random 1m performs worse, sequential 4k and 1m perform significantly worse - 2-2.5x slower.

Scheduler used was mq-deadline, it was said to perform better with ZFS on spinning rust disks than none, though I did no comparison myself yet.


Unfortunately, this performance degradation is severe and makes the layout in question impractical. Perhaps it could've been addressed if ZFS had native support for uneven disks, but so far I have to consider other options.

toriningen
  • 1,014