2

Just last weekend, I set up a new (clean install) backup server for my main FreeNAS machine, and started a manual complete pool backup between them. Both machines are enterprise hardware and run fast, the link between is a direct 10G optic LAN (Chelsio), both machines have plenty of fast NVMe ZIL/cache and 128GB fast ddr4, with Xeon v4 and Supermicro baseboards. The pool I'm replicating/copying is 14GB actual data, deduped with 35GB referenced data (2.5x dedup). The pools are striped mirrors (4 sets of 3 way mirrors with enterprise 6+TB 7200 disks) not RaidZ so they dont even have parity to slow them. Nothing else is running on the servers or their connection except the SSH connections for the transfers. The zfs send command includes the args needed to send the data deduped (although by oversight, not compressed).

Command on sender:

zfs send -vvDRLe mypool@latest_snapshot | nc -N BACKUP_IP BACKUP_PORT

Command on recipient:

nc -l PORT | zfs receive -vvFsd my_pool

I was expecting one of two things to happen - either it sends 14TB and finishes, or it sends 35TB but the 21TB that's already sent (deduped data) goes really fast, and only 14 and a bit TB needs to be sent. But instead it seems to be intent on sending all 35TB in full, and incredibly slowly at that - did I do something wrong or misunderstand?

What I don't get is that even with serialising the snapshots/datasets, the backup servers disks are running at almost 100% according to gstat and have been doing so for 4 full days now. The data is arriving correctly (I can mount those snaps/datasets which have completed). But sending the entire pool looks like it'll take about 7 days all-in, with almost 100% disk activity the whole time.

Transferring 14TB or even 35TB on a 10G link between 2 fast servers - whatever status info is displayed on console - just shouldn't take that long, unless it's incredibly inefficient, which seems unlikely.

Both systems can read/write even the HDD spinners at almost 500 MB/s and ZFS optimises disk access and doesn't need to re-dedup the data as it's sent already deduped.

Why is it taking so long? Why isn't it just sending one time only, the raw blocks in the pool?

Replying to some points from comments:

  1. netcat (nc): netcat (nc) provides a bare transparent unencrypted tcp transport/tunnel to pipe data between two systems (among other uses) - a bit like ssh/VPN but no slowdown or repackaging other than bare TCP handshakes on the wire. As far as zfs send/zfs receive are concerned they are in direct communication, and beyond a tiny latency the netcat link should run at the maximum speed that send/receive can handle.
  2. Mirror disk speed: A mirror writes at the slowest speed of any of its disks, but ZFS treats the disks as a striped mirror (data stripes across 4 vdevs on both systems, and each vdev is a mirror). With the source pool 55% full and the dest pool empty, and assuming the CPUs can keep up, zfs should be able to simultaneously read from 12 disks, and write to 4 disks, and the writes should be pretty much all sequential, there's no other IO activity. I figure that the slowest disk in any mirror can seq write at >= 125MB/s, which is way below the rate for a modern enterprise 7200 HDD, and the backup can be filled sequentially rather than random IO. That's where I get a sustained replication rate of >> 500MB/s.
  3. Dedup table/RAM adequacy: The dedup table is about 40GB in RAM (from bytes per entry x total blocks in source pool per zdb). I've set a sysctl on both systems to reserve 85GB of RAM for the dedup table and other metadata, hence about 35GB for cached data, before any use of L2ARC (if used with send/rcv). So dedup and metadata shouldn't be evicted from RAM on either machine.

Speed and progress update:

  • After 5 days runtime, I have some updated progress stats. It's sending data at about 58 MB/sec average. Not completely disastrous, but still, it underpins the question above. I'd expect a rate about 10 x that, since the disk sets can read at up to 12 HDD's at a time (almost 2 GB/sec) and write up to 4 disks at a time (about 500 GB/s). It doesn't have to dedup or re-dedup the data (AFAIK), it's running on 3.5 GHz 4 + 8 core Xeon v4's with tons of RAM on both systems, and a LAN that can do 1GB/sec.
Stilez
  • 1,825

1 Answers1

1

From what you mentioned about compression, I’m assuming all the storage sizes / speeds you described were in uncompressed sizes. If not, that could make transfer times longer by a factor equal to your average compression ratio (but not if disk access is the bottleneck, since the decompression / compression happens after reading from disk in zfs send and before writing to disk in zfs receive).

Based on the information you’ve collected so far, it sounds like you’re bottlenecked on the disk bandwidth, not on the network connection. You mentioned that each system can read/write at ~500MB/s, so your best-case transfer time for 35TB is around 20 hours (about 2.5x slower than just transferring through the 10Gb/s network). But, based on your mirroring setup, I’m surprised that reads and writes would get the same throughput — are you sure about that? On the send system you only need to read from one disk (so you can parallelize reads across three disks), but on the receive system you have to write to all three disks (so you’re bound by the throughput of the slowest disk at any given time). To test the write throughput on the receive side, you could run dd if=/dev/urandom of=some_file_in_pool bs=1M count=1024 conv=fdatasync.

Since you said the receiving disks are at 100% busy, my guess is that it’s not reaching 500MB/s write bandwidth. This could either be because the real write limit is lower than that (the dd command above should confirm), or it could be that the system is having to do metadata reads during the receive, and that’s breaking your nice large-IO-size write workload by adding a bunch of disk seeks into the mix. You should be able to investigate the second hypothesis more deeply using DTrace to see what the io provider thinks your read/write sizes are.

Dan
  • 1,118