1

I'm currently building a new iteration of my DIY router - the new system has a pair of 10 gig ports. I'm running ubuntu 23.04 on a R68S U1

Initially during speed tests with iperf2 between a system I know can handle 10 gig line speeds I was getting 5 gig speeds. One of Asus's guides for their equipment suggested testing with iperf at 800k

geek@router-t1:~$ iperf -s -w 800k
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  416 KByte (WARNING: requested  781 KByte)
------------------------------------------------------------
[  1] local 10.0.0.1 port 5001 connected with 10.0.0.2 port 52191
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-60.0461 sec  35.0 GBytes  5.01 Gbits/sec

Interestingly this indicated that my TCP window size was smaller, and this was precisely what Asus warned about.

This never happens with windows clients, only linux ones... which is curious, but probably another issue

Adding the following lines - as suggested here

net.core.wmem_max=4194304
net.core.rmem_max=12582912
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 87380 4194304

resulted in twice the benchmarks

geek@router-t1:~$ iperf -s -w 800k -B 10.0.0.1
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 1.53 MByte (WARNING: requested  781 KByte)
------------------------------------------------------------
[  1] local 10.0.0.1 port 5001 connected with 10.0.0.2 port 57480
[ ID] Interval       Transfer     Bandwidth
[  1] 0.0000-60.0443 sec  69.2 GBytes  9.90 Gbits/sec

I understand this adjusts the socket receive buffer and the size of a buffer for a newly created socket - IBM has a pretty good explaination here to what they do here

How would I work out the appropriate size for a given system, and why does this have such a dramatic effect?

Journeyman Geek
  • 133,878

1 Answers1

2

For maximum throughput, TCP window sizes must be large enough to allow the sender to "keep the pipe full", meaning it needs to be able to keep sending full-sized segments as fast as the link can bear, for long enough to get an Ack back for the first packet. That Ack time is going to be that network path's round-trip time (RTT), which is what ping(8) measures, but note that the RTT measured by ping on an idle network is likely to be lower than the RTT your TCP sender sees when keeping a 10Gbps link full.

We usually call this calculation the "Bandwidth x Delay Product" (BDP). So multiply your bandwidth in bits per second, times your RTT in seconds. That allows the seconds to cancel out, leaving you with the number of bits that may need to be "in flight" ("in the network", "in the pipe") at any given time, to keep the pipe full.

So using 10Gbps and 0.3ms typical wired Ethernet LAN idle ping times I usually see on my Ethernet LANs, that would be 10,000,000,000 bits/sec x 0.0003 seconds = 3,000,000 bits = about 366 KibiBytes.

So to a first approximation, your original 416KiByte TCP window should have allowed full speed performance. This indicates that my estimation of 0.3ms RTT was low for the reality of TCP over your 10Gbps link when full.

If I revise my RTT estimate up to 1ms, I get a BDP of 1.2MiBytes, which is more in line with the 1.53MiByte size your Linux TCP stack autoscaled your window to in your successful example after you adjusted your sysctls to give it more room to scale.

Giving each TCP connection a few MebiBytes of RAM for a receive window is fine for personal machines on a LAN, but can be a problem for busy servers that have to handle hundreds or thousands of simultaneous TCP connections. You can run out of RAM pretty quick. That's part of the reason most modern TCP implementations have window autoscaling algorithms built in and enabled by default, but capped by a sysctl.

Spiff
  • 110,156