I'm testing floating point code generation of some compiler. The code in question calculates a simple sum:
double sum = 0.0;
for (int i = 1; i <= n; i++) {
sum += 1.0 / i;
}
I get two versions of this code, a simple one:
L017: cvtsi2sd xmm1, eax
add eax, byte 1
movsd xmm2, qword [rel L007]
divsd xmm2, xmm1
addsd xmm0, xmm2
cmp edx, eax
jge short L017
and one with the loop unrolled:
L017: cvtsi2sd xmm1, ecx
movsd xmm2, qword [rel L008]
divsd xmm2, xmm1
addsd xmm2, xmm0
lea edx, [rcx+1H]
add ecx, byte 2
cvtsi2sd xmm0, edx
movsd xmm1, qword [rel L008]
divsd xmm1, xmm0
addsd xmm2, xmm1
movapd xmm0, xmm2
cmp eax, ecx
jnz short L017
The unrolled one runs much longer then the simple one: 4s vs 1.6 s when n=1000000000.
The problem is resolved by clearing the register before cvtsi2sd :
L017: pxor xmm1, xmm1
cvtsi2sd xmm1, ecx
movsd xmm2, qword [rel L008]
...
The new unrolled version runs as fast as the simple one. OK, this is a partial register stall, although I'd never expected it to be so bad. There are however two questions:
Why does it not affect the simple version? It looks like tight loops do not suffer from this stall.
What makes
cvtsi2sdso much different from other scalar FP instructions, such asaddsd? They also affect only part of anxmmregister (in non-VEX mode, and this is the mode used).
I observed this on three processors:
- i7-6820HQ (Skylake)
- i7-8665U (Whiskey Lake)
- Xeon Gold 5120 (Skylake)