2

I'm testing floating point code generation of some compiler. The code in question calculates a simple sum:

    double sum = 0.0;
    for (int i = 1; i <= n; i++) {
        sum += 1.0 / i;
    }

I get two versions of this code, a simple one:

L017:   cvtsi2sd xmm1, eax
        add     eax, byte 1
        movsd   xmm2, qword [rel L007]
        divsd   xmm2, xmm1
        addsd   xmm0, xmm2
        cmp     edx, eax 
        jge     short L017

and one with the loop unrolled:

L017:   cvtsi2sd xmm1, ecx
        movsd   xmm2, qword [rel L008]
        divsd   xmm2, xmm1
        addsd   xmm2, xmm0
        lea     edx, [rcx+1H]
        add     ecx, byte 2
        cvtsi2sd xmm0, edx
        movsd   xmm1, qword [rel L008]
        divsd   xmm1, xmm0
        addsd   xmm2, xmm1
        movapd  xmm0, xmm2
        cmp     eax, ecx
        jnz     short L017

The unrolled one runs much longer then the simple one: 4s vs 1.6 s when n=1000000000. The problem is resolved by clearing the register before cvtsi2sd :

L017:   pxor xmm1, xmm1
        cvtsi2sd xmm1, ecx
        movsd   xmm2, qword [rel L008]
...

The new unrolled version runs as fast as the simple one. OK, this is a partial register stall, although I'd never expected it to be so bad. There are however two questions:

  1. Why does it not affect the simple version? It looks like tight loops do not suffer from this stall.

  2. What makes cvtsi2sd so much different from other scalar FP instructions, such as addsd? They also affect only part of an xmm register (in non-VEX mode, and this is the mode used).

I observed this on three processors:

  • i7-6820HQ (Skylake)
  • i7-8665U (Whiskey Lake)
  • Xeon Gold 5120 (Skylake)
Pasha
  • 41
  • 2
  • 1
    Wow, that "unrolled" version is *incredibly* pathological. I don't think a compiler wrote that? Basically, you have 2 long-latency operations: `cvtsi2sd` and `divsd`. In the first version, due to the dependency chain (only `xmm0` and `eax` feeding into `xmm0`, and `eax` is modified in every loop iteration too), `divsd` of one iteration and `cvtsi2sd` of the next iteration can execute in parallel. In the "unrolled" code, your dependencies are just all over the place, and that prevents meaningful instruction level parallelism. – EOF Dec 20 '21 at 16:52
  • "partial register *stall*" has a specific meaning (described in [Why doesn't GCC use partial registers?](https://stackoverflow.com/q/41573502) for Intel P6-family), and this isn't it. This is a *false dependency* caused by `cvtsi2sd` only writing a partial register; Intel's short-sighted design for SSE to make it faster on Pentium-III. – Peter Cordes Dec 20 '21 at 19:53
  • What makes `cvtsi2sd` different from other scalar FP operations is that it had the *chance* to be write-only, but Intel didn't take it, instead giving it a false dependency. `addsd` already reads its destination so it's a true dependency. Seems like a duplicate of [Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?](https://stackoverflow.com/q/60688348) – Peter Cordes Dec 20 '21 at 19:58

0 Answers0