Integer describing number of floating point arguments in xmm registers not passed to rax

Question

I have got a function which is declared as follows:

double foo(int ** buffer, int size, ...);

The function is a part of cpp implementation of a program.

I use last parameter to pass multiple double variables to the function.

The problem is that on Mac I do not receive valid number in rax register, on the other hand on ubuntu it works as expected.

A simple example:

CPP

#include <iostream>
extern "C" double foo(int ** buffer, int buffer_size, ...);

int main() {
    int* buffer [] = {new int(2), new int(3), new int(4)};
    std::cout<< foo(buffer, 2, 1.0, 2.0, 3.0) << '\n';
    std::cout<< foo(buffer, 3, 2.0, 3.0) << '\n';
    std::cout<< foo(buffer, 3) << '\n';
}

Assembly, NASM2

global foo

section .text

foo:
    cvtsi2sd xmm0, rax
    ret

Mac output:

1.40468e+14
1.40736e+14
1.40736e+14

Ubuntu output:

3
2
0

The program is 64-bit

Note that this number is stored in `al`, not `rax`. Try `movzx eax, al` beforehand to clear out the high bits. — fuz, Jan 13 '19 at 01:20
@MateuszStompór : which is correct. the only part of RAX that matters is AL. The other bits can be anything. As fuz suggested use `movzx eax, al` to make all of RAX equal to the value in AL. It happens to be on Ubuntu the upper bits of RAX were already 0.On your Mac they weren't. Try `foo:` `movzx eax, al` `cvtsi2sd xmm0, rax` `ret` . Should be noted that when the destination of an instruction is a 32-bit register the CPU zero extends automatically across the entire 64-bit register. — Michael Petch, Jan 13 '19 at 01:45

score 1 · Accepted Answer · answered Jan 13 '19 at 15:36

The x86-64 System V ABI says the FP register arg count is passed in AL, and that the upper bytes of RAX are allowed to contain garbage. (Same as any narrow integer or FP arg. But see also this Q&A about clang assuming zero- or sign-extension of narrow integer args to 32 bit. This only applies to function args proper, not al.)

Use movzx eax, al to zero-extend AL into RAX. (Writing EAX implicitly zero-extends into RAX, unlike writing an 8 or 16-bit register.)

If there's another integer register you can clobber, use movzx ecx,al so mov-elimination on Intel CPUs can work, making it zero latency and not needing an execution port. Intel's mov-elimination fails when the src and dst are parts of the same register.

There's also zero benefit to using a 64-bit source for conversion to FP. cvtsi2sd xmm0, eax is one byte shorter (no REX prefix), and after zero-extension into EAX you know that the signed 2's complement interpretation of EAX and RAX that cvtsi2sd uses are identical.

On your Mac, clang/LLVM chose to leave garbage in the upper bytes of RAX. LLVM's optimizer is less careful about avoiding false dependencies than gcc's, so it will sometimes write partial registers. (Sometimes even when it doesn't save code size, but in this case it does).

From your results, we can conclude that you used clang on Mac, and gcc or ICC on Ubuntu.

It's easier to look at the compiler-generate asm from a simplified example (new and std::cout::operator<< result in a lot of code).

extern "C" double foo(int, ...);
int main() {
    foo(123, 1.0, 2.0);
}

Compiles to this asm on the Godbolt compiler explorer, with gcc and clang -O3:

### clang7.0 -O3
.section .rodata
.LCPI0_0:
    .quad   4607182418800017408     # double 1
.LCPI0_1:
    .quad   4611686018427387904     # double 2

.text
main:                                   # @main
    push    rax                  # align the stack by 16 before a call
    movsd   xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
    movsd   xmm1, qword ptr [rip + .LCPI0_1] # xmm1 = mem[0],zero
    mov     edi, 123
    mov     al, 2                # leave the rest of RAX unmodified
    call    foo
    xor     eax, eax             # return 0
    pop     rcx
    ret

GCC emits basically the same thing, but with

 ## gcc8.2 -O3
    ...
    mov     eax, 2               # AL = RAX = 2   FP args in regs
    mov     edi, 123
    call    foo
    ...

mov eax,2 instead of mov al,2 avoids a false dependency on the old value of RAX, on CPUs that don't rename AL separately from the rest of RAX. (Only Intel P6-family and Sandybridge do that, not IvyBridge and later. And not any AMD CPUs, or Pentium 4, or Silvermont.)

See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent for more about how IvB and later are different from Core2 / Nehalem.

Integer describing number of floating point arguments in xmm registers not passed to rax

1 Answers1

Linked