The x86-64 System V ABI says the FP register arg count is passed in AL, and that the upper bytes of RAX are allowed to contain garbage. (Same as any narrow integer or FP arg. But see also this Q&A about clang assuming zero- or sign-extension of narrow integer args to 32 bit. This only applies to function args proper, not al.)
Use movzx eax, al to zero-extend AL into RAX. (Writing EAX implicitly zero-extends into RAX, unlike writing an 8 or 16-bit register.)
If there's another integer register you can clobber, use movzx ecx,al so mov-elimination on Intel CPUs can work, making it zero latency and not needing an execution port. Intel's mov-elimination fails when the src and dst are parts of the same register.
There's also zero benefit to using a 64-bit source for conversion to FP. cvtsi2sd xmm0, eax is one byte shorter (no REX prefix), and after zero-extension into EAX you know that the signed 2's complement interpretation of EAX and RAX that cvtsi2sd uses are identical.
On your Mac, clang/LLVM chose to leave garbage in the upper bytes of RAX. LLVM's optimizer is less careful about avoiding false dependencies than gcc's, so it will sometimes write partial registers. (Sometimes even when it doesn't save code size, but in this case it does).
From your results, we can conclude that you used clang on Mac, and gcc or ICC on Ubuntu.
It's easier to look at the compiler-generate asm from a simplified example (new and std::cout::operator<< result in a lot of code).
extern "C" double foo(int, ...);
int main() {
foo(123, 1.0, 2.0);
}
Compiles to this asm on the Godbolt compiler explorer, with gcc and clang -O3:
### clang7.0 -O3
.section .rodata
.LCPI0_0:
.quad 4607182418800017408 # double 1
.LCPI0_1:
.quad 4611686018427387904 # double 2
.text
main: # @main
push rax # align the stack by 16 before a call
movsd xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
movsd xmm1, qword ptr [rip + .LCPI0_1] # xmm1 = mem[0],zero
mov edi, 123
mov al, 2 # leave the rest of RAX unmodified
call foo
xor eax, eax # return 0
pop rcx
ret
GCC emits basically the same thing, but with
## gcc8.2 -O3
...
mov eax, 2 # AL = RAX = 2 FP args in regs
mov edi, 123
call foo
...
mov eax,2 instead of mov al,2 avoids a false dependency on the old value of RAX, on CPUs that don't rename AL separately from the rest of RAX. (Only Intel P6-family and Sandybridge do that, not IvyBridge and later. And not any AMD CPUs, or Pentium 4, or Silvermont.)
See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent for more about how IvB and later are different from Core2 / Nehalem.