What is a short x86 instruction sequence to move the xmm0 register to eax and edx?
1 Answers
Which parts of xmm0 do you want?
movd eax, xmm0
pextrd edx, xmm0, 1 ; SSE4.1
gets the low 64bits of xmm0 into edx:eax. If you need all 4 parts, consider storing to memory and reloading: store-forwarding to loads has more latency but better throughput than shuffles (fewer total uops), especially if you can use them as memory source operands instead of just mov.
(But if you want a horizontal sum or something, normally do that with SIMD shuffles like pshufd / paddd twice to reduce 4 elements to 2 then to 1. Although movd eax, xmm0 / movdqa [esp], xmm0 store, and 3 scalar add eax, [esp + 4/8/12] is actually not bad for total uops or latency in this case, unlike scalar FP where latency is higher and you want the result in an XMM reg anyway.)
In 64bit code, movq rax, xmm0 / shld rdx, rax, 32 might be better than pextrd, and doesn't require SSE4.1.
A more normal mov rdx, rax / shr rdx, 32 might be more efficient than SHLD, even though it costs more uops on Intel CPUs. shld is slow on AMD CPUs, 8 uops on Zen. (https://uops.info/)
BMI2 rorx rdx, rax, 32 a good way to copy-and-shift, and is efficient on all CPUs that support it. It of course leaves the high half of RDX probably non-zero, but that's fine.
Another option would be to movd/movq, if you're not close to bottlenecked on throughput for the single port they compete for. On most CPUs they can't actually run in parallel, so movd/movq competing for a port does still cost latency for the 2nd one. On a modern CPU with mov-elimination (Zen or IvyBridge), mov rdx, rax with zero latency is better. But this does get your values in EAX and EDX zero-extended into RAX and RDX.
movq rdx, xmm0
movd eax, xmm0 ; or schedule this first if you can use EAX right away
shr rdx, 32
See the x86 tag wiki for instruction-set references and other stuff.
See Agner Fog's excellent Optimizing Assembly guide for tips on which instructions to use.
- 328,167
- 45
- 605
- 847
-
-
1@tgiphil: ok, the low 64 is what I was guessing. Is there some reason you haven't accepted this answer? Do you need a 32bit SSE2 version using a vector shift or shuffle to get the second word down to element 0 for another `movd`? – Peter Cordes Jun 10 '16 at 09:10
-
-
1@tgiphil: `pshufd` + `movd`, or any other convenient shuffle to bring the element you want to the low 64 or 32 bits. – Peter Cordes May 22 '19 at 12:00