0

I have function that I'm writing in assembly and I want to be sure what is going to give me the best throughput.

I have a 64bit value in RAX and I need to get the top most byte and perform some operations on it and I was wondering what is the best way of going about this.

shr  rax, 56    ; This will get me the most significant byte in al.

However, is this more effective than...

rol  rax, 8
and  rax, r12   ; I already have the value 255 in r12

The reason why I'm asking is that on some architectures, shifting speed is a function of the number of shifts that you do. If I recall, on the 680x0 chips it was 6 + 2n where n was the shift count. I don't think this is true on x86 architectures, but I'm not sure... so some enlightenment from people would be appreciated. (I understand about latency)

Or is there an easy way to swap bits 0-31 of RAX with bits 32-64 rather than rotating or shifting? Something like what swap did on the 680x0.

The Welder
  • 916
  • 6
  • 24
  • There is `bswap` but that still leaves the other bits and depending on model it's slower than `shr` anyway. – Jester Dec 06 '15 at 13:07
  • Yeah I know about bswap, but that's really for endian exchange – The Welder Dec 06 '15 at 13:59
  • 2
    Sure, but it does bring the highest byte down to the lowest, which is what you wanted. – Jester Dec 06 '15 at 14:03
  • Variable shift cycle only occurs on old systems or microcontrollers. Modern high performance CPUs all have a [barrel shifter](https://en.wikipedia.org/wiki/Barrel_shifter) so shifts can be done in a single cycle – phuclv Dec 06 '15 at 16:51
  • The shift distance usually doesn't matter, but [rotations have smaller throughput than shifts](https://stackoverflow.com/q/12770667/581205), except for rotation by one. This difference is only relevant if you saturate the execution units; e.g., you can do two shifts per cycle, but only a single rotation by amount greater than one. – maaartinus Dec 30 '17 at 04:15

1 Answers1

2

According to the instruction tables at http://agner.org/optimize/, rol with an immediate count is a single-uop/m-op instruction with 1 cycle latency on Intel (Pentium M to Haswell) and AMD (K8 to Steamroller). Throughput ranges from one per clock to three per clock.

Rotate with a variable count (rol r, cl) is slower on Intel, same speed on AMD.

Obviously, read of Agner Fog's guides if you're asking this kind of question, since there's more to high performance than single instructions taken alone.


If you're doing this on multiple data items, you could use vector shuffles on 16B (xmm registers with SSE) or 32B (ymm registers with AVX) chunks at once. pshufd xmm, xmm, imm will let you pick any input dword for each output dword. (So you can broadcast and stuff, as well as just plain shuffle.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Wow awesome, that's the kind of resources I'm looking for. Thanks. Shuffling seems to be the way to go for this except I was rather hoping to avoid having to use xmm registers... but still. – The Welder Dec 06 '15 at 15:58