Bit scatter over multiple NEON registers

Question

What is the most efficient way to spread bits from memory evenly over multiple vector registers? All data must end up in the least-significant bits of the target registers.

For example, how can 2 bytes from memory be spread over 8 words (in two lanes)?

      V0.S4             |  V1.S4
S[3]: [data bit 6 + 7]  |  [data bit 14 + 15]
S[2]: [data bit 4 + 5]  |  [data bit 12 + 13]
S[1]: [data bit 2 + 3]  |  [data bit 10 + 11]
S[0]: [data bit 0 + 1]  |  [data bit 8 + 9]

The 8, 16 and 32-bit split-up is easy with LD1 and widening instructions. A 3-bit split-up may be messy.

Probably broadcast, shift with a per-element variable count, and mask. Similar to some strategies for bitmap to vector mask on x86 ([is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/q/36488675) links various ideas). With more than 1 bit per element, you wouldn't be using a mask+compare, but ARM has much better variable-count shifts than x86. (Although x86 can shift variable amounts within 32-bit elements with AVX2 `vpsrlvd`, but not narrower.) — Peter Cordes, Jul 24 '23 at 20:45
That is exactly what I feared Peter. Too many operations mitigate the wins made with parallel processing. UBFX is King. — Pascal de Kloe, Jul 24 '23 at 22:44
Load + broadcast + shift + AND is only 4 operations and gives you 4 results that can be stored with a single store. If you ultimately want the results in scalar GP registers, then for sure just UBFX, or `and` with a shifted source operand, otherwise SIMD seems like a clear win, especially if you have some SIMD processing to do with the result. — Peter Cordes, Jul 24 '23 at 23:16
`ld1r` is load+broadcast in one instruction. I might try a 32-bit `ld1r`, four 4S shifts by appropriate preloaded vectors, and four ANDs. This processes 4 bytes of input, making 16 words of output, and short dependency chains. — Nate Eldredge, Jul 25 '23 at 03:23
Oh wait, `ld4r` is four load+broadcast. Expensive, but then you can heat up all the pipelines by processing 16 bytes of input into 16 registers. — Nate Eldredge, Jul 25 '23 at 03:27
Problem is that you can't shift vector parts individually @NateEldredge. So the best we've got for the example is 2 `LDRB`, 8 `UBFX` and 8 `VMOV`. — Pascal de Kloe, Jul 25 '23 at 11:00
@PascaldeKloe: Yes you can; that's what [`USHL` (register)](https://developer.arm.com/documentation/dui0801/l/A64-SIMD-Vector-Instructions/USHL--vector---A64-?lang=en) does (and its friend `SSHL`). Despite the name it also does right shifts if you specify a negative shift count. So your two-byte scatter can be done with one `LD1R`, two `USHL` and two vector `AND`. Plus a couple extra instructions to initialize registers with the shift counts and mask - but assuming this will be done in a loop, the initialization only needs to be done once. — Nate Eldredge, Jul 26 '23 at 01:40
Oh good, ARM has broadcast-load as a single instruction (like x86 with AVX), so only 3 operations per vector of results. — Peter Cordes, Jul 26 '23 at 02:07

Nate Eldredge · Accepted Answer · 2023-07-26T02:04:26.653

Vector USHL/SSHL allow for per-element shift counts, where negative counts produce a right shift. So follow it with a mask and you are in business.

Start by initializing some registers with our needed constants. This only needs to be done once.

V8.4S = { 0, -2, -4, -6 }
V9.4S = { -8, -10, -12, -14 }
V10.4S = {3, 3, 3, 3}

and then

LD1R   V2.8H, [X0]       // load 2 bytes, replicate across all elements
                       // note we only really care about half of them
USHL   V0.4S, V2.4S, V8.4S
USHL   V1.4S, V2.4S, V9.4S
AND    V0.16B, V0.16B, V10.16B
AND    V1.16B, V1.16B, V10.16B

Alternatively, to save a constant, you could also do

V8.4S = {0, -2, -4, -6}
V10.4S = {3, 3, 3, 3}

LD2R   { V2.16B, V3.16B }, [X0]
USHL   V0.4S, V2.4S, V8.4S
USHL   V1.4S, V3.4S, V8.4S
AND    V0.16B, V0.16B, V10.16B
AND    V1.16B, V1.16B, V10.16B

where each of the two bytes is replicated across its own register.

You can load four bytes at a time by starting with LD1R V2.4S, [X0] (and then four different shift count vectors) or LD4R { V2.16B, ..., V5.16B }, [X0] following the second approach. You can even load 16 bytes at a time with LD4R { V2.4S, ..., V5.4S }, [X0] and then repeat the first version four times.

speed confirmed—got over 2× *overall* improvement with `LD4R`. — Pascal de Kloe, Jul 26 '23 at 17:15
You could keep the elements 16bit until the very end where you use `uxtl/uxtl2` to make them 32bit ones. Only one shift register is required as well. — Jake 'Alquimista' LEE, Jul 28 '23 at 03:54

Bit scatter over multiple NEON registers

1 Answers1