What I am trying to implement is a way to broadcast a 32bit integer to a 256bit YMM register in C effectively using intel intrinsics. The twist is however, that I want each bit of the 32bit integer to be translated into either a 0x00 or 0xFF byte in the register, depending on whether the bit was 0 or 1 in my integer.
For instance, if I had a 4bit integer with the bits 0011 and a 16bit register, I would want the 16bit register to end up with the content: 0000 0000 1111 1111
If I use the usual intel intrinsic functions for broadcasting, I would end up with a register of the form: 0011 0011 0011 0011.
Since the lowest intel intrinsic shuffle possible is working on bytes, I cannot shuffle the bits around to match afterwards.
The only solution I have found, is to use an if for each bit before the register, and thus prepare the data beforehand and then load it into the register. Like this pseudo-C code-snippet:
if(some_int & 1) {
expanded_bit[0] = 0xFF;
}
if(some_int & 2) {
expanded_bit[1] = 0xFF;
}
if(some_int & 4) {
expanded_bit[2] = 0xFF;
}
if(some_int & 8) {
expanded_bit[3] = 0xFF;
}
some_register = _mm256_load_epi8(expanded_bit[0], expanded_bit[1], expanded_bit[2], expanded_bit[3]);
This does not really seem efficient though... (And one could say it maybe defeats the purpose of SIMD, if the overhead of preparing the data equals the time gained by using SIMD operations).