1

I'm vectorizing some image processing code using 32 bit hand-written assembly to access AVX2 instructions. However I've run into a roadblock. The results of the vector operations end up in a YMM register and I need to get a population count(POPCNT) on that register. I cannot seem to find information on any instruction or tricks I could use to quickly get a population count on a YMM register.

My only recourse for the moment would be to copy the contents of the YMM register into memory and use the normal 32 bit POPCNT to compute it. This would require eight calls to POPCNT as well as 7 additions to sum it. It would be nice if there was a way to get the population count of the YMM register using less instructions.

It would have been perfect if AVX2 allowed me to do something like:-

POPCNT [EBP - 4], YMM1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Niya
  • 243
  • 2
  • 8
  • Which instruction set extensions are you permitted to use? Do you want the full population of the register or rather the population of each dword or qword? – fuz Apr 03 '22 at 15:44
  • @fuz It's actually the sum of the population counts of 8 DWORDs but since the previous calculations will be done in YMM registers so they could be done 8 at a time, I was hoping for a way to get a population count on all 8 DWORDs at once while they are still in the YMM register. Effectively, I just want a population count of all the contents in the YMM register. It is the same as summing the pop counts of each DWORD. As for the instruction set, I want to use nothing later than AVX2. – Niya Apr 03 '22 at 16:08
  • In this case, it's a bit tricky. You'll need to manually implement a population count procedure. Muła has some good approaches. – fuz Apr 03 '22 at 16:30
  • Please forgive my ignorance. But who or what is "Mula"? I'm still relatively new to the world of assembly programming. – Niya Apr 03 '22 at 16:39
  • 1
    Muła is a [guy](http://0x80.pl/articles/index.html#population-count-new) who did some nice research on population counts with SIMD instructions. – fuz Apr 03 '22 at 17:30
  • 1
    @fuz Ah ok I see. Those solutions are a little over my head at the moment though. I'm far from being an assembly language expert. I think I could use it once if I can figure out the technical details. How it works, how to adapt it to my specific needs etc – Niya Apr 03 '22 at 19:10
  • 1
    I can try and write an example implementation for you. Do you have a single ymm register worth of results? Or possibly more? What does the code around that look like? Is it all SIMD code? Or are there scalar operations, too? – fuz Apr 03 '22 at 21:16
  • You aren't by any chance counting compare results where every element is always 0 or 1? If so you can horizontally add them (e.g. with `psadbw` against zero if they're byte elements.) But anyway, no, until AVX-512, there isn't an instruction for SIMD popcnt, and even then it's separate counts in each element so you have to hsum the result. (But with qword element size that's only 2 buckets for an XMM vector). As Fuz said, if you're popcounting a whole array, you don't want to reduce each vector to scalar, just do that at the end after a popcount within smaller chunks. – Peter Cordes Apr 04 '22 at 03:16
  • [Fast counting the number of set bits in \_\_m128i register](https://stackoverflow.com/q/17354971) shows how to count within chunks (look at compiler output if you want it to translate intrinsics to asm for you). It shows how to sum up to a count of all 128 bits; [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) has links that show hsums for 256-bit AVX2. (just bring down the top half with `vextracti128` / `vpaddb` and then the problem reduces to the XMM problem Marat's answer solves.) If you have a more specific Q, edit. – Peter Cordes Apr 04 '22 at 03:18
  • And BTW, writing 64-bit code would let you use scalar 64-bit chunks, instead of being limited to 32-bit. On an AMD CPU especially, for a single vector it's worth considering store / 4x memory-source `popcnt` with 64-bit operand-size, and 3x `add` (32-bit operand-size is fine of course). – Peter Cordes Apr 04 '22 at 03:22

0 Answers0