2

I have the following C++ function that sums all elements of an SSE 128-bit float register. Basically I just do two horizontal adds using the code bellow:

float sum4(__m128 x) {
        const __m128 hsum_0 = _mm_hadd_ps(x, x);
        const __m128 hsum_1 = _mm_hadd_ps(x, x);
        return _mm_cvtss_f32(hsum_1);
}

Is this the most efficient way of summing all the elements of a 128-bit SSE register? I'm asking this because I read that we should avoid horizontal operations for dense processing (http://wiki.ros.org/PatrickMihelich/pcl_simd#Horizontal_or_vertical.3F), so if I call sum4() multiple times through the program execution time the performance will be highly damaged.

Thanks for all help in advance!

Aswathy - Intel
  • 638
  • 4
  • 12
  • 1
    Your question as asked is a duplicate. But yes, as Patrick says you should avoid designing your algorithms to need this at all except outside loops. e.g. if you're doing a dot product, `_mm_add_ps()` the mulps results, and don't horizontal sum until the end. When it does matter, `haddps` is one of the slower ways, but even an efficient hsum is far worse than none at all. – Peter Cordes Oct 04 '19 at 10:57
  • Thanks for pointing out the original question Peter, I have searched it but couldn't find it! Ok, I will take it into account! – César Gouveia Oct 04 '19 at 11:07

0 Answers0