1

I have 4 _m256d, how can I find the minimum among all 16 values? How can I know the minimum value come from which __m256d variable? and which element is it? assume part of values are the same in different __m256d variable

I'm trying but it doesn't work:

#include <immintrin.h>
#include <float.h>

int main()
{
   // either v1[0] or v3[2] is the answer.
    __m256d v1 = _mm256_set_pd(1.0, 2.0, 3.0, 4.0);
    __m256d v2 = _mm256_set_pd(5.0, 6.0, 7.0, 8.0);
    __m256d v3 = _mm256_set_pd(3.0, 4.0, 1.0, 2.0);
    __m256d v4 = _mm256_set_pd(6.0, 5.0, 8.0, 7.0);

    __m256d min = _mm256_set1_pd(DBL_MAX);

    // Find the minimum among all 16 values
    min = _mm256_min_pd(min, v1);
    min = _mm256_min_pd(min, v2);
    min = _mm256_min_pd(min, v3);
    min = _mm256_min_pd(min, v4);

    // Get a 4-bit mask of the minimum elements
    int mask = _mm256_movemask_pd(_mm256_cmp_pd(min, min, _CMP_EQ_OQ));

    // Extract the index of the minimum element
    int index = __builtin_ffs(mask) - 1;

    // Determine which __m256d variable the minimum value comes from and which element it is
    __m256d* v[4] = {&v1, &v2, &v3, &v4};
    int v_index = index / 4;
    int elem_index = index % 4;

    printf("The minimum value is %lf from v%d at element %d\n", min[elem_index], v_index + 1, elem_index);

    return 0;
}
holmessh
  • 65
  • 5
  • 1
    `_mm256_min_pd` is a vertical min, not horizontal. You're doing 4 different `min` operations and then comparing each element against itself, which is like `!isnan()`. After reducing to one vector of mins, you need [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) but with `min` instead of `add`, then search for a match in each of 4 vectors. And maybe pack the compare results down from 64-bit elements to narrower, so you can get them all with one `_mm256_movemask_epi8` and one search. – Peter Cordes May 10 '23 at 15:54

1 Answers1

1

Assuming you have AVX1 but not AVX2, I would do it like that.

#include <immintrin.h>

struct sMin16
{
    // The minimum value
    double val;
    // Index of the first minimum element
    int index;
};

// Compute minimum of 16 FP64 numbers, stored in 4 AVX vectors
sMin16 min16( __m256d v0, __m256d v1, __m256d v2, __m256d v3 )
{
    // Compute vertical minimum of the 4 vectors
    __m256d t0 = _mm256_min_pd( v0, v1 );
    __m256d t1 = _mm256_min_pd( v2, v3 );
    t0 = _mm256_min_pd( t0, t1 );

    // Compute broadcasted horizontal minimum of `t0` vector
    // Swap 16-byte pieces, compute minimum
    t1 = _mm256_permute2f128_pd( t0, t0, 0x01 );
    t0 = _mm256_min_pd( t0, t1 );
    // Swap elements pairwise, compute minimum
    t1 = _mm256_shuffle_pd( t0, t0, 0b0101 );
    t0 = _mm256_min_pd( t0, t1 );

    // Store the minimum value
    sMin16 result;
    result.val = _mm256_cvtsd_f64( t0 );

    // Compare numbers for equality with the broadcasted minimum,
    // and make a bitmap of the results
    uint32_t mask;
    mask = (uint32_t)_mm256_movemask_pd( _mm256_cmp_pd( t0, v0, _CMP_EQ_OQ ) );
    mask |= (uint32_t)_mm256_movemask_pd( _mm256_cmp_pd( t0, v1, _CMP_EQ_OQ ) ) << 4;
    mask |= (uint32_t)_mm256_movemask_pd( _mm256_cmp_pd( t0, v2, _CMP_EQ_OQ ) ) << 8;
    mask |= (uint32_t)_mm256_movemask_pd( _mm256_cmp_pd( t0, v3, _CMP_EQ_OQ ) ) << 12;

    // We have a bitmap of 16 bits, a bit is set for element[s] equal to the minimum
    // Compute index of the first element equal to the minimum
#ifdef _MSC_VER
    unsigned long idx;
    _BitScanForward( &idx, mask );
    result.index = idx;
#else
    result.index = __builtin_ctz( mask );
#endif
    return result;
}

The lowest 2 bits of the computed index contain lane index in a vector, and bits [ 2 .. 3 ] of the index contain index of the vector with the first minimum element.

Note that _mm256_set_pd intrinsic reverses the order of the arguments, so _mm256_set_pd( 1, 2, 3, 4 ) creates a vector with the values [ 4, 3, 2, 1 ], not [ 1, 2, 3, 4 ].
For this reason, the function returns minimum index = 3 for your test case, that’s last lane in the v0 vector.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • Since you want the `min` broadcasted, consider using 256-bit shuffles like `vshufpd ymm` / `vperm2f128` so all 4 elements get the min at the same time. Those are both single-uop on Zen 2 and Intel. (And you can still get the low scalar out of the `__m256d` for free). That saves a `vinsertf128`. Also, `vpermilpd` (p5) has worse throughput on Intel Ice Lake than `vshufpd same,same` (p15). Very silly, IDK what's up with that, why they couldn't just decode `vpermilpd imm8` as `vshufpd same,same, imm8` with the same immediate and have it able to run on port 1. – Peter Cordes May 11 '23 at 18:34
  • @PeterCordes Good idea, updated – Soonts May 14 '23 at 09:44
  • While you're looking at this again, did you consider using AVX2 `vpackssdw` to pack pairs of compare results before movemask? Or maybe if you're going to pack at all with annoying AVX2 in-lane shuffles, it's important to go all the way to one vector so you can put them back in linear order. Maybe even narrowing to a `__m128i` with `vextracti128` and a final `vpacksswb` step so you can `vpshufb`. So that might actually be worse than 4x `vmovmskpd` plus integer shift / OR. (The shift count is too high for `lea` to help even if you used `add` :/) – Peter Cordes May 14 '23 at 09:53
  • @PeterCordes I considered it but choose not to, for 2 reasons. (1) The OP hasn’t tagged the question AVX2 (2) The throughput of `vmovmskpd` is pretty good, 1 cycle. Latency is not great at 5-7 cycles, but the latency of `vpmovmskb` is about the same, so collecting bytes with AVX2 shuffles will probably be slower, given the extra instructions to gather the bytes for `vpmovmskb`. While it would save scalar shifts + bitwise ORs, I believe these gonna be faster than AVX2 code to produce the `__m128i` with these 16 bytes. – Soonts May 14 '23 at 10:20