Find 4 minimal values in 4 __m256d registers

Question

I cannot figure out how to implement:

__m256d min(__m256d A, __m256d B, __m256d C, __m256d D)
{
    __m256d result;

    // result should contain 4 minimal values out of 16 : A[0], A[1], A[2], A[3], B[0], ... , D[3]
    // moreover it should be result[0] <= result[1] <= result[2] <= result[2]

     return result;
}

Any ideas of how to use _mm256_min_pd, _mm256_max_pd and shuffles/permutes in a smart way?

==================================================

This where I got so far, after:

    __m256d T = _mm256_min_pd(A, B);
    __m256d Q = _mm256_max_pd(A, B);
    A = T; B = Q;
    T = _mm256_min_pd(C, D);
    Q = _mm256_max_pd(C, D);
    C = T; D = Q;
    T = _mm256_min_pd(B, C);
    Q = _mm256_max_pd(B, C);
    B = T; C = Q;
    T = _mm256_min_pd(A, B);
    Q = _mm256_max_pd(A, B);
    A = T; D = Q;
    T = _mm256_min_pd(C, D);
    Q = _mm256_max_pd(C, D);
    C = T; D = Q;
    T = _mm256_min_pd(B, C);
    Q = _mm256_max_pd(B, C);
    B = T; C = Q;

we have : A[0] < B[0] < C[0] < D[0], A[1] < B[1] < C[1] < D[1], A[2] < B[2] < C[2] < D[2], A[3] < B[3] < C[3] < D[3],

so the minimal value is among A's, second minimal is among A's or B's, ... Not sure where to go from there ...

========================================================

Second idea is that the problem is reducible to itself, but with 2 input __m256 elements. If this can be done, then just do min4(A,B) --> P, min4(C,D) --> Q, min4(P,Q) --> return value.

No idea how to that for two vectors though :)

=======================================================================

Update 2 : problem almost solved -- the following function computes 4 minimal values.

__m256d min4(__m256d A, __m256d B, __m256d C, __m256d D)
{
    __m256d T;
    T = _mm256_min_pd(A, B);
    B = _mm256_max_pd(A, B);            
    B = _mm256_permute_pd(B, 0x5);
    A = _mm256_min_pd(T, B);            
    B = _mm256_max_pd(T, B);            
    B = _mm256_permute2f128_pd(B, B, 0x1);
    T = _mm256_min_pd(A, B);
    B = _mm256_max_pd(A, B);
    B = _mm256_permute_pd(B, 0x5);
    A = _mm256_min_pd(A, B);

    T = _mm256_min_pd(C, D);
    D = _mm256_max_pd(C, D);            
    D = _mm256_permute_pd(D, 0x5);
    C = _mm256_min_pd(T, D);            
    D = _mm256_max_pd(T, D);            
    D = _mm256_permute2f128_pd(D, D, 0x1);
    T = _mm256_min_pd(C, D);
    D = _mm256_max_pd(C, D);
    D = _mm256_permute_pd(D, 0x5);
    C = _mm256_min_pd(C, D);

    T = _mm256_min_pd(A, C);
    C = _mm256_max_pd(A, C);            
    C = _mm256_permute_pd(C, 0x5);
    A = _mm256_min_pd(T, C);            
    C = _mm256_max_pd(T, C);            
    C = _mm256_permute2f128_pd(C, C, 0x1);
    T = _mm256_min_pd(A, C);
    C = _mm256_max_pd(A, C);
    C = _mm256_permute_pd(C, 0x5);
    A = _mm256_min_pd(A, C);

    return A;
};

All that remains is to sort the values in increasing order inside A before return.

What is the specific issue that you're running into? This is a pretty broad question. — Adam B, Mar 11 '16 at 16:44
You're looking for one that gets the lowest 4 doubles out of all 16 doubles into one vector, in order, right? Google SIMD sorting network, and stuff like that. You might find that unpacking to two `__m128d` vectors is useful for some steps, but maybe not. If you only care about the smallest 4 elements, and not a complete sort, it might be harder to beat scalar code with a SIMD sorting network. — Peter Cordes, Mar 11 '16 at 16:49
Correct -- the lowest 4 doubles out of all 16 doubles into one vector. These 4 vectors contain 16 values that are result of a SIMD computation that works very very well. At the end 4 lowest have to be selected. The goal is not to beat scalar code, but just to avoid it. It seems illogical to me to unload the values to memory, then do sort, then load again. — Fedor_Govnjukoff, Mar 11 '16 at 17:11
What's the most important performance criterion here? Latency, throughput, or total fused-domain uops (i.e. impact on throughput of surrounding code)? Can out-of-order execution potentially have multiple sorts in flight at once, or overlapping with other work, or is this part of a loop-carried dependency? — Peter Cordes, Mar 12 '16 at 07:35
BTW, the formal name for finding the smallest k elements out of n is [Selection algorithm](https://en.wikipedia.org/wiki/Selection_algorithm). (technically, a selection algorithm only finds the kth order statistic, not all k..n or 0..k (partial sort). We want a partial sort that doesn't do extra work to make sure the rest of the array (or registers) still have meaningful data.) Anyway, I didn't find a lot of discussion of *very* small `k` where n is also small when googling [simd selection algorithm](https://www.google.com/?q=simd%20selection%20algorithm). :/ — Peter Cordes, Mar 12 '16 at 08:51
If you were doing no this vertically it would be much easier - would your use case allow for 4 such operations in parallel ? — Paul R, Mar 12 '16 at 09:43

score 3 · Answer 1 · edited May 23 '17 at 12:31

It might be best to do some SIMD compares to reduce to 8 or 4 (like you have now) candidates, and then unpack to scalar doubles in vector registers. This doesn't have to involve a memory round-trip: vextractf128 the high half (_mm256_extractf128_pd), and cast the low half. Maybe use movhlps (_mm_movehl_ps) to get the high half of a __m128 down to the low half (although on CPUs with AVX, you only save a code byte or two from using that instead of a shuffle with an immediate; it's not faster like it is on some old CPUs).

IDK whether unpacking with shuffles or simply storing is the way to go. Maybe a mix of both, to keep the shuffle ports and the store/load ports busy would do well. Obviously the low double in every vector is already in place as a scalar, so that's one you don't have to load. (And compilers are bad at seeing through a store-and-reload-as-scalars to take advantage of this, even to a local array.)

Even without narrowing down the set of candidates very much, some SIMD comparators before unpacking could reduce the amount of swaps / shuffles expected from branchy scalar code, reducing branch mispredict penalties.

As I described in comments on Paul R's answer, in scalar code you'd probably do well with an insertion-sort type of algorithm. But it's more like a priority queue: only insert into the first 4 elements. If an new candidate is greater than the biggest existing candidate, just move on. Otherwise insertion-sort it into the list of 4 candidates which you maintain in sorted order.

I found a really nice paper on SIMD sorting networks, with specific discussion of AVX. They go into detail about the required shuffles when using SIMD packed-min / packed-max instructions to sort a couple vector-registers of data. They even use intrinsics like _mm512_shuffle_epi32 in their examples. They say their results are applicable to AVX, even though they use AVX-512 mask registers in their examples.

It's only the last bit of the paper where they talk about merging to use the small sort as a building block for a big parallel sort. I can't find their actual code anywhere, so maybe they never published the full implementation they benchmarked to make their graphs. :(

BTW, I wrote a previous answer with some not-very-great ideas about sorting 64bit structs by a float member, but that's not really applicable here since I was only addressing the complications of dealing with a payload (which you don't have).

I don't have time right now to finish this answer, so I'll just post a summary of my idea:

Adapt the 2-register method from that paper to AVX (or AVX2). I'm not sure how best to emulate their AVX512 masked min/max instructions. :/ I may update this later. You might want to email the authors and ask about the code they used to benchmark the desktop CPU.

Anyway, use the 2-register function on pairs of regs, to reduce from 4 to 2 regs, then again to reduce to 1 reg. Unlike your version, theirs produces a fully-sorted output register.

Trying to avoid cross-lane shuffles whenever possible may be tricky. I'm not sure if you can gain anything from using shufpd (__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int select);) to combine data from two source regs while shuffling. The 256b version can do a different shuffle on each lane, using 4 bits of the imm8 instead of just 2.

It's an interesting problem, but I unfortunately shouldn't take the time to write up a full solution myself. If I had time, I'd like to compare an insertion-sort priority queue and a sorting-network fully-unrolled implementation of the same pqueue, on 4, 8, 12, and 16 elements each. (different levels of SIMD sorting network before going scalar).

Links I've found:

Another paper on SSE sorting, with some code using palignr to merge two two individually sorted vectors into a sorted 8-element pair of vectors. Not directly applicable to 256b vectors of doubles.
Another paper on SSE sorting, with test results from Core2 CPUs. It uses an inefficient blend/blend/shuffle for shuffling between two vectors, because of the limitations of shufps. In-lane shufpd has slightly different restrictions. This paper is probably worth looking at carefully. They have algorithms that work on real SSE vectors, with the available shuffle operations.
Xiaochen, Rocki, and Suda's paper linked already
What looks like a chapter on sorting networks from the CLR algorithms textbook.
Sorting network generator, with a choice of algorithms. Not SIMD-specific
https://en.wikipedia.org/wiki/Bitonic_sorter the example diagram has a 16-input sorting network. Bitonic sorts use patterns that can SIMD to some degree. The upper part of the end of the network can be omitted, since we only care about the order of the lowest 4.
Fastest sort of fixed length 6 int array. A popular question with many answers.

Good answer! I will add another link to a paper which I think looks promising. It got pseudo-code on page 6 (1279) http://www.vldb.org/pvldb/vol8/p1274-inoue.pdf — gustf, Mar 13 '16 at 20:13
@gustf: not just pseudo-code: actual C++ with intrinsics. Interesting: I keep forgetting about `palignr` for combining elements from two vectors. Of course, this question is asking about float, so palignr would cause extra latency forwarding to minpd/maxpd. They're using it to transfer a single element, so it doesn't map to `_mm256_permute2f128_pd`, unfortunately. — Peter Cordes, Mar 13 '16 at 20:16
True, I meant to write intrinsics, don't know what happened really :) And yes you are right about the `alignr`, but I was thinking that the actual algorithm could be of interest. — gustf, Mar 13 '16 at 20:47

Paul R · Answer 2 · 2016-03-12T06:15:06.643

1

This is a purely "horizontal" operation and not really suited to SIMD - I suspect it will be quicker to just stash the four vectors in memory, sort the 16 values, and then load the first four into your result vector:

__m256d min(__m256d A, __m256d B, __m256d C, __m256d D)
{
    double buff[16] __attribute__ ((aligned(32)));

    _mm256_store_pd(&buff[0], A);
    _mm256_store_pd(&buff[4], B);
    _mm256_store_pd(&buff[8], C);
    _mm256_store_pd(&buff[12], D);

    std::partial_sort(buff, buff+4, buff+16);

    return _mm256_load_pd(&buff[0]);    
}

For improved performance you could implement an inline custom sort routine which is hard-coded for 16 elements.

edited Mar 12 '16 at 06:15

answered Mar 11 '16 at 16:53

Paul R

208,748
37
389
560

I am aware of this obvious solution, sir. – Fedor_Govnjukoff Mar 11 '16 at 17:22
OK - you might want to add that information to your question then. – Paul R Mar 11 '16 at 17:29
1

use [`std::partial_sort(buff, buff+4, buff+16)`](http://www.cplusplus.com/reference/algorithm/partial_sort/) to not waste time sorting the whole array. – Peter Cordes Mar 12 '16 at 02:51
@PeterCordes: thanks - good suggestion - I didn't even know that existed. – Paul R Mar 12 '16 at 06:15
1

I wasn't sure there was an STL function for it until I looked, but I knew the concept existed. Hopefully it has different strategies for when the partial-sort range is very small. e.g. insertion-sort into the first 4 elements, and then stop checking. Or maintain a queue of the highest 4 elements seen so far. `std::partial_sort` still does more work than needed, because it can't leave the rest of the array corrupt (e.g. duplicate copies of elements). Maybe there's an STL function even better suited to this, but partial_sort was what I found first. – Peter Cordes Mar 12 '16 at 06:43
1

Also, with `-std=gnu++11`, you can use `alignas(32) double buff[16];`. gcc and clang then generate the requisite `and rsp, -32` [or equivalent](http://goo.gl/sDlD79) after setting up a stack frame. Unfortunately it doesn't look as efficient as it could be :/ Probably some vector sorting-network stuff and then a scalar horizontal over 4 or 8 elements is better. – Peter Cordes Mar 12 '16 at 07:41
@PeterCordes : I think if we are to unload values to a memory array and partially sort them, then the simplest thing thing to do is to run bubble-sort-like loop 4 times from the top to the bottom+i of the array, each time pushing lowest element down. This will require like 15 + 14 +13 + 12 = 54 compare-and-exchange-if-greater operations. – Fedor_Govnjukoff Mar 12 '16 at 10:12
1

@Fedor_Govnjukoff: insertion sort is considered optimal for small arrays. – Paul R Mar 12 '16 at 10:19
@PaulR: Maybe insertion sort is faster for small arrays, but how to adapt it for finding 4 lowest elements is not clear. – Fedor_Govnjukoff Mar 12 '16 at 10:49
@Fedor_Govnjukoff: true - I've never tried that. Network sorts can also be similarly fast for small arrays, especially when you have branchless min/max, and you could prune the sorting network so that it only generates the 4 smallest outputs. – Paul R Mar 12 '16 at 11:22
1

@Fedor_Govnjukoff: adapting insertion sort: operate normally on the first 4 elements, so they're sorted. After that, only consider inserting into the first 4 elements. When shifting to make room for a new element, you can stop shifting after writing a new 4th element. So you're basically treating the first 4 elements as a priority queue that you insert into. For every following element, first check the largest element in the pqueue to see if the element is smaller than any of the current min4. On an Intel CPU with low latency gp<->vector, maybe broadcast load and cmpps->movmsk->bsf – Peter Cordes Mar 12 '16 at 11:26

Find 4 minimal values in 4 __m256d registers

2 Answers2

Linked