How to get data out of AVX registers?

Question

Using MSVC 2013 and AVX 1, I've got 8 floats in a register:

__m256 foo = mm256_fmadd_ps(a,b,c);

Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX intrisics would make this rather complicated:

print(_castu32_f32(_mm256_extract_epi32(foo, 0)));
print(_castu32_f32(_mm256_extract_epi32(foo, 1)));
print(_castu32_f32(_mm256_extract_epi32(foo, 2)));
// ...

but MSVC doesn't even have either of these two intrinsics. Sure, I could write back the values to memory and load from there, but I suspect that at assembly level there's no need to spill a register.

Bonus Q: I'd of course like to write

for(int i = 0; i !=8; ++i) 
    print(_castu32_f32(_mm256_extract_epi32(foo, i)))

but MSVC doesn't understand that many intrinsics require loop unrolling. How do I write a loop over the 8x32 floats in __m256 foo?

If you're going to be printing data then it hardly matters about spilling a register to memory - just use a suitable union. — Paul R, Jun 03 '16 at 11:12
It matters whether `print()` is standing in for a function that can really be fully inlined, or if the compiler has to eventually `call` a function it can't see the code for. What's really going on? — Peter Cordes, Jun 03 '16 at 18:44
If you only care about MSVC, something like `foo.m256_f32[i]` may work (that is `foo[i]` with gcc). — Marc Glisse, Jun 03 '16 at 18:49
@PeterCordes: Actually, I introduced AVX here because it's a common pattern. The actual calls are all inlineable, at about 20-50 instructions themselves, _but_ a few of those call non-inlineable functions in turn. But AVX isn't going away, and I can't be the only one who is going to interface SIMD AVX code with classic SISD code, so a generic answer is welcome. — MSalters, Jun 03 '16 at 23:00
@MSalters: I think my answer covers both cases pretty well: An optimal pattern for shuffling data out to separate xmm registers, and when store + scalar-loads is better (i.e. for integer, or when calling functions that will clobber vector regs). Generally storing to memory for this isn't bad; it's only about 5c latency for a store-forwarding round trip on Intel CPUs. It's only bad for a horizontal sum or other reduction. I'm working on a 2nd answer about the C++ side of things, and convenient syntax for looping over vector elements. (e.g. gcc unrolls for you with `-O1` or higher) — Peter Cordes, Jun 03 '16 at 23:42
@PeterCordes: I've got recursive template code for Paul's idea. No unrolling needed; just inlining of 7 levels of helper functions. Pretty straightforward really - one call to `print`, one recursive call. Should be trivial to inline, which means I end up with 8 calls in a row, just like with an unrolled loop. Still got to check the assembly though. — MSalters, Jun 04 '16 at 00:23

Paul R · Answer 1 · 2019-11-26T22:59:39.590

10

Assuming you only have AVX (i.e. no AVX2) then you could do something like this:

float extract_float(const __m128 v, const int i)
{
    float x;
    _MM_EXTRACT_FLOAT(x, v, i);
    return x;
}

void print(const __m128 v)
{
    print(extract_float(v, 0));
    print(extract_float(v, 1));
    print(extract_float(v, 2));
    print(extract_float(v, 3));
}

void print(const __m256 v)
{
    print(_mm256_extractf128_ps(v, 0));
    print(_mm256_extractf128_ps(v, 1));
}

However I think I would probably just use a union:

union U256f {
    __m256 v;
    float a[8];
};

void print(const __m256 v)
{
    const U256f u = { v };

    for (int i = 0; i < 8; ++i)
        print(u.a[i]);
}

edited Nov 26 '19 at 22:59

answered Jun 03 '16 at 11:28

Paul R

208,748
37
389
560

1

His `print()` function takes a float argument. `extract_ps` extracts to memory, or an integer register. A float shuffle (like shufps) is a far better choice (and then `_mm_cvtss_f32` to cast the vector to its scalar low element). If you want to use an SSE4.1 instruction, use `_mm_insert_ps`, which can select any source element and put it into any destination element, and also zero specified elements in the destination. (The SysV ABI allows garbage in upper xmm elements of a reg used to pass a scalar, so you don't need the zeroing. I assume Windows is the same.) – Peter Cordes Jun 03 '16 at 18:28
My bad - I wrote this in a hurry this morning - I think there's an `_MM_EXTRACT_PS` which does the right thing here ? (I can't easily check as I'm at a cricket match right now.) – Paul R Jun 03 '16 at 19:23
1

Apparently there is a [wrapper macro called `_MM_EXTRACT_FLOAT`](http://stackoverflow.com/a/3130397/224132). You use it as `_MM_EXTRACT_FLOAT(dest_float, src_m128, element_index)`, so it's weird (it doesn't evaluate to an expression, except as a GNU C statement-expression). g++ compiles it to a shufps when used on a vector already in a register, or a `movss` with an offset on a reference to a vector. gcc defines it in `/usr/lib/gcc/x86_64-linux-gnu/5.2.1/include/smmintrin.h`, in terms of `__builtin_ia32_vec_ext_v4sf`, not a specific Intel intrinsic. So yeah, good choice here I guess. – Peter Cordes Jun 03 '16 at 19:36
1

Also, it's only available with `-msse4.1` or higher! That's insane, because there's no need for anything beyond SSE1 to implement it. – Peter Cordes Jun 03 '16 at 19:50
@PeterCordes: Yeah, I found that one as well. I'm pretty confident I can use SSE4.1, I've got `/cpu/procinfo` data from the actual servers so I can check on monday. That's also how I know I have to use AVX1, not AVX2. (And I don't think you can have SSE4.1 without AVX anyway). MSVC is because the code also has to run on our Windows development environments. – MSalters Jun 03 '16 at 23:06
1

@MSalters: That's correct, AVX guarantees all previous Intel SSE extensions (and provides VEX-encoded non-destructive destination versions of all of them). – Peter Cordes Jun 04 '16 at 00:05
1

@PeterCordes: answer fixed now - the `_MM_EXTRACT_FLOAT` approach is rather clunky - I think I prefer the union method even more now. – Paul R Jun 04 '16 at 08:01

Peter Cordes · Accepted Answer · 2016-06-03T18:37:12.353

5

Careful: _mm256_fmadd_ps isn't part of AVX1. FMA3 has its own feature bit, and was only introduced on Intel with Haswell. AMD introduced FMA3 with Piledriver (AVX1+FMA4+FMA3, no AVX2).

At the asm level, if you want to get eight 32bit elements into integer registers, it is actually faster to store to the stack and then do scalar loads. pextrd is a 2-uop instruction on SnB-family, and Bulldozer-family. (and Nehalem and Silvermont, which don't support AVX).

The only CPU where vextractf128 + 2xmovd + 6xpextrd isn't terrible is AMD Jaguar. (cheap pextrd, and only one load port.) (See Agner Fog's insn tables)

A wide aligned store can forward to overlapping narrow loads. (Of course, you can use movd to get the low element, so you have a mix of load port and ALU port uops).

Of course, you seem to be extracting floats by using an integer extract and then converting it back to a float. That seems horrible.

What you actually need is each float in the low element of its own xmm register. vextractf128 is obviously the way to start, bringing element 4 to the bottom of a new xmm reg. Then 6x AVX shufps can easily get the other three elements of each half. (Or movshdup and movhlps have shorter encodings: no immediate byte).

7 shuffle uops are worth considering vs. 1 store and 7 load uops, but not if you were going to spill the vector for a function call anyway.

ABI considerations:

You're on Windows, where xmm6-15 are call-preserved (only the low128; the upper halves of ymm6-15 are call-clobbered). This is yet another reason to start with vextractf128.

In the SysV ABI, all the xmm / ymm / zmm registers are call-clobbered, so every print() function requires a spill/reload. The only sane thing to do there is store to memory and call print with the original vector (i.e. print the low element, because it will ignore the rest of the register). Then movss xmm0, [rsp+4] and call print on the 2nd element, etc.

It does you no good to get all 8 floats nicely unpacked into 8 vector regs, because they'd all have to be spilled separately anyway before the first function call!

edited Jun 03 '16 at 18:37

answered Jun 03 '16 at 18:17

Peter Cordes

328,167
45
605
847

I suppose when you say "bring element 4 to the bottom of a new XMM register", it also includes moving elements 5-7 to the same register? Because the AVX lane concept appears to imply that I need to have all 4 high elements cross into the low lane. (Or more realistically, an YMM register is just a pair of XMM registers and I need to rename a high register to a low register). Am I right that your assembly is essentially the same as Paul R's intrinsics? The `vextractf12` is `_mm256_extractf128_ps(v, 1)`; the six `shufps` are `_mm_extract_ps(v, 1)` to `3`. – MSalters Jun 03 '16 at 23:27
@MSalters: `vextractf128` (`_mm256_extractf128_ps`) does what the name implies. Yes, you can shuffle its 128b result to get elements 5-7. I was just pointing out the bonus feature that element 4 is already in place, so you don't need any more shuffling for it. With AVX2, you could use 7x `vpermps` to get each element in turn, or in-lane 256b shuffles and then `vextractf128` the low element of the upper 128, but those options are both worse. – Peter Cordes Jun 04 '16 at 00:12
But `_mm_extract_ps` is *not* `shufps`. It's the intrinsic for `extractps`, and the result is an `int`. You can [search on instruction names in Intel's intrinsics guide.](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=shufps&expand=1048,1048,1356,2263,2936,1798,2377&techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX). If you're lucky, a compiler might optimize `reinterpret_cast(_mm_extract_ps(v, 1))` to a `shufps` or `vpermilps`, but it might instead emit `extractps` / `movd`. The _MM_EXTRACT_FLOAT macro discussed in comments on Paul's answer may be convenient. – Peter Cordes Jun 04 '16 at 00:17
@MSalters: BTW, I don't like Intel's intrinsic names. They're WAY too long to type, and less memorable than asm mnemonics. Especially having to type `_mm256` or `__m256` all the time is ridiculous, and `epi32` vs. `epi8` could have been more compact. But even besides that noise, I don't like the actual names they choose. They're different from the asm mnemonics, and the intrinsic for `vpermilps` for example is just `_mm_permute_ps` or `_mm_permutevar_ps`. It's hard to remember whether that's `shufps`, `vpermilps`, or AVX2 cross-lane `vpermps`. – Peter Cordes Jun 04 '16 at 00:23
Agner Fog's vector class library is pretty good, and gcc defines `__m256` as a gcc native vector type, so you can do simple ops like `a + b` on vector types (for float anyway. For integer, you need Agner's VCL or something to pick the right element size). – Peter Cordes Jun 04 '16 at 00:25

Peter Cordes · Answer 3 · 2019-11-27T04:21:31.250

(Unfinished answer. Posting anyway in case it helps anyone, or in case I come back to it. Generally if you need to interface with scalar that you can't vectorize, it's not bad to just store a vector to a local array, and then reload it one element at a time.)

See my other answer for asm details. This answer is about the C++ side of things.

void foo(__m256 v) {
    alignas(32) float vecbuf[8];   // 32-byte aligned array allows aligned store
                                   // avoiding the risk of cache-line splits
    _mm256_store_ps(vecbuf, v);

    float v0 = _mm_cvtss_f32(_mm256_castps256_ps128(v));  // the bottom of the register
    float v1 = vecbuf[1];
    float v2 = vecbuf[2];
    ...
   // or loop over vecbuf[i]
   // if you do need all 8 elements one at a time, this is a good way
}

or loop over vecbuf[i]. A vector store can forward to a scalar reload of one of its elements so this only introduces about 6 cycles of latency, and multiple reloads can be in flight at once. (So it's very good for throughput on modern CPUs with 2/clock load throughput.)

Note that I avoided reloading the low element; the low element of a vector in a register already is a scalar float. _mm_cvtss_f32( _mm256_castps256_ps128(v) ) is simply how you keep the compiler's type system happy; it compiles to zero asm instructions and so it's literally free (barring missed-optimization bugs). (See Intel's intrinsics guide). XMM registers are the low 128 of the corresponding YMM register, and scalar float / double are the low 32 or 64 bits of an XMM register. (Garbage in the upper half doesn't matter.)

Casting the first once gives OoO exec something to do while waiting for the rest to arrive. You might consider shuffling to get a 2nd element with vunpckhps or vmovhlps on the low 128, so you have 2 elements ready quickly, if that helps fill the latency bubble.

In GNU C/C++, you can index a vector type like an array, with v[1] or even a variable index like v[i]. The compiler will choose between shuffle or store/reload.

But this isn't portable to MSVC which defines __m256 in terms of a union with some named members.

Storing to an array and reloading is portable, and compilers can sometimes even optimize it into a shuffle. (If you don't want that, check the generated asm.)

e.g. clang optimizes a function that just returns vecbuf[1] into a simple vshufps. https://godbolt.org/z/tHJH_V

If you actually want to add up all the elements of a vector into a scalar total, shuffle and SIMD add. Fastest way to do horizontal float vector sum on x86

(Same for multiply, min, max or other associative reductions over the elements of a single vector. Of course if you have multiple vectors, do vertical ops down to one vector, like _mm256_add_ps(v1,v2))

Using Agner Fog's Vector Class Library, his wrapper classes overload operator[] to work exactly the way you'd expect, even for non-constant args. This often compiles to a store/reload, but it makes it easy to write the code in C++. With optimization enabled, you'll probably get decent results. (except the low element might get stored/reloaded, instead of just getting used in place. So you might want to special-case vec[0] into _mm_cvtss_f32(vec) or something.)

(VCL used to be licensed under the GPL, but the current version is now a simple Apache license.)

See also my github repo with mostly-untested changes to Agner's VCL, to generate better code for some functions.

There's a _MM_EXTRACT_FLOAT wrapper macro, but it's weird and only defined with SSE4.1. I think it's intended to go with SSE4.1 extractps (which can extract the binary representation of a float into an integer register, or store to memory). It gcc does compile it into an FP shuffle when the destination is a float, though. Be careful that other compilers don't compile it to an actual extractps instruction if you want the result as a float, because that's not what extractps does. (That is what insertps does, but a simpler FP shuffle would take fewer instruction bytes. e.g. shufps with AVX is great.)

It's weird because it takes 3 args: _MM_EXTRACT_FLOAT(dest, src_m128, idx), so you can't even use it as an initializer for a float local.

To loop over a vector

gcc will unroll a loop like that for you, but only with -O1 or higher. At -O0, it will give you an error message.

float bad_hsum(__m128 & fv) {
    float sum = 0;
    for (int i=0 ; i<4 ; i++) {
        float f;
        _MM_EXTRACT_FLOAT(f, fv, i);  // works only with -O1 or higher
        sum += f;
    }
    return sum;
}

Are you aware of anything like VCL but that isnt copyleft licensed? — BeeOnRope, Dec 22 '18 at 17:18
@BeeOnRope: Chuck Walbourn's DirectXMath is I think similar: https://github.com/Microsoft/DirectXMath. But with more emphasis of matrix math and functions that can't map to a single instruction. — Peter Cordes, Dec 22 '18 at 17:25
Maybe this reasoning is stupid but, according to this (https://godbolt.org/z/h4M94z), extracting the value with (i.e. vecbuf[i]) or without (i.e. v[i]) storing it into an array produces the same ASM code. Nonetheless, it is probable that I am missing something or that I am no considering other side effects of indexing an AVX register. In that case, what implications does that approach have? — horro, Sep 30 '20 at 17:55
Update: Agner Fog's VCL is now Apache licensed. Apparently that was a for-pay option before, but I don't remember ever noticing him advertising you could get a copy under a different license. — Peter Cordes, Feb 21 '23 at 20:07

Александр · Answer 4 · 2018-12-23T09:41:03.670

    float valueAVX(__m256 a, int i){

        float ret = 0;
        switch (i){

            case 0:
//                 a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 0)      ( a3, a2, a1, a0 )
// cvtss_f32             a0 

                ret = _mm_cvtss_f32(_mm256_extractf128_ps(a, 0));
                break;
            case 1: {
//                     a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 0)     lo = ( a3, a2, a1, a0 )
// shuffle(lo, lo, 1)      ( - , a3, a2, a1 )
// cvtss_f32                 a1 
                __m128 lo = _mm256_extractf128_ps(a, 0);
                ret = _mm_cvtss_f32(_mm_shuffle_ps(lo, lo, 1));
            }
                break;
            case 2: {
//                   a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 0)   lo = ( a3, a2, a1, a0 )
// movehl(lo, lo)        ( - , - , a3, a2 )
// cvtss_f32               a2 
                __m128 lo = _mm256_extractf128_ps(a, 0);
                ret = _mm_cvtss_f32(_mm_movehl_ps(lo, lo));
            }
                break;
            case 3: {
//                   a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 0)   lo = ( a3, a2, a1, a0 )
// shuffle(lo, lo, 3)    ( - , - , - , a3 )
// cvtss_f32               a3 
                __m128 lo = _mm256_extractf128_ps(a, 0);                    
                ret = _mm_cvtss_f32(_mm_shuffle_ps(lo, lo, 3));
            }
                break;

            case 4:
//                 a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 1)      ( a7, a6, a5, a4 )
// cvtss_f32             a4 
                ret = _mm_cvtss_f32(_mm256_extractf128_ps(a, 1));
                break;
            case 5: {
//                     a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 1)     hi = ( a7, a6, a5, a4 )
// shuffle(hi, hi, 1)      ( - , a7, a6, a5 )
// cvtss_f32                 a5 
                __m128 hi = _mm256_extractf128_ps(a, 1);
                ret = _mm_cvtss_f32(_mm_shuffle_ps(hi, hi, 1));
            }
                break;
            case 6: {
//                   a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 1)   hi = ( a7, a6, a5, a4 )
// movehl(hi, hi)        ( - , - , a7, a6 )
// cvtss_f32               a6 
                __m128 hi = _mm256_extractf128_ps(a, 1);
                ret = _mm_cvtss_f32(_mm_movehl_ps(hi, hi));
            }
                break;
            case 7: {
//                   a = ( a7, a6, a5, a4, a3, a2, a1, a0 )
// extractf(a, 1)   hi = ( a7, a6, a5, a4 )
// shuffle(hi, hi, 3)    ( - , - , - , a7 )
// cvtss_f32               a7 
                __m128 hi = _mm256_extractf128_ps(a, 1);
                ret = _mm_cvtss_f32(_mm_shuffle_ps(hi, hi, 3));
            }
                break;
        }

        return ret;
    }

For case 3 and 7, you should use `_mm_shuffle_ps(tw, tw, 3)` instead of using two shuffles. (`vshufps` is faster than `vmovhlps` + `vshufps`). `movehl_ps` is good for case 2 and 6, though: saves 1 byte of code size because it doesn't need an immediate. If you had AVX2, case 6 could be done with a `vpermpd ymm, ymm, imm8` immediate shuffle. And cases 5 and 7 could be done with a `vpermps` lane-crossing shuffle if you had a shuffle control vector in a register. — Peter Cordes, Dec 22 '18 at 06:12
While this code snippet may be the solution, [including an explanation](//meta.stackexchange.com/questions/114762/explaining-entirely-‌code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. — HMD, Dec 22 '18 at 06:16

score 0 · Answer 5 · answered Jul 12 '23 at 02:33

0

On visual studio.. i tried below:

__m256 _zd = { 17.236,19.336,72.35,47.391,8.354,9.336 };        Single precision --- floats 32 bits (1 signed 8 exponent 23 mantissa)
__asm nop;
float(*ArrPtr)[8] = (float(*)[8])&_zd;
std::cout << *(*ArrPtr) << " Extracted values " << *((*ArrPtr)+1) << std::end

l;

answered Jul 12 '23 at 02:33

Cedyangs279

11
1

1

This has strict-aliasing undefined behaviour on other compilers, like GCC and clang, unless you use `-fno-strict-aliasing`. At least with `int*` pointing at a `__m256i`, as in [GCC AVX \_\_m256i cast to int array leads to wrong values](https://stackoverflow.com/q/71364764) . This might work in practice on GCC/clang for `float*` pointing into `typedef float __m256 __attribute__((vector_size(32))` since the element type is actually `float`. See also [print a \_\_m128i variable](https://stackoverflow.com/q/13257166) – Peter Cordes Jul 12 '23 at 04:29
1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 13 '23 at 18:41

How to get data out of AVX registers?

5 Answers5

ABI considerations:

To loop over a vector

Linked

Related