Loading XMM registers from address location

Question

I'm trying to load/store a memory from/to a char pointer array using the XMM0 128-bit register on a 32-bit operating system.

What I tried is very simple:

int main() {
    char *data = new char[33];
    for (int i = 0; i < 32; i++)
        data[i] = 'a';
    data[32] = 0;
    ASM
    {
        movdqu xmm0,[data]
    }

    delete[] data;
}

The problem is that this doesn't seem to work. The first time I debugged the Win32 application I got:

xmm0 = 0024F8380000000000F818E30055F158

The second time I debugged it I got:

xmm0 = 0043FD6800000000002C18E3008CF158

So there must be something with the line:

movdqu xmm0,[data]

I tried using this instead:

movdqu xmm0,data

but I got the same result.

What I thought was the problem is that I copy the address instead of the data at the address. However the value shown at the xmm0 register is too large for a 32-bit address, so it must be copying memory from another address.

I also tried some other instructions I found at the internet, but with the same result.

Is it the way I'm passing the pointer or am I misunderstanding something about xmm basics?

A valid solution with an explanation will be appreciated.

Even though I found the solution (finally after three hours), I would still like an explanation:

ASM
    {
        push eax
        mov eax,data
        movdqu xmm0,[eax]
        pop eax
    }

Why should I pass the pointer to a 32-bit register?

Can you please try if local variable `char data[33];` instead of new/delete with pointer can be used directly, as in original post with `[data]`? I can't debug now, but I think this may work, as I can imagine the compiled source. What is puzzling me at the moment, what is the C++ difference from `char *data`. From the C++ point of view they look to be equivalent. I'm probably overlooking something. (and in that second version, that `mov eax,data` is compiled to `mov eax,[data]`, right?) — Ped7g, Aug 18 '16 at 14:00
x86 does not have a "memory indirect" addressing mode. You are loading the pointer into `xmm0`. Since `xmm0` is larger than a pointer, you are also reading garbage bytes in memory beyond the end of where the pointer is stored. — Raymond Chen, Aug 18 '16 at 14:06
It works with non-pointers.Yep it is compiled to mov eax,[data] (Ped7g). Hmm what is 'data' actually at assembly? Maybe I have mistaken how ASM threats 'data'. I thought 'data' is threated as it's threated in c++. It's seems that it returns the pointer of the 'data' varriable not to the 'data' pointer address as it would in C++. That explaination looks logical to me — user2377766, Aug 18 '16 at 14:07
The recommended way of doing this is the [`mm_loadu_si128`](https://msdn.microsoft.com/en-us/library/f4k12ae8(v=vs.100).aspx) intrinsic. — Raymond Chen, Aug 18 '16 at 14:14
Don't use inline asm, especially not MSVC inline asm. Use intrinsics instead, so the compiler can optimize. Unless you can write a whole loop in asm that's more efficient than what the compiler generates, don't even think of using MSVC inline asm. It only works in 32-bit mode anyway, and if you want performance you normally want x86-64 mode. Some experienced asm devs say [MSVC asm "was never very reliable"](http://stackoverflow.com/questions/3323445/what-is-the-difference-between-asm-and-asm/35959859#comment59576185_35959859). GNU C is safe, but: https://gcc.gnu.org/wiki/DontUseInlineAsm — Peter Cordes, Aug 18 '16 at 16:13
@Peter Mortensen: XMM vs. xmm is not a big deal, there's no standard convention. Assemblers are not case sensitive for register names, either. And BTW, `xmm0` (in code-formatting) was a valid option, but so is all-caps XMM0. (Intel's manuals tend to use all-caps for literal instruction and register names mixed into sentences, which does seem less noisy than SO's code formatting which changes the background colour). — Peter Cordes, Aug 27 '16 at 04:12

chqrlie · Answer 1 · 2023-06-26T05:37:33.263

4

The problem with your code is data is a pointer. The assembly code movdqu xmm0,[data] loads the 16 bytes at the address of data into register xmm0. This means the 4 or 8 bytes comprising the value of the pointer and whatever bytes that follow in memory. You are lucky the pointer address is correctly aligned in memory, otherwise you would get a segmentation fault. Nothing guarantees this alignment.

The alternative using an automatic array char data[33]; would solve the addressing problem (movqdu would load data from the array) but not the alignment issue, you could still get a violation depending on how the compiler aligns the array with automatic storage. Again, no guarantee for proper alignment.

The solution you found is probably a good approach, but unlike malloc(), I am not sure if the pointer returned by new is valid for larger alignment than the specified type. Furthermore, neither new nor malloc() guarantee 16 byte alignement required for teh SIMD instructions you intend to use. Most systems have memory allocation APIs to ensure 16 or wider alignment:

POSIX system have posix_memalign:

int posix_memalign(void **memptr, size_t alignment, size_t size);

linux systems also support memalign:

void *memalign(size_t alignment, size_t size);

The preferred solution is the C Standard function aligned_alloc defined in <stdlib.h>, added in C11, but potentially not available on all systems:

void *aligned_alloc(size_t alignment, size_t size);

If this function is available on your system, you could write:

#include <stdlib.h>

int main(void) {
    char *data = aligned_alloc(16, 32);
    for (int i = 0; i < 32; i++) {
        data[i] = 'a';
    }
    __asm {
        mov    eax, data
        movdqu xmm0, [eax]
    }
    free(data);
    return 0;
}

As commented by Peter Cordes, it is much better to use intrinsics for this kind of thing, namely mm_loadu_si128. There are two primary reasons: first, the inline assembly syntax is not standard and differs from one compiler to another and between 32- and 64-bit builds, so by using intrinsics, your code becomes slightly more portable. Second, the compiler does a relatively poor job of optimizing inline assembly, and in particular, tends to do a lot of pointless memory stores and loads. The compiler does a much better job optimizing intrinsics, which makes your code run faster (which is the whole point in using inline assembly!).

edited Jun 26 '23 at 05:37

answered Aug 18 '16 at 14:23

chqrlie

131,814
10
121
189

2

Don't use push/pop inside an inline-asm statement. MSVC reads your asm and saves/restores any registers you use. More importantly, don't use MSVC inline asm for this at all. You'll get better results with intrinsics. – Peter Cordes Aug 18 '16 at 15:55
@PeterCordes Thanks for the advice, but I'm really starting to get sick seeing 'use intrinsics' everywhere as the purpose of asm is not only for coding but also for codebreaking. I'm really missing the point why VS2016 is disabling x64 inline asm. With all the useless things in this world why would they disable the nature of programming ? Going in this direction people in 22nd century might not even know what programming is just bcs some layer on top of it might be more useful ? – user2377766 Aug 20 '16 at 01:13
@user2377766: If you want to write in asm, just write whole functions in asm and assemble + link them separately. [MSVC's inline-asm syntax makes it impossible to produce efficient code, and the implementation is reportedly not robust](http://stackoverflow.com/questions/3323445/what-is-the-difference-between-asm-and-asm/35959859#comment59577407_35959859). GNU C inline asm is supported on all targets, and doesn't force inputs/outputs to be stored/reloaded, so if you really like inline asm, I highly recommend that. – Peter Cordes Aug 20 '16 at 01:25
@PeterCordes so what ? In the end people will not know what's actually happening under the program they are producing just bcs It's easier not to. What's the diff with the sheep module most of us have already chosen! I agree person might not know anything but I believe that every person must have the opportunity to choose ifsomething is useless for him or not. Not to be chosen for him by a person who either doesn't care to provide it or it's more comfortable not to. So why rev-eng when you can just run the app with no care. Wonder when 'intrinsics' and C++ will be inappropriate – user2377766 Aug 20 '16 at 01:26
@user2377766: WTF are you talking about? I just said that there are lots of good ways to write asm, but MSVC inline asm syntax isn't one of them, so I'm not sad to see it gone. (Of course, I do agree with you that making the implementation suck less and extending the syntax to allow efficient inputs/output without a store/reload would have been even better.) I'm not offended, I just think you completely missed the point of what I said, because I'm a huge fan of asm. It's only MSVC inline asm that sucks, IMO. You obviously didn't have time to read my answer on that link. – Peter Cordes Aug 20 '16 at 01:30
@user2377766 Besides that, knowing asm is still useful even if you don't actually write in it directly. e.g. see my answer on [SSE horizontal sums](http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86), and also [hand-holding the compiler into making optimal asm](http://stackoverflow.com/questions/34407437/what-is-the-efficient-way-to-count-set-bits-at-a-position-or-lower/34410357#34410357), based on looking at the asm output and tweaking the source. This avoids all the pitfalls of https://gcc.gnu.org/wiki/DontUseInlineAsm. – Peter Cordes Aug 20 '16 at 01:34
1

@user2377766: `Wonder when 'intrinsics' and C++ will be inappropriate`: They already are in a lot of use-cases, according to a lot of people. I hate that a lot of software packages these days are bloated piles of crap that uses hundreds of times more memory and CPU than they should (e.g. web browsers, graphical desktops). I wish people would take more time to make software efficient instead of churning out more crap faster. But taking the time to micro-optimize with manual vectorization is often not justified; compilers continue to improve, and large-scale optimizations are important. – Peter Cordes Aug 20 '16 at 01:40
`inline assembly is not supported for 64-bit builds` - inline `asm` is supported in gcc 64-bit. Not sure about visual studio. – annoying_squid Jul 31 '18 at 16:58
_"pointer returned by new is valid for any alignment"_ — luckily, it is good for `sizeof(void*)` alignment. Without at least that, C++ code wouldn't work well on many CPUs. However, for things such as SSE and now AVX, you should instead use a C function such as `posix_memalign()`. – Alexis Wilke Jun 25 '23 at 18:47

score 1 · Accepted Answer · edited Aug 27 '16 at 04:16

#include <iostream>

int main()
{
    char *dataptr = new char[33];
    char datalocal[33];
    dataptr[0] = 'a';   dataptr[1] = 0;
    datalocal[0] = 'a'; datalocal[1] = 0;
    printf("%p %p %c\n", dataptr, &dataptr, dataptr[0]);
    printf("%p %p %c\n", datalocal, &datalocal, datalocal[0]);
    delete[] dataptr;
}

Output:

0xd38050 0x7635bd709448 a
0x7635bd709450 0x7635bd709450 a

As we can see, the dynamic pointer data is really a pointer variable (32 bits or 64 bits at 0x7635BD709448), containing a pointer to the heap, 0xD38050.

The local variable is directly a 33 characters long buffer, allocated at address 0x7635BD709450.

But the datalocal works also as a char * value.

I'm a bit confused what the formal C++ explanation of this is. While writing C++ code, this feels quite natural and dataptr[0] is the first element in the heap memory (that is, dereferencing dataptr twice), but in assembler you see the true nature of dataptr, which is address of the pointer variable. So you have first to load the heap pointer by mov eax,[data] = loads eax with 0xD38050, and then you can load the content of 0xD38050 into XMM0 by using [eax].

With a local variable there is no variable with the address of it; the symbol datalocal is already the address of the first element, so movdqu xmm0,[data] will work then.

In the "wrong" case you can still do movdqu xmm0,[data]; it's not a problem of the CPU to load 128 bits from a 32-bit variable. It will simply continue reading beyond the 32 bits and read another 96 bits belonging to other variables/code. In case you are around a memory boundary and this is the last memory page of the application, it will crash on an invalid access.

Alignment were mentioned a few times in comments. That's a valid point; to access the memory through movdqu it should be aligned. Check your C++ compiler intrinsics. For Visual Studio this should work:

__declspec(align(16)) char datalocal[33];
char *dataptr = _aligned_malloc(33, 16);
_aligned_free(dataptr);

About my C++ interpretation: Maybe I got this wrong since the beginning.

The dataptr is the value of the dataptr symbol, that is, that heap address. Then dataptr[0] is dereferencing the heap address, accessing the first element of the allocated memory. &dataptr is the address of the dataptr value. This makes sense also with syntax like dataptr = nullptr;, where you are storing the nullptr value into the dataptr variable, not overwriting the dataptr symbol address.

With datalocal[] there's basically no sense in accessing the pure datalocal, like in datalocal = 'a';, as it's an array variable, so you should always provide the [] index. And &datalocal is the address of such an array. The pure datalocal is then an aliased shortcut for easier point math with arrays, etc., having also the char * type, but if the pure datalocal would throw a syntax error, it would still be possible to write C++ code (using &datalocal for pointer, datalocal[..] for elements), and it would fit with that dataptr logic completely.

Conclusion: You had your example wrong since the beginning, because in assembly language [data] is loading the value of data, which is the pointer to the heap returned by new.

This is my own explanation, and now some C++ expert will come and tear it to pieces from a formal point of view... :)))

In most contexts (e.g. [passing as a function arg](http://stackoverflow.com/questions/38800044/what-kind-of-c11-data-type-is-an-array-according-to-the-amd64-abi#comment64984890_38800044), or when used with operators like `+` or `[]`), arrays work like pointers. The address isn't stored anywhere, though; it's more like an immediate constant. Or a compile-time-constant offset from the stack pointer. But a pointer variable *does* actually store a pointer in memory or a register. BTW, `&datalocal` gives a warning, but does compile to the same code as `&datalocal[0]`. https://godbolt.org/g/05S5XS — Peter Cordes, Aug 18 '16 at 16:03
I thought `movdqu` was unaligned access? If so it isn't required to be aligned. If it is known to be aligned then I'd suggest `movdqa` — Michael Petch, Aug 27 '16 at 04:14

Loading XMM registers from address location

2 Answers2