-1

I am quite used to Intel-format inline assembly. Does anyone knows how to convert the two AT&T lines into Intel format in the code below? It is basically loading local variable's address into a register.

int main(int argc, const char *argv[]){
    float x1[256];
    float x2[256];

    for(int x=0; x<256; ++x){
        x1[x] = x;
        x2[x] = 0.5f;
    }

    asm("movq %0, %%rax"::"r"(&x1[0])); // how to convert to Intel format?
    asm("movq %0, %%rbx"::"r"(&x2[0])); // how to convert to Intel format?

    asm(".intel_syntax noprefix\n"
        "mov rcx, 32\n"
"re:\n"
        "vmovups ymm0, [rax]\n"
        "vmovups ymm1, [rbx]\n"
        "vaddps ymm0, ymm0, ymm1\n"
        "vmovups [rax], ymm0\n"
        "add rax, 32\n"
        "add rbx, 32\n"
        "loopnz re"
    );
}

Specifically, loading on-stack local variables using mov eax, [var_a] is allowed when compiled in 32-bit mode. For example,

// a32.cpp
#include <stdint.h>
extern "C" void f(){
    int32_t a=123;
    asm(".intel_syntax noprefix\n"
        "mov eax, [a]"
    );
}

It compiles well:

xuancong@ubuntu:~$ rm -f a32.so && g++-7 -mavx -fPIC -masm=intel -shared -o a32.so -m32 a32.cpp && ls -al a32.so
-rwxr-xr-x 1 501 dialout 6580 Aug 28 09:26 a32.so

However, the same syntax is not allowed when compiled in 64-bit mode:

// a64.cpp
#include <stdint.h>
extern "C" void f(){
    int64_t a=123;
    asm(".intel_syntax noprefix\n"
        "mov rax, [a]"
    );
}

It does not compile:

xuancong@ubuntu:~$ rm -f a64.so && g++-7 -mavx -fPIC -masm=intel -shared -o a64.so -m64 a64.cpp && ls -al a64.so
/usr/bin/ld: /tmp/cclPNMoq.o: relocation R_X86_64_32S against undefined symbol `a' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status

So is there some way to make this work without using input:output:clobber, because simple local variables or function arguments can be accessed directly via mov rax, [rsp+##] or mov rax, [rbp+##] without clobbering other registers?

xuancong84
  • 1,412
  • 16
  • 17
  • I don't know about GCC, but with Visual Studio, there is an issue when dealing with 64 bit versus 32 bit addresses that require a compile time switch (large address aware: yes or no). – rcgldr Aug 24 '20 at 15:19
  • What about having `"a" (&x1[0]), "b" (&x2[0])` in the input list of your last block? (You probably also need a clobber list anyway) – user786653 Aug 24 '20 at 15:21
  • Note that even when converting the two lines, the code is still wrong. You are not allowed to simply clobber random registers and you cannot assume that particular registers hold particular values between `asm` statements. Consider using intrinsic functions instead. That said, `movq %0, %%rax` is just `mov rax, offset %0`. Nothing special going on here. – fuz Aug 24 '20 at 16:03
  • 5
    Unfortunately your example code demonstrates a number of the reasons why you shouldn't use inline assembly. Aside from the errors that will result from your broken code, your assembly code is less optimal then what a compiler can generate. For such a simple loop you can rely on auto-vectorization optimzations. For more complex code you can use intrinsics instead. – Ross Ridge Aug 24 '20 at 16:15
  • 1
    There are *many* things wrong here. The [lockless](https://locklessinc.com/articles/gcc_asm/) tutorial is *very* good, though quite dense to work through. The documentation for gcc's [extended asm](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Extended-Asm) has improved greatly with subsequent releases. If you provided pseudo-code of what you're trying to achieve - preferable with well-documented C code - you might get more help. It's quite possible you may achieve better results with vector intrinsics! – Brett Hale Aug 24 '20 at 18:47
  • 1
    Why would you ever manually vectorize with AVX but then use a [slow `loopnz` instruction](https://stackoverflow.com/q/35742570)?!! Especially when it makes no sense to test for `add rbx, 32` having set ZF, just `dec ecx` / `jnz` would be the sane option. If this is the kind of asm you're writing by hand, you should really just switch to intrinsics. As well as inefficient, this is super broken because you don't declare clobbers on registers you modify, or tell the compiler about the memory you read and write. Expect this to break, especially if compiled with optimization enabled. – Peter Cordes Aug 24 '20 at 21:37
  • 1
    One bit I might add to what the others have said is to consider why you are using inline asm at all. While it *can* be better than the code generated by the compiler, there's no guarantee of that. Indeed, even if the code *inside* the asm is better, the compiler may have to perform so many adjustments in how it generates the surrounding code to allow for inlining the asm, that any benefit is lost. That's in addition to the other reasons [not to use inline asm](https://gcc.gnu.org/wiki/DontUseInlineAsm). Just because "it's always been like that" is a poor reason to keep doing so. – David Wohlferd Aug 24 '20 at 21:38
  • Also, `vaddps` can use an unaligned memory source operand. You only need one `vmvovups`. You're already incrementing two pointers instead of one index so it can [stay micro-fused](https://stackoverflow.com/a/31027695/10461973) in the back-end on Haswell/Skylake CPUs. – Peter Cordes Aug 24 '20 at 21:40
  • @PeterCordes - I want to see better than [this](https://godbolt.org/z/6P51c1). Is this the case where `vaddps` is using an unaligned operand? Can I cut down on loop variables? Re-open, Cordes. You know you want to... – Brett Hale Aug 25 '20 at 07:00
  • @BrettHale: Yeah, GCC folds `_mm256_loadu_ps` into a memory source for `vaddps`. (Which works even without the `and rsp, -32` it uses for efficiency to avoid cache-line splits on any of the loads.) – Peter Cordes Aug 25 '20 at 07:21
  • @BrettHale: I guess I can add [Looping over arrays with inline assembly](//stackoverflow.com/q/34244185) to the list of duplicates. Does this need to be reopened? The Q really seems to be about using Intel-syntax for GNU inline asm, which is easy with gcc but impossible? with clang. But also about using inline asm completely wrong, perhaps incorrectly translated from MSVC `_asm{}` block syntax? There are some duplicates for that, too; it's not a new problem and the OP really just needs to read a tutorial. Their entire approach is not viable so the details of it are uninteresting. – Peter Cordes Aug 25 '20 at 07:22
  • 1
    @PeterCordes - that's true. The question is 'not even wrong' with its approach to inline asm. I just got sidetracked by the potential of the AVX encodings - which is more interesting to me than the actual question. – Brett Hale Aug 25 '20 at 07:31
  • @BrettHale: yup, that's happened to me sometimes, too. :P Added a few more duplicates, including [How to access C struct/variables from inline asm?](https://stackoverflow.com/q/32741032) which covers using inline asm completely wrong. – Peter Cordes Aug 25 '20 at 07:34
  • 1
    @BrettHale: related to the AVX alignment rabbit-hole you went down: [Aligned and unaligned memory access with AVX/AVX2 intrinsics](https://stackoverflow.com/q/31089502) – Peter Cordes Aug 25 '20 at 07:37
  • @PeterCordes, thanks for your comments, for at least making me learn why loopnz is slower than dec+jnz. But if you can build a piece of faster code, then show to me, I will test it out and give credits to you. FYI, this is only for my testing, in the real application in audio signal processing, both buffer size and buffer pointers are passed in. I can request compiler to allocate aligned memory buffers to make it faster. – xuancong84 Aug 25 '20 at 09:29
  • @xuancong84: Brett already [commented](https://stackoverflow.com/questions/63563740/gcc-inline-assembly-cannot-load-local-variables-address-into-register-in-x64-in?noredirect=1#comment112418132_63563740) with an intrinsics loop that compiles to good asm. I have no interest in writing an inline-asm version of this trivial problem; compilers are already good at it, and saving a uop or two with clever looping tricks isn't going to make much difference, or maybe a bit if data is hot in L1d cache. (And yes, I'd recommend aligning your buffers by 32 or 64.) – Peter Cordes Aug 25 '20 at 09:32
  • Not all of those use 64-bit code, but [How to set gcc to use intel syntax permanently?](https://stackoverflow.com/q/38953951) does, and the concept is identical regardless of 32 vs. 64-bit. Hopefully you don't need someone to spoon-feed you examples. See also [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259) re: safely using pointers from inline asm. – Peter Cordes Aug 26 '20 at 00:13
  • @PeterCordes - just edit the question at the end and list those links. It's still the OP's question, but at least your links won't be buried in the comments. – Brett Hale Aug 26 '20 at 12:18
  • @BrettHale: IMO we should re-close this syntax / basics question. Performance tuning should be a separate question. Entangling those two things would lead to a combinatorial explosion of questions that shouldn't be duplicates. Making the same correctness / syntax mistake as an earlier question *while trying to make something different run fast* doesn't stop it from being a duplicate. Also, apparently someone deleted my earlier comment with the list of duplicates. (Or multiple someones flagging it.) Also, I hope everyone can agree that correctness/safety is necessary before optimizing. – Peter Cordes Aug 28 '20 at 01:48
  • @PeterCordes - agreed. OP just doesn't seem to want to listen. – Brett Hale Aug 28 '20 at 03:31
  • @BrettHale: I can't cast a close vote since I already used mine up on the initial closing. You should be able to if you only cast a reopen vote, not a close vote, on this question. – Peter Cordes Aug 28 '20 at 03:35
  • 3
    To direct some comments about the original question `"mov eax, [a]"` doesn't generate the code you think it does. If you were to review the generated code you would discover that it generated a mov from a memory operand that wasn't on the stack. GCC inline assembly doesn't support accessing variables directly. The GCC manual has a warning it isn't supported even for global variable (not on the stack). I assume from your question that you may have been developing on MSVC using Microsoft's inline assembly? – Michael Petch Aug 28 '20 at 03:37
  • 2
    If you were to use `clang++` and the `-fms-extensions` option you would be able to generate accesses to variable outside the inline assembly. You'd code it like MSVC: `asm { mov rax, [a] }` . GCC doesn't support MS extensions. – Michael Petch Aug 28 '20 at 03:44
  • @MichaelPetch You are right, I had been developing on MSVC's inline assembly. Recently, I attempted to recompile my code on Linux and MacOS. Quite sad to hear that "GCC inline assembly doesn't support accessing variables directly", but that is the answer I am looking for. – xuancong84 Aug 28 '20 at 07:33
  • GCC inline asm *does* support accessing variables directly, just not by doing it that way where the compiler has to parse asm instructions and understand the side effects of every instruction as well as look for C symbol names. Instead you use named-operand syntax like `[x] "+r" (x)` so you can use `%[x]` in your asm template string, and it will use the register where the compiler put your C variable. – Peter Cordes Aug 28 '20 at 07:46
  • @PeterCordes Yup, it should be "GCC inline assembly doesn't support accessing variables directly without using extension IO syntax". 'Extension IO syntax' means the `input:output:clobber` in the end of asm() block. This is a trade-off limitation: doing so facilitates compiler optimization over different CPU architectures, but the programmer will lose some degrees of control over the actual register assignment, for example, if I insist to use RAX and RBX register to receive local variables a and b. – xuancong84 Aug 28 '20 at 09:56
  • 1
    *but the programmer will lose some degrees of control over the actual register assignment, for example* - incorrect. For x86 specifically, there are constraint letters for each individual register, like `"a"` for al/ax/eax/rax (depending on type width), or `"D"` for dil/di/edi/rdi. For other ISAs, you can use `register int foo asm("r3")` to make `"r"` constraints pick that register. https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html (scroll down / search for x86) – Peter Cordes Aug 28 '20 at 10:33
  • You have much *more* control and efficiency than MSVC, and don't have to force the compiler to spill your inputs to memory before the asm statement. MSVC inline asm is garbage for wrapping a single insn, but GNU C can do something nearly as good as a builtin with no store/reload round trip for input or output. (Except of course it still defeats constant-propagation, and other optimizations like value-range analysis (e.g. having the compiler know the output must be even, positive, or whatever other useful thing).) – Peter Cordes Aug 28 '20 at 10:34
  • @PeterCordes Thanks, so the solution is put loading/storing in `input:output:clobber` list, right? For example, `::"a"(&x1[0])` if I want to force the compiler to load &x1[0] into RAX. – xuancong84 Aug 29 '20 at 02:56
  • Yes. You can also ask for a pointer as an input and use addressing modes manually inside your inline asm statement if you want to write a whole loop in inline asm. (But beware that you need to tell the compiler about memory you're reading / writing, so it can limit optimization accordingly). [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259) / [Looping over arrays with inline assembly](https://stackoverflow.com/q/34244185). Or you can just use intrinsics like a normal person; compilers make fairly good asm. – Peter Cordes Aug 29 '20 at 03:17
  • Thanks @PeterCordes . But how about the output? e.g., if I want to store xmm4's lowest 32-bits into a local variable "float out". The "gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html" does not contain any xmm registers other than xmm0. Also, when I use Yz, it crashes on my MacOS clang compiler but it compiles fine in Linux g++. – xuancong84 Aug 31 '20 at 02:10
  • Use `register float foo asm("xmm4")` to make an "x" constraint pick xmm4. ("x" normally picks *any* XMM register). [Using a specific zmm register in inline asm](https://stackoverflow.com/a/52018348) shows an example. – Peter Cordes Aug 31 '20 at 02:51
  • Thanks @PeterCordes, but I am passing in a float pointer from function argument, now I want to store xmm# into that float pointer so as to return results. – xuancong84 Sep 02 '20 at 01:11
  • So what? `"=x"(foo)` as an output constraint to capture xmm4, then `*ptr = foo;` to get the compiler to emit a store instruction if it doesn't optimize away the store. Or use `"=m"(*ptr)` if you want to actually write a `movss` instruction in the asm template, but normally it's best to leave `mov` instructions to the compiler outside the asm template. – Peter Cordes Sep 02 '20 at 01:55
  • Thanks @PeterCordes . For the float pointer, there are 2 cases: 1. it is pointing to a single float variable, I expect `movss`; 2. it is pointing to a float buffer of size 4 (or 8), I expect `movps xmm#` (or `vmovps ymm#`). – xuancong84 Sep 03 '20 at 02:38
  • It would be much easier to just use intrinsics, but if you read the links in my earlier comments you'll see examples of getting vectors outputs from inline asm statements, or of safely passing in a pointer you can store into. – Peter Cordes Sep 03 '20 at 02:56

1 Answers1

0

Great, let us look at the test result:

#include <iostream>
#include <cstdlib>
#include <cstdio>
#include <time.h>
#include <immintrin.h>

#define N 256000000
using namespace std;

void f1a(float *a, float *b, int64_t n){
    asm("movq %0, %%rax"::"r"(a));
    asm("movq %0, %%rbx"::"r"(b));
    asm("movq %0, %%rcx"::"r"(n));

    asm(".intel_syntax noprefix\n"
        "shr rcx, 3\n"
"re:\n"
        "vmovaps ymm0, [rax]\n"
        "vmovaps ymm1, [rbx]\n"
        "vaddps ymm0, ymm0, ymm1\n"
        "vmovaps [rax], ymm0\n"
        "add rax, 32\n"
        "add rbx, 32\n"
        "loopnz re"
    );
}

void f1b(float *a, float *b, int64_t n){
    asm("movq %0, %%rax"::"r"(a));
    asm("movq %0, %%rbx"::"r"(b));
    asm("movq %0, %%rcx"::"r"(n));

    asm(".intel_syntax noprefix\n"
        "shr rcx, 3\n"
"re1:\n"
        "vmovaps ymm0, [rax]\n"
        "vmovaps ymm1, [rbx]\n"
        "vaddps ymm0, ymm0, ymm1\n"
        "vmovaps [rax], ymm0\n"
        "add rax, 32\n"
        "add rbx, 32\n"
        "dec rcx\n"
        "jnz re1"
    );
}

void f1c(float *a, float *b, int64_t n){
    asm("movq %0, %%rax"::"r"(a));
    asm("movq %0, %%rbx"::"r"(b));
    asm("movq %0, %%rcx"::"r"(n));

    asm(".intel_syntax noprefix\n"
"re2:\n"
        "sub rcx, 8\n"
        "vmovaps ymm0, [rax+rcx*4]\n"
        "vmovaps ymm1, [rbx+rcx*4]\n"
        "vaddps ymm0, ymm0, ymm1\n"
        "vmovaps [rax+rcx*4], ymm0\n"
        "jnz re2"
    );
}

void f2a(float *a, float *b, int64_t n){
    for(int i=n-8; i>=0; i-=8) {
        __m256 x8 = _mm256_load_ps(&a[i]);
        __m256 y8 = _mm256_load_ps(&b[i]);
        __m256 s = _mm256_add_ps(x8, y8);
        _mm256_store_ps(&a[i], s);
    }
}

void f2b(float *a, float *b, int64_t n){
    for(int i=(n>>3)-1; i>=0; --i) {
        __m256 x8 = _mm256_load_ps(&a[i*8]);
        __m256 y8 = _mm256_load_ps(&b[i*8]);
        __m256 s = _mm256_add_ps(x8, y8);
        _mm256_store_ps(&a[i*8], s);
    }
}

void f3(float *a, float *b, int64_t n){
    for(int i=n-1; i>=0; --i)
        a[i] += b[i];
}

void test(float *a, float *b, void(*func)(float*, float*, int64_t), char *name){
    clock_t t;
    printf("Testing %s():", name); fflush(stdout);
    t = clock();
    func(a, b, N);
    printf("%lu\n", clock()-t); fflush(stdout);
}

alignas(64) float x1[N];
alignas(64) float x2[N];

int main(int argc, const char *argv[]){
    printf("Preparing buffer ...");
    fflush(stdout);
    for(int x=0; x<N; ++x){
        x1[x] = x/10.0f;
        x2[x] = 0.5f+1.0f/(x+1);
    }
    printf("Done!\n");
    fflush(stdout);

    test(x1, x2, f3, "warm-up-cache");
    test(x1, x2, f1a, "f1a");
    test(x1, x2, f1b, "f1b");
    test(x1, x2, f1c, "f1c");
    test(x1, x2, f2a, "f2a");
    test(x1, x2, f2b, "f2b");
    test(x1, x2, f3, "f3");

    return 0;
}

Output:

Preparing buffer ...Done!
Testing warm-up-cache():551638
Testing f1a():179409
Testing f1b():159309
Testing f1c():172496
Testing f2a():247539
Testing f2b():245975
Testing f3():520559

Since inline assembly does not compile with -O3, I have commented out f1* and compile with -O3. The O3-test result is as follows:

Testing warm-up-cache():233775
Testing f2a():170199
Testing f2b():187909
Testing f3():181979

The improvement is not that significant on this simple example. However, the solution to the OP is still absent. Suggested-duplicate posts do not contain 64-bit Intel format solution.

xuancong84
  • 1,412
  • 16
  • 17
  • 3
    Rather than belabor the (still) bad asm, I've [tried](https://godbolt.org/z/Yh7bEP) to clean that up a bit so we can see some comparisons (and build using -O3). It's hard to get good timings on godbolt, due to variations in server load. In general, f2a seems to perform better than f1a. f2b seems slightly worse than f1b, but that appears related to how `n` is passed. Replacing with `N` improves performance. Looks like intrinsics are safer, faster, and (way) easier to maintain. BTW: doesn't your asm write moving up and your c++ write moving down? Not exactly apples-apples. – David Wohlferd Aug 25 '20 at 20:13
  • [How to set gcc to use intel syntax permanently?](https://stackoverflow.com/q/38953951) answered the title question and does use x86-64. If you want to fix the totally broken way you're using inline asm, see [GCC Inline-Assembly Error: "Operand size mismatch for 'int'"](https://stackoverflow.com/a/46016608) for example, and [What is the difference between 'asm', '\_\_asm' and '\_\_asm\_\_'?](https://stackoverflow.com/q/3323445). – Peter Cordes Aug 26 '20 at 00:18
  • Once you can write safe inline asm, optimizing this specific loop with intrinsics vs. inline asm could be posted as a separate question, here or on codereview.SE. But this question is about using intel-syntax for GNU C inline asm, and already has answers elsewhere. – Peter Cordes Aug 26 '20 at 00:19
  • Many thanks @DavidWohlferd , so f1b() is still faster by a small margin (about 10%). Intrinsics are not effective here, they don't even perform better than raw C++ loop in f3() under -O3: Preparing buffer ...Done! Testing warm-up-cache():2137 Testing f1a():1983 Testing f1b():1506 Testing f1c():1844 Testing f2a():1700 Testing f2b():1845 Testing f3():1719 – xuancong84 Aug 26 '20 at 02:09
  • Did you try changing from `n` to `N` like I suggested? That looked to make f2b faster than f1b. – David Wohlferd Aug 26 '20 at 03:27
  • 1
    Your misuse of extended inline asm contributed to the question being closed in the first place. You cannot make assumptions about what the compiler is doing *in between* separate asm statements, and there's still no correct entries for `input: output: clobber` specifications. One example: you clobber `%rbx`, which should be 'callee'-saved. It's simply good luck that the program doesn't crash. Myself, and others re-opened the question mostly because the AVX encodings were interesting in themselves. – Brett Hale Aug 26 '20 at 12:04
  • 1
    Also - using `clock()` for timing is effectively useless here. The tests are too fast, introducing too much uncertainty. Try running each test 100 times, or 1000 times (for example) and / or use the `rdtsc` time-stamp counter. – Brett Hale Aug 26 '20 at 12:14
  • @BrettHale, I have run the entire test code for a few times and take average. Running core test codes in a loop will introduce some overhead, for example, the compiler might need to preserve rcx register for the outer loop, although it applies to every function, it will distort the time proportion. – xuancong84 Aug 27 '20 at 15:23
  • @BrettHale You are right in saying that "you cannot make assumptions about what the compiler is doing in between separate asm statements". However, if no `input: output: ` is used, the register state will not change between consecutive asm blocks. This is because when you use `input : output :`, sometimes, the compiler needs to use additional registers to load/store complicated variables (e.g., inside a struct/class, etc.). That is why you also need to use clobber to tell compiler to avoid those registers (or preserve their state) when loading/storing variables. – xuancong84 Aug 27 '20 at 15:33
  • @BrettHale Yup, AVX is an interesting topic. What is more interesting is that although according to Wikipedia, all registers other than AX/CX/DX are callee-saved, however in practice, if you clobber AX/BX/CX/DX/SI/DI in your global C function, it will still work almost all the time. In one of my class projects, ( https://sites.google.com/site/xuancong84/music-tempo-visualization ), I have used EBX extensively (even EDI in one function) in my SIMD functions without saving, and it works. So it is not something as simple as luck ^_^ – xuancong84 Aug 27 '20 at 16:01
  • 1
    "the register state will not change between consecutive asm blocks" - Says who? There is *absolutely no guarantee of this*. It might happen that way, or it might not. Indeed (per the [docs](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html)): *Do not expect a sequence of asm statements to remain perfectly consecutive after compilation, even when you are using the volatile qualifier. If certain instructions need to remain consecutive in the output, put them in a single multi-instruction asm statement.* If something gets interleaved with your asm, registers can change. – David Wohlferd Aug 28 '20 at 01:36
  • 1
    "still work almost all the time" - Who wants code that works "almost" all the time? Follow the rules and your code will work dependably instead of "hoping" that it will run correctly this time. Have you tried using N instead of n in f2b as I suggested? And "walking up" for both the asm and c++ code (instead of up for asm and down for c++)? My expectation is that a correctly written f2b will outperform the asm in f1b. Which would call into question the whole motivation for writing inline asm at all. – David Wohlferd Aug 28 '20 at 01:42
  • @DavidWohlferd: Equally importantly, destroying a bunch of integer regs without telling GCC about it is almost certain to cause hard-to-debug breakage if you call this from somewhere in the middle of a larger program that doesn't just exit right away. Trashing RBX (a call-preserved register) means that even a call/ret boundary for a non-inlined function can't save this code, although it would make the other problems (compiler using registers between statements) more likely to happen-to-work. – Peter Cordes Aug 28 '20 at 01:51
  • Thanks @DavidWohlferd, changing n to N will defeat the purpose of the function, because the size of the large array is often variable in practice. "a correctly written f2b will outperform the asm in f1b", yup I agree. But my habit is that rather than learning all those wrapper rules (compiler intrinsics) and hoping the compiler will generate code as I want, I prefer to write out the code my own ^_^ – xuancong84 Aug 28 '20 at 02:23
  • 1
    "changing n to N will defeat the purpose of the function" - Perhaps so. But you are asking the c++ compiler to handle signed values, while allowing the asm to use unsigned. What's the point of comparing performance if the routines are doing different things? As for "learning the rules," the rules for inline asm are many and weird. What's more, you keep ignoring the experts when they tell you what the rules are. If your goal is to have absolute control over the code, write 100% asm routines (possibly calling them from c). It would be way easier than trying to understand inline asm. – David Wohlferd Aug 28 '20 at 02:39
  • In your deleted answer, you wrote "*does anyone know the Intel syntax for the MOV instruction that can load/store a local variable and can be compiled in 64-bit environment?*" - the reason nobody's giving you that is that writing a `mov` instruction inside an `asm` statement is not an efficient way to do that. Use constraints to get the compiler to put the value you want in a register for you, with compiler-generated instructions. Or if you insist on doing it wrong, compile with `-masm=intel` and use `mov rax, %0`. – Peter Cordes Aug 28 '20 at 05:09