2

par2 has a small and fairly clean C++ codebase, which I think builds fine on GNU/Linux, OS X, and Windows (with MSVC++).

I'd like to incorporate an x86-64 asm version of the one function that takes nearly all the CPU time. (mailing list posts with more details. My implementation/benchmark here.)

Intrinsics would be the obvious solution, but gcc doesn't generate good enough code for getting one byte at a time from a 64bit register for use as an index into a LUT. I might also take the time to schedule instructions so each uop cache line holds a multiple of 4 uops, since uop throughput is the bottleneck even when the input/output buffer is a decent size.

I'd prefer not to introduce a build-dependency on yasm, since many people have gcc installed, but not yasm.

Is there a way to write a function in asm in a separate file that gcc / clang and MSVC can assemble? The goals are:

  • no extra software as a build-dep. (no YASM).
  • only one version of each asm function. (no maintaining MASM & AT&T versions of the same code.)

Par2cmdline's build systems is autoconf/automake for Unix, MSVC .sln for Windows.

I know GNU assemble has a .intel_syntax noprefix directive, but that only changes instruction formats, not other assembler directives. e.g. .align 16 vs. align 16. My code is fairly simple and small, so it would be ok to work around the different directives with C-preprocessor #defines, if that can work.

I'm assuming that doing CPU-detection and setting a function pointer based on the result shouldn't be a problem in C++, even if I have to use some #ifdef conditional compilation for that.

If there isn't a solution to what I'm hoping for, I'll probably introduce a build-depend on yasm and have a ./configure --no-asm option to disable asm speedups for people building on x86 without yasm available.

My preferred plan for handling the different calling convention in the Windows and Linux ABIs was to use __attribute__((sysv_abi)) on my C prototypes for my asm functions. Then I only have to write the function prologue for the SysV ABI. Does MSVC has anything like that, that would put args into regs according to the SysV ABI for certain functions? (BTW, this tickled a compiler bug, so be careful with this idea if you want your code to work with current gcc.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • XY problem - _gcc doesn't generate good enough code for getting one byte at a time from a 64bit register for use as an index into a LUT_ you should really ask this and provide code. – Jester Jul 30 '15 at 21:34
  • @Jester: I went off-topic in that gcc bug report I linked in the new last paragraph I added. Have a look if you want. I should really get around to reporting that performance bug separately. Actually, just compile https://github.com/pcordes/par2-asm-experiments/blob/master/intrin-pinsrw.c – Peter Cordes Jul 30 '15 at 21:36
  • I don't think this is even possible. So I'd go with YASM, people can deal with it. Wouldn't be the first time they had to install something to build some software. – harold Jul 30 '15 at 21:45
  • 1
    I don't think it would be practical to do what you're suggesting. Also YASM is maybe the fourth most likely assembler to be already installed on someone's machine. It would make more sense to use the GNU assembler (or even MASM) because you'll know at least some of your users will need to have installed. Finally, no, there's no way to tell MSVC to use the System V calling convention. Why would there be? It doesn't target any system that uses it. It would make more sense to use the Microsoft ABI since GCC and clang do support it. – Ross Ridge Jul 30 '15 at 22:02
  • 1
    @Jester: I stopped being lazy and reported the gcc problem: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67072 – Peter Cordes Jul 30 '15 at 22:20
  • @RossRidge: I think I liked the SysV ABI since it put the args I wanted in the regs I wanted them in, saving a couple mov instructions (in a function with a moderate-size loop that runs several thousands of times :P). You're probably right that using the MS ABI would make portability easiest. – Peter Cordes Jul 30 '15 at 22:23
  • FWIW I have the same problem, and for now I settled on simply including a binary copy NASM in my project to avoid the need for any external software install. The licenses for both YASM and NASM seem to be compatible with this. Of course, this only works for the platforms for which I include binaries, and assumes a make system smart enough to be able to call the embedded binaries. It does have the advantage of making the build pretty reproducible since it doesn't use a random version of the assembler the user may have. – BeeOnRope Jun 21 '16 at 23:26
  • For the calling convention thing, I've just been adding a windows "thunk" which moves the arguments into the expected place and then calls the Linux version. It's almost free, at least when reg-reg `mov`s are almost free. I considering using the symbolic approach to just remap the registers (so each platform would use different registers, but decided against it because it complicates things when you choose additional registers (which may conflict on only one platform) and when dealing with register asymmetries (e.g., longer encodings for certain regs, etc). – BeeOnRope Jun 21 '16 at 23:33
  • @BeeOnRope: if you put the Sindows register-move prologue directly before the SysV version, it can just fall through into the rest of the function, saving a jump. – Peter Cordes Jun 22 '16 at 04:59
  • Yes, that's what I did - I was using "calls" loosely above, it doesn't use an actual `call`. Perhaps it can't correctly be called a thunk either. Loose lips sink ships... – BeeOnRope Jun 22 '16 at 05:18

1 Answers1

2

While I have no good solution to remove the dependency on a particular assembler I do have a suggestion on how to deal the two difference 64-bit calling conventions: Microsoft x64 versus SysV ABI.

The lowest commen denominator is the Microsoft x64 calling conventions since it can only pass the first four values by register. So if you limit yourself to this and use macros to define the registers you can easily make your code compile for both Unix (Linux/BSD/OSX) and Windows.

For example look in the file strcat64.asm in Agner Fog's asmlib

%IFDEF  WINDOWS
%define Rpar1   rcx                    ; function parameter 1
%define Rpar2   rdx                    ; function parameter 2
%define Rpar3   r8                     ; function parameter 3
%ENDIF
%IFDEF  UNIX
%define Rpar1   rdi                    ; function parameter 1
%define Rpar2   rsi                    ; function parameter 2
%define Rpar3   rdx                    ; function parameter 3
%ENDIF

        push    Rpar1                  ; dest
        push    Rpar2                  ; src
        call    A_strlen               ; length of dest
        push    rax                    ; strlen(dest)
        mov     Rpar1, [rsp+8]         ; src
        call    A_strlen               ; length of src
        pop     Rpar1                  ; strlen(dest)
        pop     Rpar2                  ; src
        add     Rpar1, [rsp]           ; dest + strlen(dest)
        lea     Rpar3, [rax+1]         ; strlen(src)+1
        call    A_memcpy               ; copy
        pop     rax                    ; return dest
        ret

;A_strcat ENDP

I don't think four registers is really a limitation because if you're writing something in assembly it's because you want the best efficiency in which case the function calling overhead should be negligible compare to the function itself so pushing/popping some values to/from the stack if you need to when calling the function should not make a difference in performance.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Yeah, the par2 use-case spends a very long time per-call, and is also a leaf function so I don't need to resort to a stack-args ABI. I'd have to define 5 macros for each ABI, though, to cover all the regs. (Otherwise, I couldn't use rdi as a temporary if it might be the same register as Rpar1.) Macro-names for regs should actually make my code more readable, since I tried to keep the prologue small to reduce code footprint, so I'm using rdi for something other than the dest buffer, and stuff like that. – Peter Cordes Sep 25 '15 at 17:04
  • @PeterCordes, what five ABI's are you referring to? – Z boson Sep 27 '15 at 09:26
  • for both ABIs, I'll need symbolic-name macros for all registers used by either ABI. Symbolic macro-names for registers is a nice idea anyway, since I was using `rdi` for something other than the dest buffer, and stuff like that. I felt like that hurt the human-readability of the code, but I hate sacrificing insn bytes in the function prologue (and everywhere an extra rex prefix would be needed) just for that. – Peter Cordes Sep 27 '15 at 09:38
  • @PeterCordes, I hope you don't mind this comment your many of your answers and comments remind me of a quote by Pascal "I am sorry I have had to write you such a long letter, but I did not have time to write you a short one". – Z boson Sep 27 '15 at 09:45
  • @Z boson: That's fair, and amusing. I totally get it. :) I've seen that quote before, and thought it made perfect sense. I think it has something to do with me having ADHD, and instead of making 5 different answers or comments, I pack things into one. I'm sure there are parts of my exhaustive style that aren't just ADHD, and are just me, also. :P I never know which parts of my ideas are obvious to other people, and which *do* require the long explanation. When I'm editing a post, I do try to delete stuff if I find a more compact way to say something. I did hit the 30k char limit on one.. – Peter Cordes Sep 27 '15 at 09:47
  • 1
    @PeterCordes, that's okay, you still make less writing mistakes than me (look at the run-on sentence in my answer). I think many people on SO would fail the Turing test (especially many of the C++ purists). That's interesting because I think in a few years their computers would pass the test. That will become the new canonical example of irony. – Z boson Sep 28 '15 at 07:30
  • Are you in Europe? English is my native language and I can type quickly, so I have few barriers to spewing out tons of text. My sleep schedule doesn't usually line up with my time zone (Nova Scotia, Canada, on GMT-3 in the summer), so I bet some people guess I'm in Europe or something. :P I should really have put this in a chat thread, but I'm lazy. – Peter Cordes Sep 28 '15 at 09:10