0

The asm function strlen receives the link to an string as a char - Array. To to so, the functions may use SIMD on general purpose register, but without using xmm register or SSE instructions.

The functions checks with bit manipulation in 8 Byte steps, if the string contains the zero which marks the end. If it does, it checks byte per byte if it is zero. It works but I get page fault exceptions, which I don't know how to solve since the functions iterates in single steps in the last loop at the end.

        ; rdi <- const *char
        ; rax <- counter + return value
        ; rbx <- current array
        ; rcx <- Bitmask
        ; rsi <- Arr for calculation
       
    
    strlen:
        PUSH rbx
        PUSH rbp
        MOV  rbp,rsp
        SUB rsp,8      ; calling convention
    
        XOR rax,rax
        MOV rcx,0x7F7F7F7F7F7F7F7F
    
    while_parallel: ;checks if next 8 byte contains end - zero 
        MOV rbx,[rdi+rax]
        MOV rsi,rbx
    
        AND rsi,rcx
        ADD rsi,rcx
        OR  rsi,rbx
        OR  rsi,rcx
        NOT rsi
    
        CMP rsi, qword 0
        JNE while_single
        ADD rax,8
        JMP while_parallel
    
    while_single:  ;check if next byte is zero
        CMP [rdi+rax],byte 0
        JE  end
        INC rax
        JMP while_single
   end:
    ADD rsp,8
    POP rbp
    POP rbx

   RET

Note that I must not use any SSE instructions nor xmm register.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
HeapUnderStop
  • 378
  • 1
  • 9
  • 1
    You read 8 bytes at a time, from a possibly unaligned address. That may very well cross into unmapped page. If the input is not known to be aligned, then you should do single byte reads until it is. – Jester Jun 02 '23 at 23:45
  • 1
    You could round down the address at the start and then ignore the "garbage bytes", it's more annoying to work out than just doing some byte-by-byte until the address is aligned – harold Jun 02 '23 at 23:50
  • Techniques for handling unaligned addresses that might be within 7 bytes of an unmapped page are the same for SWAR as for `strlen` in general, whether you go a word at a time or unroll to 64-bytes at a time with multiple SIMD vectors. See [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739) . See also [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/q/57650895) which quotes glibc's plain-C fallback bithack strlen which also has to get to an alignment boundary. – Peter Cordes Jun 03 '23 at 00:42
  • You don't need to `sub rsp, 8`; this is a leaf function so there are no function calls that would need a 16-byte-aligned RSP. Also, you don't need to `push` anything: you don't use RBP for anything, and you could use call-clobbered RDX or R8 instead of call-preserved RBX. Also, you should arrange your loop so a `jcc while_parallel` is at the bottom, instead of running an extra `jmp`. This might mean undoing an `add rax, 8` after falling out of the loop. Instead of `NOT rsi` / `cmp rsi, 0`, just do `cmp rsi, -1`. – Peter Cordes Jun 03 '23 at 00:50
  • 1
    Also, as discussed in comments on Soonts's answer on your last question, a 64-bit version of the `(((v) - 0x01010101UL) & ~(v) & 0x80808080UL)` bithack should be fewer operations, and still avoids any false-positives. But it would need two constants, so startup overhead is somewhat worse for short strings. Both would benefit from BMI1 for `andn`. In this version, you could invert the constant for a copy-and-and instead of `mov`/`and`. – Peter Cordes Jun 03 '23 at 00:51
  • 1
    [parallel-processing] is about multi-threading, not SIMD / SWAR. What you're doing is SWAR, a special case of SIMD. https://en.wikipedia.org/wiki/SWAR . It is taking advantage of data parallelism by doing multiple operations at once, but when people say "parallel processing", they mean something else. That's why I retagged this and your previous question. – Peter Cordes Jun 03 '23 at 00:54

0 Answers0