0

Given a 64 Bit general purpose register (Not a xmm register) in x64 architecture, filled with one byte unsigned values. How can I check it for a zero value simultaneously without using SSE instructions?

Is there a way to do so in a parallel way, without iterating over the register in 4 bit steps?

I tried to compare it with certain 64-bit masks but it is not working.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
HeapUnderStop
  • 378
  • 1
  • 9
  • 3
    What is a "8 bit float value"? 1 sign bit, 3 exp-bits, 4 significant? Do you only want to check for `+0.0` or either `+0` or `-0`? – chtz Jun 01 '23 at 13:33
  • Do you only want to check for existence of a zero, or do want to know the position(s)? – Simon Goater Jun 01 '23 at 14:23
  • 1
    https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord for the SWAR bithack used in non-SIMD `strlen` implementation. If they're sign/magnitude floats so `-0.0` is possible, AND to clear the sign bits first. It might still be faster to `movq xmm0, rax` / `pcmpeqb xmm0, xmm1` (with xmm1 being zeroed earlier) / `pmovmskb eax, xmm1`, unless you're in kernel code or something where you can't use XMM regs. – Peter Cordes Jun 01 '23 at 14:28
  • If you also want the position of the first zero byte, that's another point in favour of `pcmpeqb` / `pmovmskb`, since you can just `bsf` or `tzcnt`. Glibc's portable C fallback for `strlen` (using a bithack) just compares each byte in sequence once it finds a chunk containing a zero byte, rather than `__builtin_ctz` or anything ([Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/q/57650895) quotes the code) – Peter Cordes Jun 01 '23 at 16:04
  • Please define the representation format for your 8-bit floats, as I'm not aware of a standard for that. – Nate Eldredge Jun 01 '23 at 16:35
  • Correction to my earlier comment: njuffa's answer on [Speed up strlen using SWAR in x86-64 assembly](https://stackoverflow.com/q/76401479) shows that `bsf` or `tzcnt` can be used on the result of the bithack, so it's still usable if you do want the position. – Peter Cordes Jun 05 '23 at 15:42

1 Answers1

1

Technically, you could do something like that:

// True if any of the 8 bytes in the integer is 0
bool anyZeroByte( uint64_t v )
{
    // Compute bitwise OR of 8 bits in each byte
    v |= ( v >> 4 ) & 0x0F0F0F0F0F0F0F0Full;
    v |= ( v >> 2 ) & 0x0303030303030303ull;
    constexpr uint64_t lowMask = 0x0101010101010101ull;
    v |= ( v >> 1 ) & lowMask;
    // Isolate the lowest bit
    v &= lowMask;
    // Now these bits are 0 for zero bytes, 1 for non-zero;
    // Invert that bit
    v ^= lowMask;
    // Now these bits are 1 for zero bytes, 0 for non-zero
    // Compute the result
    return 0 != v;
}

However, SIMD gonna be way faster. SSE is an absolute requirement on x64 architecture, all AMD64 processors in the world are required to support SSE1 and SSE2. Here’s SSE2 version:

bool anyZeroByteSse2( uint64_t v )
{
    __m128i vec = _mm_cvtsi64_si128( (int64_t)v );
    __m128i zero = _mm_setzero_si128();
    __m128i eq = _mm_cmpeq_epi8( vec, zero );
    return 0 != ( _mm_movemask_epi8( eq ) & 0xFF );
}

That’s 6 instructions instead of 16: link.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • 3
    Peter Cordes supplied a link to a four-instruction SWAR implementation: https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord – amonakov Jun 02 '23 at 13:21
  • @amonakov: It's six x86-64 instructions (including BMI1 `andn` which combines `~` and `and`, and a `setne` to materialize the boolean). It also includes two `mov r64, imm64` to materialize the 64-bit constants. Unless you use memory source operands for the constants, like `test rdi, [rel mask_80]`. Then you could get it down to 5 (including a register-copy or a pure load into a new register, since x86 doesn't have `add dst, src, mem` and the 64-bit constant is too big for LEA to copy-and-add.) – Peter Cordes Jun 02 '23 at 20:03
  • But yeah, the no-false-positives bithack is probably about as cheap as SSE2, maybe lower latency but perhaps worse throughput depending on front-end cost of setting up the constants, as long as you don't care where the zero byte is if there is one. The bitmap from `pmovmskb` can be used with `bsf` or `tzcnt`. – Peter Cordes Jun 02 '23 at 20:09
  • @PeterCordes plus, if the test is in a small loop, then constants can be set up just once prior to the loop – amonakov Jun 02 '23 at 20:28
  • 1
    @amonakov: If the test is in a loop, you should probably be using SIMD even if that means `movq` / `movhps` to load two separate 8-byte chunks, if the data's coming from memory. (Then `test al,al` or `test ah,ah` or `test eax, 0xff00` to test 8-bit halves of the 16-bit mask, if you care about which 8-byte chunk had a zero byte.) Or maybe there are use-cases inside loops for the scalar bithack. With constants loaded ahead of time, the `v - 0x0101...` can be done with LEA to copy-and-add, with `-0x0101...` in a register. – Peter Cordes Jun 02 '23 at 20:47
  • @amonakov: OP's followup question ([page fault error with SIMD strlen (using SWAR in integer registers, not SSE)](https://stackoverflow.com/q/76394019)) did put this in a scalar loop, and made a point of avoiding XMM registers. Maybe they're implementing `strlen` in kernel code; that would be a good use-case for this. – Peter Cordes Jun 03 '23 at 00:56