Converting function to bitwise only?

Question

I have a function to count upper bits of a 32 bit value. So if a number is 11100011111..., the result is 3 as there are 3 ones in the most significant place before a 0 is hit.

I need to convert the function to use only bitwise operations (no if statements or while loops) and the total number of operations should be lesser than 50.

Question: How can this be converted to bitwise operations only while keeping less than 50 ops?

Here is the Code:

  int count = 0;
  int i = 28;

  while(i >= 0) {
    int temp = (x>>i) & 0xF;
    i-=4;
    if(temp == 0xF) count+=4;
    else {
      int mask = 0x1;
      int a = (temp>>3) & mask;
      int b = (temp>>2) & mask;
      int c = (temp>>1) & mask;
      int d = temp & mask;

      if (a != 1) break;
        count+=1;
      if (b != 1) break;
        count+=1;
      if (c != 1) break;
        count+=1;
      if (d != 1) break;
        count+=1;
    }

score 7 · Answer 1 · answered Sep 12 '12 at 01:11

_{Notations: I'll use the C notations for bitwise operations: $a \mathbin\& b$ (bitwise and), $a \mathbin| b$ (bitwise or), $a \mathbin\ll b$ (shift left), $a \mathbin\gg b$ (shift right). The shift operators are left-associative (e.g. $x \mathbin\gg m \mathbin\gg n = (x \mathbin\gg m) \mathbin\gg n$). The shift operations put zeroes in newly created bit positions: $a \mathbin\ll b = a \times 2^b$ and $a \mathbin\gg b = \lfloor a \times 2^{-b} \rfloor$. Sequences of digits with a line above (e.g. $\overline{11101000}$, $\overline{z^4z^3z^2z^1z^0}$) represent a number written in base 2 (binary).}

_{If $x$ is an integer between $0$ and $2^N-1$, we say that $x$ is an $N$-bit number, and write $x = \overline{x^{N-1}x^{N-2}\ldots x^1x^0}$, i.e. $x^k$ is bit $k$ of $x$. Bits $k-1$ through $0$ constitute the $k$ least significant bits of $x$, and bits $N-1$ through $N-k$ constitute the $k$ most significant bits of $x$ (where $x$ is considered as an $N$-bit number).}

Intuition

With 32 bits and 50 operations, an approach that looks at one bit of the input at a time isn't going to work. The next obvious thing to try is a recursive (dive-and-conquer approach) where you work on half the bits at a time. There are two simple ways to divide an $N$-bit number: into the $N/2$ most significant bits and the $N/2$ least significant bits, or into odd and even positions. Most/least significant looks promising, because the result counts how many most significant bits are 1.

Or we could approach the problem the other way: the result is a number between 0 and 32, a 6-bit number (for all but one input values, it's a 5-bit number). Can we find the bits of the result, one at a time?

Let $a$ be the input (a 32-bit number) and $z$ the output. Bit $5$ of $z$ is 1 iff $a$ is all-bits-1. Bit $4$ of $z$ is 1 iff the 16 most significant bits of $a$ are all 1 and $a$ is not all-bits-1. What about bit $3$? We have to discriminate on bit $4$: if it's 0, we look at the $8$ most significant bits of $a$ (with an exception when $a$ is all-bits-1); if it's 1, we look at bits $8$ through $15$ instead.

Expression in terms of bitwise operations

Let's express this in terms of bitwise operations. For the time being, assume we have a primitive operation $A_N$ on $N$-bit numbers such that $A_N(x) = 1$ if $x = 2^N-1$ (i.e. $x$ is all-bits-1) and $A_N(x) = 0$ otherwise.

$z^5 = A_{32}(a)$
if $z^5 = 0$ then $z^4 = 0$, else
$z^4 = A_{16}(a \mathbin\gg 16)$
if $z^5 = 0$ then $z^4 = 0$, else
if $z^4 = 0$ then $z^3 = A_8(a \mathbin\gg (16+8))$, else
$z^3 = A_8((a \mathbin\& \overline{1111111100000000}) \mathbin\gg 8)$

That's not very orderly: at this point we're looking to express $z^4$ and $z^3$ in similar way so that a recursion pattern can emerge. Furthermore we've used conditional statements which we're going to have to eliminate somehow. What seems constant is that $z^k$ is $A_{2^k}$ of some number. So what number? At each stage, we cut some number which is a slice of $a$ in half. To compute the next bit of the result we're looking at the most significant half of one of the halves from the previous step. Let's write $a_{k+1}$ for the input to the step that computes $z^k$ (so that we start with $a_6 = a$ to compute $z_5$). Then, leaving aside for a moment the case when $a = 2^{32}-1$ (in other words, we leave aside the case when $z^5=1$):

$a_5 = a_6$
$z^4 = A_{16}(a_5 \mathbin\gg 16)$
If $z^4 = 1$ then $a_4 = a_5 \mathbin\& (2^{16}-1)$, else $a_4 = a_5 \mathbin\gg 16$. It's awfully tempting to turn that 16 into $z^4 \mathbin\ll 4$, but it's on the wrong side. We could flip a bit (using xor), but this complicates the expression; instead, let's shift the other way: $a_4 = (a_5 \mathbin\& (2^{16}-1)) \mathbin\ll (z^4 \mathbin\ll 4) \mathbin\gg 4$.

And thus a pattern emerges: $$\begin{align*} z^k &= A_{2^k}(a_{k+1} \mathbin\gg 2^k) \\ a_k &= (a_{k+1} \mathbin\& (2^{2^k}-1)) \mathbin\ll (z^k \mathbin\ll k) \mathbin\gg k \\ \end{align*}$$

A simpler recursion

The initial case $z^5$ isn't right. It would be a simple matter to adjust the recursion pattern to accommodate it, but before doing that, I prefer to tweak the notation to reduce the number of shifts. When we compute $a_k$, we're aligning the extracted number to the right. Let's instead align to the left. Furthermore, let's drop the masking: we'll use a new predicate $B_m$ instead of $A_m$ which looks only at the appropriate bits. Define $$ b_{k-1} = b_k \mathbin\ll (z^k \mathbin\ll k) $$ Define $B^n_m(x) = 1$ if the $2^n$-bit number $x$ has its $m$ most significant bits all set, and $B^n_m(x) = 0$ otherwise. $$ z^k = B^n_{2^k}(b_k) $$

For a 32-bit number ($n=5$), we start with $b_5 = a$ and $z^5 = B^5_{32}(b_5)$ (check all 32 bits). Next we compute $b_4 = b_5 \mathbin\ll (z^5 \mathbin\ll 5)$; when $z^5=1$ (i.e. $a=2^n-1$) that sets $b_4 = 0$. We can then compute $z^4 = B^5_{16}(b_4)$ (look at the most significant 16 bits), then either $b_3 = b_4$ or $b_3 = b_4 \mathbin\ll 16$ depending on $z^4$, and so on until we find $z^0$.

Testing for all-bits-1

We're getting close. We still need to express that $B$ function in terms of bitwise operators. Note that so far, we've used 10 operations (2 shifts per step to compute $b_4$ through $b_0$). That leaves 40 operations to do the all-bits-1 check.

$B^n_1$ is easy: take the most significant bit and shift it into place, i.e. $B^n_1(x) = x \mathbin\gg (2^n-1)$. For $B^n_2$, we need to and the two most significant bits of $x$ together: $B^n_2(x) = (x \mathbin\gg (2^n-1)) \mathbin\& (x \mathbin\gg (2^n-2))$. Going on, we have $B^n_3(x) = (x \mathbin\gg (2^n-1)) \mathbin\& (x \mathbin\gg (2^n-2)) \mathbin\& (x \mathbin\gg (2^n-3))$. A recursive pattern emerges, but it's expensive: one shift and one and per bit. Linear is too slow, we need to go exponential.

Above we were taking the and of a single bit at a time. This is a waste: we can operate on $2^n$ bits in one go. Take $x$, shift left by $m$ and and with $x$: the result has its $m$ most significant bits all-1 iff $x$ has its $2m$ most significant bits all-1. Or in other words: $$B^n_{2m}(x) = B^n_m(x \mathbin\& (x \mathbin\ll m))$$ Recall that $B^n_1(x) = x \mathbin\gg (2^n-1)$. We can compute $B^n_{2^k}$ in $2k+1$ operations. Thus we can compute $z^k$ from $b_{k+1}$ in $2k+1$; for $z^0$ through $z^n$, that's a total of $\sum_{k=0}^n (2k+1) = (n+1)^2$ operations.

Optimizing some shifts

We have a complete algorithm now, but it's a bit slow, we need to do a little optimization. Notice that we only ever use $z^k$ in the expression $z^k \mathbin\ll 2^k$. So instead of computing $z^k$, we'll directly compute that bit at its final position: $$y_k = z^k \mathbin\ll k = \hat B^n_{2^k}(b_k) \mathbin\& (1 \mathbin\ll k)$$ where $\hat B^n_{m}(b_k)$ is computed as described above for $B^n_{2^k}(b_k)$, except that for the last shift (resulting from the expansion of $B^n_{1}(c)$ for some $c$), instead of taking $c \mathbin\gg (2^n-1)$, we take $c \mathbin\gg (2^n-1-k)$. Note that $b_{k-1} = b_k \mathbin\ll y_k$.

The cost so far:

one operation per result bit except the most significant one to compute $b_k$ (one shift) → $n$
1 operation per result bit except one to or everything together ($z = y_5 \mathbin| y_4 \mathbin| \ldots \mathbin| y_0$) → $n$
$(n+1)^2$ operations to compute the $B^n_{2^k}(b_k)$ values as seen above, plus one more to get $\hat B^n_{2^k}(b_k)$. That's one shift plus one and, times $(k+1)$ for each $k$, with $k$ ranging from $0$ to $n$. → $(n+1)(n+2)$

The total is $C(n) = (n+1)^2 + 3n + 1$ operations. For $n=5$, that's 52. We're almost there!

Further optimizations in the last steps

Let's see the last few operations once we get to $b_1$ (given here with $n=5$ as an example, but you can generalize to any $n \ge 2$). $$\begin{gather*} c_1 = b_1 \mathbin\& (b_1 \mathbin\ll 1) \\ y_1 = (c_1 \mathbin\gg 30) \mathbin\& \overline{10} \\ b_0 = b_1 \mathbin\ll y_1 \\ y_0 = (b_0 \mathbin\gg 31) \mathbin\& \overline{1} \\ \end{gather*}$$ Let's make this more concrete: let $u = b_1^{31}$ and $v = b_1^{30}$ be the two most significant bits of $b_1$ (i.e. $b_1 = \overline{uv\ldots}$ as a 32-bit number). We need to obtain $y_1 = 2$ if $u=v=1$ and $y_1=0$ otherwise; and $y_0 = 1$ if $u=1$ and $v=0$ and $y_0$ otherwise. At this stage, we can save some operations by aligning on the right instead of on the left. We take advantage of the fact that if you right-shift a 32-bit number by 30, you get a 2-bit number. The sequence of operations above can be replaced by the following sequence: $$\begin{gather*} w = b_1 \mathbin\gg (2^n-2) \\ y_1 = w \mathbin\& (w \mathbin\ll 1) \\ b_0 = b_1 \mathbin\ll y_1 \\ y_0 = b_0 \mathbin\gg (2^n-1) \\ \end{gather*}$$ That's 5 operations instead of 7, bringing us down to 50 for $n=5$, and more generally to $C'(n)=(n+1)^2+3n-1$.

Remarks

By the way, the computation of $b_{k-1}$ from $b_k$ can be done with just two temporary variables. When we obtain $y_k$, we can or it into an accumulator to obtain $z$. So we can do the whole computation with four $2^n$-bit registers.

Code

Here's the 32-bit case played out in Python syntax. The ck_j variables are temporaries used to compute $\hat B^5_{2^k}(b_k)$. I used a single-assignment style, you can get a classical imperative program by stripping all number suffixes from the variable names.

#!/usr/bin/env python
import sys

# Python does bignum arithmetic, but we absolutely need modulo 2^32 arithmetic
# in a few places.
def trim(x): return x & 0xffffffff
def bin5(x): return bin(0x100000000 | trim(x))[3:]

def f5(a, trail_optimization=True, trace=False):
    (y1, y0) = (None, None) # bring these variables to function scope
    b5 = a
    c5_5 = b5
    c5_4 = c5_5 & (c5_5 << 16)
    c5_3 = c5_4 & (c5_4 <<  8)
    c5_2 = c5_3 & (c5_3 <<  4)
    c5_1 = c5_2 & (c5_2 <<  2)
    c5_0 = c5_1 & (c5_1 <<  1)
    y5 = (c5_0 & (2**31)) >> (31 - 5)
    z5 = y5
    b4 = b5 << y5
    c4_4 = b4
    c4_3 = c4_4 & (c4_4 <<  8)
    c4_2 = c4_3 & (c4_3 <<  4)
    c4_1 = c4_2 & (c4_2 <<  2)
    c4_0 = c4_1 & (c4_1 <<  1)
    y4 = (c4_0 & (2**31)) >> (31 - 4)
    z4 = z5 | y4
    b3 = b4 << y4
    c3_3 = b3
    c3_2 = c3_3 & (c3_3 <<  4)
    c3_1 = c3_2 & (c3_2 <<  2)
    c3_0 = c3_1 & (c3_1 <<  1)
    y3 = (c3_0 & (2**31)) >> (31 - 3)
    z3 = z4 | y3
    b2 = b3 << y3
    c2_2 = b2
    c2_1 = c2_2 & (c2_2 <<  2)
    c2_0 = c2_1 & (c2_1 <<  1)
    y2 = (c2_0 & (2**31)) >> (31 - 2)

    z2 = z3 | y2
    b1 = b2 << y2
    if trail_optimization:
        b1 = trim(b1)
        w = b1 >> 30
        #if trace: print 'w=' + bin5(w)[-2:],
        y1 = w & (w << 1)
        b0 = b1 << y1
        b0 = trim(b0)
        y0 = b0 >> 31
    else:
        c1_1 = b1
        c1_0 = c1_1 & (c1_1 <<  1)
        y1 = (c1_0 & (2**31)) >> (31 - 1)
        b0 = b1 << y1
        c0_0 = b0
        y0 = (c0_0 & (2**31)) >> (31 - 0)
    z1 = z2 | y1
    z0 = z1 | y0
    if trace: print 'b5=' + bin5(b5), 'y5=' + str(y5),
    if trace: print 'b4=' + bin5(b4), 'y4=' + str(y4),
    if trace: print 'b3=' + bin5(b3), 'y3=' + str(y3),
    if trace: print 'b2=' + bin5(b2), 'y2=' + str(y2),
    if trace: print 'b1=' + bin5(b1), 'y1=' + str(y1),
    if trace: print 'b0=' + bin5(b0), 'y0=' + str(y0),
    return z0

if __name__ == '__main__':
    for i in xrange(33):
        x = (1 << 32) - (1 << i)
        print bin5(x), f5(x, trace=True)
        if 0 <= i <= 31:
            x = x - 1
            print bin5(x), f5(x, trace=True)

Realz Slaw · Answer 2 · 2012-09-12T12:11:10.530

Comments in the codes.

#include <assert.h>
#include <stdio.h>



unsigned int should_still_count_next_time[256] = {
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
};

unsigned int byte_left_ones_count[2][256] = {
  {
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
  },
  {
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
     3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
     4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8
  }
};


unsigned int countrbits32(unsigned int v)
{

  assert (((0xffffffff | v) == 0xffffffff)
          && "v has > 32-bit value");

  //Split the number into four bytes.
  unsigned char msb0_byte = (v >> 24);
  unsigned char msb1_byte = (v >> 16) & 0xff;
  unsigned char msb2_byte = (v >> 8) & 0xff;
  unsigned char msb3_byte = v & 0xff;

  //This is a flag, for each byte we will count the 1s in the byte,
  //but only if we've not encountered any zeros in a previous byte.
  int still_counting = 1;

  int count = 0;

  //Add the number of ones in this byte.
  //The @c still_counting flag will determine if we actually add
  // anything at all, or if we just add 0.
  //@c byte_left_ones_count[1][*] is a lookup into all numbers in
  // range [0-255] and how many 1s there are on the left.
  //For example, @c byte_left_ones_count[255] == 8, because
  // 255 == 0b11111111 while @c byte_left_ones_count[254] == 7,
  // because 254 == 0b11111110.
  //@c byte_left_ones_count[0][*] = 0. So if @still_counting == 0,
  // this will add nothing to @count.
  count += byte_left_ones_count[still_counting][msb0_byte];

  //If we encounter a 0 in this byte, we should stop counting for
  // future bytes.
  //@c should_still_count_next_time[255] == 1 because 255=0b11111111,
  // and has no zeros. The rest of @c should_still_count_next_time
  // is filled with zeros.
  still_counting &= should_still_count_next_time[msb0_byte];

  //On to the next byte.
  count += byte_left_ones_count[still_counting][msb1_byte];
  still_counting &= should_still_count_next_time[msb1_byte];

  count += byte_left_ones_count[still_counting][msb2_byte];
  still_counting &= should_still_count_next_time[msb2_byte];

  count += byte_left_ones_count[still_counting][msb3_byte];
  still_counting &= should_still_count_next_time[msb3_byte];

  return count;
}



int main(int argc, char**argv)
{
  printf( "countrbits32(0xfff0000a): %d\n", countrbits32(0xfff0000a));

  return 0;
}

Vor · Answer 3 · 2012-09-11T16:58:41.627

In order to achieve the goal you need to parallelize the operations:

      Input n = 11100010 
First group by 2 bits and count:
0 0 -> 00  no leading 1s
0 1 -> 00  no leading 1s
1 0 -> 01  1 leading 1
1 1 -> 10  2 leading 1
This can be done using the following operations:
                  n = 11 10 00 10 
               mask = 01 01 01 01 
               n>>1 = 01 01 00 01
A = (n >> 1) & mask = 01 01 00 01
now add bit 0 of each pair ANDed with the bit 1:
                     mask2 = 10 10 10 10
B = (n & ((n & mask2)>>1)) = 01 00 00 00 
              A + B = 10 01 00 01  (count: 2 1 0 1)
A + B stores the leading 1s for each pair of bits.
Now repeat the same reasoning to groups of 4 bits
mask = 0011 0011, mask2 = 1000 1000
A + B = 0011 0000 (leading 1s for each group of 4 bits)
.... and so on for groups of 8 and 16 bits ....

(sorry for the quick description, but it takes more than write the code :-).

This is the magic (and working) function in Java (in C replace long with unsigned int):

// count number of leading 1s in a 32bit value
// (0 <= i <= 0xFFFFFFFFL)
public long count(long i) {        
    long t = i & 0xaaaaaaaa;                
    i = ((i>>1) & 0x55555555) + (i & (t >> 1)) ;        
    t = i & 0x88888888;
    i = ((i >> 2) & 0x33333333) + (i & ((t>>2) 
            | (t >> 3)));        
    t = i & 0x40404040;
    i = ((i >> 4) & 0x0f0f0f0f) + (i & ((t>>4) 
            | (t >> 5) | (t >> 6)));        
    t = i & 0x08000800;
    i = ((i >> 8) & 0x00ff00ff) + (i & ((t>>8) | (t>>9) 
            | (t >> 10) | (t >> 11)));
    t = i & 0x00100000;
    i = ((i >> 16) & 0x0000ffff) + (i & ((t>>16) | (t>>17) 
            | (t>>18) | (t >> 19) | (t >> 20)));                
    return i;
}

It has 50 operators ( >>, |, +, & ) and can be further optimized grouping the ORed-SHIFTs.