bit bashing

While on c.l.c, I came upon a discussion on bit bashing.

I regularly come across the same questions, and although the question was eventually answered, here is a small discussion, but (I hope…) more complete discussion about it.

Most of what has been discussed in comp.lang.c and what follows here is more accurately described by Henry S. Warren, Jr in his book Hacker’s delight.

I liked this book very much and I recommend it highly – at least if you like binary stuff. A really nice book, and yes a real delight :-)

So, as in the original post, we try to answer the question:

How do I count the number of bits that are set in foo?

A straight-forward approach, (yes, this solution is not efficient), in term of abstraction, is iterating on all foo’s bits and counting them:

for ( int i = 0; i < 32; ++i ){
    // if bit 0 from foo is set, foo &amp; 1 evaluates to 1
    // 0 otherwise
    count += (foo & 1);

    // right-shift foo by one bit, hence next bitwise and
    // will test bit 1
    // At the end of the loop all bits have been tested since they all
    // have been shifted
    foo >>= 1;
}

This method *costs* a lot a CPU power:

  1. the for() loop, use a branching instruction – costly
  2. count is updated by one, same thing for the counter
  3. there is 32 steps for counting bits, even if only one bit is set

The branching costs a lot, and hence we should try to avoid or reduce it.

One solution (well a partial one), is the solution used by the OP:

count = 0;
while( foo ){
    // increment this since foo is not 0 hence it must have at
    // least one of his bit set
    ++count;

   // clears the right most bit
   // x   = 00101100
   // x-1 = 00101011
   // x & (x-1) = 00101000
   // original x right most bit has been cleared
   foo = foo & foo - 1; // foo &= (foo -1}; may be more efficient (but any good compiler will do it for you when parsing foo = foo & foo -1)

This is much better, compared to the previous solution:

  1. the for loop, use a branching instruction – that costs *a lot*
  2. count is updated by one, same thing for the counter
  3. the number of steps for counting bits is not always 32! It really depends on the number of bits that are set !

We achieved our goal, on average, the loop is executed only 16 times instead of 32 – that is very good. We got rid of half of the processing.

Nice but we can do even better, we can get rid of the branching at all. This is the solution the OP was wandering about:

x -= (x >> 1) & 0x55555555;
x = (x & 0x33333333) + ((x >> 2) & 0x33333333 );
x = ( x + (x >> 4)) & 0x0f0f0f0f;
x += x >> 8;
x += x >> 16;
return x & 0x0000003f;

no more branching :-)

A good explanation on what’s going on here is given in this reply made to the OP. Most of the speed is achieved by:

  1. removing the loop
  2. we now have exactly 6 very bit friendly statements that can be processed very quickly by any CPU (way much faster than running the branch/loop)

If you like numbers, here a few small benches, forget the numbers per se, they will vary too much given the hardware you do have. Much more interesting are the ratios between them.

Counting the number of bits of 64 random integers, 1,000,000 times for each integer:

simple solution: 20 sec.
OP solution: 5 sec.

last solution: 1,7 sec.

Can we make it faster ? Well maybe …

As discussed by Henry S. Warren, Jr in his book Hacker’s delight, another solution is to pre-compute the number of bits:

...
int foo = ...;
...

int CountBits( int foo )
{
    const int nbBitsByByte[256] = { 0, 1, 1, 2, 1, 2, 2, 3 ..., 8 };
    unsigned char* ptr = (unsigned char*) foo;
    return nbBitsByByte[ ptr[0] ] +
           nbBitsByByte[ ptr[1] ] +
           nbBitsByByte[ ptr[2] ] +
           nbBitsByByte[ ptr[3] ];
}

This could be much faster since:

  1. there are only memory accesses – these accesses being cache friendly
  2. pipeline friendly, may be executed in parallel by the CPU if it has multiple integer computing units (can be done explicitely by writing a = ptr[ 0 ] etc.)

Definitely worth a try.

Advertisements

About this entry