Are you compiling with any optimisations on? If so, did you check the asm output for your looping version? It's possible that the compiler detects this pattern and replaces it with a BSR instruction in the binary. Even if it doesn't do that, it's quite likely that the compiler is vectorising the loop for you. Which might explain why you're not seeing a speedup from BSR: not because BSR is slow, but because the compiler is making your other code fast.