One place the branchless techniques can come in handy is with SIMD (or simulated SIMD by packing several small ints into longer ones.) A branchless algorithm that's 50% slower can pay for itself if you can process four elements at a time.
Agreed that performance testing is absolutely necessary, preferrably with something that can tell you not only where exactly the processor is spending its time but also whether it's spending its time computing or waiting on memory etc, whether branches are being predicted well...
Agreed that performance testing is absolutely necessary, preferrably with something that can tell you not only where exactly the processor is spending its time but also whether it's spending its time computing or waiting on memory etc, whether branches are being predicted well...