1. There's no need to debate on what's fast and what's slow. Figure out what arc...

1. There's no need to debate on what's fast and what's slow. Figure out what architecture you're talking about and get the numbers. Intel:

http://www.intel.com/content/www/us/en/processors/architectu...

(Intel® 64 and IA-32 Architectures Optimization Reference Manual and read appendix C).

Similar manuals exist for most platforms, although sometimes the embedded vendors get a bit shy. Agner's stuff is good too, and he isn't prone towards leaving empty bits in latency tables due to forgetfulness or embarrassment, unlike Intel.

2. If you are futzing around with Athlon 64s and Pentium Ms and so on you are retrocomputing. Good for you, but please don't tell us that some microoperations are 'slow' in general. The facts are available in mind-numbing detail; go acquaint yourself with them.

3. Modern x86 - Core 2 and onwards:

The 'slowness' of individual operations - as long as you stay away from really nasty fiascos like INC and DEC and XLATB and integer divide and so on - is NOT necessarily all that important. Even in the unlikely event that you are l337 enough to avoid branch mispredicts, cache misses, etc. - the important thing is to be able to keep your pipeline full. You can issue and retire 4 uops per cycle; 3 ALU ops and 1 load/store.

Frankly, it just doesn't matter whether a instruction is 3 cycles or one cycle if you've got good reciprocal throughput and a full pipeline. The instructions to stay away from are the ones with both large latency and large reciprocal throughput - these will tie up a port for a startling length of time (like the SSE4.2 string match instructions, which are botched and appear to be getting slower, not quicker).

Keeping your pipeline full has far, far more to do with having a lot of independent work to do than it does with instruction selection. Variable shift vs fixed shift is a second-order (third?) compared to the difference between issuing one instruction per cycle vs. 4 (the latter is unlikely but doable in some loops).

Aspire to data-parallelism, even in single-threaded CPU code. That long sequential dependency is what's killing you. Even a Level 1 cache hit is 4 cycles on a Nehalem or Sandy Bridge; if your algorithm has nothing useful to do for those cycles you're going to be twiddling your thumbs on 3 alu ports and 1-2 load/store ports for 4 cycles.

4. Yes, most of the really obscure instructions suck. Read the forgotten Optimization Reference Manual and find out which and why.