Depends largely on the state of code caches, microcode and bus structure. Loop unrolling can be the worst thing you can do, on some architectures.
See, the hardware folks have listened in on the compiler people and their problems. They've done things like identify loops and rewritten them in microcode for optimization. If a short loop can fit entirely within the CPU code buffer, speed goes way up. Unroll the loop and blow the CPU code buffer, defeat the optimization and lose all that.
See, the hardware folks have listened in on the compiler people and their problems. They've done things like identify loops and rewritten them in microcode for optimization. If a short loop can fit entirely within the CPU code buffer, speed goes way up. Unroll the loop and blow the CPU code buffer, defeat the optimization and lose all that.