Lets take a look inside the GTX 460 which has eight shader modules.
Each shader module has 48 cuda core. One cuda core can do one integer and one float operation using its ALU.
Thus the GF104 / GTX160 card has 8 * 48 = 384 ALUs for integer operation. Lets see whats inside the ATI Cypress Radeon 5870 which has 20 SIMD engines.
Each SIMD engine (one red row in the picture above) contains 16 thread processor.
Each thread processor contains 4 ALUs + 1 special purpose ALU for transcendental (sine, cosine, etc). Thus one thread processor can can do four integer operations using 4 ALUs.
Thus the Radeon 5870 has 20 * 16 * 5 = 1600 ALUs, out of which 1280 ALUs can be used for integer ops.
Now the nvidia ALUs run at twice the clock speed i.e. for a 700MHz device they will run at 1400MHz, while the ATI ALUs run at advertised clock speed.
But the ATI ALUs can do shift and rotate in one clock cycle, while the nvidia ALUs can do these operation in (4 clock cycles? pls correct me if I'm wrong)
So the maximum overall speedup with an ATI GPU (assuming same clock speeds) will be 1280/ (384 *2 / 4) = 6.6X.
We see around 70MH/s with a GTX 460 while we see 400MH/s with a 5870. The ratio = 5.7X.