If no carry after 2nd reduction = all good;
else add "r".
But to check the carry we must do an addition with 0, so we have 5 operations if carry is set, or 1 if unset (plus the condition check).
I wonder what is the computational cost of executing a conditional branch on CUDA though.
I know that on Intel platforms, it would've taken longer to do a CMP/JNZ than a simple bitshift. So maybe in case jumps are expensive in CUDA too, there would be something like
- create a new variable to store the carry, run UADDC(0,0) on that
- carry bit will be zero at this point
- do a UMULLO(the r value you quoted, carry) - ie. multiplication and store lower 64 bits. This potentially avoids an if statement of it is correct
- Add this the result to the value in the code that we are reducing
In this case, if there's no carry, the above is a no-op and can be optimized out by the kernel, otherwise it reduces to the previous code.
So what do you think about this?