How many raw FE mul/second are you getting on a stock RTX 4090 @450 W? Like, simply multiplying a value by itself numerous times, and saving it to DRAM after many loops?
My best version gave out 85 billion mul/s (40 registers, 1024 threads / block), but I haven't messed around with altering SASS, just read-only to analyze what instructions get used. The key is to make good use of both FP32 and ALU units during the operations (without getting insane) but also the pseudo-code planning matters a lot, since it affects that balance. Anyway, I think improving the inversion would be the #1 gainer.
I don't have a 4090 locally but rented one to test; headless / no DE / nothing utilising the GPU.
This specific 4090 seems to have a lower power ceiling (300W).
SMI outputs (during load):
| 0 NVIDIA GeForce RTX 4090 On |
| 37% 60C P2 299W / 300W |
Clock rates during tests:
Graphics: 2415 MHz, SM: 2415 MHz, Memory: 10251 MHz
Graphics: 2400 MHz, SM: 2400 MHz, Memory: 10251 MHz
Graphics: 2385 MHz, SM: 2385 MHz, Memory: 10251 MHz
Tried a few t/b variations, this looked best:
RawFE:: regs=40 threads=1024 block=256 throughput= 87487.5 Mmul/s
This is pre altering SASS.
I'll play around and update after some SASS alterations.