SecretAdmirere
Newbie

Activity: 14
Merit: 1
|
 |
May 07, 2026, 11:22:56 PM |
|
Hello RetiredCoder, i just wanted to express my gratitude towards your contribution in all of this. I recently saw your SASS implementation of MulMod256, and it led me towards rewriting my own which turned out better than i imagined it would, still it's nothing compared to your custom version written in sass (or from others like kTimesG), but i'm happy with it for now (it is fast and it is computing correct results). So in a way, i wouldn't do it if it weren't for you, or at very least i wouldn't do it now, so for that i wanna say to you a BIG THANK YOU!I'm new to this scene, CUDA programming, PTX, C++, elliptic curves, SHA256, RIPEMD160, registers, occupancy, etc.. So there is still a lot to do, learn and improve, but every day there is a new milestone for me. Sometimes one step forward - two steps back, sometimes vice versa (eg. had an issue in the reduction, was correct 98% of the times, didn't notice it because i was lazy to test it properly, turns out it was wrong, was slightly faster but computed wrong results so yeah).. I didn't bother benchmarking just the ModMult, since i have RTX 3070 Ti and currently i have other things that need financial focus rather than spending money on renting GPUs just for benchmarking it on 4090/5090/PRO 6000. But here are some numbers taken from my address hashing kernel (Mkeys/s is referring to hash160 checks against hash160 target) and i just replaced ModMult for benchmarking and everything else was the same for all four: ===============================================================================
https://github.com/FixedPaul/VanitySearch-Bitcrack/blob/master/GPU/GPUMath.h
_ModMult
2020 Mkeys/s - Let's call it a baseline
===============================================================================
https://github.com/RetiredC/RCKangaroo/blob/main/RCGpuUtils.h
MulModP
2020 Mkeys/s - Let's call it a baseline
===============================================================================
My old before rewrite
ModMult
2080 Mkeys/s - increase by ~3% over a baseline
===============================================================================
My new after rewrite
ModMult
2200 Mkeys/s - increase by ~9% over a baseline
=============================================================================== It is written completely in C++ and PTX, i don't think it's worth spending the mental cycles to do something beyond C++ and PTX for a RTX 30-series GPUs, maybe one day when i'm able to afford 4090 or 5090 i would spend more time doing it, but for the time being it will stay as C++ and PTX. This is the sass of the ModMult, taken out directly from the compiled kernel, as is, i hope i didn't miss something (don't wanna share the code as C++ and PTX, i'm sharing it as sass since those who can reverse it back to C++ and PTX probably know a lot more to just write it for themselves a better version, and those "eating magic mushrooms" people from 660 pages thread can't get their hands on it for now): RC SASS implementation of MulMod256: 47 - IMAD.WIDE.U32.X 27 - IMAD.WIDE.U32 37 - IADD3.X 1 - IADD3 Total: 112 instructions
My new after rewrite ModMult: 68 - IMAD.WIDE.U32 50 - IADD3.X 28 - IMAD.MOV.U32 8 - IMAD.X 7 - IMAD.WIDE.U32.X 4 - IADD3 1 - IMAD.HI.U32 1 - IMAD 1 - SEL Total: 168 instructions
Line 471 /*0800*/ IMAD.WIDE.U32 R4, R56, R64, RZ ;
Line 475 /*07d0*/ IMAD.MOV.U32 R7, RZ, RZ, RZ ; /*0820*/ IMAD.MOV.U32 R6, RZ, RZ, R5 ; /*0840*/ IMAD.WIDE.U32 R6, P2, R56, R65, R6 ; Line 476 /*0860*/ IMAD.WIDE.U32 R6, P3, R57, R64, R6 ; /*0890*/ IADD3.X R11, RZ, RZ, RZ, P3, P2 ;
Line 480 /*0880*/ IMAD.MOV.U32 R10, RZ, RZ, R7 ; /*08c0*/ IMAD.WIDE.U32 R10, P2, R56, R61, R10 ; Line 481 /*08e0*/ IMAD.WIDE.U32 R10, P3, R57, R65, R10 ; /*08f0*/ IADD3.X R12, RZ, RZ, RZ, P3, P2 ; Line 482 /*0900*/ IMAD.WIDE.U32 R10, P4, R54, R64, R10 ; /*0910*/ IMAD.X R13, RZ, RZ, R12, P4 ;
Line 486 /*0920*/ IMAD.MOV.U32 R12, RZ, RZ, R11 ; /*0930*/ IMAD.WIDE.U32 R12, P2, R56, R62, R12 ; Line 487 /*0940*/ IMAD.WIDE.U32 R12, P3, R57, R61, R12 ; /*0950*/ IADD3.X R14, RZ, RZ, RZ, P3, P2 ; Line 488 /*0960*/ IMAD.WIDE.U32 R12, P4, R54, R65, R12 ; Line 489 /*0970*/ IMAD.WIDE.U32 R12, P5, R55, R64, R12 ; /*0980*/ IADD3.X R15, RZ, RZ, R14, P5, P4 ;
Line 493 /*0990*/ IMAD.MOV.U32 R14, RZ, RZ, R13 ; /*09a0*/ IMAD.WIDE.U32 R14, P4, R56, R60, R14 ; Line 494 /*09b0*/ IMAD.WIDE.U32 R14, P5, R57, R62, R14 ; /*09c0*/ IADD3.X R66, RZ, RZ, RZ, P5, P4 ; Line 495 /*09d0*/ IMAD.WIDE.U32 R14, P3, R54, R61, R14 ; Line 496 /*09e0*/ IMAD.WIDE.U32 R14, P2, R55, R65, R14 ; /*09f0*/ IADD3.X R66, RZ, RZ, R66, P2, P3 ; Line 497 /*0a00*/ IMAD.WIDE.U32 R14, P4, R18, R64, R14 ; /*0a10*/ IMAD.X R67, RZ, RZ, R66, P4 ;
Line 501 /*0a20*/ IMAD.MOV.U32 R66, RZ, RZ, R15 ; /*0a30*/ IMAD.WIDE.U32 R66, P4, R56, R59, R66 ; Line 502 /*0a40*/ IMAD.WIDE.U32 R66, P5, R57, R60, R66 ; /*0a50*/ IADD3.X R68, RZ, RZ, RZ, P5, P4 ; Line 503 /*0a60*/ IMAD.WIDE.U32 R66, P2, R54, R62, R66 ; Line 504 /*0a70*/ IMAD.WIDE.U32 R66, P3, R55, R61, R66 ; /*0a80*/ IADD3.X R68, RZ, RZ, R68, P3, P2 ; Line 505 /*0a90*/ IMAD.WIDE.U32 R66, P4, R18, R65, R66 ; Line 506 /*0aa0*/ IMAD.WIDE.U32 R66, P5, R17, R64, R66 ; /*0ab0*/ IADD3.X R69, RZ, RZ, R68, P5, P4 ;
Line 510 /*0ac0*/ IMAD.MOV.U32 R68, RZ, RZ, R67 ; /*0ad0*/ IMAD.WIDE.U32 R68, P4, R56, R58, R68 ; Line 511 /*0ae0*/ IMAD.WIDE.U32 R68, P5, R57, R59, R68 ; /*0af0*/ IADD3.X R74, RZ, RZ, RZ, P5, P4 ; Line 512 /*0b00*/ IMAD.WIDE.U32 R68, P2, R54, R60, R68 ; Line 513 /*0b10*/ IMAD.WIDE.U32 R68, P3, R55, R62, R68 ; /*0b20*/ IADD3.X R74, RZ, RZ, R74, P3, P2 ; Line 514 /*0b30*/ IMAD.WIDE.U32 R68, P5, R18, R61, R68 ; Line 515 /*0b40*/ IMAD.WIDE.U32 R68, P4, R17, R65, R68 ; /*0b50*/ IADD3.X R74, RZ, RZ, R74, P4, P5 ; Line 516 /*0b60*/ IMAD.WIDE.U32 R68, P2, R16, R64, R68 ; /*0b70*/ IMAD.X R75, RZ, RZ, R74, P2 ;
Line 520 /*0b80*/ IMAD.MOV.U32 R74, RZ, RZ, R69 ; /*0b90*/ IMAD.WIDE.U32 R74, P4, R56, R63, R74 ; Line 521 /*0ba0*/ IMAD.WIDE.U32 R74, P5, R57, R58, R74 ; /*0bb0*/ IADD3.X R76, RZ, RZ, RZ, P5, P4 ; Line 522 /*0bc0*/ IMAD.WIDE.U32 R74, P3, R54, R59, R74 ; Line 523 /*0bd0*/ IMAD.WIDE.U32 R74, P2, R55, R60, R74 ; /*0be0*/ IADD3.X R76, RZ, RZ, R76, P2, P3 ; Line 524 /*0bf0*/ IMAD.WIDE.U32 R74, P0, R18, R62, R74 ; Line 525 /*0c00*/ IMAD.WIDE.U32 R74, P1, R17, R61, R74 ; /*0c10*/ IADD3.X R76, RZ, RZ, R76, P1, P0 ; Line 526 /*0c20*/ IMAD.WIDE.U32 R74, P4, R16, R65, R74 ; Line 527 /*0c30*/ IMAD.WIDE.U32 R74, P2, R53, R64, R74 ; /*0c40*/ IADD3.X R77, RZ, RZ, R76, P2, P4 ;
Line 531 /*0c50*/ IMAD.MOV.U32 R76, RZ, RZ, R75 ; /*0c60*/ IMAD.WIDE.U32 R56, P1, R57, R63, R76 ; Line 532 /*0c70*/ IMAD.WIDE.U32 R56, P2, R54, R58, R56 ; /*0c80*/ IADD3.X R64, RZ, RZ, RZ, P2, P1 ; Line 533 /*0c90*/ IMAD.WIDE.U32 R56, P3, R55, R59, R56 ; Line 534 /*0ca0*/ IMAD.WIDE.U32 R56, P4, R18, R60, R56 ; /*0cb0*/ IADD3.X R64, RZ, RZ, R64, P4, P3 ; Line 535 /*0cc0*/ IMAD.WIDE.U32 R56, P0, R17, R62, R56 ; Line 536 /*0cd0*/ IMAD.WIDE.U32 R56, P1, R16, R61, R56 ; /*0ce0*/ IADD3.X R64, RZ, RZ, R64, P1, P0 ; Line 537 /*0cf0*/ IMAD.WIDE.U32 R56, P2, R53, R65, R56 ; /*0d00*/ IMAD.X R65, RZ, RZ, R64, P2 ;
Line 541 /*0d10*/ IMAD.MOV.U32 R64, RZ, RZ, R57 ; /*0d30*/ IMAD.WIDE.U32 R64, P0, R54, R63, R64 ; Line 542 /*0d40*/ IMAD.WIDE.U32 R64, P1, R55, R58, R64 ; /*0d50*/ IADD3.X R76, RZ, RZ, RZ, P1, P0 ; Line 543 /*0d60*/ IMAD.WIDE.U32 R64, P2, R18, R59, R64 ; Line 544 /*0d70*/ IMAD.WIDE.U32 R64, P3, R17, R60, R64 ; /*0d80*/ IADD3.X R76, RZ, RZ, R76, P3, P2 ; Line 545 /*0d90*/ IMAD.WIDE.U32 R64, P4, R16, R62, R64 ; Line 546 /*0da0*/ IMAD.WIDE.U32 R64, P0, R53, R61, R64 ; /*0db0*/ IADD3.X R77, RZ, RZ, R76, P0, P4 ;
Line 550 /*0dc0*/ IMAD.MOV.U32 R76, RZ, RZ, R65 ; /*0dd0*/ IMAD.WIDE.U32 R54, P0, R55, R63, R76 ; Line 551 /*0de0*/ IMAD.WIDE.U32 R54, P1, R18, R58, R54 ; /*0df0*/ IADD3.X R76, RZ, RZ, RZ, P1, P0 ; Line 552 /*0e00*/ IMAD.WIDE.U32 R54, P2, R17, R59, R54 ; Line 553 /*0e10*/ IMAD.WIDE.U32 R54, P3, R16, R60, R54 ; /*0e20*/ IADD3.X R76, RZ, RZ, R76, P3, P2 ; Line 554 /*0e30*/ IMAD.WIDE.U32 R54, P0, R53, R62, R54 ; /*0e40*/ IMAD.X R77, RZ, RZ, R76, P0 ;
Line 558 /*0e50*/ IMAD.MOV.U32 R76, RZ, RZ, R55 ; /*0e60*/ IMAD.WIDE.U32 R76, P0, R18, R63, R76 ; Line 559 /*0e70*/ IMAD.WIDE.U32 R76, P1, R17, R58, R76 ; /*0ea0*/ IADD3.X R76, RZ, RZ, RZ, P1, P0 ; Line 560 /*0e80*/ IMAD.WIDE.U32 R76, P2, R16, R59, R76 ; Line 561 /*0e90*/ IMAD.WIDE.U32 R60, P3, R53, R60, R76 ; /*0ed0*/ IADD3.X R77, RZ, RZ, R76, P3, P2 ;
Line 565 /*0ef0*/ IMAD.MOV.U32 R76, RZ, RZ, R61 ; /*0f10*/ IMAD.WIDE.U32 R76, P1, R17, R63, R76 ; Line 566 /*0f20*/ IMAD.WIDE.U32 R76, P2, R16, R58, R76 ; /*0f30*/ IADD3.X R11, RZ, RZ, RZ, P2, P1 ; Line 567 /*0f40*/ IMAD.WIDE.U32 R76, P3, R53, R59, R76 ; /*0f60*/ IMAD.X R11, RZ, RZ, R11, P3 ;
Line 571 /*0f70*/ IMAD.MOV.U32 R10, RZ, RZ, R77 ; /*0f90*/ IMAD.WIDE.U32 R16, P3, R16, R63, R10 ; Line 572 /*1030*/ IMAD.WIDE.U32 R16, P4, R53, R58, R16 ; /*1050*/ IADD3.X R13, RZ, RZ, RZ, P4, P3 ;
Line 577 /*10a0*/ IMAD.MOV.U32 R12, RZ, RZ, R17 ; /*10c0*/ IMAD.WIDE.U32 R6, R53, R63, R12 ;
Line 585 /*0eb0*/ IADD3 R5, P0, R6, R56, RZ ; Line 586 /*0ee0*/ IADD3.X R13, P0, R10, R64, RZ, P0, !PT ; Line 587 /*0f00*/ IADD3.X R15, P0, R12, R54, RZ, P0, !PT ; Line 588 /*0fb0*/ IADD3.X R5, P0, R14, R60, RZ, P0, !PT ; Line 589 /*0fe0*/ IADD3.X R67, P0, R66, R76, RZ, P0, !PT ; Line 590 /*1060*/ IADD3.X R69, P0, R68, R16, RZ, P0, !PT ; Line 591 /*10f0*/ IADD3.X R75, P0, R74, R6, RZ, P0, !PT ; Line 592 /*1140*/ IADD3.X R13, P0, RZ, R7, RZ, P0, !PT ; Line 593 /*1170*/ SEL R5, RZ, 0x1, !P0 ;
Line 597 /*0ec0*/ IMAD.HI.U32 R6, R56, 0x3d1, RZ ; Line 598 /*08a0*/ IMAD.MOV.U32 R7, RZ, RZ, RZ ; /*0f80*/ IMAD.WIDE.U32 R6, P1, R64, 0x3d1, R6 ; Line 600 /*0fc0*/ IMAD.MOV.U32 R10, RZ, RZ, R7 ; /*0fd0*/ IMAD.MOV.U32 R11, RZ, RZ, RZ ; /*1000*/ IMAD.WIDE.U32.X R10, P1, R54, 0x3d1, R10, P1 ; Line 602 /*0ff0*/ IMAD.MOV.U32 R7, RZ, RZ, RZ ; /*1010*/ IMAD.MOV.U32 R6, RZ, RZ, R11 ; /*1040*/ IMAD.WIDE.U32.X R6, P1, R60, 0x3d1, R6, P1 ; Line 604 /*1070*/ IMAD.MOV.U32 R10, RZ, RZ, R7 ; /*1090*/ IMAD.MOV.U32 R11, RZ, RZ, RZ ; /*10b0*/ IMAD.WIDE.U32.X R10, P1, R76, 0x3d1, R10, P1 ; Line 606 /*10e0*/ IMAD.MOV.U32 R12, RZ, RZ, R11 ; /*1100*/ IMAD.MOV.U32 R13, RZ, RZ, RZ ; /*1120*/ IMAD.WIDE.U32.X R12, P1, R16, 0x3d1, R12, P1 ; Line 608 /*1110*/ IMAD.MOV.U32 R11, RZ, RZ, RZ ; /*1130*/ IMAD.MOV.U32 R10, RZ, RZ, R13 ; /*1160*/ IMAD.WIDE.U32.X R10, P1, R6, 0x3d1, R10, P1 ; Line 610 /*1180*/ IMAD.MOV.U32 R12, RZ, RZ, R11 ; /*11a0*/ IMAD.WIDE.U32.X R6, P1, R7, 0x3d1, R12, P1 ;
Line 615 /*0d20*/ IMAD R57, R56, 0x3d1, RZ ; /*0f50*/ IADD3 R4, P2, R57, R4, RZ ; Line 616 /*0fa0*/ IADD3.X R18, P2, R5, R6, RZ, P2, !PT ; Line 617 /*1020*/ IADD3.X R55, P2, R13, R10, RZ, P2, !PT ; Line 618 /*1080*/ IADD3.X R15, P2, R15, R6, RZ, P2, !PT ; Line 619 /*10d0*/ IADD3.X R14, P2, R5, R10, RZ, P2, !PT ; Line 620 /*1150*/ IADD3.X R67, P2, R67, R12, RZ, P2, !PT ; Line 621 /*1190*/ IADD3.X R69, P2, R69, R10, RZ, P2, !PT ; Line 622 /*11b0*/ IADD3.X R75, P2, R75, R6, RZ, P2, !PT ; Line 623 /*11c0*/ IADD3.X R57, P2, RZ, R7, RZ, P2, !PT ; Line 624 /*11d0*/ IADD3.X R11, RZ, RZ, R5, P2, P1 ;
Line 632 /*11e0*/ IMAD.WIDE.U32 R6, R11, 0x3d1, RZ ; /*11f0*/ IMAD.WIDE.U32 R6, P0, R57, 0x1, R6 ; /*1200*/ IMAD.WIDE.U32 R56, R57, 0x3d1, RZ ; /*1220*/ IADD3 R57, P1, R57, R6, RZ ; /*1230*/ IADD3 R56, P0, R4, R56, RZ ; /*1250*/ IADD3.X R57, P0, R18, R57, RZ, P0, !PT ; Line 633 /*1210*/ IMAD.X R5, RZ, RZ, RZ, P0 ; /*1240*/ IMAD.MOV.U32 R4, RZ, RZ, R7 ; /*1260*/ IMAD.WIDE.U32.X R4, R11, 0x1, R4, P1 ; /*1280*/ IADD3.X R54, P0, R55, R4, RZ, P0, !PT ; /*1290*/ IADD3.X R55, P0, R15, R5, RZ, P0, !PT ; Line 634 /*12b0*/ IADD3.X R18, P0, RZ, R14, RZ, P0, !PT ; /*12d0*/ IADD3.X R17, P0, RZ, R67, RZ, P0, !PT ; Line 635 /*1300*/ IADD3.X R16, P0, RZ, R69, RZ, P0, !PT ; /*1320*/ IMAD.X R53, RZ, RZ, R75, P0 ;
|