|
mcdouglasx
|
 |
March 27, 2026, 10:35:43 PM |
|
Rc, don't you think it would be better to change in RCGpuCore.cu: jmp_ind = x[0] % JMP_CNT; to jmp_ind = x[1] % JMP_CNT; I mean, because in the middle of ` x` you'd have a better, more uniform distribution of jumps and less correlation. Or even use the combination ` x[1] ^ x[0]`. I think this would improve uniformity and reduce the loop rate. I'm only including that line of code as a guide, since there are more that refer to this. However, if you already have a generated database, this could damage it, but for those performing a new search, I think it would be worthwhile.
|
|
|
|
Bram24732
Member


Activity: 322
Merit: 28
|
 |
March 28, 2026, 01:31:29 PM |
|
Rc, don't you think it would be better to change in RCGpuCore.cu: jmp_ind = x[0] % JMP_CNT; to jmp_ind = x[1] % JMP_CNT; I mean, because in the middle of ` x` you'd have a better, more uniform distribution of jumps and less correlation. Or even use the combination ` x[1] ^ x[0]`. I think this would improve uniformity and reduce the loop rate. I'm only including that line of code as a guide, since there are more that refer to this. However, if you already have a generated database, this could damage it, but for those performing a new search, I think it would be worthwhile. Why do you think x[1] has a better distribution than x[0] ?
|
I solved 67 and 68 using custom software distributing the load across ~25k GPUs. 4090 stocks speeds : ~8.1Bkeys/sec. Don’t challenge me technically if you know shit about fuck, I’ll ignore you. Same goes if all you can do is LLM reply.
|
|
|
|
kTimesG
|
 |
March 28, 2026, 03:41:37 PM |
|
Why do you think x[1] has a better distribution than x[0] ?
No, he is correct. And it's because when the DP uses the same bits as the jump index function, and you get to select a better X between two options, this adds bias for the even indices in the jump table, increasing their probability, hence losing the overall selection uniformity for the pseudo-random walk. This is easily proven (and very visible) by plotting the frequencies of the used jump indices - it's not uniform at all. Even worse when selecting a DP based on more than one bit (powers of two indices get biased the more bits you get to compare between two X candidates).
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
|
mcdouglasx
|
 |
March 28, 2026, 04:57:27 PM |
|
Why do you think x[1] has a better distribution than x[0] ?
The central bits like x[1] are usually freer from algebraic biases derived from the modular arithmetic of the curve.
|
|
|
|
RetiredCoder (OP)
Full Member
 

Activity: 168
Merit: 171
No pain, no gain!
|
 |
March 28, 2026, 05:55:12 PM |
|
Rc, don't you think it would be better to change in RCGpuCore.cu: jmp_ind = x[0] % JMP_CNT; to jmp_ind = x[1] % JMP_CNT; So you think that the lowest bits of the X of secp256k1 points are not uniformly distributed, right? Why do you think so? I see same uniform distribution for any bits. Anyway, it's very easy to check and I don't see any difference in my tests. No, he is correct. And it's because when the DP uses the same bits as the jump index function, and you get to select a better X between two options, this adds bias for the even indices in the jump table, increasing their probability, hence losing the overall selection uniformity for the pseudo-random walk. This is easily proven (and very visible) by plotting the frequencies of the used jump indices - it's not uniform at all. Even worse when selecting a DP based on more than one bit (powers of two indices get biased the more bits you get to compare between two X candidates).
Just don't use same bits for DP, jumps and for point selection to avoid such issues 
|
|
|
|
|
kTimesG
|
 |
March 28, 2026, 06:13:29 PM |
|
Just don't use same bits for DP, jumps and for point selection to avoid such issues  Exactly what I'm doing, the DP bits never intersect the jump function bits. Anyway, I thought this was the "point" behind McD's suggestion, but I guess he had something else in mind.
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
|
mcdouglasx
|
 |
March 28, 2026, 07:33:50 PM |
|
So you think that the lowest bits of the X of secp256k1 points are not uniformly distributed, right? Why do you think so? I see same uniform distribution for any bits. Anyway, it's very easy to check and I don't see any difference in my tests.
RC, I did a quick test with 4 million real pubkeys, replicating the GPU logic exactly (Little-Endian and Y parity). PUZZLE 67 (low range):x[0]: chi2=519.37, bias=0.46%
x[1]: chi2=502.23, bias=0.46%
xor: chi2=499.66, bias=0.45%
xor: Wins by a narrow margin PUZZLE 135 (mid range):x[0]: chi2=507.51, bias=0.46%
x[1]: chi2=514.07, bias=0.48%
xor: chi2=524.48, bias=0.45%
x[0]: Wins. My suggestion to use x[0] ^ x[1] is valid "on paper" because it reduces the inversion bias (inv_flag) and improves uniformity at low ranges, such as Puzzle 67. However, in practice, your current implementation (x[0]) achieves near-perfect entropy at Puzzle 135 ( Chi2: 507.51), which demonstrates that the bias naturally disappears as the search space grows. your current code is statistically sound for large puzzles. My proposal to combine fragments acts as a safety net to avoid correlations with the Y coordinate, but it is not strictly necessary for the current kernel performance.
|
|
|
|
Bram24732
Member


Activity: 322
Merit: 28
|
 |
March 29, 2026, 11:05:30 AM |
|
Why do you think x[1] has a better distribution than x[0] ?
No, he is correct. And it's because when the DP uses the same bits as the jump index function, and you get to select a better X between two options, this adds bias for the even indices in the jump table, increasing their probability, hence losing the overall selection uniformity for the pseudo-random walk. This is easily proven (and very visible) by plotting the frequencies of the used jump indices - it's not uniform at all. Even worse when selecting a DP based on more than one bit (powers of two indices get biased the more bits you get to compare between two X candidates). Oh it’s not a matter of curve distribution but more how you use it then. I didn’t read RC’s code so I didn’t spot this behaviour.
|
I solved 67 and 68 using custom software distributing the load across ~25k GPUs. 4090 stocks speeds : ~8.1Bkeys/sec. Don’t challenge me technically if you know shit about fuck, I’ll ignore you. Same goes if all you can do is LLM reply.
|
|
|
olnev
Newbie

Activity: 3
Merit: 0
|
 |
April 03, 2026, 05:36:58 PM |
|
Thanks for testing it out and I am glad to hear it's working with your 6700. Please let me know if you wish to commit your improvements.
It's working on RX6400 (RDNA 2). Speed is ~226 MKey/s. However I can't make it working on RX7600XT (RDNA 3): DBG: RndPnts tame x_init=00000000000000000000000000000000 wild x_init=e25e16cabde26a99bb29878863c7e237 BENCH: Speed: 0 MKeys/s, Err: 0, DPs: 0K/9646K, Time: 0d:00h:00m/213503982334601d:07h:00m BENCH: Speed: 0 MKeys/s, Err: 0, DPs: 0K/9646K, Time: 0d:00h:00m/213503982334601d:07h:00m BENCH: Speed: 0 MKeys/s, Err: 0, DPs: 0K/9646K, Time: 0d:00h:00m/213503982334601d:07h:00m BENCH: Speed: 0 MKeys/s, Err: 0, DPs: 0K/9646K, Time: 0d:00h:00m/213503982334601d:07h:00m
|
|
|
|
|
|
kTimesG
|
 |
April 04, 2026, 06:15:03 PM |
|
How many raw FE mul/second are you getting on a stock RTX 4090 @450 W?
RawFE:: regs=40 threads=1024 block=256 throughput= 87487.5 Mmul/s This is pre altering SASS. I'll play around and update after some SASS alterations. 115 Gmul/s @ 413 W .... after some better carry chaining. int32 perf: 19912 G xAdd / s, 19912 G xMul / s FE limb size: 32 blockDim: 256 Computing FE result [CPU] a = 0x3be98b46a0db920cfae7933e4bee03a58d393f4b3a6e8e1b3af8f4a1eed5c226 b = 0xe6aeff98065a7a533a9705e4c9d949b357df42916c1c3bc1f2c20fe97552baa3 Total ops: 16777216 r = 0x183f14a5c2e101b5666fbca1d032228a4d2784225255aeccc96eab8578aa790a [CPU] Total ops: 16777216 Speed: 40.85 Mo/s Computing FE result [GPU] TestFE kernel attributes: registers: 46 max threads / block: 1024 local memory: 0 bytes const memory: 0 bytes Launch parameters gridDim: 128 blockDim: 1024 r = 183f14a5 c2e101b5 666fbca1 d032228a 4d278422 5255aecc c96eab85 78aa790a Total ops: 2199023255552; Wall clock speed: 114399.89 Mo/s (19222250 ticks) GPU speed: 114482.21 Mo/s (19208.43 ms); Kernel ticks: 51523633975 AVG ticks/op: 3071.05 SUCCESS!
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
Bilmehdi93
Newbie

Activity: 1
Merit: 0
|
 |
April 10, 2026, 03:28:47 PM |
|
Hello dear Retired Coder, please can i get the Tame file for the puzzle 120 or 125 please, if this is possible please share it to my email bilmehdi93@gmail.com, thank you so much
|
|
|
|
|
RetiredCoder (OP)
Full Member
 

Activity: 168
Merit: 171
No pain, no gain!
|
 |
April 22, 2026, 08:02:09 AM Last edit: April 22, 2026, 01:18:25 PM by RetiredCoder Merited by kTimesG (10), Ykra (10), Cricktor (4) |
|
Finally I had some time to prepare RCAsm for people who are brave enough to make their CUDA kernels faster https://github.com/RetiredC/RCAsmWhy ASM? PTX is not powerful enough:- You still cannot control registers usage. - PTX does not provide all instructions, some of them can be really important if you are going to create really fast code. - There is no way to declare fast functions: if you define "inline" function, it's just including its code so main code grows every time when you call that function. If it's not inline, calls are very slow. - There is no way to use uniform registers and instructions directly. - There is no way to specify control codes. - There is no good management for carry flags, also some carry-related instruction are missed. - You have to check what SASS is generated every time, spend time to convince compiler to make it as you want, etc. As a result, often ASM is really faster if you know what you are doing. RCAsm features:- sm89 and sm120 support. - variables for R, UR, P. - asm functions (include/call). - supports constants and math expressions. - automatic kernels injection into .cuasm file. - simple but convenient editor for asm sources. - #IF #ELSEIF #ENDIF support. - open source, written in Python. I hope you will have a lot of fun with this tool and SASS  Also check Kernel01 sample for SASS implementation of MulMod256.
|
|
|
|
Torin Keepler
Newbie

Activity: 43
Merit: 0
|
 |
April 22, 2026, 04:46:58 PM Last edit: April 22, 2026, 04:58:57 PM by Torin Keepler |
|
Finally I had some time to prepare RCAsm for people who are brave enough to make their CUDA kernels faster https://github.com/RetiredC/RCAsmWhy ASM? PTX is not powerful enough:Huge thanks for the provided tools and code example! The project is simply super. I managed to compile everything: the assembler and the injection went absolutely successfully and without any errors. Could you please tell me if you plan to release instructions or examples of assembly kernels specifically for the RCKangaroo program, for different architectures, in the future? Compiling code... 4 public units found: main.asm: CONST STRIDE main.asm: CONST INT_SIZE main.asm: KERNEL mulKernel mul.asm: FUNCTION MulMod256 kernel mulKernel: compiled, regcnt: 255, asm_lines: 138 appended functions: injecting SUCCESSFUL Done(372ms): compiling+injecting SUCCESSFUL
|
|
|
|
|
RetiredCoder (OP)
Full Member
 

Activity: 168
Merit: 171
No pain, no gain!
|
 |
April 22, 2026, 06:19:22 PM |
|
Could you please tell me if you plan to release instructions or examples of assembly kernels specifically for the RCKangaroo program, for different architectures, in the future?
Yes I will publish asm sources for my turbo kernels (both sm89 and sm120) for RCKangaroo as soon as #135 is solved (no matter who solves it). It will happen soon, so you won’t have to wait long.
|
|
|
|
Torin Keepler
Newbie

Activity: 43
Merit: 0
|
 |
April 22, 2026, 07:17:51 PM Last edit: April 26, 2026, 06:27:11 PM by Torin Keepler |
|
Could you please tell me if you plan to release instructions or examples of assembly kernels specifically for the RCKangaroo program, for different architectures, in the future?
Yes I will publish asm sources for my turbo kernels (both sm89 and sm120) for RCKangaroo as soon as #135 is solved (no matter who solves it). It will happen soon, so you won’t have to wait long. I have questions regarding the current implementation of the RCKangaroo. Could you please provide some clarification? Thank you very much. Potential race condition in KernelB Also, I noticed a potential issue in KernelB when writing to LoopTable. The code currently uses BLOCK_X at the end of the index: Kparams.LoopTable[MD_LEN * BLOCK_SIZE * PNT_GROUP_CNT * BLOCK_X + 2 * MD_LEN * BLOCK_SIZE * gr_ind2 + ind * BLOCK_SIZE + BLOCK_X] = RegsA; Doesn't this cause all threads in the block to overwrite the exact same memory address at the same time? I changed the last BLOCK_X to THREAD_X so that each thread writes to its own unique column, and it seems to work perfectly. Was BLOCK_X just a typo here, or is there a specific reason for it? Thanks in advance!
|
|
|
|
|
RetiredCoder (OP)
Full Member
 

Activity: 168
Merit: 171
No pain, no gain!
|
 |
April 22, 2026, 08:42:53 PM |
|
1. About jump sizes and Out-Of-Bounds drifting When working with large search ranges, the jumps from Table 2 accumulate over a long period, which causes the kangaroos to eventually drift far outside the boundaries of the search range. Because of this, do you think it would be better to use smaller jumps in Table 2 to prevent them from drifting out of bounds?
Jumps for all tables can be right or left (depends on Y), so I don't expect any serious drifts. Also I don't like the idea of reducing these jumps, but you can try it. 2. Potential race condition in KernelB Also, I noticed a potential issue in KernelB when writing to LoopTable. The code currently uses BLOCK_X at the end of the index: Kparams.LoopTable[MD_LEN * BLOCK_SIZE * PNT_GROUP_CNT * BLOCK_X + 2 * MD_LEN * BLOCK_SIZE * gr_ind2 + ind * BLOCK_SIZE + BLOCK_X] = RegsA; Doesn't this cause all threads in the block to overwrite the exact same memory address at the same time? I changed the last BLOCK_X to THREAD_X so that each thread writes to its own unique column, and it seems to work perfectly. Was BLOCK_X just a typo here, or is there a specific reason for it?
Oh, it's a bug, it must be THREAD_X of course! I have version 4.0 with a lot of changes to support asm kernels and some interesting ideas implemented, but I will upload it later (as I said above).
|
|
|
|
RetiredCoder (OP)
Full Member
 

Activity: 168
Merit: 171
No pain, no gain!
|
 |
April 22, 2026, 09:37:48 PM |
|
Regarding the jumps, it is clear that their direction depends on the Y-coordinate, and in most cases, they tend to bounce around approximately within their own local area (halo). However, please observe this closely. After just a month of running, a large number of kangaroos will end up in a space that is 4 bits larger. This is a serious problem.
If you see this issue, probably the best solution is just to restart a kangaroo after it hits DP. Or you can reduce jumps and hope that it wont cause other issues. By the way, I added an implementation that restricts kangaroo landings to even X-coordinates 75% of the time.
SOTA+ does this trick with the cheap point. If you managed to do it without using cheap point - you found a security issue of secp256k1.
|
|
|
|
Torin Keepler
Newbie

Activity: 43
Merit: 0
|
 |
April 22, 2026, 10:03:55 PM Last edit: April 23, 2026, 01:57:12 PM by Torin Keepler |
|
Regarding the jumps, it is clear that their direction depends on the Y-coordinate, and in most cases, they tend to bounce around approximately within their own local area (halo). However, please observe this closely. After just a month of running, a large number of kangaroos will end up in a space that is 4 bits larger. This is a serious problem.
If you see this issue, probably the best solution is just to restart a kangaroo after it hits DP. Or you can reduce jumps and hope that it wont cause other issues. By the way, I added an implementation that restricts kangaroo landings to even X-coordinates 75% of the time.
SOTA+ does this trick with the cheap point. If you managed to do it without using cheap point - you found a security issue of secp256k1. Yes, I have implemented the restart method, but I'm currently using DP30. By the time a kangaroo hits a distinguished point (DP), it ends up doing quite a bit of redundant work. Therefore, I think it is more optimal to use a different loop exit value. Regarding the X-coordinate parity, that is exactly how I implemented it - using a "cheap" second point - but for now, I'm stuck on implementing a cheap loop detector.
|
|
|
|
|
RetiredCoder (OP)
Full Member
 

Activity: 168
Merit: 171
No pain, no gain!
|
 |
April 23, 2026, 06:04:21 AM |
|
Could you briefly explain why you consider the coefficient in the SOTA+ method to be slightly better?
Because I have proofs: https://github.com/RetiredC/Kang-1
|
|
|
|
Ykra
Newbie

Activity: 16
Merit: 32
|
How many raw FE mul/second are you getting on a stock RTX 4090 @450 W?
RawFE:: regs=40 threads=1024 block=256 throughput= 87487.5 Mmul/s This is pre altering SASS. I'll play around and update after some SASS alterations. 115 Gmul/s @ 413 W .... after some better carry chaining. Oh huge improvement! I did go ahead with some SASS tuning + testing on the 4090 I was renting. Best result I achieved was 102318.5 Mmul/s but was hard limited by the 300W power ceiling on that specific card, I've been a bit consumed by other things as of late but always interested in new developments here. This is great, really appreciate you sharing your work. Would +merit but too newbie to do so, so pretend I did. This would have saved some of my mental when I was making my own IDE (and CuAsm with fixes) for my 5090 work, then again part of the struggle is part of the fun I guess. I learnt a lot from redplait's blog with his deep diving into SASS amongst other interesting things, can see some of his interesting work here: https://github.com/redplait/denvdis
|
|
|
|
|
|