|
kTimesG
|
 |
October 18, 2025, 03:08:20 PM |
|
k = 0.625  benchmark test? -- Wilds are always Even but the Tames are not Relax, Biff! k's down to 0.6 only when all points are golden and stuff's jumpin' both left AND right via the cheapos. So, hash table's up 2x to 1.25 sqrt. By the same logic BSGS's actually faster AND better, bcz the cheapos are in full effect over there and we might as well say that 1.06 sqrt is actually k = 0.53. Bonus: it eats up less space as well (0.71 sqrt). If you stick your head up from the established norms you might understand why Wildos aren't really just at an even distance (and nah, having jump deltas even becomes irrelevant at that time). The perspective of how you look at them plays a big role. I'm not gonna say more ATM because then it becomes obvious.
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
RetiredCoder (OP)
Full Member
 
Offline
Activity: 162
Merit: 141
No pain, no gain!
|
 |
October 18, 2025, 08:40:00 PM |
|
But you know what? Cycles don't really need so much exhaustive burnout if you mastermind the behavioral headaches of roundabouts in a much more simpler way. No hoverboard needed.
I'm glad that you finally like cycles 
|
|
|
|
vdog99
Newbie
Offline
Activity: 14
Merit: 0
|
 |
October 24, 2025, 03:20:28 PM |
|
Hey RC
How do you go about incorporating SOTA+ with the GPU Version of SOTA In the RCKangaroo implementation of your Code (SOTA+ calculates of P - J and P + J causes a loop: A -> B (Add) and B -> A (Sub)) especially the Loop detection and Escape in Kernel A?
Should we just rely on KernelB Loop D&E ((average latency of STEP_CNT/2 = 500 steps)) and remove the L1S2 which would then Lower K value correct since Kernel B is expensive?
In Kang-2 you used a multi-List method but that would not make sense for a GPU Implementation?
Is there a way to integrate SOTA+ in RCKangaroo while doing accurate and correct Loop Detection?
|
|
|
|
|
Efim_Wr
Newbie
Offline
Activity: 4
Merit: 0
|
 |
October 25, 2025, 11:24:59 PM |
|
Hi all. Here are the performance figures for my video cards. CMP 70HX ~1711 MKeys/s RTX 5070 ~3400 MKeys/s CMP 40HX ~1500 MKeys/s
RetiredCoder, you're a genius. Thanks for all your hard work. I have some small developments in the field of EC. You're probably familiar with them, but if you have anything, please send me a private message.
|
|
|
|
|
Cricktor
Legendary
Offline
Activity: 1442
Merit: 3801
|
 |
October 26, 2025, 10:56:11 AM |
|
Sounds like your "small developments in the field of EC" are not for the public? If so, then why talk about it?
|
|
|
|
|
kTimesG
|
 |
October 26, 2025, 06:15:32 PM |
|
Is there a way to integrate SOTA+ in RCKangaroo while doing accurate and correct Loop Detection?
I think SOTA+ simply means always computing both P+J and P-J and choosing the best route of the two. What stops you from doing that, if it's not already in the code? IMHO there is no such thing as an accurate and correct loop detection. 2-cycles are totally trivial. But what if the loop is of size 1024? Good luck detecting that when you're only tracking the last 8 or 16 distances, even if it's close to impossible for something like that to ever occur during the lifespan of our galaxy. Are there algos to detect very complicated cycles? I bet there must be, but it's most likely complicated useless overkill. If you do the math, you'll find out that from a point on, cycles longer than some very short length don't even matter anymore, compared to the huge amount of walks you're supposed to create and maintain. They're the disposable 0.001% of the entire amount of power needed to solve some DLP. Also choosing very fine-tuned jump distances can avoid a lot of the cycles from the very start.
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
RetiredCoder (OP)
Full Member
 
Offline
Activity: 162
Merit: 141
No pain, no gain!
|
 |
October 28, 2025, 06:49:52 AM |
|
How do you go about incorporating SOTA+ with the GPU Version of SOTA In the RCKangaroo implementation of your Code
It's not difficult to upgrade SOTA to SOTA+ in RCKangaroo, but the group size is 24 or 64 there, so cheap point is not so cheap and therefore it won't be faster. In general, SOTA+ is faster only when group size is about 12-16 or less.
|
|
|
|
vdog99
Newbie
Offline
Activity: 14
Merit: 0
|
 |
October 28, 2025, 11:29:34 PM Last edit: October 29, 2025, 02:59:01 PM by vdog99 |
|
How do you go about incorporating SOTA+ with the GPU Version of SOTA In the RCKangaroo implementation of your Code
It's not difficult to upgrade SOTA to SOTA+ in RCKangaroo, but the group size is 24 or 64 there, so cheap point is not so cheap and therefore it won't be faster. Oh that makes sense I couldn't get to improve its K than the current one, in spite of everything I did.
|
|
|
|
|
|
kTimesG
|
 |
October 29, 2025, 06:38:40 PM |
|
But you know what? Cycles don't really need so much exhaustive burnout if you mastermind the behavioral headaches of roundabouts in a much more simpler way. No hoverboard needed.
I'm glad that you finally like cycles  Yo Doc. I'm honestly in a loop on loop handling. I frankesteined my way to do a sorta' but not identical SOTA+ (let's call it SOTA++) in CUDA because I want that 1.0 sqrt after all. Was seriously thinking this is what your RCKang does already (based on some code comments in Kang1 or Kang2 that sounded like "on a GPU, we always compute both of x3, because no benefits etc.". My bad. I'm using 8 as group size so basically I'm jumping 80% faster than a forward-only creature. But it all got nuked when I got to the point of having to escape cycles. Detecting them was a breeze. Escaping them? It's simpler to get out of a Guantanamo Bay prison in one piece. Check this out: a simple 4 cycle: A -> B -> C -> B -> A -> D Now, the problem with this circuit is that B and C are in a 2-cycle that needs to exit at B (to keep up symmetric walk), but also B and A are in a 2-cycle as well (this dude's exiting on A), so D's the exit point because B's just been visited. Crap. Yeah, the 4-cycle auto-exited by itself during the second 2-cycle handling. That's totally f***d up, but also so beautiful. I hate cycles.
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
vdog99
Newbie
Offline
Activity: 14
Merit: 0
|
 |
October 29, 2025, 07:52:07 PM Last edit: October 29, 2025, 08:05:53 PM by vdog99 |
|
But you know what? Cycles don't really need so much exhaustive burnout if you mastermind the behavioral headaches of roundabouts in a much more simpler way. No hoverboard needed.
I'm glad that you finally like cycles  Yo Doc. I'm honestly in a loop on loop handling. I frankesteined my way to do a sorta' but not identical SOTA+ (let's call it SOTA++) in CUDA because I want that 1.0 sqrt after all. Was seriously thinking this is what your RCKang does already (based on some code comments in Kang1 or Kang2 that sounded like "on a GPU, we always compute both of x3, because no benefits etc.". My bad. I'm using 8 as group size so basically I'm jumping 80% faster than a forward-only creature. But it all got nuked when I got to the point of having to escape cycles. Detecting them was a breeze. Escaping them? It's simpler to get out of a Guantanamo Bay prison in one piece. Check this out: a simple 4 cycle: A -> B -> C -> B -> A -> D Now, the problem with this circuit is that B and C are in a 2-cycle that needs to exit at B (to keep up symmetric walk), but also B and A are in a 2-cycle as well (this dude's exiting on A), so D's the exit point because B's just been visited. Crap. Yeah, the 4-cycle auto-exited by itself during the second 2-cycle handling. That's totally f***d up, but also so beautiful. I hate cycles. Yeah K.G It all comes down to Cycles lol From my understanding so far, note it has been less than a month before looking at this and I have a full-time job and school, So take it with a grain of salt but here we go: 1) What is SOTA++ and What the Heuristic you are choosing aka what route are you assigning P+J or P-J? 2) What tracking method are using to Detect Cycles.. RC uses Distance Tracking instead of Point Tracking for efficiency. -- So assuming you are using that Kang Travelling -> A -> B -> C Gets assigned P-J. Goes to B Hence A -> B -> C -> B -> C In Kernel A if you do Continue on, instead of Escaping, your K value will improve Significant but you also have to Pay Kernel B Cost of Escape_Detection. I had lots of these issues as well, couldn't really overcome it. Ill have to Calculate Group Operations and find a way to reduce it as RC mentioned but I never went below 1.37. I just went back to SOTA, and improved efficiency there. I am getting K=1.01 after statistical Significant sampling but more work needs to be done. I really like SOTA+ but I cant really get it to work maybe you will have better luck
|
|
|
|
|
|
kTimesG
|
 |
October 30, 2025, 09:50:20 AM |
|
1) What is SOTA++ and What the Heuristic you are choosing aka what route are you assigning P+J or P-J?
I compute both X3(P+J) and X3(P-J) and their slopes, before Y3. The route is chosen based on what X3 is better. This keeps symmetry without having to do weird checks for Y parity and conditional branching. I have a lot of other tweaks though for faster computations. Also an important idea is that walks are only recreated if a useless DP collision occurs (otherwise a low K but lots of EC multiplications is called cheating the definition of a low complexity). Hence A -> B -> C -> B -> C
Not really. 2-cycles are very frequent so if there are a lot of jumps and no exiting from them until later, K goes higher. Keeping track of whether the next jump goes to the previous point (storing the last jump index and direction) allows to exit 2-cycles while jumping. A is start point/distance. Goes to B. B goes to C. C detects a 2-cycle but is not the exit point (because in a B <-> C cycle, the end point must be deterministic otherwise two walks might diverge) so it goes back to B anyway. B detects a 2-cycle, is the exit point, so goes in the other direction, which happens to be A. A detects a 2-cycle, and is the exit point (in the A <-> B cycle) so it goes to D and cycle is exited. Main problem: cycle detection must keep up with whether the walk exited a cycle. If some long walk is in the middle of some cycle that might exit by itself, the long cycle exiting kinda breaks because the deterministic exit point of a long cycle might never be reached a second time. Sure, recreating the walk is an option that affects speed, but is it lowering K or increasing K? I had lots of these issues as well, couldn't really overcome it. Ill have to Calculate Group Operations and find a way to reduce it as RC mentioned but I never went below 1.37. I just went back to SOTA, and improved efficiency there. I am getting K=1.01 after statistical Significant sampling but more work needs to be done.
You must be doing something wrong if you get a K = 1.37. Make sure all jump points are distinct and DP overhead is correct. I get K ~= 1.01 even with a very high DP. But this only when exiting all cycles as soon as they get detected (e.g. on the CPU).
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
vdog99
Newbie
Offline
Activity: 14
Merit: 0
|
 |
October 30, 2025, 04:50:04 PM Last edit: October 30, 2025, 05:08:43 PM by vdog99 |
|
I compute both X3(P+J) and X3(P-J) and their slopes, before Y3. The route is chosen based on what X3 is better.
What is the Route you are choosing? How are you doing? this might where I am getting things wrong I am assuming the idea is to use the 2 Point at the same time to get to K = 0.99 so how do you decide? A is start point/distance. Goes to B. B goes to C. C detects a 2-cycle but is not the exit point (because in a B <-> C cycle, the end point must be deterministic otherwise two walks might diverge) so it goes back to B anyway. B detects a 2-cycle, is the exit point, so goes in the other direction, which happens to be A. A detects a 2-cycle, and is the exit point (in the A <-> B cycle) so it goes to D and cycle is exited.
Main problem: cycle detection must keep up with whether the walk exited a cycle. If some long walk is in the middle of some cycle that might exit by itself, the long cycle exiting kinda breaks because the deterministic exit point of a long cycle might never be reached a second time. Sure, recreating the walk is an option that affects speed, but is it lowering K or increasing K?
I am not a good CUDA programmer but do you know a way to debug this. Write it to a file or something else. I can do mathematical analysis and figure out what is happening here, i am in the same boat as well, hating on these cycles.
|
|
|
|
|
|
kTimesG
|
 |
October 30, 2025, 06:16:28 PM |
|
What is the Route you are choosing? How are you doing? this might where I am getting things wrong I am assuming the idea is to use the 2 Point at the same time to get to K = 0.99 so how do you decide?
Obviously, the best/default route is going in the direction that has the higher chances of being a DP. If the DP criteria is on X, then you know what to pick. I assume this is why RC's code tries to skip X3's that are odd, maybe I'm wrong, don't really care, I know my way works. Note that this removes the need of checking Y parity since it will always be the same best point for any type of walk (tame, wild1, wild2). If you don't compute both points, like RC, and use Y parity and the X bit, or some 2-cycle is detected and need to reverse, then the 2nd point is needed anyway, so not sure why RC says that the cheapo is too slow on GPU. I get almost the same speed whether I compute just one X or both, so maybe it depends on the rest of the code, IDK  Also note that this means DP density doubles, so the overhead is half then normal. The "worse" point can never be a DP if the "best" point is not a DP itself, so no need to check on it (less than 0.0...01% chances of ever hitting such an alternate DP from some other walk, and chances go lower as DP increases). Even for DP 0, the collision of some other walk straight into a weak DP is less than 1%. I am not a good CUDA programmer but do you know a way to debug this. Write it to a file or something else. I can do mathematical analysis and figure out what is happening here, i am in the same boat as well, hating on these cycles.
In CUDA we can only printf, and its tough to catch a naughty cycle when 40.000 threads are each jumping multiple walks. But there is no bug here, except that some cycles might exit at 2 different points depending on when they're handled (during jumps or at the end). That's rare though, just a fun fact. Of course, assuming cycles aren't simply nuked and recreated, which kinda sucks 
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
Tony8989
Newbie
Offline
Activity: 37
Merit: 0
|
 |
November 04, 2025, 02:56:55 PM |
|
testing bsgs on Puzzle 135 PVK not found. 100.00000 PetaKeys scanned in 11.15 sec. New range k1
|
|
|
|
|
RetiredCoder (OP)
Full Member
 
Offline
Activity: 162
Merit: 141
No pain, no gain!
|
RCKangaroo v3.1: - fixed "gpu illegal memory access" bug. - some small improvements. This time I've included binary for Windows only, too lazy to build it for Linux  But of course you can build it from sources, for both Windows and Linux.
|
|
|
|
Ykra
Newbie
Offline
Activity: 14
Merit: 2
|
 |
November 21, 2025, 03:59:14 PM |
|
RCKangaroo v3.1: - fixed "gpu illegal memory access" bug. - some small improvements. This time I've included binary for Windows only, too lazy to build it for Linux  But of course you can build it from sources, for both Windows and Linux. Thank you for releasing RCKangaroo and the recent fixes. Your sources have been extremely helpful to study. I’ve built a full-featured implementation using your SOTA approach with some optimizations, giving roughly a 10-12% improvement in Gkeys/s over the public RCKangaroo. I still can’t reach the higher performance figures you quoted for the 4090 with your private solver, so I would be interested in seeing the level of optimization you applied once you’ve completed 135 should you be willing to share. On that note, how is progress on 135?
|
|
|
|
|
|
kTimesG
|
 |
November 21, 2025, 10:57:37 PM |
|
I’ve built a full-featured implementation using your SOTA approach with some optimizations, giving roughly a 10-12% improvement in Gkeys/s over the public RCKangaroo. I still can’t reach the higher performance figures you quoted for the 4090 with your private solver
I got to 11.6 GK/s with custom SOTA+ (computing both X3 points) on a 4090 with DP checks and updating jump distances. But just the kernel alone (without any of the EC math) is 3500 lines, full of compilation flags and different things to use (constant memory, shared memory, texture cache, prefetch sizes...) so yeah. Still nowhere close to the 12.8 GK/s suggested by RC. I don't think that speed can be achieved without hardcore hand-written PTX (the only thing I didn't get into so far), a much more optimized inversion, and probably some big architectural changes.
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
Ykra
Newbie
Offline
Activity: 14
Merit: 2
|
 |
November 22, 2025, 04:40:51 PM |
|
I got to 11.6 GK/s with custom SOTA+ (computing both X3 points) on a 4090 with DP checks and updating jump distances. But just the kernel alone (without any of the EC math) is 3500 lines, full of compilation flags and different things to use (constant memory, shared memory, texture cache, prefetch sizes...) so yeah.
Still nowhere close to the 12.8 GK/s suggested by RC. I don't think that speed can be achieved without hardcore hand-written PTX (the only thing I didn't get into so far), a much more optimized inversion, and probably some big architectural changes.
11.6 GKeys/s on a 4090 is impressive. I'm currently getting ~11.5 GKeys/s on a 5090, so I'm clearly behind/missing a lot of things on the optimisation front. My kernel is still a frankenstein build with borrowed RC code and definitely needs a refactor before I take on new improvements. Thank you for the reply, it's given me some more areas to think on.
|
|
|
|
|
Efim_Wr
Newbie
Offline
Activity: 4
Merit: 0
|
 |
November 25, 2025, 03:07:02 PM |
|
Hi all. Here are the performance figures for my video cards. CMP 70HX ~1711 MKeys/s RTX 5070 ~3400 MKeys/s CMP 40HX ~1500 MKeys/s CMP 90HX ~2700-3000 MKeys/s + shunt mod I was just conducting some observations on the 85th puzzle. The classic Kangaroo didn't finish in 50 minutes, so I guess I need to wait longer, but the SOTA+ finished in 3 minutes. All experiments were run at 9800 MKeys/s.
|
|
|
|
|
Uncle Mark
Newbie
Offline
Activity: 1
Merit: 0
|
 |
November 26, 2025, 08:06:32 PM |
|
How long or how many machines do I need to process bit 115 in hours or day using this code?
|
|
|
|
|
|