AndrewWeb
Jr. Member
Offline
Activity: 81
Merit: 3
|
 |
December 17, 2024, 08:23:16 AM |
|
If one RTX4090 card can do it in +/- 249934 days (68 years). Then eighty RTX4090 cards can do it i in less than a year.
|
|
|
|
|
COBRAS
Member

Offline
Activity: 1137
Merit: 25
|
 |
December 17, 2024, 08:56:02 AM |
|
If one RTX4090 card can do it in +/- 249934 days (68 years). Then eighty RTX4090 cards can do it i in less than a year.
if will be not one 130, but 2^45 in 2^90 how long it take ?
|
[
|
|
|
ee1234ee
Jr. Member
Offline
Activity: 50
Merit: 1
|
 |
December 17, 2024, 09:38:42 AM |
|
If one RTX4090 card can do it in +/- 249934 days (68 years). Then eighty RTX4090 cards can do it i in less than a year.
Who taught you mathematics? 249934 days=684 years
|
|
|
|
|
Lolo54
Member

Offline
Activity: 133
Merit: 32
|
 |
December 17, 2024, 10:16:33 AM |
|
The 135 remains technically feasible with very large resources like that used by retiredcoder (400 GPU 4090 if he is telling the truth) and with a very well optimized RCKangaroo, depending on the position of the privateKey it would take between 0 and 1.2 years to resolve it. Depending on the price of bitcoin, the operation can be profitable but it must be possible to do so with such a resource. for 99% of people who try to solve it it is useless to do so unless they believe in phenomenal luck like the lottery it will be a waste of money and time
|
|
|
|
|
abdullahsoliman
Newbie
Offline
Activity: 17
Merit: 0
|
 |
December 17, 2024, 11:25:19 AM |
|
If one RTX4090 card can do it in +/- 249934 days (68 years). Then eighty RTX4090 cards can do it i in less than a year.
Who taught you mathematics? 249934 days=684 years with this math, we need +400 4090 GPU to solve the 135 in 1.7 years or something like this, and in half a year but we need to use +1000 GPU  if I'm not wrong this is a hallucination in this situation, as the Retired Coder tells us one GPU costs between 0.20 to 0.30 per hour, this may cost 180$ month > 2160$ year so 400 card cost (864000$) 1000 cost (2160000$) if I'm not wrong I think I will still laugh for the next year 
|
|
|
|
|
RetiredCoder (OP)
Full Member
 
Offline
Activity: 163
Merit: 141
No pain, no gain!
|
 |
December 17, 2024, 01:36:01 PM |
|
“Your implementation of the Kangaroo method to solve the ECDLP (Elliptic Curve Discrete Logarithm Problem) is interesting,
Stop posting AI BS here, I will remove it every time.
|
|
|
|
Hoesis.USA
Jr. Member
Offline
Activity: 54
Merit: 1
|
 |
December 17, 2024, 04:07:13 PM |
|
@RetiredCoder Maybe you can share networking and dp load/save options. Do you have any plan on it?
|
https://github.com/ufodia
|
|
|
RetiredCoder (OP)
Full Member
 
Offline
Activity: 163
Merit: 141
No pain, no gain!
|
 |
December 17, 2024, 05:16:20 PM |
|
@RetiredCoder Maybe you can share networking and dp load/save options. Do you have any plan on it?
I'm not going to create serious ready-to-use open-source solution for cracking really high ranges. You should do it by yourself if you want to crack #135 and get a lot of money  But I'm going to update RCKangaroo to support old cards better (+higher speed) when I have time.
|
|
|
|
Lolo54
Member

Offline
Activity: 133
Merit: 32
|
 |
December 17, 2024, 05:53:38 PM |
|
I'm not going to create serious ready-to-use open-source solution for cracking really high ranges. You should do it by yourself if you want to crack #135 and get a lot of money  But I'm going to update RCKangaroo to support old cards better (+higher speed) when I have time. It would actually be great if 20xx GPUs could be supported! And if in addition a speed optimization for these 20xx and 30xx series is made great . Thank you 
|
|
|
|
|
RetiredCoder (OP)
Full Member
 
Offline
Activity: 163
Merit: 141
No pain, no gain!
|
 |
December 17, 2024, 05:58:56 PM |
|
I'm not going to create serious ready-to-use open-source solution for cracking really high ranges. You should do it by yourself if you want to crack #135 and get a lot of money  But I'm going to update RCKangaroo to support old cards better (+higher speed) when I have time. It would actually be great if 20xx GPUs could be supported! And if in addition a speed optimization for these 20xx and 30xx series is made great . Thank you  Yes, new version will support these cards and will be faster by at least 20-30% for these cards.
|
|
|
|
Wouimbly
Newbie
Offline
Activity: 9
Merit: 0
|
 |
December 17, 2024, 06:07:36 PM |
|
Some explanations about other GPUs support: 1. I have zero interest in old cards (same for AMD cards) so I don't have them for development/tests and don't support them. 2. You can easily enable support for older nvidia cards, it will work, but my code is designed for the latest generation, for previous generations it's not optimal and the speed is not the best, that's why I disabled them.
Hi! pretty cool tool :-) Do you plan to implement if possible to have - a "continue" option in case the gpu stop and you want to continue from where is stopped ? - a "multiple pubkey" option like an input file with a list of pubkey ? thanks and have a good one
|
|
|
|
|
|
kTimesG
|
 |
December 18, 2024, 03:22:19 PM |
|
I must admit you used some really clever tricks to make maximum usage of shared memory (L1) and L2 caches. I'm still trying to figure out the way you keep track of the jump distances using the shared memory instead of updating them using L2.
After adapting my own kernel to load/store stuff using L2 (instead of only once, before and after all the jumps) I reached 9.7 GK/s on RTX 4090 (64 jump points, DP 32), which was an increase of 75% in speed, and I haven't even tried to do micro-optimizations on it, like before. So I guess this was the missing lack of knowledge to be able go beyond the advertised 8+ GK/s stated by others around here, after trying every possible advanced optimizations I could think of to speed things up.
So did you start work on solving 135?
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
RetiredCoder (OP)
Full Member
 
Offline
Activity: 163
Merit: 141
No pain, no gain!
|
 |
December 18, 2024, 10:32:14 PM |
|
Hi! pretty cool tool :-) Do you plan to implement if possible to have - a "continue" option in case the gpu stop and you want to continue from where is stopped ? - a "multiple pubkey" option like an input file with a list of pubkey ? thanks and have a good one
1. Maybe. 2. No, it's bad idea. I must admit you used some really clever tricks to make maximum usage of shared memory (L1) and L2 caches. I'm still trying to figure out the way you keep track of the jump distances using the shared memory instead of updating them using L2.
After adapting my own kernel to load/store stuff using L2 (instead of only once, before and after all the jumps) I reached 9.7 GK/s on RTX 4090 (64 jump points, DP 32), which was an increase of 75% in speed, and I haven't even tried to do micro-optimizations on it, like before. So I guess this was the missing lack of knowledge to be able go beyond the advertised 8+ GK/s stated by others around here, after trying every possible advanced optimizations I could think of to speed things up.
So did you start work on solving 135?
Yes, 10G for 4090 is ok. And then one day you will understand that the only way to improve it further - use symmetry and get sqrt(2) boost. Yes you will lose some speed but total improvement worth it. From RCKangaroo readme: Fastest ECDLP solvers will always use SOTA method, as it's 1.39 times faster and requires less memory for DPs compared to the best 3-way kangaroos with K=1.6. Even if you already have a faster implementation of kangaroo jumps, incorporating SOTA method will improve it further. While adding the necessary loop-handling code will cause you to lose about 5–15% of your current speed, the SOTA method itself will provide a 39% performance increase. Overall, this translates to roughly a 25% net improvement, which should not be ignored if your goal is to build a truly fast solver.
|
|
|
|
|
kTimesG
|
 |
December 19, 2024, 09:31:13 AM |
|
The comparison between SOTA and 3-kangaroo you made is only relevant if the jumping device (RTX 4090 in your case) allows the tradeoffs you mention, in order for the tradeoffs to end up with a more efficient solver.
Let me give you an example.
RTX 3050 or other cards that have a low amount of L2 memory.
Your RCKangaroo is much slower than a non-cycling jumper, like 50% slower, because the overhead of handling the cycles is clearly visible (no L2 memory to hide computing bounds). It is even worse when the kangaroos are stored in global memory (since L2 is too small to cache them).
So the winner in these cases is a normal (optimized) 3-kangaroo algorithm.
But because RTX 4090 has around 75 MB of L2 cache, the lower speed is hidden because it is offset by the fast L2 cache. In other words, the raw speed becomes irrelevant, because now the speed bound is limited by the memory latency of L2, not by the raw computing cores.
In this case you can do pretty much whatever you want in the kernel and go crazy with whatever algorithm you want to use, like your SOTA method. Because it won't affect the speed too much.
What I want to say is: in the future you can't really know if new GPUs will have the same tradeoff benefit, so there is no guarantee that the computing size of a CUDA device won't have a greater influence on speed when compared to the cache latency and size of L2 memory.
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
RetiredCoder (OP)
Full Member
 
Offline
Activity: 163
Merit: 141
No pain, no gain!
|
 |
December 19, 2024, 10:30:04 AM |
|
The comparison between SOTA and 3-kangaroo you made is only relevant if the jumping device (RTX 4090 in your case) allows the tradeoffs you mention, in order for the tradeoffs to end up with a more efficient solver. Let me give you an example. RTX 3050 or other cards that have a low amount of L2 memory. Your RCKangaroo is much slower than a non-cycling jumper, like 50% slower, because the overhead of handling the cycles is clearly visible (no L2 memory to hide computing bounds). It is even worse when the kangaroos are stored in global memory (since L2 is too small to cache them). So the winner in these cases is a normal (optimized) 3-kangaroo algorithm. But because RTX 4090 has around 75 MB of L2 cache, the lower speed is hidden because it is offset by the fast L2 cache. In other words, the raw speed becomes irrelevant, because now the speed bound is limited by the memory latency of L2, not by the raw computing cores. In this case you can do pretty much whatever you want in the kernel and go crazy with whatever algorithm you want to use, like your SOTA method. Because it won't affect the speed too much. What I want to say is: in the future you can't really know if new GPUs will have the same tradeoff benefit, so there is no guarantee that the computing size of a CUDA device won't have a greater influence on speed when compared to the cache latency and size of L2 memory.
You are wrong, it's not related to L2, loop handling slowdown for new and old cards is similar. But I won't argue. I will release new version with 20-35% speedup for old cards soon. Not much optimized, but anyway faster than now.
|
|
|
|
|
kTimesG
|
 |
December 19, 2024, 11:23:51 AM |
|
If X1, Y1, X2, Y2, Z, and the jump distance are all in registers, you get maximum speed. You can never get faster than that. But you can only do it with a very small number of kangaroos, like up to 6 or 7 kangaroos, depending on how well you use the registers.
Next wall is when L1 + shared memory gets used. This is the next maximum possible speed, but lower speed per kangaroo then above. You can add maybe 1 kangaroo more this way, because this cache is really small (128 KB per SM).
Third wall is using L2 cache. So, much lower speed / kangaroo, even though the overall throughput is greater. This scales up better only if L2 is really big.
This is why: when L2 cache is small, then loading and storing X1 Y1 in/out of L2 before and after each jump is much too slow, because they will actually pass into deice global memory. So, if L2 is small, the only logical (and faster) option is to only load and store X1 Y1 before all the jumps, and store them back after all the jumps.
Adding cycle handling in this case reduces the number of allowed X1 Y1 = less kangaroos = lower speed.
I have like seven different kernel versions that test these different strategies for when and how data is being loaded into/to different memory levels, so I'm pretty sure about the differences in each strategy.
|
Off the grid, training pigeons to broadcast signed messages.
|
|
|
RetiredCoder (OP)
Full Member
 
Offline
Activity: 163
Merit: 141
No pain, no gain!
|
 |
December 20, 2024, 12:01:59 PM Merited by Etar (2), whanau (2) |
|
v2.0 (Windows/Linux): https://github.com/RetiredC/RCKangaroo- added support for 30xx, 20xx and 1xxx cards. - some minor changes. Speed: 4090 - 7.9GKeys/s. 3090 - 4.1GKeys/s. 2080Ti - 2.9GKeys/s. Please report speed for other cards, for old cards speedup is up to 40%.
|
|
|
|
|
Etar
|
 |
December 20, 2024, 12:12:58 PM |
|
Please report speed for other cards, for old cards speedup is up to 40%.
Thanks! 1660Super 930-948Mkey/s
|
|
|
|
|
MrGPBit
Jr. Member
Offline
Activity: 52
Merit: 1
|
 |
December 20, 2024, 12:59:24 PM |
|
Please report speed for other cards, for old cards speedup is up to 40%.
NVIDIA GeForce GTX 1660 Ti "Laptop" 975 MKeys/s
|
|
|
|
|
Fibonacci_Dev
Newbie
Offline
Activity: 5
Merit: 0
|
 |
December 20, 2024, 01:16:42 PM |
|
v2.0 (Windows/Linux): https://github.com/RetiredC/RCKangaroo- added support for 30xx, 20xx and 1xxx cards. - some minor changes. Speed: 4090 - 7.9GKeys/s. 3090 - 4.1GKeys/s. 2080Ti - 2.9GKeys/s. Please report speed for other cards, for old cards speedup is up to 40%. Thank you so much for your incredible contribution to this tool! I believe it’s already fantastic, but there’s just one feature that could make it truly perfect: the addition of an --end parameter. Here’s an example to illustrate what I mean: Puzzle 135: 40000000000000000000000000000000000000:7fffffffffffffffffffffffffffffff I’d like to try my luck and search within a specific range, such as: --start 52000000000000000000000000000000000000 --end 5affffffffffffffffffffffffffffff The inclusion of an --end parameter would be a game-changer for scenarios like this. Thanks in advance for your consideration
|
|
|
|
|
|