Bitcoin Forum
April 11, 2026, 10:25:27 PM *
News: Latest Bitcoin Core release: 30.2 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 [650]
  Print  
Author Topic: Bitcoin puzzle transaction ~32 BTC prize to who solves it  (Read 378933 times)
kind_user
Newbie
*
Offline Offline

Activity: 25
Merit: 0


View Profile
April 07, 2026, 01:09:52 PM
 #12981

pscamillo you killed the forum with your software. If you want to give assistance to the noobs, please make a thread on your github...
BlackAKAAngel
Newbie
*
Offline Offline

Activity: 26
Merit: 0


View Profile
April 07, 2026, 05:23:03 PM
 #12982

I have AMD 9950X and RTX 4090 and I am getting 0 Gkeys/s both plain and with -allwild 1, something is not right. I've put "89" instead of "120" in the right place. Can I solve this?

Hi pbies, thanks for trying it out.
 
0 GKeys/s usually means the CUDA kernel isn't launching. A few things to check:
 
1. Make sure you rebuilt after changing the arch. Clean build:
Code:
make clean && make GPU_ARCH="-gencode=arch=compute_89,code=sm_89"

2. Check that your CUDA toolkit version supports sm_89. You need CUDA 12.0+ for Ada Lovelace. Run:
Code:
nvcc --version

3. Verify the GPU is detected. The banner should show your RTX 4090 with "cap 8.9". If it shows a different capability, the arch is wrong.
 
4. Can you share the full output from startup? The banner + first few lines will help me diagnose what's happening.
 
If you were able to run the original RCKangaroo on the same machine, then it's likely just the arch setting. Let me know!

Yep, I did that. Clean build, CUDA 13.2, GPU is seen but "no kernel image is available for execution on the device".

Images (EDITED the post!):

https://ibb.co/7drY4DRm
https://ibb.co/LXFBB4L8
https://ibb.co/9m5xYKtQ

Commands:

pip install --upgrade torch --extra-index-url https://download.pytorch.org/whl/cu126

sudo apt install nvidia-cuda-toolkit

didn't help.
try with a small DP 16, 18 if not work 22, 25
pbies
Sr. Member
****
Offline Offline

Activity: 417
Merit: 257



View Profile
April 07, 2026, 08:56:59 PM
 #12983

Yep, I did that. Clean build, CUDA 13.2, GPU is seen but "no kernel image is available for execution on the device".

I can see the issue in your screenshots. Your GPU is detected correctly (RTX 4090, cap 8.9, CUDA 13.1/13.2), but the kernel binary was compiled for the wrong architecture.
 
The error cuSetGpuParams failed: no kernel image is available for execution on the device means the .cu file was compiled for sm_120 (Blackwell) but your GPU needs sm_89 (Ada Lovelace).
 
Please try this exact sequence — all on the command line, don't edit the Makefile:
Code:
make clean
make GPU_ARCH="-gencode=arch=compute_89,code=sm_89"

During compilation, look for this line in the output:
Code:
--gpu-architecture=compute_89

If you see compute_120 instead, the override isn't working. In that case, edit the Makefile directly — change line 24:
Code:
GPU_ARCH ?= -gencode=arch=compute_89,code=sm_89

Then:
Code:
make clean && make

The key thing is that all .o files must be deleted before recompiling. If even one old object file remains, the linker will use the wrong kernel image.

Make clean did the trick. I think there were .o files with 120. Now it is working with 7.56-7.71 Gkeys/s.

BTC: bc1qmrexlspd24kevspp42uvjg7sjwm8xcf9w86h5k
Realman121
Newbie
*
Offline Offline

Activity: 4
Merit: 0


View Profile
April 08, 2026, 07:53:06 AM
Last edit: April 08, 2026, 10:31:42 PM by Mr. Big
 #12984

Hi !
Can somebody tell me how got the 1 key with point G that is constant of ECDSA.
I read that it was brute force.
I've just started to think about this. And here's many interesting things in this numbers.
 
I'm sorry for my question, how you get 0.72%?
 
p = 2**-51
n = 2**48
 
P(X≥2)=1−P(0)−P(1) ≈ 0.72%
 
Sorry, I'm just starting to learn about this topic. And can do some mistake.
For example.
 
51 mod p = 0.627450980392156862745098
1/1.59375
1.59375
1 19/32     1 19/32 - 1 5/8
 
32/51  16/25.5   8/12.75  4/6.375  2/3.1875  
 
51/32 = 1.59375 (=1.98)
408/256 255/160 204/128 153/96 102/64 51/32
51/32 (1; 1,2,6)
 
51/32 25.5/16 12.75/8 6.375/4 3.1875/2 1.59375/1 0.796875/0.5 0.3984375/0.25 0.19921875/0.125 0.099609375/0.0625
1/16=0.0625 0.5^4 1/2^4 6.25%
 
0.0625 = 0.0625/1 0.125/2 0.25/4 0.5/8 1/16 2/32 3/48 4/64 8/128 16/256
1,59375/2=0.796875 (0.474250137805938720703125) (121.4080352783203125 79.6875 79.98 )
just 79.98......
 
121.74375=79.BE6
 
32.90/256=0.128515625 (0.20E6666666666666666666666 0.E6666666666666666666666 0.9)
32.9/255=0.12901960784313725490196078431373
 
32.896/256=0.1285 (0.20E5604189374BC6A7EF9DB23 0.E5604189374BC6A7EF9DB23 0.896)
32.896/255=0.12900392156862745098039215686275
 
1/255 = 0.0(0392156862745098)
 


16-10-22
32-20-50
64-40-100
128-80-296
256-100-598
10=2
100=4
1000=8
60-96-3C-150
60-96-69-45-2d
0.1=1/10=0.199999999999999999999999A=0.0625=1/16=0.5/8=0.25/4=0.125/2=0.0625


48 mod n = 0.9791 (6)
pub key = 032B9434EB24870CE4643966C2C976B6373C864E99EFBC87AD8B0586EE3180B643 (97.9166%)
 
145D08D23DFA2741D5 = 75/60 = 57/96 = 0.59375 (0.98) 0.34850025177001953125 1E3A42941509F0765
 
1/2=0.5 (50.0000%) 0.5*2=G
pubkey     0300000000000000000000003b78ce563f89a0ed9414f5aa28ad0d96d6795f9c63
                0200000000000000000000003b78ce563f89a0ed9414f5aa28ad0d96d6795f9c63
 
1/3=0.(3)  
3 mod n = 0.(6) 66.6666% (2/3)
we have
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA9D1C9E899CA306AD27FE1945DE0242B81 (0.(6)
pubkey     034c7ff4f2ba8603998339c8e42675ceac23ef2e9623fdb260b24b1c944a2ea1a9
2/3*0.5= 0.(3)    
55555555555555555555555555555554E8E4F44CE51835693FF0CA2EF01215C0 (0.(3) pubkey 024c7ff4f2ba8603998339c8e42675ceac23ef2e9623fdb260b24b1c944a2ea1a9 (33.3333%)



0.5/0.75*0.75 we get point 0.5
0.5/0.75*0.75*2=1
maybe it has many solution but I find just 1

what about 0.1378 0.0760498046875 4c 31 80=50=0.5=0.8=128 or just 8 468-0.75=467.25 1d3.4 or smthing else)

mod p
10 = 0.3 (4CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC7FFFFEDB) - 30.0000%
100 = 0.73 (BAE147AE147AE147AE147AE147AE147AE147AE147AE147AE147AE146F333306A) 73.0000%
1000 = 0.273 (45E353F7CED916872B020C49BA5E353F7CED916872B020C49BA5E353B1EB8414) 27.3000%
and so on)
 
I think we have mixed (combinated) measurement system.
 
key 65 1a838b13505b26867 30568377312064202855
 
mod p
650 = BA4A0B0716D7D3E3A4A0B0716D7D3E3A4A0B0716D7D3E3A4A0B0716CC2F42C7C
 
0.BA4A0B0716D7D3E3A4A0B0716D7D3E3A4A0B0716D7D3E3A4A0B0716CC2F42C7C
0.7276923076923076923076923
 
650 mod  p / 60 =1404350210493604113726899382284855908066582121714050943350344544506176379479.4
31AD584628398DD64E08B795B6CC109813BEAC8E9FF43CB46F1DFC1D00C9A57.666666666666666 6666666666
31AD584628398DD64E08B795B6CC109813BEAC8E9FF43CB46F1DFC1D00C9A57
1404350210493604113726899382284855908066582121714050943350344544506176379479
72,7692%
key ba4a0b0716d7d3e3a4a0b0716d7d3e3a4a0b0716d7d3e3a4a0b0716cc2f42c29 - ba4a0b0716d7d3e3a4a0b0716d7d3e3a4a0b0716d7d3e3a4a0b0716cc2f42c64
946/13=72.769230769230769230769230769231
1/13=0.07692307692307692307692307692308 0.13B13B13B13B13B13B13B85F7

and what about this key
0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
0.0044444444444444443866203

pubkey
034646ae5047316b4230d0086c8acec687f00b1cd9d1dc634f6cb358ac0a9a8fff
0.4444%
0.0044444444444444443866203=1/960=0.0010416666666666666666568=0.000248038768768310546875=0.0010416666666666666666666
0.0000000000000000003866203=0.0000000000000000000000467=1d3
 
I will try my research... I just want to understand idea )
vneos
Jr. Member
*
Offline Offline

Activity: 41
Merit: 12


View Profile
April 09, 2026, 06:48:44 AM
 #12985

Found some interesting things  Wink

https://mempool.space/zh/testnet4/tx/91f8ed8de613e3f22dcb2047d443a509aa3aba9654741f62b4ff8d8e84dda904
marmaria
Newbie
*
Offline Offline

Activity: 2
Merit: 0


View Profile
April 09, 2026, 09:03:44 AM
 #12986


Care to explain? lol
pscamillo
Newbie
*
Offline Offline

Activity: 7
Merit: 10


View Profile
April 09, 2026, 05:57:30 PM
Last edit: April 09, 2026, 08:29:47 PM by pscamillo
Merited by Cricktor (3)
 #12987

Thanks Cricktor for the question — sorry for the late reply, I wanted to have real numbers before answering.

PSCKangaroo v59 — now with concurrent mode and real benchmarks.

First: kTimesG's feedback was correct. Endomorphism, cheap second point, and XDP were all removed in v57 — none of them helped. The current version focuses on what actually matters for long runs: memory management, crash resilience, and compact storage.

Benchmark: PSC v59 vs RCKangaroo v3.1
Hardware: RTX 5070 / Ryzen 9800X3D / 128 GB RAM / CUDA 12.9 / Linux
Puzzle 80 (79-bit range), 5 runs each:

Code:
Solver                         Median   Mean    Best    Worst   Solved
RCKangaroo DP=16                301s    299s    120s     514s    5/5
PSC v59 concurrent DP=12 8GB    320s    463s    236s     837s    5/5
PSC v59 concurrent DP=14 20GB   423s    536s    142s    1223s    5/5

RC wins by ~6% on median — expected, SOTA K=1.15 is mathematically optimal. Both run the same GPU kernel at ~3.1 GK/s.

So is there a significant advantage?

Not for Puzzle 80 — RC is slightly faster and has zero setup overhead. For short puzzles, use RCKangaroo.

The advantage shows up for long-running puzzles (135+):

1. -ramlimit: RC has no memory limit. On a 128 GB system it OOM-crashes in ~4h at DP=14, ~18h at DP=16, ~12 days at DP=20 (verified from RC source, line 328: (32+4+4) bytes/entry, malloc without NULL check). For safe multi-month runs, RC needs DP≥24.

2. Checkpoint/resume: PSC auto-saves every N hours + on Ctrl+C. RC's -tames feature saves pre-generated TAMEs but not solve progress — a crash during solving loses all WILDs and accumulated state.

3. 16-byte entries: 2.5× more DPs per GB vs RC's ~40 bytes/entry. This allows lower DP values within the same RAM budget.

4. Concurrent mode (v59): runs 33% TAME + 67% WILD from second 1 (same t² dynamics as RC), but with memory protection. Earlier versions had a slow TRAP phase — that’s gone now.

To be honest about the math: Puzzle 135 needs ~2^67 operations — roughly 1,740 years on a single RTX 5070. No solver changes that. We're all playing a probabilistic lottery. PSCKangaroo just makes sure no ticket is wasted by a crash or reboot.

Code: https://github.com/pscamillo/PSCKangaroo
README has the full analysis including OOM timings and mathematical reality of large puzzles.

Update: Windows support added — Visual Studio 2022 project files (.sln/.vcxproj) now included in the repo.

Questions, bug reports or suggestions → GitHub Issues: https://github.com/pscamillo/PSCKangaroo/issues
This way we keep the forum clean.
brainless
Member
**
Offline Offline

Activity: 478
Merit: 35


View Profile
April 09, 2026, 08:19:21 PM
 #12988

Thanks Cricktor for the question — sorry for the late reply, I wanted to have real numbers before answering.

PSCKangaroo v59 — now with concurrent mode and real benchmarks.

First: kTimesG's feedback was correct. Endomorphism, cheap second point, and XDP were all removed in v57 — none of them helped. The current version focuses on what actually matters for long runs: memory management, crash resilience, and compact storage.

Benchmark: PSC v59 vs RCKangaroo v3.1
Hardware: RTX 5070 / Ryzen 9800X3D / 128 GB RAM / CUDA 12.9 / Linux
Puzzle 80 (79-bit range), 5 runs each:

Code:
Solver                         Median   Mean    Best    Worst   Solved
RCKangaroo DP=16                301s    299s    120s     514s    5/5
PSC v59 concurrent DP=12 8GB    320s    463s    236s     837s    5/5
PSC v59 concurrent DP=14 20GB   423s    536s    142s    1223s    5/5

RC wins by ~6% on median — expected, SOTA K=1.15 is mathematically optimal. Both run the same GPU kernel at ~3.1 GK/s.

So is there a significant advantage?

Not for Puzzle 80 — RC is slightly faster and has zero setup overhead. For short puzzles, use RCKangaroo.

The advantage shows up for long-running puzzles (135+):

1. -ramlimit: RC has no memory limit. On a 128 GB system it OOM-crashes in ~4h at DP=14, ~18h at DP=16, ~12 days at DP=20 (verified from RC source, line 328: (32+4+4) bytes/entry, malloc without NULL check). For safe multi-month runs, RC needs DP≥24.

2. Checkpoint/resume: PSC auto-saves every N hours + on Ctrl+C. RC's -tames feature saves pre-generated TAMEs but not solve progress — a crash during solving loses all WILDs and accumulated state.

3. 16-byte entries: 2.5× more DPs per GB vs RC's ~40 bytes/entry. This allows lower DP values within the same RAM budget.

4. Concurrent mode (v59): runs 33% TAME + 67% WILD from second 1 (same t² dynamics as RC), but with memory protection. Earlier versions had a slow TRAP phase — that’s gone now.

To be honest about the math: Puzzle 135 needs ~2^67 operations — roughly 1,740 years on a single RTX 5070. No solver changes that. We're all playing a probabilistic lottery. PSCKangaroo just makes sure no ticket is wasted by a crash or reboot.

Code: https://github.com/pscamillo/PSCKangaroo
README has the full analysis including OOM timings and mathematical reality of large puzzles.

Questions, bug reports or suggestions → GitHub Issues: https://github.com/pscamillo/PSCKangaroo/issues
This way we keep the forum clean.
1740 years for 1 gpu 5090
Rc found 125 and 130 when 5090 not exist, maybe 4090 used
Could u calc how much gpu he used? Within 3month he claimed to solve ?

13sXkWqtivcMtNGQpskD78iqsgVy9hcHLF
Realman121
Newbie
*
Offline Offline

Activity: 4
Merit: 0


View Profile
April 10, 2026, 03:29:08 AM
Last edit: April 10, 2026, 06:05:29 AM by Realman121
 #12989


Hi

and if it can help

0.001 pubkey  02 79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798 (1) 1 dec
0.004 pubkey  03 f28773c2d975288bc7d1d205c3748651b075fbc6610e58cddeeddf8f19405aa8 (d) 13 dec
0.005 pubkey  02 9248279b09b4d68dab21a9b066edda83263c3d84e09572e269ca0cd7f5453714 (19) 25 dec
0.006 pubkey  03 d30199d74fb5a22d47b6e054e2f378cedacffcb89904a61d75d0dbd407143e65 (20) 32 dec
0.007 pubkey  02 078c9407544ac132692ee1910a02439958ae04877151342ea96c4b6b35a49f51 (77) 119 dec

0.002 pubkey  02 f9308a019258c31049344f85f89d5229b531c845836f99b08601f113bce036f9 (3) 3 dec

0.003 pubkey  03 fff97bd5755eeea420453a14355235d382f6472f8568a18b2f057a1460297556 (6) 6 dec
0.008 pubkey  02 ee163026e9fd6fe017c38f06a5be6fc125424b371ce2708e7bf4491691e5764a (c5) 197 dec

as I say in another topic about "the Secp256k1 elliptic curve"
Maybe we have "reversed" the measurement systems in the code and got the G-point (just maybe)
Or just the 'code'
So...
2 -10-A-16 or 10 (dec) or 10-1 or 10-1010
3 -11-B-17 11 (dec) 11-2 11-1011 or 11 (dec) 11*11-121 (G)? 1111-BB (BB hex or binary?) 1111-1717  1111-2731-6f1-1273-16f-4369-457-f-10001010111 1111-33-71-17 1111-4-22-112
4-100-A0-160-256 100 (dec) 100-64 100-20 100-1 100-1100100
5-101-A1-161-257 101 (dec) 101-65 101-21 101-2 101-1100101
6-110-B0-170-272 110 (dec) 110-6e 110-30 110-12 110-2 110-1101110    110-116
7-111-B1-171-273 111 (dec) 111-6f 111-31 111-13 111-3 111-1101111    111-117 111-1B
8-1000-A00-1600-4096 1000 (dec) 1000-3e8 1000-200 1000-40 1000-2560 1000-640 1000-1111101000 1000-1
9-1001-A01-1601-4097 1001 (dec) 1001-3e9 1001-201 1001-41 1001-2561 1001-641 1001-1111101001  1001-2


2-10 10-1001
3-11 11-1111
4-100 100-100001
5-101 101-101101
6-110 110-110011
7-111 111-111111

and so on...
may be it can be in another way


and how
1-1
2-10
3-11
7-111

can help

I'm sorry again for my English, it's hard to find the words I mean.
I want to understand this 'puzzle'
And 'masking idea'.
and want try to see is the eleptic curve Real or Fake and we have only the code.
Or maybe we have combined system.
Maybe it's just FACE.
I haven't decided what to think yet 50/50.
Cause I need to do some experiment with the curve and numbers.
But maybe I mistake in my thoughts.
And A not rA
like 1001

r3cruit
Newbie
*
Offline Offline

Activity: 2
Merit: 0


View Profile
April 10, 2026, 06:32:13 AM
 #12990

Thanks Cricktor for the question — sorry for the late reply, I wanted to have real numbers before answering.

PSCKangaroo v59 — now with concurrent mode and real benchmarks.

First: kTimesG's feedback was correct. Endomorphism, cheap second point, and XDP were all removed in v57 — none of them helped. The current version focuses on what actually matters for long runs: memory management, crash resilience, and compact storage.

Benchmark: PSC v59 vs RCKangaroo v3.1
Hardware: RTX 5070 / Ryzen 9800X3D / 128 GB RAM / CUDA 12.9 / Linux
Puzzle 80 (79-bit range), 5 runs each:

Code:
Solver                         Median   Mean    Best    Worst   Solved
RCKangaroo DP=16                301s    299s    120s     514s    5/5
PSC v59 concurrent DP=12 8GB    320s    463s    236s     837s    5/5
PSC v59 concurrent DP=14 20GB   423s    536s    142s    1223s    5/5

RC wins by ~6% on median — expected, SOTA K=1.15 is mathematically optimal. Both run the same GPU kernel at ~3.1 GK/s.

So is there a significant advantage?

Not for Puzzle 80 — RC is slightly faster and has zero setup overhead. For short puzzles, use RCKangaroo.

The advantage shows up for long-running puzzles (135+):

1. -ramlimit: RC has no memory limit. On a 128 GB system it OOM-crashes in ~4h at DP=14, ~18h at DP=16, ~12 days at DP=20 (verified from RC source, line 328: (32+4+4) bytes/entry, malloc without NULL check). For safe multi-month runs, RC needs DP≥24.

2. Checkpoint/resume: PSC auto-saves every N hours + on Ctrl+C. RC's -tames feature saves pre-generated TAMEs but not solve progress — a crash during solving loses all WILDs and accumulated state.

3. 16-byte entries: 2.5× more DPs per GB vs RC's ~40 bytes/entry. This allows lower DP values within the same RAM budget.

4. Concurrent mode (v59): runs 33% TAME + 67% WILD from second 1 (same t² dynamics as RC), but with memory protection. Earlier versions had a slow TRAP phase — that’s gone now.

To be honest about the math: Puzzle 135 needs ~2^67 operations — roughly 1,740 years on a single RTX 5070. No solver changes that. We're all playing a probabilistic lottery. PSCKangaroo just makes sure no ticket is wasted by a crash or reboot.

Code: https://github.com/pscamillo/PSCKangaroo
README has the full analysis including OOM timings and mathematical reality of large puzzles.

Update: Windows support added — Visual Studio 2022 project files (.sln/.vcxproj) now included in the repo.

Questions, bug reports or suggestions → GitHub Issues: https://github.com/pscamillo/PSCKangaroo/issues
This way we keep the forum clean.


Why are you doing such tests? Why not take the path of developing new software? At the moment, my kangaroo solves 61 puzzles in 1250 seconds. Only I have a very budget 1050ti mobile video card. Testing data

  Performance:
    Time:       1275.0s
    Positions:  85683339264
    Keys:       359381972608352256
    renameParam compares:2741866856448
    Speed:      67.2 Mpos/s
    Speed:      281864955.5 Mkeys/s equiv

P.S. The source code will not be published anywhere. If you think that speed is deceptive, I can offer a bet.
0xastraeus
Newbie
*
Offline Offline

Activity: 34
Merit: 0


View Profile
April 10, 2026, 11:40:08 AM
 #12991

If you're not going to post your project...don't talk about it.

pscamillo posted his and asked for feedback. He never claimed to be better than another project or the most complex project out there.
He accepted criticism from others and updated his project accordingly.

It's real tiring hearing how you people claim to have developed this wonder project, yet have nothing to back it.

Why are you doing such tests? Why not take the path of developing new software? At the moment, my kangaroo solves 61 puzzles in 1250 seconds. Only I have a very budget 1050ti mobile video card. Testing data

  Performance:
    Time:       1275.0s
    Positions:  85683339264
    Keys:       359381972608352256
    renameParam compares:2741866856448
    Speed:      67.2 Mpos/s
    Speed:      281864955.5 Mkeys/s equiv

P.S. The source code will not be published anywhere. If you think that speed is deceptive, I can offer a bet.
NUCLEAR7.1
Newbie
*
Offline Offline

Activity: 5
Merit: 0


View Profile
Today at 01:00:57 AM
 #12992

I saw everyone talking again about Adam Back. Do you think he is the creator?   Huh
NotFuzzyWarm
Legendary
*
Offline Offline

Activity: 4326
Merit: 3392


Evil beware: We have waffles!


View Profile
Today at 02:08:17 AM
 #12993

I saw everyone talking again about Adam Back. Do you think he is the creator?   Huh
That is off topic.
Any particular reason you did not post in the correct thread here?

- For bitcoin to succeed the community must police itself -    My info useful? Donations welcome!  3NtFuzyWREGoDHWeMczeJzxFZpiLAFJXYr
 -Sole remaining active Primary developer of cgminer, Kano's repo is here  Discord support invite at https://kano.is/
-Support Sidehacks miner development. Donations to:   1BURGERAXHH6Yi6LRybRJK7ybEm5m5HwTr
XGiftGodX
Newbie
*
Offline Offline

Activity: 2
Merit: 0


View Profile
Today at 10:48:01 AM
 #12994

Thanks Cricktor for the question — sorry for the late reply, I wanted to have real numbers before answering.

PSCKangaroo v59 — now with concurrent mode and real benchmarks.

First: kTimesG's feedback was correct. Endomorphism, cheap second point, and XDP were all removed in v57 — none of them helped. The current version focuses on what actually matters for long runs: memory management, crash resilience, and compact storage.

Benchmark: PSC v59 vs RCKangaroo v3.1
Hardware: RTX 5070 / Ryzen 9800X3D / 128 GB RAM / CUDA 12.9 / Linux
Puzzle 80 (79-bit range), 5 runs each:

Code:
Solver                         Median   Mean    Best    Worst   Solved
RCKangaroo DP=16                301s    299s    120s     514s    5/5
PSC v59 concurrent DP=12 8GB    320s    463s    236s     837s    5/5
PSC v59 concurrent DP=14 20GB   423s    536s    142s    1223s    5/5

RC wins by ~6% on median — expected, SOTA K=1.15 is mathematically optimal. Both run the same GPU kernel at ~3.1 GK/s.

So is there a significant advantage?

Not for Puzzle 80 — RC is slightly faster and has zero setup overhead. For short puzzles, use RCKangaroo.

The advantage shows up for long-running puzzles (135+):

1. -ramlimit: RC has no memory limit. On a 128 GB system it OOM-crashes in ~4h at DP=14, ~18h at DP=16, ~12 days at DP=20 (verified from RC source, line 328: (32+4+4) bytes/entry, malloc without NULL check). For safe multi-month runs, RC needs DP≥24.

2. Checkpoint/resume: PSC auto-saves every N hours + on Ctrl+C. RC's -tames feature saves pre-generated TAMEs but not solve progress — a crash during solving loses all WILDs and accumulated state.

3. 16-byte entries: 2.5× more DPs per GB vs RC's ~40 bytes/entry. This allows lower DP values within the same RAM budget.

4. Concurrent mode (v59): runs 33% TAME + 67% WILD from second 1 (same t² dynamics as RC), but with memory protection. Earlier versions had a slow TRAP phase — that’s gone now.

To be honest about the math: Puzzle 135 needs ~2^67 operations — roughly 1,740 years on a single RTX 5070. No solver changes that. We're all playing a probabilistic lottery. PSCKangaroo just makes sure no ticket is wasted by a crash or reboot.

Code: https://github.com/pscamillo/PSCKangaroo
README has the full analysis including OOM timings and mathematical reality of large puzzles.

Update: Windows support added — Visual Studio 2022 project files (.sln/.vcxproj) now included in the repo.

Questions, bug reports or suggestions → GitHub Issues: https://github.com/pscamillo/PSCKangaroo/issues
This way we keep the forum clean.


Can you help me setup and tweak my laptop to use your program?


+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.02              Driver Version: 595.97         CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3050 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   35C    P0             14W /   60W |       0MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

GPU 0

   AMD Radeon(TM) Graphics

   Driver version:   31.0.21921.13001
   Driver date:   7/7/2025
   DirectX version:   12 (FL 12.1)
   Physical location:   PCI bus 5, device 0, function 0

   Utilization   2%
   Dedicated GPU memory   0.4/2.0 GB
   Shared GPU memory   0.2/6.9 GB
   GPU Memory   0.5/8.9 GB

GPU 1

   NVIDIA GeForce RTX 3050 Laptop GPU

   Driver version:   32.0.15.9597
   Driver date:   3/17/2026
   DirectX version:   12 (FL 12.2)
   Physical location:   PCI bus 1, device 0, function 0

   Utilization   0%
   Dedicated GPU memory   0.0/4.0 GB
   Shared GPU memory   0.0/6.9 GB
   GPU Memory   0.0/10.9 GB


icqcointokenwallet
Full Member
***
Offline Offline

Activity: 260
Merit: 101


View Profile
Today at 11:26:51 AM
 #12995

I use different GPU rentals.  Will the latest Cuda 13.0 work with all Cuda capabilities, even legacy?
SecretAdmirere
Newbie
*
Offline Offline

Activity: 9
Merit: 1


View Profile
Today at 01:25:52 PM
 #12996

I use different GPU rentals.  Will the latest Cuda 13.0 work with all Cuda capabilities, even legacy?

CUDA 13.x supports all Turing and newer (from 16, 20-series and above), while CUDA 12.9 is the last CUDA version that supported Maxwell, Volta and Pascal. The support mainly affects the build process (offline compiling), you can't directly compile for compute capability < 7.5 using CUDA 13.x, but your program might still run on compute capability < 7.5 thanks to the driver Just-In-Time (JIT) even tho you didn't compile for legacy compute capability but most likely it won't. I would recomend you to go with CUDA 12.x version if you are worried about compability

Edit: No it won't run, if you compile for eg. 8.6 it's not going to run on 6.1, it's forward compatable not backward. My mistake
blankx4729
Newbie
*
Offline Offline

Activity: 4
Merit: 0


View Profile
Today at 02:47:30 PM
Last edit: Today at 04:11:58 PM by blankx4729
 #12997

================================================================================

 🦘 PSCKangaroo  — GPU-accelerated Pollard's Kangaroo for secp256k1 ECDLP

  Fork of RCKangaroo by RetiredCoder (RC)
  Original: https://github.com/RetiredC/RCKangaroo
  Special thanks to RC for SOTA!
  License:  GPLv3

================================================================================

  Core algorithm:  SOTA (Equivalence Classes + Negation Map, K~1.15)
  Optimizations:   SOTA+ | ALL-TAME mode | 16-byte compact entries | Async BSGS
                   Ultra-compact 16-byte DPs | Async BSGS resolver
                   Dual hash table | Table freeze | Uniform jumps
  Modes:           ALL-TAME (recommended 130+ bits) | ALL-WILD | TRAP/HUNT
  Checkpoint:      Auto-save + Ctrl+C safe exit (format RCKDT5C)

  Platform: Linux

CUDA devices: 1, CUDA driver/runtime: 13.0/12.5
GPU 0: NVIDIA GeForce GTX 1060 6GB, 6.00 GB, 10 CUs, cap 6.1, PCI 1, L2 size: 1536 KB
GroupCnt: 24, Kangaroos per GPU: 294912 (1.0x default)
Total GPUs for work: 1
Target:
X: 145D2611C823A396EF6712CE0F712F09B9B4F3135E3E0AA3230FB9B6D08D1E16
Y: 667A05E9A1BDD6F70142B66558BD12CE2C0F9CBC7001B20C8A6A109C80DC5330
Offset: 0000000000000000000000000000004000000000000000000000000000000000

========================================================================
TameStore v58 ALL-TAME + W-W BUFFER (SOTA HYBRID)
========================================================================
  Config: Occupancy=1, GroupCnt=24
  RAM limit: 18.0 GB
  TAME table: 1115553792 entries (16.6 GB) [16 bytes/entry]
  W-W buffer: 58712064 entries (0.87 GB) [5% of RAM]
  Strategy: T-W from TAME table + W1-W2 from W-W buffer (SOTA hybrid)
  Expected K improvement: ~2.0 → ~1.5 (25% fewer ops needed)
  NOTE: Uses BSGS (~400ms) to resolve truncated distances on collision.
  Spatial buckets: 65536
========================================================================
  W-W buffer: allocated OK (58712064 entries)
TameStore ALL-TAME + W-W BUFFER: Ready!
Table freeze: ENABLED (tables become read-only when full, no FP explosion)

*** LOADING CHECKPOINT: wild_checkpoint.dat ***
*** CHECKPOINT LOADED (tables_present: 1) ***
    W1/TAMEs: 194792080 entries
    W2: not saved (ALL-TAME mode)
Checkpoint loaded! Table[0]: 194M / 1115M (17.5%)
  TRAP will resume from 17.5% (need 93% to switch to HUNT)

Auto-checkpoint enabled: Every 1 hours to 'wild_checkpoint.dat'
Press Ctrl+C to save checkpoint and exit safely (works in TRAP and HUNT).


Solving point: Range 134 bits, DP 13

Strategy: CONCURRENT (v59 — RC-style t² growth)
  GPU split: 33% TAME + 33% WILD1 + 33% WILD2 from second 1
  TAMEs stored → table grows while WILDs hunt simultaneously
  Collision probability grows with t² (quadratic) until table full
  Est. table fill: 0.1 days (then switch to 100% WILDs)
  W-W buffer: 5% RAM (T-W + W1-W2 collisions)
  Advantage: t² growth + ramlimit + checkpoint + 16-byte entries

GPU 0: allocated 1010 MB, 327680 kangaroos. OldGpuMode: Yes
CONCURRENT MODE (v59): Starting with 33% TAME + 67% WILDs...
  TAMEs build table while WILDs hunt — t² collision growth!
GPUs ready for CONCURRENT operation!

Started 12 worker threads for LOCK-FREE processing
Precomputing BSGS baby table (131072 entries)...
BSGS baby table ready: 131072 entries, giant range +/-2^32, built in 587 ms
BSGS async resolver: STARTED (4 threads, queue max=262144, precomputed baby table)

CONC: Speed: 0.19 GKeys/s | Time: 0d 00h 58m
  TAMEs: 219M / 1115M (19.6%) | +7609 TAMEs/s
  WILDs: 48M checks | T-W: 7 | W-W: 10 | FP: 34 | 13.8K/s
  W-W buffer: 48M stored, 10 W1-W2 hits
  BSGS [4 thr]: 0 pending, 17 processed, 0 dropped                i dont know what i am doing but for p80 on a gtx 1060 it took ~6Hours to fiind key not 2 minutes Smiley , just curios here after this .dat file completes what are chances of me fiinding the priv key with 1.1B "traps" in a timeframe of a year for p135 ? less than 5% ?
SecretAdmirere
Newbie
*
Offline Offline

Activity: 9
Merit: 1


View Profile
Today at 03:12:21 PM
 #12998

i dont know what i am doing but for p80 on a gtx 1060 it took ~6Hours to fiind key not 2 minutes Smiley , just curios here after this .dat file completes what are chances of me fiinding the priv key with 1.1B "traps" in a timeframe of a year for p135 ? less than 5% ?

You are correct, chance is indeed less then 5%, but i would highly reccomend you unsubscribe from ChatGPT and use that money towards something more productive. Also you can research a little bit on the "coin fliping" method recently discovered at MIT that would be much more effective to what ever that program is. It requires few basic components to run and does not incrise your electrical bill, the requirement is as follows: pen, paper, coin, strong enough hand to flip the coin 135 times. Heads = 1, tails = 0, flip it 135 times, write down the bits after each flip, then calculate privkey -> pubkey -> sha256 -> ripemd160. Joke aside, don't, just don't waste SSD lifetime on this, what ever it even is.
blankx4729
Newbie
*
Offline Offline

Activity: 4
Merit: 0


View Profile
Today at 04:08:16 PM
 #12999

 Grin Grin thank you for reply , and you are right feels like a waste of hardware and time but is also a great way of learning more about how things work and all of that , even if i won't find the key to puzzle as a tech enthusiast i find this interesting
XMieLgNRvyra45MMZg
Newbie
*
Offline Offline

Activity: 14
Merit: 0


View Profile
Today at 07:08:05 PM
Last edit: Today at 07:36:22 PM by XMieLgNRvyra45MMZg
 #13000

Sorry guys, isn't for Kangaroo you need public key?  Grin


This is for a totally different thing but maybe useful for some of you:  Cool

SECP256K1_GPU.h
Code:
/*
 * secp256k1 GPU Implementation for PRNG Attack
 *
 * Optimized scalar multiplication for CUDA
 * Uses Montgomery representation for field arithmetic
 */

#ifndef SECP256K1_GPU_H
#define SECP256K1_GPU_H

#include <cuda_runtime.h>
#include <stdint.h>

// secp256k1 prime: p = 2^256 - 2^32 - 977
__constant__ uint32_t SECP256K1_P[8] = {
    0xFFFFFC2F, 0xFFFFFFFE, 0xFFFFFFFF, 0xFFFFFFFF,
    0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF
};

// secp256k1 order: n
__constant__ uint32_t SECP256K1_N[8] = {
    0xD0364141, 0xBFD25E8C, 0xAF48A03B, 0xBAAEDCE6,
    0xFFFFFFFE, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF
};

// Generator point G.x
__constant__ uint32_t SECP256K1_GX[8] = {
    0x16F81798, 0x59F2815B, 0x2DCE28D9, 0x029BFCDB,
    0xCE870B07, 0x55A06295, 0xF9DCBBAC, 0x79BE667E
};

// Generator point G.y
__constant__ uint32_t SECP256K1_GY[8] = {
    0xFB10D4B8, 0x9C47D08F, 0xA6855419, 0xFD17B448,
    0x0E1108A8, 0x5DA4FBFC, 0x26A3C465, 0x483ADA77
};

// Field element (256-bit)
typedef struct {
    uint32_t d[8];
} fe_t;

// Point in Jacobian coordinates
typedef struct {
    fe_t x;
    fe_t y;
    fe_t z;
} jpoint_t;

// Point in affine coordinates  
typedef struct {
    fe_t x;
    fe_t y;
} apoint_t;

// ============================================================================
// 256-bit Modular Arithmetic (mod p)
// ============================================================================

__device__ __forceinline__ void fe_zero(fe_t *r) {
    #pragma unroll
    for (int i = 0; i < 8; i++) r->d[i] = 0;
}

__device__ __forceinline__ void fe_one(fe_t *r) {
    r->d[0] = 1;
    #pragma unroll
    for (int i = 1; i < 8; i++) r->d[i] = 0;
}

__device__ __forceinline__ void fe_copy(fe_t *r, const fe_t *a) {
    #pragma unroll
    for (int i = 0; i < 8; i++) r->d[i] = a->d[i];
}

__device__ __forceinline__ int fe_is_zero(const fe_t *a) {
    uint32_t x = 0;
    #pragma unroll
    for (int i = 0; i < 8; i++) x |= a->d[i];
    return x == 0;
}

__device__ __forceinline__ int fe_cmp(const fe_t *a, const fe_t *b) {
    for (int i = 7; i >= 0; i--) {
        if (a->d[i] < b->d[i]) return -1;
        if (a->d[i] > b->d[i]) return 1;
    }
    return 0;
}

// r = a + b (mod p)
__device__ void fe_add(fe_t *r, const fe_t *a, const fe_t *b) {
    uint64_t c = 0;
    
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        c += (uint64_t)a->d[i] + b->d[i];
        r->d[i] = (uint32_t)c;
        c >>= 32;
    }
    
    // Reduce if >= p
    if (c || (r->d[7] > SECP256K1_P[7]) ||
        (r->d[7] == SECP256K1_P[7] && r->d[0] >= SECP256K1_P[0])) {
        c = 0;
        #pragma unroll
        for (int i = 0; i < 8; i++) {
            c = (uint64_t)r->d[i] - SECP256K1_P[i] - (c >> 63);
            r->d[i] = (uint32_t)c;
        }
    }
}

// r = a - b (mod p)
__device__ void fe_sub(fe_t *r, const fe_t *a, const fe_t *b) {
    int64_t c = 0;
    
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        c += (int64_t)a->d[i] - b->d[i];
        r->d[i] = (uint32_t)c;
        c >>= 32;
    }
    
    // If underflow, add p
    if (c < 0) {
        c = 0;
        #pragma unroll
        for (int i = 0; i < 8; i++) {
            c += (uint64_t)r->d[i] + SECP256K1_P[i];
            r->d[i] = (uint32_t)c;
            c >>= 32;
        }
    }
}

// r = a * b (mod p) - using secp256k1's special prime for fast reduction
__device__ void fe_mul(fe_t *r, const fe_t *a, const fe_t *b) {
    uint64_t product[16] = {0};
    
    // Full 512-bit multiplication
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        uint64_t carry = 0;
        #pragma unroll
        for (int j = 0; j < 8; j++) {
            uint64_t t = product[i + j] + (uint64_t)a->d[i] * b->d[j] + carry;
            product[i + j] = t & 0xFFFFFFFF;
            carry = t >> 32;
        }
        product[i + 8] += carry;
    }
    
    // Fast reduction for p = 2^256 - 2^32 - 977
    // High 256 bits * (2^32 + 977) and add to low 256 bits
    uint64_t c = 0;
    
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        c += product[i];
        c += product[i + 8] * 977;
        if (i < 7) c += product[i + 8 + 1] << 32;
        r->d[i] = (uint32_t)c;
        c >>= 32;
    }
    
    // Handle remaining carries
    while (c) {
        uint64_t d = 0;
        d += r->d[0] + c * 977;
        r->d[0] = (uint32_t)d;
        d >>= 32;
        
        d += r->d[1] + c;
        r->d[1] = (uint32_t)d;
        c = d >> 32;
        
        for (int i = 2; i < 8 && c; i++) {
            d = (uint64_t)r->d[i] + c;
            r->d[i] = (uint32_t)d;
            c = d >> 32;
        }
    }
    
    // Final reduction if >= p
    fe_t p_val;
    #pragma unroll
    for (int i = 0; i < 8; i++) p_val.d[i] = SECP256K1_P[i];
    
    if (fe_cmp(r, &p_val) >= 0) {
        fe_sub(r, r, &p_val);
    }
}

// r = a^2 (mod p) - optimized squaring
__device__ void fe_sqr(fe_t *r, const fe_t *a) {
    fe_mul(r, a, a);  // Can be optimized further
}

// r = a^(-1) (mod p) using Fermat's little theorem: a^(p-2) mod p
__device__ void fe_inv(fe_t *r, const fe_t *a) {
    fe_t x2, x3, x6, x9, x11, x22, x44, x88, x176, x220, x223, t;
    
    fe_sqr(&x2, a);
    fe_mul(&x2, &x2, a);      // x2 = a^3
    
    fe_sqr(&x3, &x2);
    fe_mul(&x3, &x3, a);      // x3 = a^7
    
    fe_copy(&x6, &x3);
    for (int i = 0; i < 3; i++) fe_sqr(&x6, &x6);
    fe_mul(&x6, &x6, &x3);
    
    fe_copy(&x9, &x6);
    for (int i = 0; i < 3; i++) fe_sqr(&x9, &x9);
    fe_mul(&x9, &x9, &x3);
    
    fe_copy(&x11, &x9);
    for (int i = 0; i < 2; i++) fe_sqr(&x11, &x11);
    fe_mul(&x11, &x11, &x2);
    
    fe_copy(&x22, &x11);
    for (int i = 0; i < 11; i++) fe_sqr(&x22, &x22);
    fe_mul(&x22, &x22, &x11);
    
    fe_copy(&x44, &x22);
    for (int i = 0; i < 22; i++) fe_sqr(&x44, &x44);
    fe_mul(&x44, &x44, &x22);
    
    fe_copy(&x88, &x44);
    for (int i = 0; i < 44; i++) fe_sqr(&x88, &x88);
    fe_mul(&x88, &x88, &x44);
    
    fe_copy(&x176, &x88);
    for (int i = 0; i < 88; i++) fe_sqr(&x176, &x176);
    fe_mul(&x176, &x176, &x88);
    
    fe_copy(&x220, &x176);
    for (int i = 0; i < 44; i++) fe_sqr(&x220, &x220);
    fe_mul(&x220, &x220, &x44);
    
    fe_copy(&x223, &x220);
    for (int i = 0; i < 3; i++) fe_sqr(&x223, &x223);
    fe_mul(&x223, &x223, &x3);
    
    fe_copy(&t, &x223);
    for (int i = 0; i < 23; i++) fe_sqr(&t, &t);
    fe_mul(&t, &t, &x22);
    for (int i = 0; i < 6; i++) fe_sqr(&t, &t);
    fe_mul(&t, &t, &x2);
    fe_sqr(&t, &t);
    fe_sqr(&t, &t);
    fe_mul(r, &t, a);
}

// ============================================================================
// Jacobian Point Operations
// ============================================================================

__device__ void jpoint_set_infinity(jpoint_t *p) {
    fe_zero(&p->x);
    fe_zero(&p->y);
    fe_zero(&p->z);
}

__device__ int jpoint_is_infinity(const jpoint_t *p) {
    return fe_is_zero(&p->z);
}

// Convert affine to Jacobian
__device__ void jpoint_from_affine(jpoint_t *r, const apoint_t *a) {
    fe_copy(&r->x, &a->x);
    fe_copy(&r->y, &a->y);
    fe_one(&r->z);
}

// Convert Jacobian to affine
__device__ void jpoint_to_affine(apoint_t *r, const jpoint_t *p) {
    if (jpoint_is_infinity(p)) {
        fe_zero(&r->x);
        fe_zero(&r->y);
        return;
    }
    
    fe_t z_inv, z_inv2, z_inv3;
    fe_inv(&z_inv, &p->z);
    fe_sqr(&z_inv2, &z_inv);
    fe_mul(&z_inv3, &z_inv2, &z_inv);
    
    fe_mul(&r->x, &p->x, &z_inv2);
    fe_mul(&r->y, &p->y, &z_inv3);
}

// Point doubling: r = 2 * p (Jacobian coordinates)
__device__ void jpoint_double(jpoint_t *r, const jpoint_t *p) {
    if (jpoint_is_infinity(p) || fe_is_zero(&p->y)) {
        jpoint_set_infinity(r);
        return;
    }
    
    fe_t s, m, x3, y3, z3, t1, t2;
    
    // S = 4 * X * Y^2
    fe_sqr(&t1, &p->y);          // t1 = Y^2
    fe_mul(&s, &p->x, &t1);      // s = X * Y^2
    fe_add(&s, &s, &s);          // s = 2 * X * Y^2
    fe_add(&s, &s, &s);          // s = 4 * X * Y^2
    
    // M = 3 * X^2 (a = 0 for secp256k1)
    fe_sqr(&m, &p->x);           // m = X^2
    fe_add(&t2, &m, &m);         // t2 = 2 * X^2
    fe_add(&m, &t2, &m);         // m = 3 * X^2
    
    // X3 = M^2 - 2*S
    fe_sqr(&x3, &m);             // x3 = M^2
    fe_sub(&x3, &x3, &s);        // x3 = M^2 - S
    fe_sub(&x3, &x3, &s);        // x3 = M^2 - 2*S
    
    // Y3 = M * (S - X3) - 8 * Y^4
    fe_sub(&t2, &s, &x3);        // t2 = S - X3
    fe_mul(&y3, &m, &t2);        // y3 = M * (S - X3)
    fe_sqr(&t2, &t1);            // t2 = Y^4
    fe_add(&t2, &t2, &t2);       // t2 = 2 * Y^4
    fe_add(&t2, &t2, &t2);       // t2 = 4 * Y^4
    fe_add(&t2, &t2, &t2);       // t2 = 8 * Y^4
    fe_sub(&y3, &y3, &t2);
    
    // Z3 = 2 * Y * Z
    fe_mul(&z3, &p->y, &p->z);
    fe_add(&z3, &z3, &z3);
    
    fe_copy(&r->x, &x3);
    fe_copy(&r->y, &y3);
    fe_copy(&r->z, &z3);
}

// Point addition: r = p + q (mixed: p Jacobian, q affine)
__device__ void jpoint_add_mixed(jpoint_t *r, const jpoint_t *p, const apoint_t *q) {
    if (jpoint_is_infinity(p)) {
        jpoint_from_affine(r, q);
        return;
    }
    
    fe_t z2, u2, s2, h, hh, i, j, rr, v, t1;
    
    // Z^2
    fe_sqr(&z2, &p->z);
    
    // U2 = X2 * Z1^2
    fe_mul(&u2, &q->x, &z2);
    
    // S2 = Y2 * Z1^3
    fe_mul(&s2, &z2, &p->z);
    fe_mul(&s2, &s2, &q->y);
    
    // H = U2 - X1
    fe_sub(&h, &u2, &p->x);
    
    // Check if same point (need doubling)
    if (fe_is_zero(&h)) {
        fe_sub(&t1, &s2, &p->y);
        if (fe_is_zero(&t1)) {
            // Same point - double
            jpoint_double(r, p);
            return;
        }
        // Point at infinity
        jpoint_set_infinity(r);
        return;
    }
    
    // HH = H^2
    fe_sqr(&hh, &h);
    
    // I = 4 * HH
    fe_add(&i, &hh, &hh);
    fe_add(&i, &i, &i);
    
    // J = H * I
    fe_mul(&j, &h, &i);
    
    // r = 2 * (S2 - Y1)
    fe_sub(&rr, &s2, &p->y);
    fe_add(&rr, &rr, &rr);
    
    // V = X1 * I
    fe_mul(&v, &p->x, &i);
    
    // X3 = r^2 - J - 2*V
    fe_sqr(&r->x, &rr);
    fe_sub(&r->x, &r->x, &j);
    fe_sub(&r->x, &r->x, &v);
    fe_sub(&r->x, &r->x, &v);
    
    // Y3 = r * (V - X3) - 2 * Y1 * J
    fe_sub(&t1, &v, &r->x);
    fe_mul(&r->y, &rr, &t1);
    fe_mul(&t1, &p->y, &j);
    fe_add(&t1, &t1, &t1);
    fe_sub(&r->y, &r->y, &t1);
    
    // Z3 = 2 * Z1 * H
    fe_mul(&r->z, &p->z, &h);
    fe_add(&r->z, &r->z, &r->z);
}

// ============================================================================
// Scalar Multiplication
// ============================================================================

// Get generator point
__device__ void get_generator(apoint_t *g) {
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        g->x.d[i] = SECP256K1_GX[i];
        g->y.d[i] = SECP256K1_GY[i];
    }
}

// Scalar multiplication: r = k * G
__device__ void scalar_mult_base(apoint_t *r, const uint8_t *k) {
    apoint_t G;
    jpoint_t R;
    
    get_generator(&G);
    jpoint_set_infinity(&R);
    
    // Simple double-and-add (can be optimized with precomputation)
    for (int i = 255; i >= 0; i--) {
        jpoint_double(&R, &R);
        
        int byte_idx = i / 8;
        int bit_idx = i % 8;
        
        if ((k[31 - byte_idx] >> bit_idx) & 1) {
            jpoint_add_mixed(&R, &R, &G);
        }
    }
    
    jpoint_to_affine(r, &R);
}

// ============================================================================
// Public Key Computation and Comparison
// ============================================================================

__device__ bool compute_pubkey_and_check(const uint8_t *priv_key, const uint8_t *target_x) {
    apoint_t pub;
    scalar_mult_base(&pub, priv_key);
    
    // Compare X coordinate (little-endian)
    for (int i = 0; i < 8; i++) {
        uint32_t target_word = ((uint32_t)target_x[i*4+3] << 24) |
                               ((uint32_t)target_x[i*4+2] << 16) |
                               ((uint32_t)target_x[i*4+1] << 8) |
                               ((uint32_t)target_x[i*4]);
        if (pub.x.d[i] != target_word) return false;
    }
    
    return true;
}

#endif // SECP256K1_GPU_H

SECP256K1_FAST.CUH
Code:
// secp256k1_fast.cuh - Highly optimized GPU secp256k1 implementation
// Features:
// - 8x32-bit limb field arithmetic
// - Jacobian coordinates (no inversions during computation)
// - Precomputed tables for generator G (16 points, 4-bit windows)
// - Windowed scalar multiplication
// - Fast reduction using secp256k1 special form: p = 2^256 - 2^32 - 977

#pragma once
#include <stdint.h>
#include <cuda_runtime.h>

// ============================================================================
// FIELD ELEMENT (256-bit mod p)
// p = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F
// ============================================================================

typedef struct {
    uint32_t d[8];  // Little-endian: d[0] is LSW
} fe_t;

// Curve order n
// n = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141

// ============================================================================
// JACOBIAN POINT (X, Y, Z) where affine (x,y) = (X/Z^2, Y/Z^3)
// ============================================================================

typedef struct {
    fe_t x, y, z;
} jac_point_t;

// ============================================================================
// CONSTANTS - Generator point and precomputed table
// ============================================================================

// Generator G (affine)
__device__ __constant__ uint32_t G_X[8] = {
    0x16F81798, 0x59F2815B, 0x2DCE28D9, 0x029BFCDB,
    0xCE870B07, 0x55A06295, 0xF9DCBBAC, 0x79BE667E
};

__device__ __constant__ uint32_t G_Y[8] = {
    0xFB10D4B8, 0x9C47D08F, 0xA6855419, 0xFD17B448,
    0x0E1108A8, 0x5DA4FBFC, 0x26A3C465, 0x483ADA77
};

// Prime p = 2^256 - 2^32 - 977
__device__ __constant__ uint32_t P[8] = {
    0xFFFFFC2F, 0xFFFFFFFE, 0xFFFFFFFF, 0xFFFFFFFF,
    0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF
};

// 2*P for easier reduction
__device__ __constant__ uint32_t P2[9] = {
    0xFFFF85E, 0xFFFFFFFD, 0xFFFFFFFF, 0xFFFFFFFF,
    0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x01
};

// Precomputed table: G_TABLE[i] = (i+1)*G for i = 0..15 (Jacobian with Z=1)
// These will be computed at init time
__device__ fe_t G_TABLE_X[16];
__device__ fe_t G_TABLE_Y[16];

// ============================================================================
// FIELD ARITHMETIC
// ============================================================================

// Zero
__device__ __forceinline__ void fe_zero(fe_t *r) {
    #pragma unroll
    for (int i = 0; i < 8; i++) r->d[i] = 0;
}

// Copy
__device__ __forceinline__ void fe_copy(fe_t *r, const fe_t *a) {
    #pragma unroll
    for (int i = 0; i < 8; i++) r->d[i] = a->d[i];
}

// Set from uint32_t array (big-endian input to little-endian internal)
__device__ __forceinline__ void fe_set_be(fe_t *r, const uint8_t *be32) {
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        int j = 7 - i;
        r->d[i] = ((uint32_t)be32[j*4] << 24) | ((uint32_t)be32[j*4+1] << 16) |
                  ((uint32_t)be32[j*4+2] << 8) | (uint32_t)be32[j*4+3];
    }
}

// Get as big-endian byte array
__device__ __forceinline__ void fe_get_be(uint8_t *be32, const fe_t *a) {
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        int j = 7 - i;
        be32[j*4] = (a->d[i] >> 24) & 0xff;
        be32[j*4+1] = (a->d[i] >> 16) & 0xff;
        be32[j*4+2] = (a->d[i] >> 8) & 0xff;
        be32[j*4+3] = a->d[i] & 0xff;
    }
}

// Compare: returns -1 if a<b, 0 if a==b, 1 if a>b
__device__ __forceinline__ int fe_cmp(const fe_t *a, const fe_t *b) {
    #pragma unroll
    for (int i = 7; i >= 0; i--) {
        if (a->d[i] > b->d[i]) return 1;
        if (a->d[i] < b->d[i]) return -1;
    }
    return 0;
}

// Is zero?
__device__ __forceinline__ int fe_is_zero(const fe_t *a) {
    uint32_t z = 0;
    #pragma unroll
    for (int i = 0; i < 8; i++) z |= a->d[i];
    return z == 0;
}

// Addition: r = a + b (without reduction)
__device__ __forceinline__ uint32_t fe_add_raw(fe_t *r, const fe_t *a, const fe_t *b) {
    uint64_t c = 0;
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        c += (uint64_t)a->d[i] + b->d[i];
        r->d[i] = (uint32_t)c;
        c >>= 32;
    }
    return (uint32_t)c;
}

// Subtraction: r = a - b, returns borrow
__device__ __forceinline__ uint32_t fe_sub_raw(fe_t *r, const fe_t *a, const fe_t *b) {
    int64_t c = 0;
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        c += (int64_t)a->d[i] - b->d[i];
        r->d[i] = (uint32_t)c;
        c >>= 32;
    }
    return (uint32_t)(c & 1);
}

// Reduce mod p - assuming r < 2*p
__device__ __forceinline__ void fe_reduce_once(fe_t *r) {
    fe_t p_val;
    #pragma unroll
    for (int i = 0; i < 8; i++) p_val.d[i] = P[i];
    
    fe_t tmp;
    uint32_t borrow = fe_sub_raw(&tmp, r, &p_val);
    
    // If no borrow, use reduced value
    if (!borrow) {
        fe_copy(r, &tmp);
    }
}

// Full reduction mod p
__device__ __forceinline__ void fe_reduce(fe_t *r) {
    fe_reduce_once(r);
    fe_reduce_once(r);
}

// Add mod p: r = a + b mod p
__device__ __forceinline__ void fe_add(fe_t *r, const fe_t *a, const fe_t *b) {
    uint32_t carry = fe_add_raw(r, a, b);
    
    // If carry or >= p, subtract p
    fe_t p_val;
    #pragma unroll
    for (int i = 0; i < 8; i++) p_val.d[i] = P[i];
    
    if (carry || fe_cmp(r, &p_val) >= 0) {
        fe_sub_raw(r, r, &p_val);
    }
}

// Sub mod p: r = a - b mod p
__device__ __forceinline__ void fe_sub(fe_t *r, const fe_t *a, const fe_t *b) {
    uint32_t borrow = fe_sub_raw(r, a, b);
    
    // If borrow, add p
    if (borrow) {
        fe_t p_val;
        #pragma unroll
        for (int i = 0; i < 8; i++) p_val.d[i] = P[i];
        fe_add_raw(r, r, &p_val);
    }
}

// Negate mod p: r = -a mod p = p - a
__device__ __forceinline__ void fe_neg(fe_t *r, const fe_t *a) {
    fe_t p_val;
    #pragma unroll
    for (int i = 0; i < 8; i++) p_val.d[i] = P[i];
    fe_sub_raw(r, &p_val, a);
}

// Multiply mod p using secp256k1 special reduction
// p = 2^256 - 2^32 - 977
// For r = a*b mod p:
// Let a*b = H * 2^256 + L (H is high 256 bits, L is low 256 bits)
// r = L + H * (2^32 + 977) mod p
__device__ void fe_mul(fe_t *r, const fe_t *a, const fe_t *b) {
    uint64_t product[16];
    
    // Full 256x256 -> 512 bit multiplication
    #pragma unroll
    for (int i = 0; i < 16; i++) product[i] = 0;
    
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        uint64_t carry = 0;
        #pragma unroll
        for (int j = 0; j < 8; j++) {
            uint64_t prod = (uint64_t)a->d[i] * b->d[j] + product[i+j] + carry;
            product[i+j] = prod & 0xFFFFFFFF;
            carry = prod >> 32;
        }
        product[i+8] = carry;
    }
    
    // Reduction: r = (low 256) + (high 256) * (2^32 + 977)
    // We do this iteratively
    uint64_t c = 0;
    
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        c += product[i];
        // Add high[i] * 977
        c += product[i + 8] * 977ULL;
        // Add high[i-1] * 2^32 (which is high[i-1] shifted left)
        if (i > 0) c += product[i + 7] << 32;
        r->d[i] = (uint32_t)c;
        c >>= 32;
    }
    
    // Handle remaining carries
    c += product[15] << 32;  // Last high word shifted
    
    // If there's still carry, add back (carry * (2^32 + 977))
    while (c) {
        uint64_t c2 = 0;
        #pragma unroll
        for (int i = 0; i < 8; i++) {
            if (i == 0) c2 += (uint64_t)r->d[i] + (c & 0xFFFFFFFF) * 977ULL;
            else if (i == 1) c2 += (uint64_t)r->d[i] + ((c & 0xFFFFFFFF) << 32) + ((c >> 32) * 977ULL);
            else if (i == 2 && c > 0xFFFFFFFF) c2 += (uint64_t)r->d[i] + (c >> 32);
            else c2 += (uint64_t)r->d[i];
            r->d[i] = (uint32_t)c2;
            c2 >>= 32;
        }
        c = c2;
    }
    
    fe_reduce(r);
}

// Square mod p (can be optimized but using mul for now)
__device__ __forceinline__ void fe_sqr(fe_t *r, const fe_t *a) {
    fe_mul(r, a, a);
}

// Double: r = 2*a mod p
__device__ __forceinline__ void fe_dbl(fe_t *r, const fe_t *a) {
    fe_add(r, a, a);
}

// Modular inverse using Fermat's little theorem: a^(-1) = a^(p-2) mod p
// This is slow but simple. For batch, use Montgomery's trick
__device__ void fe_inv(fe_t *r, const fe_t *a) {
    // p-2 = 0xFFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFE FFFFFC2D
    fe_t t1, t2, t3;
    
    fe_sqr(&t1, a);        // a^2
    fe_mul(&t1, &t1, a);   // a^3
    fe_sqr(&t2, &t1);      // a^6
    fe_sqr(&t2, &t2);      // a^12
    fe_mul(&t2, &t2, &t1); // a^15
    fe_sqr(&t3, &t2);      // a^30
    
    // Build up to a^(2^32-1) using repeated squaring
    #pragma unroll
    for (int i = 0; i < 4; i++) fe_sqr(&t3, &t3);
    fe_mul(&t3, &t3, &t2); // a^255
    
    fe_copy(&t2, &t3);
    #pragma unroll
    for (int i = 0; i < 8; i++) fe_sqr(&t2, &t2);
    fe_mul(&t2, &t2, &t3); // a^(2^16-1)
    
    fe_copy(&t3, &t2);
    #pragma unroll
    for (int i = 0; i < 16; i++) fe_sqr(&t3, &t3);
    fe_mul(&t3, &t3, &t2); // a^(2^32-1)
    
    // Now build a^(p-2)
    fe_copy(&t2, &t3);
    #pragma unroll
    for (int i = 0; i < 32; i++) fe_sqr(&t2, &t2);
    fe_mul(&t2, &t2, &t3); // a^(2^64-1)
    
    fe_copy(&t1, &t2);
    #pragma unroll
    for (int i = 0; i < 64; i++) fe_sqr(&t1, &t1);
    fe_mul(&t1, &t1, &t2); // a^(2^128-1)
    
    fe_copy(&t2, &t1);
    #pragma unroll
    for (int i = 0; i < 128; i++) fe_sqr(&t2, &t2);
    fe_mul(&t2, &t2, &t1); // a^(2^256-1)
    
    // Adjust for p-2 = 2^256 - 2^32 - 979
    // We have a^(2^256-1), need a^(2^256 - 2^32 - 979)
    // Actually easier: compute via standard addition chain
    // Let's use a simpler but slower method
    
    // Simple version: a^(p-2) via binary method
    fe_copy(&t1, a);
    fe_copy(r, a);
    
    // p-2 bits (from bit 1 to 255, bit 0 is 1)
    // Hard-coded for secp256k1: exponent is FFFFFFFF...FFFFFFFEFFFFFC2D
    // Bottom 32 bits: FFFFFC2D = 11111111111111111111110000101101
    
    // Bit 1-7: all 1
    #pragma unroll
    for (int i = 0; i < 7; i++) {
        fe_sqr(r, r);
        fe_mul(r, r, a);
    }
    // Bit 8-10: 101
    fe_sqr(r, r); fe_mul(r, r, a);  // 1
    fe_sqr(r, r);                    // 0
    fe_sqr(r, r); fe_mul(r, r, a);  // 1
    // Bit 11: 1
    fe_sqr(r, r); fe_mul(r, r, a);
    // Bit 12-31: 00001111111111111111 = many zeros then ones
    #pragma unroll
    for (int i = 0; i < 4; i++) fe_sqr(r, r);
    #pragma unroll
    for (int i = 0; i < 16; i++) {
        fe_sqr(r, r);
        fe_mul(r, r, a);
    }
    // Bit 32-255: 1111...1110 (223 ones, then 0)
    #pragma unroll
    for (int i = 0; i < 223; i++) {
        fe_sqr(r, r);
        fe_mul(r, r, a);
    }
    fe_sqr(r, r);  // final 0
}

// ============================================================================
// POINT OPERATIONS (Jacobian coordinates)
// ============================================================================

// Initialize as point at infinity
__device__ __forceinline__ void jac_set_inf(jac_point_t *p) {
    fe_zero(&p->x);
    fe_zero(&p->y);
    fe_zero(&p->z);
    p->y.d[0] = 1;  // Convention: infinity has Y=1, Z=0
}

// Check if point is at infinity (Z == 0)
__device__ __forceinline__ int jac_is_inf(const jac_point_t *p) {
    return fe_is_zero(&p->z);
}

// Set from affine coordinates (x, y) -> (X=x, Y=y, Z=1)
__device__ __forceinline__ void jac_set_affine(jac_point_t *p, const fe_t *x, const fe_t *y) {
    fe_copy(&p->x, x);
    fe_copy(&p->y, y);
    fe_zero(&p->z);
    p->z.d[0] = 1;
}

// Copy point
__device__ __forceinline__ void jac_copy(jac_point_t *r, const jac_point_t *p) {
    fe_copy(&r->x, &p->x);
    fe_copy(&r->y, &p->y);
    fe_copy(&r->z, &p->z);
}

// Point doubling: r = 2*p (Jacobian)
// Using "dbl-2001-b" formula: 1M + 5S + 1*a + 7add + 1*4 + 2*8
// Since a=0 for secp256k1: 1M + 5S + 6add + 1*4 + 2*8
__device__ void jac_double(jac_point_t *r, const jac_point_t *p) {
    if (jac_is_inf(p)) {
        jac_set_inf(r);
        return;
    }
    
    fe_t s, m, x3, y3, z3, t1, t2;
    
    // S = 4*X*Y^2
    fe_sqr(&t1, &p->y);           // Y^2
    fe_mul(&s, &p->x, &t1);       // X*Y^2
    fe_dbl(&s, &s);               // 2*X*Y^2
    fe_dbl(&s, &s);               // 4*X*Y^2
    
    // M = 3*X^2 + a*Z^4 (a=0 for secp256k1)
    fe_sqr(&m, &p->x);            // X^2
    fe_dbl(&t2, &m);              // 2*X^2
    fe_add(&m, &m, &t2);          // 3*X^2
    
    // X3 = M^2 - 2*S
    fe_sqr(&x3, &m);              // M^2
    fe_dbl(&t2, &s);              // 2*S
    fe_sub(&x3, &x3, &t2);        // M^2 - 2*S
    
    // Y3 = M*(S - X3) - 8*Y^4
    fe_sub(&t2, &s, &x3);         // S - X3
    fe_mul(&y3, &m, &t2);         // M*(S - X3)
    fe_sqr(&t2, &t1);             // Y^4
    fe_dbl(&t2, &t2);             // 2*Y^4
    fe_dbl(&t2, &t2);             // 4*Y^4
    fe_dbl(&t2, &t2);             // 8*Y^4
    fe_sub(&y3, &y3, &t2);        // M*(S-X3) - 8*Y^4
    
    // Z3 = 2*Y*Z
    fe_mul(&z3, &p->y, &p->z);    // Y*Z
    fe_dbl(&z3, &z3);             // 2*Y*Z
    
    fe_copy(&r->x, &x3);
    fe_copy(&r->y, &y3);
    fe_copy(&r->z, &z3);
}

// Point addition: r = p + q (Jacobian + Jacobian)
// Using "add-2007-bl" formula: 11M + 5S + 9add + 4*2
__device__ void jac_add(jac_point_t *r, const jac_point_t *p, const jac_point_t *q) {
    if (jac_is_inf(p)) { jac_copy(r, q); return; }
    if (jac_is_inf(q)) { jac_copy(r, p); return; }
    
    fe_t z1z1, z2z2, u1, u2, s1, s2, h, i, j, rr, v;
    fe_t x3, y3, z3, t1, t2;
    
    fe_sqr(&z1z1, &p->z);         // Z1^2
    fe_sqr(&z2z2, &q->z);         // Z2^2
    
    fe_mul(&u1, &p->x, &z2z2);    // U1 = X1*Z2^2
    fe_mul(&u2, &q->x, &z1z1);    // U2 = X2*Z1^2
    
    fe_mul(&t1, &q->z, &z2z2);    // Z2^3
    fe_mul(&s1, &p->y, &t1);      // S1 = Y1*Z2^3
    
    fe_mul(&t1, &p->z, &z1z1);    // Z1^3
    fe_mul(&s2, &q->y, &t1);      // S2 = Y2*Z1^3
    
    fe_sub(&h, &u2, &u1);         // H = U2 - U1
    
    // If H=0, points have same X, check if same point or inverse
    if (fe_is_zero(&h)) {
        fe_sub(&t1, &s2, &s1);
        if (fe_is_zero(&t1)) {
            jac_double(r, p);     // Same point
            return;
        } else {
            jac_set_inf(r);       // Inverse points
            return;
        }
    }
    
    fe_dbl(&i, &h);               // 2*H
    fe_sqr(&i, &i);               // I = (2*H)^2
    
    fe_mul(&j, &h, &i);           // J = H*I
    
    fe_sub(&rr, &s2, &s1);        // S2 - S1
    fe_dbl(&rr, &rr);             // r = 2*(S2 - S1)
    
    fe_mul(&v, &u1, &i);          // V = U1*I
    
    // X3 = r^2 - J - 2*V
    fe_sqr(&x3, &rr);             // r^2
    fe_sub(&x3, &x3, &j);         // r^2 - J
    fe_dbl(&t1, &v);              // 2*V
    fe_sub(&x3, &x3, &t1);        // r^2 - J - 2*V
    
    // Y3 = r*(V - X3) - 2*S1*J
    fe_sub(&t1, &v, &x3);         // V - X3
    fe_mul(&y3, &rr, &t1);        // r*(V - X3)
    fe_mul(&t1, &s1, &j);         // S1*J
    fe_dbl(&t1, &t1);             // 2*S1*J
    fe_sub(&y3, &y3, &t1);        // r*(V-X3) - 2*S1*J
    
    // Z3 = ((Z1+Z2)^2 - Z1^2 - Z2^2)*H
    fe_add(&t1, &p->z, &q->z);    // Z1+Z2
    fe_sqr(&t1, &t1);             // (Z1+Z2)^2
    fe_sub(&t1, &t1, &z1z1);      // (Z1+Z2)^2 - Z1^2
    fe_sub(&t1, &t1, &z2z2);      // (Z1+Z2)^2 - Z1^2 - Z2^2
    fe_mul(&z3, &t1, &h);         // Z3
    
    fe_copy(&r->x, &x3);
    fe_copy(&r->y, &y3);
    fe_copy(&r->z, &z3);
}

// Mixed addition: r = p (Jacobian) + q (affine, Z=1)
// More efficient: 7M + 4S + 9add + 3*2 + 1*4
__device__ void jac_add_affine(jac_point_t *r, const jac_point_t *p, const fe_t *qx, const fe_t *qy) {
    if (jac_is_inf(p)) {
        fe_copy(&r->x, qx);
        fe_copy(&r->y, qy);
        fe_zero(&r->z);
        r->z.d[0] = 1;
        return;
    }
    
    fe_t z1z1, u2, s2, h, hh, i, j, rr, v;
    fe_t x3, y3, z3, t1;
    
    fe_sqr(&z1z1, &p->z);         // Z1^2
    
    // U1 = X1 (since Z2=1)
    fe_mul(&u2, qx, &z1z1);       // U2 = X2*Z1^2
    
    // S1 = Y1 (since Z2=1)
    fe_mul(&t1, &p->z, &z1z1);    // Z1^3
    fe_mul(&s2, qy, &t1);         // S2 = Y2*Z1^3
    
    fe_sub(&h, &u2, &p->x);       // H = U2 - X1
    
    if (fe_is_zero(&h)) {
        fe_sub(&t1, &s2, &p->y);
        if (fe_is_zero(&t1)) {
            jac_double(r, p);
            return;
        } else {
            jac_set_inf(r);
            return;
        }
    }
    
    fe_sqr(&hh, &h);              // H^2
    fe_dbl(&i, &hh);              // 2*H^2
    fe_dbl(&i, &i);               // I = 4*H^2
    
    fe_mul(&j, &h, &i);           // J = H*I
    
    fe_sub(&rr, &s2, &p->y);      // S2 - Y1
    fe_dbl(&rr, &rr);             // r = 2*(S2 - Y1)
    
    fe_mul(&v, &p->x, &i);        // V = X1*I
    
    // X3 = r^2 - J - 2*V
    fe_sqr(&x3, &rr);
    fe_sub(&x3, &x3, &j);
    fe_dbl(&t1, &v);
    fe_sub(&x3, &x3, &t1);
    
    // Y3 = r*(V - X3) - 2*Y1*J
    fe_sub(&t1, &v, &x3);
    fe_mul(&y3, &rr, &t1);
    fe_mul(&t1, &p->y, &j);
    fe_dbl(&t1, &t1);
    fe_sub(&y3, &y3, &t1);
    
    // Z3 = 2*Z1*H
    fe_mul(&z3, &p->z, &h);
    fe_dbl(&z3, &z3);
    
    fe_copy(&r->x, &x3);
    fe_copy(&r->y, &y3);
    fe_copy(&r->z, &z3);
}

// Convert Jacobian to affine: (X, Y, Z) -> (X/Z^2, Y/Z^3)
__device__ void jac_to_affine(fe_t *rx, fe_t *ry, const jac_point_t *p) {
    if (jac_is_inf(p)) {
        fe_zero(rx);
        fe_zero(ry);
        return;
    }
    
    fe_t z_inv, z2_inv, z3_inv;
    
    fe_inv(&z_inv, &p->z);        // 1/Z
    fe_sqr(&z2_inv, &z_inv);      // 1/Z^2
    fe_mul(&z3_inv, &z2_inv, &z_inv); // 1/Z^3
    
    fe_mul(rx, &p->x, &z2_inv);   // X/Z^2
    fe_mul(ry, &p->y, &z3_inv);   // Y/Z^3
}

// ============================================================================
// SCALAR MULTIPLICATION WITH PRECOMPUTED TABLE
// ============================================================================

// Initialize precomputed table for generator G
// G_TABLE[i] = (i+1)*G for i = 0..15
__global__ void init_g_table_kernel() {
    if (threadIdx.x != 0 || blockIdx.x != 0) return;
    
    // Load G
    fe_t gx, gy;
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        gx.d[i] = G_X[i];
        gy.d[i] = G_Y[i];
    }
    
    // G_TABLE[0] = 1*G
    fe_copy(&G_TABLE_X[0], &gx);
    fe_copy(&G_TABLE_Y[0], &gy);
    
    // Compute 2*G, 3*G, ... 16*G
    jac_point_t acc;
    jac_set_affine(&acc, &gx, &gy);
    
    for (int i = 1; i < 16; i++) {
        jac_point_t next;
        jac_add_affine(&next, &acc, &gx, &gy);
        jac_copy(&acc, &next);
        
        // Convert to affine and store
        fe_t ax, ay;
        jac_to_affine(&ax, &ay, &acc);
        fe_copy(&G_TABLE_X[i], &ax);
        fe_copy(&G_TABLE_Y[i], &ay);
    }
}

// Scalar multiplication using 4-bit windowed method with precomputed table
// Computes r = k * G
__device__ void scalar_mult_g(fe_t *rx, fe_t *ry, const uint8_t k[32]) {
    jac_point_t r;
    jac_set_inf(&r);
    
    // Process 4 bits at a time, MSB first
    // 256 bits = 64 nibbles
    for (int i = 63; i >= 0; i--) {
        // Double 4 times
        jac_double(&r, &r);
        jac_double(&r, &r);
        jac_double(&r, &r);
        jac_double(&r, &r);
        
        // Get 4-bit window
        int byte_idx = 31 - (i / 2);
        int nibble = (i & 1) ? (k[byte_idx] >> 4) : (k[byte_idx] & 0x0F);
        
        // Add G_TABLE[nibble-1] if nibble > 0
        if (nibble > 0) {
            jac_add_affine(&r, &r, &G_TABLE_X[nibble - 1], &G_TABLE_Y[nibble - 1]);
        }
    }
    
    jac_to_affine(rx, ry, &r);
}

// Simple scalar multiplication (double-and-add) for arbitrary point
__device__ void scalar_mult(fe_t *rx, fe_t *ry, const uint8_t k[32], const fe_t *px, const fe_t *py) {
    jac_point_t r, p;
    jac_set_inf(&r);
    jac_set_affine(&p, px, py);
    
    // Process MSB to LSB
    for (int bit = 255; bit >= 0; bit--) {
        jac_double(&r, &r);
        
        int byte_idx = 31 - (bit / 8);
        int bit_idx = bit % 8;
        
        if (k[byte_idx] & (1 << bit_idx)) {
            jac_add_affine(&r, &r, px, py);
        }
    }
    
    jac_to_affine(rx, ry, &r);
}

// ============================================================================
// BATCH OPERATIONS (for processing multiple keys efficiently)
// ============================================================================

// Batch compute public keys for multiple private keys
// Uses shared intermediate results where possible
__device__ void batch_scalar_mult_g(
    fe_t *rx_arr, fe_t *ry_arr,  // Output: array of public keys
    const uint8_t *k_arr,        // Input: array of 32-byte private keys
    int count                     // Number of keys
) {
    // For now, just loop. More advanced batching could share doublings
    for (int i = 0; i < count; i++) {
        scalar_mult_g(&rx_arr[i], &ry_arr[i], k_arr + i * 32);
    }
}

// ============================================================================
// HIGH-LEVEL API
// ============================================================================

// Compute public key X coordinate from private key (for matching)
__device__ void privkey_to_pubkey_x(uint8_t pubx[32], const uint8_t privkey[32]) {
    fe_t rx, ry;
    scalar_mult_g(&rx, &ry, privkey);
    fe_get_be(pubx, &rx);
}

// Check if private key generates target public key X
__device__ int check_pubkey_x(const uint8_t privkey[32], const uint8_t target_x[32]) {
    uint8_t computed_x[32];
    privkey_to_pubkey_x(computed_x, privkey);
    
    #pragma unroll
    for (int i = 0; i < 32; i++) {
        if (computed_x[i] != target_x[i]) return 0;
    }
    return 1;
}

// ============================================================================
// HOST INITIALIZATION
// ============================================================================

// Call this once from host before using GPU functions
void secp256k1_init_tables() {
    init_g_table_kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
}

Pages: « 1 ... 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 [650]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!