math09183
Member
Offline
Activity: 170
Merit: 58
|
|
October 23, 2021, 03:17:08 PM |
|
LOL That's what happens when you use ad hoc written code, without proper testing. I guess you still did not prepare any set of unit tests to proof your code works? Good luck for the future releases, maybe somewhere around version 20 it will be stable Most of code have bugs. Are you a great programmer who does everything without mistakes? I found this bug and solved it, what's your problem? Relax, haters gonna hate Anyway, good job, I appreciate your work. Sh*t happens.
|
|
|
|
Etar (OP)
|
|
October 23, 2021, 05:13:51 PM Last edit: October 23, 2021, 07:01:36 PM by Etar |
|
KEY[15]: 0x49dccfd96dc5df56487436f5a1b18c4f5d34f65ddb48cb5e2452dd26bc983cd5 Pub: 02b1985389d8ab680dedd67bba7ca781d1a9e6e5974aad2e70518125bad5783eb5 **************************** Found in 1 seconds GPU#0 job finished Working time 00:00:55s FINDpubkey= (55b95bef84a6045a505d015ef15e136e0a31cc2aa00fa4bca62e5df215ee981b, 3b4d6bce33718dc6cf59f28b550648d7e8b2796ac36f25ff0c01f8bc42a16fd9) GPU#0 Cnt:0000000000000000000000000000000000000000000000000000000000000001 GPU#0 Cnt:00000000000000000000000000000000000000000000000045ba000000000001 1110MKey/s x1073741824 2^30.12 x2^31=2^61.12 ***********GPU#0************ KEY[16]: 0x49dccfd96dc5df56487436f5a1b18c4f5d34f65ddb48cb5e7ad38337c7f173c7 Pub: 0355b95bef84a6045a505d015ef15e136e0a31cc2aa00fa4bca62e5df215ee981b **************************** Found in 4 seconds GPU#0 job finished Working time 00:00:59s
double giant step defeated. v1.7.1 released with maximum perfomance.
|
|
|
|
mamuu
Member
Offline
Activity: 73
Merit: 19
|
|
October 23, 2021, 07:13:47 PM |
|
KEY[15]: 0x49dccfd96dc5df56487436f5a1b18c4f5d34f65ddb48cb5e2452dd26bc983cd5 Pub: 02b1985389d8ab680dedd67bba7ca781d1a9e6e5974aad2e70518125bad5783eb5 **************************** Found in 1 seconds GPU#0 job finished Working time 00:00:55s FINDpubkey= (55b95bef84a6045a505d015ef15e136e0a31cc2aa00fa4bca62e5df215ee981b, 3b4d6bce33718dc6cf59f28b550648d7e8b2796ac36f25ff0c01f8bc42a16fd9) GPU#0 Cnt:0000000000000000000000000000000000000000000000000000000000000001 GPU#0 Cnt:00000000000000000000000000000000000000000000000045ba000000000001 1110MKey/s x1073741824 2^30.12 x2^31=2^61.12 ***********GPU#0************ KEY[16]: 0x49dccfd96dc5df56487436f5a1b18c4f5d34f65ddb48cb5e7ad38337c7f173c7 Pub: 0355b95bef84a6045a505d015ef15e136e0a31cc2aa00fa4bca62e5df215ee981b **************************** Found in 4 seconds GPU#0 job finished Working time 00:00:59s
double giant step defeated. v1.7.1 released with maximum perfomance. Hi, Please , can you write a little tutorial on usage?
|
1DWA3Sa8i6eHVWV4AG4UP2SBhYB2XrfiHW
|
|
|
Etar (OP)
|
|
October 23, 2021, 07:16:04 PM |
|
Hi, Please , can you write a little tutorial on usage?
did you look at readme.md on github? first of all set -t parametr as 256 or 512 next set -b equil to SM count of your card next set -p start from 128 then set -w as max as possible to your gpu memory, check -htsz params -w 31 -htsz 29 need around 64GB of RAM to generate all arrays -w 30 -htsz 28 need around 32GB of RAM to generate all arrays -w 29 -htsz 28 -w 28 -htsz 27 -w 27 -htsz 25 if you will have free gpu memory you can increase -p or -b or -t (all params multiple of 2)
|
|
|
|
ssxb
Jr. Member
Offline
Activity: 81
Merit: 2
|
|
October 24, 2021, 03:30:22 AM |
|
Hi, Please , can you write a little tutorial on usage?
did you look at readme.md on github? first of all set -t parametr as 256 or 512 next set -b equil to SM count of your card next set -p start from 128 then set -w as max as possible to your gpu memory, check -htsz params -w 31 -htsz 29 need around 64GB of RAM to generate all arrays -w 30 -htsz 28 need around 32GB of RAM to generate all arrays -w 29 -htsz 28 -w 28 -htsz 27 -w 27 -htsz 25 if you will have free gpu memory you can increase -p or -b or -t (all params multiple of 2) some great work from your side , appreciate just a quick question, do you have any plan to enhanced it for 120bit or more to perform better than JLK?
|
|
|
|
Etar (OP)
|
|
October 24, 2021, 07:55:06 AM |
|
just a quick question, do you have any plan to enhanced it for 120bit or more to perform better than JLK?
Bsgs will never faster then kangaroo on big ranges.Example puzzle #80, width 79bit Kanagaroo need in average 2^40.5 op to solve key 2080ti can serach DPs with speed 1500Mkey/s = 2^30.48 So you need in average 2^10 second to find key. Bsgscuda can search 2^61/s, so 2^79 / 2^61 = 2^18 second to check whole 79bit range. Ofcourse key can be close to the begining and you can found very fast but it is not 100%. we can devide pub to 64 pubkey and try to find one of them in range 1..3ffffffffffffffffff fortunately, the key we need is the first one on the list but it was just lucky. by the way privkey will be 3a869719b73046d6b46 = 2^73.87 so we need 2^73.87 / 2^61/s = 2^12.87 seconds to find key. It is 7 more timer then need for kangaroo
|
|
|
|
arulbero
Legendary
Online
Activity: 1939
Merit: 2080
|
|
October 24, 2021, 04:32:04 PM |
|
Do you use the negation map to speed up the algorithm ? --> pag. 8-9 https://eprint.iacr.org/2015/605.pdfYou need to compute only: sqrt(n) / 2 baby steps : for example, n = 2^60 -> 2^29 baby steps sqrt(n) giant steps : for example, n = 2^60 -> 2^30 giant steps It is like to shift the public key of 2^59 steps, and search in the interval [1,..., 2^59] instead of [1,...., 2^60] exploiting the symmetrie. Besides, in order to compute a batch of k steps, you need to calculate only k/2 (instead of k) elements x^-1 mod p , at the cost of 1 inversions and 3*(k/2 - 1) multiplications mod p.
|
|
|
|
Etar (OP)
|
|
October 24, 2021, 05:02:06 PM |
|
Do you use the negation map to speed up the algorithm ? --> pag. 8-9 https://eprint.iacr.org/2015/605.pdfYou need to compute only: sqrt(n) / 2 baby steps : for example, n = 2^60 -> 2^29 baby steps sqrt(n) giant steps : for example, n = 2^60 -> 2^30 giant steps It is like to shift the public key of 2^59 steps, and search in the interval [1,..., 2^59] instead of [1,...., 2^60] exploiting the symmetrie. Besides, in order to compute a batch of k steps, you need to calculate only k/2 (instead of k) elements x^-1 mod p , at the cost of 1 inversions and 3*(k/2 - 1) multiplications mod p. No, don`t use even don`t know about this. Will try to understand this tweak, thanks.
|
|
|
|
arulbero
Legendary
Online
Activity: 1939
Merit: 2080
|
|
October 24, 2021, 05:16:14 PM |
|
As you already explained few days ago, cuda BSGS is searching the keys one after another, not in parallel. I know that, I read carefully . With my previous post I meant that even if it would run in parallel it would slow down as described. So read my post in conjunctive and if someone would modify it to process in parallel I expect that behaviours. BSGS works in this way: suppose we know that P = k*G, and the private key k is in [1,...., 2^60] range 1) precompute 2^29 baby steps (they are simple public keys): 1*G, 2*G, 3*G, ...., 2^29 * G 2) split the public key P in many other public keys (they are called giant steps, 2^30 public keys): P, P - 1*(2^30*G), P - 2*(2^30*G), P - 3*(2^30)G, ..., P - (2^30 - 1)*(2^30*G) 3) for each giant steps, you check if it lies in [1, ..., 2^30] range (if it is equal to a 'baby step' public key) 4) if P - a*(2^30*G) = +-b*G then P = (a*2^30 +- b)*G then the private key is k = a*2^30 +- b If you want to search 2 public keys, P1 and P2, you can use the same baby steps, but you need to generate 2^31 giant steps instead of 2^30.
|
|
|
|
arulbero
Legendary
Online
Activity: 1939
Merit: 2080
|
|
October 24, 2021, 05:18:43 PM |
|
Besides, in order to compute a batch of k steps, you need to calculate only k/2 (instead of k) elements x^-1 mod p , at the cost of 1 inversions and 3*(k/2 - 1) multiplications mod p.
No, don`t use even don`t know about this. Will try to understand this tweak, thanks. I mean: how do you compute a batch of 'consecutive' keys ? Like P, P+G, P+2G, P+3G, P+4G, P+5G, ... ?
|
|
|
|
Etar (OP)
|
|
October 24, 2021, 05:31:37 PM |
|
Besides, in order to compute a batch of k steps, you need to calculate only k/2 (instead of k) elements x^-1 mod p , at the cost of 1 inversions and 3*(k/2 - 1) multiplications mod p.
No, don`t use even don`t know about this. Will try to understand this tweak, thanks. I mean: how do you compute a batch of 'consecutive' keys ? Like P, P+G, P+2G, P+3G, P+4G, P+5G, ... ? with group inverse, calculate inverse for batch and use it in addition. Sorry if i misunderstud question. In the same way as in bitcrack https://github.com/brichard19/BitCrack/blob/6bf8059ef075eb1622298395866b0bd02375e1d9/cudaMath/secp256k1.cuh#L642and then https://github.com/brichard19/BitCrack/blob/6bf8059ef075eb1622298395866b0bd02375e1d9/cudaMath/secp256k1.cuh#L656
|
|
|
|
arulbero
Legendary
Online
Activity: 1939
Merit: 2080
|
|
October 24, 2021, 05:40:09 PM |
|
Besides, in order to compute a batch of k steps, you need to calculate only k/2 (instead of k) elements x^-1 mod p , at the cost of 1 inversions and 3*(k/2 - 1) multiplications mod p.
No, don`t use even don`t know about this. Will try to understand this tweak, thanks. I mean: how do you compute a batch of 'consecutive' keys ? Like P, P+G, P+2G, P+3G, P+4G, P+5G, ... ? with group inverse, calculate inverse for batch and use it in addition. Sorry if i misunderstud question. Ok. If you have a batch of 100 points, you don't need to compute 100 inversions but only 50 inversions. If you have to compute A + B: If you have to compute A - B, since -B = (xB, n-yB) 1/(xb-xa) is the same as in A + B. Because for example P+2G and P-2G use the same inverse.
|
|
|
|
Etar (OP)
|
|
October 24, 2021, 05:44:48 PM |
|
Ok. If you have a batch of 100 points, you don't need to compute 100 inversions but only 50 inversions. Because for example P+2G and P-2G use the same inverse.
Exacly all this used in bsgs cuda(batch addition and symmetry) befor using symmetry in addition speed was 800Mkeys with -w 30, after using symmetry speed grow to 1150Mkeys
|
|
|
|
arulbero
Legendary
Online
Activity: 1939
Merit: 2080
|
|
October 24, 2021, 05:52:59 PM |
|
Ok. If you have a batch of 100 points, you don't need to compute 100 inversions but only 50 inversions. Because for example P+2G and P-2G use the same inverse.
Exacly all this used in bsgs cuda(batch addition and symmetry) befor using symmetry in addition speed was 800Mkeys with -w 30, after using symmetry speed grow to 1150Mkeys Ok. The square "a**2 mod p" is optimized like here ?
|
|
|
|
Etar (OP)
|
|
October 24, 2021, 05:57:01 PM Last edit: October 24, 2021, 06:07:30 PM by Etar |
|
Ok. If you have a batch of 100 points, you don't need to compute 100 inversions but only 50 inversions. Because for example P+2G and P-2G use the same inverse.
Exacly all this used in bsgs cuda(batch addition and symmetry) befor using symmetry in addition speed was 800Mkeys with -w 30, after using symmetry speed grow to 1150Mkeys Ok. The square "a**2 mod p" is optimized like here ? No. used mult mod P i use optimized square mod P in PB https://github.com/Etayson/BSGS-cuda/blob/e41fff517b8de153b6bf9846ee7abb47524fe43e/lib/Curve64.pb#L2161but need buffer 512bytes for this, so i did not transfer it to cuda ptx Also used double giant step, so we substuct double giant value from pub.
|
|
|
|
arulbero
Legendary
Online
Activity: 1939
Merit: 2080
|
|
October 24, 2021, 06:33:05 PM |
|
Also used double giant step, so we substuct double giant value from pub.
So, instead of computing sqrt(n) / 2 baby steps and sqrt(n) giant steps you compute sqrt(n) baby steps and sqrt(n) / 2 giant steps then, for example, for n = 2^60: P - a * (2^31*G) = b*G where 'a' lies in [1, ..., 2^29] and 'b' lies in [1,...,2^30] means: 1) P - a*(2^31*G) = b*G --> P = [a*(2^31) + b] * G --> priv key = a*(2^31) + b or 2) P - a*(2^31*G) = -b*G --> P = [a*(2^31) - b] * G --> priv key = a*(2^31) - b to save 2^30 giant steps. It seems to me that the program is already optimized.
|
|
|
|
Etar (OP)
|
|
October 24, 2021, 06:46:12 PM |
|
Also used double giant step, so we substuct double giant value from pub.
So, instead of computing sqrt(n) / 2 baby steps and sqrt(n) giant steps you compute sqrt(n) baby steps and sqrt(n) / 2 giant steps then, for example, for n = 2^60: P - a * (2^31*G) = b*G where 'a' lies in [1, ..., 2^29] and 'b' lies in [1,...,2^30] means: 1) P - a*(2^31*G) = b*G --> P = [a*(2^31) + b] * G --> priv key = a*(2^31) + b or 2) P - a*(2^31*G) = -b*G --> P = [a*(2^31) - b] * G --> priv key = a*(2^31) - b to save 2^30 giant steps. It seems to me that the program is already optimized. I compute as max as possible baby array size dependency of GPU memory Baby array is 1G,2G,3G... So this array computed only first time and then redesigned to hashtable. It is one HT for any ranges. Giant array is computed with doubled value of Baby array size, for ex if Baby array have size 2^30 then Giant Array have value G*(2^31), G*(2^32), G*(2^33)... All arrays computed only one time if not changed settings. So you can easy used all arays for different ranges and pubkeys without recompute.
|
|
|
|
arulbero
Legendary
Online
Activity: 1939
Merit: 2080
|
|
October 24, 2021, 06:57:39 PM |
|
I compute as max as possible baby array size dependency of GPU memory Baby array is 1G,2G,3G... So this array computed only first time and then redesigned to hashtable. It is one HT for any ranges. Giant array is computed with doubled value of Baby array size, for ex if Baby array have size 2^30 then Giant Array have value G*(2^31), G*(2^32), G*(2^33)... All arrays computed only one time if not changed settings. So you can easy used all arays for different ranges and pubkeys without recompute.
You can choose: 1) 2^30 baby-steps: 1*G, 2*G, ..., 2^30*G and 2^29 giant steps: P-1*2^31*G, P-2*2^31*G,..,P -a*2^31*G where a is in [1,2,.., 2^29 - 1] or 2) 2^29 baby-steps: 1*G, 2*G, ..., 2^29*G and 2^30 giant steps: P-1*2^30*G, P-2*2^30*G,..., P -a*2^30*G where a is in [1,2,..., 2^30 - 1] I don't understand why: "G*(2^31), G*(2^32), G*(2^33)": in this way you compute only 30 giant steps
|
|
|
|
Etar (OP)
|
|
October 24, 2021, 07:05:38 PM Last edit: October 24, 2021, 07:19:00 PM by Etar |
|
I compute as max as possible baby array size dependency of GPU memory Baby array is 1G,2G,3G... So this array computed only first time and then redesigned to hashtable. It is one HT for any ranges. Giant array is computed with doubled value of Baby array size, for ex if Baby array have size 2^30 then Giant Array have value G*(2^31), G*(2^32), G*(2^33)... All arrays computed only one time if not changed settings. So you can easy used all arays for different ranges and pubkeys without recompute.
You can choose: 1) 2^30 baby-steps: 1*G, 2*G, ..., 2^30*G and 2^29 giant steps: P-1*2^31*G, P-2*2^31*G,..,P -a*2^31*G where a is in [1,2,.., 2^29 - 1] or 2) 2^29 baby-steps: 1*G, 2*G, ..., 2^29*G and 2^30 giant steps: P-1*2^30*G, P-2*2^30*G,..., P -a*2^30*G where a is in [1,2,..., 2^30 - 1] I don't understand why: "G*(2^31), G*(2^32), G*(2^33)": in this way you compute only 30 giant steps it is universal arrays for any ranges. You not need to recompute array if you change range or pubkeys. size of giant array is equil to thread number * block number * pparam for 2080ti i use 512 thread 138 blocks and pparam = 480 so totaly i have 33914880 doubled giant values So each cuda kernel call calculate 33914880 * 2(due to +y/-y in batch additions) giat steps
|
|
|
|
jacky19790729
Jr. Member
Offline
Activity: 82
Merit: 8
|
|
October 29, 2021, 03:42:12 AM |
|
based on above table you can increase speed if you will utilize both bloom+bp https://github.com/iceland2k14/bsgsso CPU cores are less powerful than cuda and i was thinking [not sure possible or not] if we load all bp in RAM and use some bloom in GPU memory perhaps their will be some dramatic speed boost I had try iceland2k14's BSGS Intel i7-7800X + 24 GB DDR4-2400 D:\python\BSGS_ice>python bsgs_dll_secp256k1.py -p 0385a30d8413af4f8f9e6312400f2d194fe14f02e719b24c3f83bf1fd233a8f963 -b bPfile.bin -bl bloomfile.bin -n 1000000000000000 -keyspace 40000000000000:80000000000000
[+] Starting Program : BSGS mode Version [ 13072021 ] [+] Search Started for the Public key: 0485a30d8413af4f8f9e6312400f2d194fe14f02e719b24c3f83bf1fd233a8f9630eb400323654cec63999b56f4ba44e8b21ab92d9d697fabe4666df3678585669 [+] Search Mode: Sequential search in the given range [+] Reading bloom filter from file complete in : 26.70087 sec [+] Reading Baby table from file complete in : 0.03731 sec [+] seq value: 1000000000000000 m value : 3999999999 [+] Search Range: 0x40000000000000 to 0x80000000000000 [+] k1: 0x40000000000000 PVK not found. 1000.00000 Trillion scanned in 1.14 sec. New range [+] k1: 0x438d7ea4c68001 PVK not found. 1000.00000 Trillion scanned in 1.18 sec. New range [+] k1: 0x471afd498d0002 PVK not found. 1000.00000 Trillion scanned in 1.03 sec. New range [+] k1: 0x4aa87bee538003 PVK not found. 1000.00000 Trillion scanned in 1.05 sec. New range [+] k1: 0x4e35fa931a0004 PVK not found. 1000.00000 Trillion scanned in 0.99 sec. New range [+] k1: 0x51c37937e08005 PVK not found. 1000.00000 Trillion scanned in 1.01 sec. New range [+] k1: 0x5550f7dca70006 PVK not found. 1000.00000 Trillion scanned in 1.00 sec. New range [+] k1: 0x58de76816d8007 PVK not found. 1000.00000 Trillion scanned in 1.01 sec. New range [+] k1: 0x5c6bf526340008 PVK not found. 1000.00000 Trillion scanned in 1.09 sec. New range [+] k1: 0x5ff973cafa8009 PVK not found. 1000.00000 Trillion scanned in 0.93 sec. New range [+] k1: 0x6386f26fc1000a PVK not found. 1000.00000 Trillion scanned in 0.96 sec. New range [+] k1: 0x6714711487800b PVK not found. 1000.00000 Trillion scanned in 1.00 sec. New range [+] k1: 0x6aa1efb94e000c ============== KEYFOUND ============== 1 BSGS FOUND PrivateKey 0x6abe1f9b67e114 ====================================== Start 1 : 2021-10-29 11:34:19 Start 1_0: 2021-10-29 11:34:45 END 1 : 2021-10-29 11:34:59
1 CPU speed: BSGS Check 0x38D7EA4C68000 key / second
|
|
|
|
|