hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 09, 2020, 03:59:47 AM |
|
the most optimized version I found has 17multiplications.
If you look at the links in my prior posts, I linked to a GEJ+GE that uses 8mul+3sqr. Yes, bitcoin code. But as I understand bitcoin uses hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae. In fact, ECC is difficult for me, my brain melts. I don't understand why, but I feel problem to calculate privkey->pubkey using hybrid formulae if there's much leading same bits. But the more I think about it - the more my brain is going to infinite loop. What about iterations of GCD, I have some ideas to verify in Python, so I will write later.
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 09, 2020, 04:17:32 AM Last edit: December 09, 2020, 04:28:59 AM by gmaxwell |
|
hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.
Affine + Jacobian is actually what you want for this, your circuit never performs a GEJ+GEJ operation. (You might perform a GEJ doubling, but doubling is special and much simpler operation though as j2002ba2 points out, the doubling can and should be eliminated) (Though there is a gej_gej addition in that codebase a little bit higher, which is 12 mul, 4 sqr-- but that isn't what you want).
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 09, 2020, 07:20:43 AM Last edit: December 09, 2020, 07:35:59 AM by hakabit |
|
hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.
Affine + Jacobian is actually what you want for this, your circuit never performs a GEJ+GEJ operation. (You might perform a GEJ doubling, but doubling is special and much simpler operation though as j2002ba2 points out, the doubling can and should be eliminated) (Though there is a gej_gej addition in that codebase a little bit higher, which is 12 mul, 4 sqr-- but that isn't what you want). After lot of thinking I've at last got the point how combining by 10bits and hybrid point addition works. Funny, but process much-much easier than I was trying to imagine. And yes, affine+jacobian is just fine. What about iterations count, then I think just better to select 768iterations. Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 09, 2020, 06:22:57 PM |
|
What about iterations count, then I think just better to select 768iterations. Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.
There is an informal proof up thread that 765 is enough, I haven't considered it carefully. I find it a little concerning that he found a 760 iteration input so easily-- for safegcd we've found it fairly hard to find inputs close to the bound. Perhaps for estimation purposes you could just assume 765 but I think that if I were going to fabricate this circuit I'd want to make a more formal proof of the bound. aside, Pieter has started writing an accessible writeup about how safegcd works.
|
|
|
|
j2002ba2
|
|
December 09, 2020, 08:24:35 PM Last edit: December 09, 2020, 09:24:06 PM by j2002ba2 |
|
I missed something from J+A point addition - step 14 in 3.22. There should be additional 256-bit-add-minus-p & 256-bit-mux, for 1280 more gates. It doesn't change the outcome much. Since there are many multiplexers, it might be beneficial to allow the Transmission gate. A 256-bit multiplexer would be 512*TG + 1*NOT. This is 3 times less transistors. Perhaps it would be better to count transistors instead of logic elements. NOT - 2 NOR - 4 NAND - 4 AND - 6 OR - 6 XOR - 8 (6 with PTL) XNOR - 8 TG - 2
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 10, 2020, 01:33:33 AM Last edit: December 10, 2020, 02:27:40 AM by hakabit |
|
I missed something from J+A point addition - step 14 in 3.22. There should be additional 256-bit-add-minus-p & 256-bit-mux, for 1280 more gates. It doesn't change the outcome much. Since there are many multiplexers, it might be beneficial to allow the Transmission gate. A 256-bit multiplexer would be 512*TG + 1*NOT. This is 3 times less transistors. Perhaps it would be better to count transistors instead of logic elements. NOT - 2 NOR - 4 NAND - 4 AND - 6 OR - 6 XOR - 8 (6 with PTL) XNOR - 8 TG - 2
Yeah, not a big deal. I have much more questions to step 9. As we see there's checking leading to point doubleing or to infinity. The same checking in bitcoin code provided by gmaxwell leading to secp256k1_gej_double_var(r, a, rzr); or to secp256k1_gej_set_infinity(r);. gmaxwell said, that infinity is not possible, I dumb in ECC, so let's trust to this statement. And what about doubleing ? Can it be ignored ? If not, then I can't understand how to get smaller than 17mul per one point addition(input hybrid affine+J). gmaxwell, please, clear up this question. What about transmission gate and transistors. If to implement secp256k1 in hardware, then it's very good idea..
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 10, 2020, 01:56:51 AM |
|
What about iterations count, then I think just better to select 768iterations. Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.
There is an informal proof up thread that 765 is enough, I haven't considered it carefully. I find it a little concerning that he found a 760 iteration input so easily-- for safegcd we've found it fairly hard to find inputs close to the bound. Perhaps for estimation purposes you could just assume 765 but I think that if I were going to fabricate this circuit I'd want to make a more formal proof of the bound. aside, Pieter has started writing an accessible writeup about how safegcd works. Today I've programmed logging iterations with quite simple logic - logging largest iter2, iter3, iter4, average summ of them, and largest summ of them. In fact, largest summ can be interpreted as unrolled iterations count. In 6hours test-run on random inputs we get as high as 728. I feel running script for some weeks we can get as high as +-760. Also I carefully(carefully!) re-read article related to 2/ln(k)*ln(A) formulae. It gives result for "a upper bound for the average number of iterations". It sound very interesting, upper bound of average. So........................ Results: (test#, lrgiter2, lrgiter3, lrgiter4, avgsum, lrgsum): 0 169 183 92 444 444 1 186 183 97 458 466 3 197 183 97 458 477 4 236 183 123 458 542 5 236 190 123 458 549 9 255 190 123 458 568 11 255 191 123 458 569 13 255 197 123 458 575 16 255 197 125 458 577 24 255 197 250 510 702 58 255 204 250 510 709 431 255 205 250 510 710 2952 255 209 250 510 714 8611 255 211 250 510 716 still trying... 2020-12-09 20:00:50.639906 54438 255 214 250 511 719 76276 255 215 250 511 720 325434 255 216 250 517 721 1085191 255 217 250 518 722 2017092 255 222 250 518 727 48955014 255 223 250 524 728still trying... 2020-12-10 01:56:08.532261
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 10, 2020, 03:52:10 AM Last edit: December 10, 2020, 04:03:09 AM by hakabit |
|
we still have a lot of multiplications. no matter affine-affine or affine-Jacobian used. gmaxwell, You was providing link to PDF related to multiplications. but... it's really too difficult for me to read, high-grade math terminology used, I do not understand at least 70% there. But I already have general understanding in multiplication algorithms. I understand principles of Toom(and subset called Karatsuba.). I do not understand principles of Schonhage–Strassen algorithm, Furer's algorithm and many more, but they are in any case not effective for 2^256 numbers. After multiplication made we need to reduce by module. I see Barrett reduction is the best suited for secp256k1 case, but I still don't understand principles of such. Initially, I and j2002ba2 are using binary algorithm for estimations, 1mod-multiplication = 255+256=511mod-additions. Also there's hybrid algorithm called Montgomery multiplication. gmaxwell, how do You think, what algorithm is the most effective(=smaller amount of gates) for secp256k1 ? 1) 1mul = 511adds 2) multiply by Toom then reduce by Barrett 3) Montgomery multiplication 4) any more good ?!
|
|
|
|
j2002ba2
|
|
December 10, 2020, 04:28:19 AM |
|
I have much more questions to step 9. As we see there's checking leading to point doubleing or to infinity.
When the input is sanitized there are no infinity points. Step 9 is skipped.
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 10, 2020, 05:12:02 AM |
|
Toom is likely appropriate at these sizes. I do not expect Montgomery to be useful except where there are many consecutive multiplies and squares-- such as performing an inversion with a power ladder. In software at least delayed carry processing is a big performance increase, maybe in logic it wouldn't matter. The secp256k1 field has special structure: 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1 This simplifies reduction but I haven't really given any thought to what a dedicated logic implementation would look like. On doubling: You don't need to double because you should use a table per digit where each table entry has already been multiplied by a power of 2. With that there is no need for doubling at all. Though if you do have a need for doubling you can do it in 3mul 4sqr. Computing scalar times point will never have an infinity as an intermediate. Infinity is functionally 0. If you are adding up some number by summing its digits, you'll never encounter 0 in your incremental sum. E.g. If I want to compute 1984G 4 + 80 + 900 + 1000 ... none of those sums can be zero. If you consider it, you can see this applies to any number base. Infinity can only be encountered if the scalar is larger than the group.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 10, 2020, 06:31:20 AM Last edit: December 10, 2020, 07:13:49 PM by hakabit |
|
Toom is likely appropriate at these sizes. I do not expect Montgomery to be useful except where there are many consecutive multiplies and squares-- such as performing an inversion with a power ladder. In software at least delayed carry processing is a big performance increase, maybe in logic it wouldn't matter. The secp256k1 field has special structure: 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1 This simplifies reduction but I haven't really given any thought to what a dedicated logic implementation would look like. On doubling: You don't need to double because you should use a table per digit where each table entry has already been multiplied by a power of 2. With that there is no need for doubling at all. Though if you do have a need for doubling you can do it in 3mul 4sqr. Computing scalar times point will never have an infinity as an intermediate. Infinity is functionally 0. If you are adding up some number by summing its digits, you'll never encounter 0 in your incremental sum. E.g. If I want to compute 1984G 4 + 80 + 900 + 1000 ... none of those sums can be zero. If you consider it, you can see this applies to any number base. Infinity can only be encountered if the scalar is larger than the group. Ok, let's forget Montgomery-multiplication. Only two variants left: 511adds or Toom+reduct. What about infinity, then I just trust You. But what about checking leading to doubleing - I just see it in bitcoin code, so it was good to ask to be sure. That's very good, that no need to include doubleing calculations. There are algorithms: 14.3.4 Reduction methods for moduli of special form (A. Menezes, P. van Oorschot, S. Vanstone. Handbook of Applied Cryptography.) Algorithm 10.25 Fast reduction for special form moduli (H. Cohen, G. Frey, R. Avanzi, C. Doche, T. Lange, K. Nguyen, F. Vercauteren. Handbook of Elliptic and Hyperelliptic Curve Cryptography.) And I found golang code realizing them with description. I see they splitted secp256k1 modulo in a very interesting way. https://github.com/btcsuite/btcd/blob/master/btcec/field.go" // The secp256k1 prime is equivalent to 2^256 - 4294968273, so it fits // this criteria. // // 4294968273 in field representation (base 2^26) is: // n[0] = 977 // n[1] = 64 // That is to say (2^26 * 64) + 977 = 4294968273" So how do You think will it be much better than Barrett ? While I don't understand this algorithm for special module, but I suppose if it much more efficient for calculating module, then the same algorithm principles can be used for multiplication. Isn't it ?! And at all, how much more efficient Toom+Reduction than 511adds can be ? Will we save only +-10% gates or much more ? If not bigger than 10%, then I think prefer in my final estimations to use 511adds, because they are much easier to understand and realize.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 10, 2020, 04:40:15 PM Last edit: December 10, 2020, 07:10:32 PM by hakabit |
|
Also I have question related to Toom. Toom can have different coefficients.
In wikipedia provided only these: Toom-1 = Toom.km(1).kn(1) = classic school multiplication algorithm Toom-1,5 = Toom.km(2).kn(1) Toom-2 = Toom.km(2).kn(2) = Karatsuba Toom-2,5 = Toom.km(3).kn(2) Toom-3 = Toom.km(3).kn(2) - most frequently used Toom algorithm
But I can imagine we can go further: Toom-3,5 Toom-4 Toom-4,5 Toom-5 ...
And further, and further... Up to 255 splits each 1bit. But as I understand the more we splitting number - the more carry bits we will have for adding. So where's the gold balance point ? What are the best Toom coefficients(km, kn) for 256bit numbers ?!
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 11, 2020, 07:39:56 AM |
|
I think I found the answer. There's article on http://binary.cr.yp.to/auth256.html indicating, that we can go to as low as to only 29*256=7424gates for 256bits mod-multiplication. So we can use this number for estimations.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 11, 2020, 07:49:15 AM |
|
What about toom coefficients, then in libtommath library I found such table: Split into n Parts Exponent Notes 2 1.584962501 This is Karatsuba Multiplication. 3 1.464973520 This is Toom-Cook Multiplication. 4 1.403677461 5 1.365212389 10 1.278753601 100 1.149426538 1000 1.100270931 10000 1.075252070 5.2. MULTIPLICATION 105 Figure 5.6: Asymptotic Running Time of Polynomial Basis Multiplication But that's theory, in practice we will have not such good results. For example, here http://binary.cr.yp.to/m.htmlprovided info that school = 130thousands gates classic Karatsuba = 44thousands gates, while in theory there should 10times smaller than school results...
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 11, 2020, 04:34:04 PM |
|
Alas, that page is about binary field multiplication, arithmetic in GF(2^n) rather than in GF(P). Binary field arithmetic has special structure that make logic implementations especially efficient (and public key cryptography especially weak. ) and isn't what is needed here.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 13, 2020, 05:19:50 AM Last edit: December 13, 2020, 09:46:41 AM by hakabit |
|
Alas, that page is about binary field multiplication, arithmetic in GF(2^n) rather than in GF(P). Binary field arithmetic has special structure that make logic implementations especially efficient (and public key cryptography especially weak. ) and isn't what is needed here. Thanks ! Sorry, I paid no attention to field structure. Before Your correction I was surprised that we can use only 7K gates for 256bits mod-mul. In any case now we have such numbers for 256bits non-modular multiplying: ) school 130K gates ) Karatsuba 44K gates ) Bernstein 35K gates In this article https://homes.esat.kuleuven.be/~fvercaut/papers/bar_mont.pdfI found info that mod-mul Barrett uses 33.15K gates, mod-mul Montgomery uses 33.4K gates. These numbers related to field with anykind (2^(n−1) ≤ M < 2^n) module, but I am not sure that it's asynchronous(unrolled) schematic. All this means, that initial j2002ba2 estimations can be lowered dramatically. (mine can be optimized of course too Still going. ) Also I found article http://www.bodrato.it/papers/WhatAboutToomCookMatricesOptimality.pdfwith formulaes to estimate Toom coefficients. But high-grade math terminology used. Could anyone help to estimate best Toom coefficients for 256bits ?! I feel there's problem with complexity calculations. Usually formulaes are provided to calculate only the general, but not all logic. For example for School-algorithm formulae is O(n2). Using this formulae we can only calculate XOR gates = 65536, while additionaly we have the same number of AND gates.
|
|
|
|
|