Bitcoin Forum
November 13, 2024, 02:19:46 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 [3]  All
  Print  
Author Topic: Theoretical minimum # of logic operations to perform privkey->pubkey(secp256k1)  (Read 1259 times)
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 09, 2020, 03:59:47 AM
 #41

the most optimized version I found has 17multiplications.
If you look at the links in my prior posts, I linked to a GEJ+GE that uses 8mul+3sqr.


Yes, bitcoin code. But as I understand bitcoin uses hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.

In fact, ECC is difficult for me, my brain melts.

I don't understand why, but I feel problem to calculate privkey->pubkey using hybrid formulae if there's much leading same bits.
But the more I think about it - the more my brain is going to infinite loop.  Sad



What about iterations of GCD, I have some ideas to verify in Python, so I will write later.
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4270
Merit: 8805



View Profile WWW
December 09, 2020, 04:17:32 AM
Last edit: December 09, 2020, 04:28:59 AM by gmaxwell
 #42

hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.
Affine + Jacobian is actually what you want for this, your circuit never performs a GEJ+GEJ operation. (You might perform a GEJ doubling, but doubling is special and much simpler operation though as j2002ba2 points out, the doubling can and should be eliminated)

(Though there is a gej_gej addition in that codebase a little bit higher, which is 12 mul, 4 sqr-- but that isn't what you want).
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 09, 2020, 07:20:43 AM
Last edit: December 09, 2020, 07:35:59 AM by hakabit
 #43

hybrid affine+Jacobian. while for Jacobian-Jacobian addition I found only 17mul formulae.
Affine + Jacobian is actually what you want for this, your circuit never performs a GEJ+GEJ operation. (You might perform a GEJ doubling, but doubling is special and much simpler operation though as j2002ba2 points out, the doubling can and should be eliminated)

(Though there is a gej_gej addition in that codebase a little bit higher, which is 12 mul, 4 sqr-- but that isn't what you want).


After lot of thinking I've at last got the point how combining by 10bits and hybrid point addition works.
Funny, but process much-much easier than I was trying to imagine.
And yes, affine+jacobian is just fine.




What about iterations count, then I think just better to select 768iterations.
Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.

gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4270
Merit: 8805



View Profile WWW
December 09, 2020, 06:22:57 PM
 #44

What about iterations count, then I think just better to select 768iterations.
Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.
There is an informal proof up thread that 765 is enough, I haven't considered it carefully.  I find it a little concerning that he found a 760 iteration input so easily-- for safegcd we've found it fairly hard to find inputs close to the bound.

Perhaps for estimation purposes you could just assume 765 but I think that if I were going to fabricate this circuit I'd want to make a more formal proof of the bound.

aside,  Pieter has started writing an accessible writeup about how safegcd works.
j2002ba2
Full Member
***
Offline Offline

Activity: 206
Merit: 447


View Profile
December 09, 2020, 08:24:35 PM
Last edit: December 09, 2020, 09:24:06 PM by j2002ba2
 #45

I missed something from J+A point addition - step 14 in 3.22. There should be additional 256-bit-add-minus-p & 256-bit-mux, for 1280 more gates. It doesn't change the outcome much.

Since there are many multiplexers, it might be beneficial to allow the Transmission gate.

A 256-bit multiplexer would be 512*TG + 1*NOT. This is 3 times less transistors.

Perhaps it would be better to count transistors instead of logic elements.
 NOT - 2
 NOR - 4
NAND - 4
 AND - 6
  OR - 6
 XOR - 8 (6 with PTL)
XNOR - 8
  TG - 2
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 10, 2020, 01:33:33 AM
Last edit: December 10, 2020, 02:27:40 AM by hakabit
 #46

I missed something from J+A point addition - step 14 in 3.22. There should be additional 256-bit-add-minus-p & 256-bit-mux, for 1280 more gates. It doesn't change the outcome much.

Since there are many multiplexers, it might be beneficial to allow the Transmission gate.

A 256-bit multiplexer would be 512*TG + 1*NOT. This is 3 times less transistors.

Perhaps it would be better to count transistors instead of logic elements.
 NOT - 2
 NOR - 4
NAND - 4
 AND - 6
  OR - 6
 XOR - 8 (6 with PTL)
XNOR - 8
  TG - 2


Yeah, not a big deal.

I have much more questions to step 9.
As we see there's checking leading to point doubleing or to infinity.
The same checking in bitcoin code provided by gmaxwell leading to secp256k1_gej_double_var(r, a, rzr); or to secp256k1_gej_set_infinity(r);.
gmaxwell said, that infinity is not possible, I dumb in ECC, so let's trust to this statement.
And what about doubleing ? Can it be ignored ? If not, then I can't understand how to get smaller than 17mul per one point addition(input hybrid affine+J).

gmaxwell, please, clear up this question.



What about transmission gate and transistors. If to implement secp256k1 in hardware, then it's very good idea..
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 10, 2020, 01:56:51 AM
 #47

What about iterations count, then I think just better to select 768iterations.
Modinverse is only on final step, so are there 768 or only 68 iterations - not a big deal.
There is an informal proof up thread that 765 is enough, I haven't considered it carefully.  I find it a little concerning that he found a 760 iteration input so easily-- for safegcd we've found it fairly hard to find inputs close to the bound.

Perhaps for estimation purposes you could just assume 765 but I think that if I were going to fabricate this circuit I'd want to make a more formal proof of the bound.

aside,  Pieter has started writing an accessible writeup about how safegcd works.

Today I've programmed logging iterations with quite simple logic - logging largest iter2, iter3, iter4, average summ of them, and largest summ of them.
In fact, largest summ can be interpreted as unrolled iterations count.

In 6hours test-run on random inputs we get as high as 728. I feel running script for some weeks we can get as high as +-760.

Also I carefully(carefully!) re-read article related to 2/ln(k)*ln(A) formulae. It gives result for "a upper bound for the average number of iterations". It sound very interesting, upper bound of average.  Smiley

So........................  Roll Eyes Roll Eyes



Results:
(test#, lrgiter2, lrgiter3, lrgiter4, avgsum, lrgsum):
0 169 183 92 444 444
1 186 183 97 458 466
3 197 183 97 458 477
4 236 183 123 458 542
5 236 190 123 458 549
9 255 190 123 458 568
11 255 191 123 458 569
13 255 197 123 458 575
16 255 197 125 458 577
24 255 197 250 510 702
58 255 204 250 510 709
431 255 205 250 510 710
2952 255 209 250 510 714
8611 255 211 250 510 716
still trying... 2020-12-09 20:00:50.639906
54438 255 214 250 511 719
76276 255 215 250 511 720
325434 255 216 250 517 721
1085191 255 217 250 518 722
2017092 255 222 250 518 727
48955014 255 223 250 524 728
still trying... 2020-12-10 01:56:08.532261
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 10, 2020, 03:52:10 AM
Last edit: December 10, 2020, 04:03:09 AM by hakabit
 #48

we still have a lot of multiplications. no matter affine-affine or affine-Jacobian used.

gmaxwell, You was providing link to PDF related to multiplications. but... it's really too difficult for me to read, high-grade math terminology used, I do not understand at least 70% there.  Undecided

But I already have general understanding in multiplication algorithms. I understand principles of Toom(and subset called Karatsuba.). I do not understand principles of Schonhage–Strassen algorithm, Furer's algorithm and many more, but they are in any case not effective for 2^256 numbers.
After multiplication made we need to reduce by module. I see Barrett reduction is the best suited for secp256k1 case, but I still don't understand principles of such.

Initially, I and j2002ba2 are using binary algorithm for estimations, 1mod-multiplication = 255+256=511mod-additions.

Also there's hybrid algorithm called Montgomery multiplication.


gmaxwell, how do You think, what algorithm is the most effective(=smaller amount of gates) for secp256k1 ?
1) 1mul = 511adds
2) multiply by Toom then reduce by Barrett
3) Montgomery multiplication
4) any more good ?!
j2002ba2
Full Member
***
Offline Offline

Activity: 206
Merit: 447


View Profile
December 10, 2020, 04:28:19 AM
 #49


I have much more questions to step 9.
As we see there's checking leading to point doubleing or to infinity.


When the input is sanitized there are no infinity points. Step 9 is skipped.
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4270
Merit: 8805



View Profile WWW
December 10, 2020, 05:12:02 AM
 #50

Toom is likely appropriate at these sizes.

I do not expect Montgomery to be useful except where there are many consecutive multiplies and squares-- such as performing an inversion with a power ladder.

In software at least delayed carry processing is a big performance increase, maybe in logic it wouldn't matter.

The secp256k1 field has special structure: 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1

This simplifies reduction but I haven't really given any thought to what a dedicated logic implementation would look like.

On doubling:  You don't need to double because you should use a table per digit where each table entry has already been multiplied by a power of 2.

With that there is no need for doubling at all.  Though if you do have a need for doubling you can do it in 3mul 4sqr.

Computing  scalar times point will never have an infinity as an intermediate.  Infinity is functionally 0.   If you are adding up some number by summing its digits, you'll never encounter 0 in your incremental sum.  E.g. If I want to compute 1984G   4 + 80 + 900 + 1000  ... none of those sums can be zero.  If you consider it, you can see this applies to any number base.   Infinity can only be encountered if the scalar is larger than the group.
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 10, 2020, 06:31:20 AM
Last edit: December 10, 2020, 07:13:49 PM by hakabit
 #51

Toom is likely appropriate at these sizes.

I do not expect Montgomery to be useful except where there are many consecutive multiplies and squares-- such as performing an inversion with a power ladder.

In software at least delayed carry processing is a big performance increase, maybe in logic it wouldn't matter.

The secp256k1 field has special structure: 2^256 - 2^32 - 2^9 - 2^8 - 2^7 - 2^6 - 2^4 - 1

This simplifies reduction but I haven't really given any thought to what a dedicated logic implementation would look like.

On doubling:  You don't need to double because you should use a table per digit where each table entry has already been multiplied by a power of 2.

With that there is no need for doubling at all.  Though if you do have a need for doubling you can do it in 3mul 4sqr.

Computing  scalar times point will never have an infinity as an intermediate.  Infinity is functionally 0.   If you are adding up some number by summing its digits, you'll never encounter 0 in your incremental sum.  E.g. If I want to compute 1984G   4 + 80 + 900 + 1000  ... none of those sums can be zero.  If you consider it, you can see this applies to any number base.   Infinity can only be encountered if the scalar is larger than the group.


Ok, let's forget Montgomery-multiplication. Only two variants left: 511adds or Toom+reduct.

What about infinity, then I just trust You. But what about checking leading to doubleing - I just see it in bitcoin code, so it was good to ask to be sure.
That's very good, that no need to include doubleing calculations.

There are algorithms:
14.3.4 Reduction methods for moduli of special form
(A. Menezes, P. van Oorschot, S. Vanstone. Handbook of Applied Cryptography.)

Algorithm 10.25 Fast reduction for special form moduli
(H. Cohen, G. Frey, R. Avanzi, C. Doche, T. Lange, K. Nguyen, F. Vercauteren.
Handbook of Elliptic and Hyperelliptic Curve Cryptography.)

And I found golang code realizing them with description. I see they splitted secp256k1 modulo in a very interesting way.
https://github.com/btcsuite/btcd/blob/master/btcec/field.go
"   // The secp256k1 prime is equivalent to 2^256 - 4294968273, so it fits
   // this criteria.
   //
   // 4294968273 in field representation (base 2^26) is:
   // n[0] = 977
   // n[1] = 64
   // That is to say (2^26 * 64) + 977 = 4294968273"

So how do You think will it be much better than Barrett ?
While I don't understand this algorithm for special module, but I suppose if it much more efficient for calculating module, then the same algorithm principles can be used for multiplication. Isn't it ?!


And at all, how much more efficient Toom+Reduction than 511adds can be ? Will we save only +-10% gates or much more ?
If not bigger than 10%, then I think prefer in my final estimations to use 511adds, because they are much easier to understand and realize.
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 10, 2020, 04:40:15 PM
Last edit: December 10, 2020, 07:10:32 PM by hakabit
 #52

Also I have question related to Toom. Toom can have different coefficients.

In wikipedia provided only these:
Toom-1     = Toom.km(1).kn(1) = classic school multiplication algorithm
Toom-1,5  = Toom.km(2).kn(1)
Toom-2     = Toom.km(2).kn(2) = Karatsuba
Toom-2,5  = Toom.km(3).kn(2)
Toom-3     = Toom.km(3).kn(2) - most frequently used Toom algorithm

But I can imagine we can go further:
Toom-3,5
Toom-4
Toom-4,5
Toom-5
...

And further, and further... Up to 255 splits each 1bit.
But as I understand the more we splitting number - the more carry bits we will have for adding.
So where's the gold balance point ? What are the best Toom coefficients(km, kn) for 256bit numbers ?!
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 11, 2020, 07:39:56 AM
 #53

I think I found the answer.
There's article on http://binary.cr.yp.to/auth256.html indicating, that we can go to as low as to only 29*256=7424gates for 256bits mod-multiplication.

So we can use this number for estimations.
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 11, 2020, 07:49:15 AM
 #54

What about toom coefficients, then in libtommath library I found such table:

Split into n Parts         Exponent         Notes
2         1.584962501         This is Karatsuba Multiplication.
3         1.464973520         This is Toom-Cook Multiplication.
4         1.403677461         
5         1.365212389         
10         1.278753601         
100         1.149426538         
1000         1.100270931         
10000         1.075252070         
5.2. MULTIPLICATION 105                  
Figure 5.6: Asymptotic Running Time of Polynomial Basis Multiplication                  

But that's theory, in practice we will have not such good results.

For example, here http://binary.cr.yp.to/m.html
provided info that
school = 130thousands gates
classic Karatsuba = 44thousands gates, while in theory there should 10times smaller than school results...

gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4270
Merit: 8805



View Profile WWW
December 11, 2020, 04:34:04 PM
 #55

There's article on http://binary.cr.yp.to/auth256.html indicating, that we can go to as low as to only 29*256=7424gates for 256bits mod-multiplication.
Alas, that page is about binary field multiplication, arithmetic in GF(2^n) rather than in GF(P).  Binary field arithmetic has special structure that make logic implementations especially efficient (and public key cryptography especially weak. Smiley ) and isn't what is needed here.
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 13, 2020, 05:19:50 AM
Last edit: December 13, 2020, 09:46:41 AM by hakabit
 #56

There's article on http://binary.cr.yp.to/auth256.html indicating, that we can go to as low as to only 29*256=7424gates for 256bits mod-multiplication.
Alas, that page is about binary field multiplication, arithmetic in GF(2^n) rather than in GF(P).  Binary field arithmetic has special structure that make logic implementations especially efficient (and public key cryptography especially weak. Smiley ) and isn't what is needed here.

Thanks ! Sorry, I paid no attention to field structure. Before Your correction I was surprised that we can use only 7K gates for 256bits mod-mul.



In any case now we have such numbers for 256bits non-modular multiplying:
) school 130K gates
) Karatsuba 44K gates
) Bernstein 35K gates

In this article
https://homes.esat.kuleuven.be/~fvercaut/papers/bar_mont.pdf
I found info that mod-mul Barrett uses 33.15K gates, mod-mul Montgomery uses 33.4K gates.
These numbers related to field with anykind (2^(n−1) ≤ M < 2^n) module, but I am not sure that it's asynchronous(unrolled) schematic.

All this means, that initial j2002ba2 estimations can be lowered dramatically. (mine can be optimized of course too  Smiley Still going. )



Also I found article
http://www.bodrato.it/papers/WhatAboutToomCookMatricesOptimality.pdf
with formulaes to estimate Toom coefficients. But high-grade math terminology used. Could anyone help to estimate best Toom coefficients for 256bits ?!

I feel there's problem with complexity calculations. Usually formulaes are provided to calculate only the general, but not all logic. For example for School-algorithm formulae is O(n2). Using this formulae we can only calculate XOR gates = 65536, while additionaly we have the same number of AND gates.
Pages: « 1 2 [3]  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!