Bitcoin Forum
June 17, 2024, 09:28:37 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 [2] 3 »  All
  Print  
Author Topic: Theoretical minimum # of logic operations to perform privkey->pubkey(secp256k1)  (Read 1198 times)
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4200
Merit: 8440



View Profile WWW
December 05, 2020, 03:15:51 AM
 #21

The algorithm you are discussing is a binary (extended) gcd, not the "original" Euclid's algorithm.  The original Euclid's has an arbitrary division (well a modular reduction, but same deal) in the inner loop.
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 06, 2020, 02:36:32 PM
Last edit: December 06, 2020, 03:15:38 PM by hakabit
 #22

The algorithm you are discussing is a binary (extended) gcd, not the "original" Euclid's algorithm.  The original Euclid's has an arbitrary division (well a modular reduction, but same deal) in the inner loop.



Yes, I know difference. Even for same input EGCD(or BEGCD) probably have different amount of iterations compared to GCD.
However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.

Here
https://tuengr.com/V10/001.pdf
I found formulae(based on proved by Ishmukhametov formulaes) for EGCD.
2/ln(k)*ln(A)
it gives 2*0.7*177 = 248max iterations(2 to 256 bits numbers).



Also here
http://www.joppebos.com/files/CTInversion.pdf
https://link.springer.com/content/pdf/10.1007%2F3-540-36400-5_6.pdf
https://www.researchgate.net/publication/3044233_The_Montgomery_modular_inverse_-_Revisited
https://www.researchgate.net/publication/3387259_Improved_Montgomery_modular_inverse_algorithm

I found one more algo called "Montgomery Inverse" with additional phase to calculate back from montgomery-domain.
Proved(?!) fomulae: log2(a) < 2log2(a)
it gives min 256 and max 512 iterations for each phase(if 256bits numbers).

Each iteration contain 5*256bits additions with signed numbers.(substraction ~= addition)

However each second iteration amount of logical gates can be lowered by 1bit.

So in total we have 512 * 5 * half = 1280 additions for 1inverse. (not included muxes)

I think that's ideal algo for point inverse if we are planning to calculate secp256k1 using millions, but not trillions logical operations.
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4200
Merit: 8440



View Profile WWW
December 06, 2020, 04:19:09 PM
 #23

However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.
No, that is entirely untrue, unfortunately.

Binary GCD is faster to use not because it makes fewer iterations but because there isn't a very expensive division in each iteration.
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 06, 2020, 08:38:44 PM
 #24

However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.
No, that is entirely untrue, unfortunately.

Binary GCD is faster to use not because it makes fewer iterations but because there isn't a very expensive division in each iteration.


I am sorry, I messed up with abbreviations.
I mean (and had to write) that binary version cannot be worstier than non-binary.

I provided proof that non-binary version has max 390 iterations. So I was trying to say, that there are no needs to process 735iterations for binary version.
But 390 - was upper(but not lower) bound limiter. Today I provided proof for 248 max iterations. (and as I understand that's lower bound limiter ?!)

What about 735... in other post You provided URL to PDF on cr.yp.to website. This PDF named "Fast constant-time
gcd computation and modular inversion".

In cryptography constant-time algorithms are used to prevent side-channel attacks.
So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 06, 2020, 09:32:19 PM
Last edit: December 06, 2020, 11:11:02 PM by hakabit
 #25

I'mmm sorrry.

Calculating 2/ln(k)*ln(A) I've made kid's mistake. Not divided 2 to, but multiplied to 2.

So... 2/(0.7) * 177 = 506 max iterations.  I'm a little bit unobservant. Embarrassed Embarrassed

2/ln(2)*ln(2^256) = (2/0,6931471805599453)*177,445678223346 = 512,00000000000000923324826168937

So... 512 max iterations.

 Roll Eyes Roll Eyes Roll Eyes Roll Eyes Roll Eyes

But now I am a little bit puzzled. 390 max iterations for non-binary and 506 512 max iterations for binary.
I hope maximizing iterations is in interchange with lowering logic gates(operations).
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4200
Merit: 8440



View Profile WWW
December 06, 2020, 10:42:00 PM
 #26

In cryptography constant-time algorithms are used to prevent side-channel attacks.
So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).

In the case of an unrolled logic implementation, *all* candidates will be constant time.  Algorithms like binary gcds are made constant time by unrolling to their maximum number of operations.

If you allow memory and looping then the problem can be computed with an extremely small number of gates... (just the smallest computer with an ALU you can find, plus enough gates for them memory for a point, the input, a couple hundred bits of temporaries, and the program itself)

Quote
I mean (and had to write) that binary version cannot be worstier than non-binary.
That is how I understood what you wrote. But all of the binary gcd variants will perform (many) more iterations than the classical GCD. You cannot use the behaviour of the classical gcd to reason about the iteration counts of some binary gcd.
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 07, 2020, 06:37:20 PM
 #27

In cryptography constant-time algorithms are used to prevent side-channel attacks.
So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).

In the case of an unrolled logic implementation, *all* candidates will be constant time.  Algorithms like binary gcds are made constant time by unrolling to their maximum number of operations.

If you allow memory and looping then the problem can be computed with an extremely small number of gates... (just the smallest computer with an ALU you can find, plus enough gates for them memory for a point, the input, a couple hundred bits of temporaries, and the program itself)

Quote
I mean (and had to write) that binary version cannot be worstier than non-binary.
That is how I understood what you wrote. But all of the binary gcd variants will perform (many) more iterations than the classical GCD. You cannot use the behaviour of the classical gcd to reason about the iteration counts of some binary gcd.


Now I need to agree that I understand GCD algos in not so good way as I was thinking. Need some time to analyse them additionally.

In any case I found proofs how to calculate bound limits both for non-binary and binary GCD.
While I see that You was writing that "AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need."
(BTW, if You are intrested in proof details, then I will try to find initial Ishmukhametov's articles and post them here. Just let me know.)

Calculations for 256bits numbers shows that "worst-scenario cannot be bigger than 350 iterations for non-binary GCD"
and that "worst-scenario should be no smaller than 512 iterations for binary GCD."
As result it means that "it is guaranteed to get result in no more than 512 iterations for worst-scenario (both for binary both for non-binary GCD)".

You are speaking about GCD algorithm with 735iterations. Why do we need 735 if 512 enough ? My answer - lot of them are dummy(fake).
They deliberately(willfully) added dummy iterations generated by quasi-random algorithm as technique to fight against side-channel attacks.
Either they was unrolling cycles, but unrolled them in non-correct way, adding useless iterations without special will for doing that.

And what do You think ?




P.S.
Of course, any algorithm can be calculated using only one NAND(or NOR) logic gate + a lot of memory and muxes.

I would like to write much about unrolled logic and constant-time, but prefer to wait Your answer related to iterations overcounting.
j2002ba2
Full Member
***
Offline Offline

Activity: 206
Merit: 444


View Profile
December 07, 2020, 09:15:27 PM
 #28

For binary euclidean gcd the maximum number of iterations is <=760.

To achieve it let u is zero with msb=1. Then 6 does 255 cycles (the number of zeroes), and we end with u=1. Next 21 and 12 are alternated, except for the few zero bits in p. 12 is executed 255 times, and 21 - 249. Finally u and v are both 1, and 18 is executed.

Another way to look at it is both 6 and 12 have to be executed 255 times max. This limits either 18 or 21 to be executed at most 255 times. This gives <=765 cycles.

Even A tends towards zero, odd A tends towards p. After every execution of 18 u is even and 6 is executed at least once. Same for 21 and 12. Then A and C are between -2p and 2p (after 18/21). 256+1 bits plus 1 sign bit, for toal 258 bits.


1-bit full-adder: 2*XOR + 2*AND + 1*OR
1-bit half-adder: 1*XOR + 1*AND
1-bit full-add-to-zero: 1*XOR + 1*AND
1-bit full-add-to-one: 1*XNOR + 1*OR
1-bit full-subtract: 2*XOR + 2*AND + 1*OR + 2*NOT
1-bit half-subtract: 1*XOR + 1*AND + 1*NOT
1-bit full-sub-zero: 1*XOR + 1*AND + 1*NOT
1-bit full-sub-one: 1*XNOR + 1*OR + 1*NOT
1-bit half-add-to-one: 1*NOT + wire
1-bit half-add-to-one-drop-lsb: wire

258-bit add to p drop lsb (A+p)/2: 249*full-add-to-one + 1*half-add-to-one-drop-lsb + 8*full-add-to-zero
  = 249*XNOR + 249*OR + 8*XOR + 8*AND
256-bit sub: 255*full-subtract + 1*half-subtract
  = 511*XOR + 511*AND + 255*OR + 511*NOT
258-bit sub: 258*full-subtract
  = 516*XOR + 516*AND + 258*OR + 516*NOT
256-bit mux: 768*NAND + 1*NOT
258-bit mux: 774*NAND + 1*NOT
256-bit compare ge: 256*XNOR + 256*AND + 256*NOT +   255*AND +   255*AND + 255*NOT + 255*AND
  = 256*XNOR + 1021*AND + 511*NOT

We have the following possible operations in the cycle:
1. 256-bit check for zero: 255*OR (+ 1*NOT)   <- not omitted
2. 256-bit >=: 256*XNOR + 1021*AND + 511*NOT
3. 256-bit subtraction: 511*XOR + 511*AND + 255*OR + 511*NOT
4. 258-bit add p, div 2: 249*XNOR + 249*OR + 8*XOR + 8*AND
5. 258-bit subtraction: 516*XOR + 516*AND + 258*OR + 516*NOT
3,4,5 are done twice

Control:
u1 = not 1. & not u[0]              - 1*AND + 1*NOT
v1 = not 1. & u[0] & not v[0]       - 2*AND + 1*NOT
u2 = not 1. & u[0] & v[0] & 2.      - 3*AND
v2 = not 1. & u[0] & v[0] & not 2.  - 1*AND   (repeat from up)
u = not (u1 | u2)                   - 1*OR + 1*NOT
v = not (v1 | v2)                   - 1*OR + 1*NOT
A1 = u1 & not A[0]                  - 1*AND + 1*NOT
A2 = u1 & A[0]                      - 1*AND
A3 = u2
A = u
C1 = v1 & not C[0]                  - 1*AND + 1*NOT
C2 = v1 & C[0]                      - 1*AND
C3 = v2
C = v
= 11*AND + 2*OR + 6*NOT

One cycle:
1. 255*OR
2. 256*XNOR + 1021*AND + 511*NOT
3. 1022*XOR + 1022*AND + 510*OR + 1022*NOT
4. 498*XNOR + 498*OR + 16*XOR + 16*AND
5. 1032*XOR + 1032*AND + 516*OR + 1032*NOT
ctrl. 11*AND + 2*OR + 6*NOT
mux. 10800*NAND + 14*NOT
= 754*XNOR + 2070*XOR + 3102*AND + 1782*OR + 10800*NAND + 2585*NOT

760 cycles: 573040*XNOR + 1573200*XOR + 2357520*AND + 1354320*OR + 8208000*NAND + 1964600*NOT
765 cycles: 576810*XNOR + 1583550*XOR + 2373030*AND + 1363230*OR + 8262000*NAND + 1977525*NOT

---

C mod p:
-2p <= C <= 2p, hence two times add to p, and two times add to -p
256-bit add to p: 249*full-add-to-one + 1*half-add-to-one + 6*full-add-to-zero
  = 249*XNOR + 249*OR + 1*NOT + 6*XOR + 6*AND
256-bit add to -p: 248*full-add-to-zero + 1*half-add-to-one + 7*full-add-to-one:
  = 248*XOR + 248*AND + 1*NOT + 7*XNOR + 7*OR

selecting the output:
257-bit mux: 771*NAND + 1*NOT
256-bit mux: 768*NAND + 1*NOT

control the output:
omitted, <16 elements

= 512*XNOR + 508*XOR + 508*AND + 512*OR + 3078*NAND + 8*NOT

---

total:
760 cycle: 573552*XNOR + 1573708*XOR + 2358028*AND + 1354832*OR + 8211078*NAND + 1964608*NOT
765 cycle: 577322*XNOR + 1584058*XOR + 2373538*AND + 1363742*OR + 8265078*NAND + 1977533*NOT

760 cycle: 16,035,806 gates
765 cycle: 16,141,271 gates

this is close to 8.5 times more gates than the wrong estimate
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4200
Merit: 8440



View Profile WWW
December 07, 2020, 09:29:15 PM
 #29

In any case I found proofs how to calculate bound limits both for non-binary and binary GCD.
While I see that You was writing that "AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need."
(BTW, if You are intrested in proof details, then I will try to find initial Ishmukhametov's articles and post them here. Just let me know.)

Calculations for 256bits numbers shows that "worst-scenario cannot be bigger than 350 iterations for non-binary GCD"
and that "worst-scenario should be no smaller than 512 iterations for binary GCD."
As result it means that "it is guaranteed to get result in no more than 512 iterations for worst-scenario (both for binary both for non-binary GCD)".
By a "tight bound" I mean the exact number of iterations which is enough for the worst input, but no longer.  Instead, what we can compute is a loose bound: a number which is large enough for any input but might be too large.

"worst-scenario should be no smaller than 512 iterations for binary GCD." might well be true, but "no smaller than" is not useful to you. For determining the circuit size you need to know that the worst case is no larger than some size.

As an aside, I found some random binary GCD from an internet search and it typically needed around a bit over 1000 iterations for random inputs and the secp256k1 field prime. (I would have tried the one from j2002ba2's earlier post but it gave wrong results when I tested it, though perhaps I made an error while transliterating it).

Quote
You are speaking about GCD algorithm with 735iterations. Why do we need 735 if 512 enough ? My answer - lot of them are dummy(fake).
They deliberately(willfully) added dummy iterations generated by quasi-random algorithm as technique to fight against side-channel attacks.
Nope that is just incorrect.

There aren't any dummy iterations except in the sense that on some numbers it takes fewer than the maximum so there will be some iterations that don't do anything useful-- but those iterations will still be there in the logic implementation. ... also in the sense that the known maximum bound isn't tight, even on the worst number it probably takes a few less.

The algorithm of the safegcd paper takes more iterations than some other constant time binary egcds because it is a least-significant bit algorithm:  each iteration only looks at and updates the least significant bits of their numbers.  In software with 32/64 bit numbers this ends up being a LOT faster than versions that have to update the whole 256 bit numbers at each step.

Implemented as raw logic I suspect a MSB binary egcd will be more efficient because you can shrink the numbers that it works on a bit at a time as their magnitude is guaranteed to be decreased by the algorithm.
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4200
Merit: 8440



View Profile WWW
December 07, 2020, 09:30:31 PM
 #30

760 cycle: 16,035,806 gates
765 cycle: 16,141,271 gates

this is close to 8.5 times more gates than the wrong estimate
This sounds more reasonable.
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 08, 2020, 02:01:07 AM
Last edit: December 08, 2020, 02:47:04 AM by hakabit
 #31

So here is my python code to test iterations.(see at the end)

And some part of test-results:

test# in (out, iterx, iter1, iter2, iter3, itersum):
0 51387622388232169033260144332317924780174711222857253999722512827559407809221 (42194227499242744317322493862270938245042350176722270391089169414969189792505, 173, 0, 177, 193, 370)
1 49303630616562965676786885508349510956169277855033004449417649058314657299485 (47016891994125063651541437063141305860063754426223854778073828393151612275997, 178, 0, 192, 172, 364)
2 60564124354985471474884769266985666656540021107491550318511011936878151315223 (96860706416467978990701996056838683430055307461271246547818608610050676002308, 181, 0, 175, 187, 362)
3 26876395035512405675280385359306825136587569085110256960487392157911296412919 (68790574322733353789092487174594866418670893295029170296485893186408773988152, 182, 0, 176, 183, 359)
4 11023636690140810652109138350092873223350285162460178239756198575491250070806 (79574857778441618190374605398778396331605356310850458929090676313728623789452, 186, 0, 163, 191, 354)
5 48673899963758261337290584766356417468862845552291965980487962550310458587722 (43493957935191599280071520123333639912283544665443371757241332743464486359707, 183, 0, 177, 184, 361)
6 13375659417630302272707017036247711675902473928798522670864387374916032135255 (91371322322092500654928225171492304811395845400276376022117681208712975765739, 174, 0, 178, 179, 357)
7 18933447562660498799151803539411934730546092368277561544957971138167446797703 (58387591255298048564603862915862446716810659550968633559759220573583656442311, 185, 0, 195, 167, 362)
#
#
#

Test-runs shows that:
1) step5 cannot be skipped as it was suggested by j2002ba2, because usually it will lead to infinite loop, especially for small numbers. try to send as input 12.
2) algortihm has internal bug leading to output results bigger than 256bits. try to send as input 10321180907436651991825056156721508500075068309065038805868074283166084193390. I've placed a fix - to send back moduled resuts.
3) itersum shows that it's quite impossible to get more than 512 iterations, I feel 384 is the biggest possible.
keep in mind that iter2 and iter3 loops are not stand-alone, they are running internally in iterx loop.

In fact I am not sure if it's good idea to use itersum=iter2+iter3 as iterations indicator. possibly it's more correct to pay attention only to iterx, or to largest iter.


Code:
#code by hakabit supraspecially for gmaxwell
#algorithm source: http://cacr.uwaterloo.ca/hac/about/chap14.pdf, page 608, section "14.61 Algorithm Binary extended gcd algorithm"

import random
#print(random.getrandbits(256))

def ext_binary_gcd(x, y):

    g = 1

    iterx = 0
    iter1 = 0
    iter2 = 0
    iter3 = 0

    while (x % 2 == 0) and (y % 2 == 0) :    # yeah, funny cycle when one number is prime :-)

        iter1 = iter1 + 1

        x = x // 2
        y = y // 2
        g = 2 * g

    u = x
    v = y
    A = 1
    B = 0
    C = 0
    D = 1

    while (u != 0) :

        iterx = iterx + 1

        #step4
        while (u % 2 == 0) :

            iter2 = iter2 + 1

            u = u // 2

            if (A % 2 == 0) and (B % 2 == 0) :

                A = A // 2
                B = B // 2

            else :

                A = (A + y) // 2
                B = (B - x) // 2

        while (v % 2 == 0) :

            iter3 = iter3 + 1

            v = v // 2

            if (C % 2 == 0) and (D % 2 == 0) :

                C = C // 2
                D = D // 2

            else :

                C = (C + y) // 2
                D = (D - x) // 2

        if (u >= v) :

            u = u - v
            A = A - C
            B = B - D

        else :

            v = v - u
            C = C - A
            D = D - B

    #a = C
    #b = D

    pp = 2 ** 256 - 2 ** 32 - 977
    b = D % pp   #BLM-rig

    #return (a, b, g*v)
    return (b, iterx, iter1, iter2, iter3, iter2 + iter3)

p = 2**256 - 2**32 - 977

#a=12
#res=ext_binary_gcd(p,a)
#print ("initial test, should be in=12, out=48246703848881748093154577086953294938862493610683568349773993336628681113193")
#print ("in (out, iterbig, iter1, iter2, iter3): ",a,res)



print("test# in (out, iterx, iter1, iter2, iter3, itersum): ")
for i in range(1000):

    #print(i)
    ii = random.getrandbits(256)
    ii = ii % p

    if (ii < 2):
        ii = ii + 2

    res = ext_binary_gcd(p, ii)
    print(i, ii, res)

#end
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 08, 2020, 04:14:09 AM
Last edit: December 08, 2020, 04:24:34 AM by hakabit
 #32

Here we go !
One more test code. no over 256bits bug, returns also symmetric by module result. a little bit fuzzy conditions checking logic, I can call it quasi-recursive. it's binary in it's nature, just realized in a kind strange way.

Part of results:

test# calls# in [out, mod symm out]
0 364 75316636100529471829340583998430204208230490280793297630878645139432980850156 [-21481456315583653538655726848489351884800251925024345069131289520822185938657, 33025674478099610959444009195219596039238050737410647384245105885135854051132]
1 385 39888426508246177183753123701414070167668113673592482698911298409940354217536 [-29960760575314177163053332470999754258247890690156906035102057524026276070769, 86973073792154943490949046208733472044349727771710623997020383424312036814343]
2 360 14620752759159978489146343667071343355650196237127974340021988638599091328303 [12174829735737702474779493260773244294171916983485771075179095797017886235155, -96421093662667652633700306227539987869225167103772496913502848083575395070788]
3 383 102729399071985721675035906741175076856965109198477353566672326776001633036975 [70580439539010164434200376999720640409016804834521829709981617744376762996302, -79555186999421925612988503851178736579078677622324526318628793408380057609471]
4 370 46339676628294076970974548809934900743231066044015694600462844368554430445143 [2539235054366037579706180091206398081627721978267432321188130108233551671292, -6344958648894604413478372293328991555118887139277093896810498521216296850165]
5 368 26660698059602402176427980216008131458879504029132938908230058389025515567933 [16540677529517895161096648910563987919480254183872798943295671897534551358308, -71839064538438794363947864942753805375404185837480918551631678838489697347191]
6 354 594851748071041868996674850995031382743985475622021247651263573480083167890 [-64638328581828718437634047594178337042202536377421456333765449858808997403, 12582306659716172481129154954429736186019811569145338237994747628567568503971]
7 359 80681079014327548209785902873606497791641955222184391014151038138305717861451 [44686311015859290531071619784617096434102136509364594238047222505744964738512, -64133020728639186296951506401894836855168854511276042070060900652427696988205]
8 372 82669520055979716581712547816646165489596227017635063706256738823534854476311 [45051940896140100306048968032786700593661378537963560664532177163878090793890, -63102560133743198571360833063775568154007385207992635050991710591879861205579]
9 360 55943743540139384302837049042449505019549115662377158567347079861108763368228 [-21482989564914934180757589310789560187172090496876602635626422561634142769637, 44465387680038776841839359752521363942810960966395512063353292864302945775019]
10 349 66724607627136654255694803441709862788832479174235937080362394192249729343898 [29069806528358354934610693165080779089541395635574791746246411666180327081765, -50446960294663850907746139715414905993368124572672804919153802516270964965753]
#
#
#

As in previous code iterations count close to 384number.
But it's not good iterations indicator. I have idea how to count real true unrolled iterations, but need some time to implement.


Code:
#code by hakabit supraspecially for gmaxwell
#algorithm source: https://dspace.cvut.cz/bitstream/handle/10467/77227/F3-DP-2018-Zhumaniezov-Alisher-priloha-Testing.py

import random

def bin_ext_gcd(a, b):

    global cc
    cc = cc + 1

    if a == 0:
        #print(a, b, 0, 1)
        return [0, 1]

    if b == 0:
        #print(a, b, 1, 0)
        return [1, 0]

    if a == b:
        #print(a, b, 1, 0)
        return [1, 0]

    if (a % 2 == 0) & (b % 2 == 0):
        x, y = bin_ext_gcd(a // 2, b // 2)
        #print(a, b, x, y)
        return [x, y]

    if a % 2 == 0:
        x, y = bin_ext_gcd(a // 2, b)
        if x % 2 == 0:
            x, y = x // 2, y
        else:
            x, y = (x + b) // 2, y - a // 2
        #print(a, b, x, y)
        return [x, y]

    if b % 2 == 0:
        x, y = bin_ext_gcd(a, b // 2)
        if y % 2 == 0:
            x, y = x, y // 2
        else:
            x, y = x - b // 2, (y + a) // 2
        #print(a, b, x, y)
        return [x, y]

    if a > b:
        x, y = bin_ext_gcd((a - b) // 2, b)
        if x % 2 == 0:
            x, y = x // 2, y - x // 2
        else:
            x, y = (x + b) // 2, y - (a - b) // 2 - (x + b) // 2
        #print(a, b, x, y)
        return [x, y]

    if a < b:
        x, y = bin_ext_gcd(a, (b - a) // 2)
        if y % 2 == 0:
            x, y = x - y // 2, y // 2
        else:
            x, y = x - (b - a) // 2 - (y + a) // 2, (y + a) // 2
        #print(a, b, x, y)
        return [x, y]

cc = 0

p = 2**256 - 2**32 - 977


#cc=0
#a=12
#res=bin_ext_gcd(p,a)
#print ("initial test, should be in=12, out=48246703848881748093154577086953294938862493610683568349773993336628681113193")
#print (res,cc)



print ("test# calls# in [out, mod symm out]")
for i in range(100):

    #print(i)
    ii = random.getrandbits(256)
    ii = ii % p

    if (ii < 2):
        ii = ii + 2

    cc = 0
    res = bin_ext_gcd(p, ii)
    print(i,cc,ii,res)

#end
j2002ba2
Full Member
***
Offline Offline

Activity: 206
Merit: 444


View Profile
December 08, 2020, 08:50:08 AM
 #33

Multiplication is cheaper since p is fixed.

1-bit full-add: 2*XOR + 2*AND + 1*OR
1-bit half-add: 1*XOR + 1*AND
1-bit full-add-zero: 1*XOR + 1*AND
1-bit full-add-one: 1*XNOR + 1*OR
1-bit full-subtract: 2*XOR + 2*AND + 1*OR + 2*NOT
1-bit half-subtract: 1*XOR + 1*AND + 1*NOT
1-bit half-add-one: 1*NOT + wire

256-bit-add: 255*full-add + 1*half-add
  = 511*XOR + 511*AND + 1*OR
256-bit sub: 255*full-subtract + 1*half-subtract
  = 511*XOR + 511*AND + 255*OR + 511*NOT
256-bit-add-p: 249*full-add-one + 1*half-add-one + 6*full-add-zero
  = 249*XNOR + 249*OR + 1*NOT + 6*XOR + 6*AND
256-bit-add-minus-p: 249*full-add-zero + 1*half-add-one + 6*full-add-one
  = 249*XOR + 249*AND + 1*NOT + 6*XNOR + 6*OR
256-bit mux: 768*NAND + 1*NOT

add-mod-p: 256-bit-add + 256-bit-add-minus-p + 256-bit-mux
 = 6*XNOR + 760*XOR + 760*AND + 7*OR + 768*NAND + 2*NOT
sub-mod-p: 256-bit-sub + 256-bit-add-p + 256-bit-mux
 = 249*XNOR + 517*XOR + 517*AND + 504*OR + 768*NAND + 513*NOT
mul-mod-p: 256*add-mod-p
 = 0x600*XNOR + 0x2f800*XOR + 0x2f800*AND + 0x700*OR + 0x30000*NAND + 0x200*NOT

mul-mod-p is 589,568 gates; this makes it ~27 times cheaper than inversion

j2002ba2
Full Member
***
Offline Offline

Activity: 206
Merit: 444


View Profile
December 08, 2020, 10:30:22 AM
Merited by NotATether (5)
 #34

Point addition

In affine coordinates: 1*inversion + 2*mul-mod-p + 6*sub-mod-p
  = 16,141,271 + 2*589,568 + 6*3,068 = 17,338,815
Jacobian plus affine: 11*mul-mod-p + 6*sub-mod-p
  = 11*589,568 + 6*3,068 = 6,503,656
256-bit-multiplex: 769

Affine coordinates:
bit  PA   256-bit-muxes
 1: 255   2*256             4,421,791,553
 2: 127   2*4*128           2,202,816,961
 4:  63   2*16*64           1,093,920,257
 8:  31   2*256*32            550,102,561
16:  15   2*65536*16        1,872,792,113
 9:  28   2*(512*28 + 16)     507,560,196
10:  25   2*(1024*25 + 64)    472,941,607
11:  23   2*(2048*23 + 8 )    471,251,001
12:  21   2*(4096*21 + 16)    496,432,331


Jacobian coordinates (including final inversion and 2 mul-mod-p):
bit  PA   256-bit-muxes
 1: 255   2*256 + 256           1,676,343,279
 2: 127   2*4*128 + 128           844,170,607
 4:  63   2*16*64 + 64            428,674,863
 8:  31   2*256*32 + 32           231,557,647
16:  15   2*65536*16 + 16       1,727,597,439
 9:  28   2*(512*28 + 16) + 29    221,518,452
10:  25   2*(1024*25 + 64) + 26   219,403,033
11:  23   2*(2048*23 + 8 ) + 24   239,381,207



Using mixed Jacobian-affine coordinates,
adding in groups of 10 bits (25 1024-point memories plus 64-point memory),
we get 219,403,033 gates.

hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 08, 2020, 01:05:13 PM
Last edit: December 08, 2020, 01:25:52 PM by hakabit
 #35

Point addition


Jacobian coordinates (including final inversion and 2 mul-mod-p):
bit  PA   256-bit-muxes

1: 255   2*256 + 256           1,676,343,279

10:  25   2*(1024*25 + 64) + 26   219,403,033




Using mixed Jacobian-affine coordinates,
adding in groups of 10 bits (25 1024-point memories plus 64-point memory),
we get 219,403,033 gates.



Thanks for fresh info !

I still can't understand how You optimizing gates count by combining bits.
In Your initial post You was using such technique for additions, now I see You are using it for any operations.

Please, explain me in a very simple steps.

For example we have two 32bits numbers to summarize.

00111111111111111111111111111111
00000000000000000000000000000001

And the same numbers splitted to 10bit limbs.

1111111111 1111111111 1111111111
0000000000 0000000000 0000000001

Let's say we are using classical cascaded full-adder(ripple-carry adder)
How amount of gates can be lowered due to splitting ?!


Either... do You mean using not 2input, but 10input gates and summarize in parallel not two, but ten numbers ? But it's against my initial limitations.

Or is this technique in some kind similiar to Karatsuba algorithm ?!
hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 08, 2020, 01:35:55 PM
 #36


In affine coordinates: 1*inversion + 2*mul-mod-p + 6*sub-mod-p
  = 16,141,271 + 2*589,568 + 6*3,068 = 17,338,815
Jacobian plus affine: 11*mul-mod-p + 6*sub-mod-p
  = 11*589,568 + 6*3,068 = 6,503,656
256-bit-multiplex: 769


Some more questions.
1) what is Your formulae for affine point addition with only two multiplications ?
the most optimized version I found has multiplication+multiplication+squaring=3multiplications.
2) what is Your formulae for Jacobian point addition with only 11multiplications ?
the most optimized version I found has 17multiplications.

of course, there are Jacobian formulaes with smaller multiplications amount, but they are limited by Z1=1, so can be used only to add pre-computated or re-converted affine<->Jacobian points.

Also I would like to note that by test-runs for affine we have +-384iterations, but not +-760.
j2002ba2
Full Member
***
Offline Offline

Activity: 206
Merit: 444


View Profile
December 08, 2020, 03:35:55 PM
 #37


I still can't understand how You optimizing gates count by combining bits.


The most naive point addition is with memory 1 point (G). Then we double the generator and add points appropriately. However all doubled points are constant, hence we could precompute all 255 doublings. The result is 256 points memory, no doubling. Since point addition is more expensive compared to implementing memory through multiplexers, we could go further. Split the private key in groups of 2 bits, precompute for each group the four combinations (resulting in 4*128 points memory), and feed the appropriate points to 127 point adders.

i.e. at lsb position
bits
00 - 0*G
01 - 1*G
10 - 2*G
11 - 3*G

next group would be
00 - 0*G
01 - 4*G
10 - 8*G
11 - 12*G

and so on.

If we use groups of 4 bits - 24 points memory per group.

It turns out that with affine points best is to use groups of 11 bits, plus 3 bits for the last group.
For Jacobian+affine point additions best is groups of 10 bits, plus 6 bits for the last group.


Some more questions.
1) what is Your formulae for affine point addition with only two multiplications ?
the most optimized version I found has multiplication+multiplication+squaring=3multiplications.
2) what is Your formulae for Jacobian point addition with only 11multiplications ?
the most optimized version I found has 17multiplications.

Also I would like to note that by test-runs for affine we have +-384iterations, but not +-760.


1. It is quite possible, that I have a mistake here, missing one multiplication. Anyway Jacobian+affine performs better.
2. "Guide to Elliptic Curve Cryptography", Algorithm 3.22. It has 8 multiplications, 3 squares, 6 subtractions.

For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p.

hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 08, 2020, 06:24:00 PM
Last edit: December 08, 2020, 08:58:28 PM by hakabit
 #38


I still can't understand how You optimizing gates count by combining bits.


The most naive point addition is with memory 1 point (G). Then we double the generator and add points appropriately. However all doubled points are constant, hence we could precompute all 255 doublings. The result is 256 points memory, no doubling. Since point addition is more expensive compared to implementing memory through multiplexers, we could go further. Split the private key in groups of 2 bits, precompute for each group the four combinations (resulting in 4*128 points memory), and feed the appropriate points to 127 point adders.




1. It is quite possible, that I have a mistake here, missing one multiplication. Anyway Jacobian+affine performs better.
2. "Guide to Elliptic Curve Cryptography", Algorithm 3.22. It has 8 multiplications, 3 squares, 6 subtractions.

For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p.


Big thanks for detailed answer !

Yeah, I understand Your idea. When I was constructing my trillionized scheme I was estimating such grouping.
I got:
group by 1 = 256 pre-computations
group by 2 = about 32768 pre-computations
group by 3 = about 2.7mln pre-computation
after seen agressively exponential grow I understood that it's bad idea.

Now I understand that I've made a big combinatorical logic mistake.

After mistake detected(thanks to You), now I can say, of course such grouping is great idea, and yes, more than likely optimal grouping is by ~10bits.


I found "Algorithm 3.22". I see it's hybrid for affine+Jacobian. But I can't understand one thing.
Ok, let's pre-compute points grouping up to 11bits.
Then let's take privkey 111111111111111111 (2+16 leading ones) {skipped some bits} 100000000000000001 (16 leeding zeros)
How can be pubkey calculated using this algorithm without re-converts affine<->Jacobian ?!

(from other side, I can agree, that using privkey with more than 11 leading same bits is not good idea, but...)


Now what about iterations. Ouch... my test-runs shows about 384iterations. but... for such multi-conditional multi-loop algorith it's quite complex to count iterations.
My idea is that I need to place special triggers on each loop, then log results to array, then analyse array in detecting consequental and-or parallel triggers switching.
But I am not too good in Python or Ruby, so it will took a time to implement...

As result I suppose that there will be either 384 or 2*384=768 max iterations( if max input number = 2^256).
of, course in secp256k1 module is slightly smaller than 2^256, so there can be sligthly smaller than 768iters...
...but should be 512 !  Cool

hakabit (OP)
Jr. Member
*
Offline Offline

Activity: 42
Merit: 18


View Profile
December 08, 2020, 08:47:09 PM
 #39





For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p.


Please, post Your code here.  Wink
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4200
Merit: 8440



View Profile WWW
December 09, 2020, 02:21:22 AM
 #40

the most optimized version I found has 17multiplications.
If you look at the links in my prior posts, I linked to a GEJ+GE that uses 8mul+3sqr.
Pages: « 1 [2] 3 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!