gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 05, 2020, 03:15:51 AM |
|
The algorithm you are discussing is a binary (extended) gcd, not the "original" Euclid's algorithm. The original Euclid's has an arbitrary division (well a modular reduction, but same deal) in the inner loop.
|
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 06, 2020, 04:19:09 PM |
|
However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.
No, that is entirely untrue, unfortunately. Binary GCD is faster to use not because it makes fewer iterations but because there isn't a very expensive division in each iteration.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 06, 2020, 08:38:44 PM |
|
However I was thinking that worstiest EGCD(or BEGCD) scenario cannot be worstier than GCD.
No, that is entirely untrue, unfortunately. Binary GCD is faster to use not because it makes fewer iterations but because there isn't a very expensive division in each iteration. I am sorry, I messed up with abbreviations. I mean (and had to write) that binary version cannot be worstier than non-binary. I provided proof that non-binary version has max 390 iterations. So I was trying to say, that there are no needs to process 735iterations for binary version. But 390 - was upper(but not lower) bound limiter. Today I provided proof for 248 max iterations. (and as I understand that's lower bound limiter ?!) What about 735... in other post You provided URL to PDF on cr.yp.to website. This PDF named "Fast constant-timegcd computation and modular inversion". In cryptography constant-time algorithms are used to prevent side-channel attacks. So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 06, 2020, 09:32:19 PM Last edit: December 06, 2020, 11:11:02 PM by hakabit |
|
I'mmm sorrry. Calculating 2/ln(k)*ln(A) I've made kid's mistake. Not divided 2 to, but multiplied to 2. So... 2/(0.7) * 177 = 506 max iterations. I'm a little bit unobservant. 2/ln(2)*ln(2^256) = (2/0,6931471805599453)*177,445678223346 = 512,00000000000000923324826168937 So... 512 max iterations. But now I am a little bit puzzled. 390 max iterations for non-binary and 506 512 max iterations for binary. I hope maximizing iterations is in interchange with lowering logic gates(operations).
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 06, 2020, 10:42:00 PM |
|
In cryptography constant-time algorithms are used to prevent side-channel attacks. So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).
In the case of an unrolled logic implementation, *all* candidates will be constant time. Algorithms like binary gcds are made constant time by unrolling to their maximum number of operations. If you allow memory and looping then the problem can be computed with an extremely small number of gates... (just the smallest computer with an ALU you can find, plus enough gates for them memory for a point, the input, a couple hundred bits of temporaries, and the program itself) I mean (and had to write) that binary version cannot be worstier than non-binary. That is how I understood what you wrote. But all of the binary gcd variants will perform (many) more iterations than the classical GCD. You cannot use the behaviour of the classical gcd to reason about the iteration counts of some binary gcd.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 07, 2020, 06:37:20 PM |
|
In cryptography constant-time algorithms are used to prevent side-channel attacks. So of course constant-time GCD algorithm can contain 735 and even more iterations - most of them are dummy(fake, useless).
In the case of an unrolled logic implementation, *all* candidates will be constant time. Algorithms like binary gcds are made constant time by unrolling to their maximum number of operations. If you allow memory and looping then the problem can be computed with an extremely small number of gates... (just the smallest computer with an ALU you can find, plus enough gates for them memory for a point, the input, a couple hundred bits of temporaries, and the program itself) I mean (and had to write) that binary version cannot be worstier than non-binary. That is how I understood what you wrote. But all of the binary gcd variants will perform (many) more iterations than the classical GCD. You cannot use the behaviour of the classical gcd to reason about the iteration counts of some binary gcd. Now I need to agree that I understand GCD algos in not so good way as I was thinking. Need some time to analyse them additionally. In any case I found proofs how to calculate bound limits both for non-binary and binary GCD. While I see that You was writing that "AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need." (BTW, if You are intrested in proof details, then I will try to find initial Ishmukhametov's articles and post them here. Just let me know.) Calculations for 256bits numbers shows that "worst-scenario cannot be bigger than 350 iterations for non-binary GCD" and that "worst-scenario should be no smaller than 512 iterations for binary GCD." As result it means that "it is guaranteed to get result in no more than 512 iterations for worst-scenario (both for binary both for non-binary GCD)". You are speaking about GCD algorithm with 735iterations. Why do we need 735 if 512 enough ? My answer - lot of them are dummy(fake). They deliberately(willfully) added dummy iterations generated by quasi-random algorithm as technique to fight against side-channel attacks. Either they was unrolling cycles, but unrolled them in non-correct way, adding useless iterations without special will for doing that. And what do You think ? P.S. Of course, any algorithm can be calculated using only one NAND(or NOR) logic gate + a lot of memory and muxes. I would like to write much about unrolled logic and constant-time, but prefer to wait Your answer related to iterations overcounting.
|
|
|
|
j2002ba2
|
|
December 07, 2020, 09:15:27 PM |
|
For binary euclidean gcd the maximum number of iterations is <=760.
To achieve it let u is zero with msb=1. Then 6 does 255 cycles (the number of zeroes), and we end with u=1. Next 21 and 12 are alternated, except for the few zero bits in p. 12 is executed 255 times, and 21 - 249. Finally u and v are both 1, and 18 is executed.
Another way to look at it is both 6 and 12 have to be executed 255 times max. This limits either 18 or 21 to be executed at most 255 times. This gives <=765 cycles.
Even A tends towards zero, odd A tends towards p. After every execution of 18 u is even and 6 is executed at least once. Same for 21 and 12. Then A and C are between -2p and 2p (after 18/21). 256+1 bits plus 1 sign bit, for toal 258 bits.
1-bit full-adder: 2*XOR + 2*AND + 1*OR 1-bit half-adder: 1*XOR + 1*AND 1-bit full-add-to-zero: 1*XOR + 1*AND 1-bit full-add-to-one: 1*XNOR + 1*OR 1-bit full-subtract: 2*XOR + 2*AND + 1*OR + 2*NOT 1-bit half-subtract: 1*XOR + 1*AND + 1*NOT 1-bit full-sub-zero: 1*XOR + 1*AND + 1*NOT 1-bit full-sub-one: 1*XNOR + 1*OR + 1*NOT 1-bit half-add-to-one: 1*NOT + wire 1-bit half-add-to-one-drop-lsb: wire
258-bit add to p drop lsb (A+p)/2: 249*full-add-to-one + 1*half-add-to-one-drop-lsb + 8*full-add-to-zero = 249*XNOR + 249*OR + 8*XOR + 8*AND 256-bit sub: 255*full-subtract + 1*half-subtract = 511*XOR + 511*AND + 255*OR + 511*NOT 258-bit sub: 258*full-subtract = 516*XOR + 516*AND + 258*OR + 516*NOT 256-bit mux: 768*NAND + 1*NOT 258-bit mux: 774*NAND + 1*NOT 256-bit compare ge: 256*XNOR + 256*AND + 256*NOT + 255*AND + 255*AND + 255*NOT + 255*AND = 256*XNOR + 1021*AND + 511*NOT
We have the following possible operations in the cycle: 1. 256-bit check for zero: 255*OR (+ 1*NOT) <- not omitted 2. 256-bit >=: 256*XNOR + 1021*AND + 511*NOT 3. 256-bit subtraction: 511*XOR + 511*AND + 255*OR + 511*NOT 4. 258-bit add p, div 2: 249*XNOR + 249*OR + 8*XOR + 8*AND 5. 258-bit subtraction: 516*XOR + 516*AND + 258*OR + 516*NOT 3,4,5 are done twice
Control: u1 = not 1. & not u[0] - 1*AND + 1*NOT v1 = not 1. & u[0] & not v[0] - 2*AND + 1*NOT u2 = not 1. & u[0] & v[0] & 2. - 3*AND v2 = not 1. & u[0] & v[0] & not 2. - 1*AND (repeat from up) u = not (u1 | u2) - 1*OR + 1*NOT v = not (v1 | v2) - 1*OR + 1*NOT A1 = u1 & not A[0] - 1*AND + 1*NOT A2 = u1 & A[0] - 1*AND A3 = u2 A = u C1 = v1 & not C[0] - 1*AND + 1*NOT C2 = v1 & C[0] - 1*AND C3 = v2 C = v = 11*AND + 2*OR + 6*NOT
One cycle: 1. 255*OR 2. 256*XNOR + 1021*AND + 511*NOT 3. 1022*XOR + 1022*AND + 510*OR + 1022*NOT 4. 498*XNOR + 498*OR + 16*XOR + 16*AND 5. 1032*XOR + 1032*AND + 516*OR + 1032*NOT ctrl. 11*AND + 2*OR + 6*NOT mux. 10800*NAND + 14*NOT = 754*XNOR + 2070*XOR + 3102*AND + 1782*OR + 10800*NAND + 2585*NOT
760 cycles: 573040*XNOR + 1573200*XOR + 2357520*AND + 1354320*OR + 8208000*NAND + 1964600*NOT 765 cycles: 576810*XNOR + 1583550*XOR + 2373030*AND + 1363230*OR + 8262000*NAND + 1977525*NOT
---
C mod p: -2p <= C <= 2p, hence two times add to p, and two times add to -p 256-bit add to p: 249*full-add-to-one + 1*half-add-to-one + 6*full-add-to-zero = 249*XNOR + 249*OR + 1*NOT + 6*XOR + 6*AND 256-bit add to -p: 248*full-add-to-zero + 1*half-add-to-one + 7*full-add-to-one: = 248*XOR + 248*AND + 1*NOT + 7*XNOR + 7*OR
selecting the output: 257-bit mux: 771*NAND + 1*NOT 256-bit mux: 768*NAND + 1*NOT
control the output: omitted, <16 elements
= 512*XNOR + 508*XOR + 508*AND + 512*OR + 3078*NAND + 8*NOT
---
total: 760 cycle: 573552*XNOR + 1573708*XOR + 2358028*AND + 1354832*OR + 8211078*NAND + 1964608*NOT 765 cycle: 577322*XNOR + 1584058*XOR + 2373538*AND + 1363742*OR + 8265078*NAND + 1977533*NOT
760 cycle: 16,035,806 gates 765 cycle: 16,141,271 gates
this is close to 8.5 times more gates than the wrong estimate
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 07, 2020, 09:29:15 PM |
|
In any case I found proofs how to calculate bound limits both for non-binary and binary GCD. While I see that You was writing that "AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need." (BTW, if You are intrested in proof details, then I will try to find initial Ishmukhametov's articles and post them here. Just let me know.)
Calculations for 256bits numbers shows that "worst-scenario cannot be bigger than 350 iterations for non-binary GCD" and that "worst-scenario should be no smaller than 512 iterations for binary GCD." As result it means that "it is guaranteed to get result in no more than 512 iterations for worst-scenario (both for binary both for non-binary GCD)".
By a "tight bound" I mean the exact number of iterations which is enough for the worst input, but no longer. Instead, what we can compute is a loose bound: a number which is large enough for any input but might be too large. "worst-scenario should be no smaller than 512 iterations for binary GCD." might well be true, but "no smaller than" is not useful to you. For determining the circuit size you need to know that the worst case is no larger than some size. As an aside, I found some random binary GCD from an internet search and it typically needed around a bit over 1000 iterations for random inputs and the secp256k1 field prime. (I would have tried the one from j2002ba2's earlier post but it gave wrong results when I tested it, though perhaps I made an error while transliterating it). You are speaking about GCD algorithm with 735iterations. Why do we need 735 if 512 enough ? My answer - lot of them are dummy(fake). They deliberately(willfully) added dummy iterations generated by quasi-random algorithm as technique to fight against side-channel attacks.
Nope that is just incorrect. There aren't any dummy iterations except in the sense that on some numbers it takes fewer than the maximum so there will be some iterations that don't do anything useful-- but those iterations will still be there in the logic implementation. ... also in the sense that the known maximum bound isn't tight, even on the worst number it probably takes a few less. The algorithm of the safegcd paper takes more iterations than some other constant time binary egcds because it is a least-significant bit algorithm: each iteration only looks at and updates the least significant bits of their numbers. In software with 32/64 bit numbers this ends up being a LOT faster than versions that have to update the whole 256 bit numbers at each step. Implemented as raw logic I suspect a MSB binary egcd will be more efficient because you can shrink the numbers that it works on a bit at a time as their magnitude is guaranteed to be decreased by the algorithm.
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 07, 2020, 09:30:31 PM |
|
760 cycle: 16,035,806 gates 765 cycle: 16,141,271 gates
this is close to 8.5 times more gates than the wrong estimate
This sounds more reasonable.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 08, 2020, 02:01:07 AM Last edit: December 08, 2020, 02:47:04 AM by hakabit |
|
So here is my python code to test iterations.(see at the end) And some part of test-results: test# in (out, iterx, iter1, iter2, iter3, itersum): 0 51387622388232169033260144332317924780174711222857253999722512827559407809221 (42194227499242744317322493862270938245042350176722270391089169414969189792505, 173, 0, 177, 193, 370) 1 49303630616562965676786885508349510956169277855033004449417649058314657299485 (47016891994125063651541437063141305860063754426223854778073828393151612275997, 178, 0, 192, 172, 364) 2 60564124354985471474884769266985666656540021107491550318511011936878151315223 (96860706416467978990701996056838683430055307461271246547818608610050676002308, 181, 0, 175, 187, 362) 3 26876395035512405675280385359306825136587569085110256960487392157911296412919 (68790574322733353789092487174594866418670893295029170296485893186408773988152, 182, 0, 176, 183, 359) 4 11023636690140810652109138350092873223350285162460178239756198575491250070806 (79574857778441618190374605398778396331605356310850458929090676313728623789452, 186, 0, 163, 191, 354) 5 48673899963758261337290584766356417468862845552291965980487962550310458587722 (43493957935191599280071520123333639912283544665443371757241332743464486359707, 183, 0, 177, 184, 361) 6 13375659417630302272707017036247711675902473928798522670864387374916032135255 (91371322322092500654928225171492304811395845400276376022117681208712975765739, 174, 0, 178, 179, 357) 7 18933447562660498799151803539411934730546092368277561544957971138167446797703 (58387591255298048564603862915862446716810659550968633559759220573583656442311, 185, 0, 195, 167, 362)
|
# # # Test-runs shows that: 1) step5 cannot be skipped as it was suggested by j2002ba2, because usually it will lead to infinite loop, especially for small numbers. try to send as input 12. 2) algortihm has internal bug leading to output results bigger than 256bits. try to send as input 10321180907436651991825056156721508500075068309065038805868074283166084193390. I've placed a fix - to send back moduled resuts. 3) itersum shows that it's quite impossible to get more than 512 iterations, I feel 384 is the biggest possible. keep in mind that iter2 and iter3 loops are not stand-alone, they are running internally in iterx loop. In fact I am not sure if it's good idea to use itersum=iter2+iter3 as iterations indicator. possibly it's more correct to pay attention only to iterx, or to largest iter. #code by hakabit supraspecially for gmaxwell #algorithm source: http://cacr.uwaterloo.ca/hac/about/chap14.pdf, page 608, section "14.61 Algorithm Binary extended gcd algorithm"
import random #print(random.getrandbits(256))
def ext_binary_gcd(x, y):
g = 1
iterx = 0 iter1 = 0 iter2 = 0 iter3 = 0
while (x % 2 == 0) and (y % 2 == 0) : # yeah, funny cycle when one number is prime :-)
iter1 = iter1 + 1
x = x // 2 y = y // 2 g = 2 * g
u = x v = y A = 1 B = 0 C = 0 D = 1
while (u != 0) :
iterx = iterx + 1
#step4 while (u % 2 == 0) :
iter2 = iter2 + 1
u = u // 2
if (A % 2 == 0) and (B % 2 == 0) :
A = A // 2 B = B // 2
else :
A = (A + y) // 2 B = (B - x) // 2
while (v % 2 == 0) :
iter3 = iter3 + 1
v = v // 2
if (C % 2 == 0) and (D % 2 == 0) :
C = C // 2 D = D // 2
else :
C = (C + y) // 2 D = (D - x) // 2
if (u >= v) :
u = u - v A = A - C B = B - D
else :
v = v - u C = C - A D = D - B
#a = C #b = D
pp = 2 ** 256 - 2 ** 32 - 977 b = D % pp #BLM-rig
#return (a, b, g*v) return (b, iterx, iter1, iter2, iter3, iter2 + iter3)
p = 2**256 - 2**32 - 977
#a=12 #res=ext_binary_gcd(p,a) #print ("initial test, should be in=12, out=48246703848881748093154577086953294938862493610683568349773993336628681113193") #print ("in (out, iterbig, iter1, iter2, iter3): ",a,res)
print("test# in (out, iterx, iter1, iter2, iter3, itersum): ") for i in range(1000):
#print(i) ii = random.getrandbits(256) ii = ii % p
if (ii < 2): ii = ii + 2
res = ext_binary_gcd(p, ii) print(i, ii, res)
#end
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 08, 2020, 04:14:09 AM Last edit: December 08, 2020, 04:24:34 AM by hakabit |
|
Here we go ! One more test code. no over 256bits bug, returns also symmetric by module result. a little bit fuzzy conditions checking logic, I can call it quasi-recursive. it's binary in it's nature, just realized in a kind strange way. Part of results: test# calls# in [out, mod symm out] 0 364 75316636100529471829340583998430204208230490280793297630878645139432980850156 [-21481456315583653538655726848489351884800251925024345069131289520822185938657, 33025674478099610959444009195219596039238050737410647384245105885135854051132] 1 385 39888426508246177183753123701414070167668113673592482698911298409940354217536 [-29960760575314177163053332470999754258247890690156906035102057524026276070769, 86973073792154943490949046208733472044349727771710623997020383424312036814343] 2 360 14620752759159978489146343667071343355650196237127974340021988638599091328303 [12174829735737702474779493260773244294171916983485771075179095797017886235155, -96421093662667652633700306227539987869225167103772496913502848083575395070788] 3 383 102729399071985721675035906741175076856965109198477353566672326776001633036975 [70580439539010164434200376999720640409016804834521829709981617744376762996302, -79555186999421925612988503851178736579078677622324526318628793408380057609471] 4 370 46339676628294076970974548809934900743231066044015694600462844368554430445143 [2539235054366037579706180091206398081627721978267432321188130108233551671292, -6344958648894604413478372293328991555118887139277093896810498521216296850165] 5 368 26660698059602402176427980216008131458879504029132938908230058389025515567933 [16540677529517895161096648910563987919480254183872798943295671897534551358308, -71839064538438794363947864942753805375404185837480918551631678838489697347191] 6 354 594851748071041868996674850995031382743985475622021247651263573480083167890 [-64638328581828718437634047594178337042202536377421456333765449858808997403, 12582306659716172481129154954429736186019811569145338237994747628567568503971] 7 359 80681079014327548209785902873606497791641955222184391014151038138305717861451 [44686311015859290531071619784617096434102136509364594238047222505744964738512, -64133020728639186296951506401894836855168854511276042070060900652427696988205] 8 372 82669520055979716581712547816646165489596227017635063706256738823534854476311 [45051940896140100306048968032786700593661378537963560664532177163878090793890, -63102560133743198571360833063775568154007385207992635050991710591879861205579] 9 360 55943743540139384302837049042449505019549115662377158567347079861108763368228 [-21482989564914934180757589310789560187172090496876602635626422561634142769637, 44465387680038776841839359752521363942810960966395512063353292864302945775019] 10 349 66724607627136654255694803441709862788832479174235937080362394192249729343898 [29069806528358354934610693165080779089541395635574791746246411666180327081765, -50446960294663850907746139715414905993368124572672804919153802516270964965753]
|
# # # As in previous code iterations count close to 384number. But it's not good iterations indicator. I have idea how to count real true unrolled iterations, but need some time to implement. #code by hakabit supraspecially for gmaxwell #algorithm source: https://dspace.cvut.cz/bitstream/handle/10467/77227/F3-DP-2018-Zhumaniezov-Alisher-priloha-Testing.py
import random
def bin_ext_gcd(a, b):
global cc cc = cc + 1
if a == 0: #print(a, b, 0, 1) return [0, 1]
if b == 0: #print(a, b, 1, 0) return [1, 0]
if a == b: #print(a, b, 1, 0) return [1, 0]
if (a % 2 == 0) & (b % 2 == 0): x, y = bin_ext_gcd(a // 2, b // 2) #print(a, b, x, y) return [x, y]
if a % 2 == 0: x, y = bin_ext_gcd(a // 2, b) if x % 2 == 0: x, y = x // 2, y else: x, y = (x + b) // 2, y - a // 2 #print(a, b, x, y) return [x, y]
if b % 2 == 0: x, y = bin_ext_gcd(a, b // 2) if y % 2 == 0: x, y = x, y // 2 else: x, y = x - b // 2, (y + a) // 2 #print(a, b, x, y) return [x, y]
if a > b: x, y = bin_ext_gcd((a - b) // 2, b) if x % 2 == 0: x, y = x // 2, y - x // 2 else: x, y = (x + b) // 2, y - (a - b) // 2 - (x + b) // 2 #print(a, b, x, y) return [x, y]
if a < b: x, y = bin_ext_gcd(a, (b - a) // 2) if y % 2 == 0: x, y = x - y // 2, y // 2 else: x, y = x - (b - a) // 2 - (y + a) // 2, (y + a) // 2 #print(a, b, x, y) return [x, y]
cc = 0
p = 2**256 - 2**32 - 977
#cc=0 #a=12 #res=bin_ext_gcd(p,a) #print ("initial test, should be in=12, out=48246703848881748093154577086953294938862493610683568349773993336628681113193") #print (res,cc)
print ("test# calls# in [out, mod symm out]") for i in range(100):
#print(i) ii = random.getrandbits(256) ii = ii % p
if (ii < 2): ii = ii + 2
cc = 0 res = bin_ext_gcd(p, ii) print(i,cc,ii,res)
#end
|
|
|
|
j2002ba2
|
|
December 08, 2020, 08:50:08 AM |
|
Multiplication is cheaper since p is fixed.
1-bit full-add: 2*XOR + 2*AND + 1*OR 1-bit half-add: 1*XOR + 1*AND 1-bit full-add-zero: 1*XOR + 1*AND 1-bit full-add-one: 1*XNOR + 1*OR 1-bit full-subtract: 2*XOR + 2*AND + 1*OR + 2*NOT 1-bit half-subtract: 1*XOR + 1*AND + 1*NOT 1-bit half-add-one: 1*NOT + wire
256-bit-add: 255*full-add + 1*half-add = 511*XOR + 511*AND + 1*OR 256-bit sub: 255*full-subtract + 1*half-subtract = 511*XOR + 511*AND + 255*OR + 511*NOT 256-bit-add-p: 249*full-add-one + 1*half-add-one + 6*full-add-zero = 249*XNOR + 249*OR + 1*NOT + 6*XOR + 6*AND 256-bit-add-minus-p: 249*full-add-zero + 1*half-add-one + 6*full-add-one = 249*XOR + 249*AND + 1*NOT + 6*XNOR + 6*OR 256-bit mux: 768*NAND + 1*NOT
add-mod-p: 256-bit-add + 256-bit-add-minus-p + 256-bit-mux = 6*XNOR + 760*XOR + 760*AND + 7*OR + 768*NAND + 2*NOT sub-mod-p: 256-bit-sub + 256-bit-add-p + 256-bit-mux = 249*XNOR + 517*XOR + 517*AND + 504*OR + 768*NAND + 513*NOT mul-mod-p: 256*add-mod-p = 0x600*XNOR + 0x2f800*XOR + 0x2f800*AND + 0x700*OR + 0x30000*NAND + 0x200*NOT
mul-mod-p is 589,568 gates; this makes it ~27 times cheaper than inversion
|
|
|
|
j2002ba2
|
|
December 08, 2020, 10:30:22 AM Merited by NotATether (5) |
|
Point addition
In affine coordinates: 1*inversion + 2*mul-mod-p + 6*sub-mod-p = 16,141,271 + 2*589,568 + 6*3,068 = 17,338,815 Jacobian plus affine: 11*mul-mod-p + 6*sub-mod-p = 11*589,568 + 6*3,068 = 6,503,656 256-bit-multiplex: 769
Affine coordinates: bit PA 256-bit-muxes 1: 255 2*256 4,421,791,553 2: 127 2*4*128 2,202,816,961 4: 63 2*16*64 1,093,920,257 8: 31 2*256*32 550,102,561 16: 15 2*65536*16 1,872,792,113 9: 28 2*(512*28 + 16) 507,560,196 10: 25 2*(1024*25 + 64) 472,941,607 11: 23 2*(2048*23 + 8 ) 471,251,001 12: 21 2*(4096*21 + 16) 496,432,331
Jacobian coordinates (including final inversion and 2 mul-mod-p): bit PA 256-bit-muxes 1: 255 2*256 + 256 1,676,343,279 2: 127 2*4*128 + 128 844,170,607 4: 63 2*16*64 + 64 428,674,863 8: 31 2*256*32 + 32 231,557,647 16: 15 2*65536*16 + 16 1,727,597,439 9: 28 2*(512*28 + 16) + 29 221,518,452 10: 25 2*(1024*25 + 64) + 26 219,403,033 11: 23 2*(2048*23 + 8 ) + 24 239,381,207
Using mixed Jacobian-affine coordinates, adding in groups of 10 bits (25 1024-point memories plus 64-point memory), we get 219,403,033 gates.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 08, 2020, 01:05:13 PM Last edit: December 08, 2020, 01:25:52 PM by hakabit |
|
Point addition
Jacobian coordinates (including final inversion and 2 mul-mod-p): bit PA 256-bit-muxes
1: 255 2*256 + 256 1,676,343,279
10: 25 2*(1024*25 + 64) + 26 219,403,033
Using mixed Jacobian-affine coordinates, adding in groups of 10 bits (25 1024-point memories plus 64-point memory), we get 219,403,033 gates.
Thanks for fresh info ! I still can't understand how You optimizing gates count by combining bits. In Your initial post You was using such technique for additions, now I see You are using it for any operations. Please, explain me in a very simple steps. For example we have two 32bits numbers to summarize. 00111111111111111111111111111111 00000000000000000000000000000001 And the same numbers splitted to 10bit limbs. 1111111111 1111111111 1111111111 0000000000 0000000000 0000000001 Let's say we are using classical cascaded full-adder(ripple-carry adder) How amount of gates can be lowered due to splitting ?! Either... do You mean using not 2input, but 10input gates and summarize in parallel not two, but ten numbers ? But it's against my initial limitations. Or is this technique in some kind similiar to Karatsuba algorithm ?!
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 08, 2020, 01:35:55 PM |
|
In affine coordinates: 1*inversion + 2*mul-mod-p + 6*sub-mod-p = 16,141,271 + 2*589,568 + 6*3,068 = 17,338,815 Jacobian plus affine: 11*mul-mod-p + 6*sub-mod-p = 11*589,568 + 6*3,068 = 6,503,656 256-bit-multiplex: 769
Some more questions. 1) what is Your formulae for affine point addition with only two multiplications ? the most optimized version I found has multiplication+multiplication+squaring=3multiplications. 2) what is Your formulae for Jacobian point addition with only 11multiplications ? the most optimized version I found has 17multiplications. of course, there are Jacobian formulaes with smaller multiplications amount, but they are limited by Z1=1, so can be used only to add pre-computated or re-converted affine<->Jacobian points. Also I would like to note that by test-runs for affine we have +-384iterations, but not +-760.
|
|
|
|
j2002ba2
|
|
December 08, 2020, 03:35:55 PM |
|
I still can't understand how You optimizing gates count by combining bits.
The most naive point addition is with memory 1 point (G). Then we double the generator and add points appropriately. However all doubled points are constant, hence we could precompute all 255 doublings. The result is 256 points memory, no doubling. Since point addition is more expensive compared to implementing memory through multiplexers, we could go further. Split the private key in groups of 2 bits, precompute for each group the four combinations (resulting in 4*128 points memory), and feed the appropriate points to 127 point adders. i.e. at lsb position bits 00 - 0*G 01 - 1*G 10 - 2*G 11 - 3*G next group would be 00 - 0*G 01 - 4*G 10 - 8*G 11 - 12*G and so on. If we use groups of 4 bits - 2 4 points memory per group. It turns out that with affine points best is to use groups of 11 bits, plus 3 bits for the last group. For Jacobian+affine point additions best is groups of 10 bits, plus 6 bits for the last group. Some more questions. 1) what is Your formulae for affine point addition with only two multiplications ? the most optimized version I found has multiplication+multiplication+squaring=3multiplications. 2) what is Your formulae for Jacobian point addition with only 11multiplications ? the most optimized version I found has 17multiplications.
Also I would like to note that by test-runs for affine we have +-384iterations, but not +-760.
1. It is quite possible, that I have a mistake here, missing one multiplication. Anyway Jacobian+affine performs better. 2. "Guide to Elliptic Curve Cryptography", Algorithm 3.22. It has 8 multiplications, 3 squares, 6 subtractions. For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 08, 2020, 06:24:00 PM Last edit: December 08, 2020, 08:58:28 PM by hakabit |
|
I still can't understand how You optimizing gates count by combining bits.
The most naive point addition is with memory 1 point (G). Then we double the generator and add points appropriately. However all doubled points are constant, hence we could precompute all 255 doublings. The result is 256 points memory, no doubling. Since point addition is more expensive compared to implementing memory through multiplexers, we could go further. Split the private key in groups of 2 bits, precompute for each group the four combinations (resulting in 4*128 points memory), and feed the appropriate points to 127 point adders.
1. It is quite possible, that I have a mistake here, missing one multiplication. Anyway Jacobian+affine performs better. 2. "Guide to Elliptic Curve Cryptography", Algorithm 3.22. It has 8 multiplications, 3 squares, 6 subtractions. For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p. Big thanks for detailed answer ! Yeah, I understand Your idea. When I was constructing my trillionized scheme I was estimating such grouping. I got: group by 1 = 256 pre-computations group by 2 = about 32768 pre-computations group by 3 = about 2.7mln pre-computation after seen agressively exponential grow I understood that it's bad idea. Now I understand that I've made a big combinatorical logic mistake. After mistake detected(thanks to You), now I can say, of course such grouping is great idea, and yes, more than likely optimal grouping is by ~10bits. I found "Algorithm 3.22". I see it's hybrid for affine+Jacobian. But I can't understand one thing. Ok, let's pre-compute points grouping up to 11bits. Then let's take privkey 1 11111111111111111 (2+16 leading ones) {skipped some bits} 1 00000000000000001 (16 leeding zeros) How can be pubkey calculated using this algorithm without re-converts affine<->Jacobian ?! (from other side, I can agree, that using privkey with more than 11 leading same bits is not good idea, but...) Now what about iterations. Ouch... my test-runs shows about 384iterations. but... for such multi-conditional multi-loop algorith it's quite complex to count iterations. My idea is that I need to place special triggers on each loop, then log results to array, then analyse array in detecting consequental and-or parallel triggers switching. But I am not too good in Python or Ruby, so it will took a time to implement... As result I suppose that there will be either 384 or 2*384=768 max iterations( if max input number = 2^256). of, course in secp256k1 module is slightly smaller than 2^256, so there can be sligthly smaller than 768iters... ...but should be 512 !
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 08, 2020, 08:47:09 PM |
|
For binary euclidean gcd I get 760 iterations, when running with input 0x8000...00 and p. Please, post Your code here.
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 09, 2020, 02:21:22 AM |
|
the most optimized version I found has 17multiplications.
If you look at the links in my prior posts, I linked to a GEJ+GE that uses 8mul+3sqr.
|
|
|
|
|