hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 03, 2020, 01:10:24 AM |
|
Greetings to bitcointalkers ! On bitcointalk forum I've found theme "Theoretical minimum # of logic operations to perform double iterated SHA256?" https://bitcointalk.org/index.php?topic=1029536.0I have the same question, but related not to sha256, but to secp256k1. So, What is the theoretical minimum number of logical operations an ASIC needs to perform to compute privkey->pubkey secp256k1 ? With some additional limitations: only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used. only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used. I was trying to make some estimations myself, but getting absurdically large results. Smth about 1trillion(not billion!) NAND gates. This means about ~4trillion transistors while AMD Ryzen 9 CPU has only 9billion transistors. Thanks in advance for real advices and help. 8-)
|
|
|
|
pooya87
Legendary
Offline
Activity: 3626
Merit: 11031
Crypto Swap Exchange
|
|
December 03, 2020, 04:52:05 AM |
|
But unlike SHA256 (or hashes in general) there aren't any bitwise operations in elliptic curve cryptography. It is all pure math that also contains multiplication and division that can't be converted to logical operations either.
|
|
|
|
NotATether
Legendary
Offline
Activity: 1778
Merit: 7374
Top Crypto Casino
|
|
December 03, 2020, 09:58:54 AM |
|
But unlike SHA256 (or hashes in general) there aren't any bitwise operations in elliptic curve cryptography. It is all pure math that also contains multiplication and division that can't be converted to logical operations either.
We could make a rough estimate if we look at an ECDSA implementation that uses 256-bit wide keys (like the secp256k1 curve used in bitcoin). As the key size changes, the size of the base point and order also changes which I think will increase the number of logic gates correspondingly.
So according to this https://en.wikipedia.org/wiki/Elliptic_Curve_Digital_Signature_Algorithm#Signature_generation_algorithm, we need to find the number of logic gates needed to: - make a sha256 hash - do a right shift of N bits (to get leftmost Ln bits) - generate a random number - do a curve multiplication - do a modulus - do a regular multiplication, an addition, the modular inverse of said random number, and then a second modulus. All in 256-bits. So I think a transistor is needed for each of 256-bit addition/multiplication/modinverse/modulus and the number of logic gates in those can be found independently. The number of logic gates for right-shift should also be simple to get. And OP already has a rough idea for SHA256. Curve multiplication still needs to be accounted for, as does random number generation.
|
|
|
|
j2002ba2
|
|
December 03, 2020, 12:19:22 PM |
|
But unlike SHA256 (or hashes in general) there aren't any bitwise operations in elliptic curve cryptography. It is all pure math that also contains multiplication and division that can't be converted to logical operations either.
This is a ridiculous statement. All math on computer is done via logical elements. Always has been. So, What is the theoretical minimum number of logical operations an ASIC needs to perform to compute privkey->pubkey secp256k1 ?
With some additional limitations: only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used. only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.
From your gates construct add-mod-p, sub-mod-p, mul-mod-p, binary-extended-gcd-algorithm. Mix them appropriately. add-mod-p would be 256+256-to-257 bit adder, 257-bit adder, 256-bit multiplexer sub-mod-p would be 256 bit not, 256+256+1-to-257 bit adder, 257-bit adder, 256-bit multiplexer mul-mod-p would use add-mod-p + multiplexor (+ sub-mod-p for doubling) up to 256 times, depending on the number we multiply to binary-extended-gcd can be found in chapter 14 of Handbook of Applied Cryptography, available online at http://cacr.uwaterloo.ca/hac/Inversion is done only once via b-e-gcd: since v=y=1 skip steps 2 and 5. Since G is fixed, you should precompute 2 0-255G, and use them via multiplexer. If it is less costly (less gates), you could group powers of G by 2, 3, or more bits, using more multiplexers. A minimal example: there might be mistakes256-bit adder: 255*(2*XOR + 2*AND +1*OR) + 1*XOR + 1*AND = 511*XOR + 511*AND + 256*OR 257-bit adder: 513*XOR + 513*AND + 257*OR 256-bit adder with carry input: 512*XOR + 512*AND + 256*OR 256-bit sub: 512*XOR + 512*AND + 256*OR + 256*NOT 256-bit multiplexer: 512*AND + 256*OR + 1*NOT add-mod-p: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT sub-mod-p: 0x401*XOR + 0x601*AND + 0x301*OR + 0x101*NOT mul-mod-p: 0x100*add-mod-p + 0x100*sub-mod-p + 0x200*multiplex = 0x80100*XOR + 0xe0100*AND + 0x80200*OR + 0x30200*NOT inversion: 0x100*add + 0x300*sub + 0x200*multiplex = 0x7ff00*XOR + 0xbff00*AND + 0x60000*OR + 0x30200*NOT Looks like inversion is less costly than mul-mod-p. (There might be more efficient mul-mod-p method though.) We'll use affine coordinates then. In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p = 0x181906*XOR + 0x280706*AND + 0x161606*OR + 0x90906*NOT 256 point-additions + 256 multiplexers: 0x18190600*XOR + 0x28090600*AND + 0x14170600*OR + 0x9090700*NOT some missing logic for the leading zero bits: less than 4K missing inversion logic: less than 1M total: 0x5f421900= 1,598,167,296 gates.Combining several bits: 2: 128 PA + 4*128 mux = 799,378,944 4: 64 PA + 16*64 mux = 400,280,064 7: 37 PA + 128*37 mux = 234,598,648 8: 32 PA + 256*32 mux = 206,045,952 9: 29 PA + 512*29 mux = 192,438,200 10: 26 PA + 1024*26 mux = 182,767,728 11: 24 PA + 2048*24 mux = 187,607,616 16: 16 PA + 65536*16 mux = 906,228,096
Looks like the best combination is to do additions in groups of 10 bits. 182,767,728 gates (plus maybe a million more).
|
|
|
|
NotATether
Legendary
Offline
Activity: 1778
Merit: 7374
Top Crypto Casino
|
|
December 03, 2020, 07:26:22 PM |
|
From your gates construct add-mod-p, sub-mod-p, mul-mod-p, binary-extended-gcd-algorithm. Mix them appropriately. add-mod-p would be 256+256-to-257 bit adder, 257-bit adder, 256-bit multiplexer sub-mod-p would be 256 bit not, 256+256+1-to-257 bit adder, 257-bit adder, 256-bit multiplexer mul-mod-p would use add-mod-p + multiplexor (+ sub-mod-p for doubling) up to 256 times, depending on the number we multiply to binary-extended-gcd can be found in chapter 14 of Handbook of Applied Cryptography, available online at http://cacr.uwaterloo.ca/hac/Inversion is done only once via b-e-gcd: since v=y=1 skip steps 2 and 5. Since G is fixed, you should precompute 2 0-255G, and use them via multiplexer. If it is less costly (less gates), you could group powers of G by 2, 3, or more bits, using more multiplexers. This all counts the gates for just the operations after the point multiplication step right? For the point multiplication there's an algorithm here ( https://eprint.iacr.org/2014/427.pdf, section 3, Algorithm 1) that uses Montgomery ladders to implement the point multiplication for Koblitz curves. We need about n-1 add-mod-p gates, and n is a random int between 1 and group order-1 and likely extremely large. I'm not sure how you'd fit all those gates, unless you construct some kind of loop for the add-mod-p gate (is this possible with multiplexers?)
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 03, 2020, 10:40:49 PM Last edit: December 03, 2020, 10:55:37 PM by hakabit |
|
pooya87, no way ! any math formulae can be represented using combination of AND,OR,NOT gates. then, using Morgan's law, can be converted to AIGER(and-inverted graph, only AND and NOT gates) or to OIGER(or-inverted graph, only OR and NOT gates). then can be converted to only-NAND or to only-NOR gates schematics. however such convertations leads to growing amount of transistors, so to minimize transistors count usually it is worth to use mix of AND,OR,NOT gates, also worth adding XOR gates.
NotATether, You are speaking about slightly different thing. Yous speak about signing messages using ECC, I speak about privkey-to-pubkey calculations using ECC.
j2002ba2, Your answer is brilliant ! I see You have excellent knowledgebase and practicebase in ECC. Bravissimo ! It took me hours to carefully read and analyse Your message. I was analysing my calculations against Your calculations, also analysing Binary extended gcd algorithm and so on...
I will write more in next my messages...
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 03, 2020, 10:54:50 PM |
|
So I had estimations draft, but was too shy to include it in my first post. Now I include it: Here is my ruby code to calculate privkey->pubkey ECC(secp256k1).(see at end of post) As we see calculations consists of 256round. Each round(if points are pre-computated) consists of: 256bit modular summarize(SUMM): 0 operations 256bit modular substraction(SUBM): 7 operations 256bit modular multiply(MULM): 3 operations 256bit modular exponentional power(POWM): 1 operation So 1round = 0SUMM + 7SUBM + 3MULM + 1POWM It can be said that SUBM is almost identical in using logic gates as SUMM, so to simplify our estimations let's agree that SUBM ~= SUMM. By using binarian algorithm MULM can be represented as 511*SUMM, and POWM as 511*MULM. So 1round = 0SUMM + 7SUBM + 3MULM + 1POWM = 7SUMM + 3*MULM + 1POWM = 7SUMM + 3*511SUMM + 1*511MULM = 7SUMM + 3*511SUMM + 1*511*511SUMM = 262661SUMM 256rounds = 256*262661SUMM = 67241216SUMM Some years ago I was investigating electrical circuits. As I remember 256bit modular summarizator preliminary consist of: 2000 NOT, 3000 AND, 7000 XOR logic gates. However there are a lot of other ways how such summarizator can be represented using other combinations of logic gates. I feel(I hope?!) it can be optimized much. 1round = 262661SUMM * (2000 NOT, 3000 AND, 7000 XOR) = 525322000 NOT, 787983000 AND, 1838627000 XOR 256rounds = 256 * 262661SUMM * (2000 NOT, 3000 AND, 7000 XOR) = 134482432000 NOT, 201723648000 AND, 470688512000 XOR Hm... 134 billions NOT gates, 201 billions AND gates, 470billions XOR gates.. or equal to almost 1trillion NAND gates ?!... *** Gx = 0x79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798
Gy= 0x483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8
P = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F #N=2**256 - 2**32 - 977
def powm(n, e, mod) fail ArgumentError, 'negative exponent' if e < 0 prod = 1 base = n % mod until e.zero? prod = (prod * base) % mod if e.odd? #e.odd can give error on newest ruby versions e >>= 1 base = (base * base) % mod end prod end def add(x1,y1,x2,y2) px, py = x1,y1 qx, qy = x2,y2 if px == qx && py==qy then lam = (3 * px * px) * powm(2 * py, P - 2, P) #can be ignored in estimations else lam = (qy - py) * powm(qx - px, P - 2, P) end rx = lam*lam - px - qx ry = lam * (px - rx) - py
[rx % P, ry % P] end privkey=0x18E14A7B6A307F426A94F8114701E7C8E774E7F9A47E2C2035DB29A206321725
tx=Gx ty=Gy rx=0 ry=0 for i in 0..255
if privkey[i]==1 then if rx == 0 && ry==0 then rx,ry=tx,ty else rx,ry = add(rx, ry,tx,ty) end end tx,ty = add(tx, ty,tx,ty) end puts rx,ry
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 04, 2020, 01:40:13 AM |
|
or smth similiar to memory relays used.
Not using memory will brutally hurt your performance. This means about ~4trillion transistors while AMD Ryzen 9 CPU has only 9billion transistors. Well 4 trillion sounds wrong but a ryzen 9 cpu takes a great many clock cycles to complete one operation. It wouldn't be shocking to me if an unrolled implementation had a gate count comparable to a big cpu. ... though also most of that cpu's gates are in sram, and you've excluded memory sooo...
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 04, 2020, 02:24:44 AM Last edit: December 04, 2020, 01:13:47 PM by hakabit |
|
So, What is the theoretical minimum number of logical operations an ASIC needs to perform to compute privkey->pubkey secp256k1 ?
With some additional limitations: only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used. only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.
A minimal example: there might be mistakes256-bit adder: 255*(2*XOR + 2*AND +1*OR) + 1*XOR + 1*AND = 511*XOR + 511*AND + 256*OR 257-bit adder: 513*XOR + 513*AND + 257*OR 256-bit adder with carry input: 512*XOR + 512*AND + 256*OR 256-bit sub: 512*XOR + 512*AND + 256*OR + 256*NOT 256-bit multiplexer: 512*AND + 256*OR + 1*NOT add-mod-p: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT sub-mod-p: 0x401*XOR + 0x601*AND + 0x301*OR + 0x101*NOT mul-mod-p: 0x100*add-mod-p + 0x100*sub-mod-p + 0x200*multiplex = 0x80100*XOR + 0xe0100*AND + 0x80200*OR + 0x30200*NOT inversion: 0x100*add + 0x300*sub + 0x200*multiplex = 0x7ff00*XOR + 0xbff00*AND + 0x60000*OR + 0x30200*NOT Looks like inversion is less costly than mul-mod-p. (There might be more efficient mul-mod-p method though.) We'll use affine coordinates then. In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p = 0x181906*XOR + 0x280706*AND + 0x161606*OR + 0x90906*NOT 256 point-additions + 256 multiplexers: 0x18190600*XOR + 0x28090600*AND + 0x14170600*OR + 0x9090700*NOT some missing logic for the leading zero bits: less than 4K missing inversion logic: less than 1M total: 0x5f421900= 1,598,167,296 gates.I see we are getting almost identical results for 1round, but using slightly different terminology. My version is "So 1round = 0SUMM + 7SUBM + 3MULM + 1POWM". Your version is "In affine coordinates point addition is 1*inversion + 2*mul-mod-p + 6*sub-mod-p" SUBM = sub-mod-p MULM = mul-mod-p My POWM means that I am using POWM as of Euler's theorem for modular inversion calculations. Your "inversion" leads to "Euclidean Binary extended gcd algorithm". Now what about differences: 1) As I see for SUMM(add-mod-p) I provided: 2000 NOT, 3000 AND, 7000 XOR logic gates. You provide: 0x400*XOR + 0x600*AND + 0x301*OR + 1*NOT (in HEX) Or 1024*XOR + 1536*AND + 769*OR + 1*NOT(in DEC) May be I have memory faults and there was 200 NOT, 300 AND, 700 XOR logic gates. In any case I need to make fresh research to count how many gates need to be used for 256bits modular full-adder. 2) As I understand I forgot to include multiplexors as switchers to pre-computated points. 3) In "14.61 Algorithm Binary extended gcd algorithm" step 6. I see "If u ≥ v then". This means we need to use 256bits comparator. As I understand You forgot to include it. By the way, 256bits comparator can consume even more logical gates than 256bits full adder. (at this moment I am doing some research). That's because adder uses only 1bit(1wire) for carry-out and carry-in, while comparator uses either aggregated 2bit(2wires) or segregated 3bits(3wires) quasi-carry for carry-in and carry-out. 4) In "14.61 Algorithm Binary extended gcd algorithm" step 7. I see " otherwise, go to step 4.". This means that algorithm is recursive. How many maximum recursions there can be ?! I suppose maximum 256. This means we need to make dummy cycle with 256 iterations, each iteration should contain all(!) "14.61 Algorithm Binary extended gcd algorithm" calculations except step 2. and step 5. (and also exclude step 1. 3. oops) I feel after including 256iterations the result will be that Euclidean algorithm will consume almost the same amount of logic gates as for algorithm using Euler's theorem. It's interesting note, that v=y=1. Yes, v=1, actually. What about y=1 it's not obvious for me from first sight, but I feel You are correct. And here are some more comments: 1) "14.61 Algorithm Binary extended gcd algorithm" on input uses unsigned integers, but all futher operations are with signed integers. Operations with signed integers consumes 2-3 times more logic gates than operations with unsigned integers. 2) On step 4.2 I see "If A ≡ B ≡ 0 (mod 2) then". If A>0 and B>0 then it is very light operation which equal to "if A and B are both even" - we just need to verify if last bit=0. But... there can be situations when A<0 and/or B<0. Is it enough to verify that last bit=1 for negative integers ?! At this moment I am not sure. I need some time to think. 3) You are saying "If it is less costly (less gates), you could group powers of G by 2, 3, or more bits, using more multiplexers." Do You mean that privkey not always looks like 11 011111 01111 011 {skipped some bits} 11111 011111 0111 But also can looks like 1 001111 00111 0011 {skipped some bits} 11111 00111 00111 (reason for grouping powers of G by 2) Or like 1 00011 00011 00011 {skipped some bits} 1111 00011 000111 (reason for grouping powers of G by 3) If I understand Your idea correctly, then my opinion is that such grouping has many pros and cons. I don't know if it less costly. I think it's needed to make many-many test-iterations to measure pros vs. cons. 4) I don't understand Your idea about "Combining several bits" and table provided with "0: 26 PA + 1024*26 mux = 182,767,728". Do You mean that there will be stand-alone 10bits full-adder, with 10bits+10bits input and 10bits output, so we can seqeuntially passing to it A and B, and getting back result C=A+B ? If yes, then it is against limitation "only asynchronous logic and no backward or input-as-output connections, this means no RS-triggers or smth similiar to memory relays used.". You see, such trick will lower number of gates, but not lower number of logic operations, also such trick will make estimation of logic operations much harder. Or do you mean logic gates with not 2input but with 10inputs ? Then it is against limitation "only standard 2input OR, AND, XOR, NOR, NAND, XNOR and 1input NOT gates used." Using not 2input gates make estimation of logic operations a little bit harder. Or do You mean smth different ?! Waiting Your answers and comments to proceed to final estimations.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 04, 2020, 03:19:10 AM |
|
or smth similiar to memory relays used.
Not using memory will brutally hurt your performance. This means about ~4trillion transistors while AMD Ryzen 9 CPU has only 9billion transistors. Well 4 trillion sounds wrong but a ryzen 9 cpu takes a great many clock cycles to complete one operation. It wouldn't be shocking to me if an unrolled implementation had a gate count comparable to a big cpu. ... though also most of that cpu's gates are in sram, and you've excluded memory sooo... but my question is theoretical. so performance is not important. you see, previously someone had interest to sha256, now I have interest to secp256k1. I think for many bitcointalkers it's very interesting what the hell is amount of logical operations needed to calculate secp256k1. yeah, 4trillion transistors and for me sounds absurdically large. so that's why I am looking for help on forum. Now I almost for sure that I've made mistake saying that 256bits modular adder consist of 2000 NOT, 3000 AND, 7000 XOR logic gates. It's more correct to say that 200 NOT, 300 AND, 700 XOR logic gates. So in fact we will have not 4trillion transistors, but 400billion transistors. But that's still is unacceptably large amount. j2002ba2 provided result total: 0x5f421900= 1,598,167,296 gates.But I feel he just skipped that fact that Euclidean algorithm is recursive. His final result can be even larger than mine.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 04, 2020, 01:39:14 PM |
|
I am sorry, after some thinking, I think I've made mistake saying that j2002ba2 is ignoring recursion.
He was writing "inversion: 0x100*add + 0x300*sub + 0x200*multiplex" (but that's in HEX) In DEC it's "inversion: 256*add + 768*sub + 512*multiplex"
So it coincide with 256iterations.
...and still trying to understand how he got optimizing from 1,598,167,296 gates to 182,767,728 gates...
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 04, 2020, 02:12:06 PM Last edit: December 04, 2020, 02:54:16 PM by hakabit |
|
Today I was trying to find more effective ways to ECC point additions. I paid attention to Jacobian coordinate system. I found some Python code on internet, then mixed them to my test-code.(see at the end of post) On my PC with Python2 it gives answer "(36422191471907241029883925342251831624200921388586025344128047678873736520530L, 20277110887056303803699431755396003735040374760118964734768299847012543114150L)" and that is correct answer. We need to pay attention that it still needs modular inversion, but that is done only for last point addition from all 256 in secp256k1. However, j2002ba2 provided such great algorithm that there's almost no advantages in using Jacobian coordinates. Here is list of operations: U1 = (Xp * Zq ** 2) % P U2 = (Xq * Zp ** 2) % P S1 = (Yp * Zq ** 3) % P S2 = (Yq * Zp ** 3) % P H = U2 - U1 R = S2 - S1 H2 = (H * H) % P H3 = (H * H2) % P U1H2 = (U1 * H2) % P nx = (R ** 2 - H3 - 2 * U1H2) % P ny = (R * (U1H2 - nx) - S1 * H3) % P nz = (H * Zp * Zq) % P In fact, 17mod-multiplications and 6mod-substractions for each round. def inv(a, m): if a < 0 or m <= a: a = a % m c, d = a, m uc, vc, ud, vd = 1, 0, 0, 1 while c != 0: q, c, d = divmod(d, c) + (c,) uc, vc, ud, vd = ud - q*uc, vd - q*vc, uc, vc if ud > 0: return ud else: return ud + m
def to_jacobian((Xp, Yp)): """ Convert point to Jacobian coordinates
:param (Xp,Yp,Zp): First Point you want to add :return: Point in Jacobian coordinates """ return (Xp, Yp, 1)
def from_jacobian((Xp, Yp, Zp), P): """ Convert point back from Jacobian coordinates
:param (Xp,Yp,Zp): First Point you want to add :param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod p) :return: Point in default coordinates """ z = inv(Zp, P) return ((Xp * z**2) % P, (Yp * z**3) % P)
def jacobian_add((Xp, Yp, Zp), (Xq, Yq, Zq), A, P): """ Add two points in elliptic curves
:param (Xp,Yp,Zp): First Point you want to add :param (Xq,Yq,Zq): Second Point you want to add :param P: Prime number in the module of the equation Y^2 = X^3 + A*X + B (mod p) :param A: Coefficient of the first-order term of the equation Y^2 = X^3 + A*X + B (mod p) :return: Point that represents the sum of First and Second Point """ if not Yp: return (Xq, Yq, Zq) if not Yq: return (Xp, Yp, Zp) U1 = (Xp * Zq ** 2) % P U2 = (Xq * Zp ** 2) % P S1 = (Yp * Zq ** 3) % P S2 = (Yq * Zp ** 3) % P if U1 == U2: if S1 != S2: return (0, 0, 1) return jacobian_double((Xp, Yp, Zp), A, P) H = U2 - U1 R = S2 - S1 H2 = (H * H) % P H3 = (H * H2) % P U1H2 = (U1 * H2) % P nx = (R ** 2 - H3 - 2 * U1H2) % P ny = (R * (U1H2 - nx) - S1 * H3) % P nz = (H * Zp * Zq) % P return (nx, ny, nz)
def fast_add(a,b, A, P): return from_jacobian(jacobian_add(to_jacobian(a), to_jacobian(b), A, P), P)
P = 2**256 - 2**32 - 977 A=0
a=39006303077722201472019215118849942292522572644014512222183524570138876820125,67456145645462716033455438706152384917972612258004579542185540136844047420439
b=4032983015753143990395647783770666587927265353624430905763286836981504199392,44353125519324157186344456159742269880631179110473143840214086765587351124293
g2=fast_add(a,b,A,P)
print g2
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 04, 2020, 04:35:04 PM |
|
However, j2002ba2 provided such great algorithm that there's almost no advantages in using Jacobian coordinates.
I am puzzled by the fact that you wrote this but then pasted code that uses Jacobian addition law as your example. What is your point? Aside, the extended gcd is not guaranteed to complete in 256 steps (in fact, I expect it seldom completes in 256 steps)-- it can take much more. The constant time LSB euclidean we have for libsecp256k1 is only guaranteed to complete in 735 steps, and although non-LSB versions have somewhat smaller bounds they're still a lot larger than 256. It's a little annoying to prove an upper bound on the operations of these GCD algorithms. I'm surprised to hear the claim that when working with basic logic the inversion could be cheaper than a field multiply. In fact, I don't believe it: I think this is an artifact of radically over-costing the modular multiply and underestimating the number of iterations needed. Even if inversion is slow, depending on what you're doing it still makes sense to use affine logic because inversions can be batched (though batching only makes sense if inversion is at least three times the cost of a modular multiplication).
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 04, 2020, 08:14:00 PM Last edit: December 04, 2020, 08:40:26 PM by hakabit |
|
However, j2002ba2 provided such great algorithm that there's almost no advantages in using Jacobian coordinates.
I am puzzled by the fact that you wrote this but then pasted code that uses Jacobian addition law as your example. What is your point? Aside, the extended gcd is not guaranteed to complete in 256 steps (in fact, I expect it seldom completes in 256 steps)-- it can take much more. The constant time LSB euclidean we have for libsecp256k1 is only guaranteed to complete in 735 steps, and although non-LSB versions have somewhat smaller bounds they're still a lot larger than 256. It's a little annoying to prove an upper bound on the operations of these GCD algorithms. I'm surprised to hear the claim that when working with basic logic the inversion could be cheaper than a field multiply. In fact, I don't believe it: I think this is an artifact of radically over-costing the modular multiply and underestimating the number of iterations needed. Even if inversion is slow, depending on what you're doing it still makes sense to use affine logic because inversions can be batched (though batching only makes sense if inversion is at least three times the cost of a modular multiplication). gmaxwell, I have a very base knowledge in cryptography, some knowledge in programming and in electronic circuits, and no knowledge in high grade mathematica. I am doing my research from scratch. I am sorry in advance if I am sayingh something really wrong. My point is that 0) (my first version using Euler's theorem algorithm uses ~511mul for each round. no way to discuss.) 1) Jacobian algorithm uses 17mod-multiplications(mul) and 6mod-substractions(sub) for each round. Preliminary 1mul ~= 511mod-additions(add), and 1sub ~= 1add. So in total 17mul + 6sub = 17*511add + 6add = 8693add. 2) And binarian Euclidean algorithm uses 256*(1add+3sub+2mux) for each round inverse and 3mul + 6sub for additional calculations. Preliminary 1mux ~= 1add. So in total 256*(1add+3sub+2mux) +(3mul+6sub) = 256*(6add) + 3*511add + 6add = 1536 + 1533 + 6 = 3075add So 3075add < 8693add. Now, You are saying that 256 iterations not enough, and we need 735. (let's round up to 768) Ok. (768/256 - 256/256) * 1536add = 2 * 1536add = 3072add need to be added. 3075add + 3072add = 6147add. Still 6147add < 8693add. What about modular multiplication optimization. You see, Jacobian has 17mul and Euclidean 3mul. So any mod-mul optimizations will be useful for both algorithms. However, of course in relative numbers it is more helpful for Jacobian, but not the fact that in absolute numbers Jacobian overbeat. For optimizing classical(non-modular) multiplications I've heard about Karatsuba algorithm. For optimizing modular multiplications usually mentioned Montgomery algorithm. At this moment I absolutely do not understand how do they work both of them. :-) Also, I am not sure, gmaxwell, so do You feel affine or Jacobian algo should use smaller amount of logical operations ?!
|
|
|
|
j2002ba2
|
|
December 04, 2020, 10:46:35 PM |
|
Most probably I made a mistake in binary extended gcd. It should be slower than mul-mod-p.
I'll post a detailed algorithm -> logic tomorrow.
|
|
|
|
j2002ba2
|
|
December 04, 2020, 10:51:41 PM |
|
From "SPA vulnerabilities of the binary extended Euclidean algorithm": Binary extended Euclidean algorithm Inputs: Integers k and p such that gcd(k, p) = 1 Output: k−1 mod p 1: u = k 2: v = p 3: A = 1 4: C = 0 5: while u != 0 do 6: while even(u) do 7: u = u/2 8: if even(A) then 9: A = A/2 10: else 11: A = (A+p)/2 12: while even(v) do 13: v = v/2 14: if even(C) then 15: C = C/2 16: else 17: C = (C+p)/2 18: if u ≥ v then 19: u = u-v 20: A = A-C 21: else 22: v = v-u 23: C = C-A 24: return C mod p
After unrolling the cycles, we could at each iteration do either of * u = u/2, A = A/2 or (A+p)/2 * v = v/2, C = C/2 or (C+p)/2 * u = u-v, A = A-C * v = v-u, C = C-A
So, a single loop cycle would be: 1. In parallel: * Check if u = 0 * Check if u >= v * u1 = u/2 (drop lsb) * u2 = u-v (256-bit subtraction) * v1 = v/2 * v2 = v-u * A1 = A/2 * A2 = (A+p)/2 (n-bit addition, dropping lsb) * A3 = A-C (n-bit subtraction) * C1 = C/2 * C2 = (C+p)/2 * C3 = C-A For u and v we need 3 muxes each - passthrough (old value), u1, u2 For A and C we need 4 muxes each - passthrough, A1, A2, A3
2. Select what happens in this cycle, default being passthrough, highest priority first: * u = 0 : all passthrough * u is even : u1, if A is even A1 else A2 * v is even : v1, if C is even C1 else C2 * u >= v : u2, A3 * else : v2, C3
What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.
|
|
|
|
NotATether
Legendary
Offline
Activity: 1778
Merit: 7374
Top Crypto Casino
|
|
December 05, 2020, 12:20:38 AM |
|
What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.
If p is odd - and it should be, because if were even we could make an optimization that returns 1 or 2 depending on the parity of k [this falls apart if k and p have a composite factor gcd though, like 6 or 10] - we can take advantage of the fact that even(A) and even(C) will only run once in the inner while loops. So for the entire even(u) and even(v) iterations, each addition of A+p and C+p should increase their length by no more than ceil(log2(p)) bits. That means C+p will only be computed twice because 0+p => odd, 0+p+p => even. A+p will also only be done twice: at 1, and at A-C or C-A, which is (1+p)/arbitrary-p or p-(1+p)/arbitrary depending on which is larger, keeping in mind 1+p is always even. The second A+p addition will therefore be either (1+p)/arbitrary or p - (1+p)/arbitrary. At any rate, less than 1+p. So I think we can put a ceiling for the number of bits at ceil(log2(1+p)) bits for C and A [this is for the special case where gcd(k,p) is a single prime >2]. The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this?
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 05, 2020, 02:42:02 AM |
|
What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.
If p is odd - and it should be, because if were even we could make an optimization that returns 1 or 2 depending on the parity of k [this falls apart if k and p have a composite factor gcd though, like 6 or 10] - we can take advantage of the fact that even(A) and even(C) will only run once in the inner while loops. So for the entire even(u) and even(v) iterations, each addition of A+p and C+p should increase their length by no more than ceil(log2(p)) bits. That means C+p will only be computed twice because 0+p => odd, 0+p+p => even. A+p will also only be done twice: at 1, and at A-C or C-A, which is (1+p)/arbitrary-p or p-(1+p)/arbitrary depending on which is larger, keeping in mind 1+p is always even. The second A+p addition will therefore be either (1+p)/arbitrary or p - (1+p)/arbitrary. At any rate, less than 1+p. So I think we can put a ceiling for the number of bits at ceil(log2(1+p)) bits for C and A [this is for the special case where gcd(k,p) is a single prime >2]. The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this? on https://math.stackexchange.com/questions/1483118/proving-the-number-of-iterations-in-the-euclidean-algorithmI found great post "In two steps we move from (m,n) to (m′=n,n′=mmodn) and then to (m′′=n′,n′′=m′modn′). Assume n≤2k. Then m′,n′≤2k and n′′≤min{m′,n′,|m′−n′|}. Henc n′′>2k−1 could only happen if both |m′−n′|>2k−1 and min{n′,m′}>2k−1, which implies max{m′,n′}>2k−1+2k−1=2k, contradiction. Hence after two steps n≤2k becomes n′′≤2k−1 Hence if n≤2k initially, then after 2k steps we arrive at a situation with n≤20=1. Since it takes one additional step to reach n=0 and terminate, and since initially we may have n>m, it seems we may need up to 2k+2 rounds.
Indeed, the desired bounds are not fully correct, as for example m=1, n=42 (so k=0) requires 2 instead of just 2⋅0=0 steps: (1,42)↦(42,1)↦(1,0).
Remark: A deeper analysis should exhibit that the behaviour is governed by powers of the golden ration ϕ≈1.618 rather than 2–√≈1.414."
In secp256k1 we are working with field where k=256. so result should be maximum 2*256+2 iterations = 514. But also topicstarter speaks about golden ratio and I don't understand do we need 514 to multiply to 1.618(=832) or to divide with 1.618(=316) ?!
|
|
|
|
gmaxwell
Moderator
Legendary
Offline
Activity: 4270
Merit: 8805
|
|
December 05, 2020, 02:52:10 AM |
|
The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this?
As I mentioned, it's not very simple. https://gcd.cr.yp.to/safegcd-20190413.pdf Covers doing this for a somewhat different GCD algorithm (a variant which is designed to work only on the LSBs of values which makes it much faster in software). The bounds analysis in the paper is extremely complicated. Pieter came up with a simpler to understand approach that also produces a somewhat tighter bound (for safegcd): The idea comes from the fact that the steps of the algorithm are all affine transformations on the inputs. This means that if instead feeding a value into a step of the algorithm you feed in a convex hull that captures all the possible inputs andthe result is a convex hull that contains all the possible outputs of that step (and maybe a bit more). You can 'feed in' a hull by just sending its vertices through because it always performs an affine transformation. You handle the branches at each iteration by executing all of them and taking the union of the resulting hulls. Keep running until the hull shrinks below 1 which means that every input would have finished by then. It's kind of like you're concurrently running the algorithm on all inputs at once. At least for safegcd this actually converges. We know from testing with small inputs that the result is still somewhat conservative rather than exact. AFAIK no one knows how to compute an tight bound for this kind of algorithm, but useful upper limits exist which is all you need. You could, of course, just use safegcd... though I think a MSB binary GCD will be more efficient implemented in raw logic. I don't know a good cite for one off the top of my head, there probably is a cite in the safegcd paper. IIRC, GMP has a constant time GCD that implements one of the binary MSB algorithms. Depending on the application it might be acceptable to even make it sometimes be incorrect. E.g. if this were being used for mining you could pick a bound that is wrong less than 1:1000000 times based on experimental testing and that would probably be fine. Now, You are saying that 256 iterations not enough, and we need 735. No, I was saying a different related algorithm needed 735. I don't know what the bound is on that particular one. I just know it's going to be greater than 256. (if you look at the general structure though you should see that it shouldn't usually shave off more than one bit per iteration (only when the values are extremely close will it shave off more)-- but it's not guaranteed to even do that, so that should give you an intuition as to why 256 is far too low). You see, Jacobian has 17mul Well, don't use needlessly inefficient algorithms. One of your inputs is always affine and the group law can be algebraically simplified for secp256k1. https://github.com/bitcoin-core/secp256k1/blob/master/src/group_impl.h#L389(for point*scalar you will never get an infinity in the calculation either.) For optimizing classical(non-modular) multiplications I've heard about Karatsuba algorithm.Sure, and toom-cook. Some useful overview in https://cr.yp.to/lineartime/multapps-20080515.pdf. (squaring is faster than multiplying for pretty much the same reason that karatsuba is a speedup) Montgomery helps when you have consecutive squaring or multiplying -- saving you from having to reduce at the expense of doubling the size of your numbers. But you need a conversion to/from Montgomery before performing addition/subtraction -- so I wouldn't expect Montgomery to be a win for point addition. The secp256k1 field order is somewhat more efficient to modular reduce by than an arbitrary number. There is a complicated implementation in libsecp256k1 though it's optimized for processors and doesn't make the right tradeoffs for dumb logic. As far as jacobian vs affine. If you're able to operate on multiple points in parallel and share work in the inversion then affine will be faster for sure. If you cannot, then I remain sceptical that you're not miscounting something.
|
|
|
|
hakabit (OP)
Jr. Member
Offline
Activity: 42
Merit: 18
|
|
December 05, 2020, 03:05:29 AM |
|
What remains is the maximum number of iterations of the cycle, and how many bits (n) are needed for C and A.
If p is odd - and it should be, because if were even we could make an optimization that returns 1 or 2 depending on the parity of k [this falls apart if k and p have a composite factor gcd though, like 6 or 10] - we can take advantage of the fact that even(A) and even(C) will only run once in the inner while loops. So for the entire even(u) and even(v) iterations, each addition of A+p and C+p should increase their length by no more than ceil(log2(p)) bits. That means C+p will only be computed twice because 0+p => odd, 0+p+p => even. A+p will also only be done twice: at 1, and at A-C or C-A, which is (1+p)/arbitrary-p or p-(1+p)/arbitrary depending on which is larger, keeping in mind 1+p is always even. The second A+p addition will therefore be either (1+p)/arbitrary or p - (1+p)/arbitrary. At any rate, less than 1+p. So I think we can put a ceiling for the number of bits at ceil(log2(1+p)) bits for C and A [this is for the special case where gcd(k,p) is a single prime >2]. The maximum #iterations problem still stands though - maybe there's some research papers on the internet documenting this? but I found one more idea on https://en.wikipedia.org/wiki/Euclidean_algorithm#Worst-case"Worst-case If the Euclidean algorithm requires N steps for a pair of natural numbers a > b > 0, the smallest values of a and b for which this is true are the Fibonacci numbers FN+2 and FN+1, respectively.[95] More precisely, if the Euclidean algorithm requires N steps for the pair a > b, then one has a ≥ FN+2 and b ≥ FN+1. This can be shown by induction.[96] If N = 1, b divides a with no remainder; the smallest natural numbers for which this is true is b = 1 and a = 2, which are F2 and F3, respectively. Now assume that the result holds for all values of N up to M − 1. The first step of the M-step algorithm is a = q0b + r0, and the Euclidean algorithm requires M − 1 steps for the pair b > r0. By induction hypothesis, one has b ≥ FM+1 and r0 ≥ FM. Therefore, a = q0b + r0 ≥ b + r0 ≥ FM+1 + FM = FM+2, which is the desired inequality. This proof, published by Gabriel Lamé in 1844, represents the beginning of computational complexity theory,[97] and also the first practical application of the Fibonacci numbers.[95]
This result suffices to show that the number of steps in Euclid's algorithm can never be more than five times the number of its digits (base 10).[98] For if the algorithm requires N steps, then b is greater than or equal to FN+1 which in turn is greater than or equal to φN−1, where φ is the golden ratio. Since b ≥ φN−1, then N − 1 ≤ logφb. Since log10φ > 1/5, (N − 1)/5 < log10φ logφb = log10b. Thus, N ≤ 5 log10b. Thus, the Euclidean algorithm always needs less than O(h) divisions, where h is the number of digits in the smaller number b."in secp256k1 we are working with module m= 115792089237316195423570985008687907853269984665640564039457584007908834671663 and 256bits 2^256= 115792089237316195423570985008687907853269984665640564039457584007913129639936 in any case that's the maximaiest numbers. they both contain 78digits. So 78*5= 390 maximum iterations.
|
|
|
|
|