Bitcoin Forum

Bitcoin => Development & Technical Discussion => Topic started by: ecdsa123 on November 28, 2022, 12:48:10 PM



Title: secp256k1 library in pure assembly
Post by: ecdsa123 on November 28, 2022, 12:48:10 PM
Hello All


I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.

anybody knows?


Title: Re: secp256k1 library in pure assembly
Post by: PowerGlove on November 28, 2022, 01:45:03 PM
Off-topic (sorry), but just wanted to say that seeing questions like this on Bitcointalk really cheers me up! :)

I'm probably reading too much into it (nostalgia will sometimes do that to you), but someone asking for an assembly listing is (no joke) the highlight of my week (I really miss the old days).

Anybody else remember downloading TASM/MASM/NASM over dial-up and working through binders of printed out tutorials? ;D


Title: Re: secp256k1 library in pure assembly
Post by: ecdsa123 on November 28, 2022, 01:55:38 PM
I have check and analyse for comparison : sha256 in pure asm (rewrite by myself) and c++.
in pure asm we have almost 120x faster than c++ (in line asm optimised)

so you will be shocked when you will check how many "optimised " are design in pure asm for differents mathematics problems in 2022 yr:)




Title: Re: secp256k1 library in pure assembly
Post by: garlonicon on November 28, 2022, 03:49:41 PM
Quote
I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.
If you are looking for a library, then check headers like "immintrin.h". The keyword is "intrinsics", for example x86 intrinsics list (https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170) that means you can just call a C-like function in your code, and it will be converted to the pure assembly instruction when compiled. It will be easier to write some code in C++ or similar language, and call some functions, than writing everything in assembly, unless you know assembly very well.

Quote
Anybody else remember downloading TASM/MASM/NASM over dial-up and working through binders of printed out tutorials?
I used FASM and downloaded it in modern times, few years ago, when exploring how BIOS is constructed: http://flatassembler.net/
I also used it some time ago when I tried to write my own operating system from scratch: https://wiki.osdev.org/Main_Page

Quote
so you will be shocked when you will check how many "optimised " are design in pure asm for differents mathematics problems in 2022 yr:)
Well, assembly can speed things up if you can use the right opcodes, and if your processor supports it. In other cases, you may end up with code, that is correct, but not supported by your processor. So, the first thing is checking your hardware, and what is available, because some opcodes may trigger an error. Also, it is a high chance that using your CPU is not the best way of solving that, and if your code will be hardware-specific by design, then it may be profitable to prepare your code for some GPU or ASIC (but then you probably would need some custom hardware).

Edit: Also note that typical compilers has some flags that can be used to optimize it for some architecture with a given features. Compilers like GCC can produce those instructions, so check them first, it may be faster than your code in assembly, unless you really know that language well.


Title: Re: secp256k1 library in pure assembly
Post by: PowerGlove on November 28, 2022, 05:26:53 PM
I used FASM and downloaded it in modern times, few years ago, when exploring how BIOS is constructed: http://flatassembler.net/
Yup. FASM is pretty special and Tomasz Grysztar is an exceptional programmer! Writing a self-hosting assembler puts him in a very small group.

I also used it some time ago when I tried to write my own operating system from scratch: https://wiki.osdev.org/Main_Page
Nice one! 0xAA55 (and 0x7C00) is burned in my memory, too. ;D


Title: Re: secp256k1 library in pure assembly
Post by: pooya87 on November 29, 2022, 03:50:38 AM
I have check and analyse for comparison : sha256 in pure asm (rewrite by myself) and c++.
in pure asm we have almost 120x faster than c++ (in line asm optimised)
Have you ever published this code or has anybody else (more specifically a c++ expert) seen the code because 120x speed up does not sound right to me unless the code written in c++ is bad or entirely different (eg. simple implementation of SHA256 vs using intel SHA intrinsics) or your benchmark could be flawed.


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on November 29, 2022, 06:02:18 PM
Hello All


I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.

anybody knows?

I mean if you challenge me to do it, I might actually come up with an optimized secp256k1 ASM for GNU/Linux one day... who knows  :D

I'm already working on a version that uses GMP which itself is heavily optimized.


Title: Re: secp256k1 library in pure assembly
Post by: ecdsa123 on November 30, 2022, 03:06:51 PM
as I see there is no known library for this


Title: Re: secp256k1 library in pure assembly
Post by: albert0bsd on December 01, 2022, 12:56:23 AM
I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.

Write code in ASM is really hard, i have a long time without write anythin in ASM by my self the last code that check in ASM and edit just some lines was the libaesni for some of my old projects.

If there are any other developers interesting in write this code please let me know.


Title: Re: secp256k1 library in pure assembly
Post by: pooya87 on December 01, 2022, 03:59:44 AM
as I see there is no known library for this
Writing an entire ECC library in ASM is impossible, we are talking about thousands of lines of code that would be a lot more in ASM and as I said before the benefits is not as great as you'd think. However parts of the code can be written in ASM like what libsecp256k1 (https://github.com/bitcoin-core/secp256k1/tree/master/src/asm) does by writing the field element code in ASM.


Title: Re: secp256k1 library in pure assembly
Post by: PowerGlove on December 01, 2022, 09:23:44 AM
as I see there is no known library for this
Writing an entire ECC library in ASM is impossible, [...]
I don't know about that, man; heavier lifts have been made before. FASM is one example (an assembler written in assembly). If I check how many lines of code that has:

Code:
grep '^$' -rv ./fasm-1.73.30/fasm/source | wc -l

I get 35483. Even an elaborate, fully-featured secp256k1 library in x86-64 assembly would fit (more than) comfortably in 1/4 of that.

If that doesn't convince you (i.e. you feel that a significant fraction of FASM's source code is likely table-generated) then think of feats like the first RollerCoaster Tycoon game: Chris Sawyer wrote that in (99%) assembly. I don't know how familiar you are with gamedev, but something like RollerCoaster Tycoon completely dwarfs a secp256k1 library in terms of complexity.


Title: Re: secp256k1 library in pure assembly
Post by: ecdsa123 on December 01, 2022, 10:44:00 AM
in secp256k1 - we have only five main macros:
- add   - easy implement
- substract - easy implement
- multiply - easy implement
- divide - easy implement

and it is "easy" problem.

The main problem in secp256k1 is modulo p


Title: Re: secp256k1 library in pure assembly
Post by: albert0bsd on December 01, 2022, 12:33:33 PM
The main problem in secp256k1 is modulo p

Actually we only need a good framework to do operations with big numbers, this also using all the capabilities of modern CPU.

Regards.


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 01, 2022, 06:04:10 PM
The main problem in secp256k1 is modulo p

Actually we only need a good framework to do operations with big numbers, this also using all the capabilities of modern CPU.

Regards.


Literally this.

Have you heard of GAP? It's a C language framework for doing huge integer math and knows about group theory and such. A really smart assembly guy recommended it to me a few months ago.

https://www.gap-system.org/Download/


Title: Re: secp256k1 library in pure assembly
Post by: albert0bsd on December 02, 2022, 03:45:20 PM
also some good ecc implementation based on gmp: https://github.com/masterzorag/ec_gmp

I used that library for some tools that I made but it is not optimized for secp256k1 also it is some kind of vulnerable to some side channels attacks and incomplete because it declare EC.b parameter but it never use.

A lot of improvements can be made to that implementation.

The fastest implementation for secp256k1 code that I ever see and use it is already inside of kangaroo tool.

https://github.com/JeanLucPons/Kangaroo/tree/master/SECPK1

Same library that I actually use in my keyhunt code.






Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 02, 2022, 05:42:43 PM
The fastest implementation for secp256k1 code that I ever see and use it is already inside of kangaroo tool.

https://github.com/JeanLucPons/Kangaroo/tree/master/SECPK1

I wonder if there is a way to optimize it further though? Do you know whether it's making use of SSE?  But even more important than that, maybe there's a series of assembly instructions you can run to run repeated calls faster.

But since I use secp256k1 curve only for testing and research I do no care much for any of  possible vulnerabilities and attacks.

The safest (not necessary the fastest) secp256k1 is the one used in Bitcoin Core. But I don't use it because I keep getting wrong answers when I do arithmetic. Maybe the privkey bytes are not being filled correctly or something.


Title: Re: secp256k1 library in pure assembly
Post by: albert0bsd on December 02, 2022, 07:11:36 PM
And EC.b can be used here:

Thank you for Pointing where b is used, I recheck the code and that function is not there, anyway it is easy to add.


Do you know whether it's making use of SSE?  But even more important than that, maybe there's a series of assembly instructions you can run to run repeated calls faster.

The SSE are used in some other code of JLP those are used to generate 4 sha256 and 4 rmd160 hashes of fixed data at the same time.

That is only useful when you are working with addresses or hashes rmd160 those SSE functions boost that part of the code some 2 or 3 times faster.

Code here:

https://github.com/JeanLucPons/VanitySearch/blob/master/SECP256K1.cpp#L380

Functions are inside of the hash folder

Also I use those in keyhunt for address and rmd160 search.

About if that code can be optimized, I already check it a lot of times and it is already optimized.

Regards


Title: Re: secp256k1 library in pure assembly
Post by: ecdsa123 on December 03, 2022, 11:58:35 AM
but still it is slower than in pure asm.

at the moment I implemented "adding" in pure asm 256/512 bit

it is fastest than c++

(without modulo)




Title: Re: secp256k1 library in pure assembly
Post by: albert0bsd on December 03, 2022, 02:38:24 PM
Ok, lets to to write it by parts using openAI

Code:
Write in assembly a code to add two numbers of 512 bits

Code:
; Define the two 512-bit numbers to be added
.data
num1: .quad 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF
num2: .quad 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA

; Define the 512-bit result
result: .quad 0, 0, 0, 0

; Main program
.text
.global main
main:

  ; Initialize the result to 0
  mov $0, %rax
  mov %rax, result(,%rax,8)

  ; Loop through each of the 64-bit words in the numbers
  mov $0, %rax
  loop:

    ; Add the corresponding 64-bit words from num1 and num2
    movq num1(,%rax,8), %rdx
    addq num2(,%rax,8), %rdx

    ; Store the result in the corresponding 64-bit word in the result
    mov %rdx, result(,%rax,8)

    ; Increment the loop counter and check if it has reached 4
    inc %rax
    cmp $4, %rax
    jne loop

  ; End the program
  ret


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 03, 2022, 04:25:27 PM
Ok, lets to to write it by parts using openAI

Code:
Write in assembly a code to add two numbers of 512 bits

Code:
; Define the two 512-bit numbers to be added
.data
num1: .quad 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF
num2: .quad 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA

; Define the 512-bit result
result: .quad 0, 0, 0, 0

; Main program
.text
.global main
main:

  ; Initialize the result to 0
  mov $0, %rax
  mov %rax, result(,%rax,8)

  ; Loop through each of the 64-bit words in the numbers
  mov $0, %rax
  loop:

    ; Add the corresponding 64-bit words from num1 and num2
    movq num1(,%rax,8), %rdx
    addq num2(,%rax,8), %rdx

    ; Store the result in the corresponding 64-bit word in the result
    mov %rdx, result(,%rax,8)

    ; Increment the loop counter and check if it has reached 4
    inc %rax
    cmp $4, %rax
    jne loop

  ; End the program
  ret

Damn! I didn't know OpenAI could write code.

I used Dall-E to generate images before, but I wasn't aware of anything like this. Yeah, I've used Copilot, but I haven't generated any assembly with it.

This stuff could be very useful if it indeed works (AI generated code is sometimes buggy). It may not know how to generate a secp256k1 operation in ASM yet, but I think we'll get there soon (plus, ARM support!).


Title: Re: secp256k1 library in pure assembly
Post by: ecdsa123 on December 03, 2022, 07:33:02 PM
why you will not use xmm and xmmword? it is 128 bit? less code, less operation, fastest

when I finish i will upload.

the problem is with modulo for my self:)


try something like:

Code:
    %define arg1f XMM0
    %define arg2f XMM1
    %define arg3f XMM2
    %define arg4f XMM3
    %define arg1 RDI
    %define arg2 RSI
    %define arg3 RDX
    %define arg4 RC
    %define arg5 R8
    %define arg6 R9


%macro MULT 1,2,3
;Multi XMM0:XMM1:XMM4:XMM5 by XMM2:XMM3:XMM6:XMM7
        movups arg3f,[%1+%3]
        movaps xmm7,arg3f

        mulps  arg3f,arg1f
        movshdup    arg4f, arg3f
        addps       arg3f, arg4f
        movaps xmm6,xmm7
        movhlps     arg4f, arg3f
        addss       arg3f, arg4f
        movss  [%2+%3], arg3f;

        mulps  xmm6,arg2f
        movshdup    arg4f, xmm6
        addps       xmm6, arg4f
        movaps arg3f,xmm7
        movhlps     arg4f, xmm6
        addss       xmm6, arg4f
        movss  [%2+4+%3], xmm6

        mulps  arg3f,xmm4
        movshdup    arg4f, arg3f
        addps       arg3f, arg4f
        movaps xmm6,xmm7
        movhlps     arg4f, arg3f
        addss       arg3f, arg4f
        movss  [%2+8+%3], arg3f

        mulps  xmm6,xmm5
        movshdup    arg4f, xmm6
        addps       xmm6, arg4f
        movhlps     arg4f, xmm6
        addss       xmm6, arg4f
        movss  [%2+8+4+%3], xmm6

%endmacro


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 04, 2022, 04:57:42 PM
why you will not use xmm and xmmword? it is 128 bit? less code, less operation, fastest

when I finish i will upload.

the problem is with modulo for my self:)


try something like:

Code:
    %define arg1f XMM0
    %define arg2f XMM1
    %define arg3f XMM2
    %define arg4f XMM3
    %define arg1 RDI
    %define arg2 RSI
    %define arg3 RDX
    %define arg4 RC
    %define arg5 R8
    %define arg6 R9


%macro MULT 1,2,3
;Multi XMM0:XMM1:XMM4:XMM5 by XMM2:XMM3:XMM6:XMM7
        movups arg3f,[%1+%3]
        movaps xmm7,arg3f

        mulps  arg3f,arg1f
        movshdup    arg4f, arg3f
        addps       arg3f, arg4f
        movaps xmm6,xmm7
        movhlps     arg4f, arg3f
        addss       arg3f, arg4f
        movss  [%2+%3], arg3f;

        mulps  xmm6,arg2f
        movshdup    arg4f, xmm6
        addps       xmm6, arg4f
        movaps arg3f,xmm7
        movhlps     arg4f, xmm6
        addss       xmm6, arg4f
        movss  [%2+4+%3], xmm6

        mulps  arg3f,xmm4
        movshdup    arg4f, arg3f
        addps       arg3f, arg4f
        movaps xmm6,xmm7
        movhlps     arg4f, arg3f
        addss       arg3f, arg4f
        movss  [%2+8+%3], arg3f

        mulps  xmm6,xmm5
        movshdup    arg4f, xmm6
        addps       xmm6, arg4f
        movhlps     arg4f, xmm6
        addss       xmm6, arg4f
        movss  [%2+8+4+%3], xmm6

%endmacro

Are you saving some bits at the end of each dword for the carry? Otherwise you're going to lose accuracy in the final answer, because SIMD stuff will just overflow instead of carry.

I was thinking of applying the excellent technique of 5x52-bit numbers used in libsecp256k1, where we use 5 quadwords that each hold 52 bits of the real number each (except for the most significant qword which holds less). There's also a 10x26-bit variant for 32-bit machines.

If that technique is applied, we can defer the carry and add the numbers hundreds of times without corruption.

We would only need 3 SSE adds, or 2 AVX adds in that case. But I have heard that using the AVX instructions incurs some kind of speed penalty like AVX-512?

An alternative is just stuffing them in the R8-R15 registers and RAX/RBX for the 5 quadwords for both operands and then clobber one of the operands with the sum.


Title: Re: secp256k1 library in pure assembly
Post by: j2002ba2 on December 05, 2022, 05:48:07 PM
Ok, lets to to write it by parts using openAI

Code:
...
    addq num2(,%rax,8), %rdx
...
As usual, the so called AI gave wrong result. The carry flag is completely ignored, so instead it adds two vectors, each having four 64bit numbers.


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 06, 2022, 03:39:53 AM
Ok, lets to to write it by parts using openAI

Code:
...
    addq num2(,%rax,8), %rdx
...
As usual, the so called AI gave wrong result. The carry flag is completely ignored, so instead it adds two vectors, each having four 64bit numbers.

This is a recurring pattern I've seen in the AI-generated results. Nearly all of them have comments, so I know that the ones that don't talk about the point additions expressions are wrong, and there are also samples where I saw arithmetic using just "eax" instead of the necessary 4/5 64-bit registers.

There were instances where I got merely 256-bit number addition instead of point addition.


Title: Re: secp256k1 library in pure assembly
Post by: ecdsa123 on December 21, 2022, 07:42:25 PM
I have done implement parts of secp256k1 in pure asm

I would like to inform that perofrmence is fucking fast (for me):

100 000 000 performs modulo n on secp256k1 vals:

ubuntu@ubuntu2004:~/Downloads/ma$ nasm -f elf64 main.asm
ubuntu@ubuntu2004:~/Downloads/ma$ ld -s -o main main.o
ubuntu@ubuntu2004:~/Downloads/ma$ time ./main

real   0m4.856s
user   0m4.850s
sys   0m0.005s
ubuntu@ubuntu2004:~/Downloads/ma$


Title: Re: secp256k1 library in pure assembly
Post by: albert0bsd on December 22, 2022, 02:30:38 AM
I have done implement parts of secp256k1 in pure asm

I would like to inform that perofrmence is fucking fast (for me):

100 000 000 performs modulo n on secp256k1 vals:

Can you explain what kind of numbers are those 100 Million inputs. What is your hardware and against what other implementation are you comparing your results?


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 22, 2022, 09:51:35 AM
And moreover VanitySeacrh is only fast at "Point pub = secp256k1->ComputePublicKey(&privKey);" since it generates a "Point GTable[256*32];"
to be used in "Q = Add2(Q, GTable[256 * i + (b-1)])" having Jacobian Coordinates representation as calculation scheme.
If you take Add2 to add one million points in a sequence it is faster than classic point_addition with inversion by far.
But to use any Jacobian point after calculation further you need to use "Q.Reduce();" beforehand the same thing as inversion.
And in that situation classic scheme using GMP becomes faster since you just add point and can use it further immediately without any Reduce().

That's a sort of tradeoff you'll have when using data structures which can hold additional overflow at the cost of requiring a "rebuild" later. But most applications that do many hetrogeneous arithmetic ops should look into writing a function to sort of do it all at once with some optimization if possible. Just so that you can retain the benefit of using overflow structures.


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 22, 2022, 12:59:05 PM
And moreover VanitySeacrh is only fast at "Point pub = secp256k1->ComputePublicKey(&privKey);" since it generates a "Point GTable[256*32];"
to be used in "Q = Add2(Q, GTable[256 * i + (b-1)])" having Jacobian Coordinates representation as calculation scheme.
If you take Add2 to add one million points in a sequence it is faster than classic point_addition with inversion by far.
But to use any Jacobian point after calculation further you need to use "Q.Reduce();" beforehand the same thing as inversion.
And in that situation classic scheme using GMP becomes faster since you just add point and can use it further immediately without any Reduce().

That's a sort of tradeoff you'll have when using data structures which can hold additional overflow at the cost of requiring a "rebuild" later. But most applications that do many hetrogeneous arithmetic ops should look into writing a function to sort of do it all at once with some optimization if possible. Just so that you can retain the benefit of using overflow structures.


Different projective points representations boost scalar multiplication only but not points addition.
That is all very interesting. But can you point to some source code to see that in action.
In case with GMP performance boost comes when you put local variables of functions to global scope.
So that you initialize them at the start and clear in the end. And functions use them directly for reading and writing throughout the program run.


Libsecp256k1 that is used in Bitcoin Core has two special-purpose structs for adding & multiplying a point many times before requiring a rebuild. https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52.h and https://github.com/bitcoin-core/secp256k1/blob/master/src/field_10x26.h

These are special because there is no intermediate "modulus" during any of the operations, so they are quite fast. Unfortunately these functions are only correct for public keys (as the modulus is hardcoded), and you'd need to write a different set of functions to operate with private keys.

What I'm thinking is: Imagine if you need to perform the following (arbitrary) sequence:

E = a*G + b*G + c*G + d*G

I am multiplying 4 times and adding three times.

Now we could do (a+b+c+d)*G, and since these functions can last for a number of ops without a rebuild, I should be safe doing a Montgomery ladder for the multiplication by G, with this sum. Instead of multiplying 4 times I only multiply one time.

Now let's look at something I just pasted from JeanLucPons Kangaroo:

Code:
      dy.ModSub(p2y,p1y);
      _s.ModMulK1(&dy,&dx[g]);
      _p.ModSquareK1(&_s);

      rx.ModSub(&_p,p1x);
      rx.ModSub(p2x);

      ry.ModSub(p2x,&rx);
      ry.ModMulK1(&_s);
      ry.ModSub(p2y);

It roughly translates to (((p2y-p1y) * dx)^2 - p1x - p2x) = rx
and (p2x - (((p2y-p1y) * dx)^2 - p1x - p2x)) * ((p2y-p1y) * dx) - p2y = ry

That's 6 subtractions, 3 multiplications, and one squaring in total.

What if there were a way to optimize this calculation through bit shifts and the like, just like how libsecp256k1 multiplies two numbers with not only muls and adds but also bitwise arithmetic?

We have a pattern: (a - b) * c which occurs twice in this calculation in total.

Rather than consuming a subtraction and multiplication, I wonder if there was a way to calculate them at once (not just concatenate them together).

That would be incredible.


Title: Re: secp256k1 library in pure assembly
Post by: arulbero on December 22, 2022, 05:33:57 PM

Now let's look at something I just pasted from JeanLucPons Kangaroo:

Code:
      dy.ModSub(p2y,p1y);
      _s.ModMulK1(&dy,&dx[g]);
      _p.ModSquareK1(&_s);

      rx.ModSub(&_p,p1x);
      rx.ModSub(p2x);

      ry.ModSub(p2x,&rx);
      ry.ModMulK1(&_s);
      ry.ModSub(p2y);

It roughly translates to (((p2y-p1y) * dx)^2 - p1x - p2x) = rx
and (p2x - (((p2y-p1y) * dx)^2 - p1x - p2x)) * ((p2y-p1y) * dx) - p2y = ry

That's 6 subtractions, 3 multiplications, and one squaring in total.

What if there were a way to optimize this calculation through bit shifts and the like, just like how libsecp256k1 multiplies two numbers with not only muls and adds but also bitwise arithmetic?

We have a pattern: (a - b) * c which occurs twice in this calculation in total.

Rather than consuming a subtraction and multiplication, I wonder if there was a way to calculate them at once (not just concatenate them together).


The point addition formula is:

Code:
Point Addition for X₁ ≠ X₂             #


            s = (y2 - y1) / (x2 - x1)
            x3 = s ** 2 - x1 - x2
            y3 = s * (x1 - x3) - y1

If you have to generate a batch of public keys with same distance, for example B =  10*G, 15*G, 20*G, 25*G ...

you can generate only x coordinates (y coordinates are not necesssary) and compute only 1 inversion for the entire batch B; you don't need Montgomery ladder, projective coordinates, 5x52 limbs, ...

To generate 1000 public keys you need to compute:

1000 squares   -->  s**2
1000 multiplications   (y2-y1) * 1/(x2-x1)
1 inversion
1500 multiplications --> to get 1/(x2-x1)

tot:   2500 mult + 1000 sqr + 1 inv to compute 1000 public keys ->  2.5 mul + 1 sqr + 4 subtractions for each key.


Title: Re: secp256k1 library in pure assembly
Post by: ecdsa123 on December 22, 2022, 05:39:17 PM
I have done implement parts of secp256k1 in pure asm

I would like to inform that perofrmence is fucking fast (for me):

100 000 000 performs modulo n on secp256k1 vals:

Can you explain what kind of numbers are those 100 Million inputs. What is your hardware and against what other implementation are you comparing your results?


so: test was performed on:
Intel Core i5-1240P  - without multithreading - on just one core.
Power set up on balance (not maximum performance)

test has been done on:

xor rcx,rcx
_loop:
x*x mod n
so first multiply x by x ( x is value Point of generator Secp256k1 ( i mean the x value of G(x,y)) then mod n
so 256 bit *256 bit value then mod n

inc rcx
cmp rcx,100 000 000
jne _loop
 


Title: Re: secp256k1 library in pure assembly
Post by: ecdsa123 on December 22, 2022, 07:20:42 PM

Now let's look at something I just pasted from JeanLucPons Kangaroo:

Code:
      dy.ModSub(p2y,p1y);
      _s.ModMulK1(&dy,&dx[g]);
      _p.ModSquareK1(&_s);

      rx.ModSub(&_p,p1x);
      rx.ModSub(p2x);

      ry.ModSub(p2x,&rx);
      ry.ModMulK1(&_s);
      ry.ModSub(p2y);

It roughly translates to (((p2y-p1y) * dx)^2 - p1x - p2x) = rx
and (p2x - (((p2y-p1y) * dx)^2 - p1x - p2x)) * ((p2y-p1y) * dx) - p2y = ry

That's 6 subtractions, 3 multiplications, and one squaring in total.



What if there were a way to optimize this calculation through bit shifts and the like, just like how libsecp256k1 multiplies two numbers with not only muls and adds but also bitwise arithmetic?

We have a pattern: (a - b) * c which occurs twice in this calculation in total.

Rather than consuming a subtraction and multiplication, I wonder if there was a way to calculate them at once (not just concatenate them together).


The point addition formula is:

Code:
Point Addition for X₁ ≠ X₂             #


            s = (y2 - y1) / (x2 - x1)
            x3 = s ** 2 - x1 - x2
            y3 = s * (x1 - x3) - y1

If you have to generate a batch of public keys with same distance, for example B =  10*G, 15*G, 20*G, 25*G ...

you can generate only x coordinates (y coordinates are not necesssary) and compute only 1 inversion for the entire batch B; you don't need Montgomery ladder, projective coordinates, 5x52 limbs, ...

To generate 1000 public keys you need to compute:

1000 squares   -->  s**2
1000 multiplications   (y2-y1) * 1/(x2-x1)
1 inversion
1500 multiplications --> to get 1/(x2-x1)

tot:   2500 mult + 1000 sqr + 1 inv to compute 1000 public keys ->  2.5 mul + 1 sqr + 4 subtractions for each key.

this is intresting


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 23, 2022, 04:05:06 AM
The point addition formula is:

Code:
Point Addition for X_1 != X_2             #


            s = (y2 - y1) / (x2 - x1)
            x3 = s ** 2 - x1 - x2
            y3 = s * (x1 - x3) - y1

If you have to generate a batch of public keys with same distance, for example B =  10*G, 15*G, 20*G, 25*G ...

you can generate only x coordinates (y coordinates are not necesssary) and compute only 1 inversion for the entire batch B; you don't need Montgomery ladder, projective coordinates, 5x52 limbs, ...

To generate 1000 public keys you need to compute:

1000 squares   -->  s**2
1000 multiplications   (y2-y1) * 1/(x2-x1)
1 inversion
1500 multiplications --> to get 1/(x2-x1)

tot:   2500 mult + 1000 sqr + 1 inv to compute 1000 public keys ->  2.5 mul + 1 sqr + 4 subtractions for each key.

It is not clear to me to what this 1 inversion would be applied to. At a first glance, I only see the (x2 - x1) being inverted, but it's not clear how you'd do the inversion last, or only once, because x1 && x2 will usually not both be constant.


Title: Re: secp256k1 library in pure assembly
Post by: arulbero on December 23, 2022, 07:56:04 AM
It is not clear to me to what this 1 inversion would be applied to. At a first glance, I only see the (x2 - x1) being inverted, but it's not clear how you'd do the inversion last, or only once, because x1 && x2 will usually not both be constant.

Of course you have to get many different inversions, 1/(x2-x1), 1/(x2'-x1') and so on, but you can get these values by computing only one inversion for the entire group.

The idea is:

imagine you need 1/a and 1/b

you don't want to compute 2 inversions, because inversion is a very expensive operation.

Then:

1) compute a*b        -> 1 multiplication
2) compute 1/(a*b)   -> 1 inversion, you invert only a*b
3) compute 1((a*b) * b = 1/a  -> 1 multiplication
4) compute 1/(a*b) * a = 1/b  -> 1 multiplication

now you got 2 inversions (1/a and 1/b) computing only 3 multiplications and 1 'real' inversion.

If you have more than 2 elements, for example a,b,c,d,e

1) compute a*b                  -> 1 mul  (you have to store this result)
2) compute (a*b)*c            -> 1 mul  (you have to store this result)
3) compute (a*b*c)*d        -> 1 mul  (you have to store this result)
4) compute (a*b*c*d)*e    ->  1mul   (you have to store this result)
5) compute 1/(a*b*c*d*e) -> 1 inv

--> about 1 mul for each element + 1 inversion  -> (n-1)mul + 1 inv


6) compute 1/(a*b*c*d*e) * (a*b*c*d) = 1/e  -> 1 mul
7) compute 1/(a*b*c*d*e) * e = 1/(a*b*c*d)  -> 1 mul

8 ) compute 1/(a*b*c*d) * (a*b*c)  = 1/d  -> 1 mul
9) compute 1/(a*b*c*d) * d = 1/(a*b*c)  -> 1 mul

10) compute 1/(a*b*c) * (a*b)  = 1/c  -> 1 mul
11) compute 1/(a*b*c) * c = 1/(a*b)   -> 1 mul

12) compute 1/(a*b) * (a)  = 1/b  -> 1 mul
13) compute 1/(a*b) * (b) = 1/a   -> 1 mul

->   2*(n-2) + 2 = 2*(n-1) mul


If you do the same with a very long batch,

you need then 3*(n-1) mul + only 1 real inversion, in this way you minimize the impact of the real inversion.

On average you can consider then 3 mul for each element to invert.

-------------------------------------------------------------------------------------------------------------------------------------------------------


The function that implements this idea in Vanitysearch is 'ModInv' :

https://github.com/JeanLucPons/VanitySearch/blob/master/IntGroup.cpp/#L35

first n-1 mul -> https://github.com/JeanLucPons/VanitySearch/blob/master/IntGroup.cpp/#L42-L44

the only one real inversion -> https://github.com/JeanLucPons/VanitySearch/blob/master/IntGroup.cpp/#L48

second and third n-1 mul -> https://github.com/JeanLucPons/VanitySearch/blob/master/IntGroup.cpp/#L50-L54


that function belongs to the class IntGroup:

https://github.com/JeanLucPons/VanitySearch/blob/master/IntGroup.h#L24


This trick works only if you know many 'x1' and 'x2' before starting the computation.


Title: Re: secp256k1 library in pure assembly
Post by: ecdsa123 on December 24, 2022, 09:59:45 AM

intresting things:

I'm plying in pure asm:


during multiply x by x , but combination of size of registers I got:

x is our X of Generator point.

Code:
SECTION .text
GLOBAL _start

_start:

call def_Point_Addition

    ; Terminate program
mov rax,1            ; 'exit' system call
mov rbx,0            ; exit with error code 0
int 80h


def_Point_Addition:           ;Point Addition for P1= P2   (tangent with slope)
mov512 num_1_512,x_256
mov512 num_2_512,x_256
mov rdi,num_1_512
mov rsi,num_2_512
call mul_u256s

call print_u512     ; print n1 after addition
call print_nl
ret

I get output:


buntu@ubuntu2004:/mnt/hgfs/plik/asm$ nasm -f elf64 proba.asm
ubuntu@ubuntu2004:/mnt/hgfs/plik/asm$ ld -s -o proba proba.o
ubuntu@ubuntu2004:/mnt/hgfs/plik/asm$ time ./proba
483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b85dbbf79179d08fe 08f9a320029fccfe5423baf1bc4ca9debda56d7f7609edd60

real   0m0.004s
user   0m0.001s
sys   0m0.001s
ubuntu@ubuntu2004:/mnt/hgfs/plik/asm$

the first of 64 bytes are y of x!

I do not understand:)




Title: Re: secp256k1 library in pure assembly
Post by: arulbero on December 24, 2022, 11:23:28 AM
I have done implement parts of secp256k1 in pure asm

I would like to inform that performance is fucking fast (for me):

100 000 000 performs modulo n on secp256k1 vals:

ubuntu@ubuntu2004:~/Downloads/ma$ nasm -f elf64 main.asm
ubuntu@ubuntu2004:~/Downloads/ma$ ld -s -o main main.o
ubuntu@ubuntu2004:~/Downloads/ma$ time ./main

real   0m4.856s
user   0m4.850s
sys   0m0.005s
ubuntu@ubuntu2004:~/Downloads/ma$


Can you explain what kind of numbers are those 100 Million inputs. What is your hardware and against what other implementation are you comparing your results?


so: test was performed on:
Intel Core i5-1240P  - without multithreading - on just one core.
Power set up on balance (not maximum performance)

test has been done on:

xor rcx,rcx
_loop:
x*x mod n
so first multiply x by x ( x is value Point of generator Secp256k1 ( i mean the x value of G(x,y)) then mod n
so 256 bit *256 bit value then mod n

inc rcx
cmp rcx,100 000 000
jne _loop
 


I tried a similar test with my mul function:

Code:
int main(){

 uint64_t  x[4] =  {0x59f2815b16f81798, 0x029bfcdb2dce28d9, 0x55a06295ce870b07, 0x79be667ef9dcbbac};
 uint64_t  a[4] =  {0x59f2815b16f81798, 0x029bfcdb2dce28d9, 0x55a06295ce870b07, 0x79be667ef9dcbbac};


 uint64_t* ptrx = &x[0];
 uint64_t* ptra = &a[0];

 uint32_t i;

 for(i=0; i < 100000000; i++){
     mul(ptrx,ptra,ptrx);  //a*x -> x
 }
 

}


12th Gen Intel(R) Core(TM) i7-12700H  - without multithreading - on just one core.
Power set up on balance (not maximum performance)

$ gcc   -O2  main.c

$ time ./a.out

real   0m0,886s
user   0m0,882s
sys   0m0,005s


Title: Re: secp256k1 library in pure assembly
Post by: ecdsa123 on December 24, 2022, 12:10:55 PM
could you add mod n after mult?

i'm intresting to see the benchmark.
in your example we have only multiply without mod n


Title: Re: secp256k1 library in pure assembly
Post by: arulbero on December 24, 2022, 12:34:40 PM
could you add mod n after mult?

i'm intresting to see the benchmark.
in your example we have only multiply without mod n

My mul function has mod too.

Code:
// compute a * b = (low. high)
#define MultiplyWordsLoHi(low, high, a, b) asm  ( "mulx  %2, %0, %1;" : "=r"(low), "=r"(high) :  "gr" (a), "d" (b) : "cc");

#define AccAdd4WordsBy4_wc(a0, a1, a2, a3, b0, b1, b2)   asm  ("addq %4, %0; adcx %5, %1; adcx %6, %2; adcq $0, %3;" : "+r"(a0), "+r"(a1), "+r"(a2), "+r"(a3) : "r"(b0), "r"(b1), "r"(b2) : "cc");
// (a0, a1, a2, a3) = (a0, a1, a2, a3) + (b0, b1, b2, 0) without carry

#define MulAcc(c, a0, a1, a, b) asm      ("mulx %3, %3, %4; addq %3, %1; adcq %4, %2; adcq $0, %0;" : "+r"(c), "+r"(a0), "+r"(a1), "=a"(a), "=d"(b) : "a"(a), "d"(b) : "cc");

#define MulAcc_11(a0, a1, c0, a, b)      asm ("mulx %3, %0, %1; addq %2, %0; adcq $0, %1;" : "+r"(a0), "+r"(a1): "r"(c0), "r"(a), "d"(b) : "cc");



// compute u*v mod p
inline void
mul(uint64_t *r, uint64_t *u, uint64_t *v) {
  
  uint64_t u0 = u[0];
  uint64_t u1 = u[1];
  uint64_t u2 = u[2];
  uint64_t u3 = u[3];


  uint64_t v0 = v[0];
  uint64_t v1 = v[1];
  uint64_t v2 = v[2];
  uint64_t v3 = v[3];

  uint64_t r0, r1, r2, r3, r4, r5, r6, r7;
  uint64_t z1, z2, z3, z4, z5, z6, z7, z8, z44, z66;

  z2 = z3 = z4 = z5 = z6 = z7 = z8 = r1 = r2 = r3 = r4 = r5 = r6 = r7 = 0;


  MultiplyWordsLoHi(r0, z1, u0, v0) //x1 --> r0 ok
  MultiplyWordsLoHi(z2, z3, u1, v0)
  MultiplyWordsLoHi(z4, z5, u2, v0)
  MultiplyWordsLoHi(z6, z7, u3, v0)
  AccAdd4WordsBy4_wc(z2, z4, z6, z7, z1, z3, z5)


  MulAcc_11(r1, z1, z2, u0, v1) //x1 --> r1 ok
  MultiplyWordsLoHi(z2, z3, u1, v1)
  MultiplyWordsLoHi(z44, z5, u2, v1)
  MultiplyWordsLoHi(z66, z8, u3, v1)
  AccAdd4WordsBy4_wc(z1, z3, z5, z8, z4, z6, z7)
  AccAdd4WordsBy4_wc(z2, z44, z66, z8, z1, z3, z5)

  
  MulAcc_11(r2, z1, z2, u0, v2) //x1 --> r2 ok
  MultiplyWordsLoHi(z2, z3, u1, v2)
  MultiplyWordsLoHi(z4, z5, u2, v2)
  MultiplyWordsLoHi(z6, z7, u3, v2)
  AccAdd4WordsBy4_wc(z1, z3, z5, z7, z44, z66, z8)
  AccAdd4WordsBy4_wc(z2, z4, z6, z7, z1, z3, z5)

  MulAcc_11(r3, z1, z2, u0, v3) //x1 --> r3 ok
  MultiplyWordsLoHi(r4, z3, u1, v3)
  MultiplyWordsLoHi(r5, z5, u2, v3)
  MultiplyWordsLoHi(r6, r7, u3, v3)
  AccAdd4WordsBy4_wc(z1, z3, z5, r7, z4, z6, z7)
  AccAdd4WordsBy4_wc(r4, r5, r6, r7, z1, z3, z5) //r4, r5, r6, r7 ok
  

  //Reduction, mod 2^256 - 0x1000003d1
  
  uint64_t p = 0x1000003d1;
  MulAcc_11(z1, z2, r0, r4, p)
  MultiplyWordsLoHi(z3, z4, r5, p)
  MultiplyWordsLoHi(z5, z6, r6, p)
  MultiplyWordsLoHi(z7, z8, r7, p)
  

  //MulAcc_11(z1, z2, r0, r4, p)
  AccAdd4WordsBy4_wc(z2, z4, z6, z8, r1, r2, r3)

  
  uint64_t c = 0;
  AccAdd4WordsBy4_wc(z3, z5, z7, z8, z2, z4, z6)
  MulAcc(c, z1, z3, p, z8)
  

  r[0] = z1;
  r[1] = z3;
 
  
  if(c == 1){

  asm (
     "addq $1, %0; adcq $0, %1; \n"
       : "=r" (z5), "=r" (z7)
       : : "cc");

  }
  
  r[2] = z5;
  r[3] = z7;
  
}

EDIT:


With 1 000 000 000 iterations:

mul without mod:

real   0m5,771s
user   0m5,770s
sys   0m0,000s



mul with mod:

real   0m8,438s
user   0m8,434s
sys   0m0,004s



Title: Re: secp256k1 library in pure assembly
Post by: Pieter Wuille on December 26, 2022, 02:52:18 PM
But since I use secp256k1 curve only for testing and research I do no care much for any of  possible vulnerabilities and attacks.

The safest (not necessary the fastest) secp256k1 is the one used in Bitcoin Core. But I don't use it because I keep getting wrong answers when I do arithmetic. Maybe the privkey bytes are not being filled correctly or something.

Perhaps you should open an issue or discussion topic (https://github.com/bitcoin-core/secp256k1/issues or https://github.com/bitcoin-core/secp256k1/discussions/categories/q-a) on the library, because that's definitely not supposed to happen.


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 26, 2022, 04:56:50 PM
But since I use secp256k1 curve only for testing and research I do no care much for any of  possible vulnerabilities and attacks.

The safest (not necessary the fastest) secp256k1 is the one used in Bitcoin Core. But I don't use it because I keep getting wrong answers when I do arithmetic. Maybe the privkey bytes are not being filled correctly or something.

Perhaps you should open an issue or discussion topic (https://github.com/bitcoin-core/secp256k1/issues or https://github.com/bitcoin-core/secp256k1/discussions/categories/q-a) on the library, because that's definitely not supposed to happen.

To update this thread with info I posted in the meantime, it appears to be that the multiplication has the secp256k1 characteristic modulus hardcoded into the algorithm, making it suitable only for public key multiplication.

I was thinking about making an adapted version of the multiplication algorithm that uses the curve order (subtracted from 2^256) in its place, so that private key multiplication is covered as well.

I haven't explored the 10x26 legs imply too deep, but I'm assuming it's a similar case.

Since you're here though, let me ask: Was the multiplication assembly (and possibly the C version of it in another file using int128) only intended for public keys?


Title: Re: secp256k1 library in pure assembly
Post by: arulbero on December 26, 2022, 06:48:47 PM
it appears to be that the multiplication has the secp256k1 characteristic modulus hardcoded into the algorithm, making it suitable only for public key multiplication.

I was thinking about making an adapted version of the multiplication algorithm that uses the curve order (subtracted from 2^256) in its place, so that private key multiplication is covered as well.

I haven't explored the 10x26 legs imply too deep, but I'm assuming it's a similar case.

Since you're here though, let me ask: Was the multiplication assembly (and possibly the C version of it in another file using int128) only intended for public keys?

You have to distinguish between:

1) field operations (mod p, space of coordinates x and y):

https://github.com/bitcoin-core/secp256k1/tree/master/src/   all files with name : field*

and

2) scalar operations (mod n, space of private keys):

https://github.com/bitcoin-core/secp256k1/tree/master/src/  all files with name : scalar*


If you want to multiply 2 private keys (mod n):

you could use secp256k1_scalar_mul_512 (256bit * 256 bit -> 512 bit)  (assembly multiplication)

https://github.com/bitcoin-core/secp256k1/blob/master/src/scalar_4x64_impl.h#L587
Code:
inline void secp256k1_scalar_mul_512(uint64_t* l, const uint64_t *a,  const uint64_t *b) {
    
    //uint64_t l[8];
    
    __asm__ __volatile__ (
    /* Preload */
    "movq 0(%%rdi), %%r15\n"
    "movq 8(%%rdi), %%rbx\n"
    "movq 16(%%rdi), %%rcx\n"
    "movq 0(%%rdx), %%r11\n"
    "movq 8(%%rdx), %%r12\n"
    "movq 16(%%rdx), %%r13\n"
    "movq 24(%%rdx), %%r14\n"
    /* (rax,rdx) = a0 * b0 */
    "movq %%r15, %%rax\n"
    "mulq %%r11\n"
    /* Extract l0 */
    "movq %%rax, 0(%%rsi)\n"
    /* (r8,r9,r10) = (rdx) */
    "movq %%rdx, %%r8\n"
    "xorq %%r9, %%r9\n"
    "xorq %%r10, %%r10\n"
    /* (r8,r9,r10) += a0 * b1 */
    "movq %%r15, %%rax\n"
    "mulq %%r12\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    /* (r8,r9,r10) += a1 * b0 */
    "movq %%rbx, %%rax\n"
    "mulq %%r11\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    /* Extract l1 */
    "movq %%r8, 8(%%rsi)\n"
    "xorq %%r8, %%r8\n"
    /* (r9,r10,r8) += a0 * b2 */
    "movq %%r15, %%rax\n"
    "mulq %%r13\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    /* (r9,r10,r8) += a1 * b1 */
    "movq %%rbx, %%rax\n"
    "mulq %%r12\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    /* (r9,r10,r8) += a2 * b0 */
    "movq %%rcx, %%rax\n"
    "mulq %%r11\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    /* Extract l2 */
    "movq %%r9, 16(%%rsi)\n"
    "xorq %%r9, %%r9\n"
    /* (r10,r8,r9) += a0 * b3 */
    "movq %%r15, %%rax\n"
    "mulq %%r14\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    /* Preload a3 */
    "movq 24(%%rdi), %%r15\n"
    /* (r10,r8,r9) += a1 * b2 */
    "movq %%rbx, %%rax\n"
    "mulq %%r13\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    /* (r10,r8,r9) += a2 * b1 */
    "movq %%rcx, %%rax\n"
    "mulq %%r12\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    /* (r10,r8,r9) += a3 * b0 */
    "movq %%r15, %%rax\n"
    "mulq %%r11\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    /* Extract l3 */
    "movq %%r10, 24(%%rsi)\n"
    "xorq %%r10, %%r10\n"
    /* (r8,r9,r10) += a1 * b3 */
    "movq %%rbx, %%rax\n"
    "mulq %%r14\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    /* (r8,r9,r10) += a2 * b2 */
    "movq %%rcx, %%rax\n"
    "mulq %%r13\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    /* (r8,r9,r10) += a3 * b1 */
    "movq %%r15, %%rax\n"
    "mulq %%r12\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    /* Extract l4 */
    "movq %%r8, 32(%%rsi)\n"
    "xorq %%r8, %%r8\n"
    /* (r9,r10,r8) += a2 * b3 */
    "movq %%rcx, %%rax\n"
    "mulq %%r14\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    /* (r9,r10,r8) += a3 * b2 */
    "movq %%r15, %%rax\n"
    "mulq %%r13\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    /* Extract l5 */
    "movq %%r9, 40(%%rsi)\n"
    /* (r10,r8) += a3 * b3 */
    "movq %%r15, %%rax\n"
    "mulq %%r14\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    /* Extract l6 */
    "movq %%r10, 48(%%rsi)\n"
    /* Extract l7 */
    "movq %%r8, 56(%%rsi)\n"
    : "+d"(b)
    : "S"(l), "D"(a)
    : "rax", "rbx", "rcx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
}  

where for example

Code:
const uint64_t  a[4] =  {0x59f2815b16f81798, 0x029bfcdb2dce28d9, 0x55a06295ce870b07, 0x79be667ef9dcbbac};
const uint64_t  b[4] =  {0x9c47d08ffb10d4b8, 0xfd17b448a6855419, 0x5da4fbfc0e1108a8, 0x483ada7726a3c465};

uint64_t c[8] = {0};

secp256k1_scalar_mul_512(c, a, b);

printf("%016lx%016lx%016lx%016lx%016lx%016lx%016lx%016lx\n", c[7], c[6], c[5], c[4], c[3], c[2], c[1], c[0]);

Result:
225989dbbc349b6f319ca3eed777a46f55b1dc22e97af11261167d213c1f060d29520a21508989b06ed1194129efb1517cee385a708abe44718bc509775ad540

and then apply the function secp256k1_scalar_reduce_512 (from 512 bit to 256 bit mod n)

https://github.com/bitcoin-core/secp256k1/blob/master/src/scalar_4x64_impl.h#L274

to get the result mod n

https://github.com/bitcoin-core/secp256k1/blob/master/src/scalar_4x64_impl.h#L761-L765

Code:
805714a252d0c0b58910907e85b5b801fff610a36bdf46847a4bf5d9ae2d10ed

I got the same results with my function, with libsecp256k1 functions and with python3 too:

Code:
>>> n=0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141
>>> a=0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
>>> b=0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
>>> hex(a*b)
'0x225989dbbc349b6f319ca3eed777a46f55b1dc22e97af11261167d213c1f060d29520a21508989b06ed1194129efb1517cee385a708abe44718bc509775ad540'
>>> hex(a*b % n)
'0x805714a252d0c0b58910907e85b5b801fff610a36bdf46847a4bf5d9ae2d10ed'


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 27, 2022, 05:43:41 PM
You have to distinguish between:

1) field operations (mod p, space of coordinates x and y):

https://github.com/bitcoin-core/secp256k1/tree/master/src/   all files with name : field*

and

2) scalar operations (mod n, space of private keys):

https://github.com/bitcoin-core/secp256k1/tree/master/src/  all files with name : scalar*


...

Alright.

There's one other thing to address: in the secp256k1_fe_mul (or something like that) function, the all but the last leg are multiplied by the constant R. This causes a result different from when I calculated an example in Python. So inside the fe_mul function, I need to modify it to avoid multiplying the values in the result (stack) by R, and send that multiplication to a temporary instead.


Title: Re: secp256k1 library in pure assembly
Post by: Pieter Wuille on December 28, 2022, 03:19:38 PM
Alright.

There's one other thing to address: in the secp256k1_fe_mul (or something like that) function, the all but the last leg are multiplied by the constant R. This causes a result different from when I calculated an example in Python. So inside the fe_mul function, I need to modify it to avoid multiplying the values in the result (stack) by R, and send that multiplication to a temporary instead.

That makes no sense; fe_mul just multiplies two field elements modulo p. That R constant is an implementation detail, that even differs between 32-bit and 64-bit platforms. It's not actually multiplying the result by this value.

Note that field elements are internally stored in a denormalized representation where the limbs can overflow. If you want to convert it to a portable format, use fe_get_b32.


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on December 28, 2022, 04:41:36 PM
Note that field elements are internally stored in a denormalized representation where the limbs can overflow. If you want to convert it to a portable format, use fe_get_b32.

That must be why I was getting different results while testing. I'll check out this function and run my C++ and Python mod-mul programs again. It will be interesting to see the results of this.


Title: Re: secp256k1 library in pure assembly
Post by: esa_a on February 09, 2023, 08:55:12 PM
And moreover VanitySeacrh is only fast at "Point pub = secp256k1->ComputePublicKey(&privKey);" since it generates a "Point GTable[256*32];"
to be used in "Q = Add2(Q, GTable[256 * i + (b-1)])" having Jacobian Coordinates representation as calculation scheme.
If you take Add2 to add one million points in a sequence it is faster than classic point_addition with inversion by far.
But to use any Jacobian point after calculation further you need to use "Q.Reduce();" beforehand the same thing as inversion.
And in that situation classic scheme using GMP becomes faster since you just add point and can use it further immediately without any Reduce().

Hey, did you try piggypiggy's implementation? Was it any faster?


Title: Re: secp256k1 library in pure assembly
Post by: alexeyneu on February 09, 2023, 11:03:30 PM
Ok, lets to to write it by parts using openAI

Code:
Write in assembly a code to add two numbers of 512 bits

Code:
; Define the two 512-bit numbers to be added
.data
num1: .quad 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF
num2: .quad 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA

; Define the 512-bit result
result: .quad 0, 0, 0, 0

; Main program
.text
.global main
main:

  ; Initialize the result to 0
  mov $0, %rax
  mov %rax, result(,%rax,8)

  ; Loop through each of the 64-bit words in the numbers
  mov $0, %rax
  loop:

    ; Add the corresponding 64-bit words from num1 and num2
    movq num1(,%rax,8), %rdx
    addq num2(,%rax,8), %rdx

    ; Store the result in the corresponding 64-bit word in the result
    mov %rdx, result(,%rax,8)

    ; Increment the loop counter and check if it has reached 4
    inc %rax
    cmp $4, %rax
    jne loop

  ; End the program
  ret
but dude it's sandbox asm

some kind of this stuff https://stackoverflow.com/questions/64726805/understanding-a-basic-assembly-code-with-lea-instruction#comment114452574_64726805

it's not executable nor ring zero os part . it's nothing really. why to use asm then?


Title: Re: secp256k1 library in pure assembly
Post by: NotATether on February 14, 2023, 08:15:08 AM
but dude it's sandbox asm

some kind of this stuff https://stackoverflow.com/questions/64726805/understanding-a-basic-assembly-code-with-lea-instruction#comment114452574_64726805

it's not executable nor ring zero os part . it's nothing really. why to use asm then?

There is really no point to be writing code in assembly that is not using instructions that are faster than the ones that gcc is compiling down to.

For example, there are a bunch of MOVs, CMPs, JMPs, Add/Xor/Lea instructions when you compile some C file down to assembly. There are only two ways to make this faster:

1 - you can somehow reformat the assembly to remove excessive MOVs, so that its using as few instructions as possible (will not result in a large performance improvement)
2 - your use case can be accelerated by SIMD instructions (will result in a much faster performance).


Title: Re: secp256k1 library in pure assembly
Post by: alexeyneu on February 14, 2023, 07:13:34 PM
but dude it's sandbox asm

some kind of this stuff https://stackoverflow.com/questions/64726805/understanding-a-basic-assembly-code-with-lea-instruction#comment114452574_64726805

it's not executable nor ring zero os part . it's nothing really. why to use asm then?

There is really no point to be writing code in assembly that is not using instructions that are faster than the ones that gcc is compiling down to.

For example, there are a bunch of MOVs, CMPs, JMPs, Add/Xor/Lea instructions when you compile some C file down to assembly. There are only two ways to make this faster:

1 - you can somehow reformat the assembly to remove excessive MOVs, so that its using as few instructions as possible (will not result in a large performance improvement)
2 - your use case can be accelerated by SIMD instructions (will result in a much faster performance).

but you can do simd in c. stuff like this _mm_setzero_si128() . in protected mode  os you executable is just a task with little to no hw access. so asm has no much sense . if you remove some mov's in exe it'll not increase performance
about this code(or smth like that)  posted here - it's something stylized to msdos real mode asm . it's unrelated to binary produced after it.