ecdsa123 (OP)
Jr. Member
Offline
Activity: 51
Merit: 107
|
1
|
|
|
|
PowerGlove
|
|
November 28, 2022, 01:45:03 PM |
|
Off-topic (sorry), but just wanted to say that seeing questions like this on Bitcointalk really cheers me up! I'm probably reading too much into it (nostalgia will sometimes do that to you), but someone asking for an assembly listing is (no joke) the highlight of my week (I really miss the old days). Anybody else remember downloading TASM/MASM/NASM over dial-up and working through binders of printed out tutorials?
|
|
|
|
garlonicon
Copper Member
Legendary
Offline
Activity: 926
Merit: 2217
|
|
November 28, 2022, 03:49:41 PM Last edit: November 28, 2022, 04:12:00 PM by garlonicon Merited by Welsh (4), vapourminer (1), ABCbits (1) |
|
I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version. If you are looking for a library, then check headers like "immintrin.h". The keyword is "intrinsics", for example x86 intrinsics list that means you can just call a C-like function in your code, and it will be converted to the pure assembly instruction when compiled. It will be easier to write some code in C++ or similar language, and call some functions, than writing everything in assembly, unless you know assembly very well. Anybody else remember downloading TASM/MASM/NASM over dial-up and working through binders of printed out tutorials? I used FASM and downloaded it in modern times, few years ago, when exploring how BIOS is constructed: http://flatassembler.net/I also used it some time ago when I tried to write my own operating system from scratch: https://wiki.osdev.org/Main_Pageso you will be shocked when you will check how many "optimised " are design in pure asm for differents mathematics problems in 2022 yr:) Well, assembly can speed things up if you can use the right opcodes, and if your processor supports it. In other cases, you may end up with code, that is correct, but not supported by your processor. So, the first thing is checking your hardware, and what is available, because some opcodes may trigger an error. Also, it is a high chance that using your CPU is not the best way of solving that, and if your code will be hardware-specific by design, then it may be profitable to prepare your code for some GPU or ASIC (but then you probably would need some custom hardware). Edit: Also note that typical compilers has some flags that can be used to optimize it for some architecture with a given features. Compilers like GCC can produce those instructions, so check them first, it may be faster than your code in assembly, unless you really know that language well.
|
|
|
|
PowerGlove
|
|
November 28, 2022, 05:26:53 PM |
|
I used FASM and downloaded it in modern times, few years ago, when exploring how BIOS is constructed: http://flatassembler.net/Yup. FASM is pretty special and Tomasz Grysztar is an exceptional programmer! Writing a self-hosting assembler puts him in a very small group. Nice one! 0xAA55 (and 0x7C00) is burned in my memory, too.
|
|
|
|
pooya87
Legendary
Offline
Activity: 3640
Merit: 11041
Crypto Swap Exchange
|
|
November 29, 2022, 03:50:38 AM |
|
I have check and analyse for comparison : sha256 in pure asm (rewrite by myself) and c++. in pure asm we have almost 120x faster than c++ (in line asm optimised)
Have you ever published this code or has anybody else (more specifically a c++ expert) seen the code because 120x speed up does not sound right to me unless the code written in c++ is bad or entirely different (eg. simple implementation of SHA256 vs using intel SHA intrinsics) or your benchmark could be flawed.
|
|
|
|
NotATether
Legendary
Offline
Activity: 1792
Merit: 7390
Top Crypto Casino
|
|
November 29, 2022, 06:02:18 PM |
|
Hello All
I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.
anybody knows?
I mean if you challenge me to do it, I might actually come up with an optimized secp256k1 ASM for GNU/Linux one day... who knows I'm already working on a version that uses GMP which itself is heavily optimized.
|
|
|
|
albert0bsd
|
|
December 01, 2022, 12:56:23 AM Last edit: December 01, 2022, 01:13:11 AM by albert0bsd |
|
I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.
Write code in ASM is really hard, i have a long time without write anythin in ASM by my self the last code that check in ASM and edit just some lines was the libaesni for some of my old projects. If there are any other developers interesting in write this code please let me know.
|
|
|
|
pooya87
Legendary
Offline
Activity: 3640
Merit: 11041
Crypto Swap Exchange
|
|
December 01, 2022, 03:59:44 AM |
|
as I see there is no known library for this
Writing an entire ECC library in ASM is impossible, we are talking about thousands of lines of code that would be a lot more in ASM and as I said before the benefits is not as great as you'd think. However parts of the code can be written in ASM like what libsecp256k1 does by writing the field element code in ASM.
|
|
|
|
PowerGlove
|
as I see there is no known library for this
Writing an entire ECC library in ASM is impossible, [...]I don't know about that, man; heavier lifts have been made before. FASM is one example (an assembler written in assembly). If I check how many lines of code that has: grep '^$' -rv ./fasm-1.73.30/fasm/source | wc -l I get 35483. Even an elaborate, fully-featured secp256k1 library in x86-64 assembly would fit (more than) comfortably in 1/4 of that. If that doesn't convince you (i.e. you feel that a significant fraction of FASM's source code is likely table-generated) then think of feats like the first RollerCoaster Tycoon game: Chris Sawyer wrote that in (99%) assembly. I don't know how familiar you are with gamedev, but something like RollerCoaster Tycoon completely dwarfs a secp256k1 library in terms of complexity.
|
|
|
|
albert0bsd
|
The main problem in secp256k1 is modulo p
Actually we only need a good framework to do operations with big numbers, this also using all the capabilities of modern CPU. Regards.
|
|
|
|
NotATether
Legendary
Offline
Activity: 1792
Merit: 7390
Top Crypto Casino
|
|
December 01, 2022, 06:04:10 PM |
|
The main problem in secp256k1 is modulo p
Actually we only need a good framework to do operations with big numbers, this also using all the capabilities of modern CPU. Regards. Literally this. Have you heard of GAP? It's a C language framework for doing huge integer math and knows about group theory and such. A really smart assembly guy recommended it to me a few months ago. https://www.gap-system.org/Download/
|
|
|
|
albert0bsd
|
|
December 02, 2022, 03:45:20 PM |
|
I used that library for some tools that I made but it is not optimized for secp256k1 also it is some kind of vulnerable to some side channels attacks and incomplete because it declare EC.b parameter but it never use. A lot of improvements can be made to that implementation. The fastest implementation for secp256k1 code that I ever see and use it is already inside of kangaroo tool. https://github.com/JeanLucPons/Kangaroo/tree/master/SECPK1Same library that I actually use in my keyhunt code.
|
|
|
|
NotATether
Legendary
Offline
Activity: 1792
Merit: 7390
Top Crypto Casino
|
|
December 02, 2022, 05:42:43 PM |
|
I wonder if there is a way to optimize it further though? Do you know whether it's making use of SSE? But even more important than that, maybe there's a series of assembly instructions you can run to run repeated calls faster. But since I use secp256k1 curve only for testing and research I do no care much for any of possible vulnerabilities and attacks.
The safest (not necessary the fastest) secp256k1 is the one used in Bitcoin Core. But I don't use it because I keep getting wrong answers when I do arithmetic. Maybe the privkey bytes are not being filled correctly or something.
|
|
|
|
albert0bsd
|
|
December 02, 2022, 07:11:36 PM |
|
And EC.b can be used here:
Thank you for Pointing where b is used, I recheck the code and that function is not there, anyway it is easy to add. Do you know whether it's making use of SSE? But even more important than that, maybe there's a series of assembly instructions you can run to run repeated calls faster.
The SSE are used in some other code of JLP those are used to generate 4 sha256 and 4 rmd160 hashes of fixed data at the same time. That is only useful when you are working with addresses or hashes rmd160 those SSE functions boost that part of the code some 2 or 3 times faster. Code here: https://github.com/JeanLucPons/VanitySearch/blob/master/SECP256K1.cpp#L380Functions are inside of the hash folder Also I use those in keyhunt for address and rmd160 search. About if that code can be optimized, I already check it a lot of times and it is already optimized. Regards
|
|
|
|
albert0bsd
|
|
December 03, 2022, 02:38:24 PM |
|
Ok, lets to to write it by parts using openAI Write in assembly a code to add two numbers of 512 bits ; Define the two 512-bit numbers to be added .data num1: .quad 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF num2: .quad 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA
; Define the 512-bit result result: .quad 0, 0, 0, 0
; Main program .text .global main main:
; Initialize the result to 0 mov $0, %rax mov %rax, result(,%rax,8)
; Loop through each of the 64-bit words in the numbers mov $0, %rax loop:
; Add the corresponding 64-bit words from num1 and num2 movq num1(,%rax,8), %rdx addq num2(,%rax,8), %rdx
; Store the result in the corresponding 64-bit word in the result mov %rdx, result(,%rax,8)
; Increment the loop counter and check if it has reached 4 inc %rax cmp $4, %rax jne loop
; End the program ret
|
|
|
|
NotATether
Legendary
Offline
Activity: 1792
Merit: 7390
Top Crypto Casino
|
|
December 03, 2022, 04:25:27 PM |
|
Ok, lets to to write it by parts using openAI Write in assembly a code to add two numbers of 512 bits ; Define the two 512-bit numbers to be added .data num1: .quad 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF num2: .quad 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA
; Define the 512-bit result result: .quad 0, 0, 0, 0
; Main program .text .global main main:
; Initialize the result to 0 mov $0, %rax mov %rax, result(,%rax,8)
; Loop through each of the 64-bit words in the numbers mov $0, %rax loop:
; Add the corresponding 64-bit words from num1 and num2 movq num1(,%rax,8), %rdx addq num2(,%rax,8), %rdx
; Store the result in the corresponding 64-bit word in the result mov %rdx, result(,%rax,8)
; Increment the loop counter and check if it has reached 4 inc %rax cmp $4, %rax jne loop
; End the program ret
Damn! I didn't know OpenAI could write code. I used Dall-E to generate images before, but I wasn't aware of anything like this. Yeah, I've used Copilot, but I haven't generated any assembly with it. This stuff could be very useful if it indeed works (AI generated code is sometimes buggy). It may not know how to generate a secp256k1 operation in ASM yet, but I think we'll get there soon (plus, ARM support!).
|
|
|
|
NotATether
Legendary
Offline
Activity: 1792
Merit: 7390
Top Crypto Casino
|
|
December 04, 2022, 04:57:42 PM |
|
why you will not use xmm and xmmword? it is 128 bit? less code, less operation, fastest when I finish i will upload. the problem is with modulo for my self:) try something like: %define arg1f XMM0 %define arg2f XMM1 %define arg3f XMM2 %define arg4f XMM3 %define arg1 RDI %define arg2 RSI %define arg3 RDX %define arg4 RC %define arg5 R8 %define arg6 R9
%macro MULT 1,2,3 ;Multi XMM0:XMM1:XMM4:XMM5 by XMM2:XMM3:XMM6:XMM7 movups arg3f,[%1+%3] movaps xmm7,arg3f
mulps arg3f,arg1f movshdup arg4f, arg3f addps arg3f, arg4f movaps xmm6,xmm7 movhlps arg4f, arg3f addss arg3f, arg4f movss [%2+%3], arg3f;
mulps xmm6,arg2f movshdup arg4f, xmm6 addps xmm6, arg4f movaps arg3f,xmm7 movhlps arg4f, xmm6 addss xmm6, arg4f movss [%2+4+%3], xmm6
mulps arg3f,xmm4 movshdup arg4f, arg3f addps arg3f, arg4f movaps xmm6,xmm7 movhlps arg4f, arg3f addss arg3f, arg4f movss [%2+8+%3], arg3f
mulps xmm6,xmm5 movshdup arg4f, xmm6 addps xmm6, arg4f movhlps arg4f, xmm6 addss xmm6, arg4f movss [%2+8+4+%3], xmm6
%endmacro
Are you saving some bits at the end of each dword for the carry? Otherwise you're going to lose accuracy in the final answer, because SIMD stuff will just overflow instead of carry. I was thinking of applying the excellent technique of 5x52-bit numbers used in libsecp256k1, where we use 5 quadwords that each hold 52 bits of the real number each (except for the most significant qword which holds less). There's also a 10x26-bit variant for 32-bit machines. If that technique is applied, we can defer the carry and add the numbers hundreds of times without corruption. We would only need 3 SSE adds, or 2 AVX adds in that case. But I have heard that using the AVX instructions incurs some kind of speed penalty like AVX-512? An alternative is just stuffing them in the R8-R15 registers and RAX/RBX for the 5 quadwords for both operands and then clobber one of the operands with the sum.
|
|
|
|
j2002ba2
|
|
December 05, 2022, 05:48:07 PM |
|
Ok, lets to to write it by parts using openAI ... addq num2(,%rax,8), %rdx ...
As usual, the so called AI gave wrong result. The carry flag is completely ignored, so instead it adds two vectors, each having four 64bit numbers.
|
|
|
|
NotATether
Legendary
Offline
Activity: 1792
Merit: 7390
Top Crypto Casino
|
|
December 06, 2022, 03:39:53 AM |
|
Ok, lets to to write it by parts using openAI ... addq num2(,%rax,8), %rdx ...
As usual, the so called AI gave wrong result. The carry flag is completely ignored, so instead it adds two vectors, each having four 64bit numbers. This is a recurring pattern I've seen in the AI-generated results. Nearly all of them have comments, so I know that the ones that don't talk about the point additions expressions are wrong, and there are also samples where I saw arithmetic using just "eax" instead of the necessary 4/5 64-bit registers. There were instances where I got merely 256-bit number addition instead of point addition.
|
|
|
|
albert0bsd
|
|
December 22, 2022, 02:30:38 AM Merited by PowerGlove (1) |
|
I have done implement parts of secp256k1 in pure asm
I would like to inform that perofrmence is fucking fast (for me):
100 000 000 performs modulo n on secp256k1 vals:
Can you explain what kind of numbers are those 100 Million inputs. What is your hardware and against what other implementation are you comparing your results?
|
|
|
|
|