1

November 28, 2022, 01:45:03 PM

Off-topic (sorry), but just wanted to say that seeing questions like this on Bitcointalk really cheers me up!

I'm probably reading too much into it (nostalgia will sometimes do that to you), but someone asking for an assembly listing is (no joke) the highlight of my week (I really miss the old days).

Anybody else remember downloading TASM/MASM/NASM over dial-up and working through binders of printed out tutorials? Grin

garlonicon

Legendary

Offline

Activity: 852
Merit: 2089

Pawns are the soul of chess

November 28, 2022, 03:49:41 PM
Last edit: November 28, 2022, 04:12:00 PM by garlonicon

Merited by Welsh (4), vapourminer (1), ABCbits (1)

Quote

I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.

If you are looking for a library, then check headers like "immintrin.h". The keyword is "intrinsics", for example x86 intrinsics list that means you can just call a C-like function in your code, and it will be converted to the pure assembly instruction when compiled. It will be easier to write some code in C++ or similar language, and call some functions, than writing everything in assembly, unless you know assembly very well.

Quote

Anybody else remember downloading TASM/MASM/NASM over dial-up and working through binders of printed out tutorials?

I used FASM and downloaded it in modern times, few years ago, when exploring how BIOS is constructed: http://flatassembler.net/
I also used it some time ago when I tried to write my own operating system from scratch: https://wiki.osdev.org/Main_Page

Quote

so you will be shocked when you will check how many "optimised " are design in pure asm for differents mathematics problems in 2022 yr:)

Well, assembly can speed things up if you can use the right opcodes, and if your processor supports it. In other cases, you may end up with code, that is correct, but not supported by your processor. So, the first thing is checking your hardware, and what is available, because some opcodes may trigger an error. Also, it is a high chance that using your CPU is not the best way of solving that, and if your code will be hardware-specific by design, then it may be profitable to prepare your code for some GPU or ASIC (but then you probably would need some custom hardware).

Edit: Also note that typical compilers has some flags that can be used to optimize it for some architecture with a given features. Compilers like GCC can produce those instructions, so check them first, it may be faster than your code in assembly, unless you really know that language well.

My pizza is ready. Find the brainwallet. Good luck: https://mempool.space/testnet4/tx/914a6348ba832b79c631a6f70ea9e2c3e1e433ee0b90633d51e3fe1164c05dc0

PowerGlove

Hero Member

Offline

Activity: 576
Merit: 4895

Quote from: garlonicon on November 28, 2022, 03:49:41 PM

November 28, 2022, 05:26:53 PM

I used FASM and downloaded it in modern times, few years ago, when exploring how BIOS is constructed: http://flatassembler.net/

Yup. FASM is pretty special and Tomasz Grysztar is an exceptional programmer! Writing a self-hosting assembler puts him in a very small group.

Quote from: garlonicon on November 28, 2022, 03:49:41 PM

I also used it some time ago when I tried to write my own operating system from scratch: https://wiki.osdev.org/Main_Page

Nice one! 0xAA55 (and 0x7C00) is burned in my memory, too. Grin

pooya87

Legendary

Offline

Activity: 3570
Merit: 10851

Quote from: ecdsa123 on November 28, 2022, 01:55:38 PM

November 29, 2022, 03:50:38 AM

Merited by NeuroticFish (2), ABCbits (2), NotATether (2), garlonicon (2), PowerGlove (2), vapourminer (1), DdmrDdmr (1)

I have check and analyse for comparison : sha256 in pure asm (rewrite by myself) and c++.
in pure asm we have almost 120x faster than c++ (in line asm optimised)

Have you ever published this code or has anybody else (more specifically a c++ expert) seen the code because 120x speed up does not sound right to me unless the code written in c++ is bad or entirely different (eg. simple implementation of SHA256 vs using intel SHA intrinsics) or your benchmark could be flawed.

██
██
██
██
██
██
██
██
██
██
██
██
██

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

NotATether

Legendary

Offline

Activity: 1722
Merit: 7250

In memory of o_e_l_e_o

Quote from: ecdsa123 on November 28, 2022, 12:48:10 PM

November 29, 2022, 06:02:18 PM

Hello All

I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.

anybody knows?

I mean if you challenge me to do it, I might actually come up with an optimized secp256k1 ASM for GNU/Linux one day... who knows Cheesy

I'm already working on a version that uses GMP which itself is heavily optimized.

██
██
██
██
██
██
██
██
██
██
██
██
██

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

albert0bsd

Hero Member

Offline

Activity: 930
Merit: 677

Quote from: ecdsa123 on November 28, 2022, 12:48:10 PM

December 01, 2022, 12:56:23 AM
Last edit: December 01, 2022, 01:13:11 AM by albert0bsd

I'm looking for library written in pure nasm/masm - assembly for x8086-64 (not arm) Intel version.

Write code in ASM is really hard, i have a long time without write anythin in ASM by my self the last code that check in ASM and edit just some lines was the libaesni for some of my old projects.

If there are any other developers interesting in write this code please let me know.

pooya87

Legendary

Offline

Activity: 3570
Merit: 10851

Quote from: ecdsa123 on November 30, 2022, 03:06:51 PM

December 01, 2022, 03:59:44 AM

as I see there is no known library for this

Writing an entire ECC library in ASM is impossible, we are talking about thousands of lines of code that would be a lot more in ASM and as I said before the benefits is not as great as you'd think. However parts of the code can be written in ASM like what libsecp256k1 does by writing the field element code in ASM.

██
██
██
██
██
██
██
██
██
██
██
██
██

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

PowerGlove

Hero Member

Offline

Activity: 576
Merit: 4895

Quote from: pooya87 on December 01, 2022, 03:59:44 AM

December 01, 2022, 09:23:44 AM

Merited by Welsh (4), ABCbits (3), pooya87 (2), vapourminer (1), npuath (1)

Quote from: ecdsa123 on November 30, 2022, 03:06:51 PM

as I see there is no known library for this

Writing an entire ECC library in ASM is impossible, [...]

I don't know about that, man; heavier lifts have been made before. FASM is one example (an assembler written in assembly). If I check how many lines of code that has:

Code:

grep '^$' -rv ./fasm-1.73.30/fasm/source | wc -l

I get 35483. Even an elaborate, fully-featured secp256k1 library in x86-64 assembly would fit (more than) comfortably in 1/4 of that.

If that doesn't convince you (i.e. you feel that a significant fraction of FASM's source code is likely table-generated) then think of feats like the first RollerCoaster Tycoon game: Chris Sawyer wrote that in (99%) assembly. I don't know how familiar you are with gamedev, but something like RollerCoaster Tycoon completely dwarfs a secp256k1 library in terms of complexity.

albert0bsd

Hero Member

Offline

Activity: 930
Merit: 677

Quote from: ecdsa123 on December 01, 2022, 10:44:00 AM

December 01, 2022, 12:33:33 PM

Merited by Welsh (4), NotATether (3)

#10

The main problem in secp256k1 is modulo p

Actually we only need a good framework to do operations with big numbers, this also using all the capabilities of modern CPU.

Regards.

NotATether

Legendary

Offline

Activity: 1722
Merit: 7250

In memory of o_e_l_e_o

Quote from: albert0bsd on December 01, 2022, 12:33:33 PM

December 01, 2022, 06:04:10 PM

#11

Quote from: ecdsa123 on December 01, 2022, 10:44:00 AM

The main problem in secp256k1 is modulo p

Actually we only need a good framework to do operations with big numbers, this also using all the capabilities of modern CPU.

Regards.

Literally this.

Have you heard of GAP? It's a C language framework for doing huge integer math and knows about group theory and such. A really smart assembly guy recommended it to me a few months ago.

https://www.gap-system.org/Download/

██
██
██
██
██
██
██
██
██
██
██
██
██

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

albert0bsd

Hero Member

Offline

Activity: 930
Merit: 677

Quote from: AlexanderCurl on December 02, 2022, 04:20:19 AM

December 02, 2022, 03:45:20 PM

#12

also some good ecc implementation based on gmp: https://github.com/masterzorag/ec_gmp

I used that library for some tools that I made but it is not optimized for secp256k1 also it is some kind of vulnerable to some side channels attacks and incomplete because it declare EC.b parameter but it never use.

A lot of improvements can be made to that implementation.

The fastest implementation for secp256k1 code that I ever see and use it is already inside of kangaroo tool.

https://github.com/JeanLucPons/Kangaroo/tree/master/SECPK1

Same library that I actually use in my keyhunt code.

NotATether

Legendary

Offline

Activity: 1722
Merit: 7250

In memory of o_e_l_e_o

Quote from: albert0bsd on December 02, 2022, 03:45:20 PM

December 02, 2022, 05:42:43 PM

#13

The fastest implementation for secp256k1 code that I ever see and use it is already inside of kangaroo tool.

https://github.com/JeanLucPons/Kangaroo/tree/master/SECPK1

I wonder if there is a way to optimize it further though? Do you know whether it's making use of SSE? But even more important than that, maybe there's a series of assembly instructions you can run to run repeated calls faster.

Quote from: AlexanderCurl on December 02, 2022, 05:39:30 PM

But since I use secp256k1 curve only for testing and research I do no care much for any of possible vulnerabilities and attacks.

The safest (not necessary the fastest) secp256k1 is the one used in Bitcoin Core. But I don't use it because I keep getting wrong answers when I do arithmetic. Maybe the privkey bytes are not being filled correctly or something.

██
██
██
██
██
██
██
██
██
██
██
██
██

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

albert0bsd

Hero Member

Offline

Activity: 930
Merit: 677

Quote from: AlexanderCurl on December 02, 2022, 05:39:30 PM

December 02, 2022, 07:11:36 PM

#14

And EC.b can be used here:

Thank you for Pointing where b is used, I recheck the code and that function is not there, anyway it is easy to add.

Quote from: NotATether on December 02, 2022, 05:42:43 PM

Do you know whether it's making use of SSE? But even more important than that, maybe there's a series of assembly instructions you can run to run repeated calls faster.

The SSE are used in some other code of JLP those are used to generate 4 sha256 and 4 rmd160 hashes of fixed data at the same time.

That is only useful when you are working with addresses or hashes rmd160 those SSE functions boost that part of the code some 2 or 3 times faster.

Code here:

https://github.com/JeanLucPons/VanitySearch/blob/master/SECP256K1.cpp#L380

Functions are inside of the hash folder

Also I use those in keyhunt for address and rmd160 search.

About if that code can be optimized, I already check it a lot of times and it is already optimized.

Regards

albert0bsd

Hero Member

Offline

Activity: 930
Merit: 677

December 03, 2022, 02:38:24 PM

Merited by NotATether (5), ABCbits (2), vapourminer (1)

#15

Ok, lets to to write it by parts using openAI

Code:

Write in assembly a code to add two numbers of 512 bits

Code:

; Define the two 512-bit numbers to be added
.data
num1: .quad 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF
num2: .quad 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA

; Define the 512-bit result
result: .quad 0, 0, 0, 0

; Main program
.text
.global main
main:

  ; Initialize the result to 0
  mov $0, %rax
  mov %rax, result(,%rax,8)

  ; Loop through each of the 64-bit words in the numbers
  mov $0, %rax
  loop:

    ; Add the corresponding 64-bit words from num1 and num2
    movq num1(,%rax,8), %rdx
    addq num2(,%rax,8), %rdx

    ; Store the result in the corresponding 64-bit word in the result
    mov %rdx, result(,%rax,8)

    ; Increment the loop counter and check if it has reached 4
    inc %rax
    cmp $4, %rax
    jne loop

  ; End the program
  ret

NotATether

Legendary

Offline

Activity: 1722
Merit: 7250

In memory of o_e_l_e_o

Quote from: albert0bsd on December 03, 2022, 02:38:24 PM

December 03, 2022, 04:25:27 PM

#16

Ok, lets to to write it by parts using openAI

Code:

Write in assembly a code to add two numbers of 512 bits

Code:

; Define the two 512-bit numbers to be added
.data
num1: .quad 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF, 0x1234567890ABCDEF
num2: .quad 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA, 0x0987654321FEDCBA

; Define the 512-bit result
result: .quad 0, 0, 0, 0

; Main program
.text
.global main
main:

  ; Initialize the result to 0
  mov $0, %rax
  mov %rax, result(,%rax,8)

  ; Loop through each of the 64-bit words in the numbers
  mov $0, %rax
  loop:

    ; Add the corresponding 64-bit words from num1 and num2
    movq num1(,%rax,8), %rdx
    addq num2(,%rax,8), %rdx

    ; Store the result in the corresponding 64-bit word in the result
    mov %rdx, result(,%rax,8)

    ; Increment the loop counter and check if it has reached 4
    inc %rax
    cmp $4, %rax
    jne loop

  ; End the program
  ret

Damn! I didn't know OpenAI could write code.

I used Dall-E to generate images before, but I wasn't aware of anything like this. Yeah, I've used Copilot, but I haven't generated any assembly with it.

This stuff could be very useful if it indeed works (AI generated code is sometimes buggy). It may not know how to generate a secp256k1 operation in ASM yet, but I think we'll get there soon (plus, ARM support!).

██
██
██
██
██
██
██
██
██
██
██
██
██

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

NotATether

Legendary

Offline

Activity: 1722
Merit: 7250

In memory of o_e_l_e_o

Quote from: ecdsa123 on December 03, 2022, 07:33:02 PM

December 04, 2022, 04:57:42 PM

#17

why you will not use xmm and xmmword? it is 128 bit? less code, less operation, fastest

when I finish i will upload.

the problem is with modulo for my self:)

try something like:

Code:

    %define arg1f XMM0
    %define arg2f XMM1
    %define arg3f XMM2
    %define arg4f XMM3
    %define arg1 RDI
    %define arg2 RSI
    %define arg3 RDX
    %define arg4 RC
    %define arg5 R8
    %define arg6 R9 


%macro MULT 1,2,3
;Multi XMM0:XMM1:XMM4:XMM5 by XMM2:XMM3:XMM6:XMM7
        movups arg3f,[%1+%3]
        movaps xmm7,arg3f

        mulps  arg3f,arg1f
        movshdup    arg4f, arg3f
        addps       arg3f, arg4f
        movaps xmm6,xmm7
        movhlps     arg4f, arg3f
        addss       arg3f, arg4f
        movss  [%2+%3], arg3f;

        mulps  xmm6,arg2f
        movshdup    arg4f, xmm6
        addps       xmm6, arg4f
        movaps arg3f,xmm7
        movhlps     arg4f, xmm6
        addss       xmm6, arg4f
        movss  [%2+4+%3], xmm6 

        mulps  arg3f,xmm4
        movshdup    arg4f, arg3f
        addps       arg3f, arg4f
        movaps xmm6,xmm7
        movhlps     arg4f, arg3f
        addss       arg3f, arg4f
        movss  [%2+8+%3], arg3f

        mulps  xmm6,xmm5
        movshdup    arg4f, xmm6
        addps       xmm6, arg4f
        movhlps     arg4f, xmm6
        addss       xmm6, arg4f
        movss  [%2+8+4+%3], xmm6

%endmacro

Are you saving some bits at the end of each dword for the carry? Otherwise you're going to lose accuracy in the final answer, because SIMD stuff will just overflow instead of carry.

I was thinking of applying the excellent technique of 5x52-bit numbers used in libsecp256k1, where we use 5 quadwords that each hold 52 bits of the real number each (except for the most significant qword which holds less). There's also a 10x26-bit variant for 32-bit machines.

If that technique is applied, we can defer the carry and add the numbers hundreds of times without corruption.

We would only need 3 SSE adds, or 2 AVX adds in that case. But I have heard that using the AVX instructions incurs some kind of speed penalty like AVX-512?

An alternative is just stuffing them in the R8-R15 registers and RAX/RBX for the 5 quadwords for both operands and then clobber one of the operands with the sum.

██
██
██
██
██
██
██
██
██
██
██
██
██

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

j2002ba2

Full Member

Offline

Activity: 206
Merit: 447

Quote from: albert0bsd on December 03, 2022, 02:38:24 PM

December 05, 2022, 05:48:07 PM

#18

Ok, lets to to write it by parts using openAI

Code:

...
    addq num2(,%rax,8), %rdx
...

As usual, the so called AI gave wrong result. The carry flag is completely ignored, so instead it adds two vectors, each having four 64bit numbers.

NotATether

Legendary

Offline

Activity: 1722
Merit: 7250

In memory of o_e_l_e_o

Quote from: j2002ba2 on December 05, 2022, 05:48:07 PM

December 06, 2022, 03:39:53 AM

#19

Quote from: albert0bsd on December 03, 2022, 02:38:24 PM

Ok, lets to to write it by parts using openAI

Code:

...
    addq num2(,%rax,8), %rdx
...

As usual, the so called AI gave wrong result. The carry flag is completely ignored, so instead it adds two vectors, each having four 64bit numbers.

This is a recurring pattern I've seen in the AI-generated results. Nearly all of them have comments, so I know that the ones that don't talk about the point additions expressions are wrong, and there are also samples where I saw arithmetic using just "eax" instead of the necessary 4/5 64-bit registers.

There were instances where I got merely 256-bit number addition instead of point addition.

██
██
██
██
██
██
██
██
██
██
██
██
██

██
██
██
██
██
██
██
██
██
██
██
██
██

BTC XMR ETH TON 800+ coins

██
██
██
██
██
██
██
██
██
██
██
██
██

albert0bsd

Hero Member

Offline

Activity: 930
Merit: 677