Bitcoin Forum
May 05, 2024, 02:06:39 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1] 2 3 »  All
  Print  
Author Topic: secp256k1 library and Intel cpu  (Read 3949 times)
arulbero (OP)
Legendary
*
Offline Offline

Activity: 1915
Merit: 2074


View Profile
January 05, 2017, 11:37:48 PM
Merited by ABCbits (1)
 #1

Does secp256k1 library exploit the new instructions on Intel cpu for large integer arithmetic?

https://software.intel.com/en-us/blogs/2014/06/11/optimizing-big-data-processing-with-haswell-256-bit-integer-simd-instructions

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

1714917999
Hero Member
*
Offline Offline

Posts: 1714917999

View Profile Personal Message (Offline)

Ignore
1714917999
Reply with quote  #2

1714917999
Report to moderator
1714917999
Hero Member
*
Offline Offline

Posts: 1714917999

View Profile Personal Message (Offline)

Ignore
1714917999
Reply with quote  #2

1714917999
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714917999
Hero Member
*
Offline Offline

Posts: 1714917999

View Profile Personal Message (Offline)

Ignore
1714917999
Reply with quote  #2

1714917999
Report to moderator
achow101
Moderator
Legendary
*
expert
Offline Offline

Activity: 3388
Merit: 6581


Just writing some code


View Profile WWW
January 05, 2017, 11:41:23 PM
 #2

Probably not.

gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4158
Merit: 8382



View Profile WWW
January 06, 2017, 09:41:44 AM
Last edit: January 14, 2017, 04:46:02 PM by gmaxwell
Merited by ABCbits (2)
 #3

The content in your first link would likely not be helpful.

The content in your second link-- the  would be beneficial, though potentially not by much because the dependency chain in the arithmetic is hand constructed to reduce conflicts.  If someone would like to try them out, it shouldn't be very hard.

In the context of Bitcoin development (thus achows' response), Signature verification isn't really the limiting factor in Bitcoin Core performance anymore in any case.
arulbero (OP)
Legendary
*
Offline Offline

Activity: 1915
Merit: 2074


View Profile
January 06, 2017, 11:27:40 AM
 #4

Signature verification isn't really the limiting factor in Bitcoin Core performance anymore in any case.

I thought signature verification was the biggest limiting factor in IBD process (obviously with slow CPU). 
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4158
Merit: 8382



View Profile WWW
January 06, 2017, 11:36:11 AM
Merited by ABCbits (1)
 #5

I thought signature verification was the biggest limiting factor in IBD process (obviously with slow CPU). 
On arm, which doesn't yet use the libsecp256k1 assembly optimizations by default-- perhaps.  Otherwise, no, not since libsecp256k1 made signature validation many times faster. Maybe on some really unbalanced system with a slow single core cpu and huge dbcache and fast network and SSD it might be a majority.
rico666
Legendary
*
Offline Offline

Activity: 1120
Merit: 1037


฿ → ∞


View Profile WWW
January 08, 2017, 10:22:21 AM
 #6

I am also interested in a faster secp256k1.

Unfortunately, it seems Pieter Wuille et al. were neither serious about providing a fast secp256k1 nor about documenting it well.  Roll Eyes

I have done some hacking to some pretty basic functions in there

https://bitcointalk.org/index.php?topic=1573035.msg17365068#msg17365068

- which alone made the LBC generator about 10% faster overall -
and later also changing the field_5x52 code in secp256k1_fe_set_b32 to

Code:
    r->n[0] = (uint64_t)a[31]
            | (uint64_t)a[30] << 8
            | (uint64_t)a[29] << 16
            | (uint64_t)a[28] << 24
            | (uint64_t)a[27] << 32
            | (uint64_t)a[26] << 40
            | (uint64_t)(a[25] & 0xF)  << 48;

    r->n[1] = (uint64_t)((a[25] >> 4) & 0xF)
            | (uint64_t)a[24] << 4
            | (uint64_t)a[23] << 12
            | (uint64_t)a[22] << 20
            | (uint64_t)a[21] << 28
            | (uint64_t)a[20] << 36
            | (uint64_t)a[19] << 44;

    r->n[2] = (uint64_t)a[18]
            | (uint64_t)a[17] << 8
            | (uint64_t)a[16] << 16
            | (uint64_t)a[15] << 24
            | (uint64_t)a[14] << 32
            | (uint64_t)a[13] << 40
            | (uint64_t)(a[12] & 0xF) << 48;

    r->n[3] = (uint64_t)((a[12] >> 4) & 0xF)
            | (uint64_t)a[11] << 4
            | (uint64_t)a[10] << 12
            | (uint64_t)a[9]  << 20
            | (uint64_t)a[8]  << 28
            | (uint64_t)a[7]  << 36
            | (uint64_t)a[6]  << 44;

    r->n[4] = (uint64_t)a[5]
            | (uint64_t)a[4] << 8
            | (uint64_t)a[3] << 16
            | (uint64_t)a[2] << 24
            | (uint64_t)a[1] << 32
            | (uint64_t)a[0] << 40;

I have been pointed to https://github.com/llamasoft/secp256k1_fast_unsafe but am not sure if that is still maintained/developed.
Is there any place where R&D discussion about further development is ongoing? Before I'd start reimplementing that mess from scratch I'd prefer to participate in some "official endeavor".

However, I have little hope that will make sense:

Quote
Signature verification isn't really the limiting factor in Bitcoin Core performance anymore in any case.

Together with other statements from gmaxwell @ github ("this is alpha/, don't expect other doc shan source.." - something like that from memory) I see there seems not much motivation in further development from the "official side".


Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  Past BURST Activities
AlexGR
Legendary
*
Offline Offline

Activity: 1708
Merit: 1049



View Profile
January 08, 2017, 11:41:37 AM
Last edit: March 19, 2017, 09:15:13 AM by AlexGR
 #7

I am also interested in a faster secp256k1.

I've been tampering with the asm a bit and had some luck, especially in scalar performance by merging the various stages into one function, however do note that it's gcc only (clang had some issues with rsi/rdi use on the manually inlined function and to work around it one has to move rsi or rdi to some xmm register and restore it before the function ends - which adds a few cycles)

https://github.com/Alex-GR/secp256k1/blob/master/src/scalar_4x64_impl.h
https://github.com/Alex-GR/secp256k1/blob/master/src/field_5x52_asm_impl.h

In 5x52 asm I found 2-3% by simply doing what the cpu scheduler should be doing and issuing together the adds + muls (the cpu integer unit typically has one add and one mul unit and ideally we want to be using them at the same time).

For example:

/* d += a3 * b1 */
    "movq 8(%%rbx),%%rax\n"
    "mulq %%r13\n"
    "addq %%rax,%%rcx\n"
    "adcq %%rdx,%%r15\n"
    /* d += a2 * b2 */
    "movq 16(%%rbx),%%rax\n" <=== this must be moved upwards
    "mulq %%r12\n"

becomes:

/* d += a3 * b1 */
    "movq 8(%%rbx),%%rax\n"
    "mulq %%r13\n"
    "addq %%rax,%%rcx\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%r15\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"

So the adcq+mulq are issued together to the mul and add units of the cpu respectively. (edit apparently the mul+add are in the simd unit, in the integer part it's just 3 integer units waiting to do parallel work of any kind - except load/store, which is 2 at a time). I'm baffled on why the cpu scheduler wasn't already doing this but then again I do have an older cpu (core2 quad / 45nm) to play with - it might not be an issue with modern ones.

Also, of the three temp variables used for temp storage in the field asm, some can be eliminated by rearranging the code a bit and thus reducing a couple memory accesses.

(ps. I don't claim the code is safe - I've only used it for benchmarks and the builtin test)

One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?
piotr_n
Legendary
*
Offline Offline

Activity: 2053
Merit: 1354


aka tonikt


View Profile WWW
January 08, 2017, 01:14:55 PM
Last edit: January 08, 2017, 01:26:47 PM by piotr_n
 #8

The original secp256k1 lib is a great piece of work,  but it surely can be improved even more.

Any optimizations are very welcome - I'm sure sipa (the original author) would agree. He's a very nice guy and it's easy to contact him directly (probably best via IRC). Although, about the code used by the core,  he'll probably tell you that it's being maintained by other people now.

Otherwise, just publish your improvements  here - even if not Bitcoin Core,  someone else will use them, trust me,  because time is money Smiley

Check out gocoin - my original project of full bitcoin node & cold wallet written in Go.
PGP fingerprint: AB9E A551 E262 A87A 13BB  9059 1BE7 B545 CDF3 FD0E
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4158
Merit: 8382



View Profile WWW
January 09, 2017, 11:50:01 AM
Last edit: January 09, 2017, 12:50:14 PM by gmaxwell
 #9

I am also interested in a faster secp256k1.

Unfortunately, it seems Pieter Wuille et al. were neither serious about providing a fast secp256k1 nor about documenting it well.  Roll Eyes

Can you show another library for the same application within a factor of _five_ of the performance?  Or with more than 1/5th the documentation?

I'm just ... beside myself at your comment, it's not even insulting: it's just too absurd.

Quote
- which alone made the LBC generator about 10% faster overall -
and later also changing the field_5x52 code in secp256k1_fe_set_b32 to
Thanks,  that function is not performance critical in anything the library was intended for ... e.g. not brainwallet cracking. But making it faster would be fine.

You might have noticed that the constant unpacking macro does effectively the same thing: https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52.h#L22 but a word at a time.

If you open a PR on the function you should add it to the bench_internal benchmarks as you go.

Quote
Is there any place where R&D discussion about further development is ongoing? Before I'd start re-implementing that mess from scratch I'd prefer to participate in some "official endeavor".

same place it's always been https://github.com/bitcoin-core/secp256k1/

Quote
However, I have little hope that will make sense:
Quote
Signature verification isn't really the limiting factor in Bitcoin Core performance anymore in any case.
... just because something isn't a limiting factor isn't a reason to improve it.


Quote
Together with other statements from gmaxwell @ github ("this is alpha/, don't expect other doc shan source.." - something like that from memory) I see there seems not much motivation in further development from the "official side".
Disinterest in spoon feeding people who sound like wanna-be thieves who thefts would scare ordinary people away from Bitcoin and whom can't even be bothered to RTFM (there has been fairly detailed documentation for the library for years) is not evidence of a lack of interest in further development.

Any optimizations are very welcome - I'm sure sipa (the original author) would agree. He's a very nice guy and it's easy to contact him directly (probably best via IRC). Although, about the code used by the core,  he'll probably tell you that it's being maintained by other people now.
Uh, what?


In 5x52 asm I found 2-3% by simply doing what the cpu scheduler should be doing and issuing together the adds + muls (the cpu integer unit typically has one add and one mul unit and ideally we want to be using them at the same time).
Awesome! PRs welcome!

One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?
It is potentially patent encumbered for another couple years, restricting it to experimental use.

Quote
So the adcq+mulq are issued together to the mul and add units of the cpu respectively. I'm baffled on why the cpu scheduler wasn't already doing this but then again I do have an older cpu (core2 quad / 45nm) to play with - it might not be an issue with modern ones.
Generally newer cpus work better, but the performance is also more important on older (and in-order cpus like the atoms) since they're slower to begin with.
rico666
Legendary
*
Offline Offline

Activity: 1120
Merit: 1037


฿ → ∞


View Profile WWW
January 13, 2017, 09:55:40 AM
 #10

Can you show another library for the same application within a factor of _five_ of the performance?  Or with more than 1/5th the documentation?

I'm just ... beside myself at your comment, it's not even insulting: it's just too absurd.

Good. We seem to live in different worlds then, each others world seeming absurd to the other one.

I for one, find your "argument" of requesting to be shown another lib doing the same thing with 5 perf 1/5 docs quite absurd. It's like claiming that anything in the world by definition cannot suck when there is nothing else that sucks less. Really absurd.

Quote
You might have noticed that the constant unpacking macro does effectively the same thing: https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52.h#L22 but a word at a time.

No I haven't, because reading the secp256k1 code is a major PITA, but thanks for the pointer. That wheel would probably be the next thing I'd reinvented.

Quote
If you open a PR on the function you should add it to the bench_internal benchmarks as you go.

What's a PR? Press release? Ah. pull request I suppose. You're assuming too much. E.g. that the whole world runs on - or is - git.

Quote
Quote
Is there any place where R&D discussion about further development is ongoing? Before I'd start re-implementing that mess from scratch I'd prefer to participate in some "official endeavor".

same place it's always been https://github.com/bitcoin-core/secp256k1/

If that is the "discussion" place, no wonder I didn't see it. Srsly?

Quote
Disinterest in spoon feeding people who sound like wanna-be thieves who thefts would scare ordinary people away from Bitcoin and whom can't even be bothered to RTFM (there has been fairly detailed documentation for the library for years) is not evidence of a lack of interest in further development.

My dear young gmaxwell: Evidently, after you did the diligent work opening a PR  Smiley, benchmarked the code in the process, reformatting it for the worse, but reordering it for the better, found out and stated that my code is about 6-7 orders of magnitude faster than the original code (and I am not primarily a C hacker) ... you have the guts to use the term "wanna-be" when addressing any of your texts towards me?

Let alone the fact that there was such a tremendously suboptimal code for such a long time should teach you something. Let me assure you, your perception of the situation here is skewed at best. From what I see in the secp256k1 lib, the collective who did the job are good programmers with potential. Motivated, young, inexperienced, but potential.

If it wasn't for the LBC hobby project of mine, I would have never had looked in the mess that is the secp256k1 library. I did and I commented. You don't like the comment, maybe feel it being insulting, puzzling or absurd. Fine. I have high hopes that before you are in your mid-40ties you will understand what my comment was about. As I said: "potential".

Rationalizing the poor state of an open source project is something I came across in the Linux world since the 90ties. To me, your statements are neither new nor original. So should you - some day - come to the conclusion that you could attract a certain kind of programmers when the open source software reaches a certain kind of quality, it'd be swell. Let me tell you it's not about "spoon feeding people who sound like wanna-be thieves". It's about preparing the ground for people who may have twice your experience and otherwise scoff at the project.

No offence - ofc.

Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  Past BURST Activities
rico666
Legendary
*
Offline Offline

Activity: 1120
Merit: 1037


฿ → ∞


View Profile WWW
January 13, 2017, 11:54:48 AM
 #11

One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?

I don't think patents are any problem with the endomorphism code. The code itself is the problem. Not sure which benchmarks you are referring to, but if I take a (very coarse) look on benchmarks on my system, USE_ENDOMORPHISM is nothing you'd like to enable:

Code:
Times for tests:

gcc version 6.3.1 20170109 (GCC)

1) CFLAGS -g -O2
real    0m14.365s
user    0m14.357s
sys     0m0.007s

2) CFLAGS -O3 -march=sklake
real    0m13.549s
user    0m13.547s
sys     0m0.000s

3) CFLAGS -O3 -march=sklake & USE_ENDOMORPHISM 1
real    0m15.660s
user    0m15.660s
sys     0m0.000s

4) CFLAGS -g -O2 & USE_ENDOMORPHISM 1
real    0m16.139s
user    0m16.137s
sys     0m0.000s

5) CFLAGS -g -O2 & undef USE_ASM_X86_64
real    0m14.849s
user    0m14.847s
sys     0m0.000s

6) CFLAGS -O3 -march=sklake & undef USE_ASM_X86_64
real    0m14.520s
user    0m14.517s
sys     0m0.000s

So yes, the beef seems to be in better assembler code and ditching endomorphism.
On modern CPUs, ditch that old gcc too and use -O3 (forget what you've heard about it in the past years).

Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  Past BURST Activities
AlexGR
Legendary
*
Offline Offline

Activity: 1708
Merit: 1049



View Profile
January 13, 2017, 12:30:47 PM
Last edit: January 13, 2017, 12:42:27 PM by AlexGR
 #12

One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?

I don't think patents are any problem with the endomorphism code. The code itself is the problem. Not sure which benchmarks you are referring to, but if I take a (very coarse) look on benchmarks on my system, USE_ENDOMORPHISM is nothing you'd like to enable:

Code:
Times for tests:

gcc version 6.3.1 20170109 (GCC)

1) CFLAGS -g -O2
real    0m14.365s
user    0m14.357s
sys     0m0.007s

2) CFLAGS -O3 -march=sklake
real    0m13.549s
user    0m13.547s
sys     0m0.000s

3) CFLAGS -O3 -march=sklake & USE_ENDOMORPHISM 1
real    0m15.660s
user    0m15.660s
sys     0m0.000s

4) CFLAGS -g -O2 & USE_ENDOMORPHISM 1
real    0m16.139s
user    0m16.137s
sys     0m0.000s

5) CFLAGS -g -O2 & undef USE_ASM_X86_64
real    0m14.849s
user    0m14.847s
sys     0m0.000s

6) CFLAGS -O3 -march=sklake & undef USE_ASM_X86_64
real    0m14.520s
user    0m14.517s
sys     0m0.000s

So yes, the beef seems to be in better assembler code and ditching endomorphism.
On modern CPUs, ditch that old gcc too and use -O3 (forget what you've heard about it in the past years).

Rico


There are 3 benchmarks

bench_internal
bench_verify
bench_sign

which are built by ./configure --enable-benchmark

As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on.

Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.
piotr_n
Legendary
*
Offline Offline

Activity: 2053
Merit: 1354


aka tonikt


View Profile WWW
January 13, 2017, 01:34:34 PM
Last edit: January 13, 2017, 01:56:57 PM by piotr_n
 #13

It wasn't 'the collective'.
Sipa wrote the entire library,  from scratch,  all by himself.
The guys just took his code,  adding some pretty useless checks and heavy building system around it,  and now it's 'officially' hosted from bitcoin/secp256k1, as the 'community project' . Which means that if you want to change any bit there,  you have to play the bitcoin celebrities PR game, which is mostly about endulging a  big egos of a few funny characters.

But if you check the history of sipa/secp256k1 you can see that it used to quite easy to commit optimizations into that code.

It was all done by one person as his personal,  partially experimental  project and I personally admire the work.
And you can't seriously expect from and coder  to work on his personal  hobby project and  then deliver it with the industry standards documentation.
What kind of documentation would you even expect from a lib that provides a simple EC math functions?  The function names are descriptive enough if you understand the operations they provide. And if you don't understand the math behind it then no document is going to help you anyway.

Check out gocoin - my original project of full bitcoin node & cold wallet written in Go.
PGP fingerprint: AB9E A551 E262 A87A 13BB  9059 1BE7 B545 CDF3 FD0E
rico666
Legendary
*
Offline Offline

Activity: 1120
Merit: 1037


฿ → ∞


View Profile WWW
January 13, 2017, 06:13:06 PM
 #14

And you can't seriously expect from and coder  to work on his personal  hobby project and  then deliver it with the industry standards documentation.

Depends on the coder - I guess. As long as I have interest in my hobby project, I want it to be perfect. I for one have no problem to admit that my LBC projects still sucks badly in many places right now. I intend to improve. Documentation, Ease of use, Speed, One of them is to get decent EC performance on a GPU - that's why I am looking into this at all.
 
Quote
What kind of documentation would you even expect from a lib that provides a simple EC math functions?  The function names are descriptive enough if you understand the operations they provide. And if you don't understand the math behind it then no document is going to help you anyway.

Well - for me the names are pretty bad.  "secp256k1_fe_cmov" *thumbs up* But it's not only that. The data structures are pretty sideways too. Actually I understand the math pretty well, that's why I am so puzzled about what I see - still unsure if the lib is seriously doing what I think it's doing. I don't think the person who wrote this did really care for performance - he probably just wanted something that sucked less.

Sipa you say? Why did he abandon the project? Was it just some proof of work?


Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  Past BURST Activities
rico666
Legendary
*
Offline Offline

Activity: 1120
Merit: 1037


฿ → ∞


View Profile WWW
January 13, 2017, 06:35:04 PM
 #15

There are 3 benchmarks

bench_internal
bench_verify
bench_sign

which are built by ./configure --enable-benchmark

As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on.

Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.

Ok - I did.

bench_verify shows speedup with endomorphism

ecdsa_verify: min 42.0us / avg 42.2us / max 43.0us  (with)
ecdsa_verify: min 57.7us / avg 57.8us / max 58.4us  (without)

bench_internal shows no improvements (within measure tolerance) except one:

wnaf_const: min 0.0887us / avg 0.0920us / max 0.102us (with)
wnaf_const: min 0.155us / avg 0.161us / max 0.171us     (without)

I doubt this would cause the speedup from above.


Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  Past BURST Activities
piotr_n
Legendary
*
Offline Offline

Activity: 2053
Merit: 1354


aka tonikt


View Profile WWW
January 13, 2017, 09:46:57 PM
 #16


Sipa you say? Why did he abandon the project? Was it just some proof of work?

Proof of work that have already found so many applications, including one in your project. I guess you can call it however you like.

I don't know him and can't speak  for him,  but I wouldn't say he abandoned it. Rather decided it was complete enough and moved on with his life, like we do all the time.  Even satoshi did that with his big project.
And you're also not going to work on a single project all your life,  are you?

Check out gocoin - my original project of full bitcoin node & cold wallet written in Go.
PGP fingerprint: AB9E A551 E262 A87A 13BB  9059 1BE7 B545 CDF3 FD0E
piotr_n
Legendary
*
Offline Offline

Activity: 2053
Merit: 1354


aka tonikt


View Profile WWW
January 13, 2017, 10:12:09 PM
Last edit: January 13, 2017, 10:26:50 PM by piotr_n
 #17

There is no question that at this moment sipa's secp256k1 lib is the fastest solution on the market.

And you can complain  all you want about messy coding or poor documentation, but unless you provide an alternative to prove that it can be done so much better... well,  then it's just going to be a moaning.

And just a moaning isn't very professional.

Sipa didn't go to openssl forum saying how shitty their implementation was - he just made a better one,  to prove the point. And he proved it so well that now nobody bothers to make an effort in beating him. Smiley

Check out gocoin - my original project of full bitcoin node & cold wallet written in Go.
PGP fingerprint: AB9E A551 E262 A87A 13BB  9059 1BE7 B545 CDF3 FD0E
rico666
Legendary
*
Offline Offline

Activity: 1120
Merit: 1037


฿ → ∞


View Profile WWW
January 14, 2017, 08:21:41 AM
 #18

There is no question that at this moment sipa's secp256k1 lib is the fastest solution on the market.

And you can complain  all you want about messy coding or poor documentation, but unless you provide an alternative to prove that it can be done so much better... well,  then it's just going to be a moaning.

And just a moaning isn't very professional.

Strange. I was under the impression that

Code:
field_get_b32: min 0.647us / avg 0.666us / max 0.751us
field_set_b32: min 0.551us / avg 0.571us / max 0.624us

becomes

field_get_b32: min 0us / avg 0.0000000477us / max 0.000000238us
field_set_b32: min 0us / avg 0.0000000238us / max 0.000000238us

allowed me a tiny little bit of moaning.


Rico

all non self-referential signatures except mine are lame ... oh wait ...   ·  LBC Thread (News)  ·  Past BURST Activities
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4158
Merit: 8382



View Profile WWW
January 14, 2017, 03:14:27 PM
Last edit: January 14, 2017, 04:36:50 PM by gmaxwell
 #19

Strange. I was under the impression that

allowed me a tiny little bit of moaning.
It's a trivial optimization that already existed in three other places in the code. Thanks for noticing that it hadn't been performed there, and providing code for one of the two places it needed to be improved, but come on-- you're just making yourself look foolish here with the rude attitude while you're clearly fairly ignorant about what you're talking about overall. Case in point:

The endormorphism makes verification and ECDH significantly faster. It doesn't do anything else beyond additional endomorphism related tests.  

It does not make pubkey generation faster and really can't (well it could be used with some effort to halve the size of the in-memory tables at a small performance penalty.).

It's a little absurd that you insult a ~27% performance increase to verification while bragging about a under-half-percent change to verification performance.  It seems to me that you're trying to compensate for ignorance by insulting a lot, it might fool a few people who just don't know much of anything-- but not anyone else.  And it prevents you from learning. I do think wanna-be applies, in spades, and if you keep with that attitude it will probably continue to do so.

Quote
I don't think the person who wrote this did really care for performance.

And again, I ask you-- suggest something else similar with is 1/5th as fast, 1/5th as well documented-- I could also add 1/5th as well tested?  I'm not aware of anything.

Quote
No I haven't, because reading the secp256k1 code is a major PITA, but thanks for the pointer.

Odd to hear that, since the libsecp256k1 code has received many positive comments about its clarity and origination. One researcher described it as the cleanest production ECC code that he'd read.

Style preferences differ, of course,  -- lets see what your own code looks like

http://ftp://ftp.cryptoguru.org/LBC/client/LBC

 <blink> <blink>

It wasn't 'the collective'.
Sipa wrote the entire library,  from scratch,  all by himself.
That is far from an accurate history, but it doesn't matter-- Pieter did do the lionshare of the work but he didn't do it in isolation. But less fortunate, is where your internal imagination continues and you write:

Quote
Which means that if you want to change any bit there,  you have to play the bitcoin celebrities PR game, which is mostly about indulging a  big egos of a few funny characters.
Which is just something you're purely making up. And it's unfortunate because if other people read it and don't know that you're extrapolating based on your imagination and fears they might not contribute where they otherwise might. That does the world a disservice.

Quote
And you can't seriously expect from and coder  to work on his personal  hobby project and  then deliver it with the industry standards documentation.
What kind of documentation would you even expect from a lib that provides a simple EC math functions?  The function names are descriptive enough if you understand the operations they provide. And if you don't understand the math behind it then no document is going to help you anyway.

But what is strange is that it is _extensively_ documented.

E.g. above rico666 commented on secp256k1_fe_cmov -- even if someone were ignorant enough of the subject and the conventions used to not immediately know that this function performed a conditional move of a field element, or of programming enough to not know what a conditional move was; there is documentation (in this case apparently added by me):

Code:
/** If flag is true, set *r equal to *a; otherwise leave it. Constant-time. */
static void secp256k1_fe_cmov(secp256k1_fe *r, const secp256k1_fe *a, int flag);

... even though this is purely internal code and is not accessible to an end user of the library.
AlexGR
Legendary
*
Offline Offline

Activity: 1708
Merit: 1049



View Profile
January 14, 2017, 03:54:56 PM
Last edit: January 14, 2017, 04:38:58 PM by AlexGR
 #20

There are 3 benchmarks

bench_internal
bench_verify
bench_sign

which are built by ./configure --enable-benchmark

As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on.

Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.

Ok - I did.

bench_verify shows speedup with endomorphism

ecdsa_verify: min 42.0us / avg 42.2us / max 43.0us  (with)
ecdsa_verify: min 57.7us / avg 57.8us / max 58.4us  (without)

bench_internal shows no improvements (within measure tolerance) except one:

wnaf_const: min 0.0887us / avg 0.0920us / max 0.102us (with)
wnaf_const: min 0.155us / avg 0.161us / max 0.171us     (without)

I doubt this would cause the speedup from above.

Rico


I'll upload here 2 versions of /src/field_5x52_asm_impl.h that I've kind of hacked, one using memory, the other xmm registers.

The commentary is not good because it's not production level - just fooling around* with the data flow so that the data get from one end to the other faster, with less code imprint. I've never had them run on anything beside my Q8200, and I'm wondering on the behavior of modern cpus. I'd appreciate if you (or anyone else) can run a benchmark (baseline) + these 2, and perhaps a time ./tests as a more real-world performance.

If I do a ./time tests, both run faster by a second (58.2 seconds baseline with endomorphism down to 57.2 seconds in my underclocked Q8200 @ 1.86gz), although the memory version seems faster in the benchmarks. I have a theory on why the xmm version sucks in benchmarks (OS context switches being more expensive for also saving the xmm reg set?) but the bottom line is it seems faster than baseline when doing a timed test run (more real-world application)... Security-wise, I wouldn't want to let data hanging around on the XMM registers though.

(*What I wanted to do is to reduce opcode size, instruction count and memory accesses by reducing the number of temporary variables from 3 to 2 or 1, while interleaving muls with adds).


Version 1 - normal/memory:

Code:
/**********************************************************************
 * Copyright (c) 2013-2014 Diederik Huys, Pieter Wuille               *
 * Distributed under the MIT software license, see the accompanying   *
 * file COPYING or http://www.opensource.org/licenses/mit-license.php.*
 **********************************************************************/

/**
 * Changelog:
 * - March 2013, Diederik Huys:    original version
 * - November 2014, Pieter Wuille: updated to use Peter Dettman's parallel multiplication algorithm
 * - December 2014, Pieter Wuille: converted from YASM to GCC inline assembly
 */

#ifndef _SECP256K1_FIELD_INNER5X52_IMPL_H_
#define _SECP256K1_FIELD_INNER5X52_IMPL_H_

SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a, const uint64_t * SECP256K1_RESTRICT b) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            r15:rcx = d
 *            r10-r14 = a0-a4
 *            rbx     = b
 *            rdi     = r
 *            rsi     = a / t?
 */
  uint64_t tmp1, tmp2;
__asm__ __volatile__(
    "movq 24(%%rsi),%%r13\n"
    "movq 0(%%rbx),%%rax\n"
    "movq 32(%%rsi),%%r14\n"
    /* d += a3 * b0 */
    "mulq %%r13\n"
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq %%rax,%%r9\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 8(%%rbx),%%rax\n"
    "movq %%rdx,%%rsi\n"
    /* d += a2 * b1 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b2 */
    "mulq %%r11\n"
    "movq $0x1000003d10,%%rcx\n"
    "movq $0xfffffffffffff,%%r15\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d = a0 * b3 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* c = a4 * b4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "shrdq $52,%%rdx,%%r8\n"     /* c >>= 52 (%%r8 only) */
    /* d += (c & M) * R */
    "andq %%r15,%%rax\n"
    "mulq %%rcx\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t3 (tmp1) = d & M */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%q1\n"  
    /* d >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* d += a4 * b0 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b1 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b3 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a0 * b4 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
     /* d += c * R */
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    "mulq %%r8\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t4 = d & M (%%r15) */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rax,%%r15\n"
    "shrq $48,%%r15\n" /*Q3*/
    
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rdx\n"
    "andq %%rdx,%%rax\n"
    "movq %%rax,%q2\n"
    /*"movq %q2,%%r15\n" */
    "movq 0(%%rbx),%%rax\n"
    /* c = a0 * b0 */
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += a4 * b1 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b2 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b3 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b4 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    
    "movq %%r15,%%rax\n"  /*Q3 transfered*/
    
    /* u0 = d & M (%%r15) */
    "movq %%r9,%%rdx\n"
    "shrdq $52,%%rsi,%%r9\n"
    "movq $0xfffffffffffff,%%r15\n"
    "xor %%esi, %%esi\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */

    /* u0 = (u0 << 4) | tx (%%r15) */
    "shlq $4,%%rdx\n"
    "orq %%rax,%%rdx\n"
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b1 */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b2 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b3 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b4 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a2 * b0 */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a1 * b1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b2 (last use of %%r10 = a0) */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    /* fetch t3 (%%r10, overwrites a0), t4 (%%r15) */
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b3 */
    "mulq %%r14\n"
    "movq %q1,%%r10\n"
    "xor %%esi, %%esi\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b4 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq $0x1000003d10,%%r11\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 (%%r9 only) */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %q2,%%rsi\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t3 */
    "xor %%ecx,%%ecx\n"
    "movq %%r9,%%rax\n"
    "addq %%r10,%%r8\n"
    /* c += d * R */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%rsi,%%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
: "+S"(a), "=m"(tmp1), "=m"(tmp2)
: "b"(b), "D"(r)
: "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

SECP256K1_INLINE static void secp256k1_fe_sqr_inner(uint64_t *r, const uint64_t *a) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            rcx:rbx = d
 *            r10-r14 = a0-a4
 *            r15     = M (0xfffffffffffff)
 *            rdi     = r
 *            rsi     = a / t?
 */
  uint64_t tmp1a;
__asm__ __volatile__(
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 24(%%rsi),%%r13\n"
    "movq 32(%%rsi),%%r14\n"
    "leaq (%%r10,%%r10,1),%%rax\n"
    "movq $0xfffffffffffff,%%r15\n"
    /* d = (a0*2) * a3 */
    "mulq %%r13\n"
    "movq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += (a1*2) * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"
    "movq %%r14,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c = a4 * a4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "movq %%rdx,%%r9\n"
    /* d += (c & M) * R */
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r9,%%r8\n"
    /* t3 (tmp1) = d & M */
    "movq %%rbx,%%rsi\n"
    "andq %%r15,%%rsi\n" /*Q1 became rsi*/
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    /* a4 *= 2 */
    "movq %%r10,%%rax\n"
    "addq %%r14,%%r14\n"
    /* d += a0 * a4 */
    "mulq %%r14\n"
    "xor %%ecx,%%ecx\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d+= (a1*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a2 * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"

    /* d += c * R */
    "movq %%r8,%%rax\n"
    "movq $0x1000003d10,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    "mulq %%r8\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* t4 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rdx,%%r15\n"
    "shrq $48,%%r15\n" /*Q3=R15*/
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    "movq %%rdx,%q1\n"/*Q2 OUT - renamed to q1*/
    /* c = a0 * a0 */
    "movq %%r10,%%rax\n"
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%rdx,%%r9\n"
    /* d += a1 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r12,%%r12,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += (a2*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* u0 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "movq $0xfffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* u0 = (u0 << 4) | tx (%%rsi) */
    "shlq $4,%%rdx\n"
    "orq %%r15,%%rdx\n" /*Q3 - R15 RETURNS*/
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "movq $0xfffffffffffff,%%r15\n" /*R15 back in its place*/
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"    
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* a0 *= 2 */
    "addq %%r10,%%r10\n"
    /* c += a0 * a1 */
    "movq %%r10,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a2 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a3 * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%r10,%%rax\n"
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* c += a0 * a2 (last use of %%r10) */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %q1,%%r12\n" /*Q2 RETURNS*/
    "adcq %%rdx,%%r9\n"
    /* fetch t3 (%%r10, overwrites a0),t4 (%%rsi) */
    /*"movq %q1,%%r10\n" */
    /* c += a1 * a1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a3 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq $0x1000003d10,%%r13\n"
    "mulq %%r13\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 (%%rbx only) */
    "shrdq $52,%%rcx,%%rbx\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r14,%%r14\n"
    /* c += t3 */
    "movq %%rbx,%%rax\n"
    "addq %%rsi,%%r8\n" /*RSI = Q1*/
    /* c += d * R */
    "mulq %%r13\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r14\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r14,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%r12, %%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
: "+S"(a), "=m"(tmp1a)
: "D"(r)
: "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

#endif


Version 2 - more xmm reg use

Code:
/**********************************************************************
 * Copyright (c) 2013-2014 Diederik Huys, Pieter Wuille               *
 * Distributed under the MIT software license, see the accompanying   *
 * file COPYING or http://www.opensource.org/licenses/mit-license.php.*
 **********************************************************************/

/**
 * Changelog:
 * - March 2013, Diederik Huys:    original version
 * - November 2014, Pieter Wuille: updated to use Peter Dettman's parallel multiplication algorithm
 * - December 2014, Pieter Wuille: converted from YASM to GCC inline assembly
 */

#ifndef _SECP256K1_FIELD_INNER5X52_IMPL_H_
#define _SECP256K1_FIELD_INNER5X52_IMPL_H_

SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a, const uint64_t * SECP256K1_RESTRICT b) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            r15:rcx = d
 *            r10-r14 = a0-a4
 *            rbx     = b
 *            rdi     = r
 *            rsi     = a / t?
 */
/* xmm0 = q1 xmm6=q2    */
/* This has 17 mem accesses + 17 xmm uses vs 35 mem access and no xmm use*/

__asm__ __volatile__(
    "push %%rbx\n"
    "movq %%rsp, %%xmm1\n"
    "movq %%rbp, %%xmm2\n"
    "movq %%rdi, %%xmm3\n"
    "movq 0(%%rbx),%%rdi\n"
    "movq 8(%%rbx),%%rbp\n"
    "movq 16(%%rbx),%%rsp\n"
    "movq %%rdi,%%xmm4\n"
    
    "movq 24(%%rsi),%%r13\n"
    "movq %%rdi,%%rax\n"
    "movq 32(%%rsi),%%r14\n"
    /* d += a3 * b0 */
    "mulq %%r13\n"
    "movq 0(%%rsi),%%r10\n"
    "movq %%rax,%%r9\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq %%rbp,%%rax\n"
    "movq %%rdx,%%rsi\n"
    /* d += a2 * b1 */
    "mulq %%r12\n"
    "movq 24(%%rbx),%%rcx\n"
    "movq 32(%%rbx),%%rbx\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b2 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d = a0 * b3 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* c = a4 * b4 */
    "mulq %%r14\n"
    "movq $0xfffffffffffff,%%r15\n"
    "movq %%rax,%%r8\n"
    /* d += (c & M) * R */
    "andq %%r15,%%rax\n"
    "shrdq $52,%%rdx,%%r8\n"     /* c >>= 52 (%%r8 only) */
    "movq $0x1000003d10,%%rdx\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t3 (tmp1) = d & M */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%%xmm0\n"  
    /* d >>= 52 */
    "movq %%rdi,%%rax\n"
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* d += a4 * b0 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b1 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b3 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a0 * b4 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
     /* d += c * R */
    "movq $0x1000003d10,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    "mulq %%r8\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t4 = d & M (%%r15) */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rax,%%r15\n"
    "shrq $48,%%r15\n" /*Q3*/
    
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rdx\n"
    "andq %%rdx,%%rax\n"
    "movq %%rax,%%xmm6\n"
    /*"movq %q2,%%r15\n" */
    "movq %%rdi,%%rax\n"
    /* c = a0 * b0 */
    "mulq %%r10\n"
    "movq %%rcx,%%xmm5\n"
    "movq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += a4 * b1 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b2 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b3 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b4 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    
    "movq %%r15,%%rax\n"  /*Q3 transfered*/
    
    /* u0 = d & M (%%r15) */
    "movq %%r9,%%rdx\n"
    "shrdq $52,%%rsi,%%r9\n"
    "movq $0xfffffffffffff,%%r15\n"
    "xor %%esi, %%esi\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */

    /* u0 = (u0 << 4) | tx (%%r15) */
    "shlq $4,%%rdx\n"
    "orq %%rax,%%rdx\n"
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%%rdx\n"
        /* c >>= 52 */
    "movq %%rdi,%%rax\n"
    "movq %%xmm3, %%rdi\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    "movq %%rdx,0(%%rdi)\n"
    /* c += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b1 */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b2 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b3 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b4 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%xmm4,%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a2 * b0 */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a1 * b1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b2 (last use of %%r10 = a0) */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    /* fetch t3 (%%r10, overwrites a0), t4 (%%r15) */
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b3 */
    "mulq %%r14\n"
    "movq %%xmm0,%%r10\n"
    "xor %%esi, %%esi\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b4 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq $0x1000003d10,%%rbx\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rbx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 (%%r9 only) */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t3 */
    "movq %%r9,%%rax\n"
    "addq %%r10,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += d * R */
    "mulq %%rbx\n"
    "movq %%xmm1, %%rsp\n"
    "movq %%xmm2, %%rbp\n"
    "movq %%xmm6,%%rsi\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%rsi,%%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
    "pop %%rbx\n"
: "+S"(a)
: "b"(b), "D"(r)
: "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

SECP256K1_INLINE static void secp256k1_fe_sqr_inner(uint64_t *r, const uint64_t *a) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            rcx:rbx = d
 *            r10-r14 = a0-a4
 *            r15     = M (0xfffffffffffff)
 *            rdi     = r
 *            rsi     = a / t?
 */
/* tmp1a = xmm0 */
__asm__ __volatile__(
    "movq %%rsp, %%xmm1\n"
    "movq %%rbp, %%xmm2\n"
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 24(%%rsi),%%r13\n"
    "movq 32(%%rsi),%%r14\n"
    "leaq (%%r10,%%r10,1),%%rax\n"
    "movq $0xfffffffffffff,%%r15\n"
    /* d = (a0*2) * a3 */
    "mulq %%r13\n"
    "movq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += (a1*2) * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"
    "movq %%r14,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c = a4 * a4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "movq %%rdx,%%r9\n"
    /* d += (c & M) * R */
    "movq $0x1000003d10,%%rsp\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r9,%%r8\n"
    /* t3 (tmp1) = d & M */
    "movq %%rbx,%%rsi\n"
    "andq %%r15,%%rsi\n" /*Q1 OUT*/
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    /* a4 *= 2 */
    "movq %%r10,%%rax\n"
    "addq %%r14,%%r14\n"
    /* d += a0 * a4 */
    "mulq %%r14\n"
    "xor %%ecx,%%ecx\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d+= (a1*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a2 * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"

    /* d += c * R */
    "movq %%r8,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    "mulq %%rsp\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* t4 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rdx,%%rbp\n"
    "shrq $48,%%rbp\n" /*Q3 OUT*/
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    "movq %%rdx,%%xmm0\n"/*Q2 OUT*/
    /* c = a0 * a0 */
    "movq %%r10,%%rax\n"
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%rdx,%%r9\n"
    /* d += a1 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r12,%%r12,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += (a2*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* u0 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "movq %%r15,%%rax\n"
    "andq %%rax,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* u0 = (u0 << 4) | tx (%%rsi) */
    "shlq $4,%%rdx\n"
    "orq %%rbp,%%rdx\n" /*Q3 RETURNS*/
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"    
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* a0 *= 2 */
    "addq %%r10,%%r10\n"
    /* c += a0 * a1 */
    "movq %%r10,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a2 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a3 * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%r10,%%rax\n"
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* c += a0 * a2 (last use of %%r10) */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%xmm0,%%r12\n" /*Q2 RETURNS*/
    "adcq %%rdx,%%r9\n"
    /* fetch t3 (%%r10, overwrites a0),t4 (%%rsi) */
    /*"movq %q1,%%r10\n" */
    /* c += a1 * a1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a3 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 (%%rbx only) */
    "shrdq $52,%%rcx,%%rbx\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r14,%%r14\n"
    /* c += t3 */
    "movq %%rbx,%%rax\n"
    "addq %%rsi,%%r8\n" /*RSI = Q1 RETURNS*/
    /* c += d * R */
    "mulq %%rsp\n"
    "movq %%xmm1, %%rsp\n"
    "movq %%xmm2, %%rbp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r14\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r14,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%r12, %%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"

: "+S"(a)
: "D"(r)
: "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

#endif

Pages: [1] 2 3 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!