Bitcoin Forum
April 23, 2024, 08:09:02 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 [2]  All
  Print  
Author Topic: Speeding up signature verification  (Read 23806 times)
MatthewLM
Legendary
*
Offline Offline

Activity: 1190
Merit: 1004


View Profile
January 13, 2013, 06:28:23 PM
 #21

If you are to use a GPU, why not also use the CPU along side it? If time is important, then surely using as much processing power as is available is recommended. Understandably the are other considerations such as power usage and the operation of other programs on the same system.

At least you can take advantage of multiple CPU cores with multiple threads.
1713902942
Hero Member
*
Offline Offline

Posts: 1713902942

View Profile Personal Message (Offline)

Ignore
1713902942
Reply with quote  #2

1713902942
Report to moderator
Each block is stacked on top of the previous one. Adding another block to the top makes all lower blocks more difficult to remove: there is more "weight" above each block. A transaction in a block 6 blocks deep (6 confirmations) will be very difficult to remove.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1713902942
Hero Member
*
Offline Offline

Posts: 1713902942

View Profile Personal Message (Offline)

Ignore
1713902942
Reply with quote  #2

1713902942
Report to moderator
1713902942
Hero Member
*
Offline Offline

Posts: 1713902942

View Profile Personal Message (Offline)

Ignore
1713902942
Reply with quote  #2

1713902942
Report to moderator
1713902942
Hero Member
*
Offline Offline

Posts: 1713902942

View Profile Personal Message (Offline)

Ignore
1713902942
Reply with quote  #2

1713902942
Report to moderator
smtp
Member
**
Offline Offline

Activity: 70
Merit: 10


View Profile
January 13, 2013, 06:48:35 PM
 #22

If you are to use a GPU, why not also use the CPU along side it? If time is important, then surely using as much processing power as is available is recommended. Understandably the are other considerations such as power usage and the operation of other programs on the same system.

At least you can take advantage of multiple CPU cores with multiple threads.
Of course, but two important points must/should be decided (as usual):

1) Effort: Who should write all this coding (for which GPUs?) and merge it with bitcoin-0.?. (how much real time spend of a (few) developper(s)?
2) Benefit: How big is the average profit of this new code (by all bitcoin-client users)?

smtp
Pieter Wuille
Legendary
*
qt
Offline Offline

Activity: 1072
Merit: 1174


View Profile WWW
January 13, 2013, 06:53:55 PM
 #23

I simply should concatenate my current block chain (3 files) into 1 huge file (almost 4.9 GB), name this "bootstrap.dat" and 0.7.* will load this without signature-checking into the bitcoin-client if I start it (and of course build a new blkindex.dat)?

It will load it, but only the first 193000 blocks will be done without signature checking.

Also, I think 0.7.x has a bug where no more than 2G of bootstrap.dat is loaded anyway. You can use -loadblock=file1 -loadblock=file2, ... though.

I do Bitcoin stuff.
smtp
Member
**
Offline Offline

Activity: 70
Merit: 10


View Profile
January 13, 2013, 07:13:43 PM
 #24

I simply should concatenate my current block chain (3 files) into 1 huge file (almost 4.9 GB), name this "bootstrap.dat" and 0.7.* will load this without signature-checking into the bitcoin-client if I start it (and of course build a new blkindex.dat)?

It will load it, but only the first 193000 blocks will be done without signature checking.

Also, I think 0.7.x has a bug where no more than 2G of bootstrap.dat is loaded anyway. You can use -loadblock=file1 -loadblock=file2, ... though.

All this I knew -- I thought that I had misunderstood that this 193000 blocks limit was hardcoded into the client. :-/

I did it with -loadblock... already, thus my personal annoying real-time expierence.
Thx, anyway, then I can stop my try now -- I just have started 0.7.1 with this 4.9 GB bootstrap.dat and looked forward whether all will succeed.
193000 blocks looks about as 216000 blocks, but the sad truth is: only about 50 % of the transactions and this is what matters. :-/

smtp
Pieter Wuille
Legendary
*
qt
Offline Offline

Activity: 1072
Merit: 1174


View Profile WWW
January 13, 2013, 08:24:49 PM
 #25

All this I knew -- I thought that I had misunderstood that this 193000 blocks limit was hardcoded into the client. :-/

Yes, it is hardcoded. 0.7.x have a latest checkpoint at 193000. The current development code has an extra one at 210000.

I do Bitcoin stuff.
Diapolo
Hero Member
*****
Offline Offline

Activity: 769
Merit: 500



View Profile WWW
January 14, 2013, 08:04:31 AM
 #26

But before the last checkpoint (=exactly what is stored in bootstrap.dat), signature checking is disabled anyway (and this hasn't changed in 0.Cool.
Huh Do I feel victim to a misunderstanding or do you just have told me a trick? Smiley

I simply should concatenate my current block chain (3 files) into 1 huge file (almost 4.9 GB), name this "bootstrap.dat" and 0.7.* will load this without signature-checking into the bitcoin-client if I start it (and of course build a new blkindex.dat)?

smtp

You perhaps won't succeed with a file > 4GB. There were some limits with fseek or such as far as I remember.

Dia

Liked my former work for Bitcoin Core? Drop me a donation via:
1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x
bitcoin:1PwnvixzVAKnAqp8LCV8iuv7ohzX2pbn5x?label=Diapolo
smtp
Member
**
Offline Offline

Activity: 70
Merit: 10


View Profile
January 14, 2013, 09:06:58 AM
Last edit: January 14, 2013, 04:29:56 PM by smtp
 #27

You perhaps won't succeed with a file > 4GB. There were some limits with fseek or such as far as I remember.
On 64-bit Linux the limit is 2^63 as filepos and I guess on 32-bit Linux 2^31 (at least, but depending on size of data_type long), but already the (official) 193000-bootstrap file has > 2GB size.

Appendum: I should add: Really it depends on the filesystem you are using. I talked about Ext-2, Ext-3 and Ext-4 FS - (the default Linux filesystems).

smtp
Pieter Wuille
Legendary
*
qt
Offline Offline

Activity: 1072
Merit: 1174


View Profile WWW
January 14, 2013, 08:34:40 PM
 #28

0.7.x has indeed a bug that it needs the ability to seek to positions in block files being imported. This limits it to 2 GB on some systems.

The current git head code doesn't need this, and can import blocks from arbitrary-sized files.

I do Bitcoin stuff.
deuteragenie
Newbie
*
Offline Offline

Activity: 36
Merit: 0


View Profile
January 29, 2014, 10:00:48 PM
 #29


The book tells (well, hints) how to compute lambda and beta, and here are the values I found:
lambda = 0x5363ad4cc05c30e0a5261c028812645a122e22ea20816678df02967c1b23bd72
beta = 0x7ae96a2b657c07106e64479eac3434e99cf0497512f58995c1396c28719501ee


Isn't lambda a solution of l*l+l=-1 mod n ?

If that's the case, there are actually two possible lambdas:

5363AD4CC05C30E0A5261C028812645A122E22EA20816678DF02967C1B23BD72
and
AC9C52B33FA3CF1F5AD9E3FD77ED9BA4A880B9FC8EC739C2E0CFC810B51283CE

What made you choose the smaller one ?
ysangkok
Newbie
*
Offline Offline

Activity: 10
Merit: 3


View Profile
January 04, 2015, 11:59:50 PM
 #30

Risk vs reward. No offense to Hal, but 25% isn't a very big improvement, and signature verification is something that absolutely must work correctly. Get it wrong and people can lose a lot of money very quickly. Why risk it?

The secp256k1-specific optimization has been implemented, and there's a pull request to integrate it in the reference client. It does indeed achieve some 20% improvement.

What smtp asked was about the parallel batch verification, with potentially much higher speedsup (Hal mentioned x4). The problem is that this is a much more invasive change, and the risk of getting crypto stuff wrong is extremely high. It's also not certain which degree of speedup is possible, as we need to do recovery of R before batch verification is possible (and changing this requires a hard fork basically, as the signatures are part of the scripting rules).

That said, if someone is interested in investigating, and potentially supplying a patch that is well testable... by all means, do so.

It looks as if this code is slated for inclusion in 0.10, but the given reasons are not performance, but security:

https://github.com/bitcoin/bitcoin/blob/0.10/doc/release-notes.md#improved-signing-security

Is the new code doing the batch verification previously mentioned or is it doing the "some 20%" optimization? Or maybe the optimizations vanished to mitigate timing attacks?

Thanks in advance.
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4158
Merit: 8382



View Profile WWW
January 05, 2015, 01:53:35 AM
Merited by vapourminer (1)
 #31

5363AD4CC05C30E0A5261C028812645A122E22EA20816678DF02967C1B23BD72
and
AC9C52B33FA3CF1F5AD9E3FD77ED9BA4A880B9FC8EC739C2E0CFC810B51283CE
What made you choose the smaller one ?
Lambda (mod N) has multiplicative order three, the other value is just lambda applied twice. Unfortunately, it appears is no way to use that for additional speedups because the is no small orthogonal basis that cam be formed from a reduction on [n, lambda, lambda^2].

It looks as if this code is slated for inclusion in 0.10, but the given reasons are not performance, but security:
Absolutely not. The changes in 0.10 you're referring to are exclusively related to signing, as the release notes say, and do not change verification.

Quote
Is the new code doing the batch verification previously mentioned or is it doing the "some 20%" optimization? Or maybe the optimizations vanished to mitigate timing attacks?
Verification in libsecp256k1 on x86_64 is now probably closer to ten times faster than OpenSSL's implementation, but the software is very much still under development. This is without batch validation. True batch validation requires additional side information to have any hope of being faster, and even with it the speed increase is not that enormous due to the additional blinding required to preserve security.
AlexGR
Legendary
*
Offline Offline

Activity: 1708
Merit: 1049



View Profile
April 26, 2016, 03:25:31 AM
Merited by vapourminer (1)
 #32

Is there any downside in using SIMD instructions (loading multiple values in different segments of the same registers and performing "packed" SSE/AVX instructions for the math operations) to batch-process multiple signatures per thread?

It would theoretically get the cpu cycles per round much lower than doing it in a serial manner.
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4158
Merit: 8382



View Profile WWW
April 26, 2016, 04:45:49 AM
 #33

Is there any downside in using SIMD instructions (loading multiple values in different segments of the same registers and performing "packed" SSE/AVX instructions for the math operations) to batch-process multiple signatures per thread?

It would theoretically get the cpu cycles per round much lower than doing it in a serial manner.
You lose the 64,64->128 widening multiplies. The net result appears to be a loss of performance.

If someone was able to get the field arithmetic to operate usefully in parallel there wouldn't be a need to perform multiple verification at once to get performance gains; fwiw.

But actually getting it to be faster while losing the wide multiplies appears to be elusive. Even in the AVX-512 integer stuff I do not see a quadword multiply.



AlexGR
Legendary
*
Offline Offline

Activity: 1708
Merit: 1049



View Profile
April 26, 2016, 11:02:52 PM
Last edit: April 27, 2016, 04:26:28 PM by AlexGR
 #34

Does this help at all?

https://en.wikipedia.org/wiki/CLMUL_instruction_set

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/carry-less-multiplication-instruction-in-gcm-mode-paper.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/polynomial-multiplication-instructions-paper.pdf

"Using the PCLMULQDQ instruction, the performance of ECC can be
boosted significantly on a range of IA processor cores, making it an
attractive option over other public key algorithms."

There are also some newer "regular" instructions that might be useful for speedups (mulx, adcx, adox): http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

Quote

MULX Instruction
The mulx instruction is an extension of the existing mul instruction, with the
difference being in the effect on flags:
mulx dest_hi, dest_lo, src1
The instruction also uses an implicit src2 register, edx or rdx depending on
whether the 32-bit or 64-bit version is being used.
The operation is:
 dest_hi:dest_lo = src1 * r/edx
The reg/mem source operand src1 is multiplied by rdx/edx, and the result is
stored in the two destination registers dest_hi:dest_lo. No flags are
modified.
This provides two key advantages over the existing mul instruction:
- Greater flexibility in register usage, as current mul destination
registers are implicitly defined. With mulx, the destination registers
may be distinct from the source operands, so that the source
operands are not over-written.
- Since no flags are modified, mulx instructions can be mixed with
add-carry instructions without corrupting the carry chain.

ADCX/ADOX Instructions

The adcx and adox instructions are extensions of the adc instruction,
designed to support two separate carry chains. They are defined as:
adcx dest/src1, src2
adox dest/src1, src2
Both instructions compute the sum of src1 and src2 plus a carry-in and
generate an output sum dest and a carry-out. The difference between these
two instructions is that adcx uses the CF flag for the carry in and carry out
(leaving the OF flag unchanged), whereas the adox instruction uses the OF
flag for the carry in and carry out (leaving the CF flag unchanged).

The primary advantage of these instructions over adc is that they support two
independent carry chains. Note that the two carry chains can be initialized by
an instruction that clears both the CF and OF flags, for example “xor reg,reg”

As for AVX512, I think AVX512-DQ (Q=quadword) might do (some of) the job, but it will be too limited in use (xeons).
gmaxwell
Moderator
Legendary
*
expert
Offline Offline

Activity: 4158
Merit: 8382



View Profile WWW
April 27, 2016, 08:30:18 AM
 #35

"Using the PCLMULQDQ instruction, the performance of ECC can be
boosted significantly on a range of IA processor cores, making it an
attractive option over other public key algorithms."
Not irrelevant, the 'carryless multiply' is useful for characteristic-2 ECC, which only crazy people would use.

There are also some newer "regular" instructions that might be useful for speedups (mulcx, adcx, adox): http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf

Quote
MULX Instruction
ADCX/ADOX Instructions
Yes, these are somewhat useful... but are only in very new processors. E.g. I just got broadwell-ep chips that support ADCX a couple weeks ago (release april 1st). But they don't let us do things efficiently with SIMD, they just would likely make the existing approach somewhat faster if we use them.

AlexGR
Legendary
*
Offline Offline

Activity: 1708
Merit: 1049



View Profile
April 08, 2017, 12:14:52 AM
Last edit: April 08, 2017, 10:55:23 PM by AlexGR
Merited by vapourminer (1)
 #36

After tampering a bit with the code over the past year, I now understand that I initially asked the wrong question, regarding putting multiple verifications in parallel in a SIMD manner, as the calculations required in a single signature are too many by themselves.

Since this is the case, the question should be rephrased on whether SIMD can be used on these multiple calculations of a single pass. As you rightly pointed out, quadword->octaword is generally not supported.

Now, I think I found a way to use vectorization, bypassing the quadword->octaword obstacle - since it doesn't seem that it will be getting hardware support anytime soon.

The answer lies in breaking up the multiplications in 32x32 bits producing 64 bit results, like the 10x26 field does. Then packing the sources for packed multiplications.

I put the multiplications of 10x26 in vectorized SIMD loops, #pragma simd'ed (icc*), and it pulled it off (result on the left).



This is the http://www.felixcloutier.com/x86/PMULUDQ.html instruction. It can be used for taking 4x 32bit sources and producing 2x 64 bit outputs.

In my q8200 (no avx2), in x64 mode (utilizing the 32bit field), this is ~5-10% slower than issuing multiple 64 bit imuls. Imuls in x64 mode are very, very convenient for 32x32 bit => 64 bit, and this is what is normally generated in x64 mode.

In an AVX2 scenario, packed multiplication goes up to 8x32 bit sources / 4x64 bit results.

In an AVX512 scenario (assuming there is a similar instruction), it should pack 16x32bit sources into eight 64 results. Plus with an extra 16 registers, it should eliminate a lot of memory accesses. If verification speed is an issue in the future, we might be able to exploit this.

As an incidental "discovery", while compiling with -m32 to check the 32bit output, I saw that the uint64s are produced with plain muls in eax:edx (which is to be expected), although many 32bit machines DO have support for SSE2. Now in x86+SSE2 mode, you can get either one or two 64 bit outputs (depending how you word the instruction) without going through MULs/ADDs in eax:edx - which is slower.

*gcc doesn't like the d=d+dd part inside the loop and it hesitates to vectorize even with #pragma GCC ivdep. One needs to do the addition results outside the loop, or write it manually in asm. ICC does it like a boss with the padd's from the results.


edit: Something else I remembered regarding MULX + ADCX which I've mentioned previously. I tried building with

./configure --enable-benchmark --enable-endomorphism --with-asm=no CFLAGS="-O3 -march=skylake"

...to check the asm output of new compilers.

GCC 6.3 now outputs MULXs (no ADCXs though) in field and scalar multiplications.
Clang 3.9 now employs both MULQs and ADCXs.

(from clang)

000000000040c300 <secp256k1_fe_mul>:
...
  40c319:   48 8b 4e 08             mov    0x8(%rsi),%rcx
  40c31d:   48 89 4c 24 b8          mov    %rcx,-0x48(%rsp)
  40c322:   c4 62 e3 f6 c8          mulx   %rax,%rbx,%r9
  40c327:   49 89 c3                mov    %rax,%r11
  40c32a:   4c 89 5c 24 d8          mov    %r11,-0x28(%rsp)
  40c32f:   49 8b 52 10             mov    0x10(%r10),%rdx
  40c333:   48 89 54 24 a0          mov    %rdx,-0x60(%rsp)
 40c338:   c4 62 fb f6 c1          mulx   %rcx,%rax,%r8
  40c33d:   48 01 d8                add    %rbx,%rax
 40c340:   66 4d 0f 38 f6 c1       adcx   %r9,%r8
  40c346:   48 8b 6e 10             mov    0x10(%rsi),%rbp
  40c34a:   49 8b 1a                mov    (%r10),%rbx
...

...although how much faster it is compared to MUL/ADC c code or asm, I can't say.
AlexGR
Legendary
*
Offline Offline

Activity: 1708
Merit: 1049



View Profile
April 13, 2018, 08:56:55 PM
Merited by suchmoon (4), ABCbits (2), vapourminer (1), Last of the V8s (1)
 #37

I currently have an i5-8250u @ 3.4ghz laptop and I tried MULXing / ADCxing / ADOXing the field 5x52 asm. I found ADCXs/ADOXs to be useless in speeding up short parallel chains. Performance was same or worse than ADD/ADC.

MULX, on paper, is 4 cycles vs classic MUL at 3 cycles. The benefit though is that you can continue doing MULxing on other registers, thus not waiting for the rax/rdx pair to be added to go on.

So results are in. Test with gcc8 (I tried gcc7 + clang, similar performance improvements found) and parameters as follows (default cflags):

./configure --enable-endomorphism --enable-openssl-tests=no

Default run (normal asm / unmodified secp):

perf stat -d ./bench_verify
ecdsa_verify: min 46.8us / avg 46.8us / max 47.1us

9377.185057      task-clock (msec)         #    1.000 CPUs utilized          
27      context-switches          #    0.003 K/sec                  
2      cpu-migrations            #    0.000 K/sec                  
3,284      page-faults               #    0.350 K/sec                  
31,765,926,255      cycles                    #    3.388 GHz                      (62.49%)
85,725,991,961      instructions             #    2.70  insn per cycle           (75.00%)
1,822,885,107      branches                  #  194.396 M/sec                    (75.00%)
64,368,756      branch-misses             #    3.53% of all branches          (75.00%)
13,413,927,724      L1-dcache-loads           # 1430.486 M/sec                    (75.03%)
7,487,706      L1-dcache-load-misses     #    0.06% of all L1-dcache hits    (75.08%)
4,131,670      LLC-loads                 #    0.441 M/sec                    (49.96%)
93,400      LLC-load-misses           #    2.26% of all LL-cache hits     (49.92%)

9.376614679 seconds time elapsed

./bench_internal
scalar_add: min 0.00982us / avg 0.00996us / max 0.0103us
scalar_negate: min 0.00376us / avg 0.00381us / max 0.00411us
scalar_sqr: min 0.0389us / avg 0.0393us / max 0.0405us
scalar_mul: min 0.0395us / avg 0.0398us / max 0.0413us
scalar_split: min 0.175us / avg 0.178us / max 0.186us
scalar_inverse: min 11.3us / avg 11.5us / max 11.7us
scalar_inverse_var: min 2.65us / avg 2.70us / max 2.85us
field_normalize: min 0.00988us / avg 0.00995us / max 0.0102us
field_normalize_weak: min 0.00404us / avg 0.00405us / max 0.00411us
field_sqr: min 0.0187us / avg 0.0189us / max 0.0194us
field_mul: min 0.0233us / avg 0.0236us / max 0.0254us

field_inverse: min 5.10us / avg 5.11us / max 5.14us
field_inverse_var: min 2.61us / avg 2.62us / max 2.69us
field_sqrt: min 5.07us / avg 5.08us / max 5.13us
group_double_var: min 0.149us / avg 0.150us / max 0.153us
group_add_var: min 0.337us / avg 0.338us / max 0.341us
group_add_affine: min 0.288us / avg 0.289us / max 0.292us
group_add_affine_var: min 0.243us / avg 0.244us / max 0.246us
group_jacobi_var: min 0.212us / avg 0.219us / max 0.251us
wnaf_const: min 0.0799us / avg 0.0830us / max 0.104us
ecmult_wnaf: min 0.528us / avg 0.532us / max 0.552us
hash_sha256: min 0.324us / avg 0.328us / max 0.345us
hash_hmac_sha256: min 1.26us / avg 1.27us / max 1.30us
hash_rfc6979_hmac_sha256: min 7.00us / avg 7.00us / max 7.03us
context_verify: min 7007us / avg 7038us / max 7186us
context_sign: min 33.4us / avg 34.1us / max 36.7us
num_jacobi: min 0.109us / avg 0.111us / max 0.126us

After MULXing:
Custom field 5x52 asm with MULXs:

perf stat -d ./bench_verify
ecdsa_verify: min 39.9us / avg 39.9us / max 40.2us

8003.494101      task-clock (msec)         #    1.000 CPUs utilized          
28      context-switches          #    0.003 K/sec                  
0      cpu-migrations            #    0.000 K/sec                  
3,278      page-faults               #    0.410 K/sec                  
27,113,041,440      cycles                    #    3.388 GHz                      (62.46%)
70,772,097,848      instructions             #    2.61  insn per cycle           (75.01%)
1,872,709,155      branches                  #  233.986 M/sec                    (75.01%)
63,567,635      branch-misses             #    3.39% of all branches          (75.01%)
20,812,623,788      L1-dcache-loads           # 2600.442 M/sec                    (75.01%)
7,187,062      L1-dcache-load-misses     #    0.03% of all L1-dcache hits    (75.01%)
4,098,304      LLC-loads                 #    0.512 M/sec                    (49.98%)
97,108      LLC-load-misses           #    2.37% of all LL-cache hits     (49.98%)

8.003690210 seconds time elapsed

./bench_internal
scalar_add: min 0.00982us / avg 0.00988us / max 0.0100us
scalar_negate: min 0.00376us / avg 0.00377us / max 0.00384us
scalar_sqr: min 0.0389us / avg 0.0391us / max 0.0402us
scalar_mul: min 0.0395us / avg 0.0397us / max 0.0412us
scalar_split: min 0.176us / avg 0.178us / max 0.184us
scalar_inverse: min 11.3us / avg 11.4us / max 11.6us
scalar_inverse_var: min 2.66us / avg 2.71us / max 3.01us
field_normalize: min 0.00988us / avg 0.00998us / max 0.0104us
field_normalize_weak: min 0.00404us / avg 0.00406us / max 0.00415us
field_sqr: min 0.0153us / avg 0.0155us / max 0.0164us
field_mul: min 0.0172us / avg 0.0175us / max 0.0182us

field_inverse: min 4.17us / avg 4.18us / max 4.22us
field_inverse_var: min 2.62us / avg 2.63us / max 2.65us
field_sqrt: min 4.12us / avg 4.12us / max 4.13us
group_double_var: min 0.127us / avg 0.128us / max 0.131us
group_add_var: min 0.270us / avg 0.270us / max 0.272us
group_add_affine: min 0.250us / avg 0.250us / max 0.252us
group_add_affine_var: min 0.200us / avg 0.200us / max 0.201us
group_jacobi_var: min 0.212us / avg 0.214us / max 0.219us
wnaf_const: min 0.0799us / avg 0.0802us / max 0.0817us
ecmult_wnaf: min 0.528us / avg 0.535us / max 0.569us
hash_sha256: min 0.324us / avg 0.334us / max 0.355us
hash_hmac_sha256: min 1.27us / avg 1.27us / max 1.31us
hash_rfc6979_hmac_sha256: min 6.99us / avg 6.99us / max 7.03us
context_verify: min 5934us / avg 5966us / max 6127us
context_sign: min 30.9us / avg 31.3us / max 33.0us
num_jacobi: min 0.111us / avg 0.113us / max 0.120us

From 9.37secs to 8.00 secs (0.85x). Field_mul at 0.73x. Field_sqr at 0.81.

After doing some rearranging (c file) of group_impl.h:

perf stat -d ./bench_verify
ecdsa_verify: min 38.2us / avg 38.3us / max 38.7us

7675.837387      task-clock (msec)         #    1.000 CPUs utilized          
37      context-switches          #    0.005 K/sec                  
0      cpu-migrations            #    0.000 K/sec                  
3,268      page-faults               #    0.426 K/sec                  
25,993,738,895      cycles                   #    3.386 GHz                      (62.48%)
70,649,153,999      instructions              #    2.72  insn per cycle           (74.99%)
1,872,833,433      branches                  #  243.991 M/sec                    (74.99%)
64,040,465      branch-misses             #    3.42% of all branches          (74.99%)
20,969,673,428      L1-dcache-loads           # 2731.907 M/sec                    (75.04%)
7,260,544      L1-dcache-load-misses     #    0.03% of all L1-dcache hits    (75.09%)
4,076,705      LLC-loads                 #    0.531 M/sec                    (49.97%)
110,695      LLC-load-misses           #    2.72% of all LL-cache hits     (49.92%)

7.675396172 seconds time elapsed

./bench_internal
scalar_add: min 0.00980us / avg 0.00984us / max 0.00987us
scalar_negate: min 0.00376us / avg 0.00377us / max 0.00382us
scalar_sqr: min 0.0389us / avg 0.0391us / max 0.0396us
scalar_mul: min 0.0392us / avg 0.0396us / max 0.0403us
scalar_split: min 0.176us / avg 0.177us / max 0.181us
scalar_inverse: min 11.3us / avg 11.4us / max 11.7us
scalar_inverse_var: min 2.65us / avg 2.69us / max 2.93us
field_normalize: min 0.00991us / avg 0.00999us / max 0.0103us
field_normalize_weak: min 0.00404us / avg 0.00405us / max 0.00414us
field_sqr: min 0.0153us / avg 0.0154us / max 0.0158us
field_mul: min 0.0172us / avg 0.0173us / max 0.0175us

field_inverse: min 4.17us / avg 4.18us / max 4.20us
field_inverse_var: min 2.62us / avg 2.62us / max 2.65us
field_sqrt: min 4.12us / avg 4.13us / max 4.16us
group_double_var: min 0.121us / avg 0.122us / max 0.123us
group_add_var: min 0.267us / avg 0.268us / max 0.271us
group_add_affine: min 0.249us / avg 0.249us / max 0.252us
group_add_affine_var: min 0.192us / avg 0.193us / max 0.196us
group_jacobi_var: min 0.211us / avg 0.214us / max 0.224us
wnaf_const: min 0.0799us / avg 0.0802us / max 0.0818us
ecmult_wnaf: min 0.528us / avg 0.534us / max 0.574us
hash_sha256: min 0.324us / avg 0.327us / max 0.341us
hash_hmac_sha256: min 1.26us / avg 1.27us / max 1.28us
hash_rfc6979_hmac_sha256: min 6.98us / avg 6.98us / max 7.01us
context_verify: min 5885us / avg 5916us / max 6039us
context_sign: min 30.9us / avg 31.5us / max 35.9us
num_jacobi: min 0.110us / avg 0.111us / max 0.122us

Time spent on ./bench_verify = 0.81x, from 31.76mn -> 25.99mn cycles.

Most of these are hardware specific though.

I have a few things I'm currently tampering with, getting it down to 0.78x. A special sqr_inner2 function which has 3 parameters (r - output, a input, counter). The counter is used for the number of times one wants to square the same number, without recalling the function - the function does that on it's own. ~10% of the time spent in ./bench verify are looped field squares. It takes inversions/squaring from 4.2us down to 3.9us. The gain is much larger on less optimized sqr functions. The c int128 5x52 impl sqr, goes 20%+ faster if looped internally with a counter.


[1] mulx-asm: https://github.com/Alex-GR/secp256k1/blob/master/field_5x52_asm_impl.h
[2] reordered group_impl.h (most gains in double_var and is probably not-hardware specific since I'm seeing 2% gains even on a 10-yr old quad core): https://github.com/Alex-GR/secp256k1/blob/master/group_impl.h

Code is free to use/reuse/get ideas/implemented, etc etc - although I don't claim it's heavily tested, or even safe. It does pass the tests though.
fillippone
Legendary
*
Offline Offline

Activity: 2142
Merit: 15383


Fully fledged Merit Cycler - Golden Feather 22-23


View Profile WWW
September 26, 2020, 10:46:31 AM
 #38

I implemented an optimized ECDSA verify for the secp256k1 curve used by Bitcoin, based on pages 125-129 of the Guide to Elliptic Curve Cryptography, by Hankerson, Menezes and Vanstone. I own the book but I also found a PDF on a Russian site which is more convenient.
<...>

I am celebrating this post because today the patent covering this idea has finally expired, and this is going to be activated in the next bitcoin core release.

Thank you to all the great minds who contributed to this.
Link

.
.HUGE.
▄██████████▄▄
▄█████████████████▄
▄█████████████████████▄
▄███████████████████████▄
▄█████████████████████████▄
███████▌██▌▐██▐██▐████▄███
████▐██▐████▌██▌██▌██▌██
█████▀███▀███▀▐██▐██▐█████

▀█████████████████████████▀

▀███████████████████████▀

▀█████████████████████▀

▀█████████████████▀

▀██████████▀▀
█▀▀▀▀











█▄▄▄▄
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
.
CASINSPORTSBOOK
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
▀▀▀▀█











▄▄▄▄█
Karartma1
Legendary
*
Offline Offline

Activity: 2310
Merit: 1422



View Profile
September 26, 2020, 02:24:53 PM
 #39

I implemented an optimized ECDSA verify for the secp256k1 curve used by Bitcoin, based on pages 125-129 of the Guide to Elliptic Curve Cryptography, by Hankerson, Menezes and Vanstone. I own the book but I also found a PDF on a Russian site which is more convenient.
<...>

I am celebrating this post because today the patent covering this idea has finally expired, and this is going to be activated in the next bitcoin core release.

Thank you to all the great minds who contributed to this.
Link
Yeah man, if you really think about it there is no doubt that those earlier contributors working on making the bitcoin protocol better were some of the smartest and greatest minds around. I personally did not know this. There is always space for learning
Cheers Hal et al.
Pages: « 1 [2]  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!