Title: Speeding up signature verification Post by: Hal on February 07, 2011, 10:31:27 PM In another thread, [mike] wrote:
I don't know of any algorithms that will make ECDSA much faster using GPUs. The best bet for speeding this up (assuming we care) is to exploit the special properties of the secp256k1 curve: http://portal.acm.org/citation.cfm?id=1514739 http://www.hyperelliptic.org/tanja/KoblitzC.html I'm trying to inplement the secp256k1 shortcut. Should have results shortly. Unfortunately I only expect about 20% speedup. We'll see. I'm also looking at batch signature verification: http://cseweb.ucsd.edu/~mihir/papers/batch.pdf (http://cseweb.ucsd.edu/~mihir/papers/batch.pdf) http://www.math.snu.ac.kr/~jhcheon/publications/2007/BatchVer_PKC2007_CL.pdf (http://www.math.snu.ac.kr/~jhcheon/publications/2007/BatchVer_PKC2007_CL.pdf) This can theoretically speed up verification by almost a factor of 4 (table 6 in 2nd paper) if you have a lot of signatures in a batch. It requires a slight change in the ECDSA signature format: (r, s) is replaced by (R, s), where R is the EC point of which r is the x coordinate. This change could be applied retroactively, as R is calculated and discarded every time the sig is verified. We do tend to have sigs in batches, namely blocks; sometimes several dozen or even hundreds, and this will grow. Batch verification returns true iff all sigs are valid. A good block should never have invalid signatures so it makes sense to batch the verify. I need to research some security aspects, namely: does R need to be checked to see if it's on the curve (ie y^2 = x^3 + 7)? And what is an appropriate security parameter for the probability the batch verify test could be fooled? The papers talk about 2^80 but that seems too conservative. Title: Re: Speeding up signature verification Post by: Hal on February 08, 2011, 07:11:52 AM I implemented an optimized ECDSA verify for the secp256k1 curve used by Bitcoin, based on pages 125-129 of the Guide to Elliptic Curve Cryptography, by Hankerson, Menezes and Vanstone. I own the book but I also found a PDF on a Russian site which is more convenient.
secp256k1 uses the following prime for its x and y coordinates: p = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f and the curve order is: n = 0xfffffffffffffffffffffffffffffffebaaedce6af48a03bbfd25e8cd0364141 The first step is to compute values beta, lambda such that for any curve point Q = (x,y): lambda * Q = (beta*x mod p, y) This is the so-called efficiently computable endomorphism, and what it means is, you can multiply any curve point by this special value lambda very quickly, by doing a single mod-p multiply. The book tells (well, hints) how to compute lambda and beta, and here are the values I found: lambda = 0x5363ad4cc05c30e0a5261c028812645a122e22ea20816678df02967c1b23bd72 beta = 0x7ae96a2b657c07106e64479eac3434e99cf0497512f58995c1396c28719501ee Given that we can multiply by lambda quickly, here is the trick to compute k*Q. First use the shortcut to compute Q' = lambda*Q. Next, k must be decomposed into two parts k1 and k2, each about half the width of n, such that: k = k1 + k2*lambda mod n Then k*Q = (k1 + k2*lambda)*Q = k1*Q + k2*lambda*Q = k1*Q + k2*Q' That last expression can be evaluated efficiently via a double multiply algorithm, and since k1 and k2 are half length, we get the speedup. The missing piece is splitting k into k1 and k2. This uses the following 4 values: a1 = 0x3086d221a7d46bcde86c90e49284eb15 b1 = -0xe4437ed6010e88286f547fa90abfe4c3 a2 = 0x114ca50f7a8e2f3f657c1108d9d44cfd8 b2 = 0x3086d221a7d46bcde86c90e49284eb15 (it's ok that a1 = b2) Use these as follows to split k: c1 = RoundToNearestInteger(b2*k/n) c2 = RoundToNearestInteger(-b1*k/n) k1 = k - c1*a1 - c2*a2 k2 = -c1*b1 - c2*b2 With all this, I measure about a 25% speedup on raw signature verifications. For Bitcoin, initial block load from a wifi-connected local node is reduced from: real 36m21.537s user 24m43.277s sys 0m27.950s to: real 32m59.777s user 18m21.145s sys 0m28.262s Not a big difference, and it would probably be even less significant when fetching over the net. Title: Re: Speeding up signature verification Post by: Mike Hearn on February 08, 2011, 12:28:00 PM Great stuff! A 25% win is pretty significant when you look at it in terms of hardware costs rather than initial load time. It would certainly drop the cost of running a VISA scale node.
The batch verification thing does sound interesting. It's almost ideal for BitCoin in fact. Title: Re: Speeding up signature verification Post by: Hal on February 08, 2011, 09:08:58 PM I've put the code up at github. I'm a C programmer more than C++; it shows.
https://github.com/halfinney/bitcoin/tree/secp256k1 (https://github.com/halfinney/bitcoin/tree/secp256k1) Here's a self contained test program that compares the speeds. Build with -lcrypto. Code: #include <stdio.h> Title: Re: Speeding up signature verification Post by: smtp on January 13, 2013, 02:25:57 PM The batch verification thing does sound interesting. It's almost ideal for BitCoin in fact. Since nearly 2 years has gone but no activity for this "batch verification"-algorithm has (seemingly) done/published, I like to ask: why? smtp Title: Re: Speeding up signature verification Post by: Peter Todd on January 13, 2013, 03:24:21 PM Since nearly 2 years has gone but no activity for this "batch verification"-algorithm has (seemingly) done/published, I like to ask: why? smtp Risk vs reward. No offense to Hal, but 25% isn't a very big improvement, and signature verification is something that absolutely must work correctly. Get it wrong and people can lose a lot of money very quickly. Why risk it? Title: Re: Speeding up signature verification Post by: MatthewLM on January 13, 2013, 03:41:08 PM Why wont GPUs help much? Surely checking signatures in parallel would be an improvement?
Title: Re: Speeding up signature verification Post by: Pieter Wuille on January 13, 2013, 04:02:30 PM Risk vs reward. No offense to Hal, but 25% isn't a very big improvement, and signature verification is something that absolutely must work correctly. Get it wrong and people can lose a lot of money very quickly. Why risk it? The secp256k1-specific optimization has been implemented, and there's a pull request to integrate it in the reference client. It does indeed achieve some 20% improvement. What smtp asked was about the parallel batch verification, with potentially much higher speedsup (Hal mentioned x4). The problem is that this is a much more invasive change, and the risk of getting crypto stuff wrong is extremely high. It's also not certain which degree of speedup is possible, as we need to do recovery of R before batch verification is possible (and changing this requires a hard fork basically, as the signatures are part of the scripting rules). That said, if someone is interested in investigating, and potentially supplying a patch that is well testable... by all means, do so. Title: Re: Speeding up signature verification Post by: Etlase2 on January 13, 2013, 04:09:38 PM Why wont GPUs help much? Surely checking signatures in parallel would be an improvement? It's not a matter of parallel vs. not parallel, it has to do with the fact that GPUs are terrible at point multiplication, the difficult part of signature verification. A regular CPU far outclasses them at it, and for much less energy. Title: Re: Speeding up signature verification Post by: Pieter Wuille on January 13, 2013, 04:15:27 PM It's not a matter of parallel vs. not parallel, it has to do with the fact that GPUs are terrible at point multiplication, the difficult part of signature verification. A regular CPU far outclasses them at it, and for much less energy. I'm pretty sure that a good OpenCL implementation of ECDSA verification, running on high-end ATI gpu's, will be significantly faster than what OpenSSL (or any implementation) can do on a CPU. People have already written EC addition in OpenCL, reaching speeds of tens of millions of operations per second. EC multiplication is much slower and more complex to implement efficiently, but I expect it would reach tens of thousands of operations per second at least. Someone just needs to write it... Title: Re: Speeding up signature verification Post by: smtp on January 13, 2013, 04:17:15 PM Risk vs reward. No offense to Hal, but 25% isn't a very big improvement, and signature verification is something that absolutely must work correctly. Get it wrong and people can lose a lot of money very quickly. Why risk it? Yes, inprinciple I agree, 25% is not much (I only got 22%). But if Hal's secp256k1Verify code is integrated into pre-0.8.0 it could end up to be inserted in the next 0.8.0 release officially. This means, other people recognize this as an important improvement. And in the light that the probable bottle-neck is (or just has become) this ECDSA_verify heavily(!), I can understand their decision. To answer your question "Why risk it?": To not lose more and more potential new users because they are annoyed by the huge real-time necessary to verify the current block chain with the official bitcoin versions. It seems most people don't see this real danger -- this real-time behaviour might restrict (or even kill) bitcoin usage/spread seriously. :-( smtp Title: Re: Speeding up signature verification Post by: Peter Todd on January 13, 2013, 04:30:17 PM To answer your question "Why risk it?": To not lose more and more potential new users because they are annoyed by the huge real-time necessary to verify the current block chain with the official bitcoin versions. It seems most people don't see this real danger -- this real-time behaviour might restrict (or even kill) bitcoin usage/spread seriously. :-( You know what turns potential new users off of Bitcoin? Headlines like "Bitcoin hacked! Thousands lose all their Bitcoins!" If having to wait ten minutes to an hour is what makes you not want to use Bitcoin, how about you wait until we come up with a overlay network on top of Bitcoin with instant transactions? It'll happen eventually; I'm working on one such concept myself. Title: Re: Speeding up signature verification Post by: smtp on January 13, 2013, 04:32:01 PM Someone just needs to write it... Emphasis is "just" ... since February 2011. :->>But what I dislike is the idea to expect in future not only a cpu/processor for a bitcoin-client, but also a (high-end ?) GPU for the customer to get reasonable real-time behaviour of his bitcoin-client. This I call a miss-conception to the future. smtp Title: Re: Speeding up signature verification Post by: smtp on January 13, 2013, 04:37:32 PM If having to wait ten minutes to an hour is what makes you not want to use Bitcoin, how about you wait until we come up with a overlay network on top of Bitcoin with instant transactions? It'll happen eventually; I'm working on one such concept myself. I guess waiting > 24 h for the whole block chain is more the truth (not 10 minutes to an hour) this is annoying. :-(At least 0.8.0 will have an option to not verify the bootstrap.dat, I think. smtp Title: Re: Speeding up signature verification Post by: Pieter Wuille on January 13, 2013, 04:50:07 PM But what I dislike is the idea to expect in future not only a cpu/processor for a bitcoin-client, but also a (high-end ?) GPU for the customer to get reasonable real-time behaviour of his bitcoin-client. This I call a miss-conception to the future. In the future - at some point (near or far) - it's not reasonable to expect that every end user will run a fully verifying bitcoin node. If they use a desktop client at all (I assume much more will happen on mobile devices, for example), it will probably be with an SPV client or even lighter. At this point in the future, "reasonable real-time behaviour for the bitcoin-client" will only apply to those that have to (miners, payment processors, merchants) or volunteer to. I expect such people to run a node 24/7, and time to do a sync will only apply once. That doesn't mean we should tell everyone now to move away (even though many people are, unfortunately but understandably) for fully validating clients. While the economy grows, we need a network to sustain it. In this respect, you are right, and we should do efforts to keep performance as good as possible (and I believe I do what I can for that). But as retep says, correctness is of much higher importance - if there is a problem with the code, it can be a disaster (lost coins, forked network, ...). In particular, I have reasonable confidence in the 25% speedup optimization for secp256k1 after having spent time on it myself and tested it by comparing the output of the optimized operations with the non-optimized OpenSSL ones. I'd still like some profession cryptographer to look at the code still. GPU code to do ECDSA verification would in the first place be an experiment, in my opinion. If it indeed helps in significantly faster verification, some may consider it. But as things stand now, I don't think there's an actual need. Title: Re: Speeding up signature verification Post by: Pieter Wuille on January 13, 2013, 04:57:10 PM At least 0.8.0 will have an option to not verify the bootstrap.dat, I think. You can't build an UTXO set (needed for validation) without doing most of the verification anyway (everything except signature checking, basically). If you mean trusting someone to build a pre-indexed blockchain for you, you're far better off running an SPV node - it's faster, needs less disk, needs less bandwidth to maintain, and doesn't require trust in a centralized source for your block data. The current git head (pre-0.8) code still has some problems that cause initial syncup to be slow (in particular, the block fetching algorithm is crappy, and gets very easily confused to the point where it stops doing anything for several minutes). From reports, it also seems it's significantly slower at verifying on Windows systems. Still, if combined with parallel+optimized signature verification code (which may or may not make it into 0.8 release), and the block fetching system gets fixed, I hope we can get close to a sync time of 1 hour on reasonably recent systems. Title: Re: Speeding up signature verification Post by: Peter Todd on January 13, 2013, 05:11:38 PM If having to wait ten minutes to an hour is what makes you not want to use Bitcoin, how about you wait until we come up with a overlay network on top of Bitcoin with instant transactions? It'll happen eventually; I'm working on one such concept myself. I guess waiting > 24 h for the whole block chain is more the truth (not 10 minutes to an hour) this is annoying. :-(It took me a week to apply for my PayPal account back in the day. Title: Re: Speeding up signature verification Post by: smtp on January 13, 2013, 05:42:17 PM You can't build an UTXO set (needed for validation) without doing most of the verification anyway (everything except signature checking, basically). "most of the verification anyway" ... I like to weight this "most" by real-time not by code size or "logical" verification-classes and then it turns out that the signature checking needs 99% or far more of it (depends on your disk speed).Or put it in actual numbers: I need up to block height 216283: 112.7 sec cpu-time to put all data into place for every possible (including script validation) check. Well, real time is about 2-3 times larger depending on how quick the disk is (I needed e.g. 260 sec) from which I read the block chain. And doing full verification (with script-verifications) needs 22419827/940 sec > 6.5 hours (I did not do it, only estimated it from a sample of about 890000 OP_CHECKSIG) in the most recent redeeming scripts. And I don't believe this huge real-time safe is because I use hashtables and not level-DB structures. :-) Therefore I expect when loading a new bootstrap.dat (say to blockheight 216283) from disk into the client with pre-0.8.0 I only need about 5 minutes. BTW: B-trees may now much better reasoned because you need NOW much more deletions in your data structures due to the redeemed scripts. smtp Title: Re: Speeding up signature verification Post by: Pieter Wuille on January 13, 2013, 05:52:01 PM You can't build an UTXO set (needed for validation) without doing most of the verification anyway (everything except signature checking, basically). "most of the verification anyway" ... I like to weight this "most" by real-time not by code size or "logical" verification-classes and then it turns out that the signature checking needs 99% or far more of it (depends on your disk speed).I sure, I fully agree. In terms of CPU usage, signatures validation outweighs everything. But before the last checkpoint (=exactly what is stored in bootstrap.dat), signature checking is disabled anyway (and this hasn't changed in 0.8). Also, I assume that importing bootstrap.dat on your hardware will take more than 5 minutes. Done completely in RAM, it's probably possible, but the client also maintains a database on disk, and does synchronous writes at times to make sure block data and index/coins remain consistent with eachother. Title: Re: Speeding up signature verification Post by: smtp on January 13, 2013, 06:22:40 PM But before the last checkpoint (=exactly what is stored in bootstrap.dat), signature checking is disabled anyway (and this hasn't changed in 0.8). ??? Do I feel victim to a misunderstanding or do you just have told me a trick? :)I simply should concatenate my current block chain (3 files) into 1 huge file (almost 4.9 GB), name this "bootstrap.dat" and 0.7.* will load this without signature-checking into the bitcoin-client if I start it (and of course build a new blkindex.dat)? smtp Title: Re: Speeding up signature verification Post by: MatthewLM on January 13, 2013, 06:28:23 PM If you are to use a GPU, why not also use the CPU along side it? If time is important, then surely using as much processing power as is available is recommended. Understandably the are other considerations such as power usage and the operation of other programs on the same system.
At least you can take advantage of multiple CPU cores with multiple threads. Title: Re: Speeding up signature verification Post by: smtp on January 13, 2013, 06:48:35 PM If you are to use a GPU, why not also use the CPU along side it? If time is important, then surely using as much processing power as is available is recommended. Understandably the are other considerations such as power usage and the operation of other programs on the same system. Of course, but two important points must/should be decided (as usual):At least you can take advantage of multiple CPU cores with multiple threads. 1) Effort: Who should write all this coding (for which GPUs?) and merge it with bitcoin-0.?. (how much real time spend of a (few) developper(s)? 2) Benefit: How big is the average profit of this new code (by all bitcoin-client users)? smtp Title: Re: Speeding up signature verification Post by: Pieter Wuille on January 13, 2013, 06:53:55 PM I simply should concatenate my current block chain (3 files) into 1 huge file (almost 4.9 GB), name this "bootstrap.dat" and 0.7.* will load this without signature-checking into the bitcoin-client if I start it (and of course build a new blkindex.dat)? It will load it, but only the first 193000 blocks will be done without signature checking. Also, I think 0.7.x has a bug where no more than 2G of bootstrap.dat is loaded anyway. You can use -loadblock=file1 -loadblock=file2, ... though. Title: Re: Speeding up signature verification Post by: smtp on January 13, 2013, 07:13:43 PM I simply should concatenate my current block chain (3 files) into 1 huge file (almost 4.9 GB), name this "bootstrap.dat" and 0.7.* will load this without signature-checking into the bitcoin-client if I start it (and of course build a new blkindex.dat)? It will load it, but only the first 193000 blocks will be done without signature checking. Also, I think 0.7.x has a bug where no more than 2G of bootstrap.dat is loaded anyway. You can use -loadblock=file1 -loadblock=file2, ... though. All this I knew -- I thought that I had misunderstood that this 193000 blocks limit was hardcoded into the client. :-/ I did it with -loadblock... already, thus my personal annoying real-time expierence. Thx, anyway, then I can stop my try now -- I just have started 0.7.1 with this 4.9 GB bootstrap.dat and looked forward whether all will succeed. 193000 blocks looks about as 216000 blocks, but the sad truth is: only about 50 % of the transactions and this is what matters. :-/ smtp Title: Re: Speeding up signature verification Post by: Pieter Wuille on January 13, 2013, 08:24:49 PM All this I knew -- I thought that I had misunderstood that this 193000 blocks limit was hardcoded into the client. :-/ Yes, it is hardcoded. 0.7.x have a latest checkpoint at 193000. The current development code has an extra one at 210000. Title: Re: Speeding up signature verification Post by: Diapolo on January 14, 2013, 08:04:31 AM But before the last checkpoint (=exactly what is stored in bootstrap.dat), signature checking is disabled anyway (and this hasn't changed in 0.8). ??? Do I feel victim to a misunderstanding or do you just have told me a trick? :)I simply should concatenate my current block chain (3 files) into 1 huge file (almost 4.9 GB), name this "bootstrap.dat" and 0.7.* will load this without signature-checking into the bitcoin-client if I start it (and of course build a new blkindex.dat)? smtp You perhaps won't succeed with a file > 4GB. There were some limits with fseek or such as far as I remember. Dia Title: Re: Speeding up signature verification Post by: smtp on January 14, 2013, 09:06:58 AM You perhaps won't succeed with a file > 4GB. There were some limits with fseek or such as far as I remember. On 64-bit Linux the limit is 2^63 as filepos and I guess on 32-bit Linux 2^31 (at least, but depending on size of data_type long), but already the (official) 193000-bootstrap file has > 2GB size.Appendum: I should add: Really it depends on the filesystem you are using. I talked about Ext-2, Ext-3 and Ext-4 FS - (the default Linux filesystems). smtp Title: Re: Speeding up signature verification Post by: Pieter Wuille on January 14, 2013, 08:34:40 PM 0.7.x has indeed a bug that it needs the ability to seek to positions in block files being imported. This limits it to 2 GB on some systems.
The current git head code doesn't need this, and can import blocks from arbitrary-sized files. Title: Re: Speeding up signature verification Post by: deuteragenie on January 29, 2014, 10:00:48 PM The book tells (well, hints) how to compute lambda and beta, and here are the values I found: lambda = 0x5363ad4cc05c30e0a5261c028812645a122e22ea20816678df02967c1b23bd72 beta = 0x7ae96a2b657c07106e64479eac3434e99cf0497512f58995c1396c28719501ee Isn't lambda a solution of l*l+l=-1 mod n ? If that's the case, there are actually two possible lambdas: 5363AD4CC05C30E0A5261C028812645A122E22EA20816678DF02967C1B23BD72 and AC9C52B33FA3CF1F5AD9E3FD77ED9BA4A880B9FC8EC739C2E0CFC810B51283CE What made you choose the smaller one ? Title: Re: Speeding up signature verification Post by: ysangkok on January 04, 2015, 11:59:50 PM Risk vs reward. No offense to Hal, but 25% isn't a very big improvement, and signature verification is something that absolutely must work correctly. Get it wrong and people can lose a lot of money very quickly. Why risk it? The secp256k1-specific optimization has been implemented, and there's a pull request to integrate it in the reference client. It does indeed achieve some 20% improvement. What smtp asked was about the parallel batch verification, with potentially much higher speedsup (Hal mentioned x4). The problem is that this is a much more invasive change, and the risk of getting crypto stuff wrong is extremely high. It's also not certain which degree of speedup is possible, as we need to do recovery of R before batch verification is possible (and changing this requires a hard fork basically, as the signatures are part of the scripting rules). That said, if someone is interested in investigating, and potentially supplying a patch that is well testable... by all means, do so. It looks as if this code is slated for inclusion in 0.10, but the given reasons are not performance, but security: https://github.com/bitcoin/bitcoin/blob/0.10/doc/release-notes.md#improved-signing-security Is the new code doing the batch verification previously mentioned or is it doing the "some 20%" optimization? Or maybe the optimizations vanished to mitigate timing attacks? Thanks in advance. Title: Re: Speeding up signature verification Post by: gmaxwell on January 05, 2015, 01:53:35 AM 5363AD4CC05C30E0A5261C028812645A122E22EA20816678DF02967C1B23BD72 Lambda (mod N) has multiplicative order three, the other value is just lambda applied twice. Unfortunately, it appears is no way to use that for additional speedups because the is no small orthogonal basis that cam be formed from a reduction on [n, lambda, lambda^2].and AC9C52B33FA3CF1F5AD9E3FD77ED9BA4A880B9FC8EC739C2E0CFC810B51283CE What made you choose the smaller one ? It looks as if this code is slated for inclusion in 0.10, but the given reasons are not performance, but security: Absolutely not. The changes in 0.10 you're referring to are exclusively related to signing, as the release notes say, and do not change verification.Quote Is the new code doing the batch verification previously mentioned or is it doing the "some 20%" optimization? Or maybe the optimizations vanished to mitigate timing attacks? Verification in libsecp256k1 on x86_64 is now probably closer to ten times faster than OpenSSL's implementation, but the software is very much still under development. This is without batch validation. True batch validation requires additional side information to have any hope of being faster, and even with it the speed increase is not that enormous due to the additional blinding required to preserve security.Title: Re: Speeding up signature verification Post by: AlexGR on April 26, 2016, 03:25:31 AM Is there any downside in using SIMD instructions (loading multiple values in different segments of the same registers and performing "packed" SSE/AVX instructions for the math operations) to batch-process multiple signatures per thread?
It would theoretically get the cpu cycles per round much lower than doing it in a serial manner. Title: Re: Speeding up signature verification Post by: gmaxwell on April 26, 2016, 04:45:49 AM Is there any downside in using SIMD instructions (loading multiple values in different segments of the same registers and performing "packed" SSE/AVX instructions for the math operations) to batch-process multiple signatures per thread? You lose the 64,64->128 widening multiplies. The net result appears to be a loss of performance.It would theoretically get the cpu cycles per round much lower than doing it in a serial manner. If someone was able to get the field arithmetic to operate usefully in parallel there wouldn't be a need to perform multiple verification at once to get performance gains; fwiw. But actually getting it to be faster while losing the wide multiplies appears to be elusive. Even in the AVX-512 integer stuff I do not see a quadword multiply. Title: Re: Speeding up signature verification Post by: AlexGR on April 26, 2016, 11:02:52 PM Does this help at all?
https://en.wikipedia.org/wiki/CLMUL_instruction_set http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/carry-less-multiplication-instruction-in-gcm-mode-paper.pdf http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/polynomial-multiplication-instructions-paper.pdf "Using the PCLMULQDQ instruction, the performance of ECC can be boosted significantly on a range of IA processor cores, making it an attractive option over other public key algorithms." There are also some newer "regular" instructions that might be useful for speedups (mulx, adcx, adox): http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf Quote MULX Instruction The mulx instruction is an extension of the existing mul instruction, with the difference being in the effect on flags: mulx dest_hi, dest_lo, src1 The instruction also uses an implicit src2 register, edx or rdx depending on whether the 32-bit or 64-bit version is being used. The operation is: dest_hi:dest_lo = src1 * r/edx The reg/mem source operand src1 is multiplied by rdx/edx, and the result is stored in the two destination registers dest_hi:dest_lo. No flags are modified. This provides two key advantages over the existing mul instruction: - Greater flexibility in register usage, as current mul destination registers are implicitly defined. With mulx, the destination registers may be distinct from the source operands, so that the source operands are not over-written. - Since no flags are modified, mulx instructions can be mixed with add-carry instructions without corrupting the carry chain. ADCX/ADOX Instructions The adcx and adox instructions are extensions of the adc instruction, designed to support two separate carry chains. They are defined as: adcx dest/src1, src2 adox dest/src1, src2 Both instructions compute the sum of src1 and src2 plus a carry-in and generate an output sum dest and a carry-out. The difference between these two instructions is that adcx uses the CF flag for the carry in and carry out (leaving the OF flag unchanged), whereas the adox instruction uses the OF flag for the carry in and carry out (leaving the CF flag unchanged). The primary advantage of these instructions over adc is that they support two independent carry chains. Note that the two carry chains can be initialized by an instruction that clears both the CF and OF flags, for example “xor reg,reg” As for AVX512, I think AVX512-DQ (Q=quadword) might do (some of) the job, but it will be too limited in use (xeons). Title: Re: Speeding up signature verification Post by: gmaxwell on April 27, 2016, 08:30:18 AM "Using the PCLMULQDQ instruction, the performance of ECC can be Not irrelevant, the 'carryless multiply' is useful for characteristic-2 ECC, which only crazy people would use.boosted significantly on a range of IA processor cores, making it an attractive option over other public key algorithms." There are also some newer "regular" instructions that might be useful for speedups (mulcx, adcx, adox): http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf Quote MULX Instruction Yes, these are somewhat useful... but are only in very new processors. E.g. I just got broadwell-ep chips that support ADCX a couple weeks ago (release april 1st). But they don't let us do things efficiently with SIMD, they just would likely make the existing approach somewhat faster if we use them.ADCX/ADOX Instructions Title: Re: Speeding up signature verification Post by: AlexGR on April 08, 2017, 12:14:52 AM After tampering a bit with the code over the past year, I now understand that I initially asked the wrong question, regarding putting multiple verifications in parallel in a SIMD manner, as the calculations required in a single signature are too many by themselves.
Since this is the case, the question should be rephrased on whether SIMD can be used on these multiple calculations of a single pass. As you rightly pointed out, quadword->octaword is generally not supported. Now, I think I found a way to use vectorization, bypassing the quadword->octaword obstacle - since it doesn't seem that it will be getting hardware support anytime soon. The answer lies in breaking up the multiplications in 32x32 bits producing 64 bit results, like the 10x26 field does. Then packing the sources for packed multiplications. I put the multiplications of 10x26 in vectorized SIMD loops, #pragma simd'ed (icc*), and it pulled it off (result on the left). https://i.imgur.com/jbDXe83.jpg This is the http://www.felixcloutier.com/x86/PMULUDQ.html instruction. It can be used for taking 4x 32bit sources and producing 2x 64 bit outputs. In my q8200 (no avx2), in x64 mode (utilizing the 32bit field), this is ~5-10% slower than issuing multiple 64 bit imuls. Imuls in x64 mode are very, very convenient for 32x32 bit => 64 bit, and this is what is normally generated in x64 mode. In an AVX2 scenario, packed multiplication goes up to 8x32 bit sources / 4x64 bit results. In an AVX512 scenario (assuming there is a similar instruction), it should pack 16x32bit sources into eight 64 results. Plus with an extra 16 registers, it should eliminate a lot of memory accesses. If verification speed is an issue in the future, we might be able to exploit this. As an incidental "discovery", while compiling with -m32 to check the 32bit output, I saw that the uint64s are produced with plain muls in eax:edx (which is to be expected), although many 32bit machines DO have support for SSE2. Now in x86+SSE2 mode, you can get either one or two 64 bit outputs (depending how you word the instruction) without going through MULs/ADDs in eax:edx - which is slower. *gcc doesn't like the d=d+dd part inside the loop and it hesitates to vectorize even with #pragma GCC ivdep. One needs to do the addition results outside the loop, or write it manually in asm. ICC does it like a boss with the padd's from the results. edit: Something else I remembered regarding MULX + ADCX which I've mentioned previously. I tried building with ./configure --enable-benchmark --enable-endomorphism --with-asm=no CFLAGS="-O3 -march=skylake" ...to check the asm output of new compilers. GCC 6.3 now outputs MULXs (no ADCXs though) in field and scalar multiplications. Clang 3.9 now employs both MULQs and ADCXs. (from clang) 000000000040c300 <secp256k1_fe_mul>: ... 40c319: 48 8b 4e 08 mov 0x8(%rsi),%rcx 40c31d: 48 89 4c 24 b8 mov %rcx,-0x48(%rsp) 40c322: c4 62 e3 f6 c8 mulx %rax,%rbx,%r9 40c327: 49 89 c3 mov %rax,%r11 40c32a: 4c 89 5c 24 d8 mov %r11,-0x28(%rsp) 40c32f: 49 8b 52 10 mov 0x10(%r10),%rdx 40c333: 48 89 54 24 a0 mov %rdx,-0x60(%rsp) 40c338: c4 62 fb f6 c1 mulx %rcx,%rax,%r8 40c33d: 48 01 d8 add %rbx,%rax 40c340: 66 4d 0f 38 f6 c1 adcx %r9,%r8 40c346: 48 8b 6e 10 mov 0x10(%rsi),%rbp 40c34a: 49 8b 1a mov (%r10),%rbx ... ...although how much faster it is compared to MUL/ADC c code or asm, I can't say. Title: Re: Speeding up signature verification Post by: AlexGR on April 13, 2018, 08:56:55 PM I currently have an i5-8250u @ 3.4ghz laptop and I tried MULXing / ADCxing / ADOXing the field 5x52 asm. I found ADCXs/ADOXs to be useless in speeding up short parallel chains. Performance was same or worse than ADD/ADC.
MULX, on paper, is 4 cycles vs classic MUL at 3 cycles. The benefit though is that you can continue doing MULxing on other registers, thus not waiting for the rax/rdx pair to be added to go on. So results are in. Test with gcc8 (I tried gcc7 + clang, similar performance improvements found) and parameters as follows (default cflags): ./configure --enable-endomorphism --enable-openssl-tests=no Default run (normal asm / unmodified secp): perf stat -d ./bench_verify ecdsa_verify: min 46.8us / avg 46.8us / max 47.1us 9377.185057 task-clock (msec) # 1.000 CPUs utilized 27 context-switches # 0.003 K/sec 2 cpu-migrations # 0.000 K/sec 3,284 page-faults # 0.350 K/sec 31,765,926,255 cycles # 3.388 GHz (62.49%) 85,725,991,961 instructions # 2.70 insn per cycle (75.00%) 1,822,885,107 branches # 194.396 M/sec (75.00%) 64,368,756 branch-misses # 3.53% of all branches (75.00%) 13,413,927,724 L1-dcache-loads # 1430.486 M/sec (75.03%) 7,487,706 L1-dcache-load-misses # 0.06% of all L1-dcache hits (75.08%) 4,131,670 LLC-loads # 0.441 M/sec (49.96%) 93,400 LLC-load-misses # 2.26% of all LL-cache hits (49.92%) 9.376614679 seconds time elapsed ./bench_internal scalar_add: min 0.00982us / avg 0.00996us / max 0.0103us scalar_negate: min 0.00376us / avg 0.00381us / max 0.00411us scalar_sqr: min 0.0389us / avg 0.0393us / max 0.0405us scalar_mul: min 0.0395us / avg 0.0398us / max 0.0413us scalar_split: min 0.175us / avg 0.178us / max 0.186us scalar_inverse: min 11.3us / avg 11.5us / max 11.7us scalar_inverse_var: min 2.65us / avg 2.70us / max 2.85us field_normalize: min 0.00988us / avg 0.00995us / max 0.0102us field_normalize_weak: min 0.00404us / avg 0.00405us / max 0.00411us field_sqr: min 0.0187us / avg 0.0189us / max 0.0194us field_mul: min 0.0233us / avg 0.0236us / max 0.0254us field_inverse: min 5.10us / avg 5.11us / max 5.14us field_inverse_var: min 2.61us / avg 2.62us / max 2.69us field_sqrt: min 5.07us / avg 5.08us / max 5.13us group_double_var: min 0.149us / avg 0.150us / max 0.153us group_add_var: min 0.337us / avg 0.338us / max 0.341us group_add_affine: min 0.288us / avg 0.289us / max 0.292us group_add_affine_var: min 0.243us / avg 0.244us / max 0.246us group_jacobi_var: min 0.212us / avg 0.219us / max 0.251us wnaf_const: min 0.0799us / avg 0.0830us / max 0.104us ecmult_wnaf: min 0.528us / avg 0.532us / max 0.552us hash_sha256: min 0.324us / avg 0.328us / max 0.345us hash_hmac_sha256: min 1.26us / avg 1.27us / max 1.30us hash_rfc6979_hmac_sha256: min 7.00us / avg 7.00us / max 7.03us context_verify: min 7007us / avg 7038us / max 7186us context_sign: min 33.4us / avg 34.1us / max 36.7us num_jacobi: min 0.109us / avg 0.111us / max 0.126us After MULXing: Custom field 5x52 asm with MULXs: perf stat -d ./bench_verify ecdsa_verify: min 39.9us / avg 39.9us / max 40.2us 8003.494101 task-clock (msec) # 1.000 CPUs utilized 28 context-switches # 0.003 K/sec 0 cpu-migrations # 0.000 K/sec 3,278 page-faults # 0.410 K/sec 27,113,041,440 cycles # 3.388 GHz (62.46%) 70,772,097,848 instructions # 2.61 insn per cycle (75.01%) 1,872,709,155 branches # 233.986 M/sec (75.01%) 63,567,635 branch-misses # 3.39% of all branches (75.01%) 20,812,623,788 L1-dcache-loads # 2600.442 M/sec (75.01%) 7,187,062 L1-dcache-load-misses # 0.03% of all L1-dcache hits (75.01%) 4,098,304 LLC-loads # 0.512 M/sec (49.98%) 97,108 LLC-load-misses # 2.37% of all LL-cache hits (49.98%) 8.003690210 seconds time elapsed ./bench_internal scalar_add: min 0.00982us / avg 0.00988us / max 0.0100us scalar_negate: min 0.00376us / avg 0.00377us / max 0.00384us scalar_sqr: min 0.0389us / avg 0.0391us / max 0.0402us scalar_mul: min 0.0395us / avg 0.0397us / max 0.0412us scalar_split: min 0.176us / avg 0.178us / max 0.184us scalar_inverse: min 11.3us / avg 11.4us / max 11.6us scalar_inverse_var: min 2.66us / avg 2.71us / max 3.01us field_normalize: min 0.00988us / avg 0.00998us / max 0.0104us field_normalize_weak: min 0.00404us / avg 0.00406us / max 0.00415us field_sqr: min 0.0153us / avg 0.0155us / max 0.0164us field_mul: min 0.0172us / avg 0.0175us / max 0.0182us field_inverse: min 4.17us / avg 4.18us / max 4.22us field_inverse_var: min 2.62us / avg 2.63us / max 2.65us field_sqrt: min 4.12us / avg 4.12us / max 4.13us group_double_var: min 0.127us / avg 0.128us / max 0.131us group_add_var: min 0.270us / avg 0.270us / max 0.272us group_add_affine: min 0.250us / avg 0.250us / max 0.252us group_add_affine_var: min 0.200us / avg 0.200us / max 0.201us group_jacobi_var: min 0.212us / avg 0.214us / max 0.219us wnaf_const: min 0.0799us / avg 0.0802us / max 0.0817us ecmult_wnaf: min 0.528us / avg 0.535us / max 0.569us hash_sha256: min 0.324us / avg 0.334us / max 0.355us hash_hmac_sha256: min 1.27us / avg 1.27us / max 1.31us hash_rfc6979_hmac_sha256: min 6.99us / avg 6.99us / max 7.03us context_verify: min 5934us / avg 5966us / max 6127us context_sign: min 30.9us / avg 31.3us / max 33.0us num_jacobi: min 0.111us / avg 0.113us / max 0.120us From 9.37secs to 8.00 secs (0.85x). Field_mul at 0.73x. Field_sqr at 0.81. After doing some rearranging (c file) of group_impl.h: perf stat -d ./bench_verify ecdsa_verify: min 38.2us / avg 38.3us / max 38.7us 7675.837387 task-clock (msec) # 1.000 CPUs utilized 37 context-switches # 0.005 K/sec 0 cpu-migrations # 0.000 K/sec 3,268 page-faults # 0.426 K/sec 25,993,738,895 cycles # 3.386 GHz (62.48%) 70,649,153,999 instructions # 2.72 insn per cycle (74.99%) 1,872,833,433 branches # 243.991 M/sec (74.99%) 64,040,465 branch-misses # 3.42% of all branches (74.99%) 20,969,673,428 L1-dcache-loads # 2731.907 M/sec (75.04%) 7,260,544 L1-dcache-load-misses # 0.03% of all L1-dcache hits (75.09%) 4,076,705 LLC-loads # 0.531 M/sec (49.97%) 110,695 LLC-load-misses # 2.72% of all LL-cache hits (49.92%) 7.675396172 seconds time elapsed ./bench_internal scalar_add: min 0.00980us / avg 0.00984us / max 0.00987us scalar_negate: min 0.00376us / avg 0.00377us / max 0.00382us scalar_sqr: min 0.0389us / avg 0.0391us / max 0.0396us scalar_mul: min 0.0392us / avg 0.0396us / max 0.0403us scalar_split: min 0.176us / avg 0.177us / max 0.181us scalar_inverse: min 11.3us / avg 11.4us / max 11.7us scalar_inverse_var: min 2.65us / avg 2.69us / max 2.93us field_normalize: min 0.00991us / avg 0.00999us / max 0.0103us field_normalize_weak: min 0.00404us / avg 0.00405us / max 0.00414us field_sqr: min 0.0153us / avg 0.0154us / max 0.0158us field_mul: min 0.0172us / avg 0.0173us / max 0.0175us field_inverse: min 4.17us / avg 4.18us / max 4.20us field_inverse_var: min 2.62us / avg 2.62us / max 2.65us field_sqrt: min 4.12us / avg 4.13us / max 4.16us group_double_var: min 0.121us / avg 0.122us / max 0.123us group_add_var: min 0.267us / avg 0.268us / max 0.271us group_add_affine: min 0.249us / avg 0.249us / max 0.252us group_add_affine_var: min 0.192us / avg 0.193us / max 0.196us group_jacobi_var: min 0.211us / avg 0.214us / max 0.224us wnaf_const: min 0.0799us / avg 0.0802us / max 0.0818us ecmult_wnaf: min 0.528us / avg 0.534us / max 0.574us hash_sha256: min 0.324us / avg 0.327us / max 0.341us hash_hmac_sha256: min 1.26us / avg 1.27us / max 1.28us hash_rfc6979_hmac_sha256: min 6.98us / avg 6.98us / max 7.01us context_verify: min 5885us / avg 5916us / max 6039us context_sign: min 30.9us / avg 31.5us / max 35.9us num_jacobi: min 0.110us / avg 0.111us / max 0.122us Time spent on ./bench_verify = 0.81x, from 31.76mn -> 25.99mn cycles. Most of these are hardware specific though. I have a few things I'm currently tampering with, getting it down to 0.78x. A special sqr_inner2 function which has 3 parameters (r - output, a input, counter). The counter is used for the number of times one wants to square the same number, without recalling the function - the function does that on it's own. ~10% of the time spent in ./bench verify are looped field squares. It takes inversions/squaring from 4.2us down to 3.9us. The gain is much larger on less optimized sqr functions. The c int128 5x52 impl sqr, goes 20%+ faster if looped internally with a counter. [1] mulx-asm: https://github.com/Alex-GR/secp256k1/blob/master/field_5x52_asm_impl.h [2] reordered group_impl.h (most gains in double_var and is probably not-hardware specific since I'm seeing 2% gains even on a 10-yr old quad core): https://github.com/Alex-GR/secp256k1/blob/master/group_impl.h Code is free to use/reuse/get ideas/implemented, etc etc - although I don't claim it's heavily tested, or even safe. It does pass the tests though. Title: Re: Speeding up signature verification Post by: fillippone on September 26, 2020, 10:46:31 AM I implemented an optimized ECDSA verify for the secp256k1 curve used by Bitcoin, based on pages 125-129 of the Guide to Elliptic Curve Cryptography, by Hankerson, Menezes and Vanstone. I own the book but I also found a PDF on a Russian site which is more convenient. <...> I am celebrating this post because today the patent covering this idea has finally expired, and this is going to be activated in the next bitcoin core release. Thank you to all the great minds who contributed to this. Link (https://bitcointalk.org/index.php?topic=5207455.msg55266239#msg55266239) Title: Re: Speeding up signature verification Post by: Karartma1 on September 26, 2020, 02:24:53 PM I implemented an optimized ECDSA verify for the secp256k1 curve used by Bitcoin, based on pages 125-129 of the Guide to Elliptic Curve Cryptography, by Hankerson, Menezes and Vanstone. I own the book but I also found a PDF on a Russian site which is more convenient. <...> I am celebrating this post because today the patent covering this idea has finally expired, and this is going to be activated in the next bitcoin core release. Thank you to all the great minds who contributed to this. Link (https://bitcointalk.org/index.php?topic=5207455.msg55266239#msg55266239) Cheers Hal et al. |