Latest posts of: AlexGR

Show Posts
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [17] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 ... 208 »

321

Alternate cryptocurrencies / Announcements (Altcoins) / Re: [ANN][DASH] Dash (dash.org) | First Self-Funding Self-Governing Crypto Currency

on: February 11, 2017, 06:39:58 AM

Quote from: arielbit on February 11, 2017, 05:24:07 AM now...after all this shills appearing around, did anyone of these bitches answered if Evan did really hide the "masternode concept" before launching and after 1 month since this shitcoin is launched? You do understand that masternodes were a game-theory outgrowth of preventing mixing nodes from being sybil-attacked, right?

322

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: February 10, 2017, 03:48:20 AM

Quote from: JimboToronto on February 09, 2017, 09:19:14 PM Quote from: ImI on February 09, 2017, 02:37:41 PM Next step would obv be to close a small exchange. Then, after several smaller ones, proceed with the bigger ones. Then go after the miners. Then go after every retailer that accepts BTC. It seems to me that the PBOC isn't going after Bitcoin. They're simply trying to prevent the outflow of capital from their country. If you are serious about preventing it, you don't allow citizens to export 50k usd per year... especially in a country with >1bn citizens Or you don't allow the buying of gold with unlimited quantities of CNY (which is like converting your local currency into an international asset and leaving the central bank of china to deal with gold imports and usd outflows) I think the currency issue is exaggerated by sites like zerohedge. Meaning that even if CNY goes down, then chinese stuff become cheaper, exports increase (a lot) and there is a growing volume of USD inflows due to increased exports which then reinforces the balance of CNY/USD. There's a natural equilibrium in situations involving flows of capital, currency values and trade balances - and as long as you are a serious net exporter (China is), it's unlikely that many things can affect you. The only argument I've seen about how China's reserves of 2-3 trillion usd are somehow "inadequate" is that they are only a small fraction of the local currency. Lol? Supposedly this is somehow "scary" if all Chinese wanted to convert their CNYs into USDs and the USDs run out. Yet, this can only happen through some kind of international media campaign to undermine the CNY through FUD. And there is such an ongoing campaign both in mainstream and alternative western media. Even bitcoin-related news use false narratives (like "people are using BTC to bypass capital controls) regarding the CNY weaknesses etc etc, to artificially create runs to foreign currencies. Of course, if these gain traction, they become self-fulfilling prophecies. It's like bank runs. When people hear the X bank is unsafe, even if it isn't, they might move their money. And then when sufficient momentum accumulates by people moving their money, the bank actually becomes problematic and then the news stories are "vindicated" for reporting it early.

323

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: February 09, 2017, 02:19:02 PM

The good thing is that people got a new "reminder". If you don't get the coins in your wallet, you don't really own them. So if you buy coins => get them right away in your own wallet, don't leave them in an exchange... Exchanges get "hacked", get government interventions, etc etc.

324

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: February 07, 2017, 09:42:53 PM

Quote from: conspirosphere.tk on February 07, 2017, 05:43:48 PM it will get critical at gold parity (1 ounce is 31.1 grams): http://pricedingold.com/bitcoin/ I like to think parity in terms of ~380 ounces per BTC. That's due to the ratio of 6.1 bn above ground ounces vs 16.15 mn bitcoins mined. That would require a price of $456k per BTC - but at least the scarcity ratio would be accounted for... As a side note, the problem with gold spiking upwards is its large marketcap and large annual production levels. At current prices, ~3500 tons of gold per year of new mining output (without factoring recycling) is ~110mn oz. That requires 132bn USD to absorb. A tenfold increase in the price of gold, would suddenly require 1.32 trillion USD per year just to buy annual production. The problem is that there is no such liquidity in the system for allowing this. On the other hand, silver or bitcoin, can do much larger runs due to their smaller marketcap and much smaller liquidity requirements to buy their annual production. BTC at 10500$ (10x) would require just 6.9 billion per year to buy the annual production (657k coins x 10.5k usd). Silver at 170$ (10x) would require 136 billion per year to buy the annual mining output of ~800mn oz. Gold is priced so high that the numbers involved are too high at 1.32 trillion USD (in a scenario of 10x price) for its 110mn oz. It could happen in a hyperinflation scenario, but then the money one takes wouldn't be worth it anyway. In a sense, gold is constrained from doing a huge run by its large marketcap and the liquidity requirements in the fiat system to sustain prices of 10x+. Silver less so, and bitcoin even less so. Bitcoin seems to be the best bet in terms of upwards potential because it's so small and its fiat requirements to sustain its rise and mining output absorption are equally small.

325

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: February 05, 2017, 09:35:52 PM

Quote from: European Central Bank on February 05, 2017, 08:54:44 PM Quote from: AlexGR on February 05, 2017, 08:52:07 PM By hiring "forking" actors. Forks and threats of forks are traditionally bad for price. And just when price is starting to go well, just like clockwork, the forking drama is reignited. We are just lucky that demand is so heavy that it exceeds the possible short-term problems created by the drama. i've yet to see any fork talk affect the price. Yeah so far so good. But there is the history of xt and classic threats and their impact in price. Quote it's a long term and ongoing cancer, not like a china ban or gox death that's instant. the only time there will be an effect is if it's clear there is going to be contentious fork after which there might not a be a price at all. Exactly. At that point BTC is as good as dead, for setting a precedent that allows anyone to fork it for whatever bullshit reason. And some people are trying very hard to do it. Quote and traders don't seem to care about a clogged network either. They shouldn't because it's a technical and game theory issue. Even at 1 gb blocks, there could be a guy setting up a script that makes sure to fill 3 gb txs in 10 mins and claim the network is clogged and needs upgrading. It's doable because Bitcoin does not have any proper measure to protect itself from this kind of abuse, other than the blocksize limit satoshi implemented. An alternative would be a relatively big fixed fee for the minimum tx to act as a disincentive. The bottom line is that broadcasting txs doesn't cost anything, and the tx won't cost anything if it doesn't get processed (by putting it at a fee level lower than the bulk of other txs) - so you can always claim that there is a shortage of space and clogging. Wallet software is where the game is at, rather than traders, because that's what affects usability. If the user doesn't understand that they have to put X fee to get their tx in by Y blocks, then you have complaints that they are waiting too long. Plus an option to bump one's fee up -in case one issued a tx with a low fee- should be implemented in the user interface.

326

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: February 05, 2017, 08:52:07 PM

Quote from: Genesis1337 on February 05, 2017, 08:47:48 PM People are finally opening their eyes to bitcoin!!! How can the governments around the world crush this rally? By hiring "forking" actors. Forks and threats of forks are traditionally bad for price. And just when price is starting to go well, just like clockwork, the forking drama is reignited. These people are cancer. We are just lucky that demand is so heavy that it exceeds the possible short-term problems created by the drama.

327

Bitcoin / Development & Technical Discussion / Re: Processor speed and blockchain synchronisation

on: February 05, 2017, 01:20:27 PM

Quote from: Jet Cash on February 05, 2017, 10:59:42 AM Quote from: AlexGR on February 04, 2017, 09:38:42 PM Another tip is perhaps disabling full journaling of the filesystem for initial sync as that does tend to slow it down, whether SSD or mechanic. Plus on SSD it would double all the writes (reducing the disk's lifetime). That sounds like a good idea, but I can't find any info about it. By initial sync, do you mean the initial downloading of the blockchain? With this computer, I copied the blockchain directory from the i5 computer, and started the new node with the address of the copied file. That seems to have worked without any problems. It's usually a mount option of the filesystem. You can check with the mount command. for example: mount /dev/sda5 on /home/alex/100GB type ext4 (rw,noatime,noacl,data=journal) There are three modes: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt data=journal All data are committed into the journal prior to being written into the main file system. Enabling this mode will disable delayed allocation and O_DIRECT support. note: this writes data twice, it's the slowest but is the safest option for data corruption prevention. data=ordered All data are forced directly out to the main file system prior to its metadata being committed to the journal. data=writeback Data ordering is not preserved, data may be written into the main file system after its metadata has been committed to the journal. note: this is the fastest but more unsafe if there is a sudden shutdown

328

Bitcoin / Development & Technical Discussion / Re: Processor speed and blockchain synchronisation

on: February 04, 2017, 09:38:42 PM

Quote from: Carlton Banks on February 04, 2017, 12:26:33 PM It's not just the clock speed of the processor, there are qualitative differences between Celerons and i5's; your i5 probably has "hyper threading', otherwise known as SIMD SMT (multi threading on same core by utilizing the different pipelines of the same CPU when they are sitting idle) (SIMD are SSE/AVX instructions where you batch process multiple data with one instruction...) Quote from: Jet Cash on February 04, 2017, 03:54:21 PM I'll give you guys one tip though. If you are going to run a rig like this, don't forget to connect the SSD drive. Core doesn't get into a sweat about it, but it does suggest that it downloads the blockchain onto your internal SSD, and that may not be quite what you want. Another tip is perhaps disabling full journaling of the filesystem for initial sync as that does tend to slow it down, whether SSD or mechanic. Plus on SSD it would double all the writes (reducing the disk's lifetime).

329

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: January 28, 2017, 07:57:15 PM

Year of the rooster, not chicken

330

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: January 27, 2017, 03:56:43 PM

Quote from: ImI on January 27, 2017, 03:49:37 PM LePen hates Bitcoin. If she leaves the Euro, then it doesn't matter whether she hates Bitcoin.

331

Alternate cryptocurrencies / Announcements (Altcoins) / Re: [ANN] [PASC] PascalCoin, deletable blockchain & bank account system [PASA]

on: January 26, 2017, 01:22:55 PM

Cryptocurrency in Pascal? Nice

332

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: January 24, 2017, 11:35:19 AM

Quote from: GBattaglia on January 24, 2017, 11:26:46 AM Quote from: AlexGR on January 24, 2017, 11:22:27 AM Quote from: simmo77 on January 24, 2017, 11:19:14 AM Quote from: GBattaglia on January 24, 2017, 11:12:14 AM I feel $890 will be the resistance point for this drop, but I question whether we will be able to hit it. I think if we drop to $880 then a large sell off will occur. Funny thing is the drop isn't coming from china, but they are following. I'm not ballsy (read: stupid) enough to have fiat sitting around on exchanges to take advantage of little dips like this. I know there are ways of insta-buying BTC, but none of them are available to me, so I'll be watching from the sidelines again... And that's precisely the reason for higher volatility. As people become wiser and not letting their funds on exchanges, the order books become thinner. Fat order books = High risk for exchange "hacking" / lower volatility due to inertia (=fat order book) Thin order books = Better prepared for exchange "hacking" / bigger volatility I have to admit I don't have a huge fear of "hacking" with BTC-E. It is always a risk and I wouldn't store life savings there or anything, but they have been one of the most reliable in that regard for a very long time aside from the temporary DDOS downtimes that happen every now and then. Personally I don't really believe the hacking narrative half the times hence the "". Most of the time it's inside job. I mean Karpeles had like 200k coins stashed somewhere? Bitfinex had top-notch security and everything went out the window to make the impossible->possible? Mintpal got bought and then got "hacked"? Yeah right. So, from that perspective, people should be careful.

333

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: January 24, 2017, 11:22:27 AM

Quote from: simmo77 on January 24, 2017, 11:19:14 AM Quote from: GBattaglia on January 24, 2017, 11:12:14 AM I feel $890 will be the resistance point for this drop, but I question whether we will be able to hit it. I think if we drop to $880 then a large sell off will occur. Funny thing is the drop isn't coming from china, but they are following. I'm not ballsy (read: stupid) enough to have fiat sitting around on exchanges to take advantage of little dips like this. I know there are ways of insta-buying BTC, but none of them are available to me, so I'll be watching from the sidelines again... And that's precisely the reason for higher volatility. As people become wiser and not letting their funds on exchanges, the order books become thinner. Fat order books = High risk for exchange "hacking" / lower volatility due to inertia (=fat order book) Thin order books = Better prepared for exchange "hacking" / bigger volatility

334

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: January 18, 2017, 09:15:39 PM

Quote from: bitcoinvest on January 18, 2017, 09:06:11 PM This is nothing?? or is something? Who can put this orders? Somebody who has many maybe? Is some kind of luquidation or something...?? If you see similar and/or frequently used amounts on sell/buy side, whether in alts, or btc markets, it's more often than not a trading bot. It could be selling and rebuying these 25 BTCs (at a lower price) a hundred times a day.

335

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: January 16, 2017, 10:07:03 AM

Bitcoin Difficulty: 336,899,932,796 Estimated Next Difficulty: 380,176,964,299 (+12.85%) Adjust time: After 1113 Blocks, About 6.8 days

336

Bitcoin / Development & Technical Discussion / Re: secp256k1 library and Intel cpu

on: January 14, 2017, 10:06:59 PM

Quote from: gmaxwell on January 14, 2017, 04:49:28 PM

Neat. You shouldn't benchmark using the tests: they're full of debugging instrumentation that distorts the performance and spend a lot of their time on random things. Compile with --enable-benchmarks and use the benchmarks.

A quick check on i7-4600U doesn't give a really clear result:

Before: field_sqr: min 0.0915us / avg 0.0917us / max 0.0928us field_mul: min 0.116us / avg 0.116us / max 0.117us field_inverse: min 25.2us / avg 25.7us / max 28.5us field_inverse_var: min 13.8us / avg 13.9us / max 14.0us field_sqrt: min 24.9us / avg 25.0us / max 25.2us ecdsa_verify: min 238us / avg 238us / max 239us After (v1): field_sqr: min 0.0924us / avg 0.0924us / max 0.0928us field_mul: min 0.117us / avg 0.117us / max 0.117us field_inverse: min 25.4us / avg 25.5us / max 25.9us field_inverse_var: min 13.7us / avg 13.7us / max 14.0us field_sqrt: min 25.1us / avg 25.3us / max 26.1us ecdsa_verify: min 237us / avg 237us / max 237us After (v2): field_sqr: min 0.0942us / avg 0.0942us / max 0.0944us field_mul: min 0.118us / avg 0.118us / max 0.119us field_inverse: min 25.9us / avg 26.0us / max 26.4us field_inverse_var: min 13.6us / avg 13.7us / max 13.8us field_sqrt: min 25.6us / avg 25.9us / max 27.8us ecdsa_verify: min 243us / avg 244us / max 246us

Hmm... interesting how different architectures are affected. Unless you are underclocked, I think for that particular cpu the times are pretty slow - is there any debugging or performance-logging framework running on top of this that creates overhead, distorting the performance? (Although I do expect newer chips to have better schedulers). Realistically, you should be quite faster than me. (my lib is with ./configure -enable-benchmark and gcc default flags, no endomorphism).

For comparison (q8200 @ 1.86)

Before:
field_sqr: min 0.0680us / avg 0.0681us / max 0.0683us
field_mul: min 0.0833us / avg 0.0835us / max 0.0841us
field_inverse: min 18.5us / avg 18.6us / max 18.8us
field_inverse_var: min 6.32us / avg 6.32us / max 6.33us
field_sqrt: min 18.4us / avg 18.6us / max 18.9us
ecdsa_verify: min 243us / avg 243us / max 245us

(v1)
field_sqr: min 0.0654us / avg 0.0660us / max 0.0667us
field_mul: min 0.0819us / avg 0.0822us / max 0.0825us
field_inverse: min 18.4us / avg 18.4us / max 18.5us
field_inverse_var: min 6.35us / avg 6.36us / max 6.37us
field_sqrt: min 18.4us / avg 18.4us / max 18.5us
ecdsa_verify: min 235us / avg 236us / max 237us

(v2)
field_sqr: min 0.0660us / avg 0.0675us / max 0.0679us
field_mul: min 0.0858us / avg 0.0861us / max 0.0862us
field_inverse: min 18.8us / avg 18.8us / max 18.8us
field_inverse_var: min 6.31us / avg 6.31us / max 6.31us
field_sqrt: min 18.5us / avg 18.6us / max 18.7us
ecdsa_verify: min 243us / avg 243us / max 244us

I've always used the benchmarks provided, but I think they may lack real world correlation. My wakeup call was a few months ago I disassembled the bench_internal to see what clang and gcc were doing differently in terms of asm... one was inlining/merging the benchmark and the function to be benchmarked, thus saving the overhead of calling it and distorting the result. I think it was clang which was merging it - and that particular benchmark was faster for it. So I couldn't tell due to this type of distortion which implementation was actually faster. I think it would be a nice addition if we had something like the validation of, say, a given amount of bitcoin blocks (let's say 10-20mb of data loaded in ram) as a more RL-like benchmark.

Btw, I remember having seen a video where you gave a lecture about the library to a university (?) and commenting on the tests of the library, saying something to the effect that perhaps in the future a bounty can be issued about bugs that exist but can't be detected by the tests.

Asm tampering (especially if you try to repurpose rdi/rsi registers) is definitely one of the fields were you can have the test run fine and then have bench_internal or bench_verify abort due to error. Or the opposite (benchmark run ok, test crashes). Or have it be entirely ok in one compiler (test/benchmarks) and then crash in another. This is due to the compiler using the registers differently prior or after the functions in conjuction with other functions, and the same code is some times OK in certain use cases (executables) and crashes in other use cases (different executables), so it's much trickier than C because I have no idea how a test could catch these. After all it can only test it's own execution.

My "manual" testing routine to see if everything is ok, is by going ./tests, ./bench_internal, ./bench_verify. If everything passes, it's probably good. This is not for the 5x52 (which doesn't have unstable code in it) but for my custom made 4x64 impl.h with different secp256k1_scalar_mul and secp256k1_scalar_sqr).

I wanted to put the whole file in so that no cutting and splicing are needed for 2 functions, but the forum notification is bugged (saying I have a post >64kbytes when I don't) so I had to cut down on the text. Anyway...

Code:

static void secp256k1_scalar_mul(secp256k1_scalar *r, const secp256k1_scalar *a, const secp256k1_scalar *b) {
 #ifdef USE_ASM_X86_64
    uint64_t l[8];
    const uint64_t *pb = b->d;
    
    __asm__ __volatile__(
    /* Preload */
    "movq 0(%%rdi), %%r15\n"
    "movq 8(%%rdi), %%rbx\n"
    "movq 16(%%rdi), %%rcx\n"
    "movq 0(%%rdx), %%r11\n"
    "movq 8(%%rdx), %%r9\n"
    "movq 16(%%rdx), %%r10\n"
    "movq 24(%%rdx), %%r8\n"
    /* (rax,rdx) = a0 * b0 */
    "movq %%r15, %%rax\n"
    "mulq %%r11\n"
    /* Extract l0 */
    "movq %%rax, 0(%%rsi)\n"
    /* (r14,r12,r13) = (rdx) */
    "movq %%rdx, %%r14\n"
    "xorq %%r12, %%r12\n"
    "xorq %%r13, %%r13\n"
    /* (r14,r12,r13) += a0 * b1 */
    "movq %%r15, %%rax\n"
    "mulq %%r9\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r13\n"
    /* (r14,r12,r13) += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    /* Extract l1 */
    "movq %%r14, 8(%%rsi)\n"
    "movq $0, %%r14\n"
    /* (r12,r13,r14) += a0 * b2 */
    "movq %%r15, %%rax\n"
    "adcq $0, %%r13\n"
    "mulq %%r10\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r14\n"
    /* (r12,r13,r14) += a1 * b1 */
    "mulq %%r9\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r14\n"
    /* (r12,r13,r14) += a2 * b0 */
    "mulq %%r11\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    /* Extract l2 */
    "movq %%r12, 16(%%rsi)\n"
    "movq $0, %%r12\n"
    /* (r13,r14,r12) += a0 * b3 */
    "movq %%r15, %%rax\n"
    "adcq $0, %%r14\n"
    "mulq %%r8\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    /* Preload a3 */
    "movq 24(%%rdi), %%r15\n"
    /* (r13,r14,r12) += a1 * b2 */
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r10\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r12\n"
    /* (r13,r14,r12) += a2 * b1 */
    "mulq %%r9\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    "movq %%r15, %%rax\n"
    "adcq $0, %%r12\n"
    /* (r13,r14,r12) += a3 * b0 */
    "mulq %%r11\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    /* Extract l3 */
    "movq %%r13, 24(%%rsi)\n"
    "movq $0, %%r13\n"
    /* (r14,r12,r13) += a1 * b3 */
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r8\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r13\n"
    /* (r14,r12,r13) += a2 * b2 */
    "mulq %%r10\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%r15, %%rax\n"
    "adcq $0, %%r13\n"
    /* (r14,r12,r13) += a3 * b1 */
    "mulq %%r9\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r13\n"
    /* Extract l4 */
   /* "movq %%r14, 32(%%rsi)\n"*/
    /* (r12,r13,r14) += a2 * b3 */
    "mulq %%r8\n"
    "movq %%r14, %%r11\n"
    "xorq %%r14, %%r14\n"
    "addq %%rax, %%r12\n"
    "movq %%r15, %%rax\n"
    "adcq %%rdx, %%r13\n"
    "adcq $0, %%r14\n"
    /* (r12,r13,r14) += a3 * b2 */
    "mulq %%r10\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "movq %%r15, %%rax\n"
    "adcq $0, %%r14\n"
    /* Extract l5 */
    /*"movq %%r12, 40(%%rsi)\n"*/
    /* (r13,r14) += a3 * b3 */
    "mulq %%r8\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    /* Extract l6 */
    /*"movq %%r13, 48(%%rsi)\n"*/
    /* Extract l7 */
    /*"movq %%r14, 56(%%rsi)\n"*/
    : "+d"(pb)
    : "S"(l), "D"(a->d)
    : "rax", "rbx", "rcx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
    

      __asm__ __volatile__(
    /* Preload. */
  /*  "movq 32(%%rsi), %%r11\n" */
  /*  "movq 40(%%rsi), %%r12\n" */
   /*"movq 48(%%rsi), %%r13\n" */
  /*   "movq 56(%%rsi), %%r14\n" */
    "movq 0(%%rsi), %%rbx\n"  
    "movq %3, %%rax\n"
    "movq %%rax, %%r10\n"
    "xor %%ecx, %%ecx\n"  
    "xorq %%r15, %%r15\n"
    "xorq %%r9, %%r9\n"
    "xorq %%r8, %%r8\n"
    "mulq %%r11\n"
    "addq %%rax, %%rbx\n" /*q0 into rbx*/
    "adcq %%rdx, %%rcx\n"
    "addq 8(%%rsi), %%rcx\n" 
    "movq %%r10, %%rax\n"
    "adcq %%r9, %%r15\n"
    "mulq %%r12\n"
    "addq %%rax, %%rcx\n" /*q1 stored to rcx*/
    "adcq %%rdx, %%r15\n"
    "movq %4, %%rax\n" 
    "adcq %%r9, %%r8\n"
    "mulq %%r11\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r15\n"
    "adcq %%r9, %%r8\n"
    "addq 16(%%rsi), %%r15\n"
    "adcq %%r9, %%r8\n"
    "movq %%r10, %%rax\n"
    "adcq %%r9, %%r9\n"
    "mulq %%r13\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r8\n"
    "movq %4, %%rax\n"
    "adcq $0, %%r9\n"
    "mulq %%r12\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    "movq %%r10, %%rax\n"
    "movq $0, %%r10\n"
    "addq %%r11, %%r15\n" /*q2 into r15*/
    "adcq $0, %%r8\n"
    "adcq $0, %%r9\n"
    "addq 24(%%rsi), %%r8\n"
    "adcq $0, %%r9\n"
    "adcq %%r10, %%r10\n"
    "mulq %%r14\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "movq %4, %%rax\n"  
    "movq %%rax, %%rsi\n"  
    "adcq $0, %%r10\n"
    "mulq %%r13\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    "addq %%r8, %%r12\n" /* q3 into r12*/
    "adcq $0, %%r9\n"
    "movq $0, %%r8\n"
    "movq %%rsi, %%rax\n" 
    "adcq $0, %%r10\n"
    "mulq %%r14\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq %%r8, %%r8\n"
    "addq %%r9, %%r13\n" /*q4 into r13*/
    "adcq $0, %%r10\n" 
    "adcq $0, %%r8\n" 
    "addq %%r14, %%r10\n" /* q5 into r10 */ 
    "movq %3, %%rax\n"
    "movq %%rax, %%r9\n"
    "adcq $0, %%r8\n" /*q6 into r8*/
  
/* %q5 input for second operation is %q0 output from first / RBX as the connecting link
    %q6 input for second operation is %q1 output from first / RCX as the connecting link
    %q7 input for second operation is %q2 output from first / R15 as the connecting link
    %q8 input for second operation is %q3 output from first / R12 as the connecting link
    %q9  input for second operation is %q4 output from first / R13 as the connecting link*
    %q10 input for second operation is %q5 output from first / R10 as the connecting link*
    %q11 input for second operation is %q6 output from first  / R8 as the connecting link */    
    
    /* Reduce 385 bits into 258. */

    "mulq %%r13\n"
    "xorq %%r14, %%r14\n"
    "xorq %%r11, %%r11\n"
    "addq %%rax, %%rbx\n" /* q0 output*/
    "adcq %%rdx, %%r14\n"
    "addq %%rcx, %%r14\n" 
    "mov $0, %%ecx\n"  
    "movq %%r9, %%rax\n"
    "adcq %%r11, %%r11\n"
    "mulq %%r10\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r11\n"
    "movq %%rsi, %%rax\n"
    "adcq %%rcx, %%rcx\n"
    "mulq %%r13\n"
    "addq %%rax, %%r14\n" /* q1 output */
    "movq %%r9, %%rax\n"
    "adcq %%rdx, %%r11\n"
    "adcq $0, %%rcx\n"
    "xorq %%r9, %%r9\n"
    "addq %%r15, %%r11\n" 
    "adcq %%r9, %%rcx\n"
    "movq %%rax, %%r15\n"
    "adcq %%r9, %%r9\n"
    "mulq %%r8\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r9\n"
    "mulq %%r10\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "adcq $0, %%r9\n"
    "addq %%r13, %%r11\n" /* q2 output */
    "adcq $0, %%rcx\n"
    "adcq $0, %%r9\n"
    "addq %%r12, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r9\n"
    "mulq %%r8\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r9\n"
    "addq %%r10, %%rcx\n"    /* q3 output */
    "adcq $0, %%r9\n"
    "movq %%r15, %%rax\n"
    "addq %%r8, %%r9\n" /* q4 output */
    
/* %q1 input for next operation is %q0 output from prior / RBX as the connecting link
    %q2 input for next operation is %q1 output from prior / R14 as the connecting link 
    %q3 input for next operation is %q2 output from prior / R11 as the connecting link  
    %q4 input for next operation is %q3 output from prior / RCX as the connecting link
    %q5 input for next operation is %q4 output from prior / R9 as the connecting link   */
        
    /* Reduce 258 bits into 256. */

    "mulq %%r9\n"   
    "addq %%rbx, %%rax\n"
    "adcq $0, %%rdx\n"
    "movq %%rax, %%r8\n"  /* 0(q2) output */
    "movq %%rdx, %%r12\n" 
    "xorq %%r13, %%r13\n"
    "addq %%r14, %%r12\n"
    "movq %%rsi, %%rax\n"
    "adcq %%r13, %%r13\n"
    "mulq %%r9\n"
    "addq %%rax, %%r12\n" /* 8(q2) output */
    "adcq %%rdx, %%r13\n" 
    "xor %%ebx, %%ebx\n"
    "addq %%r9, %%r13\n"
    "adcq %%rbx, %%rbx\n"
    "movq $0xffffffffffffffff, %%r14\n"
    "addq %%r11, %%r13\n" /* 16(q2) output */
    "movq $0, %%r11\n"
    "adcq $0, %%rbx\n"
    "addq %%rcx, %%rbx\n"  /* 24(q2) output */
    "adcq $0, %%r11\n" /* c  output */

    
/*FINAL REDUCTION */
    
/*    r8 carries ex 0(%%rdi), 
       r12 carries ex 8(%%rdi),
       r13 carries ex 16(%%rdi), 
       rbx carries ex 24(%%rdi)
       r11 carries c */
    "movq $0xbaaedce6af48a03b,%%r9\n"
    "movq $0xbaaedce6af48a03a,%%rcx\n"
    "movq $0xbfd25e8cd0364140,%%r10\n"
    "cmp   %%r14 ,%%rbx\n"
    "setne %%dl\n"
    "cmp   $0xfffffffffffffffd,%%r13\n"
    "setbe %%al\n"
    "or     %%eax,%%edx\n"
    "cmp  %%rcx,%%r12\n"
    "setbe %%cl\n"
    "or     %%edx,%%ecx\n"
    "cmp  %%r9,%%r12\n"
    "movzbl %%dl,%%edx\n"
    "seta  %%r9b\n"
    "cmp  %%r10,%%r8\n"
    "movzbl %%cl,%%ecx\n"
    "seta  %%r10b\n"
    "not   %%ecx\n"
    "not   %%edx\n"
    "or     %%r10d,%%r9d\n"
    "movzbl %%r9b,%%r9d\n"
    "and   %%r9d,%%ecx\n"
    "xor    %%r9d,%%r9d\n"
    "cmp   %%r14,%%r13\n"
    "sete  %%r9b\n"
    "xor   %%r10d,%%r10d\n"
    "and   %%r9d,%%edx\n"
    "or     %%edx,%%ecx\n"
    "xor   %%edx,%%edx\n"
    "add  %%ecx,%%r11d\n"
    "imulq %%r11,%%r15\n"
    "addq  %%r15,%%r8\n"
    "adcq  %%rdx,%%r10\n"  
    "imulq %%r11,%%rsi\n"
    "xorq %%r15,%%r15\n"
    "xor   %%eax,%%eax\n"
    "movq  %%r8,0(%q2)\n"
    "xor   %%edx,%%edx\n"
    "addq %%r12,%%rsi\n"
    "adcq %%rdx,%%rdx\n" 
    "addq %%rsi,%%r10\n"
    "movq %%r10,8(%q2)\n"
    "adcq %%rdx,%%r15\n"
    "addq %%r11,%%r13\n"
    "adcq %%rax,%%rax\n" 
    "addq %%r15,%%r13\n"
    "movq %%r13,16(%q2)\n"
    "adcq $0,%%rax\n"
    "addq %%rbx,%%rax\n"
    "movq %%rax,24(%q2)\n"
    : "=D"(r)
    : "S"(l), "D"(r), "n"(SECP256K1_N_C_0), "n"(SECP256K1_N_C_1)
    : "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory"); 
     

#else
    uint64_t l[8];
    secp256k1_scalar_mul_512(l, a, b);
    secp256k1_scalar_reduce_512(r, l);
#endif   
}

Code:

static void secp256k1_scalar_sqr(secp256k1_scalar *r, const secp256k1_scalar *a) {
 #ifdef USE_ASM_X86_64
    uint64_t l[8];
    
    __asm__ __volatile__(
    /* Preload */
    "movq 0(%%rdi), %%r11\n"
    "movq 8(%%rdi), %%r12\n"
    "movq 16(%%rdi), %%rcx\n"
    "movq 24(%%rdi), %%r14\n"
    /* (rax,rdx) = a0 * a0 */
    "movq %%r11, %%rax\n"
    "mulq %%r11\n"
    /* Extract l0 */
    "movq %%rax, %%rbx\n" /*0(%%rsi)\n"*/
    /* (r8,r9,r10) = (rdx,0) */
    "movq %%rdx, %%r15\n"
    "xorq %%r9, %%r9\n"
    "xorq %%r10, %%r10\n"
    "xorq %%r8, %%r8\n"
    /* (r8,r9,r10) += 2 * a0 * a1 */
    "movq %%r11, %%rax\n"
    "mulq %%r12\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    "addq %%rax, %%r15\n" /*8 rsi in r15*/
    "adcq %%rdx, %%r9\n"
    "movq %%r11, %%rax\n"
    "adcq $0, %%r10\n"
    /* Extract l1 */
   /* 8(rsi) in r15*/
    /* (r9,r10,r8) += 2 * a0 * a2 */
    "mulq %%rcx\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "movq %%r12, %%rax\n"
    "adcq $0, %%r8\n"
    /* (r9,r10,r8) += a1 * a1 */
    "mulq %%r12\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    /* Extract l2 */
    "movq %%r9, 16(%%rsi)\n"
    "movq %%r11, %%rax\n"
    "movq $0, %%r9\n"
    /* (r10,r8,r9) += 2 * a0 * a3 */
    "adcq $0, %%r8\n"
    "mulq %%r14\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "movq %%r12, %%rax\n"
    "adcq $0, %%r9\n"
    /* (r10,r8,r9) += 2 * a1 * a2 */
    "mulq %%rcx\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "movq %%r10, %%r13\n"
    "movq %%r12, %%rax\n"
    "adcq $0, %%r9\n"
    /* Extract l3 */
    /*"movq %%r10, 24(%%rsi)\n"*/

    /* (r8,r9,r10) += 2 * a1 * a3 */
    "mulq %%r14\n"
    "xorq %%r10, %%r10\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r10\n"
    /* (r8,r9,r10) += a2 * a2 */
    "mulq %%rcx\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    /* Extract l4 */
    /*"movq %%r8, 32(%%rsi)\n"*/
    "movq %%r8, %%r11\n"
    "movq %%rcx, %%rax\n"
    "movq $0, %%r8\n"
    /* (r9,r10,r8) += 2 * a2 * a3 */
    "adcq $0, %%r10\n"
    "mulq %%r14\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "movq %%r14, %%rax\n"
    "adcq $0, %%r8\n"
    /* Extract l5 */
    /*"movq %%r9, 40(%%rsi)\n"*/
 /*   "movq %%r9, %%r12\n"*/
    /* (r10,r8) += a3 * a3 */
    "mulq %%r14\n"
    "addq %%rax, %%r10\n"
    /* Extract l6 */
    /*"movq %%r10, 48(%%rsi)\n"*/
    /*"movq %%r10, %%rcx\n"*/
    /* Extract l7 */
    /*"movq %%r8, 56(%%rsi)\n"*/
    /*"movq %%r8, %%r14\n"*/
    :
    : "S"(l), "D"(a->d)
    : "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
        
      __asm__ __volatile__(
    /* Preload. */
  /*  "movq 32(%%rsi), %%r11\n" */
  /*  "movq 40(%%rsi), %%r9\n" */
  /*   "movq 48(%%rsi), %%r10\n" */
  /*   "movq 56(%%rsi), %%r8\n" */
  /*  "movq 0(%%rsi), %%rbx\n"  */
 /*   "movq %%rcx, %%r13\n"*/
    "movq %3, %%rax\n"
    "adcq %%rdx, %%r8\n"
    "mulq %%r11\n"
    "xor %%ecx, %%ecx\n" 
    "xorq %%r12, %%r12\n"
    "xorq %%r14, %%r14\n"
    "addq %%rax, %%rbx\n" /*q0 into rbx*/
    "adcq %%rdx, %%rcx\n"
 /*   "addq 8(%%rsi), %%rcx\n" */
    "addq %%r15, %%rcx\n" 
    "mov $0, %%r15d\n"
    "movq %3, %%rax\n"
    "adcq %%r12, %%r15\n"
    "mulq %%r9\n"
    "addq %%rax, %%rcx\n" /*q1 stored to rcx*/
    "adcq %%rdx, %%r15\n"
    "movq %4, %%rax\n" 
    "adcq %%r12, %%r14\n"
    "mulq %%r11\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r15\n"
    "adcq %%r12, %%r14\n"
    "addq 16(%%rsi), %%r15\n"
    "adcq %%r12, %%r14\n"
    "movq %3, %%rax\n"
    "adcq %%r12, %%r12\n"
    "mulq %%r10\n"
    "movq %4, %%rsi\n"  
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r14\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r9\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r14\n"
    "adcq $0, %%r12\n"
    "movq %3, %%rax\n"
    "addq %%r11, %%r15\n" /*q2 into r15*/
    "adcq $0, %%r14\n"
    "adcq $0, %%r12\n"
    "addq %%r13, %%r14\n"
    "movq $0, %%r13\n"
    "adcq $0, %%r12\n"
    "adcq $0, %%r13\n"
    "mulq %%r8\n"
    "addq %%rax, %%r14\n"
    "movq %%rsi, %%rax\n"  
    "adcq %%rdx, %%r12\n"
    "adcq $0, %%r13\n"
    "mulq %%r10\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "adcq $0, %%r13\n"
    "addq %%r14, %%r9\n" /* q3 into r9*/
    "adcq $0, %%r12\n"
    "movq %%rsi, %%rax\n" 
    "movq $0, %%r14\n"
    "adcq $0, %%r13\n"
    "mulq %%r8\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "adcq %%r14, %%r14\n"
    "addq %%r12, %%r10\n" /*q4 into r10*/
    "adcq $0, %%r13\n" 
    "adcq $0, %%r14\n" 
    "addq %%r8, %%r13\n" /* q5 into r13 */ 
    "movq %3, %%rax\n"
    "movq %%rax, %%r12\n"
    "adcq $0, %%r14\n" /*q6 into r14*/
  
/* %q5 input for second operation is %q0 output from first / RBX as the connecting link
    %q6 input for second operation is %q1 output from first / RCX as the connecting link
    %q7 input for second operation is %q2 output from first / R15 as the connecting link
    %q8 input for second operation is %q3 output from first / r9 as the connecting link
    %q9  input for second operation is %q4 output from first / r10 as the connecting link*
    %q10 input for second operation is %q5 output from first / r13 as the connecting link*
    %q11 input for second operation is %q6 output from first  / r14 as the connecting link */    
    
    /* Reduce 385 bits into 258. */

    "mulq %%r10\n"
    "xorq %%r8, %%r8\n"
    "xorq %%r11, %%r11\n"
    "addq %%rax, %%rbx\n" /* q0 output*/
    "adcq %%rdx, %%r8\n"
    "addq %%rcx, %%r8\n" 
    "movq %%r12, %%rax\n"
    "mov $0, %%ecx\n"  
    "adcq %%r11, %%r11\n"
    "mulq %%r13\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r11\n"
    "movq %%rsi, %%rax\n"
    "adcq %%rcx, %%rcx\n"
    "mulq %%r10\n"
    "addq %%rax, %%r8\n" /* q1 output */
    "movq %%r12, %%rax\n"
    "adcq %%rdx, %%r11\n"
    "adcq $0, %%rcx\n"
    "xorq %%r12, %%r12\n"
    "addq %%r15, %%r11\n" 
    "adcq %%r12, %%rcx\n"
    "movq %%rax, %%r15\n"
    "adcq %%r12, %%r12\n"
    "mulq %%r14\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r13\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "adcq $0, %%r12\n"
    "addq %%r10, %%r11\n" /* q2 output */
    "adcq $0, %%rcx\n"
    "adcq $0, %%r12\n"
    "addq %%r9, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r14\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r12\n"
    "addq %%r13, %%rcx\n"    /* q3 output */
    "adcq $0, %%r12\n"
    "movq %%r15, %%rax\n"
    "addq %%r14, %%r12\n" /* q4 output */
    
/* %q1 input for next operation is %q0 output from prior / RBX as the connecting link
    %q2 input for next operation is %q1 output from prior / r8 as the connecting link 
    %q3 input for next operation is %q2 output from prior / R11 as the connecting link  
    %q4 input for next operation is %q3 output from prior / RCX as the connecting link
    %q5 input for next operation is %q4 output from prior / r12 as the connecting link   */
        
    /* Reduce 258 bits into 256. */

    "mulq %%r12\n"   
    "addq %%rbx, %%rax\n"
    "adcq $0, %%rdx\n"
    "movq %%rax, %%r14\n"  /* 0(q2) output */
    "movq %%rdx, %%r9\n" 
    "xorq %%r10, %%r10\n"
    "addq %%r8, %%r9\n"
    "movq %%rsi, %%rax\n"
    "adcq %%r10, %%r10\n"
    "mulq %%r12\n"
    "addq %%rax, %%r9\n" /* 8(q2) output */
    "adcq %%rdx, %%r10\n" 
    "xor %%ebx, %%ebx\n"
    "addq %%r12, %%r10\n"
    "adcq %%rbx, %%rbx\n"
    "movq $0xffffffffffffffff, %%r8\n"
    "addq %%r11, %%r10\n" /* 16(q2) output */
    "movq $0, %%r11\n"
    "adcq $0, %%rbx\n"
    "addq %%rcx, %%rbx\n"  /* 24(q2) output */
    "adcq $0, %%r11\n" /* c  output */

    
/*FINAL REDUCTION */
    
/*    r14 carries ex 0(%%rdi), 
       r9 carries ex 8(%%rdi),
       r10 carries ex 16(%%rdi), 
       rbx carries ex 24(%%rdi)
       r11 carries c */
    "movq $0xbaaedce6af48a03b,%%r12\n"
    "movq $0xbaaedce6af48a03a,%%rcx\n"
    "movq $0xbfd25e8cd0364140,%%r13\n"
    "cmp   %%r8 ,%%rbx\n"
    "setne %%dl\n"
    "cmp   $0xfffffffffffffffd,%%r10\n"
    "setbe %%al\n"
    "or     %%eax,%%edx\n"
    "cmp  %%rcx,%%r9\n"
    "setbe %%cl\n"
    "or     %%edx,%%ecx\n"
    "cmp  %%r12,%%r9\n"
    "movzbl %%dl,%%edx\n"
    "seta  %%r12b\n"
    "cmp  %%r13,%%r14\n"
    "movzbl %%cl,%%ecx\n"
    "seta  %%r13b\n"
    "not   %%ecx\n"
    "not   %%edx\n"
    "or     %%r13d,%%r12d\n"
    "movzbl %%r12b,%%r12d\n"
    "and   %%r12d,%%ecx\n"
    "xor    %%r12d,%%r12d\n"
    "cmp   %%r8,%%r10\n"
    "sete  %%r12b\n"
    "xor   %%r13d,%%r13d\n"
    "and   %%r12d,%%edx\n"
    "or     %%edx,%%ecx\n"
    "xor   %%edx,%%edx\n"
    "add  %%ecx,%%r11d\n"
    "imulq %%r11,%%r15\n"
    "addq  %%r15,%%r14\n"
    "adcq  %%rdx,%%r13\n"  
    "imulq %%r11,%%rsi\n"
    "xorq %%r15,%%r15\n"
    "xor   %%eax,%%eax\n"
    "movq  %%r14,0(%q2)\n"
    "xor   %%edx,%%edx\n"
    "addq %%r9,%%rsi\n"
    "adcq %%rdx,%%rdx\n" 
    "addq %%rsi,%%r13\n"
    "movq %%r13,8(%q2)\n"
    "adcq %%rdx,%%r15\n"
    "addq %%r11,%%r10\n"
    "adcq %%rax,%%rax\n" 
    "addq %%r15,%%r10\n"
    "movq %%r10,16(%q2)\n"
    "adcq $0,%%rax\n"
    "addq %%rbx,%%rax\n"
    "movq %%rax,24(%q2)\n"
    : "=D"(r)
    : "S"(l), "D"(r), "n"(SECP256K1_N_C_0), "n"(SECP256K1_N_C_1)
    : "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");      
     
#else
    uint64_t l[8];
    secp256k1_scalar_sqr_512(l, a);
    secp256k1_scalar_reduce_512(r, l);
#endif    
}

This, measured right now, gives

(original)
scalar_sqr: min 0.134us / avg 0.135us / max 0.136us
scalar_mul: min 0.141us / avg 0.143us / max 0.144us
scalar_inverse: min 40.5us / avg 40.6us / max 40.9us

(my hacked version - only gcc)
scalar_sqr: min 0.122us / avg 0.122us / max 0.122us
scalar_mul: min 0.126us / avg 0.127us / max 0.127us
scalar_inverse: min 36.7us / avg 36.9us / max 37.1us

The way the original code is (very readable for maintenance though - unlike my crap), if one dissassembles it, shows something like that:

1) mul512 or sqr512 starts and then writes its output to variables

2) Then we have pops and pushes for the next function which is reduce512

3) The reduce512 function imports the data from the outputs of #1

4) Reduce512 goes in 3 stages with each stage writing its own distinct output to variables and then the next stage imports it as its input. (The three stages can be streamlined by merging them - always using registers. The necessity for distinct output points and input points is then redundant / less moves and no need for variables).

5) As reduce512 ends, it puts its own output to variables

6) Final reduction imports the output of (5) and processes it.

My rationale was that if mul512 OR sqr 512+reduce512+final reduction go together, in one asm, one saves a lot of inputs/outputs and pops/pushes. Plus code size goes down significantly (1300 bytes => 1000 bytes) which leaves some extra L1 cache for other stuff. Reduce512/mul512/sqr512 still exists as code (altered) but they aren't really called. What gets called is the unified secp256k1_scalar_mul and the secp256k1_scalar_sqr - which have everything inside them. This was proof of concept so to speak, because I was seeing the disassembled output and I was like "AARRRGGHHH why can't one stage or function simply forward its results with the registers and there is all this pushing and popping and ram and variables".

For example, this is the behavior between (5) and (6) in the disassembled output of the original reduce512:

406246:   4c 89 4f 10     mov %r9,0x10(%rdi)
  40624a:   4d 31 c9    xor %r9,%r9
  40624d:   49 01 f0    add %rsi,%r8
  406250:   49 83 d1 00     adc $0x0,%r9
  406254:   4c 89 47 18     mov %r8,0x18(%rdi)
  406258:   4c 89 cb    mov %r9,%rbx
  40625b:   4c 8b 5f 18     mov 0x18(%rdi),%r11
  40625f:   48 8b 77 10     mov 0x10(%rdi),%rsi

My thoughts were like "ok, these ram moves are redundant and HAVE TO GO". Why should r8 write to ram and then get reimported from ram to r11? Why should r9 go to ram and get re-imported instead of going straight to rsi? Waste of time". I had the same reaction every time I spotted data going out and then getting moved back in as input - instead of being used as is).

Still, the source is very readable the way it is right now and the performance tradeoff is not that large compared to understanding what each thing does.

337

Bitcoin / Development & Technical Discussion / Re: secp256k1 library and Intel cpu

on: January 14, 2017, 03:54:56 PM

Quote from: rico666 on January 13, 2017, 06:35:04 PM

Quote from: AlexGR on January 13, 2017, 12:30:47 PM

There are 3 benchmarks

bench_internal
bench_verify
bench_sign

which are built by ./configure --enable-benchmark

As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on.

Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.

Ok - I did.

bench_verify shows speedup with endomorphism

ecdsa_verify: min 42.0us / avg 42.2us / max 43.0us (with)
ecdsa_verify: min 57.7us / avg 57.8us / max 58.4us (without)

bench_internal shows no improvements (within measure tolerance) except one:

wnaf_const: min 0.0887us / avg 0.0920us / max 0.102us (with)
wnaf_const: min 0.155us / avg 0.161us / max 0.171us (without)

I doubt this would cause the speedup from above.

Rico

I'll upload here 2 versions of /src/field_5x52_asm_impl.h that I've kind of hacked, one using memory, the other xmm registers.

The commentary is not good because it's not production level - just fooling around* with the data flow so that the data get from one end to the other faster, with less code imprint. I've never had them run on anything beside my Q8200, and I'm wondering on the behavior of modern cpus. I'd appreciate if you (or anyone else) can run a benchmark (baseline) + these 2, and perhaps a time ./tests as a more real-world performance.

If I do a ./time tests, both run faster by a second (58.2 seconds baseline with endomorphism down to 57.2 seconds in my underclocked Q8200 @ 1.86gz), although the memory version seems faster in the benchmarks. I have a theory on why the xmm version sucks in benchmarks (OS context switches being more expensive for also saving the xmm reg set?) but the bottom line is it seems faster than baseline when doing a timed test run (more real-world application)... Security-wise, I wouldn't want to let data hanging around on the XMM registers though.

(*What I wanted to do is to reduce opcode size, instruction count and memory accesses by reducing the number of temporary variables from 3 to 2 or 1, while interleaving muls with adds).

Version 1 - normal/memory:

Code:

/**********************************************************************
 * Copyright (c) 2013-2014 Diederik Huys, Pieter Wuille               *
 * Distributed under the MIT software license, see the accompanying   *
 * file COPYING or http://www.opensource.org/licenses/mit-license.php.*
 **********************************************************************/

/**
 * Changelog:
 * - March 2013, Diederik Huys:    original version
 * - November 2014, Pieter Wuille: updated to use Peter Dettman's parallel multiplication algorithm
 * - December 2014, Pieter Wuille: converted from YASM to GCC inline assembly
 */

#ifndef _SECP256K1_FIELD_INNER5X52_IMPL_H_
#define _SECP256K1_FIELD_INNER5X52_IMPL_H_

SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a, const uint64_t * SECP256K1_RESTRICT b) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            r15:rcx = d
 *            r10-r14 = a0-a4
 *            rbx     = b
 *            rdi     = r
 *            rsi     = a / t?
 */
  uint64_t tmp1, tmp2;
__asm__ __volatile__(
    "movq 24(%%rsi),%%r13\n"
    "movq 0(%%rbx),%%rax\n"
    "movq 32(%%rsi),%%r14\n"
    /* d += a3 * b0 */
    "mulq %%r13\n"
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq %%rax,%%r9\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 8(%%rbx),%%rax\n"
    "movq %%rdx,%%rsi\n"
    /* d += a2 * b1 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b2 */
    "mulq %%r11\n"
    "movq $0x1000003d10,%%rcx\n"
    "movq $0xfffffffffffff,%%r15\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d = a0 * b3 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* c = a4 * b4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "shrdq $52,%%rdx,%%r8\n"     /* c >>= 52 (%%r8 only) */
    /* d += (c & M) * R */
    "andq %%r15,%%rax\n"
    "mulq %%rcx\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t3 (tmp1) = d & M */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%q1\n"  
    /* d >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* d += a4 * b0 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b1 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b3 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a0 * b4 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
     /* d += c * R */
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    "mulq %%r8\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t4 = d & M (%%r15) */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rax,%%r15\n"
    "shrq $48,%%r15\n" /*Q3*/
    
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rdx\n"
    "andq %%rdx,%%rax\n"
    "movq %%rax,%q2\n"
    /*"movq %q2,%%r15\n" */
    "movq 0(%%rbx),%%rax\n"
    /* c = a0 * b0 */
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += a4 * b1 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b2 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b3 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b4 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    
    "movq %%r15,%%rax\n"  /*Q3 transfered*/
    
    /* u0 = d & M (%%r15) */
    "movq %%r9,%%rdx\n"
    "shrdq $52,%%rsi,%%r9\n"
    "movq $0xfffffffffffff,%%r15\n"
    "xor %%esi, %%esi\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */

    /* u0 = (u0 << 4) | tx (%%r15) */
    "shlq $4,%%rdx\n"
    "orq %%rax,%%rdx\n"
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b1 */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b2 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b3 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b4 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a2 * b0 */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a1 * b1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b2 (last use of %%r10 = a0) */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    /* fetch t3 (%%r10, overwrites a0), t4 (%%r15) */
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b3 */
    "mulq %%r14\n"
    "movq %q1,%%r10\n" 
    "xor %%esi, %%esi\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b4 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq $0x1000003d10,%%r11\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 (%%r9 only) */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %q2,%%rsi\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t3 */
    "xor %%ecx,%%ecx\n"
    "movq %%r9,%%rax\n"
    "addq %%r10,%%r8\n"
    /* c += d * R */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%rsi,%%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
: "+S"(a), "=m"(tmp1), "=m"(tmp2)
: "b"(b), "D"(r)
: "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

SECP256K1_INLINE static void secp256k1_fe_sqr_inner(uint64_t *r, const uint64_t *a) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            rcx:rbx = d
 *            r10-r14 = a0-a4
 *            r15     = M (0xfffffffffffff)
 *            rdi     = r
 *            rsi     = a / t?
 */
  uint64_t tmp1a;
__asm__ __volatile__(
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 24(%%rsi),%%r13\n"
    "movq 32(%%rsi),%%r14\n"
    "leaq (%%r10,%%r10,1),%%rax\n"
    "movq $0xfffffffffffff,%%r15\n"
    /* d = (a0*2) * a3 */
    "mulq %%r13\n"
    "movq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += (a1*2) * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"
    "movq %%r14,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c = a4 * a4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "movq %%rdx,%%r9\n"
    /* d += (c & M) * R */
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r9,%%r8\n"
    /* t3 (tmp1) = d & M */
    "movq %%rbx,%%rsi\n"
    "andq %%r15,%%rsi\n" /*Q1 became rsi*/
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    /* a4 *= 2 */
    "movq %%r10,%%rax\n"
    "addq %%r14,%%r14\n"
    /* d += a0 * a4 */
    "mulq %%r14\n"
    "xor %%ecx,%%ecx\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d+= (a1*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a2 * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"

    /* d += c * R */
    "movq %%r8,%%rax\n"
    "movq $0x1000003d10,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    "mulq %%r8\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* t4 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rdx,%%r15\n"
    "shrq $48,%%r15\n" /*Q3=R15*/
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    "movq %%rdx,%q1\n"/*Q2 OUT - renamed to q1*/
    /* c = a0 * a0 */
    "movq %%r10,%%rax\n"
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%rdx,%%r9\n"
    /* d += a1 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r12,%%r12,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += (a2*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* u0 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "movq $0xfffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* u0 = (u0 << 4) | tx (%%rsi) */
    "shlq $4,%%rdx\n"
    "orq %%r15,%%rdx\n" /*Q3 - R15 RETURNS*/
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "movq $0xfffffffffffff,%%r15\n" /*R15 back in its place*/
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"    
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* a0 *= 2 */
    "addq %%r10,%%r10\n"
    /* c += a0 * a1 */
    "movq %%r10,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a2 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a3 * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%r10,%%rax\n"
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* c += a0 * a2 (last use of %%r10) */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %q1,%%r12\n" /*Q2 RETURNS*/
    "adcq %%rdx,%%r9\n"
    /* fetch t3 (%%r10, overwrites a0),t4 (%%rsi) */
    /*"movq %q1,%%r10\n" */
    /* c += a1 * a1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a3 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq $0x1000003d10,%%r13\n"
    "mulq %%r13\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 (%%rbx only) */
    "shrdq $52,%%rcx,%%rbx\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r14,%%r14\n"
    /* c += t3 */
    "movq %%rbx,%%rax\n"
    "addq %%rsi,%%r8\n" /*RSI = Q1*/
    /* c += d * R */
    "mulq %%r13\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r14\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r14,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%r12, %%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
: "+S"(a), "=m"(tmp1a)
: "D"(r)
: "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

#endif

Version 2 - more xmm reg use

Code:

/**********************************************************************
 * Copyright (c) 2013-2014 Diederik Huys, Pieter Wuille               *
 * Distributed under the MIT software license, see the accompanying   *
 * file COPYING or http://www.opensource.org/licenses/mit-license.php.*
 **********************************************************************/

/**
 * Changelog:
 * - March 2013, Diederik Huys:    original version
 * - November 2014, Pieter Wuille: updated to use Peter Dettman's parallel multiplication algorithm
 * - December 2014, Pieter Wuille: converted from YASM to GCC inline assembly
 */

#ifndef _SECP256K1_FIELD_INNER5X52_IMPL_H_
#define _SECP256K1_FIELD_INNER5X52_IMPL_H_

SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a, const uint64_t * SECP256K1_RESTRICT b) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            r15:rcx = d
 *            r10-r14 = a0-a4
 *            rbx     = b
 *            rdi     = r
 *            rsi     = a / t?
 */
/* xmm0 = q1 xmm6=q2    */
/* This has 17 mem accesses + 17 xmm uses vs 35 mem access and no xmm use*/

__asm__ __volatile__(
    "push %%rbx\n"
    "movq %%rsp, %%xmm1\n"
    "movq %%rbp, %%xmm2\n"
    "movq %%rdi, %%xmm3\n"
    "movq 0(%%rbx),%%rdi\n"
    "movq 8(%%rbx),%%rbp\n"
    "movq 16(%%rbx),%%rsp\n"
    "movq %%rdi,%%xmm4\n"
    
    "movq 24(%%rsi),%%r13\n"
    "movq %%rdi,%%rax\n"
    "movq 32(%%rsi),%%r14\n"
    /* d += a3 * b0 */
    "mulq %%r13\n"
    "movq 0(%%rsi),%%r10\n"
    "movq %%rax,%%r9\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq %%rbp,%%rax\n"
    "movq %%rdx,%%rsi\n"
    /* d += a2 * b1 */
    "mulq %%r12\n"
    "movq 24(%%rbx),%%rcx\n"
    "movq 32(%%rbx),%%rbx\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b2 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d = a0 * b3 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* c = a4 * b4 */
    "mulq %%r14\n"
    "movq $0xfffffffffffff,%%r15\n"
    "movq %%rax,%%r8\n"
    /* d += (c & M) * R */
    "andq %%r15,%%rax\n"
    "shrdq $52,%%rdx,%%r8\n"     /* c >>= 52 (%%r8 only) */
    "movq $0x1000003d10,%%rdx\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t3 (tmp1) = d & M */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%%xmm0\n"  
    /* d >>= 52 */
    "movq %%rdi,%%rax\n"
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* d += a4 * b0 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b1 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b3 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a0 * b4 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
     /* d += c * R */
    "movq $0x1000003d10,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    "mulq %%r8\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t4 = d & M (%%r15) */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rax,%%r15\n"
    "shrq $48,%%r15\n" /*Q3*/
    
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rdx\n"
    "andq %%rdx,%%rax\n"
    "movq %%rax,%%xmm6\n"
    /*"movq %q2,%%r15\n" */
    "movq %%rdi,%%rax\n"
    /* c = a0 * b0 */
    "mulq %%r10\n"
    "movq %%rcx,%%xmm5\n"
    "movq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += a4 * b1 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b2 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b3 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b4 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    
    "movq %%r15,%%rax\n"  /*Q3 transfered*/
    
    /* u0 = d & M (%%r15) */
    "movq %%r9,%%rdx\n"
    "shrdq $52,%%rsi,%%r9\n"
    "movq $0xfffffffffffff,%%r15\n"
    "xor %%esi, %%esi\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */

    /* u0 = (u0 << 4) | tx (%%r15) */
    "shlq $4,%%rdx\n"
    "orq %%rax,%%rdx\n"
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%%rdx\n"
        /* c >>= 52 */
    "movq %%rdi,%%rax\n"
    "movq %%xmm3, %%rdi\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    "movq %%rdx,0(%%rdi)\n"
    /* c += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b1 */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b2 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b3 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b4 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%xmm4,%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a2 * b0 */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a1 * b1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b2 (last use of %%r10 = a0) */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    /* fetch t3 (%%r10, overwrites a0), t4 (%%r15) */
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b3 */
    "mulq %%r14\n"
    "movq %%xmm0,%%r10\n" 
    "xor %%esi, %%esi\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b4 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq $0x1000003d10,%%rbx\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rbx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 (%%r9 only) */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t3 */
    "movq %%r9,%%rax\n"
    "addq %%r10,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += d * R */
    "mulq %%rbx\n"
    "movq %%xmm1, %%rsp\n"
    "movq %%xmm2, %%rbp\n"
    "movq %%xmm6,%%rsi\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%rsi,%%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
    "pop %%rbx\n"
: "+S"(a)
: "b"(b), "D"(r)
: "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

SECP256K1_INLINE static void secp256k1_fe_sqr_inner(uint64_t *r, const uint64_t *a) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            rcx:rbx = d
 *            r10-r14 = a0-a4
 *            r15     = M (0xfffffffffffff)
 *            rdi     = r
 *            rsi     = a / t?
 */
/* tmp1a = xmm0 */
__asm__ __volatile__(
    "movq %%rsp, %%xmm1\n"
    "movq %%rbp, %%xmm2\n"
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 24(%%rsi),%%r13\n"
    "movq 32(%%rsi),%%r14\n"
    "leaq (%%r10,%%r10,1),%%rax\n"
    "movq $0xfffffffffffff,%%r15\n"
    /* d = (a0*2) * a3 */
    "mulq %%r13\n"
    "movq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += (a1*2) * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"
    "movq %%r14,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c = a4 * a4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "movq %%rdx,%%r9\n"
    /* d += (c & M) * R */
    "movq $0x1000003d10,%%rsp\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r9,%%r8\n"
    /* t3 (tmp1) = d & M */
    "movq %%rbx,%%rsi\n"
    "andq %%r15,%%rsi\n" /*Q1 OUT*/
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    /* a4 *= 2 */
    "movq %%r10,%%rax\n"
    "addq %%r14,%%r14\n"
    /* d += a0 * a4 */
    "mulq %%r14\n"
    "xor %%ecx,%%ecx\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d+= (a1*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a2 * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"

    /* d += c * R */
    "movq %%r8,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    "mulq %%rsp\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* t4 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rdx,%%rbp\n"
    "shrq $48,%%rbp\n" /*Q3 OUT*/
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    "movq %%rdx,%%xmm0\n"/*Q2 OUT*/
    /* c = a0 * a0 */
    "movq %%r10,%%rax\n"
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%rdx,%%r9\n"
    /* d += a1 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r12,%%r12,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += (a2*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* u0 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "movq %%r15,%%rax\n"
    "andq %%rax,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* u0 = (u0 << 4) | tx (%%rsi) */
    "shlq $4,%%rdx\n"
    "orq %%rbp,%%rdx\n" /*Q3 RETURNS*/
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"    
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* a0 *= 2 */
    "addq %%r10,%%r10\n"
    /* c += a0 * a1 */
    "movq %%r10,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a2 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a3 * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%r10,%%rax\n"
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* c += a0 * a2 (last use of %%r10) */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%xmm0,%%r12\n" /*Q2 RETURNS*/
    "adcq %%rdx,%%r9\n"
    /* fetch t3 (%%r10, overwrites a0),t4 (%%rsi) */
    /*"movq %q1,%%r10\n" */
    /* c += a1 * a1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a3 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 (%%rbx only) */
    "shrdq $52,%%rcx,%%rbx\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r14,%%r14\n"
    /* c += t3 */
    "movq %%rbx,%%rax\n"
    "addq %%rsi,%%r8\n" /*RSI = Q1 RETURNS*/
    /* c += d * R */
    "mulq %%rsp\n"
    "movq %%xmm1, %%rsp\n"
    "movq %%xmm2, %%rbp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r14\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r14,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%r12, %%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"

: "+S"(a)
: "D"(r)
: "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

#endif

338

Bitcoin / Development & Technical Discussion / Re: secp256k1 library and Intel cpu

on: January 13, 2017, 12:30:47 PM

Quote from: rico666 on January 13, 2017, 11:54:48 AM Quote from: AlexGR on January 08, 2017, 11:41:37 AM One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)? I don't think patents are any problem with the endomorphism code. The code itself is the problem. Not sure which benchmarks you are referring to, but if I take a (very coarse) look on benchmarks on my system, USE_ENDOMORPHISM is nothing you'd like to enable: Code: Times for tests: gcc version 6.3.1 20170109 (GCC) 1) CFLAGS -g -O2 real 0m14.365s user 0m14.357s sys 0m0.007s 2) CFLAGS -O3 -march=sklake real 0m13.549s user 0m13.547s sys 0m0.000s 3) CFLAGS -O3 -march=sklake & USE_ENDOMORPHISM 1 real 0m15.660s user 0m15.660s sys 0m0.000s 4) CFLAGS -g -O2 & USE_ENDOMORPHISM 1 real 0m16.139s user 0m16.137s sys 0m0.000s 5) CFLAGS -g -O2 & undef USE_ASM_X86_64 real 0m14.849s user 0m14.847s sys 0m0.000s 6) CFLAGS -O3 -march=sklake & undef USE_ASM_X86_64 real 0m14.520s user 0m14.517s sys 0m0.000s So yes, the beef seems to be in better assembler code and ditching endomorphism. On modern CPUs, ditch that old gcc too and use -O3 (forget what you've heard about it in the past years). Rico There are 3 benchmarks bench_internal bench_verify bench_sign which are built by ./configure --enable-benchmark As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on. Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.

339

Local / Ελληνικά (Greek) / Re: [INFO] Συζήτηση για την Ισοτιμία

on: January 13, 2017, 11:52:58 AM

Quote from: herrhausen5 on January 13, 2017, 10:40:49 AM Quote from: AlexGR on January 13, 2017, 09:10:20 AM Γενικα υπαρχει υπερβολη απο πλευρας δυσης στο χαρακτηρισμο του trade volume ως fake. Με 0% προμηθεια λογικο ειναι ο ογκος να ειναι μεγαλος - Από αυτά που διαβάζουμε, φαίνεται ότι η προμήθεια δεν είναι 0% αλλά επιβάλλεται στην τιμή. Κάποιος πληρώνει και αγοράζει μια σοκοφρέτα του ενός ευρώ. Ύστερα βγάζει από το πορτοφόλι του 24 λεπτά για να πληρώσει τον Φ.ΠΑ. Ένας άλλος που αγοράζει τη σοκοφρέτα με τον Φ.Π.Α ενσωματωμένο στη σοκοφρέτα (1,24), πληρώνει λιγότερα από τον πρώτο; Απ'οτι καταλαβα το καπελο (και κερδος των ανταλλακτηριων) μπαινει οταν τα λεφτα μπαινουν και βγαινουν απ'το ανταλλακτηριο (πχ withdrawal fees) και οχι per trade. Αν το εχω καταλαβει σωστα, τοτε συμφερει τους κατοχους να κανουν οσο το δυνατον περισσοτερα trades για να αυξησουν τα CNY ή τα BTC τους.

340

Local / Ελληνικά (Greek) / Re: [INFO] Συζήτηση για την Ισοτιμία

on: January 13, 2017, 09:10:20 AM

Γενικα υπαρχει υπερβολη απο πλευρας δυσης στο χαρακτηρισμο του trade volume ως fake. Με 0% προμηθεια λογικο ειναι ο ογκος να ειναι μεγαλος - αφου ενας trader (ή ενα trader bot) μπορει να βγαζει κερδη ακομα και με 0.1% τα οποια σε ενα δυτικο ανταλλακτηριο δε θα εφθαναν ουτε για τα trade fees (0.2 - 0.3% per trade). Αυτο δε σημαινει οτι ειναι πλασματικος ο ογκος. Οσο μικραινει το trade fee τοσο αυξανεται ο ογκος αλλα και η χρηση trading bots.

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [17] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 ... 208 »