Latest posts of: AlexGR

Show Posts
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 [14] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 ... 208 »

261

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: May 20, 2017, 09:22:07 PM

Quote from: julian071 on May 20, 2017, 09:13:29 PM +10% in China in 24 hours, haven't seen that in a while. Chinese FOMO is the best kind of FOMO

262

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: May 20, 2017, 07:15:39 AM

Coinmarketcap price: 1 Bitcoin Bitcoin $32,727,657,375 $2002.84 16,340,625 BTC $1,078,680,000 3.65%

263

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: May 16, 2017, 06:13:48 AM

Quote from: Dotto on May 16, 2017, 01:43:58 AM In the meantime, ripple bubbling to more than 1/3 of BTC marketcap in no time. Do you think it´s ripe to short the hell of or will continue pumping? My suggestion is to never short banker(-backed) coins. Let it pump, let it dump - who cares. If crypto people wanted centralized solutions they'd own bank stocks.

264

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: May 09, 2017, 05:21:31 PM

Quote from: blade87 on May 09, 2017, 05:10:06 PM I once thought $10K was crazy, but in a world where Bitcoin is pretty well accepted, I now believe that number is actually quite small. Even at $10K - Bitcoin is still worth significnatly less than Amazon, Apple, Facebook, or etc. Exactly. High numbers are only impressive due to low number of total coins. Factoring that there are just 16 million coins, the price is actually pretty low.

265

Local / Ελληνικά (Greek) / Re: [INFO] Συζήτηση για την Ισοτιμία

on: May 09, 2017, 04:42:51 PM

Quote from: theoulis on May 09, 2017, 06:57:22 AM Quote from: AlexGR on May 08, 2017, 11:38:43 PM Τα δις δε τα ριχνει ο μητσος που καθεται στο trollbox του poloniex, αυτα τα λεφτα που εχουν πεσει τις τελευταιες βδομαδες ειναι στανταρ θεσμικα... αν υπαρχουν εκατομμυρια μητσοι να βαζουν απο ενα 10ρικακι?ή μηπως αυτα τα λεφτα ειναι συστημικα-εκτυπωμενα χωρις αντικρυσμα? Θελει πολυ "ενορχηστρωση" για να βαλουν εκατομμυρια μητσοι x 10ε πχ... και μαλιστα στα ιδια altcoins στις ιδιες χρονικες στιγμες.

266

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: May 09, 2017, 04:56:11 AM

Quote from: JimboToronto on May 09, 2017, 03:30:31 AM Just got home from the ballpark to see that both my teams won tonight. Jays 4-2 and a Bitcoin ATH of $1685USD (Bitcoinaverage). Life is good. Coinmarketcap has BTC @ $1742.27 (+6.44%). I think it factors volume, including btc/alts, but excluding zero-fee trading.

267

Local / Ελληνικά (Greek) / Re: [INFO] Συζήτηση για την Ισοτιμία

on: May 08, 2017, 11:38:43 PM

Τα δις δε τα ριχνει ο μητσος που καθεται στο trollbox του poloniex, αυτα τα λεφτα που εχουν πεσει τις τελευταιες βδομαδες ειναι στανταρ θεσμικα...

268

Local / Ελληνικά (Greek) / Re: [INFO] Συζήτηση για την Ισοτιμία

on: May 08, 2017, 10:21:12 PM

Η τιμη βασικα ειναι ικανη να φθασει και 2k+ για πλακα, με ολη τη συσσωρευμενη "δυναμη" απ'τα alts. Ειναι τελειως διαφορετικο να εχεις τιμη 1700$ με ενα btc που εχει πχ 90% crypto marketshare, και τελειως διαφορετικο με μια "ρεζερβα" δισεκατομμυριων που ισοδυναμει με αλλη μια φορα τη κεφαλαιοποιηση του bitcoin. Αυτη η ρεζερβα, ειναι ικανη να στεγνωσει πολυ γρηγορα τη προσφορα των btc εφοσον υπαρξουν εστω και μικρες ρευστοποιησεις σε ενα 20-30% των φουσκωμενων altcoins.

269

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: May 08, 2017, 12:19:34 PM

We first need to hit parity with the 400oz gold bars

270

Local / Ελληνικά (Greek) / Re: [INFO] Συζήτηση για την Ισοτιμία

on: May 03, 2017, 11:04:23 PM

Η τιμη ειναι της πλακας, μπροστα σ'αυτο που θα επρεπε να ειναι

271

Bitcoin / Development & Technical Discussion / Re: How to use properly secp256k1 library

on: April 23, 2017, 09:16:38 AM

Code:

SECP256K1_INLINE static void secp256k1_fe_sqr_inner(uint32_t *r, const uint32_t *a) {
  /*  uint64_t c, d;*/
    uint64_t result[17]; /*temp storage array*/
    uint32_t tempstor[2]; /*temp storage array*/
    const uint32_t M = 0x3FFFFFFUL, R0 = 0x3D10UL  /*, R1 = 0x400UL*/ ;

  tempstor[0]=M;
  tempstor[1]=R0;
/*tempstor[2] for R1 isn't needed. It's 1024, so shifting left by 10 instead*/

  __asm__ __volatile__(  
 /* Part #1: The multiplications and additions of
  * 
  *  
  *   (uint64_t)(a[0]*2) * a[1]        =result0
  * 
  *   (uint64_t)(a[0]*2) * a[2]
       + (uint64_t)a[1] * a[1]          =result1
             
      (uint64_t)(a[0]*2) * a[3]
       + (uint64_t)(a[1]*2) * a[2]    =result2
       
       (uint64_t)(a[0]*2) * a[4]       =result3
       + (uint64_t)(a[1]*2) * a[3]
       + (uint64_t)a[2] * a[2];        
       
      (uint64_t)(a[0]*2) * a[5]
       + (uint64_t)(a[1]*2) * a[4]    =result4
       + (uint64_t)(a[2]*2) * a[3];     */

 "MOVD 0(%0), %%MM0\n"  /*a0 */
 "MOVD 8(%0), %%MM2\n"  /*a2 */
 "MOVD 4(%0), %%MM1\n" /* a1 */
 "MOVQ %%MM0, %%MM6\n" /*cloning a0 to mm6*/
 "MOVQ %%MM0, %%MM4\n" /*cloning a0 to mm4*/
 "MOVQ %%MM1, %%MM5\n" /*cloning a1 to mm5*/
 "PMULUDQ %%MM2, %%MM6\n" /*a2 * a0*/
 "MOVQ %%MM1, %%MM7\n"  /*cloning a1 to mm7*/
 "PMULUDQ %%MM5, %%MM5\n" /*a1 * a1*/ 
 "PMULUDQ %%MM0, %%MM7\n" /*a0 * a1*/
 "MOVD 12(%0), %%MM3\n" /* a3 */
 "PADDQ %%MM6, %%MM6\n"  /* doubled a2 * a0 */
 "PADDQ %%MM5, %%MM6\n" /* (a2*a0*2) + (a1*a1) */
 "PADDQ %%MM7, %%MM7\n" /* doubled a0 * a1 */
 "MOVQ %%MM1, %%MM5\n" /*cloning a1 to mm5*/
 "MOVQ %%MM6, 8(%1)\n" /*result[1] */  
 "MOVQ %%MM7, 0(%1)\n" /*result[0] */ 
 
 "PMULUDQ %%MM3, %%MM4\n" /*a3 * a0*/
 "PMULUDQ %%MM2, %%MM5\n" /*a2 * a1*/  
 "MOVD 16(%0), %%MM7\n" /*a4 to mm7*/ 
 "PADDQ %%MM4, %%MM5\n" /* (a3 * a0) + (a2*a1) */
 "PADDQ %%MM5, %%MM5\n" /* (a3 * a0) + (a2*a1) doubled */ 
 "MOVQ %%MM5, 16(%1)\n" /*result[2] */  
 
 "MOVQ %%MM2, %%MM6\n" /*cloning a2 to mm6*/
 "MOVQ %%MM7, %%MM4\n" /*cloning a4 to mm4*/ 
 "MOVQ %%MM3, %%MM5\n" /*cloning a3 to mm5*/ 
 "PMULUDQ %%MM0, %%MM7\n" /*a0 * a4*/
 "PMULUDQ %%MM1, %%MM5\n" /*a1 * a3*/
 "PMULUDQ %%MM2, %%MM6\n" /*a2 * a2*/  
 "PADDQ %%MM7, %%MM5\n" /* (a0*a4) + (a1*a3) */
 "PADDQ %%MM5, %%MM5\n"/* doubling (a0*a4) + (a1*a3) */
 "MOVD 20(%0), %%MM7\n" /*prefetch a5 to mm7*/ 
 "PADDQ %%MM6, %%MM5\n" /*  (2* (a0*a4) + (a1*a3)) + (a2*a2) */
 "MOVQ %%MM5, 24(%1)\n" /*result[3] */  
 
 "MOVQ %%MM4, %%MM6\n" /*cloning a4 to mm6*/
 "PMULUDQ %%MM0, %%MM7\n" /*a0 * a5*/
 "MOVQ %%MM3, %%MM5\n" /*cloning a3 to mm5*/ 
 "PMULUDQ %%MM1, %%MM6\n" /*a1 * a4*/
 "PMULUDQ %%MM2, %%MM5\n" /*a2 * a3*/ 
 "PADDQ %%MM6, %%MM7\n" 
 "MOVD 24(%0), %%MM6\n" /*prefetch a6 to mm6*/ 
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 20(%0), %%MM5\n" /*prefetch a5 to mm5*/ 
 "PADDQ %%MM7, %%MM7\n" /*adding and doubling a1*a4 + a0*a5 + a2*a3*/
 "MOVQ %%MM7, 32(%1)\n" /*result[4] */  
 
 /* Part #2: The multiplications and additions of
  *  
       (uint64_t)(a[0]*2) * a[6]
       + (uint64_t)(a[1]*2) * a[5]
       + (uint64_t)(a[2]*2) * a[4]        =result5
       + (uint64_t)a[3] * a[3]
       
       (uint64_t)(a[0]*2) * a[7]
       + (uint64_t)(a[1]*2) * a[6]
       + (uint64_t)(a[2]*2) * a[5]
       + (uint64_t)(a[3]*2) * a[4]       =result6
       
        (uint64_t)(a[0]*2) * a[8]
       + (uint64_t)(a[1]*2) * a[7]
       + (uint64_t)(a[2]*2) * a[6]
       + (uint64_t)(a[3]*2) * a[5]
       + (uint64_t)a[4] * a[4]           =result7      */

 "PMULUDQ %%MM3, %%MM3\n" /*a3 * a3*/ 
 "MOVQ %%MM4, %%MM7\n" /*a4 to mm7*/
 "PMULUDQ %%MM0, %%MM6\n" /*a0 * a6*/
 "PMULUDQ %%MM1, %%MM5\n" /*a1 * a5*/
 "PMULUDQ %%MM2, %%MM7\n" /*a2 * a4*/
 "PADDQ %%MM6, %%MM5\n"
 "MOVD 24(%0), %%MM6\n" /*prefetch a6 to mm6*/  
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 20(%0), %%MM5\n" /*prefetch a5 to mm5*/  
 "PADDQ %%MM7, %%MM7\n" /*adding and doubling a0*a6 + a1*a5 + a2*a4*/
 "PADDQ %%MM3, %%MM7\n" /* adding non-doubled a3*a3 */
 "MOVD 12(%0), %%MM3\n" /*prefetch a3 to mm3*/ 
 "MOVQ %%MM7, 40(%1)\n" /*result[5] */   
 
 "PMULUDQ %%MM2, %%MM5\n" /*a2 * a5*/
 "PMULUDQ %%MM1, %%MM6\n" /*a1 * a6*/
 "PMULUDQ %%MM3, %%MM4\n" /*a3 * a4*/ 
 "MOVD 28(%0), %%MM7\n" /*a7 to mm7*/ 
 "PMULUDQ %%MM0, %%MM7\n" /*a0 * a7*/
 "PADDQ %%MM5, %%MM6\n"
 "PADDQ %%MM6, %%MM4\n"   /* adding 4 prior multiplications */
 "MOVD 24(%0), %%MM5\n" /*prefetch a6 to mm5*/ 
 "PADDQ %%MM4, %%MM7\n"
 "MOVD 20(%0), %%MM4\n" /*prefetch a5 to mm4*/   
 "PADDQ %%MM7, %%MM7\n"  /* doubling the result */
 "MOVQ %%MM7, 48(%1)\n" /*result[6] */    

 "PMULUDQ %%MM4, %%MM3\n" /*a5 * a3*/
 "MOVD 28(%0), %%MM6\n" /*a7 to mm6*/ 
 "PMULUDQ %%MM5, %%MM2\n" /*a6 * a2*/ 
 "MOVD 32(%0), %%MM7\n" /*a8 to mm7*/  
 "PMULUDQ %%MM6, %%MM1\n" /*a7 * a1*/
 "MOVD 16(%0), %%MM4\n" /*a4 to mm4*/  
 "PMULUDQ %%MM7, %%MM0\n" /*a8 * a0*/
 "PMULUDQ %%MM4, %%MM4\n" /*a4 * a4*/ 
 "PADDQ %%MM3, %%MM2\n"
 "PADDQ %%MM2, %%MM1\n"   /* adding the first 4 multiplications */
 "PADDQ %%MM1, %%MM0\n"
 "MOVD 36(%0), %%MM2\n" /*a9 to mm2*/  
 "PADDQ %%MM0, %%MM0\n"  /* doubling the result of the first 4 multiplications */
 "MOVQ %%MM7, %%MM1\n"  /* cloning a8 to mm1*/
 "PADDQ %%MM4, %%MM0\n"  /* adding the non-doubled a4*a4 */
 "MOVQ %%MM2, %%MM3\n"  /* cloning a9 to mm3*/
 "MOVQ %%MM0, 56(%1)\n" /*result[7] */   
 
/* Part #3: The multiplications and additions of
  *  
  *   (uint64_t)a[9] * a[9]               = result8
       
       (uint64_t)(a[8]*2) * a[9]         = result9 
           
       (uint64_t)(a[7]*2) * a[9]
       + (uint64_t)a[8] * a[8]            = result10
       
       (uint64_t)(a[6]*2) * a[9]
       + (uint64_t)(a[7]*2) * a[8]      = result11    
       
       (uint64_t)(a[5]*2) * a[9]
       + (uint64_t)(a[6]*2) * a[8]      = result12
       + (uint64_t)a[7] * a[7]
       
       (uint64_t)(a[4]*2) * a[9]
       + (uint64_t)(a[5]*2) * a[8]      = result13
       + (uint64_t)(a[6]*2) * a[7]
       
       (uint64_t)(a[3]*2) * a[9]
       + (uint64_t)(a[4]*2) * a[8]
       + (uint64_t)(a[5]*2) * a[7]
       + (uint64_t)a[6] * a[6]            = result14          */

 "PMULUDQ %%MM3, %%MM3\n" /*a9 * a9*/ 
 "MOVQ %%MM2, %%MM4\n"  /* cloning a9 to mm4*/  
 "PADDQ %%MM7, %%MM7\n" /*a8 doubled */
 "MOVQ %%MM1, %%MM0\n" /*cloning a8 to mm0*/
 "MOVQ %%MM3, 64(%1)\n" /*result[8] = a9*a9  */ 
 
 "PMULUDQ %%MM7, %%MM4\n" /*(a8*2) * a9*/ 
 "MOVQ %%MM2, %%MM3\n"  /* cloning a9 to mm3*/  
 "MOVQ %%MM4, 72(%1)\n" /*result[9]*/ 
 "PMULUDQ %%MM0, %%MM0\n" /*a8*a8*/  
 "PMULUDQ %%MM6, %%MM3\n" /*a7*a9*/ 
 "MOVQ %%MM2, %%MM4\n"  /* cloning a9 to mm4*/  
 "PMULUDQ %%MM5, %%MM4\n" /*a6*a9*/ 
 "PADDQ %%MM3, %%MM3\n" /*a7*a9 doubled*/  
 "PADDQ %%MM0, %%MM3\n" /*a7*a9 doubled + a8*a8  */
 "MOVQ %%MM1, %%MM0\n" /*cloning a8 to mm0*/
 "MOVQ %%MM2, %%MM7\n" /*cloning a9 to mm7*/ 
 "MOVQ %%MM3, 80(%1)\n" /*result[10]*/ 
 
 "PMULUDQ %%MM6, %%MM0\n" /*a7*a8*/  
 "MOVQ %%MM1, %%MM3\n" /*cloning a8 to mm3*/
 "PADDQ %%MM4, %%MM0\n" /*a7*a8 + a6*a9 */
 "MOVD 20(%0), %%MM4\n" /*a5 to mm4*/  
 "PADDQ %%MM0, %%MM0\n" /*a7*a8 + a6*a9 doubling*/ 
 "MOVQ %%MM0, 88(%1)\n" /*result[11]*/  

 "PMULUDQ %%MM5, %%MM3\n" /*a6*a8*/   
 "PMULUDQ %%MM4, %%MM7\n" /*a5*a9*/  
 "MOVQ %%MM6, %%MM0\n" /*cloning a7*/   
 "PMULUDQ %%MM0, %%MM0\n" /*a7*a7*/  
 "PADDQ %%MM3, %%MM7\n" /*a5*a9   +   a6*a8*/
 "PADDQ %%MM7, %%MM7\n"/*a5*a9   +   a6*a8   doubled*/
 "MOVD 16(%0), %%MM3\n" /*a4*/
 "PADDQ %%MM0, %%MM7\n"/*a5*a9   +   a6*a8   doubled     + a7*a7*/
 "MOVQ %%MM2, %%MM0\n" /*cloning a9*/
 "MOVQ %%MM7, 96(%1)\n" /*result[12]*/ 

 "PMULUDQ %%MM3, %%MM0\n" /*a4*a9*/     
 "MOVQ %%MM1, %%MM7\n" /*cloning a8*/
 "MOVQ %%MM6, %%MM3\n" /*cloning a7*/  
 "PMULUDQ %%MM4, %%MM7\n" /*a5*a8*/     
 "PMULUDQ %%MM5, %%MM3\n" /*a6*a7*/  
 "PADDQ %%MM0, %%MM7\n"
 "MOVD 12(%0), %%MM0\n" /* a3 in mm0*/
 "PADDQ %%MM3, %%MM7\n"  /*adding the 3 prior multiplications*/
 "PADDQ %%MM7, %%MM7\n"  /* doubling the result of the 3 multiplications*/
 "MOVQ %%MM7, 104(%1)\n" /*result[13]*/

 "PMULUDQ %%MM5, %%MM5\n" /*a6*a6*/    
 "MOVQ %%MM6, %%MM3\n" /*cloning a7*/ 
 "MOVD 16(%0), %%MM7\n" /* a4 in mm7*/    
 "PMULUDQ %%MM2, %%MM0\n" /*a3*a9*/  
 "PMULUDQ %%MM4, %%MM3\n" /*a5*a7*/  
 "PMULUDQ %%MM1, %%MM7\n" /*a4*a8*/  
 "PADDQ %%MM0, %%MM3\n"
 "MOVD 8(%0), %%MM0\n" /*a2 in mm0*/ 
 "PADDQ %%MM3, %%MM7\n" /*adding the last 3 multiplications*/
 "MOVD 16(%0), %%MM3\n" /*a4 in mm3*/  
 "PADDQ %%MM7, %%MM7\n" /*doubling the sum of the 3 multiplications*/
 "PADDQ %%MM5, %%MM7\n" /*adding non-doubled a6*a6*/
 "MOVQ %%MM7, 112(%1)\n" /*result[14]*/ 

/* Part #4: The multiplications and additions of
  *  
  *   (uint64_t)(a[2]*2) * a[9]
       + (uint64_t)(a[3]*2) * a[8]
       + (uint64_t)(a[4]*2) * a[7]
       + (uint64_t)(a[5]*2) * a[6]        =result 15
       
       (uint64_t)(a[1]*2) * a[9]
       + (uint64_t)(a[2]*2) * a[8]
       + (uint64_t)(a[3]*2) * a[7]
       + (uint64_t)(a[4]*2) * a[6]
       + (uint64_t)a[5] * a[5]              =result16
       
        (uint64_t)(a[0]*2) * a[9]
       + (uint64_t)(a[1]*2) * a[8]
       + (uint64_t)(a[2]*2) * a[7]
       + (uint64_t)(a[3]*2) * a[6]
       + (uint64_t)(a[4]*2) * a[5]        =result17  / initial value of d  */ 

 "MOVD 24(%0), %%MM5\n" /*a6 in mm5*/
 "PMULUDQ %%MM2, %%MM0\n" /*a2*a9*/  
 "MOVD 12(%0), %%MM7\n" /*a3 in mm7*/ 
 "PMULUDQ %%MM6, %%MM3\n" /*a4*a7*/ 
 "PMULUDQ %%MM4, %%MM5\n" /*a5*a6*/ 
 "PMULUDQ %%MM1, %%MM7\n" /*a3*a8*/  
 "PADDQ %%MM0, %%MM3\n"
 "MOVD 4(%0), %%MM0\n" /*a1 in mm0*/  
 "PADDQ %%MM3, %%MM5\n"
 "MOVD 16(%0), %%MM3\n" /*a4in mm3*/   
 "PADDQ %%MM5, %%MM7\n" /* ...adding all the resuts */
 "MOVD 24(%0), %%MM5\n" /*a6 in mm5*/
 "PADDQ %%MM7, %%MM7\n" /* doubling them */
 "MOVQ %%MM7, 120(%1)\n" /*result[15]*/ 

 "PMULUDQ %%MM3, %%MM5\n" /* a4*a6 */ 
 "PMULUDQ %%MM2, %%MM0\n" /* a1*a9 */
 "PMULUDQ %%MM4, %%MM4\n" /* a5*a5 */  
 "MOVD 8(%0), %%MM7\n" /*a2 in mm7*/ 
 "MOVD 12(%0), %%MM3\n" /*a3 in mm3*/   
 "PMULUDQ %%MM1, %%MM7\n" /* a2*a8 */  
 "PMULUDQ %%MM6, %%MM3\n" /* a3*a7 */ 
 "PADDQ %%MM5, %%MM0\n"
 "MOVD 24(%0), %%MM5\n" /*prefetch a6 in mm5*/ 
 "PADDQ %%MM0, %%MM7\n"
 "MOVD  0(%0), %%MM0\n" /*prefetch a0 in mm0*/ 
 "PADDQ %%MM3, %%MM7\n" /*adding the 4 multiplications except a5*a5*/
 "MOVD  4(%0), %%MM3\n" /*prefetch a1 in mm3*/
 "PADDQ %%MM7, %%MM7\n" /*...and doubling them*/
 "PADDQ %%MM4, %%MM7\n" /*adding the non-doubled a5*a5*/
 "MOVD 20(%0), %%MM4\n" /*prefetch a5 in mm4*/
 "MOVQ %%MM7, 128(%1)\n" /*result[16]*/  
 
 "MOVD  8(%0), %%MM7\n" /*a2 in mm7*/
 "PMULUDQ %%MM3, %%MM1\n" /* a1*a8 */ 
 "PMULUDQ %%MM0, %%MM2\n" /* a0*a9 */  
 "MOVD 12(%0), %%MM3\n" /*a3 in mm3*/
 "PMULUDQ %%MM7, %%MM6\n" /* a2*a7 */ 
 "MOVD 16(%0), %%MM7\n" /*a4 in mm7*/
 "PMULUDQ %%MM0, %%MM0\n" /* a0*a0 =  c initial value*/ 
 "PMULUDQ %%MM3, %%MM5\n" /* a3*a6 */   
 "PMULUDQ %%MM7, %%MM4\n" /* a4*a5 */    
 "PADDQ %%MM2, %%MM1\n"
 "MOVD 0(%3), %%MM2\n"  /*prefetch M to MM2 */
 "PADDQ %%MM1, %%MM6\n"
 "MOVQ %%MM0, %%MM7\n" /* move c to MM7 (a0*a0) */
 "MOVQ 128(%1), %%MM3\n" /*prefetch result16*/
 "PADDQ %%MM5, %%MM6\n"  
 "MOVQ %%MM2, %%MM0\n" /*cloning M to MM0 for secondary storage */ 
 "PADDQ %%MM4, %%MM6\n" /*adding all multiplications together except a0*a0 which is a different result*/
 "MOVD 4(%3), %%MM4\n"  /*R0 to MM4 */
 "PADDQ %%MM6, %%MM6\n" /*doubling the result  -- initial value for d*/
/* "MOVQ %%MM6, 144(%1)\n" */  /*result[17] - which is also the initial value for d so it can be used directly*/   
  
 
 "PAND %%MM6, %%MM2\n"   /* r[9] = d & M;  */
 "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */     
 "MOVD %%MM2, 36(%2)\n" /* r[9] = d & M;  */
 "MOVQ %%MM0, %%MM2\n"  /* M back to mm2 */
 "PADDQ %%MM3, %%MM6\n"  /*  d += (uint64_t)result[16]  */

  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ 120(%1), %%MM3\n" /*prefetch result15*/
  "MOVQ 0(%1), %%MM5\n"  /*prefetch result0 */     
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[15]  */  
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 0(%2)\n" /* exporting t0/r[0] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[0] */    

  "MOVQ 112(%1), %%MM3\n"   /*prefetch result14*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[14]  */  
  "MOVQ 8(%1), %%MM5\n"  /*prefetch result1 */       
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 4(%2)\n" /* exporting t1/r[1] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[1] */    

  "MOVQ 104(%1), %%MM3\n"  /*prefetch result13*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[13]  */  
  "MOVQ 16(%1), %%MM5\n"  /*prefetch result2 */     
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 8(%2)\n" /* exporting t2/r[2] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[2] */    

  "MOVQ 96(%1), %%MM3\n"  /*prefetch result12*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[12]  */  
  "MOVQ 24(%1), %%MM5\n"  /*prefetch result3 */       
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 12(%2)\n" /* exporting t3/r[3] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[3] */      
  
  "MOVQ 88(%1), %%MM3\n"  /*prefetch result11*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[11]  */  
  "MOVQ 32(%1), %%MM5\n"  /*prefetch result4 */     
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 16(%2)\n" /* exporting t4/r[4] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[4] */      

  "MOVQ 80(%1), %%MM3\n"  /*prefetch result10*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[10]  */
  "MOVQ 40(%1), %%MM5\n"  /*prefetch result5 */    
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 20(%2)\n" /* exporting t5/r[5] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[5] */      

  "MOVQ 72(%1), %%MM3\n"  /*prefetch result9*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[9]  */  
  "MOVQ 48(%1), %%MM5\n"  /*prefetch result6 */       
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 24(%2)\n" /* exporting t6/r[6] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[6] */        

  "MOVQ 64(%1), %%MM3\n"  /*prefetch result8*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[8]  */
  "MOVQ 56(%1), %%MM5\n"  /*prefetch result7 */     
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 28(%2)\n" /* exporting t7/r[7] = c & M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[7] */     

  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "MOVD %%MM3, 32(%2)\n" /* exporting t8/r[8] = c & M */
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  
  "MOVQ %%MM0, %%MM2\n" /*cloning M*/
  "MOVD 36(%2), %%MM1\n" /* R[9] in */    
  "PMULUDQ %%MM6, %%MM4\n" /* d * R0 */
  "PADDQ %%MM1, %%MM4\n" /* d * R0 + r[9]*/  
  "PADDQ %%MM4, %%MM7\n" /* c+=  d * R0 + r[9]*/
  
  "PSRLQ $4, %%MM0\n"    /*  M >>= 4; */ 
  "PAND %%MM7, %%MM0\n" /*c & M >>4*/
  "MOVD %%MM0, 36(%2)\n"  /*  r[9] = c & (M >> 4); */
  
  "PSRLQ $22, %%MM7\n"    /*  c >>= 22 */   
  "PSLLQ $14, %%MM6\n" /* d * (R1 << 4). Since (R1 << 4) equals 16384, it's essentially a left shift by 14 */
  "PADDQ %%MM6, %%MM7\n" /* c += d * (R1 << 4); */

  "MOVQ %%MM7, %%MM3\n" /*cloning c*/
  "PSLLQ $6, %%MM7\n" /*  result of c * (R1 >> 4) which equals c shifted left by 6, since (R1 >> 4) = 64 */    

  "MOVQ %%MM3, %%MM0\n"     /*this is a manual attempt at multiplying c with x3D1 or 977 decimal, by shifting and adding copies of c ...*/
  "MOVQ %%MM3, %%MM1\n"     /*all this segment, is, in reality, just a (c*977) single line multiplication */
  "MOVQ %%MM3, %%MM6\n"     /* which for some reason doesn't want to work otherwise with a plain PMULUDQ c * 977 constant  */
  "MOVQ %%MM3, %%MM4\n"
  "MOVQ %%MM3, %%MM5\n"  
  "PSLLQ $9, %%MM0\n" /* x512 */    
  "PSLLQ $8, %%MM1\n" /* x256 */    
  "PSLLQ $7, %%MM6\n" /* x128 */      
  "PSLLQ $6, %%MM4\n" /* x64 */        
  "PSLLQ $4, %%MM5\n" /* x16 */        /*512+256+128+64 = 976 times c, so +1 add on top =977 times c or c * 0x3D1 */
  "PADDQ %%MM3, %%MM0\n"
  "PADDQ %%MM1, %%MM0\n"
  "MOVD 0(%2), %%MM3\n"          /*prefetch r[0] to MM3 */
  "PADDQ %%MM6, %%MM0\n"  
  "PADDQ %%MM4, %%MM0\n"  
  "PADDQ %%MM0, %%MM5\n"     /*  result of c * (R0 >> 4) */    
    
  "PADDQ %%MM3, %%MM5\n"  /* d = r[0] + c (R0 >> 4) */
  "MOVD 4(%2), %%MM6\n"  /*r[1] to MM6 */  
  "MOVQ %%MM5, %%MM3\n" /*cloning d */
  
  "PAND %%MM2, %%MM5\n" /*d&M*/
  "MOVD 8(%2), %%MM0\n"  /*r[2] to MM0 */      
  "PSRLQ $26, %%MM3\n"    /*  d >>= 26 */     
  "PADDQ %%MM7, %%MM6\n" /* c * (R1 >> 4) + r[1] */
  "PADDQ %%MM6, %%MM3\n" /*d   += c * (R1 >> 4) + r[1];  */
  "MOVD %%MM5, 0(%2)\n" /* export d to r[0] */
  "MOVQ %%MM3, %%MM7\n" /*cloning d */
  
  "PAND %%MM2, %%MM7\n" /*d&M*/
  "PSRLQ $26, %%MM3\n"    /*  d >>= 26 */       
  "PADDQ %%MM0, %%MM3\n"  /*d   += r[2];*/
  "MOVD %%MM7, 4(%2)\n"  /*r[1] = d & M; */
  "MOVD %%MM3, 8(%2)\n"  /*r[2] =d */
  "EMMS\n"
  
 : 
 : "q"(a), "q"(result), "q"(r) , "q" (tempstor)
 : "memory", "%mm0", "%mm1", "%mm2", "%mm3", "%mm4", "%mm5", "%mm6", "%mm7"
 );   
}

If anyone wants to use it, or integrate it somewhere, feel free (no credits required), but I can't claim it'll be 100% consistent when I don't even understand why 2 multiplications broke and required a workaround.

I may try this again in the future with a packed version on ymm/zmm registers. In a packed manner it might even trounce the 5x52 field. We'll see how it goes...

edit: Damn it, I just realized why it broke. I'm multiplying doublewords to get a quadword, yet c=46bits can't fit as a doubleword source. Doh...

272

Bitcoin / Development & Technical Discussion / Re: How to use properly secp256k1 library

on: April 23, 2017, 09:16:11 AM

Quote from: gmaxwell on April 23, 2017, 03:50:09 AM

The verification angle also has two other issues you aren't considering: consensus consistency. Use of GMP in validation would make the exact behavior of GMP consensus critical, which is a problem because different systems run different code. (GMP also has a license which is more restrictive than the Bitcoin software).

I would have never thought consistency might have been a problem with integer values until I kind of learned this the hard way, losing a day, getting beaten by two multiplications in the end of field 10x26 while converting the whole thing into SSE2 run on MMX registers for my 12yr old pentium-m laptop...

d += c * (R1 >> 4) + t1

and

d = c * (R0 >> 4) + t0;

where R1 >>4 = 64, and R0 >>4 = 977... there was just no way to get it compute with a c * 64 / c * 977 like this: "PMULUDQ (register with 64 or 977 as source) and (register with c) as target.

I still don't know why a left shift by 6 is not the same as a multiplication with 64 - if overflowing and wrapping around is not at play (or maybe it is I don't know - but from what I see bit count should be 53 and 56, far from overflowing).

   d = c * (R0 >> 4) + t0;
   VERIFY_BITS(d, 56);

   d += c * (R1 >> 4) + t1;
   VERIFY_BITS(d, 53);

Anyway, my mind is still perplexed about this but I had to work around it until the test didn't break.

For the *64 I did it with shifting left by 6.

For the *977 I did it with a monstrosity where c got copied into multiple registers and then each copy got shifted appropriately and then all the shifted c's got added.

Anyway, despite this expensive workaround, native 64bit arithmetic on mmx registers trounced the gcc/clang/icc-compiled versions for a -m32 build. Opcode size was also reduced (1900 vs 2800+ bytes for field_mul, 1300 vs 2200+ bytes for field_sqr).

Quote

If you don't mean using GMP but just using different operations for non-sidechannel sensitive paths-- the library already does that in many places-- though not for FE mul/add. FE normalizes do take variable time in verification. If you have a speedup based on that feel free to submit it!

Right now the only massive speedup I have achieved is in the 32bit version where compilers use 32 bit registers (instead of 64 bit mmx/sse registers). For the 64bit version, with non-avx use, I'm at ~10% gains in non-commented code (which seems bad for reviewing) of which 3-5% was a recent gain, by reducing the clobbering parameters on the asm and manually interleaving the initial pushes and final pops at convenient stages (like multiplication and addition stalls).

The faster 32-bit version is below (I did add comments on what each line does, in case it would be useful to others, but I wasn't considering it for "submission" due to the array use which is not sidechannel-resistant (and I thought it was a "requirement"). Essentially an array is used at the start for storing the results of the multiplications and additions which are non-linear and thus can be computed at the start.)

On my laptop (pentium-m 2.13ghz) bench_verify (with endomoprhism) is down to 350us from 570us.

On my desktop (q8200 @1.86ghz) it's at 404us down from 652us. It could probably go down a further 5-10% if code readability suffers a lot (it's already suffering from some manual interleaving of memory operations with muls and adds to gain ~7-10%).

c version (-m32 / field 10x26):
field_sqr: min 0.187us / avg 0.188us / max 0.189us
field_mul: min 0.275us / avg 0.277us / max 0.278us
field_inverse: min 55.4us / avg 55.6us / max 55.8us
field_inverse_var: min 55.4us / avg 55.6us / max 55.9us
field_sqrt: min 53.8us / avg 54.1us / max 54.5us
...
context_verify: min 77649us / avg 77741us / max 77891us
context_sign: min 267us / avg 268us / max 269us
...

asm version (-m32 / field 10x26) (sse2 on mmx* registers):
field_sqr: min 0.101us / avg 0.101us / max 0.102us
field_mul: min 0.135us / avg 0.135us / max 0.135us
field_inverse: min 28.3us / avg 28.3us / max 28.4us
field_inverse_var: min 28.3us / avg 28.3us / max 28.4us
field_sqrt: min 28.0us / avg 28.0us / max 28.0us
...
context_verify: min 42876us / avg 43099us / max 43391us
context_sign: min 170us / avg 170us / max 170us

* In core2, it doesn't make a difference if it's xmm or mm registers (except a 2-3% speedup with the removal of EMMS in the end). On the pentium-m, the xmm register operations are very slow to begin with, I suspect they are emulated in terms of width, leading to twice the operations internally, but the mmx registers are mapped on the 80bit FPU registers which have good actual width and proper speed.

Code:

SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint32_t *r, const uint32_t *a, const uint32_t * SECP256K1_RESTRICT b) {
/*  uint64_t c, d;*/
    uint64_t result[19]; /*temp storage array*/
    uint32_t tempstor[2]; /*temp storage array*/
    const uint32_t M = 0x3FFFFFFUL, R0 = 0x3D10UL /* ,R1 = 0x400UL */ ;
    
tempstor[0]=M;
tempstor[1]=R0;
/*tempstor[2] for R1 isn't needed. It's 1024, so shifting left by 10 instead*/
    
  __asm__ __volatile__(  
  
 /* Part #1: The multiplications and additions of
  * 
  *   d  = (uint64_t)(uint64_t)a[0] * b[9]
       + (uint64_t)a[1] * b[8]
       + (uint64_t)a[2] * b[7]
       + (uint64_t)a[3] * b[6]
       + (uint64_t)a[4] * b[5]
       + (uint64_t)a[5] * b[4]
       + (uint64_t)a[6] * b[3]
       + (uint64_t)a[7] * b[2]
       + (uint64_t)a[8] * b[1]
       + (uint64_t)a[9] * b[0]; */
 
 "MOVD  0(%0), %%MM0\n"  /*a0 */
 "MOVD 36(%1), %%MM2\n"  /*b9 */ 
 "MOVD  4(%0), %%MM1\n" /* a1 */
 "MOVD 32(%1), %%MM3\n" /* b8 */
 "PMULUDQ %%MM0, %%MM2\n" /*a0 * b9*/
 "PMULUDQ %%MM1, %%MM3\n" /*a1 * b8*/
 "MOVD  8(%0), %%MM4\n"  /*a2 */
 "MOVD 28(%1), %%MM6\n"  /*b7 */
 "MOVD 12(%0), %%MM5\n" /* a3 */
 "MOVD 24(%1), %%MM7\n" /* b6 */
 "PMULUDQ %%MM4, %%MM6\n" /*a2 * b7*/
 "PMULUDQ %%MM5, %%MM7\n" /*a3 * b6*/
 "MOVD 16(%0), %%MM0\n"  /*a4 */
 "MOVD 20(%0), %%MM1\n" /* a5 */
 "PADDQ %%MM2, %%MM3\n"
 "PADDQ %%MM6, %%MM7\n"
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 20(%1), %%MM2\n"  /*b5 */
 "MOVD 16(%1), %%MM3\n" /* b4 */
 "PMULUDQ %%MM0, %%MM2\n" /*a4 * b5*/
 "PMULUDQ %%MM1, %%MM3\n" /*a5 * b4*/
 "MOVD 24(%0), %%MM0\n"  /*a6 */
 "PADDQ %%MM2, %%MM3\n"
 "MOVD 28(%0), %%MM1\n" /* a7 */
 "PADDQ %%MM3, %%MM7\n"  /*keeping result additions in mm7*/
 "MOVD 12(%1), %%MM2\n"  /*b3 */
 "MOVD  8(%1), %%MM3\n" /* b2 */
 "PMULUDQ %%MM0, %%MM2\n" /*a6 * b3*/
 "PMULUDQ %%MM1, %%MM3\n" /*a7 * b2*/
 "PADDQ %%MM2, %%MM3\n"
 "MOVD 32(%0), %%MM0\n"  /*a8 */
 "MOVD 36(%0), %%MM1\n" /* a9 */
 "PADDQ %%MM3, %%MM7\n"  /*keeping result additions in mm7*/
 "MOVD  4(%1), %%MM2\n"  /*b1 */
 "MOVD  0(%1), %%MM3\n" /* b0 */
 "PMULUDQ %%MM0, %%MM2\n" /*a8 * b1*/
 "PMULUDQ %%MM1, %%MM3\n" /*a9 * b0*/
 "PADDQ %%MM2, %%MM3\n"
 "MOVD  4(%1), %%MM2\n"  /*b1 */
 "PADDQ %%MM3, %%MM7\n"  /*keeping result additions in mm7*/
 "MOVD  8(%1), %%MM3\n" /* b2 */
 "MOVQ %%MM7, 0(%2)\n"    /* extract result[0] */ 
   
  /* Part #2: The multiplications and additions of
  * 
  *
  d += (uint64_t)a[1] * b[9]
       + (uint64_t)a[2] * b[8]
       + (uint64_t)a[3] * b[7]
       + (uint64_t)a[4] * b[6]
       + (uint64_t)a[5] * b[5]
       + (uint64_t)a[6] * b[4]
       + (uint64_t)a[7] * b[3]
       + (uint64_t)a[8] * b[2]
       + (uint64_t)a[9] * b[1]; */
 
 "PMULUDQ %%MM1, %%MM2\n" /*a9 * b1*/
 "PMULUDQ %%MM0, %%MM3\n" /*a8 * b2*/
 "MOVD 28(%1), %%MM6\n"  /*b7 */ 
 "MOVD 32(%1), %%MM7\n" /* b8 */
 "PMULUDQ %%MM5, %%MM6\n" /*a3 * b7*/
 "PMULUDQ %%MM4, %%MM7\n" /*a2 * b8*/ 
 "PADDQ %%MM2, %%MM3\n"
 "MOVD  4(%0), %%MM0\n"  /*a1 */
 "MOVD 36(%1), %%MM2\n"  /*b9 */ 
 "PADDQ %%MM3, %%MM6\n"
 "MOVD 16(%0), %%MM1\n" /* a4 */ 
 "PADDQ %%MM6, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 24(%1), %%MM3\n" /* b6 */
 "PMULUDQ %%MM0, %%MM2\n" /*a1 * b9*/
 "PMULUDQ %%MM1, %%MM3\n" /*a4 * b6*/
 "MOVD 28(%0), %%MM4\n" /* a7 */
 "MOVD 12(%1), %%MM5\n" /* b3 */ 
 "PADDQ %%MM2, %%MM7\n"
 "MOVD 20(%0), %%MM0\n"  /*a5 */ 
 "MOVD 24(%0), %%MM1\n" /* a6 */ 
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 20(%1), %%MM2\n"  /*b5 */
 "MOVD 16(%1), %%MM3\n" /* b4 */
 "PMULUDQ %%MM0, %%MM2\n" /*a5 * b5*/
 "PMULUDQ %%MM1, %%MM3\n" /*a6 * b4*/
 "PMULUDQ %%MM4, %%MM5\n" /*a7 * b3*/
 "MOVD 20(%1), %%MM6\n" /* b5 */
 "PADDQ %%MM2, %%MM7\n"
 "MOVD 16(%0), %%MM2\n"  /*a4 */ 
 "PADDQ %%MM5, %%MM7\n"
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 24(%1), %%MM5\n"  /*b6 */
 "MOVQ %%MM7, 8(%2)\n"   /* extract result[1] */  
 
   /* Part #3: The multiplications and additions of
  * 
  *     d += (uint64_t)a[2] * b[9]
       + (uint64_t)a[3] * b[8]
       + (uint64_t)a[4] * b[7]   
       + (uint64_t)a[5] * b[6]
       + (uint64_t)a[6] * b[5]
       + (uint64_t)a[7] * b[4]
       + (uint64_t)a[8] * b[3]
       + (uint64_t)a[9] * b[2];*/

 "PMULUDQ %%MM1, %%MM6\n" /*a6 * b5*/
 "MOVD 16(%1), %%MM7\n"  /*b4 */
 "PMULUDQ %%MM0, %%MM5\n" /*a5 * b6*/
 "PMULUDQ %%MM4, %%MM7\n" /*a7 * b4*/
 "MOVD 36(%1), %%MM3\n"  /*b9 */
 "MOVD 12(%0), %%MM1\n" /* a3 */
 "PADDQ %%MM6, %%MM5\n"
 "MOVD 32(%1), %%MM4\n" /* b8 */
 "PADDQ %%MM5, %%MM7\n"  /*keeping result additions in mm7*/
 "MOVD  8(%0), %%MM0\n"  /* a2 */
 "MOVD 28(%1), %%MM5\n"  /*b7 */
 "PMULUDQ %%MM1, %%MM4\n" /*a3 * b8*/
 "PMULUDQ %%MM2, %%MM5\n" /*a4 * b7*/ 
 "PMULUDQ %%MM0, %%MM3\n" /*a2 * b9*/
 "MOVD 12(%1), %%MM6\n"  /*b3 */ 
 "PADDQ %%MM4, %%MM7\n"
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 36(%0), %%MM4\n" /* a9 */ 
 "MOVD 32(%0), %%MM5\n"  /*a8 */ 
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD  8(%1), %%MM3\n"  /*b2 */
 "PMULUDQ %%MM5, %%MM6\n" /*a8 * b3*/ 
 "PMULUDQ %%MM4, %%MM3\n" /*a9 * b2  - (order is b2 * a9) */
 "MOVD 12(%1), %%MM0\n"  /*b3 */
 "PADDQ %%MM6, %%MM7\n"
 "MOVD 32(%1), %%MM6\n"  /*b8 */
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/  
 "MOVD 16(%1), %%MM3\n"  /*b4 */ 
 "MOVQ %%MM7, 16(%2)\n"   /* extract result[2] */  
 
  /* Part #4: The multiplications and additions of
  * 
  * 
  *    d += (uint64_t)a[3] * b[9]
       + (uint64_t)a[4] * b[8]
       + (uint64_t)a[5] * b[7]
       + (uint64_t)a[6] * b[6]
       + (uint64_t)a[7] * b[5]
       + (uint64_t)a[8] * b[4]
       + (uint64_t)a[9] * b[3]; */ 

 "PMULUDQ %%MM4, %%MM0\n" /*a9 * b3*/
 "MOVD 36(%1), %%MM7\n" /* b9 */
 "PMULUDQ %%MM5, %%MM3\n" /*a8 * b4*/
 "PMULUDQ %%MM2, %%MM6\n" /*a4 * b8*/
 "PMULUDQ %%MM1, %%MM7\n" /*a3 * b9*/   
 "PADDQ %%MM0, %%MM3\n"
 "MOVD 24(%1), %%MM4\n"  /*b6 */
 "MOVD 20(%1), %%MM5\n" /* b5 */ 
 "PADDQ %%MM3, %%MM6\n"
 "MOVD 20(%0), %%MM0\n"  /*a5 */
 "MOVD 28(%0), %%MM2\n"  /*a7 */ 
 "PADDQ %%MM6, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 24(%0), %%MM6\n"  /*a6 */ 
 "MOVD 28(%1), %%MM3\n"  /*b7 */ 
 "PMULUDQ %%MM2, %%MM5\n" /*a7 * b5*/
 "PMULUDQ %%MM6, %%MM4\n" /*a6 * b6 */
 "PMULUDQ %%MM0, %%MM3\n" /*a5 * b7*/ 
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 16(%0), %%MM1\n"  /*a4 */  
 "PADDQ %%MM4, %%MM7\n"
 "MOVD 32(%1), %%MM5\n"  /*b8 */  
 "PADDQ %%MM3, %%MM7\n" 
 "MOVD 28(%1), %%MM4\n"  /*b7 */ 
 "MOVQ %%MM7, 24(%2)\n"   /* extract result[3] */  
 
  /* Part #5: The multiplications and additions of
  *  
        d += (uint64_t)a[4] * b[9]
       + (uint64_t)a[5] * b[8]
       + (uint64_t)a[6] * b[7]
       + (uint64_t)a[7] * b[6]
       + (uint64_t)a[8] * b[5]
       + (uint64_t)a[9] * b[4];  */
     
 "PMULUDQ %%MM6, %%MM4\n" /*a6 * b7 */
 "MOVD 24(%1), %%MM3\n" /* b6 */
 "MOVD 36(%1), %%MM7\n"  /*b9 */ 
 "PMULUDQ %%MM0, %%MM5\n" /*a5 * b8*/
 "PMULUDQ %%MM2, %%MM3\n" /*a7 * b6*/ 
 "PMULUDQ %%MM1, %%MM7\n" /*a4 * b9*/ 
 "PADDQ %%MM4, %%MM5\n"
 "MOVD 36(%0), %%MM4\n"  /*a9 */
 "PADDQ %%MM5, %%MM3\n"
 "MOVD 20(%1), %%MM5\n"  /*b5 */ 
 "PADDQ %%MM3, %%MM7\n" /*keeping result additions in mm7*/
 "MOVD 32(%0), %%MM3\n" /* a8 */
 "MOVD 16(%1), %%MM1\n"  /*b4 */ 
 "PMULUDQ %%MM3, %%MM5\n" /*a8 * b5 */
 "PMULUDQ %%MM4, %%MM1\n" /*a9 * b4 */
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 28(%1), %%MM3\n"  /*b7 */
 "MOVD 32(%1), %%MM5\n"  /*b8 */  
 "PADDQ %%MM1, %%MM7\n"
 "MOVQ %%MM7, 32(%2)\n"   /* extract result[4] */  
 
  /* Part #6: The multiplications and additions of
  * 
  * 
  *   d += (uint64_t)a[5] * b[9]
       + (uint64_t)a[6] * b[8]
       + (uint64_t)a[7] * b[7]
       + (uint64_t)a[8] * b[6]
       + (uint64_t)a[9] * b[5];
       */
  
 "PMULUDQ %%MM2, %%MM3\n" /*a7 * b7*/
 "MOVD 20(%1), %%MM1\n" /* b5 */
 "MOVD 36(%1), %%MM7\n"  /*b9 */ 
 "PMULUDQ %%MM6, %%MM5\n" /*a6 * b8*/ 
 "PMULUDQ %%MM4, %%MM1\n" /*a9 * b5 */
 "PMULUDQ %%MM0, %%MM7\n" /*a5 * b9*/ 
 "PADDQ %%MM3, %%MM5\n"
 "MOVD 24(%1), %%MM3\n"  /*b6 */
 "PADDQ %%MM1, %%MM5\n"
 "MOVD 32(%0), %%MM1\n" /* a8 */ 
 "PADDQ %%MM5, %%MM7\n" /*keeping result additions in mm7*/
 "PMULUDQ %%MM1, %%MM3\n" /*a8 * b6 */
 "MOVD 24(%1), %%MM0\n" /* b6 */ 
 "MOVD 32(%1), %%MM5\n"  /*b8 */  
 "PADDQ %%MM3, %%MM7\n"
 "MOVQ %%MM7, 40(%2)\n"   /* extract result[5] */  
 
   /* Part #7: The multiplications and additions of
  * 
  * 
  *    d += (uint64_t)a[6] * b[9]
       + (uint64_t)a[7] * b[8]
       + (uint64_t)a[8] * b[7]
       + (uint64_t)a[9] * b[6]; */
 
 "PMULUDQ %%MM4, %%MM0\n" /*a9 * b6 */
 "MOVD 28(%1), %%MM3\n"  /*b7 */
 "MOVD 36(%1), %%MM7\n"  /*b9 */  
 "PMULUDQ %%MM2, %%MM5\n" /*a7 * b8*/ 
 "PMULUDQ %%MM1, %%MM3\n" /*a8 * b7*/
 "PMULUDQ %%MM6, %%MM7\n" /*a6 * b9*/ 
 "PADDQ %%MM0, %%MM5\n"
 "PADDQ %%MM3, %%MM7\n"
 "MOVD 24(%1), %%MM0\n" /* b6 */ 
 "PADDQ %%MM5, %%MM7\n" /*adding results to mm7 */
 "MOVQ %%MM7, 48(%2)\n"   /* extract result[6] */
 
 
 /* Part #8: The multiplications and additions of 3 separate results
  * 
  *  
       d += (uint64_t)a[7] * b[9]
       + (uint64_t)a[8] * b[8]
       + (uint64_t)a[9] * b[7];        result 7

       d += (uint64_t)a[8] * b[9]
       + (uint64_t)a[9] * b[8];        result 8

      d += (uint64_t)a[9] * b[9];    result 9 */
     
   
 "MOVD 28(%1), %%MM3\n"  /*b7 */
 "MOVD 32(%1), %%MM5\n"  /*b8 */ 
 "MOVD 36(%1), %%MM7\n"  /*b9 */  
 "PMULUDQ %%MM4, %%MM3\n" /*a9 * b7 */
 "PMULUDQ %%MM1, %%MM5\n" /*a8 * b8*/
 "MOVQ %%MM7, %%MM6\n"  /*b9 */   
 "PMULUDQ %%MM2, %%MM7\n" /*a7 * b9*/ 
 "PADDQ %%MM3, %%MM5\n"
 "PADDQ %%MM5, %%MM7\n"
 "MOVQ %%MM6, %%MM3\n"  /*b9 */  
 "MOVD 32(%1), %%MM5\n"  /*b8 */ 
 "MOVQ %%MM7, 56(%2)\n"   /* extract result[7] */
 "PMULUDQ %%MM1, %%MM6\n" /*a8 * b9*/  
 "PMULUDQ %%MM4, %%MM5\n" /*a9 * b8*/
 "PMULUDQ %%MM4, %%MM3\n" /*a9 * b9*/
 "MOVD 8(%0), %%MM7\n" /*a2*/ 
 "PADDQ %%MM5, %%MM6\n" 
 "MOVQ %%MM3,  72(%2)\n"   /* extract result[9] */ 
 "MOVQ %%MM6,  64(%2)\n"   /* extract result[8] */
 
  /* Part #9: The multiplications and additions of 
  * 
  *       c += (uint64_t)a[0] * b[8] 
       + (uint64_t)a[1] * b[7]  
       + (uint64_t)a[2] * b[6]   
       + (uint64_t)a[3] * b[5]  
       + (uint64_t)a[4] * b[4] 
       + (uint64_t)a[5] * b[3]
       + (uint64_t)a[6] * b[2]
       + (uint64_t)a[7] * b[1] 
       + (uint64_t)a[8] * b[0];   */
 
  
 "PMULUDQ %%MM7, %%MM0\n" /*a2 * b6 */
 "MOVD  0(%1), %%MM3\n" /* b0 */
 "PMULUDQ %%MM1, %%MM3\n" /*a8 * b0*/
 "MOVD  4(%1), %%MM7\n" /* b1 */
 "PMULUDQ %%MM2, %%MM7\n" /*a7 * b1*/ 
 "MOVD 4(%0), %%MM1\n" /*a1*/
 "PADDQ %%MM0, %%MM3\n"
 "MOVD 20(%1), %%MM4\n"  /*b5*/  
 "PADDQ %%MM3, %%MM7\n"  
 "MOVD 12(%0), %%MM2\n"  /*a3*/
 "MOVD 28(%1), %%MM5\n"  /*b7*/ 
 "PMULUDQ %%MM2, %%MM4\n" /*a3 * b5*/
 "MOVD 16(%0), %%MM3\n"  /*a4*/
 "MOVD 16(%1), %%MM6\n" /*b4*/
 "PMULUDQ %%MM1, %%MM5\n" /*a1 * b7*/
 "MOVD 0(%0), %%MM0\n" /*a0*/
 "PMULUDQ %%MM6, %%MM3\n" /*b4 * a4*/ 
 "MOVD 32(%1), %%MM6\n"  /*b8*/ 
 "PMULUDQ %%MM0, %%MM6\n" /*a0 * b8*/
 "PADDQ %%MM4, %%MM5\n"
 "MOVD 24(%0), %%MM4\n" /*a6*/
 "PADDQ %%MM6, %%MM3\n"
 "PADDQ %%MM5, %%MM7\n"
 "MOVD 8(%1), %%MM6\n"  /*b2*/
 "PADDQ %%MM3, %%MM7\n"
 "MOVD 12(%1), %%MM5\n"  /*b3*/ 
 "MOVD 20(%0), %%MM3\n" /*a5*/
 "PMULUDQ %%MM4, %%MM6\n" /*a6 * b2*/ 
 "PMULUDQ %%MM3, %%MM5\n" /*a5 * b3*/ 
 "PADDQ %%MM6, %%MM7\n"
 "MOVD 0(%1), %%MM4\n" /*b0*/
 "PADDQ %%MM5, %%MM7\n" /*addition results on mm7*/
 "MOVD 8(%0), %%MM3\n"  /*a2*/   
 "MOVD 4(%1), %%MM5\n"  /*b1*/ 
 "MOVQ %%MM7, 80(%2)\n"  /* extract result[10] */ 
 
 
   /* Part #10: The multiplications and additions of 
  * 
  *         c += (uint64_t)a[0] * b[3]
       + (uint64_t)a[1] * b[2]
       + (uint64_t)a[2] * b[1]
       + (uint64_t)a[3] * b[0];       result11   
   
       c += (uint64_t)a[0] * b[1]
       + (uint64_t)a[1] * b[0];       result12

        c  = (uint64_t)a[0] * b[0];  result13       
       
           c += (uint64_t)a[0] * b[2]
       + (uint64_t)a[1] * b[1]
       + (uint64_t)a[2] * b[0];       result14   */
         

 "PMULUDQ  %%MM4, %%MM2\n" /*b0 * a3*/ 
 "PMULUDQ  %%MM5, %%MM3\n" /*b1 * a2*/ 
 "MOVD 8(%1), %%MM7\n"  /*b2*/
 "MOVD 12(%1), %%MM6\n"  /*b3*/ 
 "PMULUDQ  %%MM7, %%MM1\n" /*b2 * a1*/ 
 "PMULUDQ  %%MM6, %%MM0\n" /*b3 * a0*/ 
 "PADDQ %%MM2, %%MM3\n"
 "MOVD 8(%1), %%MM2\n"  /*b2*/   
 "PADDQ %%MM1, %%MM0\n"  
 "MOVD 4(%1), %%MM1\n"  /*b1*/   
 "PADDQ %%MM0, %%MM3\n"  
 "MOVD 0(%1), %%MM0\n"  /*b0*/   
 "MOVQ %%MM3, 88(%2)\n"  /* extract result[11] */  
 
 "MOVD 0(%0), %%MM4\n" /*a0*/ 
 "MOVQ %%MM0, %%MM3\n" /*b0*/
 "MOVD 4(%0), %%MM5\n"  /*a1*/
 "MOVD 8(%0), %%MM7\n"  /*a2*/
 "MOVQ %%MM1, %%MM6\n" /*b1*/
 "PMULUDQ %%MM5, %%MM3\n" /*a1 * b0*/
 "PMULUDQ %%MM4, %%MM6\n" /*a0 * b1*/
 "PADDQ %%MM3, %%MM6\n" 
 "MOVQ %%MM0, %%MM3\n" /*b0*/
 "MOVQ %%MM6, 96(%2)\n"  /* extract result[12] */   
 "PMULUDQ %%MM4, %%MM3\n" /*a0 * b0*/
 "PMULUDQ %%MM4, %%MM2\n" /*a0 * b2*/
 "MOVQ %%MM3, 104(%2)\n"  /* extract result[13] */    
 "MOVQ %%MM1, %%MM6\n" /*b1*/
 "PMULUDQ %%MM5, %%MM6\n" /*a1 * b1*/
 "MOVQ %%MM0, %%MM3\n" /*b0*/
 "PMULUDQ %%MM7, %%MM3\n" /*a2 * b0*/  
 "PADDQ %%MM2, %%MM6\n"
 "PADDQ %%MM6, %%MM3\n"
 "MOVD 16(%1), %%MM2\n"  /*b4*/   
 "MOVQ %%MM3, 112(%2)\n"  /* extract result[14] */  
 
  /* Part #11: The multiplications and additions of 
  * 
  *  
     c += (uint64_t)a[0] * b[4]  
       + (uint64_t)a[1] * b[3]  
       + (uint64_t)a[2] * b[2] 
       + (uint64_t)a[3] * b[1] 
       + (uint64_t)a[4] * b[0]  */ 

 "PMULUDQ %%MM4, %%MM2\n" /*a0 * b4 */
 "MOVD 16(%0), %%MM3\n"  /*a4*/    
 "MOVD 12(%0), %%MM6\n" /* a3*/
 "PMULUDQ %%MM0, %%MM3\n" /*b0 * a4 */
 "PMULUDQ %%MM1, %%MM6\n" /*b1 * a3 */   
 "PADDQ %%MM2, %%MM3\n"
 "MOVD 12(%1), %%MM2\n"  /*b3*/    
 "PADDQ %%MM3, %%MM6\n"
 "MOVD 8(%1), %%MM3\n"  /*b2*/   
 "PMULUDQ %%MM5, %%MM2\n" /*a1 * b3 */ 
 "PMULUDQ %%MM7, %%MM3\n" /*a2 * b2 */
 "PADDQ %%MM2, %%MM6\n"
 "MOVD 20(%1), %%MM2\n"  /*b5*/    
 "PADDQ %%MM3, %%MM6\n"
 "MOVD 20(%0), %%MM3\n"  /*a5*/     
 "MOVQ %%MM6, 120(%2)\n"  /* extract result[15] */  
 
   /* Part #12: The multiplications and additions of 
  * 
  *       c += (uint64_t)a[0] * b[5]
       + (uint64_t)a[1] * b[4]
       + (uint64_t)a[2] * b[3]
       + (uint64_t)a[3] * b[2]
       + (uint64_t)a[4] * b[1]
       + (uint64_t)a[5] * b[0] */   
  
 "PMULUDQ %%MM4, %%MM2\n" /*a0 * b5 */
 "MOVD 16(%0), %%MM6\n" /* a4*/
 "PMULUDQ %%MM0, %%MM3\n" /*b0 * a5 */
 "PMULUDQ %%MM1, %%MM6\n" /*b1 * a4*/   
 "PADDQ %%MM2, %%MM3\n"
 "MOVD 16(%1), %%MM2\n"  /*b4*/   
 "PADDQ %%MM3, %%MM6\n" /*adding results to mm6*/
 "MOVD 12(%1), %%MM3\n"  /*b3*/    
 "PMULUDQ %%MM5, %%MM2\n" /*a1 * b4 */
 "PMULUDQ %%MM7, %%MM3\n" /*a2 * b3 */
 "MOVD 8(%1), %%MM0\n"  /*b2*/  
 "PADDQ %%MM2, %%MM6\n"
 "MOVD 12(%0), %%MM2\n" /* a3*/
 "PADDQ %%MM3, %%MM6\n"
 "PMULUDQ %%MM0, %%MM2\n" /*b2 * a3 */    
 "MOVD 24(%0), %%MM3\n"  /*a6*/   
 "MOVD 0(%1), %%MM0\n"  /*b0*/  
 "PADDQ %%MM2, %%MM6\n" /* all additions end up in mm6*/
 "MOVD 24(%1), %%MM2\n"  /*b6*/   
 "MOVQ %%MM6, 128(%2)\n"  /* extract result[16] */   
  
    /* Part #13: The multiplications and additions of 
  *  
  *     c += (uint64_t)a[0] * b[6] 
       + (uint64_t)a[1] * b[5] 
       + (uint64_t)a[2] * b[4] 
       + (uint64_t)a[3] * b[3]
       + (uint64_t)a[4] * b[2]
       + (uint64_t)a[5] * b[1] 
       + (uint64_t)a[6] * b[0];   */ 
  
 "PMULUDQ %%MM0, %%MM3\n" /*a6 * b0 */
 "MOVD 20(%0), %%MM6\n" /* a5*/  
 "PMULUDQ %%MM4, %%MM2\n" /*b6 * a0 */
 "PMULUDQ %%MM1, %%MM6\n" /*a5 * b1*/    
 "PADDQ %%MM2, %%MM3\n"
 "PADDQ %%MM3, %%MM6\n" /*adding all results on mm6*/
 "MOVD 20(%1), %%MM2\n"  /*b5*/   
 "MOVD 16(%1), %%MM3\n"  /*b4*/    
 "PMULUDQ %%MM5, %%MM2\n" /*a1 * b5*/
 "PMULUDQ %%MM7, %%MM3\n" /*a2 * b4*/ 
 "MOVD 8(%1), %%MM4\n"  /*b2*/   
 "MOVD 12(%1), %%MM1\n"  /*b3*/   
 "PADDQ %%MM2, %%MM6\n"
 "MOVD 12(%0), %%MM2\n"  /*a3*/   
 "PADDQ %%MM3, %%MM6\n" 
 "MOVD 16(%0), %%MM3\n"  /*a4*/    
 "PMULUDQ %%MM1, %%MM2\n" /*b3 * a3 */
 "PMULUDQ %%MM4, %%MM3\n" /*b2 * a4 */ 
 "PADDQ %%MM2, %%MM6\n"
 "MOVD 4(%1), %%MM1\n"  /*b1*/ 
 "MOVD 24(%0), %%MM2\n" /*a6*/ 
 "PADDQ %%MM3, %%MM6\n" 
 "MOVD 0(%0), %%MM4\n"  /*a0*/  
 "MOVQ %%MM6, 136(%2)\n"  /* extract result[17] */   
  
 /* Part #14: The multiplications and additions of 
  *  
  *       c += (uint64_t)a[0] * b[7] 
       + (uint64_t)a[1] * b[6]  
       + (uint64_t)a[2] * b[5] 
       + (uint64_t)a[3] * b[4]
       + (uint64_t)a[4] * b[3]
       + (uint64_t)a[5] * b[2]
       + (uint64_t)a[6] * b[1]  
       + (uint64_t)a[7] * b[0];    */
       
 "PMULUDQ %%MM2, %%MM1\n" /*a6 * b1 */
 "MOVD 28(%0), %%MM6\n" /*a7*/
 "MOVD 28(%1), %%MM3\n" /*b7*/  
 "PMULUDQ %%MM6, %%MM0\n" /*a7 * b0 */
 "PMULUDQ %%MM3, %%MM4\n" /*b7 * a0*/     
 "MOVD 24(%1), %%MM6\n" /*b6*/
 "MOVD 20(%1), %%MM2\n" /*b5*/ 
 "PMULUDQ %%MM6, %%MM5\n" /*b6 * a1 */
 "PMULUDQ %%MM2, %%MM7\n" /*b5 * a2 */
 "PADDQ %%MM0, %%MM1\n"
 "MOVQ 8(%2), %%MM3\n"/*prefetch result1*/ 
 "PADDQ %%MM4, %%MM5\n"
 "MOVD 12(%0), %%MM0\n" /*a3*/
 "MOVD  8(%1), %%MM6\n" /*b2*/
 "PADDQ %%MM1, %%MM5\n"
 "MOVD 20(%0), %%MM2\n" /*a5*/
 "MOVD 12(%1), %%MM4\n" /*b3*/
 "PADDQ %%MM5, %%MM7\n" 
 "PMULUDQ %%MM2, %%MM6\n" /*a5 * b2 */
 "MOVD 16(%0), %%MM1\n" /*a4*/
 "MOVD 16(%1), %%MM5\n" /*b4*/ 
 "PMULUDQ %%MM1, %%MM4\n" /*a4 * b3 */
 "PMULUDQ %%MM0, %%MM5\n" /*a3 * b4*/     
 "PADDQ %%MM6, %%MM4\n"
 "MOVD 0(%4), %%MM2\n"  /*prefetch M to MM2 */
 "PADDQ %%MM4, %%MM5\n"
 "PADDQ %%MM7, %%MM5\n" 
 "MOVQ 0(%2), %%MM6\n" /* prefetch d in from result[0]*/
 "MOVQ %%MM2, %%MM0\n" /*M secondary storage */
 "MOVQ %%MM5, 144(%2)\n"  /* extract result[18] */   
  

  "MOVQ 104(%2), %%MM7\n" /* c in from result[13] */
  "PAND %%MM6, %%MM2\n"   /* r[9] = d & M;  */
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */     
  "MOVD 4(%4), %%MM4\n"  /*R0 to MM4 */  
  "MOVD %%MM2, 36(%3)\n" /* extract r[9] = d & M;  */
  "PADDQ %%MM3, %%MM6\n"  /*  d += (uint64_t)result[1]  */
  "MOVQ %%MM0, %%MM2\n"  /* M back to mm2 */
  
  "MOVQ 16(%2), %%MM3\n"  /*prefetch result2*/
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[2]  */
  "MOVQ 96(%2), %%MM5\n"  /* prefetch result12 */    
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 0(%3)\n" /* exporting t0/r[0] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[12] */    
  
  "MOVQ 24(%2), %%MM3\n"/*prefetch result3*/  
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[3]  */
  "MOVQ 112(%2), %%MM5\n"  /* prefetch result14 */       
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 4(%3)\n" /* exporting t1/r[1] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[14] */    
  
  "MOVQ 32(%2), %%MM3\n"/*prefetch result4*/    
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[4]  */
  "MOVQ 88(%2), %%MM5\n"  /* prefetch result11 */       
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 8(%3)\n" /* exporting t2/r[2] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[11] */    

  "MOVQ 40(%2), %%MM3\n"/*prefetch result5*/    
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[5]  */
  "MOVQ 120(%2), %%MM5\n"  /* prefetch result15 */     
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 12(%3)\n" /* exporting t3/r[3] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */    
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[15] */    
  
  "MOVQ 48(%2), %%MM3\n"/*prefetch result6*/    
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[6]  */
  "MOVQ 128(%2), %%MM5\n"  /* prefetch result16 */     
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 16(%3)\n" /* exporting t4/r[4] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[16] */    

  "MOVQ 56(%2), %%MM3\n"/*prefetch result7*/    
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[7]  */
  "MOVQ 136(%2), %%MM5\n"  /* prefetch result16 */      
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 20(%3)\n" /* exporting t5/r[5] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */    
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[16] */      
  
  "MOVQ 64(%2), %%MM3\n"/*prefetch result8*/    
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[8]  */
  "MOVQ 144(%2), %%MM5\n"  /* prefetch result18 */        
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 24(%3)\n" /* exporting t6/r[6] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */  
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[18] */      
  
  "MOVQ 72(%2), %%MM3\n"/*prefetch result9*/   
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM3, %%MM6\n"  /*   d += (uint64_t)result[9]  */
  "MOVQ 80(%2), %%MM5\n"  /* prefetch result10 */      
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 28(%3)\n" /* exporting t7/r[7] = c & M */
  "MOVQ %%MM0, %%MM2\n"  /* M */    
  "PADDQ %%MM5, %%MM7\n"  /*   c += (uint64_t)result[10] */      
  
  "PAND %%MM6, %%MM2\n"   /* u0 = d & M;  */
  "MOVQ %%MM2, %%MM1\n"   /*u0 out to temp mm1*/  
  "PSRLQ $26, %%MM6\n"    /*  d >>= 26; */ 
  "PSLLQ $10, %%MM1\n" /* R1 * u0  - since R1 equals 1024, it's a shift left by 10 for u0*/  
  "PMULUDQ %%MM4, %%MM2\n" /* R0 * u0 */
  "PADDQ %%MM2, %%MM7\n"  /* c = (result from R0*u0) + c */
  "MOVQ %%MM7, %%MM3\n" /*cloning c for the ANDing */
  "PAND %%MM0, %%MM3\n"  /*c & M*/
  "PSRLQ $26, %%MM7\n"    /*  c >>= 26; */   
  "MOVQ %%MM0, %%MM2\n" /*cloning M to mm2*/
  "PADDQ %%MM1, %%MM7\n" /* c += u0 * R1 */
  "MOVD %%MM3, 32(%3)\n" /* exporting t8/r[8] = c & M */

  "PMULUDQ %%MM6, %%MM4\n" /* d * R0 */
  "PSRLQ $4, %%MM0\n"    /*  M >>= 4 */   
  "MOVD 36(%3), %%MM1\n" /* R[9] in */    
  "PADDQ %%MM4, %%MM7\n" /* c+=  d * R0 */
  "PADDQ %%MM1, %%MM7\n" /* c+=  r[9]     ===> c+= d * R0 + r[9]*/  

  "PAND %%MM7, %%MM0\n"  /* c & (M >> 4) */
  "MOVD %%MM0, 36(%3)\n"  /*  r[9] = c & (M >> 4) */

  "PSRLQ $22, %%MM7\n"    /*  c >>= 22 */   
  "PSLLQ $14, %%MM6\n" /* d * (R1 << 4). Since (R1 << 4) equals 16384, it's essentially a left shift by 14 */
  "PADDQ %%MM6, %%MM7\n" /* c += d * (R1 << 4) */
 
  
  "MOVQ %%MM7, %%MM3\n" /*cloning c*/
  "PSLLQ $6, %%MM7\n" /*  result of c * (R1 >> 4) which equals c shifted left by 6, since (R1 >> 4) = 64 */    

  "MOVQ %%MM3, %%MM0\n"     /*this is a manual attempt at multiplying c with x3D1 or 977 decimal, by shifting and adding copies of c ...*/
  "MOVQ %%MM3, %%MM1\n"     /*all this segment, is, in reality, just a (c*977) single line multiplication */
  "MOVQ %%MM3, %%MM6\n"     /* which for some reason doesn't want to work otherwise with a plain PMULUDQ c * 977 constant  */
  "MOVQ %%MM3, %%MM4\n"
  "MOVQ %%MM3, %%MM5\n"  
  "PSLLQ $9, %%MM0\n" /* x512 */    
  "PSLLQ $8, %%MM1\n" /* x256 */    
  "PSLLQ $7, %%MM6\n" /* x128 */      
  "PSLLQ $6, %%MM4\n" /* x64 */        
  "PSLLQ $4, %%MM5\n" /* x16 */        /*512+256+128+64 = 976x, so +1 add on top =977 or 0x3D1 */
  "PADDQ %%MM3, %%MM0\n"
  "PADDQ %%MM1, %%MM6\n"
  "MOVD 0(%3), %%MM3\n"  /*prefetch r[0] to MM3 */    
  "PADDQ %%MM4, %%MM0\n"
  "PADDQ %%MM6, %%MM0\n"  
  "PADDQ %%MM0, %%MM5\n"     /*  result of c * (R0 >> 4) */    
  
  "PADDQ %%MM3, %%MM5\n"  /* d = r[0] + c (R0 >> 4) */
  "MOVD 4(%3), %%MM4\n"  /*r[1] to MM4 */  
  "MOVD 8(%3), %%MM0\n"  /*r[2] to MM5 */    
  "MOVQ %%MM5, %%MM3\n" /*cloning d */

  "PAND %%MM2, %%MM5\n" /*d&M*/
  "PSRLQ $26, %%MM3\n"    /*  d >>= 26 */     
  "PADDQ %%MM7, %%MM4\n" /* c * (R1 >> 4) + r[1] */
  "PADDQ %%MM4, %%MM3\n" /*d   += c * (R1 >> 4) + r[1];  */
  "MOVD %%MM5, 0(%3)\n" /* export d to r[0] */
  "MOVQ %%MM3, %%MM7\n" /*cloning d */
  
  "PAND %%MM2, %%MM7\n" /*d&M*/
  "PSRLQ $26, %%MM3\n"    /*  d >>= 26 */       
  "PADDQ %%MM0, %%MM3\n"  /*d   += r[2];*/
  "MOVD %%MM7, 4(%3)\n"  /*r[1] = d & M; */
  "MOVD %%MM3, 8(%3)\n" /*r[2]=d;*/
  "EMMS\n"

: 
: "q"(a), "q"(b), "q"(result), "q"(r), "S"(tempstor)
: "memory", "%mm0", "%mm1", "%mm2", "%mm3", "%mm4", "%mm5", "%mm6", "%mm7"
);
}

273

Bitcoin / Development & Technical Discussion / Re: How to use properly secp256k1 library

on: April 22, 2017, 03:59:25 PM

Quote from: gmaxwell on April 19, 2017, 05:20:37 AM libsecp256k1 is a crypto library-- not a bignum library. Its primitive operations are constant time for sidechannel resistance, they're not expected to be faster than GMP for most things-- just not gratuitously slower. From what I understand sidechannel resistance is required for signing, so that the private key is not leaked, right? Verification is all public data, so it wouldn't matter (in the case of BTC). If that is the case, and the rationale is correct, then it would probably make sense, for an app like bitcoin with asymmetrical verification/signature loads (where you need to verify thousands upon thousands to sync the blockchain for the last few days, yet you only send money a couple of times), to have duplicate operations: -One which isn't really side-channel resistant and is very fast and -one which isn't very fast (or very slow indeed) but is extremely hardened against side-channel attacks, so that you won't lose money. Depending what you want to do (syncing the blockchain or signing a new tx), you call different functions and enjoy the best of both worlds: Maximum speed when you need to verify txs in bulk, and maximum security when you sign a tx. Am I wrong somewhere?

274

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: April 15, 2017, 02:20:40 AM

Quote from: Shankara on April 14, 2017, 06:57:47 PM $200 dollars gap between some of main exchanges(okcoin- finex) this is madness I think some prices in china are due to the withdrawal issue (you can withdraw in cash, not in btc - so you sell the btc to get your cash). Not sure if they've allowed normal withdrawals.

275

Bitcoin / Development & Technical Discussion / Re: Speeding up signature verification

on: April 08, 2017, 12:14:52 AM

After tampering a bit with the code over the past year, I now understand that I initially asked the wrong question, regarding putting multiple verifications in parallel in a SIMD manner, as the calculations required in a single signature are too many by themselves. Since this is the case, the question should be rephrased on whether SIMD can be used on these multiple calculations of a single pass. As you rightly pointed out, quadword->octaword is generally not supported. Now, I think I found a way to use vectorization, bypassing the quadword->octaword obstacle - since it doesn't seem that it will be getting hardware support anytime soon. The answer lies in breaking up the multiplications in 32x32 bits producing 64 bit results, like the 10x26 field does. Then packing the sources for packed multiplications. I put the multiplications of 10x26 in vectorized SIMD loops, #pragma simd'ed (icc), and it pulled it off (result on the left). This is the http://www.felixcloutier.com/x86/PMULUDQ.html instruction. It can be used for taking 4x 32bit sources and producing 2x 64 bit outputs. In my q8200 (no avx2), in x64 mode (utilizing the 32bit field), this is ~5-10% slower than issuing multiple 64 bit imuls. Imuls in x64 mode are very, very convenient for 32x32 bit => 64 bit, and this is what is normally generated in x64 mode. In an AVX2 scenario, packed multiplication goes up to 8x32 bit sources / 4x64 bit results. In an AVX512 scenario (assuming there is a similar instruction), it should pack 16x32bit sources into eight 64 results. Plus with an extra 16 registers, it should eliminate a lot of memory accesses. If verification speed is an issue in the future, we might be able to exploit this. As an incidental "discovery", while compiling with -m32 to check the 32bit output, I saw that the uint64s are produced with plain muls in eax:edx (which is to be expected), although many 32bit machines DO have support for SSE2. Now in x86+SSE2 mode, you can get either one or two 64 bit outputs (depending how you word the instruction) without going through MULs/ADDs in eax:edx - which is slower. gcc doesn't like the d=d+dd part inside the loop and it hesitates to vectorize even with #pragma GCC ivdep. One needs to do the addition results outside the loop, or write it manually in asm. ICC does it like a boss with the padd's from the results. edit: Something else I remembered regarding MULX + ADCX which I've mentioned previously. I tried building with ./configure --enable-benchmark --enable-endomorphism --with-asm=no CFLAGS="-O3 -march=skylake" ...to check the asm output of new compilers. GCC 6.3 now outputs MULXs (no ADCXs though) in field and scalar multiplications. Clang 3.9 now employs both MULQs and ADCXs. (from clang) 000000000040c300 <secp256k1_fe_mul>: ... 40c319: 48 8b 4e 08 mov 0x8(%rsi),%rcx 40c31d: 48 89 4c 24 b8 mov %rcx,-0x48(%rsp) 40c322: c4 62 e3 f6 c8 mulx %rax,%rbx,%r9 40c327: 49 89 c3 mov %rax,%r11 40c32a: 4c 89 5c 24 d8 mov %r11,-0x28(%rsp) 40c32f: 49 8b 52 10 mov 0x10(%r10),%rdx 40c333: 48 89 54 24 a0 mov %rdx,-0x60(%rsp) 40c338: c4 62 fb f6 c1 mulx %rcx,%rax,%r8 40c33d: 48 01 d8 add %rbx,%rax 40c340: 66 4d 0f 38 f6 c1 adcx %r9,%r8 40c346: 48 8b 6e 10 mov 0x10(%rsi),%rbp 40c34a: 49 8b 1a mov (%r10),%rbx ... ...although how much faster it is compared to MUL/ADC c code or asm, I can't say.

276

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: April 01, 2017, 01:38:00 AM

Difficulty History Date Difficulty Change Hash Rate Mar 30 2017 499,635,929,817 5.03% 3,576,533,297 GH/s Mar 17 2017 475,705,205,062 3.24% 3,405,230,497 GH/s Mar 03 2017 460,769,358,091 4.54% 3,298,315,540 GH/s Feb 18 2017 440,779,902,287 4.41% 3,155,225,442 GH/s Feb 04 2017 422,170,566,884 7.43% 3,022,014,630 GH/s Jan 22 2017 392,963,262,344 16.64% 2,812,940,600 GH/s Jan 10 2017 336,899,932,796 6.05% 2,411,623,656 GH/s Dec 28 2016 317,688,400,354 2.43% 2,274,102,150 GH/s Estimated Next Difficulty: 512,261,078,186 (+2.53%) --- LOL "Bitcoin CEO"

277

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: March 28, 2017, 12:55:21 PM

Quote from: Ibian on March 28, 2017, 10:50:28 AM Is it possible to increase the blocksize without a fork? Even just in theory? I think Vitalik had proposed something like a soft fork that increases frequency of issued blocks by tampering the clock and diff, so that you have more blocks per hour (and more txs per hour). It's like 1-2 years ago so I don't remember the details. It's kind of similar to a blocksize increase but through a workaround that also allows for faster confirmations. It's not the most technically elegant solution though. Quote from: Cassius on March 28, 2017, 11:15:04 AM Can someone give a sane answer to why ... Altcoin pumping.

278

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: March 25, 2017, 11:35:01 PM

Quote from: springgers on March 25, 2017, 11:23:21 PM Quote from: AlexGR on March 25, 2017, 11:11:08 PM Quote from: MinermanNC on March 25, 2017, 11:03:26 PM I'm not sure what to think about a fork... or if it will happen lol, but ETH and ETH classic have done relatively well for themselves since. Of course ETH will prevail as the leader because of its dev support. Although that was a timed and controlled fork. ETH is a centralized coin that can avoid chaotic circumstances, precisely due to its centralized control. Didn't realize it was a centralized coin because I thought all cryptocurrency no matter if they are alt coins are decentralized. Well that changes my perspective on how I view ethereum now. No, it's not an inherent property of cryptocurrencies. The ideal is of course to be decentralized, but this is rarely the case. Plus centralization is not a white/black value, it has various degrees. There are also various types of centralization. Some cryptocurrencies are heavily centralized in their developers who set he course, others in miners (pow coins), others in stakeholders (pos coins), etc. Quote from: MinermanNC on March 25, 2017, 11:32:37 PM Quote from: AlexGR on March 25, 2017, 11:11:08 PM Quote from: MinermanNC on March 25, 2017, 11:03:26 PM I'm not sure what to think about a fork... or if it will happen lol, but ETH and ETH classic have done relatively well for themselves since. Of course ETH will prevail as the leader because of its dev support. Although that was a timed and controlled fork. ETH is a centralized coin that can avoid chaotic circumstances, precisely due to its centralized control. Well not really centralized.... I don't think any coin is really centralized. I think Miners choose to mine it because of its support and hash per dollar is much better, at least right now lol Maybe centralized in that is has strong dev support that can add to code or alter etc. so? Idk lol Altcoins with evolving feature set are typically very dev-centralized. If the dev issues 5 feature-forks or bugfix-forks a year, you can't really stay behind because you'll be sitting on a dead chain.

279

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: March 25, 2017, 11:11:08 PM

Quote from: MinermanNC on March 25, 2017, 11:03:26 PM I'm not sure what to think about a fork... or if it will happen lol, but ETH and ETH classic have done relatively well for themselves since. Of course ETH will prevail as the leader because of its dev support. Although that was a timed and controlled fork. ETH is a centralized coin that can avoid chaotic circumstances, precisely due to its centralized control.

280

Economy / Speculation / Re: Wall Observer BTC/USD - Bitcoin price movement tracking & discussion

on: March 25, 2017, 10:27:18 PM

Quote from: HI-TEC99 on March 25, 2017, 09:19:43 PM Quote from: becoin on March 25, 2017, 08:57:37 PM Quote from: HI-TEC99 on March 25, 2017, 08:44:21 PM Quote from: conspirosphere.tk on March 25, 2017, 08:38:18 PM Quote from: MinermanNC on March 25, 2017, 07:55:30 PM Main thing for me is, in the end I want only one coin. This talk of 2 chains is ridiculous. 2 coins would at least put an end for good to this endless drama. The market will sort out the good coins from the bad ones. I don't think there will ever be enough consensus for a fork. However, if there is one then doesn't it mean all Bitcoin holders can double their money, provided the price stays the same? Price will not stay the same! BUcoin price will crash immediately as a result of oversupply, everybody will sell it. The price of Bitcoin will shoot up to the moon firstly, because everybody will convert BUcoins into Bitcoins and secondly, because there will be no obstacles anymore to implement segwit, LN and sidechains. So we get to keep all our coins, sell the fork's coins for a fair amount of cash, and the coins we hold shoot up to the moon in value? That sounds like the kind of sweet deal that will entice speculators to buy in advance. Forking MONEY is self-destructive for the system. Confusion, loss of confidence, precedent for future forks, etc etc.

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 [14] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 ... 208 »