Bitcoin Forum
May 26, 2024, 07:07:39 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
  Home Help Search Login Register More  
  Show Posts
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 [34] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 ... 96 »
661  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 23, 2019, 05:42:24 PM
Very strange error.

If I mod the function __device__ void _ModMult(uint64_t *r, uint64_t *a, uint64_t *b)   in any way, for example like this:

Code:
  // Reduce from 320 to 256 
  UADD1(t[4],0ULL);
  UMULLO(al,t[4], 0x1000003D1ULL);
  UMULHI(ah,t[4], 0x1000003D1ULL);
  UADDO(r[0],r512[0], al);
  UADDC(r[1],r512[1], ah);
  UADDC(r[2],r512[2], 0ULL);
  UADD(r[3],r512[3], 0ULL);

  UADD1(r[3],0x07ULL);  <-- error!!!

I got all errors like it should be with the check option:

Code:
CPU found 1539 items
GPU: point   correct [0/271]
GPU: endo #1 correct [0/248]
GPU: endo #2 correct [0/260]
GPU: sym/point   correct [0/255]
GPU: sym/endo #1 correct [0/265]
GPU: sym/endo #2 correct [0/240]
GPU/CPU check Failed !

but I got instead the correct result with the standard command:

Code:
~/VanitySearch$ ./VanitySearch -stop -t 7 -gpu 1111
Difficulty: 16777216
Search: 1111 [Compressed]
Start Sat Mar 23 18:39:22 2019
Base Key:12FF1E3D528DC8068438E8ED181E1F2505E877A7543869B0B38E500F5FA284F9
Number of CPU thread: 7
GPU: GPU #0 Quadro M2200 (8x128 cores) Grid(64x128)

Pub Addr: 1111Cf8ucVbgUtANTRGwQsWVpXVZvqFT6
Prv Addr: 5HxepgskWZ53AokCCvk8d1ZZGinupSX4Sm7tNQygZ9zQpkftRQJ
Prv Key : 0x12FF1E3D528DC8068438E8ED181E1F2505E877A7543869B5B38E500F5FA4D5D3
Check   : 1DFm6mzxxKqFo9bysKC9x1TxEz5Z9d9uAb
Check   : 1111Cf8ucVbgUtANTRGwQsWVpXVZvqFT6 (comp)

How it is possible??  Huh
662  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 22, 2019, 01:32:23 PM
Yes, I already did it.

It will make me crazy.
It works on my 2 configs and a user on github just post a report on a GeForce GTX 1080 Ti (ccap=6.1) running on Ubuntu 18.04 and it works fine (he uses CUDA10).

Don't worry, cuda 8 needs g++ 4.9, that's the problem.


About the performance, I think most of the people use only compressed addresses.

If you do a specific ComputeKeys for only compressed keys (don't compute y at all!):

Code:
    for (uint32_t i = 0; i < HSIZE; i++) {

      // P = StartPoint + i*G
      Load256(px, sx);
      Load256(py, sy);
      ModSub256(dy, Gy[i], py);

      _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
      //_ModMult(_p2, _s, _s);        // _p = pow2(s)
      _ModSqr(_p2, _s);

      ModSub256(px, _p2,px);
      ModSub256(px, Gx[i]);         // px = pow2(s) - p1.x - p2.x;
      /*
      ModSub256(py, Gx[i], px);
      _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
      ModSub256(py, Gy[i]);         // py = - p2.y - s*(ret.x-p2.x);  
      */
      CHECK_PREFIX(GRP_SIZE / 2 + (i + 1));
      
      // P = StartPoint - i*G, if (x,y) = i*G then (x,-y) = -i*G
      Load256(px, sx);
      Load256(py, sy);
      //ModNeg256(dy,Gy[i]);
      //ModSub256(dy, py);
      ModSub256(dy, pyn, Gy[i]);

      _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
      //_ModMult(_p2, _s, _s);        // _p = pow2(s)
      _ModSqr(_p2, _s);

      ModSub256(px, _p2, px);
      ModSub256(px, Gx[i]);         // px = pow2(s) - p1.x - p2.x;
      /*
      ModSub256(py, Gx[i], px);
      _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
      
      ModAdd256(py, Gy[i]);         // py = - p2.y - s*(ret.x-p2.x);  

      //ModSub256(py, sx, px);
      //_ModMult(py, _s);             // py = - s*(ret.x-p2.x)
      //ModSub256(py, sy);
      */
      CHECK_PREFIX(GRP_SIZE / 2 - (i + 1));

    }
    
    // First point (startP - (GRP_SZIE/2)*G)
    Load256(px, sx);
    Load256(py, sy);
    ModNeg256(dy, Gy[i]);
    ModSub256(dy, py);

    _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
    //_ModMult(_p2, _s, _s);        // _p = pow2(s)
    _ModSqr(_p2, _s);

    ModSub256(px, _p2, px);
    ModSub256(px, Gx[i]);         // px = pow2(s) - p1.x - p2.x;
    /*
    ModSub256(py, Gx[i], px);
    _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
    
    ModAdd256(py, Gy[i]);         // py = - p2.y - s*(ret.x-p2.x);  
    */
    CHECK_PREFIX(0);

    i++;

    // Next start point (startP + GRP_SIZE*G)
    Load256(px, sx);
    Load256(py, sy);
    ModSub256(dy, _2Gny, py);

    _ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
    //_ModMult(_p2, _s, _s);        // _p = pow2(s)
    _ModSqr(_p2, _s);

    ModSub256(px, _p2, px);
    ModSub256(px, _2Gnx);         // px = pow2(s) - p1.x - p2.x;

    ModSub256(py, _2Gnx, px);
    _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
    //_ModSqr(py, _s);
    ModSub256(py, _2Gny);         // py = - p2.y - s*(ret.x-p2.x);  

    Load256(sx, px);
    Load256(sy, py);

  }

  // Update starting point
  __syncthreads();
  Store256A(startx, sx);

you can save time. Then: SHA256 ("02+x")  and SHA256("03+x") (without thinking at y value)

On my system I got about a 8% increase of performance.

Obviously at the end you have to do a double check to know if the correct private key for the found address is k or n-k. But only for the address found.
663  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 22, 2019, 01:20:43 PM
You may have notice that I changed the makefile.
Now you should call it like this:

Code:
make gpu=1 ccap=50 all

And also set the good variable:
Code:
CUDA       = /usr/local/cuda-8.0
CXXCUDA    = /usr/bin/g++-4.8

The readme is up-to-date

Yes, I already did it.
664  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 22, 2019, 01:16:39 PM
Unfortunately all wrong!!!

That's strange. May be I introduced an other bug.
If you restore the volatile it works ?


No.
665  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 22, 2019, 12:56:49 PM
I removed again the volatile and added "memory" to clobber list of inline assembly. This should prevent the compiler to permute instruction (for pipelining optimization) and loose a carry or get a unexpected one.

Thanks to test the source on github and tell me if you still have the errors.

This is my last idea...

Unfortunately all wrong!!!

Code:
CPU found 1577 items
GPU: point   correct [0/243]
GPU: endo #1 correct [0/251]
GPU: endo #2 correct [0/268]
GPU: sym/point   correct [0/257]
GPU: sym/endo #1 correct [0/256]
GPU: sym/endo #2 correct [0/302]
GPU/CPU check Failed !
666  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 21, 2019, 05:31:18 PM
OK thanks, it works Smiley

On my 645 GTX same performance. Sqr bring few spill moves more (there is more temp variables than in ModMult).
I didn't try yet on the OLD Quadro 600.
I will see If I can win few registers.

With Sqr
1>    33280 bytes stack frame, 128 bytes spill stores, 436 bytes spill loads
Without Sqr
1>    33280 bytes stack frame, 120 bytes spill stores, 424 bytes spill loads

You can delete:

Code:
  //uint64_t r0 = 0x0ULL;
  //uint64_t r1 = 0x0ULL;
  //uint64_t r3 = 0x0ULL;
  //uint64_t r4 = 0x0ULL;

  uint64_t r0;
  uint64_t r1;
  uint64_t r3 ;
  uint64_t r4 ;

and delete u0, u1, u2 ,u3, r0, r1, r2, r3
667  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 21, 2019, 05:22:45 PM
Code:
CPU found 1559 items
GPU: point   correct [249/249]
GPU: endo #1 correct [203/281]
GPU: endo #2 correct [220/286]
GPU: sym/point   correct [102/246]
GPU: sym/endo #1 correct [180/248]
GPU: sym/endo #2 correct [179/249]
GPU/CPU check Failed !

Where you compute endo and sym?
Without "volatile" I always get these errors.
668  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 21, 2019, 05:00:42 PM
From 153 MKeys/s to 160 MKeys/s

using a _ModSqr instead of _ModMult

Thanks, I tried but the -check failed.
I will have a look at it.
I committed the patch with few of your mods , i also review a bit the main loop.


There were errors. Now it should work:
Code:
__device__ void _ModSqr(uint64_t *rp, const uint64_t *up) {


  uint64_t u0 = up[0];
  uint64_t u1 = up[1];
  uint64_t u2 = up[2];
  uint64_t u3 = up[3];


  uint64_t u10, u11;


  uint64_t r0 = 0x0ULL;
  uint64_t r1 = 0x0ULL;
  uint64_t r3 = 0x0ULL;
  uint64_t r4 = 0x0ULL;

  uint64_t t1 = 0x0ULL;
  uint64_t t2 = 0x0ULL;

  uint64_t s0, s1, s2, s3, s4, s5, s6, s7;
  uint64_t z1, z2, z3, z4, z5, z6, z7, z8;

  //k=0
  UMULLO(s0, u0, u0);
  UMULHI(r1, u0, u0);

  //k=1
  UMULLO(r3, u0, u1);
  UMULHI(r4, u0, u1);
  UADDO1(r3, r3);
  UADDC1(r4, r4);
  UADD1(t1, 0x0ULL);
  UADDO1(r3, r1);
  UADDC1(r4, 0x0ULL);
  UADD1(t1, 0x0ULL);
  s1 = r3;



  //k=2
  UMULLO(r0, u0, u2);
  UMULHI(r1, u0, u2);
  UADDO1(r0, r0);
  UADDC1(r1, r1);
  UADD1(t2, 0x0ULL);
  UMULLO(u10, u1, u1);
  UMULHI(u11, u1, u1);
  UADDO1(r0, u10);
  UADDC1(r1, u11);
  UADD1(t2, 0x0ULL);
  UADDO1(r0, r4);
  UADDC1(r1, t1);
  UADD1(t2, 0x0ULL);


  s2 = r0;
  t1 = 0;
 
  //k=3
  UMULLO(r3, u0, u3);
  UMULHI(r4, u0, u3);
  UMULLO(u10, u1, u2);
  UMULHI(u11, u1, u2);
  UADDO1(r3, u10);
  UADDC1(r4, u11);
  UADD1(t1, 0x0ULL);
  t1 += t1;
  UADDO1(r3, r3);
  UADDC1(r4, r4);
  UADD1(t1, 0x0ULL);
  UADDO1(r3, r1);
  UADDC1(r4, t2);
  UADD1(t1, 0x0ULL);

  s3 = r3;
  t2 = 0;

  //k=4

  UMULLO(r0, u1, u3);
  UMULHI(r1, u1, u3);
  UADDO1(r0, r0);
  UADDC1(r1, r1);
  UADD1(t2, 0x0ULL);
  UMULLO(u10, u2, u2);
  UMULHI(u11, u2, u2);
  UADDO1(r0, u10);
  UADDC1(r1, u11);
  UADD1(t2, 0x0ULL);
  UADDO1(r0, r4);
  UADDC1(r1, t1);
  UADD1(t2, 0x0ULL);

  s4 = r0;
  t1 = 0;

  //k=5
  UMULLO(r3, u2, u3);
  UMULHI(r4, u2, u3);
  UADDO1(r3, r3);
  UADDC1(r4, r4);
  UADD1(t1, 0x0ULL);
  UADDO1(r3, r1);
  UADDC1(r4, t2);
  UADD1(t1, 0x0ULL);

  s5 = r3;



  //k=6
  UMULLO(r0, u3, u3);
  UMULHI(r1, u3, u3);
  UADDO1(r0, r4);
  UADD1(r1, t1);
  s6 = r0;
 
  //k=7
  s7 = r1;
 
  //Reduction
  UMULLO(z3, s5, 0x1000003d1ULL);
  UMULHI(z4, s5, 0x1000003d1ULL);
  UMULLO(z5, s6, 0x1000003d1ULL);
  UMULHI(z6, s6, 0x1000003d1ULL);
  UMULLO(z7, s7, 0x1000003d1ULL);
  UMULHI(z8, s7, 0x1000003d1ULL);
  UMULLO(z1, s4, 0x1000003d1ULL);
  UMULHI(z2, s4, 0x1000003d1ULL);
  UADDO1(z1, s0);
  UADD1(z2, 0x0ULL);


  UADDO1(z2, s1);
  UADDC1(z4, s2);
  UADDC1(z6, s3);
  UADD1(z8, 0x0ULL);

 
  //uint64_t c = 0;


  UADDO1(z3, z2);
  UADDC1(z5, z4);
  UADDC1(z7, z6);
  UADD1(z8, 0x0ULL);
 
  UMULLO(u10, z8, 0x1000003d1ULL);
  UMULHI(u11, z8, 0x1000003d1ULL);
  UADDO1(z1, u10);
  UADDC1(z3, u11);
  UADDC1(z5, 0x0ULL);
  UADD1(z7, 0x0ULL);

  /*
  UADD1(c, 0x0ULL);   
     
  rp[0] = z1;
  rp[1] = z3;
  if(c == 1){

     UADDO1(z5, 0x1ULL);
     UADD1(z7, 0x0ULL);

  }
  */

  rp[0] = z1;
  rp[1] = z3;
  rp[2] = z5;
  rp[3] = z7;
 
 
}
669  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 21, 2019, 03:20:19 PM
From 153 MKeys/s to 160 MKeys/s

using a _ModSqr instead of _ModMult

in GPUCompute.h, __device__ void ComputeKeys
Code:
      //_ModMult(_p2, _s, _s);        // _p = pow2(s)
      _ModSqr(_p2, _s);


      //_ModMult(py, _s);            
      _ModSqr(py, _s);


in GPUEngine.cu:
Code:
__device__ void _ModSqr(uint64_t *rp, const uint64_t *up) {


  uint64_t u0 = up[0];
  uint64_t u1 = up[1];
  uint64_t u2 = up[2];
  uint64_t u3 = up[3];


  uint64_t u10, u11;


  uint64_t r0 = 0;
  uint64_t r1 = 0;
  uint64_t r3 = 0;
  uint64_t r4 = 0;

  uint64_t t1 = 0;
  uint64_t t2 = 0;


  uint64_t s0, s1, s2, s3, s4, s5, s6, s7;
  uint64_t z1, z2, z3, z4, z5, z6, z7, z8;

  z1 = z2 = 0;

  //k=0
  UMULLO(s0, u0, u0);
  UMULHI(r1, u0, u0);

  //k=1
  UMULLO(r3, u0, u1);
  UMULHI(r4, u0, u1);
  UADDO1(r4, r4);
  UADDC1(u0, u0);
  UADDC1(r4, u1);
  UADDC1(u0, 0x0ULL);
  UADD1(r3, 0x0ULL);
  s1 = r3;



  //k=2
  UMULLO(r0, u0, u2);
  UMULHI(r1, u0, u2);
  UADDO1(r0, r0);
  UADDC1(r1, r1);
  UADD1(t2, 0x0ULL);
  UMULLO(u10, u1, u1);
  UMULHI(u11, u1, u1);
  UADDO1(r0, u10);
  UADDC1(r1, u11);
  UADD1(t2, 0x0ULL);
  UADDO1(r0, r4);
  UADDC1(r1, t1);
  UADD1(t2, 0x0ULL);


  s2 = r0;
  t1 = 0;
  
  //k=3
  UMULLO(r3, u0, u3);
  UMULHI(r4, u0, u3);
  UMULLO(u10, u1, u2);
  UMULHI(u11, u1, u2);
  UADDO1(r3, u10);
  UADDC1(r4, u11);
  UADD1(t1, 0x0ULL);
  t1 += t1;
  UADDO1(r3, r3);
  UADDC1(r4, r4);
  UADD1(t1, 0x0ULL);
  UADDO1(r3, r1);
  UADDC1(r4, t2);
  UADD1(t1, 0x0ULL);

  s3 = r3;
  t2 = 0;

  //k=4

  UMULLO(r0, u1, u3);
  UMULHI(r1, u1, u3);
  UADDO1(r0, r0);
  UADDC1(r1, r1);
  UADD1(t2, 0x0ULL);
  UMULLO(u10, u2, u2);
  UMULHI(u11, u2, u2);
  UADDO1(r0, u10);
  UADDC1(r1, u11);
  UADD1(t2, 0x0ULL);
  UADDO1(r0, r4);
  UADDC1(r1, t1);
  UADD1(t2, 0x0ULL);

  s4 = r0;
  t1 = 0;

  //k=5
  UMULLO(r3, u2, u3);
  UMULHI(r4, u2, u3);
  UADDO1(r3, r3);
  UADDC1(r4, r4);
  UADD1(t1, 0x0ULL);
  UADDO1(r3, r1);
  UADDC1(r4, t2);
  UADD1(t1, 0x0ULL);

  s5 = r3;



  //k=6
  UMULLO(r0, u3, u3);
  UMULHI(r1, u3, u3);
  UADDO1(r0, r4);
  UADD1(r1, t1);
  s6 = r0;
  
  //k=7
  s7 = r1;
  
  //Reduction
  UMULLO(z3, s5, 0x1000003d1);
  UMULHI(z4, s5, 0x1000003d1);
  UMULLO(z5, s6, 0x1000003d1);
  UMULHI(z6, s6, 0x1000003d1);
  UMULLO(z7, s7, 0x1000003d1);
  UMULHI(z8, s7, 0x1000003d1);
  UMULLO(z1, s4, 0x1000003d1ULL);
  UMULHI(z2, s4, 0x1000003d1ULL);
  UADDO1(z1, s0);
  UADD1(z2, 0x0ULL);


  UADDO1(z2, s1);
  UADDC1(z4, s2);
  UADDC1(z6, s3);
  UADD1(z8, 0x0ULL);

  
  uint64_t c = 0;


  UADDO1(z3, z2);
  UADDC1(z5, z4);
  UADDC1(z7, z6);
  UADD1(z8, 0x0ULL);
 
  UMULLO(u10, z8, 0x1000003d1ULL);
  UMULHI(u11, z8, 0x1000003d1ULL);
  UADDO1(z1, u10);
  UADDC1(z3, u11);
  UADD1(c, 0x0ULL);  
      
  rp[0] = z1;
  rp[1] = z3;
  if(c == 1){

     UADDC1(z5, 0x1ULL);
     UADD1(z7, 0x0ULL);

  }

  rp[2] = z5;
  rp[3] = z7;
  
 
}
670  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 21, 2019, 03:01:56 PM
Arg...
Could you try this (for 2 modmult) ?
With this mods, all instruction of the ModMult will be volatile and, theoretically, cannot be moved or removed by the compiler.

Code:
#define SET0(a) asm volatile ("mov.u64 %0,0;" : "=l"(a))

// ---------------------------------------------------------------------------------------
// Compute a*b*(mod n)
// a and b must be lower than n
// ---------------------------------------------------------------------------------------

__device__ void _ModMult(uint64_t *r, uint64_t *a, uint64_t *b) {

  uint64_t r512[8];
  uint64_t t[NBBLOCK];
  uint64_t ah,al;

  SET0(r512[5]);
  SET0(r512[6]);
  SET0(r512[7]);

  // 256*256 multiplier


Still errors.
671  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 21, 2019, 12:36:55 PM
Hello,

@arulbero

Could you try this file:
http://zelda38.free.fr/VanitySearch/GPUEngine.cu

I unrolled the UMult macro, may be nvcc performs wrong optimization due to this.
The volatile causes a 10% performance loss on my Windows. A bit less on my Linux.

Code:
// Reduce from 512 to 320 
-  UMult(t,(r512 + 4), 0x1000003D1ULL);
+  UMULLO(t[0],r512[4],0x1000003D1ULL);
+  UMULLO(t[1],r512[5],0x1000003D1ULL);
+  MADDO(t[1], r512[4],0x1000003D1ULL,t[1]);
+  UMULLO(t[2],r512[6],0x1000003D1ULL);
+  MADDC(t[2],r512[5],0x1000003D1ULL, t[2]);
+  UMULLO(t[3],r512[7],0x1000003D1ULL);
+  MADDC(t[3],r512[6],0x1000003D1ULL, t[3]);
+  MADD(t[4],r512[7],0x1000003D1ULL, 0ULL);


No, still errors!
672  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 20, 2019, 04:05:48 PM
Embarrassed

An other try:

GPU/GPUEngine.cu: 465
and
GPU/GPUEngine.cu: 514

Code:
   volatile uint64_t r512[8];

volatile prevent the compiler to make optimization on the variable adn to remove used code.
I had a problem with gcc 6 concerning this on the CPU release.

It works!!! A little slower, but it is correct now!
673  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 20, 2019, 03:24:16 PM
I compiled a cuda 8 binaries if you want to try if you have same the behavior.
http://zelda38.free.fr/VanitySearch/1.9/VanitySearch50_cuda8

On my install with SDK 8, it uses 135 registers and 0 spill move.
With SDK 10, only 120 registers and also 0 spill move.


Always error:

Code:
~/VanitySearch50_cuda8$ ./VanitySearch50_cuda8 -check -g 1
GetBase10() Results OK
Add() Results OK : 333.333 MegaAdd/sec
Mult() Results OK : 29.674 MegaMult/sec
Div() Results OK : 5.556 MegaDiv/sec
ModInv()/ModExp() Results OK
ModInv() Results OK : 341.867 KiloInv/sec
IntGroup.ModInv() Results OK : 7.327 MegaInv/sec
ModMulK1() Results OK : 11.682 MegaMult/sec
ModMulK1order() Results OK : 6.460 MegaMult/sec
ModSqrt() Results OK !
Check Generator :OK
Check Double :OK
Check Add :OK
Check GenKey :OK
Adress : 15t3Nt1zyMETkHbjJTTshxLnqPzQvAtdCe OK!
Adress : 1BoatSLRHtKNngkdXEeobR76b53LETtpyT OK!
Adress : 1JeanLucgidKHxfY5gkqGmoVjo1yaU4EDt OK(comp)!
Adress : 1Test6BNjSJC5qwYXsjwKVLvz7DpfLehy OK!
Adress : 1BitcoinP7vnLpsUHWbzDALyJKnNo16Qms OK(comp)!
Check Calc PubKey (full) 1ViViGLEawN27xRzGrEhhYPQrZiTKvKLo :OK
Check Calc PubKey (even) 1Gp7rQ4GdooysEAEJAS2o4Ktjvf1tZCihp:OK
Check Calc PubKey (odd) 18aPiLmTow7Xgu96msrDYvSSWweCvB9oBA:OK
GPU: GPU #0 Quadro M2200 (8x128 cores) Grid(64x128)
Seed: 596970
123.502 MegaKey/sec
ComputeKeys() found 1594 items , CPU check...
Expected item not found 3412910a c97422a4 6f11601a 8c75dbba a494e3c4 (thread=87, incr=-540, endo=0)
Expected item not found 34124e60 837e83bf aba37043 d981e8a7 3ba919f9 (thread=99, incr=-257, endo=0)
Expected item not found 34124b15 09d084f5 c09be79e b9e74233 a5d04c9a (thread=133, incr=184, endo=2)
Expected item not found fefed61a e1a5ee3e d71f81fa 7ed01482 1df88b0f (thread=149, incr=850, endo=2)
Expected item not found fefeb4ca 86752243 387f97b1 1ec5fc4f ab2e23cd (thread=204, incr=682, endo=1)
Expected item not found 3412af0c e80a5462 96280598 760e3541 3c0c7c79 (thread=207, incr=-470, endo=0)
Expected item not found 34122971 0483c8a0 0f392737 ffd3e8aa 20f36367 (thread=234, incr=-91, endo=2)
Expected item not found 3412b84c 7dd3e53f e5c00f67 d44fac8f 594dc830 (thread=249, incr=-547, endo=1)
Expected item not found 34127635 e84de0de f0b9672f ef7f52eb 853b6579 (thread=278, incr=-153, endo=0)
Expected item not found 3412e146 03eaa33c 3e4e3cfc 32448e75 87ddbc8c (thread=300, incr=-648, endo=0)
Expected item not found fefe49af b082f946 430aa009 d722e7b9 85848f2e (thread=309, incr=576, endo=2)
Expected item not found fefe67ad c0e86d66 4c92c703 e853c833 ee684ddc (thread=350, incr=865, endo=1)
Expected item not found 341293f0 85b21f8d 2c97f992 b66f8417 d5762b62 (thread=357, incr=-283, endo=0)
Expected item not found 34126be8 99868951 6f0abbbc 45b5acb9 7a8b8978 (thread=357, incr=-950, endo=1)
Expected item not found fefe4071 da662ebc 6e1132df 9fc940aa 4c73f6b4 (thread=414, incr=277, endo=1)
Expected item not found 3412be76 2b3f96d1 3c1f70fd 19e54210 8bb78a9a (thread=422, incr=-773, endo=1)
Expected item not found fefe1392 83313cc8 622f7b04 8f1acfcc a6973c04 (thread=441, incr=508, endo=2)
Expected item not found fefe356e dd82a5cc ad8f25d7 7e048d04 6cb9668d (thread=474, incr=-461, endo=1)
Expected item not found 34123606 dbee7d71 ff8fa64a 189afb61 71eede71 (thread=486, incr=-534, endo=0)
Expected item not found fefe7242 ab68602b f635577a 9f44ea15 2c7f99ca (thread=504, incr=439, endo=1)
Expected item not found 341210cd d27ced94 b10cda99 0cb8eef3 25bccc2e (thread=524, incr=-929, endo=2)
Expected item not found 3412b95e a84c3c11 04a60e99 2b662810 ce5bb025 (thread=530, incr=-507, endo=2)
Expected item not found fefec926 3c641602 28123d8a ef66b036 2d6d5298 (thread=564, incr=-581, endo=0)
Expected item not found 34124dfe f8227df3 39cc2aac 5fa89e87 1d48a18b (thread=578, incr=-690, endo=0)
Expected item not found fefea0bd 871357d4 6711cb08 415cb045 13054cd4 (thread=620, incr=-1012, endo=1)
Expected item not found fefe81a3 8ac675ce 43d1af2f 4032ffdd 1b9e2c41 (thread=622, incr=720, endo=1)
Expected item not found fefeee16 10039563 1325c5a1 7e4008e0 dfeb643b (thread=626, incr=-815, endo=2)
Expected item not found fefe3f11 1d5af4c0 02531103 27245668 e16e18bb (thread=631, incr=-224, endo=1)
Expected item not found fefe0722 e8c35df1 59dedc91 75c0b34c 53e207d0 (thread=720, incr=610, endo=1)
Expected item not found 341205e3 8ae3fe31 8bb77fe3 d6770770 4fbb5142 (thread=737, incr=-585, endo=0)
Expected item not found 3412a4dd 15b0f82a 37b8f95b a13d6403 40a179d9 (thread=745, incr=348, endo=1)
Expected item not found 3412e545 6a30b568 10894417 65d1c745 f0b36472 (thread=752, incr=-299, endo=0)
Expected item not found 3412c1b2 fb6e7210 acd4429c 00f57161 f02c555c (thread=780, incr=312, endo=2)
.....
CPU found 1548 items
GPU: point   correct [238/238]
GPU: endo #1 correct [213/273]
GPU: endo #2 correct [202/271]
GPU: sym/point   correct [108/226]
GPU: sym/endo #1 correct [207/277]
GPU: sym/endo #2 correct [202/263]
GPU/CPU check Failed !
674  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 20, 2019, 02:24:29 PM

I tried your function on my Linux config but it does bring significant performance increase.
Mainly due to the fact that adding temporary variable add more spill move which are slower, sometimes it is better to recompute.
On your hardware you have much more available registers, performance increase should be more significant.

A tip, May be you can try to play with the maxregister in the makefile, for compute cap 5.0, nvcc cuda 10, use 120 registers.
The random problem you have may also be due to wrong register sharing between thread, it can explain the strange and random behavior. Reducing the number of used register by inlining also reduce the probability that this happens.
It might be an explanation...

With "-maxrregcount=50" I got 188 MKeys/s speed (but there are are still errors).
675  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 20, 2019, 11:57:57 AM
Many thanks for the tips Wink
I will try this.

You don't want to try binary ? The libcudart.so.10.0 is also available from the given link. You do not need to set up cuda sdk 10 (unless a driver problem appears but this may work without installing anything).
You can just copy VanitySearch50 and the libcudart.so.10.0 in a directory and set the LD_LIBRARY_PATH.
Code:
export LD_LIBRARY_PATH=.
./VanitySearch50 ...

This is mainly to see if the problem is solved with CUDA 10 or if it comes from elsewhere.


Already tried wit "LD_LIBRARY_PATH",  the problem is the driver. I have Ubuntu 17.04, I cannot install a new driver on it.
676  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 20, 2019, 11:47:42 AM
Another sub function, if you want to test it:


Code:
__device__ void ModSub256(uint64_t *rp, uint64_t *ap, uint64_t *bp) {

 
  uint64_t a0, a1, a2, a3, b0, b1, b2, b3, r0, r1, r2, r3;
  int8_t c0, c1, c2, c3;


  a0 = ap[0];
  a1 = ap[1];
  a2 = ap[2];
  a3 = ap[3];

  b0 = bp[0];
  b1 = bp[1];
  b2 = bp[2];
  b3 = bp[3];
 
  /*
  r0 = a0 - b0;
  c0 = (a0 < b0) ? 1 : -1;
  c0 = (r0 == 0) ? 0 : c0;
 
  r1 = a1 - b1;
  c1 = (a1 < b1) ? 1 : -1;
  c1 = (r1 == 0) ? c0 : c1;
  r1 = r1 - (c0 == 1);
  
  r2 = a2 - b2;
  c2 = (a2 < b2) ? 1 : -1;
  c2 = (r2 == 0) ? c1 : c2;
  r2 = r2 - (c1 == 1);

  r3 = a3 - b3;
  c3 = (a3 < b3) ? 1 : -1;
  c3 = (r3 == 0) ? c2 : c3;
  r3 = r3 - (c2 == 1);
  */


  
  c0 = a0 < b0;
  r0 = a0 - b0;
  
  c1 = a1 < b1;
  r1 = a1 - b1;
  if(r1 == 0){ c1 = c0;}
  if(c0) {r1 = r1 - 1;}
  

  c2 = a2 < b2;
  r2 = a2 - b2;
  if(r2 == 0){ c2 = c1;}
  if(c1) {r2 = r2 - 1;}

  c3 = a3 < b3;
  r3 = a3 - b3;
  if(r3 == 0){ c3 = c2;}
  if(c2) {r3 = r3 - 1;}

  
  if(c3 == 1){


if(r0 > 0x1000003d0){  //almost always --> no borrow
                
r0 = r0 - 0x1000003d1;

}
else{
                    
   //c[0] = (r0 < 0x1000003d1) ? 1 : -1;
   //c0 = (r0 == 0x1000003d1) ? 0 : 1;
                //c0 = 1; // for sure r0 < 0x1000003d1

                r0 = r0 - 0x1000003d1;
                r1 = r1  - 1;  //c0 is 1
      

                c1 = (r1 == 0xffffffffffffffff) ? 1 : -1;
                c2 = (r2 == 0) ? c1 : -1;

if(c1 == 1) r2 = r2 - 1;
if(c2 == 1) r3 = r3 - 1;

              
};
   };
  
  
  
  rp[0] = r0;
  rp[1] = r1;
  rp[2] = r2;
  rp[3] = r3;


  return;
 
}


677  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 20, 2019, 11:36:29 AM
New version is slower on my pc (132 MKeys/s against 162 MKeys/s).

On my Windows, performance are the same than the previous release (Cuda 10).
Slightly slower on Linux (Cuda 8.0), from 39.5MK/s to 37.9MK/s.

Anyway,
Do you compile or do you use Linux binaries ?
Do you solved your problem ? I didn't manage to reproduce the issue yet.


I compile the source myself. No, my problem is not solved. I have only Cuda 8.0.


Some ideas for (maybe) a little speed improvement:


1) in __device__ void ComputeKeys (GPUCompute.h) instead of doing HSIZE times

Code:
ModNeg256(dy,Gy[i]);  <--
ModSub256(dy, py);

you could do:

Code:
ModSub256(dy, pyn, Gy[i]);

and you compute only once pyn:

Code:
ModNeg256(pyn,py);

2) instead of

Code:
ModAdd256(py, Gy[i]);

Code:
ModSub256(py, sy);

To sum up:

Code:
ModSub256(dy, pyn, Gy[i]);

_ModMult(_s, dy, dx[i]);      //  s = (p2.y-p1.y)*inverse(p2.x-p1.x)
 _ModMult(_p2, _s, _s);        // _p = pow2(s)

ModSub256(px, _p2, px);
ModSub256(px, Gx[i]);         // px = pow2(s) - p1.x - p2.x;

ModSub256(py, sx, px);
 _ModMult(py, _s);             // py = - s*(ret.x-p2.x)
 ModSub256(py, sy);         // py = - p2.y - s*(ret.x-p2.x);  


3) in __device__ void ModSub256 instead of

Code:
     if ((int64_t)t < 0) {
    UADDO1(r[0], _P[0]);
    UADDC1(r[1], _P[1]);
    UADDC1(r[2], _P[2]);
    UADD1(r[3], _P[3]);
  }

it would be better something like that:

Code:
  if ((int64_t)t < 0) {
    USUBO1(r[0], 0x01000003d1);
    USUBC1(r[1], 0ULL);
    USUBC1(r[2], 0ULL);
    USUBC1(r[3], 0ULL);
  }

(I'm not sure what C means, I suppose means with carry)
678  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 20, 2019, 09:40:53 AM
A new release of VanitySearch (1.9) is out:

Code:
Added -b option (Search compressed or uncompressed addresses)
Improved performance for loading large prefix list
Fixed difficulty calculation bug for prefix containing only '1'


New version is slower on my pc (132 MKeys/s against 162 MKeys/s).
679  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 17, 2019, 05:32:56 PM
Ok Thanks, could you try to run cuda-memcheck on the release version.



Code:
~/VanitySearch-1.8$ /usr/local/cuda-8.0/bin/cuda-memcheck --tool memcheck VanitySearch -g 1 -check
========= CUDA-MEMCHECK
GetBase10() Results OK
Add() Results OK : 123.457 MegaAdd/sec
Mult() Results OK : 23.148 MegaMult/sec
Div() Results OK : 5.208 MegaDiv/sec
ModInv()/ModExp() Results OK
ModInv() : 341.317 KiloInv/sec
IntGroup.ModInv() : 9.130 MegaInv/sec
ModMulK1() : 12.968 MegaMult/sec
ModSqrt() OK !
Check Generator :OK
Check Double :OK
Check Add :OK
Check GenKey :OK
Adress : 15t3Nt1zyMETkHbjJTTshxLnqPzQvAtdCe OK!
Adress : 1BoatSLRHtKNngkdXEeobR76b53LETtpyT OK!
Adress : 1JeanLucgidKHxfY5gkqGmoVjo1yaU4EDt OK(comp)!
Adress : 1Test6BNjSJC5qwYXsjwKVLvz7DpfLehy OK!
Adress : 1BitcoinP7vnLpsUHWbzDALyJKnNo16Qms OK(comp)!
Check Calc PubKey (full) 1ViViGLEawN27xRzGrEhhYPQrZiTKvKLo :OK
Check Calc PubKey (even) 1Gp7rQ4GdooysEAEJAS2o4Ktjvf1tZCihp:OK
Check Calc PubKey (odd) 18aPiLmTow7Xgu96msrDYvSSWweCvB9oBA:OK
GPU: GPU #0 Quadro M2200 (8x128 cores) Grid(1x128)
Seed: 223215
95.697 KiloKey/sec
ComputeKeys() found 26 items , CPU check...
Expected item not found 3412bb65 cb39a716 67dcd486 209b19df c65e364c
Expected item not found fefea644 d535267a 46308e46 c579e91b 0aad3ee2
Expected item not found 3412726b 9830f325 9c5f0d95 a99e2a9b 6c473922
Expected item not found 341292e1 b4a39d2c 59e34f3d 38725b42 dfc2e801
Expected item not found fefeba57 c1209e3d 1b79200c b9529018 de0e35e4
Expected item not found fefe4aaa 34f02402 4ed76c83 a1d60efc 8c79f7a6
Expected item not found fefe8742 63e9b7bc b13a08f1 28229fd8 30987ed3
CPU found 22 items
========= ERROR SUMMARY: 0 errors
680  Bitcoin / Development & Technical Discussion / Re: VanitySearch (Yet another address prefix finder) on: March 17, 2019, 04:09:58 PM
I committed a new Makefile with debug option.

Code:
make clean
make gpu=1 debug=1 all

In debug mode no inlining is done.

But, obviously it is much slower.
So launch

Code:
pons@linpons:~/VanitySearch$ ./VanitySearch -g 1 -check


Code:
./VanitySearch -g 1 -check
GetBase10() Results OK
Add() Results OK : 108.696 MegaAdd/sec
Mult() Results OK : 10.684 MegaMult/sec
Div() Results OK : 1.656 MegaDiv/sec
ModInv()/ModExp() Results OK
ModInv() Results OK : 132.041 KiloInv/sec
IntGroup.ModInv() Results OK : 2.222 MegaInv/sec
ModMulK1() Results OK : 3.661 MegaMult/sec
ModMulK1order() Results OK : 1.700 MegaMult/sec
ModSqrt() Results OK !
Check Generator :OK
Check Double :OK
Check Add :OK
Check GenKey :OK
Adress : 15t3Nt1zyMETkHbjJTTshxLnqPzQvAtdCe OK!
Adress : 1BoatSLRHtKNngkdXEeobR76b53LETtpyT OK!
Adress : 1JeanLucgidKHxfY5gkqGmoVjo1yaU4EDt OK(comp)!
Adress : 1Test6BNjSJC5qwYXsjwKVLvz7DpfLehy OK!
Adress : 1BitcoinP7vnLpsUHWbzDALyJKnNo16Qms OK(comp)!
Check Calc PubKey (full) 1ViViGLEawN27xRzGrEhhYPQrZiTKvKLo :OK
Check Calc PubKey (even) 1Gp7rQ4GdooysEAEJAS2o4Ktjvf1tZCihp:OK
Check Calc PubKey (odd) 18aPiLmTow7Xgu96msrDYvSSWweCvB9oBA:OK
GPU: GPU #0 Quadro M2200 (8x128 cores) Grid(1x128)
Seed: 888394
193.110 KiloKey/sec
ComputeKeys() found 26 items , CPU check...
GPU/CPU check OK


Code:
~/VanitySearch$ /usr/local/cuda-8.0/bin/cuda-memcheck --tool memcheck VanitySearch -g 1 -check
========= CUDA-MEMCHECK
GetBase10() Results OK
Add() Results OK : 109.890 MegaAdd/sec
Mult() Results OK : 10.695 MegaMult/sec
Div() Results OK : 1.818 MegaDiv/sec
ModInv()/ModExp() Results OK
ModInv() Results OK : 130.572 KiloInv/sec
IntGroup.ModInv() Results OK : 2.182 MegaInv/sec
ModMulK1() Results OK : 3.602 MegaMult/sec
ModMulK1order() Results OK : 1.684 MegaMult/sec
ModSqrt() Results OK !
Check Generator :OK
Check Double :OK
Check Add :OK
Check GenKey :OK
Adress : 15t3Nt1zyMETkHbjJTTshxLnqPzQvAtdCe OK!
Adress : 1BoatSLRHtKNngkdXEeobR76b53LETtpyT OK!
Adress : 1JeanLucgidKHxfY5gkqGmoVjo1yaU4EDt OK(comp)!
Adress : 1Test6BNjSJC5qwYXsjwKVLvz7DpfLehy OK!
Adress : 1BitcoinP7vnLpsUHWbzDALyJKnNo16Qms OK(comp)!
Check Calc PubKey (full) 1ViViGLEawN27xRzGrEhhYPQrZiTKvKLo :OK
Check Calc PubKey (even) 1Gp7rQ4GdooysEAEJAS2o4Ktjvf1tZCihp:OK
Check Calc PubKey (odd) 18aPiLmTow7Xgu96msrDYvSSWweCvB9oBA:OK
GPU: GPU #0 Quadro M2200 (8x128 cores) Grid(1x128)
Seed: 781110
15.061 KiloKey/sec
ComputeKeys() found 26 items , CPU check...
GPU/CPU check OK
========= ERROR SUMMARY: 0 errors

Code:
~/VanitySearch$ /usr/local/cuda-8.0/bin/cuda-memcheck --tool memcheck VanitySearch -g 32 -check
========= CUDA-MEMCHECK
GetBase10() Results OK
Add() Results OK : 80.000 MegaAdd/sec
Mult() Results OK : 10.030 MegaMult/sec
Div() Results OK : 1.883 MegaDiv/sec
ModInv()/ModExp() Results OK
ModInv() Results OK : 130.924 KiloInv/sec
IntGroup.ModInv() Results OK : 2.221 MegaInv/sec
ModMulK1() Results OK : 3.659 MegaMult/sec
ModMulK1order() Results OK : 1.704 MegaMult/sec
ModSqrt() Results OK !
Check Generator :OK
Check Double :OK
Check Add :OK
Check GenKey :OK
Adress : 15t3Nt1zyMETkHbjJTTshxLnqPzQvAtdCe OK!
Adress : 1BoatSLRHtKNngkdXEeobR76b53LETtpyT OK!
Adress : 1JeanLucgidKHxfY5gkqGmoVjo1yaU4EDt OK(comp)!
Adress : 1Test6BNjSJC5qwYXsjwKVLvz7DpfLehy OK!
Adress : 1BitcoinP7vnLpsUHWbzDALyJKnNo16Qms OK(comp)!
Check Calc PubKey (full) 1ViViGLEawN27xRzGrEhhYPQrZiTKvKLo :OK
Check Calc PubKey (even) 1Gp7rQ4GdooysEAEJAS2o4Ktjvf1tZCihp:OK
Check Calc PubKey (odd) 18aPiLmTow7Xgu96msrDYvSSWweCvB9oBA:OK
GPU: GPU #0 Quadro M2200 (8x128 cores) Grid(32x128)
Seed: 639838
59.308 KiloKey/sec
ComputeKeys() found 721 items , CPU check...
GPU/CPU check OK
========= ERROR SUMMARY: 0 errors
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 [34] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 ... 96 »
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!