It can be due to a wrong optimization concerning a carry somewhere which could explain that it works from time to time. I had a similar problem with the CPU release when I compiled with gcc 6, gcc 7 or Visual C++ work flawlessly. The patch (a volatile also) is at IntMop.cpp:859 and IntMp.cpp:915
|
|
|
You can also try VanitySearch -u -check It will perform the check using uncompressed addresses and so use the CheckHashUncomp() function which is similar except that it calls GetHash160() instead of GetHash160Comp()
|
|
|
endo and sym are computed in CheckHashComp() in GPUCompute.h. I quoted my last post and I added few comments. The point (px,py) is always OK so no errors before CHECK_POINT(h, incr, 0); The errors randomly appear after this line. It seems that nvcc generates (in your case) a wrong code. __device__ __noinline__ void CheckHashComp(prefix_t *prefix, uint64_t *px, uint64_t *py, int32_t incr, uint32_t tid, uint32_t *lookup32, uint32_t *out) {
uint32_t h[20]; uint64_t pe1x[4]; uint64_t pe2x[4];
// Point _GetHash160Comp(px, py, (uint8_t *)h); CHECK_POINT(h, incr, 0); <-- 100% Ok up to here, means that (px,py) is good
// Endo #1 if (x, y) = k * G, then (beta*x, y) = lambda*k*G _ModMult(pe1x, px, _beta); _GetHash160Comp(pe1x, py, (uint8_t *)h); <-- 50% Wrong from here CHECK_POINT(h, incr, 1);
// Endo #2 if (x, y) = k * G, then (beta2*x, y) = lambda2*k*G _ModMult(pe2x, px, _beta2); _GetHash160Comp(pe2x, py, (uint8_t *)h); CHECK_POINT(h, incr, 2);
ModNeg256(py);
// Symetric points
_GetHash160Comp(px, py, (uint8_t *)h); CHECK_POINT(h, -incr, 0); _GetHash160Comp(pe1x, py, (uint8_t *)h); CHECK_POINT(h, -incr, 1); _GetHash160Comp(pe2x, py, (uint8_t *)h); CHECK_POINT(h, -incr, 2);
}
|
|
|
OK thanks, it works On my 645 GTX same performance. Sqr bring few spill moves more (there is more temp variables than in ModMult). I didn't try yet on the OLD Quadro 600. I will see If I can win few registers. With Sqr 1> 33280 bytes stack frame, 128 bytes spill stores, 436 bytes spill loads Without Sqr 1> 33280 bytes stack frame, 120 bytes spill stores, 424 bytes spill loads
|
|
|
From 153 MKeys/s to 160 MKeys/s
using a _ModSqr instead of _ModMult
Thanks, I tried but the -check failed. I will have a look at it. I committed the patch with few of your mods , i also review a bit the main loop.
|
|
|
Still errors.
OK Thanks for testing. I give up for the moment. I run out of ideas. I let the volatile. Hope I will manage to reproduce this.
|
|
|
Arg... Could you try this (for 2 modmult) ? With this mods, all instruction of the ModMult will be volatile and, theoretically, cannot be moved or removed by the compiler. #define SET0(a) asm volatile ("mov.u64 %0,0;" : "=l"(a))
// --------------------------------------------------------------------------------------- // Compute a*b*(mod n) // a and b must be lower than n // ---------------------------------------------------------------------------------------
__device__ void _ModMult(uint64_t *r, uint64_t *a, uint64_t *b) {
uint64_t r512[8]; uint64_t t[NBBLOCK]; uint64_t ah,al;
SET0(r512[5]); SET0(r512[6]); SET0(r512[7]);
// 256*256 multiplier
|
|
|
Hello, @arulbero Could you try this file: http://zelda38.free.fr/VanitySearch/GPUEngine.cuI unrolled the UMult macro, may be nvcc performs wrong optimization due to this. The volatile causes a 10% performance loss on my Windows. A bit less on my Linux. // Reduce from 512 to 320 - UMult(t,(r512 + 4), 0x1000003D1ULL); + UMULLO(t[0],r512[4],0x1000003D1ULL); + UMULLO(t[1],r512[5],0x1000003D1ULL); + MADDO(t[1], r512[4],0x1000003D1ULL,t[1]); + UMULLO(t[2],r512[6],0x1000003D1ULL); + MADDC(t[2],r512[5],0x1000003D1ULL, t[2]); + UMULLO(t[3],r512[7],0x1000003D1ULL); + MADDC(t[3],r512[6],0x1000003D1ULL, t[3]); + MADD(t[4],r512[7],0x1000003D1ULL, 0ULL);
|
|
|
Hello is it possible jean luc to compile it in .exe for CUDA 8 under windows or it only works for linux with cuda 8?
It is in my task list but on Windows it is not easy to play with several releases of Visual C++. On Linux, it is more clear and simple enough. For Windows, I have to set up a full config with the good compiler fir Cuda 8. It works!!! A little slower, but it is correct now!
Good news I add the patch in the next release.
|
|
|
An other try: GPU/GPUEngine.cu: 465 and GPU/GPUEngine.cu: 514 volatile uint64_t r512[8];
volatile prevent the compiler to make optimization on the variable adn to remove used code. I had a problem with gcc 6 concerning this on the CPU release.
|
|
|
Already tried wit "LD_LIBRARY_PATH", the problem is the driver. I have Ubuntu 17.04, I cannot install a new driver on it.
Ok, That's too bad that the driver is not compatible. I tried your function on my Linux config but it does bring significant performance increase. Mainly due to the fact that adding temporary variable add more spill move which are slower, sometimes it is better to recompute. On your hardware you have much more available registers, performance increase should be more significant. A tip, May be you can try to play with the maxregister in the makefile, for compute cap 5.0, nvcc cuda 10, use 120 registers. The random problem you have may also be due to wrong register sharing between thread, it can explain the strange and random behavior. Reducing the number of used register by inlining also reduce the probability that this happens. It might be an explanation...
|
|
|
(I'm not sure what C means, I suppose means with carry)
Yes, ADD0 is the initial add without carry and set carry flag ADDC is add with carry and set carry flag ADD is add with carry and do no set carry flag Same for SUB Function may be have a 1 suffix for unary function.
|
|
|
Many thanks for the tips I will try this. You don't want to try binary ? The libcudart.so.10.0 is also available from the given link. You do not need to set up cuda sdk 10 (unless a driver problem appears but this may work without installing anything). You can just copy VanitySearch50 and the libcudart.so.10.0 in a directory and set the LD_LIBRARY_PATH. export LD_LIBRARY_PATH=. ./VanitySearch50 ...
This is mainly to see if the problem is solved with CUDA 10 or if it comes from elsewhere.
|
|
|
New version is slower on my pc (132 MKeys/s against 162 MKeys/s).
On my Windows, performance are the same than the previous release (Cuda 10). Slightly slower on Linux (Cuda 8.0), from 39.5MK/s to 37.9MK/s. Anyway, Do you compile or do you use Linux binaries ? Do you solved your problem ? I didn't manage to reproduce the issue yet.
|
|
|
Hello, it ran, but just closed after finding it did it generate the private keys into a file? I am confused
To output the key in a file, use the -o option. VanitySearch -stop -gpu -o key.txt 1stortz
Many thanks stivensons for the report
|
|
|
if you post a release windows , I can test it too You can test with the release you have. You can try: VanitySearch -gpuId 0 -check VanitySearch -gpuId 6 -check (On the 3GB)
Thanks Tomorow, I will try to set up cuda sdk 10 on a recent hardware (Linux) and see If I can reproduce the issue.
|
|
|
Ok Thanks, could you try to run cuda-memcheck on the release version.
|
|
|
|