azwccc
|
|
July 23, 2013, 11:10:43 PM |
|
I would definitely donate if the CUDA miner can outperform the AMD OCL miner (on GPU in same price range).
Best of luck to you.
|
Bitrated user: azwccc.
|
|
|
jaakkop
|
|
July 24, 2013, 05:52:16 AM |
|
Donated some BTC just now, keep up the good work BTW, are there any guides how to compile this for Windows (or how to cross-compile with Linux?) or should Bitcoin-QT building guides apply to this as well?
|
I'd buy that for a dollar bitcoin!
|
|
|
wetroof
Member
Offline
Activity: 75
Merit: 10
|
|
July 24, 2013, 06:03:48 AM |
|
if I do mine with this with this, I'll pledge to give 10% of the total XPM I mine for the first 5 days to primedigger. I have two 2.0 cuda compute capability cards and one 3.0.
|
|
|
|
teknohog
|
|
July 24, 2013, 09:21:44 AM |
|
Based on previous prime-number-based-research projects, CUDA has outperformed OpenCL.
AFAIK, there are two reasons why the Lucas-Lehmer test for Mersenne primes has been done in CUDA rather than OpenCL. It uses floating point math, and there are better FFT libraries available for CUDA. (For the huge numbers involved, multiplication is more efficient via Fourier transform and convolution.) On the other hand, trial division (another important part of Mersenne prime search) seems to be more efficient with OpenCL on AMD, as it uses integer math. I assume Primecoin would work fine with just integer math. Sources: http://mersenneforum.org/forumdisplay.php?f=92Of course, there is a more ideological point about using a language that works across many platforms (CPUs and GPUs from several vendors) than tying yourself to Nvidia, especially in an open source project.
|
|
|
|
maco
|
|
July 24, 2013, 09:35:40 AM |
|
How is progress so far? CUDA is a must
|
|
|
|
primedigger (OP)
Member
Offline
Activity: 75
Merit: 10
|
|
July 24, 2013, 02:55:20 PM |
|
Progress update:
- GPU does Fermats tests now and CPU does the rest. Fermat tests seem to work fine now. - I can find prime chains, but couldn't find a block on testnet in a timely manner
Todo: - Transferring mpz types with strings is slow so I will transform gmp mpz's directly to my CUDA format on the CPU. - A lot happenend with the high performance client, I will update my codebase to hp7 - The changes in hp6 could also be useful on the GPU ( -> fast divisibility tests before doing the expensive Fermat's test) - Interleave CPU+GPU computations and async memory copys. Without this, my client won't be very fast.
I don't like to put an ETA on this, it's done when it's done. But I hope to have something by next week which outperforms my old intel core2 quad core.
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.
Edit: And thanks for the donations / pledges guys!
|
|
|
|
hasle2
|
|
July 24, 2013, 07:30:02 PM |
|
Progress update:
- GPU does Fermats tests now and CPU does the rest. Fermat tests seem to work fine now. - I can find prime chains, but couldn't find a block on testnet in a timely manner
Todo: - Transferring mpz types with strings is slow so I will transform gmp mpz's directly to my CUDA format on the CPU. - A lot happenend with the high performance client, I will update my codebase to hp7 - The changes in hp6 could also be useful on the GPU ( -> fast divisibility tests before doing the expensive Fermat's test) - Interleave CPU+GPU computations and async memory copys. Without this, my client won't be very fast.
I don't like to put an ETA on this, it's done when it's done. But I hope to have something by next week which outperforms my old intel core2 quad core.
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.
Edit: And thanks for the donations / pledges guys!
The fact that you are finding blocks on testnet is quite an achievement. Keep up the good work
|
|
|
|
wetroof
Member
Offline
Activity: 75
Merit: 10
|
|
July 24, 2013, 09:13:52 PM |
|
Progress update:
- GPU does Fermats tests now and CPU does the rest. Fermat tests seem to work fine now. - I can find prime chains, but couldn't find a block on testnet in a timely manner
Todo: - Transferring mpz types with strings is slow so I will transform gmp mpz's directly to my CUDA format on the CPU. - A lot happenend with the high performance client, I will update my codebase to hp7 - The changes in hp6 could also be useful on the GPU ( -> fast divisibility tests before doing the expensive Fermat's test) - Interleave CPU+GPU computations and async memory copys. Without this, my client won't be very fast.
I don't like to put an ETA on this, it's done when it's done. But I hope to have something by next week which outperforms my old intel core2 quad core.
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.
Edit: And thanks for the donations / pledges guys!
I donated .2 BTC . thanks for the update.
|
|
|
|
jhxhlj
Newbie
Offline
Activity: 26
Merit: 0
|
|
July 25, 2013, 01:59:10 AM |
|
|
|
|
|
Sunny King
Legendary
Offline
Activity: 1205
Merit: 1010
|
|
July 25, 2013, 02:12:02 AM |
|
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.
Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.
|
|
|
|
primedigger (OP)
Member
Offline
Activity: 75
Merit: 10
|
|
July 25, 2013, 01:06:02 PM |
|
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.
Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit. There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms. Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation. But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes). Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4): 2013-07-24 21:53:38 primemeter 24303 prime/h 490729 test/h 47 5-chains/h prime/h and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though. You have to use setgenerate true 1, i.e. one CPU thread for mining.
|
|
|
|
paulthetafy
|
|
July 25, 2013, 01:13:24 PM |
|
Good job primedigger! Just looking at your 5-chains score, I think that means that this is slower than most CPU's though right? My crappy 4-core AMD does 400-800 5-chains/ hour (around 3000PPS).
|
|
|
|
primedigger (OP)
Member
Offline
Activity: 75
Merit: 10
|
|
July 25, 2013, 01:20:57 PM |
|
Good job primedigger! Just looking at your 5-chains score, I think that means that this is slower than most CPU's though right? My crappy 4-core AMD does 400-800 5-chains/ hour (around 3000PPS).
Yes, it's an extremely crappy score, but it's a start. The high performance client does a good job of squeezing out as much as possible out of your CPU though!
|
|
|
|
K1773R
Legendary
Offline
Activity: 1792
Merit: 1008
/dev/null
|
|
July 25, 2013, 01:31:53 PM Last edit: July 25, 2013, 01:44:27 PM by K1773R |
|
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.
Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit. There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms. Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation. But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes). Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4): 2013-07-24 21:53:38 primemeter 24303 prime/h 490729 test/h 47 5-chains/h prime/h and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though. You have to use setgenerate true 1, i.e. one CPU thread for mining. running current git (b0c7062f3925482935f5eb352b17737d21b95c5b) and i cant see any usage of my GPU, no heat nor used mem increases when using the QT. anything special to activate so it mines with the GPU? i got a powerfull GPU to test with EDIT: 2013-07-25 13:35:43 primemeter 0 prime/h 34261932 test/h 0 5-chains/h seems the miner thread which should launch the CUDA is borked? EDIT2: Have 2400 candidates after main loop Cuda start! Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, the launch timed out and was terminated from debug.log after the message it segfaults, going to debug with gdb
|
[GPG Public Key]BTC/DVC/TRC/FRC: 1 K1773RbXRZVRQSSXe9N6N2MUFERvrdu6y ANC/XPM A K1773RTmRKtvbKBCrUu95UQg5iegrqyeA NMC: N K1773Rzv8b4ugmCgX789PbjewA9fL9Dy1 LTC: L Ki773RBuPepQH8E6Zb1ponoCvgbU7hHmd EMC: E K1773RxUes1HX1YAGMZ1xVYBBRUCqfDoF BQC: b K1773R1APJz4yTgRkmdKQhjhiMyQpJgfN
|
|
|
primedigger (OP)
Member
Offline
Activity: 75
Merit: 10
|
|
July 25, 2013, 01:45:44 PM |
|
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.
Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit. There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms. Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation. But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes). Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4): 2013-07-24 21:53:38 primemeter 24303 prime/h 490729 test/h 47 5-chains/h prime/h and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though. You have to use setgenerate true 1, i.e. one CPU thread for mining. running current git (b0c7062f3925482935f5eb352b17737d21b95c5b) and i cant see any usage of my GPU, no heat nor used mem increases when using the QT. anything special to activate so it mines with the GPU? i got a powerfull GPU to test with EDIT: 2013-07-25 13:35:43 primemeter 0 prime/h 34261932 test/h 0 5-chains/h seems the miner thread which should launch the CUDA is borked? EDIT2: Have 2400 candidates after main loop Cuda start! Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, the launch timed out and was terminated from debug log You can also run it with -printmining -printtoconsole to see that output directly. Could you compile the cuda portion with -G -g (change the qt project file where it invokes nvcc) and give me the output of cuda-memcheck? You can also #define CUDA_DEBUG in the cu file, to see the GPU printfs from the console
|
|
|
|
Vorksholk
Legendary
Offline
Activity: 1713
Merit: 1029
|
|
July 25, 2013, 01:47:59 PM |
|
If someone could give me some specific compilation directions (or a windows binary!) I can test on a 780.
|
|
|
|
K1773R
Legendary
Offline
Activity: 1792
Merit: 1008
/dev/null
|
|
July 25, 2013, 01:48:56 PM Last edit: July 25, 2013, 02:00:26 PM by K1773R |
|
My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.
Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit. There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms. Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation. But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes). Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4): 2013-07-24 21:53:38 primemeter 24303 prime/h 490729 test/h 47 5-chains/h prime/h and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though. You have to use setgenerate true 1, i.e. one CPU thread for mining. running current git (b0c7062f3925482935f5eb352b17737d21b95c5b) and i cant see any usage of my GPU, no heat nor used mem increases when using the QT. anything special to activate so it mines with the GPU? i got a powerfull GPU to test with EDIT: 2013-07-25 13:35:43 primemeter 0 prime/h 34261932 test/h 0 5-chains/h seems the miner thread which should launch the CUDA is borked? EDIT2: Have 2400 candidates after main loop Cuda start! Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, the launch timed out and was terminated from debug log You can also run it with -printmining -printtoconsole to see that output directly. Could you compile the cuda portion with -G -g (change the qt project file where it invokes nvcc) and give me the output of cuda-memcheck? You can also #define CUDA_DEBUG in the cu file, to see the GPU printfs from the console was already running with -g just waiting for the "cuda start message", stoped it now and recompile with -D CUDA_DEBUG EDIT: its up and running, waiting for the cuda init + crash EDIT2: why does it take so long until the miner starts the cuda thread? that seems stupid :S EDIT3: here we go, it crashed debug.log Have 2400 candidates after main loop Cuda start! Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, unspecified launch failure stdout [0] start! sizeof(struct) = 400 mpz_print:mpz_capacity: 0 [0] string candidate is [0] N is: mpz_capacity: 30 ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff [0] E is: mpz_capacity: 30 fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe
gdb: dont want to spam, sending per PM and message too big -.-
|
[GPG Public Key]BTC/DVC/TRC/FRC: 1 K1773RbXRZVRQSSXe9N6N2MUFERvrdu6y ANC/XPM A K1773RTmRKtvbKBCrUu95UQg5iegrqyeA NMC: N K1773Rzv8b4ugmCgX789PbjewA9fL9Dy1 LTC: L Ki773RBuPepQH8E6Zb1ponoCvgbU7hHmd EMC: E K1773RxUes1HX1YAGMZ1xVYBBRUCqfDoF BQC: b K1773R1APJz4yTgRkmdKQhjhiMyQpJgfN
|
|
|
TheSwede75
|
|
July 25, 2013, 02:11:30 PM |
|
More than wiling to help perform tests as instructed if a windows binary is posted. Got an old GTX 475 rattling around that I could out to work..
|
|
|
|
primedigger (OP)
Member
Offline
Activity: 75
Merit: 10
|
|
July 25, 2013, 03:27:25 PM |
|
Please check that you're using the latest SDK. I also encountered memory problems with cuda 5.0 and I'm using 5.5 now which works for me.
|
|
|
|
Schleicher
|
|
July 25, 2013, 04:14:24 PM |
|
Would it make any difference if we use __restricted__ pointers in the CUDA code?
|
|
|
|
|