Bitcoin Forum
November 06, 2024, 09:43:34 PM *
News: Latest Bitcoin Core release: 28.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 [5] 6 7 8 »  All
  Print  
Author Topic: [XPM] CUDA enabled qt client miner for primecoins. Source code inside. WIP  (Read 31756 times)
azwccc
Sr. Member
****
Offline Offline

Activity: 243
Merit: 250


View Profile
July 23, 2013, 11:10:43 PM
 #81

I would definitely donate if the CUDA miner can outperform the AMD OCL miner (on GPU in same price range).

Best of luck to you.

Bitrated user: azwccc.
jaakkop
Member
**
Offline Offline

Activity: 63
Merit: 10


View Profile WWW
July 24, 2013, 05:52:16 AM
 #82

Donated some BTC just now, keep up the good work Smiley

BTW, are there any guides how to compile this for Windows (or how to cross-compile with Linux?)
or should Bitcoin-QT building guides apply to this as well?

I'd buy that for a dollar bitcoin!
wetroof
Member
**
Offline Offline

Activity: 75
Merit: 10


View Profile
July 24, 2013, 06:03:48 AM
 #83

if I do mine with this with this, I'll pledge to give 10% of the total XPM I mine for the first 5 days to primedigger. I have two 2.0 cuda compute capability cards and one 3.0.
teknohog
Sr. Member
****
Offline Offline

Activity: 520
Merit: 253


555


View Profile WWW
July 24, 2013, 09:21:44 AM
 #84

Based on previous prime-number-based-research projects, CUDA has outperformed OpenCL.

AFAIK, there are two reasons why the Lucas-Lehmer test for Mersenne primes has been done in CUDA rather than OpenCL. It uses floating point math, and there are better FFT libraries available for CUDA. (For the huge numbers involved, multiplication is more efficient via Fourier transform and convolution.)

On the other hand, trial division (another important part of Mersenne prime search) seems to be more efficient with OpenCL on AMD, as it uses integer math. I assume Primecoin would work fine with just integer math.

Sources: http://mersenneforum.org/forumdisplay.php?f=92

Of course, there is a more ideological point about using a language that works across many platforms (CPUs and GPUs from several vendors) than tying yourself to Nvidia, especially in an open source project.

world famous math art | masternodes are bad, mmmkay?
Every sha(sha(sha(sha()))), every ho-o-o-old, still shines
maco
Sr. Member
****
Offline Offline

Activity: 294
Merit: 250



View Profile
July 24, 2013, 09:35:40 AM
 #85

How is progress so far? CUDA is a must Smiley
primedigger (OP)
Member
**
Offline Offline

Activity: 75
Merit: 10


View Profile
July 24, 2013, 02:55:20 PM
 #86

Progress update:

- GPU does Fermats tests now and CPU does the rest. Fermat tests seem to work fine now.
- I can find prime chains, but couldn't find a block on testnet in a timely manner

Todo:
- Transferring mpz types with strings is slow so I will transform gmp mpz's directly to my CUDA format on the CPU.
- A lot happenend with the high performance client, I will update my codebase to hp7
- The changes in hp6 could also be useful on the GPU ( -> fast divisibility tests before doing the expensive Fermat's test)
- Interleave CPU+GPU computations and async memory copys. Without this, my client won't be very fast.

I don't like to put an ETA on this, it's done when it's done. But I hope to have something by next week which outperforms my old intel core2 quad core.

My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Edit: And thanks for the donations / pledges guys!
hasle2
Full Member
***
Offline Offline

Activity: 122
Merit: 100


View Profile
July 24, 2013, 07:30:02 PM
 #87

Progress update:

- GPU does Fermats tests now and CPU does the rest. Fermat tests seem to work fine now.
- I can find prime chains, but couldn't find a block on testnet in a timely manner

Todo:
- Transferring mpz types with strings is slow so I will transform gmp mpz's directly to my CUDA format on the CPU.
- A lot happenend with the high performance client, I will update my codebase to hp7
- The changes in hp6 could also be useful on the GPU ( -> fast divisibility tests before doing the expensive Fermat's test)
- Interleave CPU+GPU computations and async memory copys. Without this, my client won't be very fast.

I don't like to put an ETA on this, it's done when it's done. But I hope to have something by next week which outperforms my old intel core2 quad core.

My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Edit: And thanks for the donations / pledges guys!

The fact that you are finding blocks on testnet is quite an achievement. Keep up the good work Smiley
wetroof
Member
**
Offline Offline

Activity: 75
Merit: 10


View Profile
July 24, 2013, 09:13:52 PM
 #88

Progress update:

- GPU does Fermats tests now and CPU does the rest. Fermat tests seem to work fine now.
- I can find prime chains, but couldn't find a block on testnet in a timely manner

Todo:
- Transferring mpz types with strings is slow so I will transform gmp mpz's directly to my CUDA format on the CPU.
- A lot happenend with the high performance client, I will update my codebase to hp7
- The changes in hp6 could also be useful on the GPU ( -> fast divisibility tests before doing the expensive Fermat's test)
- Interleave CPU+GPU computations and async memory copys. Without this, my client won't be very fast.

I don't like to put an ETA on this, it's done when it's done. But I hope to have something by next week which outperforms my old intel core2 quad core.

My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Edit: And thanks for the donations / pledges guys!

I donated .2 BTC . thanks for the update.
jhxhlj
Newbie
*
Offline Offline

Activity: 26
Merit: 0



View Profile WWW
July 25, 2013, 01:59:10 AM
 #89

Any update? Grin Grin Grin Grin Grin Grin Grin Grin
Sunny King
Legendary
*
Offline Offline

Activity: 1205
Merit: 1010



View Profile WWW
July 25, 2013, 02:12:02 AM
 #90

My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.
primedigger (OP)
Member
**
Offline Offline

Activity: 75
Merit: 10


View Profile
July 25, 2013, 01:06:02 PM
 #91

My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.

There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms.

Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation.

But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes).

Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4):

2013-07-24 21:53:38 primemeter     24303 prime/h    490729 test/h        47 5-chains/h

prime/h  and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though.

You have to use setgenerate true 1, i.e. one CPU thread for mining. 
paulthetafy
Hero Member
*****
Offline Offline

Activity: 820
Merit: 1000


View Profile
July 25, 2013, 01:13:24 PM
 #92

Good job primedigger!  Just looking at your 5-chains score, I think that means that this is slower than most CPU's though right?  My crappy 4-core AMD does 400-800 5-chains/ hour (around 3000PPS).
primedigger (OP)
Member
**
Offline Offline

Activity: 75
Merit: 10


View Profile
July 25, 2013, 01:20:57 PM
 #93

Good job primedigger!  Just looking at your 5-chains score, I think that means that this is slower than most CPU's though right?  My crappy 4-core AMD does 400-800 5-chains/ hour (around 3000PPS).

Yes, it's an extremely crappy score, but it's a start. The high performance client does a good job of squeezing out as much as possible out of your CPU though!
K1773R
Legendary
*
Offline Offline

Activity: 1792
Merit: 1008


/dev/null


View Profile
July 25, 2013, 01:31:53 PM
Last edit: July 25, 2013, 01:44:27 PM by K1773R
 #94

My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.

There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms.

Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation.

But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes).

Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4):

2013-07-24 21:53:38 primemeter     24303 prime/h    490729 test/h        47 5-chains/h

prime/h  and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though.

You have to use setgenerate true 1, i.e. one CPU thread for mining.  
running current git (b0c7062f3925482935f5eb352b17737d21b95c5b) and i cant see any usage of my GPU, no heat nor used mem increases when using the QT. anything special to activate so it mines with the GPU? i got a powerfull GPU to test with Wink

EDIT:
Code:
2013-07-25 13:35:43 primemeter         0 prime/h   34261932 test/h         0 5-chains/h
seems the miner thread which should launch the CUDA is borked?

EDIT2:
Code:
Have 2400 candidates after main loop
Cuda start!
Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, the launch timed out and was terminated
from debug.log
after the message it segfaults, going to debug with gdb Wink

[GPG Public Key]
BTC/DVC/TRC/FRC: 1K1773RbXRZVRQSSXe9N6N2MUFERvrdu6y ANC/XPM AK1773RTmRKtvbKBCrUu95UQg5iegrqyeA NMC: NK1773Rzv8b4ugmCgX789PbjewA9fL9Dy1 LTC: LKi773RBuPepQH8E6Zb1ponoCvgbU7hHmd EMC: EK1773RxUes1HX1YAGMZ1xVYBBRUCqfDoF BQC: bK1773R1APJz4yTgRkmdKQhjhiMyQpJgfN
primedigger (OP)
Member
**
Offline Offline

Activity: 75
Merit: 10


View Profile
July 25, 2013, 01:45:44 PM
 #95

My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.

There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms.

Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation.

But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes).

Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4):

2013-07-24 21:53:38 primemeter     24303 prime/h    490729 test/h        47 5-chains/h

prime/h  and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though.

You have to use setgenerate true 1, i.e. one CPU thread for mining.  
running current git (b0c7062f3925482935f5eb352b17737d21b95c5b) and i cant see any usage of my GPU, no heat nor used mem increases when using the QT. anything special to activate so it mines with the GPU? i got a powerfull GPU to test with Wink

EDIT:
Code:
2013-07-25 13:35:43 primemeter         0 prime/h   34261932 test/h         0 5-chains/h
seems the miner thread which should launch the CUDA is borked?

EDIT2:
Code:
Have 2400 candidates after main loop
Cuda start!
Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, the launch timed out and was terminated
from debug log

You can also run it with -printmining -printtoconsole to see that output directly. Could you compile the cuda portion with -G -g (change the qt project file where it invokes nvcc) and give me the output of cuda-memcheck?

You can also #define CUDA_DEBUG in the cu file, to see the GPU printfs from the console
Vorksholk
Legendary
*
Offline Offline

Activity: 1713
Merit: 1029



View Profile WWW
July 25, 2013, 01:47:59 PM
 #96

If someone could give me some specific compilation directions (or a windows binary!) I can test on a 780. Smiley

VeriBlock: Securing The World's Blockchains Using Bitcoin
https://veriblock.org
K1773R
Legendary
*
Offline Offline

Activity: 1792
Merit: 1008


/dev/null


View Profile
July 25, 2013, 01:48:56 PM
Last edit: July 25, 2013, 02:00:26 PM by K1773R
 #97

My 2 cents: mining entirely on GPU wont be easy and is impractical, but tandem mining with interleaved CPU+GPU computations may very well give good speed ups.

Some feedback from knowledgeable people indicates that mod_exp probably would not speed up as well on gpu. However I think if gpu can do the sieve much more efficiently it could generate a lot less candidates for the Fermat test, which could speed things up quite a bit.

There is indeed a problem with the speeds of Fermat tests on the GPU. GNU GMP uses the most sophisticated algorithms available, the student library I found and which I started to extend uses the most basic algorithms.

Mpz_powmod needs fast multiplications of big ints, GMP's algorithm is most likely in O(log(n)*n), school book multiplication which the GPU now uses is O(n^2). I hoped that for the ~400 bit numbers involved it wouldn't make such a difference. Currently, the new version in my repo does Fermat tests on the GPU (rate is 10 per second), but my CPU is still faster due to better algorithms and a better big num implementation.

But don't worry, I won't give up so fast! The current situation is that I either need to look into porting better algorithms to the GPU or to do something else than Fermat tests on the GPU to sieve candidates (e.g. trial division with most common primes).

Anybody with a better GPU than the Geforce 570 TI I own, please test this! My stats (base version is still hp4):

2013-07-24 21:53:38 primemeter     24303 prime/h    490729 test/h        47 5-chains/h

prime/h  and test/h seem to fluctuate enormously and seem to be rather meaningless. As most tests are on the GPU, I have no idea if this is even measuring the tests right. 5-chains is accurate though.

You have to use setgenerate true 1, i.e. one CPU thread for mining.  
running current git (b0c7062f3925482935f5eb352b17737d21b95c5b) and i cant see any usage of my GPU, no heat nor used mem increases when using the QT. anything special to activate so it mines with the GPU? i got a powerfull GPU to test with Wink

EDIT:
Code:
2013-07-25 13:35:43 primemeter         0 prime/h   34261932 test/h         0 5-chains/h
seems the miner thread which should launch the CUDA is borked?

EDIT2:
Code:
Have 2400 candidates after main loop
Cuda start!
Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, the launch timed out and was terminated
from debug log

You can also run it with -printmining -printtoconsole to see that output directly. Could you compile the cuda portion with -G -g (change the qt project file where it invokes nvcc) and give me the output of cuda-memcheck?

You can also #define CUDA_DEBUG in the cu file, to see the GPU printfs from the console
was already running with -g just waiting for the "cuda start message", stoped it now and recompile with -D CUDA_DEBUG
EDIT: its up and running, waiting for the cuda init + crash Wink
EDIT2: why does it take so long until the miner starts the cuda thread? that seems stupid :S
EDIT3: here we go, it crashed Smiley
debug.log
Code:
Have 2400 candidates after main loop
Cuda start!
Cuda error: cudaMemcpy: cudaMemcpyDeviceToHost, unspecified launch failure
stdout
Code:
[0] start! 
sizeof(struct) = 400
mpz_print:mpz_capacity: 0
[0] string candidate is 
[0] N is: mpz_capacity: 30 ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
[0] E is: mpz_capacity: 30 fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffe
gdb: dont want to spam, sending per PM and message too big -.-

[GPG Public Key]
BTC/DVC/TRC/FRC: 1K1773RbXRZVRQSSXe9N6N2MUFERvrdu6y ANC/XPM AK1773RTmRKtvbKBCrUu95UQg5iegrqyeA NMC: NK1773Rzv8b4ugmCgX789PbjewA9fL9Dy1 LTC: LKi773RBuPepQH8E6Zb1ponoCvgbU7hHmd EMC: EK1773RxUes1HX1YAGMZ1xVYBBRUCqfDoF BQC: bK1773R1APJz4yTgRkmdKQhjhiMyQpJgfN
TheSwede75
Full Member
***
Offline Offline

Activity: 224
Merit: 100



View Profile
July 25, 2013, 02:11:30 PM
 #98

More than wiling to help perform tests as instructed if a windows binary is posted. Got an old GTX 475 rattling around that I could out to work..
primedigger (OP)
Member
**
Offline Offline

Activity: 75
Merit: 10


View Profile
July 25, 2013, 03:27:25 PM
 #99

Please check that you're using the latest SDK. I also encountered memory problems with cuda 5.0 and I'm using 5.5 now which works for me.
Schleicher
Hero Member
*****
Offline Offline

Activity: 675
Merit: 514



View Profile
July 25, 2013, 04:14:24 PM
 #100

Would it make any difference if we use __restricted__ pointers in the CUDA code?

Pages: « 1 2 3 4 [5] 6 7 8 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!