Bitcoin Forum
March 28, 2024, 11:21:54 PM *
News: Latest Bitcoin Core release: 26.0 [Torrent]
 
   Home   Help Search Login Register More  
Poll
Question: Do you want to see improvements in Ethash dual-mining with GGS?
I desperately need it. - 8 (15.1%)
It would be nice. - 12 (22.6%)
It's not worth it anymore. - 33 (62.3%)
Total Voters: 53

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 [46] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 ... 197 »
  Print  
Author Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!  (Read 214322 times)
djm34
Legendary
*
Offline Offline

Activity: 1400
Merit: 1050


View Profile WWW
March 21, 2017, 06:08:15 PM
 #901

2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

 (if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...
that's not that simple, in most case you only get partial overlap. Actually it all depends on gpu usage. If one kernel alone uses 100% gpu (or close to)usage then the other will just wait for ressource to be avalaible. The only way to get 2 kernels to run in parallel is that if each kernels use only 50% of gpu usage. So in most case it isn't practical

lol why is that discussion happening on amd thread ? lol
parallelizing too ?

djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
1711668114
Hero Member
*
Offline Offline

Posts: 1711668114

View Profile Personal Message (Offline)

Ignore
1711668114
Reply with quote  #2

1711668114
Report to moderator
The trust scores you see are subjective; they will change depending on who you have in your trust list.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1711668114
Hero Member
*
Offline Offline

Posts: 1711668114

View Profile Personal Message (Offline)

Ignore
1711668114
Reply with quote  #2

1711668114
Report to moderator
1711668114
Hero Member
*
Offline Offline

Posts: 1711668114

View Profile Personal Message (Offline)

Ignore
1711668114
Reply with quote  #2

1711668114
Report to moderator
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 21, 2017, 06:18:57 PM
 #902

It seems like I need to weave Round 0 into Rounds 1 through 8 explicitly with AMD drivers.
I thought this was an easy task, but, alas, it wasn't...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 21, 2017, 06:24:45 PM
 #903

2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

 (if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...
that's not that simple, in most case you only get partial overlap. Actually it all depends on gpu usage. If one kernel alone uses 100% gpu (or close to)usage then the other will just wait for ressource to be avalaible. The only way to get 2 kernels to run in parallel is that if each kernels use only 50% of gpu usage. So in most case it isn't practical

lol why is that discussion happening on amd thread ? lol
parallelizing too ?


I hope this thread would be bipartisan again once I catch up with Claymore's and Optiminer.
This AMD Zcash miner competition is dragging for way too long...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
sp_
Legendary
*
Offline Offline

Activity: 2856
Merit: 1087

Team Black developer


View Profile
March 21, 2017, 06:26:18 PM
 #904

The only way to get 2 kernels to run in parallel is that if each kernels use only 50% of gpu usage. So in most case it isn't practical

Round0 is only using 20% of the time, so you can run the round0 thread at 25% speed and it still will complete the task before the other thread. Round 1-8 is spending most of its time on waiting for memory. so you can reduce the gpu load here as well)

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
sp_
Legendary
*
Offline Offline

Activity: 2856
Merit: 1087

Team Black developer


View Profile
March 21, 2017, 06:27:24 PM
 #905

lol why is that discussion happening on amd thread ? lol
parallelizing too ?

Because dual mining works on NVIDIA hardware and AMD hardware. Claymore have already done it...

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
djeZo
Hero Member
*****
Offline Offline

Activity: 588
Merit: 520


View Profile
March 21, 2017, 06:37:49 PM
 #906

The only way to get 2 kernels to run in parallel is that if each kernels use only 50% of gpu usage. So in most case it isn't practical

Round0 is only using 20% of the time, so you can run the round0 thread at 25% speed and it still will complete the task before the other thread. Round 1-8 is spending most of its time on waiting for memory. so you can reduce the gpu load here as well)

I just tried with software work (odd blockthreads doing round1, even blocks doing round0) and the results are worse, lower speed. Some more could be tweaked with proper parameters etc, but not much more, like I said, because of high amount of shared memory needed, there are only few blockthreads being run on a SM which lowers occupancy. And let's not forget that even round0 does what is the slowest thing of everything -> doing random memory writes which are hard to coalesce so there are many memory transactions.

sp_
Legendary
*
Offline Offline

Activity: 2856
Merit: 1087

Team Black developer


View Profile
March 21, 2017, 06:59:17 PM
 #907

The only way to get 2 kernels to run in parallel is that if each kernels use only 50% of gpu usage. So in most case it isn't practical

Round0 is only using 20% of the time, so you can run the round0 thread at 25% speed and it still will complete the task before the other thread. Round 1-8 is spending most of its time on waiting for memory. so you can reduce the gpu load here as well)

I just tried with software work (odd blockthreads doing round1, even blocks doing round0) and the results are worse, lower speed. Some more could be tweaked with proper parameters etc, but not much more, like I said, because of high amount of shared memory needed, there are only few blockthreads being run on a SM which lowers occupancy. And let's not forget that even round0 does what is the slowest thing of everything -> doing random memory writes which are hard to coalesce so there are many memory transactions.


Round1 is not slow enough, so by dividing the load 50%. Round1 will run at round0 speed. (Slower)



Try with 2 buffers and 2 streams let round0 run with very few threads emulate 25% gpu load) I've already outlined the pseudocode.



Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
sp_
Legendary
*
Offline Offline

Activity: 2856
Merit: 1087

Team Black developer


View Profile
March 21, 2017, 07:17:11 PM
Last edit: March 21, 2017, 07:27:46 PM by sp_
 #908

But you probobly lazy.

Try to run round0 with 1/5 of the intensity on all the threads<32 at the same time as round1. you also need to increase the total threads in the kernel by 32. and fix the indexing. you might need to remove 75% of the blake2s instructions as well to emulate the effect.(but keep the reads and writes to memory in round0)
 


How much slower is round1?

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
March 21, 2017, 07:28:58 PM
 #909

It seems like I need to weave Round 0 into Rounds 1 through 8 explicitly with AMD drivers.
I thought this was an easy task, but, alas, it wasn't...

Did you try:

clEnqueueNDRangeKernel(q1, round0, ...)
clFinish(q1);
clEnqueueNDRangeKernel(q2, round0, ...)
clEnqueueNDRangeKernel(q1, round1, ...)
...
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 22, 2017, 04:51:53 AM
 #910

It seems like I need to weave Round 0 into Rounds 1 through 8 explicitly with AMD drivers.
I thought this was an easy task, but, alas, it wasn't...

Did you try:

clEnqueueNDRangeKernel(q1, round0, ...)
clFinish(q1);
clEnqueueNDRangeKernel(q2, round0, ...)
clEnqueueNDRangeKernel(q1, round1, ...)
...


I essentially did the same thing with two threads, two queues, and a mutex, but that didn't work.
The original idea seems to work, though. I'm almost done with my implementation.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
niksdt101
Hero Member
*****
Offline Offline

Activity: 803
Merit: 501



View Profile
March 22, 2017, 05:06:21 AM
 #911

can some one post the url for the latest build ..the posts seems pretty big..im looking for windows x64
lexele
Full Member
***
Offline Offline

Activity: 190
Merit: 100


View Profile
March 22, 2017, 06:59:13 AM
 #912

it's in zawawa's signature just before you're post
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 22, 2017, 09:42:26 AM
 #913

There was a problem with storing slot data for Round 0, but I should be able to fix that pretty easily by adding an extra buffer and row counters. It's getting pretty late, so I will save the fun for tomorrow. I cannot wait to enjoy the full speed of ZEC mining with my own miner...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 22, 2017, 10:13:24 PM
 #914

The code is working now. I just need to tweak parameters now, I think...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 22, 2017, 10:50:24 PM
 #915

My current algo splits Round 0 into four chunks and assigns each of them to one of the other Equihash rounds.
These chunks are then processed in the background while slot data is loaded into the LDS.
The VALU is pretty much doing nothing during those global memory reads, so there shouldn't be any performance penalty in theory.
However, for some unknown reasons, Rounds 3 and 4 become much slower than Rounds 1 and 2 with this optimization.
This result is rather counter-intuitive as the former are supposed to be shorter than the latter.
What I'm going to do next is to assign the four chunks to Rounds 1, 2, 7, and 8.
Just like what they say: "In theory, theory and practice are the same. In practice, they are not."

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
sp_
Legendary
*
Offline Offline

Activity: 2856
Merit: 1087

Team Black developer


View Profile
March 23, 2017, 03:05:28 AM
 #916

You need to profile each round and check the total time used. I believe round 7 and 8 are the fastest rounds. So then you need to add less blake2s instructions here. 

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 23, 2017, 01:38:25 PM
 #917

You need to profile each round and check the total time used. I believe round 7 and 8 are the fastest rounds. So then you need to add less blake2s instructions here.  

Yeah, I was measuring the duration of each Equihash round with CodeXL, but the results are pretty unpredictable, though. This whole dual mining thing is actually very complicated as inserted codes in the background have unintended effects, interfering with the foreground threads. I would say Claymore deserves respect just for the fact that he was able to pull it off. I tried various ways to parallelize two Equihash runs with better results. We will see.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 24, 2017, 01:44:27 AM
 #918

I think I figured out why the performance of the new code is not as good as expected.
It is because the L2 cache gets contaminated by Round 0 in the background.
Very tricky, huh.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 24, 2017, 02:36:36 AM
 #919

GG is finally running faster with the parallelized Round 0.
I fused Round 0 with Rounds 7 and 8 to alleviate cache contamination and to improve the cache hit ratio for the next Round 1. I could even merge it with the solution-searching kernel for better results.
Good stuff.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
sp_
Legendary
*
Offline Offline

Activity: 2856
Merit: 1087

Team Black developer


View Profile
March 24, 2017, 06:41:38 AM
 #920

GG is finally running faster with the parallelized Round 0.
I fused Round 0 with Rounds 7 and 8 to alleviate cache contamination and to improve the cache hit ratio for the next Round 1. I could even merge it with the solution-searching kernel for better results.
Good stuff.

So how much faster?

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 [46] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 ... 197 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!