Bitcoin Forum
April 19, 2024, 11:43:48 PM *
News: Latest Bitcoin Core release: 26.0 [Torrent]
 
   Home   Help Search Login Register More  
Poll
Question: Do you want to see improvements in Ethash dual-mining with GGS?
I desperately need it. - 8 (15.1%)
It would be nice. - 12 (22.6%)
It's not worth it anymore. - 33 (62.3%)
Total Voters: 53

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 [45] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 ... 197 »
  Print  
Author Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!  (Read 214332 times)
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 17, 2017, 08:17:26 PM
 #881

It seems that the real maximum size of GDS segments for RX 480 is 16KB.
It's a little disappointing, but still much better than 4KB without the kernel patch.
This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way.
Now let me fix the kernel patch one more time...

Or maybe Optiminer runs just 2 instances of the kernel and uses only half of the GDS.  It's possible (I'd even say probable) the benefits of using a full 64KB is offset by slower GDS access caused by contention with 4 instances of the kernel running.


Pretty sure both optiminer and claymore running two kernel threads.

It looks that way now... I think the Data Share unit is overloaded with my current implementation of Equihash.
What to do, what to do...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
1713570228
Hero Member
*
Offline Offline

Posts: 1713570228

View Profile Personal Message (Offline)

Ignore
1713570228
Reply with quote  #2

1713570228
Report to moderator
1713570228
Hero Member
*
Offline Offline

Posts: 1713570228

View Profile Personal Message (Offline)

Ignore
1713570228
Reply with quote  #2

1713570228
Report to moderator
1713570228
Hero Member
*
Offline Offline

Posts: 1713570228

View Profile Personal Message (Offline)

Ignore
1713570228
Reply with quote  #2

1713570228
Report to moderator
No Gods or Kings. Only Bitcoin
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1713570228
Hero Member
*
Offline Offline

Posts: 1713570228

View Profile Personal Message (Offline)

Ignore
1713570228
Reply with quote  #2

1713570228
Report to moderator
1713570228
Hero Member
*
Offline Offline

Posts: 1713570228

View Profile Personal Message (Offline)

Ignore
1713570228
Reply with quote  #2

1713570228
Report to moderator
1713570228
Hero Member
*
Offline Offline

Posts: 1713570228

View Profile Personal Message (Offline)

Ignore
1713570228
Reply with quote  #2

1713570228
Report to moderator
ioglnx
Sr. Member
****
Offline Offline

Activity: 574
Merit: 250

Fighting mob law and inquisition in this forum


View Profile
March 17, 2017, 08:23:09 PM
 #882

Taking a deep breathe and make two steps back.

GTX 1080Ti rocks da house... seriously... this card is a beast³
Owning by now 18x GTX1080Ti :-D @serious love of efficiency
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 19, 2017, 10:17:22 AM
 #883

Taking a deep breathe and make two steps back.


That's hard to do, though... I'm getting 260 sol/s with stock RX 480 right now.
Got to check everything one more time.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
joaocha
Full Member
***
Offline Offline

Activity: 254
Merit: 100


View Profile
March 20, 2017, 10:23:14 PM
 #884

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 21, 2017, 12:52:54 AM
 #885

After I tried everything with my Equihash kernel, I reached the conclusion that the current bottleneck is not in my kernel but elsewhere.
Surely enough, I found that a considerable amount of CPU time was spent in sgminer's helper functions.
I don't think anybody touched them since super-nice folks at Genesis Mining ported SA's old kernel to sgminer-gm.
Let me see...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 21, 2017, 01:11:02 AM
 #886

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

I have been thinking about that for quite some time now.
I will wrap up Equihash optimizations once I'm done with helper functions.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 21, 2017, 08:57:15 AM
 #887

I'm trying to hook Linux system calls from the user space so that GG can access a larger GDS segment without a kernel patch.
The work never ends...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
sp_
Legendary
*
Offline Offline

Activity: 2884
Merit: 1087

Team Black developer


View Profile
March 21, 2017, 09:03:23 AM
Last edit: March 21, 2017, 09:17:49 AM by sp_
 #888

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 21, 2017, 09:19:03 AM
 #889

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
laik2
Sr. Member
****
Offline Offline

Activity: 652
Merit: 266



View Profile WWW
March 21, 2017, 12:22:46 PM
 #890

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)
Jeeesus, sometimes I think sp_ is a jerk and sometimes normal...I believe either you have personality disorder or your account is being used by 2 different people...

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/
sp_
Legendary
*
Offline Offline

Activity: 2884
Merit: 1087

Team Black developer


View Profile
March 21, 2017, 12:31:23 PM
 #891

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)
Jeeesus, sometimes I think sp_ is a jerk and sometimes normal...I believe either you have personality disorder or your account is being used by 2 different people...

I just pointed the opensource development into the right direction. Time to give team Claymore some opensource competition...

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
March 21, 2017, 01:52:27 PM
 #892

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

Meh.  ZEC uses a truncated blake2 using 2x200 bits out of 512.  I doubt you'll find a way to re-use the blake calculations for another algo like dcr or sia.

To optimize round0, use the same idea as bitcoin sha-256 optimization, by looking for parts of the algorithm that can be skipped.  For example, since the last 112 bits are ignored, you might be able to skip some parts of the blake algo.  And since everything but the nonce is constant for ~2.5 minutes, you can probably move some of the calculations to compile time and generate a new kernel for each new block.  Since you're already building a custom llvm, you can probably get the kernel compile and dispatch time down to a few ms.

p.s.  Here's some bedtime reading for you on bitcoin mining optimization.
http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdf

zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 21, 2017, 03:24:08 PM
Last edit: March 21, 2017, 03:37:04 PM by zawawa
 #893

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)?  Dual mined  with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s  (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.  

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

Meh.  ZEC uses a truncated blake2 using 2x200 bits out of 512.  I doubt you'll find a way to re-use the blake calculations for another algo like dcr or sia.

To optimize round0, use the same idea as bitcoin sha-256 optimization, by looking for parts of the algorithm that can be skipped.  For example, since the last 112 bits are ignored, you might be able to skip some parts of the blake algo.  And since everything but the nonce is constant for ~2.5 minutes, you can probably move some of the calculations to compile time and generate a new kernel for each new block.  Since you're already building a custom llvm, you can probably get the kernel compile and dispatch time down to a few ms.

p.s.  Here's some bedtime reading for you on bitcoin mining optimization.
http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdf



I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
March 21, 2017, 04:02:53 PM
 #894

I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.

Just running multiple instances of the kernel should help; just don't launch them at exactly the same time.  Ideally the 2nd instance should be launched after the first has finished round0.
sp_
Legendary
*
Offline Offline

Activity: 2884
Merit: 1087

Team Black developer


View Profile
March 21, 2017, 04:10:25 PM
 #895

I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.
Just running multiple instances of the kernel should help; just don't launch them at exactly the same time.  Ideally the 2nd instance should be launched after the first has finished round0.

But you want to make sure than round1 starts at exactly the same time as round0.  running with multiple threads, sometimes help, and sometimes not. With proper code, you can make sure that this always happens. No need for 5 threads. (nicehash dj-ezo kernel on the gtx 1080)

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
djeZo
Hero Member
*****
Offline Offline

Activity: 588
Merit: 520


View Profile
March 21, 2017, 04:52:47 PM
 #896

I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.
Just running multiple instances of the kernel should help; just don't launch them at exactly the same time.  Ideally the 2nd instance should be launched after the first has finished round0.

But you want to make sure than round1 starts at exactly the same time as round0.  running with multiple threads, sometimes help, and sometimes not. With proper code, you can make sure that this always happens. No need for 5 threads. (nicehash dj-ezo kernel on the gtx 1080)


At first, I didn't fully understood what you meant, but I think I do now. Your ideal is following; when there is round0 being executed, you would like to execute other rounds in parallel with round0 but with different nonce, so that resources of the card can be better utilized (during round0 there is not much mem ops, but rather alu ops, and during rounds1+ are more mem ops and less alu ops). I had this idea but here is the problem for CUDA, you would need to be able to launch two kernels at the same time, and I am not talking about in various threads, but actually make NVIDIA driver execute two kernels in parallel. That is not how CUDA works to my knowledge. CUDA, at driver level, will always execute certain kernel, then move on to the next one. To acheive parallel solving of rounds, you would need to do it in code on your own (eg say that each odd blockthread is doing round0, each even blockthread is doing round1), but here are different needs of round0 and round1 that would lower your occupation and probably make everything slower (round0 doesn't need shared memory, needs more registers, round1 needs lot's of shared memory, needs less registers).

sp_
Legendary
*
Offline Offline

Activity: 2884
Merit: 1087

Team Black developer


View Profile
March 21, 2017, 05:00:52 PM
Last edit: March 21, 2017, 05:18:36 PM by sp_
 #897

Use double buffer and 2 cudastreams in parallell.

do
1.launch round0 buffer1 (thread1)
2.launch round1-round8 buffer2 (thread2)
sync
swap buffer pointers
loop

Or permute the rounds so that the round that give the most speed is executed in parallell. Here round3-round8 is Running at the same time as round0:

f.ex

do
launch round1-round2 (thread2)
wait for thread2
1.launch round0 buffer1 (thread1)
2.launch round3-round8 buffer2 (thread2)
sync
swap buffer pointers
loop

round0 take around 20% of the total time




Quote
Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.

On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.

This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.

Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.

http://stackoverflow.com/questions/8473617/are-cuda-kernel-calls-synchronous-or-asynchronous

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
djeZo
Hero Member
*****
Offline Offline

Activity: 588
Merit: 520


View Profile
March 21, 2017, 05:49:26 PM
 #898

Quote
Only after the first kernel is finished the second one will execute.

Like I said... it doesn't matter if you have threads, streams etc... at the end, on GPU, only one kernel can be executed at the same time. Equihash notably gets more speed with several threads, because there are many kernels to be executed (from round0 to round9) and between each execution there is pause that can be used by CUDA driver to execute another kernel of another thread.

sp_
Legendary
*
Offline Offline

Activity: 2884
Merit: 1087

Team Black developer


View Profile
March 21, 2017, 05:56:31 PM
 #899

2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

 (if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
djeZo
Hero Member
*****
Offline Offline

Activity: 588
Merit: 520


View Profile
March 21, 2017, 06:02:47 PM
 #900

2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

 (if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block.

Source of these claims?

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 [45] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 ... 197 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!