Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 17, 2017, 08:17:26 PM

#881

Quote from: jstefanop on March 17, 2017, 04:49:04 PM

Quote from: nerdralph on March 17, 2017, 01:03:04 PM

Quote from: zawawa on March 17, 2017, 09:07:54 AM

It seems that the real maximum size of GDS segments for RX 480 is 16KB.
It's a little disappointing, but still much better than 4KB without the kernel patch.
This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way.
Now let me fix the kernel patch one more time...

Or maybe Optiminer runs just 2 instances of the kernel and uses only half of the GDS. It's possible (I'd even say probable) the benefits of using a full 64KB is offset by slower GDS access caused by contention with 4 instances of the kernel running.

Pretty sure both optiminer and claymore running two kernel threads.

It looks that way now... I think the Data Share unit is overloaded with my current implementation of Equihash.
What to do, what to do...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

ioglnx

Sr. Member

Offline

Activity: 574
Merit: 250

Fighting mob law and inquisition in this forum

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 17, 2017, 08:23:09 PM

#882

Taking a deep breathe and make two steps back.

GTX 1080Ti rocks da house... seriously... this card is a beast³
Owning by now 18x GTX1080Ti :-D @serious love of efficiency

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 19, 2017, 10:17:22 AM

#883

Quote from: ioglnx on March 17, 2017, 08:23:09 PM

Taking a deep breathe and make two steps back.

That's hard to do, though... I'm getting 260 sol/s with stock RX 480 right now.
Got to check everything one more time.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

joaocha

Full Member

Offline

Activity: 254
Merit: 100

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 20, 2017, 10:23:14 PM

#884

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 12:52:54 AM

#885

After I tried everything with my Equihash kernel, I reached the conclusion that the current bottleneck is not in my kernel but elsewhere.
Surely enough, I found that a considerable amount of CPU time was spent in sgminer's helper functions.
I don't think anybody touched them since super-nice folks at Genesis Mining ported SA's old kernel to sgminer-gm.
Let me see...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 01:11:02 AM

#886

Quote from: joaocha on March 20, 2017, 10:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

I have been thinking about that for quite some time now.
I will wrap up Equihash optimizations once I'm done with helper functions.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 08:57:15 AM

#887

I'm trying to hook Linux system calls from the user space so that GG can access a larger GDS segment without a kernel patch.
The work never ends...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

sp_

Legendary

Offline

Activity: 2926
Merit: 1087

Team Black developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 09:03:23 AM
Last edit: March 21, 2017, 09:17:49 AM by sp_

#888

Quote from: joaocha on March 20, 2017, 10:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data.

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 09:19:03 AM

#889

Quote from: sp_ on March 21, 2017, 09:03:23 AM

Quote from: joaocha on March 20, 2017, 10:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

laik2

Sr. Member

Offline

Activity: 652
Merit: 266

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 12:22:46 PM

#890

Quote from: sp_ on March 21, 2017, 09:03:23 AM

Quote from: joaocha on March 20, 2017, 10:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Jeeesus, sometimes I think sp_ is a jerk and sometimes normal...I believe either you have personality disorder or your account is being used by 2 different people...

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/

sp_

Legendary

Offline

Activity: 2926
Merit: 1087

Team Black developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 12:31:23 PM

#891

Quote from: laik2 on March 21, 2017, 12:22:46 PM

Quote from: sp_ on March 21, 2017, 09:03:23 AM

Quote from: joaocha on March 20, 2017, 10:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Jeeesus, sometimes I think sp_ is a jerk and sometimes normal...I believe either you have personality disorder or your account is being used by 2 different people...

I just pointed the opensource development into the right direction. Time to give team Claymore some opensource competition...

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

nerdralph

Sr. Member

Offline

Activity: 588
Merit: 251

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 01:52:27 PM

#892

Quote from: zawawa on March 21, 2017, 09:19:03 AM

Quote from: sp_ on March 21, 2017, 09:03:23 AM

Quote from: joaocha on March 20, 2017, 10:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

Meh. ZEC uses a truncated blake2 using 2x200 bits out of 512. I doubt you'll find a way to re-use the blake calculations for another algo like dcr or sia.

To optimize round0, use the same idea as bitcoin sha-256 optimization, by looking for parts of the algorithm that can be skipped. For example, since the last 112 bits are ignored, you might be able to skip some parts of the blake algo. And since everything but the nonce is constant for ~2.5 minutes, you can probably move some of the calculations to compile time and generate a new kernel for each new block. Since you're already building a custom llvm, you can probably get the kernel compile and dispatch time down to a few ms.

p.s. Here's some bedtime reading for you on bitcoin mining optimization.
http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdf

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 03:24:08 PM
Last edit: March 21, 2017, 03:37:04 PM by zawawa

#893

Quote from: nerdralph on March 21, 2017, 01:52:27 PM

Quote from: zawawa on March 21, 2017, 09:19:03 AM

Quote from: sp_ on March 21, 2017, 09:03:23 AM

Quote from: joaocha on March 20, 2017, 10:23:14 PM

Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash

Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds.

The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding
data.

On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)

I must admit that this is a brilliant idea. Thank you so much for sharing it!

I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

nerdralph

Sr. Member

Offline

Activity: 588
Merit: 251

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 04:02:53 PM

#894

Quote from: zawawa on March 21, 2017, 03:24:08 PM

Just running multiple instances of the kernel should help; just don't launch them at exactly the same time. Ideally the 2nd instance should be launched after the first has finished round0.

sp_

Legendary

Offline

Activity: 2926
Merit: 1087

Team Black developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 04:10:25 PM

#895

Quote from: nerdralph on March 21, 2017, 04:02:53 PM

Quote from: zawawa on March 21, 2017, 03:24:08 PM

Just running multiple instances of the kernel should help; just don't launch them at exactly the same time. Ideally the 2nd instance should be launched after the first has finished round0.

But you want to make sure than round1 starts at exactly the same time as round0. running with multiple threads, sometimes help, and sometimes not. With proper code, you can make sure that this always happens. No need for 5 threads. (nicehash dj-ezo kernel on the gtx 1080)

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

djeZo

Hero Member

Offline

Activity: 588
Merit: 520

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 04:52:47 PM

#896

Quote from: sp_ on March 21, 2017, 04:10:25 PM

Quote from: nerdralph on March 21, 2017, 04:02:53 PM

Quote from: zawawa on March 21, 2017, 03:24:08 PM

Just running multiple instances of the kernel should help; just don't launch them at exactly the same time. Ideally the 2nd instance should be launched after the first has finished round0.

At first, I didn't fully understood what you meant, but I think I do now. Your ideal is following; when there is round0 being executed, you would like to execute other rounds in parallel with round0 but with different nonce, so that resources of the card can be better utilized (during round0 there is not much mem ops, but rather alu ops, and during rounds1+ are more mem ops and less alu ops). I had this idea but here is the problem for CUDA, you would need to be able to launch two kernels at the same time, and I am not talking about in various threads, but actually make NVIDIA driver execute two kernels in parallel. That is not how CUDA works to my knowledge. CUDA, at driver level, will always execute certain kernel, then move on to the next one. To acheive parallel solving of rounds, you would need to do it in code on your own (eg say that each odd blockthread is doing round0, each even blockthread is doing round1), but here are different needs of round0 and round1 that would lower your occupation and probably make everything slower (round0 doesn't need shared memory, needs more registers, round1 needs lot's of shared memory, needs less registers).

Trust thread: https://bitcointalk.org/index.php?topic=392425.0

sp_

Legendary

Offline

Activity: 2926
Merit: 1087

Team Black developer

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 05:00:52 PM
Last edit: March 21, 2017, 05:18:36 PM by sp_

#897

Use double buffer and 2 cudastreams in parallell.

do
1.launch round0 buffer1 (thread1)
2.launch round1-round8 buffer2 (thread2)
sync
swap buffer pointers
loop

Or permute the rounds so that the round that give the most speed is executed in parallell. Here round3-round8 is Running at the same time as round0:

f.ex

do
launch round1-round2 (thread2)
wait for thread2
1.launch round0 buffer1 (thread1)
2.launch round3-round8 buffer2 (thread2)
sync
swap buffer pointers
loop

round0 take around 20% of the total time

Quote

Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.

On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.

This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.

Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.

http://stackoverflow.com/questions/8473617/are-cuda-kernel-calls-synchronous-or-asynchronous

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

djeZo

Hero Member

Offline

Activity: 588
Merit: 520

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 05:49:26 PM

#898

Quote

Only after the first kernel is finished the second one will execute.

Like I said... it doesn't matter if you have threads, streams etc... at the end, on GPU, only one kernel can be executed at the same time. Equihash notably gets more speed with several threads, because there are many kernels to be executed (from round0 to round9) and between each execution there is pause that can be used by CUDA driver to execute another kernel of another thread.

Trust thread: https://bitcointalk.org/index.php?topic=392425.0

sp_

Legendary

Offline

Activity: 2926
Merit: 1087

Team Black developer

⇾ Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 05:56:31 PM

#899

2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.

(if you don't specify a stream they both go to the default stream and are executed serially)

Serial execution: one stream (the default stream)
Paralell execution: two streams.

Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner

djeZo

Hero Member

Offline

Activity: 588
Merit: 520

Re: Gateless Gate: zawawa's open-source ZEC/ETH/XMR/PASC miner (250 S/s on RX 480)

March 21, 2017, 06:02:47 PM

#900

Quote from: sp_ on March 21, 2017, 05:56:31 PM

Source of these claims?

Trust thread: https://bitcointalk.org/index.php?topic=392425.0

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 [45] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 ... 197 »

Bitcoin Forum > Alternate cryptocurrencies > Mining (Altcoins) > Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!

« previous topic next topic »