zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 17, 2017, 08:17:26 PM |
|
It seems that the real maximum size of GDS segments for RX 480 is 16KB. It's a little disappointing, but still much better than 4KB without the kernel patch. This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way. Now let me fix the kernel patch one more time...
Or maybe Optiminer runs just 2 instances of the kernel and uses only half of the GDS. It's possible (I'd even say probable) the benefits of using a full 64KB is offset by slower GDS access caused by contention with 4 instances of the kernel running. Pretty sure both optiminer and claymore running two kernel threads. It looks that way now... I think the Data Share unit is overloaded with my current implementation of Equihash. What to do, what to do...
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
ioglnx
Sr. Member
Offline
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
|
|
March 17, 2017, 08:23:09 PM |
|
Taking a deep breathe and make two steps back.
|
GTX 1080Ti rocks da house... seriously... this card is a beast³ Owning by now 18x GTX1080Ti :-D @serious love of efficiency
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 19, 2017, 10:17:22 AM |
|
Taking a deep breathe and make two steps back.
That's hard to do, though... I'm getting 260 sol/s with stock RX 480 right now. Got to check everything one more time.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
joaocha
|
|
March 20, 2017, 10:23:14 PM |
|
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 21, 2017, 12:52:54 AM |
|
After I tried everything with my Equihash kernel, I reached the conclusion that the current bottleneck is not in my kernel but elsewhere. Surely enough, I found that a considerable amount of CPU time was spent in sgminer's helper functions. I don't think anybody touched them since super-nice folks at Genesis Mining ported SA's old kernel to sgminer-gm. Let me see...
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 21, 2017, 01:11:02 AM |
|
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash
I have been thinking about that for quite some time now. I will wrap up Equihash optimizations once I'm done with helper functions.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 21, 2017, 08:57:15 AM |
|
I'm trying to hook Linux system calls from the user space so that GG can access a larger GDS segment without a kernel patch. The work never ends...
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
March 21, 2017, 09:03:23 AM Last edit: March 21, 2017, 09:17:49 AM by sp_ |
|
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash
Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds. The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data. On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel)
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 21, 2017, 09:19:03 AM |
|
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash
Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds. The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data. On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel) I must admit that this is a brilliant idea. Thank you so much for sharing it!
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
laik2
|
|
March 21, 2017, 12:22:46 PM |
|
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash
Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds. The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data. On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel) Jeeesus, sometimes I think sp_ is a jerk and sometimes normal...I believe either you have personality disorder or your account is being used by 2 different people...
|
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
March 21, 2017, 12:31:23 PM |
|
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash
Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds. The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) merged into the round1-round8 code per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data. On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel) Jeeesus, sometimes I think sp_ is a jerk and sometimes normal...I believe either you have personality disorder or your account is being used by 2 different people... I just pointed the opensource development into the right direction. Time to give team Claymore some opensource competition...
|
|
|
|
nerdralph
|
|
March 21, 2017, 01:52:27 PM |
|
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash
Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds. The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data. On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel) I must admit that this is a brilliant idea. Thank you so much for sharing it! Meh. ZEC uses a truncated blake2 using 2x200 bits out of 512. I doubt you'll find a way to re-use the blake calculations for another algo like dcr or sia. To optimize round0, use the same idea as bitcoin sha-256 optimization, by looking for parts of the algorithm that can be skipped. For example, since the last 112 bits are ignored, you might be able to skip some parts of the blake algo. And since everything but the nonce is constant for ~2.5 minutes, you can probably move some of the calculations to compile time and generate a new kernel for each new block. Since you're already building a custom llvm, you can probably get the kernel compile and dispatch time down to a few ms. p.s. Here's some bedtime reading for you on bitcoin mining optimization. http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdf
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 21, 2017, 03:24:08 PM Last edit: March 21, 2017, 03:37:04 PM by zawawa |
|
Maibe you can do a dual miner , you should focus on Ethash too, give time to ideas cook in you head, them you go back to equihash
Could the blake2s pass be removed completely (round 0)? Dual mined with the memory accesses? I am not talking about 2 threads, but one thread with the round0 merged into the other rounds. The miner will need to do 1 nonce search(round1-round8) and one blake2s (round0) per iteration. Since you work on the round0 data from the previos run, the nonce found would be the result of the previous padding data. On NVIDIA round0 take 20% of the time. (The opensource ZEC DJezo kernel) I must admit that this is a brilliant idea. Thank you so much for sharing it! Meh. ZEC uses a truncated blake2 using 2x200 bits out of 512. I doubt you'll find a way to re-use the blake calculations for another algo like dcr or sia. To optimize round0, use the same idea as bitcoin sha-256 optimization, by looking for parts of the algorithm that can be skipped. For example, since the last 112 bits are ignored, you might be able to skip some parts of the blake algo. And since everything but the nonce is constant for ~2.5 minutes, you can probably move some of the calculations to compile time and generate a new kernel for each new block. Since you're already building a custom llvm, you can probably get the kernel compile and dispatch time down to a few ms. p.s. Here's some bedtime reading for you on bitcoin mining optimization. http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdfI will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
nerdralph
|
|
March 21, 2017, 04:02:53 PM |
|
I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.
Just running multiple instances of the kernel should help; just don't launch them at exactly the same time. Ideally the 2nd instance should be launched after the first has finished round0.
|
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
March 21, 2017, 04:10:25 PM |
|
I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.
Just running multiple instances of the kernel should help; just don't launch them at exactly the same time. Ideally the 2nd instance should be launched after the first has finished round0. But you want to make sure than round1 starts at exactly the same time as round0. running with multiple threads, sometimes help, and sometimes not. With proper code, you can make sure that this always happens. No need for 5 threads. (nicehash dj-ezo kernel on the gtx 1080)
|
|
|
|
djeZo
|
|
March 21, 2017, 04:52:47 PM |
|
I will definitely look into further optimizations for Round 0. I think sp_ was talking not about reusing the Blake calculations but executing Round 0 of the next Equihash run in the background during Rounds 1 through 8. I did notice the VALU was being idle quite often during memory transfers but did not know what to do with it until now. I should be able to implement this idea *pretty* soon.
Just running multiple instances of the kernel should help; just don't launch them at exactly the same time. Ideally the 2nd instance should be launched after the first has finished round0. But you want to make sure than round1 starts at exactly the same time as round0. running with multiple threads, sometimes help, and sometimes not. With proper code, you can make sure that this always happens. No need for 5 threads. (nicehash dj-ezo kernel on the gtx 1080) At first, I didn't fully understood what you meant, but I think I do now. Your ideal is following; when there is round0 being executed, you would like to execute other rounds in parallel with round0 but with different nonce, so that resources of the card can be better utilized (during round0 there is not much mem ops, but rather alu ops, and during rounds1+ are more mem ops and less alu ops). I had this idea but here is the problem for CUDA, you would need to be able to launch two kernels at the same time, and I am not talking about in various threads, but actually make NVIDIA driver execute two kernels in parallel. That is not how CUDA works to my knowledge. CUDA, at driver level, will always execute certain kernel, then move on to the next one. To acheive parallel solving of rounds, you would need to do it in code on your own (eg say that each odd blockthread is doing round0, each even blockthread is doing round1), but here are different needs of round0 and round1 that would lower your occupation and probably make everything slower (round0 doesn't need shared memory, needs more registers, round1 needs lot's of shared memory, needs less registers).
|
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
March 21, 2017, 05:00:52 PM Last edit: March 21, 2017, 05:18:36 PM by sp_ |
|
Use double buffer and 2 cudastreams in parallell. do 1.launch round0 buffer1 (thread1) 2.launch round1-round8 buffer2 (thread2) sync swap buffer pointers loop Or permute the rounds so that the round that give the most speed is executed in parallell. Here round3-round8 is Running at the same time as round0: f.ex do launch round1-round2 (thread2) wait for thread2 1.launch round0 buffer1 (thread1) 2.launch round3-round8 buffer2 (thread2) sync swap buffer pointers loop round0 take around 20% of the total time Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.
On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.
This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.
Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.
http://stackoverflow.com/questions/8473617/are-cuda-kernel-calls-synchronous-or-asynchronous
|
|
|
|
djeZo
|
|
March 21, 2017, 05:49:26 PM |
|
Only after the first kernel is finished the second one will execute. Like I said... it doesn't matter if you have threads, streams etc... at the end, on GPU, only one kernel can be executed at the same time. Equihash notably gets more speed with several threads, because there are many kernels to be executed (from round0 to round9) and between each execution there is pause that can be used by CUDA driver to execute another kernel of another thread.
|
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
March 21, 2017, 05:56:31 PM |
|
2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.
(if you don't specify a stream they both go to the default stream and are executed serially)
Serial execution: one stream (the default stream) Paralell execution: two streams.
Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block. Running round0 with 32 threads per block or less should be enough...
|
|
|
|
djeZo
|
|
March 21, 2017, 06:02:47 PM |
|
2 kernels can ofcourse be executed at the same time but you need to specify seperate stream for both of them.
(if you don't specify a stream they both go to the default stream and are executed serially)
Serial execution: one stream (the default stream) Paralell execution: two streams.
Make sure that the serialstrem kernel code use async kernel calls like cudaallocasync etc. You also need to make sure that the kernel isn't using all of the resources on the chip. Like threads per block.
Source of these claims?
|
|
|
|
|