Bitcoin Forum
April 24, 2024, 05:43:15 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Poll
Question: Do you want to see improvements in Ethash dual-mining with GGS?
I desperately need it. - 8 (15.1%)
It would be nice. - 12 (22.6%)
It's not worth it anymore. - 33 (62.3%)
Total Voters: 53

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 [24] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 ... 197 »
  Print  
Author Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!  (Read 214337 times)
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 28, 2017, 08:26:14 AM
 #461

It turned out that I really need to reduce bank conflicts in GDS.
I also need to revisit global syncs with GWS instructions.
I wonder if Claymore already did research on these mostly undocumented features for his other miner programs.
In the mean time, I will upload a preliminary assembly version for GCN1 for testing purposes.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
1713980595
Hero Member
*
Offline Offline

Posts: 1713980595

View Profile Personal Message (Offline)

Ignore
1713980595
Reply with quote  #2

1713980595
Report to moderator
In order to achieve higher forum ranks, you need both activity points and merit points.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1713980595
Hero Member
*
Offline Offline

Posts: 1713980595

View Profile Personal Message (Offline)

Ignore
1713980595
Reply with quote  #2

1713980595
Report to moderator
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 28, 2017, 11:50:51 AM
 #462

Now I think about it, it is no wonder that bank conflicts would be a serious problem considering the fact that GDS'es 32 banks are shared across all the compute units unlike LDS. My next game plan is to reduce the number of wavefronts to avoid bank conflicts in GDS. We will see.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 28, 2017, 12:55:00 PM
 #463

WTF. This cannot be right.


Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 28, 2017, 01:33:50 PM
 #464

This is a profile of the non-GDS version. This looks OK...


Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 28, 2017, 01:57:52 PM
 #465

Looks like GG is not taking advantage of the available memory bandwidth of 7990. No wonder...


Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 28, 2017, 02:46:28 PM
 #466

I think I need to reintroduce the variable NR_ROWS_LOG.
I dropped it when I switched to sgminer-gm because the code became too complex and the AMD OpenCL driver crapped out.
I need to keep the code really simple this time around.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
January 28, 2017, 04:31:02 PM
 #467

Looks like GG is not taking advantage of the available memory bandwidth of 7990. No wonder...



Tahiti is quirky with 6 memory channels, making memory stride calculations more complicated than 4 or 8 channels.
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
January 28, 2017, 04:40:55 PM
 #468

Now I think about it, it is no wonder that bank conflicts would be a serious problem considering the fact that GDS'es 32 banks are shared across all the compute units unlike LDS. My next game plan is to reduce the number of wavefronts to avoid bank conflicts in GDS. We will see.

But that's still much better than 4 or 8 memory channels when the row counters are stored in RAM/L2.  On Hawaii with 8 memory channels you should be able to do 4 GDS updates for every write to RAM.

Given the architectural description of the GDS is a bit vague, it's possible that the atomic units only support single-cycle increment/decrement, while add might lock the GDS for two or three cycles to do read/add/write.  Even then, your bandwidth limit should still be the external memory channels.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 29, 2017, 02:36:15 AM
 #469

Hm, let me try atomic_inc, then. Thanks a lot for the pointers!
My tentative impression is that GDS is much slower than your descriptions, though.
Do you have any actual numbers from your experiments with GDS?
I began to wonder Claymore's and Optiminer's optimizations have more to do with GWS instructions, which do depend on GDS, instead of GDS proper.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
FFI2013
Hero Member
*****
Offline Offline

Activity: 906
Merit: 507


View Profile
January 29, 2017, 03:37:47 AM
 #470

I wanted to see what rates I could get with my 390/290 and 280s has anyone figured out good settings or still to early
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 29, 2017, 07:01:25 AM
Last edit: January 29, 2017, 07:14:15 AM by zawawa
 #471

I wanted to see what rates I could get with my 390/290 and 280s has anyone figured out good settings or still to early

I will upload a new version tomorrow, so we should wait until then.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 29, 2017, 07:22:36 AM
 #472

I am sick and tired of inserting GDS instructions to 6000 lines of a disassembled GCN code every time I update the OpenCL kernel. Let's see if I can use the GCN inline assembly with LLVM...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 29, 2017, 01:11:27 PM
 #473

It seems like the speedup with GDS counters would be a little over 10%, which is consistent with what I observed with the ASM and non-ASM versions of Claymore's. There are other really neat tricks with the GCN assembly, but I don't think Claymore used them. (His real strength is at the algorithmic level anyway.) I will now work on GDS counters for RX 480 on Linux, and I will prepare a next point release when I'm done.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
FFI2013
Hero Member
*****
Offline Offline

Activity: 906
Merit: 507


View Profile
January 29, 2017, 03:17:46 PM
 #474

I wanted to see what rates I could get with my 390/290 and 280s has anyone figured out good settings or still to early

I will upload a new version tomorrow, so we should wait until then.
Thanks I will also use the bat file the was posted to mine some to you for great work your doing
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
January 29, 2017, 04:48:18 PM
 #475

My tentative impression is that GDS is much slower than your descriptions, though.
Do you have any actual numbers from your experiments with GDS?
I began to wonder Claymore's and Optiminer's optimizations have more to do with GWS instructions, which do depend on GDS, instead of GDS proper.

I've done very little coding in the last few weeks due to other priorities.
I did look back over the docs, and they clearly state, "The GDS is identical to the local data shares, except that it is shared by all compute units".  Access is through the export unit instead of local to the CU, but I haven't seen a single reference that suggests that adds any latency.  Since the (LDS/GDS) are independent, it increases the potential for concurrency since a LDS and a GDS instruction can be concurrently dispatched.

Although I haven't looked at Optiminer's asm code yet, I'd wager that the primary gains are from GDS.  As I've previously pointed out, the core clock is the limit when the row counters are in L2.  With 32 bytes of data per slot, it's impossible to process 8 slots in anything less than 5 core clocks (I'm talking Rx and other devices with 4 memory channels).  2 of those are for updating the row counters, so using the GDS allows you to get that down to 3.  It looks like ds_add_rtn_u32 takes 2 clocks (vs 1 for ds_read_b32), but that still allows you to update 32 row counters in 2 cycles.

Its unclear how GDS arbitration is done, so my guess is conflicts for GDS access between CUs is exacerbating the problem.  Specifically I suspect GDS access by one CU blocks access by all other CUs, even if the other CUs are attempting to access idle banks in the GDS.  I'd suggest using a 4-way (or even Cool xor_and_store and 64 local-work size.  With a 8-way store, 32 work-items will usually (~80% of the time) hit 4 banks with no conflicts.  With a 4-way store, 87.5% of the time you'll get a bank conflict causing the ds_add to take 4 or more cycles instead of 2 (though 12.5% of the time you'll update 8 counters in 2 cycles).

nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
January 29, 2017, 04:56:52 PM
 #476

I am sick and tired of inserting GDS instructions to 6000 lines of a disassembled GCN code every time I update the OpenCL kernel. Let's see if I can use the GCN inline assembly with LLVM...

With clang/llvm 3.9, generating asm from OpenCL + inline asm was pretty easy:
Code:
${CLANG} -x cl -Xclang -finclude-default-header -Dcl_clang_storage_class_specifiers -target amdgcn -mcpu=tonga -S -o ${f}.s ${f}.cl

linking with libclc was where I ran into problems.
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
January 29, 2017, 05:05:46 PM
 #477

It seems like the speedup with GDS counters would be a little over 10%, which is consistent with what I observed with the ASM and non-ASM versions of Claymore's. There are other really neat tricks with the GCN assembly, but I don't think Claymore used them. (His real strength is at the algorithmic level anyway.) I will now work on GDS counters for RX 480 on Linux, and I will prepare a next point release when I'm done.

Maybe 10% is all you can get for Tahiti, but for Ellesmere you should be able to get a 20-25% boost.  Even if Optiminer has a reduced dev fee of 5%, the gross speed for the Rx 480 would be 270/.95 = 284.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 29, 2017, 06:16:16 PM
 #478

My tentative impression is that GDS is much slower than your descriptions, though.
Do you have any actual numbers from your experiments with GDS?
I began to wonder Claymore's and Optiminer's optimizations have more to do with GWS instructions, which do depend on GDS, instead of GDS proper.

I've done very little coding in the last few weeks due to other priorities.
I did look back over the docs, and they clearly state, "The GDS is identical to the local data shares, except that it is shared by all compute units".  Access is through the export unit instead of local to the CU, but I haven't seen a single reference that suggests that adds any latency.  Since the (LDS/GDS) are independent, it increases the potential for concurrency since a LDS and a GDS instruction can be concurrently dispatched.

Although I haven't looked at Optiminer's asm code yet, I'd wager that the primary gains are from GDS.  As I've previously pointed out, the core clock is the limit when the row counters are in L2.  With 32 bytes of data per slot, it's impossible to process 8 slots in anything less than 5 core clocks (I'm talking Rx and other devices with 4 memory channels).  2 of those are for updating the row counters, so using the GDS allows you to get that down to 3.  It looks like ds_add_rtn_u32 takes 2 clocks (vs 1 for ds_read_b32), but that still allows you to update 32 row counters in 2 cycles.

Its unclear how GDS arbitration is done, so my guess is conflicts for GDS access between CUs is exacerbating the problem.  Specifically I suspect GDS access by one CU blocks access by all other CUs, even if the other CUs are attempting to access idle banks in the GDS.  I'd suggest using a 4-way (or even Cool xor_and_store and 64 local-work size.  With a 8-way store, 32 work-items will usually (~80% of the time) hit 4 banks with no conflicts.  With a 4-way store, 87.5% of the time you'll get a bank conflict causing the ds_add to take 4 or more cycles instead of 2 (though 12.5% of the time you'll update 8 counters in 2 cycles).



Very insightful. Thank you so much! I will start with 4-way writes and 128 work-items and see how that would change things around. In the mean time, I urgently need to set up clang/llvm as I feel like losing hairs dealing with disassembled codes...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
January 29, 2017, 06:43:37 PM
 #479

I just uploaded a new pre-release:

https://github.com/zawawawa/gatelessgate/releases/tag/v0.1.3-pre0

The new assembly version is for GCN1 and Windows only for now.
I will work on the Linux version today.
As always, I appreciate your feedback, donations, and even stars on GitHub. Enjoy!

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
January 29, 2017, 06:57:57 PM
 #480

I urgently need to set up clang/llvm as I feel like losing hairs dealing with disassembled codes...

If you were using Ubuntu for development setup it's pretty easy.

Code:
miner@l1:~/bin$ apt-cache policy clang-3.9
clang-3.9:
  Installed: 1:3.9~svn288847-1~exp1
  Candidate: 1:3.9~svn288847-1~exp1
  Version table:
 *** 1:3.9~svn288847-1~exp1 0
        500 http://llvm.org/apt/trusty/ llvm-toolchain-trusty-3.9/main amd64 Packages

You'll also probably want libclc-dev.
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 [24] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 ... 197 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!