Bitcoin Forum
January 16, 2018, 04:22:58 PM *
News: Latest stable version of Bitcoin Core: 0.15.1  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 [24] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 ... 116 »
  Print  
Author Topic: Gateless Gate Sharp 1.2.2: zawawa's open-source dual ETH/XMR/PASC/LBC/FTC miner  (Read 171625 times)
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 27, 2017, 08:44:38 PM
 #461

I will look into that. The assembly version for 7990 is almost ready...

have people looked at the returns for pascal coin right now? I would forget mining unless you have almost free electric and super high amounts of hashrate. http://whattomine.com/coins/172-pasc-pascal



No wonder... If it uses a variant of SHA-256, it wouldn't be within the reach of GPU's anyway.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
1516119778
Hero Member
*
Offline Offline

Posts: 1516119778

View Profile Personal Message (Offline)

Ignore
1516119778
Reply with quote  #2

1516119778
Report to moderator
1516119778
Hero Member
*
Offline Offline

Posts: 1516119778

View Profile Personal Message (Offline)

Ignore
1516119778
Reply with quote  #2

1516119778
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1516119778
Hero Member
*
Offline Offline

Posts: 1516119778

View Profile Personal Message (Offline)

Ignore
1516119778
Reply with quote  #2

1516119778
Report to moderator
1516119778
Hero Member
*
Offline Offline

Posts: 1516119778

View Profile Personal Message (Offline)

Ignore
1516119778
Reply with quote  #2

1516119778
Report to moderator
1516119778
Hero Member
*
Offline Offline

Posts: 1516119778

View Profile Personal Message (Offline)

Ignore
1516119778
Reply with quote  #2

1516119778
Report to moderator
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 28, 2017, 08:26:14 AM
 #462

It turned out that I really need to reduce bank conflicts in GDS.
I also need to revisit global syncs with GWS instructions.
I wonder if Claymore already did research on these mostly undocumented features for his other miner programs.
In the mean time, I will upload a preliminary assembly version for GCN1 for testing purposes.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 28, 2017, 11:50:51 AM
 #463

Now I think about it, it is no wonder that bank conflicts would be a serious problem considering the fact that GDS'es 32 banks are shared across all the compute units unlike LDS. My next game plan is to reduce the number of wavefronts to avoid bank conflicts in GDS. We will see.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 28, 2017, 12:55:00 PM
 #464

WTF. This cannot be right.


Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 28, 2017, 01:33:50 PM
 #465

This is a profile of the non-GDS version. This looks OK...


Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 28, 2017, 01:57:52 PM
 #466

Looks like GG is not taking advantage of the available memory bandwidth of 7990. No wonder...


Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 28, 2017, 02:46:28 PM
 #467

I think I need to reintroduce the variable NR_ROWS_LOG.
I dropped it when I switched to sgminer-gm because the code became too complex and the AMD OpenCL driver crapped out.
I need to keep the code really simple this time around.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 434


View Profile
January 28, 2017, 04:31:02 PM
 #468

Looks like GG is not taking advantage of the available memory bandwidth of 7990. No wonder...



Tahiti is quirky with 6 memory channels, making memory stride calculations more complicated than 4 or 8 channels.
nerdralph
Sr. Member
****
Offline Offline

Activity: 434


View Profile
January 28, 2017, 04:40:55 PM
 #469

Now I think about it, it is no wonder that bank conflicts would be a serious problem considering the fact that GDS'es 32 banks are shared across all the compute units unlike LDS. My next game plan is to reduce the number of wavefronts to avoid bank conflicts in GDS. We will see.

But that's still much better than 4 or 8 memory channels when the row counters are stored in RAM/L2.  On Hawaii with 8 memory channels you should be able to do 4 GDS updates for every write to RAM.

Given the architectural description of the GDS is a bit vague, it's possible that the atomic units only support single-cycle increment/decrement, while add might lock the GDS for two or three cycles to do read/add/write.  Even then, your bandwidth limit should still be the external memory channels.
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 29, 2017, 02:36:15 AM
 #470

Hm, let me try atomic_inc, then. Thanks a lot for the pointers!
My tentative impression is that GDS is much slower than your descriptions, though.
Do you have any actual numbers from your experiments with GDS?
I began to wonder Claymore's and Optiminer's optimizations have more to do with GWS instructions, which do depend on GDS, instead of GDS proper.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
FFI2013
Hero Member
*****
Offline Offline

Activity: 517


THEKEY - Unlock the future


View Profile
January 29, 2017, 03:37:47 AM
 #471

I wanted to see what rates I could get with my 390/290 and 280s has anyone figured out good settings or still to early


░░░░░░░▄▄▄▄▄░░░░░░░░▀█▀█▀█▄▀█▀░░
░░░░░▄███████▄░░░░░░░░░█▀██▀█▄░░█▀░░░░░
░░░▄███████████▄▄▄▄▄▄█▄▄█▄█▄██▄█▄█▄██▄▄█▄░░
░░░████▀░░░▀████▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀░░
░░░████▄░░░▄████▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄░░
░░░▀███████████▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀░░
░░░░░▀███████▀░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
░░░░░░░▀▀▀▀▀░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
 



█  

█  



█ 

.................THEKEY A revolutionary online identity.................
▬ ■ ■ ■ ■ ▬▬▬▬▬▬▬ ■ ■ ■ ■ ▬▬▬▬▬▬▬ ■ ■ ■ ■ ▬▬▬▬▬▬▬ ■ ■ ■ ■ ▬▬▬▬▬▬▬ ■ ■ ■ ■ ▬▬▬▬▬▬ ■ ■ ■ ■ ▬
verification solution based on .NEO.




█  

█  



█ 

▄████████████████████████▄
██████████████████████████
██████████████████▀▀   ███
██████████████▀▀  ▄█▀ ▄███
██████████▀▀    ▄█▀   ████
██████▀▀      ▄█▀    ▄████
██▀▀        ▄█▀      █████
████▄▄    ▄█▀       ▄█████
████████▄█▀         ██████
██████████   ▄     ▄██████
██████████ ▄████▄▄ ███████
██████████████████████████
▀████████████████████████▀
 
▄████████████████████████▄
██████████████████████████
███████████████▀▀▀▀███████
████ ▀███████▀      ▀▀▀███
████    ▀▀██▀        ▄████
████▄               ██████
█████▄             ███████
██████▄           ████████
███▀██▀▀        ▄█████████
████▄▄       ▄▄███████████
██████████████████████████
██████████████████████████
▀████████████████████████▀
 
▄████████████████████████▄
████████████▀     ████████
███████████       ████████
███████████    ███████████
███████████    ███████████
████████          ████████
████████          ████████
███████████    ███████████
███████████    ███████████
███████████    ███████████
███████████    ███████████
███████████    ███████████
▀████████████████████████▀
  ●  Whitepaper
●  Website
●  Reddit
 
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 29, 2017, 07:01:25 AM
 #472

I wanted to see what rates I could get with my 390/290 and 280s has anyone figured out good settings or still to early

I will upload a new version tomorrow, so we should wait until then.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 29, 2017, 07:22:36 AM
 #473

I am sick and tired of inserting GDS instructions to 6000 lines of a disassembled GCN code every time I update the OpenCL kernel. Let's see if I can use the GCN inline assembly with LLVM...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 29, 2017, 01:11:27 PM
 #474

It seems like the speedup with GDS counters would be a little over 10%, which is consistent with what I observed with the ASM and non-ASM versions of Claymore's. There are other really neat tricks with the GCN assembly, but I don't think Claymore used them. (His real strength is at the algorithmic level anyway.) I will now work on GDS counters for RX 480 on Linux, and I will prepare a next point release when I'm done.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
FFI2013
Hero Member
*****
Offline Offline

Activity: 517


THEKEY - Unlock the future


View Profile
January 29, 2017, 03:17:46 PM
 #475

I wanted to see what rates I could get with my 390/290 and 280s has anyone figured out good settings or still to early

I will upload a new version tomorrow, so we should wait until then.
Thanks I will also use the bat file the was posted to mine some to you for great work your doing


░░░░░░░▄▄▄▄▄░░░░░░░░▀█▀█▀█▄▀█▀░░
░░░░░▄███████▄░░░░░░░░░█▀██▀█▄░░█▀░░░░░
░░░▄███████████▄▄▄▄▄▄█▄▄█▄█▄██▄█▄█▄██▄▄█▄░░
░░░████▀░░░▀████▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀░░
░░░████▄░░░▄████▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄░░
░░░▀███████████▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀░░
░░░░░▀███████▀░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
░░░░░░░▀▀▀▀▀░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
 



█  

█  



█ 

.................THEKEY A revolutionary online identity.................
▬ ■ ■ ■ ■ ▬▬▬▬▬▬▬ ■ ■ ■ ■ ▬▬▬▬▬▬▬ ■ ■ ■ ■ ▬▬▬▬▬▬▬ ■ ■ ■ ■ ▬▬▬▬▬▬▬ ■ ■ ■ ■ ▬▬▬▬▬▬ ■ ■ ■ ■ ▬
verification solution based on .NEO.




█  

█  



█ 

▄████████████████████████▄
██████████████████████████
██████████████████▀▀   ███
██████████████▀▀  ▄█▀ ▄███
██████████▀▀    ▄█▀   ████
██████▀▀      ▄█▀    ▄████
██▀▀        ▄█▀      █████
████▄▄    ▄█▀       ▄█████
████████▄█▀         ██████
██████████   ▄     ▄██████
██████████ ▄████▄▄ ███████
██████████████████████████
▀████████████████████████▀
 
▄████████████████████████▄
██████████████████████████
███████████████▀▀▀▀███████
████ ▀███████▀      ▀▀▀███
████    ▀▀██▀        ▄████
████▄               ██████
█████▄             ███████
██████▄           ████████
███▀██▀▀        ▄█████████
████▄▄       ▄▄███████████
██████████████████████████
██████████████████████████
▀████████████████████████▀
 
▄████████████████████████▄
████████████▀     ████████
███████████       ████████
███████████    ███████████
███████████    ███████████
████████          ████████
████████          ████████
███████████    ███████████
███████████    ███████████
███████████    ███████████
███████████    ███████████
███████████    ███████████
▀████████████████████████▀
  ●  Whitepaper
●  Website
●  Reddit
 
nerdralph
Sr. Member
****
Offline Offline

Activity: 434


View Profile
January 29, 2017, 04:48:18 PM
 #476

My tentative impression is that GDS is much slower than your descriptions, though.
Do you have any actual numbers from your experiments with GDS?
I began to wonder Claymore's and Optiminer's optimizations have more to do with GWS instructions, which do depend on GDS, instead of GDS proper.

I've done very little coding in the last few weeks due to other priorities.
I did look back over the docs, and they clearly state, "The GDS is identical to the local data shares, except that it is shared by all compute units".  Access is through the export unit instead of local to the CU, but I haven't seen a single reference that suggests that adds any latency.  Since the (LDS/GDS) are independent, it increases the potential for concurrency since a LDS and a GDS instruction can be concurrently dispatched.

Although I haven't looked at Optiminer's asm code yet, I'd wager that the primary gains are from GDS.  As I've previously pointed out, the core clock is the limit when the row counters are in L2.  With 32 bytes of data per slot, it's impossible to process 8 slots in anything less than 5 core clocks (I'm talking Rx and other devices with 4 memory channels).  2 of those are for updating the row counters, so using the GDS allows you to get that down to 3.  It looks like ds_add_rtn_u32 takes 2 clocks (vs 1 for ds_read_b32), but that still allows you to update 32 row counters in 2 cycles.

Its unclear how GDS arbitration is done, so my guess is conflicts for GDS access between CUs is exacerbating the problem.  Specifically I suspect GDS access by one CU blocks access by all other CUs, even if the other CUs are attempting to access idle banks in the GDS.  I'd suggest using a 4-way (or even Cool xor_and_store and 64 local-work size.  With a 8-way store, 32 work-items will usually (~80% of the time) hit 4 banks with no conflicts.  With a 4-way store, 87.5% of the time you'll get a bank conflict causing the ds_add to take 4 or more cycles instead of 2 (though 12.5% of the time you'll update 8 counters in 2 cycles).

nerdralph
Sr. Member
****
Offline Offline

Activity: 434


View Profile
January 29, 2017, 04:56:52 PM
 #477

I am sick and tired of inserting GDS instructions to 6000 lines of a disassembled GCN code every time I update the OpenCL kernel. Let's see if I can use the GCN inline assembly with LLVM...

With clang/llvm 3.9, generating asm from OpenCL + inline asm was pretty easy:
Code:
${CLANG} -x cl -Xclang -finclude-default-header -Dcl_clang_storage_class_specifiers -target amdgcn -mcpu=tonga -S -o ${f}.s ${f}.cl

linking with libclc was where I ran into problems.
nerdralph
Sr. Member
****
Offline Offline

Activity: 434


View Profile
January 29, 2017, 05:05:46 PM
 #478

It seems like the speedup with GDS counters would be a little over 10%, which is consistent with what I observed with the ASM and non-ASM versions of Claymore's. There are other really neat tricks with the GCN assembly, but I don't think Claymore used them. (His real strength is at the algorithmic level anyway.) I will now work on GDS counters for RX 480 on Linux, and I will prepare a next point release when I'm done.

Maybe 10% is all you can get for Tahiti, but for Ellesmere you should be able to get a 20-25% boost.  Even if Optiminer has a reduced dev fee of 5%, the gross speed for the Rx 480 would be 270/.95 = 284.
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 29, 2017, 06:16:16 PM
 #479

My tentative impression is that GDS is much slower than your descriptions, though.
Do you have any actual numbers from your experiments with GDS?
I began to wonder Claymore's and Optiminer's optimizations have more to do with GWS instructions, which do depend on GDS, instead of GDS proper.

I've done very little coding in the last few weeks due to other priorities.
I did look back over the docs, and they clearly state, "The GDS is identical to the local data shares, except that it is shared by all compute units".  Access is through the export unit instead of local to the CU, but I haven't seen a single reference that suggests that adds any latency.  Since the (LDS/GDS) are independent, it increases the potential for concurrency since a LDS and a GDS instruction can be concurrently dispatched.

Although I haven't looked at Optiminer's asm code yet, I'd wager that the primary gains are from GDS.  As I've previously pointed out, the core clock is the limit when the row counters are in L2.  With 32 bytes of data per slot, it's impossible to process 8 slots in anything less than 5 core clocks (I'm talking Rx and other devices with 4 memory channels).  2 of those are for updating the row counters, so using the GDS allows you to get that down to 3.  It looks like ds_add_rtn_u32 takes 2 clocks (vs 1 for ds_read_b32), but that still allows you to update 32 row counters in 2 cycles.

Its unclear how GDS arbitration is done, so my guess is conflicts for GDS access between CUs is exacerbating the problem.  Specifically I suspect GDS access by one CU blocks access by all other CUs, even if the other CUs are attempting to access idle banks in the GDS.  I'd suggest using a 4-way (or even Cool xor_and_store and 64 local-work size.  With a 8-way store, 32 work-items will usually (~80% of the time) hit 4 banks with no conflicts.  With a 4-way store, 87.5% of the time you'll get a bank conflict causing the ds_add to take 4 or more cycles instead of 2 (though 12.5% of the time you'll update 8 counters in 2 cycles).



Very insightful. Thank you so much! I will start with 4-way writes and 128 work-items and see how that would change things around. In the mean time, I urgently need to set up clang/llvm as I feel like losing hairs dealing with disassembled codes...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 448


Miner Developer


View Profile
January 29, 2017, 06:43:37 PM
 #480

I just uploaded a new pre-release:

https://github.com/zawawawa/gatelessgate/releases/tag/v0.1.3-pre0

The new assembly version is for GCN1 and Windows only for now.
I will work on the Linux version today.
As always, I appreciate your feedback, donations, and even stars on GitHub. Enjoy!

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 [24] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 ... 116 »
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!