Bitcoin Forum
April 25, 2024, 06:23:59 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Poll
Question: Do you want to see improvements in Ethash dual-mining with GGS?
I desperately need it. - 8 (15.1%)
It would be nice. - 12 (22.6%)
It's not worth it anymore. - 33 (62.3%)
Total Voters: 53

Pages: « 1 2 [3] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 ... 197 »
  Print  
Author Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!  (Read 214337 times)
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
December 16, 2016, 05:08:45 AM
 #41

Not bad zawawa.  You still have room to improve ht_store.
Code:
p = slot.ui8
Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.
Code:
p = slot.ui4[0]
Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back.  This will waste a lot of GDDR cycles due to the bus turnaround delay.  The solution is to have a n-way operation where n threads write 32/n bytes.  That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips.  With NR_SLOTS even, the first write to a given row will always be to an even memory chip.  With more slots per row this becomes less significant because the rows don't fill up equally.  Using an odd number for NR_SLOTS may also reduce channel conflicts.




I tried 4-way writes with mixed results. The 4-way write version was actually slower than the single-thread-write version, but the former seems to speed up the last few rounds. It makes sense as these rounds are more memory-intensive. I will explore this approach further.

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  Branching code where each thread executes a different store instruction will make performance worse because the L1 cache is write-thu, resulting in 4 cache lines written to the L2 with 8 dirty bytes each instead of 1 cache line written with 32 dirty bytes.  If you send me your 4-way code I can take a look.  It should even be possible to do it as a 2-way where the thread pairs execute a store_dwordx4.

1714026239
Hero Member
*
Offline Offline

Posts: 1714026239

View Profile Personal Message (Offline)

Ignore
1714026239
Reply with quote  #2

1714026239
Report to moderator
1714026239
Hero Member
*
Offline Offline

Posts: 1714026239

View Profile Personal Message (Offline)

Ignore
1714026239
Reply with quote  #2

1714026239
Report to moderator
"Bitcoin: the cutting edge of begging technology." -- Giraffe.BTC
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
QuintLeo
Legendary
*
Offline Offline

Activity: 1498
Merit: 1030


View Profile
December 16, 2016, 09:08:07 AM
 #42

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

15.12 is the original Crimson driver - for the pre-RX cards it's the best and fastest version overall per everything I've ever run it on (quite a wide assortment).

 It was also the last LINUX version that supported pre-GCN cards (Windows had to suffer with 15.7.1 though there is a "legacy" 16.2 version that basically repackaged 15.7.1 with some of the newer bells and whistles) but offered no performance advantage).

 16.9.2 or 16.10.1 seem to be the best mining options for the RX series cards (16.10.1 is WQHL seems to be the only real difference between those two for miners).
 They also seem to work as well with the R9 and HD 7xxx series GCN cards in my somewhat limited testing.

 16.12.1 is total bloated junk and reduced hashrate 5-10% on EVERYTHING I tried it on (HD7870, R9 280x, RX 470).
 Avoid it.



 I would suggest that you make the 15.12 for pre-RX cards and the 16.10.1 for RX series your "tested with and recommended" driver options.

 (This will of course change when Vega hits the street and requires newer drivers for support).





I'm no longer legendary just in my own mind!
Like something I said? Donations gratefully accepted. LYLnTKvLefz9izJFUvEGQEZzSkz34b3N6U (Litecoin)
1GYbjMTPdCuV7dci3iCUiaRrcNuaiQrVYY (Bitcoin)
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 09:10:00 AM
Last edit: December 16, 2016, 09:22:01 AM by zawawa
 #43

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 09:23:46 AM
 #44

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

15.12 is the original Crimson driver - for the pre-RX cards it's the best and fastest version overall per everything I've ever run it on (quite a wide assortment).

 It was also the last LINUX version that supported pre-GCN cards (Windows had to suffer with 15.7.1 though there is a "legacy" 16.2 version that basically repackaged 15.7.1 with some of the newer bells and whistles) but offered no performance advantage).

 16.9.2 or 16.10.1 seem to be the best mining options for the RX series cards (16.10.1 is WQHL seems to be the only real difference between those two for miners).
 They also seem to work as well with the R9 and HD 7xxx series GCN cards in my somewhat limited testing.

 16.12.1 is total bloated junk and reduced hashrate 5-10% on EVERYTHING I tried it on (HD7870, R9 280x, RX 470).
 Avoid it.



 I would suggest that you make the 15.12 for pre-RX cards and the 16.10.1 for RX series your "tested with and recommended" driver options.

 (This will of course change when Vega hits the street and requires newer drivers for support).






Thanks a lot for the great suggestion. I will definitely consider that.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
laik2
Sr. Member
****
Offline Offline

Activity: 652
Merit: 266



View Profile WWW
December 16, 2016, 09:24:09 AM
 #45

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...
It's slower, you can check for yourself I've updated repo on the test linux.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 09:40:30 AM
 #46

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...
It's slower, you can check for yourself I've updated repo on the test linux.

Are you referring to multi-threaded writes, or the default settings?

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
laik2
Sr. Member
****
Offline Offline

Activity: 652
Merit: 266



View Profile WWW
December 16, 2016, 09:56:08 AM
 #47

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...
It's slower, you can check for yourself I've updated repo on the test linux.

Are you referring to multi-threaded writes, or the default settings?
1 - 168/170S/s
2 - 148/150S/s
4 - 128/130S/s
8 - 82/84S/s

Changing threads in param.h
So basicly there is no change except that multithreading doesn't seem to work under linux as supposed to.

EDIT: -t value has no effect as of "THREADS_PER_WRITE" , it has to be hardcoded in param.h and recompiled to have effect.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 02:15:57 PM
 #48

Like I said before, parallel writes are slower than single thread writes at this point.
This is still an experimental feature.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
laik2
Sr. Member
****
Offline Offline

Activity: 652
Merit: 266



View Profile WWW
December 16, 2016, 02:32:19 PM
 #49

Like I said before, parallel writes are slower than single thread writes at this point.
This is still an experimental feature.
Ok Smiley
I'm just giving some feedback.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 02:47:13 PM
 #50

I just ordered GTX 1060, and I haven't told my wife about it yet.
Donations are always welcome, guys! My BTC address is in my signature.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
December 16, 2016, 03:51:47 PM
Last edit: December 16, 2016, 04:18:43 PM by nerdralph
 #51

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks.  I'll certainly give you credit for pumping the code out faster than I do.  I thing you forgot to push something to the repo though:
Code:
"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^

p.s. even with 1 thread per write, although it builds, no solutions are found (make test fails).

p.p.s I tried going back to v0.0.1, but it seems I also need to merge back the unix compile fixes first...
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 04:06:08 PM
 #52

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks.  I'll certainly give you credit for pumping the code out faster than I do.  I thing you forgot to push something to the repo though:
Code:
"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^


That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
December 16, 2016, 04:24:47 PM
 #53

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks.  I'll certainly give you credit for pumping the code out faster than I do.  I thing you forgot to push something to the repo though:
Code:
"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^


That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

I'm using Ubuntu 14.04 & fglrx.  If it builds OK for you I'm surprised.  I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 04:48:50 PM
 #54

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?  

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks.  I'll certainly give you credit for pumping the code out faster than I do.  I thing you forgot to push something to the repo though:
Code:
"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^


That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

I'm using Ubuntu 14.04 & fglrx.  If it builds OK for you I'm surprised.  I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.


Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
December 16, 2016, 05:13:32 PM
 #55

I'm using Ubuntu 14.04 & fglrx.  If it builds OK for you I'm surprised.  I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.


Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

I know how you are intending to declare slot_ptrs.  What I'm saying is it is not valid syntax.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/global.html
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 05:31:32 PM
 #56

I'm using Ubuntu 14.04 & fglrx.  If it builds OK for you I'm surprised.  I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.


Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

I know how you are intending to declare slot_ptrs.  What I'm saying is it is not valid syntax.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/global.html


I don't think the specs prohibit the syntax. See:

https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/opencl-c/
http://stackoverflow.com/questions/11978024/opencl-store-pointer-to-global-memory-in-local-memory

Code:
__global char * __local lgc[8];  // 8 pointers stored on the local memory that points to a char located on the global memory

I appreciate your detail-oriented approach, though  Wink

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 06:09:59 PM
 #57

In any case, this is the portion of the ISA in question for 2-way writes.
It seems very clean to me with a single FLAT_STORE_DWORDX4 at the end.
Maybe I need to separate reads/XOR's and writes into two sections.

Code:
label_0228:
 0x003CA0 S_ANDN2_B64 exec s[28:29] exec 4 Scalar 89FE7E1C
 0x003CA4 V_MOV_B32 v28 0 4 Vector ALU 7E380280
 0x003CA8 S_MOV_B64 exec s[28:29] 4 Scalar BEFE011C
 0x003CAC S_WAITCNT vmcnt(0) Varies Flow Control BF8C0F70
 0x003CB0 DS_READ_B64 v[33:34] v19 offset:7024 Varies LDS D8EC1B70 21000013
 0x003CB8 S_WAITCNT lgkmcnt(0) Varies Flow Control BF8C007F
 0x003CBC V_CMP_NE_I64 vcc 0 v[33:34] Varies Vector ALU 7DCA4280
 0x003CC0 S_AND_SAVEEXEC_B64 s[28:29] vcc 4 Scalar BE9C206A
 0x003CC4 S_CBRANCH_EXECZ label_0261 4/16 Branch BF88002F
 0x003CC8 S_MOV_B32 s8 0x05040c00 4 Scalar BE8800FF 05040C00
 0x003CD0 S_MOV_B32 s30 0x0c0c000c 4 Scalar BE9E00FF 0C0C000C
 0x003CD8 V_PERM_B32 v35 v13 v43 s8 4 Vector ALU D1ED0023 0022570D
 0x003CE0 V_PERM_B32 v36 v44 v44 s30 4 Vector ALU D1ED0024 007A592C
 0x003CE8 S_MOV_B32 s8 0x04030201 4 Scalar BE8800FF 04030201
 0x003CF0 V_OR_B32 v35 v35 v36 4 Vector ALU 28464923
 0x003CF4 V_PERM_B32 v8 v48 v45 s8 4 Vector ALU D1ED0008 00225B30
 0x003CFC V_PERM_B32 v10 v39 v48 s8 4 Vector ALU D1ED000A 00226127
 0x003D04 V_PERM_B32 v39 v40 v39 s8 4 Vector ALU D1ED0027 00224F28
 0x003D0C V_MOV_B32 v52 v35 4 Vector ALU 7E680323
 0x003D10 V_MOV_B32 v53 v8 4 Vector ALU 7E6A0308
 0x003D14 V_PERM_B32 v8 v29 v40 s8 4 Vector ALU D1ED0008 0022511D
 0x003D1C V_MOV_B32 v54 v10 4 Vector ALU 7E6C030A
 0x003D20 V_PERM_B32 v10 v24 v29 s8 4 Vector ALU D1ED000A 00223B18
 0x003D28 V_MOV_B32 v55 v39 4 Vector ALU 7E6E0327
 0x003D2C V_LSHRREV_B32 v24 8 v24 4 Vector ALU 20303088
 0x003D30 V_MOV_B32 v56 v8 4 Vector ALU 7E700308
 0x003D34 V_MOV_B32 v57 v10 4 Vector ALU 7E72030A
 0x003D38 V_MOV_B32 v58 v24 4 Vector ALU 7E740318
 0x003D3C V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D40 V_CNDMASK_B32 v8 v52 v56 vcc 4 Vector ALU 00107134
 0x003D44 V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D48 V_CNDMASK_B32 v10 v53 v57 vcc 4 Vector ALU 00147335
 0x003D4C V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D50 V_CNDMASK_B32 v45 v54 v58 vcc 4 Vector ALU 005A7536
 0x003D54 V_MOV_B32 v60 v55 4 Vector ALU 7E780337
 0x003D58 V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D5C V_CNDMASK_B32 v48 v55 v59 vcc 4 Vector ALU 00607737
 0x003D60 V_ADD_U32 v33 vcc v33 v22 4 Vector ALU 32422D21
 0x003D64 V_ADDC_U32 v34 vcc v34 0 vcc 4 Vector ALU D11C6A22 01A90122
 0x003D6C V_MOV_B32 v35 v8 4 Vector ALU 7E460308
 0x003D70 V_MOV_B32 v36 v10 4 Vector ALU 7E48030A
 0x003D74 V_MOV_B32 v37 v45 4 Vector ALU 7E4A032D
 0x003D78 V_MOV_B32 v38 v48 4 Vector ALU 7E4C0330
 0x003D7C FLAT_STORE_DWORDX4 v[33:34] v[35:38]

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
December 16, 2016, 09:12:20 PM
 #58

In any case, this is the portion of the ISA in question for 2-way writes.
It seems very clean to me with a single FLAT_STORE_DWORDX4 at the end.
Maybe I need to separate reads/XOR's and writes into two sections.

Code:
label_0228:
 0x003CA0 S_ANDN2_B64 exec s[28:29] exec 4 Scalar 89FE7E1C
 0x003CA4 V_MOV_B32 v28 0 4 Vector ALU 7E380280
 0x003CA8 S_MOV_B64 exec s[28:29] 4 Scalar BEFE011C
 0x003CAC S_WAITCNT vmcnt(0) Varies Flow Control BF8C0F70
 0x003CB0 DS_READ_B64 v[33:34] v19 offset:7024 Varies LDS D8EC1B70 21000013
 0x003CB8 S_WAITCNT lgkmcnt(0) Varies Flow Control BF8C007F
 0x003CBC V_CMP_NE_I64 vcc 0 v[33:34] Varies Vector ALU 7DCA4280
 0x003CC0 S_AND_SAVEEXEC_B64 s[28:29] vcc 4 Scalar BE9C206A
 0x003CC4 S_CBRANCH_EXECZ label_0261 4/16 Branch BF88002F
 0x003CC8 S_MOV_B32 s8 0x05040c00 4 Scalar BE8800FF 05040C00
 0x003CD0 S_MOV_B32 s30 0x0c0c000c 4 Scalar BE9E00FF 0C0C000C
 0x003CD8 V_PERM_B32 v35 v13 v43 s8 4 Vector ALU D1ED0023 0022570D
 0x003CE0 V_PERM_B32 v36 v44 v44 s30 4 Vector ALU D1ED0024 007A592C
 0x003CE8 S_MOV_B32 s8 0x04030201 4 Scalar BE8800FF 04030201
 0x003CF0 V_OR_B32 v35 v35 v36 4 Vector ALU 28464923
 0x003CF4 V_PERM_B32 v8 v48 v45 s8 4 Vector ALU D1ED0008 00225B30
 0x003CFC V_PERM_B32 v10 v39 v48 s8 4 Vector ALU D1ED000A 00226127
 0x003D04 V_PERM_B32 v39 v40 v39 s8 4 Vector ALU D1ED0027 00224F28
 0x003D0C V_MOV_B32 v52 v35 4 Vector ALU 7E680323
 0x003D10 V_MOV_B32 v53 v8 4 Vector ALU 7E6A0308
 0x003D14 V_PERM_B32 v8 v29 v40 s8 4 Vector ALU D1ED0008 0022511D
 0x003D1C V_MOV_B32 v54 v10 4 Vector ALU 7E6C030A
 0x003D20 V_PERM_B32 v10 v24 v29 s8 4 Vector ALU D1ED000A 00223B18
 0x003D28 V_MOV_B32 v55 v39 4 Vector ALU 7E6E0327
 0x003D2C V_LSHRREV_B32 v24 8 v24 4 Vector ALU 20303088
 0x003D30 V_MOV_B32 v56 v8 4 Vector ALU 7E700308
 0x003D34 V_MOV_B32 v57 v10 4 Vector ALU 7E72030A
 0x003D38 V_MOV_B32 v58 v24 4 Vector ALU 7E740318
 0x003D3C V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D40 V_CNDMASK_B32 v8 v52 v56 vcc 4 Vector ALU 00107134
 0x003D44 V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D48 V_CNDMASK_B32 v10 v53 v57 vcc 4 Vector ALU 00147335
 0x003D4C V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D50 V_CNDMASK_B32 v45 v54 v58 vcc 4 Vector ALU 005A7536
 0x003D54 V_MOV_B32 v60 v55 4 Vector ALU 7E780337
 0x003D58 V_CMP_EQ_I32 vcc 16 v22 4 Vector ALU 7D842C90
 0x003D5C V_CNDMASK_B32 v48 v55 v59 vcc 4 Vector ALU 00607737
 0x003D60 V_ADD_U32 v33 vcc v33 v22 4 Vector ALU 32422D21
 0x003D64 V_ADDC_U32 v34 vcc v34 0 vcc 4 Vector ALU D11C6A22 01A90122
 0x003D6C V_MOV_B32 v35 v8 4 Vector ALU 7E460308
 0x003D70 V_MOV_B32 v36 v10 4 Vector ALU 7E48030A
 0x003D74 V_MOV_B32 v37 v45 4 Vector ALU 7E4A032D
 0x003D78 V_MOV_B32 v38 v48 4 Vector ALU 7E4C0330
 0x003D7C FLAT_STORE_DWORDX4 v[33:34] v[35:38]

That looks OK.  Even with ~60 VGPRs used, that would still allow for 4 waves per SIMD.  I'm going to try to get the code to build with fglrx so I can get a better idea of why it's not performing better.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
December 16, 2016, 09:36:01 PM
 #59

Yeah, that would be great. I just pushed an improved version of parallel writes.
It is much faster now, but it's still slower than the single thread version.

In the mean time, I will work on other optimizations.
I think I'm getting a hang of this whole thing.
I am expecting another 10-20% speedup today.
We will see.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
laik2
Sr. Member
****
Offline Offline

Activity: 652
Merit: 266



View Profile WWW
December 16, 2016, 10:05:43 PM
 #60

Yeah, that would be great. I just pushed an improved version of parallel writes.
It is much faster now, but it's still slower than the single thread version.

In the mean time, I will work on other optimizations.
I think I'm getting a hang of this whole thing.
I am expecting another 10-20% speedup today.
We will see.
I've turned on the linux server if you want to test.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/
Pages: « 1 2 [3] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 ... 197 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!