Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!

nerdralph

Sr. Member

Offline

Activity: 588
Merit: 251

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 05:08:45 AM

#41

Quote from: zawawa on December 16, 2016, 03:10:03 AM

Quote from: nerdralph on December 15, 2016, 11:31:30 PM

Not bad zawawa. You still have room to improve ht_store.

Code:

p = slot.ui8

Will at best result in 2 store_dwordx4 instructions, and 2 core cycles to the memory controller.

Code:

p = slot.ui4[0]

Which you use after round 5 should only be one cycle, but it will force a 32-byte read burst from the GDDR into the L2, modification of 16 bytes, and then write back. This will waste a lot of GDDR cycles due to the bus turnaround delay. The solution is to have a n-way operation where n threads write 32/n bytes. That will be just core one cycle to xfer 32 bytes to the L2, and a single 32-byte write burst to one of the 2 GDDR5 chips per memory controller channel.
I also think using an odd number for NR_SLOTS should be a tiny bit faster by balancing out the writes between the odd and even memory chips. With NR_SLOTS even, the first write to a given row will always be to an even memory chip. With more slots per row this becomes less significant because the rows don't fill up equally. Using an odd number for NR_SLOTS may also reduce channel conflicts.

I tried 4-way writes with mixed results. The 4-way write version was actually slower than the single-thread-write version, but the former seems to speed up the last few rounds. It makes sense as these rounds are more memory-intensive. I will explore this approach further.

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2? Branching code where each thread executes a different store instruction will make performance worse because the L1 cache is write-thu, resulting in 4 cache lines written to the L2 with 8 dirty bytes each instead of 1 cache line written with 32 dirty bytes. If you send me your 4-way code I can take a look. It should even be possible to do it as a 2-way where the thread pairs execute a store_dwordx4.

QuintLeo

Legendary

Offline

Activity: 1498
Merit: 1030

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 09:08:07 AM

#42

Quote from: zawawa on December 15, 2016, 09:14:51 PM

Quote from: Vetal_inside on December 15, 2016, 07:56:28 PM

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

15.12 is the original Crimson driver - for the pre-RX cards it's the best and fastest version overall per everything I've ever run it on (quite a wide assortment).

It was also the last LINUX version that supported pre-GCN cards (Windows had to suffer with 15.7.1 though there is a "legacy" 16.2 version that basically repackaged 15.7.1 with some of the newer bells and whistles) but offered no performance advantage).

16.9.2 or 16.10.1 seem to be the best mining options for the RX series cards (16.10.1 is WQHL seems to be the only real difference between those two for miners).
They also seem to work as well with the R9 and HD 7xxx series GCN cards in my somewhat limited testing.

16.12.1 is total bloated junk and reduced hashrate 5-10% on EVERYTHING I tried it on (HD7870, R9 280x, RX 470).
Avoid it.

I would suggest that you make the 15.12 for pre-RX cards and the 16.10.1 for RX series your "tested with and recommended" driver options.

(This will of course change when Vega hits the street and requires newer drivers for support).

I'm no longer legendary just in my own mind!
Like something I said? Donations gratefully accepted. LYLnTKvLefz9izJFUvEGQEZzSkz34b3N6U (Litecoin)
1GYbjMTPdCuV7dci3iCUiaRrcNuaiQrVYY (Bitcoin)

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 09:10:00 AM
Last edit: December 16, 2016, 09:22:01 AM by zawawa

#43

Quote from: nerdralph on December 16, 2016, 05:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 09:23:46 AM

#44

Quote from: QuintLeo on December 16, 2016, 09:08:07 AM

Quote from: zawawa on December 15, 2016, 09:14:51 PM

Quote from: Vetal_inside on December 15, 2016, 07:56:28 PM

R9 280x w/ modded bios - 85 s/s with instances=1 and 90-95 s/s with instances=2(not stable), like as original SA miner v.5.
Win8.1, x64, drivers 15.12

add: with CM it shows 210-220 s/s, depending from memclock

The slow speed is probably due either to the modded BIOS or to the driver. Mods for Claymore's do not necessarily work with Gateless Gate/SILENTARMY. I would try the stock BIOS first. Also, I only tested the miner with Crimson drivers. I suppose I need to be more clear about requirements...

15.12 is the original Crimson driver - for the pre-RX cards it's the best and fastest version overall per everything I've ever run it on (quite a wide assortment).

It was also the last LINUX version that supported pre-GCN cards (Windows had to suffer with 15.7.1 though there is a "legacy" 16.2 version that basically repackaged 15.7.1 with some of the newer bells and whistles) but offered no performance advantage).

16.9.2 or 16.10.1 seem to be the best mining options for the RX series cards (16.10.1 is WQHL seems to be the only real difference between those two for miners).
They also seem to work as well with the R9 and HD 7xxx series GCN cards in my somewhat limited testing.

16.12.1 is total bloated junk and reduced hashrate 5-10% on EVERYTHING I tried it on (HD7870, R9 280x, RX 470).
Avoid it.

I would suggest that you make the 15.12 for pre-RX cards and the 16.10.1 for RX series your "tested with and recommended" driver options.

(This will of course change when Vega hits the street and requires newer drivers for support).

Thanks a lot for the great suggestion. I will definitely consider that.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

laik2

Sr. Member

Offline

Activity: 652
Merit: 266

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 09:24:09 AM

#45

Quote from: zawawa on December 16, 2016, 09:10:00 AM

Quote from: nerdralph on December 16, 2016, 05:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

It's slower, you can check for yourself I've updated repo on the test linux.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 09:40:30 AM

#46

Quote from: laik2 on December 16, 2016, 09:24:09 AM

Quote from: zawawa on December 16, 2016, 09:10:00 AM

Quote from: nerdralph on December 16, 2016, 05:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

It's slower, you can check for yourself I've updated repo on the test linux.

Are you referring to multi-threaded writes, or the default settings?

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

laik2

Sr. Member

Offline

Activity: 652
Merit: 266

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 09:56:08 AM

#47

Quote from: zawawa on December 16, 2016, 09:40:30 AM

Quote from: laik2 on December 16, 2016, 09:24:09 AM

Quote from: zawawa on December 16, 2016, 09:10:00 AM

Quote from: nerdralph on December 16, 2016, 05:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

It's slower, you can check for yourself I've updated repo on the test linux.

Are you referring to multi-threaded writes, or the default settings?

1 - 168/170S/s
2 - 148/150S/s
4 - 128/130S/s
8 - 82/84S/s

Changing threads in param.h
So basicly there is no change except that multithreading doesn't seem to work under linux as supposed to.

EDIT: -t value has no effect as of "THREADS_PER_WRITE" , it has to be hardcoded in param.h and recompiled to have effect.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 02:15:57 PM

#48

Like I said before, parallel writes are slower than single thread writes at this point.
This is still an experimental feature.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

laik2

Sr. Member

Offline

Activity: 652
Merit: 266

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 02:32:19 PM

#49

Quote from: zawawa on December 16, 2016, 02:15:57 PM

Like I said before, parallel writes are slower than single thread writes at this point.
This is still an experimental feature.

Ok

I'm just giving some feedback.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 02:47:13 PM

#50

I just ordered GTX 1060, and I haven't told my wife about it yet.
Donations are always welcome, guys! My BTC address is in my signature.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

nerdralph

Sr. Member

Offline

Activity: 588
Merit: 251

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 03:51:47 PM
Last edit: December 16, 2016, 04:18:43 PM by nerdralph

#51

Quote from: zawawa on December 16, 2016, 09:10:00 AM

Quote from: nerdralph on December 16, 2016, 05:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks. I'll certainly give you credit for pumping the code out faster than I do. I thing you forgot to push something to the repo though:

Code:

"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^

p.s. even with 1 thread per write, although it builds, no solutions are found (make test fails).

p.p.s I tried going back to v0.0.1, but it seems I also need to merge back the unix compile fixes first...

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 04:06:08 PM

#52

Quote from: nerdralph on December 16, 2016, 03:51:47 PM

Quote from: zawawa on December 16, 2016, 09:10:00 AM

Quote from: nerdralph on December 16, 2016, 05:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks. I'll certainly give you credit for pumping the code out faster than I do. I thing you forgot to push something to the repo though:

Code:

"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^

That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

nerdralph

Sr. Member

Offline

Activity: 588
Merit: 251

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 04:24:47 PM

#53

Quote from: zawawa on December 16, 2016, 04:06:08 PM

Quote from: nerdralph on December 16, 2016, 03:51:47 PM

Quote from: zawawa on December 16, 2016, 09:10:00 AM

Quote from: nerdralph on December 16, 2016, 05:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks. I'll certainly give you credit for pumping the code out faster than I do. I thing you forgot to push something to the repo though:

Code:

"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^

That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

I'm using Ubuntu 14.04 & fglrx. If it builds OK for you I'm surprised. I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 04:48:50 PM

#54

Quote from: nerdralph on December 16, 2016, 04:24:47 PM

Quote from: zawawa on December 16, 2016, 04:06:08 PM

Quote from: nerdralph on December 16, 2016, 03:51:47 PM

Quote from: zawawa on December 16, 2016, 09:10:00 AM

Quote from: nerdralph on December 16, 2016, 05:08:45 AM

Did you check the isa to make sure your 4-way ht_store is has a single store_dwordx2?

I didn't take a really close look, but it seems that way after skimming through the ISA. It turned out that register usage doubles when multi-threaded writes are enabled, and occupancy suffers as a result. I just pushed support for multi-threaded writes to the repo, so you could take a look. (You can enable it in param.h as usual.) I will examine the ISA tomorrow to see what's going on. It's 1 a.m. my time, time to go to bed...

Thanks. I'll certainly give you credit for pumping the code out faster than I do. I thing you forgot to push something to the repo though:

Code:

"input.cl", line 586: error: variable with automatic storage duration cannot
          be stored in the named address space
   __local global_pointer_to_slot_t slot_ptrs[64 / 2];
                                    ^

"input.cl", line 708: error: identifier "slot_ptrs" is undefined
       &slot_ptrs[get_local_id(0) / 2]);
        ^

That's very strange... laik2 was able to build the latest version without problems.
I'm using Win 10, Crimson 16.11.2, and RX 480, and laik2 is using Ubuntu 16.04 LTS. What are yours?
By the way, "make test" may be broken as I don't use it on Windows.

I'm using Ubuntu 14.04 & fglrx. If it builds OK for you I'm surprised. I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.

Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

nerdralph

Sr. Member

Offline

Activity: 588
Merit: 251

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 05:13:32 PM

#55

Quote from: zawawa on December 16, 2016, 04:48:50 PM

Quote from: nerdralph on December 16, 2016, 04:24:47 PM

I'm using Ubuntu 14.04 & fglrx. If it builds OK for you I'm surprised. I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.

Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

I know how you are intending to declare slot_ptrs. What I'm saying is it is not valid syntax.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/global.html

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 05:31:32 PM

#56

Quote from: nerdralph on December 16, 2016, 05:13:32 PM

Quote from: zawawa on December 16, 2016, 04:48:50 PM

Quote from: nerdralph on December 16, 2016, 04:24:47 PM

I'm using Ubuntu 14.04 & fglrx. If it builds OK for you I'm surprised. I'm pretty sure "__local __global" is not defined in OpenCL, and should be an error.

Ah, it must be the driver, then. That makes a perfect sense as fglrx was a nightmare to deal with.
The code runs perfectly fine with Crimson drivers.
It's not "__local __global", but a "pointer to a global object stored in local memory," so I don't see anything wrong with that.
It's good to know the code is not compatible with fglrx, though.

I know how you are intending to declare slot_ptrs. What I'm saying is it is not valid syntax.
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/global.html

I don't think the specs prohibit the syntax. See:

https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/opencl-c/
http://stackoverflow.com/questions/11978024/opencl-store-pointer-to-global-memory-in-local-memory

Code:

__global char * __local lgc[8];  // 8 pointers stored on the local memory that points to a char located on the global memory

I appreciate your detail-oriented approach, though Wink

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 06:09:59 PM

#57

In any case, this is the portion of the ISA in question for 2-way writes.
It seems very clean to me with a single FLAT_STORE_DWORDX4 at the end.
Maybe I need to separate reads/XOR's and writes into two sections.

Code:

label_0228:
 0x003CA0	S_ANDN2_B64	exec s[28:29] exec	 4 	 Scalar	89FE7E1C
 0x003CA4	V_MOV_B32	v28 0	 4 	 Vector ALU	7E380280
 0x003CA8	S_MOV_B64	exec s[28:29]	 4 	 Scalar	BEFE011C
 0x003CAC	S_WAITCNT	vmcnt(0)	 Varies 	 Flow Control	BF8C0F70
 0x003CB0	DS_READ_B64	v[33:34] v19 offset:7024	 Varies 	 LDS	D8EC1B70 21000013
 0x003CB8	S_WAITCNT	lgkmcnt(0)	 Varies 	 Flow Control	BF8C007F
 0x003CBC	V_CMP_NE_I64	vcc 0 v[33:34]	 Varies 	 Vector ALU	7DCA4280
 0x003CC0	S_AND_SAVEEXEC_B64	s[28:29] vcc	 4 	 Scalar	BE9C206A
 0x003CC4	S_CBRANCH_EXECZ	label_0261	 4/16 	 Branch	BF88002F
 0x003CC8	S_MOV_B32	s8 0x05040c00	 4 	 Scalar	BE8800FF 05040C00
 0x003CD0	S_MOV_B32	s30 0x0c0c000c	 4 	 Scalar	BE9E00FF 0C0C000C
 0x003CD8	V_PERM_B32	v35 v13 v43 s8	 4 	 Vector ALU	D1ED0023 0022570D
 0x003CE0	V_PERM_B32	v36 v44 v44 s30	 4 	 Vector ALU	D1ED0024 007A592C
 0x003CE8	S_MOV_B32	s8 0x04030201	 4 	 Scalar	BE8800FF 04030201
 0x003CF0	V_OR_B32	v35 v35 v36	 4 	 Vector ALU	28464923
 0x003CF4	V_PERM_B32	v8 v48 v45 s8	 4 	 Vector ALU	D1ED0008 00225B30
 0x003CFC	V_PERM_B32	v10 v39 v48 s8	 4 	 Vector ALU	D1ED000A 00226127
 0x003D04	V_PERM_B32	v39 v40 v39 s8	 4 	 Vector ALU	D1ED0027 00224F28
 0x003D0C	V_MOV_B32	v52 v35	 4 	 Vector ALU	7E680323
 0x003D10	V_MOV_B32	v53 v8	 4 	 Vector ALU	7E6A0308
 0x003D14	V_PERM_B32	v8 v29 v40 s8	 4 	 Vector ALU	D1ED0008 0022511D
 0x003D1C	V_MOV_B32	v54 v10	 4 	 Vector ALU	7E6C030A
 0x003D20	V_PERM_B32	v10 v24 v29 s8	 4 	 Vector ALU	D1ED000A 00223B18
 0x003D28	V_MOV_B32	v55 v39	 4 	 Vector ALU	7E6E0327
 0x003D2C	V_LSHRREV_B32	v24 8 v24	 4 	 Vector ALU	20303088
 0x003D30	V_MOV_B32	v56 v8	 4 	 Vector ALU	7E700308
 0x003D34	V_MOV_B32	v57 v10	 4 	 Vector ALU	7E72030A
 0x003D38	V_MOV_B32	v58 v24	 4 	 Vector ALU	7E740318
 0x003D3C	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D40	V_CNDMASK_B32	v8 v52 v56 vcc	 4 	 Vector ALU	00107134
 0x003D44	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D48	V_CNDMASK_B32	v10 v53 v57 vcc	 4 	 Vector ALU	00147335
 0x003D4C	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D50	V_CNDMASK_B32	v45 v54 v58 vcc	 4 	 Vector ALU	005A7536
 0x003D54	V_MOV_B32	v60 v55	 4 	 Vector ALU	7E780337
 0x003D58	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D5C	V_CNDMASK_B32	v48 v55 v59 vcc	 4 	 Vector ALU	00607737
 0x003D60	V_ADD_U32	v33 vcc v33 v22	 4 	 Vector ALU	32422D21
 0x003D64	V_ADDC_U32	v34 vcc v34 0 vcc	 4 	 Vector ALU	D11C6A22 01A90122
 0x003D6C	V_MOV_B32	v35 v8	 4 	 Vector ALU	7E460308
 0x003D70	V_MOV_B32	v36 v10	 4 	 Vector ALU	7E48030A
 0x003D74	V_MOV_B32	v37 v45	 4 	 Vector ALU	7E4A032D
 0x003D78	V_MOV_B32	v38 v48	 4 	 Vector ALU	7E4C0330
 0x003D7C	FLAT_STORE_DWORDX4	v[33:34] v[35:38]

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

nerdralph

Sr. Member

Offline

Activity: 588
Merit: 251

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 09:12:20 PM

#58

Quote from: zawawa on December 16, 2016, 06:09:59 PM

In any case, this is the portion of the ISA in question for 2-way writes.
It seems very clean to me with a single FLAT_STORE_DWORDX4 at the end.
Maybe I need to separate reads/XOR's and writes into two sections.

Code:

label_0228:
 0x003CA0	S_ANDN2_B64	exec s[28:29] exec	 4 	 Scalar	89FE7E1C
 0x003CA4	V_MOV_B32	v28 0	 4 	 Vector ALU	7E380280
 0x003CA8	S_MOV_B64	exec s[28:29]	 4 	 Scalar	BEFE011C
 0x003CAC	S_WAITCNT	vmcnt(0)	 Varies 	 Flow Control	BF8C0F70
 0x003CB0	DS_READ_B64	v[33:34] v19 offset:7024	 Varies 	 LDS	D8EC1B70 21000013
 0x003CB8	S_WAITCNT	lgkmcnt(0)	 Varies 	 Flow Control	BF8C007F
 0x003CBC	V_CMP_NE_I64	vcc 0 v[33:34]	 Varies 	 Vector ALU	7DCA4280
 0x003CC0	S_AND_SAVEEXEC_B64	s[28:29] vcc	 4 	 Scalar	BE9C206A
 0x003CC4	S_CBRANCH_EXECZ	label_0261	 4/16 	 Branch	BF88002F
 0x003CC8	S_MOV_B32	s8 0x05040c00	 4 	 Scalar	BE8800FF 05040C00
 0x003CD0	S_MOV_B32	s30 0x0c0c000c	 4 	 Scalar	BE9E00FF 0C0C000C
 0x003CD8	V_PERM_B32	v35 v13 v43 s8	 4 	 Vector ALU	D1ED0023 0022570D
 0x003CE0	V_PERM_B32	v36 v44 v44 s30	 4 	 Vector ALU	D1ED0024 007A592C
 0x003CE8	S_MOV_B32	s8 0x04030201	 4 	 Scalar	BE8800FF 04030201
 0x003CF0	V_OR_B32	v35 v35 v36	 4 	 Vector ALU	28464923
 0x003CF4	V_PERM_B32	v8 v48 v45 s8	 4 	 Vector ALU	D1ED0008 00225B30
 0x003CFC	V_PERM_B32	v10 v39 v48 s8	 4 	 Vector ALU	D1ED000A 00226127
 0x003D04	V_PERM_B32	v39 v40 v39 s8	 4 	 Vector ALU	D1ED0027 00224F28
 0x003D0C	V_MOV_B32	v52 v35	 4 	 Vector ALU	7E680323
 0x003D10	V_MOV_B32	v53 v8	 4 	 Vector ALU	7E6A0308
 0x003D14	V_PERM_B32	v8 v29 v40 s8	 4 	 Vector ALU	D1ED0008 0022511D
 0x003D1C	V_MOV_B32	v54 v10	 4 	 Vector ALU	7E6C030A
 0x003D20	V_PERM_B32	v10 v24 v29 s8	 4 	 Vector ALU	D1ED000A 00223B18
 0x003D28	V_MOV_B32	v55 v39	 4 	 Vector ALU	7E6E0327
 0x003D2C	V_LSHRREV_B32	v24 8 v24	 4 	 Vector ALU	20303088
 0x003D30	V_MOV_B32	v56 v8	 4 	 Vector ALU	7E700308
 0x003D34	V_MOV_B32	v57 v10	 4 	 Vector ALU	7E72030A
 0x003D38	V_MOV_B32	v58 v24	 4 	 Vector ALU	7E740318
 0x003D3C	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D40	V_CNDMASK_B32	v8 v52 v56 vcc	 4 	 Vector ALU	00107134
 0x003D44	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D48	V_CNDMASK_B32	v10 v53 v57 vcc	 4 	 Vector ALU	00147335
 0x003D4C	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D50	V_CNDMASK_B32	v45 v54 v58 vcc	 4 	 Vector ALU	005A7536
 0x003D54	V_MOV_B32	v60 v55	 4 	 Vector ALU	7E780337
 0x003D58	V_CMP_EQ_I32	vcc 16 v22	 4 	 Vector ALU	7D842C90
 0x003D5C	V_CNDMASK_B32	v48 v55 v59 vcc	 4 	 Vector ALU	00607737
 0x003D60	V_ADD_U32	v33 vcc v33 v22	 4 	 Vector ALU	32422D21
 0x003D64	V_ADDC_U32	v34 vcc v34 0 vcc	 4 	 Vector ALU	D11C6A22 01A90122
 0x003D6C	V_MOV_B32	v35 v8	 4 	 Vector ALU	7E460308
 0x003D70	V_MOV_B32	v36 v10	 4 	 Vector ALU	7E48030A
 0x003D74	V_MOV_B32	v37 v45	 4 	 Vector ALU	7E4A032D
 0x003D78	V_MOV_B32	v38 v48	 4 	 Vector ALU	7E4C0330
 0x003D7C	FLAT_STORE_DWORDX4	v[33:34] v[35:38]

That looks OK. Even with ~60 VGPRs used, that would still allow for 4 waves per SIMD. I'm going to try to get the code to build with fglrx so I can get a better idea of why it's not performing better.

zawawa (OP)

Sr. Member

Offline

Activity: 728
Merit: 304

Miner Developer

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 09:36:01 PM

#59

Yeah, that would be great. I just pushed an improved version of parallel writes.
It is much faster now, but it's still slower than the single thread version.

In the mean time, I will work on other optimizations.
I think I'm getting a hang of this whole thing.
I am expecting another 10-20% speedup today.
We will see.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ

laik2

Sr. Member

Offline

Activity: 652
Merit: 266

Re: Gateless Gate: zawawa's open-source cross-platform OpenCL Zcash miner

December 16, 2016, 10:05:43 PM

#60

Quote from: zawawa on December 16, 2016, 09:36:01 PM

Yeah, that would be great. I just pushed an improved version of parallel writes.
It is much faster now, but it's still slower than the single thread version.

In the mean time, I will work on other optimizations.
I think I'm getting a hang of this whole thing.
I am expecting another 10-20% speedup today.
We will see.

I've turned on the linux server if you want to test.

Miners Mining Platform [ MMP OS ] - https://app.mmpos.eu/