sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
March 09, 2017, 07:42:29 PM |
|
All this useless work and so little speed improvements. I think you miss the difference between a slut and slot. You should spend more time with your slut than your slot's..
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 09, 2017, 08:08:28 PM |
|
I was just able to directly access the GPU through libdrm, so I'm getting pretty close... Thank God I read optiminer's README. I wouldn't have thought of doing all this otherwise.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 09, 2017, 08:31:43 PM |
|
int amdgpu_device_initialize(int fd, uint32_t *major_version, uint32_t *minor_version, amdgpu_device_handle *device_handle)
int amdgpu_query_gds_info(amdgpu_device_handle dev, struct amdgpu_gds_resource_info *gds_info) Very nice, very nice.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
marvykkio
|
|
March 09, 2017, 08:45:57 PM |
|
more or less you know when to release your miner?
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 09, 2017, 09:38:31 PM |
|
$ sudo tests/modeprint/modeprint amdgpu Starting test gds_gfx_partition_size: 4096 compute_partition_size: 4096 gds_total_size: 65536 gws_per_gfx_partition: 4 gws_per_compute_partition: 4 oa_per_gfx_partition: 4 oa_per_compute_partition: 1 Ok I knew it! What a piece of junk...
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
nerdralph
|
|
March 09, 2017, 11:59:21 PM |
|
$ sudo tests/modeprint/modeprint amdgpu Starting test gds_gfx_partition_size: 4096 compute_partition_size: 4096 gds_total_size: 65536 gws_per_gfx_partition: 4 gws_per_compute_partition: 4 oa_per_gfx_partition: 4 oa_per_compute_partition: 1 Ok I knew it! What a piece of junk... I'm impressed, and a bit surprised. They're obviously using the GDS for GWS, but that *should* be documented somewhere. I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800. So it seems there is a way to initialize the GDS in Windoze too.
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 10, 2017, 06:45:25 AM |
|
$ sudo tests/modeprint/modeprint amdgpu Starting test gds_gfx_partition_size: 4096 compute_partition_size: 4096 gds_total_size: 65536 gws_per_gfx_partition: 4 gws_per_compute_partition: 4 oa_per_gfx_partition: 4 oa_per_compute_partition: 1 Ok I knew it! What a piece of junk... I'm impressed, and a bit surprised. They're obviously using the GDS for GWS, but that *should* be documented somewhere. I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800. So it seems there is a way to initialize the GDS in Windoze too. Well, it's AMD... What else can I say? I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
nerdralph
|
|
March 10, 2017, 01:19:28 PM |
|
I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800. So it seems there is a way to initialize the GDS in Windoze too.
I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS. So you're saying Claymore on Windows doesn't use GDS, while under Linux it does, but still only gets the same general performance? Not likely IMHO. Plus, as I've explained before, it's impossible to get much more than 200 sols from a Rx 470 clocked at 1250/1750 without using GDS. Even getting near 200 requires custom BIOS strap mods. p.s. I just tested Claymore 12.3 on Linux 4.8 with AMDGPU-Pro 16.40 on a Rx 470 clocked at 1200/1900. It gets 279 sols (optiminer is 268 on the same box).
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 10, 2017, 01:27:15 PM |
|
I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800. So it seems there is a way to initialize the GDS in Windoze too.
I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS. So you're saying Claymore on Windows doesn't use GDS, while under Linux it does, but still only gets the same general performance? Not likely IMHO. Plus, as I've explained before, it's impossible to get much more than 200 sols from a Rx 470 clocked at 1250/1750 without using GDS. Even getting near 200 requires custom BIOS strap mods. p.s. I just tested Claymore 12.3 on Linux 4.8 with AMDGPU-Pro 16.40 on a Rx 470 clocked at 1200/1900. It gets 279 sols (optiminer is 268 on the same box). Oh, I see. I thought the Linux version of his miner was still faster. My bad. By the way, GG is already running faster even with a fairly limited amount of the GDS. We will see...
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
eXtremal
|
|
March 10, 2017, 06:49:48 PM |
|
This patch gives +5-6% on NVidia GTX10xx cards: --- a/Core/kernel/equihash.cl +++ b/Core/kernel/equihash.cl @@ -102,7 +102,11 @@ typedef __global slot_t *global_pointer_to_slot_t; __global char *get_slot_ptr(__global char *ht, uint round, uint row, uint slot) { - return ht + (row * _NR_SLOTS(round) + slot) * _SLOT_LEN(round); + // Split row into several sub-rows with 2^RowFragmentLog slots, it gives more L2 cache hits + const uint RowFragmentLog = 4; + const uint SlotsInRow = 1 << RowFragmentLog; + const uint SlotMask = (1 << RowFragmentLog) - 1; + return ht + ((slot >> RowFragmentLog)*_NR_ROWS(round)*_SLOT_LEN(round)*SlotsInRow) + (row*_SLOT_LEN(round)*SlotsInRow) + (slot & SlotMask)*_SLOT_LEN(round); }
Can't test on AMD now, may be it gives same performance advantage. Unfortunatelly, this miner useless for NV cards.. only 280sols/s on GTX1070.
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 10, 2017, 07:08:07 PM |
|
This patch gives +5-6% on NVidia GTX10xx cards: --- a/Core/kernel/equihash.cl +++ b/Core/kernel/equihash.cl @@ -102,7 +102,11 @@ typedef __global slot_t *global_pointer_to_slot_t; __global char *get_slot_ptr(__global char *ht, uint round, uint row, uint slot) { - return ht + (row * _NR_SLOTS(round) + slot) * _SLOT_LEN(round); + // Split row into several sub-rows with 2^RowFragmentLog slots, it gives more L2 cache hits + const uint RowFragmentLog = 4; + const uint SlotsInRow = 1 << RowFragmentLog; + const uint SlotMask = (1 << RowFragmentLog) - 1; + return ht + ((slot >> RowFragmentLog)*_NR_ROWS(round)*_SLOT_LEN(round)*SlotsInRow) + (row*_SLOT_LEN(round)*SlotsInRow) + (slot & SlotMask)*_SLOT_LEN(round); }
Can't test on AMD now, may be it gives same performance advantage. Unfortunatelly, this miner useless for NV cards.. only 280sols/s on GTX1070. I tried several variations of this patch on RX480, but they didn't work... Thanks for the patch anyway. I would love to work on optimizations for NVIDIA, but first thing first, you know.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 11, 2017, 06:54:55 AM |
|
more or less you know when to release your miner?
I will release it when I'm satisfied with the performance. Hopefully within the next few days.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 11, 2017, 04:27:40 PM |
|
GDS counters and a new implementation of parallel writes are working on RX 480 now. I just need to optimize them further at this point. I wish I could do miner development full-time. It's so much fun and engaging. Oh well.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 11, 2017, 04:41:00 PM |
|
This is the new version of parallel writes. It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes. The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations. I am trying to merge these two inline assembly sections so that I could squeeze conditionals in between ds_swizzle_b32's. const int swap_data = (get_local_id(0) & 0x1); __global uint4 *second_p = p + 1; uint4 second_ui4 = slot.ui4[1]; __asm(// See: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ "ds_swizzle_b32 %0.x, %2.x offset:0x041f\n" "ds_swizzle_b32 %0.y, %2.y offset:0x041f\n" "ds_swizzle_b32 %1.x, %3.x offset:0x041f\n" "ds_swizzle_b32 %1.y, %3.y offset:0x041f\n" "ds_swizzle_b32 %1.z, %3.z offset:0x041f\n" "ds_swizzle_b32 %1.w, %3.w offset:0x041f\n" "s_waitcnt lgkmcnt(0)\n" : "=v" (second_p), "=v" (second_ui4) : "0" (second_p), "1" (second_ui4) : "memory"); __asm("flat_store_dwordx4 %0, %2\n" "flat_store_dwordx4 %1, %3\n" : : "v" (swap_data ? second_p : p), "v" (swap_data ? p : second_p), "v" (swap_data ? second_ui4 : slot.ui4[0]), "v" (swap_data ? slot.ui4[0] : second_ui4) : "memory");
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
nerdralph
|
|
March 11, 2017, 04:57:10 PM |
|
This is the new version of parallel writes. It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes. The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.
And I think ds_swizzle was only introduced in GCN3, so it would not work on Hawaii and Tahiti.
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 11, 2017, 05:04:05 PM |
|
This is the new version of parallel writes. It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes. The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.
And I think ds_swizzle was only introduced in GCN3, so it would not work on Hawaii and Tahiti. Actually, GCN1 supports ds_swizzle. It is ds_permute and ds_bpermute that were newly introduced with GCN3.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 12, 2017, 04:47:58 AM |
|
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa (OP)
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
March 12, 2017, 06:02:12 AM |
|
Looks good to me... I just need to set these registers, no? #define mmGDS_VMID0_BASE 0x3300 #define mmGDS_VMID1_BASE 0x3302 #define mmGDS_VMID2_BASE 0x3304 #define mmGDS_VMID3_BASE 0x3306 #define mmGDS_VMID4_BASE 0x3308 #define mmGDS_VMID5_BASE 0x330a #define mmGDS_VMID6_BASE 0x330c #define mmGDS_VMID7_BASE 0x330e #define mmGDS_VMID8_BASE 0x3310 #define mmGDS_VMID9_BASE 0x3312 #define mmGDS_VMID10_BASE 0x3314 #define mmGDS_VMID11_BASE 0x3316 #define mmGDS_VMID12_BASE 0x3318 #define mmGDS_VMID13_BASE 0x331a #define mmGDS_VMID14_BASE 0x331c #define mmGDS_VMID15_BASE 0x331e #define mmGDS_VMID0_SIZE 0x3301 #define mmGDS_VMID1_SIZE 0x3303 #define mmGDS_VMID2_SIZE 0x3305 #define mmGDS_VMID3_SIZE 0x3307 #define mmGDS_VMID4_SIZE 0x3309 #define mmGDS_VMID5_SIZE 0x330b #define mmGDS_VMID6_SIZE 0x330d #define mmGDS_VMID7_SIZE 0x330f #define mmGDS_VMID8_SIZE 0x3311 #define mmGDS_VMID9_SIZE 0x3313 https://github.com/torvalds/linux/blob/5924bbecd0267d87c24110cbe2041b5075173a25/drivers/gpu/drm/amd/include/asic_reg/gca/gfx_7_0_d.h
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
nerdralph
|
|
March 12, 2017, 04:42:14 PM |
|
Looks promising, however I find you really need to dig through the code and experiment to see what does and does not work. I don't consider myself a kernel module developer, so you might already know more about this than I do. With closed-source drivers like AMDGPU-Pro, it's hard to figure out which parts of the kernel drm API are implemented, and even if they are implemented whether they work. For example the 16.40 drivers implements the powerplay function force_clock_level(), but it only seems to support type PP_SCLK.
|
|
|
|
|