Bitcoin Forum
April 26, 2024, 04:09:56 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Poll
Question: Do you want to see improvements in Ethash dual-mining with GGS?
I desperately need it. - 8 (15.1%)
It would be nice. - 12 (22.6%)
It's not worth it anymore. - 33 (62.3%)
Total Voters: 53

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 [42] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 ... 197 »
  Print  
Author Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!  (Read 214337 times)
sp_
Legendary
*
Offline Offline

Activity: 2898
Merit: 1087

Team Black developer


View Profile
March 09, 2017, 07:42:29 PM
 #821

All this useless work and so little speed improvements. I think you miss the difference between a slut and slot. You should spend more time with your slut than your slot's..  Grin

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW ZILLIQA + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
1714147796
Hero Member
*
Offline Offline

Posts: 1714147796

View Profile Personal Message (Offline)

Ignore
1714147796
Reply with quote  #2

1714147796
Report to moderator
TalkImg was created especially for hosting images on bitcointalk.org: try it next time you want to post an image
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714147796
Hero Member
*
Offline Offline

Posts: 1714147796

View Profile Personal Message (Offline)

Ignore
1714147796
Reply with quote  #2

1714147796
Report to moderator
1714147796
Hero Member
*
Offline Offline

Posts: 1714147796

View Profile Personal Message (Offline)

Ignore
1714147796
Reply with quote  #2

1714147796
Report to moderator
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 09, 2017, 08:08:28 PM
 #822

I was just able to directly access the GPU through libdrm, so I'm getting pretty close...
Thank God I read optiminer's README. I wouldn't have thought of doing all this otherwise.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 09, 2017, 08:31:43 PM
 #823

Code:
int amdgpu_device_initialize(int fd,
     uint32_t *major_version,
     uint32_t *minor_version,
     amdgpu_device_handle *device_handle)

int amdgpu_query_gds_info(amdgpu_device_handle dev,
struct amdgpu_gds_resource_info *gds_info)

Very nice, very nice.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
marvykkio
Hero Member
*****
Offline Offline

Activity: 798
Merit: 1000


View Profile
March 09, 2017, 08:45:57 PM
 #824

more or less you know when to release your miner?
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 09, 2017, 09:38:31 PM
 #825

Code:
$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
March 09, 2017, 11:59:21 PM
 #826

Code:
$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

I'm impressed, and a bit surprised.  They're obviously using the GDS for GWS, but that *should* be documented somewhere.

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 10, 2017, 06:45:25 AM
 #827

Code:
$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

I'm impressed, and a bit surprised.  They're obviously using the GDS for GWS, but that *should* be documented somewhere.

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.


Well, it's AMD... What else can I say? I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
March 10, 2017, 01:19:28 PM
 #828

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.


I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

So you're saying Claymore on Windows doesn't use GDS, while under Linux it does, but still only gets the same general performance?  Not likely IMHO.
Plus, as I've explained before, it's impossible to get much more than 200 sols from a Rx 470 clocked at 1250/1750 without using GDS.  Even getting near 200 requires custom BIOS strap mods.

p.s. I just tested Claymore 12.3 on Linux 4.8 with AMDGPU-Pro 16.40 on a Rx 470 clocked at 1200/1900.  It gets 279 sols (optiminer is 268 on the same box).
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 10, 2017, 01:27:15 PM
 #829

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.


I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

So you're saying Claymore on Windows doesn't use GDS, while under Linux it does, but still only gets the same general performance?  Not likely IMHO.
Plus, as I've explained before, it's impossible to get much more than 200 sols from a Rx 470 clocked at 1250/1750 without using GDS.  Even getting near 200 requires custom BIOS strap mods.

p.s. I just tested Claymore 12.3 on Linux 4.8 with AMDGPU-Pro 16.40 on a Rx 470 clocked at 1200/1900.  It gets 279 sols (optiminer is 268 on the same box).


Oh, I see. I thought the Linux version of his miner was still faster. My bad.

By the way, GG is already running faster even with a fairly limited amount of the GDS.
We will see...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
eXtremal
Sr. Member
****
Offline Offline

Activity: 2106
Merit: 282


👉bit.ly/3QXp3oh | 🔥 Ultimate Launc


View Profile WWW
March 10, 2017, 06:49:48 PM
 #830

This patch gives +5-6% on NVidia GTX10xx cards:
Quote
--- a/Core/kernel/equihash.cl
+++ b/Core/kernel/equihash.cl
@@ -102,7 +102,11 @@ typedef __global slot_t *global_pointer_to_slot_t;
 
 __global char *get_slot_ptr(__global char *ht, uint round, uint row, uint slot)
 {
-    return ht + (row * _NR_SLOTS(round) + slot) * _SLOT_LEN(round);
+  // Split row into several sub-rows with 2^RowFragmentLog slots, it gives more L2 cache hits
+  const uint RowFragmentLog = 4;
+  const uint SlotsInRow = 1 << RowFragmentLog;
+  const uint SlotMask = (1 << RowFragmentLog) - 1;
+  return ht + ((slot >> RowFragmentLog)*_NR_ROWS(round)*_SLOT_LEN(round)*SlotsInRow) + (row*_SLOT_LEN(round)*SlotsInRow) + (slot & SlotMask)*_SLOT_LEN(round);
 }
Can't test on AMD now, may be it gives same performance advantage.
Unfortunatelly, this miner useless for NV cards.. only 280sols/s on GTX1070.

TONUP██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
██
▄▄███████▄▄
▄▄███████████████▄▄
▄███████████████████▄
▄█████▄░▄▄▀█████▀▄████▄
▄███████▄▀█▄▀██▀▄███████▄
█████████▄▀█▄▀▄██████████
██████████▄▀█▄▀██████████
██████████▀▄▀█▄▀█████████
▀███████▀▄██▄▀█▄▀███████▀
▀████▀▄█████▄▀▀░▀█████▀
▀███████████████████▀
▀▀███████████████▀▀
▀▀███████▀▀
▄▄▄███████▄▄▄
▄▄███████████████▄▄
▄███████████████████▄
▄██████████████▀▀█████▄
▄██████████▀▀█████▐████▄
██████▀▀████▄▄▀▀█████████
████▄▄███▄██▀█████▐██████
█████████▀██████████████
▀███████▌▐██████▐██████▀
▀███████▄▄███▄████████▀
▀███████████████████▀
▀▀███████████████▀▀
▀▀▀███████▀▀▀
▄▄▄███████▄▄▄
▄▄███████████████▄▄
▄███████████████████▄
▄█████████████████████▄
▄████▀▀███▀▀███▀▀██▀███▄
████▀███████▀█▀███▀█████
██████████████████████
████▄███████▄█▄███▄█████
▀████▄▄███▄▄███▄▄██▄███▀
▀█████████████████████▀
▀███████████████████▀
▀▀███████████████▀▀
▀▀▀███████▀▀▀
████████
██
██
██
██
██
██
██
██
██
██
██
████████
████████████████████████████████████████████████████████████████████████████████
.
JOIN NOW
.
████████████████████████████████████████████████████████████████████████████████
████████
██
██
██
██
██
██
██
██
██
██
██
████████
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 10, 2017, 07:08:07 PM
 #831

This patch gives +5-6% on NVidia GTX10xx cards:
Quote
--- a/Core/kernel/equihash.cl
+++ b/Core/kernel/equihash.cl
@@ -102,7 +102,11 @@ typedef __global slot_t *global_pointer_to_slot_t;
 
 __global char *get_slot_ptr(__global char *ht, uint round, uint row, uint slot)
 {
-    return ht + (row * _NR_SLOTS(round) + slot) * _SLOT_LEN(round);
+  // Split row into several sub-rows with 2^RowFragmentLog slots, it gives more L2 cache hits
+  const uint RowFragmentLog = 4;
+  const uint SlotsInRow = 1 << RowFragmentLog;
+  const uint SlotMask = (1 << RowFragmentLog) - 1;
+  return ht + ((slot >> RowFragmentLog)*_NR_ROWS(round)*_SLOT_LEN(round)*SlotsInRow) + (row*_SLOT_LEN(round)*SlotsInRow) + (slot & SlotMask)*_SLOT_LEN(round);
 }
Can't test on AMD now, may be it gives same performance advantage.
Unfortunatelly, this miner useless for NV cards.. only 280sols/s on GTX1070.

I tried several variations of this patch on RX480, but they didn't work...
Thanks for the patch anyway. I would love to work on optimizations for NVIDIA, but first thing first, you know.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 11, 2017, 06:54:55 AM
 #832

more or less you know when to release your miner?

I will release it when I'm satisfied with the performance.
Hopefully within the next few days.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 11, 2017, 04:27:40 PM
 #833

GDS counters and a new implementation of parallel writes are working on RX 480 now.
I just need to optimize them further at this point.
I wish I could do miner development full-time.
It's so much fun and engaging. Oh well.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 11, 2017, 04:41:00 PM
 #834

This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.
I am trying to merge these two inline assembly sections so that I could squeeze conditionals in between ds_swizzle_b32's.

Code:
    const int swap_data = (get_local_id(0) & 0x1);
    __global uint4 *second_p = p + 1;
    uint4 second_ui4 = slot.ui4[1];
    __asm(// See: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
          "ds_swizzle_b32 %0.x, %2.x offset:0x041f\n"
          "ds_swizzle_b32 %0.y, %2.y offset:0x041f\n"
          "ds_swizzle_b32 %1.x, %3.x offset:0x041f\n"
          "ds_swizzle_b32 %1.y, %3.y offset:0x041f\n"
          "ds_swizzle_b32 %1.z, %3.z offset:0x041f\n"
          "ds_swizzle_b32 %1.w, %3.w offset:0x041f\n"
          "s_waitcnt lgkmcnt(0)\n"
          : "=v" (second_p),
            "=v" (second_ui4)
          : "0" (second_p),
            "1" (second_ui4)
          : "memory");
   
    __asm("flat_store_dwordx4 %0, %2\n"
          "flat_store_dwordx4 %1, %3\n"
          :
          : "v" (swap_data ? second_p    : p),
            "v" (swap_data ? p           : second_p),
            "v" (swap_data ? second_ui4  : slot.ui4[0]),
            "v" (swap_data ? slot.ui4[0] : second_ui4)
          : "memory");

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
March 11, 2017, 04:57:10 PM
 #835

This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.

And I think ds_swizzle was only introduced in GCN3, so it would not work on Hawaii and Tahiti.
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 11, 2017, 05:04:05 PM
 #836

This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.

And I think ds_swizzle was only introduced in GCN3, so it would not work on Hawaii and Tahiti.


Actually, GCN1 supports ds_swizzle. It is ds_permute and ds_bpermute that were newly introduced with GCN3.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 12, 2017, 03:18:05 AM
 #837

I just found out that you can directly send commands to the GPU without the root privileges by using the DRM render node:

https://en.wikipedia.org/wiki/Direct_Rendering_Manager#Render_nodes

I think nirvana is *pretty* close...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 12, 2017, 04:47:58 AM
 #838

Bingo!

https://github.com/torvalds/linux/search?utf8=%E2%9C%93&q=cs_partition_size

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa (OP)
Sr. Member
****
Offline Offline

Activity: 728
Merit: 304


Miner Developer


View Profile
March 12, 2017, 06:02:12 AM
 #839

Looks good to me... I just need to set these registers, no?

Code:
#define mmGDS_VMID0_BASE                                                        0x3300
#define mmGDS_VMID1_BASE                                                        0x3302
#define mmGDS_VMID2_BASE                                                        0x3304
#define mmGDS_VMID3_BASE                                                        0x3306
#define mmGDS_VMID4_BASE                                                        0x3308
#define mmGDS_VMID5_BASE                                                        0x330a
#define mmGDS_VMID6_BASE                                                        0x330c
#define mmGDS_VMID7_BASE                                                        0x330e
#define mmGDS_VMID8_BASE                                                        0x3310
#define mmGDS_VMID9_BASE                                                        0x3312
#define mmGDS_VMID10_BASE                                                       0x3314
#define mmGDS_VMID11_BASE                                                       0x3316
#define mmGDS_VMID12_BASE                                                       0x3318
#define mmGDS_VMID13_BASE                                                       0x331a
#define mmGDS_VMID14_BASE                                                       0x331c
#define mmGDS_VMID15_BASE                                                       0x331e
#define mmGDS_VMID0_SIZE                                                        0x3301
#define mmGDS_VMID1_SIZE                                                        0x3303
#define mmGDS_VMID2_SIZE                                                        0x3305
#define mmGDS_VMID3_SIZE                                                        0x3307
#define mmGDS_VMID4_SIZE                                                        0x3309
#define mmGDS_VMID5_SIZE                                                        0x330b
#define mmGDS_VMID6_SIZE                                                        0x330d
#define mmGDS_VMID7_SIZE                                                        0x330f
#define mmGDS_VMID8_SIZE                                                        0x3311
#define mmGDS_VMID9_SIZE                                                        0x3313
https://github.com/torvalds/linux/blob/5924bbecd0267d87c24110cbe2041b5075173a25/drivers/gpu/drm/amd/include/asic_reg/gca/gfx_7_0_d.h

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 588
Merit: 251


View Profile
March 12, 2017, 04:42:14 PM
 #840


Looks promising, however I find you really need to dig through the code and experiment to see what does and does not work.  I don't consider myself a kernel module developer, so you might already know more about this than I do.  With closed-source drivers like AMDGPU-Pro, it's hard to figure out which parts of the kernel drm API are implemented, and even if they are implemented whether they work.  For example the 16.40 drivers implements the powerplay function force_clock_level(), but it only seems to support type PP_SCLK.
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 [42] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 ... 197 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!