Bitcoin Forum
June 19, 2018, 07:20:04 PM *
News: Latest stable version of Bitcoin Core: 0.16.1  [Torrent]. (New!)
 
   Home   Help Search Donate Login Register  
Poll
Question: Do you want to see improvements in Ethash dual-mining with GGS?
I desperately need it. - 6 (17.1%)
It would be nice. - 7 (20%)
It's not worth it anymore. - 22 (62.9%)
Total Voters: 35

Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 [42] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 ... 194 »
  Print  
Author Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480!  (Read 197520 times)
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 09, 2017, 06:12:47 PM
 #821

If optiminer directly initializes the GDS, it could probably do it for both Linux and Windows.

It looks like it is possible, on Linux, to send PM4 packets from the user space through libdrm, though, whereas that functionality is not exposed on Windows:

https://github.com/Lucretia/libdrm-amdgpu/search?utf8=%E2%9C%93&q=gds

I will try libdrm and see what I can do with it.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
1529436004
Hero Member
*
Offline Offline

Posts: 1529436004

View Profile Personal Message (Offline)

Ignore
1529436004
Reply with quote  #2

1529436004
Report to moderator
1529436004
Hero Member
*
Offline Offline

Posts: 1529436004

View Profile Personal Message (Offline)

Ignore
1529436004
Reply with quote  #2

1529436004
Report to moderator
1529436004
Hero Member
*
Offline Offline

Posts: 1529436004

View Profile Personal Message (Offline)

Ignore
1529436004
Reply with quote  #2

1529436004
Report to moderator
Automated Bitcoin Fork Extraction Tool WE DO TOUGH WALLETS: BCH | BTG | BCD | SBTC | UBTC | B2X | BCX | BTF Electrum 2FA, Trezor, Ledger, SegWit, Bech32
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1529436004
Hero Member
*
Offline Offline

Posts: 1529436004

View Profile Personal Message (Offline)

Ignore
1529436004
Reply with quote  #2

1529436004
Report to moderator
1529436004
Hero Member
*
Offline Offline

Posts: 1529436004

View Profile Personal Message (Offline)

Ignore
1529436004
Reply with quote  #2

1529436004
Report to moderator
1529436004
Hero Member
*
Offline Offline

Posts: 1529436004

View Profile Personal Message (Offline)

Ignore
1529436004
Reply with quote  #2

1529436004
Report to moderator
sp_
Legendary
*
Offline Offline

Activity: 1428
Merit: 1040

Ccminer developer


View Profile
March 09, 2017, 07:42:29 PM
 #822

All this useless work and so little speed improvements. I think you miss the difference between a slut and slot. You should spend more time with your slut than your slot's..  Grin
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 09, 2017, 08:08:28 PM
 #823

I was just able to directly access the GPU through libdrm, so I'm getting pretty close...
Thank God I read optiminer's README. I wouldn't have thought of doing all this otherwise.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 09, 2017, 08:31:43 PM
 #824

Code:
int amdgpu_device_initialize(int fd,
     uint32_t *major_version,
     uint32_t *minor_version,
     amdgpu_device_handle *device_handle)

int amdgpu_query_gds_info(amdgpu_device_handle dev,
struct amdgpu_gds_resource_info *gds_info)

Very nice, very nice.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
marvykkio
Hero Member
*****
Offline Offline

Activity: 798
Merit: 1000


View Profile
March 09, 2017, 08:45:57 PM
 #825

more or less you know when to release your miner?
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 09, 2017, 09:38:31 PM
 #826

Code:
$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 560
Merit: 251


View Profile
March 09, 2017, 11:59:21 PM
 #827

Code:
$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

I'm impressed, and a bit surprised.  They're obviously using the GDS for GWS, but that *should* be documented somewhere.

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 10, 2017, 06:45:25 AM
 #828

Code:
$ sudo tests/modeprint/modeprint amdgpu
Starting test
gds_gfx_partition_size: 4096
compute_partition_size: 4096
gds_total_size: 65536
gws_per_gfx_partition: 4
gws_per_compute_partition: 4
oa_per_gfx_partition: 4
oa_per_compute_partition: 1
Ok

I knew it! What a piece of junk...

I'm impressed, and a bit surprised.  They're obviously using the GDS for GWS, but that *should* be documented somewhere.

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.


Well, it's AMD... What else can I say? I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 560
Merit: 251


View Profile
March 10, 2017, 01:19:28 PM
 #829

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.


I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

So you're saying Claymore on Windows doesn't use GDS, while under Linux it does, but still only gets the same general performance?  Not likely IMHO.
Plus, as I've explained before, it's impossible to get much more than 200 sols from a Rx 470 clocked at 1250/1750 without using GDS.  Even getting near 200 requires custom BIOS strap mods.

p.s. I just tested Claymore 12.3 on Linux 4.8 with AMDGPU-Pro 16.40 on a Rx 470 clocked at 1200/1900.  It gets 279 sols (optiminer is 268 on the same box).
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 10, 2017, 01:27:15 PM
 #830

I just tested out claymore 12.3 on Windows with 16.10.1 drivers, and get ~255 sols with a Rx 470 clocked at 1050/1800.  So it seems there is a way to initialize the GDS in Windoze too.


I believe that extra 30+ sol/s with Claymore's comes from optimizations in the GCN assembly unrelated to GDS.

So you're saying Claymore on Windows doesn't use GDS, while under Linux it does, but still only gets the same general performance?  Not likely IMHO.
Plus, as I've explained before, it's impossible to get much more than 200 sols from a Rx 470 clocked at 1250/1750 without using GDS.  Even getting near 200 requires custom BIOS strap mods.

p.s. I just tested Claymore 12.3 on Linux 4.8 with AMDGPU-Pro 16.40 on a Rx 470 clocked at 1200/1900.  It gets 279 sols (optiminer is 268 on the same box).


Oh, I see. I thought the Linux version of his miner was still faster. My bad.

By the way, GG is already running faster even with a fairly limited amount of the GDS.
We will see...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
eXtremal
Sr. Member
****
Offline Offline

Activity: 480
Merit: 253


View Profile
March 10, 2017, 06:49:48 PM
 #831

This patch gives +5-6% on NVidia GTX10xx cards:
Quote
--- a/Core/kernel/equihash.cl
+++ b/Core/kernel/equihash.cl
@@ -102,7 +102,11 @@ typedef __global slot_t *global_pointer_to_slot_t;
 
 __global char *get_slot_ptr(__global char *ht, uint round, uint row, uint slot)
 {
-    return ht + (row * _NR_SLOTS(round) + slot) * _SLOT_LEN(round);
+  // Split row into several sub-rows with 2^RowFragmentLog slots, it gives more L2 cache hits
+  const uint RowFragmentLog = 4;
+  const uint SlotsInRow = 1 << RowFragmentLog;
+  const uint SlotMask = (1 << RowFragmentLog) - 1;
+  return ht + ((slot >> RowFragmentLog)*_NR_ROWS(round)*_SLOT_LEN(round)*SlotsInRow) + (row*_SLOT_LEN(round)*SlotsInRow) + (slot & SlotMask)*_SLOT_LEN(round);
 }
Can't test on AMD now, may be it gives same performance advantage.
Unfortunatelly, this miner useless for NV cards.. only 280sols/s on GTX1070.
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 10, 2017, 07:08:07 PM
 #832

This patch gives +5-6% on NVidia GTX10xx cards:
Quote
--- a/Core/kernel/equihash.cl
+++ b/Core/kernel/equihash.cl
@@ -102,7 +102,11 @@ typedef __global slot_t *global_pointer_to_slot_t;
 
 __global char *get_slot_ptr(__global char *ht, uint round, uint row, uint slot)
 {
-    return ht + (row * _NR_SLOTS(round) + slot) * _SLOT_LEN(round);
+  // Split row into several sub-rows with 2^RowFragmentLog slots, it gives more L2 cache hits
+  const uint RowFragmentLog = 4;
+  const uint SlotsInRow = 1 << RowFragmentLog;
+  const uint SlotMask = (1 << RowFragmentLog) - 1;
+  return ht + ((slot >> RowFragmentLog)*_NR_ROWS(round)*_SLOT_LEN(round)*SlotsInRow) + (row*_SLOT_LEN(round)*SlotsInRow) + (slot & SlotMask)*_SLOT_LEN(round);
 }
Can't test on AMD now, may be it gives same performance advantage.
Unfortunatelly, this miner useless for NV cards.. only 280sols/s on GTX1070.

I tried several variations of this patch on RX480, but they didn't work...
Thanks for the patch anyway. I would love to work on optimizations for NVIDIA, but first thing first, you know.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 11, 2017, 06:54:55 AM
 #833

more or less you know when to release your miner?

I will release it when I'm satisfied with the performance.
Hopefully within the next few days.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 11, 2017, 04:27:40 PM
 #834

GDS counters and a new implementation of parallel writes are working on RX 480 now.
I just need to optimize them further at this point.
I wish I could do miner development full-time.
It's so much fun and engaging. Oh well.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 11, 2017, 04:41:00 PM
 #835

This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.
I am trying to merge these two inline assembly sections so that I could squeeze conditionals in between ds_swizzle_b32's.

Code:
    const int swap_data = (get_local_id(0) & 0x1);
    __global uint4 *second_p = p + 1;
    uint4 second_ui4 = slot.ui4[1];
    __asm(// See: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
          "ds_swizzle_b32 %0.x, %2.x offset:0x041f\n"
          "ds_swizzle_b32 %0.y, %2.y offset:0x041f\n"
          "ds_swizzle_b32 %1.x, %3.x offset:0x041f\n"
          "ds_swizzle_b32 %1.y, %3.y offset:0x041f\n"
          "ds_swizzle_b32 %1.z, %3.z offset:0x041f\n"
          "ds_swizzle_b32 %1.w, %3.w offset:0x041f\n"
          "s_waitcnt lgkmcnt(0)\n"
          : "=v" (second_p),
            "=v" (second_ui4)
          : "0" (second_p),
            "1" (second_ui4)
          : "memory");
   
    __asm("flat_store_dwordx4 %0, %2\n"
          "flat_store_dwordx4 %1, %3\n"
          :
          : "v" (swap_data ? second_p    : p),
            "v" (swap_data ? p           : second_p),
            "v" (swap_data ? second_ui4  : slot.ui4[0]),
            "v" (swap_data ? slot.ui4[0] : second_ui4)
          : "memory");

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
nerdralph
Sr. Member
****
Offline Offline

Activity: 560
Merit: 251


View Profile
March 11, 2017, 04:57:10 PM
 #836

This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.

And I think ds_swizzle was only introduced in GCN3, so it would not work on Hawaii and Tahiti.
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 11, 2017, 05:04:05 PM
 #837

This is the new version of parallel writes.
It uses ds_swizzle_b32 and does not rely on LDS for sharing data with adjacent lanes.
The problem is that the data share unit is still overloaded with 6 consecutive ds_swizzle_b32 operations.

And I think ds_swizzle was only introduced in GCN3, so it would not work on Hawaii and Tahiti.


Actually, GCN1 supports ds_swizzle. It is ds_permute and ds_bpermute that were newly introduced with GCN3.

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 12, 2017, 03:18:05 AM
 #838

I just found out that you can directly send commands to the GPU without the root privileges by using the DRM render node:

https://en.wikipedia.org/wiki/Direct_Rendering_Manager#Render_nodes

I think nirvana is *pretty* close...

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 12, 2017, 04:47:58 AM
 #839

Bingo!

https://github.com/torvalds/linux/search?utf8=%E2%9C%93&q=cs_partition_size

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
zawawa
Sr. Member
****
Offline Offline

Activity: 588
Merit: 277


Miner Developer


View Profile
March 12, 2017, 06:02:12 AM
 #840

Looks good to me... I just need to set these registers, no?

Code:
#define mmGDS_VMID0_BASE                                                        0x3300
#define mmGDS_VMID1_BASE                                                        0x3302
#define mmGDS_VMID2_BASE                                                        0x3304
#define mmGDS_VMID3_BASE                                                        0x3306
#define mmGDS_VMID4_BASE                                                        0x3308
#define mmGDS_VMID5_BASE                                                        0x330a
#define mmGDS_VMID6_BASE                                                        0x330c
#define mmGDS_VMID7_BASE                                                        0x330e
#define mmGDS_VMID8_BASE                                                        0x3310
#define mmGDS_VMID9_BASE                                                        0x3312
#define mmGDS_VMID10_BASE                                                       0x3314
#define mmGDS_VMID11_BASE                                                       0x3316
#define mmGDS_VMID12_BASE                                                       0x3318
#define mmGDS_VMID13_BASE                                                       0x331a
#define mmGDS_VMID14_BASE                                                       0x331c
#define mmGDS_VMID15_BASE                                                       0x331e
#define mmGDS_VMID0_SIZE                                                        0x3301
#define mmGDS_VMID1_SIZE                                                        0x3303
#define mmGDS_VMID2_SIZE                                                        0x3305
#define mmGDS_VMID3_SIZE                                                        0x3307
#define mmGDS_VMID4_SIZE                                                        0x3309
#define mmGDS_VMID5_SIZE                                                        0x330b
#define mmGDS_VMID6_SIZE                                                        0x330d
#define mmGDS_VMID7_SIZE                                                        0x330f
#define mmGDS_VMID8_SIZE                                                        0x3311
#define mmGDS_VMID9_SIZE                                                        0x3313
https://github.com/torvalds/linux/blob/5924bbecd0267d87c24110cbe2041b5075173a25/drivers/gpu/drm/amd/include/asic_reg/gca/gfx_7_0_d.h

Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4V
BTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
Pages: « 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 [42] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 ... 194 »
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!