ioglnx
Sr. Member
Offline
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
|
|
November 22, 2016, 07:05:58 PM |
|
Look at my update :-D The increase is there hitting 500 on 4 Cards..2GTX1080 and 2 GTX1070 @601-619Watts before with all miners around 450Sol/s @ 750Watts. It's instances=2 which is working ..threads is not.. :-D otal 504.4 sol/s [dev0 117.7, dev1 132.8, dev2 124.3, dev3 134.0] 90 shares Total 504.3 sol/s [dev0 117.1, dev1 133.1, dev2 124.1, dev3 135.2] 90 shares Total 504.5 sol/s [dev0 117.1, dev1 133.4, dev2 124.6, dev3 135.5] 90 shares Total 504.3 sol/s [dev0 117.8, dev1 132.7, dev2 125.1, dev3 134.2] 92 shares Total 504.2 sol/s [dev0 118.7, dev1 132.4, dev2 124.4, dev3 134.6] 93 shares Total 504.3 sol/s [dev0 119.5, dev1 131.8, dev2 125.7, dev3 134.7] 93 shares Total 504.2 sol/s [dev0 119.7, dev1 131.6, dev2 126.3, dev3 134.0] 93 shares Total 504.4 sol/s [dev0 119.2, dev1 131.7, dev2 126.7, dev3 133.3] 93 shares Total 504.6 sol/s [dev0 119.4, dev1 131.6, dev2 127.0, dev3 133.1] 93 shares Total 504.6 sol/s [dev0 120.5, dev1 131.6, dev2 127.2, dev3 133.6] 93 shares Total 504.6 sol/s [dev0 120.9, dev1 130.5, dev2 126.0, dev3 132.6] 93 shares Total 505.0 sol/s [dev0 120.8, dev1 130.3, dev2 126.0, dev3 133.0] 93 shares Total 505.0 sol/s [dev0 121.1, dev1 130.6, dev2 125.5, dev3 132.7] 93 shares Total 505.0 sol/s [dev0 120.2, dev1 130.2, dev2 125.6, dev3 133.8] 94 shares Total 505.1 sol/s [dev0 120.3, dev1 130.1, dev2 125.6, dev3 133.9] 94 shares Total 504.9 sol/s [dev0 119.7, dev1 131.6, dev2 124.9, dev3 134.4] 94 shares Total 504.9 sol/s [dev0 119.6, dev1 132.0, dev2 123.9, dev3 133.8] 94 shares Total 504.8 sol/s [dev0 119.0, dev1 131.5, dev2 123.8, dev3 132.9] 94 shares Total 505.0 sol/s [dev0 119.9, dev1 132.0, dev2 124.8, dev3 132.2] 94 shares Total 505.1 sol/s [dev0 119.3, dev1 132.6, dev2 125.3, dev3 132.8] 94 shares Total 505.1 sol/s [dev0 118.2, dev1 132.6, dev2 125.9, dev3 130.7] 94 shares Total 505.1 sol/s [dev0 118.8, dev1 132.6, dev2 125.3, dev3 130.6] 95 shares
|
GTX 1080Ti rocks da house... seriously... this card is a beast³ Owning by now 18x GTX1080Ti :-D @serious love of efficiency
|
|
|
TIKCrazy
Member
Offline
Activity: 73
Merit: 10
|
|
November 22, 2016, 07:12:53 PM |
|
testing 2 is working with 104 sol\s on 1060
|
|
|
|
zawawa
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
November 22, 2016, 07:13:36 PM |
|
Look at my update :-D The increase is there hitting 500 on 4 Cards..2GTX1080 and 2 GTX1070 @601-619Watts before with all miners around 450Sol/s @ 750Watts.
It's instances=2 which is working ..threads is not.. :-D
VERY interesting... It's the other way around with AMD's drivers for Windows. Gotta love those crazy OpenCL implementations...
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
zawawa
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
November 22, 2016, 07:14:40 PM |
|
testing2 is a keeper, then. Very well.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
|
ioglnx
Sr. Member
Offline
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
|
|
November 22, 2016, 07:26:46 PM |
|
Lol do you need to mention not other vendors :-D just say NV only is much shorter :-P There is basically just AMD left..S3 gone, VIA gone. 3Dfx long gone..SGi gone :-D so..
Since 5min poolside reports 601.30 Sol/s
|
GTX 1080Ti rocks da house... seriously... this card is a beast³ Owning by now 18x GTX1080Ti :-D @serious love of efficiency
|
|
|
zawawa
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
November 22, 2016, 07:32:43 PM |
|
You would be surprised to know that some people asked me in the past if they could use Intel HD Graphics for GPGPU... You are mostly right about "other vendors," though. I wish these companies were still around.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
ioglnx
Sr. Member
Offline
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
|
|
November 22, 2016, 07:36:24 PM |
|
Many of us wish that dream..but at least some 3Dfx tech made another revival in Maxwell GPUs and also in Pascal there are some additions again from 3Dfx :-D Maybe nvidia started to sort the ideas board / paper bags of 3Dfx offices.
|
GTX 1080Ti rocks da house... seriously... this card is a beast³ Owning by now 18x GTX1080Ti :-D @serious love of efficiency
|
|
|
laik2
|
|
November 22, 2016, 07:38:07 PM |
|
Many of us wish that dream..but at least some 3Dfx tech made another revival in Maxwell GPUs and also in Pascal there are some additions again from 3Dfx :-D Maybe nvidia started to sort the ideas board / paper bags of 3Dfx offices.
I think someone said that on nvidia one must use 1 instance, 2 won't work(or will be the same). @zawawa - tell 'em that they can but its not worthy at all...don't have the capacity of nv/amd.
|
|
|
|
Amph
Legendary
Offline
Activity: 3248
Merit: 1070
|
|
November 22, 2016, 07:38:51 PM |
|
that is a good boost over the sp one, getting 120 sol per gpu(1070), with my -502 mem setting and zero core, in my cose ocing it give very small boost over underclocking not worth it
|
|
|
|
ioglnx
Sr. Member
Offline
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
|
|
November 22, 2016, 07:38:56 PM |
|
Its weeks ago before all these optimizations have been taken place. For me its working and give me 10sols more :-D
@AMPH: Did you noticed less power consumption too?
|
GTX 1080Ti rocks da house... seriously... this card is a beast³ Owning by now 18x GTX1080Ti :-D @serious love of efficiency
|
|
|
zawawa
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
November 22, 2016, 07:39:26 PM |
|
Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate. You have to wait a little, but you get a more accurate number that way.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
laik2
|
|
November 22, 2016, 07:41:19 PM |
|
Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate. You have to wait a little, but you get a more accurate number that way.
I wish I could do 130S/s with my AMDs.
|
|
|
|
Amph
Legendary
Offline
Activity: 3248
Merit: 1070
|
|
November 22, 2016, 07:41:43 PM |
|
Its weeks ago before all these optimizations have been taken place. For me its working and give me 10sols more :-D
@AMPH: Did you noticed less power consumption too?
yeah only 650 watt or around that, but still without 200 sol per gpu is not competitive enough against amd...sadly
|
|
|
|
laik2
|
|
November 22, 2016, 07:47:51 PM |
|
Its weeks ago before all these optimizations have been taken place. For me its working and give me 10sols more :-D
@AMPH: Did you noticed less power consumption too?
yeah only 650 watt or around that, but still without 200 sol per gpu is not competitive enough against amd...sadly 200 with claymore and the price is much high electricity bill... I can't do more than 110 with my RX480s but wattage is only 450(4 cards)
|
|
|
|
zawawa
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
November 22, 2016, 07:54:13 PM |
|
Oh, I forgot to mention that my Windows port always shows a 5 min average for total hashrate. You have to wait a little, but you get a more accurate number that way.
I wish I could do 130S/s with my AMDs. I'm pretty sure we will get there. It is just that there is no "easy" optimizations left for AMD cards because they were the first targets of this miner. The next optimization requires a massive rewrite, bit it can be done, methinks.
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
|
laik2
|
|
November 22, 2016, 08:34:02 PM Last edit: November 22, 2016, 08:48:01 PM by laik2 |
|
This is basicly what I could understand from your writings. wrong paste Using LDS or L1 Cache
There are a number of considerations when deciding between LDS and L1 cache for a given algorithm.
LDS supports read/modify/write operations, as well as atomics. It is well-suited for code that requires fast read/write, read/modify/write, or scatter operations that otherwise are directed to global memory. On current AMD hardware, L1 is part of the read path; hence, it is suited to cache-read-sensitive algorithms, such as matrix multiplication or convolution.
LDS is typically larger than L1 (for example: 64 kB vs 16 kB on Southern Islands devices). If it is not possible to obtain a high L1 cache hit rate for an algorithm, the larger LDS size can help. On the AMD Radeon HD 7970 device, the theoretical LDS peak bandwidth is 3.8 TB/s, compared to L1 at 1.9 TB/sec.
The native data type for L1 is a four-vector of 32-bit words. On L1, fill and read addressing are linked. It is important that L1 is initially filled from global memory with a coalesced access pattern; once filled, random accesses come at no extra processing cost.
Currently, the native format of LDS is a 32-bit word. The theoretical LDS peak bandwidth is achieved when each thread operates on a two-vector of 32-bit words (16 threads per clock operate on 32 banks). If an algorithm requires coalesced 32-bit quantities, it maps well to LDS. The use of four-vectors or larger can lead to bank conflicts, although the compiler can mitigate some of these.
From an application point of view, filling LDS from global memory, and reading from it, are independent operations that can use independent addressing. Thus, LDS can be used to explicitly convert a scattered access pattern to a coalesced pattern for read and write to global memory. Or, by taking advantage of the LDS read broadcast feature, LDS can be filled with a coalesced pattern from global memory, followed by all threads iterating through the same LDS words simultaneously.
LDS reuses the data already pulled into cache by other wavefronts. Sharing across work-groups is not possible because OpenCL does not guarantee that LDS is in a particular state at the beginning of work-group execution. L1 content, on the other hand, is independent of work-group execution, so that successive work-groups can share the content in the L1 cache of a given Vector ALU. However, it currently is not possible to explicitly control L1 sharing across work-groups.
The use of LDS is linked to GPR usage and wavefront-per-Vector ALU count. Better sharing efficiency requires a larger work-group, so that more work-items share the same LDS. Compiling kernels for larger work-groups typically results in increased register use, so that fewer wavefronts can be scheduled simultaneously per Vector ALU. This, in turn, reduces memory latency hiding. Requesting larger amounts of LDS per work-group results in fewer wavefronts per Vector ALU, with the same effect.
LDS typically involves the use of barriers, with a potential performance impact. This is true even for read-only use cases, as LDS must be explicitly filled in from global memory (after which a barrier is required before reads can commence).
|
|
|
|
zawawa
Sr. Member
Offline
Activity: 728
Merit: 304
Miner Developer
|
|
November 22, 2016, 08:48:20 PM |
|
Marvelous! An excellent analysis! I will take up the challenge with the GCN assembly. This is so much fun!!
|
Gateless Gate Sharp, an open-source ETH/XMR miner: http://bit.ly/2rJ2x4VBTC: 1BHwDWVerUTiKxhHPf2ubqKKiBMiKQGomZ
|
|
|
antantti
Legendary
Offline
Activity: 1176
Merit: 1015
|
|
November 22, 2016, 08:54:45 PM |
|
4x970 with zawawawa-r12-nv doing ~400 S/s, w7 and some oc. Power consumption down a bit from sp_ version.
r6 was hashing ~340 and sp_1 ~380 with same clocks.
About competition against amd in equihash, I am afraid that we haven't seen nothing yet from high end older amd cards.
|
|
|
|
|