Bitcoin Forum
April 30, 2017, 09:12:47 AM *
News: Latest stable version of Bitcoin Core: 0.14.1  [Torrent]. (New!)
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 [7] 8 9 10 11 12 13 14 15 16 17 18 19 20 21 »  All
  Print  
Author Topic: [ANN][GRS][DMD] Pallas optimized groestlcoin / diamond etc. opencl kernel  (Read 50079 times)
This is a self-moderated topic. If you do not want to be moderated by the person who started this topic, create a new topic.
Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
November 23, 2014, 03:06:50 PM
 #121

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome


the problem with my kernel is that, no matter how hard I try, I can't get the best hashrate on 14.9 drivers (only 20 Mh/s vs 25 with 14.6), so it's not enough to just replace diamond.cl on sgminer 4.1 or 5.
that's why I still prefer people visit this post, with all the info and troubleshooting, for best performance.
the only way to make it clean is creating a fork of sgminer, for tahiti and hawaii cards only, with the precompiled binary; some changes are needed in order for it to always use the binary and not compile the cl sources.
not sure I like it but it might work for many... what do you think?

Just an aside - I've gotten the same results - 21MH/s vs. 25MH/s. It's frustrating - but all I've tried is the lookup table implementation, so far.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
1493543567
Hero Member
*
Offline Offline

Posts: 1493543567

View Profile Personal Message (Offline)

Ignore
1493543567
Reply with quote  #2

1493543567
Report to moderator
1493543567
Hero Member
*
Offline Offline

Posts: 1493543567

View Profile Personal Message (Offline)

Ignore
1493543567
Reply with quote  #2

1493543567
Report to moderator
1493543567
Hero Member
*
Offline Offline

Posts: 1493543567

View Profile Personal Message (Offline)

Ignore
1493543567
Reply with quote  #2

1493543567
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1493543567
Hero Member
*
Offline Offline

Posts: 1493543567

View Profile Personal Message (Offline)

Ignore
1493543567
Reply with quote  #2

1493543567
Report to moderator
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
November 23, 2014, 04:15:49 PM
 #122

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome


the problem with my kernel is that, no matter how hard I try, I can't get the best hashrate on 14.9 drivers (only 20 Mh/s vs 25 with 14.6), so it's not enough to just replace diamond.cl on sgminer 4.1 or 5.
that's why I still prefer people visit this post, with all the info and troubleshooting, for best performance.
the only way to make it clean is creating a fork of sgminer, for tahiti and hawaii cards only, with the precompiled binary; some changes are needed in order for it to always use the binary and not compile the cl sources.
not sure I like it but it might work for many... what do you think?

Just an aside - I've gotten the same results - 21MH/s vs. 25MH/s. It's frustrating - but all I've tried is the lookup table implementation, so far.

Well, that means there is probably little room for improvements on that kind of implementation.
I'm curious to see if a bitslice version can be faster on AMD gpus, but I have no time (and no interest because of negative revenue) to try it myself.

Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
November 23, 2014, 04:18:11 PM
 #123

@pallas could u find the actual state of the art mining software for DMD Groestl and post links in DMD ANN we then will update software on website

it would be great if it include ur performance boost tricks already.....

i think no one from our core team runs AMD cards any longer so ur help would be welcome


the problem with my kernel is that, no matter how hard I try, I can't get the best hashrate on 14.9 drivers (only 20 Mh/s vs 25 with 14.6), so it's not enough to just replace diamond.cl on sgminer 4.1 or 5.
that's why I still prefer people visit this post, with all the info and troubleshooting, for best performance.
the only way to make it clean is creating a fork of sgminer, for tahiti and hawaii cards only, with the precompiled binary; some changes are needed in order for it to always use the binary and not compile the cl sources.
not sure I like it but it might work for many... what do you think?

Just an aside - I've gotten the same results - 21MH/s vs. 25MH/s. It's frustrating - but all I've tried is the lookup table implementation, so far.

Well, that means there is probably little room for improvements on that kind of implementation.
I'm curious to see if a bitslice version can be faster on AMD gpus, but I have no time (and no interest because of negative revenue) to try it myself.

I think it might be - 14.9 killed my X11 hashrate at first, down from 10MH/s on 290X to 2 point something. After redesigning Groestl, still based on lookup tables, I got it back up to 6.5MH/s or so. Still dismal...

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
December 27, 2014, 10:18:22 PM
 #124

Could someone please share their hashrate with r9 285? I'm curious to see if it outperforms the 280 and how much power it uses.

lpedretti
Full Member
***
Offline Offline

Activity: 145


View Profile
December 29, 2014, 04:19:19 PM
 #125

I was having issues using the optimized cl and precompiled binaries, no HW but there were very ocassional shares and pools reported me a very low hashrate, however the problem was the sgminer version i was using, i'm now using the sgminer-develop that has neoscrypt optimized kernels and with that version it works like a charm!
Running Lubuntu 14.04 with 14.x (don't remember which one)
Clock at 930, 0.95v, 13.5 Mh/s each XFX-7970DD and Gigabyte 280x windforce

Great job!

Best regards!

AC: ANuRoFPkCjZSxsw2S41djrrA1D4xMMmwhs
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
December 31, 2014, 10:59:08 PM
 #126

Hi All,

I registered here because I need a little help from you, who develops this OpenCL kernel.
A month ago I've found the Groestl algo on the amd dev forums, thanks to Wolf0 who mentioned it on there. I thought it will be a good algo to test my skills in GCN asm, and I'd like to play with it, maybe I can optimize it better than the OCL compiler (or maybe not, but at least I can learn from it anyways).

So the help I'm seeking is this:
- Please send me the latest version of this kernel (I see everyone altering it a bit, just don't know which is which)
- And pls give me a test vector with these things:
  - global kernel dimensions, workgroup size(I guess it's 256)
  - kernel parameters: dump "char *block", and the "target" value
- And of course the above testcase must find a GroestlCoin hash.

Thank you in advance

(I already sent it to Wolf0 on the amd dev forums, but the moderation there can take more time there and later I found this more appropriate place for my question)

And have a Happy New Year, btw
Wolf0
Legendary
*
Offline Offline

Activity: 1554


Miner Developer


View Profile
January 01, 2015, 12:17:47 AM
 #127

Hi All,

I registered here because I need a little help from you, who develops this OpenCL kernel.
A month ago I've found the Groestl algo on the amd dev forums, thanks to Wolf0 who mentioned it on there. I thought it will be a good algo to test my skills in GCN asm, and I'd like to play with it, maybe I can optimize it better than the OCL compiler (or maybe not, but at least I can learn from it anyways).

So the help I'm seeking is this:
- Please send me the latest version of this kernel (I see everyone altering it a bit, just don't know which is which)
- And pls give me a test vector with these things:
  - global kernel dimensions, workgroup size(I guess it's 256)
  - kernel parameters: dump "char *block", and the "target" value
- And of course the above testcase must find a GroestlCoin hash.

Thank you in advance

(I already sent it to Wolf0 on the amd dev forums, but the moderation there can take more time there and later I found this more appropriate place for my question)

And have a Happy New Year, btw

I don't check there often - how exactly do you do GCN ASM? I'm interested.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 01, 2015, 09:22:12 AM
 #128

how exactly do you do GCN ASM? I'm interested.

I wrote an assembler for it. You can try it at realhet.wordpress.com. (Use Cat 13.4 or older, otherwise examples will crash.)

My first thoughts compiling the OCL kernel (on a 7770):
- Its 2.5 times bigger than the instruction cache. (and there are no loops in it, so I guess it often reads from ram.)
- T0 and T1 is located in the gpu ram.
- VReg count is above 128. -> that allows only the minimum no of 4 wavefronts/CU. So there are no
latency hiding via parallel wavefronts.
- too short kernel with too much initialization: Ideally I'd let every workgroup run for a minimum of 0.5 sec. So kernel launch and LDS table initialization would take no time compared to the actual work.
- better instructions: BitFieldExtract for 64bit rotate, ds_read2_b64 for 128 bit LDS read.
- balancing load between LDS and L1 cache

I don't know which of the above is an actual bottleneck or will be usefull, but I wanna find out.
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
January 01, 2015, 11:04:15 PM
 #129

how exactly do you do GCN ASM? I'm interested.

I wrote an assembler for it. You can try it at realhet.wordpress.com. (Use Cat 13.4 or older, otherwise examples will crash.)

My first thoughts compiling the OCL kernel (on a 7770):
- Its 2.5 times bigger than the instruction cache. (and there are no loops in it, so I guess it often reads from ram.)
- T0 and T1 is located in the gpu ram.
- VReg count is above 128. -> that allows only the minimum no of 4 wavefronts/CU. So there are no
latency hiding via parallel wavefronts.
- too short kernel with too much initialization: Ideally I'd let every workgroup run for a minimum of 0.5 sec. So kernel launch and LDS table initialization would take no time compared to the actual work.
- better instructions: BitFieldExtract for 64bit rotate, ds_read2_b64 for 128 bit LDS read.
- balancing load between LDS and L1 cache

I don't know which of the above is an actual bottleneck or will be usefull, but I wanna find out.

I'm going to try your assembler, very interesting projects!
About your observations, first of all keep in mind that the compiler is pretty unpredictable: many optimizations just do not make sense but they work. Also I only tested it with Tahiti and Hawaii cards.
Kernel size: it can easily be made smaller (for example by including a single table instead of 2), but in all my tests it doesn't bring any advantage.
T0 and T1 are not in gpu ram: it would be much slower if they were. They are in constant ram, I believe.
Short kernel: even though you might design it in order to process multiple hashes in a single run, I think it's not worth. Simple proof: algos which are tens of times faster than groestl, like keccak, still do a single hash per kernel run. Another reason is that making the kernel last longer will result in more rejected shares.
Balancing load between local ram and cache (or whatever balancing of memory reads): I believe that many optimizations that do not make sense, work because they intrudoduce little delays that permit better memory reads between the threads. They sort of better fit together. In fact, modifying the code on other parts of the code may make the same optimization worthless. Interesting speed variations may be brought by switching instructions or grouping local ram reads differently, for example.

Hope that helps.

Atomicat
Hero Member
*****
Offline Offline

Activity: 812



View Profile
January 02, 2015, 04:10:02 AM
 #130

Learn something new every day.  My instinct is to push it till it moves, crank it to 11, but that doesn't work with the R9-290.  Doesn't work because it's throttling for power considerations long before you're hitting 1150.  Just dropped my voltages right down and finally got 23.5 at 1125, I-20.  New understanding of how to handle this card will make for better benchmarks, for true.

Oh, nice price jump today, from 60k to 70k.  Yeah, I'll take credit for that.  Put some orders up last night, woke up to find that I basically owned it on Cryptsy!  Drop a line with your DMD wallet address, I'll give you well earned reward from my ill gotten gains.


realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 02, 2015, 10:05:10 AM
 #131

"T0 and T1 are not in gpu ram: it would be much slower if they were."

Thanks for the ideas!

Actually I knew it from the disasm, that it uses ram instead of LDS for T0, T1. (Note that there is no such thing as constant memory in GCN. It can read a single value with the Scalar ALU and broadcast it across all the wavefront's workitems or it can read 64 values for a whole wavefront by the Vector ALU. Because T0 is addressed by data, it must be read by the VALU using L1 cache (there is a scalar cache too)).
And from there I had the idea of balancing the two sources (LDS and L1).

I did a simple test: renamed T0 and T1, and allocated a new T0 and a T1 from __local. And then initialized them properly. Result: all tbuffer memory read instructions disappeared from the disasm, and the hash rate is dropped from 3.99 MH/s down to 3.841. Don't know how much is the penalty of copying T0, and T1 into the LDS, though.
By the 'textbook': L1 cache can read 4bytes/cycle, LDS: 8bytes/cycle

And yes, the OpenCL compiler is totally unpredictable.

Important question: In the MH/s calculation 1 kernel thread execution means 2 Hashes, right?

(I have a HD7770 @1000MHz, and it's at 4MH/s which looks similar to Wolf0's report on dev.amd.com: R9 290 @1200 20MH/s. Using 14.9 where the compiler generates slower code.)

Now I have to convert all the math into asm. That's painful Cheesy
utahjohn
Hero Member
*****
Offline Offline

Activity: 616


View Profile WWW
January 03, 2015, 03:22:57 AM
 #132

@realhet
please share your work with the rest of us if it works out that assembly optimization works out.  Looked at r9-285 review today, looks promising as long as smaller memory bus (256 vs 384) doesn't bottleneck.  Should be faster than 280x and on par or maybe even better than 290 with lower power requirement ...
May need tweaks for each architecture ... can it be written to detect which card it's running on and auto select best?

A quote from AnandTech review :
Quote
A complete Tonga configuration will contain 2048 SPs, just like its Tahiti predecessor, with 1792 of those SPs active on R9 285. This is paired with the card’s 32 ROPs attached to a 256-bit memory bus, and a 4-wide (4 geometry processor) frontend. Compared to Tahiti the most visible change is the memory bus size, which has gone from 384-bit to 256-bit. In our look at GCN 1.2 we’ll see why AMD is able to get away with this – the short answer is compression – but it’s notable since at an architectural level Tahiti had to use a memory crossbar between the ROPs and memory bus due to their mismatched size (each block of 4 ROPs wants to be paired with a 32bit memory channel). The crossbar on Tahiti exposes the cards to more memory bandwidth, but it also introduces some inefficiencies of its own that make the subject a tradeoff.

Meanwhile Tonga’s geometry frontend has received an upgrade similar to Hawaii’s, expanding the number of geometry units (and number of polygons per clock) from 2 to 4. And there are actually some additional architectural efficiency improvements in here that should further push performance per clock beyond what Hawaii can do in the real world.

DMD: dUTjohnrXHGYkh7jELWrZkGJbMnE6mdsuh (Staking)
BTC: 1HANJQygp3jHuzutceBgMT7wfCgEug6h4L (Donation)
ETH: 0xba90d7c1ab2bb9d5c07d843476153d1722637250 Mine ETH for 0.5% http://donkeypool.com
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
January 03, 2015, 01:26:36 PM
 #133

Yes they are two chained iterations of groestl.
But they run a bit different code: the first is optimised because part of the input is known in advance and the second because the whole hash is not needed.

pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
January 03, 2015, 01:28:54 PM
 #134

Is anyone willing to donate or lend a 285 so I can optimise for Tonga?

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 03, 2015, 07:34:53 PM
 #135

Hi again,

Finally I'm at the point that it first time ever produced a correct result.
The speed test was surprisingly good: HD7770 1000MHz (640 streams, GCN1.0 chip), Cat:14.9(the 20% slower driver), total workitems: 256*10*512, elapsed: 558.613 ms,  4.693 MH/s,   gain:   1.17x where the baseline is the opencl implementation (found on amd.com at Wolf0's post) which is 4.00MH/s.

And the first optimization was really a cheap shot Grin. Unlike ocl, I was able to made it under 128 VGPRS (I use 120 currently, it was kinda close). So as each Vector ALU can choose from 2 wavefronts at any time, latency hiding finally kicked in -> elapsed: 279.916 ms  9.365 MH/s   gain:   2.34x

And I'm full of ideas to try Cheesy Next will be to shrink the code to fit into the 32KB instruction cache. Now it is 300kb, it's a massive macro unroll at the moment. The original pallas' ocl version is 110kb, wonder why 3x the multiplier though. Anyways, on GCN we can have loops with only 1 clycle overheads, or even I can write subroutines with call/ret instructions, so I gotta try that fast it is when the instruction cache has no misses at all.

OpenCL thing: While I simplify the code (I chopped down the first/last round optimizations because they would be hard to implement in asm atm) I noticed that I knew already from the past: The OpenCL->llvm-> amd_il -> gcn_asm toolchain will eliminate all the constant calculations and all the calculations whose results is not used at all. I watched the times while making these modifications and it stayed around 4MH/s. Sometimes it dropped below 3.7 when I put measurement code at various places to compare the original kernel with my kernel: if(gid==1234 && flag==1) for(int i=0; i<16; ++i) output = g;
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
January 03, 2015, 10:21:32 PM
 #136

Hi again,

Finally I'm at the point that it first time ever produced a correct result.
The speed test was surprisingly good: HD7770 1000MHz (640 streams, GCN1.0 chip), Cat:14.9(the 20% slower driver), total workitems: 256*10*512, elapsed: 558.613 ms,  4.693 MH/s,   gain:   1.17x where the baseline is the opencl implementation (found on amd.com at Wolf0's post) which is 4.00MH/s.

And the first optimization was really a cheap shot Grin. Unlike ocl, I was able to made it under 128 VGPRS (I use 120 currently, it was kinda close). So as each Vector ALU can choose from 2 wavefronts at any time, latency hiding finally kicked in -> elapsed: 279.916 ms  9.365 MH/s   gain:   2.34x

And I'm full of ideas to try Cheesy Next will be to shrink the code to fit into the 32KB instruction cache. Now it is 300kb, it's a massive macro unroll at the moment. The original pallas' ocl version is 110kb, wonder why 3x the multiplier though. Anyways, on GCN we can have loops with only 1 clycle overheads, or even I can write subroutines with call/ret instructions, so I gotta try that fast it is when the instruction cache has no misses at all.

OpenCL thing: While I simplify the code (I chopped down the first/last round optimizations because they would be hard to implement in asm atm) I noticed that I knew already from the past: The OpenCL->llvm-> amd_il -> gcn_asm toolchain will eliminate all the constant calculations and all the calculations whose results is not used at all. I watched the times while making these modifications and it stayed around 4MH/s. Sometimes it dropped below 3.7 when I put measurement code at various places to compare the original kernel with my kernel: if(gid==1234 && flag==1) for(int i=0; i<16; ++i) output = g;

Great progress, very interesting!
The first improvement, 1.17x, is about the same as the 20% that is lost on 14.9 compared to 14.6 beta, so the two implementations are equivalent.
The second, 2.34x, is really impressive: I have tried multiple times to reduce the number of variables as much as possible (down to 3x16 ulong arrays, 2 ulong and 2 uint), but the results were always worse, so probably that improvement can't be implemented in opencl, or at least I don't know how to.
The same for code size and instruction cache: I was able to squeeze it to about 50K, but at a speed loss.
About the compiler than can eliminate the constant calculations: I noticed that, but doing it by hand works best both in terms of speed and kernel size.
Finally, a question about your work: do you plan to opensource it?

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 04, 2015, 11:26:22 PM
 #137

Hi,

The Groestl asm code is opensource (I just uploaded it). My compiler and IDE is closed source though, but once you compiled the kernel with it into an .ELF binary, you can use it even on Linux, not just Win.

The first asm version is documented on my blog. Check it out here -> http://realhet.wordpress.com/
It's only a development version, and the kernel parameters are incompatible with Pallas's OpenCL kernel. I have a hard time reverse engineering how params are passed through registers, not mentioning that it can be different in every catalyst version so I keep parameters simple. One buffer with pinned memory for everything data IO is the fastest anyways.
I'm planning to post about many optimizations. Let's see how far can I go. With using only 128 VGPRS it is already at 2.3x speedup and I'm expecting more. Grin
I believe that OCL is so generalized and is kinda far from the actual GCN hardware that it is worth for some projects to go low level. (Not all projects: For example I have failed with LiteCoin. It's better for it to stay in maintainable OCL code.)
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 05, 2015, 05:12:13 PM
 #138

First 2 optimizations are done, I wrote a blog post about them. I'm at 2.65x now.
pallas
Legendary
*
Online Online

Activity: 1274


Black Belt Developer


View Profile
January 05, 2015, 06:29:19 PM
 #139

First 2 optimizations are done, I wrote a blog post about them. I'm at 2.65x now.

Thanks very much!
Unfortunately it appears the two optimisations are hard to implement in opencl: minimum code size I was able to achieve was 50K, far from 32k, and reducing the number of variables as much as possible didn't provide any speed up. Maybe the number of vregs is still higher than 128...

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 06, 2015, 04:39:59 AM
 #140

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.
Pages: « 1 2 3 4 5 6 [7] 8 9 10 11 12 13 14 15 16 17 18 19 20 21 »  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!