Bitcoin Forum
May 01, 2024, 09:59:40 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 2 3 4 5 6 [7] 8 9 10 11 12 13 14 15 16 17 18 19 20 »  All
  Print  
Author Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels  (Read 61214 times)
This is a self-moderated topic. If you do not want to be moderated by the person who started this topic, create a new topic.
realhet
Newbie
*
Offline Offline

Activity: 32
Merit: 0


View Profile WWW
January 01, 2015, 09:22:12 AM
 #121

how exactly do you do GCN ASM? I'm interested.

I wrote an assembler for it. You can try it at realhet.wordpress.com. (Use Cat 13.4 or older, otherwise examples will crash.)

My first thoughts compiling the OCL kernel (on a 7770):
- Its 2.5 times bigger than the instruction cache. (and there are no loops in it, so I guess it often reads from ram.)
- T0 and T1 is located in the gpu ram.
- VReg count is above 128. -> that allows only the minimum no of 4 wavefronts/CU. So there are no
latency hiding via parallel wavefronts.
- too short kernel with too much initialization: Ideally I'd let every workgroup run for a minimum of 0.5 sec. So kernel launch and LDS table initialization would take no time compared to the actual work.
- better instructions: BitFieldExtract for 64bit rotate, ds_read2_b64 for 128 bit LDS read.
- balancing load between LDS and L1 cache

I don't know which of the above is an actual bottleneck or will be usefull, but I wanna find out.
1714600780
Hero Member
*
Offline Offline

Posts: 1714600780

View Profile Personal Message (Offline)

Ignore
1714600780
Reply with quote  #2

1714600780
Report to moderator
"Your bitcoin is secured in a way that is physically impossible for others to access, no matter for what reason, no matter how good the excuse, no matter a majority of miners, no matter what." -- Greg Maxwell
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714600780
Hero Member
*
Offline Offline

Posts: 1714600780

View Profile Personal Message (Offline)

Ignore
1714600780
Reply with quote  #2

1714600780
Report to moderator
1714600780
Hero Member
*
Offline Offline

Posts: 1714600780

View Profile Personal Message (Offline)

Ignore
1714600780
Reply with quote  #2

1714600780
Report to moderator
pallas (OP)
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 01, 2015, 11:04:15 PM
 #122

how exactly do you do GCN ASM? I'm interested.

I wrote an assembler for it. You can try it at realhet.wordpress.com. (Use Cat 13.4 or older, otherwise examples will crash.)

My first thoughts compiling the OCL kernel (on a 7770):
- Its 2.5 times bigger than the instruction cache. (and there are no loops in it, so I guess it often reads from ram.)
- T0 and T1 is located in the gpu ram.
- VReg count is above 128. -> that allows only the minimum no of 4 wavefronts/CU. So there are no
latency hiding via parallel wavefronts.
- too short kernel with too much initialization: Ideally I'd let every workgroup run for a minimum of 0.5 sec. So kernel launch and LDS table initialization would take no time compared to the actual work.
- better instructions: BitFieldExtract for 64bit rotate, ds_read2_b64 for 128 bit LDS read.
- balancing load between LDS and L1 cache

I don't know which of the above is an actual bottleneck or will be usefull, but I wanna find out.

I'm going to try your assembler, very interesting projects!
About your observations, first of all keep in mind that the compiler is pretty unpredictable: many optimizations just do not make sense but they work. Also I only tested it with Tahiti and Hawaii cards.
Kernel size: it can easily be made smaller (for example by including a single table instead of 2), but in all my tests it doesn't bring any advantage.
T0 and T1 are not in gpu ram: it would be much slower if they were. They are in constant ram, I believe.
Short kernel: even though you might design it in order to process multiple hashes in a single run, I think it's not worth. Simple proof: algos which are tens of times faster than groestl, like keccak, still do a single hash per kernel run. Another reason is that making the kernel last longer will result in more rejected shares.
Balancing load between local ram and cache (or whatever balancing of memory reads): I believe that many optimizations that do not make sense, work because they intrudoduce little delays that permit better memory reads between the threads. They sort of better fit together. In fact, modifying the code on other parts of the code may make the same optimization worthless. Interesting speed variations may be brought by switching instructions or grouping local ram reads differently, for example.

Hope that helps.

Atomicat
Legendary
*
Offline Offline

Activity: 952
Merit: 1002



View Profile
January 02, 2015, 04:10:02 AM
 #123

Learn something new every day.  My instinct is to push it till it moves, crank it to 11, but that doesn't work with the R9-290.  Doesn't work because it's throttling for power considerations long before you're hitting 1150.  Just dropped my voltages right down and finally got 23.5 at 1125, I-20.  New understanding of how to handle this card will make for better benchmarks, for true.

Oh, nice price jump today, from 60k to 70k.  Yeah, I'll take credit for that.  Put some orders up last night, woke up to find that I basically owned it on Cryptsy!  Drop a line with your DMD wallet address, I'll give you well earned reward from my ill gotten gains.

realhet
Newbie
*
Offline Offline

Activity: 32
Merit: 0


View Profile WWW
January 02, 2015, 10:05:10 AM
 #124

"T0 and T1 are not in gpu ram: it would be much slower if they were."

Thanks for the ideas!

Actually I knew it from the disasm, that it uses ram instead of LDS for T0, T1. (Note that there is no such thing as constant memory in GCN. It can read a single value with the Scalar ALU and broadcast it across all the wavefront's workitems or it can read 64 values for a whole wavefront by the Vector ALU. Because T0 is addressed by data, it must be read by the VALU using L1 cache (there is a scalar cache too)).
And from there I had the idea of balancing the two sources (LDS and L1).

I did a simple test: renamed T0 and T1, and allocated a new T0 and a T1 from __local. And then initialized them properly. Result: all tbuffer memory read instructions disappeared from the disasm, and the hash rate is dropped from 3.99 MH/s down to 3.841. Don't know how much is the penalty of copying T0, and T1 into the LDS, though.
By the 'textbook': L1 cache can read 4bytes/cycle, LDS: 8bytes/cycle

And yes, the OpenCL compiler is totally unpredictable.

Important question: In the MH/s calculation 1 kernel thread execution means 2 Hashes, right?

(I have a HD7770 @1000MHz, and it's at 4MH/s which looks similar to Wolf0's report on dev.amd.com: R9 290 @1200 20MH/s. Using 14.9 where the compiler generates slower code.)

Now I have to convert all the math into asm. That's painful Cheesy
utahjohn
Hero Member
*****
Offline Offline

Activity: 630
Merit: 500


View Profile
January 03, 2015, 03:22:57 AM
 #125

@realhet
please share your work with the rest of us if it works out that assembly optimization works out.  Looked at r9-285 review today, looks promising as long as smaller memory bus (256 vs 384) doesn't bottleneck.  Should be faster than 280x and on par or maybe even better than 290 with lower power requirement ...
May need tweaks for each architecture ... can it be written to detect which card it's running on and auto select best?

A quote from AnandTech review :
Quote
A complete Tonga configuration will contain 2048 SPs, just like its Tahiti predecessor, with 1792 of those SPs active on R9 285. This is paired with the card’s 32 ROPs attached to a 256-bit memory bus, and a 4-wide (4 geometry processor) frontend. Compared to Tahiti the most visible change is the memory bus size, which has gone from 384-bit to 256-bit. In our look at GCN 1.2 we’ll see why AMD is able to get away with this – the short answer is compression – but it’s notable since at an architectural level Tahiti had to use a memory crossbar between the ROPs and memory bus due to their mismatched size (each block of 4 ROPs wants to be paired with a 32bit memory channel). The crossbar on Tahiti exposes the cards to more memory bandwidth, but it also introduces some inefficiencies of its own that make the subject a tradeoff.

Meanwhile Tonga’s geometry frontend has received an upgrade similar to Hawaii’s, expanding the number of geometry units (and number of polygons per clock) from 2 to 4. And there are actually some additional architectural efficiency improvements in here that should further push performance per clock beyond what Hawaii can do in the real world.
pallas (OP)
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 03, 2015, 01:26:36 PM
 #126

Yes they are two chained iterations of groestl.
But they run a bit different code: the first is optimised because part of the input is known in advance and the second because the whole hash is not needed.

pallas (OP)
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 03, 2015, 01:28:54 PM
 #127

Is anyone willing to donate or lend a 285 so I can optimise for Tonga?

realhet
Newbie
*
Offline Offline

Activity: 32
Merit: 0


View Profile WWW
January 03, 2015, 07:34:53 PM
 #128

Hi again,

Finally I'm at the point that it first time ever produced a correct result.
The speed test was surprisingly good: HD7770 1000MHz (640 streams, GCN1.0 chip), Cat:14.9(the 20% slower driver), total workitems: 256*10*512, elapsed: 558.613 ms,  4.693 MH/s,   gain:   1.17x where the baseline is the opencl implementation (found on amd.com at Wolf0's post) which is 4.00MH/s.

And the first optimization was really a cheap shot Grin. Unlike ocl, I was able to made it under 128 VGPRS (I use 120 currently, it was kinda close). So as each Vector ALU can choose from 2 wavefronts at any time, latency hiding finally kicked in -> elapsed: 279.916 ms  9.365 MH/s   gain:   2.34x

And I'm full of ideas to try Cheesy Next will be to shrink the code to fit into the 32KB instruction cache. Now it is 300kb, it's a massive macro unroll at the moment. The original pallas' ocl version is 110kb, wonder why 3x the multiplier though. Anyways, on GCN we can have loops with only 1 clycle overheads, or even I can write subroutines with call/ret instructions, so I gotta try that fast it is when the instruction cache has no misses at all.

OpenCL thing: While I simplify the code (I chopped down the first/last round optimizations because they would be hard to implement in asm atm) I noticed that I knew already from the past: The OpenCL->llvm-> amd_il -> gcn_asm toolchain will eliminate all the constant calculations and all the calculations whose results is not used at all. I watched the times while making these modifications and it stayed around 4MH/s. Sometimes it dropped below 3.7 when I put measurement code at various places to compare the original kernel with my kernel: if(gid==1234 && flag==1) for(int i=0; i<16; ++i) output = g;
pallas (OP)
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 03, 2015, 10:21:32 PM
 #129

Hi again,

Finally I'm at the point that it first time ever produced a correct result.
The speed test was surprisingly good: HD7770 1000MHz (640 streams, GCN1.0 chip), Cat:14.9(the 20% slower driver), total workitems: 256*10*512, elapsed: 558.613 ms,  4.693 MH/s,   gain:   1.17x where the baseline is the opencl implementation (found on amd.com at Wolf0's post) which is 4.00MH/s.

And the first optimization was really a cheap shot Grin. Unlike ocl, I was able to made it under 128 VGPRS (I use 120 currently, it was kinda close). So as each Vector ALU can choose from 2 wavefronts at any time, latency hiding finally kicked in -> elapsed: 279.916 ms  9.365 MH/s   gain:   2.34x

And I'm full of ideas to try Cheesy Next will be to shrink the code to fit into the 32KB instruction cache. Now it is 300kb, it's a massive macro unroll at the moment. The original pallas' ocl version is 110kb, wonder why 3x the multiplier though. Anyways, on GCN we can have loops with only 1 clycle overheads, or even I can write subroutines with call/ret instructions, so I gotta try that fast it is when the instruction cache has no misses at all.

OpenCL thing: While I simplify the code (I chopped down the first/last round optimizations because they would be hard to implement in asm atm) I noticed that I knew already from the past: The OpenCL->llvm-> amd_il -> gcn_asm toolchain will eliminate all the constant calculations and all the calculations whose results is not used at all. I watched the times while making these modifications and it stayed around 4MH/s. Sometimes it dropped below 3.7 when I put measurement code at various places to compare the original kernel with my kernel: if(gid==1234 && flag==1) for(int i=0; i<16; ++i) output = g;

Great progress, very interesting!
The first improvement, 1.17x, is about the same as the 20% that is lost on 14.9 compared to 14.6 beta, so the two implementations are equivalent.
The second, 2.34x, is really impressive: I have tried multiple times to reduce the number of variables as much as possible (down to 3x16 ulong arrays, 2 ulong and 2 uint), but the results were always worse, so probably that improvement can't be implemented in opencl, or at least I don't know how to.
The same for code size and instruction cache: I was able to squeeze it to about 50K, but at a speed loss.
About the compiler than can eliminate the constant calculations: I noticed that, but doing it by hand works best both in terms of speed and kernel size.
Finally, a question about your work: do you plan to opensource it?

realhet
Newbie
*
Offline Offline

Activity: 32
Merit: 0


View Profile WWW
January 04, 2015, 11:26:22 PM
 #130

Hi,

The Groestl asm code is opensource (I just uploaded it). My compiler and IDE is closed source though, but once you compiled the kernel with it into an .ELF binary, you can use it even on Linux, not just Win.

The first asm version is documented on my blog. Check it out here -> http://realhet.wordpress.com/
It's only a development version, and the kernel parameters are incompatible with Pallas's OpenCL kernel. I have a hard time reverse engineering how params are passed through registers, not mentioning that it can be different in every catalyst version so I keep parameters simple. One buffer with pinned memory for everything data IO is the fastest anyways.
I'm planning to post about many optimizations. Let's see how far can I go. With using only 128 VGPRS it is already at 2.3x speedup and I'm expecting more. Grin
I believe that OCL is so generalized and is kinda far from the actual GCN hardware that it is worth for some projects to go low level. (Not all projects: For example I have failed with LiteCoin. It's better for it to stay in maintainable OCL code.)
realhet
Newbie
*
Offline Offline

Activity: 32
Merit: 0


View Profile WWW
January 05, 2015, 05:12:13 PM
 #131

First 2 optimizations are done, I wrote a blog post about them. I'm at 2.65x now.
pallas (OP)
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 05, 2015, 06:29:19 PM
 #132

First 2 optimizations are done, I wrote a blog post about them. I'm at 2.65x now.

Thanks very much!
Unfortunately it appears the two optimisations are hard to implement in opencl: minimum code size I was able to achieve was 50K, far from 32k, and reducing the number of variables as much as possible didn't provide any speed up. Maybe the number of vregs is still higher than 128...

realhet
Newbie
*
Offline Offline

Activity: 32
Merit: 0


View Profile WWW
January 06, 2015, 04:39:59 AM
 #133

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.
realhet
Newbie
*
Offline Offline

Activity: 32
Merit: 0


View Profile WWW
January 06, 2015, 04:49:51 AM
 #134

(Oups an important part was missing in my blogpost -> now it's corrected)
pallas (OP)
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 06, 2015, 11:35:27 AM
 #135

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.

Thanks for the update.
I've been using Linux only for many years now, so I can't help you on windows compiling; just know it's trivial to compile the miner on linux, it runs on a terminal so doesn't need qt.
About the software version, I prefer the good old sph-sgminer which is based on sgminer 4.1, (I modified it a bit)  but you can use the latest sgminer 5.X as well.
To test the kernel you can simply point it to a pool, printf the hash or whatever.
Back to my opencl effort, I've reduced the number of vgprs to 147 but I'm struggling to get past that.

sp_
Legendary
*
Offline Offline

Activity: 2898
Merit: 1087

Team Black developer


View Profile
January 06, 2015, 01:06:50 PM
 #136

Does you assembler support self modifying code? Wink Then you can use the instruction cache as a precalc buffer as well. The advantage is that most gpu's can read from the inst cache in paralell to the level 1 cache.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
qwep1
Hero Member
*****
Offline Offline

Activity: 610
Merit: 500


View Profile
January 06, 2015, 03:38:19 PM
 #137

and will be a version for windows  Smiley

              ▄▄██▄▄
          ▄▄██████████▄▄
      ▄▄██████████████████▄▄
  ▄▄██████████▀▀ ▀▀██████████▄▄
▄█████████▀▀          ▀▀█████████▄
██████▀▀        ▄▄        ▀▀██████
██████      ▄▄██████▄▄      ██████
██████    ██████████████    ██████
██████    ██████████████    ██████
██████    ██████████████    ██████
██████      ▀▀██████▀▀      ██████
██████          ▀▀        ▄▄██████
▀█████    ▄▄          ▄▄█████████▀
   ▀▀█    ████▄▄ ▄▄██████████▀▀
          ████████████████▀▀
          ▀▀██████████▀▀
              ▀▀██▀▀
P H O R E

     █
    █
   █
  █
   █
    █
   █
  █
 █
    KryptKoin rebranded to Phore   
     █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █
PoS 3.0  -  Masternodes  -  Obfuscation


     █
    █
   █
  █
   █
    █
   █
  █
 █
.


            ▄▄██▄▄
        ▄▄██████████▄▄
    ▄▄████████▀▀████████▄▄
 ▄████████▀▀      ▀▀████████▄
▐█████▀▀              ▀▀█████▌
▐████       ▄▄██▄▄       ████▌
▐████    ▄██████████▄    ████▌
▐████    ████████████    ████▌
▐████    ▀██████████▀    ████▌
▐████       ▀▀██▀▀       ████▌
 ▀███                 ▄▄█████▌
    ▀    █▄▄      ▄▄████████▀
         █████▄▄████████▀▀
         ▀██████████▀▀
            ▀▀██▀▀
pallas (OP)
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 07, 2015, 10:34:17 PM
 #138

On hawaii only, I've managed to get to 123 VGRPS and 28K ISA size, so now I have all the optimizations of the asm code :-)
I believe the asm version is still faster on hawaii, and of course much faster on smaller cards.

utahjohn
Hero Member
*****
Offline Offline

Activity: 630
Merit: 500


View Profile
January 07, 2015, 11:37:09 PM
Last edit: January 08, 2015, 04:04:09 AM by utahjohn
 #139

new optimized CL or a BIN? (I'll test on 280x and 7950).
realhet
Newbie
*
Offline Offline

Activity: 32
Merit: 0


View Profile WWW
January 08, 2015, 07:20:00 AM
 #140

"On hawaii only, I've managed to get to 123 VGRPS and 28K ISA size, so now I have all the optimizations of the asm code :-)"

Then it got all the goodies: vgprs, icache and 2ram+6lds reads. The speedup must be the same 3.5x! Is it that much?

It must be good on small cards either, only important difference is the number of CUs anyways.
Pages: « 1 2 3 4 5 6 [7] 8 9 10 11 12 13 14 15 16 17 18 19 20 »  All
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!