Bitcoin Forum
June 29, 2017, 04:08:39 AM *
News: Latest stable version of Bitcoin Core: 0.14.2  [Torrent].
 
   Home   Help Search Donate Login Register  
Pages: « 1 2 3 4 5 6 7 [8] 9 10 11 12 13 14 15 16 17 18 19 20 21 »  All
  Print  
Author Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels  (Read 53489 times)
This is a self-moderated topic. If you do not want to be moderated by the person who started this topic, create a new topic.
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 06, 2015, 04:49:51 AM
 #141

(Oups an important part was missing in my blogpost -> now it's corrected)
1498709319
Hero Member
*
Offline Offline

Posts: 1498709319

View Profile Personal Message (Offline)

Ignore
1498709319
Reply with quote  #2

1498709319
Report to moderator
"Apparently, so I am told, there exist "people" who prefer to wipe sitting down. From the front. Initial research indicates it could be up to half the population." -- benjamindees
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1498709319
Hero Member
*
Offline Offline

Posts: 1498709319

View Profile Personal Message (Offline)

Ignore
1498709319
Reply with quote  #2

1498709319
Report to moderator
Wolf0
Legendary
*
Offline Offline

Activity: 1610


Miner Developer


View Profile
January 06, 2015, 06:12:39 AM
 #142

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.

You first need to have the target passed to the kernel.

Code:
Donations: BTC: 1WoLFdwcfNEg64fTYsX1P25KUzzSjtEZC -- XMR: 45SLUTzk7UXYHmzJ7bFN6FPfzTusdUVAZjPRgmEDw7G3SeimWM2kCdnDQXwDBYGUWaBtZNgjYtEYA22aMQT4t8KfU3vHLHG
pallas
Legendary
*
Offline Offline

Activity: 1330


Black Belt Developer


View Profile
January 06, 2015, 11:35:27 AM
 #143

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.

Thanks for the update.
I've been using Linux only for many years now, so I can't help you on windows compiling; just know it's trivial to compile the miner on linux, it runs on a terminal so doesn't need qt.
About the software version, I prefer the good old sph-sgminer which is based on sgminer 4.1, (I modified it a bit)  but you can use the latest sgminer 5.X as well.
To test the kernel you can simply point it to a pool, printf the hash or whatever.
Back to my opencl effort, I've reduced the number of vgprs to 147 but I'm struggling to get past that.

sp_
Legendary
*
Offline Offline

Activity: 1078

Ccminer developer


View Profile
January 06, 2015, 01:06:50 PM
 #144

Does you assembler support self modifying code? Wink Then you can use the instruction cache as a precalc buffer as well. The advantage is that most gpu's can read from the inst cache in paralell to the level 1 cache.

BTC: 1CTiNJyoUmbdMRACtteRWXhGqtSETYd6Vd
qwep1
Sr. Member
****
Offline Offline

Activity: 462


View Profile
January 06, 2015, 03:38:19 PM
 #145

and will be a version for windows  Smiley
pallas
Legendary
*
Offline Offline

Activity: 1330


Black Belt Developer


View Profile
January 07, 2015, 10:34:17 PM
 #146

On hawaii only, I've managed to get to 123 VGRPS and 28K ISA size, so now I have all the optimizations of the asm code :-)
I believe the asm version is still faster on hawaii, and of course much faster on smaller cards.

utahjohn
Hero Member
*****
Offline Offline

Activity: 616


View Profile WWW
January 07, 2015, 11:37:09 PM
 #147

new optimized CL or a BIN? (I'll test on 280x and 7950).

DMD: dUTjohnrXHGYkh7jELWrZkGJbMnE6mdsuh (Staking)
BTC: 1HANJQygp3jHuzutceBgMT7wfCgEug6h4L (Donation)
ETH: 0xba90d7c1ab2bb9d5c07d843476153d1722637250 Mine ETH for 0.5% http://donkeypool.com
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 07:20:00 AM
 #148

"On hawaii only, I've managed to get to 123 VGRPS and 28K ISA size, so now I have all the optimizations of the asm code :-)"

Then it got all the goodies: vgprs, icache and 2ram+6lds reads. The speedup must be the same 3.5x! Is it that much?

It must be good on small cards either, only important difference is the number of CUs anyways.
realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 07:36:35 AM
 #149

And you have the first/last round optimizations so it must be faster!
If it's as fast as the asm version, then I don't have to deal with the kernel parameters, which is boring/painful. My asm was only needed to encourage you to shrink the code/regs. Cheesy

Can you share the new source?
pallas
Legendary
*
Offline Offline

Activity: 1330


Black Belt Developer


View Profile
January 08, 2015, 09:28:52 AM
 #150

And you have the first/last round optimizations so it must be faster!
If it's as fast as the asm version, then I don't have to deal with the kernel parameters, which is boring/painful. My asm was only needed to encourage you to shrink the code/regs. Cheesy

Can you share the new source?

Unfortunately it's only about 25% faster, but we should compare apples to apples: could you try your code on hawaii chipset so we have a constant testbed?
Now I'm working on further first round optimizations, they bring little improvement but it's still worth imho.

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 09:59:22 AM
 #151

25% seems like only that loss coming back which is lost with the 14.9. I really thought you had it 3.5x faster.

Are you sure that it only uses 123VGPRS AND code size is 28KB only? Or does it started to use Scratch regs (those are terribly slow)?

Unfortunatelly I can't try on anything else than HD7770. But I'd also like to see how it runs on faster systems. I uploaded it onto my blog in the download area if someone wish to try it. I'm not familiar with the latest GCN chips (I think AMD only improve their instruction from time to time, and maybe cut down double precision performance), but with this particular program, I'm pretty sure that it will bring the 3.48x speedup on the R9 290x too. Because all the CUs can work alone using LDS and L1 cache and ICache on their own, that's why. So if current ocl code on the R9 290x runs at 20MH/s then the latest asm code should be run at 70MH/s.
pallas
Legendary
*
Offline Offline

Activity: 1330


Black Belt Developer


View Profile
January 08, 2015, 10:03:12 AM
 #152

25% seems like only that loss coming back which is lost with the 14.9. I really thought you had it 3.5x faster.

Are you sure that it only uses 123VGPRS AND code size is 28KB only? Or does it started to use Scratch regs (those are terribly slow)?

Unfortunatelly I can't try on anything else than HD7770. But I'd also like to see how it runs on faster systems. I uploaded it onto my blog in the download area if someone wish to try it. I'm not familiar with the latest GCN chips (I think AMD only improve their instruction from time to time, and maybe cut down double precision performance), but with this particular program, I'm pretty sure that it will bring the 3.48x speedup on the R9 290x too. Because all the CUs can work alone using LDS and L1 cache and ICache on their own, that's why. So if current ocl code on the R9 290x runs at 20MH/s then the latest asm code should be run at 70MH/s.

25% compared to 14.6, it's 43% compared to 14.9.
No scratch reg use (when I triggered it a couple times, it slowed down to less than 1 Mh/s).
I'd like to try your asm code myself, but I'd need the linux version of the assembler.

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 12:55:13 PM
 #153

Now I managed to build sgminer5.1 on my sys. I still have to make my kernel to work with it.

Does sgminer has an offline 'diagnostic' mode, just for testing the kernel if it runs and how fast it runs?

"I'd need the linux version of the assembler."
Sorry, it's impossible. It's not even written in Cpp just to be able to compile on any other system, than win.

And to make things more complicated Cheesy You have to compile with it for every type of gcn cards multiplied by every Catalyst driver that was altered by AMD developers. My compiler only patches the binary into the .elf, the actual elf file is generated by the current Catalyst Driver of the currently selected gfx card.
pallas
Legendary
*
Offline Offline

Activity: 1330


Black Belt Developer


View Profile
January 08, 2015, 01:39:04 PM
 #154

Does sgminer has an offline 'diagnostic' mode, just for testing the kernel if it runs and how fast it runs?

There is a simple "benchmark" option:

--benchmark         Run sgminer in benchmark mode - produces no shares

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 03:43:38 PM
 #155

Unfortunately there is no --benchmark parameter. I checked in in the source code too, but nothing similar https://github.com/sgminer-dev/sgminer/blob/master/sgminer.c.
Is there a simple war to run it? Now I have a groestl wallet, but where can I get username from? What parameters should I use other than -k groestl and -d 1?
pallas
Legendary
*
Offline Offline

Activity: 1330


Black Belt Developer


View Profile
January 08, 2015, 03:57:32 PM
 #156

Unfortunately there is no --benchmark parameter. I checked in in the source code too, but nothing similar https://github.com/sgminer-dev/sgminer/blob/master/sgminer.c.
Is there a simple war to run it? Now I have a groestl wallet, but where can I get username from? What parameters should I use other than -k groestl and -d 1?

Probably they removed it, I'm using an older version.
I run it like this, for solo mine:

sgminer -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:GROESTLCOIN_RPC_PORT -u YOURUSER -p YOURPASSWORD

Then you have to find and add your best intensity and worksize (my OS kernel works with 256 only).
username and password are set in groestlcoin.conf; the port you can easily find in their thread (or via netstat).

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 08:08:22 PM
 #157

Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?
utahjohn
Hero Member
*****
Offline Offline

Activity: 616


View Profile WWW
January 08, 2015, 08:30:59 PM
 #158

Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?
Are you sure it is running your kernel?.  Look in your sgminer dir, for a .bin file generated by OCL it may be running default groestlcoin OCL.  delete .bin and replace with your own of same name generated, it will not be regenerated it it exists in dir.  you must delete .bin whenever you change configs to force OCL recompile ... but you don't want that, u want to run your asm kernel ... so will have to figure out the parameter passing from sgminer ...

DMD: dUTjohnrXHGYkh7jELWrZkGJbMnE6mdsuh (Staking)
BTC: 1HANJQygp3jHuzutceBgMT7wfCgEug6h4L (Donation)
ETH: 0xba90d7c1ab2bb9d5c07d843476153d1722637250 Mine ETH for 0.5% http://donkeypool.com
pallas
Legendary
*
Offline Offline

Activity: 1330


Black Belt Developer


View Profile
January 08, 2015, 08:41:59 PM
 #159

Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?

Intensity 24 is too much, I'd stay between 20 and 22, otherwise you'll produce a lot of rejected shares (or orphans if solo mining).
The shaders option is ignored for groestl.
The hashrate should be calculated on the full computation, i.e. 2 chained hashes.
What kernel are you using?

realhet
Jr. Member
*
Offline Offline

Activity: 32


View Profile WWW
January 08, 2015, 09:48:35 PM
 #160

I'm using your kernel: groestlcoin.cl.

Now I disassembled a dummy kernel with the appropriate parameters and I forgot about the T buffers. OpenCL uploads them in an extra buffer automatically. I don't even wanna know how the driver send that extra buffer and most importantly can't make an automatic skeleton kernel to get the binary with a placeholder for constant data that my program can patch with the output of the assembler.

So the easiest way would be to modify sgminer to handle my kernel. I have found the the 'queue_sph_kernel()' function where I can start from.
Pages: « 1 2 3 4 5 6 7 [8] 9 10 11 12 13 14 15 16 17 18 19 20 21 »  All
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!