Bitcoin Forum
June 03, 2024, 05:01:16 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 [213] 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 »
  Print  
Author Topic: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner  (Read 877798 times)
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 15, 2016, 10:47:39 AM
 #4241

That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)

I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)

Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?

SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.

GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.

24core
Sr. Member
****
Offline Offline

Activity: 582
Merit: 250


An Impressive Purely Anonymous Currency.


View Profile WWW
January 15, 2016, 11:00:48 AM
 #4242

But with Xintensity at 1024 you get invalid shares, so lower hashrate on the pool. X 256 or 512 is the highest I can use..
Default was 640. I used this w/o a problem )))

Can you please share your config for the 7950 ?

Also slightly confused about the  kachur miner as NiceHash download os 5.2.1, is there another link ?
https://github.com/nicehash/NiceHashMiner/releases use sgminer-5-1-0-optimized
Code:
setx GPU_MAX_ALLOC_PERCENT 100
sgminer.exe --algorithm darkcoin-mod -o stratum+tcp://x11.eu.nicehash.com:3336 -u 1M948TedPdVkbk59TMnYJhtW5BFTXKPfFL -p d=0.04 -d 0 --xintensity 640 -g 1 -w 64 --gpu-memclock 1500 --gpu-engine 1100 -s 0 --expiry 10 --queue 0 --gpu-powertune 10 --keccak-unroll 0 --hamsi-expand-big 4 --gpu-fan 30-95 --temp-cutoff 95 --temp-overheat 90 --temp-target 75 --auto-fan

I tried those settings and many others, but I get a very low Hash rate and my CPU is going mental.

I only have dual core Intel G3220 @ 3Ghz - Does the optimised miner require extra CPU and hence my mining rig requires a CPU upgrade ?

z0n0
Legendary
*
Offline Offline

Activity: 1274
Merit: 1006



View Profile
January 15, 2016, 11:33:56 AM
 #4243

pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl


 Grin


ldw-com
Full Member
***
Offline Offline

Activity: 229
Merit: 100


View Profile
January 15, 2016, 11:37:48 AM
 #4244

pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl


 Grin




Oh, is that what you're reading?

Well, over here it looks like this "这是我读一堆奇怪的迹象"

:p
z0n0
Legendary
*
Offline Offline

Activity: 1274
Merit: 1006



View Profile
January 15, 2016, 11:38:57 AM
 #4245

pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl


 Grin




Oh, is that what you're reading?

Well, over here it looks like this "这是我读一堆奇怪的迹象"

:p

hahaha, yeah or that Smiley


Hey guys, whats the hash speed at X11 with R9 380 (4GB sapphire)?
ldw-com
Full Member
***
Offline Offline

Activity: 229
Merit: 100


View Profile
January 15, 2016, 11:41:28 AM
 #4246

pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl


 Grin




Oh, is that what you're reading?

Well, over here it looks like this "这是我读一堆奇怪的迹象"

:p

hahaha, yeah or that Smiley


Hey guys, whats the hash speed at X11 with R9 380 (4GB sapphire)?

I'm getting 18-19Mhz with a 390 so around 17 i guess?
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 15, 2016, 11:50:22 AM
 #4247

That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)

I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)

Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?

SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.

GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.

You might not be looking at the big picture with Groestl - look at that fucking shitty amount of waves in flight you get due to LDS (ab)use.

That's an issue with <= tahiti only, hence why I hate optimizing for those chips ;-)

pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 15, 2016, 12:12:49 PM
 #4248

That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)

I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)

Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?

SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.

GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.

You might not be looking at the big picture with Groestl - look at that fucking shitty amount of waves in flight you get due to LDS (ab)use.

That's an issue with <= tahiti only, hence why I hate optimizing for those chips ;-)

Not the case - two waves in flight, and your kernel is STILL not actually using the GPU's parallelism like it's supposed to be. One Groestl-512 hash is a big job, and it's parallelizable. If you're doing a throughput of 64 hashes per local workgroup, then use 256 for Groestl, and do 4 work-items per actual hash. Tune to taste.

I understand what you mean: it's like the good old cgminer "vector size". I will think about it.
Besides, I haven't worked on groestl for a long while, but on whirlpool and variants I can easily get 3 waves on >= hawaii.
It's a lighter job, I know, but I haven't had any interest in developing groestl recently.

pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 15, 2016, 01:27:06 PM
 #4249

No, it is the OPPOSITE of vector size. You don't get how the GPU is ACTUALLY supposed to solve issues, I don't think - it really doesn't fucking like large code size, or very complex problems in one work-item - you know this.

Vectors were profitable before because of the old architectures - VLIW based. GCN abolished hardware vectors, and instead made VGPRs 4 bytes. Why, you may ask? Occupancy! This way, if you need to work on a problem that can't be efficiently vectorized like that, you don't waste most of your VGPR.

But, but, but... mah parallelism! GCN has you covered - you just need to think of the shit differently. Instead of parallelizing in vectors, do it in work-items. To give you the cleanest example I've worked with demonstrating this (in X11), take Echo-512.

You have a 256 byte state which I'll now refer to as W. W can be represented as an array of 16 uint4s. If you're looking at the shitty darkcoin-mod.cl trying to visualize this, just look at the 64-bit W vars and imagine them as 32-bit, and an array. Now, if I was going to demonstrate this technique with Echo - I have an array of 4 uint4s. This is my W. To figure out which part of the hash you are, you can choose two ways: launch the kernel with throughput * 4, 1, 1 local size, or do throughput, 4, 1 local size. Since the latter is cleaner, I'll assume that notation: lid = get_local_id(0), and hashid = get_local_id(1).

if hashid is < 2 (i.e. 0 or 1) - we fill up W with (512, 0, 0, 0) (uint4, remember) over all four array indices. If hashid == 2, W becomes the input (input being 16 uints, this may be represented as 4 uint4s, as well), and if hashid == 3, we fill up W with the odds & ends - for X11, these are (0x80, 0, 0, 0) for W[0], (0, 0, 0, 0) for W[1], (0, 0, 0, 0x02000000) for W[2], and (512, 0,0, 0,) for W[3]. Now, go pull up darkcoin-mod.cl, and look at it until the this and the previous paragraph make sense.

I'll continue with rounds and output calculation in another post in just a bit.

Thanks for the explanation: let me put it in simple words so you can easily understad if I got it or not :-)
Basicly you are dividing the state in 4 parts which will be computed by different work items. Less regs per kernel leads to more waves and generally better parallelism.
Looking at echo, there is a good amount of work which can be done on a single slice of the input, but in this case:

#define BIG_SHIFT_ROWS   do { \
    SHIFT_ROW1(1, 5, 9, D); \
    SHIFT_ROW2(2, 6, A, E); \
    SHIFT_ROW3(3, 7, B, F); \
  }

there are overlaps. I assume you'll use LDS to communicate between work items, or maybe shuffle but that would need assembly.

pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 16, 2016, 08:39:32 PM
 #4250

Wolf0 I'm curious to know if you tried that technique (split to multiple work items) on a kernel and how was the outcome.

go6ooo1212
Legendary
*
Offline Offline

Activity: 1512
Merit: 1000


quarkchain.io


View Profile
January 16, 2016, 10:51:28 PM
 #4251

Has someone been testing the hash-power of R9 Nano , on Ethereum ?
Eliovp
Legendary
*
Offline Offline

Activity: 1050
Merit: 1293

Huh?


View Profile WWW
January 16, 2016, 11:06:08 PM
 #4252

Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

 Smiley

go6ooo1212
Legendary
*
Offline Offline

Activity: 1512
Merit: 1000


quarkchain.io


View Profile
January 16, 2016, 11:24:31 PM
 #4253

Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

 Smiley

Pf, I've expected some more improvement , compared to 7970/280X
Eliovp
Legendary
*
Offline Offline

Activity: 1050
Merit: 1293

Huh?


View Profile WWW
January 17, 2016, 01:26:41 AM
 #4254

Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

 Smiley

Pf, I've expected some more improvement , compared to 7970/280X

You're not the only one...

chrysophylax
Legendary
*
Offline Offline

Activity: 2828
Merit: 1091


--- ChainWorks Industries ---


View Profile WWW
January 17, 2016, 04:32:22 AM
 #4255

Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

 Smiley

Pf, I've expected some more improvement , compared to 7970/280X

You're not the only one...

just wait till wolf decides that its worth his time to get on the code for the new cards ... if he does of course ...

Wink ...

#crysx

RyanX
Full Member
***
Offline Offline

Activity: 140
Merit: 100


fastdice.com The Worlds Fastest Bitcoin Dice


View Profile WWW
January 17, 2016, 10:33:47 AM
 #4256

Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

 Smiley

Pf, I've expected some more improvement , compared to 7970/280X

https://forum.ethereum.org/discussion/comment/16112/#Comment_16112

"I have a Nano. HBM does you no good for ethash. HBM memory cycle time (effective) is the same as GDDR5 cards, although GDDR5 has higher latency. The advantage of HBM is that it has a 4096 bit wide bus, so the *bandwidth* is much higher, not the access time, which is what's critical for ethash. The issue is that ethash does more or less random 128 byte reads of the Dag at the core of the inner loop, which can't take advantage of HBM's super wide bus. That is compounded by the fact that the address of the next read (of the Dag) is computed from the 128 byte Dag currently being "mixed', so pre-fetch isn't happening to any significant degree. The algorithm's memory accesses also render memory caching of limited use. Starting to get the picture? Smiley The GPU does overlap global memory accesses with compute work, so some of the memory access time is hidden, but that has nothing to do with HBM. The fact is ethash is a bugger, period."

revelacaogr
Legendary
*
Offline Offline

Activity: 1316
Merit: 1021

2009 Alea iacta est


View Profile
January 20, 2016, 09:12:49 AM
Last edit: January 20, 2016, 09:27:34 AM by revelacaogr
 #4257

i hope  WOLF0  will make  a good pablic  algo for  decred coin  when  the mining will start.....





ps: wolf0 u must know that u r my favorite...U  prove yr work with facts &  not with many sauces.....THANKS FOR ALL!
pallas
Legendary
*
Offline Offline

Activity: 2716
Merit: 1094


Black Belt Developer


View Profile
January 20, 2016, 09:54:26 AM
 #4258

i hope  WOLF0  will make  a good pablic  algo for  decred coin  when  the mining will start.....





ps: wolf0 u must know that u r my favorite...U  prove yr work with facts &  not with many sauces.....THANKS FOR ALL!

I hope Wolf0 will prove me wrong, but I fear there is not a lot to optimise on blake 14 rounds.

ldw-com
Full Member
***
Offline Offline

Activity: 229
Merit: 100


View Profile
January 20, 2016, 10:36:29 AM
 #4259

ps: wolf0 u must know that u r my favorite...U  prove yr work with facts &  not with many sauces.....THANKS FOR ALL!

That sounds so weird :p
revelacaogr
Legendary
*
Offline Offline

Activity: 1316
Merit: 1021

2009 Alea iacta est


View Profile
January 20, 2016, 10:00:44 PM
 #4260

U  prove yr work with facts &  not with many sauces: Not Just Words, But Deeds ...

what is weird about that?
Pages: « 1 ... 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 [213] 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!