pallas
Legendary
Offline
Activity: 2716
Merit: 1094
Black Belt Developer
|
|
January 15, 2016, 10:47:39 AM |
|
That shows the importance of simd. Now you know where to work ;-) Thanks for the information, please keep us updated on the progress! Please tell me if I can be of any help.
I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.) EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now. EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel. Is search2 faster than yours? or is it just simd? Maybe Kachur has found a way to make AES-like algos better... BTW I wouldn't mind a frankenbin if it's faster and stable ;-) I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.) Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it? SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work. GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.
|
|
|
|
24core
|
|
January 15, 2016, 11:00:48 AM |
|
But with Xintensity at 1024 you get invalid shares, so lower hashrate on the pool. X 256 or 512 is the highest I can use..
Default was 640. I used this w/o a problem ))) Can you please share your config for the 7950 ? Also slightly confused about the kachur miner as NiceHash download os 5.2.1, is there another link ? https://github.com/nicehash/NiceHashMiner/releases use sgminer-5-1-0-optimized setx GPU_MAX_ALLOC_PERCENT 100 sgminer.exe --algorithm darkcoin-mod -o stratum+tcp://x11.eu.nicehash.com:3336 -u 1M948TedPdVkbk59TMnYJhtW5BFTXKPfFL -p d=0.04 -d 0 --xintensity 640 -g 1 -w 64 --gpu-memclock 1500 --gpu-engine 1100 -s 0 --expiry 10 --queue 0 --gpu-powertune 10 --keccak-unroll 0 --hamsi-expand-big 4 --gpu-fan 30-95 --temp-cutoff 95 --temp-overheat 90 --temp-target 75 --auto-fan I tried those settings and many others, but I get a very low Hash rate and my CPU is going mental. I only have dual core Intel G3220 @ 3Ghz - Does the optimised miner require extra CPU and hence my mining rig requires a CPU upgrade ?
|
|
|
|
z0n0
Legendary
Offline
Activity: 1274
Merit: 1006
|
|
January 15, 2016, 11:33:56 AM |
|
pallas & Wolf0: when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl
|
|
|
|
ldw-com
|
|
January 15, 2016, 11:37:48 AM |
|
pallas & Wolf0: when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl Oh, is that what you're reading? Well, over here it looks like this "这是我读一堆奇怪的迹象" :p
|
|
|
|
z0n0
Legendary
Offline
Activity: 1274
Merit: 1006
|
|
January 15, 2016, 11:38:57 AM |
|
pallas & Wolf0: when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl Oh, is that what you're reading? Well, over here it looks like this "这是我读一堆奇怪的迹象" :p hahaha, yeah or that Hey guys, whats the hash speed at X11 with R9 380 (4GB sapphire)?
|
|
|
|
ldw-com
|
|
January 15, 2016, 11:41:28 AM |
|
pallas & Wolf0: when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl Oh, is that what you're reading? Well, over here it looks like this "这是我读一堆奇怪的迹象" :p hahaha, yeah or that Hey guys, whats the hash speed at X11 with R9 380 (4GB sapphire)? I'm getting 18-19Mhz with a 390 so around 17 i guess?
|
|
|
|
pallas
Legendary
Offline
Activity: 2716
Merit: 1094
Black Belt Developer
|
|
January 15, 2016, 11:50:22 AM |
|
That shows the importance of simd. Now you know where to work ;-) Thanks for the information, please keep us updated on the progress! Please tell me if I can be of any help.
I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.) EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now. EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel. Is search2 faster than yours? or is it just simd? Maybe Kachur has found a way to make AES-like algos better... BTW I wouldn't mind a frankenbin if it's faster and stable ;-) I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.) Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it? SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work. GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it. You might not be looking at the big picture with Groestl - look at that fucking shitty amount of waves in flight you get due to LDS (ab)use. That's an issue with <= tahiti only, hence why I hate optimizing for those chips ;-)
|
|
|
|
pallas
Legendary
Offline
Activity: 2716
Merit: 1094
Black Belt Developer
|
|
January 15, 2016, 12:12:49 PM |
|
That shows the importance of simd. Now you know where to work ;-) Thanks for the information, please keep us updated on the progress! Please tell me if I can be of any help.
I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.) EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now. EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel. Is search2 faster than yours? or is it just simd? Maybe Kachur has found a way to make AES-like algos better... BTW I wouldn't mind a frankenbin if it's faster and stable ;-) I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.) Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it? SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work. GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it. You might not be looking at the big picture with Groestl - look at that fucking shitty amount of waves in flight you get due to LDS (ab)use. That's an issue with <= tahiti only, hence why I hate optimizing for those chips ;-) Not the case - two waves in flight, and your kernel is STILL not actually using the GPU's parallelism like it's supposed to be. One Groestl-512 hash is a big job, and it's parallelizable. If you're doing a throughput of 64 hashes per local workgroup, then use 256 for Groestl, and do 4 work-items per actual hash. Tune to taste. I understand what you mean: it's like the good old cgminer "vector size". I will think about it. Besides, I haven't worked on groestl for a long while, but on whirlpool and variants I can easily get 3 waves on >= hawaii. It's a lighter job, I know, but I haven't had any interest in developing groestl recently.
|
|
|
|
pallas
Legendary
Offline
Activity: 2716
Merit: 1094
Black Belt Developer
|
|
January 15, 2016, 01:27:06 PM |
|
No, it is the OPPOSITE of vector size. You don't get how the GPU is ACTUALLY supposed to solve issues, I don't think - it really doesn't fucking like large code size, or very complex problems in one work-item - you know this.
Vectors were profitable before because of the old architectures - VLIW based. GCN abolished hardware vectors, and instead made VGPRs 4 bytes. Why, you may ask? Occupancy! This way, if you need to work on a problem that can't be efficiently vectorized like that, you don't waste most of your VGPR.
But, but, but... mah parallelism! GCN has you covered - you just need to think of the shit differently. Instead of parallelizing in vectors, do it in work-items. To give you the cleanest example I've worked with demonstrating this (in X11), take Echo-512.
You have a 256 byte state which I'll now refer to as W. W can be represented as an array of 16 uint4s. If you're looking at the shitty darkcoin-mod.cl trying to visualize this, just look at the 64-bit W vars and imagine them as 32-bit, and an array. Now, if I was going to demonstrate this technique with Echo - I have an array of 4 uint4s. This is my W. To figure out which part of the hash you are, you can choose two ways: launch the kernel with throughput * 4, 1, 1 local size, or do throughput, 4, 1 local size. Since the latter is cleaner, I'll assume that notation: lid = get_local_id(0), and hashid = get_local_id(1).
if hashid is < 2 (i.e. 0 or 1) - we fill up W with (512, 0, 0, 0) (uint4, remember) over all four array indices. If hashid == 2, W becomes the input (input being 16 uints, this may be represented as 4 uint4s, as well), and if hashid == 3, we fill up W with the odds & ends - for X11, these are (0x80, 0, 0, 0) for W[0], (0, 0, 0, 0) for W[1], (0, 0, 0, 0x02000000) for W[2], and (512, 0,0, 0,) for W[3]. Now, go pull up darkcoin-mod.cl, and look at it until the this and the previous paragraph make sense.
I'll continue with rounds and output calculation in another post in just a bit.
Thanks for the explanation: let me put it in simple words so you can easily understad if I got it or not :-) Basicly you are dividing the state in 4 parts which will be computed by different work items. Less regs per kernel leads to more waves and generally better parallelism. Looking at echo, there is a good amount of work which can be done on a single slice of the input, but in this case: #define BIG_SHIFT_ROWS do { \ SHIFT_ROW1(1, 5, 9, D); \ SHIFT_ROW2(2, 6, A, E); \ SHIFT_ROW3(3, 7, B, F); \ } there are overlaps. I assume you'll use LDS to communicate between work items, or maybe shuffle but that would need assembly.
|
|
|
|
pallas
Legendary
Offline
Activity: 2716
Merit: 1094
Black Belt Developer
|
|
January 16, 2016, 08:39:32 PM |
|
Wolf0 I'm curious to know if you tried that technique (split to multiple work items) on a kernel and how was the outcome.
|
|
|
|
go6ooo1212
Legendary
Offline
Activity: 1512
Merit: 1000
quarkchain.io
|
|
January 16, 2016, 10:51:28 PM |
|
Has someone been testing the hash-power of R9 Nano , on Ethereum ?
|
|
|
|
Eliovp
Legendary
Offline
Activity: 1050
Merit: 1293
Huh?
|
|
January 16, 2016, 11:06:08 PM |
|
Has someone been testing the hash-power of R9 Nano , on Ethereum ?
Sure I have 26Mh stock
|
|
|
|
go6ooo1212
Legendary
Offline
Activity: 1512
Merit: 1000
quarkchain.io
|
|
January 16, 2016, 11:24:31 PM |
|
Has someone been testing the hash-power of R9 Nano , on Ethereum ?
Sure I have 26Mh stock Pf, I've expected some more improvement , compared to 7970/280X
|
|
|
|
Eliovp
Legendary
Offline
Activity: 1050
Merit: 1293
Huh?
|
|
January 17, 2016, 01:26:41 AM |
|
Has someone been testing the hash-power of R9 Nano , on Ethereum ?
Sure I have 26Mh stock Pf, I've expected some more improvement , compared to 7970/280X You're not the only one...
|
|
|
|
chrysophylax
Legendary
Offline
Activity: 2828
Merit: 1091
--- ChainWorks Industries ---
|
|
January 17, 2016, 04:32:22 AM |
|
Has someone been testing the hash-power of R9 Nano , on Ethereum ?
Sure I have 26Mh stock Pf, I've expected some more improvement , compared to 7970/280X You're not the only one... just wait till wolf decides that its worth his time to get on the code for the new cards ... if he does of course ... ... #crysx
|
|
|
|
RyanX
Full Member
Offline
Activity: 140
Merit: 100
fastdice.com The Worlds Fastest Bitcoin Dice
|
|
January 17, 2016, 10:33:47 AM |
|
Has someone been testing the hash-power of R9 Nano , on Ethereum ?
Sure I have 26Mh stock Pf, I've expected some more improvement , compared to 7970/280X https://forum.ethereum.org/discussion/comment/16112/#Comment_16112"I have a Nano. HBM does you no good for ethash. HBM memory cycle time (effective) is the same as GDDR5 cards, although GDDR5 has higher latency. The advantage of HBM is that it has a 4096 bit wide bus, so the *bandwidth* is much higher, not the access time, which is what's critical for ethash. The issue is that ethash does more or less random 128 byte reads of the Dag at the core of the inner loop, which can't take advantage of HBM's super wide bus. That is compounded by the fact that the address of the next read (of the Dag) is computed from the 128 byte Dag currently being "mixed', so pre-fetch isn't happening to any significant degree. The algorithm's memory accesses also render memory caching of limited use. Starting to get the picture? The GPU does overlap global memory accesses with compute work, so some of the memory access time is hidden, but that has nothing to do with HBM. The fact is ethash is a bugger, period."
|
|
|
|
revelacaogr
Legendary
Offline
Activity: 1316
Merit: 1021
2009 Alea iacta est
|
|
January 20, 2016, 09:12:49 AM Last edit: January 20, 2016, 09:27:34 AM by revelacaogr |
|
i hope WOLF0 will make a good pablic algo for decred coin when the mining will start.....
ps: wolf0 u must know that u r my favorite...U prove yr work with facts & not with many sauces.....THANKS FOR ALL!
|
|
|
|
pallas
Legendary
Offline
Activity: 2716
Merit: 1094
Black Belt Developer
|
|
January 20, 2016, 09:54:26 AM |
|
i hope WOLF0 will make a good pablic algo for decred coin when the mining will start.....
ps: wolf0 u must know that u r my favorite...U prove yr work with facts & not with many sauces.....THANKS FOR ALL!
I hope Wolf0 will prove me wrong, but I fear there is not a lot to optimise on blake 14 rounds.
|
|
|
|
ldw-com
|
|
January 20, 2016, 10:36:29 AM |
|
ps: wolf0 u must know that u r my favorite...U prove yr work with facts & not with many sauces.....THANKS FOR ALL!
That sounds so weird :p
|
|
|
|
revelacaogr
Legendary
Offline
Activity: 1316
Merit: 1021
2009 Alea iacta est
|
|
January 20, 2016, 10:00:44 PM |
|
U prove yr work with facts & not with many sauces: Not Just Words, But Deeds ...
what is weird about that?
|
|
|
|
|