joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 04:40:30 PM |
|
joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?
Mostly from more efficient management of the groestl ctx. Because quark can run groestl twice per round and was running the init function twice every time the hash function was called and it was called in a scanhash loop. That's 2* the number of hash calls for something that only needs to be done once. That was a big boost though I don't recall exactly how much. The reduction in the number of inits also helped other algos like x11. I also created a fast reinit function that skipped the constants. So now a full init is done once when scanhash is called and any subsequent reinits that are necessary are fast. That alone added another 5%. I have another idea to factor out the full init from scanhash so the ctx will be fully initted only once, ever, before entering the thread loop.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 04:50:42 PM |
|
joblo, does your quark optimisation work at the end? not sure I understand your conversation with sp_ fully: where does the +30% come from?
Joblo's optimization impacts CPU validation of any found shares. This is usually insignificant, but since he's also mining with all CPU cores, it did have an impact for him. It was that his CPU mining was slowing down ccminer. Joblo: You're invited for a beer over at #ccminer @freenode: there's friendlier dev talk there, some collaboration now and then, and certainly a lot less BS Thanks for the invite. I plan to join #ccminer (and github, and...) when things settle down, which they are beginning to do. I've been so busy trying to get all the algos supported and delivering the quick optimizations that I'm only now starting to think longer term. I'm working on a design to modulerize algos that doesn't require any base code changes when adding a new algo. But that's a big feature that requires a lot of thought. I have high standards and don't want to present a half-baked plan.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 06:20:16 PM |
|
I was hoping to get a better response to my technical trolls but all I got was more bluster. I was trying to find out if our skills were complementary. I am a complete noob when it comes to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda. When I provided a demonstration of my skills he respnded with sillly you that was cpu verification code, and why don't you do better, without ever considering the technical merit or other applications for the changes I made
He's more interested in selling what he has over and over again rather than providing anything new that sells itself. I'm afraid SP has turned into a telemarketer.
|
|
|
|
bensam1231
Legendary
Offline
Activity: 1750
Merit: 1024
|
|
January 28, 2016, 07:34:21 PM Last edit: January 28, 2016, 08:12:47 PM by bensam1231 |
|
Faster but not profitable. I didn`t reach 5mhash yet
Well, did you modify SIMD in it? And if you feel like sharing, how much was gained from SIMD alone in X11 speed percentage? Why don't you stick with AMD. Disassemble the kochur bins and check for yourself. I am, thanks. What I'm interested in is how much there is to be gained from SIMD. While the architecture is different, many things are similar - if there's an unexpectedly massive improvement from SIMD on Nvidia GPUs, it is quite likely there is on AMD. Also, why so defensive? I have no intention of enroaching on your turf, here - I could do more CUDA if I wanted, but for now it does not interest me. You don't need to see me as a threat. You are no competition to me... Then why are you afraid? Grills grills, there is plenty of optimizations to go around! Since wolf0 is hot stuff, he should make Eth better. Oh, by the way, sp_ - the comment about open sourcing some of my work I think is a little unfounded. For example, I semi-recently not only did the ONLY open-source implementation of a CryptoNight AMD miner, but I didn't base it on existing code infected with the GPL. This means there's now a base that's not only open, but MIT/BSD licensed to work off of for others. And on top of this, the community around the coins using the CryptoNight PoW really needed it, because the only existing AMD miner for it before mine was Claymore's, which was closed-source with a fee, and WAS Windows-only for the longest time. I even forked my own project and made a CryptoNight-Lite miner for that PoW - Claymore refused to implement it. You can find my CryptoNight miner here: https://github.com/wolf9466/wolf-xmr-miner -- and my CryptoNight-Lite miner here: https://github.com/wolf9466/wolf-aeon-minerUnless it's like 2x faster then Claymore, it's not worth mining with. Monero hasn't been profitable for sometime. Botnets consumed Cryptonote. Quick check on Vanillacoin. I get about 2.9GH/s per 970, it doesn't appear to be more profitable then Ethereum right now, always nice to have options though.
|
I buy private Nvidia miners. Send information and/or inquiries to my PM box.
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2912
Merit: 1087
Team Black developer
|
|
January 28, 2016, 08:34:08 PM |
|
I was hoping to get a better response to my technical trolls but all I got was more bluster. I was trying to find out if our skills were complementary. I am a complete noob when it comes to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda. When I provided a demonstration of my skills he respnded with sillly you that was cpu verification code, and why don't you do better, without ever considering the technical merit or other applications for the changes I made
He's more interested in selling what he has over and over again rather than providing anything new that sells itself. I'm afraid SP has turned into a telemarketer.
Assembler for NVIDIA Maxwell architecture https://github.com/NervanaSystems/maxas
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 08:58:26 PM |
|
I was hoping to get a better response to my technical trolls but all I got was more bluster. I was trying to find out if our skills were complementary. I am a complete noob when it comes to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda. When I provided a demonstration of my skills he respnded with sillly you that was cpu verification code, and why don't you do better, without ever considering the technical merit or other applications for the changes I made
He's more interested in selling what he has over and over again rather than providing anything new that sells itself. I'm afraid SP has turned into a telemarketer.
Assembler for NVIDIA Maxwell architecture https://github.com/NervanaSystems/maxasThanks, that will be useful when I learn how to use it. I'm looking for docs that describe the cuda processir architecture in detail so I can dtetertmine things like how many loads to queue up to fill the pipe, how many executions units, user cache management, etc. That kind of information is necessary to maximize instruction throughput at the processor level. Do you know of any avaiable docs with this kind of info?
|
|
|
|
bensam1231
Legendary
Offline
Activity: 1750
Merit: 1024
|
|
January 28, 2016, 09:02:55 PM |
|
Trying out Vanillacoin on Nova, anyone getting a lot of duplicate shares? I'm using the build of SP .78 off of Cryptominingblog as well since SP hasn't updated his releases yet.
|
I buy private Nvidia miners. Send information and/or inquiries to my PM box.
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2912
Merit: 1087
Team Black developer
|
|
January 28, 2016, 09:10:39 PM Last edit: January 28, 2016, 09:50:08 PM by sp_ |
|
I was hoping to get a better response to my technical trolls but all I got was more bluster. I was trying to find out if our skills were complementary. I am a complete noob when it comes to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda. When I provided a demonstration of my skills he respnded with sillly you that was cpu verification code, and why don't you do better, without ever considering the technical merit or other applications for the changes I made He's more interested in selling what he has over and over again rather than providing anything new that sells itself. I'm afraid SP has turned into a telemarketer.
Assembler for NVIDIA Maxwell architecture https://github.com/NervanaSystems/maxasThanks, that will be useful when I learn how to use it. I'm looking for docs that describe the cuda processir architecture in detail so I can dtetertmine things like how many loads to queue up to fill the pipe, how many executions units, user cache management, etc. That kind of information is necessary to maximize instruction throughput at the processor level. Do you know of any avaiable docs with this kind of info? There is not much info available, but if you disassemble compiled code you will see that the maxwell is superscalar with 2 pipes. 2 instructions per cycle. It's able to execute instructions while writing to memory if the code is in the instruction cache. And you to avoid ALU stalls you need to reorder your instructions carefully. There are vector instructions that can write bigger chunks of memory with fewer instructions... etc etc. The compiler is usually doing a good job here. Little to gain.. Ask DJM34 for more info. He is good in the random stuff...
|
|
|
|
bensam1231
Legendary
Offline
Activity: 1750
Merit: 1024
|
|
January 28, 2016, 09:19:41 PM |
|
Update: SPs miner for Vanillacoin seems to be messed up on Suprnova. Tpruvot version works fine for Vanillacoin, no duplicate share issues.
|
I buy private Nvidia miners. Send information and/or inquiries to my PM box.
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2912
Merit: 1087
Team Black developer
|
|
January 28, 2016, 09:24:31 PM |
|
He has added a duplicate checker in the code.
I also added it. You need to recompile latest@git
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 28, 2016, 10:03:28 PM |
|
I was hoping to get a better response to my technical trolls but all I got was more bluster. I was trying to find out if our skills were complementary. I am a complete noob when it comes to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda. When I provided a demonstration of my skills he respnded with sillly you that was cpu verification code, and why don't you do better, without ever considering the technical merit or other applications for the changes I made He's more interested in selling what he has over and over again rather than providing anything new that sells itself. I'm afraid SP has turned into a telemarketer.
Assembler for NVIDIA Maxwell architecture https://github.com/NervanaSystems/maxasThanks, that will be useful when I learn how to use it. I'm looking for docs that describe the cuda processir architecture in detail so I can dtetertmine things like how many loads to queue up to fill the pipe, how many executions units, user cache management, etc. That kind of information is necessary to maximize instruction throughput at the processor level. Do you know of any avaiable docs with this kind of info? There is not much info available, but if you disassemble compiled code you will see that the maxwell is superscalar with 2 pipes. 2 instructions per cycle. It's able to execute instructions while writing to memory if the code is in the instruction cache. And you to avoid ALU stalls you need to reorder your instructions carefully. There are vector instructions that can write bigger chunks of memory with fewer instructions... etc etc. The compiler is usually doing a good job here. Little to gain.. Ask DJM34 for more info. He is good in the random stuff... Thanks again. Have you tried interleaving memory accesses with arith instructions so they can be issued the same clock? When copying mem do you issue the first load an the first store immediately after it. Thr first load fills the cache line and the first store waits for the first bytes to become available. Then you can queue up enough loads to fill the pipe and do other things while waiting for mem. Multi-buffering is a given being careful not to overuse regs. If your doing a load, process, and store it's even better because you can have one instruction slot focussed on memory while the other can do the processing. These are things I'd like to try but haven't got the time. Although I've done similar in the past there was no performance tests that could quantify the effect, good or bad. If you think this has merit give it a shot. Like I said if it works just keep it open because I could still implement it myself. The hotter the code segments you choose the bigger the result should be. Some of the assembly routines would be logical targets.
|
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 29, 2016, 03:13:10 AM |
|
I was hoping to get a better response to my technical trolls but all I got was more bluster. I was trying to find out if our skills were complementary. I am a complete noob when it comes to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda. When I provided a demonstration of my skills he respnded with sillly you that was cpu verification code, and why don't you do better, without ever considering the technical merit or other applications for the changes I made He's more interested in selling what he has over and over again rather than providing anything new that sells itself. I'm afraid SP has turned into a telemarketer.
Assembler for NVIDIA Maxwell architecture https://github.com/NervanaSystems/maxasThanks, that will be useful when I learn how to use it. I'm looking for docs that describe the cuda processir architecture in detail so I can dtetertmine things like how many loads to queue up to fill the pipe, how many executions units, user cache management, etc. That kind of information is necessary to maximize instruction throughput at the processor level. Do you know of any avaiable docs with this kind of info? There is not much info available, but if you disassemble compiled code you will see that the maxwell is superscalar with 2 pipes. 2 instructions per cycle. It's able to execute instructions while writing to memory if the code is in the instruction cache. And you to avoid ALU stalls you need to reorder your instructions carefully. There are vector instructions that can write bigger chunks of memory with fewer instructions... etc etc. The compiler is usually doing a good job here. Little to gain.. Ask DJM34 for more info. He is good in the random stuff... Thanks again. Have you tried interleaving memory accesses with arith instructions so they can be issued the same clock? When copying mem do you issue the first load an the first store immediately after it. Thr first load fills the cache line and the first store waits for the first bytes to become available. Then you can queue up enough loads to fill the pipe and do other things while waiting for mem. Multi-buffering is a given being careful not to overuse regs. If your doing a load, process, and store it's even better because you can have one instruction slot focussed on memory while the other can do the processing. These are things I'd like to try but haven't got the time. Although I've done similar in the past there was no performance tests that could quantify the effect, good or bad. If you think this has merit give it a shot. Like I said if it works just keep it open because I could still implement it myself. The hotter the code segments you choose the bigger the result should be. Some of the assembly routines would be logical targets. GDS (global memory) LDS (local memory), and work-item shuffle all require a little waiting period before they complete. So, say I'm using ds_swizzle_b32 (work-item shuffle) like I had fun with in my 4-way Echo-512... On AMD GCN, you can do some shit like so: # These shuffle in the correct state dwords from other work-items that are adjacent. # This is done in place of BigShiftRows, but before BigMixColumns. # So, my uint4 variables (in OpenCL notation) named b and c are now loaded properly without the need for shifting rows. ds_swizzle_b32 v36, v80 offset:0x8039 # b.z ds_swizzle_b32 v37, v81 offset:0x8039 # b.w ds_swizzle_b32 v38, v78 offset:0x8039 # b.x ds_swizzle_b32 v39, v79 offset:0x8039 # b.y ds_swizzle_b32 v15, v84 offset:0x804E # c.z ds_swizzle_b32 v16, v85 offset:0x804E # c.w ds_swizzle_b32 v33, v82 offset:0x804E # c.x ds_swizzle_b32 v34, v83 offset:0x804E # c.y
# Each and every one of these takes time, however - and each one increments a little counter. # What I can do is this - since the first row in the state is not shifted, the a variable is already ready # It's in registers and ready to be used.
# The first thing I do in the OpenCL after loading up the proper state values - in BigMixColumns - is a ^ b. # So, I can do something like this:
s_waitcnt lgkmcnt(4)
# What this does is, it waits on the pending operations until there are four left. # They're queued in the order the instructions were issued - so the b uint4 should now be loaded # Note, however, that the c uint4 is NOT guaranteed to have been loaded, and cannot be relied on (yet.) # Now, I can process the XOR while the swizzle operation on the c uint4 is working!
v_xor_b32 v42, v15, v36 # v42 = a.z ^ b.z v_xor_b32 v43, v16, v37 # v43 = a.w ^ b.w v_xor_b32 v38, v74, v38 # v38 = a.x ^ b.x v_xor_b32 v39, v75, v39 # v39 = a.y ^ b.y
# And then we can put in an instruction to wait for the c uint4 before we continue... s_waitcnt lgkmcnt(0)
In case you're wondering, I load the d uint4 later in the code. Also, if you *really* wanna try your damndest to maximize the time spent executing compute shit during loads, you could do this (although you've probably figured it out by now): ds_swizzle_b32 v36, v80 offset:0x8039 # b.z ds_swizzle_b32 v37, v81 offset:0x8039 # b.w ds_swizzle_b32 v38, v78 offset:0x8039 # b.x ds_swizzle_b32 v39, v79 offset:0x8039 # b.y ds_swizzle_b32 v15, v84 offset:0x804E # c.z ds_swizzle_b32 v16, v85 offset:0x804E # c.w ds_swizzle_b32 v33, v82 offset:0x804E # c.x ds_swizzle_b32 v34, v83 offset:0x804E # c.y
s_waitcnt lgkmcnt(7) v_xor_b32 v42, v15, v36 # v42 = a.z ^ b.z s_waitcnt lgkmcnt(6) v_xor_b32 v43, v16, v37 # v43 = a.w ^ b.w s_waitcnt lgkmcnt(5) v_xor_b32 v38, v74, v38 # v38 = a.x ^ b.x s_waitcnt lgkmcnt(4) v_xor_b32 v39, v75, v39 # v39 = a.y ^ b.y
# You get the idea... [code] [/code] I think I follow even though that syntax is completely foreign to me. I think what you did is what I was talking about. But I would go one step farther. It may not apply because I don't understand the wait instructions unless there are synchronization issues. In addition to what you did I would put the first xor on b immediately after the first load. I know it's stalled waiting for data but I want its dependant instruction already queued for when the data becomes available. Secondly that first load will fill the cache line so there is no need to queue up the load instruction until the first load completes. Susequent loads will finish immediately because they hit the cache: What I would not do is have a string of indentical instruuctions because they all compete for the same execution unit and can only be issued one per clock. I would interleave the swizzles and xors to they can both be issued on the same clock, assuming all dependencies are met. With comments: ds_swizzle_b32 v36, v80 offset:0x8039 # b.z // start filling the cache with b v_xor_b32 v42, v15, v36 # v42 = a.z ^ b.z // queue first xor for when b is ready ds_swizzle_b32 v37, v81 offset:0x8039 # b.w // this will complete one clock after the previous swizzle so... v_xor_b32 v43, v16, v37 # v43 = a.w ^ b.w // make sure we're ready for it I think you get it. When all the B vars are loaded you can queue the C vars while still processing and saving the first batch. I would even go one step farther to the loading of a, if possible. I would start with swizzle a immediately followed by swizzle b then the first xor. There wil be a lot of stalling waiting for memory here so if there are any other trivial tasks do them next. Loading a & b in parallel may seem odd but once both are in the cache you're flying. Then you can mix saving processed data and loading new data, giving priority to loads to keep the GPU hot and you can stick in the first swizzle c early to get the data ready. I learned some of this stuff on a company paid Motorola course. The instructor was a geek and our class was pretty sharp so we covered the material eraly then started having fun. At the time we were in a performance cruch with customers demanding more capacity so we focsussed on code scheduling and user cache management. One of the more bizarre instructions was the delayed branch. It exssentially means branch AFTER the next instruction. That next instruction was often returning the rc. It took some getting used to but oit gives an idea of the level of optimization they were into at the time. It's the same CPU that had the ability to mark a cache line valid without touching mem. It great for malloc because the data is initially undefined anyway. Who cares whether the garbage comes from mem or stale cache, it's all garbage. Imagine mallocing 1k and having it cached without ever touching the bus. They also have an instruction to preload the cache for real. that is essentially what I was simulating above. It also had a user flush so you could flush data at any convenient time after you no longer needed it instead of a system initiated flush when you are stalled waiting for new data.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 29, 2016, 04:40:11 AM |
|
I was hoping to get a better response to my technical trolls but all I got was more bluster. I was trying to find out if our skills were complementary. I am a complete noob when it comes to cuda so I was hoping SP could implement some of my ideas with his knowledge of cuda. When I provided a demonstration of my skills he respnded with sillly you that was cpu verification code, and why don't you do better, without ever considering the technical merit or other applications for the changes I made He's more interested in selling what he has over and over again rather than providing anything new that sells itself. I'm afraid SP has turned into a telemarketer.
Assembler for NVIDIA Maxwell architecture https://github.com/NervanaSystems/maxasThanks, that will be useful when I learn how to use it. I'm looking for docs that describe the cuda processir architecture in detail so I can dtetertmine things like how many loads to queue up to fill the pipe, how many executions units, user cache management, etc. That kind of information is necessary to maximize instruction throughput at the processor level. Do you know of any avaiable docs with this kind of info? There is not much info available, but if you disassemble compiled code you will see that the maxwell is superscalar with 2 pipes. 2 instructions per cycle. It's able to execute instructions while writing to memory if the code is in the instruction cache. And you to avoid ALU stalls you need to reorder your instructions carefully. There are vector instructions that can write bigger chunks of memory with fewer instructions... etc etc. The compiler is usually doing a good job here. Little to gain.. Ask DJM34 for more info. He is good in the random stuff... Thanks again. Have you tried interleaving memory accesses with arith instructions so they can be issued the same clock? When copying mem do you issue the first load an the first store immediately after it. Thr first load fills the cache line and the first store waits for the first bytes to become available. Then you can queue up enough loads to fill the pipe and do other things while waiting for mem. Multi-buffering is a given being careful not to overuse regs. If your doing a load, process, and store it's even better because you can have one instruction slot focussed on memory while the other can do the processing. These are things I'd like to try but haven't got the time. Although I've done similar in the past there was no performance tests that could quantify the effect, good or bad. If you think this has merit give it a shot. Like I said if it works just keep it open because I could still implement it myself. The hotter the code segments you choose the bigger the result should be. Some of the assembly routines would be logical targets. GDS (global memory) LDS (local memory), and work-item shuffle all require a little waiting period before they complete. So, say I'm using ds_swizzle_b32 (work-item shuffle) like I had fun with in my 4-way Echo-512... On AMD GCN, you can do some shit like so: # These shuffle in the correct state dwords from other work-items that are adjacent. # This is done in place of BigShiftRows, but before BigMixColumns. # So, my uint4 variables (in OpenCL notation) named b and c are now loaded properly without the need for shifting rows. ds_swizzle_b32 v36, v80 offset:0x8039 # b.z ds_swizzle_b32 v37, v81 offset:0x8039 # b.w ds_swizzle_b32 v38, v78 offset:0x8039 # b.x ds_swizzle_b32 v39, v79 offset:0x8039 # b.y ds_swizzle_b32 v15, v84 offset:0x804E # c.z ds_swizzle_b32 v16, v85 offset:0x804E # c.w ds_swizzle_b32 v33, v82 offset:0x804E # c.x ds_swizzle_b32 v34, v83 offset:0x804E # c.y
# Each and every one of these takes time, however - and each one increments a little counter. # What I can do is this - since the first row in the state is not shifted, the a variable is already ready # It's in registers and ready to be used.
# The first thing I do in the OpenCL after loading up the proper state values - in BigMixColumns - is a ^ b. # So, I can do something like this:
s_waitcnt lgkmcnt(4)
# What this does is, it waits on the pending operations until there are four left. # They're queued in the order the instructions were issued - so the b uint4 should now be loaded # Note, however, that the c uint4 is NOT guaranteed to have been loaded, and cannot be relied on (yet.) # Now, I can process the XOR while the swizzle operation on the c uint4 is working!
v_xor_b32 v42, v15, v36 # v42 = a.z ^ b.z v_xor_b32 v43, v16, v37 # v43 = a.w ^ b.w v_xor_b32 v38, v74, v38 # v38 = a.x ^ b.x v_xor_b32 v39, v75, v39 # v39 = a.y ^ b.y
# And then we can put in an instruction to wait for the c uint4 before we continue... s_waitcnt lgkmcnt(0)
In case you're wondering, I load the d uint4 later in the code. Also, if you *really* wanna try your damndest to maximize the time spent executing compute shit during loads, you could do this (although you've probably figured it out by now): ds_swizzle_b32 v36, v80 offset:0x8039 # b.z ds_swizzle_b32 v37, v81 offset:0x8039 # b.w ds_swizzle_b32 v38, v78 offset:0x8039 # b.x ds_swizzle_b32 v39, v79 offset:0x8039 # b.y ds_swizzle_b32 v15, v84 offset:0x804E # c.z ds_swizzle_b32 v16, v85 offset:0x804E # c.w ds_swizzle_b32 v33, v82 offset:0x804E # c.x ds_swizzle_b32 v34, v83 offset:0x804E # c.y
s_waitcnt lgkmcnt(7) v_xor_b32 v42, v15, v36 # v42 = a.z ^ b.z s_waitcnt lgkmcnt(6) v_xor_b32 v43, v16, v37 # v43 = a.w ^ b.w s_waitcnt lgkmcnt(5) v_xor_b32 v38, v74, v38 # v38 = a.x ^ b.x s_waitcnt lgkmcnt(4) v_xor_b32 v39, v75, v39 # v39 = a.y ^ b.y
# You get the idea... [code] [/code] I think I follow even though that syntax is completely foreign to me. I think what you did is what I was talking about. But I would go one step farther. It may not apply because I don't understand the wait instructions unless there are synchronization issues. In addition to what you did I would put the first xor on b immediately after the first load. I know it's stalled waiting for data but I want its dependant instruction already queued for when the data becomes available. Secondly that first load will fill the cache line so there is no need to queue up the load instruction until the first load completes. Susequent loads will finish immediately because they hit the cache: What I would not do is have a string of indentical instruuctions because they all compete for the same execution unit and can only be issued one per clock. I would interleave the swizzles and xors to they can both be issued on the same clock, assuming all dependencies are met. With comments: ds_swizzle_b32 v36, v80 offset:0x8039 # b.z // start filling the cache with b v_xor_b32 v42, v15, v36 # v42 = a.z ^ b.z // queue first xor for when b is ready ds_swizzle_b32 v37, v81 offset:0x8039 # b.w // this will complete one clock after the previous swizzle so... v_xor_b32 v43, v16, v37 # v43 = a.w ^ b.w // make sure we're ready for it I think you get it. When all the B vars are loaded you can queue the C vars while still processing and saving the first batch. I would even go one step farther to the loading of a, if possible. NO, NO, NO. The swizzle operation, like LDS and GDS loads, take TIME. Clock cycles. If you try and use the result without using s_waitcnt to be sure that the operation has completed, more than likely you'll be reading garbage. The likelihood of this occurring becomes greater and greater the closer your read instruction is to the load instruction that must be waited on. Or, more accurately, how few clock cycles have passed. The uint4 I named a is already in registers - if you wanna walk it back, it's actually from an AES operation before, which may be done via LDS lookups into a table and XORs, or a bitsliced AES S-box followed by an otherwise mostly classic-style AES implementation. I think your misunderstanding is that you think v_xor_b32 queues something. It doesn't. ds_* instructions you might be able to say "queue" something, in the sense that they trigger an LDS read/write and immediately allow for the next instruction to be executed. v_xor_b32 is an immediate XOR of two registers. It couldn't give a fuck less what's in them, or what you meant to put in them - it's going to XOR them and put the result into the destination register, and if it's not what you intended it to be, that's your problem. I would start with swizzle a immediately followed by swizzle b then the first xor. There wil be a lot of stalling waiting for memory here so if there are any other trivial tasks do them next.
Loading a & b in parallel may seem odd but once both are in the cache you're flying. Then you can mix saving processed data and loading new data, giving priority to loads to keep the GPU hot and you can stick in the first swizzle c early to get the data ready.
I learned some of this stuff on a company paid Motorola course. The instructor was a geek and our class was pretty sharp so we covered the material eraly then started having fun. At the time we were in a performance cruch with customers demanding more capacity so we focsussed on code scheduling and user cache management. One of the more bizarre instructions was the delayed branch. It exssentially means branch AFTER the next instruction. That next instruction was often returning the rc. It took some getting used to but oit gives an idea of the level of optimization they were into at the time.
It's the same CPU that had the ability to mark a cache line valid without touching mem. It great for malloc because the data is initially undefined anyway. Who cares whether the garbage comes from mem or stale cache, it's all garbage. Imagine mallocing 1k and having it cached without ever touching the bus. They also have an instruction to preload the cache for real. that is essentially what I was simulating above. It also had a user flush so you could flush data at any convenient time after you no longer needed it instead of a system initiated flush when you are stalled waiting for new data.
Keep in mind - there is no swizzle for the uint4 named a - the first row is not shifted. This is why you don't see any swizzle ops for it - it is entirely contained within the single work-item. This is why I swizzle b, then c, and then begin my XORs. Keep in mind, again, that this triggers the start of the swizzle and immediately goes to the next instruction - this means if I do b AND c one after another, and only wait on b in order to XOR it with a, I'm putting more clock cycles between the time I initiated the swizzle for c, and the time I need it to complete. It's entirely possible that by the time I call on s_waitcnt to ensure the c variables are ready, they already are and the instruction takes basically no time at all. You're also thinking about cache, which doesn't apply here at all - swizzle is a 4-way crossbar that allows transfer of values between work-items on a compute unit. In addition, even if it wasn't, I couldn't give a fuck less if it's in the cache - hell, I'd rather it NOT be. Why? Simply because I just loaded those values into registers, and were they in memory, I would never be reading them from memory again. X11 and friends can be done using ZERO global memory at all (besides getting work, storing the state for the next kernel, and output, of course) - if you work at it, it can even be done without using LDS for storage of shit like AES tables. Now, you *may* want to use LDS for other reasons to create an optimal GPU implementation, but these are related more to parallelizing the hash functions more, by unrolling them across multiple WIs (like this Echo-512 we're discussing), rather than actual storage of data that's honestly needed to compute the hash function. Because of this, cache is really irrelevant, and in an extremely well optimized X11 kernel set, you should be able to downclock memory to hell and have it not matter one iota. Fun fact: This is why the claim of X11 being "ASIC-resistant" is more or less a flat out lie. What most people call "ASIC-resistant" is actually "ASIC-unprofitable" - meaning that the ASIC would cost so much that its advantage over the currently used mining hardware doesn't justify making it. Usually, this is done via memory usage, at least for now. But X11 isn't memory-hard - shit, it doesn't really need memory at all, especially if implemented in hardware. Perhaps your example was not well chosen, too many new concepts for me. Try to think of in the general sense where data is loaded from mem some processing done and stored back in mem. I usually see a string of 4 or 8 loads followed by a similiar string of xors or adds or whatever and then a string of stores. This is ok in the sense it uses multi buffering but is inneficient because it can't take advantage of multiple instruction issue. It's all serial. There also no need to rush the second load because the first one will get the cache filled (assuming there is cache). And user cache management doesn't depend on whether the application caches well. It's useful because the coder can manage the cache to overcome the apps defficiency. Need some data soon but have other things to do first? preload it so it's ready when you are. Done with a buffer? flush the cache line to get rid of the data and free up the bus for the next data you need. It's all about managing the data and planning when you need it and how to have it when you need it so you don't have to wait as long. With mem being the bottleneck you want to prioritize managing mem accesses to reduce latency. You don't want the bus sitting idle while you do a shitload of alu stuff just to have to wait when you ask for more data. When I mentioned queing the xor i didn't mean it literally. I just meant have it ready to be issued as soon as the data arrives.
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 29, 2016, 06:25:38 AM |
|
Here's the short answer assuming a 64 bit mem bus and a 32 byte cache line, ie 1 address cycle and 4 data cycles per burst to fill the cache line, a 4 deep mem queue and 2 instructions per clock. An optimized memcpy in pseudo asm.
ld r0, src ; start loading 1st src cache line ld r4, src+4 ; start loading 2nd src cache line preallocate dst cache ; intent to write so cache fill from mem not required, no wait st r0, dst ; be ready as soon as first word arrives, stall here ld r1, src+1 ; load 2nd word of 1st line, cached now no wait st r1, dst+1 ; store it immediately, no stall ld r2, src+2 ; etc st r2, src+2 ld r3, src+3 sr r3, src+3 flush src ; flush the first source cache line unmodified, no writeback flush dst ; modified, writeback to mem, now to keep bus busy st r4, src+4 ; by now the second cache line is filled, no wait ld r0, src+5 : start filling 3rd cache line finish saving second cache line etc.
This does not maximize double instruction issue because all instructions use the same execution unit. The bus is kept busy after an initial wait for the first word, while you wait do anything else you can that uses another execution unit like incrementing counters. Those instructions are essentially free. If the function was modified to process every word it would also be free. In fact the more processing you do the more efficient it gets because you are using the alu more and all that while the mem bus is busy doing its thing as fast as it can. If the mem IF is desiged properly there should be no problem with collisions. It should always prioritize reads before writes.
If this model can be implemented on cuda we should see some gains. I just don't have the cuda knowledge to know if it can be done or how.
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2912
Merit: 1087
Team Black developer
|
|
January 29, 2016, 06:57:56 AM |
|
Use the official version. They are using my optimized kernal as well..
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 29, 2016, 07:38:02 AM |
|
You're not getting the real issue here - GDS is read ONCE. ONLY ONCE. In pretty much all X algos. Well, per kernel.
I'm not getting what your saying. It's not about repeated accesses to the same data it's about access to different data in the same cache line only once. Preloading the cache line with the initial load instruction means the subsequent data wil be available sooner. Anyway SP doesn't seem interested and it's his thread so I should probably drop it.
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2912
Merit: 1087
Team Black developer
|
|
January 29, 2016, 07:46:30 AM |
|
You're not getting the real issue here - GDS is read ONCE. ONLY ONCE. In pretty much all X algos. Well, per kernel.
I'm not getting what your saying. It's not about repeated accesses to the same data it's about access to different data in the same cache line only once. Preloading the cache line with the initial load instruction means the subsequent data wil be available sooner. Anyway SP doesn't seem interested and it's his thread so I should probably drop it. X11 and quark only read memory linary. In my mod I use vector instructions in the gpu to load many 32bit words in one instruction. if you compile this to ptx you will see what I meen. #include "cuda_vector.h" ... uint32_t h[16]; uint28 *phash = (uint28*)hash; uint28 *outpt = (uint28*)h; outpt[0] = phash[0]; outpt[1] = phash[1];
|
|
|
|
elysimai
Newbie
Offline
Activity: 13
Merit: 0
|
|
January 29, 2016, 08:12:40 AM |
|
this sites are scam dont trust to this sites
|
|
|
|
joblo
Legendary
Offline
Activity: 1470
Merit: 1114
|
|
January 29, 2016, 08:33:30 AM |
|
You're not getting the real issue here - GDS is read ONCE. ONLY ONCE. In pretty much all X algos. Well, per kernel.
I'm not getting what your saying. It's not about repeated accesses to the same data it's about access to different data in the same cache line only once. Preloading the cache line with the initial load instruction means the subsequent data wil be available sooner. Anyway SP doesn't seem interested and it's his thread so I should probably drop it. X11 and quark only read memory linary. In my mod I use vector instructions in the gpu to load many 32bit words in one instruction. if you compile this to ptx you will see what I meen. #include "cuda_vector.h" ... uint32_t h[16]; uint28 *phash = (uint28*)hash; uint28 *outpt = (uint28*)h; outpt[0] = phash[0]; outpt[1] = phash[1]; That makes sense. I presume the size of the vector is the same as a cache line. That pretty much neutralizes what I intended to accomplish. What I was proposing had two stages: fill the cache and load register from cache with other instructions in between. If cuda does all that in one instruction I have to just wait. Got it now. I'll look for some suitable code in cpuminer to try it on.
|
|
|
|
|