WindMaster (OP)
|
|
May 18, 2013, 06:27:03 PM |
|
You would indeed be one of the people I'd expect to have modified your own OpenCL code for scrypt+chacha fairly early on. Anyway, willing to post some hash rate info for your kernel at the current N=128 for a given GPU type and lookup gap (if any)? You can probably post that info safely without giving anyone a head-start on making the modifications themselves.
The knowledge might give others incentive to do it, but oh well. Currently it does 3.4MH/s on a core-underclocked (830->738) HD6990, with lookup_gap at 1, thus no gap. As a curiosity, at N=32, it does 7.3MH/s under the same setup. Oyy, my implementation was pretty shitty then. You went 20x faster than I did at N=32, though I was on a 6950. Out of curiosity, how many hours after launch or after you started modifying your OpenCL kernel did it take you to make the changes? Mainly a point of curiosity, for comparison with how many hours it took me. I went from scratch rather than modifying the Reaper/cgminer kernel though, so my hour comparison will differ a bit because of that.
I was a bit late, I started coding the miner about 16h after the launch. It took me 13.5h from start, to a working implementation. It was very intensive, as you might imagine. Difficulty rising like no tomorrow, and my code only gave errors, until it finally worked. About 8 hours here, but as you can see, my benchmark test was almost catastrophically slower than yours. And that was just to the point of being able to get valid hashes for benchmark purposes, not to finish out an all-up miner. Now I'd be inclined to say my implementation is flawed. Debugging OpenCL code is horrible. :-)
+1 The knowledge might give others incentive to do it
There's still a pretty large technical knowledge barrier to entry though. I suspect everyone with the correct skillset and OpenCL experience already went for it. Though this may give incentive for everyone to start figuring out how best to optimize it..
|
|
|
|
rbdrbd
|
|
May 18, 2013, 06:52:36 PM |
|
You would indeed be one of the people I'd expect to have modified your own OpenCL code for scrypt+chacha fairly early on. Anyway, willing to post some hash rate info for your kernel at the current N=128 for a given GPU type and lookup gap (if any)? You can probably post that info safely without giving anyone a head-start on making the modifications themselves.
The knowledge might give others incentive to do it, but oh well. Currently (N=128) it does 3.4MH/s on a core-underclocked (830->738) HD6990, with lookup_gap at 1, thus no gap. As a curiosity, at N=32, it does 7.3MH/s under the same setup. This confirms the criticism that N started out too low. My calculations show N/KB increases at/around the following dates: 5/21: 256, 32KB 5/30: 512, 64KB 6/2: 1024, 128KB 6/26: 2048, 256KB 7/8: 4096, 512KB 8/14: 8192, 1024KB How do you feel your YAC GPU kernel performance will hold up off of those adjustments (in absolute terms and in relative terms to a high end CPU)? I did check out Tacotime's MC2 paper, I like the approach he takes with varying the hash algorithm to achieve maximum ASIC/FPGA resistance. Unfortunately, building GPU resistance for any good length of time looks like a much harder (impossible?) task.
|
|
|
|
snaidervp
Sr. Member
Offline
Activity: 333
Merit: 250
"Raven's Cry"
|
|
May 18, 2013, 06:54:27 PM |
|
does anyone kind nice guy here could tell me with simple instructions how to remove that warning from the Yac client, or maybe tell if not doing it what worse can happen? ty
|
yawn
|
|
|
FlyLord
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 18, 2013, 06:56:38 PM |
|
So, I'll chime in, I wrote one too and get pretty abysmal rates (just got around to testing it -- 150 kH/s or so, under N=6) using a 7950. I know my implementation can be cleaned up but it would require understanding keccak better and since I'm new to OpenCl and basic optimization information is pretty hard to find, I'm not sure I want to put in the effort. I put in a lookup gap capability but don't need to use it.
Btw, does anyone know if using the vector types in OpenCL on AMD is faster than using the scalar types? I haven't been able to find any definite statements leaning one way or the other ...
I wonder if mtrlt wrote is own keccak implementation or went around porting the optimized version they released. I went about it by copying over the code from scrypt-jane and direct porting it to OpenCL.
Oh and as an fyi, I didn't do it to actually make a miner, I just wanted to see how much effort it would take and how fast it would be.
|
|
|
|
WindMaster (OP)
|
|
May 18, 2013, 07:03:47 PM |
|
I did check out Tacotime's MC2 paper, I like the approach he takes with varying the hash algorithm to achieve maximum ASIC/FPGA resistance. Unfortunately, building GPU resistance for any good length of time looks like a much harder (impossible?) task.
This is likely true. I think the best one can hope for is to narrow the performance gap between CPU's and GPU's by making the memory usage large enough that it gets pushed out to significant amounts of external RAM. At that point, external RAM bandwidth is the deciding factor. We're already seeing GPU's with wider external memory busses than many CPU's however. At best it's an arms race. It appears that with YAC, the lag time for people to implement it on OpenCL still gave early CPU miners a significant head-start (I just wish I was one of those that actually got the head-start).
|
|
|
|
hanzac
|
|
May 18, 2013, 07:04:17 PM |
|
I tried last weekend to port the scryp jane code to open cl, but I only achieved at 60kH with 7950, besides the hash is not valid. So I give up because at that point I don't think it's worthwhile to keep on.
I think the main problem is that Radeon GPU uses big-endian, I might not treat it correctly at some point.
I used the cgminer, but I have very little knowledge about its implementation and I also new to these open cl and its system calls (too much functions ... that's why I don't like to use them).
The way I debug it is that I compiled the cl kernel source (modified slightly) with gcc directly. It actually produce the same hash in comparison with the original one. But the endian problem still exists ...
|
|
|
|
FlyLord
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 18, 2013, 07:06:45 PM |
|
? The Radeon HD's are Little Endian, easiest way to check is to query the props via OpenCL
Also you can make a coin GPU resistant by requiring a lot of memory to get good speed. The thing about gpus is that they're highly paralellization but slow in serial processes. So if you can modify scrypt to remove the tmto and force all N precalculated spaces to be required it would reduce the effectiveness. Go even further and make N *and* r increase over time
|
|
|
|
hanzac
|
|
May 18, 2013, 07:11:05 PM |
|
This confirms the criticism that N started out too low. My calculations show N/KB increases at/around the following dates: 5/21: 256, 32KB 5/30: 512, 64KB 6/2: 1024, 128KB 6/26: 2048, 256KB 7/8: 4096, 512KB 8/14: 8192, 1024KB
How do you feel your YAC GPU kernel performance will hold up off of those adjustments (in absolute terms and in relative terms to a high end CPU)?
I did check out Tacotime's MC2 paper, I like the approach he takes with varying the hash algorithm to achieve maximum ASIC/FPGA resistance. Unfortunately, building GPU resistance for any good length of time looks like a much harder (impossible?) task.
I don't think resistance to any technology in mind is a good way. The key point is the strength of the network, the security of the network and the fair reward for maintaining the network efficiency.
|
|
|
|
WindMaster (OP)
|
|
May 18, 2013, 07:11:41 PM |
|
So, I'll chime in, I wrote one too and get pretty abysmal rates (just got around to testing it -- 150 kH/s or so, under N=6) using a 7950. I know my implementation can be cleaned up but it would require understanding keccak better and since I'm new to OpenCl and basic optimization information is pretty hard to find, I'm not sure I want to put in the effort. I put in a lookup gap capability but don't need to use it.
Interestingly, your hash rate is right in the same ballpark as mine for N=6 on a 6950. Guess that makes 2 of us in the catastrophically unoptimized OpenCL implementation club.. Probably won't surprise anyone that I'm staring at my OpenCL code right now trying to figure out what I did wrong, as I'd consider mtrlt to be a reliable source (as the first person to publicly write an OpenCL scrypt implementation, and just about everyone mining Litecoin on GPU's is using his OpenCL code).
|
|
|
|
hanzac
|
|
May 18, 2013, 07:16:32 PM |
|
? The Radeon HD's are Little Endian, easiest way to check is to query the props via OpenCL
Also you can make a coin GPU resistant by requiring a lot of memory to get good speed. The thing about gpus is that they're highly paralellization but slow in serial processes. So if you can modify scrypt to remove the tmto and force all N precalculated spaces to be required it would reduce the effectiveness. Go even further and make N *and* r increase over time
I searched on google, there're some page saying "Radeon GPUs are big-endian" ...
|
|
|
|
FlyLord
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 18, 2013, 07:20:53 PM |
|
If I had to guess he read the keccak whitepaper which talks about optimizing the code, there's apparently some instructions that are not necessary or something. I'm also not taking advantage of the vector types. The core ChaCha code as an example, takes 16 uint's and could be done using either a uint4 [4] or uint8 [2] or uint16 -- but, as I mentioned, I've got no clue if that would actually make it faster.
After reading the PBKDF2 specs and pseudocode, I also think there are too many steps in the scrypt-jane code -- maybe I'm just reading it wrong -- that are not necessary for what we're doing.
Also, to figure out endianess query the "CL_DEVICE_ENDIAN_LITTLE" property through OpenCL, that way you'll know for sure.
|
|
|
|
WindMaster (OP)
|
|
May 18, 2013, 07:22:00 PM Last edit: May 18, 2013, 07:55:41 PM by WindMaster |
|
Are you aware Terracoin network is forked into like 4 chains right now?
Yeah. Their hard forks came about through incompatible changes to the client though, if I understand correctly. I kinda hold hard forks of the blockchain to be generally bad policy unless absolutely necessary, because it enforces changes upon the whole of a coin's population to change the parameters of the coin to something different than they understood the coin to be when they adopted the coin. For example, the Elacoin miners trying to hard fork their blockchain and change their client so their mining reward is even higher for late adopter miners than what everyone understood it to be when they adopted Elacoin and started mining it (effectively, jacking up their rate of inflation). And some mention of trying to get all the Elacoin mining pools onboard with the change to gain majority hashpower and try to force the change on everyone using Elacoin. That really rubbed me the wrong way how they were going about that. I suggest leaving checkpoint age code but increasing check period from 10 to 60+ days. Given how many things can be added and improved, there could be even more than 1 upgrade per 2 months, mandatory or not.
Perhaps. Here's what's in there now: if (Checkpoints::IsSyncCheckpointTooOld(60 * 60 * 24 * 10) && !fTestNet && !IsInitialBlockDownload()) { nPriority = 100; strStatusBar = "WARNING: Checkpoint is too old. Wait for block chain to download, or notify developers."; }
So, after 10 days it throws that particular warning. Changing it to 60 days could be workable but we definitely need to rewrite the warning, I'd think. As it is, it already leaves someone with no clue whether their client is even working anymore. Anyone have any thoughts on what it should actually say? Perhaps something like "WARNING: Your YACoin client is more than 60 days old and may not contain the most recent checkpoints. You may want to check if a more recent version of the YACoin client is available."
|
|
|
|
hanzac
|
|
May 18, 2013, 07:23:57 PM |
|
So, I'll chime in, I wrote one too and get pretty abysmal rates (just got around to testing it -- 150 kH/s or so, under N=6) using a 7950. I know my implementation can be cleaned up but it would require understanding keccak better and since I'm new to OpenCl and basic optimization information is pretty hard to find, I'm not sure I want to put in the effort. I put in a lookup gap capability but don't need to use it.
Btw, does anyone know if using the vector types in OpenCL on AMD is faster than using the scalar types? I haven't been able to find any definite statements leaning one way or the other ...
I wonder if mtrlt wrote is own keccak implementation or went around porting the optimized version they released. I went about it by copying over the code from scrypt-jane and direct porting it to OpenCL.
Oh and as an fyi, I didn't do it to actually make a miner, I just wanted to see how much effort it would take and how fast it would be.
I also port directly from scrypt-jane source code. For ROTL64 & ROTL32, I use rotate(x,y) function of CL. I also use some vector types. That's why I need to care much more about the endian problems ... Maybe I should start with little change.
|
|
|
|
hanzac
|
|
May 18, 2013, 07:27:13 PM |
|
If I had to guess he read the keccak whitepaper which talks about optimizing the code, there's apparently some instructions that are not necessary or something. I'm also not taking advantage of the vector types. The core ChaCha code as an example, takes 16 uint's and could be done using either a uint4 [4] or uint8 [2] or uint16 -- but, as I mentioned, I've got no clue if that would actually make it faster.
After reading the PBKDF2 specs and pseudocode, I also think there are too many steps in the scrypt-jane code -- maybe I'm just reading it wrong -- that are not necessary for what we're doing.
Also, to figure out endianess query the "CL_DEVICE_ENDIAN_LITTLE" property through OpenCL, that way you'll know for sure.
;-) for chacha I ported to this: uint16 chacha_core(uint16 state) { uint rounds; uint16 x; uint t;
x = state;
for (rounds = 8; rounds > 0; rounds -= 2) { quarter( x.s0, x.s4, x.s8, x.sC); quarter( x.s1, x.s5, x.s9, x.sD); quarter( x.s2, x.s6, x.sA, x.sE); quarter( x.s3, x.s7, x.sB, x.sF); quarter( x.s0, x.s5, x.sA, x.sF); quarter( x.s1, x.s6, x.sB, x.sC); quarter( x.s2, x.s7, x.s8, x.sD); quarter( x.s3, x.s4, x.s9, x.sE); }
state += x; return state; }
|
|
|
|
mtrlt
Member
Offline
Activity: 104
Merit: 10
|
|
May 18, 2013, 07:47:41 PM |
|
How do you feel your YAC GPU kernel performance will hold up off of those adjustments (in absolute terms and in relative terms to a high end CPU)?
I've not yet tested high N values extensively, however I know that N=512 is the first N where lookup_gap=2 is faster than lookup_gap=1.
|
|
|
|
FlyLord
Newbie
Offline
Activity: 14
Merit: 0
|
|
May 18, 2013, 07:50:15 PM |
|
mtrlt, want to give us a hint as to what you optimized?
|
|
|
|
mtrlt
Member
Offline
Activity: 104
Merit: 10
|
|
May 18, 2013, 07:59:10 PM |
|
Right now, I'm hesitant to reveal details. I'd absolutely love it you (FlyLord, WindMaster, hanzac, others?) could PM me your OpenCL code, but I don't know what I could give in return.
|
|
|
|
WindMaster (OP)
|
|
May 18, 2013, 08:02:26 PM |
|
Right now, I'm hesitant to reveal details. I'd absolutely love it you (FlyLord, WindMaster, hanzac, others?) could PM me your OpenCL code, but I don't know what I could give in return.
But for what purpose, if we all achieved way shittier hash rates than you did? but I don't know what I could give in return.
Oh, I think I have a good idea what everyone will probably ask for in return, and it's not something you're likely to give. It's sorta like everyone that keeps PM'ing me asking for my Verilog implementation for FPGA's, thinking they're going to run it on off-the-shelf BTC mining FPGA boards, or not reading close enough that it was an implementation for N=32 specifically. I'm saving that especially for the next altcoin that gets the bright idea to fork YACoin into yet another useless copy-pasta altcoin launch with difficulty set to 0.
|
|
|
|
|
mtrlt
Member
Offline
Activity: 104
Merit: 10
|
|
May 18, 2013, 08:12:18 PM |
|
Right now, I'm hesitant to reveal details. I'd absolutely love it you (FlyLord, WindMaster, hanzac, others?) could PM me your OpenCL code, but I don't know what I could give in return.
But for what purpose, if we all achieved way shittier hash rates than you did? I just want to see how others think. So far, except for BTC, I've always been the first to make an open source GPU miner for a currency that has had a new hash function. (My list only contains Solidcoin 2.0 and Litecoin, though. Maybe I've missed some altcoins?) I've not seen the OpenCL development practices of anyone else. I'm just curious, that's all. You don't have to send me code if you don't want to.
|
|
|
|
|