Epsylon3
Legendary
Offline
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
|
|
December 16, 2014, 10:26:42 AM |
|
me, i was not paid by vtc team to do it... but somebody was not able to compile it on linux... its also why i imported it.
The next time you dont want i use published code (by VTC not you) tell us before we commit it... not 2 days after. I was planning to release it today, but... the default difficulty setting is still not clear... 128 and 256 seems ok on most pools ive tested.
|
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
December 16, 2014, 11:04:51 AM |
|
That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast. The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)
The uint2 keccak made x11 20-25 KHASH faster. on a 750ti that's what I call "almost anything" I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2. not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning...
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
December 16, 2014, 11:20:24 AM |
|
That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast. The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)
The uint2 keccak made x11 20-25 KHASH faster. on a 750ti that's what I call "almost anything" I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2. not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning... Not sure. Will test this weekend on the 970 and 980. Note that with the last improvements, and my last echo change (wich is not checked in yet) a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11.
|
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
December 16, 2014, 12:01:28 PM |
|
me, i was not paid by vtc team to do it... but somebody was not able to compile it on linux... its also why i imported it.
The next time you dont want i use published code (by VTC not you) tell us before we commit it... not 2 days after. I was planning to release it today, but... the default difficulty setting is still not clear... 128 and 256 seems ok on most pools ive tested.
It isn't that I don't want you to publish my code, but I prefer to be the first (aside from the VTC team release...) to do it for obvious reason (my own publicity obviously... I guess this one is really obvious ) The way it happened (or my perception of the event): I sent the code to vtc team for the test release/release candidate before going to bed had some run of correction before they decided to go to sleep and when I woke the next morning someone (you) has published my code with no clear reference to the author on another forum where I don't have an account his own version of the code. Sorry by without enough coffee it looks pretty sneaky, and with more coffee it still looks sneaky. (and I am pretty sure you would have reacted in the same way, if someone had done that to you). Then, I told you to wait (or asked to jk_14 to ask you to wait as I didn't have an account on litecointalk) because it wasn't finished and as it was a test release it is way better if people report problems from only one release and don't start to use code we don't know... (I also told you to wait on irc channel, as I was planning to move the code to your interface ) Then later in the week you sent your code (actually the same, while the RC had already incorporated new parts) to the vtc team with no clear intention since they asked me what they should do with that. That time it was just looking like you were trying to go behind my back on the job I was hired to do... That's why I feel a bit pissed off about the whole situation, and I am pretty sure you would have felt the same if you had been in my situation. I don't think all this was intentional from you, but it was kind of careless... anyway... Regarding the difficulty 128 or 256 should do it, just check the pool hashrate and use the one which give the result closest to the real hashrate.
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
December 16, 2014, 12:22:36 PM |
|
That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast. The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)
The uint2 keccak made x11 20-25 KHASH faster. on a 750ti that's what I call "almost anything" I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2. not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning... Not sure. Will test this weekend on the 970 and 980. Note that with the last improvements, and my last echo change (wich is not checked in yet) a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. For the 970 and 980 is you use compute 5.2, it can be a bit tricky (register allocation is a bit tricky with uint2 and compute 5.2)
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
Epsylon3
Legendary
Offline
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
|
|
December 16, 2014, 12:27:25 PM |
|
Not sure. Will test this weekend on the 970 and 980. Note that with the last improvements, and my last echo change (wich is not checked in yet) a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p
|
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
December 16, 2014, 01:20:44 PM |
|
Not sure. Will test this weekend on the 970 and 980. Note that with the last improvements, and my last echo change (wich is not checked in yet) a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p I did the uint2 shit on AMD - little help. Replacing one rotate in Blake2b helps, though. I tried to use it for the whole routine, but for some reason I haven't been able to get it to work... (actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
December 16, 2014, 01:38:41 PM |
|
[ tried to use it for the whole routine, but for some reason I haven't been able to get it to work... (actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)
In BMW I added a couple of operators. - And SHL/SHR for the Uint2 types. The funnelshift on nvidia is only calculating 32 bit a time so you need 2 instructions to do a full rotate, and then you need code that merge the Upper/Lower registers. In code it meens that a ROL64 4-5 instructions while a shL64 is 1 instruction. When working in uint2 mode, alot of the merge code 32bit->64 bit is removed. This meens that the ROL64 is down to 2 instructions. For bmw that caused the final routine to use less registers (no memory spills). But 33% faster was a surprise for me, since the bmw has a lot of SHL/SHL/ + and - wich I suspected would run slower on uint2. (2 instructions instead of 1)
|
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
December 16, 2014, 01:50:44 PM |
|
[ tried to use it for the whole routine, but for some reason I haven't been able to get it to work... (actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)
In BMW I added a couple of operators. - And SHL/SHR for the Uint2 types. The funnelshift on nvidia is only calculating 32 bit a time so you need 2 instructions to do a full rotate, and then you need code that merge the Upper/Lower registers. In code it meens that a ROL64 4-5 instructions while a shL64 is 1 instruction. When working in uint2 mode, alot of the merge code 32bit->64 bit is removed. This meens that the ROL64 is down to 2 instructions. For bmw that caused the final routine to use less registers (no memory spills). But 33% faster was a surprise for me, since the bmw has a lot of SHL/SHL/ + and - wich I suspected would run slower on uint2. (2 instructions instead of 1) yes but the number of uint32 operations over a cycle is larger than twice the number of uint64 operation (I think... something like that, a table was posted recently on that on the thread)
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
ZeroFossilFuel
|
|
December 17, 2014, 03:37:37 AM |
|
Christian, A while back I posted a question regarding performance gains to ccminer versions newer than the v1.0 that I was using. Recently I had to reload my computer from scratch (Xubuntu 14.04) so I compiled v1.2 with Cuda 6.5, driver 340.29. Whereas with two GT640 (GDDR5) mining BTQ I was getting ~1.6Mh/s combined, I'm now getting ~2.1Mh/s. 31% boost for zero added hardware. There will be a little something from me in your stocking soon. Thanks a bunch! Z
|
|
|
|
scryptr
Legendary
Offline
Activity: 1797
Merit: 1028
|
|
December 17, 2014, 06:18:03 AM |
|
NEOSCRYPT-
DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt. Would you be able to give a status report on how developed this code is? Is it possible to mine neoscrypt on GPU with your ccminer? --scryptr
|
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
December 17, 2014, 06:56:56 AM |
|
Not sure. Will test this weekend on the 970 and 980. Note that with the last improvements, and my last echo change (wich is not checked in yet) a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p Check the head on github. I committed the 50KHASH on echo improvement. You should now get around 3MHASH@x11 on the stock clocked windforce black 750ti.
|
|
|
|
cbuchner1 (OP)
|
|
December 17, 2014, 09:35:29 AM Last edit: December 17, 2014, 09:47:56 AM by cbuchner1 |
|
Christian, A while back I posted a question regarding performance gains to ccminer versions newer than the v1.0 that I was using. Recently I had to reload my computer from scratch (Xubuntu 14.04) so I compiled v1.2 with Cuda 6.5, driver 340.29. Whereas with two GT640 (GDDR5) mining BTQ I was getting ~1.6Mh/s combined, I'm now getting ~2.1Mh/s. 31% boost for zero added hardware. There will be a little something from me in your stocking soon. Thanks a bunch! Z I think the third party code forks around here have even more hash rate improvements. I've reduced my involvement in the mining scene a lot since last summer. The last thing I did was a GPU miner for CoinShield's CPU channel (together with ChrisH), but that was never made public. Also it looks like CoinShield was quite a dud (i.e. the coin's value is not taking off and the feature rollout of the wallet is really slow)
|
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
December 17, 2014, 10:31:36 AM |
|
NEOSCRYPT-
DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt. Would you be able to give a status report on how developed this code is? Is it possible to mine neoscrypt on GPU with your ccminer? --scryptr
I wonder how that will do - I just cleared 610kh/s on 290X. current status is: "not yet" actually, I was thinking to implement neoscrypt before I got hired to work on lyra. But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help ) If there was a bounty for nvidia that would definitely help (greedy me...) But for the moment, I am still looking into improvement into lyra... once I am done with that may-be...
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
December 17, 2014, 10:45:41 AM |
|
NEOSCRYPT- DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt. Would you be able to give a status report on how developed this code is? Is it possible to mine neoscrypt on GPU with your ccminer? --scryptr
I wonder how that will do - I just cleared 610kh/s on 290X. current status is: "not yet" actually, I was thinking to implement neoscrypt before I got hired to work on lyra. But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help ) If there was a bounty for nvidia that would definitely help (greedy me...) But for the moment, I am still looking into improvement into lyra... once I am done with that may-be... The groestl(256) in lyra implementation is different. Is it faster than the groestl in x11(512). I guess it needs some more work to support 512 rounds?
|
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
December 17, 2014, 10:53:47 AM Last edit: December 17, 2014, 11:04:30 AM by djm34 |
|
NEOSCRYPT- DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt. Would you be able to give a status report on how developed this code is? Is it possible to mine neoscrypt on GPU with your ccminer? --scryptr
I wonder how that will do - I just cleared 610kh/s on 290X. current status is: "not yet" actually, I was thinking to implement neoscrypt before I got hired to work on lyra. But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help ) If there was a bounty for nvidia that would definitely help (greedy me...) But for the moment, I am still looking into improvement into lyra... once I am done with that may-be... The groestl(256) in lyra implementation is different. Is it faster than the groestl in x11(512). I guess it needs some more work to support 512 rounds? it is based on post killer groestl code... the only reason it is faster that groestl 512 is that it does less mixing (by definition small_core versus big_core in sph definition). In principle a killer groestl256 (bit slicing) should be faster... but it isn't obvious to write even from the existing code... I also tried a 64bit implementation with uint2, but it wasn't faster either... I think I should get uint2 working for compute 3.0, there was a bug (actually it was just wrong ) in the uint2 rotation without funnelshift...
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
December 17, 2014, 11:03:10 AM |
|
yes, this code is wrong.
__inline__ __device__ uint2 ROL2(const uint2 v, const int n) { uint2 result; result.x = (((v.x) << (n)) | ((v.x) >> (64 - (n)))); result.y = (((v.y) << (n)) | ((v.y) >> (64 - (n)))); return result; }
|
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
December 17, 2014, 11:05:53 AM |
|
yes, this code is wrong.
__inline__ __device__ uint2 ROL2(const uint2 v, const int n) { uint2 result; result.x = (((v.x) << (n)) | ((v.x) >> (64 - (n)))); result.y = (((v.y) << (n)) | ((v.y) >> (64 - (n)))); return result; }
This one should work: __inline__ __device__ uint2 ROL2(const uint2 v, const int n) { uint2 result; if (n <= 32) { result.y = ((v.y << (n)) | (v.x >> (32 - n))); result.x = ((v.x << (n)) | (v.y >> (32 - n))); } else { result.y = ((v.x << (n - 32)) | (v.y >> (64 - n))); result.x = ((v.y << (n - 32)) | (v.x >> (64 - n))); } return result; } there is a way to do it without a condition statement, but haven't looked into it yet
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
Epsylon3
Legendary
Offline
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
|
|
December 17, 2014, 11:10:48 AM |
|
will be cleanup to do in the next version so... nice for the final binary size
|
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
December 17, 2014, 11:21:49 AM |
|
This one should work: __inline__ __device__ uint2 ROL2(const uint2 v, const int n) { uint2 result; if (n <= 32) { result.y = ((v.y << (n)) | (v.x >> (32 - n))); result.x = ((v.x << (n)) | (v.y >> (32 - n))); } else { result.y = ((v.x << (n - 32)) | (v.y >> (64 - n))); result.x = ((v.y << (n - 32)) | (v.x >> (64 - n))); } return result; } there is a way to do it without a condition statement, but haven't looked into it yet
This one looks bether. The conditional statement will be removed by the compiler when n is a constant. I think the uint2 implementations will be faster on compute 3.0 devices as well. Less register spills. Will test on my GTX 650 later today.
|
|
|
|
|