Bitcoin Forum
June 24, 2024, 06:02:40 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 [1102] 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 »
  Print  
Author Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX]  (Read 3426878 times)
Epsylon3
Legendary
*
Offline Offline

Activity: 1484
Merit: 1082


ccminer/cpuminer developer


View Profile WWW
December 16, 2014, 10:26:42 AM
 #22021

me, i was not paid by vtc team to do it... but somebody was not able to compile it on linux... its also why i imported it.

The next time you dont want i use published code (by VTC not you) tell us before we commit it... not 2 days after. I was planning to release it today, but... the default difficulty setting is still not clear... 128 and 256 seems ok on most pools ive tested.

BTC: 1FhDPLPpw18X4srecguG3MxJYe4a1JsZnd - My Projects: ccminer - cpuminer-multi - yiimp - Forum threads : ccminer - cpuminer-multi - yiimp
djm34
Legendary
*
Offline Offline

Activity: 1400
Merit: 1050


View Profile WWW
December 16, 2014, 11:04:51 AM
 #22022

That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast.  The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)
The uint2 keccak made x11 20-25 KHASH faster. on a 750ti
that's what I call "almost anything"

I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2.


not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning...

djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
sp_
Legendary
*
Offline Offline

Activity: 2912
Merit: 1087

Team Black developer


View Profile
December 16, 2014, 11:20:24 AM
 #22023

That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast.  The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)
The uint2 keccak made x11 20-25 KHASH faster. on a 750ti
that's what I call "almost anything"
I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2.
not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning...

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet)  a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. Smiley

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
djm34
Legendary
*
Offline Offline

Activity: 1400
Merit: 1050


View Profile WWW
December 16, 2014, 12:01:28 PM
 #22024

me, i was not paid by vtc team to do it... but somebody was not able to compile it on linux... its also why i imported it.

The next time you dont want i use published code (by VTC not you) tell us before we commit it... not 2 days after. I was planning to release it today, but... the default difficulty setting is still not clear... 128 and 256 seems ok on most pools ive tested.

It isn't that I don't want you to publish my code, but I prefer to be the first (aside from the VTC team release...) to do it for obvious reason (my own publicity obviously... I guess this one is really obvious Grin)

The way it happened (or my perception of the event): I sent the code to vtc team for the test release/release candidate before going to bed had some run of correction before they decided to go to sleep and when I woke the next morning someone (you) has published my code with no clear reference to the author on another forum where I don't have an account his own version of the code. Sorry by without enough coffee it looks pretty sneaky, and with more coffee it still looks sneaky. (and I am pretty sure you would have reacted in the same way, if someone had done that to you).

Then, I told you to wait (or asked to jk_14 to ask you to wait as I didn't have an account on litecointalk) because it wasn't finished and as it was a test release it is way better if people report problems from only one release and don't start to use code we don't know... (I also told you to wait on irc channel, as I was planning to move the code to your interface )

Then later in the week you sent your code (actually the same, while the RC had already incorporated new parts) to the vtc team with no clear intention since they asked me what they should do with that. That time it was just looking like you were trying to go behind my back on the job I was hired to do...  Grin

That's why I feel a bit pissed off about the whole situation, and I am pretty sure you would have felt the same if you had been in my situation.
I don't think all this was intentional from you, but it was kind of careless...

anyway...

Regarding the difficulty 128 or 256 should do it, just check the pool hashrate and use the one which give the result closest to the real hashrate.



djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
djm34
Legendary
*
Offline Offline

Activity: 1400
Merit: 1050


View Profile WWW
December 16, 2014, 12:22:36 PM
 #22025

That improves for sure keccak and skein, but in x11 context, you won't see almost anything as these algo are already pretty fast.  The main problem with uint2 is that it does not work at all with compute 30 (the logic should work but it seems there is a problem with register allocation...)
The uint2 keccak made x11 20-25 KHASH faster. on a 750ti
that's what I call "almost anything"
I implemented it now, Bmw +32%, keccac + 40%, skein +30% x11 is 70-100 KHASH faster on the 750ti. Blake512 is running slower with uint2.
not bad, what are the numbers on the 9xx serie ? In principle it should be even better however it might require some tuning...

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet)  a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. Smiley
For the 970 and 980 is you use compute 5.2, it can be a bit tricky (register allocation is a bit tricky with uint2 and compute 5.2)

djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
Epsylon3
Legendary
*
Offline Offline

Activity: 1484
Merit: 1082


ccminer/cpuminer developer


View Profile WWW
December 16, 2014, 12:27:25 PM
 #22026

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet)  a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. Smiley

There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow

We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p

BTC: 1FhDPLPpw18X4srecguG3MxJYe4a1JsZnd - My Projects: ccminer - cpuminer-multi - yiimp - Forum threads : ccminer - cpuminer-multi - yiimp
djm34
Legendary
*
Offline Offline

Activity: 1400
Merit: 1050


View Profile WWW
December 16, 2014, 01:20:44 PM
 #22027

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet)  a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. Smiley

There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow

We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p

I did the uint2 shit on AMD - little help. Replacing one rotate in Blake2b helps, though.
I tried to use it for the whole routine, but for some reason I haven't been able to get it to work...
(actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)

djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
sp_
Legendary
*
Offline Offline

Activity: 2912
Merit: 1087

Team Black developer


View Profile
December 16, 2014, 01:38:41 PM
 #22028

[ tried to use it for the whole routine, but for some reason I haven't been able to get it to work...
(actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)

In BMW I added a couple of operators. - And SHL/SHR for the Uint2 types. The funnelshift on nvidia is only calculating 32 bit a time so you need 2 instructions to do a full rotate, and then you need code that merge the Upper/Lower registers. In code it meens that a ROL64 4-5 instructions while a shL64 is 1 instruction.
When working in uint2 mode, alot of the merge code 32bit->64 bit is removed. This meens that the ROL64 is down to 2 instructions. For bmw that caused the final routine to use less registers (no memory spills).
But 33% faster was a surprise for me, since the bmw has a lot of SHL/SHL/ + and - wich I suspected would run slower on uint2. (2 instructions instead of 1)

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
djm34
Legendary
*
Offline Offline

Activity: 1400
Merit: 1050


View Profile WWW
December 16, 2014, 01:50:44 PM
 #22029

[ tried to use it for the whole routine, but for some reason I haven't been able to get it to work...
(actually not sure it will help a lot, the amd is a lot stronger than the nvidia on 64bit calculation, that's why it makes a such a difference on nvidia)

In BMW I added a couple of operators. - And SHL/SHR for the Uint2 types. The funnelshift on nvidia is only calculating 32 bit a time so you need 2 instructions to do a full rotate, and then you need code that merge the Upper/Lower registers. In code it meens that a ROL64 4-5 instructions while a shL64 is 1 instruction.
When working in uint2 mode, alot of the merge code 32bit->64 bit is removed. This meens that the ROL64 is down to 2 instructions. For bmw that caused the final routine to use less registers (no memory spills).
But 33% faster was a surprise for me, since the bmw has a lot of SHL/SHL/ + and - wich I suspected would run slower on uint2. (2 instructions instead of 1)
yes but the number of uint32 operations over a cycle is larger than twice the number of uint64 operation (I think... something like that, a table was posted recently on that on the thread)

djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
ZeroFossilFuel
Sr. Member
****
Offline Offline

Activity: 346
Merit: 250



View Profile
December 17, 2014, 03:37:37 AM
 #22030

Christian,

A while back I posted a question regarding performance gains to ccminer versions newer than the v1.0 that I was using. Recently I had to reload my computer from scratch (Xubuntu 14.04) so I compiled v1.2 with Cuda 6.5, driver 340.29. Whereas with two GT640 (GDDR5) mining BTQ I was getting ~1.6Mh/s combined, I'm now getting ~2.1Mh/s. 31% boost for zero added hardware. There will be a little something from me in your stocking soon. Thanks a bunch!  Grin

Z
scryptr
Legendary
*
Offline Offline

Activity: 1796
Merit: 1028



View Profile WWW
December 17, 2014, 06:18:03 AM
 #22031

NEOSCRYPT-

DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt.  Would you be able to give a status report on how developed this code is?  Is it possible to mine neoscrypt on GPU with your ccminer?       --scryptr

TIPS:  BTC - 1Fs4uZ6a9ABYBTaHGUfqcwCQmeBRxkKRQT    DASH - XrK81tW31SLsVvZ2WX9VhTjpT6GXJPLdbQ
          SCRYPTR'S NOTEBOOK: https://bitcointalk.org/index.php?topic=5035515.msg46035530#msg46035530
          GITHUB: "github.com/scryptr"  MERIT is appreciated, also.  Thanks!
sp_
Legendary
*
Offline Offline

Activity: 2912
Merit: 1087

Team Black developer


View Profile
December 17, 2014, 06:56:56 AM
 #22032

Not sure. Will test this weekend on the 970 and 980.
Note that with the last improvements, and my last echo change (wich is not checked in yet)  a windforce 750ti black edition should do above 3MHASH on stock clocks mining x11. Smiley
There we are !) good job... i see 2950 with your current git on linux and 2905 on mine, i think i will grab skein one too for the 1.5.1 release... but all your merges with KlausT make your work really hard to follow
We were talking about uint2 possible improvements some days ago with djm on irc... nice you made it :p

Check the head on github. I committed the 50KHASH on echo improvement. You should now get around 3MHASH@x11 on the stock clocked windforce black 750ti.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
December 17, 2014, 09:35:29 AM
Last edit: December 17, 2014, 09:47:56 AM by cbuchner1
 #22033

Christian,

A while back I posted a question regarding performance gains to ccminer versions newer than the v1.0 that I was using. Recently I had to reload my computer from scratch (Xubuntu 14.04) so I compiled v1.2 with Cuda 6.5, driver 340.29. Whereas with two GT640 (GDDR5) mining BTQ I was getting ~1.6Mh/s combined, I'm now getting ~2.1Mh/s. 31% boost for zero added hardware. There will be a little something from me in your stocking soon. Thanks a bunch!  Grin

Z


I think the third party code forks around here have even more hash rate improvements.

I've reduced my involvement in the mining scene a lot since last summer.

The last thing I did was a GPU miner for CoinShield's CPU channel (together with ChrisH), but that was never made public. Also it looks like CoinShield was quite a dud (i.e. the coin's value is not taking off and the feature rollout of the wallet is really slow)
djm34
Legendary
*
Offline Offline

Activity: 1400
Merit: 1050


View Profile WWW
December 17, 2014, 10:31:36 AM
 #22034

NEOSCRYPT-

DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt.  Would you be able to give a status report on how developed this code is?  Is it possible to mine neoscrypt on GPU with your ccminer?       --scryptr

I wonder how that will do - I just cleared 610kh/s on 290X.
current status is: "not yet"
actually, I was thinking to implement neoscrypt before I got hired to work on lyra.
But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help  Grin)
If there was a bounty for nvidia that would definitely help  (greedy me...) Grin
But for the moment, I am still looking into improvement into lyra... once I am done with that may-be...

djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
sp_
Legendary
*
Offline Offline

Activity: 2912
Merit: 1087

Team Black developer


View Profile
December 17, 2014, 10:45:41 AM
 #22035

NEOSCRYPT-
DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt.  Would you be able to give a status report on how developed this code is?  Is it possible to mine neoscrypt on GPU with your ccminer?       --scryptr
I wonder how that will do - I just cleared 610kh/s on 290X.
current status is: "not yet"
actually, I was thinking to implement neoscrypt before I got hired to work on lyra.
But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help  Grin)
If there was a bounty for nvidia that would definitely help  (greedy me...) Grin
But for the moment, I am still looking into improvement into lyra... once I am done with that may-be...

The groestl(256) in lyra implementation is different. Is it faster than the groestl in x11(512). I guess it needs some more work to support 512 rounds?

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
djm34
Legendary
*
Offline Offline

Activity: 1400
Merit: 1050


View Profile WWW
December 17, 2014, 10:53:47 AM
Last edit: December 17, 2014, 11:04:30 AM by djm34
 #22036

NEOSCRYPT-
DJM34, I now notice that your version of ccminer has blocks of code labelled Neoscrypt.  Would you be able to give a status report on how developed this code is?  Is it possible to mine neoscrypt on GPU with your ccminer?       --scryptr
I wonder how that will do - I just cleared 610kh/s on 290X.
current status is: "not yet"
actually, I was thinking to implement neoscrypt before I got hired to work on lyra.
But I must admit I kinda lack motivation on this project... (may-be trying to beat Wolf0 will help  Grin)
If there was a bounty for nvidia that would definitely help  (greedy me...) Grin
But for the moment, I am still looking into improvement into lyra... once I am done with that may-be...

The groestl(256) in lyra implementation is different. Is it faster than the groestl in x11(512). I guess it needs some more work to support 512 rounds?
it is based on post killer groestl code... the only reason it is faster that groestl 512 is that it does less mixing (by definition small_core versus big_core in sph definition). In principle a killer groestl256 (bit slicing) should be faster... but it isn't obvious to write even from the existing code...

I also tried a 64bit implementation with uint2, but it wasn't faster either...

I think I should get uint2 working for compute 3.0, there was a bug (actually it was just wrong Grin) in the uint2 rotation without funnelshift...

djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
sp_
Legendary
*
Offline Offline

Activity: 2912
Merit: 1087

Team Black developer


View Profile
December 17, 2014, 11:03:10 AM
 #22037

yes, this code is wrong.

__inline__ __device__ uint2 ROL2(const uint2 v, const int n)
{
   uint2 result;
   result.x = (((v.x) << (n)) | ((v.x) >> (64 - (n))));
   result.y = (((v.y) << (n)) | ((v.y) >> (64 - (n))));
   return result;
}

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
djm34
Legendary
*
Offline Offline

Activity: 1400
Merit: 1050


View Profile WWW
December 17, 2014, 11:05:53 AM
 #22038

yes, this code is wrong.

__inline__ __device__ uint2 ROL2(const uint2 v, const int n)
{
   uint2 result;
   result.x = (((v.x) << (n)) | ((v.x) >> (64 - (n))));
   result.y = (((v.y) << (n)) | ((v.y) >> (64 - (n))));
   return result;
}

This one should work:
__inline__ __device__ uint2 ROL2(const uint2 v, const int n) {
      uint2 result;
      if (n <= 32) {
         result.y = ((v.y << (n)) | (v.x >> (32 - n)));
         result.x = ((v.x << (n)) | (v.y >> (32 - n)));
      }
      else {
         result.y = ((v.x << (n - 32)) | (v.y >> (64 - n)));
         result.x = ((v.y << (n - 32)) | (v.x >> (64 - n)));
      }
      return result;
   }

there is a way to do it without a condition statement, but haven't looked into it yet

djm34 facebook page
BTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze
Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
Epsylon3
Legendary
*
Offline Offline

Activity: 1484
Merit: 1082


ccminer/cpuminer developer


View Profile WWW
December 17, 2014, 11:10:48 AM
 #22039

will be cleanup to do in the next version so... nice for the final binary size

BTC: 1FhDPLPpw18X4srecguG3MxJYe4a1JsZnd - My Projects: ccminer - cpuminer-multi - yiimp - Forum threads : ccminer - cpuminer-multi - yiimp
sp_
Legendary
*
Offline Offline

Activity: 2912
Merit: 1087

Team Black developer


View Profile
December 17, 2014, 11:21:49 AM
 #22040

This one should work:
__inline__ __device__ uint2 ROL2(const uint2 v, const int n) {
      uint2 result;
      if (n <= 32) {
         result.y = ((v.y << (n)) | (v.x >> (32 - n)));
         result.x = ((v.x << (n)) | (v.y >> (32 - n)));
      }
      else {
         result.y = ((v.x << (n - 32)) | (v.y >> (64 - n)));
         result.x = ((v.y << (n - 32)) | (v.x >> (64 - n)));
      }
      return result;
   }
there is a way to do it without a condition statement, but haven't looked into it yet

This one looks bether.

The conditional statement will be removed by the compiler when n is a constant.

I think the uint2 implementations will be faster on compute 3.0 devices as well. Less register spills. Will test on my GTX 650 later today.

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
Pages: « 1 ... 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 [1102] 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!