ilovecudacompute
Member
Offline
Activity: 112
Merit: 10
|
|
July 23, 2014, 10:24:34 PM |
|
I'm still working on stabilizing the rest of my optimizations, so I'm not creating binaries yet.
Thanks a ton for your efforts KEEP US UPDATED EDIT: Tsiv i sent you a tiny btc donation :PPP
|
|
|
|
CodyF86
|
|
July 23, 2014, 10:53:40 PM |
|
Not sure if this is what you mean't djm, but you can add the files you want git to ignore to the .gitignore file in your repo (If you have one, pretty easy to setup) and you won't have to worry about that.
|
|
|
|
cayars
|
|
July 24, 2014, 12:48:58 AM |
|
Hey guys, Just wanted to give you an update. I was compiling a new version of nvMiner with djm34's X17 compiled in. He changed his github at about 3 hours into the compile. I killed the compile and I pulled down his latest and started compiling again. Not sure what went wrong but my VS project file got messed up. After 6 hours compiling with CUDA 5.5 I killed it. I reverted back to my earlier version and I'm starting over to keep things clean on my side. If this was an algo that was profitable I'd could have done a quick compile with just that algo to get it out the door for you guys to use, but since there really is no reason to mine this coin (maybe a clone in the future) I see no reason to do it "wrong". So I'm going to take my time with it and get it right. Right now PPL/X17 isn't worth mining so there is no reason to push a release other than to have a nvidia miner that supports one more algo. I'm not into "bragging rights on number of algos supported" if it doesn't make sense. I'm also going to do some bench marking of stuff sp_ has published recently in the last 5 to 10 pages. Plus if TSIV doesn't release source code to his recent mods to CryptoNight I'm going to also benchmark Wolf's changes one by one and include what I find to benefit us. So long story short, I'm going to delay the next release until I'm ready and have done some testing. The next nvMiner will have: 1) x17 2) any speed up proposed by sp_ or Wolf if they pan out for either Kepler or Maxwell. 3) If TSIV does a source release I'll include this also. (should be 24 hours or less) Also, since CUDA 6.5 is right around the corner from release using 5.5 will basically be 2 versions behind. There comes a point when it's not worth supporting older software and I think we are getting there. The next nvMiner WILL SUPPORT 5.5 but I don't know about future releases. CUDA 6.0 (3.0/3.5/5.0) compiles a lot faster then 5.5 does with both 3.0/3.5. So in the future we will move to 6.0 for nvMiner when all algos have been tested on both Maxwell and Kepler and work ok. Right now (or last test) I had a problem with FRESH on 6.0. I delay release of all nvMiner releases until I test EVERY algo after each build. Damn you djm34 because you are starting to make my testing time take longer. So moral of this post is to start upgrading your Rigs to use the latest nvidia drivers. For the last 6 months (at least) I've been running the latest beta drivers at all times with no problems at all on both Maxwell and Kepler GPUs. So I see no reason not to run the latest or beta releases (what I run). I'll compile this next version as 5.5 and probably release the version after this as 6.0 first then 5.5 and the 3rd version from now might very well be 6.0 or greater only. SO I JUST WANTED TO GIVE A heads UP on my plans up move to CUDA 6.0 which is a normal release and not beta. If during testing I find this performs worse I'll let you guys know and will re-think this (we want highest hash rates of course). So start thinking or doing upgrades to the latest nvidia driver releases. Carlo
|
|
|
|
yellowduck2
|
|
July 24, 2014, 01:08:49 AM |
|
Hey guys, Just wanted to give you an update. I was compiling a new version of nvMiner with djm34's X17 compiled in. He changed his github at about 3 hours into the compile. I killed the compile and I pulled down his latest and started compiling again. Not sure what went wrong but my VS project file got messed up. After 6 hours compiling with CUDA 5.5 I killed it. I reverted back to my earlier version and I'm starting over to keep things clean on my side. If this was an algo that was profitable I'd could have done a quick compile with just that algo to get it out the door for you guys to use, but since there really is no reason to mine this coin (maybe a clone in the future) I see no reason to do it "wrong". So I'm going to take my time with it and get it right. Right now PPL/X17 isn't worth mining so there is no reason to push a release other than to have a nvidia miner that supports one more algo. I'm not into "bragging rights on number of algos supported" if it doesn't make sense. I'm also going to do some bench marking of stuff sp_ has published recently in the last 5 to 10 pages. Plus if TSIV doesn't release source code to his recent mods to CryptoNight I'm going to also benchmark Wolf's changes one by one and include what I find to benefit us. So long story short, I'm going to delay the next release until I'm ready and have done some testing. The next nvMiner will have: 1) x17 2) any speed up proposed by sp_ or Wolf if they pan out for either Kepler or Maxwell. 3) If TSIV does a source release I'll include this also. (should be 24 hours or less) Also, since CUDA 6.5 is right around the corner from release using 5.5 will basically be 2 versions behind. There comes a point when it's not worth supporting older software and I think we are getting there. The next nvMiner WILL SUPPORT 5.5 but I don't know about future releases. CUDA 6.0 (3.0/3.5/5.0) compiles a lot faster then 5.5 does with both 3.0/3.5. So in the future we will move to 6.0 for nvMiner when all algos have been tested on both Maxwell and Kepler and work ok. Right now (or last test) I had a problem with FRESH on 6.0. I delay release of all nvMiner releases until I test EVERY algo after each build. Damn you djm34 because you are starting to make my testing time take longer. So moral of this post is to start upgrading your Rigs to use the latest nvidia drivers. For the last 6 months (at least) I've been running the latest beta drivers at all times with no problems at all on both Maxwell and Kepler GPUs. So I see no reason not to run the latest or beta releases (what I run). I'll compile this next version as 5.5 and probably release the version after this as 6.0 first then 5.5 and the 3rd version from now might very well be 6.0 or greater only. SO I JUST WANTED TO GIVE A heads UP on my plans up move to CUDA 6.0 which is a normal release and not beta. If during testing I find this performs worse I'll let you guys know and will re-think this (we want highest hash rates of course). So start thinking or doing upgrades to the latest nvidia driver releases. Carlo Thank you very much. I think we should seriously think about Nvidia Miner Foundation and a foundation donation address.
|
|
|
|
tsiv
|
|
July 24, 2014, 01:51:38 AM |
|
Note to self: __CUDA_ARCH__ is a fickle bitch. I think I got the damn thing to use the new 4-way version of the phase 2 kernel for compute 3.0+ and the old one for 2.0. Since __CUDA_ARCH__ is apparently not defined when compiling the host code I didn't see much choice but to fire up the kernel with four threads per hash even if it's the single thread per hash compute 2.0 version. Dealt with it by making the single thread kernel do work only on the first of the four subthreads. Not very happy with it but it doesn't seem to matter that much performance-wise. Bottom line: Fuck all difference on Maxwell, apparently some other compute 3.0+ cards like the new 4-way kernel and gain some performance, compute 2.0 should work like before. I'll look into pulling some of Wolf's mods, also got some ideas for the phase 1&3 kernels but we'll see. Win32 binary at https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15/ccminer-cryptonight_20140724.zip
|
|
|
|
bathrobehero
Legendary
Offline
Activity: 2002
Merit: 1051
ICO? Not even once.
|
|
July 24, 2014, 02:35:08 AM |
|
So I take back that 64 bit builds are faster with x15. I just have so many different versions of ccminer at this point (~20 GB) that I ended up using a borked 32 bit version which used the GPU less and the CPU more than it's 64 bit brother. Anyway, here are the average hasrates of a 750Ti and a 780Ti rig per card, running a couple of minutes per algo with djm34's commit 58 compiled with cuda 5.5: 750 Ti 780 Ti x32 x64 x32 x64 x11 2.4 2.3 5.5 5.2 x13 2.0 1.8 4.0 3.8 x14 1.9 1.8 4.0 3.8 x15 1.7 1.6 3.7 3.5 x17 1.6 1.5 3.6 3.4 jackpot 5.0 5.1 11 11 qubit 4.0 3.7 8.9 8.1 nist5 7.7 7.7 16.3 16.4 fresh 3.1 2.8 7.2 6.2 groestl 7.3 7.3 14.5 14.7
Gigabyte cards, solomining, very slight 60mhz core overclock.
|
Not your keys, not your coins!
|
|
|
tsiv
|
|
July 24, 2014, 05:07:13 AM |
|
Something I pretty much suspected but never bothered to check up on, run times for the various parts of the hash. Well, actually I did benchmark the core loops earlier and found the second one to be the biggest hog. Throw in the numbers for the prep and final phases and you get this: Prepare: 0.001388 sec Phase 1: 0.148383 sec Phase 2: 1.414880 sec Phase 3: 0.147834 sec Final: 0.003590 sec That's 32x15 hashes on a GTX 750 Ti. Can't tell how it works out on other cards since all I've got is a bunch of 750 Tis, but in this case optimizing the living fuck out of the prep and final parts all the way to instant completion with zero run time would bump up the total hashrate by 0.3%. Don't get me wrong, Wolf's doing nice work on unfucking stuff I pretty much just yanked out of cpuminer-multi and left as is. I just prefer to focus on shit that matters, again, no offense intended. Too bad I'm not even making a dent on that goddamn clusterfuck that is the second main loop
|
|
|
|
Equitum
Newbie
Offline
Activity: 29
Merit: 0
|
|
July 24, 2014, 05:37:54 AM |
|
I know this really doesn't have to do with ccminer discussions as much as nVidia mining/hashing in general, but I figured a good deal of people come to this thread to discuss the most profitable way to use nVidia power, so here goes: I've done a little trial run over the past day and a half of using Folding@Home to "mine" Curecoins, and I'm getting a pretty good payout (payout should be at about 30-35 or so Curecoins/day for my card's PPD; roughly 0.0035-0.0042 BTC/day at the current rate). It certainly doesn't hurt that my GPU is folding proteins and helping researchers while using all that power, instead of doing random hashing, but even without those considerations, the profit margin speaks for itself (for reference, Bombadil's profit calculator shows my 3 most profitable mining options at: {TAG: VEIL | Name:Veilcoin | Algo: X13 | BTC/day: .00316889, TAG: PP9X11 | Name:Multipool X11 (PP) | Algo: X11 | BTC/day: .00302276, TAG: XMR | Name:Monero | Algo: CryptoNight | BTC/day: .00215359}).
I get about 250k PPD with my 780 Ti and i5-4670k, so folding might be more relevant to single-card/gaming rigs moreso than pure mining rigs, but looking into F@H couldn't hurt for other kinds of rigs (I'd be interested to see what a full 750 Ti rig with a mid-range processor could put out in terms of PPD)!
|
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
July 24, 2014, 06:43:06 AM |
|
So I take back that 64 bit builds are faster with x15. I just have so many different versions of ccminer at this point (~20 GB) that I ended up using a borked 32 bit version which used the GPU less and the CPU more than it's 64 bit brother.
What we need is 32 bit adressing with 64 bit hashing. (Use 100% of the cudacores per cycle instead of 50%). This is not done by changing the build target to 64bit. Each hash needs to be re-implemented in CUDA-asm from scratch. Compute 3.0 has max 64 32bit registers per thread, compute 3.5 has 255 registers etc. But there are no speedups when compiling ccminer for 5.0. This meens that the code generated is suboptimal and needs to be finetuned (preferably in 100% Cuda asm). Remove latency, remove registers, pipeline instructions, improve cachehits etc.. Today each thread in ccminer is computing 1 hash by doing a full runtthtrough of all algorithms: x1->x2->x3->x4->x5->x6->x7->x8->x9->x10->x11 This is suboptimal A FPGA implementation will run at the speed of the slowest x, thus eliminating the other x'es since they are done in parallell. We should do something similar. The slowest algorithm for the 750TI is the groestl. This algorithm is running at 7,5 MHASH on a single 750TI. The target for the optimized GPU miner wil be in the range 5-7.5 MHASH on the 750TI for x11(darkcoin).
|
|
|
|
DougB62
|
|
July 24, 2014, 06:47:26 AM |
|
So I take back that 64 bit builds are faster with x15. I just have so many different versions of ccminer at this point (~20 GB) that I ended up using a borked 32 bit version which used the GPU less and the CPU more than it's 64 bit brother.
What we need is 32 bit adressing with 64 bit hashing. (Use 100% of the cudacores per cycle instead of 50%). This is not done by changing the build target to 64bit. Each hash needs to be re-implemented in CUDA-asm from scratch. Compute 3.0 has max 64 32bit registers per thread, compute 3.5 has 255 registers etc. But there are no speedups when compiling ccminer for 5.0. This meens that the code generated is suboptimal and needs to be finetuned (preferably in 100% Cuda asm). Remove latency, remove registers, pipeline instructions, improve cachehits etc.. Today each thread in ccminer is computing 1 hash by doing a full runtthtrough of all algorithms: x1->x2->x3->x4->x5->x6->x7->x8->x9->x10->x11 This is suboptimal A FPGA implementation will run at the speed of the slowest x, thus eliminating the other x'es since they are done in parallell. We should do something similar. The slowest algorithm for the 750TI is the groestl. This algorithm is running at 7,5 MHASH on a single 750TI. The target for the optimized GPU miner wil be in the range 5-7.5 MHASH on the 750TI for x11(darkcoin). Now you're gettin' serious... and I like that!!
|
|
|
|
yellowduck2
|
|
July 24, 2014, 06:50:10 AM |
|
The target for the optimized GPU miner wil be in the range 5-7.5 MHASH on the 750TI for x11(darkcoin).
WOw. This is big if it's true. It will more than double the speed of pretty much every algo that ccminer is using. Very big improvement. This is the kind of improvement that is ground breaking ! cbuchner1 aka Nvidia Satoshi , Any comment about sp_ theory ?
|
|
|
|
Amph
Legendary
Offline
Activity: 3248
Merit: 1070
|
|
July 24, 2014, 07:32:00 AM |
|
Welp. Managed to split the most offensive part of the kernel into four parallel threads per hash, result is spectacularly unimpressive. The best I've come up with breaks even with the current single thread per hash implementation. Well, almost. It's actually a percent slower AND loses compute 2.0 compatibility due to using shuffle. On the other hands it performs a lot more reasonably with various launch configurations, 15 blocks of 32 threads works our equally well as the original 8x60 magic bullet for 750 Ti. At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets. https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zipAlso, any chances for this code to get released already? Or are you competing against Wolf0 It works like a charm, 220H/s for GTX760, before it was 190. GTX750TIs seem unchanged. I get 270H(peaks of 297H with -l 8x50) with this release and a GTX 760 overclocked -->v0.15-rc1 ccminer-cryptonight_20140723 Thanks for that launch setting 306H/s (MSI gaming, +180core, +500mem). Still have to test what's the most stable, but thanks for giving me a start Ooh damn, you've released that a looong time ago, tsiv. Should've noticed ^^" EDIT: 320H/s with +222core, +666mem I'm waiting anxiously for a driver crash Fantastic Bombadill...im on +180 core +300 Memory If you find any better launch configs please post it I also asked in the other thread if there are binaries for wolf nvidia xmr miner dunno but i can't oc my cards at least one of them keeps crashing if i do so, you changed the power limit in the bios?
|
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
July 24, 2014, 07:34:53 AM Last edit: July 24, 2014, 07:49:08 AM by sp_ |
|
I'm pretty sure the output of the last algorithm is used as the input of the next one for X11, precisely so you can't do that.
Yes you can. If each thread is working on a different hash. example 4 threads 4 hashes HASH1: x1->x2->x3-> HASH2: x4->x5->x6-> HASH3: x7->x8->x9-> HASH4: x10->x11 Swap the 4 hashes HASH4: x1->x2->x3-> HASH1: x4->x5->x6-> HASH2: x7->x8->x9-> HASH3: x10->x11 Swap the 4 hashes HASH3: x1->x2->x3-> HASH4: x4->x5->x6-> HASH1: x7->x8->x9-> HASH2: x10->x11 Swap the 4 hashes HASH2: x1->x2->x3-> HASH3: x4->x5->x6-> HASH4: x7->x8->x9-> HASH1: x10->x11 Complete
|
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
July 24, 2014, 07:47:35 AM |
|
I'm pretty sure the output of the last algorithm is used as the input of the next one for X11, precisely so you can't do that.
Yes you can. If each thread is working on a different hash. Oh, I get it. Clever. You're not going to raise the hash to that of the slowest alg, though, because the GPU is partially occupied by the other hashes going on. However, I see no reason why that won't work. Yes, the GPU is occupied, but on seperate and non overlapping memory blocks. The slowest alg can be optimized...
|
|
|
|
PVmining
|
|
July 24, 2014, 08:26:16 AM |
|
Just saw tsiv's parallelization of the second loop. Quite impressive.
...he's a cool guy. Hey tsiv thanks a lot for your launch-config change for kopiemtu - that's really awesome!
|
|
|
|
sp_
Legendary
Offline
Activity: 2954
Merit: 1087
Team Black developer
|
|
July 24, 2014, 08:38:46 AM |
|
I haven't looked at TSIV's code. Isn't Cryptonite just a variation of x11 + scryptn? 20% gain is a good job. Now do another 20% Anyway, I will start implementing some code soon. I will start with the 11 x'es. One by One.
|
|
|
|
bigjme
|
|
July 24, 2014, 08:48:45 AM |
|
ouch i am getting left behind, my mining rig has been off over a week and this thread just looks like a developer chatroom it is great to see so many of you all working together, who's this Christian guy that releases stuff? I've never seen him here
|
Owner of: cudamining.co.uk
|
|
|
yellowduck2
|
|
July 24, 2014, 09:14:47 AM |
|
ouch i am getting left behind, my mining rig has been off over a week and this thread just looks like a developer chatroom it is great to see so many of you all working together, who's this Christian guy that releases stuff? I've never seen him here He is Nvidia Satoshi Retire behind the scene
|
|
|
|
S_tring
Full Member
Offline
Activity: 252
Merit: 102
OPEN Platform - Powering Blockchain Acceptance
|
|
July 24, 2014, 09:16:52 AM |
|
Linux users might be pleased to know that the profit switching capability of ccManager is coming along nicely, too. It uses TradeMyBit for now, and I've just coded a facility to stop mining on TMB altogether if the daily profit projection is poor. In this case it switches to an alternative pool of your choice (last resort pool), or it stops mining altogether and monitors TMB for a decent profit margin before starting again.
I should have the gitHub updated with something for you to play with next week some time.
|
|
|
|
yellowduck2
|
|
July 24, 2014, 09:18:02 AM |
|
I haven't looked at TSIV's code. Isn't Cryptonite just a variation of x11 + scryptn? 20% gain is a good job. Now do another 20% Anyway, I will start implementing some code soon. I will start with the 11 x'es. One by One. Do u mind me asking if u have a degree / master / phd in computer science ? U spot something that no one here understand at first. I see u have to explain in so many post before people gets it. Are you related to Satoshi ?
|
|
|
|
|