djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
February 04, 2014, 01:34:18 PM |
|
A brave tester with 8 Fermi cards Tesla M2090 (thanks Choseh) just figured out the performance regression between 2013-12-18 and 2014-02-02.
If you change the #if 0 in the fermi_kernel.cu to #if 1 (thereby enabling the previous version of the Salsa20/8 round function) you should see the previous performance figures again. Those who can compile the code themselves and want to mine on Fermi are welcome to make this change themselves.
also there seems to be a bug in the autotuning code in salsa_kernel.cu
hash_sec = (double)WU_PER_LAUNCH / tdelta;
should very likely be
hash_sec = (double)WU_PER_LAUNCH * repeat / tdelta;
to factor in the number of repetitions in the measurement (we want to measure for 50ms minimum for better timer accuracy). So autotune was drunk after all!
So, it seems I should release fixes (new binary release) for these problems tonight.
Christian
Yes, It works better this way. However there are still the problem with power increase between config but it is less apparent. (strangely, I don't have that problem with the gtx660, its power stays at 100% and doesn't fluctuate)
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
morbooo
Newbie
Offline
Activity: 4
Merit: 0
|
|
February 04, 2014, 01:43:01 PM |
|
I think your bug report is the one that made my mind go http://www.digitalsherpa.com/wp-content/uploads/2012/11/lightbulb1.gifThe CUDA constant memory (the c_N loop trip count, etc...) of most CUDA kernels is only initialized properly for the first GPU (use of a single static variable to mark initialization instead of a thread-specific static variable). Which explains the majority of the crashes people are seeing with multi-GPU. Thank you. The Fermi owners use a kernel that doesn't yet make use of such constants, and hence the multi-GPU support is working fine for them. So this is also on the FIXME list for tonight. Awesome, looking forward to the fix. Thanks for the support However I think that in your case where you run two cudaminer instances this cannot be the root cause. So we will have to keep looking.
Oh no I don't run two instances, I meant that one of the GPU's within the same cudaMiner instance produced invalid results. Which is in line with your explanation above. Running two instances of cudaMiner (one for each GPU) actually works perfectly, so this also confirms your hypothesis.
|
|
|
|
cbuchner1 (OP)
|
|
February 04, 2014, 01:54:05 PM |
|
Mate any idea why although I have 2 GTX 780 (and two cuda miners) one shows 520 khps and the other 605? Can it be that the one that the monitor is plugged loses hash power because of it? Any idea what is going on?
I have 3 GTX 780Ti in one PC and two of them hash 10-20 kHash/s less than the fastest one. I attribute this to subtle differences in the PCI express connectivity. But 100 kHash/s difference - ouch? played with the -H options yet?
|
|
|
|
trell0z
Newbie
Offline
Activity: 43
Merit: 0
|
|
February 04, 2014, 02:11:09 PM |
|
Mate any idea why although I have 2 GTX 780 (and two cuda miners) one shows 520 khps and the other 605? Can it be that the one that the monitor is plugged loses hash power because of it? Any idea what is going on?
I have 3 GTX 780Ti in one PC and two of them hash 10-20 kHash/s less than the fastest one. I attribute this to subtle differences in the PCI express connectivity. But 100 kHash/s difference - ouch? played with the -H options yet? Have you guys monitored your cards in afterburner? The topmost cards might be throttling more than the bottom one, or just that they boost to different mhz. Custom bios with disabled boost is awesome in general.
|
|
|
|
bathrobehero
Legendary
Offline
Activity: 2002
Merit: 1051
ICO? Not even once.
|
|
February 04, 2014, 02:24:11 PM |
|
Mate any idea why although I have 2 GTX 780 (and two cuda miners) one shows 520 khps and the other 605? Can it be that the one that the monitor is plugged loses hash power because of it? Any idea what is going on?
Primary cards are always going to perform worse as they are stressed by the OS, your browser, background apps and so on. Also,the -H flag could cause it so try -H 2 to exclude the CPU. If you're not using risers, chances are one of your card is hotter than the other, or at least requires a higher fan speed to keep it at lower temps so the fans are using more power on one card which very well means lower core frequencies when it comes to kepler. And those are not the only possible explanations, but my brake is over...
|
Not your keys, not your coins!
|
|
|
xblackdemonx
Newbie
Offline
Activity: 6
Merit: 0
|
|
February 04, 2014, 02:41:34 PM Last edit: February 04, 2014, 03:06:01 PM by xblackdemonx |
|
Hi, i'm using 2xGTX560 here with 2013-12-10 version I get about 290kh/s with 2013-12-18 version I get about 310kh/s but it freezes often with 2014-02-02 version I get about 270kh/s I'm using: cudaminer.exe -d 0,1 -i 0,0 -l F7x16,F7x16 -H 1,1 -C 1,1
|
|
|
|
lordaccess
Member
Offline
Activity: 69
Merit: 10
|
|
February 04, 2014, 02:42:36 PM |
|
Mate any idea why although I have 2 GTX 780 (and two cuda miners) one shows 520 khps and the other 605? Can it be that the one that the monitor is plugged loses hash power because of it? Any idea what is going on?
I have 3 GTX 780Ti in one PC and two of them hash 10-20 kHash/s less than the fastest one. I attribute this to subtle differences in the PCI express connectivity. But 100 kHash/s difference - ouch? played with the -H options yet? Since they are the same I use the same configuration. My problem is that if i start either one alone. It does reach 605. If I start em together They both reach 605 but after 2-3 minutes the gpu clock drops and the Voltage and the hash with them (drops to 520). It s the upper card and the the monitor is plugged in meaning the pcie is the most powerful. Also this card appears to have more temp (85C) than the one that works with max hash (75C). (but probably due to the limited space that it has to breath). EDIT: I also saw the 2 next answers. Thanks I ll try to play with the H (althouh I doubt i ll see any difference).
|
|
|
|
tron666
Member
Offline
Activity: 112
Merit: 10
|
|
February 04, 2014, 02:46:34 PM Last edit: February 04, 2014, 03:02:51 PM by tron666 |
|
When recompiling this, is there anything wrong with doing a git pull, running autogen, configure and then make. Or is it better to just delete and start from scratch. Doesnt git make sure your not mixing any files?
BTW the new commits are definately improving the performance on Fermi, but still under what it was with 2014-01-20.
|
CCO MNVPaetsHpxr97mRDqqPuV6PQoSVbFgPVE NEM-test TBZXHE-TD6AO6-PHSFZL-SZ7MWS-JEEI7C-EFCUC2-7Y7V LTC LcAQUMNhqDYesRRYMxMAsE5rhAAseDMDp7 XPM AHDtLd993oYke4Zrm5dDG5WGtgyaUaMTCK NXT 16706883867271464458 DOGE DBGiKBD1HZ8yfTdTcX5m8T7mY4X4cUVnEz
|
|
|
Lacan82
|
|
February 04, 2014, 02:51:51 PM |
|
Well I just got ripped off of a YACoin block, it said the Yay!!! thing but the damn client never actually showed the block. The client even said it found a block but it never appeared in my wallet, so sad...
Yacoin takes ~520 confirms. That usually takes a few hours after a found block. The same happened to me with an UltraCoin block. My wallet lists 3 transaction in total, but only 2 incoming transactions from mining are actually displayed. If you find a way to recover that missing transaction, please let me know. command-line command: -rescan Rescan the block chain for missing wallet transactions Have you tried this?
|
|
|
|
trell0z
Newbie
Offline
Activity: 43
Merit: 0
|
|
February 04, 2014, 02:53:15 PM |
|
Mate any idea why although I have 2 GTX 780 (and two cuda miners) one shows 520 khps and the other 605? Can it be that the one that the monitor is plugged loses hash power because of it? Any idea what is going on?
I have 3 GTX 780Ti in one PC and two of them hash 10-20 kHash/s less than the fastest one. I attribute this to subtle differences in the PCI express connectivity. But 100 kHash/s difference - ouch? played with the -H options yet? Since they are the same I use the same configuration. My problem is that if i start either one alone. It does reach 605. If I start em together They both reach 605 but after 2-3 minutes the gpu clock drops and the Voltage and the hash with them (drops to 520). It s the upper card and the the monitor is plugged in meaning the pcie is the most powerful. Also this card appears to have more temp (85C) than the one that works with max hash (75C). (but probably due to the limited space that it has to breath). EDIT: I also saw the 2 next answers. Thanks I ll try to play with the H (althouh I doubt i ll see any difference). Have you tried undervolting your cards? With the new kernel I can undervolt massively and still have a high overclock (gtx 780). Currently running +310 on core which gives me 1254mhz, no memory oc, -50mv voltage which makes it 1.100 on load. The lowered heat might make your cards stay at higher clocks more, you should also use afterburner to set the priority to the power target and not the temp target.
|
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
February 04, 2014, 04:25:19 PM |
|
Something strange (or not that's the question...). For most of the coins, the formerly known as Z kernel is the fastest especially with script coins. However, for Vertcoin (script:2048) it is way much slower (difference>50khash) than the formerly known as T kernel. Is there any reason for this ?
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
cbuchner1 (OP)
|
|
February 04, 2014, 04:31:39 PM Last edit: February 04, 2014, 05:01:41 PM by cbuchner1 |
|
Something strange (or not that's the question...). For most of the coins, the formerly known as Z kernel is the fastest especially with script coins. However, for Vertcoin (script:2048) it is way much slower (difference>50khash) than the formerly known as T kernel. Is there any reason for this ?
so you're saying the current "T" (alias name Z) kernel is slower than the current "t" kernel (formerly known as T) for VertCoin? or do you compare current cudaminer performance with some older prerelease version? either way, this is surprising. There should not be much of a difference between N=1024 and N=2048 scrypt coins, really. At high N the low register count kernels have a significant advantage - they reach higher occupancy under tight memory constraints. And they can do a lookup gap without running into much register pressure. But N=2048 isn't high... Christian
|
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
February 04, 2014, 04:37:04 PM |
|
yes the Z kernel is the slowest for the Vertcoin (it has always been the case since it was introduced)
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
bathrobehero
Legendary
Offline
Activity: 2002
Merit: 1051
ICO? Not even once.
|
|
February 04, 2014, 05:44:47 PM |
|
either way, this is surprising. There should not be much of a difference between N=1024 and N=2048 scrypt coins, really. At high N the low register count kernels have a significant advantage - they reach higher occupancy under tight memory constraints. And they can do a lookup gap without running into much register pressure. But N=2048 isn't high...
Christian
GTX 660: N:1024, Y5x32, ~240 kH/s N:2048, Y5x32, ~128 kH/s Edit: Y5x32 seems to be the fastest kernel/config, even though autotune tends to find Y5x28 the fastest most of the time.
|
Not your keys, not your coins!
|
|
|
sin242
|
|
February 04, 2014, 06:03:03 PM |
|
either way, this is surprising. There should not be much of a difference between N=1024 and N=2048 scrypt coins, really. At high N the low register count kernels have a significant advantage - they reach higher occupancy under tight memory constraints. And they can do a lookup gap without running into much register pressure. But N=2048 isn't high...
Christian
GTX 660: N:1024, Y5x32, ~240 kH/s N:2048, Y5x32, ~128 kH/s Edit: Y5x32 seems to be the fastest kernel/config, even though autotune tends to find Y5x28 the fastest most of the time. Hi there, long time lurker. Reg'd to post up for this. I'm seeing the same trend on GTX 670s. 1024 would get me ~280 with best results from Y14x20 2048 it's ~133 with the best results from K7x32
|
Dark: Xk9BoVerBd41JCjWQEhnxoowP7YNUK439z BTC: 1JzPN2h8WGSi7kQeY5wuP4PjVD2hxkHJQM
|
|
|
cbuchner1 (OP)
|
|
February 04, 2014, 06:04:21 PM |
|
1024 would get me ~280 with best results from Y14x20 2048 it's ~133 with the best results from K7x32
isn't Y an alias for K?
|
|
|
|
sin242
|
|
February 04, 2014, 06:13:14 PM |
|
1024 would get me ~280 with best results from Y14x20 2048 it's ~133 with the best results from K7x32
isn't Y an alias for K? Actually, it's funny you mention that. If i try to run Y7x32, it fails horribly aand crashes the driver
|
Dark: Xk9BoVerBd41JCjWQEhnxoowP7YNUK439z BTC: 1JzPN2h8WGSi7kQeY5wuP4PjVD2hxkHJQM
|
|
|
cbuchner1 (OP)
|
|
February 04, 2014, 06:21:25 PM |
|
1024 would get me ~280 with best results from Y14x20 2048 it's ~133 with the best results from K7x32
isn't Y an alias for K? Actually, it's funny you mention that. If i try to run Y7x32, it fails horribly aand crashes the driver according to the code, it shouldn't crash.... K and Y really do the same thing. switch (kernelid) { case 'T': case 'Z': *kernel = new NV2Kernel(); break; case 't': *kernel = new TitanKernel(); break; case 'K': case 'Y': *kernel = new NVKernel(); break; case 'k': *kernel = new KeplerKernel(); break; case 'F': case 'L': *kernel = new FermiKernel(); break; case 'f': case 'X': *kernel = new TestKernel(); break; case ' ': // choose based on device architecture *kernel = Best_Kernel_Heuristics(props); break;
|
|
|
|
djm34
Legendary
Offline
Activity: 1400
Merit: 1050
|
|
February 04, 2014, 06:22:00 PM |
|
either way, this is surprising. There should not be much of a difference between N=1024 and N=2048 scrypt coins, really. At high N the low register count kernels have a significant advantage - they reach higher occupancy under tight memory constraints. And they can do a lookup gap without running into much register pressure. But N=2048 isn't high...
Christian
GTX 660: N:1024, Y5x32, ~240 kH/s N:2048, Y5x32, ~128 kH/s Edit: Y5x32 seems to be the fastest kernel/config, even though autotune tends to find Y5x28 the fastest most of the time. same here with my gtx 660 oem 1.5gb for the kernel, although in 2048 it is rather Y6x20 but it is well known that the gtx660oem is not really a gtx660
|
djm34 facebook pageBTC: 1NENYmxwZGHsKFmyjTc5WferTn5VTFb7Ze Pledge for neoscrypt ccminer to that address: 16UoC4DmTz2pvhFvcfTQrzkPTrXkWijzXw
|
|
|
sin242
|
|
February 04, 2014, 06:28:09 PM |
|
1024 would get me ~280 with best results from Y14x20 2048 it's ~133 with the best results from K7x32
isn't Y an alias for K? Actually, it's funny you mention that. If i try to run Y7x32, it fails horribly aand crashes the driver according to the code, it shouldn't crash.... K and Y really do the same thing. switch (kernelid) { case 'T': case 'Z': *kernel = new NV2Kernel(); break; case 't': *kernel = new TitanKernel(); break; case 'K': case 'Y': *kernel = new NVKernel(); break; case 'k': *kernel = new KeplerKernel(); break; case 'F': case 'L': *kernel = new FermiKernel(); break; case 'f': case 'X': *kernel = new TestKernel(); break; case ' ': // choose based on device architecture *kernel = Best_Kernel_Heuristics(props); break;
After tinkering some more K7x32 isn't working. Has to be something on my end. There's a seperate 670 in another machine that's happily hashing away with K7x32, but now the 670s won't. Going to reinstall drivers/cuda and see if anything changes
|
Dark: Xk9BoVerBd41JCjWQEhnxoowP7YNUK439z BTC: 1JzPN2h8WGSi7kQeY5wuP4PjVD2hxkHJQM
|
|
|
|