cbuchner1 (OP)
|
 |
January 29, 2014, 02:09:16 PM |
|
I have Visual Studio 2012, but I can't load the solution file. So I probably have to wait a few more days.
CUDA 5.5 is installed? VS 2012 should be able to upgrade the solution file automatically. But then you will have to make sure that all the other dependencies are available (OpenSSL, pthreads, libcurl....)
|
|
|
|
ManIkWeet
|
 |
January 29, 2014, 02:10:38 PM |
|
So my 780 getting over 5 isnt too bad then
but 6 or 7 would be nicer. I have one optimization in mind that swaps the state of threads within the lookup_gap loop. The intention is to order threads by the loop trip count (some have to run for 0 loops, others a couple more up to the specified lookup_gap). By ordering them, some of the warps will terminate much earlier and not consume any computational resources. This would (in theory) reduce the workload nearly by factor 2, but it introduces some overhead for sorting the threads, and for shuffling the state around. Whether a net speed gain remains , that is yet to be seen. I will save that optimization for February (it would delay this release...) Christian I shall happily beta test this when it is out in February 
|
BTC donations: 18fw6ZjYkN7xNxfVWbsRmBvD6jBAChRQVn (thanks!)
|
|
|
cbuchner1 (OP)
|
 |
January 29, 2014, 02:11:35 PM |
|
To be more specific, I was using 4 streams on nv_scrypt_core_kernelA<ALGO_SCRYPT_JANE> and nv_scrypt_core_kernelB<ALGO_SCRYPT_JANE> inside the NVKernel::run_kernel. So those are the kernels I was referring to. Too bad the code in these kernels looks like witchcraft to me at the moment. LOL
what you have to know is that the "A" named kernels writes to the scratchpad (yes, the ENTIRE scratchpad) and kernels labeled "B" reads from random positions in the scratchpad. So there is an A->B dependency, first A has to complete before B can run. If you still want to try running multiple streams, divide the hashing (nonce) range into 4 equally sized regions then you can run A -> B region 1 A -> B region 2 A -> B region 3 A -> B region 4 all simultaneously on 4 streams, as their scratchpad areas do not overlap. But somehow I do not see the advantage of this. Typically launch configurations are determined that a single stream is already fully loading the GPU's multiprocessors. I currently use two streams for different hashing (nonce) ranges, but a fully concurrent execution would only be allowed if you allocated several scratchpads (one per stream). Considering that the video card memory is a scarce resource this is probably not the best idea. Especially with scrypt-jane coins this is a problem. Christian
|
|
|
|
patoberli
Member

Offline
Activity: 106
Merit: 10
|
 |
January 29, 2014, 02:15:16 PM |
|
Getting a stable 1.35 - 1.45 kh/s on my GT-640 with default clocks and mining YaCoin. There are a few "does not validate on CPU" results though, and they also don't seem to count: [2014-01-29 15:11:26] GPU #1: GeForce GT 640 result does not validate on CPU (i=23, s=0)! [2014-01-29 15:11:27] GPU #1: GeForce GT 640, 1.45 khash/s [2014-01-29 15:11:27] accepted: 51/51 (100.00%), 1.45 khash/s (yay!!!)
Otherwise it's running smooth on Windows. Build is built today. Start Parameters: cudaminer.exe -a scrypt-jane -i 0 -l K27x2 -o http://yac.coinmine.pl:8882 -O ...:... -H 2 -d 1
|
YAC: YA86YiWSvWEGSSSerPTMy4kwndabRUNftf BTC: 16NqvkYbKMnonVEf7jHbuWURFsLeuTRidX LTC: LTKCoiDwqEjaRCoNXfFhDm9EeWbGWouZjE
|
|
|
cbuchner1 (OP)
|
 |
January 29, 2014, 02:16:38 PM |
|
Getting a stable 1.35 - 1.45 kh/s on my GT-640 with default clocks and mining YaCoin. There are a few "does not validate on CPU" results though, and they also don't seem to count:
I wish I knew what is causing these... try passing a -b 8192 for a bit more speed.
|
|
|
|
patoberli
Member

Offline
Activity: 106
Merit: 10
|
 |
January 29, 2014, 03:39:38 PM |
|
Thanks, has the -b parameter an influence when running it -i 0? In any case, it didn't change the speed visibly.
|
YAC: YA86YiWSvWEGSSSerPTMy4kwndabRUNftf BTC: 16NqvkYbKMnonVEf7jHbuWURFsLeuTRidX LTC: LTKCoiDwqEjaRCoNXfFhDm9EeWbGWouZjE
|
|
|
cbuchner1 (OP)
|
 |
January 29, 2014, 04:21:19 PM |
|
Thanks, has the -b parameter an influence when running it -i 0? In any case, it didn't change the speed visibly.
-b still has an influence, as in reducing overhead for CUDA kernel calls. Bigger chunks of data to work with means less overhead.
|
|
|
|
apluscarp
Newbie
Offline
Activity: 12
Merit: 0
|
 |
January 29, 2014, 04:55:17 PM |
|
Where can I find the 2014-01-17 version?
|
|
|
|
bigjme
|
 |
January 29, 2014, 05:29:55 PM Last edit: January 29, 2014, 05:49:19 PM by bigjme |
|
Results from my latest build, here is my launch config i stuck with and the results for -L2 to -L6 ./cudaminer -a scrypt-jane -H 0 -i 0 -d 0 -l T138x2 -o http://127.0.0.1:3339 -u user -p pass -D -L4 -L2 T68x2 - 4.30-4.46 -L3 T68x3 - 4.61-4.87 -L4 T138x2 - 4.89-5.29 - avg. 4.98 -L5 T69x4 - 4.90-5.3 - avg. 5.05 -L6 T108x4 - 4.64-5.2 - avg. 4.65 running with -L4 i have noticed that it has now settled down at 4.96 - 5.05 being on for 4 hours, and i still have a lot of system usage back. so im wondering how much it is actually using memory wise it is using 2412MiB / 3071MiB not sure on actual gpu usage. wonder if i could get more memory usage then i am now, may get me more out of it
|
Owner of: cudamining.co.uk
|
|
|
whitesand77
|
 |
January 29, 2014, 05:38:28 PM |
|
But somehow I do not see the advantage of this. Typically launch configurations are determined that a single stream is already fully loading the GPU's multiprocessors.
If this were true my stream test would have given me a lower hash rate due to overhead but it doubled. Just because MSI Afterburner or another program is reporting a 90 something % GPU usage doesn't mean streams won't help. When I ran the two kernels I ran all A's first, synced, then the B's. So as long as the batch size for the NFactor is small enough to spawn off enough kernels, they'll run concurrently. Again, another optimization that won't work so well for lower NFactors. But I was actually seeing 99-100% GPU usage with the doubled hash rate. I had the same thought as you before I discovered streams but the NVIDIA Visual Profiler and the sample code convinced me otherwise. I had another raster compression routine when streamed give me a 40% increase when I thought it was already maxed out. I'm just talking this out here. I know with the current state it won't be valid results due to the kernels tripping all over each others memory space. I just wanted to see the potential. I'm going to take your suggestion, since you know the code, and see if I can understand it well enough to break it up into 4 regions. So far, I can tell this will be a steep learning curve. Thanks Joe
|
|
|
|
13G
Newbie
Offline
Activity: 17
Merit: 0
|
 |
January 29, 2014, 09:03:10 PM |
|
Results from my latest build, here is my launch config i stuck with and the results for -L2 to -L6 ./cudaminer -a scrypt-jane -H 0 -i 0 -d 0 -l T138x2 -o http://127.0.0.1:3339 -u user -p pass -D -L4 -L2 T68x2 - 4.30-4.46 -L3 T68x3 - 4.61-4.87 -L4 T138x2 - 4.89-5.29 - avg. 4.98 -L5 T69x4 - 4.90-5.3 - avg. 5.05 -L6 T108x4 - 4.64-5.2 - avg. 4.65 running with -L4 i have noticed that it has now settled down at 4.96 - 5.05 being on for 4 hours, and i still have a lot of system usage back. so im wondering how much it is actually using memory wise it is using 2412MiB / 3071MiB not sure on actual gpu usage. wonder if i could get more memory usage then i am now, may get me more out of it Great improvement! Thank you! GTX TITAN with "-a scrypt-jane -d 0 -i 0 -H 2 -C 0 -m 0 -b 32768 -L 5 -l T69x4 -s 120" now 4.7khash/s !
|
|
|
|
bigjme
|
 |
January 29, 2014, 09:05:10 PM |
|
No problem. I am hoping to find a way to allocate some more memory and gpu power. And with the improvement christian has in store it should jump up a lot more
|
Owner of: cudamining.co.uk
|
|
|
cbuchner1 (OP)
|
 |
January 29, 2014, 09:12:06 PM |
|
No problem. I am hoping to find a way to allocate some more memory and gpu power. And with the improvement christian has in store it should jump up a lot more
make that a "might" jump up a lot more. I've had my fair share of optimization failures...
|
|
|
|
bigjme
|
 |
January 29, 2014, 09:13:28 PM |
|
Even getting it to use more memory should give me an increase. I say should
|
Owner of: cudamining.co.uk
|
|
|
bathrobehero
Legendary
Offline
Activity: 2002
Merit: 1051
ICO? Not even once.
|
 |
January 29, 2014, 09:24:17 PM |
|
What's the reason behind failing to allocate more than 3GB of VRAM on Titans? It seems that wherever you look, games, applications, whatever it always has problems on that front.
|
Not your keys, not your coins!
|
|
|
bigjme
|
 |
January 29, 2014, 09:26:51 PM |
|
I believe its to do with the memory bus speed limiting the amount of memory it can use
|
Owner of: cudamining.co.uk
|
|
|
ManIkWeet
|
 |
January 29, 2014, 09:53:04 PM |
|
I believe its to do with the memory bus speed limiting the amount of memory it can use
You have any idea how logical that sounds? /sarcasm off Probably has to do with the whole 32/64 bit thing, running a x64 build doesn't nessecarily fix that either.
|
BTC donations: 18fw6ZjYkN7xNxfVWbsRmBvD6jBAChRQVn (thanks!)
|
|
|
bigjme
|
 |
January 29, 2014, 10:02:57 PM |
|
Repeating what someone else said lmao. Sarcasm not needed
|
Owner of: cudamining.co.uk
|
|
|
cbuchner1 (OP)
|
 |
January 29, 2014, 10:04:03 PM Last edit: January 29, 2014, 11:15:47 PM by cbuchner1 |
|
cbuchner1, did you note my earlier post about autotune problems and K kernel performance regression?
okay, I have just replaced the ailing PSU in my main development PC, which allows me to put more stress on the GPUs again without it turning off unexpectedly. So that regression really is bad. 254 kHash/s to 204 kHash/s with same kernel launch parameters between 2013-12-18 and current github. That's a 20% drop in performance. I might play around a bit to see what I can find. I did not find the same problem with the T kernel, even though it underwent very similar changes! EDIT1: the majority of the discrepancy stems from my redefinition of what "warp" means in Dave's Kepler kernel (to be more in line with the CUDA definition of a warp) Hence the equivalent launch config for the current github release has to use four times the number of blocks to be comparable. So I have to go from -l K7x32 to -l K28x32. Then I end up with a drop from 254 kHash/s to 220 kHash/s only. Still bad, but not quite that much. EDIT2: I find my "simplifications" in read_keys_direct and write_keys_direct to be the culprit. Turns out this has a huge performance impact, despite requiring much less instructions. Christian
|
|
|
|
ollyweg
Newbie
Offline
Activity: 19
Merit: 0
|
 |
January 29, 2014, 10:40:11 PM |
|
Hi I was wondering if you guys are doing only simple overclocking with afterburner or forced p-states. On my EVGA 660Ti I use nvidia inspector to force p2-state which gets me up to 1215MHz (300MHz over stock). Also my hashrate got a bit more stable with this and autotuning gets much more precise results since then.
This gets me 340khash/s for scrypt and 3.5khash/s for scrypt-jane. This is with the 2014-1-22 version. So I´m wondering why I´m actually still below a value from the scrypt-jane spreadsheet which apparently uses almost no OC. I´ve tested lots of kernel cfgs but I can´t seem to get any higher.
Any ideas?
My exact config: scrypt: --interactive=0 --hash-parallel=2 --launch-config=Y112x2 --texture-cache=1 --single-memory=0 scrypt-jane: --interactive=0 --hash-parallel=1 --launch-config=K7x23 --texture-cache=0 --single-memory=0 --lookup-gap=3
OC: P0: clock-offset +160; mem-offset +300; power-target 153% (actually uses about 142% TDP) P2: clock 1215MHz; forced Driver: 332.21
|
|
|
|
|