Bitcoin Forum
February 23, 2024, 05:45:11 PM *
News: Latest Bitcoin Core release: 26.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 [163] 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 ... 1135 »
  Print  
Author Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX]  (Read 3426758 times)
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
January 29, 2014, 02:09:16 PM
 #3241

I have Visual Studio 2012, but I can't load the solution file. So I probably have to wait a few more days.

CUDA 5.5 is installed? VS 2012 should be able to upgrade the solution file automatically. But then you will have to make sure that all the other dependencies are available (OpenSSL, pthreads, libcurl....)
1708710311
Hero Member
*
Offline Offline

Posts: 1708710311

View Profile Personal Message (Offline)

Ignore
1708710311
Reply with quote  #2

1708710311
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
ManIkWeet
Full Member
***
Offline Offline

Activity: 182
Merit: 100


View Profile
January 29, 2014, 02:10:38 PM
 #3242

So my 780 getting over 5 isnt too bad then

but 6 or 7 would be nicer.

I have one optimization in mind that swaps the state of threads within the lookup_gap loop. The intention is to order threads by the loop trip count (some have to run for 0 loops, others a couple more up to the specified lookup_gap). By ordering them, some of the warps will terminate much earlier and not consume any computational resources.

This would (in theory) reduce the workload nearly by factor 2, but it introduces some overhead for sorting the threads, and for shuffling the state around. Whether a net speed gain remains ,  that is yet to be seen.

I will save that optimization for February (it would delay this release...)

Christian
I shall happily beta test this when it is out in February  Grin

BTC donations: 18fw6ZjYkN7xNxfVWbsRmBvD6jBAChRQVn (thanks!)
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
January 29, 2014, 02:11:35 PM
 #3243


To be more specific, I was using 4 streams on nv_scrypt_core_kernelA<ALGO_SCRYPT_JANE> and nv_scrypt_core_kernelB<ALGO_SCRYPT_JANE> inside the NVKernel::run_kernel.  So those are the kernels I was referring to.  Too bad the code in these kernels looks like witchcraft to me at the moment. LOL

what you have to know is that the "A" named kernels writes to the scratchpad (yes, the ENTIRE scratchpad) and kernels labeled "B" reads from random positions in the scratchpad. So there is an A->B dependency, first A has to complete before B can run.

If you still want to try running multiple streams, divide the hashing (nonce) range into 4 equally sized regions

then you can run

A -> B region 1
A -> B region 2
A -> B region 3
A -> B region 4

all simultaneously on 4 streams, as their scratchpad areas do not overlap. But somehow I do not see the advantage of this. Typically launch configurations are determined that a single stream is already fully loading the GPU's multiprocessors.


I currently use two streams for different hashing (nonce) ranges, but a fully concurrent execution would only be allowed if you allocated several scratchpads (one per stream). Considering that the video card memory is a scarce resource this is probably not the best idea. Especially with scrypt-jane coins this is a problem.

Christian
patoberli
Member
**
Offline Offline

Activity: 106
Merit: 10


View Profile
January 29, 2014, 02:15:16 PM
 #3244

Getting a stable 1.35 - 1.45 kh/s on my GT-640 with default clocks and mining YaCoin.
There are a few "does not validate on CPU" results though, and they also don't seem to count:
Code:
[2014-01-29 15:11:26] GPU #1: GeForce GT 640 result does not validate on CPU (i=23, s=0)!
[2014-01-29 15:11:27] GPU #1: GeForce GT 640, 1.45 khash/s
[2014-01-29 15:11:27] accepted: 51/51 (100.00%), 1.45 khash/s (yay!!!)

Otherwise it's running smooth on Windows. Build is built today.
Start Parameters:
cudaminer.exe -a scrypt-jane -i 0 -l K27x2 -o http://yac.coinmine.pl:8882 -O ...:... -H 2 -d 1

YAC: YA86YiWSvWEGSSSerPTMy4kwndabRUNftf
BTC: 16NqvkYbKMnonVEf7jHbuWURFsLeuTRidX
LTC: LTKCoiDwqEjaRCoNXfFhDm9EeWbGWouZjE
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
January 29, 2014, 02:16:38 PM
 #3245

Getting a stable 1.35 - 1.45 kh/s on my GT-640 with default clocks and mining YaCoin.
There are a few "does not validate on CPU" results though, and they also don't seem to count:

I wish I knew what is causing these... try passing a -b 8192 for a bit more speed.
patoberli
Member
**
Offline Offline

Activity: 106
Merit: 10


View Profile
January 29, 2014, 03:39:38 PM
 #3246

Thanks, has the -b parameter an influence when running it -i 0?
In any case, it didn't change the speed visibly.

YAC: YA86YiWSvWEGSSSerPTMy4kwndabRUNftf
BTC: 16NqvkYbKMnonVEf7jHbuWURFsLeuTRidX
LTC: LTKCoiDwqEjaRCoNXfFhDm9EeWbGWouZjE
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
January 29, 2014, 04:21:19 PM
 #3247

Thanks, has the -b parameter an influence when running it -i 0?
In any case, it didn't change the speed visibly.

-b still has an influence, as in reducing overhead for CUDA kernel calls. Bigger chunks of data to work with means less overhead.
apluscarp
Newbie
*
Offline Offline

Activity: 12
Merit: 0


View Profile
January 29, 2014, 04:55:17 PM
 #3248

Where can I find the 2014-01-17 version?
bigjme
Sr. Member
****
Offline Offline

Activity: 350
Merit: 250


View Profile
January 29, 2014, 05:29:55 PM
Last edit: January 29, 2014, 05:49:19 PM by bigjme
 #3249

Results from my latest build, here is my launch config i stuck with and the results for -L2 to -L6

./cudaminer -a scrypt-jane -H 0 -i 0 -d 0 -l T138x2 -o http://127.0.0.1:3339 -u user -p pass -D -L4

-L2 T68x2 - 4.30-4.46
-L3 T68x3 - 4.61-4.87
-L4 T138x2 - 4.89-5.29 - avg. 4.98
-L5 T69x4 - 4.90-5.3 - avg. 5.05
-L6 T108x4 - 4.64-5.2 - avg. 4.65

running with -L4 i have noticed that it has now settled down at 4.96 - 5.05 being on for 4 hours, and i still have a lot of system usage back. so im wondering how much it is actually using

memory wise it is using 2412MiB /  3071MiB
not sure on actual gpu usage. wonder if i could get more memory usage then i am now, may get me more out of it

Owner of: cudamining.co.uk
whitesand77
Full Member
***
Offline Offline

Activity: 125
Merit: 100


View Profile
January 29, 2014, 05:38:28 PM
 #3250

But somehow I do not see the advantage of this. Typically launch configurations are determined that a single stream is already fully loading the GPU's multiprocessors.

If this were true my stream test would have given me a lower hash rate due to overhead but it doubled.  Just because MSI Afterburner or another program is reporting a 90 something % GPU usage doesn't mean streams won't help.  When I ran the two kernels I ran all A's first, synced, then the B's.  So as long as the batch size for the NFactor is small enough to spawn off enough kernels, they'll run concurrently.  Again, another optimization that won't work so well for lower NFactors.  But I was actually seeing 99-100% GPU usage with the doubled hash rate.  I had the same thought as you before I discovered streams but the NVIDIA Visual Profiler and the sample code convinced me otherwise.  I had another raster compression routine when streamed give me a 40% increase when I thought it was already maxed out.

I'm just talking this out here.  I know with the current state it won't be valid results due to the kernels tripping all over each others memory space.  I just wanted to see the potential.

I'm going to take your suggestion, since you know the code, and see if I can understand it well enough to break it up into 4 regions.  So far, I can tell this will be a steep learning curve.

Thanks

Joe
13G
Newbie
*
Offline Offline

Activity: 17
Merit: 0


View Profile
January 29, 2014, 09:03:10 PM
 #3251

Results from my latest build, here is my launch config i stuck with and the results for -L2 to -L6

./cudaminer -a scrypt-jane -H 0 -i 0 -d 0 -l T138x2 -o http://127.0.0.1:3339 -u user -p pass -D -L4

-L2 T68x2 - 4.30-4.46
-L3 T68x3 - 4.61-4.87
-L4 T138x2 - 4.89-5.29 - avg. 4.98
-L5 T69x4 - 4.90-5.3 - avg. 5.05
-L6 T108x4 - 4.64-5.2 - avg. 4.65

running with -L4 i have noticed that it has now settled down at 4.96 - 5.05 being on for 4 hours, and i still have a lot of system usage back. so im wondering how much it is actually using

memory wise it is using 2412MiB /  3071MiB
not sure on actual gpu usage. wonder if i could get more memory usage then i am now, may get me more out of it


Great improvement! Thank you!
GTX TITAN  with "-a scrypt-jane -d 0 -i 0 -H 2 -C 0 -m 0 -b 32768 -L 5 -l T69x4 -s 120" now 4.7khash/s !
bigjme
Sr. Member
****
Offline Offline

Activity: 350
Merit: 250


View Profile
January 29, 2014, 09:05:10 PM
 #3252

No problem.
I am hoping to find a way to allocate some more memory and gpu power. And with the improvement christian has in store it should jump up a lot more

Owner of: cudamining.co.uk
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
January 29, 2014, 09:12:06 PM
 #3253

No problem.
I am hoping to find a way to allocate some more memory and gpu power. And with the improvement christian has in store it should jump up a lot more

make that a "might" jump up a lot more.

I've had my fair share of optimization failures...
bigjme
Sr. Member
****
Offline Offline

Activity: 350
Merit: 250


View Profile
January 29, 2014, 09:13:28 PM
 #3254

Even getting it to use more memory should give me an increase. I say should

Owner of: cudamining.co.uk
bathrobehero
Legendary
*
Offline Offline

Activity: 2002
Merit: 1050


ICO? Not even once.


View Profile
January 29, 2014, 09:24:17 PM
 #3255

What's the reason behind failing to allocate more than 3GB of VRAM on Titans?
It seems that wherever you look, games, applications, whatever it always has problems on that front.

Not your keys, not your coins!
bigjme
Sr. Member
****
Offline Offline

Activity: 350
Merit: 250


View Profile
January 29, 2014, 09:26:51 PM
 #3256

I believe its to do with the memory bus speed limiting the amount of memory it can use

Owner of: cudamining.co.uk
ManIkWeet
Full Member
***
Offline Offline

Activity: 182
Merit: 100


View Profile
January 29, 2014, 09:53:04 PM
 #3257

I believe its to do with the memory bus speed limiting the amount of memory it can use
You have any idea how logical that sounds?
/sarcasm off
Probably has to do with the whole 32/64 bit thing, running a x64 build doesn't nessecarily fix that either.

BTC donations: 18fw6ZjYkN7xNxfVWbsRmBvD6jBAChRQVn (thanks!)
bigjme
Sr. Member
****
Offline Offline

Activity: 350
Merit: 250


View Profile
January 29, 2014, 10:02:57 PM
 #3258

Repeating what someone else said lmao. Sarcasm not needed

Owner of: cudamining.co.uk
cbuchner1 (OP)
Hero Member
*****
Offline Offline

Activity: 756
Merit: 502


View Profile
January 29, 2014, 10:04:03 PM
Last edit: January 29, 2014, 11:15:47 PM by cbuchner1
 #3259

cbuchner1, did you note my earlier post about autotune problems and K kernel performance regression?

okay, I have just replaced the ailing PSU in my main development PC, which allows me to put more stress on the GPUs again without it turning off unexpectedly.

So that regression really is bad. 254 kHash/s to 204 kHash/s with same kernel launch parameters between 2013-12-18 and current github.
That's a 20% drop in performance. I might play around a bit to see what I can find.

I did not find the same problem with the T kernel, even though it underwent very similar changes!

EDIT1: the majority of the discrepancy stems from my redefinition of what "warp" means in Dave's Kepler kernel (to be more in line with the CUDA definition of a warp) Hence the equivalent launch config for the current github release has to use four times the number of blocks to be comparable. So I have to go from -l K7x32 to -l K28x32. Then I end up with a drop from 254 kHash/s to 220 kHash/s only. Still bad, but not quite that much.

EDIT2: I find my "simplifications" in read_keys_direct and write_keys_direct to be the culprit. Turns out this has a huge performance impact, despite requiring much less instructions.

Christian

ollyweg
Newbie
*
Offline Offline

Activity: 19
Merit: 0


View Profile
January 29, 2014, 10:40:11 PM
 #3260

Hi I was wondering if you guys are doing only simple overclocking with afterburner or forced p-states.
On my EVGA 660Ti I use nvidia inspector to force p2-state which gets me up to 1215MHz (300MHz over stock).
Also my hashrate got a bit more stable with this and autotuning gets much more precise results since then.

This gets me 340khash/s for scrypt and 3.5khash/s for scrypt-jane.
This is with the 2014-1-22 version.
So I´m wondering why I´m actually still below a value from the scrypt-jane spreadsheet which apparently uses almost no OC.
I´ve tested lots of kernel cfgs but I can´t seem to get any higher.

Any ideas?

My exact config:
scrypt: --interactive=0 --hash-parallel=2 --launch-config=Y112x2 --texture-cache=1 --single-memory=0
scrypt-jane: --interactive=0 --hash-parallel=1 --launch-config=K7x23 --texture-cache=0 --single-memory=0 --lookup-gap=3

OC:
P0: clock-offset +160; mem-offset +300; power-target 153% (actually uses about 142% TDP)
P2: clock 1215MHz; forced
Driver: 332.21
Pages: « 1 ... 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 [163] 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 ... 1135 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!