Bitcoin Forum
April 26, 2019, 09:07:35 AM *
News: Latest Bitcoin Core release: 0.17.1 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 [163] 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 ... 1136 »
  Print  
Author Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX]  (Read 3408633 times)
bigjme
Sr. Member
****
Offline Offline

Activity: 350
Merit: 250


View Profile
January 29, 2014, 12:20:23 PM
Last edit: January 29, 2014, 12:38:59 PM by bigjme
 #3241

Yes that is weird. Im going to do a full run from -L 2 to -L 6 and will post what I get for each.

How much is your 660Ti getting christian?

Owner of: cudamining.co.uk
100% New Software
PC, Mac, Android, & HTML5 Clients
Krill Rakeback
Low Rake
Bitcoin Poker 3.0
Bad Beat Jackpot
SwC Poker Relaunch
PLAY NOW
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
Espie
Newbie
*
Offline Offline

Activity: 5
Merit: 0


View Profile
January 29, 2014, 12:51:44 PM
 #3242

Hi Christian,

I noticed some nice development on the Nvidia Developer Zone. When will you make this version available?
https://devtalk.nvidia.com/default/topic/643428/cuda-programming-and-performance/could-anyone-benchmark-this-for-me-on-a-780-ti-or-titan-/

Dennis

bigjme
Sr. Member
****
Offline Offline

Activity: 350
Merit: 250


View Profile
January 29, 2014, 01:08:46 PM
 #3243

Ok so just a quick drop of numbers before I post the results tonight. Latest cudaminer on my 780  is now getting 5.03khash/s

I have use of my desktop with it. And that is with T69x4 and -L5. That and my cpu is slightly more free now so my cpu does 0.72khash/s constant now. So ive gone from 4.3khash/s to 5.8khash/s

Not a bad jump

Owner of: cudamining.co.uk
ghur
Full Member
***
Offline Offline

Activity: 154
Merit: 100


View Profile
January 29, 2014, 01:27:15 PM
 #3244

<snip>

Alright, thank you.

Sounds reasonable enough for me Smiley

doge: D8q8dR6tEAcaJ7U65jP6AAkiiL2CFJaHah
Automated faucet, pays daily: Qoinpro
whitesand77
Full Member
***
Offline Offline

Activity: 125
Merit: 100


View Profile
January 29, 2014, 01:30:34 PM
 #3245

I've been experimenting with streams on the Y kernel.  So far I've tested this on YAC and got 5.3 khash/s on my 660 Ti.  Too bad it doesn't validate on the CPU though.  The kernel must not be concurrent safe, =). 
cbuchner1
Hero Member
*****
Offline Offline

Activity: 756
Merit: 500


View Profile
January 29, 2014, 01:43:53 PM
 #3246

I've been experimenting with streams on the Y kernel.  So far I've tested this on YAC and got 5.3 khash/s on my 660 Ti.  Too bad it doesn't validate on the CPU though.  The kernel must not be concurrent safe, =).  

yes. right. there is one scratchpad but two streams. The scrypt_core kernels have to be serialized, or they would destroy each other's scratchpad. This is why I am using CUDA events.  

Some overlap of memcpy and kernels would be desired (not happening now due to issue order of commands), and possibly the SHA256/Keccak kernels of one stream could be executed concurrently with the scrypt_core kernels of the other stream. This is also not happening now because my CUDA events currently also serialize these (need to change when events are generated and synchronized upon).

I intend to get rid of memcpy alltogether by checking hashes on the GPU instead, so the memcpy/kernel overlap issue is moot.

Christian
cbuchner1
Hero Member
*****
Offline Offline

Activity: 756
Merit: 500


View Profile
January 29, 2014, 01:45:38 PM
 #3247

Hi Christian,

I noticed some nice development on the Nvidia Developer Zone. When will you make this version available?
https://devtalk.nvidia.com/default/topic/643428/cuda-programming-and-performance/could-anyone-benchmark-this-for-me-on-a-780-ti-or-titan-/

Dennis


get it from github, or wait for the next official release (only a few more days...)
cbuchner1
Hero Member
*****
Offline Offline

Activity: 756
Merit: 500


View Profile
January 29, 2014, 01:45:55 PM
 #3248

Yes that is weird. Im going to do a full run from -L 2 to -L 6 and will post what I get for each.

How much is your 660Ti getting christian?

3.7 kHash/s give or take.
bigjme
Sr. Member
****
Offline Offline

Activity: 350
Merit: 250


View Profile
January 29, 2014, 01:47:35 PM
 #3249

So my 780 getting over 5 isnt too bad then

Owner of: cudamining.co.uk
cbuchner1
Hero Member
*****
Offline Offline

Activity: 756
Merit: 500


View Profile
January 29, 2014, 01:50:34 PM
 #3250

So my 780 getting over 5 isnt too bad then

but 6 or 7 would be nicer.

I have one optimization in mind that swaps the state of threads within the lookup_gap loop. The intention is to order threads by the loop trip count (some have to run for 0 loops, others a couple more up to the specified lookup_gap). By ordering them, some of the warps will terminate much earlier and not consume any computational resources.

This would (in theory) reduce the workload nearly by factor 2, but it introduces some overhead for sorting the threads, and for shuffling the state around. Whether a net speed gain remains ,  that is yet to be seen.

I will save that optimization for February (it would delay this release...)

Christian

bigjme
Sr. Member
****
Offline Offline

Activity: 350
Merit: 250


View Profile
January 29, 2014, 01:53:16 PM
Last edit: January 29, 2014, 02:09:33 PM by bigjme
 #3251

That would be nice to see
I will gladly test it christian if you want to sent it through while your working on it

Owner of: cudamining.co.uk
Espie
Newbie
*
Offline Offline

Activity: 5
Merit: 0


View Profile
January 29, 2014, 02:04:00 PM
 #3252

Hi Christian,

I noticed some nice development on the Nvidia Developer Zone. When will you make this version available?
https://devtalk.nvidia.com/default/topic/643428/cuda-programming-and-performance/could-anyone-benchmark-this-for-me-on-a-780-ti-or-titan-/

Dennis


get it from github, or wait for the next official release (only a few more days...)

I have Visual Studio 2012, but I can't load the solution file. So I probably have to wait a few more days.
whitesand77
Full Member
***
Offline Offline

Activity: 125
Merit: 100


View Profile
January 29, 2014, 02:08:43 PM
 #3253

I've been experimenting with streams on the Y kernel.  So far I've tested this on YAC and got 5.3 khash/s on my 660 Ti.  Too bad it doesn't validate on the CPU though.  The kernel must not be concurrent safe, =).  

yes. right. there is one scratchpad but two streams. The scrypt_core kernels have to be serialized, or they would destroy each other's scratchpad. This is why I am using CUDA events.  

Some overlap of memcpy and kernels would be desired (not happening now due to issue order of commands), and possibly the SHA256/Keccak kernels of one stream could be executed concurrently with the scrypt_core kernels of the other stream. This is also not happening now because my CUDA events currently also serialize these (need to change when events are generated and synchronized upon).

I intend to get rid of memcpy alltogether by checking hashes on the GPU instead, so the memcpy/kernel overlap issue is moot.

Christian


To be more specific, I was using 4 streams on nv_scrypt_core_kernelA<ALGO_SCRYPT_JANE> and nv_scrypt_core_kernelB<ALGO_SCRYPT_JANE> inside the NVKernel::run_kernel.  So those are the kernels I was referring to.  Too bad the code in these kernels looks like witchcraft to me at the moment. LOL
cbuchner1
Hero Member
*****
Offline Offline

Activity: 756
Merit: 500


View Profile
January 29, 2014, 02:09:16 PM
 #3254

I have Visual Studio 2012, but I can't load the solution file. So I probably have to wait a few more days.

CUDA 5.5 is installed? VS 2012 should be able to upgrade the solution file automatically. But then you will have to make sure that all the other dependencies are available (OpenSSL, pthreads, libcurl....)
ManIkWeet
Full Member
***
Offline Offline

Activity: 182
Merit: 100


View Profile
January 29, 2014, 02:10:38 PM
 #3255

So my 780 getting over 5 isnt too bad then

but 6 or 7 would be nicer.

I have one optimization in mind that swaps the state of threads within the lookup_gap loop. The intention is to order threads by the loop trip count (some have to run for 0 loops, others a couple more up to the specified lookup_gap). By ordering them, some of the warps will terminate much earlier and not consume any computational resources.

This would (in theory) reduce the workload nearly by factor 2, but it introduces some overhead for sorting the threads, and for shuffling the state around. Whether a net speed gain remains ,  that is yet to be seen.

I will save that optimization for February (it would delay this release...)

Christian
I shall happily beta test this when it is out in February  Grin

BTC donations: 18fw6ZjYkN7xNxfVWbsRmBvD6jBAChRQVn (thanks!)
cbuchner1
Hero Member
*****
Offline Offline

Activity: 756
Merit: 500


View Profile
January 29, 2014, 02:11:35 PM
 #3256


To be more specific, I was using 4 streams on nv_scrypt_core_kernelA<ALGO_SCRYPT_JANE> and nv_scrypt_core_kernelB<ALGO_SCRYPT_JANE> inside the NVKernel::run_kernel.  So those are the kernels I was referring to.  Too bad the code in these kernels looks like witchcraft to me at the moment. LOL

what you have to know is that the "A" named kernels writes to the scratchpad (yes, the ENTIRE scratchpad) and kernels labeled "B" reads from random positions in the scratchpad. So there is an A->B dependency, first A has to complete before B can run.

If you still want to try running multiple streams, divide the hashing (nonce) range into 4 equally sized regions

then you can run

A -> B region 1
A -> B region 2
A -> B region 3
A -> B region 4

all simultaneously on 4 streams, as their scratchpad areas do not overlap. But somehow I do not see the advantage of this. Typically launch configurations are determined that a single stream is already fully loading the GPU's multiprocessors.


I currently use two streams for different hashing (nonce) ranges, but a fully concurrent execution would only be allowed if you allocated several scratchpads (one per stream). Considering that the video card memory is a scarce resource this is probably not the best idea. Especially with scrypt-jane coins this is a problem.

Christian
patoberli
Member
**
Offline Offline

Activity: 106
Merit: 10


View Profile
January 29, 2014, 02:15:16 PM
 #3257

Getting a stable 1.35 - 1.45 kh/s on my GT-640 with default clocks and mining YaCoin.
There are a few "does not validate on CPU" results though, and they also don't seem to count:
Code:
[2014-01-29 15:11:26] GPU #1: GeForce GT 640 result does not validate on CPU (i=23, s=0)!
[2014-01-29 15:11:27] GPU #1: GeForce GT 640, 1.45 khash/s
[2014-01-29 15:11:27] accepted: 51/51 (100.00%), 1.45 khash/s (yay!!!)

Otherwise it's running smooth on Windows. Build is built today.
Start Parameters:
cudaminer.exe -a scrypt-jane -i 0 -l K27x2 -o http://yac.coinmine.pl:8882 -O ...:... -H 2 -d 1

YAC: YA86YiWSvWEGSSSerPTMy4kwndabRUNftf
BTC: 16NqvkYbKMnonVEf7jHbuWURFsLeuTRidX
LTC: LTKCoiDwqEjaRCoNXfFhDm9EeWbGWouZjE
cbuchner1
Hero Member
*****
Offline Offline

Activity: 756
Merit: 500


View Profile
January 29, 2014, 02:16:38 PM
 #3258

Getting a stable 1.35 - 1.45 kh/s on my GT-640 with default clocks and mining YaCoin.
There are a few "does not validate on CPU" results though, and they also don't seem to count:

I wish I knew what is causing these... try passing a -b 8192 for a bit more speed.
patoberli
Member
**
Offline Offline

Activity: 106
Merit: 10


View Profile
January 29, 2014, 03:39:38 PM
 #3259

Thanks, has the -b parameter an influence when running it -i 0?
In any case, it didn't change the speed visibly.

YAC: YA86YiWSvWEGSSSerPTMy4kwndabRUNftf
BTC: 16NqvkYbKMnonVEf7jHbuWURFsLeuTRidX
LTC: LTKCoiDwqEjaRCoNXfFhDm9EeWbGWouZjE
cbuchner1
Hero Member
*****
Offline Offline

Activity: 756
Merit: 500


View Profile
January 29, 2014, 04:21:19 PM
 #3260

Thanks, has the -b parameter an influence when running it -i 0?
In any case, it didn't change the speed visibly.

-b still has an influence, as in reducing overhead for CUDA kernel calls. Bigger chunks of data to work with means less overhead.
Pages: « 1 ... 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 [163] 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 ... 1136 »
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!