Bitcoin Forum
May 22, 2024, 05:48:38 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 [345] 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 ... 1240 »
  Print  
Author Topic: CCminer(SP-MOD) Modded GPU kernels.  (Read 2347500 times)
djeZo
Hero Member
*****
Offline Offline

Activity: 588
Merit: 520


View Profile
October 16, 2015, 02:21:40 PM
 #6881

What speeds do you get on GTX 980 Ti and GTX 950 Lyra2REv2?
I get
GTX 980 Ti ... 17.450 khs
GTX 950 ... 5.480 khs
clocks? OS? build?
around 1400mhs both cards, windows, latest SP... but I tweaked some params, originally I was getting 17khs on 980 Ti and 5khs on 950.

Would you mind to share your parameters? Smiley

Sure:

Code:
	if (strstr(props.name, "970"))
{
intensity = 256 * 256 * 20;
}
else if (strstr(props.name, "980 Ti"))
{
intensity = 256 * 256 * 18;
tpb = 8;
}
else if (strstr(props.name, "980"))
{
intensity = 256 * 256 * 16;
}
else if (strstr(props.name, "750 Ti"))
{
intensity = 256 * 256 * 5;
tpb = 16;
}
else if (strstr(props.name, "750"))
{
intensity = 256 * 256 * 5;
tpb = 16;
}
else if (strstr(props.name, "960"))
{
intensity = 256 * 256 * 6;
}
else if (strstr(props.name, "950"))
{
intensity = 256 * 256 * 18;
tpb = 11;
}

djeZo
Hero Member
*****
Offline Offline

Activity: 588
Merit: 520


View Profile
October 16, 2015, 02:24:39 PM
 #6882

I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?
you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..


I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952

Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig?

bensam1231
Legendary
*
Offline Offline

Activity: 1750
Merit: 1024


View Profile
October 16, 2015, 03:32:54 PM
 #6883

Still testing out different difficulty settings for Myr-Gr for Digibyte and nothing is working out. It would definitely be worth looking into as it's worth mining more then Quark right now, even without the SP enhanced version (using Tpruvot).

I buy private Nvidia miners. Send information and/or inquiries to my PM box.
Genoil
Sr. Member
****
Offline Offline

Activity: 438
Merit: 250


View Profile
October 16, 2015, 04:59:12 PM
 #6884

I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?
you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..


I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952

Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig?

Tried chunks on ethminer kernel. No difference at all.

ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d
BTC: 1Nu2fMCEBjmnLzqb8qUJpKgq5RoEWFhNcW
djeZo
Hero Member
*****
Offline Offline

Activity: 588
Merit: 520


View Profile
October 16, 2015, 05:07:36 PM
 #6885

I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?
you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..


I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952

Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig?

Tried chunks on ethminer kernel. No difference at all.

Yep, I checked that on my Axiom algo, no difference - it actually got even slower.

It looks like if WDDM has hardcoded in value 2 giga - when you load so much or more, it starts paging memory to host memory, regardless of how much memory is there on video card. This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).

Sad sad sad...

myagui
Legendary
*
Offline Offline

Activity: 1154
Merit: 1001



View Profile
October 16, 2015, 05:12:55 PM
 #6886

[...] This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).

Sad sad sad...

Or ... you can switch to Linux?

Grim
Sr. Member
****
Offline Offline

Activity: 504
Merit: 252


View Profile
October 16, 2015, 05:39:00 PM
 #6887

[...] This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).

Sad sad sad...

Or ... you can switch to Linux?

Linux is degrading as well !!! just not as bad as windows wddm.
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2912
Merit: 1087

Team Black developer


View Profile
October 16, 2015, 05:43:05 PM
Last edit: October 16, 2015, 05:53:40 PM by sp_
 #6888

Chunks memory with linear reads  and writss works.. In my private crypronight mod (10%) faster . I can use the optimal launch config that only used to work on linux

But windows 7 only..

8.1 and 10  doesnt work

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
myagui
Legendary
*
Offline Offline

Activity: 1154
Merit: 1001



View Profile
October 16, 2015, 05:55:34 PM
 #6889

@Grim, ah, thanks for that. I had only skimmed through the thread on nvidia and the OP there is a bit misleading. I thought this was only affecting Windows users. I see Linux suffers as well, though not as badly.

Genoil
Sr. Member
****
Offline Offline

Activity: 438
Merit: 250


View Profile
October 16, 2015, 09:48:25 PM
 #6890

@Grim, ah, thanks for that. I had only skimmed through the thread on nvidia and the OP there is a bit misleading. I thought this was only affecting Windows users. I see Linux suffers as well, though not as badly.

I haven't produced the Linux values myself, so it could be unknown circumstances that have caused that drop. Then again, the values for the TCC driver that @allanmac showed look pretty similar.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).
Sad sad sad...

I got a report from somebody with a 4GB 750Ti that there the limit was at 1 GB too. On Win8/10 the limit for Maxwell 1 is at 512MB and Maxwell 2 at 1024MB.
Chunks memory with linear reads  and writss works.. In my private crypronight mod (10%) faster . I can use the optimal launch config that only used to work on linux

But windows 7 only..

8.1 and 10  doesnt work

ETH uses  a random access pattern so that may explain the difference.

ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d
BTC: 1Nu2fMCEBjmnLzqb8qUJpKgq5RoEWFhNcW
hashbrown9000
Sr. Member
****
Offline Offline

Activity: 427
Merit: 250


View Profile
October 16, 2015, 10:27:20 PM
 #6891

Does someone have a working scrypt-jane command line? I can get my rig to hash, but all shares are rejected with "above target" error.

Here's mine for the ACX EVGA 750ti:

Code:
./ccminer -a scrypt-jane:15 -l t5x24 -L 5 -o stratum+tcp://scryptjaneleo.eu.nicehash.com:3348 -u 1HW533b9sZ3sbhZwRKtAtZaHTbRrpcaA7q -p x

Pinkcoin:
ETH:
VTC:
BTC:
myagui
Legendary
*
Offline Offline

Activity: 1154
Merit: 1001



View Profile
October 17, 2015, 12:05:32 AM
 #6892

Does someone have a working scrypt-jane command line? I can get my rig to hash, but all shares are rejected with "above target" error.

Here's mine for the ACX EVGA 750ti:

Code:
./ccminer -a scrypt-jane:15 -l t5x24 -L 5 -o stratum+tcp://scryptjaneleo.eu.nicehash.com:3348 -u 1HW533b9sZ3sbhZwRKtAtZaHTbRrpcaA7q -p x


IIRC, Leo is now at nfactor 16, so:
-a scrypt-jane:16

Edit: And you will probably need to tweak those launch parameters, as is customary across nfactor changes.

coinut
Full Member
***
Offline Offline

Activity: 253
Merit: 100


View Profile
October 17, 2015, 12:30:23 AM
 #6893

just compiled and tested your latest commit 1117, 16/10/15

card is gtx 750 ti @ 1400mhz installed directly to motherboard
algo is lyra2REv2
os win7x64

i see 5106 displayed on the miner side locally

commit on 19 9 15

I get 5155 on the miner locally



 
hashbrown9000
Sr. Member
****
Offline Offline

Activity: 427
Merit: 250


View Profile
October 17, 2015, 03:34:00 AM
 #6894

found a good launch config on the ACX 750ti: -l T15x1. With n-factor 16, I get ~ 310 H/s with tpruvot's miner.  Tried same settings with sp_'s and it errors out.

Code:
[2015-10-17 05:30:33] GPU #2: Err 77: an illegal memory access was encountered (salsa_kernel.cu:870)
[2015-10-17 05:30:33] GPU #2: CUDA error `an illegal memory access was encountered` while executing the kernel.
[2015-10-17 05:30:33] GPU #2: cudaError 77 (an illegal memory access was encountered) calling 'cudaMemcpyToSymbol(pdata, host_pdata, 20*sizeof(uint32_t), 0, cudaMemcpyHostToDevice)' (keccak.cu line 471)

[2015-10-17 05:30:33] GPU #0: CUDA error `an illegal memory access was encountered` while executing the kernel.
[2015-10-17 05:30:33] GPU #1: CUDA error `an illegal memory access was encountered` while executing the kernel.
[2015-10-17 05:30:33] GPU #0: cudaError 77 (an illegal memory access was encountered) calling 'cudaMemcpyToSymbol(pdata, host_pdata, 20*sizeof(uint32_t), 0, cudaMemcpyHostToDevice)' (keccak.cu line 471)

[2015-10-17 05:30:33] GPU #0: Err 77: an illegal memory access was encountered (salsa_kernel.cu:870)
[2015-10-17 05:30:33] GPU #2: Err 77: an illegal memory access was encountered (salsa_kernel.cu:870)
[2015-10-17 05:30:33] GPU #2: CUDA error `an illegal memory access was encountered` while executing the kernel.
[2015-10-17 05:30:33] GPU #2: cudaError 77 (an illegal memory access was encountered) calling 'cudaMemcpyToSymbol(pdata, host_pdata, 20*sizeof(uint32_t), 0, cudaMemcpyHostToDevice)' (keccak.cu line 471)

[2015-10-17 05:30:33] GPU #2: Err 77: an illegal memory access was encountered (salsa_kernel.cu:870)
[2015-10-17 05:30:33] GPU #2: CUDA error `an illegal memory access was encountered` while executing the kernel.
[2015-10-17 05:30:33] GPU #2: cudaError 77 (an illegal memory access was encountered) calling 'cudaMemcpyToSymbol(pdata, host_pdata, 20*sizeof(uint32_t), 0, cudaMemcpyHostToDevice)' (keccak.cu line 471)

Pinkcoin:
ETH:
VTC:
BTC:
ZenFr
Legendary
*
Offline Offline

Activity: 1260
Merit: 1046



View Profile
October 17, 2015, 09:30:39 AM
 #6895

Happy birthday CCMiner_SP ;-).
antantti
Legendary
*
Offline Offline

Activity: 1176
Merit: 1015


View Profile
October 17, 2015, 09:53:53 AM
 #6896

Skål again!

f225813a4921c02318b86bbdb58ca48036ef4c481b7d123f9b429d6fd34ba800
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2912
Merit: 1087

Team Black developer


View Profile
October 17, 2015, 01:03:14 PM
 #6897

Skål again!
f225813a4921c02318b86bbdb58ca48036ef4c481b7d123f9b429d6fd34ba800

I don't think this was to my BTC adress?

But thanks for your support..

The birthday present will be a little delayed

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
antantti
Legendary
*
Offline Offline

Activity: 1176
Merit: 1015


View Profile
October 17, 2015, 01:24:02 PM
 #6898

Skål again!
f225813a4921c02318b86bbdb58ca48036ef4c481b7d123f9b429d6fd34ba800

I don't think this was to my BTC adress?

But thanks for your support..

The birthday present will be a little delayed

Oops!

97b0c5263d8fb86375bf0da8a56e9fd81f591fcd3bb26f546a50ffb5f937b5f3
flipclip
Member
**
Offline Offline

Activity: 111
Merit: 10


View Profile
October 17, 2015, 02:24:40 PM
 #6899

found a good launch config on the ACX 750ti: -l T15x1. With n-factor 16, I get ~ 310 H/s with tpruvot's miner.  Tried same settings with sp_'s and it errors out.
Code:
[2015-10-17 05:30:33] GPU #2: Err 77: an illegal memory access was encountered (salsa_kernel.cu:870)
[2015-10-17 05:30:33] GPU #2: CUDA error `an illegal memory access was encountered` while executing the kernel.
[2015-10-17 05:30:33] GPU #2: cudaError 77 (an illegal memory access was encountered) calling 'cudaMemcpyToSymbol(pdata, host_pdata, 20*sizeof(uint32_t), 0, cudaMemcpyHostToDevice)' (keccak.cu line 471)
Same here.  It seems to have been a problem since the original import.
sp_ (OP)
Legendary
*
Offline Offline

Activity: 2912
Merit: 1087

Team Black developer


View Profile
October 17, 2015, 03:02:02 PM
Last edit: October 17, 2015, 06:33:03 PM by sp_
 #6900

-Faster lyra2v2 on compute 5.2 devices. (gtx 950 + 500KHASH, gtx 980ti + 500khash) (with help from djeZo's launchconfigs)
-Faster quark compute 5.2
-Added the whirlpool algo (joincoin)

1.5.71(sp-MOD) is available here: (17-oct-2015)

https://github.com/sp-hash/ccminer/releases/

The sourcecode is available here:

https://github.com/sp-hash/ccminer

Team Black Miner (ETHB3 ETH ETC VTC KAWPOW FIROPOW MEOWPOW + dual mining + tripple mining.. https://github.com/sp-hash/TeamBlackMiner
Pages: « 1 ... 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 [345] 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 ... 1240 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!