djeZo
|
|
October 16, 2015, 02:21:40 PM |
|
What speeds do you get on GTX 980 Ti and GTX 950 Lyra2REv2? I get GTX 980 Ti ... 17.450 khs GTX 950 ... 5.480 khs
clocks? OS? build? around 1400mhs both cards, windows, latest SP... but I tweaked some params, originally I was getting 17khs on 980 Ti and 5khs on 950. Would you mind to share your parameters? Sure: if (strstr(props.name, "970")) { intensity = 256 * 256 * 20; } else if (strstr(props.name, "980 Ti")) { intensity = 256 * 256 * 18; tpb = 8; } else if (strstr(props.name, "980")) { intensity = 256 * 256 * 16; } else if (strstr(props.name, "750 Ti")) { intensity = 256 * 256 * 5; tpb = 16; } else if (strstr(props.name, "750")) { intensity = 256 * 256 * 5; tpb = 16; } else if (strstr(props.name, "960")) { intensity = 256 * 256 * 6; } else if (strstr(props.name, "950")) { intensity = 256 * 256 * 18; tpb = 11; }
|
|
|
|
djeZo
|
|
October 16, 2015, 02:24:39 PM |
|
I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.
I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).
Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?
you could easily test it by running the same thing on the same card but thru a 1x raiser. Thats why I made the -g switch. You get problems in windows when allocating big buffers. Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower.. I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig?
|
|
|
|
bensam1231
Legendary
Offline
Activity: 1750
Merit: 1024
|
|
October 16, 2015, 03:32:54 PM |
|
Still testing out different difficulty settings for Myr-Gr for Digibyte and nothing is working out. It would definitely be worth looking into as it's worth mining more then Quark right now, even without the SP enhanced version (using Tpruvot).
|
I buy private Nvidia miners. Send information and/or inquiries to my PM box.
|
|
|
Genoil
|
|
October 16, 2015, 04:59:12 PM |
|
I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.
I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).
Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?
you could easily test it by running the same thing on the same card but thru a 1x raiser. Thats why I made the -g switch. You get problems in windows when allocating big buffers. Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower.. I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig? Tried chunks on ethminer kernel. No difference at all.
|
ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d BTC: 1Nu2fMCEBjmnLzqb8qUJpKgq5RoEWFhNcW
|
|
|
djeZo
|
|
October 16, 2015, 05:07:36 PM |
|
I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.
I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).
Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?
you could easily test it by running the same thing on the same card but thru a 1x raiser. Thats why I made the -g switch. You get problems in windows when allocating big buffers. Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower.. I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig? Tried chunks on ethminer kernel. No difference at all. Yep, I checked that on my Axiom algo, no difference - it actually got even slower. It looks like if WDDM has hardcoded in value 2 giga - when you load so much or more, it starts paging memory to host memory, regardless of how much memory is there on video card. This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining. Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms). Sad sad sad...
|
|
|
|
myagui
Legendary
Offline
Activity: 1154
Merit: 1001
|
|
October 16, 2015, 05:12:55 PM |
|
[...] This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.
Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).
Sad sad sad...
Or ... you can switch to Linux?
|
|
|
|
Grim
|
|
October 16, 2015, 05:39:00 PM |
|
[...] This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.
Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).
Sad sad sad...
Or ... you can switch to Linux? Linux is degrading as well !!! just not as bad as windows wddm.
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2912
Merit: 1087
Team Black developer
|
|
October 16, 2015, 05:43:05 PM Last edit: October 16, 2015, 05:53:40 PM by sp_ |
|
Chunks memory with linear reads and writss works.. In my private crypronight mod (10%) faster . I can use the optimal launch config that only used to work on linux
But windows 7 only..
8.1 and 10 doesnt work
|
|
|
|
myagui
Legendary
Offline
Activity: 1154
Merit: 1001
|
|
October 16, 2015, 05:55:34 PM |
|
@Grim, ah, thanks for that. I had only skimmed through the thread on nvidia and the OP there is a bit misleading. I thought this was only affecting Windows users. I see Linux suffers as well, though not as badly.
|
|
|
|
Genoil
|
|
October 16, 2015, 09:48:25 PM |
|
@Grim, ah, thanks for that. I had only skimmed through the thread on nvidia and the OP there is a bit misleading. I thought this was only affecting Windows users. I see Linux suffers as well, though not as badly.
I haven't produced the Linux values myself, so it could be unknown circumstances that have caused that drop. Then again, the values for the TCC driver that @allanmac showed look pretty similar. Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms). Sad sad sad...
I got a report from somebody with a 4GB 750Ti that there the limit was at 1 GB too. On Win8/10 the limit for Maxwell 1 is at 512MB and Maxwell 2 at 1024MB. Chunks memory with linear reads and writss works.. In my private crypronight mod (10%) faster . I can use the optimal launch config that only used to work on linux
But windows 7 only..
8.1 and 10 doesnt work
ETH uses a random access pattern so that may explain the difference.
|
ETH: 0xeb9310b185455f863f526dab3d245809f6854b4d BTC: 1Nu2fMCEBjmnLzqb8qUJpKgq5RoEWFhNcW
|
|
|
hashbrown9000
|
|
October 16, 2015, 10:27:20 PM |
|
Does someone have a working scrypt-jane command line? I can get my rig to hash, but all shares are rejected with "above target" error. Here's mine for the ACX EVGA 750ti: ./ccminer -a scrypt-jane:15 -l t5x24 -L 5 -o stratum+tcp://scryptjaneleo.eu.nicehash.com:3348 -u 1HW533b9sZ3sbhZwRKtAtZaHTbRrpcaA7q -p x
|
Pinkcoin: ETH: VTC: BTC:
|
|
|
myagui
Legendary
Offline
Activity: 1154
Merit: 1001
|
|
October 17, 2015, 12:05:32 AM |
|
Does someone have a working scrypt-jane command line? I can get my rig to hash, but all shares are rejected with "above target" error. Here's mine for the ACX EVGA 750ti: ./ccminer -a scrypt-jane:15 -l t5x24 -L 5 -o stratum+tcp://scryptjaneleo.eu.nicehash.com:3348 -u 1HW533b9sZ3sbhZwRKtAtZaHTbRrpcaA7q -p x IIRC, Leo is now at nfactor 16, so: -a scrypt-jane:16Edit: And you will probably need to tweak those launch parameters, as is customary across nfactor changes.
|
|
|
|
coinut
|
|
October 17, 2015, 12:30:23 AM |
|
just compiled and tested your latest commit 1117, 16/10/15
card is gtx 750 ti @ 1400mhz installed directly to motherboard algo is lyra2REv2 os win7x64
i see 5106 displayed on the miner side locally
commit on 19 9 15
I get 5155 on the miner locally
|
|
|
|
hashbrown9000
|
|
October 17, 2015, 03:34:00 AM |
|
found a good launch config on the ACX 750ti: -l T15x1. With n-factor 16, I get ~ 310 H/s with tpruvot's miner. Tried same settings with sp_'s and it errors out. [2015-10-17 05:30:33] GPU #2: Err 77: an illegal memory access was encountered (salsa_kernel.cu:870) [2015-10-17 05:30:33] GPU #2: CUDA error `an illegal memory access was encountered` while executing the kernel. [2015-10-17 05:30:33] GPU #2: cudaError 77 (an illegal memory access was encountered) calling 'cudaMemcpyToSymbol(pdata, host_pdata, 20*sizeof(uint32_t), 0, cudaMemcpyHostToDevice)' (keccak.cu line 471)
[2015-10-17 05:30:33] GPU #0: CUDA error `an illegal memory access was encountered` while executing the kernel. [2015-10-17 05:30:33] GPU #1: CUDA error `an illegal memory access was encountered` while executing the kernel. [2015-10-17 05:30:33] GPU #0: cudaError 77 (an illegal memory access was encountered) calling 'cudaMemcpyToSymbol(pdata, host_pdata, 20*sizeof(uint32_t), 0, cudaMemcpyHostToDevice)' (keccak.cu line 471)
[2015-10-17 05:30:33] GPU #0: Err 77: an illegal memory access was encountered (salsa_kernel.cu:870) [2015-10-17 05:30:33] GPU #2: Err 77: an illegal memory access was encountered (salsa_kernel.cu:870) [2015-10-17 05:30:33] GPU #2: CUDA error `an illegal memory access was encountered` while executing the kernel. [2015-10-17 05:30:33] GPU #2: cudaError 77 (an illegal memory access was encountered) calling 'cudaMemcpyToSymbol(pdata, host_pdata, 20*sizeof(uint32_t), 0, cudaMemcpyHostToDevice)' (keccak.cu line 471)
[2015-10-17 05:30:33] GPU #2: Err 77: an illegal memory access was encountered (salsa_kernel.cu:870) [2015-10-17 05:30:33] GPU #2: CUDA error `an illegal memory access was encountered` while executing the kernel. [2015-10-17 05:30:33] GPU #2: cudaError 77 (an illegal memory access was encountered) calling 'cudaMemcpyToSymbol(pdata, host_pdata, 20*sizeof(uint32_t), 0, cudaMemcpyHostToDevice)' (keccak.cu line 471)
|
Pinkcoin: ETH: VTC: BTC:
|
|
|
ZenFr
Legendary
Offline
Activity: 1260
Merit: 1046
|
|
October 17, 2015, 09:30:39 AM |
|
Happy birthday CCMiner_SP ;-).
|
|
|
|
antantti
Legendary
Offline
Activity: 1176
Merit: 1015
|
|
October 17, 2015, 09:53:53 AM |
|
Skål again!
f225813a4921c02318b86bbdb58ca48036ef4c481b7d123f9b429d6fd34ba800
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2912
Merit: 1087
Team Black developer
|
|
October 17, 2015, 01:03:14 PM |
|
Skål again! f225813a4921c02318b86bbdb58ca48036ef4c481b7d123f9b429d6fd34ba800
I don't think this was to my BTC adress? But thanks for your support.. The birthday present will be a little delayed
|
|
|
|
antantti
Legendary
Offline
Activity: 1176
Merit: 1015
|
|
October 17, 2015, 01:24:02 PM |
|
Skål again! f225813a4921c02318b86bbdb58ca48036ef4c481b7d123f9b429d6fd34ba800
I don't think this was to my BTC adress? But thanks for your support.. The birthday present will be a little delayed Oops! 97b0c5263d8fb86375bf0da8a56e9fd81f591fcd3bb26f546a50ffb5f937b5f3
|
|
|
|
flipclip
Member
Offline
Activity: 111
Merit: 10
|
|
October 17, 2015, 02:24:40 PM |
|
found a good launch config on the ACX 750ti: -l T15x1. With n-factor 16, I get ~ 310 H/s with tpruvot's miner. Tried same settings with sp_'s and it errors out. [2015-10-17 05:30:33] GPU #2: Err 77: an illegal memory access was encountered (salsa_kernel.cu:870) [2015-10-17 05:30:33] GPU #2: CUDA error `an illegal memory access was encountered` while executing the kernel. [2015-10-17 05:30:33] GPU #2: cudaError 77 (an illegal memory access was encountered) calling 'cudaMemcpyToSymbol(pdata, host_pdata, 20*sizeof(uint32_t), 0, cudaMemcpyHostToDevice)' (keccak.cu line 471)
Same here. It seems to have been a problem since the original import.
|
|
|
|
sp_ (OP)
Legendary
Offline
Activity: 2912
Merit: 1087
Team Black developer
|
|
October 17, 2015, 03:02:02 PM Last edit: October 17, 2015, 06:33:03 PM by sp_ |
|
-Faster lyra2v2 on compute 5.2 devices. (gtx 950 + 500KHASH, gtx 980ti + 500khash) (with help from djeZo's launchconfigs) -Faster quark compute 5.2 -Added the whirlpool algo (joincoin) 1.5.71(sp-MOD) is available here: (17-oct-2015) https://github.com/sp-hash/ccminer/releases/The sourcecode is available here: https://github.com/sp-hash/ccminer
|
|
|
|
|