Bitcoin Forum

Other => CPU/GPU Bitcoin mining hardware => Topic started by: MrTeal on October 26, 2012, 03:37:12 AM

Title: Interesting 5970 HW Problem
Post by: MrTeal on October 26, 2012, 03:37:12 AM
I'm having an issue with one of my 5970s, and it's kind of stumping me. In the last couple days, on of the cores of my has started to show hardware errors in CGminer. It's only the one core, and it happens regardless of frequency or voltage. The weird thing is that the number of accepted shares is the same as the number of HW errors. IE, if I have 3000 accepted shares I'll have close to 3000 HW errors, within less than 1%.

I'm using CGminer 2.7.5 and the Diablo kernel, and restarts don't seem to help. Normally, if I saw this distribution of errors in what should be a random process I would think a bit is stuck, but I'm not sure how that would apply in this case. Has anyone seen anything like this or have any ideas what could be causing it?

Title: Re: Interesting 5970 HW Problem
Post by: Aseras on October 26, 2012, 04:03:45 AM
Open up gpuz and look at all the gpu temps on the sensor tab. Not just the "main" one. I'll bet you have a hot spot on part of the core somewhere where the thermal paste is migrating. I've had mine report the temp at 55 and had a hot spot that was 87 .

Title: Re: Interesting 5970 HW Problem
Post by: Easy2Mine on October 26, 2012, 04:08:57 AM
Sorry, I can't help.
I have never seen it before either.

I restarted 1 of my rigs last month with TEAMVIEWER on my mobile phone and after the restart my 7950 in that rig only hash with 50 MH/s while the 6970 are hashing normal.
I never had any problems with TEAMVIEWER before.
Everything I try, nothing help until I uninstall all the drivers and install it again.
I also have CGminer 2.7.5 installed on that rig.
I already upgrade CGminer to the latest version now.
You try uninstall en reinstall GPU drivers already.?

I hope reinstalling drivers will help

Title: Re: Interesting 5970 HW Problem
Post by: MrTeal on October 26, 2012, 04:14:09 AM
Open up gpuz and look at all the gpu temps on the sensor tab. Not just the "main" one. I'll bet you have a hot spot on part of the core somewhere where the thermal paste is migrating. I've had mine report the temp at 55 and had a hot spot that was 87 .

At 750MHz/200MHz/1V they're running 68/63/62, with the GPU VRMs at 77/78/76. As I said though, this is independent of clock speed or voltage and by extension temperature. I can run at 400MHz/0.95V and turn off the other core and I still see utilization about half of what it should be, with an (almost) equal distribution of HW errors and accepted shares.

Title: Re: Interesting 5970 HW Problem
Post by: DobZombie on October 26, 2012, 12:54:36 PM
crack open the GPU, clean off any thermal paste and just put some new stuff on.

come back and tell us what happens

Title: Re: Interesting 5970 HW Problem
Post by: GenTarkin on October 26, 2012, 02:49:43 PM
Ive had a problem similar to this, but my 5970(one gpu) flagged HW errors bout 25% of the time. Mine is because the RAM gradually got worse and worse. By lowering the RAM from 300 to 170, the HW errors are gone, for now.
Are your HW errors "invalid nonce" errors? thats what mine appear as for the shares in cgminer.
Im willing to bet theres not a single ounce of RAM good on that GPU thats flagging the bad errors.

Title: Re: Interesting 5970 HW Problem
Post by: MrTeal on October 26, 2012, 03:43:45 PM
Ive had a problem similar to this, but my 5970(one gpu) flagged HW errors bout 25% of the time. Mine is because the RAM gradually got worse and worse. By lowering the RAM from 300 to 170, the HW errors are gone, for now.
Are your HW errors "invalid nonce" errors? thats what mine appear as for the shares in cgminer.
Im willing to bet theres not a single ounce of RAM good on that GPU thats flagging the bad errors.

 [2012-10-26 09:16:20] GPU 3 found something?
 [2012-10-26 09:16:20] OCL NONCE 1931483178 found in slot 0
 [2012-10-26 09:16:20] No best_g found! Error in OpenCL code?

For comparison, this is what I normally see
 [2012-10-26 09:16:26] GPU 2 found something?
 [2012-10-26 09:16:26] OCL NONCE 2442951545 found in slot 0
 [2012-10-26 09:16:26]  Proof: 00000000410ba6fe4fff57acaa2a9a4a358e5fdbc20a441a6951b215356b1725

I'll try running the RAM at 150 instead of 200 to see if that helps.

Title: Re: Interesting 5970 HW Problem
Post by: GenTarkin on October 26, 2012, 04:57:28 PM
Actually, I never looked at the error in debug output, it went by too fast. I just noticed when HW error was thrown in normal mode, it said "HW error, invalid nonce"

I got the HW errors to vanish @ 170mhz RAM.... if it doesnt work at 150mhz RAM, your RAM is most likely completely fuxxored. I still strongly suspect mining...somehow damages 5970's RAM.

Title: Re: Interesting 5970 HW Problem
Post by: Aseras on October 26, 2012, 06:00:22 PM
How are you controlling your clocks? Flash modding or afterburner or cgminer or ?

Title: Re: Interesting 5970 HW Problem
Post by: Beaflag VonRathburg on October 28, 2012, 06:49:34 AM
One of my rigs had a very similar issue. I put two 5970s together into one machine. I copied my 2.7.5 folder from a machine that was running the exact same setup (64 ultimate, 11.12, SDK 2.1, cgminer 2.7.5). When I fired up cgminer I would only get hw errors. It refused to accept any shares no matter what I had the cards clocked at. To test I installed 2.7.7 and it has worked fine now without any issues what so ever.

Title: Re: Interesting 5970 HW Problem
Post by: johnyj on October 28, 2012, 09:09:52 AM
Same here, one of my 5970 core now runs with 50% HW error, tried everything: changing voltage, frequency, replace thermal compund, nothing works. I'm running 2.3.3 version of cgminer and never had problem before

After cgminer started, first several minutes it works without problem, but when temperature reached above 40c degrees, this error start to appear, this happened weeks after I changed the heat sink to Accelero xtreme and thermal compund to coolaboratory liquid ultra, so the GPU cooling actually got much better

Maybe as someone said, it is the RAM problem, since I did not put any heatpad on RAM with new cooler, but anyway they stayed at 150Mhz

Title: Re: Interesting 5970 HW Problem
Post by: DobZombie on October 28, 2012, 02:50:04 PM
shouldn't they be set at least 300mhz?

Title: Re: Interesting 5970 HW Problem
Post by: johnyj on October 28, 2012, 08:16:50 PM
shouldn't they be set at least 300mhz?

Tried many different RAM clock combination without luck. I've been running at 150Mhz for months without a problem, I think it is the GPU who start to degrade, but difficult to prove, since raise the voltage does not really help

Title: Re: Interesting 5970 HW Problem
Post by: DobZombie on October 29, 2012, 03:35:17 PM
the most stable settings i've seen so far with a 5970 is
800-820 (core clock)
300 (mem)

Title: Re: Interesting 5970 HW Problem
Post by: MrTeal on October 29, 2012, 06:23:14 PM
the most stable settings i've seen so far with a 5970 is
800-820 (core clock)
300 (mem)

Wow, what kind of temps do you get with that overclock and overvolt?

Title: Re: Interesting 5970 HW Problem
Post by: DobZombie on October 31, 2012, 12:31:12 PM
With sitting in a 35-40 Degree Celcius Environment, they sit at about 75-80 degrees (some vary to 85%)

When they were sitting on my balcony they were sitting at 70-75.

This is with the LATEST ( 2011) bios updates, and fan set to automatic.  The fans are usually zooming away at 80% (4200RPM)