Bitcoin Forum
May 02, 2024, 11:55:08 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: Interesting 5970 HW Problem  (Read 4497 times)
MrTeal (OP)
Legendary
*
Offline Offline

Activity: 1274
Merit: 1004


View Profile
October 26, 2012, 03:37:12 AM
 #1

I'm having an issue with one of my 5970s, and it's kind of stumping me. In the last couple days, on of the cores of my has started to show hardware errors in CGminer. It's only the one core, and it happens regardless of frequency or voltage. The weird thing is that the number of accepted shares is the same as the number of HW errors. IE, if I have 3000 accepted shares I'll have close to 3000 HW errors, within less than 1%.

I'm using CGminer 2.7.5 and the Diablo kernel, and restarts don't seem to help. Normally, if I saw this distribution of errors in what should be a random process I would think a bit is stuck, but I'm not sure how that would apply in this case. Has anyone seen anything like this or have any ideas what could be causing it?
The Bitcoin network protocol was designed to be extremely flexible. It can be used to create timed transactions, escrow transactions, multi-signature transactions, etc. The current features of the client only hint at what will be possible in the future.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1714650908
Hero Member
*
Offline Offline

Posts: 1714650908

View Profile Personal Message (Offline)

Ignore
1714650908
Reply with quote  #2

1714650908
Report to moderator
Aseras
Hero Member
*****
Offline Offline

Activity: 658
Merit: 500


View Profile
October 26, 2012, 04:03:45 AM
 #2

Open up gpuz and look at all the gpu temps on the sensor tab. Not just the "main" one. I'll bet you have a hot spot on part of the core somewhere where the thermal paste is migrating. I've had mine report the temp at 55 and had a hot spot that was 87 .
Easy2Mine
Hero Member
*****
Offline Offline

Activity: 854
Merit: 500


einc.io


View Profile
October 26, 2012, 04:08:57 AM
 #3

Sorry, I can't help.
I have never seen it before either.

I restarted 1 of my rigs last month with TEAMVIEWER on my mobile phone and after the restart my 7950 in that rig only hash with 50 MH/s while the 6970 are hashing normal.
I never had any problems with TEAMVIEWER before.
Everything I try, nothing help until I uninstall all the drivers and install it again.
I also have CGminer 2.7.5 installed on that rig.
I already upgrade CGminer to the latest version now.
You try uninstall en reinstall GPU drivers already.?

I hope reinstalling drivers will help

MrTeal (OP)
Legendary
*
Offline Offline

Activity: 1274
Merit: 1004


View Profile
October 26, 2012, 04:14:09 AM
 #4

Open up gpuz and look at all the gpu temps on the sensor tab. Not just the "main" one. I'll bet you have a hot spot on part of the core somewhere where the thermal paste is migrating. I've had mine report the temp at 55 and had a hot spot that was 87 .

At 750MHz/200MHz/1V they're running 68/63/62, with the GPU VRMs at 77/78/76. As I said though, this is independent of clock speed or voltage and by extension temperature. I can run at 400MHz/0.95V and turn off the other core and I still see utilization about half of what it should be, with an (almost) equal distribution of HW errors and accepted shares.
DobZombie
Hero Member
*****
Offline Offline

Activity: 896
Merit: 532


Former curator of The Bitcoin Museum


View Profile
October 26, 2012, 12:54:36 PM
 #5

crack open the GPU, clean off any thermal paste and just put some new stuff on.

come back and tell us what happens

Tip Me if believe BTC1 will hit $1 Million by 2030
1DobZomBiE2gngvy6zDFKY5b76yvDbqRra
GenTarkin
Legendary
*
Offline Offline

Activity: 2450
Merit: 1002


View Profile
October 26, 2012, 02:49:43 PM
 #6

Ive had a problem similar to this, but my 5970(one gpu) flagged HW errors bout 25% of the time. Mine is because the RAM gradually got worse and worse. By lowering the RAM from 300 to 170, the HW errors are gone, for now.
Are your HW errors "invalid nonce" errors? thats what mine appear as for the shares in cgminer.
Im willing to bet theres not a single ounce of RAM good on that GPU thats flagging the bad errors.

GenTarkin's MOD Kncminer Titan custom firmware! v1.0.4! -- !!NO LONGER AVAILABLE!!
Donations: bitcoin- 1Px71mWNQNKW19xuARqrmnbcem1dXqJ3At || litecoin- LYXrLis3ik6TRn8tdvzAyJ264DRvwYVeEw
MrTeal (OP)
Legendary
*
Offline Offline

Activity: 1274
Merit: 1004


View Profile
October 26, 2012, 03:43:45 PM
 #7

Ive had a problem similar to this, but my 5970(one gpu) flagged HW errors bout 25% of the time. Mine is because the RAM gradually got worse and worse. By lowering the RAM from 300 to 170, the HW errors are gone, for now.
Are your HW errors "invalid nonce" errors? thats what mine appear as for the shares in cgminer.
Im willing to bet theres not a single ounce of RAM good on that GPU thats flagging the bad errors.

 [2012-10-26 09:16:20] GPU 3 found something?
 [2012-10-26 09:16:20] OCL NONCE 1931483178 found in slot 0
 [2012-10-26 09:16:20] No best_g found! Error in OpenCL code?

For comparison, this is what I normally see
 [2012-10-26 09:16:26] GPU 2 found something?
 [2012-10-26 09:16:26] OCL NONCE 2442951545 found in slot 0
 [2012-10-26 09:16:26]  Proof: 00000000410ba6fe4fff57acaa2a9a4a358e5fdbc20a441a6951b215356b1725

I'll try running the RAM at 150 instead of 200 to see if that helps.
GenTarkin
Legendary
*
Offline Offline

Activity: 2450
Merit: 1002


View Profile
October 26, 2012, 04:57:28 PM
 #8

Actually, I never looked at the error in debug output, it went by too fast. I just noticed when HW error was thrown in normal mode, it said "HW error, invalid nonce"

I got the HW errors to vanish @ 170mhz RAM.... if it doesnt work at 150mhz RAM, your RAM is most likely completely fuxxored. I still strongly suspect mining...somehow damages 5970's RAM.

GenTarkin's MOD Kncminer Titan custom firmware! v1.0.4! -- !!NO LONGER AVAILABLE!!
Donations: bitcoin- 1Px71mWNQNKW19xuARqrmnbcem1dXqJ3At || litecoin- LYXrLis3ik6TRn8tdvzAyJ264DRvwYVeEw
Aseras
Hero Member
*****
Offline Offline

Activity: 658
Merit: 500


View Profile
October 26, 2012, 06:00:22 PM
 #9

How are you controlling your clocks? Flash modding or afterburner or cgminer or ?
Beaflag VonRathburg
Sr. Member
****
Offline Offline

Activity: 472
Merit: 250



View Profile
October 28, 2012, 06:49:34 AM
 #10

One of my rigs had a very similar issue. I put two 5970s together into one machine. I copied my 2.7.5 folder from a machine that was running the exact same setup (64 ultimate, 11.12, SDK 2.1, cgminer 2.7.5). When I fired up cgminer I would only get hw errors. It refused to accept any shares no matter what I had the cards clocked at. To test I installed 2.7.7 and it has worked fine now without any issues what so ever.

johnyj
Legendary
*
Offline Offline

Activity: 1988
Merit: 1012


Beyond Imagination


View Profile
October 28, 2012, 09:09:52 AM
 #11

Same here, one of my 5970 core now runs with 50% HW error, tried everything: changing voltage, frequency, replace thermal compund, nothing works. I'm running 2.3.3 version of cgminer and never had problem before

After cgminer started, first several minutes it works without problem, but when temperature reached above 40c degrees, this error start to appear, this happened weeks after I changed the heat sink to Accelero xtreme and thermal compund to coolaboratory liquid ultra, so the GPU cooling actually got much better

Maybe as someone said, it is the RAM problem, since I did not put any heatpad on RAM with new cooler, but anyway they stayed at 150Mhz

DobZombie
Hero Member
*****
Offline Offline

Activity: 896
Merit: 532


Former curator of The Bitcoin Museum


View Profile
October 28, 2012, 02:50:04 PM
 #12

shouldn't they be set at least 300mhz?

Tip Me if believe BTC1 will hit $1 Million by 2030
1DobZomBiE2gngvy6zDFKY5b76yvDbqRra
johnyj
Legendary
*
Offline Offline

Activity: 1988
Merit: 1012


Beyond Imagination


View Profile
October 28, 2012, 08:16:50 PM
 #13

shouldn't they be set at least 300mhz?

Tried many different RAM clock combination without luck. I've been running at 150Mhz for months without a problem, I think it is the GPU who start to degrade, but difficult to prove, since raise the voltage does not really help

DobZombie
Hero Member
*****
Offline Offline

Activity: 896
Merit: 532


Former curator of The Bitcoin Museum


View Profile
October 29, 2012, 03:35:17 PM
 #14

the most stable settings i've seen so far with a 5970 is
1.075v
800-820 (core clock)
300 (mem)

Tip Me if believe BTC1 will hit $1 Million by 2030
1DobZomBiE2gngvy6zDFKY5b76yvDbqRra
MrTeal (OP)
Legendary
*
Offline Offline

Activity: 1274
Merit: 1004


View Profile
October 29, 2012, 06:23:14 PM
 #15

the most stable settings i've seen so far with a 5970 is
1.075v
800-820 (core clock)
300 (mem)

Wow, what kind of temps do you get with that overclock and overvolt?
DobZombie
Hero Member
*****
Offline Offline

Activity: 896
Merit: 532


Former curator of The Bitcoin Museum


View Profile
October 31, 2012, 12:31:12 PM
 #16

With sitting in a 35-40 Degree Celcius Environment, they sit at about 75-80 degrees (some vary to 85%)

When they were sitting on my balcony they were sitting at 70-75.

This is with the LATEST ( 2011) bios updates, and fan set to automatic.  The fans are usually zooming away at 80% (4200RPM)


Tip Me if believe BTC1 will hit $1 Million by 2030
1DobZomBiE2gngvy6zDFKY5b76yvDbqRra
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!