Bitcoin Forum

Other => CPU/GPU Bitcoin mining hardware => Topic started by: DeathAndTaxes on March 06, 2012, 06:22:56 PM



Title: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 06, 2012, 06:22:56 PM
On one of my rigs I kept noticing the same GPU (#4 grr) going DEAD.  I lowered the clock, dead, lowered clock, dead. Finally put it at only 750 Mhz it was fine for a couple days and then I noticed it dead again.

I verified it now is no longer stable @ even 750.   Three "dead"s in 24 hours.  I dropped it to 725 (stock) and it has been stable for almost a day now (will update post for longer periods of testing).  So far just that one core on one rig (1 of 48 gpus) dislikes even moderate overclock.  The other GPU on the same card is fine @ 800 MHz.  At higher overclocks the "bad core" dies almost instantly.  Those two facts and a 1200W single rail PSU design make me think it isn't a power issue.

This is a very old card (over a year old need to lookup sales date).  It was one of the first 5970s I bought, and it is what convinced me to replace all my other GPUs with 5970s to consolidate my growing farm.  It has never been overvolted but last summer it did run "hot" for about a month before I bought a mini-split system to supplement the house main AC.  It has run overclocked but nothing crazy.  I ran it (guestimates) 835 early on.  I dropped all the clocks to 820 and then 800 once I got up to 6 rigs because micro managing them became a hassle.   I can't try overvolting at this point (which can "restore" performance on circuits suffer electromigration) because it is in a Linux rig.

Other than 3 fan failures (and me dropping a motherboard down the stairs) this is the first casualty of mining I have had.   As our GPUs get older I wonder if anyone else has noticed something similar.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: Mousepotato on March 06, 2012, 06:26:44 PM
Hmm that's certainly something I've been worrying about as time goes on.  What I'm more curious though is how the hardware vendors will handle warranty claims against cards that are nigh irreplaceable at this point.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: rjk on March 06, 2012, 06:29:01 PM
Hmm that's certainly something I've been worrying about as time goes on.  What I'm more curious though is how the hardware vendors will handle warranty claims against cards that are nigh irreplaceable at this point.
From what I hear, 5970's are being replaced by 6990s when there is a warranty claim.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 06, 2012, 06:30:20 PM
From what I hear, 5970's are being replaced by 6990s when there is a warranty claim.

Oh god.  :(

At this point I am not sure I can even warranty it.  It runs perfectly stable at stock clock and I doubt any warranty guarantees stability at > than stock clock.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: racerguy on March 06, 2012, 07:11:48 PM
Can it play games at stock clocks?  Mining is easier on the card than playing games.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: malevolent on March 06, 2012, 07:24:50 PM
On one of my rigs I kept noticing the same GPU (#4 grr) going DEAD.  I lowered the clock, dead, lowered clock, dead. Finally put it at only 750 Mhz it was fine for a couple days and then I noticed it dead again.

I verified it now is no longer stable @ even 750.   Three "dead"s in 24 hours.  I dropped it to 725 (stock) and it has been stable for almost a day now (will update post for longer periods of testing).  So far just that one core on one rig (1 of 48 gpus) dislikes even moderate overclock.  The other GPU on the same card is fine @ 800 MHz.  At higher overclocks the "bad core" dies almost instantly.  Those two facts and a 1200W single rail PSU design make me think it isn't a power issue.

This is a very old card (over a year old need to lookup sales date).  It was one of the first 5970s I bought, and it is what convinced me to replace all my other GPUs with 5970s to consolidate my growing farm.  It has never been overvolted but last summer it did run "hot" for about a month before I bought a mini-split system to supplement the house main AC.  It has run overclocked but nothing crazy.  I ran it (guestimates) 835 early on.  I dropped all the clocks to 820 and then 800 once I got up to 6 rigs because micro managing them became a hassle.   I can't try overvolting at this point (which can "restore" performance on circuits suffer electromigration) because it is in a Linux rig.

Other than 3 fan failures (and me dropping a motherboard down the stairs) this is the first casualty of mining I have had.   As our GPUs get older I wonder if anyone else has noticed something similar.

Does it happen at different mem clocks too?
How high were the temperatures (vrms too)?


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: bulanula on March 06, 2012, 07:28:09 PM
On one of my rigs I kept noticing the same GPU (#4 grr) going DEAD.  I lowered the clock, dead, lowered clock, dead. Finally put it at only 750 Mhz it was fine for a couple days and then I noticed it dead again.

I verified it now is no longer stable @ even 750.   Three "dead"s in 24 hours.  I dropped it to 725 (stock) and it has been stable for almost a day now (will update post for longer periods of testing).  So far just that one core on one rig (1 of 48 gpus) dislikes even moderate overclock.  The other GPU on the same card is fine @ 800 MHz.  At higher overclocks the "bad core" dies almost instantly.  Those two facts and a 1200W single rail PSU design make me think it isn't a power issue.

This is a very old card (over a year old need to lookup sales date).  It was one of the first 5970s I bought, and it is what convinced me to replace all my other GPUs with 5970s to consolidate my growing farm.  It has never been overvolted but last summer it did run "hot" for about a month before I bought a mini-split system to supplement the house main AC.  It has run overclocked but nothing crazy.  I ran it (guestimates) 835 early on.  I dropped all the clocks to 820 and then 800 once I got up to 6 rigs because micro managing them became a hassle.   I can't try overvolting at this point (which can "restore" performance on circuits suffer electromigration) because it is in a Linux rig.

Other than 3 fan failures (and me dropping a motherboard down the stairs) this is the first casualty of mining I have had.   As our GPUs get older I wonder if anyone else has noticed something similar.

Does it happen at different mem clocks too?
How high were the temperatures (vrms too)?

That is the important bit here : Linux does not properly support overvolting ( discard effects of EM aftewards ) / undervolting ( prevent EM in first place ) and reading VRM temps for 5870 / 5970.

That sucks and needs to be fixed. :(


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: Buckwheet on March 06, 2012, 08:18:24 PM
I mentioned to you in my PM that recently the quality of the 5970s I have been getting on flea-bay have been sub par. I am seeing ~40% "failure rate" of the 5970s in the past ~30 days in the ones I got. Most of the failures have to do with the card starting up just fine and running without problems for about two to three minutes, and then the temp gauges suddenly go from 60-65C to 120ishC and the card stops responding and crashes out the whole system.

When I set the clocks down to normal levels the card will run without issue, but it runs 10-15C hotter then the 5970s next to it that are overclocked to 800. They are all spaced out with enough room between the cards to have two dual slot cards between them. I also have box fans blowing in cool air into the rack they are mounted on so I know its not a heat problem.

I also have AX1200s for all my rigs and I still experienced this problem when I did the following:

1. Brand new motherboard of the same make/model.
2. Replaced riser.
3. Replaced PSU with brand new AX1200
4. Replaced outlet.
5. Replaced breaker.
6. Clean install.

It "feels" like people run into this issue and since they are out of warranty they throw the card on Ebay and sell it off. It could also be pure luck that I have gotten bad cards lately. 5870s all seem to the fan or fan controller that are burning out on me a couple weeks after getting them.

Edit: It happens to me any any mem clock. I don't have a good number on specific temps.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: The-Real-Link on March 06, 2012, 08:22:49 PM

  One of my cards is seeming to have a dead core quite often but IDK if it's due to EM or simply a dud card.  Diamond's looking into it though so we'll see.  I wonder if EM is the issue of the second card in that system though as it randomly sees and doesn't see the cores.  Either way, sending both back. 

  Can't vouch for standard clock instability really so far.  I'll get back to you in a year :)


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 06, 2012, 08:30:35 PM
That is the important bit here : Linux does not properly support overvolting ( discard effects of EM aftewards ) / undervolting ( prevent EM in first place ) and reading VRM temps for 5870 / 5970.

On correction.  Linux has no problem undervolting.  I have on rig running @ 730/240 1.00V (vs 1.05V stock) as a test.  cgminer has no problem setting values allowed by voltage and 5870/5970 allow core voltage to be lowered significantly (I tested down to 0.6V and it ran fine). However it relies on ADL library and as such only will run values allowed by bios so no overvolting without bios flash to increase the allowable range. 

Would be nice if AMD exposed all functionality via drivers and allowed larger range of values.  It would make 3rd party hack utilities obsolete.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 06, 2012, 08:32:30 PM
Does it happen at different mem clocks too?
How high were the temperatures (vrms too)?

Haven't change memclock to see if that help/hurts I doubt it will make a difference but will look into it.

temps were mid 60C.  Linux has no way to see VRM temps but based on Windows experience of similar clocks that likely puts VRMs in the mid 80C range.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: malevolent on March 06, 2012, 08:42:31 PM
Does it happen at different mem clocks too?
How high were the temperatures (vrms too)?

Haven't change memclock to see if that help/hurts I doubt it will make a difference but will look into it.

temps were mid 60C.  Linux has no way to see VRM temps but based on Windows experience of similar clocks that likely puts VRMs in the mid 80C range.

You had same cards/same settings/same environment when you used windows?
As for the memclock I had a card where it started to lock up at 280MHz after some time, but it did not occur when I increased the mem clock to 350.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 06, 2012, 08:44:04 PM
Does it happen at different mem clocks too?
How high were the temperatures (vrms too)?

Haven't change memclock to see if that help/hurts I doubt it will make a difference but will look into it.

temps were mid 60C.  Linux has no way to see VRM temps but based on Windows experience of similar clocks that likely puts VRMs in the mid 80C range.

You had same cards/same settings/same environment when you used windows?
As for the memclock I had a card where it started to lock up at 280MHz after some time, but it did not occur when I increased the mem clock to 350.

Yeah.  I will try but given 47 or 48 cards work at sub 300 Mhz and one card dies instantly at clocks it ran fine before I doubt it is memclock is too low.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: malevolent on March 06, 2012, 08:47:45 PM
Yeah.  I will try but given 47 or 48 cards work at sub 300 Mhz and one card dies instantly at clocks it ran fine before I doubt it is memclock is too low.

48 cards, that's quite a lot and an increased chance you stumble upon one that is of lower quality (as we know every chip is different, same as with CPUs)


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 06, 2012, 08:50:40 PM
Yeah.  I will try but given 47 or 48 cards work at sub 300 Mhz and one card dies instantly at clocks it ran fine before I doubt it is memclock is too low.

48 cards, that's quite a lot and an increased chance you stumble upon one that is of lower quality (as we know every chip is different, same as with CPUs)

Well 48 chips, 24 cards. :)

Yeah it may be the "weak link" just thought it was interesting as this card has been running 24/7 for over a year now.  I figure these topics will become more common place as the oldest GPUs hit 18 months old, 24 months old, etc.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: coretechs on March 06, 2012, 10:04:15 PM
One of my 5970s has a similar problem.  It wont mine at 725mhz/150mhz mem/1.05v without crashing, but it runs fine at 690mhz/150mhz mem/0.95v.  I always assumed it was a fried VRM because the GPU temps are fine.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: cablepair on March 06, 2012, 10:25:45 PM
5970s are notorious for that

I have one with the same problem, I actually found a (discontinued) Accelero Xtreme 5970 some place and am waiting for it to be delivered, I hope it gives me the cooling I need to keep it mining for a few more months

I know the accelero xtreme I used on one of my 5870s that the fan died was an amazing difference

this thing can mine 24/7 clocked at 950/180 - give out 430 mhash at 42 degrees (c) with 50% fans! - amazing!



Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 06, 2012, 10:25:56 PM
One of my 5970s has a similar problem.  It wont mine at 725mhz/150mhz mem/1.05v without crashing, but it runs fine at 690mhz/150mhz mem/0.95v.  I always assumed it was a fried VRM because the GPU temps are fine.

Hmm it could be a bad VRM.  I can't check VRM temps because it is in Linux.  I may drop the card in my Windows workstation to see what VRM is reading.  If one of 3 VRMs is missing in GPU-Z that is a good (well bad but clear) sign.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: coretechs on March 06, 2012, 11:31:55 PM
If one of 3 VRMs is missing in GPU-Z that is a good (well bad but clear) sign.

It's better than electromigration because you can still use the card and go for max hash/watt on air by underclocking/undervolting as long as the functional VRMs don't get overloaded.  It's unfortunate for your water-cooled setup though.  Good luck in either case.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: AniceInovation on March 07, 2012, 03:50:56 AM
Not trying to be obvious, but i did get some of these problems when i was messing with some cards.
Turned out to be bad thermal paste, or i somewhat moved the coolers, destroying the timms.

If you didn't, remove the heat-sink and re-paste it, and check if it gets better.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: cuz0882 on March 07, 2012, 06:34:50 AM
I have 2 like that now, they really cause problems if you plug the monitor in them. One of mine crashes if I try running cgminer off it when the card is not hot. If I run guiminer for 10 mins then switch over to cgminer it works fine.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: John (John K.) on March 07, 2012, 07:30:16 AM
From what I hear, 5970's are being replaced by 6990s when there is a warranty claim.

Oh god.  :(

At this point I am not sure I can even warranty it.  It runs perfectly stable at stock clock and I doubt any warranty guarantees stability at > than stock clock.
Oh no. I just rma'ed a 5870 and a 5850 this week - would they give me a 6xxx card instead?  :'(


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: cuz0882 on March 07, 2012, 08:19:01 AM
From what I hear, 5970's are being replaced by 6990s when there is a warranty claim.

Oh god.  :(

At this point I am not sure I can even warranty it.  It runs perfectly stable at stock clock and I doubt any warranty guarantees stability at > than stock clock.
Oh no. I just rma'ed a 5870 and a 5850 this week - would they give me a 6xxx card instead?  :'(
lets wait until they only have 7990's.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: John (John K.) on March 07, 2012, 08:23:06 AM
*notes to self*


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: PatrickHarnett on March 07, 2012, 08:32:57 AM
One of my 5970s has a similar problem.  It wont mine at 725mhz/150mhz mem/1.05v without crashing, but it runs fine at 690mhz/150mhz mem/0.95v.  I always assumed it was a fried VRM because the GPU temps are fine.

Hmm it could be a bad VRM.  I can't check VRM temps because it is in Linux.  I may drop the card in my Windows workstation to see what VRM is reading.  If one of 3 VRMs is missing in GPU-Z that is a good (well bad but clear) sign.

VRM problem is more likely.  I had had several types of failures on 5970's, and if it starts more easily when cool (as opposed to when hot), runs hot with no apparent load, or freezes when load applied (especially this one) that is one of the possible symptoms of vrm fail.  I'm assuming you haven't thrown it into a primary slot to look for ram failure (artefacts).

My cards were running 24/7 long before mining bitcoin, but not usually with over-clocks, and that helped keep temperatures sensible.  The only RMA I've managed with a 5970 got me a 6970 in return - better than nothing.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: BlackPrapor on March 07, 2012, 12:57:19 PM
I had same issues with an old AMD Duron CPU. Simply repasted the thermal interface and it was running good again. Could be the same issue, that's just a guess. Is it possible that electromigration can affect FPGA/ASIC chips as well? Is there a temperature that would guarantee EM free 24/7 work?


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 07, 2012, 01:30:28 PM
My cards were running 24/7 long before mining bitcoin, but not usually with over-clocks, and that helped keep temperatures sensible.  The only RMA I've managed with a 5970 got me a 6970 in return - better than nothing.

I hope you mean 6990.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 07, 2012, 01:32:03 PM
I had same issues with an old AMD Duron CPU. Simply repasted the thermal interface and it was running good again. Could be the same issue, that's just a guess. Is it possible that electromigration can affect FPGA/ASIC chips as well? Is there a temperature that would guarantee EM free 24/7 work?

Electromigration affects all silicon chips.  Every chip ever produced will eventually be destroyed by electromigration.  Lower current and lower temps could potentially make that timeline decades beyond its economical lifespan but electromigration is the wear and tear of transistor gates and is both unavoidable and irreversible.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: PatrickHarnett on March 07, 2012, 05:43:49 PM
My cards were running 24/7 long before mining bitcoin, but not usually with over-clocks, and that helped keep temperatures sensible.  The only RMA I've managed with a 5970 got me a 6970 in return - better than nothing.

I hope you mean 6990.

No.  It's just a single GPU card.

As for the thermal paste suggestions, I've never had a problem on that front.  I also have a friend in Germany who knows some info on the electro-migration issue, and for the volume of material that needs shifting, even for micro circuitry, his view was it is unlikely (he is another long time 24/7 gpu user).


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: rjk on March 07, 2012, 05:45:15 PM
My cards were running 24/7 long before mining bitcoin, but not usually with over-clocks, and that helped keep temperatures sensible.  The only RMA I've managed with a 5970 got me a 6970 in return - better than nothing.

I hope you mean 6990.

No.  It's just a single GPU card.

As for the thermal paste suggestions, I've never had a problem on that front.  I also have a friend in Germany who knows some info on the electro-migration issue, and for the volume of material that needs shifting, even for micro circuitry, his view was it is unlikely (he is another long time 24/7 gpu user).
The 5970 is dual GPU, and so is the 6990. If you got a single GPU card in return for a dual GPU card, then you got gyped.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: PatrickHarnett on March 07, 2012, 06:28:03 PM

The 5970 is dual GPU, and so is the 6990. If you got a single GPU card in return for a dual GPU card, then you got gyped.

Yes.  But I neither the 5970's or 6990's had much availability, and from that particular supplier, I don't think they could get any (or didn't want to spend over $1000 to find a replacement).

The 5970 was second hand, and not a bad price and having the purchase documentation was a bonus (allowing the RMA in the first place).  I don't particularly like the 6970 (noisy and only single gpu). 


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: johnyj on March 07, 2012, 09:11:20 PM
I just RMAed a 5970 weeks ago, and since it is out of stock now, I only get my money back, so I have to hunt another 5970

one of my 5970 never has been able to run stable above 760 MHz, but it turned out to be a bad pci-e extender cable, re-soldered one of the wire and now it runs at 800+ Mhz stable

Another 5870 died after 30 seconds into mining, that is because I installed too thick heat pad on the memory thus make the GPU not able to contact the cooler surface evenly, replaced the heat pad solved the problem


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 07, 2012, 09:15:14 PM
The 5970 was second hand, and not a bad price and having the purchase documentation was a bonus (allowing the RMA in the first place).  I don't particularly like the 6970 (noisy and only single gpu).  

Regardless of the price you paid,  they actually "replaced" a 5970s with a 6970?

I mean by any metric you look at they robbed you:

5970
4.6 GFLOPs
$600 launch price
3200 shaders @725 MHz
Ebay value today: ~$400

6970
2.7 GFLOPs  (41% less)
$370 launch price (38% less)
1536 shaders @ 880 MHz (42% less)
Ebay value today: ~$200 (50% less)


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: DeathAndTaxes on March 07, 2012, 09:16:19 PM
An update.  Running it at 750/240 it has been stable for almost 72 hours now.  Still not sure if it is EM or just maybe 1 of 3 VRMs are blown.  Also no sure if it can do more than 750.  I want to run it like this until either it crashes again or it lasts a week.


Title: Re: I think one of my GPUs has the the dreaded electromigration.
Post by: PatrickHarnett on March 08, 2012, 12:54:06 AM
An update.  Running it at 750/240 it has been stable for almost 72 hours now.  Still not sure if it is EM or just maybe 1 of 3 VRMs are blown.  Also no sure if it can do more than 750.  I want to run it like this until either it crashes again or it lasts a week.

Don't turn it off.  The first 5970 I had fail was an XFX black edition (fancy name, still just a bit of kit).  It took four months to die (i.e. not turn on when cool), but even when very stuffed, it would drive a screen.  Recycled the fan and heatinks :)

As for the RMA - I paid about 80% of new price, 2nd hand rates were still 80% of new up until a few months ago (now about 50%), and they could have denied the return because I wasn't the registered purchaser.  Something better than nothing.