Bitcoin Forum
December 06, 2016, 12:31:04 PM *
News: To be able to use the next phase of the beta forum software, please ensure that your email address is correct/functional.
 
   Home   Help Search Donate Login Register  
Pages: [1]
  Print  
Author Topic: Random failed GPUs - definitely dead?  (Read 2442 times)
catfish
Sr. Member
****
Offline Offline

Activity: 270


teh giant catfesh


View Profile
October 23, 2011, 02:04:25 PM
 #1

I have around 20 GPU cards due to my little mining operation. Four of these cards have died - one appears to have been DOA, and the others died after very little use. None have been overclocked idiotically - and importantly, I *never* overvolt my cards.

Interestingly, these 'dead' cards pass the BIOS POST. I get nothing but a black screen, but the monitor doesn't say 'going to sleep' due to no signal - it's actually a real 'black screen'.

All of them will spin their fans up to normal speed when connected to power and PCIe in a test board. They also rev the fans up to full speed if I pull out the PCIe cable (all cables - power and PCIe-extender - are known to be good).

Very oddly, if I configure a normal system (Gigabyte H61M-D2-B3 board, one x16, three x1 PCIe slots, all filled using x1->x16 extenders) with one of the cards being the 'dead' card, Linux will recognise it when interrogating the PCI bus (i.e. all four cards are shown when lspci is executed).

It's only when I install the ATI drivers (I use 11.6 on Ubuntu 11.04 'natty') that any aticonfig command locks up. It doesn't lock the machine - I can ctrl-C out of the command. But there's something wrong with the card, and it interferes with other cards on the PCIe bus too.


Has anyone else had this behaviour, and does anyone else know how to fix it? I'd have thought that a burnt-out card would either fail to operate at all (i.e. no fan, no output from DVI port) or would mess up the logic board to the point of failing the POST. One Asus board I own has separate LEDs for failure points in RAM, CPU and GPU - if the RAM is faulty, the RAM lights up, if the GPU is faulty, the GPU lights up, etc. My 'dead' cards don't light up the GPU LED and the POST seems to complete fine.

What concerns me is that (a) I may be chucking away a perfectly usable card - where I may just need to flash the BIOS or something, or (b) that continuing to attempt to resuscitate the cards could burn out a logic board and three other expensive GPUs.

Incidentally, all but one of the dead cards were XFX brand. One was a 5850 Black Edition, which I was *hoping* to be an overclock-monster and a strong card for hashing... guess not Sad

Is there anything I can look for that would be an easy fix? There don't seem to be any burnt-out visible components on the board.

Or is the GPU field yet another consumer-electronics scam where even the tiniest, fixable failure means 'throw it away and buy a new one'? Sad

Mining profitability right now is *not* looking good so I'm not keen on buying new hardware just yet...

...so I give in to the rhythm, the click click clack
I'm too wasted to fight back...


BTC: 1A7HvdGGDie3P5nDpiskG8JxXT33Yu6Gct
1481027464
Hero Member
*
Offline Offline

Posts: 1481027464

View Profile Personal Message (Offline)

Ignore
1481027464
Reply with quote  #2

1481027464
Report to moderator
1481027464
Hero Member
*
Offline Offline

Posts: 1481027464

View Profile Personal Message (Offline)

Ignore
1481027464
Reply with quote  #2

1481027464
Report to moderator
1481027464
Hero Member
*
Offline Offline

Posts: 1481027464

View Profile Personal Message (Offline)

Ignore
1481027464
Reply with quote  #2

1481027464
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
1481027464
Hero Member
*
Offline Offline

Posts: 1481027464

View Profile Personal Message (Offline)

Ignore
1481027464
Reply with quote  #2

1481027464
Report to moderator
1481027464
Hero Member
*
Offline Offline

Posts: 1481027464

View Profile Personal Message (Offline)

Ignore
1481027464
Reply with quote  #2

1481027464
Report to moderator
ocminer
Legendary
*
Online Online

Activity: 1568



View Profile WWW
October 23, 2011, 03:00:06 PM
 #2

From my experience bad VRM's and MOSFETS on the Card(s) do these symptoms, bad caps also sometimes.

They dont deliver stable voltage under load anymore or the current is "dirty". You can watch that with
an oscilloscope very good, on a card where everything is okay, you get something like a very stable, flat line which
represents voltage i.e. 12v. On a card with a bad regulator/mosfet/cap this line looks "grizzy" or "noisy" which
makes the rest of the card freak out.
Thats why they are okay without any load on them but start to crash/hang as soon as the driver tries to activate it.

Unfortunately without a SMD soldering/reflow station you are not able to change VRM's on these cards.. You can
hardly swap a cap or a big mosfet, but you'd see if one of them is bad.

If you can RMA them, do it, otherwise you probably have no big chance.

Try to feel if the VRM's get really warm even under no load i.e. when you start up the PC with the card plugged in. If they do, they are probably the cause.

suprnova pools - reliable mining pools - #suprnova on freenet
https://www.suprnova.cc - FOLLOW us @ Twitter ! twitter.com/SuprnovaPools
tritium
Member
**
Offline Offline

Activity: 86


View Profile
October 23, 2011, 03:17:39 PM
 #3

have you heard of the oven fix? its generally seen as a last resort (after bios flashes etc) but its where you put the card (minus HSF, shroud and all the paste) upside down on a foil covered baking tray in the oven proped up with foil balls at 200C for about 10 mins

it is supposed to reflow the solder and sort out any hairline cracks but it is very hit and miss

1FCzN34C1xCLsDaLxfY7yB5CQKN74ruGHV
RyNinDaCleM
Legendary
*
Offline Offline

Activity: 1988


Legen -wait for it- dary


View Profile
October 25, 2011, 02:23:54 AM
 #4

...upside down...

This is dangerous! The GPU die may fall off.

This can work, but I'd do it back side down, myself!

PatrickHarnett
Hero Member
*****
Offline Offline

Activity: 518



View Profile
October 25, 2011, 03:58:15 AM
 #5

Before doing any opening or oven fix, do the RMA process if available.

Also, try connecting via another PC to see if it is dead or not - I use RealVNC on a couple of XP machines.  That should show if it's still producing a picture.

How far past the post screen are you getting before it all goes black?
catfish
Sr. Member
****
Offline Offline

Activity: 270


teh giant catfesh


View Profile
November 08, 2011, 08:09:38 PM
 #6

OK - I know I'm resuscitating an old thread, but it was mine. So I think I'm allowed to.

A month on, the original 4 cards that died have been joined by two more.

Guess what? Both were XFX brand cards.

So far, out of the batch of three XFX 5850 'new style' cards I bought from videoshop.co.uk (with the spiral heatpipe fansinks - pretty much identical to the new XFX 5830, but faster due to the better GPU), only one is still operating.

Not to beat that, out of their slower brothers, the XFX 5830 'new style' cards (of which I bought four, foolishly), only two are still working. I may be able to cut this failure rate down to 1, since one of these cards was spiking up to 100˚C randomly, and on close inspection it was clearly a dry joint on the connector for the fan on the GPU board. So assuming the GPU isn't irrevocably toasted, re-soldering the fan connector *may* fix it.

However, out of 5 new-model XFX cards, all using the same basic card design and identical spiral heatpipe fansinks, I've had 66.6% failure rate on the 5850s, and 50% failure rate on the 5830s.

The other 'fast' XFX cards I own are a pair of XFX 5850 Black Edition cards, which I was hoping to be good mining cards. Failure rate on these is 50%, with one card preventing two of my test logic boards from even making a POST (no beeps, lights, nothing)... I'm not putting that card anywhere *near* my production miners just in case it trashes the logic board.

Oddly, I've got four XFX 5770 single-slot cards - which are amazingly decent (cross fingers... don't want instant failures...) - I've loaded a tower-case PC with all four cards and it's quiet, reliable and low-power. So it's not as if I should have learned my lesson about XFX back when I bought the first Black Edition 5850s and found the quality terrible. The 5770 single-slot cards may not appear or feel 'military grade' but they're very reliable, so far (non-stop mining, overclocked, undervolted).

Anything higher power from XFX - 5830s, 5850s, of all types - has had absolutely unacceptable failure rates here at Catfish Towers. I have to speak to videocardshop.co.uk but having done some research on the Internet, the Americans have found XFX quality to be similarly appalling and the general experience appears to be endless RMAs.

I'm a Mac OS X guy mainly (and before Bitcoin) so virtually all of my kit is made by Apple. Hence I haven't had to do 'RMAs' at all - there are 11 Mac computers in this office (my cellar...) and only one ever had to be complained about to Apple (at an Apple Store). They gave me a deal-you-can't-refuse on a new Mac Pro as a result (my Quad G5's liquid cooling system started losing efficiency). So I'm not used to (a) DOA or poor quality kit that fails after a week or less; or (b) arguing with telephone retailers about whether I justify a refund or not.

The cards were bought purely to run Bitcoin mining OpenCL kernels. Unless it says that the cards aren't suited for this on the box, I think it's OK to assume that the cards should work - albeit at a slightly reduced lifespan. Remember that I have open-frame shelf rigs with plenty of airflow, so I'm not cooking my cards. However if some retail telephonist decided to accuse *me* of being the cause of the cards failing, then I'd either lose my rag beyond all belief (getting me nowhere), or have no real answer other than 'why are all the dead cards your XFX models, and the rest still work?'...

I've scanned the American NewEgg website for my particular XFX cards, and checked the feedback. Ridiculous numbers of people got duff cards and had to send them back, at their expense, to XFX. But even though this cost each customer $50 or whatever, they still gave good feedback. I guess customer service in general must be pretty lousy in the USA, but I'm not playing this game.

I noticed that my XFX cards appeared to be 'used' before I installed them - some had the cling-film on the card front removed already, some had the accessories (DVI-VGA adapter, etc.) removed, and some honest chap had put *three* current PC games back in the box (if anyone wants these games - give me a PM, as I'm an old bloke with loads of Mac machines and I loved PC first person shooter games AGES ago, for example my favourites ever were Heretic, Quake 2 and Half-Life, I don't play games any more).

So perhaps every XFX card I bought was a returned item, which could have been returned dishonestly after an overclocking experiment gone wrong, or worse a BIOS flash gone wrong. These items were marked 'new' by the retailer, so maybe I've got a leg to stand on, but being honest, I bought them to mine BTC - i.e. tuned to run at their maximum. Mine haven't failed after gross overclocking - two were DOA, the others failed within a few days of *non* overclocked (but memory underclocked) work. This *reduces* temperatures...

The problem is that I have two 6950s from XFX... neither has failed yet, but I'm concerned. A large part of my hash power would evaporate if all XFX cards died. Half of them already have done, I can't handle many more failing.


This isn't a pop at videocardshop.co.uk - I'll have a public pop at them if they tell me to eff off when it comes to my complaints, but since I haven't spoken to them about these XFX cards, they may be very understanding. However, it's a big warning to any of you thinking you can save a bit of money by buying these XFX cards in the UK from this retailer. The prices are very attractive... but my sample size is big enough now to make a statement about *why* the prices are so attractive. Stay away from XFX graphics cards. Even with open frames, loads of cool high-flow air, and plenty of PSU headroom, just check out the percentage failure rates above (Unless you go for the single-slot 5770s, which seem to all handle 200 MH/s each happily at 62-70˚C... and seem perfectly reliable. Odd)...


Anyone else found identical issues with XFX brand graphics cards? My best cards are the two Asus DirectCU 6950s, but apart from two expensive cards (too small a sample size), the brand I've had sustained success with is Sapphire... the original 5850 has been running 990 MHz for months, the 5850 'extreme' cards are still hammering away (well they will be, I'm about to replace the XFX cards with them), and the dual-slot 5770s clock to the moon.

And if Sapphire are known to be 'bad' - what would be recommended if you were considering building another 12-GPU shelf and wanted to build it and forget it... i.e. not the daily messing around I'm currently experiencing, not knowing whether my logic boards are failing or whether the GPU has failed / is rescuable / needs killing with fire.

I'm getting pissed off now, back to 'this is why I started using Macs' - whereas the initial PC-building was *entertaining*. The XFX boards have trashed it. Am I just unlucky - or is it well-known that bitcoin miners should avoid XFX boards?


Lastly... when graphics boards do this, is there any hope of recovering them, maybe by refinishing the GPU / heatsink interface, cleaning the whole thing up, etc. (none of the boards show burnt-out components or anything obviously knackered) - or is the damage likely internal and unfixable? Looks like around £1,000 total loss right now. Grrr.

Huh

...so I give in to the rhythm, the click click clack
I'm too wasted to fight back...


BTC: 1A7HvdGGDie3P5nDpiskG8JxXT33Yu6Gct
cicada
Full Member
***
Offline Offline

Activity: 196


View Profile
November 08, 2011, 08:40:43 PM
 #7

It's likely XFX expected the newer 58xx series cards to be big mid-budget sellers, versus the 5970 (not sure they make one), or the considerably more expensive 68/69xx series, and as such really reigned in component/QC budget to maximize profit in that range.  

RMA is going to bite them in the ass for it, especially with their lifetime warranty, but for them it's a balance of failure rate versus profit made on the trouble-free cards.  It's likely their wider reported failure rate on the newer 58xx series cards is within the 10-15% industry standard.

I have 4 XFX 6950s that have been rock solid for me, short one of them running a bit warmer (~10C) than the others under the same conditions.  My sample size is small, so I can't say much, but I'm pretty happy with these cards.  All my other cards are Sapphire 5830s, and have been nothing short of amazing.  

I never considered the XFX 58xx cards just due to the poorer reviews compared to their Sapphire and ASUS bretheren at the same price.  Guess it's a good thing I didn't Wink

Team Epic!

All your bitcoin are belong to 19mScWkZxACv215AN1wosNNQ54pCQi3iB7
P4man
Hero Member
*****
Offline Offline

Activity: 504



View Profile
November 08, 2011, 10:06:15 PM
 #8

Two things come to mind;

First of all, are you by any chance using GPU-Z and everest to monitor temps? If so, I know what happened and Ill give you the long version. The short version is: NEVER run both programs at once. Ever. It causes random spikes of vcore to 1.65v. I killed a 5850 that way.

Secondly, what your temps like? Are these reference cards with stock cooler? You may be cooking them. Despite what some people claim, running your cards at 90C (for DispIO, which could be 100+ for the shader) is not safe, nor is 100+C VRM temps 24/7. Reference coolers seem unable to avoid these temps, which is why I avoid reference coolers.

catfish
Sr. Member
****
Offline Offline

Activity: 270


teh giant catfesh


View Profile
November 08, 2011, 10:32:19 PM
 #9

Two things come to mind;

First of all, are you by any chance using GPU-Z and everest to monitor temps? If so, I know what happened and Ill give you the long version. The short version is: NEVER run both programs at once. Ever. It causes random spikes of vcore to 1.65v. I killed a 5850 that way.

Secondly, what your temps like? Are these reference cards with stock cooler? You may be cooking them. Despite what some people claim, running your cards at 90C (for DispIO, which could be 100+ for the shader) is not safe, nor is 100+C VRM temps 24/7. Reference coolers seem unable to avoid these temps, which is why I avoid reference coolers.
Nope, I'm running Linux and Mac OS X. No windows in my office (literally as well as metaphorically). And I know about the bug in the Windows catalyst drivers that causes the spike. I don't think it applies to the Linux catalyst drivers because I've run AMDOverdriveCtrl *and* aticonfig --odgt --adapter=all at the same time, without problems.

None of my cards are reference cards. My systems aim to keep all cards at a range between 60˚C and 80˚C, with sustained running above 80˚C being undesirable (but unavoidable with the 6950s, until I flash the BIOS of *all* of them to permit 300 MHz memory clocks - I'm stuck in Linux to only 100 MHz less than core clock, so a light OC at 900 MHz core means a mem clock of 800 MHz, which wastes power and generates excess heat).

I'm sure as hell not running my cards at 90˚C. I've built all sorts of random shit to prevent this. My first nearly-finished rig is reliable, stays within my heat range, and is reasonably efficient. It has mostly Sapphire, Peak (?) and Asus cards in it:



I'm sure you've seen it before. I'm trying to build another one - I have the logic boards, the PSUs, the cables, but getting the GPUs (which are mostly XFX brand) to actually *work* has wasted nearly 2 weeks now... Sad

...so I give in to the rhythm, the click click clack
I'm too wasted to fight back...


BTC: 1A7HvdGGDie3P5nDpiskG8JxXT33Yu6Gct
P4man
Hero Member
*****
Offline Offline

Activity: 504



View Profile
November 08, 2011, 10:36:10 PM
 #10

I hadnt seen it actually, but it looks very neat, well done! I guess my two idea's were wrong them, but thought Id float them anyway.

JonHind
Full Member
***
Offline Offline

Activity: 126


View Profile
November 08, 2011, 11:10:29 PM
 #11

I had a similar issue myself that was identified down to a faulty PSU. The PSU sent intermittent spikes down the rails which I believe was the reason that 2 of my original cards died. Once the PSU was changed, I had no problems since.
Pages: [1]
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!