Bitcoin Forum
December 09, 2016, 12:25:13 AM *
News: To be able to use the next phase of the beta forum software, please ensure that your email address is correct/functional.
 
   Home   Help Search Donate Login Register  
Pages: [1]
  Print  
Author Topic: Hardware Failures at reasonable temperatures?  (Read 809 times)
catfish
Sr. Member
****
Offline Offline

Activity: 270


teh giant catfesh


View Profile
September 11, 2011, 05:07:01 PM
 #1

Whilst I popped over to Switzerland for a quick break (just over a week), I had to obviously shut the windows and doors in my house. Normally, the mining rigs, all in the rear spare bedroom, get cool air from having the windows wide open. A bunch of standing fans shift the air around, and along with convection, cool air from outdoors comes in to cool the 3 kW heat dissipated from the GPUs.

I've got a webpage that constantly monitors the hash rate, results accepted / rejected, temperature and the last Phoenix result message - across all GPUs in all rigs - and could watch for any problems using the internet in my apartment. As expected, temperatures jumped *significantly* without outdoor air.

Due to the concern, a couple of days in, I logged in remotely and reduced the overclocks significantly, and at no point did any of the GPUs exceed 84˚C. I know this is high, and my *normal* setup keeps the cards below 75˚C. Only a couple of cards - 5830s - get that hot - the rest are usually 60-something.

However a few cards locked up after a few days, regardless of reducing the overclock and the temps never getting *stupidly* high.


When I returned, I expected I'd just be able to hard reboot the boxes and everything would be OK. This was not the case. After wasting a LOT of time diagnosing why two rigs wouldn't boot, I found that one 5850 (an XFX Black Edition), and one 5830 (a 'value' card from overclock.co.uk) were broken.

I haven't managed to get them to work in any other machine either - if they're plugged into the PCIe socket, then as soon as the OS probes the cards, everything hanging off the bus locks up. In particular, the onboard ethernet dies, which makes further diagnostics impossible (well, I pulled the USB flashdrive, mounted it in a working Linux box, and examined the logs - but there was nothing in syslog or dmesg before the lockup that indicated anything was wrong).


My question is - has anyone else experienced hardware failures like this? The accepted wisdom here seems to be that the Radeon 58xx GPUs can run up to 110˚C, at which point they thermally throttle to prevent damage. I've even seen comments like 'just keep it below 90 and they'll be OK'.

Yes, I overclock... but would this permanently break a GPU if the core temperature never exceeded 84˚C? In terms of 'that's core temp - the RAM may be hotter' - I underclock *every* card to 275 MHz memory clock. RAM should be fine, no?

What other critical components on Radeon cards could overheat and fail even if the *core* temperature was well within safe limits?


Of course, it may just be bad luck, but if not, I'd like to ensure it doesn't happen again. I'm replacing the two dead cards with a couple of the 6xxx series as an experiment - all are 5xxx series at the moment Smiley

TIA, catfesh

...so I give in to the rhythm, the click click clack
I'm too wasted to fight back...


BTC: 1A7HvdGGDie3P5nDpiskG8JxXT33Yu6Gct
1481243113
Hero Member
*
Offline Offline

Posts: 1481243113

View Profile Personal Message (Offline)

Ignore
1481243113
Reply with quote  #2

1481243113
Report to moderator
1481243113
Hero Member
*
Offline Offline

Posts: 1481243113

View Profile Personal Message (Offline)

Ignore
1481243113
Reply with quote  #2

1481243113
Report to moderator
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction. Advertise here.
ovidiusoft
Sr. Member
****
Offline Offline

Activity: 252


View Profile
September 11, 2011, 05:14:17 PM
 #2

Overheated PSUs or even motherboards could do that to your cards. I would suspect the PSUs. If the card temps were under 85 deg. they should have been ok. As you said, a lot of people run them at 100 deg or even more.
Dargo
Legendary
*
Offline Offline

Activity: 1554



View Profile
September 11, 2011, 06:01:31 PM
 #3

Did you monitor your VRM temps as well? Unless you are overvolting, I wouldn't think this is the problem. Maybe just bad luck with those cards, but if so it is strange that more than one died.
catfish
Sr. Member
****
Offline Offline

Activity: 270


teh giant catfesh


View Profile
September 11, 2011, 06:52:49 PM
 #4

Could be PSUs. The rig that didn't kill any cards had my best PSU - the 1000W Coolermaster unit.

That said, the other rig (containing two logic boards and two PSUs - one card from each was killed) had a Corsair 800W and a Coolermaster 600W unit.

The 800W ran two 5850s, one 5830 and a 5770. The 600W ran two 5830s. Fag packet calcs suggest the PSUs had enough headroom - if anything, the 1000W unit in the single-board rig is closer to its limits (two 5850s, two 5830s and a 5770).

However, the entire first floor of my house felt like a sauna when I returned. Ambient temperatures were probably well above 40˚C Smiley

Now the cards and CPUs may be able to take high temperatures, but if PSUs crap out when they have *really* hot air to use, then that could explain it... but PSUs live *inside* PC cases too, and the air in there is usually hotter than ambient. These PSUs were in open frame rigs...

Also - how do you monitor VRM temperatures, short of standing next to the card with an IR laser-pointer thermometer and picking off temps at various locations? The standard aticonfig output doesn't give VRM temps.

Besides, I've swapped the dead cards with whatever else I could find (not as fast, obviously) so all PCIe slots are still full, and now both boards are running happily.

Well, apart from the 5770 which thinks it has a core temperature of 510˚C (no typo) and a clock speed of 0, utilising 200% GPU load... I'm hoping this isn't *another* failed card - it's in the same rig.

I didn't think that logic boards were put under as much stress by mining as the GPU cards themselves? I'd be a bugger for the logic board to have blown somehow, as that could mess up ALL the cards... Still a bit of a mystery to me - Linux usually has very comprehensive logs but there's *nothing* to say what's stopped working. Very odd. Huh

...so I give in to the rhythm, the click click clack
I'm too wasted to fight back...


BTC: 1A7HvdGGDie3P5nDpiskG8JxXT33Yu6Gct
Dargo
Legendary
*
Offline Offline

Activity: 1554



View Profile
September 11, 2011, 07:30:43 PM
 #5

It's hard to monitor VRM temps. In linux you can do it with Radeonvolt, but this only has support for a few types of cards (reference 5850s for example). I was just tossing this out as an example of a component that can overheat badly even though core temps are within acceptable limits (but usually only a problem if you are over stock voltage). Hard to say what happened, but if your house was sauna-hot, that can't be good. Personally, I think I would just turn the rigs off while away rather than have the risk of running them in such hot ambient temperatures again. Or at least undervolt them as much as possible to keep them running cool. I have my 5850s undervolted to 0.95 VDC for efficiency, and they run much cooler that way.   
catfish
Sr. Member
****
Offline Offline

Activity: 270


teh giant catfesh


View Profile
September 11, 2011, 08:07:00 PM
 #6

It's hard to monitor VRM temps. In linux you can do it with Radeonvolt, but this only has support for a few types of cards (reference 5850s for example). I was just tossing this out as an example of a component that can overheat badly even though core temps are within acceptable limits (but usually only a problem if you are over stock voltage). Hard to say what happened, but if your house was sauna-hot, that can't be good. Personally, I think I would just turn the rigs off while away rather than have the risk of running them in such hot ambient temperatures again. Or at least undervolt them as much as possible to keep them running cool. I have my 5850s undervolted to 0.95 VDC for efficiency, and they run much cooler that way.   
Understood and wisely put... but running the cards I've managed to acquire at standard clocks is crazily slow (without enough power-consumption reduction to justify it).

For instance, my 5830s, in standard trim, mine at around 240 MH/s. My stable, standard voltage overclock gives a minimum of 300 MH/s.

My 5850s are even crazier - in standard trim, the normal output is around 280-290 MH/s. The lamest 5850 I own is an XFX Black Edition and that OCs to 380 MH/s. My old Sapphire non-extreme 5850 has been running over 400 MH/s since I first fired it up. That's a BIG drop in hash rate to lose.

Turning the rigs off when away would have cost me money - I was away for 10 nights / 11 days, and the BTC mined within that time (even with some of the cards crashing and going offline) resulted in me finding 4 ounces of silver bullion within the post when I returned. That's why I'm doing this...


Better cooling seems the only answer - unless there's some other overlooked issue, like not having good enough PSUs, or poor quality PCIe extenders? I'm using Corsair, Cooler Master and one Antec PSU. No hard drives or any other power-hungry peripherals - all USB flash drives running a hacked Linux.


I'm not sure I'm ready to drill huge holes in the wall to install extractor fan systems *just* yet - after all, bitcoin mining could become utterly unprofitable in the near future if BTC go to zero. That's partly why I want to max out the kit now, whilst I can still make tangible value from the processing.

Just seems odd that cards that aren't overheating to silly temps are having trouble.


Actually - perhaps it's because I'm dedicated to populating any logic board 100% with GPUs - and hence EVERY PCIe slot is being asked to deliver power - so the logic board itself is having to send up to 300W (in a 4-slot board) to PCIe slots that really weren't expecting it (esp. when only one is touted as a GPU slot, and the other three are x1 peripheral slots).

Maybe replacing every extender cable with a Molex-augmented cable could help - has anyone seen x1 -> x16 extenders *with Molex* augmenters? I've got some very nicely built x1 -> x16 extenders but they don't have any power input. Would that *genuinely* take load off the logic board - so whatever module supplies the 75W to each PCIe slot doesn't actually have to anymore?

...so I give in to the rhythm, the click click clack
I'm too wasted to fight back...


BTC: 1A7HvdGGDie3P5nDpiskG8JxXT33Yu6Gct
Dargo
Legendary
*
Offline Offline

Activity: 1554



View Profile
September 11, 2011, 08:45:13 PM
 #7

Yeah, I totally understand about the lost $ from shutting off your rigs. But you also lose $ from dead GPUs. Better cooling is probably your best bet. Unless you want to run AC while you are gone, I would try some kind of fresh air intake/exhaust. Also, you might try installing Atitweak (search the forum for this), and see what kind of hash rates you can get at lower voltage. My 5850s get about 300 Mh/s at 0.95 VDC (750 core, 300 mem, Phoenix with aggression at 9). I think they do about the same as yours at stock settings, and about the same overclocked. Undervolted, they run about 55C at room temp (75F or so in an open air rig with box fan blowing across the cards). Perhaps a compromise would be better cooling plus running the cards at lower voltage while you are away. Since you were monitoring core temps, it may not be the cards that are the problem. Maybe your PSU/mobo instead. You are running your PSUs pretty hard, and maybe they can't quite hack it at higher ambient temps (and of course your cards will suck more power at higher temps too). You cards will consume *a lot* less power undervolted, so this would also be a way of putting less stress on other components of your rigs. With four 5850s in my rig, I'm only pulling 450 watts from the wall at 0.95 VDC. I've been away for a total of about two weeks this summer, and each time before I leave, I think about leaving my rig on, but so far I've chickened out and just shut it off. But my apt gets insanely hot during the summer unless I leave the AC on 24/7 while I'm away.  Also, did you check cablesaurus for 1-->16 connectors with Molex? If they don't have them, probably nobody does. 
catfish
Sr. Member
****
Offline Offline

Activity: 270


teh giant catfesh


View Profile
September 11, 2011, 09:32:50 PM
 #8

Thanks for the advice Dargo - much appreciated.

I'm in the UK, most of us don't have AC in our homes... besides, the additional cost of regular home AC to help out hot mining rigs is financially untenable in most cases.

I've heard stories of people digging their gardens up to run ground heat-sink hacks with a load of plastic piping, an old central heating pump and a few old heat exchanger matrices, but it's too much aggro for me.

Given that all rigs are back up and running at full overclocks, with temperatures between 56 and 81˚C (the 81 is the outlier - one spare logic board with a single XFX 5770 single-slot thin card overclocked to generate 220 MH/s) - the cards seem to be able to take it.

Perhaps I should just think about uprating all the PSUs, or actively add cooling to the PSUs.


I'm sure that my current 'use big fans to blow air over the rigs' approach could be catastrophically inefficient - fluid dynamics is an incredibly complicated subject, and without appropriate measuring equipment, I can't tell whether the fans I'm using are causing 'dead spots' in the rigs... given that the GPU cards have fans and 'presumably' professionally designed ducting.

Has anyone simply removed the ducting and fans from their fancy GPU cards, and relied on external forced air cooling? That would, at least, prevent any possibility of 'cavitation' or more likely inefficiencies in the fan flow rates...

...so I give in to the rhythm, the click click clack
I'm too wasted to fight back...


BTC: 1A7HvdGGDie3P5nDpiskG8JxXT33Yu6Gct
bcforum
Full Member
***
Offline Offline

Activity: 140


View Profile
September 11, 2011, 09:44:02 PM
 #9


Bummer on the dead cards. In the future I'd suggest investing in some form of outside ventilation, for example mount a bunch of 120mm fans to a sheet of metal and screw it to the window opening. I have a 20 inch box fan (50cm?) mounted in an open window with bars on the outside to prevent someone from breaking in. It runs 24/7 while mining, but I'll shut it down when the weather turns cold.

If you found this post useful, feel free to share the wealth: 1E35gTBmJzPNJ3v72DX4wu4YtvHTWqNRbM
Pages: [1]
  Print  
 
Jump to:  

Sponsored by , a Bitcoin-accepting VPN.
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!