Whilst I popped over to Switzerland for a quick break (just over a week), I had to obviously shut the windows and doors in my house. Normally, the mining rigs, all in the rear spare bedroom, get cool air from having the windows wide open. A bunch of standing fans shift the air around, and along with convection, cool air from outdoors comes in to cool the 3 kW heat dissipated from the GPUs.
I've got a webpage that constantly monitors the hash rate, results accepted / rejected, temperature and the last Phoenix result message - across all GPUs in all rigs - and could watch for any problems using the internet in my apartment. As expected, temperatures jumped *significantly* without outdoor air.
Due to the concern, a couple of days in, I logged in remotely and reduced the overclocks significantly, and at no point did any of the GPUs exceed 84˚C. I know this is high, and my *normal* setup keeps the cards below 75˚C. Only a couple of cards - 5830s - get that hot - the rest are usually 60-something.
However a few cards locked up after a few days, regardless of reducing the overclock and the temps never getting *stupidly* high.
When I returned, I expected I'd just be able to hard reboot the boxes and everything would be OK. This was not the case. After wasting a LOT of time diagnosing why two rigs wouldn't boot, I found that one 5850 (an XFX Black Edition), and one 5830 (a 'value' card from overclock.co.uk) were broken.
I haven't managed to get them to work in any other machine either - if they're plugged into the PCIe socket, then as soon as the OS probes the cards, everything hanging off the bus locks up. In particular, the onboard ethernet dies, which makes further diagnostics impossible (well, I pulled the USB flashdrive, mounted it in a working Linux box, and examined the logs - but there was nothing in syslog or dmesg before the lockup that indicated anything was wrong).
My question is - has anyone else experienced hardware failures like this? The accepted wisdom here seems to be that the Radeon 58xx GPUs can run up to 110˚C, at which point they thermally throttle to prevent damage. I've even seen comments like 'just keep it below 90 and they'll be OK'.
Yes, I overclock... but would this permanently break a GPU if the core temperature never exceeded 84˚C? In terms of 'that's core temp - the RAM may be hotter' - I underclock *every* card to 275 MHz memory clock. RAM should be fine, no?
What other critical components on Radeon cards could overheat and fail even if the *core* temperature was well within safe limits?
Of course, it may just be bad luck, but if not, I'd like to ensure it doesn't happen again. I'm replacing the two dead cards with a couple of the 6xxx series as an experiment - all are 5xxx series at the moment