Title: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: x1010101x on January 17, 2012, 04:44:35 AM Hi. I've got a dedicated mining rig that experiences lots of lockups and a high reject rate. I need help figuring out why and/or what the problem(s) are.
I'll pay 3 BTC to the first person who's suggestions fix the issue, or at least help lower my lockups and reject rates. If there's a bunch of folks with valid suggestions, I'll split the bounty among them. This bounty is valid for the next 72 hours, so it ends on Friday, January 19th 8PM PST. Ok, the two issues are:
My hardware info:
My software info:
Clocks on each GPU:
Code: user@linuxcoin:~$ sudo aticonfig --odgc && sudo aticonfig --adapter=all --odgt && sudo aticonfig --pplib-cmd 'get activity' Fan: 80% Phoenix miner settings on each GPU: Code: WORKSIZE=256 VECTORS BFI_INT AGGRESSION=9 -k phatk2 Smartcoin status screen after running the rig for 24 hours with no lockups: Code: Smartcoin r657s 10:00:56 Rig pic, my camera sucks: https://i.imgur.com/iQZpG.jpg Other hints:
My ideas to fix lockups:
My ideas to fix high reject rate:
I know the simplest answer would be to lower my core clock, but I'd prefer not to do that if possible. Let me know if more info is needed, thanks all! Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: deepceleron on January 17, 2012, 09:46:54 AM Upgrade to phoenix 1.7.3, you'll see reject rate drop quickly. You could also switch to a pool (http://ozco.in/) with known low rejects that doesn't charge the highest fees in the pool biz (which is good for another 3%-10% "boost" to your bottom line).
Not what you want to hear, but I would run the cards closer to stock speed and then if it still locks up, overclock at least is eliminated as a cause of your problem. Pushing a card too hard would get me a random reboot every few days. Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: P4man on January 17, 2012, 10:05:18 AM Not what you want to hear, but I would run the cards closer to stock speed and then if it still locks up, overclock at least is eliminated as a cause of your problem. Pushing a card too hard would get me a random reboot every few days. This. Stock speed and see what happens. Overclocks dont last forever, electro-migration will reduce stable overclocking speed over time (or kill the card outright). That time period is completely unpredictable, it could be weeks or decades. If youve seriously overheated the card, that isnt going to help; electro migration correlates exponentially with temperature. Also, clarify "lock ups". I have no experience with phoenix, but on cgminer, the rig will not lock up, just the cards will "die". Doesnt prevent me from SSH-ing in to the machine. Are you getting complete freezes? Have you checked dmesg log? Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: jake262144 on January 17, 2012, 12:42:52 PM *nods his head* P4, DeepC
When I was searching for maximum stable clocks for my cards I did notice that DEAD != DEAD. When my 6950 DCII crashes, it crashes like a ton of bricks introducing lock-ups of a few dozen seconds to any interaction with the OS. Apparently, some bigshot kernel-mode code freezes the OS up when attempting to speak with the dead, until it time-outs and lets go. OTOH, necromancgminer has been doing a terrific job raising the 6770s from the dead with no fuss. If the OS really froze up,the script wouldn't do you much good. Mind sharing your reboot-magic? I want to see how your script detects the "lock up" condition. And yes, do lower your overclock clocks. Near-death lock-ups and crazy stale counts suggest that at least one of your cards can't take the beating. Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: x1010101x on January 17, 2012, 03:18:13 PM @deepceleron: Thanks, I'll give those suggestions a go. I'm not entirely opposed to lowering my core clock, especially if it will prolong the life of my cards.
@P4man Quote Also, clarify "lock ups". I have no experience with phoenix, but on cgminer, the rig will not lock up, just the cards will "die". Doesnt prevent me from SSH-ing in to the machine. Are you getting complete freezes? Have you checked dmesg log? Thanks for the info. By lockup, I don't mean that the rig itself completely locks up. It's pretty much as jake262144 described: it's a GPU hard lockup that freezes the miner. I can still SSH in, but the OS is extremely laggy, and top sometimes reports that at least 1 instance of the phoenix miner is at 100% CPU. I'm unable to "kill -9" the miner processes. Smartcoin states that one or more GPU's is "<<< DOWN >>>". The only recourse seems to be to reboot the rig. @jake262144: Smartcoin has a lockup detection feature. I haven't looked at the source, but from what I understand, if a GPU isn't responsive after X numer of Smartcoin screen refreshes/iterations, it declares that a lockup has occurred. You can put a lockup.sh file in the smartcoin directory, and Smartcoin will run it when this occurs. I do a forced reboot as that seems to be the only way to revive the card(s): Code: #!/bin/bash Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: x1010101x on January 18, 2012, 02:57:17 AM @deepceleron: you nailed it with upgrading to phoenix 1.7.3 for lowering my reject rate. Been running all day, reject rate is now at .134%! I guess phoenix 1.7.0 had issues with rejected shares?
I lowered my core clock from 780mhz to 720mhz; 700mhz is stock for 5970's. This lowered my hashrate by 100 MHash (not a big deal). I still had a couple of lockups today though. I'll bring it down to 700mhz tomorrow. As of now at least 1.5 btc of the bounty is going to deepceleron, probably the remaining 1.5 btc too if there are no more suggestions. Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: deepceleron on January 18, 2012, 07:17:20 AM @deepceleron: you nailed it with upgrading to phoenix 1.7.3 for lowering my reject rate. Been running all day, reject rate is now at .134%! I guess phoenix 1.7.0 had issues with rejected shares? I lowered my core clock from 780mhz to 720mhz; 700mhz is stock for 5970's. This lowered my hashrate by 100 MHash (not a big deal). I still had a couple of lockups today though. I'll bring it down to 700mhz tomorrow. As of now at least 1.5 btc of the bounty is going to deepceleron, probably the remaining 1.5 btc too if there are no more suggestions. That was a bugfix that was rolled out, I didn't find a definitive post that stated "this fixes all your stales" in the Phoenix thread for you, but it no longer lags for .5 seconds, and it also supports rollntime so your miner doesn't have to request more work if it completes a nonce space (supported on some pools). The one card that is running significantly hotter - you could pull it and make sure it has well-applied thermal grease. It should only have a very thin coating on components (not blobs squeezed out), and can benefit from an upgraded thermal paste like arctic silver 5. Clean the old stuff with 99% rubbing alcohol (or Everclear), and re-apply a paper thin coat of new TIM on the chips. Put the heatsink on with the normal screws and take it back off again. You should see from the thermal paste impression on the heatsink that there is good contact being made. If you still lock up that close to core clock on the cards, I would remove one card at a time (maybe starting with that hot one), and and put your miner back to overclocked. Run the card you removed by itself mining in it's own computer (your desktop computer or a $50 Craigslist dell) and see if the problem follows one card. Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: dizzy1 on January 18, 2012, 10:23:03 PM The 5970s have a rear exhaust right? Which way is the box fan pointed?
Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: ssateneth on January 18, 2012, 11:04:09 PM The 5970s have a rear exhaust right? Which way is the box fan pointed? they are in the correct direction relative to fan Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: dizzy1 on January 18, 2012, 11:27:29 PM The 5970s have a rear exhaust right? Which way is the box fan pointed? they are in the correct direction relative to fan What I meant was, Is the box fan pushing the exhaust heat back into or away from the card? Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: Proofer on January 19, 2012, 07:28:26 AM ... 700mhz is stock for 5970's. ... Says here (http://"http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5970/Pages/ati-radeon-hd-5970-overview.aspx#2"): Engine clock speed: 725 MHz Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: x1010101x on January 19, 2012, 04:13:24 PM @deepceleron: Thanks, I'll give that a try this weekend. Post or PM me your BTC addy, I'll send you the bounty.
Title: Re: 3 BTC Bounty: help me diagnose my mining rig lockups and reject rate Post by: cuz0882 on January 25, 2012, 02:21:22 PM Too start I would replace the thermal paste on the hot card. Then I would run it with just one card at a time and see if they both give rejected shares. It appears they both are, but it possible its just one. After you eliminate it being a problem with the video card. You can assume its a driver or software problem. Beyond that I would check the internet connection. I would just install windows, and run cg miner. That would be a quick fix to eliminate software problems.
|