Bitcoin Forum
May 27, 2024, 03:48:21 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: I made an auto-coldrebooter script for BAMT/SMOS  (Read 1764 times)
RLutz (OP)
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
February 27, 2014, 04:51:32 PM
Last edit: March 03, 2014, 05:39:13 AM by RLutz
 #1

I just got my rig setup a few days ago, and the one thing that really bugged me, especially when I was playing around with different settings, is that if a card was unstable and crashed, the only way to fix it was to coldreboot.

It's possible that you could have a card that might only crash once every few days, but at that point you have to issue a coldreboot. What if you're away from computer? Think of all those precious coins you could be losing!

Anyway, this script will take care of that for you. I'm using LTCrabbit's customized SMOS-Linux, but it should work for any other similar distro.

First we'll need to make a script. You can use nano or vim or whatever you prefer, I'll write the tutorial using nano since if you're a Linux newbie it's probably the easiest way to go. Fire up a root terminal, then

Code:
nano /root/autoRebooter.sh

Paste the following contents into that file (make sure to edit your targetMinTemp accordingly!!!):
Code:
#!/bin/bash

#Set your targeted minimum temp here, system will issue a cold
#reboot if a card temp falls below this number
targetMinTemp=50
i=0
(/opt/bamt/viewgpu | awk '{ print $2; }' | cut -c -2 > /tmp/viewgpu) & pid=$!
echo $pid
(sleep 10 && kill $pid)
sleep 15
array=(`cat /tmp/viewgpu`)
if [ ${#array[@]} -eq 0 ]; then
  echo "`date +%m-%d-%Y` `uptime | awk -F, '{sub(".*ge ",x,$1);print $1}'` viewgpu command failed to run, rebooting" >>  /home/$(grep '1000' /etc/passwd | cut -d ':' -f 1)/autoRebooter.log
  /sbin/coldreboot &
  sleep 30
  echo s > /proc/sysrq-trigger
  sleep 10
  echo b > /proc/sysrq-trigger
fi
for temp in ${array[@]}; do
  if [ $temp -lt $targetMinTemp ]; then
    echo "`date +%m-%d-%Y` `uptime | awk -F, '{sub(".*ge ",x,$1);print $1}'` card number $i has stopped, its current temp is $temp, coldrebooting" >> /home/$(grep '1000' /etc/passwd | cut -d ':' -f 1)/autoRebooter.log
    /sbin/coldreboot &
    sleep 30
    echo s > /proc/sysrq-trigger
    sleep 10
    echo b > /proc/sysrq-trigger
  fi
i=$(($i+1))
done
    

Use ctrl+o to write the file out, then ctrl+x to exit nano.

Next you'll need to make the script executable

Code:
chmod a+x /root/autoRebooter.sh

Lastly, we'll need to add a cronjob to periodically check in. I set it to run every hour.

Code:
crontab -e

Add the following line to the end of crontab

Code:
0 */1 * * * /root/autoRebooter.sh

ctrl+o to write it out, ctrl+x to save it.

There you go, now you never have to worry about a crashed GPU bringing down your hashrate ever again!

If you found this helpful, I'm currently at a whopping 2.3 LTC and 0.01 BTC would love to have a few fractions more!

LTC: Lhb3yJGPL9dsUZ2tt5KrbNMm3pVmmA1fkb

BTC: 1NKkGEsY5UwkzSmD63yBcJj9hkrS4YWsbX

edit: I've made improvements incase viewgpu gets stuck or coldreboot fails, tested and verified to work!
Okilo
Newbie
*
Offline Offline

Activity: 37
Merit: 0


View Profile
February 27, 2014, 09:05:44 PM
 #2

Can anyone confirm if this works?
SR0G
Newbie
*
Offline Offline

Activity: 3
Merit: 0


View Profile
February 28, 2014, 01:09:25 AM
 #3

This will be very useful but my BAMT 1.3 does not have corntab installed.

Did you had to install it?

edit: nevermind.. I got it working! Thanks.
xbudahx
Full Member
***
Offline Offline

Activity: 378
Merit: 102



View Profile
February 28, 2014, 01:10:47 AM
 #4

I'm running it, but don't see too many hardware issues. I will check back if it happens.
crazyates
Legendary
*
Offline Offline

Activity: 952
Merit: 1000



View Profile
February 28, 2014, 02:06:49 AM
 #5

Trying this out as well. Thanks!

Tips? 1crazy8pMqgwJ7tX7ZPZmyPwFbc6xZKM9
Previous Trade History - Sale Thread
RLutz (OP)
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
February 28, 2014, 04:32:07 PM
 #6

Some people have mentioned that sometimes coldreboot can fail, while I've never had this problem, you could try using Linux magic keys as a failover if coldreboot fails, just sleep for 30 or so then

Code:
#after the /sbin/coldreboot line
sleep 30
echo s > /proc/sysrq-trigger
sleep 10
echo b > /proc/sysrq-trigger
hostmaster
Sr. Member
****
Offline Offline

Activity: 266
Merit: 250


View Profile WWW
February 28, 2014, 04:34:34 PM
 #7

thanks very useful data.
RLutz (OP)
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
February 28, 2014, 04:48:10 PM
 #8

I edited the script in the OP to take care of some additional edge cases where viewgpu command might hang indefinitely.
RLutz (OP)
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
February 28, 2014, 06:16:01 PM
 #9

One last edit to clean things up. Currently it deals with any edge cases I've seen come up (coldreboot fails, viewgpu hangs, etc).
xbudahx
Full Member
***
Offline Offline

Activity: 378
Merit: 102



View Profile
March 01, 2014, 03:45:40 PM
Last edit: March 01, 2014, 03:57:47 PM by xbudahx
 #10

I had my GPUs stop spinning last night and the rig halted, it didn't reboot though and I'm running the script.

Here is what I see when manually running it, I'm using SMOS 1.3

/root/autoRebooter.sh: line 11: kill: (10583) - No such process
RLutz (OP)
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
March 01, 2014, 06:42:30 PM
 #11

I had my GPUs stop spinning last night and the rig halted, it didn't reboot though and I'm running the script.

Here is what I see when manually running it, I'm using SMOS 1.3

/root/autoRebooter.sh: line 11: kill: (10583) - No such process


Hmmm, that line means that the viewgpu process likely completed just fine. Next time that happens, can you run viewgpu and paste me the output? It's possible that there is some failure output that the script doesn't check for yet.

Some people have mentioned that coldreboot fails sometimes, perhaps I should just remove the /sbin/coldreboot and always just use Linux magic keys (echo s, echo b, etc).
xbudahx
Full Member
***
Offline Offline

Activity: 378
Merit: 102



View Profile
March 01, 2014, 07:48:02 PM
 #12

I had my GPUs stop spinning last night and the rig halted, it didn't reboot though and I'm running the script.

Here is what I see when manually running it, I'm using SMOS 1.3

/root/autoRebooter.sh: line 11: kill: (10583) - No such process


Hmmm, that line means that the viewgpu process likely completed just fine. Next time that happens, can you run viewgpu and paste me the output? It's possible that there is some failure output that the script doesn't check for yet.

Some people have mentioned that coldreboot fails sometimes, perhaps I should just remove the /sbin/coldreboot and always just use Linux magic keys (echo s, echo b, etc).

I see the same message whenever I run it manually. When the rig went down, I wasn't able to SSH into it so couldn't tell the status.

root@smos-1:~# /opt/bamt/viewgpu
0: 71.0c 480.00 Mh/s http://eu.betarigs.com:3333
1: 68.0c 480.00 Mh/s http://eu.betarigs.com:3333
2: 69.0c 480.00 Mh/s http://eu.betarigs.com:3333
RLutz (OP)
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
March 01, 2014, 09:00:35 PM
 #13

When you run it manually, that process should be dead, it just means it completed successfully. Basically there's a watchdog that only gives the viewgpu command a few seconds to complete (otherwise a stuck viewgpu command will make the script just wait for it to finish indefinitely, but if viewgpu runs normally, when kill goes to kill the process, it will have already terminated successfully).

If you weren't able to SSH in, it's possible that the Kernel just blew up? At that point, nothing will help you, not even Linux magic keys. You'll have to power button if the kernel explodes.

If you ever see a situation where it doesn't reboot and a card is dead (and you can still actually interact with the system) run the viewgpu command and give the output back.

But yeah, a malfunctioning Linux kernel, whether it's a kernel panic or some other kernel explosion means that your only likely resolution is actually powering it off.
RLutz (OP)
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
March 03, 2014, 05:40:13 AM
 #14

I improved the script yet again because I got tired of viewgpu's readings being unreliable.

It now goes off your card temperatures, which seems to be very reliable (a card that is sick or dead will quickly cool down to idle temps).

So far this hasn't failed me!
silvetti
Newbie
*
Offline Offline

Activity: 37
Merit: 0


View Profile
March 03, 2014, 02:40:02 PM
 #15

Would be awesome if the script before issuing the shutdown would write to a log file which gpu had the lower temperature and what time, etc...

Anyway, great script!
RLutz (OP)
Newbie
*
Offline Offline

Activity: 11
Merit: 0


View Profile
March 03, 2014, 03:41:35 PM
 #16

It does log which GPU had the low temp
silvetti
Newbie
*
Offline Offline

Activity: 37
Merit: 0


View Profile
March 03, 2014, 05:42:42 PM
 #17

It does log which GPU had the low temp


Just re-read the code and noticed that Wink

Great!
xbudahx
Full Member
***
Offline Offline

Activity: 378
Merit: 102



View Profile
March 06, 2014, 09:42:56 PM
 #18

This script has saved me a couple trips home from the office to check on my rig.


Now to figure out why it's crapping out after smooth sailing for weeks.
Demontager
Newbie
*
Offline Offline

Activity: 60
Merit: 0


View Profile
April 08, 2014, 04:13:12 PM
 #19

I just bit improved Rlutz original script and added some more features https://bitcointalk.org/index.php?topic=508355.0 
SICK/DEAD cards status also monitored via cgminer API.
Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!