UNATTENDED RIG: Automatic overclocking,FANs control,mail sending,HW safe guarded

DEAR BTCITCOINERS:

I am very glad to share with you my big effort carried out during several months together with my dear mining RIG provided of 3 ATI Radeon HD 5870.
This consists on an unattended and automatic controlled Mining RIG, that has provided a lot of benefits for me (hardware safe guard, automatic settings basing on temperature, Fans control, mail sending, etc), I hope it will be useful for others.
Of course I know that mining with GPUs is not really profitable now, but I am also waiting for an ASIC with 30 GH/s!! Grin

. My idea is to adapt all these scripts asapfor the mining with ASIC, and share it with you in case I see you consider this post valuable for you.

Note that I provide all this effort absolutely free!, however I would really grateful of receiving donations to my BTC address. That would encourage me for publishing other ideas and code with the BTC community.
(You can find my address at the end of the post).

First of all, let me indicate the benefits obtained from using these scripts:

All scripts are automatically started when switching on the system
All scripts can be manually started by a single command
All scripts can be manually stopped by a single command
There is a control script in charge of the following actions:
- Automatic overclocking of each GPUs basing on the GPU temperature, target temperature, etc
- Automatic downclocking of each GPUs basing on the GPU temperature, target temperature, etc
- Automatic starting of mining processes in the case they are abnormally stopped. (5 retries)
- Sending of mails if the retry limit is reached, in order to inform of the problem to the user
- Automatic shutting down of the mining processes in the case the temperature is very high. This will prevent a worse hardware failure in the GPUs
- Automatic starting of the mining processes when the temperature is already safe for starting ythe mining
- Automatic FAN control for each GPU from 30% to 100% of speed based on the temperature
There is a monitor script continuously sensing all system parameters for user supervision:
- Current GPUs clocks
- Current FAN clocks
- Current temperature for each GPU
- Temperatures of CPU, Motherboard, etc
- Current hashrate for each GPU
- Last 10 control scripts actions
The monitor script can be used thought ssh or telnet connection. In my case, I supervise my RIG though SSH from my mobile by using 'ConnectBot' application (see android or iPhone markets)

Before using the scripts, have a look to base system. Your system can be different to this, but I provide you all the code so that you can add any correction for adapting all to your system. The base system for running these scripts are the following:

Linux OS (I use Linux debian). All the scripts are coded on Linux shell.
ATI graphic cards (I have 3 ATI HD 5870). Note that I use 'aticonfig' software for almost everything
ATI drivers already installed, AMD SDK, etc. These guides were very useful for me:
- http://eligius.st/wiki/index.php/Ubuntu_Miner_Guide
- http://ewoah.com/technology/a-very-good-guide-to-building-a-bitcoin-mining-rig-cluster-guide/
lm-sensors package already installed (use apt-get or aptitude). This is for retrieving the CPU and motherboard temperatures
Screen package pre-installed (apt-get install screen). This tool will let us to run all scripts from the GUI, but sharing the console output with other sessions. This will let us monitor the system from telnet or SSH, as the console is shared (thanks to screen!!) .
I use poclbm.py for GPU mining. Other mining software will be ok, but you will need to change the related piece of code from the scripts
My system is fully unattended... so I also have installed VNC software for remote connection to the desktop (I have there my wallet)
I also have an automatic login in the system, so that when I switch on the system, the system is logged in, the scripts are automatically started and the wallet application is also started, getting updated with the BTC network transactions. (In my case, the RIG is installed in a cooled place very far from my house)

This script uses "reboot" and "halt" command without sudo password. To get it, read this link: http://sleekmason.wordpress.com/fluxbox/using-etcsudoers-to-allow-shutdownrestart-without-password/

I use "mail" command for sending mails. mail command must be available for the script (look for it on the internet).

OK, after this brief introduction, let's go to the scripts. I wish you enjoy them!!

I have the following little scripts for starting each of my individual mining process on my GPUs:

gpu0.sh

Code:

#!/bin/bash
export DISPLAY=:0.0
cd /home/your_path/scripts
DISPLAY=:0.0 aticonfig --pplib-cmd "set fanspeed 0 75"
DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://your_user@mail.com:your_password@deepbit.net:8332 | tee mining_gpu0.log

As you can see, I use a pool for mining (deepbit). You should change this line with your specific parameters for poclbm.py or other mining software.
Note that I pipe the output to the 'tee' command in order to store the output of the process in a log file called "mining_gpu0.log". This will be useful later for monitoring the script output, as we will retrieve from these logs files the hashrate for each GPU.
Note the parameter -r 5: This will make that the poclbm.py script will update the output (hashrate) one per each 5 seconds. The reason for this will be discussed later...

You should create additional gpux.sh files, one for each GPU. In my case I have 3 GPUs, so I have these additional scripts:

gpu1.sh

Code:

#!/bin/bash
export DISPLAY=:0.1
cd /home/your_path/scripts
DISPLAY=:0.1 aticonfig --pplib-cmd "set fanspeed 0 75"
DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://your_user@mail.com:your_password@deepbit.net:8332 | tee mining_gpu1.log

gpu2.sh

Code:

#!/bin/bash
export DISPLAY=:0.2
cd /home/your_path/scripts
DISPLAY=:0.2 aticonfig --pplib-cmd "set fanspeed 0 75"
DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://your_user@mail.com:your_password@deepbit.net:8332 | tee mining_gpu2.log

Is I said, I use the "screen" linux tool for sharing the outputs of the commands running. For example, we can share the script gpu0.sh in a shared console with the following command:

Code:

/usr/bin/screen -admS gpu0 ./gpu0.sh

This will create a shared console called "gpu0", that can be accessible though telnet or ssh with the following command:

Code:

screen -x gpu0

Therefore, we can watch the output of the execution of gpu0.sh. Note that for exiting in a shared console you have to use 'CTRL+A' and 'D' (to get detached of the shared console). Otherwise, you can stop the execution of gpu0.sh in that console.

Now, we can define the script start.sh that will launch the mining scripts:

start.sh

Code:

#!/bin/bash

cd /home/your_path/scripts
echo Starting mining scripts... 
/usr/bin/screen -admS gpu0 ./gpu0.sh
/usr/bin/screen -admS gpu1 ./gpu1.sh
/usr/bin/screen -admS gpu2 ./gpu2.sh
...

Now, you can add this script (/home/your_path/start.sh) to your startup programs group. You can easily do it from from the system menu.

The control script has a lot of features, It is full of comments so I expect you have enough information there.
control.sh

Code:

#!/bin/bash
#---times constants
control_time=5       # Time cycle between control loops (5 seconds)
overclock_delay=180  # waiting time between overclocking commands (It is multiplier of control_time (180*5 = 15 min) 
downclock_delay=60   # waiting time between downclocking commands (60*5 = 5 min)
downclock_urgent=24  # waiting time between urgent downclocking commands (24*5 = 2 min)
timeCounter=0        # time counter

#---GPUs temperatures
target_temp=75      # Target temp for the GPUs. Automatic Overclocking/downclocking will be performed for reaching this temperature as maximum. 
hightemp_alarm=80   # Alarm temperature: If exceeded, it will be performed an urgent downclocking
maxtemp_stop=83     # maximum temperature in the GPUs: The mining process will be stopped for security resons.
temp_recover=65     # recovery temperature: After a mining stop due to high temperature, when this safe temperature is reached, the mining is already started. 
control_gap=3       # Temperature below target_temp that is needed to be exceeded for an overclock command. (If current temperature is very near from the target temp, overclockin is not performed... we maintain the temp. near but below the limit)  

#---CPU temperaturas 
tempCPU_halt=70     #If this temperature is reached by the CPU or motherboard, a HALT is performed for turning of the RIG.

#---Clock limits 
corefreq_min=800    # Minimum freq. to be set by the control algorithm
corefreq_max0=945   # Maximum freq. to be set by the control algorithm in GPU0 (In my case I checked that above 975Mhz this GPU hangs the X session)
corefreq_max1=955   # Maximum freq. to be set by the control algorithm in GPU1 (In my case I checked that above 995Mhz the mining process got zombie)
corefreq_max2=1025  # Maximum freq. to be set by the control algorithm in GPU2 (In my case I checked that above 1055Mhz the mining process got zombie)
mem_freq=300	      # Fixed value for memory clock (normally it is 1200MHz in the GPUs, but using 300MHz reduces the temperature without affecting to the performance)

#--Mail sending 
subject="Important advice from your RIG"
mail1="your_mail@mail.com"
mail2="other_mail@mail.com"
mail3="other_mail@mail.com"

#--control constants
retryMiningAfterFailure=1  # Mining scripts are automatically started after a failure
debug=0		      # enable/disable debugging messages
numRetries=5		      # Limit of retries for restarting the mining processes in the GPUs
reboot=1                   # If zombie mining processes are detected, the control script can perform an automatic system reboot. This will recover mining in all GPUs.

#--FAN constants
FANGPU0=75
FANGPU1=75
FANGPU2=75

#--Internal variables 
GPU0=0
GPU1=1
GPU2=2
mining_stopped=0    # Mining process has been stopped by the control algorithm		
init_coreCLK0=900   # initial overclocking value for GPU0
init_coreCLK1=900   # initial overclocking value for GPU1
init_coreCLK2=900   # initial overclocking value for GPU2
counterLastCLK0=0   # Stores the time of the last overclocking/downclocking performed on GPU0 
counterLastCLK1=0   # Stores the time of the last overclocking/downclocking performed on GPU1
counterLastCLK2=0   # Stores the time of the last overclocking/downclocking performed on GPU2
simulation=0	      # Disables the overclockin, only logs outputs for debuguing . 
retriesGPU0=0
retriesGPU1=0
retriesGPU2=0
alertFailProcessGPU0=0
alertFailProcessGPU1=0
alertFailProcessGPU2=0

# ---------------------------------------------------------------------
# Function Debug: It outputs messages to the console only in debug mode
# Parameters: Text Message to be displayed
# -----------------------------------------------------------------
function debug(){
	if (test $debug -eq 1) 
       then 
	   	 echo -e "[Time: $timeCounter | $(date | awk '{print $3 $2 $4}')] $@" 
	fi
}

# ---------------------------------------------------------------------
# Function output: This function output messages to the console
# $1: Message
# $2: if $2=1 the message is sent by email
# ---------------------------------------------------------------------
function output(){
	mensaje="[Time: $timeCounter | $(date | awk '{print $3 $2 $4}')] $1"
	echo -e $mensaje       
	if test $2 -eq 1 
	then
		echo -e "$mensaje" | mail -s "$subject" $mail1
		echo -e "$mensaje" | mail -s "$subject" $mail2
		echo -e "$mensaje" | mail -s "$subject" $mail3
	fi
}

# ----------------------------------------------------------------
# Function FANCommand: This function sets the FAN speed of a GPU
# Params: $1:num_gpu: $GPU0,$GPU1,$GPU2
#         $2:FAN_SPEED: Value from 0 to 100 %, ej: 100
# Use:  FANCommand $GPU0 100
# ----------------------------------------------------------------
function FANCommand(){
	case $1 in
	  0)	
		DISPLAY=:0.0 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;;
	  1)
		DISPLAY=:0.1 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;;
	  2)
		DISPLAY=:0.2 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;;
	esac
	output "  New setting: FAN GPU$1 to $2 %" 0
}

# ----------------------------------------------------------------
# Function overclock: This function sets the clk of a GPU 
# Params: $1:num_gpu : $GPU0,$GPU1,$GPU2
#         $2:clkfreq : core clk value to be set, ej: 850
#         $3:memfreq : mem clk to be set, ej: 300  
# Use:  overclock 0 850 1200
# ----------------------------------------------------------------
function overclock(){
	if test $simulation -eq 0 
	then
		case $1 in
		   0)
			aticonfig --adapter=0 --odsc=$2,$3 >>null;;
		   1)
			aticonfig --adapter=1 --odsc=$2,$3 >>null;;
		   2)
			aticonfig --adapter=2 --odsc=$2,$3 >>null;;
		esac
	fi
	output "  New setting: Overclock GPU$1 to $2 / $3 Mhz" 0
}

# ----------------------------------------------------------------
# Function controlFAN: Calculates the FAN speed depending on the temperatures and current FAN speed (num_gpu, currentTemp, currentFAN) 
#			  It performes the FAN speed control of a GPU 
# Params: $1:num_gpu : $GPU0,$GPU01,$GPU2
#         $2:currentTemp: Current value reported by the GPU (no decimals), ej: 56
#	   $3:currentFAN:  Current FAN speed for this GPU, ej: 75 %
#		  
# Use:  controlFAN $GPU0 56 75
# ----------------------------------------------------------------
function controlFAN(){

	#hysteresis of 2ºC around temperature threshold 55º
	if test $2 -lt 54 
	then
		controlFAN=30
	elif test $2 -lt 56 
	then
		if test $3 -ne 45
		then
			controlFAN=30			
		fi
	#hysteresis of 2ºC around temperature threshold 60º
	elif test $2 -lt 59
	then 
		controlFAN=45
	elif test $2 -lt 61
	then
		if test $3 -ne 60
		then
			controlFAN=45
		fi
	#hysteresis of 2ºC around temperature threshold 65º
	elif test $2 -lt 64
	then 
		controlFAN=60
	elif test $2 -lt 66
	then
		if test $3 -ne 75
		then
			controlFAN=60
		fi
	#hysteresis of 2ºC around temperature threshold 70º
	elif test $2 -lt 69
	then 
		controlFAN=75
	elif test $2 -lt 71
	then
		if test $3 -ne 90
		then
			controlFAN=75
		fi
	else 
		controlFAN=90
 	fi

	# It sends the FAN speed command only if the new setting is different to the current one.
	case $1 in
	    0)
	  	  debug "FAN control GPU0: Current FAN=$FANGPU0, controlFAN:$controlFAN"
		  if test $controlFAN -ne $FANGPU0 
                then
		  	FANCommand 0 $controlFAN 
			FANGPU0=$controlFAN
		  fi;;
	    1)
	  	  debug "FAN control GPU1: Current FAN=$FANGPU1, controlFAN:$controlFAN"
		  if test $controlFAN -ne $FANGPU1 
                then
		  	FANCommand 1 $controlFAN 
			FANGPU1=$controlFAN
		  fi;;
	    2)
	  	  debug "FAN control GPU2: Current FAN=$FANGPU2, controlFAN:$controlFAN"
		  if test $controlFAN -ne $FANGPU2 
                then
		  	FANCommand 2 $controlFAN 
			FANGPU2=$controlFAN
		  fi;;
	esac
}

# ----------------------------------------------------------------
# Function controlTemp: Calculates the GPU clock correction to be performed depending on the GPU temperatures (num_gpu, currentTemp, consignaTemp) 
#			   It performes the overclocking/downclocking of the GPU
# Params: $1:num_gpu : $GPU0,$GPU01,$GPU2
#         $2:currentTemp: Current temperature reported by the GPU (no decimals), ej: 56
#         $3:TargetTemp: Target temperature desired in this GPU as maximum, ej: 78
# Outputs:
#	   counterLastCLK0, counterLastCLK1 y counterLastCLK2: Time information of the last CLK correction on each GPU.
#		  
# Use:    controlTemp $GPU 56 78
# ----------------------------------------------------------------
function controlTemp(){
	offsetCLK=$(expr $3 - $2)
	
	# temperature gap defined in 'control_gap' is guaranteed to avoid causing stress to the GPU when the current temperature is very near to the target temperature
	if (test $offsetCLK -gt 0) && (test $offsetCLK -lt $control_gap) 
	then
		debug "The correction ($offsetCLK) does not exceed the control GAP ($control_gap). CLK is maintained."
		return 1
	fi
	
	# Demanded frequencies are limited to the specific clk ranges of each GPU
	case $1 in
	    0)
		  demandaCLK=$(expr $coreCLK0 + $offsetCLK)
		  if test $demandaCLK -gt $corefreq_max0 
                then
			demandaCLK=$corefreq_max0
 	         fi;;
	    1)
		  demandaCLK=$(expr $coreCLK1 + $offsetCLK)
		  if test $demandaCLK -gt $corefreq_max1 
		  then
			demandaCLK=$corefreq_max1
 	         fi;;

	    2)
		  demandaCLK=$(expr $coreCLK2 + $offsetCLK)
		  if test $demandaCLK -gt $corefreq_max2 
                then
			demandaCLK=$corefreq_max2
		  fi;;
	esac
	
	if test $demandaCLK -lt $corefreq_min 
       then
		demandaCLK=$corefreq_min
	fi
		
	debug "*** GPU$1 --> CurrentTemp:$2 - Consigna:$3 - Control:$demandaCLK ($offsetCLK)" 
	
	# Sending of overclock command, only if there is a change.
	case $1 in
	    0)
		  if test $demandaCLK -ne $coreCLK0 
                then
		  	overclock 0 $demandaCLK $mem_freq
		 	counterLastCLK0=$timeCounter
			debug "Tiempo contCLK0: $counterLastCLK0" 
		  else
		  	debug "GPU0: Limit is already reached: $corefreq_max0."
		  	counterLastCLK0=$timeCounter
		  fi;;
	    1)
		  if test $demandaCLK -ne $coreCLK1 
                then
			overclock 1 $demandaCLK $mem_freq
			counterLastCLK1=$timeCounter
			debug "Tiempo contCLK1: $counterLastCLK1" 
		  else
		  	debug "GPU1: Limit is already reached: $corefreq_max1."
		  	counterLastCLK1=$timeCounter
		  fi;;
	    2)
		  if test $demandaCLK -ne $coreCLK2 
                then
			overclock 2 $demandaCLK $mem_freq
			counterLastCLK2=$timeCounter
			debug "Tiempo contCLK2: $counterLastCLK2" 
		  else
		  	debug "GPU2: Limit is already reached: $corefreq_max2."
		  	counterLastCLK2=$timeCounter			
		  fi;;	
	esac
}


# ----------------------------------------------------------------
# Function checkOverclockTimeGuard: This function ensures a certain period of time between consecutives overclock commands. 
# 					 - Guard Time between overclocks: $overclock_delay*$control_time (180*5 = 15 minutes)
#					 - Guard Time between downclocks: $downclock_delay*$control_time (60*5 = 5 minutes)
#					 - Guard time between urgent downclocks: $downclock_urgent*$control_time (24*5 = 2 minutes)
# Params: $1:num_gpu : 0,1,2
#         $2:timeCounter: Current value of the time counter
#         $3:up_down: 0:overclock, 1:downclock, 2:urgent downclock
# Outputs:
#	   $return_correction: 0:Not to perform CLK correction. 1:CLK correction can be performed now.
# ----------------------------------------------------------------
function checkOverclockTimeGuard (){
	return_correction=0
	case $1 in
		0)
			if test $3 -eq 0
                     then  ##overclock
				due_time=$(expr $counterLastCLK0 + $overclock_delay)
				if test $2 -ge $due_time
                            then
					return_correction=1
				fi
			elif test $3 -eq 1 
                     then  ##normal downclocking
				due_time=$(expr $counterLastCLK0 + $downclock_delay)
				if test $2 -ge $due_time 
                            then
					return_correction=1
				fi

			elif test $3 -eq 2 
                     then  ##urgent downclocking
				due_time=$(expr $counterLastCLK0 + $downclock_urgent)
				if test $2 -ge $due_time 
                            then
					return_correction=1
				fi
			fi;;
	       1)
			if test $3 -eq 0 
                     then  ##overclocking
				due_time=$(expr $counterLastCLK1 + $overclock_delay)
				if test $2 -ge $due_time 
                            then
					return_correction=1
				fi
			elif  test $3 -eq 1 
                     then  ##normal downclocking
				due_time=$(expr $counterLastCLK1 + $downclock_delay)
				if test $2 -ge $due_time 
                            then
					return_correction=1
				fi
			elif test $3 -eq 2 
                     then  ##urgent downclocking
				due_time=$(expr $counterLastCLK1 + $downclock_urgent)
				if test $2 -ge $due_time 
                            then
					return_correction=1
				fi
			fi;;
	       2) 
			if test $3 -eq 0 
                     then  ##overclocking
				due_time=$(expr $counterLastCLK2 + $overclock_delay)
				if test $2 -ge $due_time 
                            then
					return_correction=1
				fi
			elif test $3 -eq 1 
                     then  ##normal downclocking
				due_time=$(expr $counterLastCLK2 + $downclock_delay)
				if test $2 -ge $due_time 
                            then
					return_correction=1
				fi
			elif test $3 -eq 2 
                     then  ##urgent downclocking
				due_time=$(expr $counterLastCLK2 + $downclock_urgent)
				if test $2 -ge $due_time 
                            then
					return_correction=1
				fi
			fi;;
	esac
	debug "-------GPU$1: due_time: $due_time, correction: $return_correction"
} 

# ---------------------------------------------------------------------------------------------------------
# MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN
# ---------------------------------------------------------------------------------------------------------
#Overclockin enabling 
aticonfig --adapter=0 --od-enable    
aticonfig --adapter=1 --od-enable    
aticonfig --adapter=2 --od-enable   
overclock 0 $init_coreCLK0 $mem_freq  
overclock 1 $init_coreCLK1 $mem_freq  
overclock 2 $init_coreCLK2 $mem_freq   
FANCommand 0 30
FANCommand 1 30
FANCommand 2 30
output "Automatic control algorithm has been started" 1  #This is sent by email.
output "Automatic FAN speed control starts from 30%" 0
while true; do

	 #Fetching of current temperatures 
	 tempGPU0=$(aticonfig --adapter=0 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2)
	 tempGPU1=$(aticonfig --adapter=1 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2)
	 tempGPU2=$(aticonfig --adapter=2 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2)
	 tempCPU=$(sensors |grep CPU |grep Temperature | awk '{print $3}'|cut -c2-3)
	 tempMB=$(sensors |grep NB |grep Temperature | awk '{print $3}'|cut -c2-3)
	 
	 #Fetching of current GPU CLK frequencies 
	 coreCLK0=$(aticonfig --adapter=0 --odgc |grep Clocks | awk '{print $4}')
	 memCLK0=$(aticonfig --adapter=0 --odgc |grep Clocks | awk '{print $5}')
	 coreCLK1=$(aticonfig --adapter=1 --odgc |grep Clocks | awk '{print $4}')
	 memCLK1=$(aticonfig --adapter=1 --odgc |grep Clocks | awk '{print $5}')
	 coreCLK2=$(aticonfig --adapter=2 --odgc |grep Clocks | awk '{print $4}')
	 memCLK2=$(aticonfig --adapter=2 --odgc |grep Clocks | awk '{print $5}')
	 
 	 #It detects if there are mining processes already running
  	 miningIsActive0=$(ls /var/run/screen/S-your_user/ |grep gpu0 |wc -l)   # we look for the 'screen' session lock file
        miningIsActive1=$(ls /var/run/screen/S-your_user/ |grep gpu1 |wc -l)   # we look for the 'screen' session lock file
        miningIsActive2=$(ls /var/run/screen/S-your_user/ |grep gpu2 |wc -l)   # we look for the 'screen' session lock file
	 
	 loadGPU0=$(aticonfig --adapter=0 --odgc |grep GPU |awk '{print $4}' | cut -c1-2)  # It is also checked the load of each GPU
	 loadGPU1=$(aticonfig --adapter=1 --odgc |grep GPU |awk '{print $4}' | cut -c1-2)
	 loadGPU2=$(aticonfig --adapter=2 --odgc |grep GPU |awk '{print $4}' | cut -c1-2)
	 debug " --> Temps: $tempGPU0, $tempGPU1, $tempGPU2" 
	 debug " --> Clks: $coreCLK0, $coreCLK1, $coreCLK2" 

	 # --------------------------------Temperature of CPU and Motherboard ---------------------------------------
	 if (test $tempCPU -gt $tempCPU_halt) || (test $tempMB -gt $tempCPU_halt) 
        then
	 	output "ERR: Temperature of CPU/MB is too high! $tempCPU / $tempMB.... \nSWITCHING OFF THE SYSTEM. \n Check the CPU FAN condition and switch the RIG on manually." 1
		/usr/bin/halt
	 fi
	 
	 # Checking of zombie mining processes. 
	 num_defunc=$(ps -Al |grep py|grep defunc| wc -l)  
	 if test $num_defunc -gt 0 
     	 then
		if test $reboot -eq 1 
		then
	 	    output "### ERR: There are one or more zombie mining processes: 
			 \nMaybe a mining process is hanged and blocked.
			 \nIt is neccesary to restart the system for recovering the mining (sudo reboot).
			 \n$(ps -Al |grep py|grep defunc| wc -l) 
			 \n --> PERFORMING AN AUTOMATIC REBOOT OF THE SYSTEM...." 1    # Sent my email
			/usr/bin/reboot
		else
			 output "### ERR: There are one or more zombie mining processes: 
			 \nMaybe a mining process is hanged and blocked.
			 \nIt is neccesary to restart the system for recovering the mining (sudo reboot).
			 \n$(ps -Al |grep py|grep defunc| wc -l)" 1                    # Sent my email
		fi
	 fi
				
	 # --------------------------------Init checkings -------------------------------
	 if (test $miningIsActive0 -eq 0) 
        then
		if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1) 
              then
			if test $retriesGPU0 -lt $numRetries
                     then
				output "### ERR the mining process in GPU0 is not started ......" 0
				output "*** Starting mining on GPU0..." 0
				overclock 0 $init_coreCLK0 $mem_freq  
				coreCLK0= $init_coreCLK0
				/usr/bin/screen -admS gpu0 ./gpu0.sh
				retriesGPU0=$(expr $retriesGPU0 + 1)
			elif test $alertFailProcessGPU0 -eq 0
			then
				output "### ERR Mining retries limit has been reached in the process GPU0.sh
					\n*** Check that the process is not zombie and start it manually " 1    # Sent by email
				alertFailProcessGPU0=1
			fi
		fi
	 fi
	 
	 if (test $miningIsActive1 -eq 0) 
        then
		if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1) 
              then
			if test $retriesGPU1 -lt $numRetries 
                     then
				output "### ERR the mining process in GPU1 is not started ......" 0
				output "*** Starting mining on GPU1..." 0
				overclock 1 $init_coreCLK1 $mem_freq  
				coreCLK1= $init_coreCLK1
				/usr/bin/screen -admS gpu1 ./gpu1.sh
				retriesGPU1=$(expr $retriesGPU1 + 1)
			elif test $alertFailProcessGPU1 -eq 0        
			then
				output "### ERR Mining retries limit has been reached in the process GPU1.sh
					\n*** Check that the process is not zombie and start it manually " 1    # Sent by email
     				alertFailProcessGPU1=1 
			fi
		fi
	 fi

	
	 if (test $miningIsActive2 -eq 0) 
        then
		if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1) 
              then
			if test $retriesGPU2 -lt $numRetries 
                     then
				output "### ERR the mining process in GPU2 is not started ......" 0
				output "*** Starting mining on GPU2..." 0
				overclock 2 $init_coreCLK2 $mem_freq  
				coreCLK2= $init_coreCLK2
				/usr/bin/screen -admS gpu2 ./gpu2.sh
				retriesGPU2=$(expr $retriesGPU2 + 1)
			elif test $alertFailProcessGPU2 -eq 0
			then
				output "### ERR Mining retries limit has been reached in the process GPU2.sh
					\n*** Check that the process is not zombie and start it manually " 1    # Sent by email
				alertFailProcessGPU2=1   
			fi
		fi
	 fi


	 # --------------------------------Automatic switching off control -----------------------------
	 if (test $tempGPU0 -gt $maxtemp_stop) || (test $tempGPU1 -gt $maxtemp_stop) || (test $tempGPU2 -gt $maxtemp_stop)
	 then
		if (test $mining_stopped -eq 0) 
		then
			
			if test $retryMiningAfterFailure -eq 1
			then
				output "ERR: Extreme temperature in GPUs ($tempGPU0, $tempGPU1, $tempGPU2 ºC) - Switching off the mining...
			 	       \n After some minutes it will be strated again ..." 1   # Sent by email
			else
				output "ERR: Extreme Temperature in GPUs ($tempGPU0, $tempGPU1, $tempGPU2 ºC) - Switching off the mining...
				       \n Start the mining process manually ..." 1		# Sent by email
			fi
			./stop.sh
			mining_stopped=1
		fi
	 
	 else  # As soon as the GPUs temperatures are below temp_recover, mining is started again. 					  
	    if (test $mining_stopped -eq 1) && (test $retryMiningAfterFailure -eq 1)  	
	    then
			if (test $tempGPU0 -lt $temp_recover) || (test $tempGPU1 -lt $temp_recover) || (test $tempGPU2 -lt $temp_recover) 
			then
				# It Sets safe GPUs clock values
				output "The temperature of the GPUs has been recovered to $tempGPU0 / $tempGPU1 / $tempGPU2" 0
				output "GPUS clocks are stablished to 850/300 MHz." 0
				overclock 0 $init_coreCLK0 $mem_freq  
				overclock 1 $init_coreCLK1 $mem_freq  
				overclock 2 $init_coreCLK2 $mem_freq  
				coreCLK0= $init_coreCLK0
				coreCLK1= $init_coreCLK1
				coreCLK2= $init_coreCLK2
				retriesGPU0=0
				retriesGPU1=0
				retriesGPU2=0
				output " --> Starting mining." 0
				./minar.sh
				mining_stopped=0
			fi
	    fi
	 fi

	 #------------------------------  Overclocking control on GPU0 ----------------------------------
	 
	 if (test $mining_stopped -eq 0) && (test $miningIsActive0 -eq 1)
	 then
	 	#The temperature is within the control margins, below target temp
	 	if (test $tempGPU0 -lt $target_temp) 
         	then   
			checkOverclockTimeGuard $GPU0 $timeCounter 0 
			if test $return_correction -eq 1 
              	then
				controlTemp $GPU0 $tempGPU0 $target_temp
			fi
			
	 	#The temperature is outside the control margins, below the alarm temp 
	 	elif (test $tempGPU0 -lt $hightemp_alarm)
	        then
			checkOverclockTimeGuard $GPU0 $timeCounter 1 #downclocking 
			if test $return_correction -eq 1 
              	then
				controlTemp  $GPU0 $tempGPU0 $target_temp
			fi
		 
	 	# Overtemp alarm
	 	elif (test $tempGPU0 -lt $maxtemp_stop)
	        then
			output "Alarm! GPU0 very hot, temperature: $tempGPU0. Performing urgent downclocking ...." 1    #Sent by email
			checkOverclockTimeGuard $GPU0 $timeCounter 2 #urgent downclocking 
			if test $return -eq 1 
              	then
				controlTemp  $GPU0 $tempGPU0 $target_temp
			fi
	 	fi
		# FAN Speed control
		controlFAN $GPU0 $tempGPU0 $FANGPU0
	 fi

	 #------------------------------  Overclocking control on GPU1 ----------------------------------
	 
	 if (test $mining_stopped -eq 0) && (test $miningIsActive1 -eq 1)
	 then
	 	#The temperature is within the control margins, below target temp
	 	if (test $tempGPU1 -lt $target_temp) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0)  
	        then   
			checkOverclockTimeGuard $GPU1 $timeCounter 0 
			if test $return_correction -eq 1 
              	then
				controlTemp  $GPU1 $tempGPU1 $target_temp
			fi
			
	 	#The temperature is outside the control margins, below the alarm temp 
	 	elif (test $tempGPU1 -lt $hightemp_alarm) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0) 
	        then
			checkOverclockTimeGuard $GPU1 $timeCounter 1 #downclocking 
			if test $return_correction -eq 1 
              	then
				controlTemp  $GPU1 $tempGPU1 $target_temp
			fi
	
	 	# Overtemp alarm
	 	elif (test $tempGPU1 -lt $maxtemp_stop) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0) 
	        then
			output "Alarm! GPU1 very hot, temperature: $tempGPU1. Performing urgent downclocking ...." 1    #Sent by email
			checkOverclockTimeGuard $GPU1 $timeCounter 2 #urgent downclocking 
			if test $return_correction -eq 1 
              	then
				controlTemp  $GPU1 $tempGPU1 $target_temp
			fi
	 	fi
		# FAN Speed control
		controlFAN $GPU1 $tempGPU1 $FANGPU1
	 fi
	 #------------------------------  Overclocking control on GPU2 ----------------------------------
	 if (test $mining_stopped -eq 0) && (test $miningIsActive2 -eq 1)
	 then
	 	#The temperature is within the control margins, below target temp
	 	if (test $tempGPU2 -lt $target_temp) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0) 
	        then   
			checkOverclockTimeGuard $GPU2 $timeCounter 0 
			if test $return_correction -eq 1 
              	then
				controlTemp  $GPU2 $tempGPU2 $target_temp
			fi
			
	 	#The temperature is outside the control margins, below the alarm temp 
	 	elif (test $tempGPU2 -lt $hightemp_alarm) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0) 
	        then
			checkOverclockTimeGuard $GPU2 $timeCounter 1 #downclocking 
			if test $return_correction -eq 1 
              	then
				controlTemp  $GPU2 $tempGPU2 $target_temp
			fi
	
	 	# Overtemp alarm
	 	elif (test $tempGPU2 -lt $maxtemp_stop) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0) 
	        then
			output "Alarm! GPU2 very hot, temperature: $tempGPU2. Performing urgent downclocking ...." 1    #Sent by email
			checkOverclockTimeGuard $GPU2 $timeCounter 2 #urgent downclocking 
			if test $return_correction -eq 1 
              	then
				controlTemp  $GPU2 $tempGPU2 $target_temp
			fi
		 fi
		# FAN Speed control
		controlFAN $GPU2 $tempGPU2 $FANGPU2
	 fi
	 timeCounter=$(expr $timeCounter + 1)
	 sleep $control_time;
done

A brief description of the control script:

You can change the minimum/maximum clock settings for each of your GPUs. I manually identified the limits by getting the system hang lot several times.
I also realized that when playing with the limits, before hanging the system, sometimes a mining process got zombie (by using PS). In this situation, I was not able to recover the process, neither trying to kill the parent process... the only way was to restart the system. This control algorithm is restarting the system when finding zombie processes.
You can play with all constants for tuning the script to your own system. Almost everything is configurable (retries number, mails sending, debugging logs, halt/reboot commands, etc.
Change the email addresses by yours
1. The algorithm first obtain the GPUs temperatures, CPU temperatures, current clocks settings, checks if the mining processes are active, etc
2. In case the CPU temperature is very high (70ºC), the script switches off the system and report by email (This protects the system hardware from a overtemperature in the CPU)
3. It checks if there are zombie processes (As discussed before). If so, the script can reboot the system and report by email. (depends on if constant reboot=1)
4. It checks if any of the GPUS is not mining... if so it retries the mining by starting the script gpux.sh. There is a retry limit of 5. It reached it is also reported by email
5. In case a GPU has reached a very high temperature (83ºC) it stops all mining processes. After the temperature has been recovered, it restart the mining.
6. For each GPU, the script perform an automatic control of the GPU clock by overclockin, downclocking and urgent downclocking when needed.
7. For each GPU, the script perform an automatic control of the FAN speed.

As I did for the gpux.sh scripts, I like to launck the control from another script tubing the output to the tee command in order to store the logs in a file. Therefore we will easily get them
from the monitor script:

start_control.sh

Code:

#!/bin/bash
cd /home/your_user/scripts
./control.sh | tee control.log

Now, let's go to the monitor script:

monitor.sh

Code:

#!/bin/bash
while true; do
        echo "---------------- GPUs Health ----------------"
        aticonfig --adapter=0 --od-gettemperature | tail -n1 | awk '{print "GPU0 Temperature: " $5}' ;
        aticonfig --adapter=1 --od-gettemperature | tail -n1 | awk '{print "GPU1 Temperature: " $5}' ;
        aticonfig --adapter=2 --od-gettemperature | tail -n1 | awk '{print "GPU2 Temperature: " $5}' ;
        echo $(aticonfig --adapter=0 --odgc | grep GPU);
        echo $(aticonfig --adapter=1 --odgc | grep GPU);
	 echo $(aticonfig --adapter=2 --odgc | grep GPU);
        echo "GPU FANS: $(DISPLAY=:0.0 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4) / $(DISPLAY=:0.1 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4) / $(DISPLAY=:0.2 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4)"
        echo "Overclocking...." 
        echo "-   Core Clocks: $(aticonfig --adapter=0 --odgc |grep Clocks |cut -d ' ' -f 18) / $(aticonfig --adapter=1 --odgc |grep Clocks |cut -d ' ' -f 18) / $(aticonfig --adapter=2 --odgc |grep Clocks |cut -d ' ' -f 18) Mhz."
        echo "-   Mem Clocks: $(aticonfig --adapter=0 --odgc |grep Clocks |cut -d ' ' -f 29) / $(aticonfig --adapter=1 --odgc |grep Clocks |cut -d ' ' -f 29) / $(aticonfig --adapter=2 --odgc |grep Clocks |cut -d ' ' -f 29) Mhz."	
	 #echo " "	
	 echo ----------------  PC health -------------------
        echo $(sensors |grep CPU |grep Temperature) | cut -d ' ' -f 1,2,3
        echo $(sensors |grep NB |grep Temperature) | cut -d ' ' -f 1,2,3
        echo $(sensors |grep SB |grep Temperature) | cut -d ' ' -f 1,2,3
	 echo "HDD Avail: $(df -h |grep sda1 |cut -d ' ' -f 20)"
        #echo " " 
	 echo "---------------- Mining rate ------------------"
        # Check if there are Screen lock files.... 
        IsMining_gpu0=$(ls /var/run/screen/S-your_user/ |grep gpu0 |wc -l) 
        IsMining_gpu1=$(ls /var/run/screen/S-your_user/ |grep gpu1 |wc -l)
        IsMining_gpu2=$(ls /var/run/screen/S-your_user/ |grep gpu2 |wc -l)
        # Last Hashrate report
	 if test $IsMining_gpu0 -ge 1 
	 then	
		echo "Mining on GPU0: " $(cat mining_gpu0.log | cut -d '[' -f 2)
	 else
		echo "Mining on GPU0:  ERR! Mining Process is stopped!"
	 fi

        if test $IsMining_gpu1 -ge 1
        then
		echo "Mining on GPU1: " $(cat mining_gpu1.log | cut  -d '[' -f 2)
        else
              echo "Mining on GPU1:  ERR! Mining process is stopped!"
        fi
        if test $IsMining_gpu2 -ge 1
        then
		echo "Mining on GPU2: " $(cat mining_gpu2.log | cut -d '[' -f 2)
        else
              echo "Mining on GPU2:  ERR! Mining process is stopped!"
        fi
	 #Erase mining logs... next loop we will find only the hashrate.
	 echo "" > mining_gpu0.log
	 echo "" > mining_gpu1.log
	 echo "" > mining_gpu2.log
        controlIsActive=$(ls /var/run/screen/S-vamach/ |grep control |wc -l)
        #echo " " 
	 echo "---------- Logs Mining Controller -------------"
	 if test $controlIsActive-ge 1
        then 
                tail -5 control.log
        else
                echo "Control algorithm is OFF"
        fi
 	 sleep 5;
        clear
done

A few notes for the monitor script:

Note that the GPUs logs are erased at each loop. The time between loops is 5 seconds, exactly the same as the display rate of the poclbm.py. This way, we ensure that in the logs we will always find a single report of hashrates. In addition, the logs will not be increasing forever.
the monitor script is also displaying the last 5 lines of the control script logs, stored in the file control.log

As before, I launch the monitor.sh script from start_monitor.sh:

start_monitor.sh

Code:

#!/bin/bash
cd /home/your_user/scripts
monitorIsRunning=$(ls /var/run/screen/S-your_user/ |grep monitor |wc -l) 
if test $monitorIsRunning -ge 1 
then	
	echo "Monitor script is already running in another Screen. Getting attached..." 
	screen -x monitor
else
	echo "Monitor script is not running. Starting..." 
	/usr/bin/screen -admS monitor ./monitor.sh
fi

As you can see, this script is valid both for starting the monitor or for attaching to the screen in which the monitor is already running.
I did a symbolic link to this file called "m" (see ln command). From then, all I need to do for monitoring my rig is entering "m" in the console. (this is very comfortable when accessing to the RIG from my mobile) Grin

This is the output of the monitor script (updated each 5 seconds):

---------------- GPUs Health ----------------
GPU0 Temperature: 57.50
GPU1 Temperature: 54.50
GPU2 Temperature: 57.50
GPU load : 98%
GPU load : 97%
GPU load : 98%
GPU FANS: 45% / 45% / 45%
Overclocking....
- Core Clocks: 945 / 955 / 1020 Mhz.
- Mem Clocks: 300 / 300 / 300 Mhz.
---------------- PC health -------------------
CPU Temperature: +31.0Â°C
NB Temperature: +43.0Â°C
SB Temperature: +31.0Â°C
HDD Avail: 2GB
---------------- Mining rate ------------------
Mining on GPU0: 402.743 MH/s (~458 MH/s)]
Mining on GPU1: 405.513 MH/s (~568 MH/s)]
Mining on GPU2: 435.513 MH/s (~598 MH/s)]
---------- Logs Mining Controller -------------
[Time: 18616 | 3ene15:50:30] New setting: FAN GPU0 to 30 %
[Time: 18617 | 3ene15:50:35] New setting: FAN GPU0 to 45 %
[Time: 18618 | 3ene15:50:40] New setting: FAN GPU1 to 45 %
[Time: 18783 | 3ene16:05:29] New setting: FAN GPU1 to 30 %
[Time: 18787 | 3ene16:05:51] New setting: FAN GPU1 to 45 %

Now, we can complete the start.sh script for adding the control and monitor scripts:

start.sh

Code:

!/bin/bash

cd /home/your_path/scripts
echo Starting mining scripts... 
/usr/bin/screen -admS gpu0 ./gpu0.sh
/usr/bin/screen -admS gpu1 ./gpu1.sh
/usr/bin/screen -admS gpu2 ./gpu2.sh
echo Starting monitor script...
/usr/bin/screen -admS monitor ./monitor.sh
echo Starting automatic control script...
/usr/bin/screen -admS control ./start_control.sh
echo " " 
echo For monitoring the RIG, enter m.

And of course, we will need a stop.sh script for stopping all the mining scripts, monitor and control scripts:

stop.sh

Code:

#!/bin/bash
screen -X -S gpu0 kill
screen -X -S gpu1 kill
screen -X -S gpu2 kill
screen -X -S monitor kill
screen -X -S control kill
killall screen

That's all folks!!

I hope you liked this post, and will be useful for your mining systems!!

If you liked this post, and want to send me a donation, I will be very gratefull, and will give me energy for sharing other works.
BTCTC Address: 1NKJuhGCx7HM2skXdzAkfnxJyfsubh475A