NEICH (OP)
Newbie
Offline
Activity: 13
Merit: 0
|
![](https://bitcointalk.org/Themes/custom1/images/post/lamp.gif) |
January 03, 2013, 03:19:55 PM |
|
DEAR BTCITCOINERS:I am very glad to share with you my big effort carried out during several months together with my dear mining RIG provided of 3 ATI Radeon HD 5870. This consists on an unattended and automatic controlled Mining RIG, that has provided a lot of benefits for me (hardware safe guard, automatic settings basing on temperature, Fans control, mail sending, etc), I hope it will be useful for others. Of course I know that mining with GPUs is not really profitable now, but I am also waiting for an ASIC with 30 GH/s!! ![Grin](https://bitcointalk.org/Smileys/default/grin.gif) . My idea is to adapt all these scripts asapfor the mining with ASIC, and share it with you in case I see you consider this post valuable for you. Note that I provide all this effort absolutely free!, however I would really grateful of receiving donations to my BTC address. That would encourage me for publishing other ideas and code with the BTC community. (You can find my address at the end of the post). First of all, let me indicate the benefits obtained from using these scripts:- All scripts are automatically started when switching on the system
- All scripts can be manually started by a single command
- All scripts can be manually stopped by a single command
- There is a control script in charge of the following actions:
- Automatic overclocking of each GPUs basing on the GPU temperature, target temperature, etc
- Automatic downclocking of each GPUs basing on the GPU temperature, target temperature, etc
- Automatic starting of mining processes in the case they are abnormally stopped. (5 retries)
- Sending of mails if the retry limit is reached, in order to inform of the problem to the user
- Automatic shutting down of the mining processes in the case the temperature is very high. This will prevent a worse hardware failure in the GPUs
- Automatic starting of the mining processes when the temperature is already safe for starting ythe mining
- Automatic FAN control for each GPU from 30% to 100% of speed based on the temperature
- There is a monitor script continuously sensing all system parameters for user supervision:
- Current GPUs clocks
- Current FAN clocks
- Current temperature for each GPU
- Temperatures of CPU, Motherboard, etc
- Current hashrate for each GPU
- Last 10 control scripts actions
- The monitor script can be used thought ssh or telnet connection. In my case, I supervise my RIG though SSH from my mobile by using 'ConnectBot' application (see android or iPhone markets)
Before using the scripts, have a look to base system. Your system can be different to this, but I provide you all the code so that you can add any correction for adapting all to your system. The base system for running these scripts are the following:- Linux OS (I use Linux debian). All the scripts are coded on Linux shell.
- ATI graphic cards (I have 3 ATI HD 5870). Note that I use 'aticonfig' software for almost everything
- ATI drivers already installed, AMD SDK, etc. These guides were very useful for me:
- lm-sensors package already installed (use apt-get or aptitude). This is for retrieving the CPU and motherboard temperatures
- Screen package pre-installed (apt-get install screen). This tool will let us to run all scripts from the GUI, but sharing the console output with other sessions. This will let us monitor the system from telnet or SSH, as the console is shared (thanks to screen!!) .
- I use poclbm.py for GPU mining. Other mining software will be ok, but you will need to change the related piece of code from the scripts
- My system is fully unattended... so I also have installed VNC software for remote connection to the desktop (I have there my wallet)
- I also have an automatic login in the system, so that when I switch on the system, the system is logged in, the scripts are automatically started and the wallet application is also started, getting updated with the BTC network transactions. (In my case, the RIG is installed in a cooled place very far from my house)
- This script uses "reboot" and "halt" command without sudo password. To get it, read this link: http://sleekmason.wordpress.com/fluxbox/using-etcsudoers-to-allow-shutdownrestart-without-password/
- I use "mail" command for sending mails. mail command must be available for the script (look for it on the internet).
OK, after this brief introduction, let's go to the scripts. I wish you enjoy them!!
I have the following little scripts for starting each of my individual mining process on my GPUs:
gpu0.sh
#!/bin/bash export DISPLAY=:0.0 cd /home/your_path/scripts DISPLAY=:0.0 aticonfig --pplib-cmd "set fanspeed 0 75" DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://your_user@mail.com:your_password@deepbit.net:8332 | tee mining_gpu0.log
As you can see, I use a pool for mining (deepbit). You should change this line with your specific parameters for poclbm.py or other mining software. Note that I pipe the output to the 'tee' command in order to store the output of the process in a log file called "mining_gpu0.log". This will be useful later for monitoring the script output, as we will retrieve from these logs files the hashrate for each GPU. Note the parameter -r 5: This will make that the poclbm.py script will update the output (hashrate) one per each 5 seconds. The reason for this will be discussed later...
You should create additional gpux.sh files, one for each GPU. In my case I have 3 GPUs, so I have these additional scripts:
gpu1.sh
#!/bin/bash export DISPLAY=:0.1 cd /home/your_path/scripts DISPLAY=:0.1 aticonfig --pplib-cmd "set fanspeed 0 75" DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://your_user@mail.com:your_password@deepbit.net:8332 | tee mining_gpu1.log
gpu2.sh
#!/bin/bash export DISPLAY=:0.2 cd /home/your_path/scripts DISPLAY=:0.2 aticonfig --pplib-cmd "set fanspeed 0 75" DISPLAY=:0 ./poclbm.py -d0 -v -r 5 -w128 http://your_user@mail.com:your_password@deepbit.net:8332 | tee mining_gpu2.log
Is I said, I use the "screen" linux tool for sharing the outputs of the commands running. For example, we can share the script gpu0.sh in a shared console with the following command:
/usr/bin/screen -admS gpu0 ./gpu0.sh
This will create a shared console called "gpu0", that can be accessible though telnet or ssh with the following command: Therefore, we can watch the output of the execution of gpu0.sh. Note that for exiting in a shared console you have to use 'CTRL+A' and 'D' (to get detached of the shared console). Otherwise, you can stop the execution of gpu0.sh in that console.
Now, we can define the script start.sh that will launch the mining scripts:
start.sh
#!/bin/bash
cd /home/your_path/scripts echo Starting mining scripts... /usr/bin/screen -admS gpu0 ./gpu0.sh /usr/bin/screen -admS gpu1 ./gpu1.sh /usr/bin/screen -admS gpu2 ./gpu2.sh ...
Now, you can add this script (/home/your_path/start.sh) to your startup programs group. You can easily do it from from the system menu.
The control script has a lot of features, It is full of comments so I expect you have enough information there. control.sh
#!/bin/bash #---times constants control_time=5 # Time cycle between control loops (5 seconds) overclock_delay=180 # waiting time between overclocking commands (It is multiplier of control_time (180*5 = 15 min) downclock_delay=60 # waiting time between downclocking commands (60*5 = 5 min) downclock_urgent=24 # waiting time between urgent downclocking commands (24*5 = 2 min) timeCounter=0 # time counter
#---GPUs temperatures target_temp=75 # Target temp for the GPUs. Automatic Overclocking/downclocking will be performed for reaching this temperature as maximum. hightemp_alarm=80 # Alarm temperature: If exceeded, it will be performed an urgent downclocking maxtemp_stop=83 # maximum temperature in the GPUs: The mining process will be stopped for security resons. temp_recover=65 # recovery temperature: After a mining stop due to high temperature, when this safe temperature is reached, the mining is already started. control_gap=3 # Temperature below target_temp that is needed to be exceeded for an overclock command. (If current temperature is very near from the target temp, overclockin is not performed... we maintain the temp. near but below the limit)
#---CPU temperaturas tempCPU_halt=70 #If this temperature is reached by the CPU or motherboard, a HALT is performed for turning of the RIG.
#---Clock limits corefreq_min=800 # Minimum freq. to be set by the control algorithm corefreq_max0=945 # Maximum freq. to be set by the control algorithm in GPU0 (In my case I checked that above 975Mhz this GPU hangs the X session) corefreq_max1=955 # Maximum freq. to be set by the control algorithm in GPU1 (In my case I checked that above 995Mhz the mining process got zombie) corefreq_max2=1025 # Maximum freq. to be set by the control algorithm in GPU2 (In my case I checked that above 1055Mhz the mining process got zombie) mem_freq=300 # Fixed value for memory clock (normally it is 1200MHz in the GPUs, but using 300MHz reduces the temperature without affecting to the performance)
#--Mail sending subject="Important advice from your RIG" mail1="your_mail@mail.com" mail2="other_mail@mail.com" mail3="other_mail@mail.com"
#--control constants retryMiningAfterFailure=1 # Mining scripts are automatically started after a failure debug=0 # enable/disable debugging messages numRetries=5 # Limit of retries for restarting the mining processes in the GPUs reboot=1 # If zombie mining processes are detected, the control script can perform an automatic system reboot. This will recover mining in all GPUs.
#--FAN constants FANGPU0=75 FANGPU1=75 FANGPU2=75
#--Internal variables GPU0=0 GPU1=1 GPU2=2 mining_stopped=0 # Mining process has been stopped by the control algorithm init_coreCLK0=900 # initial overclocking value for GPU0 init_coreCLK1=900 # initial overclocking value for GPU1 init_coreCLK2=900 # initial overclocking value for GPU2 counterLastCLK0=0 # Stores the time of the last overclocking/downclocking performed on GPU0 counterLastCLK1=0 # Stores the time of the last overclocking/downclocking performed on GPU1 counterLastCLK2=0 # Stores the time of the last overclocking/downclocking performed on GPU2 simulation=0 # Disables the overclockin, only logs outputs for debuguing . retriesGPU0=0 retriesGPU1=0 retriesGPU2=0 alertFailProcessGPU0=0 alertFailProcessGPU1=0 alertFailProcessGPU2=0
# --------------------------------------------------------------------- # Function Debug: It outputs messages to the console only in debug mode # Parameters: Text Message to be displayed # ----------------------------------------------------------------- function debug(){ if (test $debug -eq 1) then echo -e "[Time: $timeCounter | $(date | awk '{print $3 $2 $4}')] $@" fi }
# --------------------------------------------------------------------- # Function output: This function output messages to the console # $1: Message # $2: if $2=1 the message is sent by email # --------------------------------------------------------------------- function output(){ mensaje="[Time: $timeCounter | $(date | awk '{print $3 $2 $4}')] $1" echo -e $mensaje if test $2 -eq 1 then echo -e "$mensaje" | mail -s "$subject" $mail1 echo -e "$mensaje" | mail -s "$subject" $mail2 echo -e "$mensaje" | mail -s "$subject" $mail3 fi }
# ---------------------------------------------------------------- # Function FANCommand: This function sets the FAN speed of a GPU # Params: $1:num_gpu: $GPU0,$GPU1,$GPU2 # $2:FAN_SPEED: Value from 0 to 100 %, ej: 100 # Use: FANCommand $GPU0 100 # ---------------------------------------------------------------- function FANCommand(){ case $1 in 0) DISPLAY=:0.0 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;; 1) DISPLAY=:0.1 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;; 2) DISPLAY=:0.2 aticonfig --pplib-cmd "set fanspeed 0 $2">>null;; esac output " New setting: FAN GPU$1 to $2 %" 0 }
# ---------------------------------------------------------------- # Function overclock: This function sets the clk of a GPU # Params: $1:num_gpu : $GPU0,$GPU1,$GPU2 # $2:clkfreq : core clk value to be set, ej: 850 # $3:memfreq : mem clk to be set, ej: 300 # Use: overclock 0 850 1200 # ---------------------------------------------------------------- function overclock(){ if test $simulation -eq 0 then case $1 in 0) aticonfig --adapter=0 --odsc=$2,$3 >>null;; 1) aticonfig --adapter=1 --odsc=$2,$3 >>null;; 2) aticonfig --adapter=2 --odsc=$2,$3 >>null;; esac fi output " New setting: Overclock GPU$1 to $2 / $3 Mhz" 0 }
# ---------------------------------------------------------------- # Function controlFAN: Calculates the FAN speed depending on the temperatures and current FAN speed (num_gpu, currentTemp, currentFAN) # It performes the FAN speed control of a GPU # Params: $1:num_gpu : $GPU0,$GPU01,$GPU2 # $2:currentTemp: Current value reported by the GPU (no decimals), ej: 56 # $3:currentFAN: Current FAN speed for this GPU, ej: 75 % # # Use: controlFAN $GPU0 56 75 # ---------------------------------------------------------------- function controlFAN(){
#hysteresis of 2ºC around temperature threshold 55º if test $2 -lt 54 then controlFAN=30 elif test $2 -lt 56 then if test $3 -ne 45 then controlFAN=30 fi #hysteresis of 2ºC around temperature threshold 60º elif test $2 -lt 59 then controlFAN=45 elif test $2 -lt 61 then if test $3 -ne 60 then controlFAN=45 fi #hysteresis of 2ºC around temperature threshold 65º elif test $2 -lt 64 then controlFAN=60 elif test $2 -lt 66 then if test $3 -ne 75 then controlFAN=60 fi #hysteresis of 2ºC around temperature threshold 70º elif test $2 -lt 69 then controlFAN=75 elif test $2 -lt 71 then if test $3 -ne 90 then controlFAN=75 fi else controlFAN=90 fi
# It sends the FAN speed command only if the new setting is different to the current one. case $1 in 0) debug "FAN control GPU0: Current FAN=$FANGPU0, controlFAN:$controlFAN" if test $controlFAN -ne $FANGPU0 then FANCommand 0 $controlFAN FANGPU0=$controlFAN fi;; 1) debug "FAN control GPU1: Current FAN=$FANGPU1, controlFAN:$controlFAN" if test $controlFAN -ne $FANGPU1 then FANCommand 1 $controlFAN FANGPU1=$controlFAN fi;; 2) debug "FAN control GPU2: Current FAN=$FANGPU2, controlFAN:$controlFAN" if test $controlFAN -ne $FANGPU2 then FANCommand 2 $controlFAN FANGPU2=$controlFAN fi;; esac }
# ---------------------------------------------------------------- # Function controlTemp: Calculates the GPU clock correction to be performed depending on the GPU temperatures (num_gpu, currentTemp, consignaTemp) # It performes the overclocking/downclocking of the GPU # Params: $1:num_gpu : $GPU0,$GPU01,$GPU2 # $2:currentTemp: Current temperature reported by the GPU (no decimals), ej: 56 # $3:TargetTemp: Target temperature desired in this GPU as maximum, ej: 78 # Outputs: # counterLastCLK0, counterLastCLK1 y counterLastCLK2: Time information of the last CLK correction on each GPU. # # Use: controlTemp $GPU 56 78 # ---------------------------------------------------------------- function controlTemp(){ offsetCLK=$(expr $3 - $2) # temperature gap defined in 'control_gap' is guaranteed to avoid causing stress to the GPU when the current temperature is very near to the target temperature if (test $offsetCLK -gt 0) && (test $offsetCLK -lt $control_gap) then debug "The correction ($offsetCLK) does not exceed the control GAP ($control_gap). CLK is maintained." return 1 fi # Demanded frequencies are limited to the specific clk ranges of each GPU case $1 in 0) demandaCLK=$(expr $coreCLK0 + $offsetCLK) if test $demandaCLK -gt $corefreq_max0 then demandaCLK=$corefreq_max0 fi;; 1) demandaCLK=$(expr $coreCLK1 + $offsetCLK) if test $demandaCLK -gt $corefreq_max1 then demandaCLK=$corefreq_max1 fi;;
2) demandaCLK=$(expr $coreCLK2 + $offsetCLK) if test $demandaCLK -gt $corefreq_max2 then demandaCLK=$corefreq_max2 fi;; esac if test $demandaCLK -lt $corefreq_min then demandaCLK=$corefreq_min fi debug "*** GPU$1 --> CurrentTemp:$2 - Consigna:$3 - Control:$demandaCLK ($offsetCLK)" # Sending of overclock command, only if there is a change. case $1 in 0) if test $demandaCLK -ne $coreCLK0 then overclock 0 $demandaCLK $mem_freq counterLastCLK0=$timeCounter debug "Tiempo contCLK0: $counterLastCLK0" else debug "GPU0: Limit is already reached: $corefreq_max0." counterLastCLK0=$timeCounter fi;; 1) if test $demandaCLK -ne $coreCLK1 then overclock 1 $demandaCLK $mem_freq counterLastCLK1=$timeCounter debug "Tiempo contCLK1: $counterLastCLK1" else debug "GPU1: Limit is already reached: $corefreq_max1." counterLastCLK1=$timeCounter fi;; 2) if test $demandaCLK -ne $coreCLK2 then overclock 2 $demandaCLK $mem_freq counterLastCLK2=$timeCounter debug "Tiempo contCLK2: $counterLastCLK2" else debug "GPU2: Limit is already reached: $corefreq_max2." counterLastCLK2=$timeCounter fi;; esac }
# ---------------------------------------------------------------- # Function checkOverclockTimeGuard: This function ensures a certain period of time between consecutives overclock commands. # - Guard Time between overclocks: $overclock_delay*$control_time (180*5 = 15 minutes) # - Guard Time between downclocks: $downclock_delay*$control_time (60*5 = 5 minutes) # - Guard time between urgent downclocks: $downclock_urgent*$control_time (24*5 = 2 minutes) # Params: $1:num_gpu : 0,1,2 # $2:timeCounter: Current value of the time counter # $3:up_down: 0:overclock, 1:downclock, 2:urgent downclock # Outputs: # $return_correction: 0:Not to perform CLK correction. 1:CLK correction can be performed now. # ---------------------------------------------------------------- function checkOverclockTimeGuard (){ return_correction=0 case $1 in 0) if test $3 -eq 0 then ##overclock due_time=$(expr $counterLastCLK0 + $overclock_delay) if test $2 -ge $due_time then return_correction=1 fi elif test $3 -eq 1 then ##normal downclocking due_time=$(expr $counterLastCLK0 + $downclock_delay) if test $2 -ge $due_time then return_correction=1 fi
elif test $3 -eq 2 then ##urgent downclocking due_time=$(expr $counterLastCLK0 + $downclock_urgent) if test $2 -ge $due_time then return_correction=1 fi fi;; 1) if test $3 -eq 0 then ##overclocking due_time=$(expr $counterLastCLK1 + $overclock_delay) if test $2 -ge $due_time then return_correction=1 fi elif test $3 -eq 1 then ##normal downclocking due_time=$(expr $counterLastCLK1 + $downclock_delay) if test $2 -ge $due_time then return_correction=1 fi elif test $3 -eq 2 then ##urgent downclocking due_time=$(expr $counterLastCLK1 + $downclock_urgent) if test $2 -ge $due_time then return_correction=1 fi fi;; 2) if test $3 -eq 0 then ##overclocking due_time=$(expr $counterLastCLK2 + $overclock_delay) if test $2 -ge $due_time then return_correction=1 fi elif test $3 -eq 1 then ##normal downclocking due_time=$(expr $counterLastCLK2 + $downclock_delay) if test $2 -ge $due_time then return_correction=1 fi elif test $3 -eq 2 then ##urgent downclocking due_time=$(expr $counterLastCLK2 + $downclock_urgent) if test $2 -ge $due_time then return_correction=1 fi fi;; esac debug "-------GPU$1: due_time: $due_time, correction: $return_correction" }
# --------------------------------------------------------------------------------------------------------- # MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN - MAIN # --------------------------------------------------------------------------------------------------------- #Overclockin enabling aticonfig --adapter=0 --od-enable aticonfig --adapter=1 --od-enable aticonfig --adapter=2 --od-enable overclock 0 $init_coreCLK0 $mem_freq overclock 1 $init_coreCLK1 $mem_freq overclock 2 $init_coreCLK2 $mem_freq FANCommand 0 30 FANCommand 1 30 FANCommand 2 30 output "Automatic control algorithm has been started" 1 #This is sent by email. output "Automatic FAN speed control starts from 30%" 0 while true; do
#Fetching of current temperatures tempGPU0=$(aticonfig --adapter=0 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2) tempGPU1=$(aticonfig --adapter=1 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2) tempGPU2=$(aticonfig --adapter=2 --od-gettemperature | tail -n1 | awk '{print $5}' | cut -c1-2) tempCPU=$(sensors |grep CPU |grep Temperature | awk '{print $3}'|cut -c2-3) tempMB=$(sensors |grep NB |grep Temperature | awk '{print $3}'|cut -c2-3) #Fetching of current GPU CLK frequencies coreCLK0=$(aticonfig --adapter=0 --odgc |grep Clocks | awk '{print $4}') memCLK0=$(aticonfig --adapter=0 --odgc |grep Clocks | awk '{print $5}') coreCLK1=$(aticonfig --adapter=1 --odgc |grep Clocks | awk '{print $4}') memCLK1=$(aticonfig --adapter=1 --odgc |grep Clocks | awk '{print $5}') coreCLK2=$(aticonfig --adapter=2 --odgc |grep Clocks | awk '{print $4}') memCLK2=$(aticonfig --adapter=2 --odgc |grep Clocks | awk '{print $5}') #It detects if there are mining processes already running miningIsActive0=$(ls /var/run/screen/S-your_user/ |grep gpu0 |wc -l) # we look for the 'screen' session lock file miningIsActive1=$(ls /var/run/screen/S-your_user/ |grep gpu1 |wc -l) # we look for the 'screen' session lock file miningIsActive2=$(ls /var/run/screen/S-your_user/ |grep gpu2 |wc -l) # we look for the 'screen' session lock file loadGPU0=$(aticonfig --adapter=0 --odgc |grep GPU |awk '{print $4}' | cut -c1-2) # It is also checked the load of each GPU loadGPU1=$(aticonfig --adapter=1 --odgc |grep GPU |awk '{print $4}' | cut -c1-2) loadGPU2=$(aticonfig --adapter=2 --odgc |grep GPU |awk '{print $4}' | cut -c1-2) debug " --> Temps: $tempGPU0, $tempGPU1, $tempGPU2" debug " --> Clks: $coreCLK0, $coreCLK1, $coreCLK2"
# --------------------------------Temperature of CPU and Motherboard --------------------------------------- if (test $tempCPU -gt $tempCPU_halt) || (test $tempMB -gt $tempCPU_halt) then output "ERR: Temperature of CPU/MB is too high! $tempCPU / $tempMB.... \nSWITCHING OFF THE SYSTEM. \n Check the CPU FAN condition and switch the RIG on manually." 1 /usr/bin/halt fi # Checking of zombie mining processes. num_defunc=$(ps -Al |grep py|grep defunc| wc -l) if test $num_defunc -gt 0 then if test $reboot -eq 1 then output "### ERR: There are one or more zombie mining processes: \nMaybe a mining process is hanged and blocked. \nIt is neccesary to restart the system for recovering the mining (sudo reboot). \n$(ps -Al |grep py|grep defunc| wc -l) \n --> PERFORMING AN AUTOMATIC REBOOT OF THE SYSTEM...." 1 # Sent my email /usr/bin/reboot else output "### ERR: There are one or more zombie mining processes: \nMaybe a mining process is hanged and blocked. \nIt is neccesary to restart the system for recovering the mining (sudo reboot). \n$(ps -Al |grep py|grep defunc| wc -l)" 1 # Sent my email fi fi # --------------------------------Init checkings ------------------------------- if (test $miningIsActive0 -eq 0) then if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1) then if test $retriesGPU0 -lt $numRetries then output "### ERR the mining process in GPU0 is not started ......" 0 output "*** Starting mining on GPU0..." 0 overclock 0 $init_coreCLK0 $mem_freq coreCLK0= $init_coreCLK0 /usr/bin/screen -admS gpu0 ./gpu0.sh retriesGPU0=$(expr $retriesGPU0 + 1) elif test $alertFailProcessGPU0 -eq 0 then output "### ERR Mining retries limit has been reached in the process GPU0.sh \n*** Check that the process is not zombie and start it manually " 1 # Sent by email alertFailProcessGPU0=1 fi fi fi if (test $miningIsActive1 -eq 0) then if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1) then if test $retriesGPU1 -lt $numRetries then output "### ERR the mining process in GPU1 is not started ......" 0 output "*** Starting mining on GPU1..." 0 overclock 1 $init_coreCLK1 $mem_freq coreCLK1= $init_coreCLK1 /usr/bin/screen -admS gpu1 ./gpu1.sh retriesGPU1=$(expr $retriesGPU1 + 1) elif test $alertFailProcessGPU1 -eq 0 then output "### ERR Mining retries limit has been reached in the process GPU1.sh \n*** Check that the process is not zombie and start it manually " 1 # Sent by email alertFailProcessGPU1=1 fi fi fi
if (test $miningIsActive2 -eq 0) then if (test $mining_stopped -eq 0) && (test $retryMiningAfterFailure -eq 1) then if test $retriesGPU2 -lt $numRetries then output "### ERR the mining process in GPU2 is not started ......" 0 output "*** Starting mining on GPU2..." 0 overclock 2 $init_coreCLK2 $mem_freq coreCLK2= $init_coreCLK2 /usr/bin/screen -admS gpu2 ./gpu2.sh retriesGPU2=$(expr $retriesGPU2 + 1) elif test $alertFailProcessGPU2 -eq 0 then output "### ERR Mining retries limit has been reached in the process GPU2.sh \n*** Check that the process is not zombie and start it manually " 1 # Sent by email alertFailProcessGPU2=1 fi fi fi
# --------------------------------Automatic switching off control ----------------------------- if (test $tempGPU0 -gt $maxtemp_stop) || (test $tempGPU1 -gt $maxtemp_stop) || (test $tempGPU2 -gt $maxtemp_stop) then if (test $mining_stopped -eq 0) then if test $retryMiningAfterFailure -eq 1 then output "ERR: Extreme temperature in GPUs ($tempGPU0, $tempGPU1, $tempGPU2 ºC) - Switching off the mining... \n After some minutes it will be strated again ..." 1 # Sent by email else output "ERR: Extreme Temperature in GPUs ($tempGPU0, $tempGPU1, $tempGPU2 ºC) - Switching off the mining... \n Start the mining process manually ..." 1 # Sent by email fi ./stop.sh mining_stopped=1 fi else # As soon as the GPUs temperatures are below temp_recover, mining is started again. if (test $mining_stopped -eq 1) && (test $retryMiningAfterFailure -eq 1) then if (test $tempGPU0 -lt $temp_recover) || (test $tempGPU1 -lt $temp_recover) || (test $tempGPU2 -lt $temp_recover) then # It Sets safe GPUs clock values output "The temperature of the GPUs has been recovered to $tempGPU0 / $tempGPU1 / $tempGPU2" 0 output "GPUS clocks are stablished to 850/300 MHz." 0 overclock 0 $init_coreCLK0 $mem_freq overclock 1 $init_coreCLK1 $mem_freq overclock 2 $init_coreCLK2 $mem_freq coreCLK0= $init_coreCLK0 coreCLK1= $init_coreCLK1 coreCLK2= $init_coreCLK2 retriesGPU0=0 retriesGPU1=0 retriesGPU2=0 output " --> Starting mining." 0 ./minar.sh mining_stopped=0 fi fi fi
#------------------------------ Overclocking control on GPU0 ---------------------------------- if (test $mining_stopped -eq 0) && (test $miningIsActive0 -eq 1) then #The temperature is within the control margins, below target temp if (test $tempGPU0 -lt $target_temp) then checkOverclockTimeGuard $GPU0 $timeCounter 0 if test $return_correction -eq 1 then controlTemp $GPU0 $tempGPU0 $target_temp fi #The temperature is outside the control margins, below the alarm temp elif (test $tempGPU0 -lt $hightemp_alarm) then checkOverclockTimeGuard $GPU0 $timeCounter 1 #downclocking if test $return_correction -eq 1 then controlTemp $GPU0 $tempGPU0 $target_temp fi # Overtemp alarm elif (test $tempGPU0 -lt $maxtemp_stop) then output "Alarm! GPU0 very hot, temperature: $tempGPU0. Performing urgent downclocking ...." 1 #Sent by email checkOverclockTimeGuard $GPU0 $timeCounter 2 #urgent downclocking if test $return -eq 1 then controlTemp $GPU0 $tempGPU0 $target_temp fi fi # FAN Speed control controlFAN $GPU0 $tempGPU0 $FANGPU0 fi
#------------------------------ Overclocking control on GPU1 ---------------------------------- if (test $mining_stopped -eq 0) && (test $miningIsActive1 -eq 1) then #The temperature is within the control margins, below target temp if (test $tempGPU1 -lt $target_temp) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0) then checkOverclockTimeGuard $GPU1 $timeCounter 0 if test $return_correction -eq 1 then controlTemp $GPU1 $tempGPU1 $target_temp fi #The temperature is outside the control margins, below the alarm temp elif (test $tempGPU1 -lt $hightemp_alarm) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0) then checkOverclockTimeGuard $GPU1 $timeCounter 1 #downclocking if test $return_correction -eq 1 then controlTemp $GPU1 $tempGPU1 $target_temp fi # Overtemp alarm elif (test $tempGPU1 -lt $maxtemp_stop) && (test $miningIsActive1 -eq 1) && (test $mining_stopped -eq 0) then output "Alarm! GPU1 very hot, temperature: $tempGPU1. Performing urgent downclocking ...." 1 #Sent by email checkOverclockTimeGuard $GPU1 $timeCounter 2 #urgent downclocking if test $return_correction -eq 1 then controlTemp $GPU1 $tempGPU1 $target_temp fi fi # FAN Speed control controlFAN $GPU1 $tempGPU1 $FANGPU1 fi #------------------------------ Overclocking control on GPU2 ---------------------------------- if (test $mining_stopped -eq 0) && (test $miningIsActive2 -eq 1) then #The temperature is within the control margins, below target temp if (test $tempGPU2 -lt $target_temp) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0) then checkOverclockTimeGuard $GPU2 $timeCounter 0 if test $return_correction -eq 1 then controlTemp $GPU2 $tempGPU2 $target_temp fi #The temperature is outside the control margins, below the alarm temp elif (test $tempGPU2 -lt $hightemp_alarm) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0) then checkOverclockTimeGuard $GPU2 $timeCounter 1 #downclocking if test $return_correction -eq 1 then controlTemp $GPU2 $tempGPU2 $target_temp fi # Overtemp alarm elif (test $tempGPU2 -lt $maxtemp_stop) && (test $miningIsActive2 -eq 1) && (test $mining_stopped -eq 0) then output "Alarm! GPU2 very hot, temperature: $tempGPU2. Performing urgent downclocking ...." 1 #Sent by email checkOverclockTimeGuard $GPU2 $timeCounter 2 #urgent downclocking if test $return_correction -eq 1 then controlTemp $GPU2 $tempGPU2 $target_temp fi fi # FAN Speed control controlFAN $GPU2 $tempGPU2 $FANGPU2 fi timeCounter=$(expr $timeCounter + 1) sleep $control_time; done
A brief description of the control script:
- You can change the minimum/maximum clock settings for each of your GPUs. I manually identified the limits by getting the system hang lot several times.
- I also realized that when playing with the limits, before hanging the system, sometimes a mining process got zombie (by using PS). In this situation, I was not able to recover the process, neither trying to kill the parent process... the only way was to restart the system. This control algorithm is restarting the system when finding zombie processes.
- You can play with all constants for tuning the script to your own system. Almost everything is configurable (retries number, mails sending, debugging logs, halt/reboot commands, etc.
- Change the email addresses by yours
- 1. The algorithm first obtain the GPUs temperatures, CPU temperatures, current clocks settings, checks if the mining processes are active, etc
- 2. In case the CPU temperature is very high (70ºC), the script switches off the system and report by email (This protects the system hardware from a overtemperature in the CPU)
- 3. It checks if there are zombie processes (As discussed before). If so, the script can reboot the system and report by email. (depends on if constant reboot=1)
- 4. It checks if any of the GPUS is not mining... if so it retries the mining by starting the script gpux.sh. There is a retry limit of 5. It reached it is also reported by email
- 5. In case a GPU has reached a very high temperature (83ºC) it stops all mining processes. After the temperature has been recovered, it restart the mining.
- 6. For each GPU, the script perform an automatic control of the GPU clock by overclockin, downclocking and urgent downclocking when needed.
- 7. For each GPU, the script perform an automatic control of the FAN speed.
As I did for the gpux.sh scripts, I like to launck the control from another script tubing the output to the tee command in order to store the logs in a file. Therefore we will easily get them from the monitor script:
start_control.sh
#!/bin/bash cd /home/your_user/scripts ./control.sh | tee control.log
Now, let's go to the monitor script:
monitor.sh
#!/bin/bash while true; do echo "---------------- GPUs Health ----------------" aticonfig --adapter=0 --od-gettemperature | tail -n1 | awk '{print "GPU0 Temperature: " $5}' ; aticonfig --adapter=1 --od-gettemperature | tail -n1 | awk '{print "GPU1 Temperature: " $5}' ; aticonfig --adapter=2 --od-gettemperature | tail -n1 | awk '{print "GPU2 Temperature: " $5}' ; echo $(aticonfig --adapter=0 --odgc | grep GPU); echo $(aticonfig --adapter=1 --odgc | grep GPU); echo $(aticonfig --adapter=2 --odgc | grep GPU); echo "GPU FANS: $(DISPLAY=:0.0 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4) / $(DISPLAY=:0.1 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4) / $(DISPLAY=:0.2 aticonfig --pplib-cmd 'get fanspeed 0'|grep Result |cut -d ' ' -f 4)" echo "Overclocking...." echo "- Core Clocks: $(aticonfig --adapter=0 --odgc |grep Clocks |cut -d ' ' -f 18) / $(aticonfig --adapter=1 --odgc |grep Clocks |cut -d ' ' -f 18) / $(aticonfig --adapter=2 --odgc |grep Clocks |cut -d ' ' -f 18) Mhz." echo "- Mem Clocks: $(aticonfig --adapter=0 --odgc |grep Clocks |cut -d ' ' -f 29) / $(aticonfig --adapter=1 --odgc |grep Clocks |cut -d ' ' -f 29) / $(aticonfig --adapter=2 --odgc |grep Clocks |cut -d ' ' -f 29) Mhz." #echo " " echo ---------------- PC health ------------------- echo $(sensors |grep CPU |grep Temperature) | cut -d ' ' -f 1,2,3 echo $(sensors |grep NB |grep Temperature) | cut -d ' ' -f 1,2,3 echo $(sensors |grep SB |grep Temperature) | cut -d ' ' -f 1,2,3 echo "HDD Avail: $(df -h |grep sda1 |cut -d ' ' -f 20)" #echo " " echo "---------------- Mining rate ------------------" # Check if there are Screen lock files.... IsMining_gpu0=$(ls /var/run/screen/S-your_user/ |grep gpu0 |wc -l) IsMining_gpu1=$(ls /var/run/screen/S-your_user/ |grep gpu1 |wc -l) IsMining_gpu2=$(ls /var/run/screen/S-your_user/ |grep gpu2 |wc -l) # Last Hashrate report if test $IsMining_gpu0 -ge 1 then echo "Mining on GPU0: " $(cat mining_gpu0.log | cut -d '[' -f 2) else echo "Mining on GPU0: ERR! Mining Process is stopped!" fi
if test $IsMining_gpu1 -ge 1 then echo "Mining on GPU1: " $(cat mining_gpu1.log | cut -d '[' -f 2) else echo "Mining on GPU1: ERR! Mining process is stopped!" fi if test $IsMining_gpu2 -ge 1 then echo "Mining on GPU2: " $(cat mining_gpu2.log | cut -d '[' -f 2) else echo "Mining on GPU2: ERR! Mining process is stopped!" fi #Erase mining logs... next loop we will find only the hashrate. echo "" > mining_gpu0.log echo "" > mining_gpu1.log echo "" > mining_gpu2.log controlIsActive=$(ls /var/run/screen/S-vamach/ |grep control |wc -l) #echo " " echo "---------- Logs Mining Controller -------------" if test $controlIsActive-ge 1 then tail -5 control.log else echo "Control algorithm is OFF" fi sleep 5; clear done
A few notes for the monitor script:
- Note that the GPUs logs are erased at each loop. The time between loops is 5 seconds, exactly the same as the display rate of the poclbm.py. This way, we ensure that in the logs we will always find a single report of hashrates. In addition, the logs will not be increasing forever.
- the monitor script is also displaying the last 5 lines of the control script logs, stored in the file control.log
As before, I launch the monitor.sh script from start_monitor.sh:
start_monitor.sh
#!/bin/bash cd /home/your_user/scripts monitorIsRunning=$(ls /var/run/screen/S-your_user/ |grep monitor |wc -l) if test $monitorIsRunning -ge 1 then echo "Monitor script is already running in another Screen. Getting attached..." screen -x monitor else echo "Monitor script is not running. Starting..." /usr/bin/screen -admS monitor ./monitor.sh fi
As you can see, this script is valid both for starting the monitor or for attaching to the screen in which the monitor is already running. I did a symbolic link to this file called "m" (see ln command). From then, all I need to do for monitoring my rig is entering "m" in the console. (this is very comfortable when accessing to the RIG from my mobile)
This is the output of the monitor script (updated each 5 seconds):
---------------- GPUs Health ---------------- GPU0 Temperature: 57.50 GPU1 Temperature: 54.50 GPU2 Temperature: 57.50 GPU load : 98% GPU load : 97% GPU load : 98% GPU FANS: 45% / 45% / 45% Overclocking.... - Core Clocks: 945 / 955 / 1020 Mhz. - Mem Clocks: 300 / 300 / 300 Mhz. ---------------- PC health ------------------- CPU Temperature: +31.0°C NB Temperature: +43.0°C SB Temperature: +31.0°C HDD Avail: 2GB ---------------- Mining rate ------------------ Mining on GPU0: 402.743 MH/s (~458 MH/s)] Mining on GPU1: 405.513 MH/s (~568 MH/s)] Mining on GPU2: 435.513 MH/s (~598 MH/s)] ---------- Logs Mining Controller ------------- [Time: 18616 | 3ene15:50:30] New setting: FAN GPU0 to 30 % [Time: 18617 | 3ene15:50:35] New setting: FAN GPU0 to 45 % [Time: 18618 | 3ene15:50:40] New setting: FAN GPU1 to 45 % [Time: 18783 | 3ene16:05:29] New setting: FAN GPU1 to 30 % [Time: 18787 | 3ene16:05:51] New setting: FAN GPU1 to 45 %
Now, we can complete the start.sh script for adding the control and monitor scripts:
start.sh
!/bin/bash
cd /home/your_path/scripts echo Starting mining scripts... /usr/bin/screen -admS gpu0 ./gpu0.sh /usr/bin/screen -admS gpu1 ./gpu1.sh /usr/bin/screen -admS gpu2 ./gpu2.sh echo Starting monitor script... /usr/bin/screen -admS monitor ./monitor.sh echo Starting automatic control script... /usr/bin/screen -admS control ./start_control.sh echo " " echo For monitoring the RIG, enter m.
And of course, we will need a stop.sh script for stopping all the mining scripts, monitor and control scripts:
stop.sh
#!/bin/bash screen -X -S gpu0 kill screen -X -S gpu1 kill screen -X -S gpu2 kill screen -X -S monitor kill screen -X -S control kill killall screen
That's all folks!!
I hope you liked this post, and will be useful for your mining systems!!
If you liked this post, and want to send me a donation, I will be very gratefull, and will give me energy for sharing other works. BTCTC Address: 1NKJuhGCx7HM2skXdzAkfnxJyfsubh475A
|