Bitcoin Forum
May 27, 2024, 03:26:23 PM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
  Home Help Search Login Register More  
  Show Posts
Pages: « 1 [2] 3 »
21  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0019 on: September 16, 2017, 06:31:13 PM
Guys what is SRR option?

also when i set minimum temp=25 or lower, it wont go lower than 30

(using v0019 with updating)

SRR = Simple Rig Resetter
it's a little board that connects to your power/reset pins on the motherboard, it then listens for heartbeats from specific rigs, when the heartbeat stops, it means the rig froze, it hard re-sets the specific rig.

https://simplemining.net/download/SRR/PDF/SRR-manual-2017-02-10.pdf

I've been using one for a few months, saved me lots of time at the beginning.
Does Linux ever hard freeze?
Been using ubuntu since 2006,
Got lucky and have never experienced it.

Linux is very stable, but it won't protect you from hardware issues so if you OC the cards too much any OS will freeze, updates can cause issues, drivers etc.
Use case:
Remote rigs - it can correct a remote issue in minutes before you're even aware.
USB failure issues - I've had USB issues with the Asrock BTC+ 110H boards where sometimes they don't recognize the USB and reboot into bios... this cycles the power and the USB is seen again.
I've had rigs hang when OC'ing cards too hard, or heat buildup during summer at a remote site.
The next Mobo manufacturer that includes this functionality in their mining Mobo's is gonna do well, all server class machines already have similar capabilities for the same reasons.



22  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0019 on: September 16, 2017, 05:17:00 PM
Guys what is SRR option?

also when i set minimum temp=25 or lower, it wont go lower than 30

(using v0019 with updating)

SRR = Simple Rig Resetter
it's a little board that connects to your power/reset pins on the motherboard, it then listens for heartbeats from specific rigs, when the heartbeat stops, it means the rig froze, it hard re-sets the specific rig.

https://simplemining.net/download/SRR/PDF/SRR-manual-2017-02-10.pdf

I've been using one for a few months, saved me lots of time at the beginning.
23  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0019 on: September 14, 2017, 11:37:11 PM


Cool, thanks! I'll look again in the previous pages to catch up with this discussion.

Sorry, it wasn't in this discussion, but on bitcointalk.org somewhere. You'll have to have a google to find it! Regardless, I think they're still working on full zpool integration into nvOC. But anyways, I don't know anything as I don't use any of these services.

https://bitcointalk.org/index.php?topic=2067256.0


So there is an interesting comment in that thread about the tests being done while altcoins were rising and with zpool/mph lagging in conversion they performed better, vs nicehash "instant" conversation to bitcoin. In a down market, nicehash may get better results (?) ...something to think about.
 
24  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0019 on: September 11, 2017, 11:33:07 AM

Have you set cpu mining to off ?

yep, for sure

looks like 3main select only one GPU for a reason I don't know

this is due to a problem with the 3main implementation wi$em@n found; but I haven't fixed yet.  You can do this manually for now by finding this area in 3main:

Code:
if [ $COIN == "XMR" ]
then
HCD='/home/m1/xmr/stakGPU/bin/xmr-stak-nvidia'
ADDR="$XMR_ADDRESS.$XMR_WORKER"

cat <<EOF >/home/m1/xmr/stakGPU/bin/config.txt

"gpu_threads_conf" : [
  { "index" : 0,
    "threads" : 32, "blocks" : 18,
    "bfactor" : 8, "bsleep" :  10,
    "affine_to_cpu" : false,
  },
],

"use_tls" : false,
"tls_secure_algo" : true,
"tls_fingerprint" : "",

"pool_address" : "$XMR_POOL",
"wallet_address" : "$ADDR",
"pool_password" : "x",

"call_timeout" : 10,
"retry_time" : 10,
"giveup_limit" : 0,

"verbose_level" : 4,

"h_print_time" : 60,

"output_file" : "",

"httpd_port" : 0,

"prefer_ipv4" : true
EOF

cd /home/m1/xmr/stakGPU/bin

screen -dmS miner $HCD

if [ $LOCALorREMOTE == "LOCAL" ]
then
screen -r miner
fi

and adding an additional:

gpu_threads_conf" : [
  { "index" : 0,
    "threads" : 32, "blocks" : 18,
    "bfactor" : 8, "bsleep" :  10,
    "affine_to_cpu" : false,
  },
  { "index" : 1,
    "threads" : 32, "blocks" : 18,
    "bfactor" : 8, "bsleep" :  10,
    "affine_to_cpu" : false,
  },
  { "index" : 2,
    "threads" : 32, "blocks" : 18,
    "bfactor" : 8, "bsleep" :  10,
    "affine_to_cpu" : false,
  },
  { "index" : 3,
    "threads" : 32, "blocks" : 18,
    "bfactor" : 8, "bsleep" :  10,
    "affine_to_cpu" : false,
  },
],

index block per GPU to the "gpu_threads_conf"

as the above would be for 4x gpus.

thanks a lot Fullzero

I'm now able to use my 7x GPU with your modification

unfortunately watchdog is not working anymore as it saying that GPU utilisation is to low

not a big deal

thanks

I will look into a watchdog conflict with multiple gpus while using xmr-stak.

I noticed that xmr-stack takes a while to load up all the gpu's with work - on 13 card systems this can take up to 3-4 minutes (it also depends on the thread/block count in xmr-stak config.txt - the settings from fullzero load under 2G of data, but you can tweak those numbers higher getting, resulting higher hash rates, but also increasing the initialization time).
Once stak loads all the cards, and assuming your OC settings are stable - it will behave itself for days. I meant to modify watchdog to recognize stak is running and increase the initialization time appropriately, but I still need to migrate to v0019. for now if you want to run watchdog, change the initial timeout to 360 seconds it should do the trick.
Once all the cards are loaded stak will spit out a bunch of submits to the pool in one shot.

I've seen issue with WATCHDOG while using keccak algo (for MAX COIN) on MPH; the GPU utilization for this coin is algo is always around 90 and WATCHDOG always restarts the mining eventually the whole RIG based on that usage!

I remember one of the user complained about keccak; asked about how to disable it (thinking its probably because of this utilization low issue).

Yes I remember seeing a post like that, I think someone also answered that this can be remedied by lowering the threshold in watchdog which should work.
25  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0019 on: September 10, 2017, 06:57:31 PM

Have you set cpu mining to off ?

yep, for sure

looks like 3main select only one GPU for a reason I don't know

this is due to a problem with the 3main implementation wi$em@n found; but I haven't fixed yet.  You can do this manually for now by finding this area in 3main:

Code:
if [ $COIN == "XMR" ]
then
HCD='/home/m1/xmr/stakGPU/bin/xmr-stak-nvidia'
ADDR="$XMR_ADDRESS.$XMR_WORKER"

cat <<EOF >/home/m1/xmr/stakGPU/bin/config.txt

"gpu_threads_conf" : [
  { "index" : 0,
    "threads" : 32, "blocks" : 18,
    "bfactor" : 8, "bsleep" :  10,
    "affine_to_cpu" : false,
  },
],

"use_tls" : false,
"tls_secure_algo" : true,
"tls_fingerprint" : "",

"pool_address" : "$XMR_POOL",
"wallet_address" : "$ADDR",
"pool_password" : "x",

"call_timeout" : 10,
"retry_time" : 10,
"giveup_limit" : 0,

"verbose_level" : 4,

"h_print_time" : 60,

"output_file" : "",

"httpd_port" : 0,

"prefer_ipv4" : true
EOF

cd /home/m1/xmr/stakGPU/bin

screen -dmS miner $HCD

if [ $LOCALorREMOTE == "LOCAL" ]
then
screen -r miner
fi

and adding an additional:

gpu_threads_conf" : [
  { "index" : 0,
    "threads" : 32, "blocks" : 18,
    "bfactor" : 8, "bsleep" :  10,
    "affine_to_cpu" : false,
  },
  { "index" : 1,
    "threads" : 32, "blocks" : 18,
    "bfactor" : 8, "bsleep" :  10,
    "affine_to_cpu" : false,
  },
  { "index" : 2,
    "threads" : 32, "blocks" : 18,
    "bfactor" : 8, "bsleep" :  10,
    "affine_to_cpu" : false,
  },
  { "index" : 3,
    "threads" : 32, "blocks" : 18,
    "bfactor" : 8, "bsleep" :  10,
    "affine_to_cpu" : false,
  },
],

index block per GPU to the "gpu_threads_conf"

as the above would be for 4x gpus.

thanks a lot Fullzero

I'm now able to use my 7x GPU with your modification

unfortunately watchdog is not working anymore as it saying that GPU utilisation is to low

not a big deal

thanks

I will look into a watchdog conflict with multiple gpus while using xmr-stak.

I noticed that xmr-stack takes a while to load up all the gpu's with work - on 13 card systems this can take up to 3-4 minutes (it also depends on the thread/block count in xmr-stak config.txt - the settings from fullzero load under 2G of data, but you can tweak those numbers higher getting, resulting higher hash rates, but also increasing the initialization time).
Once stak loads all the cards, and assuming your OC settings are stable - it will behave itself for days. I meant to modify watchdog to recognize stak is running and increase the initialization time appropriately, but I still need to migrate to v0019. for now if you want to run watchdog, change the initial timeout to 360 seconds it should do the trick.
Once all the cards are loaded stak will spit out a bunch of submits to the pool in one shot.
26  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0019 on: September 06, 2017, 07:50:54 PM
Hi All, quick question, anyone try connecting over eth-proxy (eth-proxy.py) to nicehash?
If so, have you had any issues passing the rig id over the proxy to nicehash (eth-proxy.py uses "/" while nicehash likes ".")

PS
I am aware of the nicehash integration in nvOC and the available stratum proxies in the included miners, however, I am investigating this particular path at the moment.

I don't remember using/seeing such thing and don't recall any other asking for it!

Do you mind sharing bit more detailed info?

It's a stratum proxy provided by dwarfpool: https://github.com/Atrides/eth-proxy

I found that Genoil's -S option gives rejected submits, but continues mining and watchdog does not catch it as an error.
I have a way of capturing this by logging the stdout and grepping for "rejected" but instead of modding watchdog, I tried setting up my own stratum proxy and use Genoil in Farm mode  (-F) - this way has shown to be stable and captures all the error conditions with watchdog + I can confirm roughly a 10% increase in valid submits vs the other methods, an added bonus is effective failover.
So now I want to move this setup to nicehash, the advantage of repointing the proxy is that I don't need to reconfig and rebackup all my rigs.
It should work without passing the rig ID, but I'd like to still keep individual stats so I was wondering if anyone tried it already.

27  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0019 on: September 06, 2017, 04:20:09 PM
Hi All, quick question, anyone try connecting over eth-proxy (eth-proxy.py) to nicehash?
If so, have you had any issues passing the rig id over the proxy to nicehash (eth-proxy.py uses "/" while nicehash likes ".")

PS
I am aware of the nicehash integration in nvOC and the available stratum proxies in the included miners, however, I am investigating this particular path at the moment.
28  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0019 on: August 28, 2017, 12:20:24 PM
v0019 is up on Google Drive see the OP for the Link:
[...snip...]


Thanks fullzero for the xmr-nvidia link and the v0019!

For anyone trying to optimize their gpus for xmr with xmr-stak-nvidia here is a useful redit thread with a google doc link where performance tests at various thread settings have been documented:

https://www.reddit.com/r/MoneroMining/comments/6bqk0m/4392_xmrstaknvidia_configuration_tests_so_far/

29  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0018 on: August 12, 2017, 05:13:41 PM
IAmNotAJeep Thought that quoting everything gonna create a mess

Have you been using 2 ATX PSUs or one was server PSU like in my set up? I don't have quick access to new PSUs so swapping them gonna take about 5 days to one week , it's very strange how this error seems to appear in completely unpredictable timings. I also worked out strange thing that when I disconnect the PSUs from each other it seems more stable than with the rele connector synchronizing them , when they are connected it takes a couple of minutes to freeze and otherwise it can work for a couple of hours even with more then 8 cards( using 11 atm seems to be impossible for some reason to plug in all 12 without something crashing).
Yes those relays are not very stable, out of 3 of them only one seems to work reliably for me.
I can't speak to the actual causes of why it crashes since there are many variables (PSU, motherboard/bios/controllers, GPU manufacturers etc) but after battling with this issue for a few weeks I'm running stable with only one dual PSU system.  
That system has all the cards on one PSU and the CPU/Board/PCI-bus molexes on the other PSU.
30  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0018 on: August 12, 2017, 12:02:40 PM
Hey Guys,

Have a problem with loading the OS as it tells me "xorg PROBLEM DETECTED" and then reboots and shows:
error: unknown filesystem
grab rescue>


What can it be and how can I solve this? Used flashing tools as described and tried it at least twice. I am using ASrock h110 and at the moment just one Manli P106-100 card just so I can test if I can install the OS before installing all 13 cards.



I need one or two of the:

P106-100

to test and ensure nvOC will properly support these GPUs.  A number of members have had problems using these GPUs.  If someone is willing to sell me 1 or preferably 2 Please pm me.



I have solved this issue by editing the line "XORG FAIL" to XORG "OK" which worked and the system started, now I have the different problem :

When I run more then 6-7 cards on the board no matter in what miner I either get error 15 " cannot get current temperature " or the system just freezes after in about half an hour or a bit longer and the only thing that helps is to turn off/on the power socket... I have updated the drivers and now trying to wait for it to freeze on 6 cards since it works the longest out of all the set ups I tried. As soon as I try 10 or more GPUs I either get error 15 or just frozen system. I don't think it's one of the cards since I mixed them a couple of times and what I noticed is that it crashes faster as I add more GPUs.

Do you have any idea where the problem can be? I have also 2 PSUs connected together one 600W Zalmann powering motherboard and 2400W HP server PSU which was remade for GPU mining and powers risers and GPUs, both of those seem to be working fine , so I have no idea where to look for solution.

If anyone had this problem please let me know , would be greatly appreciated... Even willing to donate to somebody who's advice will help.

PS. The cards are in stock mode, so it's not OC.

UPD Been pretty stable on 6 cards , have been running for 8 hours now , probably will try and add a couple after a 16-20 hour mark...

If you have another rig, I would try testing each of the GPUs an verifying they work individually; if they all do, then I would get a pico for your server PSU and try using only the server PSU.  If you are using an atx PSU for the mobo and a server psu for everything else without joining them; this can cause problems.

Thanks for your answer.

Unfortunately this is the test rig and I don't have another rig to try each card by itself.  I've been swapping cards in different orders and the error seem to come up with different cards all the time and in different intervals , for example half an hour or just a couple of minutes or even more then 10 hours. Also I've tried different risers at different times, they seem to have no effect on when this error comes up . The PSU's are connected with the board adapter , I've tested each pin connecter on each PSU with a voltmeter and also no problem detected. Sometimes the system freezes straight away and sometimes it takes a couple of hours which as I said doesn't seem to depend on the set up as I've tried around 50 different rotations of risers/videocards. With the same cards and risers it can have different time of error coming up. I also can't use just one server PSU as it doesn't have 2 molexes to connect to the motherboard for additional power supply, only PICO and  8pin for CPU power. I tried to unplug the connector between 2 PSUs and again no affect on when this error comes up.

Could this be software related problem maybe drivers or something else? Also can you advise on the overclocking of these cards as no methods I could find( trying to attach "fake" monitor to each GPU using console and treaking the system) would make any difference on the OC ?

PS. THe PSUs are connected to one power plug adapter which turns them on simultaneously anyway so this shouldn't be any problems in synchronization as the server PSU turns on with ATX at the same time through the WIFI plug I have.

UPD. Been running for about 8-9 hours with 7 cards and still froze after that time... Not really sure what went wrong again as I wasn't present at that time but I suspect it was the same error again.

I experienced something similar when I was building 2psu rigs, except in my case it would be the same GPUs causing the problems, if I took all the problematic cards out and built a smaller rig (1PSU) just out of the "unstable" GPUs, they gave no issues and were stable for days. My conclusion was that the problem were 2 psu setups. In the end I managed to get one of the 2 double PSU rigs stable with 11 cards, the other I simply swapped for one 1200W psu and problems went away, all the other 1 psu rigs that came into being with what I thought were defective cards, are stable and "set it and forget it".
31  Alternate cryptocurrencies / Mining (Altcoins) / Re: Remove non-existing swap reference on: July 31, 2017, 06:29:19 PM
Sorry if this was already fixed or is a duplicate report. Didn't read the thread for a long time.

I found that my nvOC system boots too long (around 2 minutes). The cause of that is a swap partition in /etc/fstab that does not exist on USB drive (was /dev/sda5 at install time). After commenting it out a 120 seconds swap disk mount wait time has gone.


nice catch!
32  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017 on: July 11, 2017, 11:39:43 PM
[...snip...]
I can add https://github.com/fireice-uk/xmr-stak-cpu

I will add it to the list once I can update the OP again.

The easiest way to do this would be to take a copy of oneBash; rename it, remove everything but the OC settings and implementation.  Then you can run that renamed bashfile whenever to change clocks.



Actually xmr-stak-cpu is a great idea! It's super stable - I actually added it to one image of nvOS where I'm also cpu mining a while ago and it plays really nice with everything else on the distro.
While at it I would suggest adding xmr-stak-nvidia I meant to test it when I had  some cycles to spare, but since xmr-stak-cpu is on the radar, maybe we can also add it's nvidia cousin?


I will add the GPU variant as well.


I keep getting a database error when I try to update the OP:  Angry


plusWATCHDOG_oneBash + additional files (includes newest SRR,  switch_v3, reboot, AutoTEMP, Watchdog, Claymore 9.7) Link


I integrated a slightly modified IAmNotAJeep_and_Maxximus007_WATCHDOG, fixed the typo in Maxximus007_AUTO_TEMPERATURE_CONTROL.


saflter your newest version of switch was causing problems when run with a monitor connected (LOCAL); I would recommend relying on the:

IAmNotAJeep_and_Maxximus007_WATCHDOG

to handle miner crashes / 0 hashrates. 

I spent a couple hours testing this, and it is very effective; it is worth noting that it currently only works when the mining process is launched in a screen ( I will make it work for all the clients even when run locally soon: so don't spend a lot of time upgrading rigs with this)

Also even if your crashes are perfectly handled; if your OC is so high it crashes every 7 minutes or less: you are losing more time restarting the mining process then you are gaining with a slightly higher hashrate. 

Use reasonable OC.  Smiley

Please provide me with:

# IAmNotAJeep BTC address:  <not yet provided>

# Maxximus007 BTC address:  <not yet provided>

# _Parallax_ BTC address:  <not yet provided>



It's great to see everyone get involved and speaking for myself, to feel like it's OK to contribute once in a while as well.
Hats off to fullzero, Maxximus007 and _Parallax_!

Here you go:
# IAmNotAJeep BTC address:  <13PnEKpfVzNseWkrm6LoueKcCMPj74zPv7>

I am reading the discussion between you and Maxximus007 now.


An unintended consequence of the watchdog script will be that it will keep rebooting the miners if there is no internet connection.
Guess how I know that one lol
33  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017 on: July 11, 2017, 11:01:00 PM

I keep getting a database error when I try to update the OP:  Angry


plusWATCHDOG_oneBash + additional files (includes newest SRR,  switch_v3, reboot, AutoTEMP, Watchdog, Claymore 9.7) Link


I integrated a slightly modified IAmNotAJeep_and_Maxximus007_WATCHDOG, fixed the typo in Maxximus007_AUTO_TEMPERATURE_CONTROL.


saflter your newest version of switch was causing problems when run with a monitor connected (LOCAL); I would recommend relying on the:

IAmNotAJeep_and_Maxximus007_WATCHDOG

to handle miner crashes / 0 hashrates. 

I spent a couple hours testing this, and it is very effective; it is worth noting that it currently only works when the mining process is launched in a screen ( I will make it work for all the clients even when run locally soon: so don't spend a lot of time upgrading rigs with this)

Also even if your crashes are perfectly handled; if your OC is so high it crashes every 7 minutes or less: you are losing more time restarting the mining process then you are gaining with a slightly higher hashrate. 

Use reasonable OC.  Smiley

Please provide me with:

# IAmNotAJeep BTC address:  <not yet provided>

# Maxximus007 BTC address:  <not yet provided>

# _Parallax_ BTC address:  <not yet provided>



It's great to see everyone get involved and speaking for myself, to feel like it's OK to contribute once in a while as well.
Hats off to fullzero, Maxximus007 and _Parallax_!

Here you go:
# IAmNotAJeep BTC address:  <13PnEKpfVzNseWkrm6LoueKcCMPj74zPv7>

34  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017 on: July 11, 2017, 10:54:35 PM
[...snip...]
I can add https://github.com/fireice-uk/xmr-stak-cpu

I will add it to the list once I can update the OP again.

The easiest way to do this would be to take a copy of oneBash; rename it, remove everything but the OC settings and implementation.  Then you can run that renamed bashfile whenever to change clocks.



Actually xmr-stak-cpu is a great idea! It's super stable - I actually added it to one image of nvOS where I'm also cpu mining a while ago and it plays really nice with everything else on the distro.
While at it I would suggest adding xmr-stak-nvidia I meant to test it when I had  some cycles to spare, but since xmr-stak-cpu is on the radar, maybe we can also add it's nvidia cousin?

 
35  Alternate cryptocurrencies / Mining (Altcoins) / Re: Softcrash watchdog on: July 11, 2017, 06:15:23 PM
Hey fullzero, i have a question,

without a doubt my biggest problem right now is that when my miner crashes it takes the whole rig down with it, everything gets stuck, SSH barely works, average system load jumps to 14.5!! and Xorg takes up 100% of the CPU, its so bad that none of the standard reboot commands work, they just do nothing, the only thing that actually reboots the rig in this state is "echo b > /proc/sysrq-trigger" so i've set up a script that checks the average system load and if its over 2 it uses the command to reboot, and it works, but i dont like this "solution", yesterday after a reboot nvOC got corrupted somehow, lost my customized oneBash and the whole system became read-only (thankfully i had a oneBash backup that was only a few days behind).

so the question is, what can i do to relive this Xorg error, i run a 7 card rig and never plan on going for a higher number, what can i do with Xorg that would fix this?

Thanks.

@ tempgoga

It seems that whenever a soft crash occurs most of the cards drop to zero, so while the display/keyboard is unresponsive you can catch the soft crash from nvidia-smi. The script below checks card utilization, if it drops below 90% it counts down a minute and if mining hasn't resumed it reboots the system.
This seems to have worked at least once in my case (only got one soft crash this weekend) and the system recovered as expected.
the threshold values work for my setup but others may find different values optimal

Also if anyone knows a way to iterate the if && statements we can get the card count from "cards=$(nvidia-smi -L | wc -l); echo $cards" but the way below also works with manual editing to adjust the watchdog for the number of cards in you individual system.
___________
 
#!/bin/bash
#m1
threshold=90
while sleep 5
 do number=$(nvidia-smi |grep % |awk '{print $13}' |tr -d %)
 set -- $number
 echo -e "$@"
# The "if and" statements below need to be manually adjusted to match the number of cards in your system
# If you have 5 cards, leave is as, if a different number of cards remove or add the && statements as needed as in the example below
        if [[ "$1" -gt "$threshold" ]] && \
           [[ "$2" -gt "$threshold" ]] && \
           [[ "$3" -gt "$threshold" ]] && \
           [[ "$4" -gt "$threshold" ]] && \
           [[ "$5" -gt "$threshold" ]]
# && \
#          [[ "$6" -gt "$threshold" ]]
         then i=12
         echo OK
         else echo $((i--))
        fi
        if [ $i -le 0 ]
         then echo $(date) REBOOT due to soft crash >>~/watchdog.log
         sleep -5
         sudo shutdown now -r
        fi
done
___________

Hey thats funny I just made a script doing something similar, although it checks the powerdraw.
Here it is:
Code:
#!/bin/bash

# Miner restart script V001
# By Maxximus007
# for nvOC by fullzero
#
# POWERLIMIT MUST BE SET IN oneBash

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT_LOW_POWER=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read POWERDRAW POWERLIMIT; } < <( nvidia-smi -i $gpu --query-gpu=power.draw,power.limit --format=csv,noheader,nounits)

  let POWER_DIFF=$( printf "%.0f" $POWERLIMIT )-$( printf "%.0f" $POWERDRAW )

  # If current draw is 30 Watt lower than the limit count them:
  if [ "$POWER_DIFF" -gt "30" ]
  then
    let COUNT_LOW_POWER=COUNT_LOW_POWER+1
  fi

  let gpu=gpu+1
done

if [ $COUNT_LOW_POWER -eq $GPUS ]
then
  echo "$(date) - Power draw is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

You can combine the above with your code, and find the utilization like this:
Code:
nvidia-smi -i 1 --query-gpu=utilization.gpu --format=csv,noheader,nounits
You have to iterate the GPU, starting at 0 to get them all
Okay I've combined the two, perhaps this will work for most of us:
Code:
#!/bin/bash

# Miner restart script V002
# By Maxximus007 && IAmNotAJeep
# for nvOC by fullzero
#

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

MIN_UTIL=90
RESTART=0

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read UTIL; } < <( nvidia-smi -i $gpu --query-gpu=utilization.gpu --format=csv,noheader,nounits)

  let UTILIZATION=$( printf "%.0f" $UTIL )

  # If current utilizations lower than the limit count them:
  if [ $UTILIZATION -lt $MIN_UTIL ]
  then
    let COUNT=COUNT+1
  fi

  let gpu=gpu+1
done

if [ $COUNT -eq $GPUS ]
then
  if [ $RESTART -gt 1 ]
  then
    echo "$(date) - Utilization is too low: reviving did not work so restarting system" | tee -a ${LOG_FILE}
    sudo shutdown now -r
  fi
  echo "$(date) - Utilization is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
  let RESTART=RESTART+1
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

Pretty cool!  I'll try it tonight, lets hope this put the softcrash issues behind us.


I will try this out as well; good work.  Smiley

@ Maxximus007
Thanks for putting these together, great collab!
I'm not a bash expert, so maybe I'm reading this wrong, but here are some thoughts.
The combined code seems to be evaluating each gpu individually for the fault condition to be met, which means if one fails and you have say 5 other cards working then it keeps going until all the cards give reduced output since all of them have to fail individually to increment the counter?So if 5/6 fail we keep going? (Again just looking at it and tracing it in my head so maybe I'm reading wrong).
The way I was thinking about it, is that I wanted all the cards to work at above 90% efficiency and reboot as soon as any card strays beyond the threshold - this is why I did the "if and" statement and didn't iterate though "if" statements alone (I didn't know how to iterate "if and" based on an unknown number of cards lol). I had a version giving 6xOK and such but I think it's more efficient to just get 1xOK if ALL meet the 90% criteria and start the countdown as soon as anything is out of norm - and if the miner recovers, flush the counter. I observed a number of these conditions with Claymore where it recovers half the time, but then eventually craps out and the script kicks in. I haven't seen it on my Genoil rig yet since my other script has kept it in check without any softcrash for day 3 now.

A thought about the power draw as threshold measure - it is power limit/card specific and I guess people would need to tune their power threshold to their power limit so I agree it's best to use gpu util. (My cards are at 82W limit for example).
Thoughts?


  
The code checks each cards individually, at times (with Claymore, not Genoil) I've seen that Util (or Powerpraw) is dropping, maybe even below 90 for a few seconds. In order not to generate too much restarts I check all cards. We can lower this or make it so that each of us can decide when it should reboot.
I've combined the restart/reboot so that the first attempt is to restart miner. If that doesn't work, we reboot the machine. We might want to reset the reboot counter after a while, so we don't loose time with a full reboot.

In the first code I checked Powerdraw -> if 30 Watt less than Powerlimit there might be something wrong. Idling cards use around 10 Watt, so that works for all I think. We can combine this with Util if that helps.

So sure we can make it more advanced, we just have to determine the right parameters. Hope others can let us know in what circumstances they see hanging miners. Just one card, or more or everything? Is Util back to zero? or hanging on to 100%?




OK thanks for the clarification, it's really neat and rewarding to see different approaches to this problem Cheesy
Here is why I coded to test that all the cards meet the threshold as one with "if &&": as an example I'll use an event from from my test rig overnight: one card dropped, the "if &&" script waited for claymore to recover for one minute, then booted the system and that was that.
Total down time, 2 mins, if you add the 1 minute of reduced capacity waiting for the miner to right itself, 3 minutes impact.

The "if &&" code does tests for a graceful miner recovery -  by continuing to test the cards for above threshold utilization for 60 seconds after it detects a fault.
If the miner recovers, but just sits there (saw both Claymore/Genoil do exactly that a number of times) that's not good enough and the system gets a boot.
My other miner restart script did not handle this exact case and once every few days I would find the miner sitting pretty and blowing bubbles mining on one or two cards until I noticed because it did not "see" all the cards anymore but it did see some so it thought it "recovered".

If the miner recovers properly, all cards need to hit above threshold  and we can flush the counter and life goes on.
On my test rig, graceful miner recovery occurred 5-6 times in the past 24 hours without prompting a restart - which is desirable above either running at reduced capacity or 5-6 reboots (IMHO).

In contrast - if we test each card independently and increment the error counter one by one until it reaches the number of GPU's, then - depending on the number of cards in the system it could take a long time for all of them to fail - the more cards, the more time to fail (right? am I misunderstanding anything?) So the same event, would unfold differently: the test rig would continue at reduced capacity until COUNT reaches # of GPU's - but since it resets at next check, we can hobble on 5,4,3,2, 1 card until they all die or and the script kicks in or we freeze and require a manual intervention. This could be hours of impact (again if I'm reading this wrong, my apologies, but this is what I'm getting out of looking at it.)

So IMHO, by testing that all the cards meet the 90% utilization threshold (as one, all or nothing = if &&), we avoid hours of impact/decreased capacity. My other concern is that as soon as cards start dropping off one at a time the system gets unstable, increasing the risk of a hang or corrupted file system due to a hard crash.
My view is that it should be cycled at maximum stability for a graceful restart.

Maybe there is a third approach not considered yet, Thoughts?

... edit:
Actually one more thought - I did not test for this yet so I don't know the answer - but in the case where the miner does not see all the cards anymore, does this mean that nvidia-smi ALSO does not see all the cards anymore? If so, and if we get the number of cards from nvidia-smi, wouldn't the script assume that the rig has the right number of cards every time that nvidia-smi stop seeing one? I do recall cards disappearing even from nvidia-smi but I never kept track of this so I don't know how often this condition actually occurs.
  
Thanks for explaining, and you do have valid points here. Like your thinking. I will rework it with this in mind.

Just wondering: Your script reboots the rig, if the miner itself does not recover. Instead we could introduce reloading miner as the first step here. In my experience that resolves the issue almost every time. It will only save 1-2 minutes so it's not a big deal to just reboot (still had the boot time of V0014 in mind).

I did not experience that nvidia-smi looses a card while it's there, but I can imagine that happens with faulty risers. Perhaps we can run the card number count nvidia-smi only at startup the number of cards (saves a call as well) and keep that number during the watchdog process. If we loose a card we do have to reboot anyway.

One other thought: Perhaps it would be an idea to echo the output of the log to a screen (tail -f) so the former reboots are shown as well?


Hi, if it's a Genoil rig I run this setup: https://bitcointalk.org/index.php?topic=1854250.msg19943144#msg19943144 plus the watchdog script being discussed here in separate "screen -dmS" sessions so I have the watchdog and restart scripts running separately.
for that setup I also tail the "ltail" script but if we run only one script then it would make sense to echo some diagnostic output of what faults and recoveries it detects (or log it - but then we need to think about logrotate or someone will run out of space in a few months lol).
For the Claymore setup I only run the watchdog since Claymore has it's own fault detection and it restarts by itself so if the built in restart doesn't work, I cycle the box and log the reboot condition only so I don't have to logrotate.
36  Alternate cryptocurrencies / Mining (Altcoins) / Re: Softcrash watchdog on: July 11, 2017, 12:00:58 PM
Hey fullzero, i have a question,

without a doubt my biggest problem right now is that when my miner crashes it takes the whole rig down with it, everything gets stuck, SSH barely works, average system load jumps to 14.5!! and Xorg takes up 100% of the CPU, its so bad that none of the standard reboot commands work, they just do nothing, the only thing that actually reboots the rig in this state is "echo b > /proc/sysrq-trigger" so i've set up a script that checks the average system load and if its over 2 it uses the command to reboot, and it works, but i dont like this "solution", yesterday after a reboot nvOC got corrupted somehow, lost my customized oneBash and the whole system became read-only (thankfully i had a oneBash backup that was only a few days behind).

so the question is, what can i do to relive this Xorg error, i run a 7 card rig and never plan on going for a higher number, what can i do with Xorg that would fix this?

Thanks.

@ tempgoga

It seems that whenever a soft crash occurs most of the cards drop to zero, so while the display/keyboard is unresponsive you can catch the soft crash from nvidia-smi. The script below checks card utilization, if it drops below 90% it counts down a minute and if mining hasn't resumed it reboots the system.
This seems to have worked at least once in my case (only got one soft crash this weekend) and the system recovered as expected.
the threshold values work for my setup but others may find different values optimal

Also if anyone knows a way to iterate the if && statements we can get the card count from "cards=$(nvidia-smi -L | wc -l); echo $cards" but the way below also works with manual editing to adjust the watchdog for the number of cards in you individual system.
___________
 
#!/bin/bash
#m1
threshold=90
while sleep 5
 do number=$(nvidia-smi |grep % |awk '{print $13}' |tr -d %)
 set -- $number
 echo -e "$@"
# The "if and" statements below need to be manually adjusted to match the number of cards in your system
# If you have 5 cards, leave is as, if a different number of cards remove or add the && statements as needed as in the example below
        if [[ "$1" -gt "$threshold" ]] && \
           [[ "$2" -gt "$threshold" ]] && \
           [[ "$3" -gt "$threshold" ]] && \
           [[ "$4" -gt "$threshold" ]] && \
           [[ "$5" -gt "$threshold" ]]
# && \
#          [[ "$6" -gt "$threshold" ]]
         then i=12
         echo OK
         else echo $((i--))
        fi
        if [ $i -le 0 ]
         then echo $(date) REBOOT due to soft crash >>~/watchdog.log
         sleep -5
         sudo shutdown now -r
        fi
done
___________

Hey thats funny I just made a script doing something similar, although it checks the powerdraw.
Here it is:
Code:
#!/bin/bash

# Miner restart script V001
# By Maxximus007
# for nvOC by fullzero
#
# POWERLIMIT MUST BE SET IN oneBash

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT_LOW_POWER=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read POWERDRAW POWERLIMIT; } < <( nvidia-smi -i $gpu --query-gpu=power.draw,power.limit --format=csv,noheader,nounits)

  let POWER_DIFF=$( printf "%.0f" $POWERLIMIT )-$( printf "%.0f" $POWERDRAW )

  # If current draw is 30 Watt lower than the limit count them:
  if [ "$POWER_DIFF" -gt "30" ]
  then
    let COUNT_LOW_POWER=COUNT_LOW_POWER+1
  fi

  let gpu=gpu+1
done

if [ $COUNT_LOW_POWER -eq $GPUS ]
then
  echo "$(date) - Power draw is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

You can combine the above with your code, and find the utilization like this:
Code:
nvidia-smi -i 1 --query-gpu=utilization.gpu --format=csv,noheader,nounits
You have to iterate the GPU, starting at 0 to get them all
Okay I've combined the two, perhaps this will work for most of us:
Code:
#!/bin/bash

# Miner restart script V002
# By Maxximus007 && IAmNotAJeep
# for nvOC by fullzero
#

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

MIN_UTIL=90
RESTART=0

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read UTIL; } < <( nvidia-smi -i $gpu --query-gpu=utilization.gpu --format=csv,noheader,nounits)

  let UTILIZATION=$( printf "%.0f" $UTIL )

  # If current utilizations lower than the limit count them:
  if [ $UTILIZATION -lt $MIN_UTIL ]
  then
    let COUNT=COUNT+1
  fi

  let gpu=gpu+1
done

if [ $COUNT -eq $GPUS ]
then
  if [ $RESTART -gt 1 ]
  then
    echo "$(date) - Utilization is too low: reviving did not work so restarting system" | tee -a ${LOG_FILE}
    sudo shutdown now -r
  fi
  echo "$(date) - Utilization is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
  let RESTART=RESTART+1
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

Pretty cool!  I'll try it tonight, lets hope this put the softcrash issues behind us.


I will try this out as well; good work.  Smiley

@ Maxximus007
Thanks for putting these together, great collab!
I'm not a bash expert, so maybe I'm reading this wrong, but here are some thoughts.
The combined code seems to be evaluating each gpu individually for the fault condition to be met, which means if one fails and you have say 5 other cards working then it keeps going until all the cards give reduced output since all of them have to fail individually to increment the counter?So if 5/6 fail we keep going? (Again just looking at it and tracing it in my head so maybe I'm reading wrong).
The way I was thinking about it, is that I wanted all the cards to work at above 90% efficiency and reboot as soon as any card strays beyond the threshold - this is why I did the "if and" statement and didn't iterate though "if" statements alone (I didn't know how to iterate "if and" based on an unknown number of cards lol). I had a version giving 6xOK and such but I think it's more efficient to just get 1xOK if ALL meet the 90% criteria and start the countdown as soon as anything is out of norm - and if the miner recovers, flush the counter. I observed a number of these conditions with Claymore where it recovers half the time, but then eventually craps out and the script kicks in. I haven't seen it on my Genoil rig yet since my other script has kept it in check without any softcrash for day 3 now.

A thought about the power draw as threshold measure - it is power limit/card specific and I guess people would need to tune their power threshold to their power limit so I agree it's best to use gpu util. (My cards are at 82W limit for example).
Thoughts?


  
The code checks each cards individually, at times (with Claymore, not Genoil) I've seen that Util (or Powerpraw) is dropping, maybe even below 90 for a few seconds. In order not to generate too much restarts I check all cards. We can lower this or make it so that each of us can decide when it should reboot.
I've combined the restart/reboot so that the first attempt is to restart miner. If that doesn't work, we reboot the machine. We might want to reset the reboot counter after a while, so we don't loose time with a full reboot.

In the first code I checked Powerdraw -> if 30 Watt less than Powerlimit there might be something wrong. Idling cards use around 10 Watt, so that works for all I think. We can combine this with Util if that helps.

So sure we can make it more advanced, we just have to determine the right parameters. Hope others can let us know in what circumstances they see hanging miners. Just one card, or more or everything? Is Util back to zero? or hanging on to 100%?




OK thanks for the clarification, it's really neat and rewarding to see different approaches to this problem Cheesy
Here is why I coded to test that all the cards meet the threshold as one with "if &&": as an example I'll use an event from from my test rig overnight: one card dropped, the "if &&" script waited for claymore to recover for one minute, then booted the system and that was that.
Total down time, 2 mins, if you add the 1 minute of reduced capacity waiting for the miner to right itself, 3 minutes impact.

The "if &&" code does tests for a graceful miner recovery -  by continuing to test the cards for above threshold utilization for 60 seconds after it detects a fault.
If the miner recovers, but just sits there (saw both Claymore/Genoil do exactly that a number of times) that's not good enough and the system gets a boot.
My other miner restart script did not handle this exact case and once every few days I would find the miner sitting pretty and blowing bubbles mining on one or two cards until I noticed because it did not "see" all the cards anymore but it did see some so it thought it "recovered".

If the miner recovers properly, all cards need to hit above threshold  and we can flush the counter and life goes on.
On my test rig, graceful miner recovery occurred 5-6 times in the past 24 hours without prompting a restart - which is desirable above either running at reduced capacity or 5-6 reboots (IMHO).

In contrast - if we test each card independently and increment the error counter one by one until it reaches the number of GPU's, then - depending on the number of cards in the system it could take a long time for all of them to fail - the more cards, the more time to fail (right? am I misunderstanding anything?) So the same event, would unfold differently: the test rig would continue at reduced capacity until COUNT reaches # of GPU's - but since it resets at next check, we can hobble on 5,4,3,2, 1 card until they all die or and the script kicks in or we freeze and require a manual intervention. This could be hours of impact (again if I'm reading this wrong, my apologies, but this is what I'm getting out of looking at it.)

So IMHO, by testing that all the cards meet the 90% utilization threshold (as one, all or nothing = if &&), we avoid hours of impact/decreased capacity. My other concern is that as soon as cards start dropping off one at a time the system gets unstable, increasing the risk of a hang or corrupted file system due to a hard crash.
My view is that it should be cycled at maximum stability for a graceful restart.

Maybe there is a third approach not considered yet, Thoughts?

... edit:
Actually one more thought - I did not test for this yet so I don't know the answer - but in the case where the miner does not see all the cards anymore, does this mean that nvidia-smi ALSO does not see all the cards anymore? If so, and if we get the number of cards from nvidia-smi, wouldn't the script assume that the rig has the right number of cards every time that nvidia-smi stop seeing one? I do recall cards disappearing even from nvidia-smi but I never kept track of this so I don't know how often this condition actually occurs.
  
37  Alternate cryptocurrencies / Mining (Altcoins) / Re: Softcrash watchdog on: July 11, 2017, 01:53:12 AM
Hey fullzero, i have a question,

without a doubt my biggest problem right now is that when my miner crashes it takes the whole rig down with it, everything gets stuck, SSH barely works, average system load jumps to 14.5!! and Xorg takes up 100% of the CPU, its so bad that none of the standard reboot commands work, they just do nothing, the only thing that actually reboots the rig in this state is "echo b > /proc/sysrq-trigger" so i've set up a script that checks the average system load and if its over 2 it uses the command to reboot, and it works, but i dont like this "solution", yesterday after a reboot nvOC got corrupted somehow, lost my customized oneBash and the whole system became read-only (thankfully i had a oneBash backup that was only a few days behind).

so the question is, what can i do to relive this Xorg error, i run a 7 card rig and never plan on going for a higher number, what can i do with Xorg that would fix this?

Thanks.

@ tempgoga

It seems that whenever a soft crash occurs most of the cards drop to zero, so while the display/keyboard is unresponsive you can catch the soft crash from nvidia-smi. The script below checks card utilization, if it drops below 90% it counts down a minute and if mining hasn't resumed it reboots the system.
This seems to have worked at least once in my case (only got one soft crash this weekend) and the system recovered as expected.
the threshold values work for my setup but others may find different values optimal

Also if anyone knows a way to iterate the if && statements we can get the card count from "cards=$(nvidia-smi -L | wc -l); echo $cards" but the way below also works with manual editing to adjust the watchdog for the number of cards in you individual system.
___________
 
#!/bin/bash
#m1
threshold=90
while sleep 5
 do number=$(nvidia-smi |grep % |awk '{print $13}' |tr -d %)
 set -- $number
 echo -e "$@"
# The "if and" statements below need to be manually adjusted to match the number of cards in your system
# If you have 5 cards, leave is as, if a different number of cards remove or add the && statements as needed as in the example below
        if [[ "$1" -gt "$threshold" ]] && \
           [[ "$2" -gt "$threshold" ]] && \
           [[ "$3" -gt "$threshold" ]] && \
           [[ "$4" -gt "$threshold" ]] && \
           [[ "$5" -gt "$threshold" ]]
# && \
#          [[ "$6" -gt "$threshold" ]]
         then i=12
         echo OK
         else echo $((i--))
        fi
        if [ $i -le 0 ]
         then echo $(date) REBOOT due to soft crash >>~/watchdog.log
         sleep -5
         sudo shutdown now -r
        fi
done
___________

Hey thats funny I just made a script doing something similar, although it checks the powerdraw.
Here it is:
Code:
#!/bin/bash

# Miner restart script V001
# By Maxximus007
# for nvOC by fullzero
#
# POWERLIMIT MUST BE SET IN oneBash

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT_LOW_POWER=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read POWERDRAW POWERLIMIT; } < <( nvidia-smi -i $gpu --query-gpu=power.draw,power.limit --format=csv,noheader,nounits)

  let POWER_DIFF=$( printf "%.0f" $POWERLIMIT )-$( printf "%.0f" $POWERDRAW )

  # If current draw is 30 Watt lower than the limit count them:
  if [ "$POWER_DIFF" -gt "30" ]
  then
    let COUNT_LOW_POWER=COUNT_LOW_POWER+1
  fi

  let gpu=gpu+1
done

if [ $COUNT_LOW_POWER -eq $GPUS ]
then
  echo "$(date) - Power draw is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

You can combine the above with your code, and find the utilization like this:
Code:
nvidia-smi -i 1 --query-gpu=utilization.gpu --format=csv,noheader,nounits
You have to iterate the GPU, starting at 0 to get them all
Okay I've combined the two, perhaps this will work for most of us:
Code:
#!/bin/bash

# Miner restart script V002
# By Maxximus007 && IAmNotAJeep
# for nvOC by fullzero
#

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

MIN_UTIL=90
RESTART=0

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read UTIL; } < <( nvidia-smi -i $gpu --query-gpu=utilization.gpu --format=csv,noheader,nounits)

  let UTILIZATION=$( printf "%.0f" $UTIL )

  # If current utilizations lower than the limit count them:
  if [ $UTILIZATION -lt $MIN_UTIL ]
  then
    let COUNT=COUNT+1
  fi

  let gpu=gpu+1
done

if [ $COUNT -eq $GPUS ]
then
  if [ $RESTART -gt 1 ]
  then
    echo "$(date) - Utilization is too low: reviving did not work so restarting system" | tee -a ${LOG_FILE}
    sudo shutdown now -r
  fi
  echo "$(date) - Utilization is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
  let RESTART=RESTART+1
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

Pretty cool!  I'll try it tonight, lets hope this put the softcrash issues behind us.


I will try this out as well; good work.  Smiley

@ Maxximus007
Thanks for putting these together, great collab!
I'm not a bash expert, so maybe I'm reading this wrong, but here are some thoughts.
The combined code seems to be evaluating each gpu individually for the fault condition to be met, which means if one fails and you have say 5 other cards working then it keeps going until all the cards give reduced output since all of them have to fail individually to increment the counter?So if 5/6 fail we keep going? (Again just looking at it and tracing it in my head so maybe I'm reading wrong).
The way I was thinking about it, is that I wanted all the cards to work at above 90% efficiency and reboot as soon as any card strays beyond the threshold - this is why I did the "if and" statement and didn't iterate though "if" statements alone (I didn't know how to iterate "if and" based on an unknown number of cards lol). I had a version giving 6xOK and such but I think it's more efficient to just get 1xOK if ALL meet the 90% criteria and start the countdown as soon as anything is out of norm - and if the miner recovers, flush the counter. I observed a number of these conditions with Claymore where it recovers half the time, but then eventually craps out and the script kicks in. I haven't seen it on my Genoil rig yet since my other script has kept it in check without any softcrash for day 3 now.

A thought about the power draw as threshold measure - it is power limit/card specific and I guess people would need to tune their power threshold to their power limit so I agree it's best to use gpu util. (My cards are at 82W limit for example).
Thoughts?


  
38  Alternate cryptocurrencies / Mining (Altcoins) / Re: Softcrash watchdog on: July 10, 2017, 06:32:34 PM
Hey fullzero, i have a question,

without a doubt my biggest problem right now is that when my miner crashes it takes the whole rig down with it, everything gets stuck, SSH barely works, average system load jumps to 14.5!! and Xorg takes up 100% of the CPU, its so bad that none of the standard reboot commands work, they just do nothing, the only thing that actually reboots the rig in this state is "echo b > /proc/sysrq-trigger" so i've set up a script that checks the average system load and if its over 2 it uses the command to reboot, and it works, but i dont like this "solution", yesterday after a reboot nvOC got corrupted somehow, lost my customized oneBash and the whole system became read-only (thankfully i had a oneBash backup that was only a few days behind).

so the question is, what can i do to relive this Xorg error, i run a 7 card rig and never plan on going for a higher number, what can i do with Xorg that would fix this?

Thanks.

@ tempgoga

It seems that whenever a soft crash occurs most of the cards drop to zero, so while the display/keyboard is unresponsive you can catch the soft crash from nvidia-smi. The script below checks card utilization, if it drops below 90% it counts down a minute and if mining hasn't resumed it reboots the system.
This seems to have worked at least once in my case (only got one soft crash this weekend) and the system recovered as expected.
the threshold values work for my setup but others may find different values optimal

Also if anyone knows a way to iterate the if && statements we can get the card count from "cards=$(nvidia-smi -L | wc -l); echo $cards" but the way below also works with manual editing to adjust the watchdog for the number of cards in you individual system.
___________
 
#!/bin/bash
#m1
threshold=90
while sleep 5
 do number=$(nvidia-smi |grep % |awk '{print $13}' |tr -d %)
 set -- $number
 echo -e "$@"
# The "if and" statements below need to be manually adjusted to match the number of cards in your system
# If you have 5 cards, leave is as, if a different number of cards remove or add the && statements as needed as in the example below
        if [[ "$1" -gt "$threshold" ]] && \
           [[ "$2" -gt "$threshold" ]] && \
           [[ "$3" -gt "$threshold" ]] && \
           [[ "$4" -gt "$threshold" ]] && \
           [[ "$5" -gt "$threshold" ]]
# && \
#          [[ "$6" -gt "$threshold" ]]
         then i=12
         echo OK
         else echo $((i--))
        fi
        if [ $i -le 0 ]
         then echo $(date) REBOOT due to soft crash >>~/watchdog.log
         sleep -5
         sudo shutdown now -r
        fi
done
___________

Hey thats funny I just made a script doing something similar, although it checks the powerdraw.
Here it is:
Code:
#!/bin/bash

# Miner restart script V001
# By Maxximus007
# for nvOC by fullzero
#
# POWERLIMIT MUST BE SET IN oneBash

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT_LOW_POWER=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read POWERDRAW POWERLIMIT; } < <( nvidia-smi -i $gpu --query-gpu=power.draw,power.limit --format=csv,noheader,nounits)

  let POWER_DIFF=$( printf "%.0f" $POWERLIMIT )-$( printf "%.0f" $POWERDRAW )

  # If current draw is 30 Watt lower than the limit count them:
  if [ "$POWER_DIFF" -gt "30" ]
  then
    let COUNT_LOW_POWER=COUNT_LOW_POWER+1
  fi

  let gpu=gpu+1
done

if [ $COUNT_LOW_POWER -eq $GPUS ]
then
  echo "$(date) - Power draw is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

You can combine the above with your code, and find the utilization like this:
Code:
nvidia-smi -i 1 --query-gpu=utilization.gpu --format=csv,noheader,nounits
You have to iterate the GPU, starting at 0 to get them all
Okay I've combined the two, perhaps this will work for most of us:
Code:
#!/bin/bash

# Miner restart script V002
# By Maxximus007 && IAmNotAJeep
# for nvOC by fullzero
#

#########################
### BELOW CODE, NO NEED FOR EDITING
#########################
echo "$(date) - Starting miner restart script." | tee -a ${LOG_FILE}
# Creating a log file to record restarts
LOG_FILE="/home/m1/restartlog.txt"
if [ ! -e "$LOG_FILE" ] ; then
    touch "$LOG_FILE"
fi

MIN_UTIL=90
RESTART=0

while true
do
sleep 60

GPUS=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | tail -1)

gpu=0
COUNT=0

while [ $gpu -lt $GPUS ]
do
  { IFS=', ' read UTIL; } < <( nvidia-smi -i $gpu --query-gpu=utilization.gpu --format=csv,noheader,nounits)

  let UTILIZATION=$( printf "%.0f" $UTIL )

  # If current utilizations lower than the limit count them:
  if [ $UTILIZATION -lt $MIN_UTIL ]
  then
    let COUNT=COUNT+1
  fi

  let gpu=gpu+1
done

if [ $COUNT -eq $GPUS ]
then
  if [ $RESTART -gt 1 ]
  then
    echo "$(date) - Utilization is too low: reviving did not work so restarting system" | tee -a ${LOG_FILE}
    sudo shutdown now -r
  fi
  echo "$(date) - Utilization is too low: kill miner and oneBash" | tee -a ${LOG_FILE}
  # If miner runs in screen 'miner' kill the screen
  screen -X -S miner kill
  # Best to restart oneBash - settings might be adjusted already
  kill ps -ef | awk '$NF~"oneBash" {print $2}'
  let RESTART=RESTART+1
else
  echo "$(date) - All good! Will check again in 60 seconds"
fi

done

Pretty cool!  I'll try it tonight, lets hope this put the softcrash issues behind us.
39  Alternate cryptocurrencies / Mining (Altcoins) / Re: Softcrash watchdog on: July 10, 2017, 05:24:38 PM
Hey fullzero, i have a question,

without a doubt my biggest problem right now is that when my miner crashes it takes the whole rig down with it, everything gets stuck, SSH barely works, average system load jumps to 14.5!! and Xorg takes up 100% of the CPU, its so bad that none of the standard reboot commands work, they just do nothing, the only thing that actually reboots the rig in this state is "echo b > /proc/sysrq-trigger" so i've set up a script that checks the average system load and if its over 2 it uses the command to reboot, and it works, but i dont like this "solution", yesterday after a reboot nvOC got corrupted somehow, lost my customized oneBash and the whole system became read-only (thankfully i had a oneBash backup that was only a few days behind).

so the question is, what can i do to relive this Xorg error, i run a 7 card rig and never plan on going for a higher number, what can i do with Xorg that would fix this?

Thanks.

@ tempgoga

It seems that whenever a soft crash occurs most of the cards drop to zero, so while the display/keyboard is unresponsive you can catch the soft crash from nvidia-smi. The script below checks card utilization, if it drops below 90% it counts down a minute and if mining hasn't resumed it reboots the system.
This seems to have worked at least once in my case (only got one soft crash this weekend) and the system recovered as expected.
the threshold values work for my setup but others may find different values optimal

Also if anyone knows a way to iterate the if && statements we can get the card count from "cards=$(nvidia-smi -L | wc -l); echo $cards" but the way below also works with manual editing to adjust the watchdog for the number of cards in you individual system.
___________
 
#!/bin/bash
#m1
threshold=90
while sleep 5
 do number=$(nvidia-smi |grep % |awk '{print $13}' |tr -d %)
 set -- $number
 echo -e "$@"
# The "if and" statements below need to be manually adjusted to match the number of cards in your system
# If you have 5 cards, leave is as, if a different number of cards remove or add the && statements as needed as in the example below
        if [[ "$1" -gt "$threshold" ]] && \
           [[ "$2" -gt "$threshold" ]] && \
           [[ "$3" -gt "$threshold" ]] && \
           [[ "$4" -gt "$threshold" ]] && \
           [[ "$5" -gt "$threshold" ]]
# && \
#          [[ "$6" -gt "$threshold" ]]
         then i=12
         echo OK
         else echo $((i--))
        fi
        if [ $i -le 0 ]
         then echo $(date) REBOOT due to soft crash >>~/watchdog.log
         sleep -5
         sudo shutdown now -r
        fi
done
___________
40  Alternate cryptocurrencies / Mining (Altcoins) / Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017 on: July 09, 2017, 03:15:28 AM
I went to try to add the watchdog script, I made both scripts but I guess I am a bit confused on how to get it to work between the system booting up running terminal and these two.

As if you try to launch the second, it will error out with the original terminal running.

Hi Nexilius, you need to kill the terminal first.
It's important to run the oneBash per bootup to set all the variables, after that:
ssh into the box
kill the terminal:

$ kill $(ps aux | grep '[t]ermina' | awk '{print $2}')

then run the other two scripts as daemons:

$screen -dmS ltail sh ~/eth/Genoil-U/ltail
$screen -dmS ett bash ~/ett

It's not elegant by any means but solid enough 9/10 times etherminer crashes out. This approach does not catch the soft crashes, but I should have a working bandaid for that tomorrow.

monitor mining status:
$screen -r ltail
then
$ctrl-a | <---don't forget the pipe
ctrl-a tab <---- to move to the right screen
ctrl-a c <---- new prompt
screen -r ett <------ to get the output of ethminer on the right screen, when it dies and ltail kills it, just up arrow to get "screen -r ett" again to monitor the new process

Like I said, not elegant, but stable enough (2+ days uptime before needing reboot).
Pages: « 1 [2] 3 »
Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!