[OS] nvOC easy-to-use Linux Nvidia Mining

Quote from: salfter on July 05, 2017, 03:01:01 AM

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

Tipjars: BTC 1TipsGocnz2N5qgAm9f7JLrsMqkb3oXe2 LTC LTipsVC7XaFy9M6Zaf1aGGe8w8xVUeWFvR | My Bitcoin Note Generator | Pool Auto-Switchers: zpool MiningPoolHub NiceHash
Bitgem Resources: Pool Explorer Paper Wallet

UberDaemon

Newbie

Offline

Activity: 51
Merit: 0

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 03:11:33 AM
Last edit: July 05, 2017, 03:30:01 AM by UberDaemon

#1526

You could configure your router to forward a port other than 22 to port 22 on your mining rig. I haven't bothered with that with mine, though; I can ssh into my FreeNAS media server (or my desktop, if it's booted into Linux...can RDP into it if it's running Windows and set it to reboot into Linux) from outside and then ssh into the mining rig from there.

Yes, I posted about this earlier, but the concern is that would leave nvOC's SSH daemon open to the WAN running with a default password for those of us who don't have another SSH daemon on our LAN to use as an intermediary. Someone could wreck all sorts of havoc if they had access to a linux box on your local network to use as a launching point, so I personally would want to have my own unique password set before I'll forward any ports to nvOC. I have a feeling there would be some extra steps involved if one were to change the password for the m1 user on nvOC since oneBash runs commands that require escalation, but I'm not sure where oneBash gets the m1 user's password from when its executing commands. I'm sure OP can clarify this when he gets caught up on posts.

PS, Fullzero, I'm really liking v0017 so far. Excellent work!!

fullzero (OP)

Legendary

Offline

Activity: 1260
Merit: 1009

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 04:21:52 AM

#1527

Quote from: f00ch0w on July 04, 2017, 04:55:34 PM

Seems like the 6pin powered risers didn't solve the issue. Rig did work stable for longest period now, think it was something over 24hrs. Plugged out 1 GPU to see is 6x GPUs causing it to crash... Every GPU was on separate cable on PSU but still crashed... Out of ideas now

EDIT: Now my x4 1070 rigs are crashing too. Same shit all over again, Either GPU1/2/3/4 has stopped working bla bla, crashes the whole rig... Can anyone provide me with a solution?

Also, I've got a Gigabyte H110-D3A on those 1070 rigs, could that be the issue? Put one as a test back on AsRock to see for a test, will let ya know

EDIT2: Yep, AsRock didnt make a difference. Sometimes it crashes with a freeze

I haven't tested a H110 chipset. Your problem might be related to chipset differences. If this is the problem; running software updater might solve it.

fullzero (OP)

Legendary

Offline

Activity: 1260
Merit: 1009

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 04:26:16 AM

#1528

Quote from: S9k on July 04, 2017, 05:12:50 PM

Quote from: fullzero on July 04, 2017, 03:54:47 PM

Quote from: S9k on July 04, 2017, 01:09:02 PM

Hi,

Please help!
I have got stuck on this problems Huh

My configuration:

-ASUS PRIME Z270-P - 2 . I tried both, results are similar.
-EVGA GeForce GTX 1080 GAMING ACX 3.0 - 2
-MSI Geforce GTX 1080 Gaming X- 2
-The Gigabyte power supply unit on 1200 watts

Three video cards work perfectly in any any combinations,

m1@m1-desktop:~$ nvidia-smi -L
GPU 0: GeForce GTX 1080 (UUID: GPU-43453088-0fca-9442-106d-7594d157ebf2)
GPU 1: GeForce GTX 1080 (UUID: GPU-d099b67e-f204-66fa-96dc-365a6b559a7e)
GPU 2: GeForce GTX 1080 (UUID: GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac)
m1@m1-desktop:~$

m1@m1-desktop:~$ lspci |grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
m1@m1-desktop:~$

but if I add the fourth (in this case the ID GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac ), then the system falls. Here what I see in dmesg

[ 98.722227] nvidia-modeset: Allocated GPU:0 (GPU-43453088-0fca-9442-106d-7594d157ebf2) @ PCI:0000:01:00.0
[ 98.769072] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769117] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769144] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769169] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769193] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769217] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769241] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.359255] nvidia-modeset: Allocated GPU:1 (GPU-5c9c8e29-a088-90a6-2a20-b2b2b971d1fb) @ PCI:0000:05:00.0
[ 99.398991] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399035] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399063] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399087] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399112] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399136] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buff er], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399160] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.984670] nvidia-modeset: Allocated GPU:2 (GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac) @ PCI:0000:06:00.0
[ 100.619118] nvidia-modeset: Allocated GPU:3 (GPU-d099b67e-f204-66fa-96dc-365a6b559a7e) @ PCI:0000:03:00.0
[ 100.743159] NVRM: GPU at PCI:0000:01:00: GPU-43453088-0fca-9442-106d-7594d157ebf2
[ 100.743162] NVRM: GPU Board Serial Number:
[ 100.743164] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 000001e0 00000801 00000004 00000005
[ 100.743649] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000004 00000005 00000004
[ 102.432593] r8169 0000:07:00.0 enp7s0: link up
[ 102.432600] IPv6: ADDRCONF(NETDEV_CHANGE): enp7s0: link becomes ready
[ 103.743306] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ 103.773941] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[ 105.501795] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[ 105.501798] Bluetooth: BNEP filters: protocol multicast
[ 105.501802] Bluetooth: BNEP socket layer initialized
[ 105.613048] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[ 105.613106] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[ 105.704570] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[ 105.704972] BUG: unable to handle kernel paging request at ffff88167153d830
[ 105.704974] IP: [<ffffffffc0262880>] _nv008171rm+0x620/0x780 [nvidia]
[ 105.705052] PGD 220c067 PUD 0
[ 105.705053] Oops: 0000 [#1] SMP

Three days I try to solve a problem.
I changed versions of BIOS (0325,0608,0610) and risers, control 4G is included, has updated NVIDIA drivers to 381.22 - nothing helps.
Maybe somebody will have ideas?

My guess is your mobo is trying to / is using SLI. Are you using an M2 ssd?

There should be some setting in the bios related to SLI; disable it / what slots are you using and are you using risers, if so on which GPUs?

If you are using risers; how are they powered?

Hi,
no, I don't use M2 SSD.
I use risers of the version 006s with the molex socket.

I managed to solve a problem. I modified / etc/default/grub

m1@m1-desktop:/etc/default$ more grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
#GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="vga=0 rdblacklist=nouveau nouveau.modeset=0"
GRUB_CMDLINE_LINUX=""

sudo update-grub

also I have created the file disable-nouveau.conf which contains two lines

m1@m1-desktop:/etc/modprobe.d$ more /etc/modprobe.d/disable-nouveau.conf
blacklist nouveau
options nouveau modeset=0

sudo reboot

Were you connecting the monitor directly to the mobo?

Not sure why else nouveau would be used.

fullzero (OP)

Legendary

Offline

Activity: 1260
Merit: 1009

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 04:28:16 AM

#1529

Quote from: gig410 on July 04, 2017, 09:22:40 PM

Quote from: fullzero on July 03, 2017, 01:57:27 AM

Quote from: gig410 on July 02, 2017, 09:54:35 PM

Quote from: xleejohnx on July 02, 2017, 09:20:57 PM

Quote from: gig410 on July 02, 2017, 09:04:25 PM

My rig crashed from having the settings too high, it went down when I was asleep. I rebooted it and it's up and running but I'm getting a low disk space warning. What file / directory do I delete ?

run this code line and you are golden on space

Code:

sudo apt-get purge $(dpkg -l linux-{image,headers}-"[0-9]*" | awk '/ii/{print $2}' | grep -ve "$(uname -r | sed -r 's/-[a-z]+//')")

that worked. Thank you so much!

Thanks for helping xleejohnx

gig410 what version are you using?

Im using 0017, sorry for the late response.

Did you add a lot of additional programs; ~ 2gb or more?

fullzero (OP)

Legendary

Offline

Activity: 1260
Merit: 1009

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 04:35:15 AM

#1530

Quote from: UberDaemon on July 05, 2017, 01:42:23 AM

Quote from: Nexillus on July 03, 2017, 03:47:13 PM

Quote from: f00ch0w on July 03, 2017, 01:30:23 PM

Seems like 6x pin powered risers solved my issue with 1050ti's crashing. Thanks a lot @fullzero and others

Now, I'm interested, is there a way to see all rigs on API and to be able to see that from outside network? If so, how to configure it with router? I got a MikroTik behind the 24-port switch.

Best way to do this is to setup a OpenVPN into the network and allowing it on the same subnet. Once you VPN, the connection will act just like if you were on the home network. It will also be secure if you use higher level of encryption like AES256-CBC.

You could just use SSH for this if you don't want to setup a VPN server, as SSH also uses AES-256 encryption and is every bit as secure as VPN, plus it's already running! The only config required would be to apply a static DHCP lease in your router so each miner always has the same LAN IP assigned to it, and to also forward appropriate port(s) in your router (i.e. you could for instance set am unused incoming WAN port like 2222 to forward all inbound traffic on that port to LAN port 22 (default SSH port) on LAN IP 10.20.30.40 if that were the LAN IP for your nvOC rig. If you have multiple rigs 2222 forwards to port 22 on 10.20.30.40, WAN port 2223 forwards all incoming traffic to LAN port 22 on IP 10.20.30.41, etc). My only concern here though is that I would want to change the default password (miner1) before opening up an outside port to nvOC's SSH daemon as a clever hacker might scan your WAN IP (which is a thing, bored people/malicious people do this) and find that open port and get lucky somehow by trying "miner1" as a password. Changing the system password is as simple as running passwd from guake/SSH, but I wouldn't recommend doing that until OP can give some guidance on if that will cause problems within oneBash. Most of the commands executed in oneBash require privilege escalation and I don't know where it finds the "miner1" password.

OP, can you shed any light on that? Is it okay to change the password for the m1 user without editing anything else? I don't see it inside oneBash itself.

I haven't tested everything after changing the m1 or root password; but you should be able to do it without issue. You would want to make sure you also change the root password in addition to the m1 if you make a static route as described.

You should probably also change the SSH keys in: seahorse as well.

fullzero (OP)

Legendary

Offline

Activity: 1260
Merit: 1009

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 04:39:11 AM

#1531

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

thanks in advance

look at the syslog:

go to ubuntu button top left and enter:

sy

click on system log

gig410

Newbie

Offline

Activity: 14
Merit: 0

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 04:41:03 AM

#1532

Quote from: fullzero on July 05, 2017, 04:39:11 AM

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0: [ 0] Receiver Error (First)
m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000

and

m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000

once in a while I get this

m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8

no idea what those mean

gig410

Newbie

Offline

Activity: 14
Merit: 0

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 04:48:29 AM

#1533

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

thanks in advance

look at the syslog:

go to ubuntu button top left and enter:

sy

click on system log

when I do that it gives me a stream of those messages in my previous post

fullzero (OP)

Legendary

Offline

Activity: 1260
Merit: 1009

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 04:59:31 AM

#1534

Quote from: IAmNotAJeep on July 05, 2017, 02:01:14 AM

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0: [ 0] Receiver Error (First)
m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000

and

m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000

once in a while I get this

m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8

no idea what those mean

What kind of risers are you using?

Have you checked to ensure they are fully seated in the pcie ports?

fullzero (OP)

Legendary

Offline

Activity: 1260
Merit: 1009

Re: Genoil CUDA crashes watchdog

July 05, 2017, 05:02:25 AM

#1535

First of all big thank you to fullzero and everyone contributing to this distro!

I've been struggling with the Genoil crash issue and lack of watchdog implementation for the past few days and I have a bandaid solution that seems to be actually working quite well, perhaps it can help others in the community:

Essentially you need to split the Genoil output to a file, grep it (we only care about 'error' instances only ; and then this output as input for a monitoring script that kills and restarts the misbehaving process.

So we have 2 scripts launched in screen as daemons "ltail" script and "ett" script

$screen -dmS ltail sh ~/eth/Genoil-U/ltail
and
$screen -dmS ett bash ~/ett

ltail:
--------------------------
#!/bin/bash
echo listening...
cd ~/eth/Genoil-U/
tail -fn0 err.log | \
while read line ; do
   DATE=$(date +%d-%m-%Y" "%H:%M:%S)
   echo "$DATE $line" | grep "error" | tee -a ~/eth/Genoil-U/timestamp.log
   if [ $? = 0 ]
   then
   kill $(ps aux | grep '[e]thminer' | awk '{print $2}')
   sleep 1
   screen -dmS ett bash ~/ett
   fi
done
-------------------------
ett:
-------------------------
#!/bin/bash
cd ~/eth/Genoil-U
./ethminer -U -F eth-us.dwarfpool.com:80/0xBEbd092a03827C37B75cd4ea314b207AA65c348f/208 2>&1 | tee >(grep error --color=never --line-buffered | tee -a err.log)

-------------------------

finally I also send output of ltail to timestamp.log to track how many times Genoil fails per hour - with roughly aiming at 1 crash per hour this gives me about 130MHs out of 5xGTX1060 which is a good 20+ MHs higher then Claymore... most importantly it gives stable hashing despite the OC introduced errors. The recovery is literally seconds.
Oh yeah and I also run
$tail -f ~/eth/Genoil-U/timestamp.log in a screen as well as watch -n 5 'sensors |grep Core' in another screen to fine tune the OC vs crash per hour vs temp
Hope this helps, and I hope the message is not too chaotic.
Cheers!

BTC: 13PnEKpfVzNseWkrm6LoueKcCMPj74zPv7
ETH: 0xBEbd092a03827C37B75cd4ea314b207AA65c348f

Very nice

I will probably add some version of this in a later version. I will include your donation address and ensure you are credited.

car1999

Full Member

Offline

Activity: 350
Merit: 100

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 05:25:04 AM

#1536

Quote from: UberDaemon on July 05, 2017, 01:23:05 AM

Quote from: gigi8114 on July 04, 2017, 09:22:07 AM

Also, I couldn't find how I can see the current mining process. I did see the screen -r commands, but that implies killing the current process and restarting it. I'd like to be able to see, from SSH, the current mining process without killing it. Is this possible?

If you want to monitor the mining process via screen you're going to have to kill the initial gnome-terminal. There's no way around that, as screen can only reconnect to an existing screen session.

This shouldn't be a big deal if you have a stable rig. You only need to do it once per reboot. My process is:

1. From my desktop where I monitor my rigs I initiate a constant ping:

Code:

ping -t 10.20.30.40  # substitute your rig's IP, find it in your router, or by running nmap on your LAN subnet, or by running ifconfig from a guake terminal on the rig if you have a monitor connected

2. Boot the rig
3. Wait until I begin to get ping responses from the rig, thus indicating Ubuntu has booted and rig has network connectivity
4. SSH into the rig (user: m1 password: miner1)
5. Initiate a screen session:

Code:

screen -s [name for your rig, make one up or call it "rig"]

6. Start nvidia-smi dmon to watch for mining process to begin (by waiting until this happens you know OC settings, fan speed settings, etc have been applied. Running those commands from within screen isn't 100% consistent IME as I always see error messages when I tried it that way. It's best to let those settings commands run from gnome-terminal as Ubuntu first boots IMO).

Code:

nvidia-smi dmon

7. Wait until you see wattage go up and GPU utilization go up to 100% (which indicates that the oneBash script concluded and opened the mining process). Exit nvidia-smi with CTRL + c
8. Find the PID for gnome-terminal.

Code:

ps aux | grep gnome-terminal

9. Kill it:

Code:

kill [PID from step 8]

10. Restart mining:

Code:

bash '/media/m1/1263-A96E/oneBash'

It might seem like a lot of steps, but it takes all of 120 seconds and you shouldn't need to do it very often once your rig is dialed in. You're losing maybe 1 minute's worth of hashes on avg of every week? Pretty negligible considering the convenience of monitoring from another workstation, and you're not using up system resources by using Teamviewer. This also lets you go completely headless if you buy a dummy HDMI plug. I just updated from 16 to 17 and didn't need to haul my extra monitor upstairs to do it. Easy peasy.

run export DISPLAY=:0 before step 5, if not, setp 10 throws erroe.

gig410

Newbie

Offline

Activity: 14
Merit: 0

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 05:37:17 AM

#1537

Quote from: fullzero on July 05, 2017, 04:59:31 AM

Quote from: gig410 on July 05, 2017, 05:37:17 AM

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0: [ 0] Receiver Error (First)
m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000

and

m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000

once in a while I get this

m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8

no idea what those mean

What kind of risers are you using?

Have you checked to ensure they are fully seated in the pcie ports?

just checked if they are seated correctly on the motherboard and on the cards and they are, I did an lspci command and it looks like id a2eb is the first gpu on the rig, it has it's own power cord to the power supply on the card and on the riser. the card does work but it has these errors

gig410

Newbie

Offline

Activity: 14
Merit: 0

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 06:10:30 AM

#1538

Quote from: fullzero on July 05, 2017, 04:59:31 AM

Quote from: fullzero on July 05, 2017, 04:26:16 AM

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0: [ 0] Receiver Error (First)
m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000

and

m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000

once in a while I get this

m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8

no idea what those mean

What kind of risers are you using?

Have you checked to ensure they are fully seated in the pcie ports?

just checked if they are seated correctly on the motherboard and on the cards and they are, I did an lspci command and it looks like id a2eb is the first gpu on the rig, it has it's own power cord to the power supply on the card and on the riser. the card does work but it has these errors

looks like I was wrong about a2eb being the first gpu. I removed the gpu completely and I'm still getting these errors as soon as I boot, it won't even go into the GUI any more

S9k

Newbie

Offline

Activity: 26
Merit: 0

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 06:12:18 AM

#1539

Quote from: S9k on July 04, 2017, 05:12:50 PM

Quote from: fullzero on July 04, 2017, 03:54:47 PM

Quote from: S9k on July 04, 2017, 01:09:02 PM

Hi,

Please help!
I have got stuck on this problems Huh

My configuration:

-ASUS PRIME Z270-P - 2 . I tried both, results are similar.
-EVGA GeForce GTX 1080 GAMING ACX 3.0 - 2
-MSI Geforce GTX 1080 Gaming X- 2
-The Gigabyte power supply unit on 1200 watts

Three video cards work perfectly in any any combinations,

m1@m1-desktop:~$ nvidia-smi -L
GPU 0: GeForce GTX 1080 (UUID: GPU-43453088-0fca-9442-106d-7594d157ebf2)
GPU 1: GeForce GTX 1080 (UUID: GPU-d099b67e-f204-66fa-96dc-365a6b559a7e)
GPU 2: GeForce GTX 1080 (UUID: GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac)
m1@m1-desktop:~$

m1@m1-desktop:~$ lspci |grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
m1@m1-desktop:~$

but if I add the fourth (in this case the ID GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac ), then the system falls. Here what I see in dmesg

[ 98.722227] nvidia-modeset: Allocated GPU:0 (GPU-43453088-0fca-9442-106d-7594d157ebf2) @ PCI:0000:01:00.0
[ 98.769072] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769117] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769144] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769169] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769193] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769217] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 98.769241] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.359255] nvidia-modeset: Allocated GPU:1 (GPU-5c9c8e29-a088-90a6-2a20-b2b2b971d1fb) @ PCI:0000:05:00.0
[ 99.398991] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399035] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399063] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399087] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399112] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399136] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buff er], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.399160] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[ 99.984670] nvidia-modeset: Allocated GPU:2 (GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac) @ PCI:0000:06:00.0
[ 100.619118] nvidia-modeset: Allocated GPU:3 (GPU-d099b67e-f204-66fa-96dc-365a6b559a7e) @ PCI:0000:03:00.0
[ 100.743159] NVRM: GPU at PCI:0000:01:00: GPU-43453088-0fca-9442-106d-7594d157ebf2
[ 100.743162] NVRM: GPU Board Serial Number:
[ 100.743164] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 000001e0 00000801 00000004 00000005
[ 100.743649] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000004 00000005 00000004
[ 102.432593] r8169 0000:07:00.0 enp7s0: link up
[ 102.432600] IPv6: ADDRCONF(NETDEV_CHANGE): enp7s0: link becomes ready
[ 103.743306] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[ 103.773941] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[ 105.501795] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[ 105.501798] Bluetooth: BNEP filters: protocol multicast
[ 105.501802] Bluetooth: BNEP socket layer initialized
[ 105.613048] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[ 105.613106] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[ 105.704570] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[ 105.704972] BUG: unable to handle kernel paging request at ffff88167153d830
[ 105.704974] IP: [<ffffffffc0262880>] _nv008171rm+0x620/0x780 [nvidia]
[ 105.705052] PGD 220c067 PUD 0
[ 105.705053] Oops: 0000 [#1] SMP

Three days I try to solve a problem.
I changed versions of BIOS (0325,0608,0610) and risers, control 4G is included, has updated NVIDIA drivers to 381.22 - nothing helps.
Maybe somebody will have ideas?

My guess is your mobo is trying to / is using SLI. Are you using an M2 ssd?

There should be some setting in the bios related to SLI; disable it / what slots are you using and are you using risers, if so on which GPUs?

If you are using risers; how are they powered?

Hi,
no, I don't use M2 SSD.
I use risers of the version 006s with the molex socket.

I managed to solve a problem. I modified / etc/default/grub

m1@m1-desktop:/etc/default$ more grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
# info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
#GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="vga=0 rdblacklist=nouveau nouveau.modeset=0"
GRUB_CMDLINE_LINUX=""

sudo update-grub

also I have created the file disable-nouveau.conf which contains two lines

m1@m1-desktop:/etc/modprobe.d$ more /etc/modprobe.d/disable-nouveau.conf
blacklist nouveau
options nouveau modeset=0

sudo reboot

Were you connecting the monitor directly to the mobo?

Not sure why else nouveau would be used.

salfter

Hero Member

Offline

Activity: 651
Merit: 501

My PGP Key: 92C7689C

Re: [OS] nvOC easy-to-use Linux Nvidia Mining v0017

July 05, 2017, 06:15:20 AM

#1540