UberDaemon
Newbie
Offline
Activity: 51
Merit: 0
|
|
July 05, 2017, 02:34:47 AM |
|
Sort of a broad question here, but anyone have any suggestions to why my mobo wont turn on? everything seems like its connected properly, probably screwed the switch up or the pins for the switch. just wondering if anyone encounters semi generic problems or common issues? its a 270 board.
If your mobo is sitting out on a table/etc and isn't inside a case take a photo, upload it to imgur, and post it here. That might help. Off the top of my head: did you connect the 24 pin ATX power AND the 8 pin CPU power? Are you sure you have your power switch connected to the proper headers/pins? If you look closely at the motherboard front panel pins there's a legend to show which pins correspond to power, reset, HDD LED, etc. Remove your front panel connectors/power switch and try taking a flathead screwdriver and touching it to the 2 power pins simultaneously for about a half second (creating a short between them which is what your power button does) and see if it powers up. Last but not least... is the toggle switch on your power supply turned on? Stranger things have happened
|
|
|
|
xleejohnx
|
|
July 05, 2017, 02:38:08 AM |
|
Sort of a broad question here, but anyone have any suggestions to why my mobo wont turn on? everything seems like its connected properly, probably screwed the switch up or the pins for the switch. just wondering if anyone encounters semi generic problems or common issues? its a 270 board.
If your mobo is sitting out on a table/etc and isn't inside a case take a photo, upload it to imgur, and post it here. That might help. Off the top of my head: did you connect the 24 pin ATX power AND the 8 pin CPU power? Are you sure you have your power switch connected to the proper headers/pins? If you look closely at the motherboard front panel pins there's a legend to show which pins correspond to power, reset, HDD LED, etc. Remove your front panel connectors/power switch and try taking a flathead screwdriver and touching it to the 2 power pins simultaneously for about a half second (creating a short between them which is what your power button does) and see if it powers up. Last but not least... is the toggle switch on your power supply turned on? Stranger things have happened I forgot one time to plug in the CPU power. I would try starting it up with nothing in it and go from there Clear the cmos. Little things add up
|
As I see a super coin as the super highway and alt coins as taxis and trucks needed to move transactions. ~philipma1957
|
|
|
gig410
Newbie
Offline
Activity: 14
Merit: 0
|
|
July 05, 2017, 02:47:20 AM |
|
So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?
thanks in advance
|
|
|
|
salfter
|
|
July 05, 2017, 03:01:01 AM |
|
I went a step lower than Pentium on my 2 rigs and bought $50 G3930 Celeron processors since I am only GPU mining. They run nvOC quite stably (I just returned from a 5 day vacation and both of my rigs that were running v0016 stayed up the entire time I was gone).
Mine's a Celeron G3920...Skylake vs. Kaby Lake. The motherboard I was using the first few weeks (a Biostar Racing Z170GT7) might not have shipped with a BIOS that supported Kaby Lake CPUs out of the box. That board conked out (was an open-box purchase), so I sent it back and am now running an Asus Prime Z270-AR (only difference between it and the Z270-A referenced in the OP is a lack of DisplayPort and DVI ports, AFAIK). Granted, I am not using Teamviewer like some folks here. That will consume more system resources. I can do everything I need with SSH and the screen command if I'm at home. I did leave one of my windows workstations online while I was gone so I could teamviewer into that and from there SSH into my rigs if necessary, but luckily I had no need to.
You could configure your router to forward a port other than 22 to port 22 on your mining rig. I haven't bothered with that with mine, though; I can ssh into my FreeNAS media server (or my desktop, if it's booted into Linux...can RDP into it if it's running Windows and set it to reboot into Linux) from outside and then ssh into the mining rig from there. Never used Teamviewer; tried accessing the mining rig with both RDP and VNC, and neither worked. SSH works better for this purpose anyway, once you're familiar with it.
|
|
|
|
salfter
|
|
July 05, 2017, 03:04:22 AM |
|
So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?
ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg. There's also /var/log/messages, but that tends to be less useful for hardware errors.
|
|
|
|
UberDaemon
Newbie
Offline
Activity: 51
Merit: 0
|
|
July 05, 2017, 03:11:33 AM Last edit: July 05, 2017, 03:30:01 AM by UberDaemon |
|
You could configure your router to forward a port other than 22 to port 22 on your mining rig. I haven't bothered with that with mine, though; I can ssh into my FreeNAS media server (or my desktop, if it's booted into Linux...can RDP into it if it's running Windows and set it to reboot into Linux) from outside and then ssh into the mining rig from there. Yes, I posted about this earlier, but the concern is that would leave nvOC's SSH daemon open to the WAN running with a default password for those of us who don't have another SSH daemon on our LAN to use as an intermediary. Someone could wreck all sorts of havoc if they had access to a linux box on your local network to use as a launching point, so I personally would want to have my own unique password set before I'll forward any ports to nvOC. I have a feeling there would be some extra steps involved if one were to change the password for the m1 user on nvOC since oneBash runs commands that require escalation, but I'm not sure where oneBash gets the m1 user's password from when its executing commands. I'm sure OP can clarify this when he gets caught up on posts. PS, Fullzero, I'm really liking v0017 so far. Excellent work!!
|
|
|
|
fullzero (OP)
Legendary
Offline
Activity: 1260
Merit: 1009
|
|
July 05, 2017, 04:21:52 AM |
|
Seems like the 6pin powered risers didn't solve the issue. Rig did work stable for longest period now, think it was something over 24hrs. Plugged out 1 GPU to see is 6x GPUs causing it to crash... Every GPU was on separate cable on PSU but still crashed... Out of ideas now
EDIT: Now my x4 1070 rigs are crashing too. Same shit all over again, Either GPU1/2/3/4 has stopped working bla bla, crashes the whole rig... Can anyone provide me with a solution?
Also, I've got a Gigabyte H110-D3A on those 1070 rigs, could that be the issue? Put one as a test back on AsRock to see for a test, will let ya know
EDIT2: Yep, AsRock didnt make a difference. Sometimes it crashes with a freeze
I haven't tested a H110 chipset. Your problem might be related to chipset differences. If this is the problem; running software updater might solve it.
|
|
|
|
fullzero (OP)
Legendary
Offline
Activity: 1260
Merit: 1009
|
|
July 05, 2017, 04:26:16 AM |
|
Hi, Please help! I have got stuck on this problems My configuration: -ASUS PRIME Z270-P - 2 . I tried both, results are similar. -EVGA GeForce GTX 1080 GAMING ACX 3.0 - 2 -MSI Geforce GTX 1080 Gaming X- 2 -The Gigabyte power supply unit on 1200 watts Three video cards work perfectly in any any combinations, m1@m1-desktop:~$ nvidia-smi -L GPU 0: GeForce GTX 1080 (UUID: GPU-43453088-0fca-9442-106d-7594d157ebf2) GPU 1: GeForce GTX 1080 (UUID: GPU-d099b67e-f204-66fa-96dc-365a6b559a7e) GPU 2: GeForce GTX 1080 (UUID: GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac) m1@m1-desktop:~$m1@m1-desktop:~$ lspci |grep VGA 01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) 03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) 05:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) m1@m1-desktop:~$but if I add the fourth (in this case the ID GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac ), then the system falls. Here what I see in dmesg [ 98.722227] nvidia-modeset: Allocated GPU:0 (GPU-43453088-0fca-9442-106d-7594d157ebf2) @ PCI:0000:01:00.0 [ 98.769072] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769117] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769144] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769169] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769193] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769217] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769241] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.359255] nvidia-modeset: Allocated GPU:1 (GPU-5c9c8e29-a088-90a6-2a20-b2b2b971d1fb) @ PCI:0000:05:00.0 [ 99.398991] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399035] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399063] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399087] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399112] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399136] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buff er], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399160] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.984670] nvidia-modeset: Allocated GPU:2 (GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac) @ PCI:0000:06:00.0 [ 100.619118] nvidia-modeset: Allocated GPU:3 (GPU-d099b67e-f204-66fa-96dc-365a6b559a7e) @ PCI:0000:03:00.0 [ 100.743159] NVRM: GPU at PCI:0000:01:00: GPU-43453088-0fca-9442-106d-7594d157ebf2 [ 100.743162] NVRM: GPU Board Serial Number: [ 100.743164] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 000001e0 00000801 00000004 00000005 [ 100.743649] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000004 00000005 00000004[ 102.432593] r8169 0000:07:00.0 enp7s0: link up [ 102.432600] IPv6: ADDRCONF(NETDEV_CHANGE): enp7s0: link becomes ready [ 103.743306] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.[ 103.773941] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004[ 105.501795] Bluetooth: BNEP (Ethernet Emulation) ver 1.3 [ 105.501798] Bluetooth: BNEP filters: protocol multicast [ 105.501802] Bluetooth: BNEP socket layer initialized [ 105.613048] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004 [ 105.613106] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004 [ 105.704570] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004[ 105.704972] BUG: unable to handle kernel paging request at ffff88167153d830[ 105.704974] IP: [<ffffffffc0262880>] _nv008171rm+0x620/0x780 [nvidia][ 105.705052] PGD 220c067 PUD 0 [ 105.705053] Oops: 0000 [#1] SMP Three days I try to solve a problem. I changed versions of BIOS (0325,0608,0610) and risers, control 4G is included, has updated NVIDIA drivers to 381.22 - nothing helps. Maybe somebody will have ideas? My guess is your mobo is trying to / is using SLI. Are you using an M2 ssd? There should be some setting in the bios related to SLI; disable it / what slots are you using and are you using risers, if so on which GPUs? If you are using risers; how are they powered? Hi, no, I don't use M2 SSD. I use risers of the version 006s with the molex socket. I managed to solve a problem. I modified / etc/default/grub m1@m1-desktop:/etc/default$ more grub # If you change this file, run 'update-grub' afterwards to update # /boot/grub/grub.cfg. # For full documentation of the options in this file, see: # info -f grub -n 'Simple configuration' GRUB_DEFAULT=0 #GRUB_HIDDEN_TIMEOUT=0 GRUB_HIDDEN_TIMEOUT_QUIET=true GRUB_TIMEOUT=10 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` GRUB_CMDLINE_LINUX_DEFAULT="vga=0 rdblacklist=nouveau nouveau.modeset=0"GRUB_CMDLINE_LINUX="" sudo update-grub also I have created the file disable-nouveau.conf which contains two lines m1@m1-desktop:/etc/modprobe.d$ more /etc/modprobe.d/disable-nouveau.conf blacklist nouveau options nouveau modeset=0sudo rebootWere you connecting the monitor directly to the mobo? Not sure why else nouveau would be used.
|
|
|
|
fullzero (OP)
Legendary
Offline
Activity: 1260
Merit: 1009
|
|
July 05, 2017, 04:28:16 AM |
|
My rig crashed from having the settings too high, it went down when I was asleep. I rebooted it and it's up and running but I'm getting a low disk space warning. What file / directory do I delete ?
run this code line and you are golden on space sudo apt-get purge $(dpkg -l linux-{image,headers}-"[0-9]*" | awk '/ii/{print $2}' | grep -ve "$(uname -r | sed -r 's/-[a-z]+//')") that worked. Thank you so much! Thanks for helping xleejohnx gig410 what version are you using? Im using 0017, sorry for the late response. Did you add a lot of additional programs; ~ 2gb or more?
|
|
|
|
fullzero (OP)
Legendary
Offline
Activity: 1260
Merit: 1009
|
|
July 05, 2017, 04:35:15 AM |
|
Seems like 6x pin powered risers solved my issue with 1050ti's crashing. Thanks a lot @fullzero and others
Now, I'm interested, is there a way to see all rigs on API and to be able to see that from outside network? If so, how to configure it with router? I got a MikroTik behind the 24-port switch.
Best way to do this is to setup a OpenVPN into the network and allowing it on the same subnet. Once you VPN, the connection will act just like if you were on the home network. It will also be secure if you use higher level of encryption like AES256-CBC. You could just use SSH for this if you don't want to setup a VPN server, as SSH also uses AES-256 encryption and is every bit as secure as VPN, plus it's already running! The only config required would be to apply a static DHCP lease in your router so each miner always has the same LAN IP assigned to it, and to also forward appropriate port(s) in your router (i.e. you could for instance set am unused incoming WAN port like 2222 to forward all inbound traffic on that port to LAN port 22 (default SSH port) on LAN IP 10.20.30.40 if that were the LAN IP for your nvOC rig. If you have multiple rigs 2222 forwards to port 22 on 10.20.30.40, WAN port 2223 forwards all incoming traffic to LAN port 22 on IP 10.20.30.41, etc). My only concern here though is that I would want to change the default password (miner1) before opening up an outside port to nvOC's SSH daemon as a clever hacker might scan your WAN IP (which is a thing, bored people/malicious people do this) and find that open port and get lucky somehow by trying "miner1" as a password. Changing the system password is as simple as running passwd from guake/SSH, but I wouldn't recommend doing that until OP can give some guidance on if that will cause problems within oneBash. Most of the commands executed in oneBash require privilege escalation and I don't know where it finds the "miner1" password. OP, can you shed any light on that? Is it okay to change the password for the m1 user without editing anything else? I don't see it inside oneBash itself. I haven't tested everything after changing the m1 or root password; but you should be able to do it without issue. You would want to make sure you also change the root password in addition to the m1 if you make a static route as described. You should probably also change the SSH keys in: seahorse as well.
|
|
|
|
fullzero (OP)
Legendary
Offline
Activity: 1260
Merit: 1009
|
|
July 05, 2017, 04:39:11 AM |
|
So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?
thanks in advance
look at the syslog: go to ubuntu button top left and enter: sy click on system log
|
|
|
|
gig410
Newbie
Offline
Activity: 14
Merit: 0
|
|
July 05, 2017, 04:41:03 AM |
|
So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?
ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg. There's also /var/log/messages, but that tends to be less useful for hardware errors. I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0: [ 0] Receiver Error (First) m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000 and m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000 once in a while I get this m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8 no idea what those mean
|
|
|
|
gig410
Newbie
Offline
Activity: 14
Merit: 0
|
|
July 05, 2017, 04:48:29 AM |
|
So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?
thanks in advance
look at the syslog: go to ubuntu button top left and enter: sy click on system log when I do that it gives me a stream of those messages in my previous post
|
|
|
|
fullzero (OP)
Legendary
Offline
Activity: 1260
Merit: 1009
|
|
July 05, 2017, 04:59:31 AM |
|
So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?
ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg. There's also /var/log/messages, but that tends to be less useful for hardware errors. I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0: [ 0] Receiver Error (First) m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000 and m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000 once in a while I get this m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8 no idea what those mean What kind of risers are you using? Have you checked to ensure they are fully seated in the pcie ports?
|
|
|
|
fullzero (OP)
Legendary
Offline
Activity: 1260
Merit: 1009
|
|
July 05, 2017, 05:02:25 AM |
|
First of all big thank you to fullzero and everyone contributing to this distro!
I've been struggling with the Genoil crash issue and lack of watchdog implementation for the past few days and I have a bandaid solution that seems to be actually working quite well, perhaps it can help others in the community:
Essentially you need to split the Genoil output to a file, grep it (we only care about 'error' instances only ; and then this output as input for a monitoring script that kills and restarts the misbehaving process.
So we have 2 scripts launched in screen as daemons "ltail" script and "ett" script
$screen -dmS ltail sh ~/eth/Genoil-U/ltail and $screen -dmS ett bash ~/ett
ltail: -------------------------- #!/bin/bash echo listening... cd ~/eth/Genoil-U/ tail -fn0 err.log | \ while read line ; do DATE=$(date +%d-%m-%Y" "%H:%M:%S) echo "$DATE $line" | grep "error" | tee -a ~/eth/Genoil-U/timestamp.log if [ $? = 0 ] then kill $(ps aux | grep '[e]thminer' | awk '{print $2}') sleep 1 screen -dmS ett bash ~/ett fi done ------------------------- ett: ------------------------- #!/bin/bash cd ~/eth/Genoil-U ./ethminer -U -F eth-us.dwarfpool.com:80/0xBEbd092a03827C37B75cd4ea314b207AA65c348f/208 2>&1 | tee >(grep error --color=never --line-buffered | tee -a err.log)
-------------------------
finally I also send output of ltail to timestamp.log to track how many times Genoil fails per hour - with roughly aiming at 1 crash per hour this gives me about 130MHs out of 5xGTX1060 which is a good 20+ MHs higher then Claymore... most importantly it gives stable hashing despite the OC introduced errors. The recovery is literally seconds. Oh yeah and I also run $tail -f ~/eth/Genoil-U/timestamp.log in a screen as well as watch -n 5 'sensors |grep Core' in another screen to fine tune the OC vs crash per hour vs temp Hope this helps, and I hope the message is not too chaotic. Cheers!
BTC: 13PnEKpfVzNseWkrm6LoueKcCMPj74zPv7 ETH: 0xBEbd092a03827C37B75cd4ea314b207AA65c348f
Very nice I will probably add some version of this in a later version. I will include your donation address and ensure you are credited.
|
|
|
|
car1999
|
|
July 05, 2017, 05:25:04 AM |
|
Also, I couldn't find how I can see the current mining process. I did see the screen -r commands, but that implies killing the current process and restarting it. I'd like to be able to see, from SSH, the current mining process without killing it. Is this possible?
If you want to monitor the mining process via screen you're going to have to kill the initial gnome-terminal. There's no way around that, as screen can only reconnect to an existing screen session. This shouldn't be a big deal if you have a stable rig. You only need to do it once per reboot. My process is: 1. From my desktop where I monitor my rigs I initiate a constant ping: ping -t 10.20.30.40 # substitute your rig's IP, find it in your router, or by running nmap on your LAN subnet, or by running ifconfig from a guake terminal on the rig if you have a monitor connected 2. Boot the rig 3. Wait until I begin to get ping responses from the rig, thus indicating Ubuntu has booted and rig has network connectivity 4. SSH into the rig (user: m1 password: miner1) 5. Initiate a screen session: screen -s [name for your rig, make one up or call it "rig"] 6. Start nvidia-smi dmon to watch for mining process to begin (by waiting until this happens you know OC settings, fan speed settings, etc have been applied. Running those commands from within screen isn't 100% consistent IME as I always see error messages when I tried it that way. It's best to let those settings commands run from gnome-terminal as Ubuntu first boots IMO). 7. Wait until you see wattage go up and GPU utilization go up to 100% (which indicates that the oneBash script concluded and opened the mining process). Exit nvidia-smi with CTRL + c 8. Find the PID for gnome-terminal. ps aux | grep gnome-terminal 9. Kill it: 10. Restart mining: bash '/media/m1/1263-A96E/oneBash' It might seem like a lot of steps, but it takes all of 120 seconds and you shouldn't need to do it very often once your rig is dialed in. You're losing maybe 1 minute's worth of hashes on avg of every week? Pretty negligible considering the convenience of monitoring from another workstation, and you're not using up system resources by using Teamviewer. This also lets you go completely headless if you buy a dummy HDMI plug. I just updated from 16 to 17 and didn't need to haul my extra monitor upstairs to do it. Easy peasy. run export DISPLAY=:0 before step 5, if not, setp 10 throws erroe.
|
|
|
|
gig410
Newbie
Offline
Activity: 14
Merit: 0
|
|
July 05, 2017, 05:37:17 AM |
|
So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?
ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg. There's also /var/log/messages, but that tends to be less useful for hardware errors. I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0: [ 0] Receiver Error (First) m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000 and m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000 once in a while I get this m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8 no idea what those mean What kind of risers are you using? Have you checked to ensure they are fully seated in the pcie ports? just checked if they are seated correctly on the motherboard and on the cards and they are, I did an lspci command and it looks like id a2eb is the first gpu on the rig, it has it's own power cord to the power supply on the card and on the riser. the card does work but it has these errors
|
|
|
|
gig410
Newbie
Offline
Activity: 14
Merit: 0
|
|
July 05, 2017, 06:10:30 AM |
|
So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?
ssh in and look at the tail end of /var/log/dmesg. I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up. The errors show up toward the end of /var/log/dmesg. There's also /var/log/messages, but that tends to be less useful for hardware errors. I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0: [ 0] Receiver Error (First) m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000 and m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000 once in a while I get this m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8 no idea what those mean What kind of risers are you using? Have you checked to ensure they are fully seated in the pcie ports? just checked if they are seated correctly on the motherboard and on the cards and they are, I did an lspci command and it looks like id a2eb is the first gpu on the rig, it has it's own power cord to the power supply on the card and on the riser. the card does work but it has these errors looks like I was wrong about a2eb being the first gpu. I removed the gpu completely and I'm still getting these errors as soon as I boot, it won't even go into the GUI any more
|
|
|
|
S9k
Newbie
Offline
Activity: 26
Merit: 0
|
|
July 05, 2017, 06:12:18 AM |
|
Hi, Please help! I have got stuck on this problems My configuration: -ASUS PRIME Z270-P - 2 . I tried both, results are similar. -EVGA GeForce GTX 1080 GAMING ACX 3.0 - 2 -MSI Geforce GTX 1080 Gaming X- 2 -The Gigabyte power supply unit on 1200 watts Three video cards work perfectly in any any combinations, m1@m1-desktop:~$ nvidia-smi -L GPU 0: GeForce GTX 1080 (UUID: GPU-43453088-0fca-9442-106d-7594d157ebf2) GPU 1: GeForce GTX 1080 (UUID: GPU-d099b67e-f204-66fa-96dc-365a6b559a7e) GPU 2: GeForce GTX 1080 (UUID: GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac) m1@m1-desktop:~$m1@m1-desktop:~$ lspci |grep VGA 01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) 03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) 05:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) m1@m1-desktop:~$but if I add the fourth (in this case the ID GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac ), then the system falls. Here what I see in dmesg [ 98.722227] nvidia-modeset: Allocated GPU:0 (GPU-43453088-0fca-9442-106d-7594d157ebf2) @ PCI:0000:01:00.0 [ 98.769072] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769117] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769144] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769169] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769193] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769217] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 98.769241] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.359255] nvidia-modeset: Allocated GPU:1 (GPU-5c9c8e29-a088-90a6-2a20-b2b2b971d1fb) @ PCI:0000:05:00.0 [ 99.398991] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399035] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399063] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399087] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399112] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399136] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buff er], ACPI requires [Package] (20150930/nsarguments-95) [ 99.399160] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95) [ 99.984670] nvidia-modeset: Allocated GPU:2 (GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac) @ PCI:0000:06:00.0 [ 100.619118] nvidia-modeset: Allocated GPU:3 (GPU-d099b67e-f204-66fa-96dc-365a6b559a7e) @ PCI:0000:03:00.0 [ 100.743159] NVRM: GPU at PCI:0000:01:00: GPU-43453088-0fca-9442-106d-7594d157ebf2 [ 100.743162] NVRM: GPU Board Serial Number: [ 100.743164] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 000001e0 00000801 00000004 00000005 [ 100.743649] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000004 00000005 00000004[ 102.432593] r8169 0000:07:00.0 enp7s0: link up [ 102.432600] IPv6: ADDRCONF(NETDEV_CHANGE): enp7s0: link becomes ready [ 103.743306] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.[ 103.773941] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004[ 105.501795] Bluetooth: BNEP (Ethernet Emulation) ver 1.3 [ 105.501798] Bluetooth: BNEP filters: protocol multicast [ 105.501802] Bluetooth: BNEP socket layer initialized [ 105.613048] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004 [ 105.613106] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004 [ 105.704570] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004[ 105.704972] BUG: unable to handle kernel paging request at ffff88167153d830[ 105.704974] IP: [<ffffffffc0262880>] _nv008171rm+0x620/0x780 [nvidia][ 105.705052] PGD 220c067 PUD 0 [ 105.705053] Oops: 0000 [#1] SMP Three days I try to solve a problem. I changed versions of BIOS (0325,0608,0610) and risers, control 4G is included, has updated NVIDIA drivers to 381.22 - nothing helps. Maybe somebody will have ideas? My guess is your mobo is trying to / is using SLI. Are you using an M2 ssd? There should be some setting in the bios related to SLI; disable it / what slots are you using and are you using risers, if so on which GPUs? If you are using risers; how are they powered? Hi, no, I don't use M2 SSD. I use risers of the version 006s with the molex socket. I managed to solve a problem. I modified / etc/default/grub m1@m1-desktop:/etc/default$ more grub # If you change this file, run 'update-grub' afterwards to update # /boot/grub/grub.cfg. # For full documentation of the options in this file, see: # info -f grub -n 'Simple configuration' GRUB_DEFAULT=0 #GRUB_HIDDEN_TIMEOUT=0 GRUB_HIDDEN_TIMEOUT_QUIET=true GRUB_TIMEOUT=10 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` GRUB_CMDLINE_LINUX_DEFAULT="vga=0 rdblacklist=nouveau nouveau.modeset=0"GRUB_CMDLINE_LINUX="" sudo update-grub also I have created the file disable-nouveau.conf which contains two lines m1@m1-desktop:/etc/modprobe.d$ more /etc/modprobe.d/disable-nouveau.conf blacklist nouveau options nouveau modeset=0sudo rebootWere you connecting the monitor directly to the mobo? Not sure why else nouveau would be used. My monitor is connected to GPU0. m1@m1-desktop:~$ nvidia-smi Wed Jul 5 02:00:52 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 381.22 Driver Version: 381.22 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 0000:01:00.0 On | N/A | | 75% 65C P2 181W / 180W | 319MiB / 8113MiB | 97% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 Off | 0000:03:00.0 Off | N/A | | 75% 60C P2 180W / 180W | 141MiB / 8114MiB | 96% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 1080 Off | 0000:05:00.0 Off | N/A | | 75% 72C P2 165W / 180W | 141MiB / 8114MiB | 96% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 1080 Off | 0000:06:00.0 Off | N/A | | 75% 71C P2 166W / 180W | 141MiB / 8114MiB | 98% Default | +-------------------------------+----------------------+----------------------+I remained on version 381.22 drivers, it seems to me they more productive
|
|
|
|
salfter
|
|
July 05, 2017, 06:15:20 AM |
|
m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0: [ 0] Receiver Error (First) m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0: device [8086:a2eb] error status/mask=00000001/00002000
no idea what those mean
Those look like the errors I was getting with some crappy PCIe extenders I had recently ordered. Here's a closeup of the inadequate soldering on the bit that goes in the slot; the other end of the riser is probably similar. Click for the full-res original: I'll be sending these back. There were no reviews when I bought them, but since then someone else has left a 1-star review. I've bought these as replacements, at fullzero's recommendation.
|
|
|
|
|