Bitcoin Forum
March 28, 2024, 06:42:27 PM *
News: Latest Bitcoin Core release: 26.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: « 1 ... 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 [77] 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 ... 417 »
  Print  
Author Topic: [OS] nvOC easy-to-use Linux Nvidia Mining  (Read 417927 times)
UberDaemon
Newbie
*
Offline Offline

Activity: 51
Merit: 0


View Profile
July 05, 2017, 02:34:47 AM
 #1521

Sort of a broad question here, but anyone have any suggestions to why my mobo wont turn on? everything seems like its connected properly, probably screwed the switch up or the pins for the switch. just wondering if anyone encounters semi generic problems or common issues? its a 270 board.

If your mobo is sitting out on a table/etc and isn't inside a case take a photo, upload it to imgur, and post it here.  That might help.

Off the top of my head: did you connect the 24 pin ATX power AND the 8 pin CPU power?  Are you sure you have your power switch connected to the proper headers/pins?  If you look closely at the motherboard front panel pins there's a legend to show which pins correspond to power, reset, HDD LED, etc.  Remove your front panel connectors/power switch and try taking a flathead screwdriver and touching it to the 2 power pins simultaneously for about a half second (creating a short between them which is what your power button does) and see if it powers up.

Last but not least... is the toggle switch on your power supply turned on?  Stranger things have happened Wink
1711651347
Hero Member
*
Offline Offline

Posts: 1711651347

View Profile Personal Message (Offline)

Ignore
1711651347
Reply with quote  #2

1711651347
Report to moderator
1711651347
Hero Member
*
Offline Offline

Posts: 1711651347

View Profile Personal Message (Offline)

Ignore
1711651347
Reply with quote  #2

1711651347
Report to moderator
The grue lurks in the darkest places of the earth. Its favorite diet is adventurers, but its insatiable appetite is tempered by its fear of light. No grue has ever been seen by the light of day, and few have survived its fearsome jaws to tell the tale.
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
xleejohnx
Hero Member
*****
Offline Offline

Activity: 672
Merit: 500


View Profile
July 05, 2017, 02:38:08 AM
 #1522

Sort of a broad question here, but anyone have any suggestions to why my mobo wont turn on? everything seems like its connected properly, probably screwed the switch up or the pins for the switch. just wondering if anyone encounters semi generic problems or common issues? its a 270 board.

If your mobo is sitting out on a table/etc and isn't inside a case take a photo, upload it to imgur, and post it here.  That might help.

Off the top of my head: did you connect the 24 pin ATX power AND the 8 pin CPU power?  Are you sure you have your power switch connected to the proper headers/pins?  If you look closely at the motherboard front panel pins there's a legend to show which pins correspond to power, reset, HDD LED, etc.  Remove your front panel connectors/power switch and try taking a flathead screwdriver and touching it to the 2 power pins simultaneously for about a half second (creating a short between them which is what your power button does) and see if it powers up.

Last but not least... is the toggle switch on your power supply turned on?  Stranger things have happened Wink

I forgot one time to plug in the CPU power.
I would try starting it up with nothing in it and go from there
Clear the cmos. Little things add up

As I see a super coin as the super highway and alt coins as taxis and trucks needed to move transactions. ~philipma1957
gig410
Newbie
*
Offline Offline

Activity: 14
Merit: 0


View Profile
July 05, 2017, 02:47:20 AM
 #1523

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

thanks in advance
salfter
Hero Member
*****
Offline Offline

Activity: 651
Merit: 501


My PGP Key: 92C7689C


View Profile WWW
July 05, 2017, 03:01:01 AM
 #1524

I went a step lower than Pentium on my 2 rigs and bought $50 G3930 Celeron processors since I am only GPU mining.  They run nvOC quite stably (I just returned from a 5 day vacation and both of my rigs that were running v0016 stayed up the entire time I was gone). 

Mine's a Celeron G3920...Skylake vs. Kaby Lake.  The motherboard I was using the first few weeks (a Biostar Racing Z170GT7) might not have shipped with a BIOS that supported Kaby Lake CPUs out of the box.  That board conked out (was an open-box purchase), so I sent it back and am now running an Asus Prime Z270-AR (only difference between it and the Z270-A referenced in the OP is a lack of DisplayPort and DVI ports, AFAIK).

Quote
Granted, I am not using Teamviewer like some folks here.  That will consume more system resources.  I can do everything I need with SSH and the screen command if I'm at home.  I did leave one of my windows workstations online while I was gone so I could teamviewer into that and from there SSH into my rigs if necessary, but luckily I had no need to. 

You could configure your router to forward a port other than 22 to port 22 on your mining rig.  I haven't bothered with that with mine, though; I can ssh into my FreeNAS media server (or my desktop, if it's booted into Linux...can RDP into it if it's running Windows and set it to reboot into Linux) from outside and then ssh into the mining rig from there.  Never used Teamviewer; tried accessing the mining rig with both RDP and VNC, and neither worked.  SSH works better for this purpose anyway, once you're familiar with it.

Tipjars: BTC 1TipsGocnz2N5qgAm9f7JLrsMqkb3oXe2 LTC LTipsVC7XaFy9M6Zaf1aGGe8w8xVUeWFvR | My Bitcoin Note Generator | Pool Auto-Switchers: zpool MiningPoolHub NiceHash
Bitgem Resources: Pool Explorer Paper Wallet
salfter
Hero Member
*****
Offline Offline

Activity: 651
Merit: 501


My PGP Key: 92C7689C


View Profile WWW
July 05, 2017, 03:04:22 AM
 #1525

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg.  I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up.  The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

Tipjars: BTC 1TipsGocnz2N5qgAm9f7JLrsMqkb3oXe2 LTC LTipsVC7XaFy9M6Zaf1aGGe8w8xVUeWFvR | My Bitcoin Note Generator | Pool Auto-Switchers: zpool MiningPoolHub NiceHash
Bitgem Resources: Pool Explorer Paper Wallet
UberDaemon
Newbie
*
Offline Offline

Activity: 51
Merit: 0


View Profile
July 05, 2017, 03:11:33 AM
Last edit: July 05, 2017, 03:30:01 AM by UberDaemon
 #1526

You could configure your router to forward a port other than 22 to port 22 on your mining rig.  I haven't bothered with that with mine, though; I can ssh into my FreeNAS media server (or my desktop, if it's booted into Linux...can RDP into it if it's running Windows and set it to reboot into Linux) from outside and then ssh into the mining rig from there.  

Yes, I posted about this earlier, but the concern is that would leave nvOC's SSH daemon open to the WAN running with a default password for those of us who don't have another SSH daemon on our LAN to use as an intermediary.  Someone could wreck all sorts of havoc if they had access to a linux box on your local network to use as a launching point, so I personally would want to have my own unique password set before I'll forward any ports to nvOC.  I have a feeling there would be some extra steps involved if one were to change the password for the m1 user on nvOC since oneBash runs commands that require escalation, but I'm not sure where oneBash gets the m1 user's password from when its executing commands.  I'm sure OP can clarify this when he gets caught up on posts.

PS, Fullzero, I'm really liking v0017 so far.  Excellent work!!
fullzero (OP)
Legendary
*
Offline Offline

Activity: 1260
Merit: 1009



View Profile
July 05, 2017, 04:21:52 AM
 #1527

Seems like the 6pin powered risers didn't solve the issue. Rig did work stable for longest period now, think it was something over 24hrs. Plugged out 1 GPU to see is 6x GPUs causing it to crash... Every GPU was on separate cable on PSU but still crashed... Out of ideas now

EDIT: Now my x4 1070 rigs are crashing too. Same shit all over again, Either GPU1/2/3/4 has stopped working bla bla, crashes the whole rig... Can anyone provide me with a solution?

Also, I've got a Gigabyte H110-D3A on those 1070 rigs, could that be the issue? Put one as a test back on AsRock to see for a test, will let ya know

EDIT2: Yep, AsRock didnt make a difference. Sometimes it crashes with a freeze

I haven't tested a H110 chipset.  Your problem might be related to chipset differences.  If this is the problem; running software updater might solve it.
fullzero (OP)
Legendary
*
Offline Offline

Activity: 1260
Merit: 1009



View Profile
July 05, 2017, 04:26:16 AM
 #1528

Hi,

Please help!
I have got stuck on this problems  Huh
My configuration:

-ASUS PRIME Z270-P - 2 . I tried both, results are similar.
-EVGA GeForce GTX 1080 GAMING ACX 3.0 - 2
-MSI Geforce GTX 1080 Gaming X-  2
-The Gigabyte power supply unit on 1200 watts


Three video cards work perfectly in any any combinations,

m1@m1-desktop:~$ nvidia-smi -L
GPU 0: GeForce GTX 1080 (UUID: GPU-43453088-0fca-9442-106d-7594d157ebf2)
GPU 1: GeForce GTX 1080 (UUID: GPU-d099b67e-f204-66fa-96dc-365a6b559a7e)
GPU 2: GeForce GTX 1080 (UUID: GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac)
m1@m1-desktop:~$


m1@m1-desktop:~$ lspci |grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
m1@m1-desktop:~$



but if I add the fourth (in this case the ID GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac ), then the system falls. Here what I see in dmesg


[   98.722227] nvidia-modeset: Allocated GPU:0 (GPU-43453088-0fca-9442-106d-7594d157ebf2) @ PCI:0000:01:00.0
[   98.769072] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769117] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769144] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769169] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769193] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769217] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769241] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.359255] nvidia-modeset: Allocated GPU:1 (GPU-5c9c8e29-a088-90a6-2a20-b2b2b971d1fb) @ PCI:0000:05:00.0
[   99.398991] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399035] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399063] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399087] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399112] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399136] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buff er], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399160] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.984670] nvidia-modeset: Allocated GPU:2 (GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac) @ PCI:0000:06:00.0
[  100.619118] nvidia-modeset: Allocated GPU:3 (GPU-d099b67e-f204-66fa-96dc-365a6b559a7e) @ PCI:0000:03:00.0
[  100.743159] NVRM: GPU at PCI:0000:01:00: GPU-43453088-0fca-9442-106d-7594d157ebf2
[  100.743162] NVRM: GPU Board Serial Number:
[  100.743164] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 000001e0 00000801 00000004 00000005
[  100.743649] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000004 00000005 00000004

[  102.432593] r8169 0000:07:00.0 enp7s0: link up
[  102.432600] IPv6: ADDRCONF(NETDEV_CHANGE): enp7s0: link becomes ready
[  103.743306] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[  103.773941] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[  105.501795] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[  105.501798] Bluetooth: BNEP filters: protocol multicast
[  105.501802] Bluetooth: BNEP socket layer initialized
[  105.613048] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[  105.613106] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[  105.704570] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004

[  105.704972] BUG: unable to handle kernel paging request at ffff88167153d830
[  105.704974] IP: [<ffffffffc0262880>] _nv008171rm+0x620/0x780 [nvidia]
[  105.705052] PGD 220c067 PUD 0
[  105.705053] Oops: 0000 [#1] SMP

Three days I try to solve a problem.
I changed versions of BIOS (0325,0608,0610) and risers, control 4G is included, has updated NVIDIA drivers to 381.22 - nothing helps.
Maybe somebody will have ideas?

My guess is your mobo is trying to / is using SLI.  Are you using an M2 ssd?

There should be some setting in the bios related to SLI; disable it / what slots are you using and are you using risers, if so on which GPUs?

If you are using risers; how are they powered?

Hi,
no, I don't use M2 SSD.
I use risers of the version 006s with the molex socket.

I managed to solve a problem. I modified / etc/default/grub

m1@m1-desktop:/etc/default$ more grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
#GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="vga=0 rdblacklist=nouveau nouveau.modeset=0"
GRUB_CMDLINE_LINUX=""

sudo update-grub

also I have created the file  disable-nouveau.conf which contains two lines

m1@m1-desktop:/etc/modprobe.d$ more /etc/modprobe.d/disable-nouveau.conf
blacklist nouveau
options nouveau modeset=0


sudo reboot

Were you connecting the monitor directly to the mobo?

Not sure why else nouveau would be used.
fullzero (OP)
Legendary
*
Offline Offline

Activity: 1260
Merit: 1009



View Profile
July 05, 2017, 04:28:16 AM
 #1529

My rig crashed from having the settings too high, it went down when I was asleep. I rebooted it and it's up and running but I'm getting a low disk space warning. What file / directory do I delete ?
run this code line and you are golden on space
Code:
sudo apt-get purge $(dpkg -l linux-{image,headers}-"[0-9]*" | awk '/ii/{print $2}' | grep -ve "$(uname -r | sed -r 's/-[a-z]+//')")
that worked. Thank you so much!

Thanks for helping xleejohnx

gig410 what version are you using?

Im using 0017, sorry for the late response.

Did you add a lot of additional programs; ~ 2gb or more?

fullzero (OP)
Legendary
*
Offline Offline

Activity: 1260
Merit: 1009



View Profile
July 05, 2017, 04:35:15 AM
 #1530

Seems like 6x pin powered risers solved my issue with 1050ti's crashing. Thanks a lot @fullzero and others

Now, I'm interested, is there a way to see all rigs on API and to be able to see that from outside network? If so, how to configure it with router? I got a MikroTik behind the 24-port switch.

Best way to do this is to setup a OpenVPN into the network and allowing it on the same subnet. Once you VPN, the connection will act just like if you were on the home network. It will also be secure if you use higher level of encryption like AES256-CBC.

You could just use SSH for this if you don't want to setup a VPN server, as SSH also uses AES-256 encryption and is every bit as secure as VPN, plus it's already running!  The only config required would be to apply a static DHCP lease in your router so each miner always has the same LAN IP assigned to it, and to also forward appropriate port(s) in your router (i.e. you could for instance set am unused incoming WAN port like 2222 to forward all inbound traffic on that port to LAN port 22 (default SSH port) on LAN IP 10.20.30.40 if that were the LAN IP for your nvOC rig.  If you have multiple rigs 2222 forwards to port 22 on 10.20.30.40, WAN port 2223 forwards all incoming traffic to LAN port 22 on IP 10.20.30.41, etc).  My only concern here though is that I would want to change the default password (miner1) before opening up an outside port to nvOC's SSH daemon as a clever hacker might scan your WAN IP (which is a thing, bored people/malicious people do this) and find that open port and get lucky somehow by trying "miner1" as a password.  Changing the system password is as simple as running passwd from guake/SSH, but I wouldn't recommend doing that until OP can give some guidance on if that will cause problems within oneBash.  Most of the commands executed in oneBash require privilege escalation and I don't know where it finds the "miner1" password.

OP, can you shed any light on that?  Is it okay to change the password for the m1 user without editing anything else?  I don't see it inside oneBash itself.

I haven't tested everything after changing the m1 or root password; but you should be able to do it without issue.  You would want to make sure you also change the root password in addition to the m1 if you make a static route as described.

You should probably also change the SSH keys in: seahorse as well.
fullzero (OP)
Legendary
*
Offline Offline

Activity: 1260
Merit: 1009



View Profile
July 05, 2017, 04:39:11 AM
 #1531

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

thanks in advance

look at the syslog:

go to ubuntu button top left and enter:

sy

click on system log
gig410
Newbie
*
Offline Offline

Activity: 14
Merit: 0


View Profile
July 05, 2017, 04:41:03 AM
 #1532

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg.  I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up.  The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0:    [ 0] Receiver Error         (First)
m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0:   device [8086:a2eb] error status/mask=00000001/00002000

and

m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0:   device [8086:a2eb] error status/mask=00000001/00002000

once in a while I get this

m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8


no idea what those mean
gig410
Newbie
*
Offline Offline

Activity: 14
Merit: 0


View Profile
July 05, 2017, 04:48:29 AM
 #1533

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

thanks in advance

look at the syslog:

go to ubuntu button top left and enter:

sy

click on system log

when I do that it gives me a stream of those messages in my previous post
fullzero (OP)
Legendary
*
Offline Offline

Activity: 1260
Merit: 1009



View Profile
July 05, 2017, 04:59:31 AM
 #1534

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg.  I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up.  The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0:    [ 0] Receiver Error         (First)
m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0:   device [8086:a2eb] error status/mask=00000001/00002000

and

m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0:   device [8086:a2eb] error status/mask=00000001/00002000

once in a while I get this

m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8


no idea what those mean


What kind of risers are you using?

Have you checked to ensure they are fully seated in the pcie ports?
fullzero (OP)
Legendary
*
Offline Offline

Activity: 1260
Merit: 1009



View Profile
July 05, 2017, 05:02:25 AM
 #1535

First of all big thank you to fullzero and everyone contributing to this distro!

I've been struggling with the Genoil crash issue and lack of watchdog implementation for the past few days and I have a bandaid solution that seems to be actually working quite well, perhaps it can help others in the community:

Essentially you need to split the Genoil output to a file, grep it (we only care about 'error' instances only ;  and then this output as input for a monitoring script that kills and restarts the misbehaving process.

So we have 2 scripts launched in screen as daemons "ltail" script and "ett" script

$screen -dmS ltail sh ~/eth/Genoil-U/ltail
and
$screen -dmS ett bash ~/ett

ltail:
--------------------------
#!/bin/bash
echo listening...
cd ~/eth/Genoil-U/
tail -fn0 err.log | \
while read line ; do
        DATE=$(date +%d-%m-%Y" "%H:%M:%S)
        echo "$DATE $line" | grep "error" | tee -a ~/eth/Genoil-U/timestamp.log
        if [ $? = 0 ]
        then
                kill $(ps aux | grep '[e]thminer' | awk '{print $2}')
                sleep 1
                screen -dmS ett bash ~/ett
        fi
done
-------------------------
ett:
-------------------------
#!/bin/bash
cd ~/eth/Genoil-U
./ethminer -U -F eth-us.dwarfpool.com:80/0xBEbd092a03827C37B75cd4ea314b207AA65c348f/208 2>&1 | tee >(grep error --color=never --line-buffered | tee -a err.log)

-------------------------

finally I also send output of ltail to timestamp.log to track how many times Genoil fails per hour - with roughly aiming at 1 crash per hour this gives me about 130MHs out of 5xGTX1060 which is a good 20+ MHs higher then Claymore... most importantly it gives stable hashing despite the OC introduced errors. The recovery is literally seconds.
Oh yeah and I also run
$tail -f ~/eth/Genoil-U/timestamp.log in a screen as well as watch -n 5 'sensors |grep Core' in another screen to fine tune the OC vs crash per hour vs temp
Hope this helps, and I hope the message is not too chaotic.
Cheers!

BTC: 13PnEKpfVzNseWkrm6LoueKcCMPj74zPv7
ETH: 0xBEbd092a03827C37B75cd4ea314b207AA65c348f


Very nice  Smiley  I will probably add some version of this in a later version.  I will include your donation address and ensure you are credited.

car1999
Full Member
***
Offline Offline

Activity: 350
Merit: 100


View Profile
July 05, 2017, 05:25:04 AM
 #1536

Also, I couldn't find how I can see the current mining process. I did see the screen -r commands, but that implies killing the current process and restarting it. I'd like to be able to see, from SSH, the current mining process without killing it. Is this possible?

If you want to monitor the mining process via screen you're going to have to kill the initial gnome-terminal.  There's no way around that, as screen can only reconnect to an existing screen session.

This shouldn't be a big deal if you have a stable rig.  You only need to do it once per reboot.  My process is:

1. From my desktop where I monitor my rigs I initiate a constant ping:
Code:
ping -t 10.20.30.40  # substitute your rig's IP, find it in your router, or by running nmap on your LAN subnet, or by running ifconfig from a guake terminal on the rig if you have a monitor connected
2. Boot the rig
3. Wait until I begin to get ping responses from the rig, thus indicating Ubuntu has booted and rig has network connectivity
4. SSH into the rig (user: m1  password: miner1)
5. Initiate a screen session:
Code:
screen -s [name for your rig, make one up or call it "rig"]
6. Start nvidia-smi dmon to watch for mining process to begin (by waiting until this happens you know OC settings, fan speed settings, etc have been applied.  Running those commands from within screen isn't 100% consistent IME as I always see error messages when I tried it that way.  It's best to let those settings commands run from gnome-terminal as Ubuntu first boots IMO).
Code:
nvidia-smi dmon
7. Wait until you see wattage go up and GPU utilization go up to 100% (which indicates that the oneBash script concluded and opened the mining process).  Exit nvidia-smi with CTRL + c
8. Find the PID for gnome-terminal.  
Code:
ps aux | grep gnome-terminal
9. Kill it:
Code:
kill [PID from step 8]
10. Restart mining:
Code:
bash '/media/m1/1263-A96E/oneBash'

It might seem like a lot of steps, but it takes all of 120 seconds and you shouldn't need to do it very often once your rig is dialed in.  You're losing maybe 1 minute's worth of hashes on avg of every week?  Pretty negligible considering the convenience of monitoring from another workstation, and you're not using up system resources by using Teamviewer.  This also lets you go completely headless if you buy a dummy HDMI plug.  I just updated from 16 to 17 and didn't need to haul my extra monitor upstairs to do it.  Easy peasy.
run export DISPLAY=:0 before step 5, if not, setp 10 throws erroe.
gig410
Newbie
*
Offline Offline

Activity: 14
Merit: 0


View Profile
July 05, 2017, 05:37:17 AM
 #1537

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg.  I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up.  The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0:    [ 0] Receiver Error         (First)
m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0:   device [8086:a2eb] error status/mask=00000001/00002000

and

m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0:   device [8086:a2eb] error status/mask=00000001/00002000

once in a while I get this

m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8


no idea what those mean


What kind of risers are you using?

Have you checked to ensure they are fully seated in the pcie ports?

just checked if they are seated correctly on the motherboard and on the cards and they are, I did an lspci command and it looks like id a2eb is the first gpu on the rig, it has it's own power cord to the power supply on the card and on the riser. the card does work but it has these errors
gig410
Newbie
*
Offline Offline

Activity: 14
Merit: 0


View Profile
July 05, 2017, 06:10:30 AM
 #1538

So my rig crashed again, it was up for about 19 hours with the current settings. The previous it crashed I had not been able to see it crash, I just knew it because the screen was blank and the fans on the gpus went up to 100 percent. This time I was siting in front of it doing something else when the screen went blank and the fans kicked up to 100 percent. My question is if there is some kind of log that could be looked at to see what caused the crash or can one be enabled that only keeps the last one hour of activity?

ssh in and look at the tail end of /var/log/dmesg.  I have some crappy PCIe extenders here that would interrupt the connection between the GPU and the computer as soon as mining software fired up.  The errors show up toward the end of /var/log/dmesg.

There's also /var/log/messages, but that tends to be less useful for hardware errors.

I have a keyboard and monitor connected to the rig for now, I found a file named kern.log that is 1.7 GB in size and kern.log.1 that is about 650 MB. these are the messages

m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0:    [ 0] Receiver Error         (First)
m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0:   device [8086:a2eb] error status/mask=00000001/00002000

and

m1-desktop kernel: [105577.995353] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.995360] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.995363] pcieport 0000:00:1b.0:   device [8086:a2eb] error status/mask=00000001/00002000

once in a while I get this

m1-desktop kernel: [105576.736779] pcieport 0000:00:1b.0: can't find device of ID00d8


no idea what those mean


What kind of risers are you using?

Have you checked to ensure they are fully seated in the pcie ports?

just checked if they are seated correctly on the motherboard and on the cards and they are, I did an lspci command and it looks like id a2eb is the first gpu on the rig, it has it's own power cord to the power supply on the card and on the riser. the card does work but it has these errors

looks like I was wrong about a2eb being the first gpu. I removed the gpu completely and I'm still getting these errors as soon as I boot, it won't even go into the GUI any more
S9k
Newbie
*
Offline Offline

Activity: 26
Merit: 0


View Profile
July 05, 2017, 06:12:18 AM
 #1539

Hi,

Please help!
I have got stuck on this problems  Huh
My configuration:

-ASUS PRIME Z270-P - 2 . I tried both, results are similar.
-EVGA GeForce GTX 1080 GAMING ACX 3.0 - 2
-MSI Geforce GTX 1080 Gaming X-  2
-The Gigabyte power supply unit on 1200 watts


Three video cards work perfectly in any any combinations,

m1@m1-desktop:~$ nvidia-smi -L
GPU 0: GeForce GTX 1080 (UUID: GPU-43453088-0fca-9442-106d-7594d157ebf2)
GPU 1: GeForce GTX 1080 (UUID: GPU-d099b67e-f204-66fa-96dc-365a6b559a7e)
GPU 2: GeForce GTX 1080 (UUID: GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac)
m1@m1-desktop:~$


m1@m1-desktop:~$ lspci |grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1)
m1@m1-desktop:~$



but if I add the fourth (in this case the ID GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac ), then the system falls. Here what I see in dmesg


[   98.722227] nvidia-modeset: Allocated GPU:0 (GPU-43453088-0fca-9442-106d-7594d157ebf2) @ PCI:0000:01:00.0
[   98.769072] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769117] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769144] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769169] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769193] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769217] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   98.769241] ACPI Warning: \_SB_.PCI0.RP04.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.359255] nvidia-modeset: Allocated GPU:1 (GPU-5c9c8e29-a088-90a6-2a20-b2b2b971d1fb) @ PCI:0000:05:00.0
[   99.398991] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399035] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399063] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399087] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399112] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399136] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buff er], ACPI requires [Package] (20150930/nsarguments-95)
[   99.399160] ACPI Warning: \_SB_.PCI0.RP05.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20150930/nsarguments-95)
[   99.984670] nvidia-modeset: Allocated GPU:2 (GPU-5aacd4db-f68b-917e-8ac2-84caf68d6cac) @ PCI:0000:06:00.0
[  100.619118] nvidia-modeset: Allocated GPU:3 (GPU-d099b67e-f204-66fa-96dc-365a6b559a7e) @ PCI:0000:03:00.0
[  100.743159] NVRM: GPU at PCI:0000:01:00: GPU-43453088-0fca-9442-106d-7594d157ebf2
[  100.743162] NVRM: GPU Board Serial Number:
[  100.743164] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 000001e0 00000801 00000004 00000005
[  100.743649] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000004 00000005 00000004

[  102.432593] r8169 0000:07:00.0 enp7s0: link up
[  102.432600] IPv6: ADDRCONF(NETDEV_CHANGE): enp7s0: link becomes ready
[  103.743306] nvidia-modeset: WARNING: GPU:0: Lost display notification (0:0x00000000); continuing.
[  103.773941] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[  105.501795] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[  105.501798] Bluetooth: BNEP filters: protocol multicast
[  105.501802] Bluetooth: BNEP socket layer initialized
[  105.613048] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[  105.613106] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004
[  105.704570] NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000000 00000080 00000000 00000005 00000004

[  105.704972] BUG: unable to handle kernel paging request at ffff88167153d830
[  105.704974] IP: [<ffffffffc0262880>] _nv008171rm+0x620/0x780 [nvidia]
[  105.705052] PGD 220c067 PUD 0
[  105.705053] Oops: 0000 [#1] SMP

Three days I try to solve a problem.
I changed versions of BIOS (0325,0608,0610) and risers, control 4G is included, has updated NVIDIA drivers to 381.22 - nothing helps.
Maybe somebody will have ideas?

My guess is your mobo is trying to / is using SLI.  Are you using an M2 ssd?

There should be some setting in the bios related to SLI; disable it / what slots are you using and are you using risers, if so on which GPUs?

If you are using risers; how are they powered?

Hi,
no, I don't use M2 SSD.
I use risers of the version 006s with the molex socket.

I managed to solve a problem. I modified / etc/default/grub

m1@m1-desktop:/etc/default$ more grub
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
#GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="vga=0 rdblacklist=nouveau nouveau.modeset=0"
GRUB_CMDLINE_LINUX=""

sudo update-grub

also I have created the file  disable-nouveau.conf which contains two lines

m1@m1-desktop:/etc/modprobe.d$ more /etc/modprobe.d/disable-nouveau.conf
blacklist nouveau
options nouveau modeset=0


sudo reboot

Were you connecting the monitor directly to the mobo?

Not sure why else nouveau would be used.



My monitor is connected to GPU0.

m1@m1-desktop:~$ nvidia-smi
Wed Jul  5 02:00:52 2017      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 381.22                 Driver Version: 381.22                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:01:00.0      On |                  N/A |
| 75%   65C    P2   181W / 180W |    319MiB /  8113MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:03:00.0     Off |                  N/A |
| 75%   60C    P2   180W / 180W |    141MiB /  8114MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 0000:05:00.0     Off |                  N/A |
| 75%   72C    P2   165W / 180W |    141MiB /  8114MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    Off  | 0000:06:00.0     Off |                  N/A |
| 75%   71C    P2   166W / 180W |    141MiB /  8114MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+


I remained on version 381.22 drivers, it seems to me they more productive
salfter
Hero Member
*****
Offline Offline

Activity: 651
Merit: 501


My PGP Key: 92C7689C


View Profile WWW
July 05, 2017, 06:15:20 AM
 #1540

m1-desktop kernel: [105577.938217] pcieport 0000:00:1b.0:    [ 0] Receiver Error         (First)
m1-desktop kernel: [105577.949736] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
m1-desktop kernel: [105577.949750] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
m1-desktop kernel: [105577.949757] pcieport 0000:00:1b.0:   device [8086:a2eb] error status/mask=00000001/00002000

no idea what those mean

Those look like the errors I was getting with some crappy PCIe extenders I had recently ordered.  Here's a closeup of the inadequate soldering on the bit that goes in the slot; the other end of the riser is probably similar. Click for the full-res original:



I'll be sending these back.  There were no reviews when I bought them, but since then someone else has left a 1-star review.

I've bought these as replacements, at fullzero's recommendation.

Tipjars: BTC 1TipsGocnz2N5qgAm9f7JLrsMqkb3oXe2 LTC LTipsVC7XaFy9M6Zaf1aGGe8w8xVUeWFvR | My Bitcoin Note Generator | Pool Auto-Switchers: zpool MiningPoolHub NiceHash
Bitgem Resources: Pool Explorer Paper Wallet
Pages: « 1 ... 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 [77] 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 ... 417 »
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!