Title: Cacti template for AMD GPU monitoring Post by: JinTu on October 18, 2011, 06:13:39 AM By leveraging my earlier work exposing GPU stats via SNMP (https://bitcointalk.org/index.php?topic=48771.0), I created a Cacti template for monitoring my mining rigs and would like to share my work with all of you.
Here are some teaser graphs from one of my (largely untuned) 6990 GPUs: Adding graphs to Cacti http://www.praecogito.com/bitcoin/amd-gpu/cacti-template/images/cacti-web-gui-data-query.jpg Core clock for GPU 0 http://www.praecogito.com/bitcoin/amd-gpu/cacti-template/images/core-clock-0.png Memory clock for GPU 0 (underclocked to 150MHz) http://www.praecogito.com/bitcoin/amd-gpu/cacti-template/images/memory-clock-0.png Core voltage for GPU 0 http://www.praecogito.com/bitcoin/amd-gpu/cacti-template/images/core-voltage-0.png Fan speed for GPU 0 http://www.praecogito.com/bitcoin/amd-gpu/cacti-template/images/fan-speed-0.png Temperature for GPU 0 http://www.praecogito.com/bitcoin/amd-gpu/cacti-template/images/temperature-0.png Load for GPU 0 http://www.praecogito.com/bitcoin/amd-gpu/cacti-template/images/load-0.png Prerequisites
Installation instructions
Troubleshooting
A Verbose Query should look like the following: Code: + Running data query [16]. Have fun! Title: Re: Cacti template for AMD GPU monitoring Post by: dlasher on February 17, 2012, 08:20:19 PM EDIT: working for 2 of 9 boxes - grrrrr
Title: Re: Cacti template for AMD GPU monitoring Post by: JinTu on February 18, 2012, 01:55:24 AM EDIT: Ignore me working fine.. I was able to still go add the graphs and it worked. Glad to hear it! Feel free to share your graphs if you are willing. Title: Re: Cacti template for AMD GPU monitoring Post by: dlasher on February 18, 2012, 05:04:52 AM EDIT: Ignore me working fine.. I was able to still go add the graphs and it worked. Glad to hear it! Feel free to share your graphs if you are willing. Oddly I can only get graphs from 3 out of 9 machines... there's an issue with PCI ID 0a breaking the perl script (noted in your amd perl script thread) and I can't understand the others.. I may post working/notworking when I get a chance tomorrow. ------EDIT------ Kinda wondering whether we're running into a PCI-ID issue on the cacti supporting scripts/items as well. Data from a non-working one. It shows "success, 0 items 0 rows". Won't show any cards when I go to add graphs. Quote Data Query Debug Information + Running data query [12]. + Found type = '3' [snmp query]. + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + XML file parsed ok. + Executing SNMP walk for list of indexes @ '.1.3.6.1.4.1.8072.1.3.2.4.1.2.5.103.112.117.105.100' + No SNMP data returned + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml -------- root@stats1:~# snmpwalk -v2c -cpublic 10.4.18.22 NET-SNMP-EXTEND-MIB::nsExtendOutLine NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuid".1 = STRING: 0 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuid".2 = STRING: 1 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuid".3 = STRING: 2 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpufan".1 = STRING: 34 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpufan".2 = STRING: 72 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpufan".3 = STRING: 33 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuload".1 = STRING: 99 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuload".2 = STRING: 99 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuload".3 = STRING: 98 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gputemp".1 = STRING: 7900 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gputemp".2 = STRING: 7850 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gputemp".3 = STRING: 7750 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuclock".1 = STRING: 950 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuclock".2 = STRING: 930 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuclock".3 = STRING: 880 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuvcore".1 = STRING: 1100 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuvcore".2 = STRING: 1088 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuvcore".3 = STRING: 1088 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpumemory".1 = STRING: 300 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpumemory".2 = STRING: 300 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpumemory".3 = STRING: 1250 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuaddress".1 = STRING: 0a:00.0 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuaddress".2 = STRING: 09:00.0 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuaddress".3 = STRING: 04:00.0 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpudescription".1 = STRING: ATI Radeon HD 5800 Series NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpudescription".2 = STRING: ATI Radeon HD 5800 Series NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpudescription".3 = STRING: AMD Radeon HD 6900 Series And a working one, showing "6 items 2 rows" Quote + Running data query [12]. + Found type = '3' [snmp query]. + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + XML file parsed ok. + Executing SNMP walk for list of indexes @ '.1.3.6.1.4.1.8072.1.3.2.4.1.2.5.103.112.117.105.100' + No SNMP data returned + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' ---------- NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuid".1 = STRING: 0 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuid".2 = STRING: 1 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpufan".1 = STRING: 20 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpufan".2 = STRING: 20 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuload".1 = STRING: 99 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuload".2 = STRING: 99 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gputemp".1 = STRING: 7000 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gputemp".2 = STRING: 6700 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuclock".1 = STRING: 870 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuclock".2 = STRING: 870 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuvcore".1 = STRING: 1100 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuvcore".2 = STRING: 1100 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpumemory".1 = STRING: 1250 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpumemory".2 = STRING: 1250 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuaddress".1 = STRING: 03:00.0 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpuaddress".2 = STRING: 04:00.0 NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpudescription".1 = STRING: AMD Radeon HD 6900 Series NET-SNMP-EXTEND-MIB::nsExtendOutLine."gpudescription".2 = STRING: AMD Radeon HD 6900 Series I suspect the errors in the cacti.log are telling me, but I need a more verbose output... Quote 02/20/2012 03:25:02 PM - CMDPHP: Poller[0] Host[2] DS[8] WARNING: Result from SNMP not valid. Partial Result: U 02/20/2012 03:25:04 PM - CMDPHP: Poller[0] Host[8] DS[80] WARNING: Result from SNMP not valid. Partial Result: U 02/20/2012 03:25:08 PM - CMDPHP: Poller[0] Host[9] DS[118] WARNING: Result from CMD not valid. Partial Result: U 02/20/2012 03:25:08 PM - CMDPHP: Poller[0] Host[9] DS[119] WARNING: Result from CMD not valid. Partial Result: U 02/20/2012 03:25:09 PM - CMDPHP: Poller[0] Host[9] DS[120] WARNING: Result from CMD not valid. Partial Result: U 02/20/2012 03:25:09 PM - CMDPHP: Poller[0] Host[9] DS[120] WARNING: Result from CMD not valid. Partial Result: U Title: Re: Cacti template for AMD GPU monitoring Post by: JinTu on February 29, 2012, 01:07:42 AM Data from a non-working one. It shows "success, 0 items 0 rows". Won't show any cards when I go to add graphs. Quote Data Query Debug Information + Running data query [12]. + Found type = '3' [snmp query]. + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + XML file parsed ok. + Executing SNMP walk for list of indexes @ '.1.3.6.1.4.1.8072.1.3.2.4.1.2.5.103.112.117.105.100' + No SNMP data returned + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml' + Found data query XML file at '/usr/share/cacti/site/resource/snmp_queries/amd_gpu.xml <snip> I suspect the errors in the cacti.log are telling me, but I need a more verbose output... Quote 02/20/2012 03:25:02 PM - CMDPHP: Poller[0] Host[2] DS[8] WARNING: Result from SNMP not valid. Partial Result: U 02/20/2012 03:25:04 PM - CMDPHP: Poller[0] Host[8] DS[80] WARNING: Result from SNMP not valid. Partial Result: U 02/20/2012 03:25:08 PM - CMDPHP: Poller[0] Host[9] DS[118] WARNING: Result from CMD not valid. Partial Result: U 02/20/2012 03:25:08 PM - CMDPHP: Poller[0] Host[9] DS[119] WARNING: Result from CMD not valid. Partial Result: U 02/20/2012 03:25:09 PM - CMDPHP: Poller[0] Host[9] DS[120] WARNING: Result from CMD not valid. Partial Result: U 02/20/2012 03:25:09 PM - CMDPHP: Poller[0] Host[9] DS[120] WARNING: Result from CMD not valid. Partial Result: U I worked with dlasher via PM and we were able to resolve this for his setup (SNMP timeout). Please see the update to the first post (Troubleshooting) for additional details. Title: Re: Cacti template for AMD GPU monitoring Post by: dlasher on March 03, 2012, 04:39:13 AM working perfectly now, thank you for the help.. Title: Re: Cacti template for AMD GPU monitoring Post by: The LT on March 04, 2012, 12:11:24 PM You sir, are my hero! Will donate in a couple of days after I've set up the system! This is going to be VERY useful.
Title: Re: Cacti template for AMD GPU monitoring Post by: JinTu on March 05, 2012, 07:18:23 AM You sir, are my hero! Will donate in a couple of days after I've set up the system! This is going to be VERY useful. Thanks mate, Be sure to post some sample graphs when you get it up and running. Title: Re: Cacti template for AMD GPU monitoring Post by: The LT on March 07, 2012, 05:38:36 PM I've hit a snag... Got net-snmp sorted out and snmpwalk seems to work fine on a cacti machine, it connects to both my rigs.
Code: garage ~ # snmpwalk -v2c -cpublic 192.168.2.4 NET-SNMP-EXTEND-MIB::nsExtendOutLine But there seems to be some problem with cacti. The verbose Data query information gives this: Code:
Something is definately wrong here as far as I can tell... It finds the snmp_queries file as seen on the third line of the log but other than that... meh... I've tried setting timeout up to 10000 msec and increasing oid's per get request but to no avail. I'm not that experienced with Cacti so maybe JinTu can offer some insight? Title: Re: Cacti template for AMD GPU monitoring Post by: The LT on March 07, 2012, 06:37:42 PM It seems I have the graphing going after all! Will gather some data and report back! This is so great, JinTu! How about a bounty for cgminer stats implementation? Once I mine some BTC i will send you a donation!
Title: Re: Cacti template for AMD GPU monitoring Post by: JinTu on March 07, 2012, 11:37:18 PM It seems I have the graphing going after all! Will gather some data and report back! This is so great, JinTu! How about a bounty for cgminer stats implementation? Once I mine some BTC i will send you a donation! Glad to hear you got it going, and looking forward to your sample graphs. I'm certainly open to a bounty for implementing cgminer stats. It's been on my todo list ever since the 2.1.0 release supporting the JSON API, but it has been near impossible to find the time. A bounty would help goad me into actually doing it. Title: Re: Cacti template for AMD GPU monitoring Post by: The LT on March 08, 2012, 10:33:33 AM Quote Glad to hear you got it going, and looking forward to your sample graphs. I'm certainly open to a bounty for implementing cgminer stats. It's been on my todo list ever since the 2.1.0 release supporting the JSON API, but it has been near impossible to find the time. A bounty would help goad me into actually doing it. I've made a small 0.5 BTC donation for your efforts for SNMP and Cacti, I know it's not much but I'm waiting for BFL's to arrive before the hashing power increases. I'll post the graphs soon, when they stop being straight lines and become something "graphy". :) A quick question, is "sudo" really needed for aticonfig to work? I wanted to minimize my logging. If it isn't explicitly required, then maybe we can add a variable to enable-disable sudo usage? Title: Re: Cacti template for AMD GPU monitoring Post by: JinTu on March 08, 2012, 06:29:08 PM I've made a small 0.5 BTC donation for your efforts for SNMP and Cacti, I know it's not much but I'm waiting for BFL's to arrive before the hashing power increases. I'll post the graphs soon, when they stop being straight lines and become something "graphy". :) A quick question, is "sudo" really needed for aticonfig to work? I wanted to minimize my logging. If it isn't explicitly required, then maybe we can add a variable to enable-disable sudo usage? Donation received, thanks for your support! Yes, unfortunately aticonfig needs to run as the same user and display your X session is logged in as for most of the commands to work. You can test this yourself by logging in as root (assuming you are already logged into X as a different user) and running the following: aticonfig --lsa This should work e.g. Code: * 0. 03:00.0 AMD Radeon HD 6900 Series aticonfig --odgt --adapter=0 This won't work e.g. Code: No protocol specified Code: Default Adapter - AMD Radeon HD 6900 Series Title: Re: Cacti template for AMD GPU monitoring Post by: The LT on March 09, 2012, 12:19:29 PM Oh okay, thanks for clearing that up! The VRM on one of the cards is getting funky, it's really nice to see it nicely graphed. It's also much easier to tune the cards to desired temperature, you can see which parameters affect what.. :) Let me spend a couple of days graphing and I'll post some stats.
Title: Re: Cacti template for AMD GPU monitoring Post by: dlasher on March 16, 2012, 05:40:36 PM not a lot to see, but average temps across miners1-9.. using CGminer with a target temp of 80C, hence the flat lines around there. example: https://i.imgur.com/1Athj.png See the rest at: http://imgur.com/a/Hxssv/all Thanks again JinTu for your work and help! Title: Re: Cacti template for AMD GPU monitoring Post by: JinTu on March 16, 2012, 06:24:51 PM not a lot to see, but average temps across miners1-9.. using CGminer with a target temp of 80C, hence the flat lines around there. example: https://i.imgur.com/1Athj.png See the rest at: http://imgur.com/a/Hxssv/all Thanks again JinTu for your work and help! No problem dlasher. I am glad you are getting some use out of it. wrt cgminer, your sample miner5 devices 0, 1 and 3 temperature graph has a lot more variability than I would expect to see if you are using the auto features. I see that device 2 of miner5 graphs for the same time frame show more temperature stability. Is this by chance a card with reduced airflow compared with devices 0, 1 and 3 or are you using different settings for these cards? Title: Re: Cacti template for AMD GPU monitoring Post by: dlasher on March 17, 2012, 03:40:33 AM wrt cgminer, your sample miner5 devices 0, 1 and 3 temperature graph has a lot more variability than I would expect to see if you are using the auto features. I see that device 2 of miner5 graphs for the same time frame show more temperature stability. Is this by chance a card with reduced airflow compared with devices 0, 1 and 3 or are you using different settings for these cards? Identical settings, miner5 has 4 58xx cards with zero space between them, lots of airflow, but the middle cards stay hotter. Title: Re: Cacti template for AMD GPU monitoring Post by: The LT on March 17, 2012, 07:50:11 PM Happily monitoring my cards for a week now! The graphs were useful in reducing the failing VRM temperatures. :)
|