Nagios / SNMP tools

Here is a collection of various tools that I wrote or adapted to my needs to ease the task of Nagios monitoring of various services mostly on Linux servers.

SNMP extend

SNMP daemon version 5.0 and above from the NetSNMP project provides a way to access output of user supplied scripts via SNMP protocol. In other words: SNMP client on one machine can invoke a script on another machine just by sending a SNMP query. After the remote script finishes its standard/error output, return code and some other values are sent back to the client in a SNMP response.
(NOTE: See SNMP exec section below if you run older SNMP daemon than NetSNMP 5.0)

For example - consider you want to query an actual date and time of a server. Indeed, there are some standard OIDs in the System MIB, but you can as well run /bin/date every time and pass its output through SNMP back to the client. Here is how to do it:

  1. On the remote server configure date extension in /etc/snmp/snmpd.conf. Simply add this single line at the end of the config file and reload snmpd:
    extend datecheck /bin/date
    
  2. From any client that has allowed SNMP access to the server query the datecheck with:
    ~$ snmpwalk -v2c -c public remote.server NET-SNMP-EXTEND-MIB::nsExtendOutputFull
    NET-SNMP-EXTEND-MIB::nsExtendOutputFull."datecheck" = STRING: Wed Oct 18 00:01:44 NZDT 2006
    
    That's about it. Easy way to run programs and scripts remotely, isn't it?

For integration with Nagios save this script: check_snmp_extend.sh
to /usr/local/nagios/libexec.local and put the following lines into Nagios' config:

$USER10$=/usr/local/nagios/libexec.local

define command{
	command_name	check_snmp_extend
	command_line	$USER10$/check_snmp_extend.sh $HOSTADDRESS$ $ARG1$
	}

define service{
	## This is an example service configured as
	## extend servicename /path/to/service-check.sh
	## on remote.server in /etc/snmp/snmpd.conf
	use			generic-service
	host_name		remote.server
	service_description	SomeService status
	check_command		check_snmp_extend!servicename
}

Have you found this script useful? Please support author by PayPal donation.


 

SNMP exec

SNMP exec provides a similar functionality to extend, however exec is less flexible and slightly slower to work with. On the other hand it is supported in many older implementations of SNMP daemons including UCD-SNMP and NetSNMP 4.x which are still found on many servers.

The configuration and operation of exec is very similar to the above described extend. I won't repeat myself - simply replace all strings extend with exec in the above Nagios config file and download this script to Nagios' libexec.local directory: check_snmp_exec.sh

Scripts for SNMP extend / exec

Your Nagios is now ready to query status from remote scripts via SNMP. The conventions for these scripts are fairly trivial - the first word on the first line must be either OK or WARNING or FAIL or UNKNOWN. This word is translated to the appropriate return code by check_snmp_extend.sh or check_snmp_exec.sh and returned back to Nagios. That's all.

BTW from now on I will talk about extend only but everything entirely applies to exec on older daemons as well.

Now it's a good time to introduce some scripts that can be used on the remote servers with extend or exec...

Monitor for Linux Software RAID

Most low end servers rely on Linux SW-RAID for their data storage. Monitoring such array and getting an alert as soon as a problem appears is an essential part of maintaining a decent availability of your services.

The core part of this monitor is a nagios-linux-swraid.pl script that parses RAID status information from /proc/mdstat and reports a single line similar to the following on its standard output:

OK - md0 [UU] has 2 of 2 devices active (active=sda2,sdb2 failed=none spare=none)

This script is hooked to SNMP daemon process running on the host with SW-RAID using a simple line in /etc/snmp/snmpd.conf config file:

extend raid-md0 /usr/local/bin/nagios-linux-swraid.pl --device=md0

On the Nagios side I assume you have check_snmp_extend correctly configured as described earlier on this page. Then simply add the following service description to your Nagios' config:

define service{
	use			generic-service
	host_name		remote.server
	service_description	RAID status
	check_command		check_snmp_extension!raid-md0
}

Have you found this script useful? Please support author by PayPal donation.


 

System Up-To-Date monitor (APT / YUM)

Keeping operating system up to date with latest patches is essential in most environments. Following two scripts check whether new patches are available for download. The first one works with APT based systems (e.g. Debian, Ubuntu or even OpenSolaris Nexenta, ...) and the second one is for YUM based systems (e.g. RedHat, Fedora or CentOS).

You can make the script itself to run apt or yum to download information about newest updates from the internet, but that usually takes too long time and Nagios usually in the middle timesout. Instead on my systems I run apt or yum every few hours from cron and let the Nagios script only do the quick tasks. More specifically - on APT systems I run apt-get update from cron, because that download data from the net and then run apt-get -q -s upgrade from SNMP script because that only reads local database and parses the output. On YUM systems I run yum check-update from cron and store its output in a file. Then from SNMP script I only read and parse that file. Keep reading for example usage.

First of all tell cron to run apt or yum every 6 hours:

## /etc/crontab
## Every 6 hours download new data from the net
# For APT-based systems (Debian, Ubuntu, ...)
50 */6 * * *   root    /usr/bin/apt-get -qq update
# For YUM-based systems (RedHat, CentOS, Fedora)
50 */6 * * *   root    /usr/bin/yum check-update > /var/run/yum.check-update

Download the nagios scripts (check-apt-upgrade.pl or check-yum-update.pl ) and tell SNMP daemon about them:

## /etc/snmp/snmpd.conf
# For APT-based systems (Debian, Ubuntu, ...)
extend sw-updates /usr/local/bin/check-apt-upgrade.pl --run
# For YUM-based systems (RedHat, CentOS, Fedora)
extend sw-updates /usr/local/bin/check-yum-update.pl --file /var/run/yum.check-update

Last step is to configure Nagios to poll these services (same for both APT and YUM):

define service{
	use			generic-service
	host_name		remote.server
	service_description	Software updates
	check_command		check_snmp_extension!sw-updates
}

That's all. You will get "OK" result if there are no new updates and "WARNING" with a list of packages to update if there are any. On Ubuntu you'll even get "CRITICAL" result if there are any security updates, and "WARNING" when there are only non-security ones.

Have you found this script useful? Please support author by PayPal donation.


 

System uptime monitor

Reporting system uptime and generating alert on system reboot is often useful. Download check_snmp_uptime.pl that does just that - reads system uptime, records last reading and alerts when current uptime reading is lower than the last recorded one. Easy, eh?

Two important things:

  1. net-snmp provides the snmpd daemon's uptime in DISMAN-EVENT-MIB::sysUpTimeInstance (.1.3.6.1.2.1.1.3.0) The real system uptime is available as HOST-RESOURCES-MIB::hrSystemUptime.0 (.1.3.6.1.2.1.25.1.1.0) Many other devices like switches provide only the former OID.
    This script can read either of them. Use --sysUpTime or --hrSystemUptime to select the appropriate OID for each device.
  2. If --dbfile parameter is not used then the script will only check and report the uptime and return OK. No alerts will be generated at all.

Have you found this script useful? Please support author by PayPal donation.


 

MySQL replication monitor

MySQL provides a relatively easy way to run online mirrors of the master database. The mirror is called slave in MySQL terminology and the mirroring process is called replication. With the following script it is easy to add your MySQL slaves to Nagios monitoring and receive an alert whenever replication drops for some reason.

The script needs to connect as a MySQL user (say user monitor) with privilege REPLICATION CLIENT. Use this GRANT command to create such a user:

mysql> GRANT REPLICATION CLIENT ON *.* TO monitor@localhost IDENTIFIED BY 'PassWord';

Then download the script check-mysql-slave.pl , save it to /usr/local/bin/ directory and tell SNMP daemon to run it on request:

extend mysql-slave /usr/local/bin/check-mysql-slave.pl --user monitor --pass PassWord --sock /tmp/mysql.sock

Last step, similar to all other monitors, is to configure Nagios to poll the monitor status every now and then:

define service{
	use			generic-service
	host_name		remote.server
	service_description	MySQL replication
	check_command		check_snmp_extension!mysql-slave
}

The script returns "CRITICAL" whenever replication breaks for some reason, e.g. Slave IO or Slave SQL process doesn't run, or the replication is too far behind master (2 minutes by default). "WARNING" is returned when replication is more than 1 minute but less then 2 minutes behind master and "OK" is returned when everything goes all right. You can also get "UNKNOWN" return code, usually on misconfiguration or when the script can't connect to the slave server. That's all ;-)

Have you found this script useful? Please support author by PayPal donation.


 

PostgreSQL / Slony cluster monitor

Slony is a popular PostrgreSQL replication system. It is a good idea to monitor the status of your Slony cluster and trigger an alert whenever any node gets out of sync. The following script will help you do just that.

  1. Download the script check-slony-cluster.pl and save it to a libexec directory on your Nagios server.
  2. Adjust the permissions on the database node to allow connection from nagios server with username, say, nagios.
  3. Include the following two blocks in your Nagios config:
    define command{
            command_name    check_slony
            command_line    \$USER1\$/check-slony-cluster.pl --host \$HOSTADDRESS\$ --user \$ARG1\$ --dbname \$ARG2\$ --cluster \$ARG3\$ --node \$ARG4\$
    }
    
    define service{
            use                             generic-service
            host_name                       dbmaster
            service_description             Slony - node 2 dbslave
            check_command                   check_slony!nagios!mydb!mydbcluster!2
            }
    
    Update the hostname, username, database name and cluster name to suit your setup.
  4. That's it. Unfortunately I don't have a Slony cluster on hand to give more detailed setup instructions. But I'm sure you'll figure out the details :-)

Very simple tools

Many Nagios scripts exist for checking core system values on Linux systems. These are the ones I use...
Note that I'm not the author of these four scripts below...

Check hostname - checking hostname is a good idea in large networks to ensure that we're talking to the right host. Other checks could depend on this result.
check_snmp_hostname.pl
Check system load
check_snmp_load.pl
Check number of running processes
check_snmp_procs.pl
Place for your feedback...
I'm sick of spam, feedback form disabled, send me an email instead. Cheers!