Thoth Network Integrity Assurance

Roger Burton West
9 November 2001


Table of Contents


1. Description

1.1. General

Thoth is a network monitoring system, designed to report on networked devices and services, and alert administrators when they are not functioning normally. It has been designed and developed primarily under Linux, though other Unix systems should be able to run it. It will not be usable under Windows, since it requires the ability to fork() from within Perl.

Features include:

1.2. History

Thoth is the result of two years' work in ISP network operations centres. I have tried a variety of systems - principally NOCOL and Netsaint - but have been dissatisfied with them all. By requiring a database back-end, the process of updating and notification was considerably simplified.

The first time Thoth was used seriously, the host it was required to test - which had crashed five times in the previous 24 hours - promptly ran perfectly for the next 48. This performance is not guaranteed to be replicated on your own system.


2. Prerequisites

Perl. All thoth systems work happily on 5.005 and 5.6. Some perl scripts are suid, which may require additional support packages.

A database server supported by DBI. Thoth has been developed with mysql and postgresql, but should work on other databases, as long as an auto_increment or equivalent function is available.

If you plan to use the web interface, a web server.

Perl modules from CPAN. Getopt::Std and POSIX should already be installed in any case. Config::IniFiles and DBI (and of course the DBD module for the selected database) are critical; Net::Ping, Net::SNMP, Net::DNS and especially Net::TCP are used by the service checks; Text::Wrap is used by the plaintext-emitting parts of the system. Text::CSV_XS is used by the cascade monitoring modules.

ssh is required by the sshremote service checker.


3. Components

3.1. Core

bin/thoth is the core component; it handles monitor scheduling, execution and database updates.

3.2. Individual monitoring elements.

Installed in check/*

3.2.1. cascademaster

Gathers monitoring information from a remote thoth server.

3.2.2. dns

Tests a Domain Name Service server; will check for authoritative zone serving if required.

3.2.3. ftp

Tests an FTP server; will attempt login if required.

3.2.4. http

Tests a web server with basic-mode authentication if required (but no https support).

3.2.5. imap

Tests an IMAP server; will attempt login if required.

3.2.6. mysql

Tests usability of a MySQL database

3.2.7. nntp

Tests for connectivity and a valid response (with or without posting permission).

3.2.8. ping

Tests for machine reachability

3.2.9. pop3

Tests a POP3 server; will attempt login if required.

3.2.10. postgresql

Tests a PostgreSQL database server.

3.2.11. smtp

Tests an SMTP server.

3.2.12. snmp

Tests an SNMP variable.

3.2.13. ssh

Tests an SSH server.

3.2.14. sshremote

Conducts tests of disc space, load, etc., on remote Unix machines.

3.2.15. sybase

Tests a sybase server (including MS-SQL).

3.2.16. tcp

Tests for port reachability.

3.2.17. vnc

Tests a VNC server.

3.3. Web interface

cgi-bin/status.cgi is the CGI script which generates status information for the web interface. All other content is static and generated when necessary.

3.4. Notification interface

notify/* are the notification programs

3.4.1. email

Given an email address, will send a suitably-formatted notification message.

3.4.2. ssh

Given a user:oldmap:newmap:command as its address, will attempt to make an ssh connection to the host and run the command. This has obvious utility for restarting failed servers. The map specifications determine the status changes allowed: an oldmap of 0,3 and a newmap of 4 will activate the command only for a change from OK or Warning to Critical.

3.4.3. smsq

Will queue a Short Message, which will be sent by modemqueue.

3.4.4. faxq

Will queue a fax, which will be sent by modemqueue.


4. Installation

4.1. create user and group thoth

4.2. extract tarball

My own policy is to put all files under /usr/local/thoth/ .

4.3. create thoth public and private databases.

Thoth-private need only be accessible by thoth itself; thoth-public needs to be accessible by thoth, but possibly also by other processes. The default names for these (in the thoth.ini supplied) are "thoth" and "thothprivate" respectively.

4.4. create tables within databases

See the dbspec.mysql and dbspec.pg files for examples.

4.5. set ownership and permissions in special cases

cgi-bin/status.cgi and condensed.cgi: thoth.thoth, 4755

check/ping: root.thoth, 4770

Ensure that perl-suid (that's the Debian package; it may be called other things in other systems) is installed and available.

4.6. edit thoth.ini

4.7. populate thoth-public monitoring database

This is most easily done with the bin/loader script. An example loader.ini file is supplied and should be customised. Alternatively, use whatever database front-end you usually use... this is the author's preferred method.

4.8. set up notification

Edit crontab:

* * * * * /usr/local/thoth/bin/notify /usr/local/thoth/thoth.ini

4.9. start thoth

Thoth is most easily run within a screen(1) session, with the command:

bin/thoth thoth.ini


5. Configuration

5.1. thoth.ini

Format is a standard Windows-style initialisation file.

5.1.1. [database-public] and [database-private]

DSN, user and pass will be fed directly to DBI::connect.

5.1.2. [paths]

bin should point to the main binary directory.

check and notify should point to the binary directories in which the check and notify programs live.

scratch should be a temporary-file directory (which must exist). I'll say that again: the temporary file directory must exist! If all your monitors flip to "unknown" on startup, that's probably why.

html and cgi should be the paths to static html and cgi scripts; make appropriate changes to the http server to allow this to work.

5.1.3. [general]

cycle is the number of seconds between monitoring cycles.

timeout is the number of seconds after which a check command should be terminated.

5.1.4. [web]

html is the HTTP path to the html content.

cgi ditto, for CGI scripts.

refresh is the number of seconds between refreshes of the CGI pages.

5.2. loader.ini

The loader.ini file is used to add hosts to the thoth-public database (see below). Note that it does not delete old entries, or test for duplicates; you may wish to set clean=YES to clear the database while loading.

There are three main parts to the loader.ini file. Refer to the example file given.

5.2.1. [standard_services]

This section sets up the defaults for services to be defined later. In most environments, most service checks will be set up in the same way, with only a few exceptions.

5.2.1.1. checksep

The number of seconds between checks

5.2.1.2. fails

The number of failures needed in succession for a check to be regarded as failed; normally 2, but could be higher on unstable networks.

5.2.1.3. other entries

Define the default check command for a service, and the default dependency if any.

5.2.2. [contact] sections

Each section defines a single contact. See the contacts table below for more information.

5.2.3. [host] and [service] sections

See the table descriptions below. For convenience, a host may have a "standard_services" entry, containing a space-separated list of standard services as defined in the first part of the file. For hosts which have non-standard services, use a [service] entry, which may be "like=" a standard service.

5.3. Database entry

This is all done to the thoth public database.

HUP the thoth core process to re-load the database.

5.3.1. Table host

field use
ip The IP address of the host
fqdn The fully-qualified domain name of the host
host The alias of the host - must be unique within the monitoring configuration.

5.3.2. Table monitor

field use
host The host alias (as in the hosts table)
service The service name - should be unique to this host
dephost The host on which this service depends
depservice The service on which this service depends
checkcommand The command with which to check the service
checksep The interval between checks (in seconds)
fails The number of failures required to raise a critical alert
disabled 1 if checking for the service should be disabled

Some macros are available to check_command:

macro use
%i The IP address of the host
%f The FQDN of the host
%h The short name of the host
%s The name of the service

The basic form of a check_command entry is "<command> -a %i". For most checks, this is sufficient; check the individual command listings below for more information.

5.3.3. Table contacts

field use
host The host name (NULL for all hosts)
service The service name (NULL for all services)
contact Of the form mode:address.
delay Time after the alert that the contact should be notified, in seconds.
severity Notify only state changes above this level (0-3).
timespec A time period during which the contact should be notified (see below)

At present, only the "email" mode is supported directly. "smsq" and "faxq" modes are supported with sms_client (for sms) and mgetty-fax, sdf and ghostscript (for fax); see (and edit) bin/modemqueue, which should be cronned to run once per minute, and create the relevant database table.

The "ssh" and "local" modes are used to execute commands. The "contact" parameter in this case should be:

local:from:to:command

or

ssh:username:from:to:command

ssh will connect to the machine in question and execute the command; local will exceute the command locally. The "from" and "to" parameters are lists of states (0=OK, 1=Pending, 2=Unknown, 3=Warning, 4=Critical); the command will only be activated on that state transition.

For example, "ssh:root:012:34:/etc/init.d/apache restart" would attempt to restart the apache server if it moves to a warning or critical state. The ssh key used for remote command execution is the default one in ~/ssh/identity.

The delay parameter is used for escalation of alerts. For example, a 600-second delay will only raise an alert if the detected problem is still active after 10 minutes.

A contact notification will not be made outside the time period defined for the contact. Time specifications take the form date:time; both parts are optional, and a null time specification will be interpreted as "at all times". If both parts are present, both must be satisfied for the notification to be made.

The time part is of the form hhmm-hhmm, these being times on the 24-hour clock between which the notification should be performed.

The date part can take either of two forms:

The severity parameter determines the severity of alert necessary before the contact is invoked.

value severity
0 show all alerts
1 ignore OK/Pending transitions (show Unknown/Warning/Critical only)
2 ignore OK/Pending/Unknown transitions (show Warning/Critical only)
3 ignore OK/Pending/Unknown/Warning transitions (show Critical only)
4 ignore all transitions (disable contact)

Examples of timespec:

timespec definition
Mon-Fri:0900-1730 standard working hours
Sat:2100-2359;Sun:0000-0900 a night-shift over the weekend

5.4. Invididual check modules

5.4.1. dns

The DNS module has one required parameter, -d with the hostname or IP address to be resolved. Its optional parameter is -u; if this is given, the response to the query must be authoritative.

5.4.2. cascademaster

The cascademaster module connects to remote thoth servers via ssh to return their monitoring data. In this way, multiple monitoring servers can be placed network-close to the systems they monitor, while still giving a single integrated view of the network. To enable this function, the remote machine will have to be configured appropriately:

command="bin/cascadeslave"

This is actually optional - thoth will monitor successfully without this restriction - but highly recommended for obvious reasons of security.

5.4.3. sshremote

The sshremote module connects to remote Unix servers via ssh to return data on disc space, free memory and swap, and system load. To enable this function, the remote machine will have to be configured appropriately:

command="/home/seshat/remote"

(or otherwise, depending on the path of the home directory). This is actually optional - thoth will monitor successfully without this restriction - but highly recommended for obvious reasons of security.


6. Operation

6.1. Signals

Thoth will shut down cleanly if it receives any of SIGINT, SIGQUIT, SIGABRT, SIGTERM or SIGSTOP. On a SIGHUP, it will reload the monitoring list from the thoth-public database (this is the normal way of adding or deleting monitors while Thoth is operation); on a SIGUSR1, it will re-invoke itself without losing current monitored values (which has the same effects as SIGHUP, but can be handy if you are running a version of perl with a progressive memory leak; I have observed this with perl 5.005, though not with perl 5.6.).


7. Ancillary programs

7.1. thothconsole

Thothconsole needs write access to the thoth-public database; the user running it should also have permission to HUP the thoth daemon, possibly via sudo or ssh. Thothconsole will configure itself from ~/thoth.ini or /etc/thoth.ini (which should include a database-public section).

Thothconsole requires the Curses, Curses::Widgets and Curses::Forms modules as well as DBI and Config::IniFiles. It need not be run on the same host as thoth, as long as database permissions grant access.

7.2. thothtkconsole

Thothtkconsole needs write access to the thoth-public database; the user running it should also have permission to HUP the thoth daemon, possibly via sudo or ssh. Thothtkconsole will configure itself from ~/thoth.ini or /etc/thoth.ini (which should include a database-public section).

Thothtkconsole requires the Tk and Tk::Pane modules as well as DBI and Config::IniFiles. It need not be run on the same host as thoth, as long as database permissions grant access.

7.3. thothtkviewer

Thothtkviewer needs read access to the thoth-private database. It checks a thoth.ini file for logon details (specifically the [database-private] section); it will also check [web] refresh for its cycle time.

Thothtkviewer requires the Tk and Tk::Pane modules as well as DBI and Config::IniFiles. Its interface is substantially similar to the web interface; however, the popup windows generated in the condensed view will be updated every (cycle) seconds until they are closed.

7.4. Web interface

The web interface is, at present, entirely an observation tool; it does not allow modification of the system in any way. It is mostly self-explanatory: "Summary status" gives a per-host summary, "Errors only" shows any monitors not currently OK, and "Full status" shows all monitors. "Condensed view" shows a summary of all hosts and services; the "Hosts list" is a generated set of static pages giving details of what is being monitored.

From the status lists, it is possible to get views of all services on a specific host, all hosts providing a specific service, or a combination thereof. This also allows viewing of the event log.


8. Expansion

8.1. Check programs

Must as a minimum accept the options:

option use
-o Output-file prefix
-s Service name
-h Show command-line help

The first two will be supplied automatically by the main thoth program. See any of the supplied monitors, or check/example, for more information.

8.2. Notification programs

Should normally return quickly.

Will be given, as successive lines on stdin:

  1. The notification address (from the contacts table)
  2. The host
  3. The service
  4. The old status code
  5. The new status code
  6. The information line
  7. The unixtime
  8. The event log ID
  9. The host's IP
  10. The FQDN of the host

9. Support and development

9.1. Mailing list

There is a mailing list, thoth-devel@firedrake.org, for discussion of the program and suggestions for additional features. To subscribe, send mail to thoth-devel-request@firedrake.org with the subject line "subscribe".