1. Description
1.1. General
Thoth is a network monitoring system, designed to report on networked devices and services, and alert administrators when they are not functioning normally. It has been designed and developed primarily under Linux, though other Unix systems should be able to run it. It will not be usable under Windows, since it requires the ability to fork() from within Perl.
Features include:
- Monitoring of network services (SMTP, POP3, HTTP, NNTP, ping, etc.)
- Remote monitoring of Unix host resources (processor load, disk usage, etc.)
- Master/slave cascade monitoring - combine multiple monitor servers from different locations into an integrated view
- Simple plugin interface to allow for user-developed service checks and event handlers
- Contact notifications when problems occur and get resolved (via email, SMS, fax, or user-defined method)
- Ability to define event handlers to be run during service events for automated problem resolution attempts
- Web interface for viewing
- current network status
- (NOT YET) log file
- (NOT YET) notifications
- (NOT YET) event history
- (NOT YET) network layout
- Fully parallelized service checks
- Host/service dependencies
- (NOT YET) Automatic log rotation
- (NOT YET) web interface for adding monitors etc.
- (NOT YET) Console-mode status view
- Console-mode client for adding/editing monitors
- Tk client for status view
- Tk client for adding/editing monitors
1.2. History
Thoth is the result of two years' work in ISP network operations centres. I have tried a variety of systems - principally NOCOL and Netsaint - but have been dissatisfied with them all. By requiring a database back-end, the process of updating and notification was considerably simplified.
The first time Thoth was used seriously, the host it was required to test - which had crashed five times in the previous 24 hours - promptly ran perfectly for the next 48. This performance is not guaranteed to be replicated on your own system.
2. Prerequisites
Perl. All thoth systems work happily on 5.005 and 5.6. Some perl scripts are suid, which may require additional support packages.
A database server supported by DBI. Thoth has been developed with mysql and postgresql, but should work on other databases, as long as an auto_increment or equivalent function is available.
If you plan to use the web interface, a web server.
Perl modules from CPAN. Getopt::Std and POSIX should already be installed in any case. Config::IniFiles and DBI (and of course the DBD module for the selected database) are critical; Net::Ping, Net::SNMP, Net::DNS and especially Net::TCP are used by the service checks; Text::Wrap is used by the plaintext-emitting parts of the system. Text::CSV_XS is used by the cascade monitoring modules.
ssh is required by the sshremote service checker.
3. Components
3.1. Core
bin/thoth is the core component; it handles monitor scheduling, execution and database updates.
3.2. Individual monitoring elements.
Installed in check/*
3.2.1. cascademaster
Gathers monitoring information from a remote thoth server.
3.2.2. dns
Tests a Domain Name Service server; will check for authoritative zone serving if required.
3.2.3. ftp
Tests an FTP server; will attempt login if required.
3.2.4. http
Tests a web server with basic-mode authentication if required (but no https support).
3.2.5. imap
Tests an IMAP server; will attempt login if required.
3.2.6. mysql
Tests usability of a MySQL database
3.2.7. nntp
Tests for connectivity and a valid response (with or without posting permission).
3.2.8. ping
Tests for machine reachability
3.2.9. pop3
Tests a POP3 server; will attempt login if required.
3.2.10. postgresql
Tests a PostgreSQL database server.
3.2.11. smtp
Tests an SMTP server.
3.2.12. snmp
Tests an SNMP variable.
3.2.13. ssh
Tests an SSH server.
3.2.14. sshremote
Conducts tests of disc space, load, etc., on remote Unix machines.
3.2.15. sybase
Tests a sybase server (including MS-SQL).
3.2.16. tcp
Tests for port reachability.
3.2.17. vnc
Tests a VNC server.
3.3. Web interface
cgi-bin/status.cgi is the CGI script which generates status information for the web interface. All other content is static and generated when necessary.
3.4. Notification interface
notify/* are the notification programs
3.4.1. email
Given an email address, will send a suitably-formatted notification message.
3.4.2. ssh
Given a user:oldmap:newmap:command as its address, will attempt to make an ssh connection to the host and run the command. This has obvious utility for restarting failed servers. The map specifications determine the status changes allowed: an oldmap of 0,3 and a newmap of 4 will activate the command only for a change from OK or Warning to Critical.
3.4.3. smsq
Will queue a Short Message, which will be sent by modemqueue.
3.4.4. faxq
Will queue a fax, which will be sent by modemqueue.
4. Installation
4.1. create user and group thoth
4.2. extract tarball
My own policy is to put all files under /usr/local/thoth/ .
4.3. create thoth public and private databases.
Thoth-private need only be accessible by thoth itself; thoth-public needs to be accessible by thoth, but possibly also by other processes. The default names for these (in the thoth.ini supplied) are "thoth" and "thothprivate" respectively.
4.4. create tables within databases
See the dbspec.mysql and dbspec.pg files for examples.
4.5. set ownership and permissions in special cases
cgi-bin/status.cgi and condensed.cgi: thoth.thoth, 4755
check/ping: root.thoth, 4770
Ensure that perl-suid (that's the Debian package; it may be called other things in other systems) is installed and available.
4.6. edit thoth.ini
4.7. populate thoth-public monitoring database
This is most easily done with the bin/loader script. An example loader.ini file is supplied and should be customised. Alternatively, use whatever database front-end you usually use... this is the author's preferred method.
4.8. set up notification
Edit crontab:
* * * * * /usr/local/thoth/bin/notify /usr/local/thoth/thoth.ini
4.9. start thoth
Thoth is most easily run within a screen(1) session, with the command:
bin/thoth thoth.ini
5. Configuration
5.1. thoth.ini
Format is a standard Windows-style initialisation file.
5.1.1. [database-public] and [database-private]
DSN, user and pass will be fed directly to DBI::connect.
5.1.2. [paths]
bin should point to the main binary directory.
check and notify should point to the binary directories in which the check and notify programs live.
scratch should be a temporary-file directory (which must exist). I'll say that again: the temporary file directory must exist! If all your monitors flip to "unknown" on startup, that's probably why.
html and cgi should be the paths to static html and cgi scripts; make appropriate changes to the http server to allow this to work.
5.1.3. [general]
cycle is the number of seconds between monitoring cycles.
timeout is the number of seconds after which a check command should be terminated.
5.1.4. [web]
html is the HTTP path to the html content.
cgi ditto, for CGI scripts.
refresh is the number of seconds between refreshes of the CGI pages.
5.2. loader.ini
The loader.ini file is used to add hosts to the thoth-public database (see below). Note that it does not delete old entries, or test for duplicates; you may wish to set clean=YES to clear the database while loading.
There are three main parts to the loader.ini file. Refer to the example file given.
5.2.1. [standard_services]
This section sets up the defaults for services to be defined later. In most environments, most service checks will be set up in the same way, with only a few exceptions.
5.2.1.1. checksep
The number of seconds between checks
5.2.1.2. fails
The number of failures needed in succession for a check to be regarded as failed; normally 2, but could be higher on unstable networks.
5.2.1.3. other entries
Define the default check command for a service, and the default dependency if any.
5.2.2. [contact] sections
Each section defines a single contact. See the contacts table below for more information.
5.2.3. [host] and [service] sections
See the table descriptions below. For convenience, a host may have a "standard_services" entry, containing a space-separated list of standard services as defined in the first part of the file. For hosts which have non-standard services, use a [service] entry, which may be "like=" a standard service.
5.3. Database entry
This is all done to the thoth public database.
HUP the thoth core process to re-load the database.
5.3.1. Table host
| field | use |
| ip | The IP address of the host |
| fqdn | The fully-qualified domain name of the host |
| host | The alias of the host - must be unique within the monitoring configuration. |
5.3.2. Table monitor
| field | use |
| host | The host alias (as in the hosts table) |
| service | The service name - should be unique to this host |
| dephost | The host on which this service depends |
| depservice | The service on which this service depends |
| checkcommand | The command with which to check the service |
| checksep | The interval between checks (in seconds) |
| fails | The number of failures required to raise a critical alert |
| disabled | 1 if checking for the service should be disabled |
Some macros are available to check_command:
| macro | use |
| %i | The IP address of the host |
| %f | The FQDN of the host |
| %h | The short name of the host |
| %s | The name of the service |
The basic form of a check_command entry is "<command> -a %i". For most checks, this is sufficient; check the individual command listings below for more information.
5.3.3. Table contacts
| field | use |
| host | The host name (NULL for all hosts) |
| service | The service name (NULL for all services) |
| contact | Of the form mode:address. |
| delay | Time after the alert that the contact should be notified, in seconds. |
| severity | Notify only state changes above this level (0-3). |
| timespec | A time period during which the contact should be notified (see below) |
At present, only the "email" mode is supported directly. "smsq" and "faxq" modes are supported with sms_client (for sms) and mgetty-fax, sdf and ghostscript (for fax); see (and edit) bin/modemqueue, which should be cronned to run once per minute, and create the relevant database table.
The "ssh" and "local" modes are used to execute commands. The "contact" parameter in this case should be:
local:from:to:command
or
ssh:username:from:to:command
ssh will connect to the machine in question and execute the command; local will exceute the command locally. The "from" and "to" parameters are lists of states (0=OK, 1=Pending, 2=Unknown, 3=Warning, 4=Critical); the command will only be activated on that state transition.
For example, "ssh:root:012:34:/etc/init.d/apache restart" would attempt to restart the apache server if it moves to a warning or critical state. The ssh key used for remote command execution is the default one in ~/ssh/identity.
The delay parameter is used for escalation of alerts. For example, a 600-second delay will only raise an alert if the detected problem is still active after 10 minutes.
A contact notification will not be made outside the time period defined for the contact. Time specifications take the form date:time; both parts are optional, and a null time specification will be interpreted as "at all times". If both parts are present, both must be satisfied for the notification to be made.
The time part is of the form hhmm-hhmm, these being times on the 24-hour clock between which the notification should be performed.
The date part can take either of two forms:
- Day(-Day), where Day is a three-letter abbreviation of the day of the week;
- Date(-Date), where Date is a numeric date of the form d+/m+/y+. (Four digit years are required.)
The severity parameter determines the severity of alert necessary before the contact is invoked.
| value | severity |
| 0 | show all alerts |
| 1 | ignore OK/Pending transitions (show Unknown/Warning/Critical only) |
| 2 | ignore OK/Pending/Unknown transitions (show Warning/Critical only) |
| 3 | ignore OK/Pending/Unknown/Warning transitions (show Critical only) |
| 4 | ignore all transitions (disable contact) |
Examples of timespec:
| timespec | definition |
| Mon-Fri:0900-1730 | standard working hours |
| Sat:2100-2359;Sun:0000-0900 | a night-shift over the weekend |
5.4. Invididual check modules
5.4.1. dns
The DNS module has one required parameter, -d with the hostname or IP address to be resolved. Its optional parameter is -u; if this is given, the response to the query must be authoritative.
5.4.2. cascademaster
The cascademaster module connects to remote thoth servers via ssh to return their monitoring data. In this way, multiple monitoring servers can be placed network-close to the systems they monitor, while still giving a single integrated view of the network. To enable this function, the remote machine will have to be configured appropriately:
- Create, under the thoth account on the master machine, an ssh key with null passphrase.
- Copy the identity.pub file of this key into the .ssh/authorized_keys file on the slave machine.
- Add to the beginning of that line in the file the clause
command="bin/cascadeslave"
This is actually optional - thoth will monitor successfully without this restriction - but highly recommended for obvious reasons of security.
5.4.3. sshremote
The sshremote module connects to remote Unix servers via ssh to return data on disc space, free memory and swap, and system load. To enable this function, the remote machine will have to be configured appropriately:
- Create, under the thoth account, an ssh key with null passphrase. (This could be the same key as is used for cascademaster.)
- Create a user account on the remote machine. An existing account can be used if necessary. The default account name is "seshat" (variously attributed as the wife or daughter of Thoth).
- Copy into the seshat account, as .ssh/authorized_keys, the identity.pub file of the ssh key created earlier.
- Add to the beginning of that line in the file the clause
command="/home/seshat/remote"
(or otherwise, depending on the path of the home directory). This is actually optional - thoth will monitor successfully without this restriction - but highly recommended for obvious reasons of security.
- Copy into the seshat account the "remote" script file, and chmod it 700.
6. Operation
6.1. Signals
Thoth will shut down cleanly if it receives any of SIGINT, SIGQUIT, SIGABRT, SIGTERM or SIGSTOP. On a SIGHUP, it will reload the monitoring list from the thoth-public database (this is the normal way of adding or deleting monitors while Thoth is operation); on a SIGUSR1, it will re-invoke itself without losing current monitored values (which has the same effects as SIGHUP, but can be handy if you are running a version of perl with a progressive memory leak; I have observed this with perl 5.005, though not with perl 5.6.).
7. Ancillary programs
7.1. thothconsole
Thothconsole needs write access to the thoth-public database; the user running it should also have permission to HUP the thoth daemon, possibly via sudo or ssh. Thothconsole will configure itself from ~/thoth.ini or /etc/thoth.ini (which should include a database-public section).
Thothconsole requires the Curses, Curses::Widgets and Curses::Forms modules as well as DBI and Config::IniFiles. It need not be run on the same host as thoth, as long as database permissions grant access.
7.2. thothtkconsole
Thothtkconsole needs write access to the thoth-public database; the user running it should also have permission to HUP the thoth daemon, possibly via sudo or ssh. Thothtkconsole will configure itself from ~/thoth.ini or /etc/thoth.ini (which should include a database-public section).
Thothtkconsole requires the Tk and Tk::Pane modules as well as DBI and Config::IniFiles. It need not be run on the same host as thoth, as long as database permissions grant access.
7.3. thothtkviewer
Thothtkviewer needs read access to the thoth-private database. It checks a thoth.ini file for logon details (specifically the [database-private] section); it will also check [web] refresh for its cycle time.
Thothtkviewer requires the Tk and Tk::Pane modules as well as DBI and Config::IniFiles. Its interface is substantially similar to the web interface; however, the popup windows generated in the condensed view will be updated every (cycle) seconds until they are closed.
7.4. Web interface
The web interface is, at present, entirely an observation tool; it does not allow modification of the system in any way. It is mostly self-explanatory: "Summary status" gives a per-host summary, "Errors only" shows any monitors not currently OK, and "Full status" shows all monitors. "Condensed view" shows a summary of all hosts and services; the "Hosts list" is a generated set of static pages giving details of what is being monitored.
From the status lists, it is possible to get views of all services on a specific host, all hosts providing a specific service, or a combination thereof. This also allows viewing of the event log.
8. Expansion
8.1. Check programs
Must as a minimum accept the options:
| option | use |
| -o | Output-file prefix |
| -s | Service name |
| -h | Show command-line help |
The first two will be supplied automatically by the main thoth program. See any of the supplied monitors, or check/example, for more information.
8.2. Notification programs
Should normally return quickly.
Will be given, as successive lines on stdin:
- The notification address (from the contacts table)
- The host
- The service
- The old status code
- The new status code
- The information line
- The unixtime
- The event log ID
- The host's IP
- The FQDN of the host
9. Support and development
9.1. Mailing list
There is a mailing list, thoth-devel@firedrake.org, for discussion of the program and suggestions for additional features. To subscribe, send mail to thoth-devel-request@firedrake.org with the subject line "subscribe".