Better Enterprise/Network Monitoring with Component Diagnostics

And when I say "Enterprise" I do mean a collection of servers and application components, although the "U.S.S. Enterprise" does have a cameo!

Most server based applications installed today have no system monitoring. The way to tell if the system is down is when the phone calls start coming. Sometimes at home. This sort of thing can really put a damper on the pay raise department.

And this topic gets richer - when you have to do an emergency upgrade, how do you know something you changed didn't screw up the production system in a whole different way? Phone calls again? More pay raise dampening.

A collection of simple system monitors and diagnostics can help you to be sure that your application is installed correctly and currently functioning correctly. And when there is a problem, not only can you be notified immediately, you can view the information to find out where, exactly, the problem is. If the problem is fixed quickly, or fixed before anybody notices, this bodes well for the pay raise stuff.

Nagios is a popular tool for doing a variety of system monitoring, but in my experience, Nagios is used almost exclusively for monitoring hardware. So if your application can no longer talk to the database, nagios thinks everything is just fine. The solution is to add some simple stuff to your app that you and nagios can use to make sure everything is okay.

A simple servlet is a great way to go for this sort of thing. A diagnotic can return something like "OK" or "FAIL" so any web browser can run a diagnostic on a component. Plus, nagios (and similar products) are able to exercise web pages too.

In "Star Trek: The Next Generation", Picard will tell somebody to run a "Level 3 diagnostic" for some part of the Enterprise. It turns out that the writers of this show have worked out what, exactly, that means. The following is an attempt to use their terminology to fit the needs of server development. Please note that level 1 and level 2 diagnostics require shutting down the application. I think that this sort of thing is better represented by testing, so I've left those out.

Level 5 Diagnostic

    Select a few things to do that can complete in two seconds or less.
    Configure nagios to exercise this diagnostic about once every two minutes.
    The first line of text returned must be "OK" or "FAIL" for nagios.
    Additional human readable text would be nice.
    Some things that might be exercised:
      read from the database (verify that communication with the database is funtioning)
      ability to put a NO-OP message into a JMS queue (verify that JMS is functioning)
      make sure all JMS queue sizes meet expected parameters
      verify that app can read and write to the file system (testing for permission problems and file handle problems)
      number of objects in memory meets a threshold
      number of objects in the file system meets a threshold
      certain files (logs?) are not getting too large
      certain recent activities are within "normal operating parameters"
      HTTP server is serving a tiny web page
      recently logged error or warning
      is data current?

Level 4 Diagnostic

    select a rich list of things to do that can complete in 30 seconds or less.
    Configure nagios to exercise this diagnostic about once every two hours.
    The first line of text returned must be "OK" or "FAIL" for nagios.
    Additional human readable text would be nice.
    Some things that might be exercised:
      exercise the Level 5 Diagnostic for this component
      check something in the Level 5 diagnostic list above that could not fit into 2 seconds
      ping components the app depends on (verify that the network is functioning between the two components)
      exercise all JMS queues
      FTP read/write
      exercise EJB interface
      exercise MDB interface
      exercise SOAP interface
      exercise RMI interface
      examine JDBC driver version
      exercise a sophisticated algorithm

Level 3 Diagnostic

    select a rich list of things to do that can complete in 5 minutes or less.
    Configure nagios to exercise this diagnostic about once every day.
    The first line of text returned must be "OK" or "FAIL" for nagios.
    Additional human readable text would be nice.
    Some things that might be exercised:
      exercise the Level 4 Diagnostic for this component
      check for when licenses expire for third party products
      exercise a workflow.

Needs of the real world are a bit more than in star trek. A few more diagnostics for the repertoire:

Ping

    Configure nagios to exercise this diagnostic about once every 30 seconds.
    Just return "OK".
    This shows that the network and most systems are functioning and that the component is not locked up.

Installation Diagnostic

    Called by the person installing the component (not by nagios) immediately after installation.
    Recycle stuff from the level 3, 4 and 5 diagnostics.
    Check to make sure jar file code can be exercised.
    Called as needed by developers and system operators (not by nagios).

Status page

    Called as needed by developers and system operators (not by nagios).
    This might show:
      a detailed list of which systems are functioning correctly and which are not so good
      log data
      the version number for the jdbc driver, the app server and the java version
      the current sizes of JMS queues
      the number of servlet requests

If you implement this servlet only with ping, you have made a big leap. Calling this ping, even manually, will verify that your component is not locked up and that the network is functioning at least in part. If you later connect nagios to the component ping, you can probably find out about a problem with your component long before you get that first call from a user of the component.

Overall, implementing all aspects of this servlet would probably take less than a day. Connecting it to nagios would take about an hour. And then ... if there ever is a problem, you should be able to resolve the problem about 20 times (maybe 100 times) faster than without it.