Checklists Plus Good Monitoring = Reliability

System health checklists + good automated monitoring will drastically increase the reliability of your system.

The use of checklists in order to increase the reliability of complex systems and practices such as aviation and surgery is less than 100 years old. In information technology, checklists are popular too. Checklists are used manually in order to monitor the health of services after deployment, or on a regular basis to check functionality. Since the early days of IT, before automated monitoring became the normal practice, the checklist (or sometimes called, “daily checklist”) was the primary method used to assess the availability and health of an IT system. A human operator would run through the checklist every so many hours, or daily. The process typically would be partially manual and partially scripted. An operator would check systems and applications’ availability and unexpected errors. The checklist might also collect and keep key performance indicator values and record them in a database. Additionally the checklist could generate reports with graphs and trending patterns.

Since monitoring has become more of a normal practice, many organizations see diminished reason to have a human operator run manual checklists. If the Nagios server can check thousands of mail queues every minute of 24 hours a day without a break, why also have a human check the same once a day? Undoubtedly, an automated monitoring system has its advantages, but there are still a few reasons to keep your staff running checklists:

  1. Regular Practice Makes a Better and More Friction-Free Practice: When the system has a problem, you need staff that has a comfortable confidence in using manual tools to validate system health. Your staff should be able to run the commands used to check system health, without having to think about what to type. An English question about the system such as, “What are the mail queue backlogs on this list of a 1000 servers?” should be easily translated to a Linux commandline without the operator having to consult manuals. Commands like “for mailserver in `cat listofservers` ; do echo $mailserver ; ssh -q mailserver ‘mailq |tail -1’  ; done” should flow right out of the fingers of an operator who is asked this question.
  2. Redundancy: Monitoring systems may not be perfect. They can fail themselves sometimes (though can and probably should be automatically monitored too.) Monitoring systems can also be configured incorrectly (too high of a threshold possibly), so checklists are a sort of human audit of the monitoring system.
  3. Training: Most organizations have junior staff. Junior staff need to practice their skills in order to gain proficiency, and grow skill-sets.

So we advise you to keep running your checklists, even if you are improving your monitoring capabilities. There is no doubt that manual or partially manual human running of system health checklists will drastically improve the reliability of your information system.