When constructing new checks for Nagios, sometimes it’s worthwhile to consider what we’ll call “check resolution”. What we mean by this term is how many distinct checks you wrap within one service check object. As an example, consider the “check_disk” check, which by default and out of the box, will check all mounted file systems, so, assuming there is more than one file system mounted, this one service check object is checking multiple file systems. We will call this a “low-resolution” check. But you don’t have to check all at once. If you want, you can use the “-p” (for path) to specify exactly which disk or disks to check. So you have the option of checking all disks with one service check object, or, each path with it’s own service check object. We will call these “high-resolution” checks.

Both high and low resolution checks have their advantages and disadvantages. With a low-resolution check, you reduce the amount of alerts you get. In an environment that has thousands of service checks running, it can be a good thing to reduce the amount of alerts that get sent to staff. On the downside, with a low-resolution check, you can only schedule downtime or acknowledge the whole check object including each of the multiple services that it may contain.  To use “check_disk” as an example again and used the default to check all disks, if you had a planned outage on one of the file systems, and you want to schedule a downtime to suppress alerts during the outage, you would not be able to schedule the downtime for the checking of just that one disk; the one check you had covering all disks would have to be snoozed, so you’d open the risk then of not finding out about a disk space condition on one of the disks not undergoing maintenance.

This concept becomes more magnified with certain types service architectures. Imagine a cluster of servers running applications which use a queue system to process client input and output. Imagine that each running application has hundreds of queues. Would you want to write/configure one check for each queue? Or would you write/configure one check to check ALL queues on each server? Or one check for ALL queues in a cluster?

It’s worth considering during the design phase of your monitoring project. Designing the right resolution for your service checks will result in the right balance between the right amount of detail and too much, and between the ability to snooze alerts easily without snoozing any alerts that should not be snoozed.

© Copyright 2020 Rex Consulting, Inc. – All rights reserved