everythingpossible - Fotolia
4 monitoring and alerting best practices for IT ops
Monitoring is vital in modern IT environments, but the variety of metrics to track can swiftly overtake admins' capacity -- and sanity.
I remember a manager once commented, as a new IT alerting system was installed, that he wanted to capture, page out and remediate every warning and error message across 700 Windows servers. This resulted in a text bill -- at $0.10 per message -- that was close to the United States' national debt.
Staff was sent into a panic correcting every Windows error message, which was a 24/7 job for the entire team. While this directive only lasted a short time, it highlighted how good intentions can go very wrong.
With monitoring and alerting, just because you can do something, doesn't mean you should. Modern IT monitoring and alerting tools have expansive capabilities, but to enable all of them can cause information overload and drown real issues in a flood of irrelevant messages.
This deluge typically leads overwhelmed staff to delete messages and ultimately disable alerts -- defeating the purpose of these tools in the first place. To avoid this scenario, follow these monitoring and alerting best practices.
Set alerts for a length of time, not just usage
A core best practice is to always include a duration -- not just a level -- of use for performance counters. Most applications spike resource utilization as they run, and alerts triggered by these momentary spikes will flood inboxes.
Set durations based on experience with the application. Measure the time frame in minutes, rather than seconds. Even if an application is critical, give the workloads time to function. An alert triggered after only a few seconds can create false positives, which cause a wave of alerts that are then ignored.
Storage capacity or even memory alerts normally won't create the same spikes and drops as CPU alerts, for example. Memory often gains value slower and more steadily, which provides a little more flexibility. However, be careful with memory values, as some applications often cache RAM: What is allocated might be very different than what is in use.
Add memory "in use" alerts or paging warnings for a more accurate picture of the memory use versus allocation. Because storage and networking capacity are a little clearer, it is easier to establish fixed thresholds in alert settings.
Choose an alert type -- or two
Once alerts and levels are set, decide how you want to receive them.
Most admins will choose email or text; both can be quickly abused. Often, email alerts are shuttled to a separate folder and get lost in the mix. This might lead to text-based options, but text is not a guaranteed delivery model either. Messages can be dropped in buildings and areas that don't get ideal cell service. Sending both text and email can create information overload again -- not to mention the question of sending text messages to admins' personal devices if they don't have work phones or siphons.
Instead, split the alert types. Send general errors and non-critical issues via email, and send downed system issues by email and text. This approach covers both options and avoids any abuse of the admin's inbox.
Create clean communication paths
Besides how someone gets the message, it's also critical to determine who gets it.
Sending messages to everyone contributes to that overflowing mailbox that just gets ignored and purged. Instead, dedicate a person or team to assume ownership over individual applications or service groupings.
A dedicated email account or app on an admin's personal device is a good start, but a dedicated on-call device is even better. This sets clear boundaries about responsibility and time off, based on when the admin has the device.
Plus, the device can be preprogrammed with the numbers of escalation contacts for more complex issues. This leads to clean communication paths that remove the personal aspect and keep the issue within the company contacts and process. Fractured forms of communication, such as some employees using text and others using email, create confusion, broken response paths and angry customers.
Evaluate communication strategies
IT teams often focus too much on what a tool can monitor, as well as reports and the trending capabilities it might have. Information does little good if it's not communicated well.
This goes beyond text messages and emails. It also entails escalation chains for IT operations teams and mobile-friendly applications and dashboards to ensure staff members are engaged and aren't overwhelmed. Alerts must mature over time, much in the same way the data center has. Don't overlook adjustments and tweaks that make alerts more valuable.