
Monitoring Why? and What?
In contrast, to those of you with ops backgrounds, especially in IT and business operations, monitoring wasn’t a big part of my day-to-day job. As described by ops professionals, monitoring is the process of systematically tracking and analyzing the performance, availability, and health of systems, applications, networks, and other critical infrastructure components. System monitoring, for instance, follows server health by tracking CPU usage, memory, disk space, and uptime. Application performance can also be checked within system monitoring by observing response time, error rates, and throughput. Network monitoring is particularly crucial, focusing on traffic, latency, bandwidth usage, and connectivity.
But why do we monitor? The short answer is to serve our business better. Critical business metrics like CPU usage, response time, and latency are essential to understand and ensure they work properly. In the simplest terms, there are three core objectives of monitoring:
- To detect where problems arise, which almost always includes alerting.
- To troubleshoot problems in real-time, so we know exactly where in our system the problem is and can receive leads on how to resolve it.
- To perform strategic work: reporting, projections, and aiding in the longer-term performance engineering of our applications and systems.
"Without monitoring, you're flying blind in a digital world where even the smallest issue can have massive consequences."
How do we select what to monitor? The answer is that just about every important aspect of the services we provide should be measured. Many targets require the measurement of several types of metrics, and choosing the right metric is crucial. Determining the important metrics to measure often involves asking what the target is expected to deliver. It all starts with the problem you’re seeking to solve and the question you’re asking. For example, is our system healthy? Are the users of our app on these servers happy? Are they able to log in and get the functionality they need in a reasonable amount of time?
The challenge lies in how we extract these metrics from a given target and make sense of them. This is not always easy; in fact, it often requires the use of specialized tools. Moreover, any discussion that touches on faults in the system is always fraught with emotion and is rarely a purely logical conversation.