Skip to main content
Application Health

5 Key Metrics to Monitor for Optimal Application Health

Every application eventually breaks. The difference between a quick recovery and a prolonged outage often comes down to what you were watching before things went wrong. Monitoring isn't about collecting every possible data point—it's about knowing which signals matter and having the discipline to act on them. This guide focuses on five metrics that consistently separate teams who catch problems early from those who only learn about failures from user complaints. We wrote this for engineers and team leads who already have basic monitoring in place but suspect they're missing something. Maybe your dashboards look fine, yet users still report slowness. Or your alert system fires so often that everyone ignores it. These are symptoms of metric selection that doesn't match how your application actually behaves under load.

Every application eventually breaks. The difference between a quick recovery and a prolonged outage often comes down to what you were watching before things went wrong. Monitoring isn't about collecting every possible data point—it's about knowing which signals matter and having the discipline to act on them. This guide focuses on five metrics that consistently separate teams who catch problems early from those who only learn about failures from user complaints.

We wrote this for engineers and team leads who already have basic monitoring in place but suspect they're missing something. Maybe your dashboards look fine, yet users still report slowness. Or your alert system fires so often that everyone ignores it. These are symptoms of metric selection that doesn't match how your application actually behaves under load. By the end of this article, you'll have a concrete framework for choosing, thresholding, and responding to the five most impactful health signals.

Why Most Monitoring Strategies Miss the Real Problems

The classic approach to application monitoring is to track CPU, memory, and disk usage, then add a few application-level counters like request count and response time. That works until it doesn't. The problem is that infrastructure metrics can look healthy while the application is failing. A server at 60% CPU might be silently dropping requests because a downstream database is saturated. Meanwhile, the CPU metric alone tells you nothing about user-facing quality.

What actually goes wrong in practice is that teams pick metrics based on what's easiest to collect rather than what's most informative. If your monitoring platform auto-generates dashboards for CPU and memory, those become your de facto health indicators. But application health is ultimately about whether users can complete their tasks successfully and quickly. That means you need metrics that reflect user experience directly—latency, error rate, and throughput—alongside infrastructure signals that help you diagnose the root cause when those user-facing metrics degrade.

Another common failure is treating all metrics as equally important. When everything is critical, nothing is. Teams end up with dozens of alerts, most of which are noise. The key is to identify a small set of primary signals that you respond to immediately, and a larger set of secondary signals that you investigate during normal working hours. This guide's five metrics are designed to be that primary set—the minimum number of indicators that give you a complete picture of application health without overwhelming your on-call rotation.

The audience for this approach includes startups scaling their first production service, established teams migrating to microservices, and anyone who has experienced an outage that monitoring failed to predict. If you've ever said, "I wish we had seen that coming," this framework is for you.

What You Need Before You Start Picking Metrics

Before you can monitor application health effectively, you need a clear definition of what "healthy" means for your specific system. That sounds obvious, but many teams skip this step and end up with generic thresholds that don't match their actual workload. A health definition should include acceptable latency for each critical user journey, the maximum tolerable error rate, and the throughput range your application must handle without degradation.

You also need to understand your application's architecture well enough to know where failures are most likely to occur. A monolithic application might have different failure modes than a microservices deployment. For example, in a monolith, a memory leak can bring down the entire service, while in a distributed system, a single slow service can cause cascading timeouts across multiple dependencies. Your metric selection should reflect these architectural realities.

Another prerequisite is instrumentation. You can't measure what you don't instrument. Ensure your application emits metrics for request latency, error codes, and request counts at minimum. Many frameworks and languages have built-in support for metrics libraries (like Prometheus client libraries, StatsD, or OpenTelemetry). If you're starting from scratch, choose an instrumentation standard that supports the five metrics we'll discuss, and make sure it can tag metrics with useful dimensions like endpoint, HTTP method, or error type.

Finally, establish a baseline. Before you can set alert thresholds, you need to know what normal looks like for your application. Collect at least two weeks of data during typical operation, including peak traffic periods and any known slow times. This baseline helps you distinguish between transient spikes and genuine degradation. Without a baseline, you risk setting thresholds that are either too tight (causing alert fatigue) or too loose (missing real problems).

The Five Metrics: A Step-by-Step Workflow

These five metrics form a cohesive set that covers both user-facing quality and infrastructure health. We present them in order of importance, starting with the most direct indicator of user experience.

1. Latency: The User's View of Speed

Latency measures how long it takes for your application to respond to a request. But not all latency measurements are equal. The most useful metric is tail latency—specifically, the 95th or 99th percentile. Average latency can hide a lot of misery. If 99 out of 100 requests complete in 100 milliseconds, but one request takes 10 seconds, the average is still under 200 milliseconds, which looks fine. The 99th percentile, however, would reveal the 10-second outlier. Monitor latency at the service boundary (the API gateway or load balancer) and also at each internal service if you have a distributed system.

Set thresholds based on your baseline and business requirements. A common starting point is to alert when the 95th percentile latency exceeds twice the baseline for more than five minutes. But adjust based on user expectations—a real-time chat app needs tighter thresholds than a batch report generator.

2. Error Rate: When Things Go Wrong

Error rate is the percentage of requests that result in an error (HTTP 5xx, application exceptions, or failed business logic). Even a small increase in error rate can indicate a serious problem. For most applications, an error rate consistently above 1% warrants investigation, but the acceptable rate depends on your service level objectives. Track error rate by endpoint and error type to identify specific failures. A spike in 503 errors might point to a backend service being down, while an increase in 500 errors could indicate a code bug.

One common mistake is only monitoring errors that reach the user. Internal errors—like failed database queries that are caught and retried—can also degrade performance and should be tracked separately. Use a metric like "internal error count" to catch these before they become user-facing.

3. Throughput: The Volume Signal

Throughput measures the number of requests your application handles per unit of time (typically requests per second or requests per minute). A sudden drop in throughput often means users can't reach your application at all—a potential outage. Conversely, a spike in throughput might signal a traffic surge that could overwhelm your infrastructure. Throughput is especially useful when combined with latency and error rate. For example, if throughput drops and latency spikes, you likely have a bottleneck. If throughput drops and latency also drops, users might be unable to connect.

Monitor throughput at the entry point of your system and at each service boundary. Set an alert for a drop of more than 20% from the baseline over a 5-minute window, as well as a spike that exceeds your infrastructure's capacity.

4. Saturation: The Capacity Ceiling

Saturation measures how close your system is to its capacity limit. This metric is often the most overlooked, yet it's the one that predicts problems before they happen. Saturation can be measured at different levels: CPU utilization, memory usage, database connection pool usage, thread pool utilization, or queue depth. The key is to identify the resource that is most likely to become the bottleneck for your application. For a web service, that might be the database connection pool. For a compute-heavy application, it could be CPU.

A good rule of thumb is to alert when saturation exceeds 80% of capacity for more than 10 minutes. But the exact threshold depends on your application's sensitivity to resource exhaustion. Some systems start degrading at 60% saturation, while others can run at 90% without issues. Use your baseline to determine the inflection point where latency starts to increase as saturation rises.

5. Resource Utilization: The Infrastructure Health Check

Resource utilization covers the classic infrastructure metrics: CPU, memory, disk I/O, and network bandwidth. While these are not directly user-facing, they provide context for the other four metrics. A high latency combined with high CPU utilization suggests a compute bottleneck. High error rates with high memory usage might indicate a memory leak. Resource utilization metrics are most valuable when you correlate them with the other signals.

Set thresholds that reflect your application's normal operating range. For CPU, a sustained utilization above 90% is usually a concern. For memory, look for steady increases over time that suggest a leak. For disk I/O, high utilization can cause queueing and slow down database operations. The key is to use these metrics as supporting evidence, not primary alerts—otherwise you'll get too many false positives.

Tools and Setup for Collecting These Metrics

Choosing the right monitoring stack depends on your team size, budget, and infrastructure complexity. For small teams or single services, a lightweight solution like Prometheus combined with Grafana for visualization is a popular choice. Prometheus is pull-based, meaning it scrapes metrics from endpoints you expose. It works well for dynamic environments and has a rich ecosystem of exporters for databases, message queues, and web servers. The setup involves instrumenting your application with a Prometheus client library, adding a prometheus.yml configuration to define scrape targets, and setting up a Grafana dashboard to view the data.

For teams already using cloud providers, managed monitoring services like AWS CloudWatch, Google Cloud Monitoring, or Azure Monitor offer deep integration with their ecosystems. These services can automatically collect infrastructure metrics and provide some application-level monitoring through SDKs. The trade-off is cost—at scale, managed monitoring can become expensive—and less flexibility in metric retention and querying compared to self-hosted solutions.

Another option is SaaS platforms like Datadog, New Relic, or Honeycomb. These provide end-to-end monitoring with built-in dashboards, alerting, and correlation across metrics, traces, and logs. They are easier to set up than self-hosted alternatives but come with per-host or per-data-volume pricing. For teams that need to move fast and don't want to manage infrastructure, SaaS is often the best choice. However, be mindful of vendor lock-in and data egress costs if you ever decide to migrate.

Regardless of the tool, ensure you can tag metrics with relevant dimensions. Tags allow you to slice and dice data—for example, viewing latency by endpoint, error rate by deployment version, or throughput by geographic region. This granularity is essential for root cause analysis. Also, plan for metric retention. Keep raw data for at least 30 days for troubleshooting, and aggregated data (like hourly or daily rollups) for longer trend analysis.

Finally, set up alerting with care. Use the five metrics as the basis for your primary alerts, but avoid alerting on every minor deviation. A good pattern is to alert on sustained changes (e.g., latency above threshold for 5 minutes) rather than instantaneous spikes. Use different severity levels: critical alerts for user-facing degradation (high latency, high error rate), warning alerts for potential issues (saturation approaching 80%), and informational alerts for changes that need review (throughput spike).

Adapting the Framework for Different Constraints

Not every team has the same resources or scale. Here are variations for common scenarios.

Startup with a Single Service

If you run a single web service with a database, focus on latency (95th percentile), error rate, and throughput. Saturation can be approximated by database connection pool usage. Resource utilization is secondary—you probably don't need to monitor disk I/O unless you're serving large files. Use a simple setup like Prometheus + Grafana, or even a cloud provider's built-in monitoring if you're on a single VM. Your alerting can be minimal: alert on latency >500ms for 5 minutes, error rate >2%, or throughput drop >30%.

Microservices with Many Services

In a microservices environment, you need to monitor each service independently and also track cross-service latency. Use distributed tracing (e.g., OpenTelemetry) to understand how latency accumulates across service boundaries. Saturation becomes more complex—each service may have different bottlenecks. Standardize on a common metrics format and ensure every service exposes the five metrics. Use a centralized monitoring platform (like Prometheus with Thanos or a SaaS tool) to aggregate data. Alerting should be service-specific, but also have global alerts for overall system health (e.g., any service with error rate >5%).

Low-Budget Team

If you can't afford SaaS or dedicated monitoring infrastructure, use open-source tools and prioritize the most critical metrics. A simple setup could be: instrument your application with a StatsD client, send metrics to a local Telegraf agent, and store them in InfluxDB. Visualize with Grafana running on the same server. For alerting, use a simple script that checks metrics from the database and sends notifications via email or a free Slack integration. This setup is not highly available, but it works for small teams. Focus on latency and error rate as your primary health signals; you can skip saturation and resource utilization initially.

Common Pitfalls and How to Debug Them

Even with the right metrics, things can go wrong. Here are frequent issues and how to address them.

Alert Fatigue

Too many alerts lead to ignored notifications. This usually happens because thresholds are too tight or because you're alerting on every metric for every service. Solution: review your alerts weekly and adjust thresholds based on recent data. Use a "burn rate" approach—alert only when the error budget is being consumed faster than expected. For latency, alert on sustained increases, not single spikes. Also, consider using a different notification channel for warnings vs. critical alerts.

Missing Metrics

You might discover that a metric you thought was being collected isn't actually instrumented. For example, you might have latency metrics for the API gateway but not for internal services. Solution: create a checklist of all services and ensure each one exposes the five metrics. Use a monitoring dashboard that shows a summary of which services are reporting, and set up a "heartbeat" alert that fires if a service stops sending metrics.

Misinterpreting Correlation

A common mistake is assuming that two metrics moving together means one caused the other. For example, a spike in CPU utilization and a spike in latency might both be caused by a traffic surge, not by CPU saturation. Solution: always look at a third metric—throughput—to confirm. If throughput also increased, the issue is load, not a bottleneck. If throughput stayed flat while CPU and latency increased, then CPU saturation is likely the cause.

Thresholds That Don't Match Reality

Setting thresholds based on intuition rather than data leads to false alarms. For instance, you might set a CPU alert at 80%, but your application normally runs at 85% during peak hours. Solution: use the baseline data to set dynamic thresholds. Many monitoring tools support anomaly detection that adjusts thresholds based on historical patterns. If that's not available, set thresholds at the 95th percentile of your baseline plus a buffer.

Frequently Asked Questions

How many metrics should I monitor total? Start with these five per service. You can add more as you identify specific failure modes, but resist the urge to monitor everything. Each additional metric adds cognitive load and potential alert noise. A good rule is that if you haven't looked at a metric in a month, consider removing it.

Should I use averages or percentiles for latency? Percentiles, specifically the 95th and 99th. Averages hide outliers. If you must use a single number, use the 95th percentile. For high-traffic services, also track the 99.9th percentile to catch rare but severe slowdowns.

What about logs and traces? Metrics are for alerting and dashboards. Logs and traces are for debugging when a metric indicates a problem. The three together form a complete observability stack. Use metrics to tell you something is wrong, logs to tell you what happened, and traces to tell you where in the request flow it happened.

How often should I review my metrics and thresholds? At least monthly, or after any significant deployment or traffic pattern change. Thresholds that worked six months ago may no longer be appropriate as your application evolves. Set a recurring calendar reminder to review your monitoring configuration.

Can I skip saturation if I have good resource utilization metrics? No. Resource utilization tells you how much of a resource is being used, but saturation tells you whether that usage is causing queuing or contention. For example, 70% CPU utilization might be fine, but if the CPU run queue is consistently above 1, you have saturation. Saturation is a leading indicator of performance degradation, while utilization is a lagging indicator.

Next Steps: From Metrics to Action

Having these five metrics in place is only the first step. The real value comes from how you respond when they deviate. Start by creating a runbook for each metric that specifies: what the alert means, what to check first, and who to notify. For example, a latency alert might trigger a check of recent deployments, database query performance, and upstream service health. Document these runbooks and test them during low-traffic periods.

Next, establish a regular review cadence. Meet with your team weekly to look at metric trends, not just alerts. Are there gradual increases in latency that haven't triggered alerts yet? Is error rate creeping up after a recent release? These reviews help you catch problems before they become incidents. Use the time to also adjust thresholds and refine your monitoring setup.

Finally, tie your metrics to service level objectives (SLOs). Define targets for each metric—for example, 99% of requests complete in under 300ms. Track your error budget (the allowable time you can be above the SLO) and use it to prioritize engineering work. When the error budget is nearly exhausted, stop shipping new features and focus on reliability. This creates a direct link between monitoring data and business decisions, which is the ultimate goal of application health monitoring.

Start with one service. Instrument it with the five metrics, set up a dashboard, and define a few critical alerts. Run with that for two weeks, then iterate. You'll quickly learn what works for your specific application and what doesn't. The framework is flexible—adapt it to your context, but keep the core principle: measure what matters to users, not just what's easy to measure.

Share this article:

Comments (0)

No comments yet. Be the first to comment!