Every engineering team tracks something. The question is whether what you track actually helps you ship better software faster, or just fills a dashboard with numbers nobody acts on. We've seen teams obsess over deployment frequency while their users suffer through constant regressions, and teams that monitor server CPU so closely they miss the fact that their application is unusable on slow networks. The problem isn't a lack of metrics — it's a lack of clarity about which metrics matter for your specific context.
This guide is for engineering leaders and team leads who want to move beyond vanity metrics and build a measurement system that drives real decisions. We'll walk through a decision framework, compare the main approaches, and show you how to implement metrics that actually improve your product and your team's health.
Who Must Choose and Why the Clock Is Ticking
Every team reaches a point where its informal sense of 'is it working?' collides with the need for hard data. Maybe you're a startup that has just hired a second backend engineer and suddenly releases feel chaotic. Maybe you're a mid-stage company where leadership is asking for reliability numbers to support a compliance audit. Or maybe you're on an established team that has been measuring the same things for years and suspects the metrics are now lying to you.
In each case, the decision is not whether to measure — you already measure something — but whether to consciously choose what you measure. The cost of not choosing is that default metrics (like uptime percentages or lines of code) will fill your dashboards by inertia, and those defaults often reward the wrong behaviors. A team that optimises for 99.99% uptime might avoid deploying any change for weeks, while a team that chases deployment frequency might push broken code to production every hour. Both extremes hurt users and the business.
The urgency comes from the fact that teams grow and their needs shift. A two-pizza team can get away with a simple checklist and a shared sense of quality. But as you add people, services, and customers, the feedback loops get longer and the cost of bad metrics compounds. A misaligned metric can steer an entire quarter's worth of work in the wrong direction before anyone notices. By then, the rework cost is high and the team's morale has taken a hit.
We're not here to prescribe a single set of metrics — that would be irresponsible, because your team's stage, product type, and risk tolerance all matter. Instead, we'll give you a decision framework that helps you pick the right metrics for your situation, and show you how to evolve them as you grow. The clock is ticking because every day you measure the wrong thing is a day you're optimizing for the wrong outcome.
Who Should Read This
This guide is written for engineering managers, tech leads, and senior individual contributors who are responsible for defining or improving their team's measurement practices. It assumes you have some familiarity with operational metrics like latency and error rates, but we'll define each term as we go. If you're a new manager feeling overwhelmed by the number of metric frameworks out there, or a seasoned lead looking to validate your current approach, this guide is for you.
The Option Landscape: Three Approaches to Performance Metrics
There is no shortage of metric frameworks. The challenge is that each framework was designed with a specific context in mind, and applying it outside that context can give you misleading signals. We'll look at three broad approaches that cover most real-world scenarios: latency-and-throughput metrics, error-budget-based metrics, and DORA-style software delivery metrics. Each has its strengths, and each can be combined with the others.
Latency and Throughput Metrics
This is the classic approach: measure how fast your system responds (latency) and how much work it can handle (throughput). Latency is typically reported as percentiles — p50, p95, p99 — because averages hide the worst-case experience. Throughput might be requests per second, transactions per minute, or data volume processed. These metrics are essential for any service that interacts with users or other systems.
When they work well: You have a well-defined user-facing endpoint, and you can instrument it end-to-end. The metrics give you a clear picture of user experience and capacity planning. When they mislead: If your system has multiple tiers (load balancer, application server, database), a single latency number can be hard to attribute. Also, optimizing for p99 latency in isolation can lead to over-provisioning and reduced throughput.
Error Budgets
Popularized by Google's SRE model, the error budget approach starts with a service level objective (SLO) — say, 99.9% availability over a rolling month. The remaining 0.1% (about 43 minutes) becomes the error budget. As long as the budget is not exhausted, the team can deploy changes freely. When the budget is low, deployments slow down and the team focuses on reliability improvements.
When it works well: You have a mature service with a clear availability target and a culture that accepts controlled risk. The error budget gives a transparent, data-driven way to balance feature velocity and stability. When it misleads: If your SLO is set too loosely, the error budget never gets used and the mechanism becomes irrelevant. If it's set too tightly, the team becomes risk-averse and innovation stalls. Also, error budgets don't directly capture user experience nuances like slow responses that are not errors but still frustrate users.
DORA Metrics
The DORA (DevOps Research and Assessment) framework defines four key metrics: deployment frequency, lead time for changes, time to restore service, and change failure rate. These metrics focus on the software delivery process rather than the runtime behavior of the system. High-performing teams, according to DORA research, deploy frequently, have short lead times, restore service quickly, and have low change failure rates.
When it works well: You want to improve your team's agility and reliability in the delivery pipeline. The metrics are actionable and comparative — you can benchmark against industry data. When it misleads: The metrics are about process, not outcomes. A team can have great DORA scores but still build the wrong product or have poor user experience. Also, the metrics assume a certain level of automation and CI/CD maturity; teams just starting out may find the targets demotivating.
Comparison Criteria: How to Choose What Fits Your Team
Choosing between these approaches is not about picking the 'best' framework — it's about matching the metrics to your team's constraints. Here are the criteria we recommend using to evaluate any metric or set of metrics.
Alignment with Business Goals
The first question is whether the metric correlates with something your users or your business cares about. Latency matters because slow pages lose revenue. Error budgets matter because outages erode trust. DORA metrics matter because slow delivery frustrates stakeholders. If you can't draw a clear line from the metric to a business outcome, it's probably a vanity metric. For example, lines of code written correlates weakly with value delivered; deployment frequency correlates more strongly but still needs to be paired with quality measures.
Actionability
A good metric should tell you what to do next. If your p99 latency spikes, you can investigate the root cause. If your error budget is half consumed, you might decide to halt deployments. If your change failure rate rises, you can invest in better testing. Metrics that are purely descriptive (like server count or total users) are harder to act on because they don't have a clear lever. We prefer metrics that come with a well-understood feedback loop: measure, compare to a target, decide, and act.
Cost of Collection
Instrumenting every endpoint, calculating percentiles, and storing high-resolution data costs engineering time and infrastructure. For a small team, a simple uptime check and request duration histogram might be enough. For a large team, you might need distributed tracing and a dedicated observability platform. Be honest about what you can afford to measure and maintain. A metric you can't reliably collect is worse than no metric, because it gives false confidence.
Resistance to Gaming
Any metric that becomes a target stops being a good metric. If you reward developers for high deployment frequency, they will break large changes into tiny, risky deploys. If you penalize teams for low uptime, they will schedule maintenance during off-hours and hide incidents. Choose metrics that are hard to game because they are measured in a way that reflects genuine user experience. For example, measuring latency from real user monitoring (RUM) is harder to game than measuring it from synthetic probes that hit a specific endpoint.
Trade-offs: A Structured Comparison of the Three Approaches
To make the choice concrete, we've built a comparison table that highlights the key trade-offs across the three approaches. Use it as a starting point, not a final verdict.
| Dimension | Latency/Throughput | Error Budgets | DORA Metrics |
|---|---|---|---|
| Primary focus | Runtime behavior | Reliability risk | Delivery process |
| Best for | User-facing services with strict performance requirements | Mature services with defined SLOs | Teams improving CI/CD and deployment practices |
| Weakness | Can hide system-level bottlenecks if not traced | Requires culture of trust around budgets | Does not measure user experience directly |
| Implementation effort | Medium: need instrumentation and percentile calculation | High: need SLO definition, monitoring, and alerting | Medium: need CI/CD pipeline data and incident tracking |
| Gaming potential | Low if using RUM; higher with synthetic probes | Medium: teams may inflate budgets or hide incidents | Medium: teams may split deployments to increase frequency |
| Scalability | Works at any scale with proper tooling | Best for teams with dedicated SRE or ops roles | Works for teams of 5–50; may need adaptation for larger orgs |
The table makes clear that no single approach covers everything. A common pattern is to combine them: use DORA metrics to track delivery health, latency/throughput to monitor runtime behavior, and error budgets to set boundaries on risk. But combining them also multiplies the complexity, so start with the one that addresses your most pressing pain point.
Composite Scenario: A Startup Scaling Up
Consider a startup that has just raised its Series A and is growing its engineering team from 5 to 15. The product is a B2B SaaS platform with a web app and an API. The current pain point is that deployments are infrequent (once a week) and often break things. The team is considering metrics. A pure latency/throughput focus would tell them about user experience but not about deployment pain. Error budgets would require them to define SLOs, which they haven't done. DORA metrics seem most relevant because they address the deployment bottleneck. The team decides to first instrument their CI/CD pipeline to measure deployment frequency and change failure rate. After three months, they improve to daily deployments with a 5% failure rate. Now they add latency monitoring because the next pain point is API response times under load.
Implementation Path: From Choice to Practice
Once you've chosen your primary metric set, the next step is to implement it in a way that doesn't overwhelm the team. We recommend a phased approach.
Phase 1: Baseline and Instrument
Start by measuring what you already have. Most teams have some monitoring in place — even if it's just a health check. Collect a week of data on the metrics you've chosen. Don't set targets yet; just understand the current distribution. This phase also involves any necessary instrumentation: adding tracing to slow endpoints, setting up deployment tracking, or defining SLOs. Keep the scope narrow. For latency, start with the top three user-facing endpoints. For DORA, start with one service.
Phase 2: Set Targets and Review Weekly
After you have a baseline, set initial targets. For latency, a common starting point is p99 < 500ms for web APIs. For error budgets, define an SLO of 99.9% for the main service. For DORA, aim for deployment frequency of at least once per week and change failure rate below 15%. Review these targets in a weekly team meeting. The goal is not to hit them immediately but to see if they feel right. If the team is constantly exceeding a target, it may be too loose; if they never hit it, it may be too tight. Adjust accordingly.
Phase 3: Integrate into Decision-Making
Metrics are useless if they live in a dashboard nobody looks at. Integrate them into your existing processes. For example, make the error budget visible in your deployment pipeline — if the budget is low, the pipeline can block deployments automatically. Use latency data in your on-call rotation to prioritize incidents. Include DORA metrics in your sprint retrospectives to discuss process improvements. The key is that metrics become part of the conversation, not a report that is reviewed once a month.
Phase 4: Iterate and Expand
After a few months, revisit your metric choices. Are they still aligned with your goals? Have new pain points emerged? Expand to additional services or add complementary metrics. For example, if your latency metrics are stable but users still complain about performance, you might need to add client-side metrics like time to interactive. If your DORA metrics look good but the team feels burned out, consider adding a well-being metric like deployment stress score (a subjective measure from the team).
Risks of Choosing Wrong or Skipping Steps
Even with the best intentions, measuring the wrong thing can cause real damage. Here are the most common risks and how to avoid them.
Metric Fixation and Tunnel Vision
When a single metric becomes the focus, the team optimizes for that metric at the expense of everything else. This is Goodhart's Law in action. For example, a team that targets 99.99% uptime might refuse to deploy any change that carries even a tiny risk, leading to stagnation. To avoid this, always track a small set of complementary metrics that cover different dimensions — speed, quality, and reliability.
Dashboard Overload
It's tempting to put every available metric on a dashboard. The result is a wall of numbers that nobody can interpret. We've seen teams with 50+ graphs on a single screen, and the only thing they monitor is the one that turns red. The fix is ruthless prioritization: for each metric, ask whether you would stop a deployment or wake someone up at 3 AM if it went bad. If the answer is no, it doesn't belong on the main dashboard.
False Confidence from Averages
Averages hide outliers. A system with a 200ms average latency might have a p99 of 5 seconds, meaning 1% of users have a terrible experience. Always use percentiles for latency and avoid relying solely on averages for any metric. Similarly, change failure rate averaged over a month can hide a bad week. Look at distributions and trends.
Cultural Resistance
Introducing new metrics can feel like surveillance to engineers. If metrics are used punitively — to blame individuals for failures — the team will game them or hide problems. The antidote is transparency and a blameless culture: metrics are for learning, not judging. Share the metrics openly, discuss them in retrospectives, and never tie them to individual performance reviews.
Mini-FAQ: Common Questions About Performance Metrics
How many metrics should a small team track? Start with 3–5. A good starting set is: p99 latency, error rate, deployment frequency, change failure rate, and one business metric (e.g., signup completion rate). As the team grows, you can add more.
Should we use synthetic monitoring or real user monitoring? Both have their place. Synthetic monitoring gives you consistent, repeatable measurements but may not reflect real user conditions. Real user monitoring (RUM) captures actual experiences but adds complexity and privacy considerations. For latency, we recommend starting with synthetic for your critical paths and adding RUM later.
How do we set SLOs without historical data? Start with industry benchmarks or educated guesses. For a web API, 99.9% availability and p99 latency under 500ms are reasonable starting points. After a month, adjust based on what you observe. The important thing is to have a target, even if it's provisional.
What if our metrics look good but users are unhappy? This is a sign that your metrics are not aligned with user experience. You may be measuring the wrong things (e.g., server-side latency instead of end-to-end page load) or missing qualitative signals. Add user feedback surveys or session replay to complement your quantitative metrics.
How often should we review our metrics? Automate alerts for real-time issues, but review trends weekly in a team meeting. Quarterly, do a deeper review of whether your metric set still matches your goals. Avoid checking dashboards multiple times a day — that leads to noise and overreaction.
Recommendation Recap: Next Steps Without Hype
You don't need to adopt every metric framework at once. The most reliable path is to identify your single biggest pain point — slow deployments, frequent outages, or poor user experience — and pick the metric set that addresses it directly. For most teams, that means starting with DORA metrics if delivery is the bottleneck, or latency/throughput if user experience is the issue. Error budgets are best added once you have a stable baseline and a culture that can handle risk budgets.
Here are five specific next moves you can make this week:
- Audit your current dashboards. List every metric you currently track. For each one, write down the action you would take if it went bad. If you can't name an action, remove it.
- Pick one metric to start. Choose a single metric that is aligned with your biggest pain point. Instrument it this week. Don't wait for the perfect tool — a simple script that logs to a file is better than nothing.
- Set a provisional target. Based on your baseline, set a target that feels challenging but achievable. Write it down and share it with the team.
- Create a weekly review slot. Block 30 minutes on the calendar to review the metric and discuss what you learned. Keep it blameless and focused on improvement.
- Plan your next metric. After one month, decide whether to add a second metric. Use the comparison criteria from this guide to choose.
Measuring what matters is not a one-time project. It's a habit of asking whether your metrics still serve your goals, and adjusting when they don't. Start small, stay honest about what the data tells you, and let the metrics guide — not dictate — your decisions.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!