Application Testing

Why SLOs are a key tool

Monday, August 18, 2025

Modern systems run at massive scale, where failures risk business, users, and reputation. SLOs make reliability measurable by tying metrics to user experience. Meta’s Sergey Sidorov shares what works to define and achieve them at scale and why SLOs are a key tool.

Modern digital products now operate at scales that would have seemed impossible just a decade ago. YouTube processes over 500 hours of video every minute, Amazon processes around 12 million orders each day, and large cloud platforms and advertising systems serve billions of real-time requests.

At this level, a failure is a business risk, a user experience problem, and a potential hit to reputation. System behavior needs to be measurable, observable, and predictable. One of the most effective tools for achieving that is the Service Level Objective (SLO).

But how do you define metrics that truly reflect user experience — especially in a high-load distributed system operating across dozens of data centers, with asynchronous processing, caching, sharding, and external dependencies? What does “working well” even mean in such environments, and how do you express that in concrete, measurable terms?

We spoke with Sergey Sidorov, an experienced software engineer and technical leader at Meta. He works on the infrastructure powering core company products such as Ads and Instagram. Sergey has built high-performance systems for asynchronous computing, capable of handling billions of queries per second. In this conversation, he shared how to make reliability measurable in high-load environments, why SLOs matter, and what actually works at scale.

ADM: What does “reliability” mean to you in the context of high-load products?

Sergey: Reliability is the probability that a service will consistently perform as expected over a defined period of time. But to make that definition useful in practice, you need to answer two key questions: What exactly counts as “good enough”? And how do you measure it?

From a user’s perspective, “good enough” is a moving target. Is the app responsive enough? Are key features always available? At what point does a delay or failure become noticeable — or, worse, frustrating?

On the engineering side, we turn these user-facing concerns into measurable signals. SLIs (Service Level Indicators) define what exactly we’re tracking — like success rates or latency thresholds. Then we set SLOs (Service Level Objectives) to define what level of performance is acceptable.

Once your SLOs are in place, you can start speaking in terms that make sense to both engineering and business. For example, 99.95% availability equals roughly 4.5 hours of downtime per year. Or: a 99.95% chance that any given request will succeed.

But defining meaningful SLIs and SLOs is hard — especially in large-scale distributed systems. You run into real-world issues like tail latency from hot shards or flaky dependencies, misleading global aggregates that hide regional outages, and noisy tenants distorting shared metrics. And at scale, so-called “rare events” happen constantly. So the ability to separate signal from noise becomes absolutely critical.

ADM: How did you come to adopt SLOs as a primary tool for managing reliability?

Sergey: There were moments when everything looked fine on the dashboards — green graphs, latency within normal range — and yet users were clearly having issues. That disconnect between system health and user experience was a red flag.
That’s when the perspective shifted. Instead of focusing solely on infrastructure metrics, the focus turned to user success — what people actually feel when interacting with the service.

Around the same time, Google’s SRE teams had already been working deeply with SLOs. Studying their approach made it clear how effectively it connects technical indicators with real user outcomes. That led to the creation of an internal platform at Meta called SLICK, which enables teams to define and track SLOs in a consistent, unified way. Beyond tooling, the process evolved: SLOs became part of postmortem analysis and day-to-day operational reviews.

But perhaps the most meaningful shift was cultural. Reliability stopped being just a question of incident count — and became a matter of meeting service-level goals. Today, SLOs are fully embedded in how systems are designed and how production readiness is assessed.

ADM: What makes an SLO truly useful, and what are some common mistakes in defining them?

Sergey: A useful SLO is one that gives teams a clear, measurable signal they can act on. The most common mistake is focusing on internal system stats — like CPU usage or queue depth — instead of what actually matters to users.

To avoid that, pick metrics tied to real user journeys — for example, “95% of page loads complete in under 300 ms” or “99.9% of messages are successfully delivered”. These kinds of indicators are not only measurable, they’re also easy to explain, debug, and improve.

Another mistake is aiming for perfection too early. It’s better to start with the “golden signals” — latency, errors, traffic, and saturation — and evolve from there. Over time, SLOs should become part of daily operations: used in on-call rotations, postmortems, and product reviews.

Once that happens, reliability becomes visible across the org. Engineers can see how their work impacts users, and teams have a shared contract that guides trade-offs and prioritization.

ADM: You developed an algorithm for fast issue detection based on SLOs — how does it differ from traditional alerting?

Sergey: Traditional alerting often relies on static thresholds — like “trigger an alert if error rate exceeds 5% for 5 minutes.” The problem is, those thresholds are either too sensitive, leading to false positives, or too relaxed, causing real issues to slip through. At scale, this creates alert fatigue and undermines trust in the system.

SLO-based alerting takes a different approach. Instead of watching individual system metrics, it monitors the risk of breaching a user-facing objective. If the current trend suggests that an SLO is likely to be violated, only then does it trigger an alert. This helps filter out noise, ignore short-lived flakiness that resolves on its own, and focus attention on problems that actually impact users.

In practice, this method has led to more accurate signals, fewer false alarms, and faster detection of real issues — which is critical when operating at the scale of billions of requests per second.

SLICK: SLO Reviews at Meta - David Bartok and Sergey Sidorov

ADM: What are the main challenges in ensuring reliability at the scale of billions of requests per second?

Sergey: At that scale, failures aren’t rare — they’re expected. The key is building systems that absorb them without impacting the user. That means redundancy, retries, and fault tolerance need to be built in by design.

Another major challenge is measurement. The volume of telemetry is massive, and storing everything isn’t feasible. Sampling helps, but it also means that small but important signals — like a latency spike in one region — can be missed.

Ultimately, reliability at scale isn’t just about resilient architecture. It’s about having the right observability in place to detect when something breaks — and understand how it affects the user experience.

ADM: What would you recommend to a team that wants to start using SLOs but doesn’t yet have complex infrastructure?

Sergey: Start simple and start from the user. You don’t need complex tooling to get value from SLOs. The key is to shift the mindset from “Is the system up?” to “Is the user experience good enough?” Then pick one or two core user journeys like loading a page, posting a message, or completing a transaction — and define what “good” means for that flow. That becomes your first SLI.

Then set a realistic objective: maybe 99.9% of requests should succeed within 300ms. It doesn’t have to be perfect. The goal is to create a shared contract between product and engineering about what level of reliability matters, and to measure it. A simple counter of successes vs total requests is enough to start. From there, you can introduce the idea of an error budget, and use it to make trade-offs: Can we ship this risky change now? Should we pause and fix reliability first?

You don’t need complex infrastructure to start using SLOs. SLOs give you clarity, alignment, and focus — no matter the scale.