In recent years, many engineering organizations have embraced DevOps as a means to improve the software development lifecycle and increase software quality. One of the key pillars of the DevOps philosophy is to “measure everything.”
Site Reliability Engineering (SRE) implements processes and actions that seek to make DevOps and DevOps methodologies a reality. One of the mandates of SRE is establishing Service Level Objectives, or SLOs, which helps satisfy the DevOps pillar of measuring everything.
SLOs are also a key component of an organization’s Service Level Agreements (SLAs) with its customers. As such, carefully chosen and established SLOs are critical to meeting these service levels and maintaining a good user experience.
So what constitutes a well-chosen SLO? First, let’s dig a little deeper into SLOs themselves. In order to do this, we must first understand Service Level Indicators (SLIs).
SLIs vs. SLOs
Service Level Indicators (SLIs) are quantifiable measurements of some aspect of a service’s reliability. Examples of commonly used SLIs are:
- Availability: the amount of time that a service is usable over a given period of time
- Application Latency: the time it takes for the application to process a request and respond
- Error Rate: the percentage of requests to the application that resulted in an error over a given period of time
At a basic level, an SLO is simply an objective or goal that’s considered a desirable level of service. SLOs help to create a common understanding of application or service performance expectations. Properly chosen SLOs provide clear goals for teams to work towards and also help to set end user expectations.
An SLO not only sets an acceptable value, but also a specific period of time over which that value needs to be met. Common time periods could be one hour, one day, one week, one month, or one year.
Reasonable SLOs for the above SLIs might be established as follows:
- Availability: 99.5% over a time period of 1 month
- Application Latency: <250ms average over a time period of 1 hour
- Error Rate: <1% over a time period of 10 minutes
There are also other key factors to consider when setting an SLO. To start with, you should make them meaningful. The following are some questions that you could ask when working to establish SLOs.
How do SLOs contribute to better availability, quality, or service? Simply measuring something and picking an arbitrary number as an SLO doesn’t set a reasonable goal for a team to strive towards.
How do SLOs contribute to SLAs with customers? Does a given SLO provide a meaningful goal that contributes to meeting your SLAs, or is it just another target to meet that doesn’t impact overall customer satisfaction?
Are the SLOs realistic? In other words, can you actually achieve them or were these objectives chosen in a vacuum without understanding how you would get there? One commonly chosen SLO that might not be considered realistic is 100% availability. If an organization follows the principles of DevOps, it will know that failure is normal and a goal of 100% might not be achievable in reality.
Are there available resources to dedicate to achieving a given SLO? These might be human or financial resources, or simply the time to work towards them. If you don’t have the resources, how can you get them? Should you reconsider the value you initially assigned to an SLO?
Is the SLO something that can actually be controlled? For example, it is often difficult to predict incoming traffic levels or throughput in an application, but you could control other factors such as application latency by scaling infrastructure or improving code quality.
Choosing your SLOs carefully is also important when it comes to error budgets and dealing with alert fatigue. According to Google, “an error budget is the amount of error that your service can accumulate over a certain period of time before your users start being unhappy.”
Typically, alerts will be set at specific values to notify a service owner if a service becomes at risk of violating its error budget. An overly aggressive SLO might not leave enough error budget for reasonable operation of the service.
This can cause excessive alerts, leading to alert fatigue and engineer burnout. If engineers burnout, either they will leave or your service levels will be affected. Neither of these cases is desirable.
Some failure should be expected. Ask yourself whether it’s important enough that someone needs to be woken up at 3AM by an alert, and then decide on your SLO.
Know Your Users
Here are a few final thoughts on establishing meaningful SLOs:
When setting SLOs, try to put yourself in your customers’ position and “define service-level objectives like an end-user.” Consider the use case of your application at a very high level.
Who are your users? Internal service teams? Consumers? Small business users? Large enterprises?
What will your users experience? Do you have a simple three-tier web app, a video streaming service, or a large-scale data processing application that ingests terabytes of data per day? What are your customers’ expectations for the end user experience? How might this affect your choice of SLIs and the settings for your SLOs?
How many users will there be? A few dozen? A few thousand? A few million? Will this matter when choosing SLOs? A low traffic site may have very different Service Level Objectives than a high traffic one.
Don’t be afraid to update your SLOs as your requirements and experiences change. What may be an acceptable SLO in the beginning might not be as suitable in the long term. Create a feedback loop and review your SLOs periodically to see if they still contribute to the overall performance of your teams and services.
All of these questions matter when you’re deciding how you will choose the appropriate SLOs for your teams. Once you know the answers to these questions, you can better prepare to deliver an optimal end user experience for your customers.