SRE Culture: How to Put Reliability First


· ·

Unreliable services can affect businesses in myriad ways, from slowed development velocity, to unhappy users, to impacted revenue streams. Reliability often takes a backseat to feature releases and other business initiatives that drive development requirements. This post will discuss key elements of SRE practice that you can use to instill a reliability-first culture in your organization, while also meeting business requirements and keeping your users happy.

Why Put Reliability First?

In many companies, product features are often prioritized over other aspects of software development such as architecture, infrastructure, security, or reliability. In an effort to improve reliability, engineering organizations often emphasize tooling and automation, but leave out other factors that can help alleviate some of the pain felt by development teams and end users.

Tooling and automation alone aren’t enough. Organizations need to create a culture of reliability and awareness of risk alongside the automation. DevOps is one approach to solving some of these issues, but DevOps is more of a philosophy than an instruction manual. This is where Site Reliability Engineering, or SRE, comes into play.

Site Reliability Engineering

The primary objective of SRE is to improve system reliability by implementing each of the five DevOps “pillars of success.”  These are:

  1. Reducing organizational silos
  2. Accepting failure as normal
  3. Implementing gradual change
  4. Leveraging tooling and automation
  5. Measuring everything

By taking an SRE-oriented approach to these pillars, a reliability-first culture can be instilled into all teams in an organization, not just engineering. Let’s examine each one in detail and turn them into values:

1. Shared Ownership

Production environments are becoming increasingly larger and more complex. As they grow, it is harder and harder for any one person or team to understand all of the nuances and interactions between services in a given environment. This is where shared ownership comes into play.

All teams need to be involved, and have a stake in, the reliable operations of the company. This includes product, marketing, operations, development, and upper management. Collaboration is key to breaking down what were traditionally barriers among different units of the business. A sense of shared ownership helps increase participation between teams, and also increases understanding of the pain points each is experiencing.

For example, if the product team responds to service incidents alongside operations and development, then it is more likely that fixing the root cause of these incidents will be prioritized into the product roadmap. In another example, if developers own and use the same tooling the operations team uses, it stands to reason that there will be a more collaborative effort to add needed features to the tooling, or reduce technical debt where there are deficiencies.

2. Risk Awareness

Hand-in-hand with a reliability-first culture is instilling a culture of risk awareness; i.e., rather than avoiding risk altogether, assume systems are inherently unreliable, and embrace risk. Accept that failure is normal, and can be measured and quantified, evaluated, and addressed.

Along the lines of shared ownership, communicating and educating everyone about risks and their severity allows teams to align along a common strategy to address these risks and improve reliability without detracting from the product or engineering roadmaps. By having this common strategy, reliability can be prioritized alongside feature releases.

Two key practices that help encourage communication and education are blameless postmortems and a risk assessment. First, conducting blameless postmortems helps to identify risks and their causes, as well as any mitigations that could be implemented to improve reliability. Second, by using a risk assessment, a common language around risk can be established so that all stakeholders understand the level of risk and cost of mitigation compared to the severity of a given risk.

3. Gradual Change

Implementing gradual change can also help increase reliability. Smaller changes decrease the risk that something big will go wrong when a change is rolled out. Rather than taking a waterfall approach of all or nothing, iterative development — implementing many small changes over time — helps to increase product development velocity. If something does go wrong, small changes are generally easier to roll back, should the need arise.

With gradual change, reliability can be built into releases more easily since the task of assessing the risk of a change becomes less onerous. In terms of code reviews alone, it is much easier to review a twenty-line code change than it is to review a five-thousand-line code change.

4. Automation and Tooling

It goes without saying that most people want their jobs to be easier. Automation reduces toil and makes life easier and systems more reliable.

All aspects of an organization are moving towards the idea that everything exists as code. Applications are defined in code. System configuration is defined in code. Infrastructure is defined in code. Even incident response and site reliability are becoming definable as code.

When business systems and processes are defined as code, everything becomes more consistent, repeatable, and reliable. That code can be reviewed and a history of changes retained. Further tooling on top of that can be defined as code to proactively identify potential risks or bad coding practices before code is deployed to production.

Tooling should be shared not just between operations and development teams, but also with other business users such as product, marketing, and finance. Shared tooling also improves consistency and reduces toil since the organization isn’t using multiple tools to achieve the same objective.

Obviously, the marketing team might not be deploying code to production using the CI/CD platform that developers use. But, that same marketing team could be pulling reports using the same data extraction and reporting tools that developers use.

This reduces overhead in requests to engineering from non-engineering teams for data they need in order to make business decisions. Training all teams on the same tooling encourages shared ownership of that tooling and helps to ensure it is usable for a variety of users, not just development or operations.

5. Measuring Reliability

DevOps dictates that everything must be measured. But how can an organization proactively measure and improve reliability? First, don’t sweat the small stuff. Instead, focus on the big picture.

Today’s cloud environments can span continents and be made up of thousands of application instances. Just because one application instance fails doesn’t mean a human needs to be alerted to investigate.

No one wants to wake up in the middle of the night for an issue that isn’t important in the larger scheme of things. Excessive alerts can create alert fatigue and burn out engineers.

Through the use of Service Level Indicators and Service Level Objectives, organizations can measure and set objectives for reliability. Using an error budget, teams will know whether or not they need to respond to a given incident and correct the problem. If a problem doesn’t risk exceeding the error budget, it may not be necessary to alert teams to respond to that problem.

A good example of this might be failed requests. It might be acceptable that a certain percentage of requests to the service fail, since that is considered within the error budget and doesn’t have a large impact on the end users. Establishing what is acceptable for the metrics that are measured is key to setting these limits. An organization should avoid stating a goal (such as 100% uptime), and then not prescribe a way to achieve that goal.

Lastly, let’s add a sixth pillar, one which applies so universally across organizations that it’s often overlooked.

6. Education

Always work to strengthen training and education. Documentation should be complete and accessible. Knowledge-sharing sessions between business units encourage communication and collaboration. If all the teams in an organization undergo the same training, and use the SRE practices above, common ground can be established in which a reliability-first culture can flourish and a business can thrive.

Steve Tidwell has been working in the tech industry for over two decades, and has done everything from end-user support to scaling a global data ingestion and analysis platform to handle data analysis for some of the largest streaming events on the Web. He has worked for a number of companies helping to improve their operations and automate their infrastructure. At the moment, Steve is currently plotting to take over the world with cloud based technologies from his corner of the office.


Leave a Comment

Your email address will not be published. Required fields are marked *

Skip to toolbar