It’s the year 2021, and Site Reliability Engineering (SRE) has become one of the fastest growing and hottest professions in the tech industry. With all of the attention on SRE, many software developers and operations engineers are now interested in moving into this burgeoning field.
There is an enormous amount of information about SRE on the Internet – some helpful, some not so much. It can be hard to know where to begin. In this post, we will give you an ultimate guide to SRE resources – all in one place.
Below, we’ve briefly summarized these resources and categorized them in a way that can get you started down the road to learning about SRE without having to do all the digging yourself. Once you’ve had a chance to review some of these resources, you should have a better idea of which direction you want to take.
As you read along, you might notice a number of references to DevOps. This is because both topics share such a close relationship that it’s almost impossible not to mention DevOps when discussing core site reliability subject matter.
You should also note that there may be a cost associated with using some of the courses or services referenced in this post. We’ve minimized these costs where possible by finding free versions or trials so that you don’t blow your budget just to try them. In some cases, you may find that the paid versions are well worth the cost.
Online Training and Education
It’s usually best to start at the beginning, so we’ll start by introducing a few online training resources that you can leverage to get up to speed on the basics of SRE.
The Linux Foundation and EdX
The Linux Foundation and EdX offer an Introduction to DevOps and Site Reliability Engineering course. Targeted primarily at managers or those who are looking for a high-level view of DevOps and SRE topics, this course covers cloud computing, containers, Kubernetes, Infrastructure as Code, CICD, and observability.
This is part of a larger paid certificate program covering DevOps Practices and Tools, which we won’t cover here. The Introduction to DevOps and Site Reliability Engineering course is free to audit and takes about 10 to 12 hours total. The graded portion, which includes a certificate, costs $169.
Google and Coursera
Coursera has teamed up with Google to offer a course on Site Reliability Engineering: Measuring and Managing Reliability. It covers topics including an introduction to SRE, measuring reliability and error budgets, Service Level Indicators (SLIs), Service Level Objectives (SLOs), quantifying risks, and managing reliability in an organization.
It takes approximately 12 hours to complete the course, and you can audit it for free. If you use the free version, you will have access to all materials except graded items. If you prefer a more formal setting or wish to earn a certificate, the cost is $49.
Google, the pioneer in the SRE movement, has invested a great deal of resources to promote their way of implementing and running SRE. Below, we have linked articles, books, videos, and web pages that provide an introduction to SRE as well as some in-depth resources. All of the information presented here is completely free.
- Google’s Site Reliability Engineering landing page. This is Google’s portal into everything SRE at Google.
- Videos: class SRE implements DevOps. Before you dive into the rest of Google’s content, take a look at this playlist of short videos (comprising approximately 1 hour of content). The series covers:
- Video: DevOps Vs. SRE: Competing Standards or Friends? This is a talk given by Seth Vargo, a Senior Staff Engineer at Google Cloud, at Cloud Next ’19. In addition, these Google Cloud Tech videos cover almost everything cloud-related, and they’re worth a look.
- Book: Site Reliability Engineering. Frequently referred to as “The SRE Book,” this is Google’s treatise on how they run their production systems.
- Book: The Site Reliability Workbook. Known as “The SRE Workbook,” this is Google’s practical guide to implementing SRE in your organization.
- Book: Building Secure & Reliable Systems. This is a 500+ page deep dive into reliable system design, implementation, and maintenance.
- Multimedia Collection: SRE Foundations and Principles. This is a collection of articles, videos, and books on the basics of SRE.
- Multimedia Collection: SRE Practices and Processes. This is another collection of articles, videos, and books on practical implementations of SRE.
- Collection: SRE Management. This is a smaller collection of articles focused primarily on the organizational side of SRE.
Microsoft’s site reliability engineering documentation covers similar topics to those listed above, and they actually refer to Google’s SRE books. We won’t link everything that’s covered on their landing page because they do an excellent job of aggregating it all in one place themselves. In short, they cover topics including:
- Introductions to SRE
- Improving Reliability Through Modern Operations Practices
- SRE Online Courses
- SRE Resources
- SRE on Azure
- SRE Talks from Microsoft
If you’re building or developing on Azure, this is the site for you.
LinkedIn has put together an amazing collection of training materials for their School of SRE. The School of SRE teaches the foundational skills that someone who wants to step into an SRE role can use to jump start their career.
This free course is hosted on Github and is actively updated. They cover a large swath of topics including:
- Fundamentals: Linux Basics, Git, and Linux Networking
- Python, Web, and Flask
- Data: Relational Databases, NoSQL, and Big Data
- Systems Design: Scalability, Availability, and Fault Tolerance
- Metrics and Monitoring: CLI Tools, Third-Party Monitoring, Alerting, Best Practices, and Observability
- Security: Fundamentals, Network Security, Threats, Attacks & Defenses, and Secure Coding
Before I go much further into things, there is one resource that anyone in the tech industry should know about. Most people are familiar with O’Reilly Media’s “animal books,” which cover all sorts of tech-related topics. You might not know that all of O’Reilly’s books are available online.
In addition to that, they also have several other resources available on their site, including books and videos (by O’Reilly and many other publishers), curated playlists, live events and online training, certification and certification prep resources, and far more than we can cover in this post.
Their service comes with a cost, but it’s well worth it just for the tech books and videos alone. All materials are constantly updated, too, so you don’t have to worry about having outdated materials or a print book that you’ll need to repurchase every time a new edition is released. Besides that, it’s better for the environment.
O’Reilly offers a 7 day free trial. After that, the cost will vary depending on which plan you choose. It can run anywhere from $49 per month to $499 paid annually. One tip we can offer is that they occasionally run specials or offer event-related coupons, so keep an eye out.
Blogs and Websites
Awesome Site Reliability Engineering
A curated list of “awesome” SRE and other engineering resources, Awesome Site Reliability Engineering is an enormous list of links to almost everything related to SRE. Their resources include everything from books and education, to culture and hiring, to postmortems and capacity planning. They are hosted on Github, and you can even contribute to the list if you are so inclined.
SRE University is another Github-hosted project that aims to provide a complete study guide for becoming an SRE.
Alice Goldfuss is a Senior SRE, Infrastructure, and Systems Engineer who has an excellent post on How to Get Into SRE. It covers some of the nitty-gritty details of SRE that you won’t often find in more formal books and online resources. In particular, she has an exhaustive section of resources at the bottom of the post that’s filled with links that are worth a look.
Liz Fong Jones
Liz Fong Jones is a developer advocate and SRE evangelist who has authored or co-authored many fine books and articles, and she has also given talks and presentations on SRE and other topics online and at several events. Her website is well worth exploring for information on all of the above topics, and she also has an O’Reilly playlist on SRE.
Many people are familiar with Reddit, a news and social media website that covers everything from politics and investing to cooking and cat pictures. Did you know that they have a subreddit on SRE? Posts go up almost daily and cover many different SRE topics and related discussions. See their DevOps subreddit for even more posts and discussions.
SRE Weekly bills itself as “a newsletter devoted to everything related to keeping a site or service available as consistently as possible.” They have weekly posts on SRE topics, and you can subscribe by email or RSS feed.
The StackPulse Blog is another great place to read up-to-date articles on SRE, DevOps, and related best practices, as well as other technical articles on subjects like Kubernetes and implementing incident response as code.
An ultimate guide to SRE resources wouldn’t be complete without links to SRE-related tooling. Here are a few of the most important:
DevOps.com and DevOpsTV
DevOps.com is a website devoted to all things DevOps. They also have a YouTube channel called DevOpsTV, which has a great video on Choosing the Right Tools When Building Your SRE Toolchain.
Docs as Code
The Docs as Code philosophy dovetails nicely with the DevOps and SRE notion that everything should be represented as code. Here are a couple of videos on the subject:
- Linux.conf.au has a YouTube channel with a video entitled, “A Practical Introduction to Docs-As-Code.” Another presentation that’s worth a look is called Site Reliability Engineering at Dropbox.
- Next Day Video has their own YouTube video called Building Docs like Code: Continuous Integration for Documentation.
Usenix, which has been around since 1975, is “a community of engineers, system administrators, scientists, and technicians working on the cutting edge of the computing world.” They’ve got a link to an interesting SREcon presentation on Deploying SRE Training Best Practices to Production.
Netflix has been a pioneer in the field of Chaos Engineering. The Netflix Technology Blog has a wealth of information on how Netflix maintains their platform and how they try to continually improve their operations and reliability. Their YouTube channel on Reliability, Performance & Cloud Infrastructure has some great videos on SRE-related topics.
Awesome SRE Tools
While there are other resources out there that we could discuss, this guide should give you a broad range of SRE-related resources that will help you learn SRE and improve your skills. If you’re looking to implement an SRE program in your organization, we hope this guide gives you excellent ideas and a great start.