In the current IT market, one of the hottest job roles is the Site Reliability Engineer (SRE). In January 2019, according to LinkedIn, being an SRE is the second most promising job in the USA. These Statistics were cited:
- Median Base Salary: $200,000
- Job Openings (YoY Growth): 1,400+ (72%)
- Career Advancement Score (out of 10): 9
In this post we will have a look at what an SRE does in their daily work, a little history on Site Reliability Engineering, and what the foundations are; and how you can become an SRE.
What Does an SRE Do?
As stated in the earlier StackPulse blog post ‘How to Implement a DevOps Culture’ DevOps and Site Reliability Engineering are different disciplines, but they are not competitors. They complement each other. That blog post explained the differences between Site Reliability Engineering and DevOps. Here we will strictly focus on characteristics of the SRE role.
Site Reliability Engineering is the application of software engineering to operational problems. The word ‘Reliability’ means an SRE has a particular role in an organisation and the Software Development Life Cycle. SREs teach application developers how to build reliable services. Next to that, they ensure that the computer systems of an organisation run correctly, 24/7. Security, stability and scalability are very important here. The business wants reliable services.
Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics.
An SRE is, therefore, a vital role within an organization. Typical SRE activities include:
- Develop and manage scalable, secure and stable systems
- Conduct Incident analysis
- Analyze performance and create improvement plans
- Monitor efficiency systems
- Manage risks
- Automate manual tasks within the SDLC
- Build automated service tools, logs and test environments to ease the engineers’ workload
- Implement new features
- Select infrastructure tools
- Adapt environments to increasing or decreasing numbers of users
Have a look at “The Ultimate Guide to SRE Acronyms” if you want to learn how to “talk SRE.”
A Little History About SREs
The term ‘Site Reliability Engineer’ originated at Google by Ben Treynor Sloss, VP of engineering, in 2003. He was hired by Google to manage a team of software developers running a production environment. Continuous development, integration and operations demanded a new way of thinking. That’s how Site Reliability Engineering came to be.
Ben Treynor Sloss explained the core of the SRE role in this interview:
“SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.”
Now that we know the origin of SRE, we can askt, what is this role built on?
What are the Foundations of SRE?
Site Reliability Engineering is based on the following:
- Scalability – System can handle a growing amount of work by adding resources to the system
- Availability – System works as required
- Incident Response – Managing the handling of incidents happening with the system
- Automation – Automating the Software Development LifeCycle Workflow
These fundamental elements are embedded in the job of an SRE in a balanced and efficient manner, to deal with the daily work in the organisation. To do this, an SRE needs a toolbox.
What is in a typical SRE Toolbox?
A Site Reliability Engineer works with the following software, languages, and tools:
- Software languages: Ruby, Python, C++, Bash, Java
- Cloud computing Services: AWS, Azure
- Infrastructure tooling: Terraform, Cloud Formation, Ansible
- Container tooling: Kubernetes, Docker, Meso
As you can see, an SRE must have Development and Operations skills to automate the manual skills of a development team.
How to Become an SRE
Currently, SREs are high in demand. But it is not an easy job. As stated earlier, an SRE needs development and operations skills – a Pi-shaped skill set. For this skill set, an SRE has to be proficient in both trades; not just one or the other, which defines a T-shaped skill set. This makes SRE a very demanding and practical career. It can be beneficial to have a solid understanding and knowledge base to start from, check out the Top 10 SRE Books to Read in 2021. However, itt can also be learned on the job with the right motivation and endurance. Most SREs have a software development or system and networking engineering background or education.
At Google, SREs do at least 50% development during their daily job. An SRE is still a software developer; an engineer doing operations.
Do you want to become an SRE? Big tech companies, Google included, want you because they know SREs are very hard to find. Is this because a good SRE ultimately ‘automates their way out of a job’?
We hope that this article showed you what a Site Reliability Engineer does, why it is in high demand and how you can become one. For more information, you can take a look at Google’s take on SRE as well as this excellent series of videos that they posted on YouTube.