Site Reliability Engineering (SRE) has become a hot topic over the last few years. It seems like everyone has been talking about it, but if you ask different people or organizations what SRE means to them, you will likely receive different and highly nuanced answers.
One thing that is consistent across all of these definitions, though, is the idea that there is a core set of best practices that engineering organizations can adopt to achieve more reliable systems. This is especially applicable to Kubernetes, which holds a great deal of promise for enabling software developers and operations teams to rapidly develop, deploy, and scale software applications without the overhead of traditional IT systems and methodologies.
In this article, we will discuss 5 key best practices for SRE that you can start using with Kubernetes right away. This list is far from exhaustive, but it should get you up and running. If you’d like to take a deeper dive into these best practices in another article, let us know.
Best Practice #1: Choose the Right Tool for the Job
In a recent interview with Wojtek Cichon of The New Stack, Seth Vargo, a Senior Staff Engineer at Google, discussed the need for companies to recognize when they need SREs – and when they don’t. Vargo advised organizations to consider their staffing and business needs in order to make an informed decision about implementing Kubernetes for their use case. Here are some key takeaways from his interview:
Kubernetes is complex, and maintaining it requires specialized knowledge. Is your organization prepared to dedicate a full-time engineer or SRE to implementing and maintaining one or more Kubernetes clusters?
- Kubernetes has many advantages, such as scalability, but these advantages come wit attached costs – real costs, such as cloud and staffing, but also soft costs that are harder to quantify. It costs organizations to make changes and update their development and operations processes and practices in order to accommodate a cloud-native paradigm like Kubernetes.
- Along these lines, you will also need to consider new tooling that is designed to work with Kubernetes. For example, your old CI/CD pipeline that once worked for deploying some EC2 instances in an autoscaling group with a load balancer (and maybe a database) will probably need drastic alterations.
- This means using containers as your new paradigm for development and considering cloud-native CI/CD solutions designed for Kubernetes (such as Argo CD, Tekton, or Jenkins X). Don’t forget that your monitoring solution may also need to be updated to work with Kubernetes.
Are your dev and ops teams ready for that? If so, you can move on to the next best practice.
Best Practice #2: Use Managed or Hosted Services Whenever Possible
By now, you’ve considered all of the implications discussed above and decided that it makes sense to retool and move to Kubernetes.
Reducing toil is a key tenet of SRE. You should minimize the effort required to deal with the implementation and day-to-day management of your new infrastructure and processes. Here are a few things that can help you accomplish this:
- Whenever possible, don’t build it yourself. Sure, it might be fun to learn and configure all the latest and greatest new tech, but does that help you accomplish your business goals efficiently?
- Consider using a hosted Kubernetes service such as Azure Kubernetes Service (AKS), Amazon’s Elastic Kubernetes Service (EKS), or Google Kubernetes Engine (GKE). Hosted Kubernetes services abstract away many of the complexities of Kubernetes, like managing the control plane and ensuring the implementation of best practices. This lets you and your developers focus on the important things, like developing, testing, and deploying your applications so they get to market quicker.
- The same also applies to other services like monitoring, alerting, and reliability. Don’t roll your own if you’re able to leverage SaaS-based services such as New Relic, Opsgenie, and Stackpulse. This way you won’t need to build the infrastructure out yourself, which will enable you to focus on analyzing data in order to make informed decisions about how to make your applications more reliable.
Best Practice #3: Configure Availability and Recovery Mechanisms
High availability (HA), failover, backups, and recovery mechanisms aren’t new concepts, but they are critical to ensuring a highly reliable system. Even with managed services, you need to deploy Kubernetes as HA across zones and regions. You should consider the following:
- Are there multiple masters? Are they deployed into more than just a single AZ or region?
- Do you have workers deployed in a similar fashion?
- Is etcd configured in HA mode?
- Have you reviewed the documentation again to ensure that they are configured correctly?
What about backups (especially for the stateful parts of Kubernetes, like etcd)? Do you know how you will recover if there is a failure?
Finally, don’t forget other failover mechanisms like DNS. Is your DNS pointing statically to a single endpoint in a single region? Do you have health checks in place with failover routes to healthy regions in case one region fails?
Best Practice #4: Ensure Visibility and Observability
Now that you’ve considered and hopefully implemented all of the above, you’ll want to think about monitoring, alerting, and observability.
Monitoring tells you what broke (and hopefully why it broke). Alerting notifies you when something is broken. Observability allows you to drill into your applications and infrastructure to find root causes by looking at the internals of your systems. In order to have a complete solution that you can leverage to improve your reliability, you will want to have all three.
The above are examples of services that you can use and integrate into your systems. Of course, you will still need to consider the nuances of your organization, such as how you handle and automate your incident response.
Stackpulse provides an excellent resource on automating specific workflows related to low incident alerts to reduce toil and alert fatigue. Take a look at Automated Kubernetes Pod Restarting Analysis with StackPulse.
Best Practice #5: Leverage Modern Reliability Methodologies
Now that you’ve done all of the above, it’s time to start digging in and making your systems truly reliable. The best way to do this is to start building a culture of reliability in your organization. Implement DevOps and SRE principles like shared ownership, risk awareness, gradual change, measuring reliability, education, and better automation and tooling.
But how exactly does one measure reliability? Here are a couple of ways (some of which might be obvious):
- Create a reliability plan. While a deep dive into reliability plans is beyond the scope of this postarticle, you can find some great information in this article by Tammy Bryant Butow, Principal Site Reliability Engineer at Gremlin. She explains how to create a reliability plan in detail, including how to find common failure modes and how to test to see if what you implemented to harden your Kubernetes cluster actually worked.
- Finally, don’t forget to establish SLOs so that you know if the best practices you’ve implemented are actually working. You can find more information about SLOs in these articles on the Stackpulse Blog:
- Service Level Objectives (SLO): A Guide
- How to Establish Your Service Level Objectives (SLOs).
Jump start your Kubernetes reliability automation with K8s playbooks and tools from StackPulse.
Don’t see one for your use case? Get started building your own today, or get in touch with our team to see how we can help.
- Check out more pre-built playbooks to help you save time and deliver more reliable services
- Learn more about the benefits of code-based, executable playbooks for incident response.
- Get started with the free edition of the StackPulse Reliability Platform