Kubernetes (K8s) is a container orchestration platform that facilitates workloads for applications or services in a scalable manner. Out of the box and only in a favourable path, K8s is a wonderful platform that magically deploys and manages workloads and restart-services when a service is not ready or is unhealthy. K8s works behind the scenes, though there are many hidden caveats in terms of making sure the platform stays stable and unfettered.
For the seasoned Site Reliability Engineer (SRE), running production-ready services in K8s remains a daunting task. This is primarily because the K8s platform has many moving parts. It’s highly customizable, and a wrong configuration can be costly. You can invest quite a lot time configuring the platform and still have issues with sudden pod restarting or other horror stories.
Read on to learn the top five concerns of running services reliably with K8s, based on the author’s experience with running multiple workloads and with multiple vendors.
#1 Choosing Reliable Kubernetes Distros
There is one Kubernetes open-source project, but there are multiple offerings. Not all offerings are suitable and reliable for your business or organization, even if you are solely managing services in one cloud vendor. You have to consider future maintenance and the cost of provisioning new infrastructure components.
Take, for example, employing a managed Kubernetes service from several vendor profiles (AKS, EKS, GKE). Each has their own release channels, policies and upgrade mechanisms. SLA and SLO pragmatics are different on each one and mainly depend on how much you have at stake with them. However, using managed Kubernetes services is frequently recommended because they offload most of the maintenance and reliability issues of running HA clusters in their infrastructure.
My recommendation though is that when you decide to use Kubernetes to production, to in fact try to see if your existing tooling allows for multi cloud. A good way to establish that is to write truly cloud-native apps. We explain what we mean next.
#2 Writing Reliable Cloud-Native Apps
If you write applications that, from the start, work in a container environment (not storing state or writing logs to stdout), then you can gain many advantages and benefits from K8s. If you include old legacy apps that do not like to be stopped or restarted at any point, then the process is unpleasant. The biggest thing is realising your container can be deleted at any moment, and making sure your code is aware of that.
Writing truly cloud-native apps should be the norm when deploying them into a Kubernetes cluster and expecting them to be reliable and running. Thankfully, there are lots of community support and reference methodologies like 12Factor apps or modern cloud-native architectures to guide you through this process.
The most challenging part, of course, is producing applications in a cloud-agnostic way, whether using specific configurations that are not tied to a cloud provider or API calls for Object storage, Serverless or IAM. Ultimately, you want to achieve high levels of fault-tolerance, self-healing, efficiency and disaster recovery. A valid and apparent path for doing that is by leveraging multiple cloud providers.
#3 Configuring Reliability Concerns and Services
Once you have deployed applications into K8s, with expectations that the platform will keep them strong and steady, you want to add cross-cutting concerns like monitoring, logging, observability and alerting. If you don’t, then you will be in the dark whenever a pod is restarting constantly or the available CPUs are exhausted and the K8s operator does not have enough resources to schedule additional workloads.
You want to be able to grasp how the internal services work, conduct root analysis and respond to outages. There is nothing preventing a wrongly configured container to spread havoc to other services; so you will need to establish rules and admission policies, as well. But before you are able to do that, you need to have the basic incident response systems in place.
In practical terms, you want to monitor the most typical issues like pod restarts, job failures, memory issues or CPU limits. Once you reach the triggering point, you want to send emails and Slack messages to relevant people. Using playbooks is an excellent way to achieve that with reasonable effort. Learn about our Kubernetes playbooks and tools here, or check out the StackPulse Playbook Library in full on GitHub.
#4 Running Workloads When the Neighbors are Noisy
Even if you run a stable infrastructure, you cannot rest assure all the time. The problem that sooner or later arises when running those workloads is resource contention due to noisy neighbours. Even if you maintain strict resource pod limits, there are no good guarantees that the scheduler will be fair or able to rapidly respond to those issues. You can assign priorities, add plugins that control CPU affinity and other options, but this is more configuration on top of everything else.
SREs can employ modern experimental techniques like Chaos Engineering to test the scalability of their production services. Using Chaos Engineering can help with application dependencies in failures. By inspecting the limits of pods while they simulate resource contention, you can discover new and more reliable ways to recover from those cases.
#5 Extending Kubernetes Reliably
K8s is, by default, highly configurable and extensible. While you can utilize it as Commercial off-the-shelf (COTS), sooner or later you may have to dig into the details to carry out domain related work for your organization.
Practically speaking, this means you will have to write custom controllers, operators or internal services that use the KubeServer API to perform custom logic.This is one of the most impressive features of Kubernetes. But it also means you will have to establish some business use cases beforehand.
Writing application code that becomes aware of the K8s cluster API represents a project itself that needs to be thoroughly tested. This is because the logic is tied to the API and its versioning strategy. If the code relies on some functionality that will be deprecated in future K8s releases, then you will need to change the code every time.
Writing a custom controller requires specialized knowledge and is a common source of errors. You want to have a dedicated team for maintaining this piece of software and stay apprised of any service updates. However, if it’s worth the effort and the value that it brings to the table, then the maintainability part is of less concern.
Another way to extend Kubernetes is by installing open- or closed-source operators like the ones listed in operatorhub.io. Of course, you will have to fully trust and understand how those operators work, because you don’t want to have insecure operators running inside your infrastructure. The main point here, though, is to use custom operators for reliability. This way you can save time on each deployment by making them more reliable and efficient.
Taking those challenges and applying them to your own organization is an excellent starting point. As might be expected, each use case is different, but the basic principles stay the same and keeping on top of K8s best practices for reliability is a team effort.
Want to dig deeper on Kubernetes Reliability? Download our free “SRE Guide to Kubernetes Troubleshooting” here.