Kubernetes is hailed for making it simple to provision the resources you need, when you need them. However, it’s challenging to know the exact size and number of nodes that best fit your application, especially when you can’t predict what load you will want to support in the future. If you are allocating resources manually, you may not be quick enough to respond to the changing needs of your application. Fortunately, Kubernetes provides multiple layers of autoscaling functionality: the Horizontal Pod Autoscaler, the Vertical Pod Autoscaler, and the Cluster Autoscaler. Together, these allow you to ensure that each pod and cluster is just the right size to meet your current needs.
Horizontal Pod Autoscaler
The Horizontal Pod Autoscaler (HPA) scales the number of pods available in a cluster in response to the present computational needs. You specify the metrics that will determine the number of pods needed, and set the thresholds at which pods should be created or removed. The usual metrics are CPU and memory usage, but you can also specify your own custom metrics. Once you’ve set up the HPA, it will continuously check the metrics you’ve chosen (the default value for checking metrics is 30-second intervals). If one of the thresholds you’ve specified is met, the HPA updates the number of pod replicas inside the deployment controller. This triggers the deployment controller to scale the number of pods, up or down, to meet the desired number of replicas.
Note that, in order for the HPA to have the data it needs to determine the best number of pod-replicas, you must install the metrics-server on your Kubernetes clusters. This will give the HPA access to CPU and memory metrics. If you wish to use custom metrics to determine how the HPA scales your pods, you will need to link Kubernetes to a time series database (such as Prometheus) with the metrics you wish to use.
One concern with autoscaling is “thrashing”: you don’t want the number of pods constantly fluctuating in response to minute changes in your resource consumption. However, you do want the autoscaling functionality to be sensitive enough to your changing needs that you always have the correct level of resources. Kubernetes offers users the ability to influence the scale velocity of the HPA in two ways:
- You can customize how long the autoscaler has to wait before another downscale operation can be performed after the current one has finished (the default value is five minutes).
- To control how fast the target can scale up, you can specify the minimum number of additional replicas needed to trigger a scale-up event.
Vertical Pod Autoscaler
Where the HPA allocates pod replicas in order to manage resources, the Vertical Pod Autoscaler (VPA) simply allocates more (or less) CPUs and memory to existing pods. This can be used just to initialize the resources given to each pod at creation, or to actively monitor and scale each pod’s resources over its lifetime. Technically, the VPA does not alter the resources for existing pods; rather, it checks which of the managed pods have correct resources set and, if not, kills them so that they can be recreated by their controllers with the updated requests.
The VPA includes a tool called the VPA Recommender, which monitors the current and past resource consumption and, based on that data, provides recommended values for the containers’ CPU and memory requests. Even if you don’t trust the VPA to manage your pods, you can still use the VPA to get recommendations about what resources would best fit your current load.
The HPA and VPA are both useful tools, so you may be tempted to put both to work managing your container resources. However, this practice has the potential to put the HPA and VPA in direct conflict with one another. If they both detect that more memory is needed, they will both try to resolve this issue at the same time, resulting in the wrong allocation of resources. However, it is possible to use the HPA and VPA together, provided that they rely on different metrics. This will prevent them from being triggered by the same events. The VPA only uses CPU and memory consumption to generate its recommendations, but if you set your HPA to use custom metrics, then both tools can function in parallel.
While the HPA and VPA allow you to scale pods, the Cluster Autoscaler (CA) scales your node clusters based on the number of pending pods. It checks to see whether there are any pending pods and increases the size of the cluster so that these pods can be created. The CA also deallocates idle nodes to keep the cluster at the optimal size. In order to provision more nodes, the CA can interface directly with cloud providers and request the resources needed. It can also use cloud provider-specific logic to specify strategies for scaling clusters.
The Cluster Autoscaler relies on different metrics and has a different goal than either the HPA or VPA. Thus, you can use CA in addition to either the HPA or VPA without conflict. The HPA and CA complement each other for truly efficient scaling. If the load increases, HPA will create new replicas. If there isn’t enough space for these replicas, CA will provision some nodes, so that the HPA-created pods have a place to run.
However, the Kubernetes Cluster Autoscaler should not be used alongside CPU-based cluster autoscalers offered by some cloud-providers. CPU-usage-based cluster autoscalers do not take into account pods when scaling up and down. As a result, they may add a node that will not have any pods, or remove a node that has some system-critical pods on it.
As with the HPA, one concern to keep in mind is the speed at which the CA will deploy (or deallocate) resources. If the scaling is too sensitive, your clusters are unstable, but if there is too much latency, then your application may experience downtime. In practice, it can take a few minutes for the CA to create a new node. One way to ensure additional pods are immediately available is to configure your pods to include “pause pods” with low priority and can be terminated to make room for new pods. In essence, this saves time by reserving space in the pod for additional clusters. On the other hand, if you want to slow the scale-up velocity, you can configure a delay interval. To smooth out the scale-down operations, you can configure the CA using the PodDistruptionBudgets tag to prevent pods from being deleted too abruptly.
These autoscaling features can save your team money by ensuring you are not over-provisioning, while still ensuring that your application has all the resources it needs to stay operational, despite unpredictable loads. However, configuring them correctly may be a headache, even if you avoid all the pitfalls listed in this article.
Whether to use HPA, VPA, CA, or some combination, depends on the needs of your application. Experimentation is the most reliable way to find which option works best for you, so it might take a few tries to find the right setup. Mastering autoscaling in Kubernetes is a journey and will require continuous learning as these tools mature.
If you want the advantages of autoscaling but want to shortcut the learning process, you may consider a Kubernetes-as-a-service platform, which implements and manages autoscaling for you. No matter how you implement them though, this suite of autoscalers can help you realize the promise of Kubernetes in a right-sized on-demand infrastructure.