This tutorial describes how to overprovisioning of cluster nodes for scaling and failover. This is desired when you have work loads that need to scale up quickly without waiting for the new cluster nodes to be created and join the cluster.
A similar problem occurs when crashing a node from the Hyperscaler. This must be replaced by Kubernetes as fast as possible. The solution can be overprovisioning of nodes
Overprovisioning: Allocating more computer resources than is strictly necessary
Below is a description of how the cluster behaves when there is a requirement to scale.
You can apply the above scenario one-to-one to the case when a node of the Hyperscaler dies.
We executed normal and overprovisioning tests on a gardener cluster on different infrastructure provider (aws, azure, gcp, alicloud). All of them tested the downtime of the application pod running in the cluster, when a node dies.
The test results for the different IaaS provider are shown below.
The results provided should only show how long the downtimes can be approximately. > The downtime results could vary +- 1 min, because the minimum request interval in UpTime is 1 minute
|Downtime||21 min||18 min||14 min||21 min|
|Downtime||2 min||2 min||1 min||2 min|
We deployed a nginx web server and a service of type LoadBalancer to expose it. So we are able to call our endpoint with external tools like UpTime to check the availability of our nginx. It takes only a few seconds to deploy a nginx web server on kubernetes, so we could say: when your endpoint works, your node is up and running.
We wanted to test how much time it takes, when your node gets killed and your cluster has to create a new one to run your application on it.
kubectl get nodes # select the node where your nginx is running on kubectl delete node <NGINX-HOSTED-NODE>
The downtime is tested with UpTime, which does every minute a request to our endpoint. Further we checked manually, if the node startup time and the timestamps on UpTime are almost similar.
Next, deploy the overprovisioned version of our demo application and kill the node with the NGINX. As you can see - the pod comes up very fast and can serve content again.