This is the multi-page printable view of this section. Click here to print.
Concepts
1 - Prober
Prober
Overview
Prober starts asynchronous and periodic probes for every shoot cluster. The first probe is the api-server probe which checks the reachability of the API Server from the control plane. The second probe is the lease probe which is done after the api server probe is successful and checks if the number of expired node leases is below a certain threshold.
If the lease probe fails, it will scale down the dependent kubernetes resources. Once the connectivity to kube-apiserver
is reestablished and the number of expired node leases are within the accepted threshold, the prober will then proactively scale up the dependent kubernetes resources it had scaled down earlier. The failure threshold fraction for lease probe
and dependent kubernetes resources are defined in configuration that is passed to the prober.
Origin
In a shoot cluster (a.k.a data plane) each node runs a kubelet which periodically renewes its lease. Leases serve as heartbeats informing Kube Controller Manager that the node is alive. The connectivity between the kubelet and the Kube ApiServer can break for different reasons and not recover in time.
As an example, consider a large shoot cluster with several hundred nodes. There is an issue with a NAT gateway on the shoot cluster which prevents the Kubelet from any node in the shoot cluster to reach its control plane Kube ApiServer. As a consequence, Kube Controller Manager transitioned the nodes of this shoot cluster to Unknown
state.
Machine Controller Manager which also runs in the shoot control plane reacts to any changes to the Node status and then takes action to recover backing VMs/machine(s). It waits for a grace period and then it will begin to replace the unhealthy machine(s) with new ones.
This replacement of healthy machines due to a broken connectivity between the worker nodes and the control plane Kube ApiServer results in undesired downtimes for customer workloads that were running on these otherwise healthy nodes. It is therefore required that there be an actor which detects the connectivity loss between the the kubelet and shoot cluster’s Kube ApiServer and proactively scales down components in the shoot control namespace which could exacerbate the availability of nodes in the shoot cluster.
Dependency Watchdog Prober in Gardener
Prober is a central component which is deployed in the garden
namespace in the seed cluster. Control plane components for a shoot are deployed in a dedicated shoot namespace for the shoot within the seed cluster.
NOTE: If you are not familiar with what gardener components like seed, shoot then please see the appendix for links.
Prober periodically probes Kube ApiServer via two separate probes:
- API Server Probe: Local cluster DNS name which resolves to the ClusterIP of the Kube Apiserver
- Lease Probe: Checks for number of expired leases to be within the specified threshold. The threshold defines the limit after which DWD can say that the kubelets are not able to reach the API server.
Behind the scene
For all active shoot clusters (which have not been hibernated or deleted or moved to another seed via control-plane-migration), prober will schedule a probe to run periodically. During each run of a probe it will do the following:
- Checks if the Kube ApiServer is reachable via local cluster DNS. This should always succeed and will fail only when the Kube ApiServer has gone down. If the Kube ApiServer is down then there can be no further damage to the existing shoot cluster (barring new requests to the Kube Api Server).
- Only if the probe is able to reach the Kube ApiServer via local cluster DNS, will it attempt to check the number of expired node leases in the shoot. The node lease renewal is done by the Kubelet, and so we can say that the lease probe is checking if the kubelet is able to reach the API server. If the number of expired node leases reaches the threshold, then the probe fails.
- If and when a lease probe fails, then it will initiate a scale-down operation for dependent resources as defined in the prober configuration.
- In subsequent runs it will keep performing the lease probe. If it is successful, then it will start the scale-up operation for dependent resources as defined in the configuration.
Prober lifecycle
A reconciler is registered to listen to all events for Cluster resource.
When a Reconciler
receives a request for a Cluster
change, it will query the extension kube-api server to get the Cluster
resource.
In the following cases it will either remove an existing probe for this cluster or skip creating a new probe:
- Cluster is marked for deletion.
- Hibernation has been enabled for the cluster.
- There is an ongoing seed migration for this cluster.
- If a new cluster is created with no workers.
- If an update is made to the cluster by removing all workers (in other words making it worker-less).
If none of the above conditions are true and there is no existing probe for this cluster then a new probe will be created, registered and started.
Probe failure identification
DWD probe can either be a success or it could return an error. If the API server probe fails, the lease probe is not done and the probes will be retried. If the error is a TooManyRequests
error due to requests to the Kube-API-Server being throttled,
then the probes are retried after a backOff of backOffDurationForThrottledRequests
.
If the lease probe fails, then the error could be due to failure in listing the leases. In this case, no scaling operations are performed. If the error in listing the leases is a TooManyRequests
error due to requests to the Kube-API-Server being throttled,
then the probes are retried after a backOff of backOffDurationForThrottledRequests
.
If there is no error in listing the leases, then the Lease probe fails if the number of expired leases reaches the threshold fraction specified in the configuration. A lease is considered expired in the following scenario:-
time.Now() >= lease.Spec.RenewTime + (p.config.KCMNodeMonitorGraceDuration.Duration * expiryBufferFraction)
Here, lease.Spec.RenewTime
is the time when current holder of a lease has last updated the lease. config
is the probe config generated from the configuration and
KCMNodeMonitorGraceDuration
is amount of time which KCM allows a running Node to be unresponsive before marking it unhealthy (See ref)
. expiryBufferFraction
is a hard coded value of 0.75
. Using this fraction allows the prober to intervene before KCM marks a node as unknown, but at the same time allowing kubelet sufficient retries to renew the node lease (Kubelet renews the lease every 10s
See ref).
Appendix
2 - Weeder
Weeder
Overview
Weeder watches for an update to service endpoints and on receiving such an event it will create a time-bound watch for all configured dependent pods that need to be actively recovered in case they have not yet recovered from CrashLoopBackoff
state. In a nutshell it accelerates recovery of pods when an upstream service recovers.
An interference in automatic recovery for dependent pods is required because kubernetes pod restarts a container with an exponential backoff when the pod is in CrashLoopBackOff
state. This backoff could become quite large if the service stays down for long. Presence of weeder would not let that happen as it’ll restart the pod.
Prerequisites
Before we understand how Weeder works, we need to be familiar with kubernetes services & endpoints.
NOTE: If a kubernetes service is created with selectors then kubernetes will create corresponding endpoint resource which will have the same name as that of the service. In weeder implementation service and endpoint name is used interchangeably.
Config
Weeder can be configured via command line arguments and a weeder configuration. See configure weeder.
Internals
Weeder keeps a watch on the events for the specified endpoints in the config. For every endpoints a list of podSelectors
can be specified. It cretes a weeder object per endpoints resource when it receives a satisfactory Create
or Update
event. Then for every podSelector it creates a goroutine. This goroutine keeps a watch on the pods with labels as per the podSelector and kills any pod which turn into CrashLoopBackOff
. Each weeder lives for watchDuration
interval which has a default value of 5 mins if not explicitly set.
To understand the actions taken by the weeder lets use the following diagram as a reference. Let us also assume the following configuration for the weeder:
watchDuration: 2m0s
servicesAndDependantSelectors:
etcd-main-client: # name of the service/endpoint for etcd statefulset that weeder will receive events for.
podSelectors: # all pods matching the label selector are direct dependencies for etcd service
- matchExpressions:
- key: gardener.cloud/role
operator: In
values:
- controlplane
- key: role
operator: In
values:
- apiserver
kube-apiserver: # name of the service/endpoint for kube-api-server pods that weeder will receive events for.
podSelectors: # all pods matching the label selector are direct dependencies for kube-api-server service
- matchExpressions:
- key: gardener.cloud/role
operator: In
values:
- controlplane
- key: role
operator: NotIn
values:
- main
- apiserver
Only for the sake of demonstration lets pick the first service -> dependent pods tuple (etcd-main-client
as the service endpoint).
- Assume that there are 3 replicas for etcd statefulset.
- Time here is just for showing the series of events
t=0
-> all etcd pods go downt=10
-> kube-api-server pods transition to CrashLoopBackOfft=100
-> all etcd pods recover togethert=101
-> Weeder seesUpdate
event foretcd-main-client
endpoints resourcet=102
-> go routine created to keep watch on kube-api-server podst=103
-> Since kube-api-server pods are still in CrashLoopBackOff, weeder deletes the pods to accelerate the recovery.t=104
-> new kube-api-server pod created by replica-set controller in kube-controller-manager
Points to Note
- Weeder only respond on
Update
events where anotReady
endpoints resource turn toReady
. Thats why there was no weeder action at timet=10
in the example above.notReady
-> no backing pod is ReadyReady
-> atleast one backing pod is Ready
- Weeder doesn’t respond on
Delete
events - Weeder will always wait for the entire
watchDuration
. If the dependent pods transition to CrashLoopBackOff after the watch duration or even after repeated deletion of these pods they do not recover then weeder will exit. Quality of service offered via a weeder is only Best-Effort.