High Availability of Deployed Components
gardenlets and extension controllers are deploying components via Deployments, StatefulSets, etc., as part of the shoot control plane, or the seed or shoot system components.
Some of the above component deployments must be further tuned to improve fault tolerance / resilience of the service. This document outlines what needs to be done to achieve this goal.
Please be forwarded to the Convenient Application Of These Rules section, if you want to take a shortcut to the list of actions that require developers' attention.
Seed Clusters
The worker nodes of seed clusters can be deployed to one or multiple availability zones. The Seed specification allows you to provide the information which zones are available:
spec:
provider:
region: europe-1
zones:
- europe-1a
- europe-1b
- europe-1cIndependent of the number of zones, seed system components like the gardenlet or the extension controllers themselves, or others like etcd-druid, dependency-watchdog, etc., should always be running with multiple replicas.
Concretely, all seed system components should respect the following conventions:
Replica Counts
Component Type < 3Zones>= 3ZonesComment Observability (Monitoring, Logging) 1 1 Downtimes accepted due to cost reasons Controllers 2 2 / (Webhook) Servers 2 2 / Apart from the above, there might be special cases where these rules do not apply, for example:
istio-ingressgatewayis scaled horizontally, hence the above numbers are the minimum values.nginx-ingress-controllerin the seed cluster is used to advertise all shoot observability endpoints, so due to performance reasons it runs with2replicas at all times. In the future, this component might disappear in favor of theistio-ingressgatewayanyways.
Topology Spread Constraints
When the component has
>= 2replicas ...... then it should also have a
topologySpreadConstraint, ensuring the replicas are spread over the nodes:yamlspec: topologySpreadConstraints: - topologyKey: kubernetes.io/hostname minDomains: 3 # lower value of max replicas or 3 maxSkew: 1 whenUnsatisfiable: ScheduleAnyway matchLabels: ...minDomainsis set when failure tolerance is configured or annotationhigh-availability-config.resources.gardener.cloud/host-spread="true"is given.... and the seed cluster has
>= 2zones, then the component should also have a secondtopologySpreadConstraint, ensuring the replicas are spread over the zones:yamlspec: topologySpreadConstraints: - topologyKey: topology.kubernetes.io/zone minDomains: 2 # lower value of max replicas or number of zones maxSkew: 1 whenUnsatisfiable: DoNotSchedule matchLabels: ...
According to these conventions, even seed clusters with only one availability zone try to be highly available "as good as possible" by spreading the replicas across multiple nodes. Hence, while such seed clusters obviously cannot handle zone outages, they can at least handle node failures.
Shoot Clusters
The Shoot specification allows configuring "high availability" as well as the failure tolerance type for the control plane components, see Highly Available Shoot Control Plane for details.
Regarding the seed cluster selection, the only constraint is that shoot clusters with failure tolerance type zone are only allowed to run on seed clusters with at least three zones. All other shoot clusters (non-HA or those with failure tolerance type node) can run on seed clusters with any number of zones.
Control Plane Components
All control plane components should respect the following conventions:
Replica Counts
Component Type w/o HA w/ HA ( node)w/ HA ( zone)Comment Observability (Monitoring, Logging) 1 1 1 Downtimes accepted due to cost reasons Controllers 1 2 2 / (Webhook) Servers 2 2 2 / Apart from the above, there might be special cases where these rules do not apply, for example:
etcdis a server, though the most critical component of a cluster requiring a quorum to survive failures. Hence, it should have3replicas even when the failure tolerance isnodeonly.kube-apiserveris scaled horizontally, hence the above numbers are the minimum values (even when the shoot cluster is not HA, there might be multiple replicas).
Topology Spread Constraints
When the component has
>= 2replicas ...... then it should also have a
topologySpreadConstraintensuring the replicas are spread over the nodes:yamlspec: topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway matchLabels: ...Hence, the node spread is done on best-effort basis only.
However, if the shoot cluster has defined a failure tolerance type, the
whenUnsatisfiablefield should be set toDoNotSchedule.... and the failure tolerance type of the shoot cluster is
zone, then the component should also have a secondtopologySpreadConstraintensuring the replicas are spread over the zones:yamlspec: topologySpreadConstraints: - maxSkew: 1 minDomains: 2 # lower value of max replicas or number of zones topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule matchLabels: ...
Node Affinity
The
gardenletannotates the shoot namespace in the seed cluster with thehigh-availability-config.resources.gardener.cloud/zonesannotation.- If the shoot cluster is non-HA or has failure tolerance type
node, then the value will be always exactly one zone (e.g.,high-availability-config.resources.gardener.cloud/zones=europe-1b). - If the shoot cluster has failure tolerance type
zone, then the value will always contain exactly three zones (e.g.,high-availability-config.resources.gardener.cloud/zones=europe-1a,europe-1b,europe-1c).
For backwards-compatibility, this annotation might contain multiple zones for shoot clusters created before
gardener/gardener@v1.60and not having failure tolerance typezone. This is because their volumes might already exist in multiple zones, hence pinning them to only one zone would not work.Hence, in case this annotation is present, the components should have the following node affinity:
yamlspec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - europe-1a # - ...This is to ensure all pods are running in the same (set of) availability zone(s) such that cross-zone network traffic is avoided as much as possible (such traffic is typically charged by the underlying infrastructure provider).
- If the shoot cluster is non-HA or has failure tolerance type
System Components
The availability of system components is independent of the control plane since they run on the shoot worker nodes while the control plane components run on the seed worker nodes (for more information, see the Kubernetes architecture overview). Hence, it only depends on the number of availability zones configured in the shoot worker pools via .spec.provider.workers[].zones. Concretely, the highest number of zones of a worker pool with systemComponents.allow=true is considered.
All system components should respect the following conventions:
Replica Counts
Component Type 1or2Zones>= 3ZonesControllers 2 2 (Webhook) Servers 2 2 Apart from the above, there might be special cases where these rules do not apply, for example:
corednsis scaled horizontally (today), hence the above numbers are the minimum values (possibly, scaling these components vertically may be more appropriate, but that's unrelated to the HA subject matter).- Optional addons like
nginx-ingressorkubernetes-dashboardare only provided on best-effort basis for evaluation purposes, hence they run with1replica at all times.
Topology Spread Constraints
When the component has
>= 2replicas ...... then it should also have a
topologySpreadConstraintensuring the replicas are spread over the nodes:yamlspec: topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway matchLabels: ...Hence, the node spread is done on best-effort basis only.
... and the cluster has
>= 2zones, then the component should also have a secondtopologySpreadConstraintensuring the replicas are spread over the zones:yamlspec: topologySpreadConstraints: - maxSkew: 1 minDomains: 2 # lower value of max replicas or number of zones topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule matchLabels: ...
Convenient Application of These Rules
According to above scenarios and conventions, the replicas, topologySpreadConstraints or affinity settings of the deployed components might need to be adapted.
In order to apply those conveniently and easily for developers, Gardener installs a mutating webhook into both seed and shoot clusters which reacts on Deployments and StatefulSets deployed to namespaces with the high-availability-config.resources.gardener.cloud/consider=true label set.
The following actions have to be taken by developers:
Check if
componentsare prepared to run concurrently with multiple replicas, e.g. controllers usually use leader election to achieve this.All components should be generally equipped with
PodDisruptionBudgets with.spec.maxUnavailable=1andunhealthyPodEvictionPolicy=AlwaysAllow:
spec:
maxUnavailable: 1
unhealthyPodEvictionPolicy: AlwaysAllow
selector:
matchLabels: ...- Add the label
high-availability-config.resources.gardener.cloud/typetodeployments orstatefulsets, as well as optionally involvedhorizontalpodautoscalers where the following two values are possible:
controllerserver
Type server is also preferred if a component is a controller and (webhook) server at the same time.
You can read more about the webhook's internals in High Availability Config.
gardenlet Internals
Make sure you have read the above document about the webhook internals before continuing reading this section.
Seed Controller
The gardenlet performs the following changes on all namespaces running seed system components:
- adds the label
high-availability-config.resources.gardener.cloud/consider=true. - adds the annotation
high-availability-config.resources.gardener.cloud/zones=<zones>, where<zones>is the list provided in.spec.provider.zones[]in theSeedspecification.
Note that neither the high-availability-config.resources.gardener.cloud/failure-tolerance-type, nor the high-availability-config.resources.gardener.cloud/zone-pinning annotations are set, hence the node affinity would never be touched by the webhook.
The only exception to this rule are the istio ingress gateway namespaces. This includes the default istio ingress gateway when SNI is enabled, as well as analogous namespaces for exposure classes and zone-specific istio ingress gateways. Those namespaces will additionally be annotated with high-availability-config.resources.gardener.cloud/zone-pinning set to true, resulting in the node affinities and the topology spread constraints being set. The replicas are not touched, as the istio ingress gateways are scaled by a horizontal autoscaler instance.
Shoot Controller
Control Plane
The gardenlet performs the following changes on the namespace running the shoot control plane components:
- adds the label
high-availability-config.resources.gardener.cloud/consider=true. This makes the webhook mutate the replica count and the topology spread constraints. - adds the annotation
high-availability-config.resources.gardener.cloud/failure-tolerance-typewith value equal to.spec.controlPlane.highAvailability.failureTolerance.type(or"", if.spec.controlPlane.highAvailability=nil). This makes the webhook mutate the node affinity according to the specified zone(s). - adds the annotation
high-availability-config.resources.gardener.cloud/zones=<zones>, where<zones>is a ...- ... random zone chosen from the
.spec.provider.zones[]list in theSeedspecification (always only one zone (even if there are multiple available in the seed cluster)) in case theShoothas no HA setting (i.e.,spec.controlPlane.highAvailability=nil) or when theShoothas HA setting with failure tolerance typenode. - ... list of three randomly chosen zones from the
.spec.provider.zones[]list in theSeedspecification in case theShoothas HA setting with failure tolerance typezone.
- ... random zone chosen from the
System Components
The gardenlet performs the following changes on all namespaces running shoot system components:
- adds the label
high-availability-config.resources.gardener.cloud/consider=true. This makes the webhook mutate the replica count and the topology spread constraints. - adds the annotation
high-availability-config.resources.gardener.cloud/zones=<zones>where<zones>is the merged list of zones provided in.zones[]withsystemComponents.allow=truefor all worker pools in.spec.provider.workers[]in theShootspecification.
Note that neither the high-availability-config.resources.gardener.cloud/failure-tolerance-type, nor the high-availability-config.resources.gardener.cloud/zone-pinning annotations are set, hence the node affinity would never be touched by the webhook.