When a shoot cluster is deleted then Gardener tries to gracefully remove most of the Kubernetes resources inside the cluster.
This is to prevent that any infrastructure or other artifacts remain after the shoot deletion.
The cleanup is performed in four steps.
Some resources are deleted with a grace period, and all resources are forcefully deleted (by removing blocking finalizers) after some time to not block the cluster deletion entirely.
Cleanup steps:
All ValidatingWebhookConfigurations and MutatingWebhookConfigurations are deleted with a 5m grace period. Forceful finalization happens after 5m.
All APIServices and CustomResourceDefinitions are deleted with a 5m grace period. Forceful finalization happens after 1h.
All CronJobs, DaemonSets, Deployments, Ingresss, Jobs, Pods, ReplicaSets, ReplicationControllers, Services, StatefulSets, PersistentVolumeClaims are deleted with a 5m grace period. Forceful finalization happens after 5m.
If the Shoot is annotated with shoot.gardener.cloud/skip-cleanup=true, then only Services and PersistentVolumeClaims are considered.
All VolumeSnapshots and VolumeSnapshotContents are deleted with a 5m grace period. Forceful finalization happens after 1h.
It is possible to override the finalization grace periods via annotations on the Shoot:
shoot.gardener.cloud/cleanup-webhooks-finalize-grace-period-seconds (for the resources handled in step 1)
shoot.gardener.cloud/cleanup-extended-apis-finalize-grace-period-seconds (for the resources handled in step 2)
shoot.gardener.cloud/cleanup-kubernetes-resources-finalize-grace-period-seconds (for the resources handled in step 3)
⚠️ If "0" is provided, then all resources are finalized immediately without waiting for any graceful deletion.
Please be aware that this might lead to orphaned infrastructure artifacts.
1.2 - containerd Registry Configuration
containerd Registry Configuration
containerd supports configuring registries and mirrors. Using this native containerd feature, Shoot owners can configure containerd to use public or private mirrors for a given upstream registry. More details about the registry configuration can be found in the corresponding upstream documentation.
containerd Registry Configuration Patterns
At the time of writing this document, containerd support two patterns for configuring registries/mirrors.
Note: Trying to use both of the patterns at the same time is not supported by containerd. Only one of the configuration patterns has to be followed strictly.
Old and Deprecated Pattern
The old and deprecated pattern is specifying registry.mirrors and registry.configs in the containerd’s config.toml file. See the upstream documentation.
Example of the old and deprecated pattern:
version = 2
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://public-mirror.example.com"]
In the above example, containerd is configured to first try to pull docker.io images from a configured endpoint (https://public-mirror.example.com). If the image is not available in https://public-mirror.example.com, then containerd will fall back to the upstream registry (docker.io) and will pull the image from there.
Hosts Directory Pattern
The hosts directory pattern is the new and recommended pattern for configuring registries. It is available starting containerd@v1.5.0. See the upstream documentation.
The above example in the hosts directory pattern looks as follows.
The /etc/containerd/config.toml file has the following section:
version = 2
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
The following hosts directory structure has to be created:
$ tree /etc/containerd/certs.d
/etc/containerd/certs.d
└── docker.io
└── hosts.toml
Finally, for the docker.io upstream registry, we configure a hosts.toml file as follows:
server = "https://registry-1.docker.io"[host."http://public-mirror.example.com"]
capabilities = ["pull", "resolve"]
Configuring containerd Registries for a Shoot
Gardener supports configuring containerd registries on a Shoot using the new hosts directory pattern. For each Shoot Node, Gardener creates the /etc/containerd/certs.d directory and adds the following section to the containerd’s /etc/containerd/config.toml file:
This allows Shoot owners to use the hosts directory pattern to configure registries for containerd. To do this, the Shoot owners need to create a directory under /etc/containerd/certs.d that is named with the upstream registry host name. In the newly created directory, a hosts.toml file needs to be created. For more details, see the hosts directory pattern section and the upstream documentation.
The registry-cache Extension
There is a Gardener-native extension named registry-cache that supports:
Configuring containerd registry mirrors based on the above-described contract. The feature is added in registry-cache@v0.6.0.
With the reversed VPN tunnel, there are no endpoints with open ports in the shoot cluster required by Gardener.
In order to allow communication to the shoots control-plane in the seed cluster, there are endpoints shared by multiple shoots of a seed cluster.
Depending on the configured zones or exposure classes, there are different endpoints in a seed cluster. The IP address(es) can be determined by a DNS query for the API Server URL.
The main entry-point into the seed cluster is the load balancer of the Istio ingress-gateway service. Depending on the infrastructure provider, there can be one IP address per zone.
The load balancer of the Istio ingress-gateway service exposes the following TCP ports:
443 for requests to the shoot API Server. The request is dispatched according to the set TLS SNI extension.
8443 for requests to the shoot API Server via api-server-proxy, dispatched based on the proxy protocol target, which is the IP address of kubernetes.default.svc.cluster.local in the shoot.
8132 to establish the reversed VPN connection. It’s dispatched according to an HTTP header value.
kube-apiserver via SNI
DNS entries for api.<external-domain> and api.<shoot>.<project>.<internal-domain> point to the load balancer of an Istio ingress-gateway service.
The Kubernetes client sets the server name to api.<external-domain> or api.<shoot>.<project>.<internal-domain>.
Based on SNI, the connection is forwarded to the respective API Server at TCP layer. There is no TLS termination at the Istio ingress-gateway.
TLS termination happens on the shoots API Server. Traffic is end-to-end encrypted between the client and the API Server. The certificate authority and authentication are defined in the corresponding kubeconfig.
Details can be found in GEP-08.
kube-apiserver via apiserver-proxy
Inside the shoot cluster, the API Server can also be reached by the cluster internal name kubernetes.default.svc.cluster.local.
The pods apiserver-proxy are deployed in the host network as daemonset and intercept connections to the Kubernetes service IP address.
The destination address is changed to the cluster IP address of the service kube-apiserver.<shoot-namespace>.svc.cluster.local in the seed cluster.
The connections are forwarded via the HaProxy Proxy Protocol to the Istio ingress-gateway in the seed cluster.
The Istio ingress-gateway forwards the connection to the respective shoot API Server by it’s cluster IP address.
As TLS termination happens at the API Server, the traffic is end-to-end encrypted the same way as with SNI.
As the API Server has to be able to connect to endpoints in the shoot cluster, a VPN connection is established.
This VPN connection is initiated from a VPN client in the shoot cluster.
The VPN client connects to the Istio ingress-gateway and is forwarded to the VPN server in the control-plane namespace of the shoot.
Once the VPN tunnel between the VPN client in the shoot and the VPN server in the seed cluster is established, the API Server can connect to nodes, services and pods in the shoot cluster.
In case a Shoot cluster uses containerd, it is possible to make the containerd process load custom configuration files.
Gardener initializes containerd with the following statement:
imports = ["/etc/containerd/conf.d/*.toml"]
This means that all *.toml files in the /etc/containerd/conf.d directory will be imported and merged with the default configuration.
To prevent unintended configuration overwrites, please be aware that containerd merges config sections, not individual keys (see here and here).
Please consult the upstream containerd documentation for more information.
⚠️ Note that this only applies to nodes which were newly created after gardener/gardener@v1.51 was deployed. Existing nodes are not affected.
1.5 - Necessary Labeling for Custom CSI Components
Necessary Labeling for Custom CSI Components
Some provider extensions for Gardener are using CSI components to manage persistent volumes in the shoot clusters.
Additionally, most of the provider extensions are deploying controllers for taking volume snapshots (CSI snapshotter).
End-users can deploy their own CSI components and controllers into shoot clusters.
In such situations, there are multiple controllers acting on the VolumeSnapshot custom resources (each responsible for those instances associated with their respective driver provisioner types).
However, this might lead to operational conflicts that cannot be overcome by Gardener alone.
Concretely, Gardener cannot know which custom CSI components were installed by end-users which can lead to issues, especially during shoot cluster deletion.
You can add a label to your custom CSI components indicating that Gardener should not try to remove them during shoot cluster deletion. This means you have to take care of the lifecycle for these components yourself!
Recommendations
Custom CSI components are typically regular Deployments running in the shoot clusters.
Please label them with the shoot.gardener.cloud/no-cleanup=true label.
Background Information
When a shoot cluster is deleted, Gardener deletes most Kubernetes resources (Deployments, DaemonSets, StatefulSets, etc.). Gardener will also try to delete CSI components if they are not marked with the above mentioned label.
This can result in VolumeSnapshot resources still having finalizers that will never be cleaned up.
Consequently, manual intervention is required to clean them up before the cluster deletion can continue.
1.6 - Readiness of Shoot Worker Nodes
Implementation in Gardener for readiness of Shoot worker Nodes. How to mark components as node-critical
Readiness of Shoot Worker Nodes
Background
When registering new Nodes, kubelet adds the node.kubernetes.io/not-ready taint to prevent scheduling workload Pods to the Node until the Ready condition gets True.
However, the kubelet does not consider the readiness of node-critical Pods.
Hence, the Ready condition might get True and the node.kubernetes.io/not-ready taint might get removed, for example, before the CNI daemon Pod (e.g., calico-node) has successfully placed the CNI binaries on the machine.
This problem has been discussed extensively in kubernetes, e.g., in kubernetes/kubernetes#75890.
However, several proposals have been rejected because the problem can be solved by using the --register-with-taints kubelet flag and dedicated controllers (ref).
Implementation in Gardener
Gardener makes sure that workload Pods are only scheduled to Nodes where all node-critical components required for running workload Pods are ready.
For this, Gardener follows the proposed solution by the Kubernetes community and registers new Node objects with the node.gardener.cloud/critical-components-not-ready taint (effect NoSchedule).
gardener-resource-manager’s Node controller reacts on newly created Node objects that have this taint.
The controller removes the taint once all node-critical Pods are ready (determined by checking the Pods’ Ready conditions).
The Node controller considers all DaemonSets and Pods as node-critical which run in the kube-system namespace and are labeled with node.gardener.cloud/critical-component=true.
If there are DaemonSets that contain the node.gardener.cloud/critical-component=true label in their metadata and in their Pod template, the Node controller waits for corresponding daemon Pods to be scheduled and to get ready before removing the taint.
Additionally, the Node controller checks for the readiness of csi-driver-node components if a respective Pod indicates that it uses such a driver.
This is achieved through a well-defined annotation prefix (node.gardener.cloud/wait-for-csi-node-).
For example, the csi-driver-node Pod for Openstack Cinder is annotated with node.gardener.cloud/wait-for-csi-node-cinder=cinder.csi.openstack.org.
A key prefix is used instead of a “regular” annotation to allow for multiple CSI drivers being registered by one csi-driver-node Pod.
The annotation key’s suffix can be chosen arbitrarily (in this case cinder) and the annotation value needs to match the actual driver name as specified in the CSINode object.
The Node controller will verify that the used driver is properly registered in this object before removing the node.gardener.cloud/critical-components-not-ready taint.
Note that the csi-driver-node Pod still needs to be labelled and tolerate the taint as described above to be considered in this additional check.
Marking Node-Critical Components
To make use of this feature, node-critical DaemonSets and Pods need to:
Tolerate the node.gardener.cloud/critical-components-not-readyNoSchedule taint.
Be labelled with node.gardener.cloud/critical-component=true.
Be placed in the kube-system namespace.
csi-driver-node Pods additionally need to:
Be annotated with node.gardener.cloud/wait-for-csi-node-<name>=<full-driver-name>.
It’s required that these Pods fulfill the above criteria (label and toleration) as well.
Gardener already marks components like kube-proxy, apiserver-proxy and node-local-dns as node-critical.
Provider extensions mark components like csi-driver-node as node-critical and add the wait-for-csi-node annotation.
Network extensions mark components responsible for setting up CNI on worker Nodes (e.g., calico-node) as node-critical.
If shoot owners manage any additional node-critical components, they can make use of this feature as well.
1.7 - Taints and Tolerations for Seeds and Shoots
Taints and Tolerations for Seeds and Shoots
Similar to taints and tolerations for Nodes and Pods in Kubernetes, the Seed resource supports specifying taints (.spec.taints, see this example) while the Shoot resource supports specifying tolerations (.spec.tolerations, see this example).
The feature is used to control scheduling to seeds as well as decisions whether a shoot can use a certain seed.
Compared to Kubernetes, Gardener’s taints and tolerations are very much down-stripped right now and have some behavioral differences.
Please read the following explanations carefully if you plan to use them.
Scheduling
When scheduling a new shoot, the gardener-scheduler will filter all seed candidates whose taints are not tolerated by the shoot.
As Gardener’s taints/tolerations don’t support effects yet, you can compare this behaviour with using a NoSchedule effect taint in Kubernetes.
Be reminded that taints/tolerations are no means to define any affinity or selection for seeds - please use .spec.seedSelector in the Shoot to state such desires.
⚠️ Please note that - unlike how it’s implemented in Kubernetes - a certain seed cluster may only be used when the shoot tolerates all the seed’s taints.
This means that specifying .spec.seedName for a seed whose taints are not tolerated will make the gardener-apiserver reject the request.
Consequently, the taints/tolerations feature can be used as means to restrict usage of certain seeds.
Toleration Defaults and Whitelist
The Project resource features a .spec.tolerations object that may carry defaults and a whitelist (see this example).
The corresponding ShootTolerationRestriction admission plugin (cf. Kubernetes’ PodTolerationRestriction admission plugin) is responsible for evaluating these settings during creation/update of Shoots.
Whitelist
If a shoot gets created or updated with tolerations, then it is validated that only those tolerations may be used that were added to either a) the Project’s .spec.tolerations.whitelist, or b) to the global whitelist in the ShootTolerationRestriction’s admission config (see this example).
⚠️ Please note that the tolerations whitelist of Projects can only be changed if the user trying to change it is bound to the modify-spec-tolerations-whitelist custom RBAC role, e.g., via the following ClusterRole:
If a shoot gets created, then the default tolerations specified in both the Project’s .spec.tolerations.defaults and the global default list in the ShootTolerationRestriction admission plugin’s configuration will be added to the .spec.tolerations of the Shoot (unless it already specifies a certain key).
Package v1alpha1 is a version of the API.
“authentication.gardener.cloud/v1alpha1” API is already used for CRD registration and must not be served by the API server.
Resource Types:
AdminKubeconfigRequest
AdminKubeconfigRequest can be used to request a kubeconfig with admin credentials
for a Shoot cluster.
Spec is the specification of the AdminKubeconfigRequest.
expirationSecondsint64
(Optional)
ExpirationSeconds is the requested validity duration of the credential. The
credential issuer may return a credential with a different validity duration so a
client needs to check the ‘expirationTimestamp’ field in a response.
Defaults to 1 hour.
AdminKubeconfigRequestSpec contains the expiration time of the kubeconfig.
Field
Description
expirationSecondsint64
(Optional)
ExpirationSeconds is the requested validity duration of the credential. The
credential issuer may return a credential with a different validity duration so a
client needs to check the ‘expirationTimestamp’ field in a response.
Defaults to 1 hour.
Spec is the specification of the ViewerKubeconfigRequest.
expirationSecondsint64
(Optional)
ExpirationSeconds is the requested validity duration of the credential. The
credential issuer may return a credential with a different validity duration so a
client needs to check the ‘expirationTimestamp’ field in a response.
Defaults to 1 hour.
ViewerKubeconfigRequestSpec contains the expiration time of the kubeconfig.
Field
Description
expirationSecondsint64
(Optional)
ExpirationSeconds is the requested validity duration of the credential. The
credential issuer may return a credential with a different validity duration so a
client needs to check the ‘expirationTimestamp’ field in a response.
Defaults to 1 hour.
SeedSelector contains an optional list of labels on Seed resources that marks those seeds whose shoots may use this provider profile.
An empty list means that all seeds of the same provider type are supported.
This is useful for environments that are of the same type (like openstack) but may have different “instances”/landscapes.
Optionally a list of possible providers can be added to enable cross-provider scheduling. By default, the provider
type of the seed must match the shoot’s provider.
Refer to the Kubernetes API documentation for the fields of the
metadata field.
immutablebool
(Optional)
Immutable, if set to true, ensures that data stored in the Secret cannot
be updated (only object metadata can be modified).
If not set to true, the field can be modified at any time.
Defaulted to nil.
datamap[string][]byte
(Optional)
Data contains the secret data. Each key must consist of alphanumeric
characters, ‘-’, ‘_’ or ‘.’. The serialized form of the secret data is a
base64 encoded string, representing the arbitrary (possibly non-string)
data value here. Described in https://tools.ietf.org/html/rfc4648#section-4
stringDatamap[string]string
(Optional)
stringData allows specifying non-binary secret data in string form.
It is provided as a write-only input field for convenience.
All keys and values are merged into the data field on write, overwriting any existing values.
The stringData field is never output when reading from the API.
Owner is a subject representing a user name, an email address, or any other identifier of a user owning
the project.
IMPORTANT: Be aware that this field will be removed in the v1 version of this API in favor of the owner
role. The only way to change the owner will be by moving the owner role. In this API version the only way
to change the owner is to use this field.
TODO: Remove this field in favor of the owner role in v1.
purposestring
(Optional)
Purpose is a human-readable explanation of the project’s purpose.
Members is a list of subjects representing a user name, an email address, or any other identifier of a user,
group, or service account that has a certain role.
namespacestring
(Optional)
Namespace is the name of the namespace that has been created for the Project object.
A nil value means that Gardener will determine the name of the namespace.
This field is immutable.
Backup holds the object store configuration for the backups of shoot (currently only etcd).
If it is not specified, then there won’t be any backups taken for shoots associated with this seed.
If backup field is present in seed, then backups of the etcd from shoot control plane will be stored
under the configured object store.
Region is a name of a region. This field is immutable.
secretBindingNamestring
(Optional)
SecretBindingName is the name of a SecretBinding that has a reference to the provider secret.
The credentials inside the provider secret will be used to create the shoot in the respective account.
The field is mutually exclusive with CredentialsBindingName.
This field is immutable.
seedNamestring
(Optional)
SeedName is the name of the seed cluster that runs the control plane of the Shoot.
ControlPlane contains general settings for the control plane of the shoot.
schedulerNamestring
(Optional)
SchedulerName is the name of the responsible scheduler which schedules the shoot.
If not specified, the default scheduler takes over.
This field is immutable.
CloudProfile contains a reference to a CloudProfile or a NamespacedCloudProfile.
credentialsBindingNamestring
(Optional)
CredentialsBindingName is the name of a CredentialsBinding that has a reference to the provider credentials.
The credentials will be used to create the shoot in the respective account. The field is mutually exclusive with SecretBindingName.
LastError holds information about the last occurred error during an operation.
observedGenerationint64
(Optional)
ObservedGeneration is the most recent generation observed for this BackupBucket. It corresponds to the
BackupBucket’s generation, which is updated on mutation by the API Server.
LastError holds information about the last occurred error during an operation.
observedGenerationint64
(Optional)
ObservedGeneration is the most recent generation observed for this BackupEntry. It corresponds to the
BackupEntry’s generation, which is updated on mutation by the API Server.
seedNamestring
(Optional)
SeedName is the name of the seed to which this BackupEntry is currently scheduled. This field is populated
at the beginning of a create/reconcile operation. It is used when moving the BackupEntry between seeds.
SeedSelector contains an optional list of labels on Seed resources that marks those seeds whose shoots may use this provider profile.
An empty list means that all seeds of the same provider type are supported.
This is useful for environments that are of the same type (like openstack) but may have different “instances”/landscapes.
Optionally a list of possible providers can be added to enable cross-provider scheduling. By default, the provider
type of the seed must match the shoot’s provider.
MaxNodeProvisionTime defines how long CA waits for node to be provisioned (default: 20 mins).
maxGracefulTerminationSecondsint32
(Optional)
MaxGracefulTerminationSeconds is the number of seconds CA waits for pod termination when trying to scale down a node (default: 600).
ignoreTaints[]string
(Optional)
IgnoreTaints specifies a list of taint keys to ignore in node templates when considering to scale a node group.
Deprecated: Ignore taints are deprecated as of K8S 1.29 and treated as startup taints
NewPodScaleUpDelay specifies how long CA should ignore newly created pods before they have to be considered for scale-up (default: 0s).
maxEmptyBulkDeleteint32
(Optional)
MaxEmptyBulkDelete specifies the maximum number of empty nodes that can be deleted at the same time (default: 10).
ignoreDaemonsetsUtilizationbool
(Optional)
IgnoreDaemonsetsUtilization allows CA to ignore DaemonSet pods when calculating resource utilization for scaling down (default: false).
verbosityint32
(Optional)
Verbosity allows CA to modify its log level (default: 2).
startupTaints[]string
(Optional)
StartupTaints specifies a list of taint keys to ignore in node templates when considering to scale a node group.
Cluster Autoscaler treats nodes tainted with startup taints as unready, but taken into account during scale up logic, assuming they will become ready shortly.
statusTaints[]string
(Optional)
StatusTaints specifies a list of taint keys to ignore in node templates when considering to scale a node group.
Cluster Autoscaler internally treats nodes tainted with status taints as ready, but filtered out during scale up logic.
SeedSelector contains an optional label selector for seeds. Only if the labels match then this controller will be
considered for a deployment.
An empty list means that all seeds are selected.
ControllerResource is a combination of a kind (DNSProvider, Infrastructure, Generic, …) and the actual type for this
kind (aws-route53, gcp, auditlog, …).
Field
Description
kindstring
Kind is the resource kind, for example “OperatingSystemConfig”.
typestring
Type is the resource type, for example “coreos” or “ubuntu”.
globallyEnabledbool
(Optional)
GloballyEnabled determines if this ControllerResource is required by all Shoot clusters.
This field is defaulted to false when kind is “Extension”.
ReconcileTimeout defines how long Gardener should wait for the resource reconciliation.
This field is defaulted to 3m0s when kind is “Extension”.
primarybool
(Optional)
Primary determines if the controller backed by this ControllerRegistration is responsible for the extension
resource’s lifecycle. This field defaults to true. There must be exactly one primary controller for this kind/type
combination. This field is immutable.
Lifecycle defines a strategy that determines when different operations on a ControllerResource should be performed.
This field is defaulted in the following way when kind is “Extension”.
Reconcile: “AfterKubeAPIServer”
Delete: “BeforeKubeAPIServer”
Migrate: “BeforeKubeAPIServer”
workerlessSupportedbool
(Optional)
WorkerlessSupported specifies whether this ControllerResource supports Workerless Shoot clusters.
This field is only relevant when kind is “Extension”.
The mode of the autoscaling to be used for the Core DNS components running in the data plane of the Shoot cluster.
Supported values are horizontal and cluster-proportional.
CoreDNSRewriting contains the setting related to rewriting requests, which are obviously incorrect due to the unnecessary application of the search path.
Field
Description
commonSuffixes[]string
(Optional)
CommonSuffixes are expected to be the suffix of a fully qualified domain name. Each suffix should contain at least one or two dots (‘.’) to prevent accidental clashes.
DNS holds information about the provider, the hosted zone id and the domain.
Field
Description
domainstring
(Optional)
Domain is the external available domain of the Shoot cluster. This domain will be written into the
kubeconfig that is handed out to end-users. This field is immutable.
Providers is a list of DNS providers that shall be enabled for this shoot cluster. Only relevant if
not a default domain is used.
Deprecated: Configuring multiple DNS providers is deprecated and will be forbidden in a future release.
Please use the DNS extension provider config (e.g. shoot-dns-service) for additional providers.
Domains contains information about which domains shall be included/excluded for this provider.
Deprecated: This field is deprecated and will be removed in a future release.
Please use the DNS extension provider config (e.g. shoot-dns-service) for additional configuration.
primarybool
(Optional)
Primary indicates that this DNSProvider is used for shoot related domains.
Deprecated: This field is deprecated and will be removed in a future release.
Please use the DNS extension provider config (e.g. shoot-dns-service) for additional and non-primary providers.
secretNamestring
(Optional)
SecretName is a name of a secret containing credentials for the stated domain and the
provider. When not specified, the Gardener will use the cloud provider credentials referenced
by the Shoot and try to find respective credentials there (primary provider only). Specifying this field may override
this behavior, i.e. forcing the Gardener to only look into the given secret.
Zones contains information about which hosted zones shall be included/excluded for this provider.
Deprecated: This field is deprecated and will be removed in a future release.
Please use the DNS extension provider config (e.g. shoot-dns-service) for additional configuration.
EncryptionConfig contains customizable encryption configuration of the API server.
Field
Description
resources[]string
Resources contains the list of resources that shall be encrypted in addition to secrets.
Each item is a Kubernetes resource name in plural (resource or resource.group) that should be encrypted.
Note that configuring a custom resource is only supported for versions >= 1.26.
Wildcards are not supported for now.
See https://github.com/gardener/gardener/blob/master/docs/usage/security/etcd_encryption_config.md for more details.
Data contains the payload required to generate resources
labelsmap[string]string
(Optional)
Labels are labels of the object
HelmControllerDeployment
HelmControllerDeployment configures how an extension controller is deployed using helm.
This is the legacy structure that used to be defined in gardenlet’s ControllerInstallation controller for
ControllerDeployment’s with type=helm.
While this is not a proper API type, we need to define the structure in the API package so that we can convert it
to the internal API version in the new representation.
Hibernation contains information whether the Shoot is suspended or not.
Field
Description
enabledbool
(Optional)
Enabled specifies whether the Shoot needs to be hibernated or not. If it is true, the Shoot’s desired state is to be hibernated.
If it is false or nil, the Shoot’s desired state is to be awakened.
HibernationSchedule determines the hibernation schedule of a Shoot.
A Shoot will be regularly hibernated at each start time and will be woken up at each end time.
Start or End can be omitted, though at least one of each has to be specified.
Field
Description
startstring
(Optional)
Start is a Cron spec at which time a Shoot will be hibernated.
endstring
(Optional)
End is a Cron spec at which time a Shoot will be woken up.
locationstring
(Optional)
Location is the time location in which both start and shall be evaluated.
HighAvailability specifies the configuration settings for high availability for a resource. Typical
usages could be to configure HA for shoot control plane or for seed system components.
HorizontalPodAutoscalerConfig contains horizontal pod autoscaler configuration settings for the kube-controller-manager.
Note: Descriptions were taken from the Kubernetes documentation.
The configurable period at which the horizontal pod autoscaler considers a Pod “not yet ready” given that it’s unready and it has transitioned to unready during that time.
Ingress configures the Ingress specific settings of the cluster
Field
Description
domainstring
Domain specifies the IngressDomain of the cluster pointing to the ingress controller endpoint. It will be used
to construct ingress URLs for system applications running in Shoot/Garden clusters. Once set this field is immutable.
AdmissionPlugins contains the list of user-defined admission plugins (additional to those managed by Gardener), and, if desired, the corresponding
configuration.
apiAudiences[]string
(Optional)
APIAudiences are the identifiers of the API. The service account token authenticator will
validate that tokens used against the API are bound to at least one of these audiences.
Defaults to [“kubernetes”].
OIDCConfig contains configuration settings for the OIDC provider.
Deprecated: This field is deprecated and will be forbidden starting from Kubernetes 1.32.
Please configure and use structured authentication instead of oidc flags.
For more information check https://github.com/gardener/gardener/issues/9858
TODO(AleksandarSavchev): Drop this field after support for Kubernetes 1.31 is dropped.
runtimeConfigmap[string]bool
(Optional)
RuntimeConfig contains information about enabled or disabled APIs.
WatchCacheSizes contains configuration of the API server’s watch cache sizes.
Configuring these flags might be useful for large-scale Shoot clusters with a lot of parallel update requests
and a lot of watching controllers (e.g. large ManagedSeed clusters). When the API server’s watch cache’s
capacity is too small to cope with the amount of update requests and watchers for a particular resource, it
might happen that controller watches are permanently stopped with too old resource version errors.
Starting from kubernetes v1.19, the API server’s watch cache size is adapted dynamically and setting the watch
cache size flags will have no effect, except when setting it to 0 (which disables the watch cache).
Logging contains configuration for the log level and HTTP access logs.
defaultNotReadyTolerationSecondsint64
(Optional)
DefaultNotReadyTolerationSeconds indicates the tolerationSeconds of the toleration for notReady:NoExecute
that is added by default to every pod that does not already have such a toleration (flag --default-not-ready-toleration-seconds).
The field has effect only when the DefaultTolerationSeconds admission plugin is enabled.
Defaults to 300.
defaultUnreachableTolerationSecondsint64
(Optional)
DefaultUnreachableTolerationSeconds indicates the tolerationSeconds of the toleration for unreachable:NoExecute
that is added by default to every pod that does not already have such a toleration (flag --default-unreachable-toleration-seconds).
The field has effect only when the DefaultTolerationSeconds admission plugin is enabled.
Defaults to 300.
StructuredAuthentication contains configuration settings for structured authentication for the kube-apiserver.
This field is only available for Kubernetes v1.30 or later.
StructuredAuthorization contains configuration settings for structured authorization for the kube-apiserver.
This field is only available for Kubernetes v1.30 or later.
PodEvictionTimeout defines the grace period for deleting pods on failed nodes. Defaults to 2m.
Deprecated: The corresponding kube-controller-manager flag --pod-eviction-timeout is deprecated
in favor of the kube-apiserver flags --default-not-ready-toleration-seconds and --default-unreachable-toleration-seconds.
The --pod-eviction-timeout flag does not have effect when the taint besed eviction is enabled. The taint
based eviction is beta (enabled by default) since Kubernetes 1.13 and GA since Kubernetes 1.18. Hence,
instead of setting this field, set the spec.kubernetes.kubeAPIServer.defaultNotReadyTolerationSeconds and
spec.kubernetes.kubeAPIServer.defaultUnreachableTolerationSeconds.
Mode specifies which proxy mode to use.
defaults to IPTables.
enabledbool
(Optional)
Enabled indicates whether kube-proxy should be deployed or not.
Depending on the networking extensions switching kube-proxy off might be rejected. Consulting the respective documentation of the used networking extension is recommended before using this field.
defaults to true if not specified.
(Members of KubernetesConfig are embedded into this type.)
kubeMaxPDVolsstring
(Optional)
KubeMaxPDVols allows to configure the KUBE_MAX_PD_VOLS environment variable for the kube-scheduler.
Please find more information here: https://kubernetes.io/docs/concepts/storage/storage-limits/#custom-limits
Note that using this field is considered alpha-/experimental-level and is on your own risk. You should be aware
of all the side-effects and consequences when changing it.
Profile configures the scheduling profile for the cluster.
If not specified, the used profile is “balanced” (provides the default kube-scheduler behavior).
EvictionHard describes a set of eviction thresholds (e.g. memory.available<1Gi) that if met would trigger a Pod eviction.
Default:
memory.available: “100Mi/1Gi/5%”
nodefs.available: “5%”
nodefs.inodesFree: “5%”
imagefs.available: “5%”
imagefs.inodesFree: “5%”
evictionMaxPodGracePeriodint32
(Optional)
EvictionMaxPodGracePeriod describes the maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
Default: 90
EvictionMinimumReclaim configures the amount of resources below the configured eviction threshold that the kubelet attempts to reclaim whenever the kubelet observes resource pressure.
Default: 0 for each resource
EvictionPressureTransitionPeriod is the duration for which the kubelet has to wait before transitioning out of an eviction pressure condition.
Default: 4m0s
EvictionSoft describes a set of eviction thresholds (e.g. memory.available<1.5Gi) that if met over a corresponding grace period would trigger a Pod eviction.
Default:
memory.available: “200Mi/1.5Gi/10%”
nodefs.available: “10%”
nodefs.inodesFree: “10%”
imagefs.available: “10%”
imagefs.inodesFree: “10%”
EvictionSoftGracePeriod describes a set of eviction grace periods (e.g. memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a Pod eviction.
Default:
memory.available: 1m30s
nodefs.available: 1m30s
nodefs.inodesFree: 1m30s
imagefs.available: 1m30s
imagefs.inodesFree: 1m30s
maxPodsint32
(Optional)
MaxPods is the maximum number of Pods that are allowed by the Kubelet.
Default: 110
podPidsLimitint64
(Optional)
PodPIDsLimit is the maximum number of process IDs per pod allowed by the kubelet.
failSwapOnbool
(Optional)
FailSwapOn makes the Kubelet fail to start if swap is enabled on the node. (default true).
KubeReserved is the configuration for resources reserved for kubernetes node components (mainly kubelet and container runtime).
When updating these values, be aware that cgroup resizes may not succeed on active worker nodes. Look for the NodeAllocatableEnforced event to determine if the configuration was applied.
Default: cpu=80m,memory=1Gi,pid=20k
SystemReserved is the configuration for resources reserved for system processes not managed by kubernetes (e.g. journald).
When updating these values, be aware that cgroup resizes may not succeed on active worker nodes. Look for the NodeAllocatableEnforced event to determine if the configuration was applied.
Deprecated: Separately configuring resource reservations for system processes is deprecated in Gardener and will be forbidden starting from Kubernetes 1.31.
Please merge existing resource reservations into the kubeReserved field.
TODO(MichaelEischer): Drop this field after support for Kubernetes 1.30 is dropped.
imageGCHighThresholdPercentint32
(Optional)
ImageGCHighThresholdPercent describes the percent of the disk usage which triggers image garbage collection.
Default: 50
imageGCLowThresholdPercentint32
(Optional)
ImageGCLowThresholdPercent describes the percent of the disk to which garbage collection attempts to free.
Default: 40
serializeImagePullsbool
(Optional)
SerializeImagePulls describes whether the images are pulled one at a time.
Default: true
registryPullQPSint32
(Optional)
RegistryPullQPS is the limit of registry pulls per second. The value must not be a negative number.
Setting it to 0 means no limit.
Default: 5
registryBurstint32
(Optional)
RegistryBurst is the maximum size of bursty pulls, temporarily allows pulls to burst to this number,
while still not exceeding registryPullQPS. The value must not be a negative number.
Only used if registryPullQPS is greater than 0.
Default: 10
seccompDefaultbool
(Optional)
SeccompDefault enables the use of RuntimeDefault as the default seccomp profile for all workloads.
This requires the corresponding SeccompDefault feature gate to be enabled as well.
This field is only available for Kubernetes v1.25 or later.
StreamingConnectionIdleTimeout is the maximum time a streaming connection can be idle before the connection is automatically closed.
This field cannot be set lower than “30s” or greater than “4h”.
Default:
“4h” for Kubernetes < v1.26.
“5m” for Kubernetes >= v1.26.
Kubelet contains configuration settings for the kubelet.
versionstring
(Optional)
Version is the semantic Kubernetes version to use for the Shoot cluster.
Defaults to the highest supported minor and patch version given in the referenced cloud profile.
The version can be omitted completely or partially specified, e.g. <major>.<minor>.
VerticalPodAutoscaler contains the configuration flags for the Kubernetes vertical pod autoscaler.
enableStaticTokenKubeconfigbool
(Optional)
EnableStaticTokenKubeconfig indicates whether static token kubeconfig secret will be created for the Shoot cluster.
Defaults to true for Shoots with Kubernetes versions < 1.26. Defaults to false for Shoots with Kubernetes versions >= 1.26.
Starting Kubernetes 1.27 the field will be locked to false.
LoadBalancerServicesProxyProtocol controls whether ProxyProtocol is (optionally) allowed for the load balancer services.
Field
Description
allowedbool
Allowed controls whether the ProxyProtocol is optionally allowed for the load balancer services.
This should only be enabled if the load balancer services are already using ProxyProtocol or will be reconfigured to use it soon.
Until the load balancers are configured with ProxyProtocol, enabling this setting may allow clients to spoof their source IP addresses.
The option allows a migration from non-ProxyProtocol to ProxyProtocol without downtime (depending on the infrastructure).
Defaults to false.
Image holds information about the machine image to use for all nodes of this pool. It will default to the
latest version of the first image stated in the referenced CloudProfile if no value has been provided.
architecturestring
(Optional)
Architecture is CPU architecture of machines in this worker pool.
UpdateStrategy is the update strategy to use for the machine image. Possible values are:
- patch: update to the latest patch version of the current minor version.
- minor: update to the latest minor and patch version.
- major: always update to the overall latest version (default).
CRI list of supported container runtime and interfaces supported by this version
architectures[]string
(Optional)
Architectures is the list of CPU architectures of the machine image in this version.
kubeletVersionConstraintstring
(Optional)
KubeletVersionConstraint is a constraint describing the supported kubelet versions by the machine image in this version.
If the field is not specified, it is assumed that the machine image in this version supports all kubelet versions.
Examples:
- ‘>= 1.26’ - supports only kubelet versions greater than or equal to 1.26
- ‘< 1.26’ - supports only kubelet versions less than 1.26
TimeWindow contains information about the time window for maintenance operations.
confineSpecUpdateRolloutbool
(Optional)
ConfineSpecUpdateRollout prevents that changes/updates to the shoot specification will be rolled out immediately.
Instead, they are rolled out during the shoot’s maintenance time window. There is one exception that will trigger
an immediate roll out which is changes to the Spec.Hibernation.Enabled field.
MaintenanceTimeWindow contains information about the time window for maintenance operations.
Field
Description
beginstring
Begin is the beginning of the time window in the format HHMMSS+ZONE, e.g. “220000+0100”.
If not present, a random value will be computed.
endstring
End is the end of the time window in the format HHMMSS+ZONE, e.g. “220000+0100”.
If not present, the value will be computed based on the “Begin” value.
NetworkingStatus contains information about cluster networking such as CIDRs.
Field
Description
pods[]string
(Optional)
Pods are the CIDRs of the pod network.
nodes[]string
(Optional)
Nodes are the CIDRs of the node network.
services[]string
(Optional)
Services are the CIDRs of the service network.
egressCIDRs[]string
(Optional)
EgressCIDRs is a list of CIDRs used by the shoot as the source IP for egress traffic as reported by the used
Infrastructure extension controller. For certain environments the egress IPs may not be stable in which case the
extension controller may opt to not populate this field.
NodeLocalDNS contains the settings of the node local DNS components running in the data plane of the Shoot cluster.
Field
Description
enabledbool
Enabled indicates whether node local DNS is enabled or not.
forceTCPToClusterDNSbool
(Optional)
ForceTCPToClusterDNS indicates whether the connection from the node local DNS to the cluster DNS (Core DNS) will be forced to TCP or not.
Default, if unspecified, is to enforce TCP.
forceTCPToUpstreamDNSbool
(Optional)
ForceTCPToUpstreamDNS indicates whether the connection from the node local DNS to the upstream DNS (infrastructure DNS) will be forced to TCP or not.
Default, if unspecified, is to enforce TCP.
disableForwardToUpstreamDNSbool
(Optional)
DisableForwardToUpstreamDNS indicates whether requests from node local DNS to upstream DNS should be disabled.
Default, if unspecified, is to forward requests for external domains to upstream DNS
ClientAuthentication can optionally contain client configuration used for kubeconfig generation.
Deprecated: This field has no implemented use and will be forbidden starting from Kubernetes 1.31.
It’s use was planned for genereting OIDC kubeconfig https://github.com/gardener/gardener/issues/1433
TODO(AleksandarSavchev): Drop this field after support for Kubernetes 1.30 is dropped.
clientIDstring
(Optional)
The client ID for the OpenID Connect client, must be set.
groupsClaimstring
(Optional)
If provided, the name of a custom OpenID Connect claim for specifying user groups. The claim value is expected to be a string or array of strings. This flag is experimental, please see the authentication documentation for further details.
groupsPrefixstring
(Optional)
If provided, all groups will be prefixed with this value to prevent conflicts with other authentication strategies.
issuerURLstring
(Optional)
The URL of the OpenID issuer, only HTTPS scheme will be accepted. Used to verify the OIDC JSON Web Token (JWT).
requiredClaimsmap[string]string
(Optional)
key=value pairs that describes a required claim in the ID Token. If set, the claim is verified to be present in the ID Token with a matching value.
signingAlgs[]string
(Optional)
List of allowed JOSE asymmetric signing algorithms. JWTs with a ‘alg’ header value not in this list will be rejected. Values are defined by RFC 7518 https://tools.ietf.org/html/rfc7518#section-3.1
usernameClaimstring
(Optional)
The OpenID claim to use as the user name. Note that claims other than the default (‘sub’) is not guaranteed to be unique and immutable. This flag is experimental, please see the authentication documentation for further details. (default “sub”)
usernamePrefixstring
(Optional)
If provided, all usernames will be prefixed with this value. If not provided, username claims other than ‘email’ are prefixed by the issuer URL to avoid clashes. To skip any prefixing, provide the value ‘-’.
OpenIDConnectClientAuthentication contains configuration for OIDC clients.
Field
Description
extraConfigmap[string]string
(Optional)
Extra configuration added to kubeconfig’s auth-provider.
Must not be any of idp-issuer-url, client-id, client-secret, idp-certificate-authority, idp-certificate-authority-data, id-token or refresh-token
Subject is representing a user name, an email address, or any other identifier of a user, group, or service
account that has a certain role.
rolestring
Role represents the role of this member.
IMPORTANT: Be aware that this field will be removed in the v1 version of this API in favor of the roles
list.
TODO: Remove this field in favor of the roles list in v1.
roles[]string
(Optional)
Roles represents the list of roles of this member.
Owner is a subject representing a user name, an email address, or any other identifier of a user owning
the project.
IMPORTANT: Be aware that this field will be removed in the v1 version of this API in favor of the owner
role. The only way to change the owner will be by moving the owner role. In this API version the only way
to change the owner is to use this field.
TODO: Remove this field in favor of the owner role in v1.
purposestring
(Optional)
Purpose is a human-readable explanation of the project’s purpose.
Members is a list of subjects representing a user name, an email address, or any other identifier of a user,
group, or service account that has a certain role.
namespacestring
(Optional)
Namespace is the name of the namespace that has been created for the Project object.
A nil value means that Gardener will determine the name of the namespace.
This field is immutable.
Whitelist contains a list of tolerations that are allowed to be added to the shoots in this project. Please note
that this list may only be added by users having the spec-tolerations-whitelist verb for project resources.
ControlPlaneConfig contains the provider-specific control plane config blob. Please look up the concrete
definition in the documentation of your provider extension.
InfrastructureConfig contains the provider-specific infrastructure config blob. Please look up the concrete
definition in the documentation of your provider extension.
ProxyMode available in Linux platform: ‘userspace’ (older, going to be EOL), ‘iptables’
(newer, faster), ‘ipvs’ (newest, better in performance and scalability).
As of now only ‘iptables’ and ‘ipvs’ is supported by Gardener.
In Linux platform, if the iptables proxy is selected, regardless of how, but the system’s kernel or iptables versions are
insufficient, this always falls back to the userspace proxy. IPVS mode will be enabled when proxy mode is set to ‘ipvs’,
and the fall back path is firstly iptables and then userspace.
Zones is a list of availability zones in this region.
labelsmap[string]string
(Optional)
Labels is an optional set of key-value pairs that contain certain administrator-controlled labels for this region.
It can be used by Gardener administrators/operators to provide additional information about a region, e.g. wrt
quality, reliability, etc.
ResourceWatchCacheSize contains configuration of the API server’s watch cache size for one specific resource.
Field
Description
apiGroupstring
(Optional)
APIGroup is the API group of the resource for which the watch cache size should be configured.
An unset value is used to specify the legacy core API (e.g. for secrets).
resourcestring
Resource is the name of the resource for which the watch cache size should be configured
(in lowercase plural form, e.g. secrets).
sizeint32
CacheSize specifies the watch cache size that should be configured for the specified resource.
SecretBindingProvider defines the provider type of the SecretBinding.
Field
Description
typestring
Type is the type of the provider.
For backwards compatibility, the field can contain multiple providers separated by a comma.
However the usage of single SecretBinding (hence Secret) for different cloud providers is strongly discouraged.
SecretRef is a reference to a Secret object containing the cloud provider credentials for
the object store where backups should be stored. It should have enough privileges to manipulate
the objects as well as buckets.
SeedSettingDependencyWatchdogProber controls the prober settings for the dependency-watchdog for the seed.
Field
Description
enabledbool
Enabled controls whether the probe controller(prober) of the dependency-watchdog should be enabled. This controller
scales down the kube-controller-manager, machine-controller-manager and cluster-autoscaler of shoot clusters in case their respective kube-apiserver is not
reachable via its external ingress in order to avoid melt-down situations.
SeedSettingDependencyWatchdogWeeder controls the weeder settings for the dependency-watchdog for the seed.
Field
Description
enabledbool
Enabled controls whether the endpoint controller(weeder) of the dependency-watchdog should be enabled. This controller
helps to alleviate the delay where control plane components remain unavailable by finding the respective pods in
CrashLoopBackoff status and restarting them once their dependants become ready and available again.
ExternalTrafficPolicy describes how nodes distribute service traffic they
receive on one of the service’s “externally-facing” addresses.
Defaults to “Cluster”.
Zones controls settings, which are specific to the single-zone load balancers in a multi-zonal setup.
Can be empty for single-zone seeds. Each specified zone has to relate to one of the zones in seed.spec.provider.zones.
ProxyProtocol controls whether ProxyProtocol is (optionally) allowed for the load balancer services.
Defaults to nil, which is equivalent to not allowing ProxyProtocol.
ExternalTrafficPolicy describes how nodes distribute service traffic they
receive on one of the service’s “externally-facing” addresses.
Defaults to “Cluster”.
ProxyProtocol controls whether ProxyProtocol is (optionally) allowed for the load balancer services.
Defaults to nil, which is equivalent to not allowing ProxyProtocol.
Enabled controls whether certain Services deployed in the seed cluster should be topology-aware.
These Services are etcd-main-client, etcd-events-client, kube-apiserver, gardener-resource-manager and vpa-webhook.
SeedSettingVerticalPodAutoscaler controls certain settings for the vertical pod autoscaler components deployed in the
seed.
Field
Description
enabledbool
Enabled controls whether the VPA components shall be deployed into the garden namespace in the seed cluster. It
is enabled by default because Gardener heavily relies on a VPA being deployed. You should only disable this if
your seed cluster already has another, manually/custom managed VPA deployment.
Backup holds the object store configuration for the backups of shoot (currently only etcd).
If it is not specified, then there won’t be any backups taken for shoots associated with this seed.
If backup field is present in seed, then backups of the etcd from shoot control plane will be stored
under the configured object store.
Conditions represents the latest available observations of a Seed’s current state.
observedGenerationint64
(Optional)
ObservedGeneration is the most recent generation observed for this Seed. It corresponds to the
Seed’s generation, which is updated on mutation by the API Server.
clusterIdentitystring
(Optional)
ClusterIdentity is the identity of the Seed cluster. This field is immutable.
Backup holds the object store configuration for the backups of shoot (currently only etcd).
If it is not specified, then there won’t be any backups taken for shoots associated with this seed.
If backup field is present in seed, then backups of the etcd from shoot control plane will be stored
under the configured object store.
ServiceAccountConfig is the kube-apiserver configuration for service accounts.
Field
Description
issuerstring
(Optional)
Issuer is the identifier of the service account token issuer. The issuer will assert this
identifier in “iss” claim of issued tokens. This value is used to generate new service account tokens.
This value is a string or URI. Defaults to URI of the API server.
extendTokenExpirationbool
(Optional)
ExtendTokenExpiration turns on projected service account expiration extension during token generation, which
helps safe transition from legacy token to bound service account token feature. If this flag is enabled,
admission injected tokens would be extended up to 1 year to prevent unexpected failure during transition,
ignoring value of service-account-max-token-expiration.
MaxTokenExpiration is the maximum validity duration of a token created by the service account token issuer. If an
otherwise valid TokenRequest with a validity duration larger than this value is requested, a token will be issued
with a validity duration of this value.
This field must be within [30d,90d].
acceptedIssuers[]string
(Optional)
AcceptedIssuers is an additional set of issuers that are used to determine which service account tokens are accepted.
These values are not used to generate new service account tokens. Only useful when service account tokens are also
issued by another external system or a change of the current issuer that is used for generating tokens is being performed.
Region is a name of a region. This field is immutable.
secretBindingNamestring
(Optional)
SecretBindingName is the name of a SecretBinding that has a reference to the provider secret.
The credentials inside the provider secret will be used to create the shoot in the respective account.
The field is mutually exclusive with CredentialsBindingName.
This field is immutable.
seedNamestring
(Optional)
SeedName is the name of the seed cluster that runs the control plane of the Shoot.
ControlPlane contains general settings for the control plane of the shoot.
schedulerNamestring
(Optional)
SchedulerName is the name of the responsible scheduler which schedules the shoot.
If not specified, the default scheduler takes over.
This field is immutable.
CloudProfile contains a reference to a CloudProfile or a NamespacedCloudProfile.
credentialsBindingNamestring
(Optional)
CredentialsBindingName is the name of a CredentialsBinding that has a reference to the provider credentials.
The credentials will be used to create the shoot in the respective account. The field is mutually exclusive with SecretBindingName.
LastErrors holds information about the last occurred error(s) during an operation.
observedGenerationint64
(Optional)
ObservedGeneration is the most recent generation observed for this Shoot. It corresponds to the
Shoot’s generation, which is updated on mutation by the API Server.
RetryCycleStartTime is the start time of the last retry cycle (used to determine how often an operation
must be retried until we give up).
seedNamestring
(Optional)
SeedName is the name of the seed cluster that runs the control plane of the Shoot. This value is only written
after a successful create/reconcile operation. It will be used when control planes are moved between Seeds.
technicalIDstring
TechnicalID is the name that is used for creating the Seed namespace, the infrastructure resources, and
basically everything that is related to this particular Shoot. This field is immutable.
UID is a unique identifier for the Shoot cluster to avoid portability between Kubernetes clusters.
It is used to compute unique hashes. This field is immutable.
clusterIdentitystring
(Optional)
ClusterIdentity is the identity of the Shoot cluster. This field is immutable.
Region is a name of a region. This field is immutable.
secretBindingNamestring
(Optional)
SecretBindingName is the name of a SecretBinding that has a reference to the provider secret.
The credentials inside the provider secret will be used to create the shoot in the respective account.
The field is mutually exclusive with CredentialsBindingName.
This field is immutable.
seedNamestring
(Optional)
SeedName is the name of the seed cluster that runs the control plane of the Shoot.
ControlPlane contains general settings for the control plane of the shoot.
schedulerNamestring
(Optional)
SchedulerName is the name of the responsible scheduler which schedules the shoot.
If not specified, the default scheduler takes over.
This field is immutable.
CloudProfile contains a reference to a CloudProfile or a NamespacedCloudProfile.
credentialsBindingNamestring
(Optional)
CredentialsBindingName is the name of a CredentialsBinding that has a reference to the provider credentials.
The credentials will be used to create the shoot in the respective account. The field is mutually exclusive with SecretBindingName.
EvictAfterOOMThreshold defines the threshold that will lead to pod eviction in case it OOMed in less than the given
threshold since its start and if it has only one container (default: 10m0s).
evictionRateBurstint32
(Optional)
EvictionRateBurst defines the burst of pods that can be evicted (default: 1)
evictionRateLimitfloat64
(Optional)
EvictionRateLimit defines the number of pods that can be evicted per second. A rate limit set to 0 or -1 will
disable the rate limiter (default: -1).
evictionTolerancefloat64
(Optional)
EvictionTolerance defines the fraction of replica count that can be evicted for update in case more than one
pod can be evicted (default: 0.5).
recommendationMarginFractionfloat64
(Optional)
RecommendationMarginFraction is the fraction of usage added as the safety margin to the recommended request
(default: 0.15).
RecommenderInterval is the interval how often metrics should be fetched (default: 1m0s).
targetCPUPercentilefloat64
(Optional)
TargetCPUPercentile is the usage percentile that will be used as a base for CPU target recommendation.
Doesn’t affect CPU lower bound, CPU upper bound nor memory recommendations.
(default: 0.9)
recommendationLowerBoundCPUPercentilefloat64
(Optional)
RecommendationLowerBoundCPUPercentile is the usage percentile that will be used for the lower bound on CPU recommendation.
(default: 0.5)
recommendationUpperBoundCPUPercentilefloat64
(Optional)
RecommendationUpperBoundCPUPercentile is the usage percentile that will be used for the upper bound on CPU recommendation.
(default: 0.95)
targetMemoryPercentilefloat64
(Optional)
TargetMemoryPercentile is the usage percentile that will be used as a base for memory target recommendation.
Doesn’t affect memory lower bound nor memory upper bound.
(default: 0.9)
recommendationLowerBoundMemoryPercentilefloat64
(Optional)
RecommendationLowerBoundMemoryPercentile is the usage percentile that will be used for the lower bound on memory recommendation.
(default: 0.5)
recommendationUpperBoundMemoryPercentilefloat64
(Optional)
RecommendationUpperBoundMemoryPercentile is the usage percentile that will be used for the upper bound on memory recommendation.
(default: 0.95)
MaxSurge is maximum number of machines that are created during an update.
This value is divided by the number of configured zones for a fair distribution.
MaxUnavailable is the maximum number of machines that can be unavailable during an update.
This value is divided by the number of configured zones for a fair distribution.
DataVolumes contains a list of additional worker volumes.
kubeletDataVolumeNamestring
(Optional)
KubeletDataVolumeName contains the name of a dataVolume that should be used for storing kubelet state.
zones[]string
(Optional)
Zones is a list of availability zones that are used to evenly distribute this worker pool. Optional
as not every provider may support availability zones.
Kubelet contains configuration settings for all kubelets of this worker pool.
If set, all spec.kubernetes.kubelet settings will be overwritten for this worker pool (no merge of settings).
versionstring
(Optional)
Version is the semantic Kubernetes version to use for the Kubelet in this Worker Group.
If not specified the kubelet version is derived from the global shoot cluster kubernetes version.
version must be equal or lower than the version of the shoot kubernetes version.
Only one minor version difference to other worker groups and global kubernetes version is allowed.
(Members of DefaultSpec are embedded into this type.)
DefaultSpec is a structure containing common fields used by all extension resources.
userData[]byte
UserData is the base64-encoded user data for the bastion instance. This should
contain code to provision the SSH key on the bastion instance.
This field is immutable.
SecretRef is a reference to a secret that contains the cloud provider specific credentials.
regionstring
(Optional)
Region is the region of this DNS record. If not specified, the region specified in SecretRef will be used.
If that is also not specified, the extension controller will use its default region.
zonestring
(Optional)
Zone is the DNS hosted zone of this DNS record. If not specified, it will be determined automatically by
getting all hosted zones of the account and searching for the longest zone name that is a suffix of Name.
namestring
Name is the fully qualified domain name, e.g. “api.”. This field is immutable.
Purpose describes how the result of this OperatingSystemConfig is used by Gardener. Either it
gets sent to the Worker extension controller to bootstrap a VM, or it is downloaded by the
gardener-node-agent already running on a bootstrapped VM.
This field is immutable.
InfrastructureProviderStatus is a raw extension field that contains the provider status that has
been generated by the controller responsible for the Infrastructure resource.
regionstring
Region is the name of the region where the worker pool should be deployed to. This field is immutable.
(Members of DefaultSpec are embedded into this type.)
DefaultSpec is a structure containing common fields used by all extension resources.
userData[]byte
UserData is the base64-encoded user data for the bastion instance. This should
contain code to provision the SSH key on the bastion instance.
This field is immutable.
CloudConfig contains the generated output for the given operating system
config spec. It contains a reference to a secret as the result may contain confidential data.
SecretRef is a reference to a secret that contains the cloud provider specific credentials.
regionstring
(Optional)
Region is the region of this DNS record. If not specified, the region specified in SecretRef will be used.
If that is also not specified, the extension controller will use its default region.
zonestring
(Optional)
Zone is the DNS hosted zone of this DNS record. If not specified, it will be determined automatically by
getting all hosted zones of the account and searching for the longest zone name that is a suffix of Name.
namestring
Name is the fully qualified domain name, e.g. “api.”. This field is immutable.
File is a file that should get written to the host’s file system. The content can either be inlined or
referenced from a secret in the same namespace.
Field
Description
pathstring
Path is the path of the file system where the file should get written to.
permissionsuint32
(Optional)
Permissions describes with which permissions the file should get written to the file system.
If no permissions are set, the operating system’s defaults are used.
Inline is a struct that contains information about the inlined data.
transmitUnencodedbool
(Optional)
TransmitUnencoded set to true will ensure that the os-extension does not encode the file content when sent to the node.
This for example can be used to manipulate the clear-text content before it reaches the node.
(Members of DefaultStatus are embedded into this type.)
DefaultStatus is a structure containing common fields used by all extension resources.
nodesCIDRstring
(Optional)
NodesCIDR is the CIDR of the node network that was optionally created by the acting extension controller.
This might be needed in environments in which the CIDR for the network for the shoot worker node cannot
be statically defined in the Shoot resource but must be computed dynamically.
egressCIDRs[]string
(Optional)
EgressCIDRs is a list of CIDRs used by the shoot as the source IP for egress traffic. For certain environments the egress
IPs may not be stable in which case the extension controller may opt to not populate this field.
MachineImage contains logical information about the name and the version of the machie image that
should be used. The logical information must be mapped to the provider-specific information (e.g.,
AMIs, …) by the provider itself.
Purpose describes how the result of this OperatingSystemConfig is used by Gardener. Either it
gets sent to the Worker extension controller to bootstrap a VM, or it is downloaded by the
gardener-node-agent already running on a bootstrapped VM.
This field is immutable.
CloudConfig is a structure for containing the generated output for the given operating system
config spec. It contains a reference to a secret as the result may contain confidential data.
Values are the values configured at the given path. If defined, it is expected as json format:
- A given json object will be put to the given path.
- If not configured, only the table entry to be created.
FilePaths is a list of files the unit depends on. If any file changes a restart of the dependent unit will be
triggered. For each FilePath there must exist a File with matching Path in OperatingSystemConfig.Spec.Files.
MachineImage contains logical information about the name and the version of the machie image that
should be used. The logical information must be mapped to the provider-specific information (e.g.,
AMIs, …) by the provider itself.
minimumint32
Minimum is the minimum size of the worker pool.
namestring
Name is the name of this worker pool.
nodeAgentSecretNamestring
(Optional)
NodeAgentSecretName is uniquely identifying selected aspects of the OperatingSystemConfig. If it changes, then the
worker pool must be rolled.
UserDataSecretRef references a Secret and a data key containing the data that is sent to the provider’s APIs when
a new machine/VM that is part of this worker pool shall be spawned.
NodeTemplate contains resource information of the machine which is used by Cluster Autoscaler to generate nodeTemplate during scaling a nodeGroup from zero
architecturestring
(Optional)
Architecture is the CPU architecture of the worker pool machines and machine image.
InfrastructureProviderStatus is a raw extension field that contains the provider status that has
been generated by the controller responsible for the Infrastructure resource.
regionstring
Region is the name of the region where the worker pool should be deployed to. This field is immutable.
ShootRef defines the target shoot for a Bastion. The name field of the ShootRef is immutable.
seedNamestring
(Optional)
SeedName is the name of the seed to which this Bastion is currently scheduled. This field is populated
at the beginning of a create/reconcile operation.
providerTypestring
(Optional)
ProviderType is cloud provider used by the referenced Shoot.
sshPublicKeystring
SSHPublicKey is the user’s public key. This field is immutable.
ShootRef defines the target shoot for a Bastion. The name field of the ShootRef is immutable.
seedNamestring
(Optional)
SeedName is the name of the seed to which this Bastion is currently scheduled. This field is populated
at the beginning of a create/reconcile operation.
providerTypestring
(Optional)
ProviderType is cloud provider used by the referenced Shoot.
sshPublicKeystring
SSHPublicKey is the user’s public key. This field is immutable.
LastHeartbeatTimestamp is the time when the bastion was last marked as
not to be deleted. When this is set, the ExpirationTimestamp is advanced
as well.
ExpirationTimestamp is the time after which a Bastion is supposed to be
garbage collected.
observedGenerationint64
(Optional)
ObservedGeneration is the most recent generation observed for this Bastion. It corresponds to the
Bastion’s generation, which is updated on mutation by the API Server.
RuntimeCluster is the deployment configuration for the admission in the runtime cluster. The runtime deployment
is responsible for creating the admission controller in the runtime cluster.
VirtualCluster is the deployment configuration for the admission deployment in the garden cluster. The garden deployment
installs necessary resources in the virtual garden cluster e.g. RBAC that are necessary for the admission controller.
Backup contains the object store configuration for backups for the virtual garden etcd.
Field
Description
providerstring
Provider is a provider name. This field is immutable.
bucketNamestring
(Optional)
BucketName is the name of the backup bucket. If not provided, gardener-operator attempts to manage a new bucket.
In this case, the cloud provider credentials provided in the SecretRef must have enough privileges for creating
and deleting buckets.
SecretRef is a reference to a Secret object containing the cloud provider credentials for the object store where
backups should be stored. It should have enough privileges to manipulate the objects as well as buckets.
DNSDomain defines a DNS domain with optional provider.
Field
Description
namestring
Name is the domain name.
providerstring
(Optional)
Provider is the name of the DNS provider as declared in the ‘.spec.dns.providers’ section.
It is only optional, if the .spec.dns section is not provided at all.
PollInterval is the interval of how often the GitHub API is polled for issue updates. This field is used as a
fallback mechanism to ensure state synchronization, even when there is a GitHub webhook configuration. If a
webhook event is missed or not successfully delivered, the polling will help catch up on any missed updates.
If this field is not provided and there is no ‘webhookSecret’ key in the referenced secret, it will be
implicitly defaulted to 15m.
Container contains configuration for the dashboard terminal container.
allowedHosts[]string
(Optional)
AllowedHosts should consist of permitted hostnames (without the scheme) for terminal connections.
It is important to consider that the usage of wildcards follows the rules defined by the content security policy.
‘.seed.local.gardener.cloud’, or ‘.other-seeds.local.gardener.cloud’. For more information, see
https://github.com/gardener/dashboard/blob/master/docs/operations/webterminals.md#allowlist-for-hosts.
Deployment specifies how an extension can be installed for a Gardener landscape. It includes the specification
for installing an extension and/or an admission controller.
ExtensionDeploymentSpec specifies how to install the extension in a gardener landscape. The installation is split into two parts:
- installing the extension in the virtual garden cluster by creating the ControllerRegistration and ControllerDeployment
- installing the extension in the runtime cluster (if necessary).
SeedSelector contains an optional label selector for seeds. Only if the labels match then this controller will be
considered for a deployment.
An empty list means that all seeds are selected.
AdmissionPlugins contains the list of user-defined admission plugins (additional to those managed by Gardener),
and, if desired, the corresponding configuration.
WatchCacheSizes contains configuration of the API server’s watch cache sizes.
Configuring these flags might be useful for large-scale Garden clusters with a lot of parallel update requests
and a lot of watching controllers (e.g. large ManagedSeed clusters). When the API server’s watch cache’s
capacity is too small to cope with the amount of update requests and watchers for a particular resource, it
might happen that controller watches are permanently stopped with too old resource version errors.
Starting from kubernetes v1.19, the API server’s watch cache size is adapted dynamically and setting the watch
cache size flags will have no effect, except when setting it to 0 (which disables the watch cache).
Domains specify the ingress domains of the cluster pointing to the ingress controller endpoint. They will be used
to construct ingress URLs for system applications running in runtime cluster.
ResourcesToStoreInETCDEvents contains a list of resources which should be stored in etcd-events instead of
etcd-main. The ‘events’ resource is always stored in etcd-events. Note that adding or removing resources from
this list will not migrate them automatically from the etcd-main to etcd-events or vice versa.
CertificateSigningDuration is the maximum length of duration signed certificates will be given. Individual CSRs
may request shorter certs by setting spec.expirationSeconds.
SNI contains configuration options for the TLS SNI settings.
Field
Description
secretNamestring
SecretName is the name of a secret containing the TLS certificate and private key.
domainPatterns[]string
(Optional)
DomainPatterns is a list of fully qualified domain names, possibly with prefixed wildcard segments. The domain
patterns also allow IP addresses, but IPs should only be used if the apiserver has visibility to the IP address
requested by a client. If no domain patterns are provided, the names of the certificate are extracted.
Non-wildcard matches trump over wildcard matches, explicit domain patterns trump over extracted names.
Enabled controls whether certain Services deployed in the cluster should be topology-aware.
These Services are virtual-garden-etcd-main-client, virtual-garden-etcd-events-client and virtual-garden-kube-apiserver.
Additionally, other components that are deployed to the runtime cluster via other means can read this field and
according to its value enable/disable topology-aware routing for their Services.
SettingVerticalPodAutoscaler controls certain settings for the vertical pod autoscaler components deployed in the
seed.
Field
Description
enabledbool
(Optional)
Enabled controls whether the VPA components shall be deployed into this cluster. It is true by default because
the operator (and Gardener) heavily rely on a VPA being deployed. You should only disable this if your runtime
cluster already has another, manually/custom managed VPA deployment. If this is not the case, but you still
disable it, then reconciliation will fail.
MachineImages is the list of machine images that are understood by the controller. It maps
logical names and versions to provider-specific identifiers.
WorkerStatus
WorkerStatus contains information about created worker resources.
MachineImages is a list of machine images that have been used in this worker. Usually, the extension controller
gets the mapping from name/version to the provider-specific machine image data from the CloudProfile. However, if
a version that is still in use gets removed from this componentconfig it cannot reconcile anymore existing Worker
resources that are still using this version. Hence, it stores the used versions in the provider status to ensure
reconciliation is possible.
Equivalences specifies possible group/kind equivalences for objects.
deletePersistentVolumeClaimsbool
(Optional)
DeletePersistentVolumeClaims specifies if PersistentVolumeClaims created by StatefulSets, which are managed by this
resource, should also be deleted when the corresponding StatefulSet is deleted (defaults to false).
Equivalences specifies possible group/kind equivalences for objects.
deletePersistentVolumeClaimsbool
(Optional)
DeletePersistentVolumeClaims specifies if PersistentVolumeClaims created by StatefulSets, which are managed by this
resource, should also be deleted when the corresponding StatefulSet is deleted (defaults to false).
CredentialsRef is a reference to a resource holding the credentials.
Accepted resources are core/v1.Secret and security.gardener.cloud/v1alpha1.WorkloadIdentity
This field is immutable.
Quotas is a list of references to Quota objects in the same or another namespace.
This field is immutable.
WorkloadIdentity
WorkloadIdentity is resource that allows workloads to be presented before external systems
by giving them identities managed by the Gardener API server.
The identity of such workload is represented by JSON Web Token issued by the Gardener API server.
Workload identities are designed to be used by components running in the Gardener environment,
seed or runtime cluster, that make use of identity federation inspired by the OIDC protocol.
KubeconfigSecretRef is a reference to a secret containing a kubeconfig for the cluster to which gardenlet should
be deployed. This is only used by gardener-operator for a very first gardenlet deployment. After that, gardenlet
will continuously upgrade itself. If this field is empty, gardener-operator deploys it into its own runtime
cluster.
Gardenlet specifies that the ManagedSeed controller should deploy a gardenlet into the cluster
with the given deployment parameters and GardenletConfiguration.
Selector is a label query over ManagedSeeds and Shoots that should match the replica count.
It must match the ManagedSeeds and Shoots template’s labels. This field is immutable.
Template describes the ManagedSeed that will be created if insufficient replicas are detected.
Each ManagedSeed created / updated by the ManagedSeedSet will fulfill this template.
ShootTemplate describes the Shoot that will be created if insufficient replicas are detected for hosting the corresponding ManagedSeed.
Each Shoot created / updated by the ManagedSeedSet will fulfill this template.
UpdateStrategy specifies the UpdateStrategy that will be
employed to update ManagedSeeds / Shoots in the ManagedSeedSet when a revision is made to
Template / ShootTemplate.
revisionHistoryLimitint32
(Optional)
RevisionHistoryLimit is the maximum number of revisions that will be maintained
in the ManagedSeedSet’s revision history. Defaults to 10. This field is immutable.
Bootstrap is the mechanism that should be used for bootstrapping gardenlet connection to the Garden cluster. One of ServiceAccount, BootstrapToken, None.
If set to ServiceAccount or BootstrapToken, a service account or a bootstrap token will be created in the garden cluster and used to compute the bootstrap kubeconfig.
If set to None, the gardenClientConnection.kubeconfig field will be used to connect to the Garden cluster. Defaults to BootstrapToken.
This field is immutable.
mergeWithParentbool
(Optional)
MergeWithParent specifies whether the GardenletConfiguration of the parent gardenlet
should be merged with the specified GardenletConfiguration. Defaults to true. This field is immutable.
Env is the list of environment variables to set in the gardenlet container.
vpabool
(Optional)
VPA specifies whether to enable VPA for gardenlet. Defaults to true.
Deprecated: This field is deprecated and has no effect anymore. It will be removed in the future.
TODO(rfranzke): Remove this field after v1.110 has been released.
KubeconfigSecretRef is a reference to a secret containing a kubeconfig for the cluster to which gardenlet should
be deployed. This is only used by gardener-operator for a very first gardenlet deployment. After that, gardenlet
will continuously upgrade itself. If this field is empty, gardener-operator deploys it into its own runtime
cluster.
Conditions represents the latest available observations of a Gardenlet’s current state.
observedGenerationint64
(Optional)
ObservedGeneration is the most recent generation observed for this Gardenlet. It corresponds to the Gardenlet’s
generation, which is updated on mutation by the API Server.
Selector is a label query over ManagedSeeds and Shoots that should match the replica count.
It must match the ManagedSeeds and Shoots template’s labels. This field is immutable.
Template describes the ManagedSeed that will be created if insufficient replicas are detected.
Each ManagedSeed created / updated by the ManagedSeedSet will fulfill this template.
ShootTemplate describes the Shoot that will be created if insufficient replicas are detected for hosting the corresponding ManagedSeed.
Each Shoot created / updated by the ManagedSeedSet will fulfill this template.
UpdateStrategy specifies the UpdateStrategy that will be
employed to update ManagedSeeds / Shoots in the ManagedSeedSet when a revision is made to
Template / ShootTemplate.
revisionHistoryLimitint32
(Optional)
RevisionHistoryLimit is the maximum number of revisions that will be maintained
in the ManagedSeedSet’s revision history. Defaults to 10. This field is immutable.
ManagedSeedSetStatus represents the current state of a ManagedSeedSet.
Field
Description
observedGenerationint64
ObservedGeneration is the most recent generation observed for this ManagedSeedSet. It corresponds to the
ManagedSeedSet’s generation, which is updated on mutation by the API Server.
replicasint32
Replicas is the number of replicas (ManagedSeeds and their corresponding Shoots) created by the ManagedSeedSet controller.
readyReplicasint32
ReadyReplicas is the number of ManagedSeeds created by the ManagedSeedSet controller that have a Ready Condition.
nextReplicaNumberint32
NextReplicaNumber is the ordinal number that will be assigned to the next replica of the ManagedSeedSet.
currentReplicasint32
CurrentReplicas is the number of ManagedSeeds created by the ManagedSeedSet controller from the ManagedSeedSet version
indicated by CurrentRevision.
updatedReplicasint32
UpdatedReplicas is the number of ManagedSeeds created by the ManagedSeedSet controller from the ManagedSeedSet version
indicated by UpdateRevision.
currentRevisionstring
CurrentRevision, if not empty, indicates the version of the ManagedSeedSet used to generate ManagedSeeds with smaller
ordinal numbers during updates.
updateRevisionstring
UpdateRevision, if not empty, indicates the version of the ManagedSeedSet used to generate ManagedSeeds with larger
ordinal numbers during updates
collisionCountint32
(Optional)
CollisionCount is the count of hash collisions for the ManagedSeedSet. The ManagedSeedSet controller
uses this field as a collision avoidance mechanism when it needs to create the name for the
newest ControllerRevision.
PendingReplica, if not empty, indicates the replica that is currently pending creation, update, or deletion.
This replica is in a state that requires the controller to wait for it to change before advancing to the next replica.
Gardenlet specifies that the ManagedSeed controller should deploy a gardenlet into the cluster
with the given deployment parameters and GardenletConfiguration.
Conditions represents the latest available observations of a ManagedSeed’s current state.
observedGenerationint64
ObservedGeneration is the most recent generation observed for this ManagedSeed. It corresponds to the
ManagedSeed’s generation, which is updated on mutation by the API Server.
Gardenlet specifies that the ManagedSeed controller should deploy a gardenlet into the cluster
with the given deployment parameters and GardenletConfiguration.
Since is the moment in time since the replica is pending with the specified reason.
retriesint32
(Optional)
Retries is the number of times the shoot operation (reconcile or delete) has been retried after having failed.
Only applicable if Reason is ShootReconciling or ShootDeleting.
UpdateStrategy specifies the strategy that the ManagedSeedSet
controller will use to perform updates. It includes any additional parameters
necessary to perform the update for the indicated strategy.
Project decides whether to apply the configuration if the
Shoot is in a specific Project matching the label selector.
Use the selector only if the OIDC Preset is opt-in, because end
users may skip the admission by setting the labels.
Defaults to the empty LabelSelector, which matches everything.
OpenIDConnectPreset
OpenIDConnectPreset is a OpenID Connect configuration that is applied
to a Shoot in a namespace.
Server contains the kube-apiserver’s OpenID Connect configuration.
This configuration is not overwriting any existing OpenID Connect
configuration already set on the Shoot object.
Client contains the configuration used for client OIDC authentication
of Shoot clusters.
This configuration is not overwriting any existing OpenID Connect
client authentication already set on the Shoot object.
Deprecated: The OpenID Connect configuration this field specifies is not used and will be forbidden starting from Kubernetes 1.31.
It’s use was planned for genereting OIDC kubeconfig https://github.com/gardener/gardener/issues/1433
TODO(AleksandarSavchev): Drop this field after support for Kubernetes 1.30 is dropped.
ShootSelector decides whether to apply the configuration if the
Shoot has matching labels.
Use the selector only if the OIDC Preset is opt-in, because end
users may skip the admission by setting the labels.
Default to the empty LabelSelector, which matches everything.
weightint32
Weight associated with matching the corresponding preset,
in the range 1-100.
Required.
Project decides whether to apply the configuration if the
Shoot is in a specific Project matching the label selector.
Use the selector only if the OIDC Preset is opt-in, because end
users may skip the admission by setting the labels.
Defaults to the empty LabelSelector, which matches everything.
KubeAPIServerOpenIDConnect contains configuration settings for the OIDC provider.
Note: Descriptions were taken from the Kubernetes documentation.
Field
Description
caBundlestring
(Optional)
If set, the OpenID server’s certificate will be verified by one of the authorities in the oidc-ca-file, otherwise the host’s root CA set will be used.
clientIDstring
The client ID for the OpenID Connect client.
Required.
groupsClaimstring
(Optional)
If provided, the name of a custom OpenID Connect claim for specifying user groups. The claim value is expected to be a string or array of strings. This field is experimental, please see the authentication documentation for further details.
groupsPrefixstring
(Optional)
If provided, all groups will be prefixed with this value to prevent conflicts with other authentication strategies.
issuerURLstring
The URL of the OpenID issuer, only HTTPS scheme will be accepted. If set, it will be used to verify the OIDC JSON Web Token (JWT).
Required.
requiredClaimsmap[string]string
(Optional)
key=value pairs that describes a required claim in the ID Token. If set, the claim is verified to be present in the ID Token with a matching value.
signingAlgs[]string
(Optional)
List of allowed JOSE asymmetric signing algorithms. JWTs with a ‘alg’ header value not in this list will be rejected. Values are defined by RFC 7518 https://tools.ietf.org/html/rfc7518#section-3.1
Defaults to [RS256]
usernameClaimstring
(Optional)
The OpenID claim to use as the user name. Note that claims other than the default (‘sub’) is not guaranteed to be unique and immutable. This field is experimental, please see the authentication documentation for further details.
Defaults to “sub”.
usernamePrefixstring
(Optional)
If provided, all usernames will be prefixed with this value. If not provided, username claims other than ‘email’ are prefixed by the issuer URL to avoid clashes. To skip any prefixing, provide the value ‘-’.
OpenIDConnectClientAuthentication contains configuration for OIDC clients.
Field
Description
secretstring
(Optional)
The client Secret for the OpenID Connect client.
extraConfigmap[string]string
(Optional)
Extra configuration added to kubeconfig’s auth-provider.
Must not be any of idp-issuer-url, client-id, client-secret, idp-certificate-authority, idp-certificate-authority-data, id-token or refresh-token
Server contains the kube-apiserver’s OpenID Connect configuration.
This configuration is not overwriting any existing OpenID Connect
configuration already set on the Shoot object.
Client contains the configuration used for client OIDC authentication
of Shoot clusters.
This configuration is not overwriting any existing OpenID Connect
client authentication already set on the Shoot object.
Deprecated: The OpenID Connect configuration this field specifies is not used and will be forbidden starting from Kubernetes 1.31.
It’s use was planned for genereting OIDC kubeconfig https://github.com/gardener/gardener/issues/1433
TODO(AleksandarSavchev): Drop this field after support for Kubernetes 1.30 is dropped.
ShootSelector decides whether to apply the configuration if the
Shoot has matching labels.
Use the selector only if the OIDC Preset is opt-in, because end
users may skip the admission by setting the labels.
Default to the empty LabelSelector, which matches everything.
weightint32
Weight associated with matching the corresponding preset,
in the range 1-100.
Required.
This is a short guide describing different options how to automatically scale CoreDNS in the shoot cluster.
Background
Currently, Gardener uses CoreDNS as DNS server. Per default, it is installed as a deployment into the shoot cluster that is auto-scaled horizontally to cover for QPS-intensive applications. However, doing so does not seem to be enough to completely circumvent DNS bottlenecks such as:
Cloud provider limits for DNS lookups.
Unreliable UDP connections that forces a period of timeout in case packets are dropped.
Unnecessary node hopping since CoreDNS is not deployed on all nodes, and as a result DNS queries end-up traversing multiple nodes before reaching the destination server.
Inefficient load-balancing of services (e.g., round-robin might not be enough when using IPTables mode).
Overload of the CoreDNS replicas as the maximum amount of replicas is fixed.
and more …
As an alternative with extended configuration options, Gardener provides cluster-proportional autoscaling of CoreDNS. This guide focuses on the configuration of cluster-proportional autoscaling of CoreDNS and its advantages/disadvantages compared to the horizontal
autoscaling.
Please note that there is also the option to use a node-local DNS cache, which helps mitigate potential DNS bottlenecks (see Trade-offs in conjunction with NodeLocalDNS for considerations regarding using NodeLocalDNS together with one of the CoreDNS autoscaling approaches).
Configuring Cluster-Proportional DNS Autoscaling
All that needs to be done to enable the usage of cluster-proportional autoscaling of CoreDNS is to set the corresponding option (spec.systemComponents.coreDNS.autoscaling.mode) in the Shoot resource to cluster-proportional:
To switch back to horizontal DNS autoscaling, you can set the spec.systemComponents.coreDNS.autoscaling.mode to horizontal (or remove the coreDNS section).
Once the cluster-proportional autoscaling of CoreDNS has been enabled and the Shoot cluster has been reconciled afterwards, a ConfigMap called coredns-autoscaler will be created in the kube-system namespace with the default settings. The content will be similar to the following:
It is possible to adapt the ConfigMap according to your needs in case the defaults do not work as desired. The number of CoreDNS replicas is calculated according to the following formula:
Depending on your needs, you can adjust coresPerReplica or nodesPerReplica, but it is also possible to override min if required.
Trade-Offs of Horizontal and Cluster-Proportional DNS Autoscaling
The horizontal autoscaling of CoreDNS as implemented by Gardener is fully managed, i.e., you do not need to perform any configuration changes. It scales according to the CPU usage of CoreDNS replicas, meaning that it will create new replicas if the existing ones are under heavy load. This approach scales between 2 and 5 instances, which is sufficient for most workloads. In case this is not enough, the cluster-proportional autoscaling approach can be used instead, with its more flexible configuration options.
The cluster-proportional autoscaling of CoreDNS as implemented by Gardener is fully managed, but allows more configuration options to adjust the default settings to your individual needs. It scales according to the cluster size, i.e., if your cluster grows in terms of cores/nodes so will the amount of CoreDNS replicas. However, it does not take the actual workload, e.g., CPU consumption, into account.
Experience shows that the horizontal autoscaling of CoreDNS works for a variety of workloads. It does reach its limits if a cluster has a high amount of DNS requests, though. The cluster-proportional autoscaling approach allows to fine-tune the amount of CoreDNS replicas. It helps to scale in clusters of changing size. However, please keep in mind that you need to cater for the maximum amount of DNS requests as the replicas will not be adapted according to the workload, but only according to the cluster size (cores/nodes).
Trade-Offs in Conjunction with NodeLocalDNS
Using a node-local DNS cache can mitigate a lot of the potential DNS related problems. It works fine with a DNS workload that can be handle through the cache and reduces the inter-node DNS communication. As node-local DNS cache reduces the amount of traffic being sent to the cluster’s CoreDNS replicas, it usually works fine with horizontally scaled CoreDNS. Nevertheless, it also works with CoreDNS scaled in a cluster-proportional approach. In this mode, though, it might make sense to adapt the default settings as the CoreDNS workload is likely significantly reduced.
Overall, you can view the DNS options on a scale. Horizontally scaled DNS provides a small amount of DNS servers. Especially for bigger clusters, a cluster-proportional approach will yield more CoreDNS instances and hence may yield a more balanced DNS solution. By adapting the settings you can further increase the amount of CoreDNS replicas. On the other end of the spectrum, a node-local DNS cache provides DNS on every node and allows to reduce the amount of (backend) CoreDNS instances regardless if they are horizontally or cluster-proportionally scaled.
3.2 - Shoot Autoscaling
The basics of horizontal Node and vertical Pod auto-scaling
Auto-Scaling in Shoot Clusters
There are three auto-scaling scenarios of relevance in Kubernetes clusters in general and Gardener shoot clusters in particular:
Horizontal node auto-scaling, i.e., dynamically adding and removing worker nodes.
Horizontal pod auto-scaling, i.e., dynamically adding and removing pod replicas.
Vertical pod auto-scaling, i.e., dynamically raising or shrinking the resource requests/limits of pods.
This document provides an overview of these scenarios and how the respective auto-scaling components can be enabled and configured. For more details, please see our pod auto-scaling best practices.
Horizontal Node Auto-Scaling
Every shoot cluster that has at least one worker pool with minimum < maximum nodes configuration will get a cluster-autoscaler deployment.
Gardener is leveraging the upstream community Kubernetes cluster-autoscaler component.
We have forked it to gardener/autoscaler so that it supports the way how Gardener manages the worker nodes (leveraging gardener/machine-controller-manager).
However, we have not touched the logic how it performs auto-scaling decisions.
Consequently, please refer to the official documentation for this component.
The Shoot API allows to configure a few flags of the cluster-autoscaler:
Only some cluster-autoscaler flags can be configured per worker pool, and is limited by NodeGroupAutoscalingOptions of the upstream community Kubernetes repository. This list can be found here.
Horizontal Pod Auto-Scaling
This functionality (HPA) is a standard functionality of any Kubernetes cluster (implemented as part of the kube-controller-manager that all Kubernetes clusters have). It is always enabled.
This form of auto-scaling (VPA) is enabled by default, but it can be switched off in the Shoot by setting .spec.kubernetes.verticalPodAutoscaler.enabled=false in case you deploy your own VPA into your cluster (having more than one VPA on the same set of pods will lead to issues, eventually).
Gardener is leveraging the upstream community Kubernetes vertical-pod-autoscaler.
If enabled, Gardener will deploy it as part of the control plane into the seed cluster.
It will also be used for the vertical autoscaling of Gardener’s system components deployed into the kube-system namespace of shoot clusters, for example, kube-proxy or metrics-server.
You might want to refer to the official documentation for this component to get more information how to use it.
⚠️ Please note that if you disable VPA, the related CustomResourceDefinitions (ours and yours) will remain in your shoot cluster (whether someone acts on them or not).
You can delete these CustomResourceDefinitions yourself using kubectl delete crd if you want to get rid of them (in case you statically size all resources, which we do not recommend).
There are two types of pod autoscaling in Kubernetes: Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA). HPA (implemented as part of the kube-controller-manager) scales the number of pod replicas, while VPA (implemented as independent community project) adjusts the CPU and memory requests for the pods. Both types of autoscaling aim to optimize resource usage/costs and maintain the performance and (high) availability of applications running on Kubernetes.
Horizontal Pod Autoscaling involves increasing or decreasing the number of pod replicas in a deployment, replica set, stateful set, or anything really with a scale subresource that manages pods. HPA adjusts the number of replicas based on specified metrics, such as CPU or memory average utilization (usage divided by requests; most common) or average value (usage; less common). When the demand on your application increases, HPA automatically scales out the number of pods to meet the demand. Conversely, when the demand decreases, it scales in the number of pods to reduce resource usage.
HPA targets (mostly stateless) applications where adding more instances of the application can linearly increase the ability to handle additional load. It is very useful for applications that experience variable traffic patterns, as it allows for real-time scaling without the need for manual intervention.
Note
HPA continuously monitors the metrics of the targeted pods and adjusts the number of replicas based on the observed metrics. It operates solely on the current metrics when it calculates the averages across all pods, meaning it reacts to the immediate resource usage without considering past trends or patterns. Also, all pods are treated equally based on the average metrics. This could potentially lead to situations where some pods are under high load while others are underutilized. Therefore, particular care must be applied to (fair) load-balancing (connection vs. request vs. actual resource load balancing are crucial).
Besides HPA and VPA, CPA and CPVA are further options for scaling horizontally or vertically (neither is deployed by Gardener and must be deployed by the user). Unlike HPA and VPA, CPA and CPVA do not monitor the actual pod metrics, but scale solely on the number of nodes or CPU cores in the cluster. While this approach may be helpful and sufficient in a few rare cases, it is often a risky and crude scaling scheme that we do not recommend. More often than not, cluster-proportional scaling results in either under- or over-reserving your resources.
Vertical Pod Autoscaling, on the other hand, focuses on adjusting the CPU and memory resources allocated to the pods themselves. Instead of changing the number of replicas, VPA tweaks the resource requests (and limits, but only proportionally, if configured) for the pods in a deployment, replica set, stateful set, daemon set, or anything really with a scale subresource that manages pods. This means that each pod can be given more, or fewer resources as needed.
VPA is very useful for optimizing the resource requests of pods that have dynamic resource needs over time. It does so by mutating pod requests (unfortunately, not in-place). Therefore, in order to apply new recommendations, pods that are “out of bounds” (i.e. below a configured/computed lower or above a configured/computed upper recommendation percentile) will be evicted proactively, but also pods that are “within bounds” may be evicted after a grace period. The corresponding higher-level replication controller will then recreate a new pod that VPA will then mutate to set the currently recommended requests (and proportional limits, if configured).
Note
VPA continuously monitors all targeted pods and calculates recommendations based on their usage (one recommendation for the entire target). This calculation is influenced by configurable percentiles, with a greater emphasis on recent usage data and a gradual decrease (=decay) in the relevance of older data. However, this means, that VPA doesn’t take into account individual needs of single pods - eventually, all pods will receive the same recommendation, which may lead to considerable resource waste. Ideally, VPA would update pods in-place depending on their individual needs, but that’s (individual recommendations) not in its design, even if in-place updates get implemented, which may be years away for VPA based on current activity on the component.
Selecting the Appropriate Autoscaler
Before deciding on an autoscaling strategy, it’s important to understand the characteristics of your application:
Interruptibility: Most importantly, if the clients of your workload are too sensitive to disruptions/cannot cope well with terminating pods, then maybe neither HPA nor VPA is an option (both, HPA and VPA cause pods and connections to be terminated, though VPA even more frequently). Clients must retry on disruptions, which is a reasonable ask in a highly dynamic (and self-healing) environment such as Kubernetes, but this is often not respected (or expected) by your clients (they may not know or care you run the workload in a Kubernetes cluster and have different expectations to the stability of the workload unless you communicated those through SLIs/SLOs/SLAs).
Statelessness: Is your application stateless or stateful? Stateless applications are typically better candidates for HPA as they can be easily scaled out by adding more replicas without worrying about maintaining state.
Traffic Patterns: Does your application experience variable traffic? If so, HPA can help manage these fluctuations by adjusting the number of replicas to handle the load.
Resource Usage: Does your application’s resource usage change over time? VPA can adjust the CPU and memory reservations dynamically, which is beneficial for applications with non-uniform resource requirements.
Scalability: Can your application handle increased load by scaling vertically (more resources per pod) or does it require horizontal scaling (more pod instances)?
HPA is the right choice if:
Your application is stateless and can handle increased load by adding more instances.
You experience short-term fluctuations in traffic that require quick scaling responses.
You want to maintain a specific performance metric, such as requests per second per pod.
VPA is the right choice if:
Your application’s resource requirements change over time, and you want to optimize resource usage without manual intervention.
You want to avoid the complexity of managing resource requests for each pod, especially when they run code where it’s impossible for you to suggest static requests.
In essence:
For applications that can handle increased load by simply adding more replicas, HPA should be used to handle short-term fluctuations in load by scaling the number of replicas.
For applications that require more resources per pod to handle additional work, VPA should be used to adjust the resource allocation for longer-term trends in resource usage.
Consequently, if both cases apply (VPA often applies), HPA and VPA can also be combined. However, combining both, especially on the same metrics (CPU and memory), requires understanding and care to avoid conflicts and ensure that the autoscaling actions do not interfere with and rather complement each other. For more details, see Combining HPA and VPA.
Horizontal Pod Autoscaler (HPA)
HPA operates by monitoring resource metrics for all pods in a target. It computes the desired number of replicas from the current average metrics and the desired user-defined metrics as follows:
HPA checks the metrics at regular intervals, which can be configured by the user. Several types of metrics are supported (classical resource metrics like CPU and memory, but also custom and external metrics like requests per second or queue length can be configured, if available). If a scaling event is necessary, HPA adjusts the replica count for the targeted resource.
Defining an HPA Resource
To configure HPA, you need to create an HPA resource in your cluster. This resource specifies the target to scale, the metrics to be used for scaling decisions, and the desired thresholds. Here’s an example of an HPA configuration:
In this example, HPA is configured to scale foo-deployment based on pod average CPU and memory usage. It will maintain an average CPU and memory usage (not utilization, which is usage divided by requests!) across all replicas of 2 CPUs and 8G or lower with as few replicas as possible. The number of replicas will be scaled between a minimum of 1 and a maximum of 10 based on this target.
In the official documentation ([1] and [2]) you will find examples with average utilization (averageUtilization), not average usage (averageValue), but this is not particularly helpful, especially if you plan to combine HPA together with VPA on the same metrics (generally discouraged in the documentation). If you want to safely combine both on the same metrics, you should scale on average usage (averageValue) as shown above. For more details, see Combining HPA and VPA.
Finally, the behavior section influences how fast you scale up and down. Most of the time (depends on your workload), you like to scale out faster than you scale in. In this example, the configuration will trigger a scale-out only after observing the need to scale out for 30s (stabilizationWindowSeconds) and will then only scale out at most 100% (value + type) of the current number of replicas every 60s (periodSeconds). The configuration will trigger a scale-in only after observing the need to scale in for 1800s (stabilizationWindowSeconds) and will then only scale in at most 1 pod (value + type) every 300s (periodSeconds). As you can see, scale-out happens quicker than scale-in in this example.
HPA (actually KCM) Options
HPA is a function of the kube-controller-manager (KCM).
downscaleStabilization (default 5m): HPA will scale out whenever the formula (in accordance with the behavior section, if present in the HPA resource) yields a higher replica count, but it won’t scale in just as eagerly. This option lets you define a trailing time window that HPA must check and only if the recommended replica count is consistently lower throughout the entire time window, HPA will scale in (in accordance with the behavior section, if present in the HPA resource). If at any point in time in that trailing time window the recommended replica count isn’t lower, scale-in won’t happen. This setting is just a default, if nothing is defined in the behavior section of an HPA resource. The default for the upscale stabilization is 0s and it cannot be set via a KCM option (downscale stabilization was historically more important than upscale stabilization and when later the behavior sections were added to the HPA resources, upscale stabilization remained missing from the KCM options).
tolerance (default +/-10%): HPA will not scale out or in if the desired replica count is (mathematically as a float) near the actual replica count (see source code for details), which is a form of hysteresis to avoid replica flapping around a threshold.
There are a few more configurable options of lesser interest:
syncPeriod (default 15s): How often HPA retrieves the pods and metrics respectively how often it recomputes and sets the desired replica count.
cpuInitializationPeriod (default 30s) and initialReadinessDelay (default 5m): Both settings only affect whether or not CPU metrics are considered for scaling decisions. They can be easily misinterpreted as the official docs are somewhat hard to read (see source code for details, which is more readable, if you ignore the comments). Normally, you have little reason to modify them, but here is what they do:
cpuInitializationPeriod: Defines a grace period after a pod starts during which HPA won’t consider CPU metrics of the pod for scaling if the pod is either not ready or it is ready, but a given CPU metric is older than the last state transition (to ready). This is to ignore CPU metrics that predate the current readiness while still in initialization to not make scaling decisions based on potentially misleading data. If the pod is ready and a CPU metric was collected after it became ready, it is considered also within this grace period.
initialReadinessDelay: Defines another grace period after a pod starts during which HPA won’t consider CPU metrics of the pod for scaling if the pod is not ready and it became not ready within this grace period (the docs/comments want to check whether the pod was ever ready, but the code only checks whether the pod condition last transition time to not ready happened within that grace period which it could have from being ready or simply unknown before). This is to ignore not (ever have been) ready pods while still in initialization to not make scaling decisions based on potentially misleading data. If the pod is ready, it is considered also within this grace period.
So, regardless of the values of these settings, if a pod is reporting ready and it has a CPU metric from the time after it became ready, that pod and its metric will be considered. This holds true even if the pod becomes ready very early into its initialization. These settings cannot be used to “black-out” pods for a certain duration before being considered for scaling decisions. Instead, if it is your goal to ignore a potentially resource-intensive initialization phase that could wrongly lead to further scale-out, you would need to configure your pods to not report as ready until that resource-intensive initialization phase is over.
Considerations When Using HPA
Selection of metrics: Besides CPU and memory, HPA can also target custom or external metrics. Pick those (in addition or exclusively), if you guarantee certain SLOs in your SLAs.
Targeting usage or utilization: HPA supports usage (absolute) and utilization (relative). Utilization is often preferred in simple examples, but usage is more precise and versatile.
Compatibility with VPA: Care must be taken when using HPA in conjunction with VPA, as they can potentially interfere with each other’s scaling decisions.
Vertical Pod Autoscaler (VPA)
VPA operates by monitoring resource metrics for all pods in a target. It computes a resource requests recommendation from the historic and current resource metrics. VPA checks the metrics at regular intervals, which can be configured by the user. Only CPU and memory are supported. If VPA detects that a pod’s resource allocation is too high or too low, it may evict pods (if within the permitted disruption budget), which will trigger the creation of a new pod by the corresponding higher-level replication controller, which will then be mutated by VPA to match resource requests recommendation. This happens in three different components that work together:
VPA Recommender: The Recommender observes the historic and current resource metrics of pods and generates recommendations based on this data.
VPA Updater: The Updater component checks the recommendations from the Recommender and decides whether any pod’s resource requests need to be updated. If an update is needed, the Updater will evict the pod.
VPA Admission Controller: When a pod is (re-)created, the Admission Controller modifies the pod’s resource requests based on the recommendations from the Recommender. This ensures that the pod starts with the optimal amount of resources.
Since VPA doesn’t support in-place updates, pods will be evicted. You will want to control voluntary evictions by means of Pod Disruption Budgets (PDBs). Please make yourself familiar with those and use them.
Note
PDBs will not always work as expected and can also get in your way, e.g. if the PDB is violated or would be violated, it may possibly block evictions that would actually help your workload, e.g. to get a pod out of an OOMKilledCrashLoopBackoff (if the PDB is or would be violated, not even unhealthy pods would be evicted as they could theoretically become healthy again, which VPA doesn’t know). In order to overcome this issue, it is now possible (alpha since Kubernetes v1.26 in combination with the feature gate PDBUnhealthyPodEvictionPolicy on the API server, beta and enabled by default since Kubernetes v1.27) to configure the so-called unhealthy pod eviction policy. The default is still IfHealthyBudget as a change in default would have changed the behavior (as described above), but you can now also set AlwaysAllow at the PDB (spec.unhealthyPodEvictionPolicy). For more information, please check out this discussion, the PR and this document and balance the pros and cons for yourself. In short, the new AlwaysAllow option is probably the better choice in most of the cases while IfHealthyBudget is useful only if you have frequent temporary transitions or for special cases where you have already implemented controllers that depend on the old behavior.
Defining a VPA Resource
To configure VPA, you need to create a VPA resource in your cluster. This resource specifies the target to scale, the metrics to be used for scaling decisions, and the policies for resource updates. Here’s an example of an VPA configuration:
In this example, VPA is configured to scale foo-deployment requests (RequestsOnly) from 50m cores (minAllowed) up to 4 cores (maxAllowed) and 200M memory (minAllowed) up to 16G memory (maxAllowed) automatically (updateMode). VPA doesn’t support in-place updates, so in updateModeAuto it will evict pods under certain conditions and then mutate the requests (and possibly limits if you omit controlledValues or set it to RequestsAndLimits, which is the default) of upcoming new pods.
Off: In this mode, recommendations are computed, but never applied. This mode is useful, if you want to learn more about your workload or if you have a custom controller that depends on VPA’s recommendations but shall act instead of VPA.
Initial: In this mode, recommendations are computed and applied, but pods are never proactively evicted to enforce new recommendations over time. This mode is useful, if you want to control pod evictions yourself (similar to the StatefulSetupdateStrategyOnDelete) or your workload is sensitive to evictions, e.g. some brownfield singleton application or a daemon set pod that is critical for the node.
Auto (default): In this mode, recommendations are computed, applied, and pods are even proactively evicted to enforce new recommendations over time. This applies recommendations continuously without you having to worry too much.
As mentioned, controlledValues influences whether only requests or requests and limits are scaled:
RequestsOnly: Updates only requests and doesn’t change limits. Useful if you have defined absolute limits (unrelated to the requests).
RequestsAndLimits (default): Updates requests and proportionally scales limits along with the requests. Useful if you have defined relative limits (related to the requests). In this case, the gap between requests and limits should be either zero for QoS Guaranteed or small for QoS Burstable to avoid useless (way beyond the threshold of unhealthy behavior) or absurd (larger than node capacity) values.
VPA doesn’t offer many more settings that can be tuned per VPA resource than you see above (different than HPA’s behavior section). However, there is one more that isn’t shown above, which allows to scale only up or only down (evictionRequirements[].changeRequirement), in case you need that, e.g. to provide resources when needed, but avoid disruptions otherwise.
VPA Options
VPA is an independent community project that consists of a recommender (computing target recommendations and bounds), an updater (evicting pods that are out of recommendation bounds), and an admission controller (mutating webhook applying the target recommendation to newly created pods). As such, they have independent options.
recommendationMarginFraction (default 15%): Safety margin that will be added to the recommended requests.
targetCPUPercentile (default 90%): CPU usage percentile that will be targeted with the CPU recommendation (i.e. recommendation will “fit” e.g. 90% of the observed CPU usages). This setting is relevant for balancing your requests reservations vs. your costs. If you want to reduce costs, you can reduce this value (higher risk because of potential under-reservation, but lower costs), because CPU is compressible, but then VPA may lack the necessary signals for scale-up as throttling on an otherwise fully utilized node will go unnoticed by VPA. If you want to err on the safe side, you can increase this value, but you will then target more and more a worst case scenario, quickly (maybe even exponentially) increasing the costs.
targetMemoryPercentile (default 90%): Memory usage percentile that will be targeted with the memory recommendation (i.e. recommendation will “fit” e.g. 90% of the observed memory usages). This setting is relevant for balancing your requests reservations vs. your costs. If you want to reduce costs, you can reduce this value (higher risk because of potential under-reservation, but lower costs), because OOMs will trigger bump-ups, but those will disrupt the workload. If you want to err on the safe side, you can increase this value, but you will then target more and more a worst case scenario, quickly (maybe even exponentially) increasing the costs.
There are a few more configurable options of lesser interest:
recommenderInterval (default 1m): How often VPA retrieves the pods and metrics respectively how often it recomputes the recommendations and bounds.
There are many more options that you can only configure if you deploy your own VPA and which we will not discuss here, but you can check them out here.
Note
Due to an implementation detail (smallest bucket size), VPA cannot create recommendations below 10m cores and 10M memory even if minAllowed is lower.
evictAfterOOMThreshold (default 10m): Pods where at least one container OOMs within this time period since its start will be actively evicted, which will implicitly apply the new target recommendation that will have been bumped up after OOMKill. Please note, the kubelet may evict pods even before an OOM, but only if kube-reserved is underrun, i.e. node-level resources are running low. In these cases, eviction will happen first by pod priority and second by how much the usage overruns the requests.
evictionTolerance (default 50%): Defines a threshold below which no further eligible pod will be evited anymore, i.e. limits how many eligible pods may be in eviction in parallel (but at least 1). The threshold is computed as follows: running - evicted > replicas - tolerance. Example: 10 replicas, 9 running, 8 eligible for eviction, 20% tolerance with 10 replicas which amounts to 2 pods, and no pod evicted in this round yet, then 9 - 0 > 10 - 2 is true and a pod would be evicted, but the next one would be in violation as 9 - 1 = 10 - 2 and no further pod would be evicted anymore in this round.
evictionRateBurst (default 1): Defines how many eligible pods may be evicted in one go.
evictionRateLimit (default disabled): Defines how many eligible pods may be evicted per second (a value of 0 or -1 disables the rate limiting).
There are a few more configurable options of lesser interest:
updaterInterval (default 1m): How often VPA evicts the pods.
There are many more options that you can only configure if you deploy your own VPA and which we will not discuss here, but you can check them out here.
Considerations When Using VPA
Initial Resource Estimates: VPA requires historical resource usage data to base its recommendations on. Until they kick in, your initial resource requests apply and should be sensible.
Pod Disruption: When VPA adjusts the resources for a pod, it may need to “recreate” the pod, which can cause temporary disruptions. This should be taken into account.
Compatibility with HPA: Care must be taken when using VPA in conjunction with HPA, as they can potentially interfere with each other’s scaling decisions.
Combining HPA and VPA
HPA and VPA serve different purposes and operate on different axes of scaling. HPA increases or decreases the number of pod replicas based on metrics like CPU or memory usage, effectively scaling the application out or in. VPA, on the other hand, adjusts the CPU and memory reservations of individual pods, scaling the application up or down.
When used together, these autoscalers can provide both horizontal and vertical scaling. However, they can also conflict with each other if used on the same metrics (e.g. both on CPU or both on memory). In particular, if VPA adjusts the requests, the utilization, i.e. the ratio between usage and requests, will approach 100% (for various reasons not exactly right, but for this consideration, close enough), which may trigger HPA to scale out, if it’s configured to scale on utilization below 100% (often seen in simple examples), which will spread the load across more pods, which may trigger VPA again to adjust the requests to match the new pod usages.
If desiredMetricValue is utilization and VPA adjusts the requests, which changes the utilization, this may inadvertently trigger HPA and create said feedback loop. On the other hand, if desiredMetricValue is usage and VPA adjusts the requests now, this will have no impact on HPA anymore (HPA will always influence VPA, but we can control whether VPA influences HPA).
Therefore, to safely combine HPA and VPA, consider the following strategies:
Configure HPA and VPA on different metrics: One way to avoid conflicts is to use HPA and VPA based on different metrics. For instance, you could configure HPA to scale based on requests per seconds (or another representative custom/external metric) and VPA to adjust CPU and memory requests. This way, each autoscaler operates independently based on its specific metric(s).
Configure HPA to scale on usage, not utilization, when used with VPA: Another way to avoid conflicts is to use HPA not on average utilization (averageUtilization), but instead on average usage (averageValue) as replicas driver, which is an absolute metric (requests don’t affect usage). This way, you can combine both autoscalers even on the same metrics.
Pod Autoscaling and Cluster Autoscaler
Autoscaling within Kubernetes can be implemented at different levels: pod autoscaling (HPA and VPA) and cluster autoscaling (CA). While pod autoscaling adjusts the number of pod replicas or their resource reservations, cluster autoscaling focuses on the number of nodes in the cluster, so that your pods can be hosted. If your workload isn’t static and especially if you make use of pod autoscaling, it only works if you have sufficient node capacity available. The most effective way to do that, without running a worst-case number of nodes, is to configure burstable worker pools in your shoot spec, i.e. define a true minimum node count and a worst-case maximum node count and leave the node autoscaling to Gardener that internally uses the Cluster Autoscaler to provision and deprovision nodes as needed.
Cluster Autoscaler automatically adjusts the number of nodes by adding or removing nodes based on the demands of the workloads and the available resources. It interacts with the cloud provider’s APIs to provision or deprovision nodes as needed. Cluster Autoscaler monitors the utilization of nodes and the scheduling of pods. If it detects that pods cannot be scheduled due to a lack of resources, it will trigger the addition of new nodes to the cluster. Conversely, if nodes are underutilized for some time and their pods can be placed on other nodes, it will remove those nodes to reduce costs and improve resource efficiency.
Pod Disruption Budgets (PDBs): Use PDBs to ensure that during scale-down events, the availability of applications is maintained as the Cluster Autoscaler will not voluntarily evict a pod if a PDB would be violated.
expander (default least-waste): Defines the “expander” algorithm to use during scale-up, see FAQ.
scaleDownDelayAfterAdd (default 1h): Defines how long after scaling up a node, a node may be scaled down.
scaleDownDelayAfterFailure (default 3m): Defines how long after scaling down a node failed, scaling down will be resumed.
scaleDownDelayAfterDelete (default 0s): Defines how long after scaling down a node, another node may be scaled down.
Can be configured globally and also overwritten individually per worker pool:
scaleDownUtilizationThreshold (default 50%): Defines the threshold below which a node becomes eligible for scaling down.
scaleDownUnneededTime (default 30m): Defines the trailing time window the node must be consistently below a certain utilization threshold before it can finally be scaled down.
There are many more options that you can only configure if you deploy your own CA and which we will not discuss here, but you can check them out here.
Importance of Monitoring
Monitoring is a critical component of autoscaling for several reasons:
Performance Insights: It provides insights into how well your autoscaling strategy is meeting the performance requirements of your applications.
Resource Utilization: It helps you understand resource utilization patterns, enabling you to optimize resource allocation and reduce waste.
Cost Management: It allows you to track the cost implications of scaling actions, helping you to maintain control over your cloud spending.
Troubleshooting: It enables you to quickly identify and address issues with autoscaling, such as unexpected scaling behavior or resource bottlenecks.
To effectively monitor autoscaling, you should leverage the following tools and metrics:
Kubernetes Metrics Server: Collects resource metrics from kubelets and provides them to HPA and VPA for autoscaling decisions (automatically provided by Gardener).
Prometheus: An open-source monitoring system that can collect and store custom metrics, providing a rich dataset for autoscaling decisions.
Grafana/Plutono: A visualization tool that integrates with Prometheus to create dashboards for monitoring autoscaling metrics and events.
Cloud Provider Tools: Most cloud providers offer native monitoring solutions that can be used to track the performance and costs associated with autoscaling.
Key metrics to monitor include:
CPU and Memory Utilization: Track the resource utilization of your pods and nodes to understand how they correlate with scaling events.
Pod Count: Monitor the number of pod replicas over time to see how HPA is responding to changes in load.
Scaling Events: Keep an eye on scaling events triggered by HPA and VPA to ensure they align with expected behavior.
Application Performance Metrics: Track application-specific metrics such as response times, error rates, and throughput.
Based on the insights gained from monitoring, you may need to adjust your autoscaling configurations:
Refine Thresholds: If you notice frequent scaling actions or periods of underutilization or overutilization, adjust the thresholds used by HPA and VPA to better match the workload patterns.
Update Policies: Modify VPA update policies if you observe that the current settings are causing too much or too little pod disruption.
Custom Metrics: If using custom metrics, ensure they accurately reflect the load on your application and adjust them if they do not.
Scaling Limits: Review and adjust the minimum and maximum scaling limits to prevent over-scaling or under-scaling based on the capacity of your cluster and the criticality of your applications.
BestEffort, i.e. pods where no container has CPU or memory requests or limits: Avoid them unless you have really good reasons. The kube-scheduler will place them just anywhere according to its policy, e.g. balanced or bin-packing, but whatever resources these pods consume, may bring other pods into trouble or even the kubelet and the container runtime itself, if it happens very suddenly.
Burstable, i.e. pods where at least one container has CPU or memory requests and at least one has no limits or limits that don’t match the requests: Prefer them unless you have really good reasons for the other QoS classes. Always specify proper requests or use VPA to recommend those. This helps the kube-scheduler to make the right scheduling decisions. Not having limits will additionally provide upward resource flexibility, if the node is not under pressure.
Guaranteed, i.e. pods where all containers have CPU and memory requests and equal limits: Avoid them unless you really know the limits or throttling/killing is intended. While “Guaranteed” sounds like something “positive” in the English language, this class comes with the downside, that pods will be actively CPU-throttled and will actively go OOM, even if the node is not under pressure and has excess capacity left. Worse, if containers in the pod are under VPA, their CPU requests/limits will often not be scaled up as CPU throttling will go unnoticed by VPA.
Summary
As a rule of thumb, always set CPU and memory requests (or let VPA do that) and always avoid CPU and memory limits.
CPU limits aren’t helpful on an under-utilized node (=may result in needless outages) and even suppress the signals for VPA to act. On a nearly or fully utilized node, CPU limits are practically irrelevant as only the requests matter, which are translated into CPU shares that provide a fair use of the CPU anyway (see CFS). Therefore, if you do not know the healthy range, do not set CPU limits. If you as author of the source code know its healthy range, set them to the upper threshold of that healthy range (everything above, from your knowledge of that code, is definitely an unbound busy loop or similar, which is the main reason for CPU limits, besides batch jobs where throttling is acceptable or even desired).
Memory limits may be more useful, but suffer a similar, though not as negative downside. As with CPU limits, memory limits aren’t helpful on an under-utilized node (=may result in needless outages), but different than CPU limits, they result in an OOM, which triggers VPA to provide more memory suddenly (modifies the currently computed recommendations by a configurable factor, defaulting to +20%, see docs). Therefore, if you do not know the healthy range, do not set memory limits. If you as author of the source code know its healthy range, set them to the upper threshold of that healthy range (everything above, from your knowledge of that code, is definitely an unbound memory leak or similar, which is the main reason for memory limits)
Horizontal Pod Autoscaling (HPA): Use for pods that support horizontal scaling. Prefer scaling on usage, not utilization, as this is more predictable (not dependent on a second variable, namely the current requests) and conflict-free with vertical pod autoscaling (VPA).
As a rule of thumb, set the initial replicas to the 5th percentile of the actually observed replica count in production. Since HPA reacts fast, this is not as critical, but may help reduce initial load on the control plane early after deployment. However, be cautious when you update the higher-level resource not to inadvertently reset the current HPA-controlled replica count (very easy to make mistake that can lead to catastrophic loss of pods). HPA modifies the replica count directly in the spec and you do not want to overwrite that. Even if it reacts fast, it is not instant (not via a mutating webhook as VPA operates) and the damage may already be done.
As for minimum and maximum, let your high availability requirements determine the minimum and your theoretical maximum load determine the maximum, flanked with alerts to detect erroneous run-away out-scaling or the actual nearing of your practical maximum load, so that you can intervene.
Vertical Pod Autoscaling (VPA): Use for containers that have a significant usage (e.g. any container above 50m CPU or 100M memory) and a significant usage spread over time (by more than 2x), i.e. ignore small (e.g. side-cars) or static (e.g. Java statically allocated heap) containers, but otherwise use it to provide the resources needed on the one hand and keep the costs in check on the other hand.
As a rule of thumb, set the initial requests to the 5th percentile of the actually observed CPU resp. memory usage in production. Since VPA may need some time at first to respond and evict pods, this is especially critical early after deployment. The lower bound, below which pods will be immediately evicted, converges much faster than the upper bound, above which pods will be immediately evicted, but it isn’t instant, e.g. after 5 minutes the lower bound is just at 60% of the computed lower bound; after 12 hours the upper bound is still at 300% of the computed upper bound (see code). Unlike with HPA, you don’t need to be as cautious when updating the higher-level resource in the case of VPA. As long as VPA’s mutating webhook (VPA Admission Controller) is operational (which also the VPA Updater checks before evicting pods), it’s generally safe to update the higher-level resource. However, if it’s not up and running, any new pods that are spawned (e.g. as a consequence of a rolling update of the higher-level resource or for any other reason) will not be mutated. Instead, they will receive whatever requests are currently configured at the higher-level resource, which can lead to catastrophic resource under-reservation. Gardener deploys the VPA Admission Controller in HA - if unhealthy, it is reported under the ControlPlaneHealthy shoot status condition.
If you have defined absolute limits (unrelated to the requests), configure VPA to only scale the requests or else it will proportionally scale the limits as well, which can easily become useless (way beyond the threshold of unhealthy behavior) or absurd (larger than node capacity):
If you have defined relative limits (related to the requests), the default policy to scale the limits proportionally with the requests is fine, but the gap between requests and limits must be zero for QoS Guaranteed and should best be small for QoS Burstable to avoid useless or absurd limits either, e.g. prefer limits being 5 to at most 20% larger than requests as opposed to being 100% larger or more.
As a rule of thumb, set minAllowed to the highest observed VPA recommendation (usually during the initialization phase or during any periodical activity) for an otherwise practically idle container, so that you avoid needless trashing (e.g. resource usage calms down over time and recommendations drop consecutively until eviction, which will then lead again to initialization or later periodical activity and higher recommendations and new evictions). ⚠️ You may want to provide higher minAllowed values, if you observe that up-scaling takes too long for CPU or memory for a too large percentile of your workload. This will get you out of the danger zone of too few resources for too many pods at the expense of providing too many resources for a few pods. Memory may react faster than CPU, because CPU throttling is not visible and memory gets aided by OOM bump-up incidents, but still, if you observe that up-scaling takes too long, you may want to increase minAllowed accordingly.
As a rule of thumb, set maxAllowed to your theoretical maximum load, flanked with alerts to detect erroneous run-away usage or the actual nearing of your practical maximum load, so that you can intervene. However, VPA can easily recommend requests larger than what is allocatable on a node, so you must either ensure large enough nodes (Gardener can scale up from zero, in case you like to define a low-priority worker pool with more resources for very large pods) and/or cap VPA’s target recommendations using maxAllowed at the node allocatable remainder (after daemon set pods) of the largest eligible machine type (may result in under-provisioning resources for a pod). Use your monitoring and check maximum pod usage to decide about the maximum machine type.
Recommendations in a Box
Container
When to use
Value
Requests
- Set them (recommended) unless: - Do not set requests for QoS BestEffort; useful only if pod can be evicted as often as needed and pod can pick up where it left off without any penalty
Set requests to 95th percentile (w/o VPA) of the actually observed CPU resp. memory usage in production resp. 5th percentile (w/ VPA) (see below)
Limits
- Avoid them (recommended) unless: - Set limits for QoS Guaranteed; useful only if pod has strictly static resource requirements - Set CPU limits if you want to throttle CPU usage for containers that can be throttled w/o any other disadvantage than processing time (never do that when time-critical operations like leases are involved) - Set limits if you know the healthy range and want to shield against unbound busy loops, unbound memory leaks, or similar
If you really can (otherwise not), set limits to healthy theoretical max load
Scaler
When to use
Initial
Minimum
Maximum
HPA
Use for pods that support horizontal scaling
Set initial replicas to 5th percentile of the actually observed replica count in production (prefer scaling on usage, not utilization) and make sure to never overwrite it later when controlled by HPA
Set minReplicas to 0 (requires feature gate and custom/external metrics), to 1 (regular HPA minimum), or whatever the high availability requirements of the workload demand
Set maxReplicas to healthy theoretical max load
VPA
Use for containers that have a significant usage (>50m/100M) and a significant usage spread over time (>2x)
Set initial requests to 5th percentile of the actually observed CPU resp. memory usage in production
Set minAllowed to highest observed VPA recommendation (includes start-up phase) for an otherwise practically idle container (avoids pod trashing when pod gets evicted after idling)
Set maxAllowed to fresh node allocatable remainder after daemonset pods (avoids pending pods when requests exeed fresh node allocatable remainder) or, if you really can (otherwise not), to healthy theoretical max load (less disruptive than limits as no throttling or OOM happens on under-utilized nodes)
CA
Use for dynamic workloads, definitely if you use HPA and/or VPA
N/A
Set minimum to 0 or number of nodes required right after cluster creation or wake-up
Set maximum to healthy theoretical max load
Note
Theoretical max load may be very difficult to ascertain, especially with modern software that consists of building blocks you do not own or know in detail. If you have comprehensive monitoring in place, you may be tempted to pick the observed maximum and add a safety margin or even factor on top (2x, 4x, or any other number), but this is not to be confused with “theoretical max load” (solely depending on the code, not observations from the outside). At any point in time, your numbers may change, e.g. because you updated a software component or your usage increased. If you decide to use numbers that are set based only on observations, make sure to flank those numbers with monitoring alerts, so that you have sufficient time to investigate, revise, and readjust if necessary.
Conclusion
Pod autoscaling is a dynamic and complex aspect of Kubernetes, but it is also one of the most powerful tools at your disposal for maintaining efficient, reliable, and cost-effective applications. By carefully selecting the appropriate autoscaler, setting well-considered thresholds, and continuously monitoring and adjusting your strategies, you can ensure that your Kubernetes deployments are well-equipped to handle your resource demands while not over-paying for the provided resources at the same time.
As Kubernetes continues to evolve (e.g. in-place updates) and as new patterns and practices emerge, the approaches to autoscaling may also change. However, the principles discussed above will remain foundational to creating scalable and resilient Kubernetes workloads. Whether you’re a developer or operations engineer, a solid understanding of pod autoscaling will be instrumental in the successful deployment and management of containerized applications.
4 - Concepts
4.1 - APIServer Admission Plugins
A list of all gardener managed admission plugins together with their responsibilities
Overview
Similar to the kube-apiserver, the gardener-apiserver comes with a few in-tree managed admission plugins.
If you want to get an overview of the what and why of admission plugins then this document might be a good start.
This document lists all existing admission plugins with a short explanation of what it is responsible for.
ClusterOpenIDConnectPreset, OpenIDConnectPreset
(both enabled by default)
These admission controllers react on CREATE operations for Shoots.
If the Shoot does not specify any OIDC configuration (.spec.kubernetes.kubeAPIServer.oidcConfig=nil), then it tries to find a matching ClusterOpenIDConnectPreset or OpenIDConnectPreset, respectively.
If there are multiple matches, then the one with the highest weight “wins”.
In this case, the admission controller will default the OIDC configuration in the Shoot.
ControllerRegistrationResources
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for ControllerRegistrations.
It validates that there exists only one ControllerRegistration in the system that is primarily responsible for a given kind/type resource combination.
This prevents misconfiguration by the Gardener administrator/operator.
CustomVerbAuthorizer
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for Projects and NamespacedCloudProfiles.
For Projects it validates whether the user is bound to an RBAC role with the modify-spec-tolerations-whitelist verb in case the user tries to change the .spec.tolerations.whitelist field of the respective Project resource.
Usually, regular project members are not bound to this custom verb, allowing the Gardener administrator to manage certain toleration whitelists on Project basis.
For NamespacedCloudProfiles, the modification of specific fields also require the user to be bound to an RBAC role with custom verbs.
Please see this document for more information.
DeletionConfirmation
(enabled by default)
This admission controller reacts on DELETE operations for Projects, Shoots, and ShootStates.
It validates that the respective resource is annotated with a deletion confirmation annotation, namely confirmation.gardener.cloud/deletion=true.
Only if this annotation is present it allows the DELETE operation to pass.
This prevents users from accidental/undesired deletions.
In addition, it applies the “four-eyes principle for deletion” concept if the Project is configured accordingly.
Find all information about it in this document.
Furthermore, this admission controller reacts on CREATE or UPDATE operations for Shoots.
It makes sure that the deletion.gardener.cloud/confirmed-by annotation is properly maintained in case the Shoot deletion is confirmed with above mentioned annotation.
ExposureClass
(enabled by default)
This admission controller reacts on Create operations for Shoots.
It mutates Shoot resources which have an ExposureClass referenced by merging both their shootSelectors and/or tolerations into the Shoot resource.
ExtensionValidator
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for BackupEntrys, BackupBuckets, Seeds, and Shoots.
For all the various extension types in the specifications of these objects, it validates whether there exists a ControllerRegistration in the system that is primarily responsible for the stated extension type(s).
This prevents misconfigurations that would otherwise allow users to create such resources with extension types that don’t exist in the cluster, effectively leading to failing reconciliation loops.
ExtensionLabels
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for BackupBuckets, BackupEntrys, CloudProfiles, NamespacedCloudProfiles, Seeds, SecretBindings, CredentialsBindings, WorkloadIdentitys and Shoots. For all the various extension types in the specifications of these objects, it adds a corresponding label in the resource. This would allow extension admission webhooks to filter out the resources they are responsible for and ignore all others. This label is of the form <extension-type>.extensions.gardener.cloud/<extension-name> : "true". For example, an extension label for provider extension type aws, looks like provider.extensions.gardener.cloud/aws : "true".
ProjectValidator
(enabled by default)
This admission controller reacts on CREATE operations for Projects.
It prevents creating Projects with a non-empty .spec.namespace if the value in .spec.namespace does not start with garden-.
⚠️ This admission plugin will be removed in a future release and its business logic will be incorporated into the static validation of the gardener-apiserver.
ResourceQuota
(enabled by default)
This admission controller enables object count ResourceQuotas for Gardener resources, e.g. Shoots, SecretBindings, Projects, etc.
⚠️ In addition to this admission plugin, the ResourceQuota controller must be enabled for the Kube-Controller-Manager of your Garden cluster.
ResourceReferenceManager
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for CloudProfiles, Projects, SecretBindings, Seeds, and Shoots.
Generally, it checks whether referred resources stated in the specifications of these objects exist in the system (e.g., if a referenced Secret exists).
However, it also has some special behaviours for certain resources:
CloudProfiles: It rejects removing Kubernetes or machine image versions if there is at least one Shoot that refers to them.
Projects: It sets the .spec.createdBy field for newly created Project resources, and defaults the .spec.owner field in case it is empty (to the same value of .spec.createdBy).
Shoots: It sets the gardener.cloud/created-by=<username> annotation for newly created Shoot resources.
SeedValidator
(enabled by default)
This admission controller reacts on DELETE operations for Seeds.
Rejects the deletion if Shoot(s) reference the seed cluster.
ShootDNS
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for Shoots.
It tries to assign a default domain to the Shoot.
It also validates the DNS configuration (.spec.dns) for shoots.
ShootNodeLocalDNSEnabledByDefault
(disabled by default)
This admission controller reacts on CREATE operations for Shoots.
If enabled, it will enable node local dns within the shoot cluster (for more information, see NodeLocalDNS Configuration) by setting spec.systemComponents.nodeLocalDNS.enabled=true for newly created Shoots.
Already existing Shoots and new Shoots that explicitly disable node local dns (spec.systemComponents.nodeLocalDNS.enabled=false)
will not be affected by this admission plugin.
ShootQuotaValidator
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for Shoots.
It validates the resource consumption declared in the specification against applicable Quota resources.
Only if the applicable Quota resources admit the configured resources in the Shoot then it allows the request.
Applicable Quotas are referred in the SecretBinding that is used by the Shoot.
ShootResourceReservation
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for Shoots.
It injects the Kubernetes.Kubelet.KubeReserved setting for kubelet either as global setting for a shoot or on a per worker pool basis.
If the admission configuration (see this example) for the ShootResourceReservation plugin contains useGKEFormula: false (the default), then it sets a static default resource reservation for the shoot.
If useGKEFormula: true is set, then the plugin injects resource reservations based on the machine type similar to GKE’s formula for resource reservation into each worker pool.
Already existing resource reservations are not modified; this also means that resource reservations are not automatically updated if the machine type for a worker pool is changed.
If a shoot contains global resource reservations, then no per worker pool resource reservations are injected.
By default, useGKEFormula: true applies to all Shoots.
Operators can provide an optional label selector via the selector field to limit which Shoots get worker specific resource reservations injected.
ShootVPAEnabledByDefault
(disabled by default)
This admission controller reacts on CREATE operations for Shoots.
If enabled, it will enable the managed VerticalPodAutoscaler components (for more information, see Vertical Pod Auto-Scaling)
by setting spec.kubernetes.verticalPodAutoscaler.enabled=true for newly created Shoots.
Already existing Shoots and new Shoots that explicitly disable VPA (spec.kubernetes.verticalPodAutoscaler.enabled=false)
will not be affected by this admission plugin.
ShootTolerationRestriction
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for Shoots.
It validates the .spec.tolerations used in Shoots against the whitelist of its Project, or against the whitelist configured in the admission controller’s configuration, respectively.
Additionally, it defaults the .spec.tolerations in Shoots with those configured in its Project, and those configured in the admission controller’s configuration, respectively.
ShootValidator
(enabled by default)
This admission controller reacts on CREATE, UPDATE and DELETE operations for Shoots.
It validates certain configurations in the specification against the referred CloudProfile (e.g., machine images, machine types, used Kubernetes version, …).
Generally, it performs validations that cannot be handled by the static API validation due to their dynamic nature (e.g., when something needs to be checked against referred resources).
Additionally, it takes over certain defaulting tasks (e.g., default machine image for worker pools, default Kubernetes version).
ShootManagedSeed
(enabled by default)
This admission controller reacts on UPDATE and DELETE operations for Shoots.
It validates certain configuration values in the specification that are specific to ManagedSeeds (e.g. the nginx-addon of the Shoot has to be disabled, the Shoot VPA has to be enabled).
It rejects the deletion if the Shoot is referred to by a ManagedSeed.
ManagedSeedValidator
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for ManagedSeedss.
It validates certain configuration values in the specification against the referred Shoot, for example Seed provider, network ranges, DNS domain, etc.
Similar to ShootValidator, it performs validations that cannot be handled by the static API validation due to their dynamic nature.
Additionally, it performs certain defaulting tasks, making sure that configuration values that are not specified are defaulted to the values of the referred Shoot, for example Seed provider, network ranges, DNS domain, etc.
ManagedSeedShoot
(enabled by default)
This admission controller reacts on DELETE operations for ManagedSeeds.
It rejects the deletion if there are Shoots that are scheduled onto the Seed that is registered by the ManagedSeed.
ShootDNSRewriting
(disabled by default)
This admission controller reacts on CREATE operations for Shoots.
If enabled, it adds a set of common suffixes configured in its admission plugin configuration to the Shoot (spec.systemComponents.coreDNS.rewriting.commonSuffixes) (for more information, see DNS Search Path Optimization).
Already existing Shoots will not be affected by this admission plugin.
NamespacedCloudProfileValidator
(enabled by default)
This admission controller reacts on CREATE and UPDATE operations for NamespacedCloudProfiles.
It primarily validates if the referenced parent CloudProfile exists in the system. In addition, the admission controller ensures that the NamespacedCloudProfile only configures new machine types, and does not overwrite those from the parent CloudProfile.
4.2 - Architecture
The concepts behind the Gardener architecture
Official Definition - What is Kubernetes?
“Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.”
Introduction - Basic Principle
The foundation of the Gardener (providing Kubernetes Clusters as a Service) is Kubernetes itself, because Kubernetes is the go-to solution to manage software in the Cloud, even when it’s Kubernetes itself (see also OpenStack which is provisioned more and more on top of Kubernetes as well).
While self-hosting, meaning to run Kubernetes components inside Kubernetes, is a popular topic in the community, we apply a special pattern catering to the needs of our cloud platform to provision hundreds or even thousands of clusters. We take a so-called “seed” cluster and seed the control plane (such as the API server, scheduler, controllers, etcd persistence and others) of an end-user cluster, which we call “shoot” cluster, as pods into the “seed” cluster. That means that one “seed” cluster, of which we will have one per IaaS and region, hosts the control planes of multiple “shoot” clusters. That allows us to avoid dedicated hardware/virtual machines for the “shoot” cluster control planes. We simply put the control plane into pods/containers and since the “seed” cluster watches them, they can be deployed with a replica count of 1 and only need to be scaled out when the control plane gets under pressure, but no longer for HA reasons. At the same time, the deployments get simpler (standard Kubernetes deployment) and easier to update (standard Kubernetes rolling update). The actual “shoot” cluster consists only of the worker nodes (no control plane) and therefore the users may get full administrative access to their clusters.
Setting The Scene - Components and Procedure
We provide a central operator UI, which we call the “Gardener Dashboard”. It talks to a dedicated cluster, which we call the “Garden” cluster, and uses custom resources managed by an aggregated API server (one of the general extension concepts of Kubernetes) to represent “shoot” clusters. In this “Garden” cluster runs the “Gardener”, which is basically a Kubernetes controller that watches the custom resources and acts upon them, i.e. creates, updates/modifies, or deletes “shoot” clusters. The creation follows basically these steps:
Create a namespace in the “seed” cluster for the “shoot” cluster, which will host the “shoot” cluster control plane.
Generate secrets and credentials, which the worker nodes will need to talk to the control plane.
Create the infrastructure (using Terraform), which basically consists out of the network setup.
Deploy the “shoot” cluster control plane into the “shoot” namespace in the “seed” cluster, containing the “machine-controller-manager” pod.
Create machine CRDs in the “seed” cluster, describing the configuration and the number of worker machines for the “shoot” (the machine-controller-manager watches the CRDs and creates virtual machines out of it).
Wait for the “shoot” cluster API server to become responsive (pods will be scheduled, persistent volumes and load balancers are created by Kubernetes via the respective cloud provider).
Finally, we deploy kube-system daemons like kube-proxy and further add-ons like the dashboard into the “shoot” cluster and the cluster becomes active.
Overview Architecture Diagram
Detailed Architecture Diagram
Note: The kubelet, as well as the pods inside the “shoot” cluster, talks through the front-door (load balancer IP; public Internet) to its “shoot” cluster API server running in the “seed” cluster. The reverse communication from the API server to the pod, service, and node networks happens through a VPN connection that we deploy into the “seed” and “shoot” clusters.
4.3 - Backup and Restore
Understand the etcd backup and restore capabilities of Gardener
Overview
Kubernetes uses etcd as the key-value store for its resource definitions. Gardener supports the backup and restore of etcd. It is the responsibility of the shoot owners to backup the workload data.
Gardener uses an etcd-backup-restore component to backup the etcd backing the Shoot cluster regularly and restore it in case of disaster. It is deployed as sidecar via etcd-druid. This doc mainly focuses on the backup and restore configuration used by Gardener when deploying these components. For more details on the design and internal implementation details, please refer to GEP-06 and the documentation on individual repositories.
etcd-backup-restore supports full snapshot and delta snapshots over full snapshot. In Gardener, this configuration is currently hard-coded to the following parameters:
Full Snapshot schedule:
Daily, 24hr interval.
For each Shoot, the schedule time in a day is randomized based on the configured Shoot maintenance window.
Delta Snapshot schedule:
At 5min interval.
If aggregated events size since last snapshot goes beyond 100Mib.
Backup History / Garbage backup deletion policy:
Gardener configures backup restore to have Exponential garbage collection policy.
As per policy, the following backups are retained:
All full backups and delta backups for the previous hour.
Latest full snapshot of each previous hour for the day.
Latest full snapshot of each previous day for 7 days.
Latest full snapshot of the previous 4 weeks.
Garbage Collection is configured at 12hr interval.
Listing:
Gardener doesn’t have any API to list out the backups.
To find the backups list, an admin can checkout the BackupEntry resource associated with the Shoot which holds the bucket and prefix details on the object store.
Restoration
The restoration process of etcd is automated through the etcd-backup-restore component from the latest snapshot. Gardener doesn’t support Point-In-Time-Recovery (PITR) of etcd. In case of an etcd disaster, the etcd is recovered from the latest backup automatically. For further details, please refer the Restoration topic. Post restoration of etcd, the Shoot reconciliation loop brings the cluster back to its previous state.
Again, the Shoot owner is responsible for maintaining the backup/restore of his workload. Gardener only takes care of the cluster’s etcd.
4.4 - Cluster API
Understand the evolution of the Gardener API and its relation to the Cluster API
Relation Between Gardener API and Cluster API (SIG Cluster Lifecycle)
In essence, the Cluster API harmonizes how to get to clusters, while Gardener goes one step further and also harmonizes the clusters themselves. The Cluster API delegates the specifics to so-called providers for infrastructures or control planes via specific CR(D)s, while Gardener only has one cluster CR(D). Different Cluster API providers, e.g. for AWS, Azure, GCP, etc., give you vastly different Kubernetes clusters. In contrast, Gardener gives you the exact same clusters with the exact same K8s version, operating system, control plane configuration like for API server or kubelet, add-ons like overlay network, HPA/VPA, DNS and certificate controllers, ingress and network policy controllers, control plane monitoring and logging stacks, down to the behavior of update procedures, auto-scaling, self-healing, etc., on all supported infrastructures. These homogeneous clusters are an essential goal for Gardener, as its main purpose is to simplify operations for teams that need to develop and ship software on Kubernetes clusters on a plethora of infrastructures (a.k.a. multi-cloud).
That means that we follow the Cluster API with great interest and are active members. It was completely overhauled from v1alpha1 to v1alpha2. But because v1alpha2 made too many assumptions about the bring-up of masters and was enforcing master machine operations (for more information, see The Cluster API Book: “As of v1alpha2, Machine-Based is the only control plane type that Cluster API supports”), services that managed their control planes differently like GKE or Gardener couldn’t adopt it. In 2020 v1alpha3 was introduced and made it possible (again) to integrate managed services like GKE or Gardener. The mapping from the Gardener API to the Cluster API is mostly syntactic.
To wrap it up, while the Cluster API knows about clusters, it doesn’t know about their make-up. With Gardener, we wanted to go beyond that and harmonize the make-up of the clusters themselves and make them homogeneous across all supported infrastructures. Gardener can therefore deliver homogeneous clusters with exactly the same configuration and behavior on all infrastructures (see also Gardener’s coverage in the official conformance test grid).
With Cluster API v1alpha3 and the support for declarative control plane management, it has became possible (again) to enable Kubernetes managed services like GKE or Gardener. We would be more than happy if the community would be interested to contribute a Gardener control plane provider.
4.5 - etcd
How Gardener uses the etcd key-value store
etcd - Key-Value Store for Kubernetes
etcd is a strongly consistent key-value store and the most prevalent choice for the Kubernetes
persistence layer. All API cluster objects like Pods, Deployments, Secrets, etc., are stored in etcd, which
makes it an essential part of a Kubernetes control plane.
Garden or Shoot Cluster Persistence
Each garden or shoot cluster gets its very own persistence for the control plane.
It runs in the shoot namespace on the respective seed cluster (or in the garden namespace in the garden cluster, respectively).
Concretely, there are two etcd instances per shoot cluster, which the kube-apiserver is configured to use in the following way:
etcd-main
A store that contains all “cluster critical” or “long-term” objects.
These object kinds are typically considered for a backup to prevent any data loss.
etcd-events
A store that contains all Event objects (events.k8s.io) of a cluster.
Events usually have a short retention period and occur frequently, but are not essential for a disaster recovery.
The setup above prevents both, the critical etcd-main is not flooded by Kubernetes Events, as well as backup space is not occupied by non-critical data.
This separation saves time and resources.
etcd Operator
Configuring, maintaining, and health-checking etcd is outsourced to a dedicated operator called etcd Druid.
When a gardenlet reconciles a Shoot resource or a gardener-operator reconciles a Garden resource, they manage an Etcd resource in the seed or garden cluster, containing necessary information (backup information, defragmentation schedule, resources, etc.).
etcd-druid needs to manage the lifecycle of the desired etcd instance (today main or events).
Likewise, when the Shoot or Garden is deleted, gardenlet or gardener-operator deletes the Etcd resources and etcd Druid takes care of cleaning up all related objects, e.g. the backing StatefulSets.
Backup
If Seeds specify backups for etcd (example), then Gardener and the respective provider extensions are responsible for creating a bucket on the cloud provider’s side (modelled through a BackupBucket resource).
The bucket stores backups of Shoots scheduled on that Seed.
Furthermore, Gardener creates a BackupEntry, which subdivides the bucket and thus makes it possible to store backups of multiple shoot clusters.
How long backups are stored in the bucket after a shoot has been deleted depends on the configured retention period in the Seed resource.
Please see this example configuration for more information.
For Gardens specifying backups for etcd (example), the bucket must be pre-created externally and provided via the Garden specification.
Both etcd instances are configured to run with a special backup-restore sidecar.
It takes care about regularly backing up etcd data and restoring it in case of data loss (in the main etcd only).
The sidecar also performs defragmentation and other house-keeping tasks.
More information can be found in the component’s GitHub repository.
Housekeeping
etcd maintenance tasks must be performed from time to time in order to re-gain database storage and to ensure the system’s reliability.
The backup-restoresidecar takes care about this job as well.
For both Shoots and Gardens, a random time within the shoot’s maintenance time is chosen for scheduling these tasks.
4.6 - gardenadm
Bootstrapping and management of autonomous shoot clusters.
Caution
This tool is currently under development and considered highly experimental.
Do not use it in production environments.
Read more about it in GEP-28.
Overview
To be implemented.
4.7 - Gardener Admission Controller
Functions and list of handlers for the Gardener Admission Controller
Overview
While the Gardener API server works with admission plugins to validate and mutate resources belonging to Gardener related API groups, e.g. core.gardener.cloud, the same is needed for resources belonging to non-Gardener API groups as well, e.g. secrets in the core API group.
Therefore, the Gardener Admission Controller runs a http(s) server with the following handlers which serve as validating/mutating endpoints for admission webhooks.
It is also used to serve http(s) handlers for authorization webhooks.
Admission Webhook Handlers
This section describes the admission webhook handlers that are currently served.
In Shoots, AdmissionPlugin can have reference to other files. This validation handler validates the referred admission plugin secret and ensures that the secret always contains the required data kubeconfig.
Kubeconfig Secret Validator
Malicious Kubeconfigs applied by end users may cause a leakage of sensitive data.
This handler checks if the incoming request contains a Kubernetes secret with a .data.kubeconfig field and denies the request if the Kubeconfig structure violates Gardener’s security standards.
Namespace Validator
Namespaces are the backing entities of Gardener projects in which shoot cluster objects reside.
This validation handler protects active namespaces against premature deletion requests.
Therefore, it denies deletion requests if a namespace still contains shoot clusters or if it belongs to a non-deleting Gardener project (without .metadata.deletionTimestamp).
Resource Size Validator
Since users directly apply Kubernetes native objects to the Garden cluster, it also involves the risk of being vulnerable to DoS attacks because these resources are continuously watched and read by controllers.
One example is the creation of shoot resources with large annotation values (up to 256 kB per value), which can cause severe out-of-memory issues for the gardenlet component.
Vertical autoscaling can help to mitigate such situations, but we cannot expect to scale infinitely, and thus need means to block the attack itself.
The Resource Size Validator checks arbitrary incoming admission requests against a configured maximum size for the resource’s group-version-kind combination. It denies the request if the object exceeds the quota.
Note
The contents of status subresources and metadata.managedFields are not taken into account for the resource size calculation.
Example for Gardener Admission Controller configuration:
With the configuration above, the Resource Size Validator denies requests for shoots with Gardener’s core API group which exceed a size of 100 kB. The same is done for Kubernetes secrets.
As this feature is meant to protect the system from malicious requests sent by users, it is recommended to exclude trusted groups, users or service accounts from the size restriction via resourceAdmissionConfiguration.unrestrictedSubjects.
For example, the backing user for the gardenlet should always be capable of changing the shoot resource instead of being blocked due to size restrictions.
This is because the gardenlet itself occasionally changes the shoot specification, labels or annotations, and might violate the quota if the existing resource is already close to the quota boundary.
Also, operators are supposed to be trusted users and subjecting them to a size limitation can inhibit important operational tasks.
Wildcard ("*") in subject name is supported.
Size limitations depend on the individual Gardener setup and choosing the wrong values can affect the availability of your Gardener service.
resourceAdmissionConfiguration.operationMode allows to control if a violating request is actually denied (default) or only logged.
It’s recommended to start with log, check the logs for exceeding requests, adjust the limits if necessary and finally switch to block.
Understand the Gardener API server extension and the resources it exposes
Overview
The Gardener API server is a Kubernetes-native extension based on its aggregation layer.
It is registered via an APIService object and designed to run inside a Kubernetes cluster whose API it wants to extend.
After registration, it exposes the following resources:
CloudProfiles
CloudProfiles are resources that describe a specific environment of an underlying infrastructure provider, e.g. AWS, Azure, etc.
Each shoot has to reference a CloudProfile to declare the environment it should be created in.
In a CloudProfile, the gardener operator specifies certain constraints like available machine types, regions, which Kubernetes versions they want to offer, etc.
End-users can read CloudProfiles to see these values, but only operators can change the content or create/delete them.
When a shoot is created or updated, then an admission plugin checks that only allowed values are used via the referenced CloudProfile.
Additionally, a CloudProfile may contain a providerConfig, which is a special configuration dedicated for the infrastructure provider.
Gardener does not evaluate or understand this config, but extension controllers might need it for declaration of provider-specific constraints, or global settings.
Please see this example manifest and consult the documentation of your provider extension controller to get information about its providerConfig.
NamespacedCloudProfiles
In addition to CloudProfiles, NamespacedCloudProfiles exist to enable project-level customizations of CloudProfiles.
Project administrators can create and manage cloud profiles with overrides or extensions specific to their project.
Please see this example manifest and this usage documentation for further information.
InternalSecrets
End-users can read and/or write Secrets in their project namespaces in the garden cluster. This prevents Gardener components from storing such “Gardener-internal” secrets in the respective project namespace.
InternalSecrets are resources that contain shoot or project-related secrets that are “Gardener-internal”, i.e., secrets used and managed by the system that end-users don’t have access to.
InternalSecrets are defined like plain Kubernetes Secrets, behave exactly like them, and can be used in the same manners. The only difference is, that the InternalSecret resource is a dedicated API resource (exposed by gardener-apiserver).
This allows separating access to “normal” secrets and internal secrets by the usual RBAC means.
Gardener uses an InternalSecret per Shoot for syncing the client CA to the project namespace in the garden cluster (named <shoot-name>.ca-client). The shoots/adminkubeconfig subresource signs short-lived client certificates by retrieving the CA from the InternalSecret.
Operators should configure gardener-apiserver to encrypt the internalsecrets.core.gardener.cloud resource in etcd.
Seeds are resources that represent seed clusters.
Gardener does not care about how a seed cluster got created - the only requirement is that it is of at least Kubernetes v1.25 and passes the Kubernetes conformance tests.
The Gardener operator has to either deploy the gardenlet into the cluster they want to use as seed (recommended, then the gardenlet will create the Seed object itself after bootstrapping) or provide the kubeconfig to the cluster inside a secret (that is referenced by the Seed resource) and create the Seed resource themselves.
Please see this, this, and optionally this example manifests.
Shoot Quotas
To allow end-users not having their dedicated infrastructure account to try out Gardener, the operator can register an account owned by them that they allow to be used for trial clusters.
Trial clusters can be put under quota so that they don’t consume too many resources (resulting in costs) and that one user cannot consume all resources on their own.
These clusters are automatically terminated after a specified time, but end-users may extend the lifetime manually if needed.
The first thing before creating a shoot cluster is to create a Project.
A project is used to group multiple shoot clusters together.
End-users can invite colleagues to the project to enable collaboration, and they can either make them admin or viewer.
After an end-user has created a project, they will get a dedicated namespace in the garden cluster for all their shoots.
Now that the end-user has a namespace the next step is registering their infrastructure provider account.
Please see this example manifest and consult the documentation of the extension controller for the respective infrastructure provider to get information about which keys are required in this secret.
After the secret has been created, the end-user has to create a special SecretBinding resource that binds this secret.
Later, when creating shoot clusters, they will reference such binding.
Shoot cluster contain various settings that influence how end-user Kubernetes clusters will look like in the end.
As Gardener heavily relies on extension controllers for operating system configuration, networking, and infrastructure specifics, the end-user has the possibility (and responsibility) to provide these provider-specific configurations as well.
Such configurations are not evaluated by Gardener (because it doesn’t know/understand them), but they are only transported to the respective extension controller.
⚠️ This means that any configuration issues/mistake on the end-user side that relates to a provider-specific flag or setting cannot be caught during the update request itself but only later during the reconciliation (unless a validator webhook has been registered in the garden cluster by an operator).
Please see this example manifest and consult the documentation of the provider extension controller to get information about its spec.provider.controlPlaneConfig, .spec.provider.infrastructureConfig, and .spec.provider.workers[].providerConfig.
Understand where the gardener-controller-manager runs and its functionalities
Overview
The gardener-controller-manager (often referred to as “GCM”) is a component that runs next to the Gardener API server, similar to the Kubernetes Controller Manager.
It runs several controllers that do not require talking to any seed or shoot cluster.
Also, as of today, it exposes an HTTP server that is serving several health check endpoints and metrics.
This document explains the various functionalities of the gardener-controller-manager and their purpose.
Bastion resources have a limited lifetime which can be extended up to a certain amount by performing a heartbeat on them.
The Bastion controller is responsible for deleting expired or rotten Bastions.
“expired” means a Bastion has exceeded its status.expirationTimestamp.
“rotten” means a Bastion is older than the configured maxLifetime.
The maxLifetime defaults to 24 hours and is an option in the BastionControllerConfiguration which is part of gardener-controller-managers ControllerManagerControllerConfiguration, see the example config file for details.
The controller also deletes Bastions in case the referenced Shoot:
no longer exists
is marked for deletion (i.e., have a non-nil.metadata.deletionTimestamp)
was migrated to another seed (i.e., Shoot.spec.seedName is different than Bastion.spec.seedName).
The deletion of Bastions triggers the gardenlet to perform the necessary cleanups in the Seed cluster, so some time can pass between deletion and the Bastion actually disappearing.
Clients like gardenctl are advised to not re-use Bastions whose deletion timestamp has been set already.
Refer to GEP-15 for more information on the lifecycle of
Bastion resources.
After the gardenlet gets deployed on the Seed cluster, it needs to establish itself as a trusted party to communicate with the Gardener API server. It runs through a bootstrap flow similar to the kubelet bootstrap process.
On startup, the gardenlet uses a kubeconfig with a bootstrap token which authenticates it as being part of the system:bootstrappers group. This kubeconfig is used to create a CertificateSigningRequest (CSR) against the Gardener API server.
The controller in gardener-controller-manager checks whether the CertificateSigningRequest has the expected organization, common name and usages which the gardenlet would request.
It only auto-approves the CSR if the client making the request is allowed to “create” the
certificatesigningrequests/seedclient subresource. Clients with the system:bootstrappers group are bound to the gardener.cloud:system:seed-bootstrapperClusterRole, hence, they have such privileges. As the bootstrap kubeconfig for the gardenlet contains a bootstrap token which is authenticated as being part of the systems:bootstrappers group, its created CSR gets auto-approved.
CloudProfiles are essential when it comes to reconciling Shoots since they contain constraints (like valid machine types, Kubernetes versions, or machine images) and sometimes also some global configuration for the respective environment (typically via provider-specific configuration in .spec.providerConfig).
Consequently, to ensure that CloudProfiles in-use are always present in the system until the last referring Shoot or NamespacedCloudProfile gets deleted, the controller adds a finalizer which is only released when there is no Shoot or NamespacedCloudProfile referencing the CloudProfile anymore.
NamespacedCloudProfiles provide a project-scoped extension to CloudProfiles, allowing for adjustments of a parent CloudProfile (e.g. by overriding expiration dates of Kubernetes versions or machine images). This allows for modifications without global project visibility. Like CloudProfiles do in their spec, NamespacedCloudProfiles also expose the resulting Shoot constraints as a CloudProfileSpec in their status.
The controller ensures that NamespacedCloudProfiles in-use remain present in the system until the last referring Shoot is deleted by adding a finalizer that is only released when there is no Shoot referencing the NamespacedCloudProfile anymore.
Extensions are registered in the garden cluster via ControllerRegistration and deployment of respective extensions are specified via ControllerDeployment. For more info refer to Registering Extension Controllers.
This controller ensures that ControllerDeployment in-use always exists until the last ControllerRegistration referencing them gets deleted. The controller adds a finalizer which is only released when there is no ControllerRegistration referencing the ControllerDeployment anymore.
The ControllerRegistration controller makes sure that the required Gardener Extensions specified by the ControllerRegistration resources are present in the seed clusters.
It also takes care of the creation and deletion of ControllerInstallation objects for a given seed cluster.
The controller has three reconciliation loops.
This reconciliation loop watches the Seed objects and determines which ControllerRegistrations are required for them and reconciles the corresponding ControllerInstallation resources to reach the determined state.
To begin with, it computes the kind/type combinations of extensions required for the seed.
For this, the controller examines a live list of ControllerRegistrations, ControllerInstallations, BackupBuckets, BackupEntrys, Shoots, and Secrets from the garden cluster.
For example, it examines the shoots running on the seed and deducts the kind/type, like Infrastructure/gcp.
The seed (seed.spec.provider.type) and DNS (seed.spec.dns.provider.type) provider types are considered when calculating the list of required ControllerRegistrations, as well.
It also decides whether they should always be deployed based on the .spec.deployment.policy.
For the configuration options, please see this section.
Based on these required combinations, each of them are mapped to ControllerRegistration objects and then to their corresponding ControllerInstallation objects (if existing).
The controller then creates or updates the required ControllerInstallation objects for the given seed.
It also deletes every existing ControllerInstallation whose referenced ControllerRegistration is not part of the required list.
For example, if the shoots in the seed are no longer using the DNS provider aws-route53, then the controller proceeds to delete the respective ControllerInstallation object.
This reconciliation loop watches the ControllerRegistration resource and adds finalizers to it when they are created.
In case a deletion request comes in for the resource, i.e., if a .metadata.deletionTimestamp is set, it actively scans for a ControllerInstallation resource using this ControllerRegistration, and decides whether the deletion can be allowed.
In case no related ControllerInstallation is present, it removes the finalizer and marks it for deletion.
This loop also watches the Seed object and adds finalizers to it at creation.
If a .metadata.deletionTimestamp is set for the seed, then the controller checks for existing ControllerInstallation objects which reference this seed.
If no such objects exist, then it removes the finalizer and allows the deletion.
This reconciler watches two resources in the garden cluster:
ClusterRoles labelled with authorization.gardener.cloud/custom-extensions-permissions=true
ServiceAccounts in seed namespaces matching the selector provided via the authorization.gardener.cloud/extensions-serviceaccount-selector annotation of such ClusterRoles.
Its core task is to maintain a ClusterRoleBinding resource referencing the respective ClusterRole.
This gets bound to all ServiceAccounts in seed namespaces whose labels match the selector provided via the authorization.gardener.cloud/extensions-serviceaccount-selector annotation of such ClusterRoles.
You can read more about the purpose of this reconciler in this document.
CredentialsBindings reference Secrets, WorkloadIdentitys and Quotas and are themselves referenced by Shoots.
The controller adds finalizers to the referenced objects to ensure they don’t get deleted while still being referenced.
Similarly, to ensure that CredentialsBindings in-use are always present in the system until the last referring Shoot gets deleted, the controller adds a finalizer which is only released when there is no Shoot referencing the CredentialsBinding anymore.
Referenced Secrets and WorkloadIdentitys will also be labeled with provider.shoot.gardener.cloud/<type>=true, where <type> is the value of the .provider.type of the CredentialsBinding.
Also, all referenced Secrets and WorkloadIdentitys, as well as Quotas, will be labeled with reference.gardener.cloud/credentialsbinding=true to allow for easily filtering for objects referenced by CredentialsBindings.
With the Gardener Event Controller, you can prolong the lifespan of events related to Shoot clusters.
This is an optional controller which will become active once you provide the below mentioned configuration.
All events in K8s are deleted after a configurable time-to-live (controlled via a kube-apiserver argument called --event-ttl (defaulting to 1 hour)).
The need to prolong the time-to-live for Shoot cluster events frequently arises when debugging customer issues on live systems.
This controller leaves events involving Shoots untouched, while deleting all other events after a configured time.
In order to activate it, provide the following configuration:
concurrentSyncs: The amount of goroutines scheduled for reconciling events.
ttlNonShootEvents: When an event reaches this time-to-live it gets deleted unless it is a Shoot-related event (defaults to 1h, equivalent to the event-ttl default).
⚠️ In addition, you should also configure the --event-ttl for the kube-apiserver to define an upper-limit of how long Shoot-related events should be stored. The --event-ttl should be larger than the ttlNonShootEvents or this controller will have no effect.
ExposureClass abstracts the ability to expose a Shoot clusters control plane in certain network environments (e.g. corporate networks, DMZ, internet) on all Seeds or a subset of the Seeds. For more information, see ExposureClasses.
Consequently, to ensure that ExposureClasses in-use are always present in the system until the last referring Shoot gets deleted, the controller adds a finalizer which is only released when there is no Shoot referencing the ExposureClass anymore.
ManagedSeedSet objects maintain a stable set of replicas of ManagedSeeds, i.e. they guarantee the availability of a specified number of identical ManagedSeeds on an equal number of identical Shoots.
The ManagedSeedSet controller creates and deletes ManagedSeeds and Shoots in response to changes to the replicas and selector fields. For more information, refer to the ManagedSeedSet proposal document.
The reconciler first gets all the replicas of the given ManagedSeedSet in the ManagedSeedSet’s namespace and with the matching selector. Each replica is a struct that contains a ManagedSeed, its corresponding Seed and Shoot objects.
Then the pending replica is retrieved, if it exists.
Next it determines the ready, postponed, and deletable replicas.
A replica is considered ready when a Seed owned by a ManagedSeed has been registered either directly or by deploying gardenlet into a Shoot, the Seed is Ready and the Shoot’s status is Healthy.
If a replica is not ready and it is not pending, i.e. it is not specified in the ManagedSeed’s status.pendingReplica field, then it is added to the postponed replicas.
A replica is deletable if it has no scheduled Shoots and the replica’s Shoot and ManagedSeed do not have the seedmanagement.gardener.cloud/protect-from-deletion annotation.
Finally, it checks the actual and target replica counts. If the actual count is less than the target count, the controller scales up the replicas by creating new replicas to match the desired target count. If the actual count is more than the target, the controller deletes replicas to match the desired count. Before scale-out or scale-in, the controller first reconciles the pending replica (there can always only be one) and makes sure the replica is ready before moving on to the next one.
Scale-out(actual count < target count)
During the scale-out phase, the controller first creates the Shoot object from the ManagedSeedSet’s spec.shootTemplate field and adds the replica to the status.pendingReplica of the ManagedSeedSet.
For the subsequent reconciliation steps, the controller makes sure that the pending replica is ready before proceeding to the next replica. Once the Shoot is created successfully, the ManagedSeed object is created from the ManagedSeedSet’s spec.template. The ManagedSeed object is reconciled by the ManagedSeed controller and a Seed object is created for the replica. Once the replica’s Seed becomes ready and the Shoot becomes healthy, the replica also becomes ready.
Scale-in(actual count > target count)
During the scale-in phase, the controller first determines the replica that can be deleted. From the deletable replicas, it chooses the one with the lowest priority and deletes it. Priority is determined in the following order:
First, compare replica statuses. Replicas with “less advanced” status are considered lower priority. For example, a replica with StatusShootReconciling status has a lower value than a replica with StatusShootReconciled status. Hence, in this case, a replica with a StatusShootReconciling status will have lower priority and will be considered for deletion.
Then, the replicas are compared with the readiness of their Seeds. Replicas with non-ready Seeds are considered lower priority.
Then, the replicas are compared with the health statuses of their Shoots. Replicas with “worse” statuses are considered lower priority.
Finally, the replica ordinals are compared. Replicas with lower ordinals are considered lower priority.
Quota object limits the resources consumed by shoot clusters either per provider secret or per project/namespace.
Consequently, to ensure that Quotas in-use are always present in the system until the last SecretBinding or CredentialsBinding that references them gets deleted, the controller adds a finalizer which is only released when there is no SecretBinding or CredentialsBinding referencing the Quota anymore.
This reconciler manages a dedicated Namespace for each Project.
The namespace name can either be specified explicitly in .spec.namespace (must be prefixed with garden-) or it will be determined by the controller.
If .spec.namespace is set, it tries to create it. If it already exists, it tries to adopt it.
This will only succeed if the Namespace was previously labeled with gardener.cloud/role=project and project.gardener.cloud/name=<project-name>.
This is to prevent end-users from being able to adopt arbitrary namespaces and escalate their privileges, e.g. the kube-system namespace.
After the namespace was created/adopted, the controller creates several ClusterRoles and ClusterRoleBindings that allow the project members to access related resources based on their roles.
These RBAC resources are prefixed with gardener.cloud:system:project{-member,-viewer}:<project-name>.
Gardener administrators and extension developers can define their own roles. For more information, see Extending Project Roles for more information.
In addition, operators can configure the Project controller to maintain a default ResourceQuota for project namespaces.
Quotas can especially limit the creation of user facing resources, e.g. Shoots, SecretBindings, CredentialsBinding, Secrets and thus protect the garden cluster from massive resource exhaustion but also enable operators to align quotas with respective enterprise policies.
⚠️ Gardener itself is not exempted from configured quotas. For example, Gardener creates Secrets for every shoot cluster in the project namespace and at the same time increases the available quota count. Please mind this additional resource consumption.
The controller configuration provides a template section controllers.project.quotas where such a ResourceQuota (see the example below) can be deposited.
The Project controller takes the specified config and creates a ResourceQuota with the name gardener in the project namespace.
If a ResourceQuota resource with the name gardener already exists, the controller will only update fields in spec.hard which are unavailable at that time.
This is done to configure a default Quota in all projects but to allow manual quota increases as the projects’ demands increase.
spec.hard fields in the ResourceQuota object that are not present in the configuration are removed from the object.
Labels and annotations on the ResourceQuotaconfig get merged with the respective fields on existing ResourceQuotas.
An optional projectSelector narrows down the amount of projects that are equipped with the given config.
If multiple configs match for a project, then only the first match in the list is applied to the project namespace.
The .status.phase of the Project resources is set to Ready or Failed by the reconciler to indicate whether the reconciliation loop was performed successfully.
Also, it generates Events to provide further information about its operations.
When a Project is marked for deletion, the controller ensures that there are no Shoots left in the project namespace.
Once all Shoots are gone, the Namespace and Project are released.
As Gardener is a large-scale Kubernetes as a Service, it is designed for being used by a large amount of end-users.
Over time, it is likely to happen that some of the hundreds or thousands of Project resources are no longer actively used.
Gardener offers the “stale projects” reconciler which will take care of identifying such stale projects, marking them with a “warning”, and eventually deleting them after a certain time period.
This reconciler is enabled by default and works as follows:
Projects are considered as “stale”/not actively used when all of the following conditions apply: The namespace associated with the Project does not have any…
Shoot resources.
BackupEntry resources.
Secret resources that are referenced by a SecretBinding or a CredentialsBinding that is in use by a Shoot (not necessarily in the same namespace).
Quota resources that are referenced by a SecretBinding or a CredentialsBinding that is in use by a Shoot (not necessarily in the same namespace).
The time period when the project was used for the last time (status.lastActivityTimestamp) is longer than the configured minimumLifetimeDays
If a project is considered “stale”, then its .status.staleSinceTimestamp will be set to the time when it was first detected to be stale.
If it gets actively used again, this timestamp will be removed.
After some time, the .status.staleAutoDeleteTimestamp will be set to a timestamp after which Gardener will auto-delete the Project resource if it still is not actively used.
The component configuration of the gardener-controller-manager offers to configure the following options:
minimumLifetimeDays: Don’t consider newly created Projects as “stale” too early to give people/end-users some time to onboard and get familiar with the system. The “stale project” reconciler won’t set any timestamp for Projects younger than minimumLifetimeDays. When you change this value, then projects marked as “stale” may be no longer marked as “stale” in case they are young enough, or vice versa.
staleGracePeriodDays: Don’t compute auto-delete timestamps for stale Projects that are unused for less than staleGracePeriodDays. This is to not unnecessarily make people/end-users nervous “just because” they haven’t actively used their Project for a given amount of time. When you change this value, then already assigned auto-delete timestamps may be removed if the new grace period is not yet exceeded.
staleExpirationTimeDays: Expiration time after which stale Projects are finally auto-deleted (after .status.staleSinceTimestamp). If this value is changed and an auto-delete timestamp got already assigned to the projects, then the new value will only take effect if it’s increased. Hence, decreasing the staleExpirationTimeDays will not decrease already assigned auto-delete timestamps.
Gardener administrators/operators can exclude specific Projects from the stale check by annotating the related Namespace resource with project.gardener.cloud/skip-stale-check=true.
Since the other two reconcilers are unable to actively monitor the relevant objects that are used in a Project (Shoot, Secret, etc.), there could be a situation where the user creates and deletes objects in a short period of time. In that case, the Stale Project Reconciler could not see that there was any activity on that project and it will still mark it as a Stale, even though it is actively used.
The Project Activity Reconciler is implemented to take care of such cases. An event handler will notify the reconciler for any activity and then it will update the status.lastActivityTimestamp. This update will also trigger the Stale Project Reconciler.
SecretBindings reference Secrets and Quotas and are themselves referenced by Shoots.
The controller adds finalizers to the referenced objects to ensure they don’t get deleted while still being referenced.
Similarly, to ensure that SecretBindings in-use are always present in the system until the last referring Shoot gets deleted, the controller adds a finalizer which is only released when there is no Shoot referencing the SecretBinding anymore.
Referenced Secrets will also be labeled with provider.shoot.gardener.cloud/<type>=true, where <type> is the value of the .provider.type of the SecretBinding.
Also, all referenced Secrets, as well as Quotas, will be labeled with reference.gardener.cloud/secretbinding=true to allow for easily filtering for objects referenced by SecretBindings.
This reconciliation loop takes care of seed related operations in the garden cluster. When a new Seed object is created,
the reconciler creates a new Namespace in the garden cluster seed-<seed-name>. Namespaces dedicated to single
seed clusters allow us to segregate access permissions i.e., a gardenlet must not have permissions to access objects in
all Namespaces in the garden cluster.
There are objects in a Garden environment which are created once by the operator e.g., default domain secret,
alerting credentials, and are required for operations happening in the gardenlet. Therefore, we not only need a seed specific
Namespace but also a copy of these “shared” objects.
The “main” reconciler takes care about this replication:
Every time a BackupBucket object is created or updated, the referenced Seed object is enqueued for reconciliation.
It’s the reconciler’s task to check the status subresource of all existing BackupBuckets that reference this Seed.
If at least one BackupBucket has .status.lastError != nil, the BackupBucketsReady condition on the Seed will be set to False, and consequently the Seed is considered as NotReady.
If the SeedBackupBucketsCheckControllerConfiguration (which is part of gardener-controller-managers component configuration) contains a conditionThreshold for the BackupBucketsReady, the condition will instead first be set to Progressing and eventually to False once the conditionThreshold expires. See the example config file for details.
Once the BackupBucket is healthy again, the seed will be re-queued and the condition will turn true.
This reconciler reconciles Seed objects and checks whether all ControllerInstallations referencing them are in a healthy state.
Concretely, all three conditions Valid, Installed, and Healthy must have status True and the Progressing condition must have status False.
Based on this check, it maintains the ExtensionsReady condition in the respective Seed’s .status.conditions list.
The “Lifecycle” reconciler processes Seed objects which are enqueued every 10 seconds in order to check if the responsible
gardenlet is still responding and operable. Therefore, it checks renewals via Lease objects of the seed in the garden cluster
which are renewed regularly by the gardenlet.
In case a Lease is not renewed for the configured amount in config.controllers.seed.monitorPeriod.duration:
The reconciler assumes that the gardenlet stopped operating and updates the GardenletReady condition to Unknown.
Additionally, the conditions and constraints of all Shoot resources scheduled on the affected seed are set to Unknown as well,
because a striking gardenlet won’t be able to maintain these conditions any more.
If the gardenlet’s client certificate has expired (identified based on the .status.clientCertificateExpirationTimestamp field in the Seed resource) and if it is managed by a ManagedSeed, then this will be triggered for a reconciliation. This will trigger the bootstrapping process again and allows gardenlets to obtain a fresh client certificate.
In case the reconciled Shoot is registered via a ManagedSeed as a seed cluster, this reconciler merges the conditions in the respective Seed’s .status.conditions into the .status.conditions of the Shoot.
This is to provide a holistic view on the status of the registered seed cluster by just looking at the Shoot resource.
This reconciler is responsible for hibernating or awakening shoot clusters based on the schedules defined in their .spec.hibernation.schedules.
It ignores failed Shoots and those marked for deletion.
This reconciler is responsible for maintaining shoot clusters based on the time window defined in their .spec.maintenance.timeWindow.
It might auto-update the Kubernetes version or the operating system versions specified in the worker pools (.spec.provider.workers).
It could also add some operation or task annotations. For more information, see Shoot Maintenance.
This reconciler might auto-delete shoot clusters in case their referenced SecretBinding or CredentialsBinding is itself referencing a Quota with .spec.clusterLifetimeDays != nil.
If the shoot cluster is older than the configured lifetime, then it gets deleted.
It maintains the expiration time of the Shoot in the value of the shoot.gardener.cloud/expiration-timestamp annotation.
This annotation might be overridden, however only by at most twice the value of the .spec.clusterLifetimeDays.
Shoot objects may specify references to other objects in the garden cluster which are required for certain features.
For example, users can configure various DNS providers via .spec.dns.providers and usually need to refer to a corresponding Secret with valid DNS provider credentials inside.
Such objects need a special protection against deletion requests as long as they are still being referenced by one or multiple shoots.
Therefore, this reconciler checks Shoots for referenced objects and adds the finalizer gardener.cloud/reference-protection to their .metadata.finalizers list.
The reconciled Shoot also gets this finalizer to enable a proper garbage collection in case the gardener-controller-manager is offline at the moment of an incoming deletion request.
When an object is not actively referenced anymore because the Shoot specification has changed or all related shoots were deleted (are in deletion), the controller will remove the added finalizer again so that the object can safely be deleted or garbage collected.
This reconciler inspects the following references:
This reconciler is responsible for retrying certain failed Shoots.
Currently, the reconciler retries only failed Shoots with an error code ERR_INFRA_RATE_LIMITS_EXCEEDED. See Shoot Status for more details.
This reconciler is responsible for maintaining the shoot.gardener.cloud/status label on Shoots. See Shoot Status for more details.
4.10 - Gardener Node Agent
How Gardener bootstraps machines into worker nodes and how it installs and maintains gardener-managed node-specific components
Overview
The goal of the gardener-node-agent is to bootstrap a machine into a worker node and maintain node-specific components, which run on the node and are unmanaged by Kubernetes (e.g. the kubelet service, systemd units, …).
It effectively is a Kubernetes controller deployed onto the worker node.
Architecture and Basic Design
This figure visualizes the overall architecture of the gardener-node-agent. On the left side, it starts with an OperatingSystemConfig resource (OSC) with a corresponding worker pool specific cloud-config-<worker-pool> secret being passed by reference through the userdata to a machine by the machine-controller-manager (MCM).
On the right side, the cloud-config secret will be extracted and used by the gardener-node-agent after being installed. Details on this can be found in the next section.
Finally, the gardener-node-agent runs a systemd service watching on secret resources located in the kube-system namespace like our cloud-config secret that contains the OperatingSystemConfig. When gardener-node-agent applies the OSC, it installs the kubelet + configuration on the worker node.
Installation and Bootstrapping
This section describes how the gardener-node-agent is initially installed onto the worker node.
In the beginning, there is a very small bash script called gardener-node-init.sh, which will be copied to /var/lib/gardener-node-agent/init.sh on the node with cloud-init data.
This script’s sole purpose is downloading and starting the gardener-node-agent.
The binary artifact is extracted from an OCI artifact and lives at /opt/bin/gardener-node-agent.
Along with the init script, a configuration for the gardener-node-agent is carried over to the worker node at /var/lib/gardener-node-agent/config.yaml.
This configuration contains things like the shoot’s kube-apiserver endpoint, the according certificates to communicate with it, and controller configuration.
In a bootstrapping phase, the gardener-node-agent sets itself up as a systemd service.
It also executes tasks that need to be executed before any other components are installed, e.g. formatting the data device for the kubelet.
Controllers
This section describes the controllers in more details.
This controller creates a Lease for gardener-node-agent in kube-system namespace of the shoot cluster.
Each instance of gardener-node-agent creates its own Lease when its corresponding Node was created.
It renews the Lease resource every 10 seconds. This indicates a heartbeat to the external world.
This controller watches the Node object for the machine it runs on.
The correct Node is identified based on the hostname of the machine (Nodes have the kubernetes.io/hostname label).
Whenever the worker.gardener.cloud/restart-systemd-services annotation changes, the controller performs the desired changes by restarting the specified systemd unit files.
See also this document for more information.
After restarting all units, the annotation is removed.
ℹ️ When the gardener-node-agent systemd service itself is requested to be restarted, the annotation is removed first to ensure it does not restart itself indefinitely.
This controller contains the main logic of gardener-node-agent.
It watches Secrets whose data map contains the OperatingSystemConfig which consists of all systemd units and files that are relevant for the node configuration.
Amongst others, a prominent example is the configuration file for kubelet and its unit file for the kubelet.service.
The controller decodes the configuration and computes the files and units that have changed since its last reconciliation.
It writes or update the files and units to the file system, removes no longer needed files and units, reloads the systemd daemon, and starts or stops the units accordingly.
After successful reconciliation, it persists the just applied OperatingSystemConfig into a file on the host.
This file will be used for future reconciliations to compute file/unit changes.
The controller also maintains two annotations on the Node:
worker.gardener.cloud/kubernetes-version, describing the version of the installed kubelet.
checksum/cloud-config-data, describing the checksum of the applied OperatingSystemConfig (used in future reconciliations to determine whether it needs to reconcile, and to report that this node is up-to-date).
This controller watches the access token Secrets in the kube-system namespace configured via the gardener-node-agent’s component configuration (.controllers.token.syncConfigs[] field).
Whenever the .data.token field changes, it writes the new content to a file on the configured path on the host file system.
This mechanism is used to download its own access token for the shoot cluster, but also the access tokens of other systemd components (e.g., valitail).
Since the underlying client is based on k8s.io/client-go and the kubeconfig points to this token file, it is dynamically reloaded without the necessity of explicit configuration or code changes.
This procedure ensures that the most up-to-date tokens are always present on the host and used by the gardener-node-agent and the other systemd components.
Reasoning
The gardener-node-agent is a replacement for what was called the cloud-config-downloader and the cloud-config-executor, both written in bash. The gardener-node-agent implements this functionality as a regular controller and feels more uniform in terms of maintenance.
With the new architecture we gain a lot, let’s describe the most important gains here.
Developer Productivity
Since the Gardener community develops in Go day by day, writing business logic in bash is difficult, hard to maintain, almost impossible to test. Getting rid of almost all bash scripts which are currently in use for this very important part of the cluster creation process will enhance the speed of adding new features and removing bugs.
Speed
Until now, the cloud-config-downloader runs in a loop every 60s to check if something changed on the shoot which requires modifications on the worker node. This produces a lot of unneeded traffic on the API server and wastes time, it will sometimes take up to 60s until a desired modification is started on the worker node.
By writing a “real” Kubernetes controller, we can watch for the Node, the OSC in the Secret, and the shoot-access token in the secret. If any of these object changed, and only then, the required action will take effect immediately.
This will speed up operations and will reduce the load on the API server of the shoot especially for large clusters.
Scalability
The cloud-config-downloader adds a random wait time before restarting the kubelet in case the kubelet was updated or a configuration change was made to it. This is required to reduce the load on the API server and the traffic on the internet uplink. It also reduces the overall downtime of the services in the cluster because every kubelet restart transforms a node for several seconds into NotReady state which potentially interrupts service availability.
Decision was made to keep the existing jitter mechanism which calculates the kubelet-download-and-restart-delay-seconds on the controller itself.
Correctness
The configuration of the cloud-config-downloader is actually done by placing a file for every configuration item on the disk on the worker node. This was done because parsing the content of a single file and using this as a value in bash reduces to something like VALUE=$(cat /the/path/to/the/file). Simple, but it lacks validation, type safety and whatnot.
With the gardener-node-agent we introduce a new API which is then stored in the gardener-node-agentsecret and stored on disk in a single YAML file for comparison with the previous known state. This brings all benefits of type safe configuration.
Because actual and previous configuration are compared, removed files and units are also removed and stopped on the worker if removed from the OSC.
Availability
Previously, the cloud-config-downloader simply restarted the systemd units on every change to the OSC, regardless which of the services changed. The gardener-node-agent first checks which systemd unit was changed, and will only restart these. This will prevent unneeded kubelet restarts.
4.11 - Gardener Operator
Understand the component responsible for the garden cluster environment and its various features
Overview
The gardener-operator is responsible for the garden cluster environment.
Without this component, users must deploy ETCD, the Gardener control plane, etc., manually and with separate mechanisms (not maintained in this repository).
This is quite unfortunate since this requires separate tooling, processes, etc.
A lot of production- and enterprise-grade features were built into Gardener for managing the seed and shoot clusters, so it makes sense to re-use them as much as possible also for the garden cluster.
Deployment
There is a Helm chart which can be used to deploy the gardener-operator.
Once deployed and ready, you can create a Garden resource.
Note that there can only be one Garden resource per system at a time.
ℹ️ Similar to seed clusters, garden runtime clusters require a VPA, see this section.
By default, gardener-operator deploys the VPA components.
However, when there already is a VPA available, then set .spec.runtimeCluster.settings.verticalPodAutoscaler.enabled=false in the Garden resource.
The Garden resource offers a few settings that are used to control the behaviour of gardener-operator in the runtime cluster.
This section provides an overview over the available settings in .spec.runtimeCluster.settings:
Load Balancer Services
gardener-operator deploys Istio and relevant resources to the runtime cluster in order to expose the virtual-garden-kube-apiserver service (similar to how the kube-apiservers of shoot clusters are exposed).
In most cases, the cloud-controller-manager (responsible for managing these load balancers on the respective underlying infrastructure) supports certain customization and settings via annotations.
This document provides a good overview and many examples.
By setting the .spec.runtimeCluster.settings.loadBalancerServices.annotations field the Gardener administrator can specify a list of annotations which will be injected into the Services of type LoadBalancer.
Vertical Pod Autoscaler
gardener-operator heavily relies on the Kubernetes vertical-pod-autoscaler component.
By default, the Garden controller deploys the VPA components into the garden namespace of the respective runtime cluster.
In case you want to manage the VPA deployment on your own or have a custom one, then you might want to disable the automatic deployment of gardener-operator.
Otherwise, you might end up with two VPAs which will cause erratic behaviour.
By setting the .spec.runtimeCluster.settings.verticalPodAutoscaler.enabled=false you can disable the automatic deployment.
⚠️ In any case, there must be a VPA available for your runtime cluster.
Using a runtime cluster without VPA is not supported.
It is possible to define the minimum size for PersistentVolumeClaims in the runtime cluster created by gardener-operator via the .spec.runtimeCluster.volume.minimumSize field.
This can be relevant in case the runtime cluster runs on an infrastructure that does only support disks of at least a certain size.
Configuration For Virtual Cluster
ETCD Encryption Config
The spec.virtualCluster.kubernetes.kubeAPIServer.encryptionConfig field in the Garden API allows operators to customize encryption configurations for the kube-apiserver of the virtual cluster. It provides options to specify additional resources for encryption. Similarly spec.virtualCluster.gardener.gardenerAPIServer.encryptionConfig field allows operators to customize encryption configurations for the gardener-apiserver.
The resources field can be used to specify resources that should be encrypted in addition to secrets. Secrets are always encrypted for the kube-apiserver. For the gardener-apiserver, the following resources are always encrypted:
controllerdeployments.core.gardener.cloud
controllerregistrations.core.gardener.cloud
internalsecrets.core.gardener.cloud
shootstates.core.gardener.cloud
Adding an item to any of the lists will cause patch requests for all the resources of that kind to encrypt them in the etcd. See Encrypting Confidential Data at Rest for more details.
ℹ️ Note that configuring encryption for a custom resource for the kube-apiserver is only supported for Kubernetes versions >= 1.26.
Extension Resource
A Gardener installation relies on extensions to provide support for new cloud providers or to add new capabilities.
You can find out more about Gardener extensions and how they can be used here.
The Extension resource is intended to automate the installation and management of extensions in a Gardener landscape.
It contains configuration for the following scenarios:
The deployment of the extension chart in the garden runtime cluster.
The deployment of ControllerRegistration and ControllerDeployment resources in the (virtual) garden cluster.
In the near future, the Extension will be used by the gardener-operator to automate the management of the backup bucket for ETCD and DNS records required by the garden cluster.
To do that, gardener-operator will leverage extensions that support DNSRecord and BackupBucket resources.
As of today, the support for managed DNSRecords and BackupBuckets in the gardener-operator is still being built.
However, the Extension’s specification already reflects the target picture.
The .spec.deployment specifies how an extension can be installed for a Gardener landscape and consists of the following parts:
.spec.deployment.extension contains the deployment specification of an extension.
.spec.deployment.admission contains the deployment specification of an extension admission.
Each one is described in more details below.
Configuration for Extension Deployment
.spec.deployment.extension contains configuration for the registration of an extension in the garden cluster.
gardener-operator follows the same principles described by this document:
.spec.deployment.extension.helm and .spec.deployment.extension.values are used when creating the ControllerDeployment in the garden cluster.
.spec.deployment.extension.policy and .spec.deployment.extension.seedSelector define the extension’s installation policy as per the ControllerDeployment's respective fields
Runtime
Extensions can manage resources required by the Garden resource (e.g. BackupBucket, DNSRecord, Extension) in the runtime cluster.
Since the environment in the runtime cluster may differ from that of a Seed, the extension is installed in the runtime cluster with a distinct set of Helm chart values specified in .spec.deployment.extension.runtimeValues.
If no runtimeValues are provided, the extension deployment for the runtime garden is considered superfluous and the deployment is uninstalled.
The configuration allows for precise control over various extension parameters, such as requested resources, priority classes, and more.
Besides the values configured in .spec.deployment.extension.runtimeValues, a runtime deployment flag and a priority class are merged into the values:
gardener:
runtimeCluster:
enabled: true# indicates the extension is enabled for the Garden cluster, e.g. for handling `BackupBucket`, `DNSRecord` and `Extension` objects. priorityClassName: gardener-garden-system-200
As soon as a Garden object is created and runtimeValues are configured, the extension is deployed in the runtime cluster.
Extension Registration
When the virtual garden cluster is available, the Extension controller manages ControllerRegistration/ControllerDeployment resources
to register extensions for shoots. The fields of .spec.deployment.extension include their configuration options.
Configuration for Admission Deployment
The .spec.deployment.admission defines how an extension admission may be deployed by the gardener-operator.
This deployment is optional and may be omitted.
Typically, the admission are split in two parts:
Runtime
The runtime part contains deployment relevant manifests, required to run the admission service in the runtime cluster.
The following values are passed to the chart during reconciliation:
gardener:
runtimeCluster:
priorityClassName: <Class to be used for extension admission>
gardener:
virtualCluster:
serviceAccount:
name: <Name of the service account used to connect to the garden cluster>
namespace: <Namespace of the service account>
Extension admissions often need to retrieve additional context from the garden cluster in order to process validating or mutating requests.
For example, the corresponding CloudProfile might be needed to perform a provider specific shoot validation.
Therefore, Gardener automatically injects a kubeconfig into the admission deployment to interact with the (virtual) garden cluster (see this document for more information).
Configuration for Extension Resources
The .spec.resources field refers to the extension resources as defined by Gardener in the extensions.gardener.cloud/v1alpha1 API.
These include both well-known types such as Infrastructure, Worker etc. and generic resources.
The field will be used to populate the respective field in the resulting ControllerRegistration in the garden cluster.
Controllers
The gardener-operator controllers are now described in more detail.
The reconciler first generates a general CA certificate which is valid for ~30d and auto-rotated when 80% of its lifetime is reached.
Afterwards, it brings up the so-called “garden system components”.
The gardener-resource-manager is deployed first since its ManagedResource controller will be used to bring up the remainders.
Other system components are:
runtime garden system resources (PriorityClasses for the workload resources)
virtual garden system resources (RBAC rules)
Vertical Pod Autoscaler (if enabled via .spec.runtimeCluster.settings.verticalPodAutoscaler.enabled=true in the Garden)
ETCD Druid
Istio
As soon as all system components are up, the reconciler deploys the virtual garden cluster.
It comprises out of two ETCDs (one “main” etcd, one “events” etcd) which are managed by ETCD Druid via druid.gardener.cloud/v1alpha1.Etcd custom resources.
The whole management works similar to how it works for Shoots, so you can take a look at this document for more information in general.
The virtual garden control plane components are:
virtual-garden-etcd-main
virtual-garden-etcd-events
virtual-garden-kube-apiserver
virtual-garden-kube-controller-manager
virtual-garden-gardener-resource-manager
If the .spec.virtualCluster.controlPlane.highAvailability={} is set then these components will be deployed in a “highly available” mode.
For ETCD, this means that there will be 3 replicas each.
This works similar like for Shoots (see this document) except for the fact that there is no failure tolerance type configurability.
The gardener-resource-manager’s HighAvailabilityConfig webhook makes sure that all pods with multiple replicas are spread on nodes, and if there are at least two zones in .spec.runtimeCluster.provider.zones then they also get spread across availability zones.
If once set, removing .spec.virtualCluster.controlPlane.highAvailability again is not supported.
The virtual-garden-kube-apiserverDeployment is exposed via Istio, similar to how the kube-apiservers of shoot clusters are exposed.
Similar to the Shoot API, the version of the virtual garden cluster is controlled via .spec.virtualCluster.kubernetes.version.
Likewise, specific configuration for the control plane components can be provided in the same section, e.g. via .spec.virtualCluster.kubernetes.kubeAPIServer for the kube-apiserver or .spec.virtualCluster.kubernetes.kubeControllerManager for the kube-controller-manager.
The kube-controller-manager only runs a few controllers that are necessary in the scenario of the virtual garden.
Most prominently, the serviceaccount-token controller is unconditionally disabled.
Hence, the usage of static ServiceAccount secrets is not supported generally.
Instead, the TokenRequest API should be used.
Third-party components that need to communicate with the virtual cluster can leverage the gardener-resource-manager’s TokenRequestor controller and the generic kubeconfig, just like it works for Shoots.
Please note, that this functionality is restricted to the garden namespace. The current Secret name of the generic kubeconfig can be found in the annotations (key: generic-token-kubeconfig.secret.gardener.cloud/name) of the Garden resource.
For the virtual cluster, it is essential to provide at least one DNS domain via .spec.virtualCluster.dns.domains.
The respective DNS records are not managed by gardener-operator and should be created manually.
They should point to the load balancer IP of the istio-ingressgatewayService in namespace virtual-garden-istio-ingress.
The DNS records must be prefixed with both gardener. and api. for all domains in .spec.virtualCluster.dns.domains.
The first DNS domain in this list is used for the server in the kubeconfig, and for configuring the --external-hostname flag of the API server.
Apart from the control plane components of the virtual cluster, the reconcile also deploys the control plane components of Gardener.
gardener-apiserver reuses the same ETCDs like the virtual-garden-kube-apiserver, so all data related to the “the garden cluster” is stored together and “isolated” from ETCD data related to the runtime cluster.
This drastically simplifies backup and restore capabilities (e.g., moving the virtual garden cluster from one runtime cluster to another).
The Gardener control plane components are:
gardener-apiserver
gardener-admission-controller
gardener-controller-manager
gardener-scheduler
Besides those, the gardener-operator is able to deploy the following optional components:
Gardener Dashboard (and the controller for web terminals) when .spec.virtualCluster.gardener.gardenerDashboard (or .spec.virtualCluster.gardener.gardenerDashboard.terminal, respectively) is set.
You can read more about it and its configuration in this section.
Gardener Discovery Server when .spec.virtualCluster.gardener.gardenerDiscoveryServer is set.
The service account issuer of shoots will be calculated in the format https://discovery.<.spec.runtimeCluster.ingress.domains[0]>/projects/<project-name>/shoots/<shoot-uid>/issuer.
This configuration applies for all seeds registered with the Garden cluster. Once set it should not be modified.
The reconciler also manages a few observability-related components (more planned as part of GEP-19):
It is also mandatory to provide an IPv4 CIDR for the service network of the virtual cluster via .spec.virtualCluster.networking.services.
This range is used by the API server to compute the cluster IPs of Services.
The controller maintains the .status.lastOperation which indicates the status of an operation.
.spec.virtualCluster.gardener.gardenerDashboard serves a few configuration options for the dashboard.
This section highlights the most prominent fields:
oidcConfig: The general OIDC configuration is part of .spec.virtualCluster.kubernetes.kubeAPIServer.oidcConfig.
This section allows you to define a few specific settings for the dashboard.
sessionLifetime is the duration after which a session is terminated (i.e., after which a user is automatically logged out).
additionalScopes allows to extend the list of scopes of the JWT token that are to be recognized.
You must reference a Secret in the garden namespace containing the client and, if applicable, the client secret for the dashboard:
If using a public client, a client secret is not required. The dashboard can function as a public OIDC client, allowing for improved flexibility in environments where secret storage is not feasible.
enableTokenLogin: This is enabled by default and allows logging into the dashboard with a JWT token.
You can disable it in case you want to only allow OIDC-based login.
However, at least one of the both login methods must be enabled.
frontendConfigMapRef: Reference a ConfigMap in the garden namespace containing the frontend configuration in the data with key frontend-config.yaml, for example
Please take a look at this file to get an idea of which values are configurable.
This configuration can also include branding, themes, and colors.
Read more about it here.
Assets (logos/icons) are configured in a separate ConfigMap, see below.
assetsConfigMapRef: Reference a ConfigMap in the garden namespace containing the assets, for example
Note that the assets must be provided base64-encoded, hence binaryData (instead of data) must be used.
Please take a look at this file to get more information.
gitHub: You can connect a GitHub repository that can be used to create issues for shoot clusters in the cluster details page.
You have to reference a Secret in the garden namespace that contains the GitHub credentials, for example:
apiVersion: v1
kind: Secret
metadata:
name: gardener-dashboard-github
namespace: garden
type: Opaque
stringData:
# This is for GitHub token authentication: authentication.token: <secret>
# Alternatively, this is for GitHub app authentication: authentication.appId: <secret>
authentication.clientId: <secret>
authentication.clientSecret: <secret>
authentication.installationId: <secret>
authentication.privateKey: <secret>
# This is the webhook secret, see explanation below webhookSecret: <secret>
Note that you can also set up a GitHub webhook to the dashboard such that it receives updates when somebody changes the GitHub issue.
The webhookSecret field is the secret that you enter in GitHub in the webhook configuration.
The dashboard uses it to verify that received traffic is indeed originated from GitHub.
If you don’t want to set up such webhook, or if the dashboard is not reachable by the GitHub webhook (e.g., in restricted environments) you can also configure gitHub.pollInterval.
It is the interval of how often the GitHub API is polled for issue updates.
This field is used as a fallback mechanism to ensure state synchronization, even when there is a GitHub webhook configuration.
If a webhook event is missed or not successfully delivered, the polling will help catch up on any missed updates.
If this field is not provided and there is no webhookSecret key in the referenced secret, it will be implicitly defaulted to 15m.
The dashboard will use this to regularly poll the GitHub API for updates on issues.
terminal: This enables the web terminal feature, read more about it here.
When set, the terminal-controller-manager will be deployed to the runtime cluster.
The allowedHosts field is explained here.
The container section allows you to specify a container image and a description that should be used for the web terminals.
Observability
Garden Prometheus
gardener-operator deploys a Prometheus instance in the garden namespace (called “Garden Prometheus”) which fetches metrics and data from garden system components, cAdvisors, the virtual cluster control plane, and the Seeds’ aggregate Prometheus instances.
Its purpose is to provide an entrypoint for operators when debugging issues with components running in the garden cluster.
It also serves as the top-level aggregator of metering across a Gardener landscape.
To extend the configuration of the Garden Prometheus, you can create the prometheus-operator’s custom resources and label them with prometheus=garden, for example:
gardener-operator deploys another Prometheus instance in the garden namespace (called “Long-Term Prometheus”) which federates metrics from Garden Prometheus.
Its purpose is to store those with a longer retention than Garden Prometheus would. It is not possible to define different retention periods for different metrics in Prometheus, hence, using another Prometheus instance is the only option.
This Long-term Prometheus also has an additional Cortex sidecar container for caching some queries to achieve faster processing times.
To extend the configuration of the Long-term Prometheus, you can create the prometheus-operator’s custom resources and label them with prometheus=longterm, for example:
By default, the alertmanager-garden deployed by gardener-operator does not come with any configuration.
It is the responsibility of the human operators to design and provide it.
This can be done by creating monitoring.coreos.com/v1alpha1.AlertmanagerConfig resources labeled with alertmanager=garden (read more about them here), for example:
A Plutono instance is deployed by gardener-operator into the garden namespace for visualizing monitoring metrics and logs via dashboards.
In order to provide custom dashboards, create a ConfigMap in the garden namespace labelled with dashboard.monitoring.gardener.cloud/garden=true that contains the respective JSON documents, for example:
This reconciler performs four “care” actions related to Gardens.
It maintains the following conditions:
VirtualGardenAPIServerAvailable: The /healthz endpoint of the garden’s virtual-garden-kube-apiserver is called and considered healthy when it responds with 200 OK.
RuntimeComponentsHealthy: The conditions of the ManagedResources applied to the runtime cluster are checked (e.g., ResourcesApplied).
VirtualComponentsHealthy: The virtual components are considered healthy when the respective Deployments (for example virtual-garden-kube-apiserver,virtual-garden-kube-controller-manager), and Etcds (for example virtual-garden-etcd-main) exist and are healthy. Additionally, the conditions of the ManagedResources applied to the virtual cluster are checked (e.g., ResourcesApplied).
ObservabilityComponentsHealthy: This condition is considered healthy when the respective Deployments (for example plutono) and StatefulSets (for example prometheus, vali) exist and are healthy.
If all checks for a certain condition are succeeded, then its status will be set to True.
Otherwise, it will be set to False or Progressing.
If at least one check fails and there is threshold configuration for the conditions (in .controllers.gardenCare.conditionThresholds), then the status will be set:
to Progressing if it was True before.
to Progressing if it was Progressing before and the lastUpdateTime of the condition does not exceed the configured threshold duration yet.
to False if it was Progressing before and the lastUpdateTime of the condition exceeds the configured threshold duration.
The condition thresholds can be used to prevent reporting issues too early just because there is a rollout or a short disruption.
Only if the unhealthiness persists for at least the configured threshold duration, then the issues will be reported (by setting the status to False).
In order to compute the condition statuses, this reconciler considers ManagedResources (in the garden and istio-system namespace) and their status, see this document for more information.
The following table explains which ManagedResources are considered for which condition type:
Condition Type
ManagedResources are considered when
RuntimeComponentsHealthy
.spec.class=seed and care.gardener.cloud/condition-type label either unset, or set to RuntimeComponentsHealthy
VirtualComponentsHealthy
.spec.class unset or care.gardener.cloud/condition-type label set to VirtualComponentsHealthy
ObservabilityComponentsHealthy
care.gardener.cloud/condition-type label set to ObservabilityComponentsHealthy
Garden objects may specify references to other objects in the Garden cluster which are required for certain features.
For example, operators can configure a secret for ETCD backup via .spec.virtualCluster.etcd.main.backup.secretRef.name or an audit policy ConfigMap via .spec.virtualCluster.kubernetes.kubeAPIServer.auditConfig.auditPolicy.configMapRef.name.
Such objects need a special protection against deletion requests as long as they are still being referenced by the Garden.
Therefore, this reconciler checks Gardens for referenced objects and adds the finalizer gardener.cloud/reference-protection to their .metadata.finalizers list.
The reconciled Garden also gets this finalizer to enable a proper garbage collection in case the gardener-operator is offline at the moment of an incoming deletion request.
When an object is not actively referenced anymore because the Garden specification has changed is in deletion, the controller will remove the added finalizer again so that the object can safely be deleted or garbage collected.
This reconciler inspects the following references:
Admission plugin kubeconfig Secrets (.spec.virtualCluster.kubernetes.kubeAPIServer.admissionPlugins[].kubeconfigSecretName and .spec.virtualCluster.gardener.gardenerAPIServer.admissionPlugins[].kubeconfigSecretName)
Audit policy ConfigMaps (.spec.virtualCluster.kubernetes.kubeAPIServer.auditConfig.auditPolicy.configMapRef.name and .spec.virtualCluster.gardener.gardenerAPIServer.auditConfig.auditPolicy.configMapRef.name)
Audit webhook kubeconfig Secrets (.spec.virtualCluster.kubernetes.kubeAPIServer.auditWebhook.kubeconfigSecretName and .spec.virtualCluster.gardener.gardenerAPIServer.auditWebhook.kubeconfigSecretName)
This controller registers controllers, which need to be installed in two contexts. If the Garden cluster is at the same time used as a Seed cluster, the gardener-operator will start these controllers. If the Garden cluster is separate from the Seed cluster, the controllers will be started by gardenlet.
The registration happens as soon as the Garden resource is created.
It contains the networking information of the garden runtime cluster which is required configuration for the NetworkPolicy controller.
Gardener relies on extensions to provide various capabilities, such as supporting cloud providers.
This controller automates the management of extensions by managing all necessary resources in the runtime and virtual garden clusters.
This reconciler reacts on events from BackupBucket, DNSRecord and Extension resources.
Based on these resources and the related Extension specification, it is checked if the extension deployment is required in the garden runtime cluster.
The result is then put into the RuntimeRequired condition and added to the Extension status.
The Gardenlet controller reconciles a seedmanagement.gardener.cloud/v1alpha1.Gardenlet resource in case there is no Seed yet with the same name.
This is used to allow easy deployments of gardenlets into unmanaged seed clusters.
For a general overview, see this document.
On Gardenlet reconciliation, the controller deploys the gardenlet to the cluster (either its own, or the one provided via the .spec.kubeconfigSecretRef) after downloading the Helm chart specified in .spec.deployment.helm.ociRepository and rendering it with the provided values/configuration.
On Gardenlet deletion, nothing happens: gardenlets must always be deleted manually (by deleting the Seed and, once gone, then the gardenletDeployment).
Note
This controller only takes care of the very first gardenlet deployment (since it only reacts when there is no Seed resource yet).
After the gardenlet is running, it uses the self-upgrade mechanism by watching the seedmanagement.gardener.cloud/v1alpha1.Gardenlet (see this for more details.)
After a successful Garden reconciliation, gardener-operator also updates the .spec.deployment.helm.ociRepository.ref to its own version in all Gardenlet resources labeled with operator.gardener.cloud/auto-update-gardenlet-helm-chart-ref=true.
gardenlets then updates themselves.
⚠️ If you prefer to manage the Gardenlet resources via GitOps, Flux, or similar tools, then you should better manage the .spec.deployment.helm.ociRepository.ref field yourself and not label the resources as mentioned above (to prevent gardener-operator from interfering with your desired state).
Make sure to apply your Gardenlet resources (potentially containing a new version) after the Garden resource was successfully reconciled (i.e., after Gardener control plane was successfully rolled out, see this for more information.)
Webhooks
As of today, the gardener-operator only has one webhook handler which is now described in more detail.
Validation
This webhook handler validates CREATE/UPDATE/DELETE operations on Garden resources.
Simple validation is performed via standard CRD validation.
However, more advanced validation is hard to express via these means and is performed by this webhook handler.
Furthermore, for deletion requests, it is validated that the Garden is annotated with a deletion confirmation annotation, namely confirmation.gardener.cloud/deletion=true.
Only if this annotation is present it allows the DELETE operation to pass.
This prevents users from accidental/undesired deletions.
Another validation is to check that there is only one Garden resource at a time.
It prevents creating a second Garden when there is already one in the system.
Defaulting
This webhook handler mutates the Garden resource on CREATE/UPDATE/DELETE operations.
Simple defaulting is performed via standard CRD defaulting.
However, more advanced defaulting is hard to express via these means and is performed by this webhook handler.
Using Garden Runtime Cluster As Seed Cluster
In production scenarios, you probably wouldn’t use the Kubernetes cluster running gardener-operator and the Gardener control plane (called “runtime cluster”) as seed cluster at the same time.
However, such setup is technically possible and might simplify certain situations (e.g., development, evaluation, …).
If the runtime cluster is a seed cluster at the same time, gardenlet’s Seed controller will not manage the components which were already deployed (and reconciled) by gardener-operator.
As of today, this applies to:
gardener-resource-manager
vpa-{admission-controller,recommender,updater}
etcd-druid
istio control-plane
nginx-ingress-controller
Those components are so-called “seed system components”.
In addition, there are a few observability components:
fluent-operator
fluent-bit
vali
plutono
kube-state-metrics
prometheus-operator
As all of these components are managed by gardener-operator in this scenario, the gardenlet just skips them.
ℹ️ There is no need to configure anything - the gardenlet will automatically detect when its seed cluster is the garden runtime cluster at the same time.
⚠️ Note that such setup requires that you upgrade the versions of gardener-operator and gardenlet in lock-step.
Otherwise, you might experience unexpected behaviour or issues with your seed or shoot clusters.
Credentials Rotation
The credentials rotation works in the same way as it does for Shoot resources, i.e. there are gardener.cloud/operation annotation values for starting or completing the rotation procedures.
For certificate authorities, gardener-operator generates one which is automatically rotated roughly each month (ca-garden-runtime) and several CAs which are NOT automatically rotated but only on demand.
🚨 Hence, it is the responsibility of the (human) operator to regularly perform the credentials rotation.
Please refer to this document for more details. As of today, gardener-operator only creates the following types of credentials (i.e., some sections of the document don’t apply for Gardens and can be ignored):
certificate authorities (and related server and client certificates)
ETCD encryption key
observability password for Plutono
ServiceAccount token signing key
WorkloadIdentity token signing key
⚠️ Rotation of static ServiceAccount secrets is not supported since the kube-controller-manager does not enable the serviceaccount-token controller.
When the ServiceAccount token signing key rotation is in Preparing phase, then gardener-operator annotates all Seeds with gardener.cloud/operation=renew-garden-access-secrets.
This causes gardenlet to populate new ServiceAccount tokens for the garden cluster to all extensions, which are now signed with the new signing key.
Read more about it here.
Similarly, when the CA certificate rotation is in Preparing phase, then gardener-operator annotates all Seeds with gardener.cloud/operation=renew-kubeconfig.
This causes gardenlet to request a new client certificate for its garden cluster kubeconfig, which is now signed with the new client CA, and which also contains the new CA bundle for the server certificate verification.
Read more about it here.
Also, when the WorkloadIdentity token signing key rotation is in Preparing phase, then gardener-operator annotates all Seeds with gardener.cloud/operation=renew-workload-identity-tokens.
This causes gardenlet to renew all workload identity tokens in the seed cluster with new tokens now signed with the new signing key.
Migrating an Existing Gardener Landscape to gardener-operator
Since gardener-operator was only developed in 2023, six years after the Gardener project initiation, most users probably already have an existing Gardener landscape.
The most prominent installation procedure is garden-setup, however experience shows that most community members have developed their own tooling for managing the garden cluster and the Gardener control plane components.
Consequently, providing a general migration guide is not possible since the detailed steps vary heavily based on how the components were set up previously.
As a result, this section can only highlight the most important caveats and things to know, while the concrete migration steps must be figured out individually based on the existing installation.
Please test your migration procedure thoroughly.
Note that in some cases it can be easier to set up a fresh landscape with gardener-operator, restore the ETCD data, switch the DNS records, and issue new credentials for all clients.
Please make sure that you configure all your desired fields in the Garden resource.
ETCD
gardener-operator leverages etcd-druid for managing the virtual-garden-etcd-main and virtual-garden-etcd-events, similar to how shoot cluster control planes are handled.
The PersistentVolumeClaim names differ slightly - for virtual-garden-etcd-events it’s virtual-garden-etcd-events-virtual-garden-etcd-events-0, while for virtual-garden-etcd-main it’s main-virtual-garden-etcd-virtual-garden-etcd-main-0.
The easiest approach for the migration is to make your existing ETCD volumes follow the same naming scheme.
Alternatively, backup your data, let gardener-operator take over ETCD, and then restore your data to the new volume.
The backup bucket must be created separately, and its name as well as the respective credentials must be provided via the Garden resource in .spec.virtualCluster.etcd.main.backup.
virtual-garden-kube-apiserver Deployment
gardener-operator deploys a virtual-garden-kube-apiserver into the runtime cluster.
This virtual-garden-kube-apiserver spans a new cluster, called the virtual cluster.
There are a few certificates and other credentials that should not change during the migration.
You have to prepare the environment accordingly by leveraging the secret’s manager capabilities.
The existing Cluster CA Secret should be labeled with secrets-manager-use-data-for-name=ca.
The existing Client CA Secret should be labeled with secrets-manager-use-data-for-name=ca-client.
The existing Front Proxy CA Secret should be labeled with secrets-manager-use-data-for-name=ca-front-proxy.
The existing Service Account Signing Key Secret should be labeled with secrets-manager-use-data-for-name=service-account-key.
The existing ETCD Encryption Key Secret should be labeled with secrets-manager-use-data-for-name=kube-apiserver-etcd-encryption-key.
virtual-garden-kube-apiserver Exposure
The virtual-garden-kube-apiserver is exposed via a dedicated istio-ingressgateway deployed to namespace virtual-garden-istio-ingress.
The virtual-garden-kube-apiserverService in the garden namespace is only of type ClusterIP.
Consequently, DNS records for this API server must target the load balancer IP of the istio-ingressgateway.
Virtual Garden Kubeconfig
gardener-operator does not generate any static token or likewise for access to the virtual cluster.
Ideally, human users access it via OIDC only.
Alternatively, you can create an auto-rotated token that you can use for automation like CI/CD pipelines:
The shoot-access-virtual-gardenSecret will get a .data.token field which can be used to authenticate against the virtual garden cluster.
See also this document for more information about the TokenRequestor.
gardener-apiserver
Similar to the virtual-garden-kube-apiserver, the gardener-apiserver also uses a few certificates and other credentials that should not change during the migration.
Again, you have to prepare the environment accordingly by leveraging the secret’s manager capabilities.
The existing ETCD Encryption Key Secret should be labeled with secrets-manager-use-data-for-name=gardener-apiserver-etcd-encryption-key.
Also note that gardener-operator manages the Service and Endpoints resources for the gardener-apiserver in the virtual cluster within the kube-system namespace (garden-setup uses the garden namespace).
Local Development
The easiest setup is using a local KinD cluster and the Skaffold based approach to deploy and develop the gardener-operator.
Setting Up the KinD Cluster (runtime cluster)
make kind-operator-up
This command sets up a new KinD cluster named gardener-local and stores the kubeconfig in the ./example/gardener-local/kind/operator/kubeconfig file.
It might be helpful to copy this file to $HOME/.kube/config, since you will need to target this KinD cluster multiple times.
Alternatively, make sure to set your KUBECONFIG environment variable to ./example/gardener-local/kind/operator/kubeconfig for all future steps via export KUBECONFIG=$PWD/example/gardener-local/kind/operator/kubeconfig.
All the following steps assume that you are using this kubeconfig.
Setting Up Gardener Operator
make operator-up
This will first build the base images (which might take a bit if you do it for the first time).
Afterwards, the Gardener Operator resources will be deployed into the cluster.
Developing Gardener Operator (Optional)
make operator-dev
This is similar to make operator-up but additionally starts a skaffold dev loop.
After the initial deployment, skaffold starts watching source files.
Once it has detected changes, press any key to trigger a new build and deployment of the changed components.
Debugging Gardener Operator (Optional)
make operator-debug
This is similar to make gardener-debug but for Gardener Operator component. Please check Debugging Gardener for details.
Creating a Garden
In order to create a garden, just run:
kubectl apply -f example/operator/20-garden.yaml
You can wait for the Garden to be ready by running:
./hack/usage/wait-for.sh garden local VirtualGardenAPIServerAvailable VirtualComponentsHealthy
Alternatively, you can run kubectl get garden and wait for the RECONCILED status to reach True:
NAME LAST OPERATION RUNTIME VIRTUAL API SERVER OBSERVABILITY AGE
local Processing False False False False 1s
(Optional): Instead of creating above Garden resource manually, you could execute the e2e tests by running:
make test-e2e-local-operator
Accessing the Virtual Garden Cluster
⚠️ Please note that in this setup, the virtual garden cluster is not accessible by default when you download the kubeconfig and try to communicate with it.
The reason is that your host most probably cannot resolve the DNS name of the cluster.
Hence, if you want to access the virtual garden cluster, you have to run the following command which will extend your /etc/hosts file with the required information to make the DNS names resolvable:
cat <<EOF | sudo tee -a /etc/hosts
# Manually created to access local Gardener virtual garden cluster.
# TODO: Remove this again when the virtual garden cluster access is no longer required.
172.18.255.3 api.virtual-garden.local.gardener.cloud
EOF
To access the virtual garden, you can acquire a kubeconfig by
kubectl -n garden get secret gardener -o jsonpath={.data.kubeconfig} | base64 -d > /tmp/virtual-garden-kubeconfig
kubectl --kubeconfig /tmp/virtual-garden-kubeconfig get namespaces
Note that this kubeconfig uses a token that has validity of 12h only, hence it might expire and causing you to re-download the kubeconfig.
Creating Seeds and Shoots
You can also create Seeds and Shoots from your local development setup.
Please see here for details.
Deleting the Garden
./hack/usage/delete garden local
Tear Down the Gardener Operator Environment
make operator-down
make kind-operator-down
4.12 - Gardener Resource Manager
Set of controllers with different responsibilities running once per seed and once per shoot
Overview
Initially, the gardener-resource-manager was a project similar to the kube-addon-manager.
It manages Kubernetes resources in a target cluster which means that it creates, updates, and deletes them.
Also, it makes sure that manual modifications to these resources are reconciled back to the desired state.
In the Gardener project we were using the kube-addon-manager since more than two years.
While we have progressed with our extensibility story (moving cloud providers out-of-tree), we had decided that the kube-addon-manager is no longer suitable for this use-case.
The problem with it is that it needs to have its managed resources on its file system.
This requires storing the resources in ConfigMaps or Secrets and mounting them to the kube-addon-manager pod during deployment time.
The gardener-resource-manager uses CustomResourceDefinitions which allows to dynamically add, change, and remove resources with immediate action and without the need to reconfigure the volume mounts/restarting the pod.
Meanwhile, the gardener-resource-manager has evolved to a more generic component comprising several controllers and webhook handlers.
It is deployed by gardenlet once per seed (in the garden namespace) and once per shoot (in the respective shoot namespaces in the seed).
Component Configuration
Similar to other Gardener components, the gardener-resource-manager uses a so-called component configuration file.
It allows specifying certain central settings like log level and formatting, client connection configuration, server ports and bind addresses, etc.
In addition, controllers and webhooks can be configured and sometimes even disabled.
Note that the very basic ManagedResource and health controllers cannot be disabled.
This controller watches custom objects called ManagedResources in the resources.gardener.cloud/v1alpha1 API group.
These objects contain references to secrets, which itself contain the resources to be managed.
The reason why a Secret is used to store the resources is that they could contain confidential information like credentials.
In the above example, the controller creates two ConfigMaps in the default namespace.
When a user is manually modifying them, they will be reconciled back to the desired state stored in the managedresource-example secret.
It is also possible to inject labels into all the resources:
In this example, the label foo=bar will be injected into the Deployment, as well as into all created ReplicaSets and Pods.
Preventing Reconciliations
If a ManagedResource is annotated with resources.gardener.cloud/ignore=true, then it will be skipped entirely by the controller (no reconciliations or deletions of managed resources at all).
However, when the ManagedResource itself is deleted (for example when a shoot is deleted), then the annotation is not respected and all resources will be deleted as usual.
This feature can be helpful to temporarily patch/change resources managed as part of such ManagedResource.
Condition checks will be skipped for such ManagedResources.
Modes
The gardener-resource-manager can manage a resource in the following supported modes:
Ignore
The corresponding resource is removed from the ManagedResource status (.status.resources). No action is performed on the cluster.
The resource is no longer “managed” (updated or deleted).
The primary use case is a migration of a resource from one ManagedResource to another one.
The mode for a resource can be specified with the resources.gardener.cloud/mode annotation. The annotation should be specified in the encoded resource manifest in the Secret that is referenced by the ManagedResource.
Resource Class and Reconcilation Scope
By default, the gardener-resource-manager controller watches for ManagedResources in all namespaces.
The .sourceClientConnection.namespace field in the component configuration restricts the watch to ManagedResources in a single namespace only.
Note that this setting also affects all other controllers and webhooks since it’s a central configuration.
A ManagedResource has an optional .spec.class field that allows it to indicate that it belongs to a given class of resources.
The .controllers.resourceClass field in the component configuration restricts the watch to ManagedResources with the given .spec.class.
A default class is assumed if no class is specified.
For instance, the gardener-resource-manager which is deployed in the Shoot’s control plane namespace in the Seed does not specify a .spec.class and watches only for resources in the control plane namespace by specifying it in the .sourceClientConnection.namespace field.
If the .spec.class changes this means that the resources have to be handled by a different Gardener Resource Manager. That is achieved by:
Cleaning all referenced resources by the Gardener Resource Manager that was responsible for the old class in its target cluster.
Creating all referenced resources by the Gardener Resource Manager that is responsible for the new class in its target cluster.
A ManagedResource has a ManagedResourceStatus, which has an array of Conditions. Conditions currently include:
Condition
Description
ResourcesApplied
True if all resources are applied to the target cluster
ResourcesHealthy
True if all resources are present and healthy
ResourcesProgressing
False if all resources have been fully rolled out
ResourcesApplied may be False when:
the resource apiVersion is not known to the target cluster
the resource spec is invalid (for example the label value does not match the required regex for it)
…
ResourcesHealthy may be False when:
the resource is not found
the resource is a Deployment and the Deployment does not have the minimum availability.
…
ResourcesProgressing may be True when:
a Deployment, StatefulSet or DaemonSet has not been fully rolled out yet, i.e. not all replicas have been updated with the latest changes to spec.template.
there are still old Pods belonging to an older ReplicaSet of a Deployment which are not terminated yet.
Each Kubernetes resources has different notion for being healthy. For example, a Deployment is considered healthy if the controller observed its current revision and if the number of updated replicas is equal to the number of replicas.
The following status.conditions section describes a healthy ManagedResource:
conditions:
- lastTransitionTime: "2022-05-03T10:55:39Z" lastUpdateTime: "2022-05-03T10:55:39Z" message: All resources are healthy.
reason: ResourcesHealthy
status: "True" type: ResourcesHealthy
- lastTransitionTime: "2022-05-03T10:55:36Z" lastUpdateTime: "2022-05-03T10:55:36Z" message: All resources have been fully rolled out.
reason: ResourcesRolledOut
status: "False" type: ResourcesProgressing
- lastTransitionTime: "2022-05-03T10:55:18Z" lastUpdateTime: "2022-05-03T10:55:18Z" message: All resources are applied.
reason: ApplySucceeded
status: "True" type: ResourcesApplied
Ignoring Updates
In some cases, it is not desirable to update or re-apply some of the cluster components (for example, if customization is required or needs to be applied by the end-user).
For these resources, the annotation “resources.gardener.cloud/ignore” needs to be set to “true” or a truthy value (Truthy values are “1”, “t”, “T”, “true”, “TRUE”, “True”) in the corresponding managed resource secrets.
This can be done from the components that create the managed resource secrets, for example Gardener extensions or Gardener. Once this is done, the resource will be initially created and later ignored during reconciliation.
Finalizing Deletion of Resources After Grace Period
When a ManagedResource is deleted, the controller deletes all managed resources from the target cluster.
In case the resources still have entries in their .metadata.finalizers[] list, they will remain stuck in the system until another entity removes the finalizers.
If you want the controller to forcefully finalize the deletion after some grace period (i.e., setting .metadata.finalizers=null), you can annotate the managed resources with resources.gardener.cloud/finalize-deletion-after=<duration>, e.g., resources.gardener.cloud/finalize-deletion-after=1h.
Preserving replicas or resources in Workload Resources
The objects which are part of the ManagedResource can be annotated with:
resources.gardener.cloud/preserve-replicas=true in case the .spec.replicas field of workload resources like Deployments, StatefulSets, etc., shall be preserved during updates.
resources.gardener.cloud/preserve-resources=true in case the .spec.containers[*].resources fields of all containers of workload resources like Deployments, StatefulSets, etc., shall be preserved during updates.
This can be useful if there are non-standard horizontal/vertical auto-scaling mechanisms in place.
Standard mechanisms like HorizontalPodAutoscaler or VerticalPodAutoscaler will be auto-recognized by gardener-resource-manager, i.e., in such cases the annotations are not needed.
Origin
All the objects managed by the resource manager get a dedicated annotation
resources.gardener.cloud/origin describing the ManagedResource object that describes
this object. The default format is <namespace>/<objectname>.
In multi-cluster scenarios (the ManagedResource objects are maintained in a
cluster different from the one the described objects are managed), it might
be useful to include the cluster identity, as well.
This can be enforced by setting the .controllers.clusterID field in the component configuration.
Here, several possibilities are supported:
given a direct value: use this as id for the source cluster.
<cluster>: read the cluster identity from a cluster-identity config map
in the kube-system namespace (attribute cluster-identity). This is
automatically maintained in all clusters managed or involved in a gardener landscape.
<default>: try to read the cluster identity from the config map. If not found,
no identity is used.
empty string: no cluster identity is used (completely cluster local scenarios).
By default, cluster id is not used. If cluster id is specified, the format is <cluster id>:<namespace>/<objectname>.
In addition to the origin annotation, all objects managed by the resource manager get a dedicated label resources.gardener.cloud/managed-by. This label can be used to describe these objects with a selector. By default it is set to “gardener”, but this can be overwritten by setting the .conrollers.managedResources.managedByLabelValue field in the component configuration.
Compression
The number and size of manifests for a ManagedResource can accumulate to a considerable amount which leads to increased Secret data.
A decent compression algorithm helps to reduce the footprint of such Secrets and the load they put on etcd, the kube-apiserver, and client caches.
We found Brotli to be a suitable candidate for most use cases (see comparison table here).
When the gardener-resource-manager detects a data key with the known suffix .br, it automatically un-compresses the data first before processing the contained manifest.
This controller processes ManagedResources that were reconciled by the main ManagedResource Controller at least once.
Its main job is to perform checks for maintaining the well known conditionsResourcesHealthy and ResourcesProgressing.
Progressing Checks
In Kubernetes, applied changes must usually be rolled out first, e.g. when changing the base image in a Deployment.
Progressing checks detect ongoing roll-outs and report them in the ResourcesProgressing condition of the corresponding ManagedResource.
The following object kinds are considered for progressing checks:
gardener-resource-manager can evaluate the health of specific resources, often by consulting their conditions.
Health check results are regularly updated in the ResourcesHealthy condition of the corresponding ManagedResource.
The following object kinds are considered for health checks:
If a resource owned by a ManagedResource is annotated with resources.gardener.cloud/skip-health-check=true, then the resource will be skipped during health checks by the health controller. The ManagedResource conditions will not reflect the health condition of this resource anymore. The ResourcesProgressing condition will also be set to False.
In Kubernetes, workload resources (e.g., Pods) can mount ConfigMaps or Secrets or reference them via environment variables in containers.
Typically, when the content of such a ConfigMap/Secret gets changed, then the respective workload is usually not dynamically reloading the configuration, i.e., a restart is required.
The most commonly used approach is probably having the so-called checksum annotations in the pod template, which makes Kubernetes recreate the pod if the checksum changes.
However, it has the downside that old, still running versions of the workload might not be able to properly work with the already updated content in the ConfigMap/Secret, potentially causing application outages.
In order to protect users from such outages (and also to improve the performance of the cluster), the Kubernetes community provides the “immutable ConfigMaps/Secrets feature”.
Enabling immutability requires ConfigMaps/Secrets to have unique names.
Having unique names requires the client to delete ConfigMaps/Secrets no longer in use.
In order to provide a similarly lightweight experience for clients (compared to the well-established checksum annotation approach), the gardener-resource-manager features an optional garbage collector controller (disabled by default).
The purpose of this controller is cleaning up such immutable ConfigMaps/Secrets if they are no longer in use.
How Does the Garbage Collector Work?
The following algorithm is implemented in the GC controller:
List all ConfigMaps and Secrets labeled with resources.gardener.cloud/garbage-collectable-reference=true.
List all Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, Pods, ManagedResources and for each of them:
iterate over the .metadata.annotations and for each of them:
If the annotation key follows the reference.resources.gardener.cloud/{configmap,secret}-<hash> scheme and the value equals <name>, then consider it as “in-use”.
Delete all ConfigMaps and Secrets not considered as “in-use”.
Consequently, clients need to:
Create immutable ConfigMaps/Secrets with unique names (e.g., a checksum suffix based on the .data).
Label such ConfigMaps/Secrets with resources.gardener.cloud/garbage-collectable-reference=true.
Annotate their workload resources with reference.resources.gardener.cloud/{configmap,secret}-<hash>=<name> for all ConfigMaps/Secrets used by the containers of the respective Pods.
⚠️ Add such annotations to .metadata.annotations, as well as to all templates of other resources (e.g., .spec.template.metadata.annotations in Deployments or .spec.jobTemplate.metadata.annotations and .spec.jobTemplate.spec.template.metadata.annotations for CronJobs.
This ensures that the GC controller does not unintentionally consider ConfigMaps/Secrets as “not in use” just because there isn’t a Pod referencing them anymore (e.g., they could still be used by a Deployment scaled down to 0).
ℹ️ For the last step, there is a helper function InjectAnnotations in the pkg/controller/garbagecollector/references, which you can use for your convenience.
The Kubernetes community is slowly transitioning from static ServiceAccount token Secrets to ServiceAccount Token Volume Projection.
Typically, when you create a ServiceAccount
Disabling the serviceaccount-token controller is an option, however, especially in the Gardener context it may either break end-users or it may not even be possible to control such settings.
Also, even if a future Kubernetes version supports native configuration of the above behaviour, Gardener still supports older versions which won’t get such features but need a solution as well.
This is where the TokenInvalidator comes into play:
Since it is not possible to prevent kube-controller-manager from generating static ServiceAccountSecrets, the TokenInvalidator is, as its name suggests, just invalidating these tokens.
It considers all such Secrets belonging to ServiceAccounts with .automountServiceAccountToken=false.
By default, all namespaces in the target cluster are watched, however, this can be configured by specifying the .targetClientConnection.namespace field in the component configuration.
Note that this setting also affects all other controllers and webhooks since it’s a central configuration.
Any attempt to regenerate the token or creating a new such secret will again make the component invalidating it.
You can opt-out of this behaviour for ServiceAccounts setting .automountServiceAccountToken=false by labeling them with token-invalidator.resources.gardener.cloud/skip=true.
In order to enable the TokenInvalidator you have to set both .controllers.tokenValidator.enabled=true and .webhooks.tokenValidator.enabled=true in the component configuration.
The below graphic shows an overview of the Token Invalidator for Service account secrets in the Shoot cluster.
This controller provides the service to create and auto-renew tokens via the TokenRequest API.
It provides a functionality similar to the kubelet’s Service Account Token Volume Projection.
It was created to handle the special case of issuing tokens to pods that run in a different cluster than the API server they communicate with (hence, using the native token volume projection feature is not possible).
The controller differentiates between source cluster and target cluster.
The source cluster hosts the gardener-resource-manager pod. Secrets in this cluster are watched and modified by the controller.
The target clustercan be configured to point to another cluster. The existence of ServiceAccounts are ensured and token requests are issued against the target.
When the gardener-resource-manager is deployed next to the Shoot’s controlplane in the Seed, the source cluster is the Seed while the target cluster points to the Shoot.
Reconciliation Loop
This controller reconciles Secrets in all namespaces in the source cluster with the label: resources.gardener.cloud/purpose=token-requestor.
See this YAML file for an example of the secret.
The controller ensures a ServiceAccount exists in the target cluster as specified in the annotations of the Secret in the source cluster:
You can optionally annotate the Secret with serviceaccount.resources.gardener.cloud/labels, e.g. serviceaccount.resources.gardener.cloud/labels={"some":"labels","foo":"bar"}.
This will make the ServiceAccount getting labelled accordingly.
The requested tokens will act with the privileges which are assigned to this ServiceAccount.
The controller will then request a token via the TokenRequest API and populate it into the .data.token field to the Secret in the source cluster.
Alternatively, the client can provide a raw kubeconfig (in YAML or JSON format) via the Secret’s .data.kubeconfig field.
The controller will then populate the requested token in the kubeconfig for the user used in the .current-context.
For example, if .data.kubeconfig is
then the .users[0].user.token field of the kubeconfig will be updated accordingly.
The controller also adds an annotation to the Secret to keep track when to renew the token before it expires.
By default, the tokens are issued to expire after 12 hours. The expiration time can be set with the following annotation:
It automatically renews once 80% of the lifetime is reached, or after 24h.
Optionally, the controller can also populate the token into a Secret in the target cluster. This can be requested by annotating the Secret in the source cluster with:
Overall, the TokenRequestor controller provides credentials with limited lifetime (JWT tokens)
used by Shoot control plane components running in the Seed to talk to the Shoot API Server.
Please see the graphic below:
ℹ️ Generally, the controller can run with multiple instances in different components.
For example, gardener-resource-manager might run the TokenRequestor controller, but gardenlet might run it, too.
In order to differentiate which instance of the controller is responsible for a Secret, it can be labeled with resources.gardener.cloud/class=<class>.
The <class> must be configured in the respective controller, otherwise it will be responsible for all Secrets no matter whether they have the label or not.
Gardener configures the kubelets such that they request two certificates via the CertificateSigningRequest API:
client certificate for communicating with the kube-apiserver
server certificate for serving its HTTPS server
For client certificates, the kubernetes.io/kube-apiserver-client-kubelet signer is used (see Certificate Signing Requests for more details).
The kube-controller-manager’s csrapprover controller is responsible for auto-approving such CertificateSigningRequests so that the respective certificates can be issued.
For server certificates, the kubernetes.io/kubelet-serving signer is used.
Unfortunately, the kube-controller-manager is not able to auto-approve such CertificateSigningRequests (see kubernetes/kubernetes#73356 for details).
That’s the motivation for having this controller as part of gardener-resource-manager.
It watches CertificateSigningRequests with the kubernetes.io/kubelet-serving signer and auto-approves them when all the following conditions are met:
The .spec.username is prefixed with system:node:.
There must be at least one DNS name or IP address as part of the certificate SANs.
The common name in the CSR must match the .spec.username.
The organization in the CSR must only contain system:nodes.
There must be a Node object with the same name in the shoot cluster.
There must be exactly one Machine for the node in the seed cluster.
The DNS names part of the SANs must be equal to all .status.addresses[] of type Hostname in the Node.
The IP addresses part of the SANs must be equal to all .status.addresses[] of type InternalIP in the Node.
If any one of these requirements is violated, the CertificateSigningRequest will be denied.
Otherwise, once approved, the kube-controller-manager’s csrsigner controller will issue the requested certificate.
Gardener Node Agent
There is a second use case for CSR Approver, because Gardener Node Agent is able to use client certificates for communication with kube-apiserver.
These certificates are requested via the CertificateSigningRequest API. They are using the kubernetes.io/kube-apiserver-client signer.
Three use cases are covered:
Bootstrap a new node.
Renew certificates.
Migrate nodes using gardener-node-agent service account.
There is no auto-approve for these CertificateSigningRequests either.
As there are more users of kubernetes.io/kube-apiserver-client signer this controller handles only CertificateSigningRequests when the common name in the CSR is prefixed with gardener.cloud:node-agent:machine:.
The prefix is followed by the username which must be equal to the machine.Name.
It auto-approves them when the following conditions are met.
Bootstrapping:
The .spec.username is prefixed with system:node:.
A Machine for common name pattern gardener.cloud:node-agent:machine:<machine-name> in the CSR exists.
The Machine does not have a label with key node.
Certificate renewal:
The .spec.username is prefixed with gardener.cloud:node-agent:machine:.
A Machine for common name pattern gardener.cloud:node-agent:machine:<machine-name> in the CSR exists.
The common name in the CSR must match the .spec.username.
Migration:
The .spec.username is equal to system:serviceaccount:kube-system:gardener-node-agent.
A Machine for common name pattern gardener.cloud:node-agent:machine:<machine-name> in the CSR exists.
The Machine has a label with key node.
If the common name in the CSR is not prefixed with gardener.cloud:node-agent:machine:, the CertificateSigningRequest will be ignored.
If any one of these requirements is violated, the CertificateSigningRequest will be denied.
Otherwise, once approved, the kube-controller-manager’s csrsigner controller will issue the requested certificate.
This controller reconciles Services with a non-empty .spec.podSelector.
It creates two NetworkPolicys for each port in the .spec.ports[] list.
For example:
apiVersion: v1
kind: Service
metadata:
name: gardener-resource-manager
namespace: a
spec:
selector:
app: gardener-resource-manager
ports:
- name: server
port: 443
protocol: TCP
targetPort: 10250
leads to
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
gardener.cloud/description: Allows ingress TCP traffic to port 10250 for pods
selected by the a/gardener-resource-manager service selector from pods running
in namespace a labeled with map[networking.resources.gardener.cloud/to-gardener-resource-manager-tcp-10250:allowed].
name: ingress-to-gardener-resource-manager-tcp-10250
namespace: a
spec:
ingress:
- from:
- podSelector:
matchLabels:
networking.resources.gardener.cloud/to-gardener-resource-manager-tcp-10250: allowed
ports:
- port: 10250
protocol: TCP
podSelector:
matchLabels:
app: gardener-resource-manager
policyTypes:
- Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
gardener.cloud/description: Allows egress TCP traffic to port 10250 from pods
running in namespace a labeled with map[networking.resources.gardener.cloud/to-gardener-resource-manager-tcp-10250:allowed]
to pods selected by the a/gardener-resource-manager service selector.
name: egress-to-gardener-resource-manager-tcp-10250
namespace: a
spec:
egress:
- to:
- podSelector:
matchLabels:
app: gardener-resource-manager
ports:
- port: 10250
protocol: TCP
podSelector:
matchLabels:
networking.resources.gardener.cloud/to-gardener-resource-manager-tcp-10250: allowed
policyTypes:
- Egress
A component that initiates the connection to gardener-resource-manager’s tcp/10250 port can now be labeled with networking.resources.gardener.cloud/to-gardener-resource-manager-tcp-10250=allowed.
That’s all this component needs to do - it does not need to create any NetworkPolicys itself.
Cross-Namespace Communication
Apart from this “simple” case where both communicating components run in the same namespace a, there is also the cross-namespace communication case.
With above example, let’s say there are components running in another namespace b, and they would like to initiate the communication with gardener-resource-manager in a.
To cover this scenario, the Service can be annotated with networking.resources.gardener.cloud/namespace-selectors='[{"matchLabels":{"kubernetes.io/metadata.name":"b"}}]'.
Note that you can specify multiple namespace selectors in this annotation which are OR-ed.
This will make the controller create additional NetworkPolicys as follows:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
gardener.cloud/description: Allows ingress TCP traffic to port 10250 for pods selected
by the a/gardener-resource-manager service selector from pods running in namespace b
labeled with map[networking.resources.gardener.cloud/to-a-gardener-resource-manager-tcp-10250:allowed].
name: ingress-to-gardener-resource-manager-tcp-10250-from-b
namespace: a
spec:
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: b
podSelector:
matchLabels:
networking.resources.gardener.cloud/to-a-gardener-resource-manager-tcp-10250: allowed
ports:
- port: 10250
protocol: TCP
podSelector:
matchLabels:
app: gardener-resource-manager
policyTypes:
- Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
gardener.cloud/description: Allows egress TCP traffic to port 10250 from pods running in
namespace b labeled with map[networking.resources.gardener.cloud/to-a-gardener-resource-manager-tcp-10250:allowed]
to pods selected by the a/gardener-resource-manager service selector.
name: egress-to-a-gardener-resource-manager-tcp-10250
namespace: b
spec:
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: a
podSelector:
matchLabels:
app: gardener-resource-manager
ports:
- port: 10250
protocol: TCP
podSelector:
matchLabels:
networking.resources.gardener.cloud/to-a-gardener-resource-manager-tcp-10250: allowed
policyTypes:
- Egress
The components in namespace b now need to be labeled with networking.resources.gardener.cloud/to-a-gardener-resource-manager-tcp-10250=allowed, but that’s already it.
Obviously, this approach also works for namespace selectors different from kubernetes.io/metadata.name to cover scenarios where the namespace name is not known upfront or where multiple namespaces with a similar label are relevant.
The controller creates two dedicated policies for each namespace matching the selectors.
Service Targets In Multiple Namespaces
Finally, let’s say there is a Service called example which exists in different namespaces whose names are not static (e.g., foo-1, foo-2), and a component in namespace bar wants to initiate connections with all of them.
The exampleServices in these namespaces can now be annotated with networking.resources.gardener.cloud/namespace-selectors='[{"matchLabels":{"kubernetes.io/metadata.name":"bar"}}]'.
As a consequence, the component in namespace bar now needs to be labeled with networking.resources.gardener.cloud/to-foo-1-example-tcp-8080=allowed, networking.resources.gardener.cloud/to-foo-2-example-tcp-8080=allowed, etc.
This approach does not work in practice, however, since the namespace names are neither static nor known upfront.
To overcome this, it is possible to specify an alias for the concrete namespace in the pod label selector via the networking.resources.gardener.cloud/pod-label-selector-namespace-alias annotation.
In above case, the exampleService in the foo-* namespaces could be annotated with networking.resources.gardener.cloud/pod-label-selector-namespace-alias=all-foos.
This would modify the label selector in all NetworkPolicys related to cross-namespace communication, i.e. instead of networking.resources.gardener.cloud/to-foo-{1,2,...}-example-tcp-8080=allowed, networking.resources.gardener.cloud/to-all-foos-example-tcp-8080=allowed would be used.
Now the component in namespace bar only needs this single label and is able to talk to all such Services in the different namespaces.
Real-world examples for this scenario are the kube-apiserverService (which exists in all shoot namespaces), or the istio-ingressgatewayService (which exists in all istio-ingress* namespaces).
In both cases, the names of the namespaces are not statically known and depend on user input.
Overwriting The Pod Selector Label
For a component which initiates the connection to many other components, it’s sometimes impractical to specify all the respective labels in its pod template.
For example, let’s say a component foo talks to bar{0..9} on ports tcp/808{0..9}.
foo would need to have the ten networking.resources.gardener.cloud/to-bar{0..9}-tcp-808{0..9}=allowed labels.
As an alternative and to simplify this, it is also possible to annotate the targeted Services with networking.resources.gardener.cloud/from-<some-alias>-allowed-ports.
For our example, <some-alias> could be all-bars.
As a result, component foo just needs to have the label networking.resources.gardener.cloud/to-all-bars=allowed instead of all the other ten explicit labels.
⚠️ Note that this also requires to specify the list of allowed container ports as annotation value since the pod selector label will no longer be specific for a dedicated service/port.
For our example, the Service for barX with X in {0..9} needs to be annotated with networking.resources.gardener.cloud/from-all-bars-allowed-ports=[{"port":808X,"protocol":"TCP"}] in addition.
Real-world examples for this scenario are the Prometheis in seed clusters which initiate the communication to a lot of components in order to scrape their metrics.
Another example is the kube-apiserver which initiates the communication to webhook servers (potentially of extension components that are not known by Gardener itself).
Ingress From Everywhere
All above scenarios are about components initiating connections to some targets.
However, some components also receive incoming traffic from sources outside the cluster.
This traffic requires adequate ingress policies so that it can be allowed.
To cover this scenario, the Service can be annotated with networking.resources.gardener.cloud/from-world-to-ports=[{"port":"10250","protocol":"TCP"}].
As a result, the controller creates the following NetworkPolicy:
The respective pods don’t need any additional labels.
If the annotation’s value is empty ([]) then all ports are allowed.
Services Exposed via Ingress Resources
The controller can optionally be configured to watch Ingress resources by specifying the pod and namespace selectors for the Ingress controller.
If this information is provided, it automatically creates NetworkPolicy resources allowing the respective ingress/egress traffic for the backends exposed by the Ingresses.
This way, neither custom NetworkPolicys nor custom labels must be provided.
The needed configuration is part of the component configuration:
As a result, the controller would automatically create the following NetworkPolicys:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
gardener.cloud/description: Allows ingress TCP traffic to port 10250 for pods
selected by the a/gardener-resource-manager service selector from ingress controller
pods running in the default namespace labeled with map[foo:bar].
name: ingress-to-gardener-resource-manager-tcp-10250-from-ingress-controller
namespace: a
spec:
ingress:
- from:
- podSelector:
matchLabels:
foo: bar
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: default
ports:
- port: 10250
protocol: TCP
podSelector:
matchLabels:
app: gardener-resource-manager
policyTypes:
- Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
annotations:
gardener.cloud/description: Allows egress TCP traffic to port 10250 from pods
running in the default namespace labeled with map[foo:bar] to pods selected by
the a/gardener-resource-manager service selector.
name: egress-to-a-gardener-resource-manager-tcp-10250-from-ingress-controller
namespace: default
spec:
egress:
- to:
- podSelector:
matchLabels:
app: gardener-resource-manager
namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: a
ports:
- port: 10250
protocol: TCP
podSelector:
matchLabels:
foo: bar
policyTypes:
- Egress
ℹ️ Note that Ingress resources reference the service port while NetworkPolicys reference the target port/container port.
The controller automatically translates this when reconciling the NetworkPolicy resources.
Gardenlet configures kubelet of shoot worker nodes to register the Node object with the node.gardener.cloud/critical-components-not-ready taint (effect NoSchedule).
This controller watches newly created Node objects in the shoot cluster and removes the taint once all node-critical components are scheduled and ready.
If the controller finds node-critical components that are not scheduled or not ready yet, it checks the Node again after the duration configured in ResourceManagerConfiguration.controllers.node.backoff
Please refer to the feature documentation or proposal issue for more details.
This controller computes a reconciliation delay per node by using a simple linear mapping approach based on the index of the nodes in the list of all nodes in the shoot cluster.
This approach ensures that the delays of all instances of gardener-node-agent are distributed evenly.
The minimum and maximum delays can be configured, but they are defaulted to 0s and 5m, respectively.
This approach works well as long as the number of nodes in the cluster is not higher than the configured maximum delay in seconds.
In this case, the delay is still computed linearly, however, the more nodes exist in the cluster, the closer the delay times become (which might be of limited use then).
Consider increasing the maximum delay by annotating the Shoot with shoot.gardener.cloud/cloud-config-execution-max-delay-seconds=<value>.
The highest possible value is 1800.
The controller adds the node-agent.gardener.cloud/reconciliation-delay annotation to nodes whose value is read by the node-agents.
Webhooks
Mutating Webhooks
High Availability Config
This webhook is used to conveniently apply the configuration to make components deployed to seed or shoot clusters highly available.
The details and scenarios are described in High Availability Of Deployed Components.
The webhook reacts on creation/update of Deployments, StatefulSets and HorizontalPodAutoscalers in namespaces labeled with high-availability-config.resources.gardener.cloud/consider=true.
The webhook performs the following actions:
The .spec.replicas (or spec.minReplicas respectively) field is mutated based on the high-availability-config.resources.gardener.cloud/type label of the resource and the high-availability-config.resources.gardener.cloud/failure-tolerance-type annotation of the namespace:
Failure Tolerance Type ➡️ / ⬇️ Component Type️ ️
unset
empty
non-empty
controller
2
1
2
server
2
2
2
The replica count values can be overwritten by the high-availability-config.resources.gardener.cloud/replicas annotation.
It does NOT mutate the replicas when:
the replicas are already set to 0 (hibernation case), or
when the resource is scaled horizontally by HorizontalPodAutoscaler, and the current replica count is higher than what was computed above.
When the high-availability-config.resources.gardener.cloud/zones annotation is NOT empty and either the high-availability-config.resources.gardener.cloud/failure-tolerance-type annotation is set or the high-availability-config.resources.gardener.cloud/zone-pinning annotation is set to true, then it adds a node affinity to the pod template spec:
This ensures that all pods are pinned to only nodes in exactly those concrete zones.
Topology Spread Constraints are added to the pod template spec when the .spec.replicas are greater than 1. When the high-availability-config.resources.gardener.cloud/zones annotation …
… contains only one zone, then the following is added:
spec:
topologySpreadConstraints:
- topologyKey: kubernetes.io/hostname
minDomains: 3 # lower value of max replicas or 3 maxSkew: 1
whenUnsatisfiable: ScheduleAnyway # or DoNotSchedule labelSelector: ...
This ensures that the (multiple) pods are scheduled across nodes. minDomains is set when failure tolerance is configured or annotation high-availability-config.resources.gardener.cloud/host-spread="true" is given.
… contains at least two zones, then the following is added:
spec:
topologySpreadConstraints:
- topologyKey: kubernetes.io/hostname
maxSkew: 1
whenUnsatisfiable: ScheduleAnyway # or DoNotSchedule labelSelector: ...
- topologyKey: topology.kubernetes.io/zone
minDomains: 2 # lower value of max replicas or number of zones maxSkew: 1
whenUnsatisfiable: DoNotSchedule
labelSelector: ...
This enforces that the (multiple) pods are scheduled across zones.
The minDomains calculation is based on whatever value is lower - (maximum) replicas or number of zones. This is the number of minimum domains required to schedule pods in a highly available manner.
Independent on the number of zones, when one of the following conditions is true, then the field whenUnsatisfiable is set to DoNotSchedule for the constraint with topologyKey=kubernetes.io/hostname (which enforces the node-spread):
The high-availability-config.resources.gardener.cloud/host-spread annotation is set to true.
The high-availability-config.resources.gardener.cloud/failure-tolerance-type annotation is set and NOT empty.
Tolerations for taints node.kubernetes.io/not-ready and node.kubernetes.io/unreachable are added to the handled Deployment and StatefulSet if their podTemplates do not already specify them.
The TolerationSeconds are taken from the respective configuration section of the webhook’s configuration (see example)).
We consider fine-tuned values for those tolerations a matter of high-availability because they often help to reduce recovery times in case of node or zone outages, also see High-Availability Best Practices.
In addition, this webhook handling helps to set defaults for many but not all workload components in a cluster. For instance, Gardener can use this webhook to set defaults for nearly every component in seed clusters but only for the system components in shoot clusters. Any customer workload remains unchanged.
Kubernetes Service Host Injection
By default, when Pods are created, Kubernetes implicitly injects the KUBERNETES_SERVICE_HOST environment variable into all containers.
The value of this variable points it to the default Kubernetes service (i.e., kubernetes.default.svc.cluster.local).
This allows pods to conveniently talk to the API server of their cluster.
In shoot clusters, this network path involves the apiserver-proxyDaemonSet which eventually forwards the traffic to the API server.
Hence, it results in additional network hop.
The purpose of this webhook is to explicitly inject the KUBERNETES_SERVICE_HOST environment variable into all containers and setting its value to the FQDN of the API server.
This way, the additional network hop is avoided.
Auto-Mounting Projected ServiceAccount Tokens
When this webhook is activated, then it automatically injects projected ServiceAccount token volumes into Pods and all its containers if all of the following preconditions are fulfilled:
The Pod is NOT labeled with projected-token-mount.resources.gardener.cloud/skip=true.
The Pod’s .spec.serviceAccountName field is NOT empty and NOT set to default.
The ServiceAccount specified in the Pod’s .spec.serviceAccountName sets .automountServiceAccountToken=false.
The Pod’s .spec.volumes[] DO NOT already contain a volume with a name prefixed with kube-api-access-.
The expirationSeconds are defaulted to 12h and can be overwritten with the .webhooks.projectedTokenMount.expirationSeconds field in the component configuration, or with the projected-token-mount.resources.gardener.cloud/expiration-seconds annotation on a Pod resource.
The volume will be mounted into all containers specified in the Pod to the path /var/run/secrets/kubernetes.io/serviceaccount.
This is the default location where client libraries expect to find the tokens and mimics the upstream ServiceAccount admission plugin. See Managing Service Accounts for more information.
Overall, this webhook is used to inject projected service account tokens into pods running in the Shoot and the Seed cluster.
Hence, it is served from the Seed GRM and each Shoot GRM.
Please find an overview below for pods deployed in the Shoot cluster:
Pod Topology Spread Constraints
When this webhook is enabled, then it mimics the topologyKey feature for Topology Spread Constraints (TSC) on the label pod-template-hash.
Concretely, when a pod is labelled with pod-template-hash, the handler of this webhook extends any topology spread constraint in the pod:
The procedure circumvents a known limitation with TSCs which leads to imbalanced deployments after rolling updates.
Gardener enables this webhook to schedule pods of deployments across nodes and zones.
Please note that the gardener-resource-manager itself as well as pods labelled with topology-spread-constraints.resources.gardener.cloud/skip are excluded from any mutations.
System Components Webhook
If enabled, this webhook handles scheduling concerns for system components Pods (except those managed by DaemonSets).
The following tasks are performed by this webhook:
Add pod.spec.nodeSelector as given in the webhook configuration.
Add pod.spec.tolerations as given in the webhook configuration.
Add pod.spec.tolerations for any existing nodes matching the node selector given in the webhook configuration. Known taints and tolerations used for taint based evictions are disregarded.
Gardener enables this webhook for kube-system and kubernetes-dashboard namespaces in shoot clusters, selecting Pods being labelled with resources.gardener.cloud/managed-by: gardener.
It adds a configuration, so that Pods will get the worker.gardener.cloud/system-components: true node selector (step 1) as well as tolerate any custom taint (step 2) that is added to system component worker nodes (shoot.spec.provider.workers[].systemComponents.allow: true).
In addition, the webhook merges these tolerations with the ones required for at that time available system component Nodes in the cluster (step 3).
Both is required to ensure system component Pods can be scheduled or executed during an active shoot reconciliation that is happening due to any modifications to shoot.spec.provider.workers[].taints, e.g. Pods must be scheduled while there are still Nodes not having the updated taint configuration.
You can opt-out of this behaviour for Pods by labeling them with system-components-config.resources.gardener.cloud/skip=true.
EndpointSlice Hints
This webhook mutates EndpointSlices. For each endpoint in the EndpointSlice, it sets the endpoint’s hints to the endpoint’s zone.
The webhook aims to circumvent issues with the Kubernetes TopologyAwareHints feature that currently does not allow to achieve a deterministic topology-aware traffic routing. For more details, see the following issue kubernetes/kubernetes#113731 that describes drawbacks of the TopologyAwareHints feature for our use case.
If the above-mentioned issue gets resolved and there is a native support for deterministic topology-aware traffic routing in Kubernetes, then this webhook can be dropped in favor of the native Kubernetes feature.
Validating Webhooks
Unconfirmed Deletion Prevention For Custom Resources And Definitions
As part of Gardener’s extensibility concepts, a lot of CustomResourceDefinitions are deployed to the seed clusters that serve as extension points for provider-specific controllers.
For example, the Infrastructure CRD triggers the provider extension to prepare the IaaS infrastructure of the underlying cloud provider for a to-be-created shoot cluster.
Consequently, these extension CRDs have a lot of power and control large portions of the end-user’s shoot cluster.
Accidental or undesired deletions of those resource can cause tremendous and hard-to-recover-from outages and should be prevented.
When this webhook is activated, it reacts for CustomResourceDefinitions and most of the custom resources in the extensions.gardener.cloud/v1alpha1 API group.
It also reacts for the druid.gardener.cloud/v1alpha1.Etcd resources.
The webhook prevents DELETE requests for those CustomResourceDefinitions labeled with gardener.cloud/deletion-protected=true, and for all mentioned custom resources if they were not previously annotated with the confirmation.gardener.cloud/deletion=true.
This prevents that undesired kubectl delete <...> requests are accepted.
Extension Resource Validation
When this webhook is activated, it reacts for most of the custom resources in the extensions.gardener.cloud/v1alpha1 API group.
It also reacts for the druid.gardener.cloud/v1alpha1.Etcd resources.
The webhook validates the resources specifications for CREATE and UPDATE requests.
Authorization Webhooks
node-agent-authorizer webhook
gardener-resource-manager serves an authorization webhook for shoot kube-apiservers which authorizes requests made by the gardener-node-agent.
It works similar to SeedAuthorizer. However, the logic used to make decisions is much simpler so it does not implement a decision graph.
In many cases, the objects gardener-node-agent is allowed to access depend on the Node it is running on.
The username of the gardener-node-agent used for authorization requests is derived from the name of the Machine resource responsible for the node that the gardener-node-agent is running on. It follows the pattern gardener.cloud:node-agent:machine:<machine-name>.
The name of the Node which runs on a Machine is read from node label of the Machine.
All gardener-node-agent users are assigned to gardener.cloud:node-agents group.
Today, the following rules are implemented:
Resource
Verbs
Description
CertificateSigningRequests
get, create
Allow create requests for all CertificateSigningRequestss. Allow get requests for CertificateSigningRequestss created by the same user.
Events
create, patch
Allow to create and patch all Events.
Leases
get, list, watch, create, update
Allow get, list, watch, create, update requests for Leases with the name gardener-node-agent-<node-name> in kube-system namespace.
Nodes
get, list, watch, patch, update
Allow get, watch, patch, update requests for the Node where gardener-node-agent is running. Allow list requests for all nodes.
Secrets
get, list, watch
Allow get, list, watch request to gardener-valitail secret and the gardener-node-agent-secret of the worker group of the Node where gardener-node-agent is running.
4.13 - Gardener Scheduler
Understand the configuration and flow of the controller that assigns a seed cluster to newly created shoots
Overview
The Gardener Scheduler is in essence a controller that watches newly created shoots and assigns a seed cluster to them.
Conceptually, the task of the Gardener Scheduler is very similar to the task of the Kubernetes Scheduler: finding a seed for a shoot instead of a node for a pod.
Either the scheduling strategy or the shoot cluster purpose hereby determines how the scheduler is operating.
The following sections explain the configuration and flow in greater detail.
Why Is the Gardener Scheduler Needed?
1. Decoupling
Previously, an admission plugin in the Gardener API server conducted the scheduling decisions.
This implies changes to the API server whenever adjustments of the scheduling are needed.
Decoupling the API server and the scheduler comes with greater flexibility to develop these components independently.
2. Extensibility
It should be possible to easily extend and tweak the scheduler in the future.
Possibly, similar to the Kubernetes scheduler, hooks could be provided which influence the scheduling decisions.
It should be also possible to completely replace the standard Gardener Scheduler with a custom implementation.
Algorithm Overview
The following sequence describes the steps involved to determine a seed candidate:
Determine usable seeds with “usable” defined as follows:
no .metadata.deletionTimestamp
.spec.settings.scheduling.visible is true
.status.lastOperation is not nil
conditions GardenletReady, BackupBucketsReady (if available) are true
Filter seeds:
matching .spec.seedSelector in CloudProfile used by the Shoot
matching .spec.seedSelector in Shoot
having no network intersection with the Shoot’s networks (due to the VPN connectivity between seeds and shoots their networks must be disjoint)
whose taints (.spec.taints) are tolerated by the Shoot (.spec.tolerations)
whose access restrictions (.spec.accessRestrictions) are supporting those configured in the Shoot (.spec.accessRestrictions)
which have at least three zones in .spec.provider.zones if shoot requests a high available control plane with failure tolerance type zone.
Apply active strategy e.g., Minimal Distance strategy
Choose least utilized seed, i.e., the one with the least number of shoot control planes, will be the winner and written to the .spec.seedName field of the Shoot.
In order to put the scheduling decision into effect, the scheduler sends an update request for the Shoot resource to
the API server. After validation, the gardener-apiserver updates the Shoot to have the spec.seedName field set.
Subsequently, the gardenlet picks up and starts to create the cluster on the specified seed.
Configuration
The Gardener Scheduler configuration has to be supplied on startup. It is a mandatory and also the only available flag.
This yaml file holds an example scheduler configuration.
Most of the configuration options are the same as in the Gardener Controller Manager (leader election, client connection, …).
However, the Gardener Scheduler on the other hand does not need a TLS configuration, because there are currently no webhooks configurable.
Strategies
The scheduling strategy is defined in the candidateDeterminationStrategy of the scheduler’s configuration and can have the possible values SameRegion and MinimalDistance.
The SameRegion strategy is the default strategy.
Same Region strategy
The Gardener Scheduler reads the spec.provider.type and .spec.region fields from the Shoot resource.
It tries to find a seed that has the identical .spec.provider.type and .spec.provider.region fields set.
If it cannot find a suitable seed, it adds an event to the shoot stating that it is unschedulable.
Minimal Distance strategy
The Gardener Scheduler tries to find a valid seed with minimal distance to the shoot’s intended region.
Distances are configured via ConfigMap(s), usually per cloud provider in a Gardener landscape.
The configuration is structured like this:
It refers to one or multiple CloudProfiles via annotation scheduling.gardener.cloud/cloudprofiles.
It contains the declaration as region-config via label scheduling.gardener.cloud/purpose.
If a CloudProfile is referred by multiple ConfigMaps, only the first one is considered.
The data fields configure actual distances, where key relates to the Shoot region and value contains distances to Seed regions.
Gardener provider extensions for public cloud providers usually have an example weight ConfigMap in their repositories.
We suggest to check them out before defining your own data.
If a valid seed candidate cannot be found after consulting the distance configuration, the scheduler will fall back to
the Levenshtein distance to find the closest region. Therefore, the region name
is split into a base name and an orientation. Possible orientations are north, south, east, west and central.
The distance then is twice the Levenshtein distance of the region’s base name plus a correction value based on the
orientation and the provider.
If the orientations of shoot and seed candidate match, the correction value is 0, if they differ it is 2 and if
either the seed’s or the shoot’s region does not have an orientation it is 1.
If the provider differs, the correction value is additionally incremented by 2.
Because of this, a matching region with a matching provider is always preferred.
Special handling based on shoot cluster purpose
Every shoot cluster can have a purpose that describes what the cluster is used for, and also influences how the cluster is setup (see Shoot Cluster Purpose for more information).
In case the shoot has the testing purpose, then the scheduler only reads the .spec.provider.type from the Shoot resource and tries to find a Seed that has the identical .spec.provider.type.
The region does not matter, i.e., testing shoots may also be scheduled on a seed in a complete different region if it is better for balancing the whole Gardener system.
shoots/binding Subresource
The shoots/binding subresource is used to bind a Shoot to a Seed. On creation of a shoot cluster/s, the scheduler updates the binding automatically if an appropriate seed cluster is available.
Only an operator with the necessary RBAC can update this binding manually. This can be done by changing the .spec.seedName of the shoot. However, if a different seed is already assigned to the shoot, this will trigger a control-plane migration. For required steps, please see Triggering the Migration.
spec.schedulerName Field in the Shoot Specification
Similar to the spec.schedulerName field in Pods, the Shoot specification has an optional .spec.schedulerName field. If this field is set on creation, only the scheduler which relates to the configured name is responsible for scheduling the shoot.
The default-scheduler name is reserved for the default scheduler of Gardener.
Affected Shoots will remain in Pending state if the mentioned scheduler is not present in the landscape.
spec.seedName Field in the Shoot Specification
Similar to the .spec.nodeName field in Pods, the Shoot specification has an optional .spec.seedName field. If this field is set on creation, the shoot will be scheduled to this seed. However, this field can only be set by users having RBAC for the shoots/binding subresource. If this field is not set, the scheduler will assign a suitable seed automatically and populate this field with the seed name.
seedSelector Field in the Shoot Specification
Similar to the .spec.nodeSelector field in Pods, the Shoot specification has an optional .spec.seedSelector field.
It allows the user to provide a label selector that must match the labels of the Seeds in order to be scheduled to one of them.
The labels on the Seeds are usually controlled by Gardener administrators/operators - end users cannot add arbitrary labels themselves.
If provided, the Gardener Scheduler will only consider as “suitable” those seeds whose labels match those provided in the .spec.seedSelector of the Shoot.
By default, only seeds with the same provider as the shoot are selected. By adding a providerTypes field to the seedSelector,
a dedicated set of possible providers (* means all provider types) can be selected.
Ensuring a Seed’s Capacity for Shoots Is Not Exceeded
Seeds have a practical limit of how many shoots they can accommodate. Exceeding this limit is undesirable, as the system performance will be noticeably impacted. Therefore, the scheduler ensures that a seed’s capacity for shoots is not exceeded by taking into account a maximum number of shoots that can be scheduled onto a seed.
This mechanism works as follows:
The gardenlet is configured with certain resources and their total capacity (and, for certain resources, the amount reserved for Gardener), see /example/20-componentconfig-gardenlet.yaml. Currently, the only such resource is the maximum number of shoots that can be scheduled onto a seed.
The gardenlet seed controller updates the capacity and allocatable fields in the Seed status with the capacity of each resource and how much of it is actually available to be consumed by shoots. The allocatable value of a resource is equal to capacity minus reserved.
When scheduling shoots, the scheduler filters out all candidate seeds whose allocatable capacity for shoots would be exceeded if the shoot is scheduled onto the seed.
Failure to Determine a Suitable Seed
In case the scheduler fails to find a suitable seed, the operation is being retried with exponential backoff.
The reason for the failure will be reported in the Shoot’s .status.lastOperation field as well as a Kubernetes event (which can be retrieved via kubectl -n <namespace> describe shoot <shoot-name>).
Current Limitation / Future Plans
Azure unfortunately has a geographically non-hierarchical naming pattern and does not start with the continent. This is the reason why we will exchange the implementation of the MinimalDistance strategy with a more suitable one in the future.
4.14 - gardenlet
Understand how the gardenlet, the primary “agent” on every seed cluster, works and learn more about the different Gardener components
Overview
Gardener is implemented using the operator pattern:
It uses custom controllers that act on our own custom resources,
and apply Kubernetes principles to manage clusters instead of containers.
Following this analogy, you can recognize components of the Gardener architecture
as well-known Kubernetes components, for example, shoot clusters can be compared with pods,
and seed clusters can be seen as worker nodes.
The following Gardener components play a similar role as the corresponding components
in the Kubernetes architecture:
Gardener Component
Kubernetes Component
gardener-apiserver
kube-apiserver
gardener-controller-manager
kube-controller-manager
gardener-scheduler
kube-scheduler
gardenlet
kubelet
Similar to how the kube-scheduler of Kubernetes finds an appropriate node
for newly created pods, the gardener-scheduler of Gardener finds an appropriate seed cluster
to host the control plane for newly ordered clusters.
By providing multiple seed clusters for a region or provider, and distributing the workload,
Gardener also reduces the blast radius of potential issues.
Kubernetes runs a primary “agent” on every node, the kubelet,
which is responsible for managing pods and containers on its particular node.
Decentralizing the responsibility to the kubelet has the advantage that the overall system
is scalable. Gardener achieves the same for cluster management by using a gardenlet
as а primary “agent” on every seed cluster, and is only responsible for shoot clusters
located in its particular seed cluster:
The gardener-controller-manager has controllers to manage resources of the Gardener API. However, instead of letting the gardener-controller-manager talk directly to seed clusters or shoot clusters, the responsibility isn’t only delegated to the gardenlet, but also managed using a reversed control flow: It’s up to the gardenlet to contact the Gardener API server, for example, to share a status for its managed seed clusters.
Reversing the control flow allows placing seed clusters or shoot clusters behind firewalls without the necessity of direct access via VPN tunnels anymore.
TLS Bootstrapping
Kubernetes doesn’t manage worker nodes itself, and it’s also not
responsible for the lifecycle of the kubelet running on the workers.
Similarly, Gardener doesn’t manage seed clusters itself,
so it is also not responsible for the lifecycle of the gardenlet running on the seeds.
As a consequence, both the gardenlet and the kubelet need to prepare
a trusted connection to the Gardener API server
and the Kubernetes API server correspondingly.
To prepare a trusted connection between the gardenlet
and the Gardener API server, the gardenlet initializes
a bootstrapping process after you deployed it into your seed clusters:
The gardenlet starts up with a bootstrap kubeconfig
having a bootstrap token that allows to create CertificateSigningRequest (CSR) resources.
After the CSR is signed, the gardenlet downloads
the created client certificate, creates a new kubeconfig with it,
and stores it inside a Secret in the seed cluster.
The gardenlet deletes the bootstrap kubeconfig secret,
and starts up with its new kubeconfig.
The gardenlet starts normal operation.
The gardener-controller-manager runs a control loop
that automatically signs CSRs created by gardenlets.
The gardenlet bootstrapping process is based on the
kubelet bootstrapping process. More information:
Kubelet’s TLS bootstrapping.
If you don’t want to run this bootstrap process, you can create
a kubeconfig pointing to the garden cluster for the gardenlet yourself,
and use the field gardenClientConnection.kubeconfig in the
gardenlet configuration to share it with the gardenlet.
gardenlet Certificate Rotation
The certificate used to authenticate the gardenlet against the API server
has a certain validity based on the configuration of the garden cluster
(--cluster-signing-duration flag of the kube-controller-manager (default 1y)).
You can also configure the validity for the client certificate by specifying .gardenClientConnection.kubeconfigValidity.validity in the gardenlet’s component configuration.
Note that changing this value will only take effect when the kubeconfig is rotated again (it is not picked up immediately).
The minimum validity is 10m (that’s what is enforced by the CertificateSigningRequest API in Kubernetes which is used by the gardenlet).
By default, after about 70-90% of the validity has expired, the gardenlet tries to automatically replace
the current certificate with a new one (certificate rotation).
You can change these boundaries by specifying .gardenClientConnection.kubeconfigValidity.autoRotationJitterPercentage{Min,Max} in the gardenlet’s component configuration.
To use a certificate rotation, you need to specify the secret to store
the kubeconfig with the rotated certificate in the field
.gardenClientConnection.kubeconfigSecret of the
gardenlet component configuration.
Rotate Certificates Using Bootstrap kubeconfig
If the gardenlet created the certificate during the initial TLS Bootstrapping
using the Bootstrap kubeconfig, certificates can be rotated automatically.
The same control loop in the gardener-controller-manager that signs
the CSRs during the initial TLS Bootstrapping also automatically signs
the CSR during a certificate rotation.
ℹ️ You can trigger an immediate renewal by annotating the Secret in the seed
cluster stated in the .gardenClientConnection.kubeconfigSecret field with
gardener.cloud/operation=renew. Within 10s, gardenlet detects this and terminates
itself to request new credentials. After it has booted up again, gardenlet will issue a
new certificate independent of the remaining validity of the existing one.
ℹ️ Alternatively, annotate the respective Seed with gardener.cloud/operation=renew-kubeconfig.
This will make gardenlet annotate its own kubeconfig secret with gardener.cloud/operation=renew
and triggers the process described in the previous paragraph.
Rotate Certificates Using Custom kubeconfig
When trying to rotate a custom certificate that wasn’t created by gardenlet
as part of the TLS Bootstrap, the x509 certificate’s Subject field
needs to conform to the following:
the Common Name (CN) is prefixed with gardener.cloud:system:seed:
the Organization (O) equals gardener.cloud:system:seeds
Otherwise, the gardener-controller-manager doesn’t automatically
sign the CSR.
In this case, an external component or user needs to approve the CSR manually,
for example, using the command kubectl certificate approve seed-csr-<...>).
If that doesn’t happen within 15 minutes,
the gardenlet repeats the process and creates another CSR.
Configuring the Seed to Work with gardenlet
The gardenlet works with a single seed, which must be configured in the
GardenletConfiguration under .seedConfig. This must be a copy of the
Seed resource, for example:
Similar to how Kubernetes uses Lease objects for node heart beats
(see KEP),
the gardenlet is using Lease objects for heart beats of the seed cluster.
Every two seconds, the gardenlet checks that the seed cluster’s /healthz
endpoint returns HTTP status code 200.
If that is the case, the gardenlet renews the lease in the Garden cluster in the gardener-system-seed-lease namespace and updates
the GardenletReady condition in the status.conditions field of the Seed resource. For more information, see this section.
Similar to the node-lifecycle-controller inside the kube-controller-manager,
the gardener-controller-manager features a seed-lifecycle-controller that sets
the GardenletReady condition to Unknown in case the gardenlet fails to renew the lease.
As a consequence, the gardener-scheduler doesn’t consider this seed cluster for newly created shoot clusters anymore.
/healthz Endpoint
The gardenlet includes an HTTP server that serves a /healthz endpoint.
It’s used as a liveness probe in the Deployment of the gardenlet.
If the gardenlet fails to renew its lease,
then the endpoint returns 500 Internal Server Error, otherwise it returns 200 OK.
Please note that the /healthz only indicates whether the gardenlet
could successfully probe the Seed’s API server and renew the lease with
the Garden cluster.
It does not show that the Gardener extension API server (with the Gardener resource groups)
is available.
However, the gardenlet is designed to withstand such connection outages and
retries until the connection is reestablished.
Controllers
The gardenlet consists out of several controllers which are now described in more detail.
The BackupBucket controller reconciles those core.gardener.cloud/v1beta1.BackupBucket resources whose .spec.seedName value is equal to the name of the Seed the respective gardenlet is responsible for.
A core.gardener.cloud/v1beta1.BackupBucket resource is created by the Seed controller if .spec.backup is defined in the Seed.
The controller adds finalizers to the BackupBucket and the secret mentioned in the .spec.secretRef of the BackupBucket. The controller also copies this secret to the seed cluster. Additionally, it creates an extensions.gardener.cloud/v1alpha1.BackupBucket resource (non-namespaced) in the seed cluster and waits until the responsible extension controller reconciles it (see Contract: BackupBucket Resource for more details).
The status from the reconciliation is reported in the .status.lastOperation field. Once the extension resource is ready and the .status.generatedSecretRef is set by the extension controller, the gardenlet copies the referenced secret to the garden namespace in the garden cluster. An owner reference to the core.gardener.cloud/v1beta1.BackupBucket is added to this secret.
If the core.gardener.cloud/v1beta1.BackupBucket is deleted, the controller deletes the generated secret in the garden cluster and the extensions.gardener.cloud/v1alpha1.BackupBucket resource in the seed cluster and it waits for the respective extension controller to remove its finalizers from the extensions.gardener.cloud/v1alpha1.BackupBucket. Then it deletes the secret in the seed cluster and finally removes the finalizers from the core.gardener.cloud/v1beta1.BackupBucket and the referred secret.
The BackupEntry controller reconciles those core.gardener.cloud/v1beta1.BackupEntry resources whose .spec.seedName value is equal to the name of a Seed the respective gardenlet is responsible for.
Those resources are created by the Shoot controller (only if backup is enabled for the respective Seed) and there is exactly one BackupEntry per Shoot.
The controller creates an extensions.gardener.cloud/v1alpha1.BackupEntry resource (non-namespaced) in the seed cluster and waits until the responsible extension controller reconciled it (see Contract: BackupEntry Resource for more details).
The status is populated in the .status.lastOperation field.
The core.gardener.cloud/v1beta1.BackupEntry resource has an owner reference pointing to the corresponding Shoot.
Hence, if the Shoot is deleted, the BackupEntry resource also gets deleted.
In this case, the controller deletes the extensions.gardener.cloud/v1alpha1.BackupEntry resource in the seed cluster and waits until the responsible extension controller has deleted it.
Afterwards, the finalizer of the core.gardener.cloud/v1beta1.BackupEntry resource is released so that it finally disappears from the system.
If the spec.seedName and .status.seedName of the core.gardener.cloud/v1beta1.BackupEntry are different, the controller will migrate it by annotating the extensions.gardener.cloud/v1alpha1.BackupEntry in the Source Seed with gardener.cloud/operation: migrate, waiting for it to be migrated successfully and eventually deleting it from the Source Seed cluster. Afterwards, the controller will recreate the extensions.gardener.cloud/v1alpha1.BackupEntry in the Destination Seed, annotate it with gardener.cloud/operation: restore and wait for the restore operation to finish. For more details about control plane migration, please read Shoot Control Plane Migration.
Keep Backup for Deleted Shoots
In some scenarios it might be beneficial to not immediately delete the BackupEntrys (and with them, the etcd backup) for deleted Shoots.
In this case you can configure the .controllers.backupEntry.deletionGracePeriodHours field in the component configuration of the gardenlet.
For example, if you set it to 48, then the BackupEntrys for deleted Shoots will only be deleted 48 hours after the Shoot was deleted.
Additionally, you can limit the shoot purposes for which this applies by setting .controllers.backupEntry.deletionGracePeriodShootPurposes[].
For example, if you set it to [production] then only the BackupEntrys for Shoots with .spec.purpose=production will be deleted after the configured grace period. All others will be deleted immediately after the Shoot deletion.
In case a BackupEntry is scheduled for future deletion but you want to delete it immediately, add the annotation backupentry.core.gardener.cloud/force-deletion=true.
The Bastion controller reconciles those operations.gardener.cloud/v1alpha1.Bastion resources whose .spec.seedName value is equal to the name of a Seed the respective gardenlet is responsible for.
The controller creates an extensions.gardener.cloud/v1alpha1.Bastion resource in the seed cluster in the shoot namespace with the same name as operations.gardener.cloud/v1alpha1.Bastion. Then it waits until the responsible extension controller has reconciled it (see Contract: Bastion Resource for more details). The status is populated in the .status.conditions and .status.ingress fields.
During the deletion of operations.gardener.cloud/v1alpha1.Bastion resources, the controller first sets the Ready condition to False and then deletes the extensions.gardener.cloud/v1alpha1.Bastion resource in the seed cluster.
Once this resource is gone, the finalizer of the operations.gardener.cloud/v1alpha1.Bastion resource is released, so it finally disappears from the system.
This reconciler is responsible for ControllerInstallations referencing a ControllerDeployment whose type=helm.
For each ControllerInstallation, it creates a namespace on the seed cluster named extension-<controller-installation-name>.
Then, it creates a generic garden kubeconfig and garden access secret for the extension for accessing the garden cluster.
After that, it unpacks the Helm chart tarball in the ControllerDeployments .providerConfig.chart field and deploys the rendered resources to the seed cluster.
The Helm chart values in .providerConfig.values will be used and extended with some information about the Gardener environment and the seed cluster:
As of today, there are a few more fields in .gardener.seed, but it is recommended to use the .gardener.seed.spec if the Helm chart needs more information about the seed configuration.
The rendered chart will be deployed via a ManagedResource created in the garden namespace of the seed cluster.
It is labeled with controllerinstallation-name=<name> so that one can easily find the owning ControllerInstallation for an existing ManagedResource.
The reconciler maintains the Installed condition of the ControllerInstallation and sets it to False if the rendering or deployment fails.
This reconciler reconciles ControllerInstallation objects and checks whether they are in a healthy state.
It checks the .status.conditions of the backing ManagedResource created in the garden namespace of the seed cluster.
If the ResourcesApplied condition of the ManagedResource is True, then the Installed condition of the ControllerInstallation will be set to True.
If the ResourcesHealthy condition of the ManagedResource is True, then the Healthy condition of the ControllerInstallation will be set to True.
If the ResourcesProgressing condition of the ManagedResource is True, then the Progressing condition of the ControllerInstallation will be set to True.
A ControllerInstallation is considered “healthy” if Applied=Healthy=True and Progressing=False.
This reconciler watches all resources in the extensions.gardener.cloud API group in the seed cluster.
It is responsible for maintaining the Required condition on ControllerInstallations.
Concretely, when there is at least one extension resource in the seed cluster a ControllerInstallation is responsible for, then the status of the Required condition will be True.
If there are no extension resources anymore, its status will be False.
This condition is taken into account by the ControllerRegistration controller part of gardener-controller-manager when it computes which extensions have to be deployed to which seed cluster. See Gardener Controller Manager for more details.
The Gardenlet controller reconciles a Gardenlet resource with the same name as the Seed the gardenlet is responsible for.
This is used to implement self-upgrades of gardenlet based on information pulled from the garden cluster.
For a general overview, see this document.
On Gardenlet reconciliation, the controller deploys the gardenlet within its own cluster which after downloading the Helm chart specified in .spec.deployment.helm.ociRepository and rendering it with the provided values/configuration.
On Gardenlet deletion, nothing happens: The gardenlet does not terminate itself - deleting a Gardenlet object effectively means that self-upgrades are stopped.
The ManagedSeed controller in the gardenlet reconciles ManagedSeeds that refers to Shoot scheduled on Seed the gardenlet is responsible for.
Additionally, the controller monitors Seeds, which are owned by ManagedSeeds for which the gardenlet is responsible.
On ManagedSeed reconciliation, the controller first waits for the referenced Shoot to undergo a reconciliation process.
Once the Shoot is successfully reconciled, the controller sets the ShootReconciled status of the ManagedSeed to true.
Then, it creates garden namespace within the target shoot cluster.
The controller also manages secrets related to Seeds, such as the backup and kubeconfig secrets.
It ensures that these secrets are created and updated according to the ManagedSeed spec.
Finally, it deploys the gardenlet within the specified shoot cluster which registers the Seed cluster.
On ManagedSeed deletion, the controller first deletes the corresponding Seed that was originally created by the controller.
Subsequently, it deletes the gardenlet instance within the shoot cluster.
The controller also ensures the deletion of related Seed secrets.
Finally, the dedicated garden namespace within the shoot cluster is deleted.
The NetworkPolicy controller reconciles NetworkPolicys in all relevant namespaces in the seed cluster and provides so-called “general” policies for access to the runtime cluster’s API server, DNS, public networks, etc.
The controller resolves the IP address of the Kubernetes service in the default namespace and creates an egress NetworkPolicys for it.
This reconciler is responsible for managing the seed’s system components.
Those comprise CA certificates, the various CustomResourceDefinitions, the logging and monitoring stacks, and few central components like gardener-resource-manager, etcd-druid, istio, etc.
The reconciler also deploys a BackupBucket resource in the garden cluster in case the Seed's .spec.backup is set.
It also checks whether the seed cluster’s Kubernetes version is at least the minimum supported version and errors in case this constraint is not met.
This reconciler maintains the .status.lastOperation field, i.e. it sets it:
to state=Progressing before it executes its reconciliation flow.
to state=Error in case an error occurs.
to state=Succeeded in case the reconciliation succeeded.
This reconciler checks whether the seed system components (deployed by the “main” reconciler) are healthy.
It checks the .status.conditions of the backing ManagedResource created in the garden namespace of the seed cluster.
A ManagedResource is considered “healthy” if the conditions ResourcesApplied=ResourcesHealthy=True and ResourcesProgressing=False.
If all ManagedResources are healthy, then the SeedSystemComponentsHealthy condition of the Seed will be set to True.
Otherwise, it will be set to False.
If at least one ManagedResource is unhealthy and there is threshold configuration for the conditions (in .controllers.seedCare.conditionThresholds), then the status of the SeedSystemComponentsHealthy condition will be set:
to Progressing if it was True before.
to Progressing if it was Progressing before and the lastUpdateTime of the condition does not exceed the configured threshold duration yet.
to False if it was Progressing before and the lastUpdateTime of the condition exceeds the configured threshold duration.
The condition thresholds can be used to prevent reporting issues too early just because there is a rollout or a short disruption.
Only if the unhealthiness persists for at least the configured threshold duration, then the issues will be reported (by setting the status to False).
In order to compute the condition statuses, this reconciler considers ManagedResources (in the garden and istio-system namespace) and their status, see this document for more information.
The following table explains which ManagedResources are considered for which condition type:
This reconciler checks whether the connection to the seed cluster’s /healthz endpoint works.
If this succeeds, then it renews a Lease resource in the garden cluster’s gardener-system-seed-lease namespace.
This indicates a heartbeat to the external world, and internally the gardenlet sets its health status to true.
In addition, the GardenletReady condition in the status of the Seed is set to True.
The whole process is similar to what the kubelet does to report heartbeats for its Node resource and its KubeletReady condition. For more information, see this section.
If the connection to the /healthz endpoint or the update of the Lease fails, then the internal health status of gardenlet is set to false.
Also, this internal health status is set to false automatically after some time, in case the controller gets stuck for whatever reason.
This internal health status is available via the gardenlet’s /healthz endpoint and is used for the livenessProbe in the gardenlet pod.
This reconciler is responsible for managing all shoot cluster components and implements the core logic for creating, updating, hibernating, deleting, and migrating shoot clusters.
It is also responsible for syncing the Cluster cluster to the seed cluster before and after each successful shoot reconciliation.
The main reconciliation logic is performed in 3 different task flows dedicated to specific operation types:
reconcile (operations: create, reconcile, restore): this is the main flow responsible for creation and regular reconciliation of shoots. Hibernating a shoot also triggers this flow. It is also used for restoration of the shoot control plane on the new seed (second half of a Control Plane Migration)
migrate: this flow is triggered when spec.seedName specifies a different seed than status.seedName. It performs the first half of the Control Plane Migration, i.e., a backup (migrate operation) of all control plane components followed by a “shallow delete”.
delete: this flow is triggered when the shoot’s deletionTimestamp is set, i.e., when it is deleted.
The gardenlet takes special care to prevent unnecessary shoot reconciliations.
This is important for several reasons, e.g., to not overload the seed API servers and to not exhaust infrastructure rate limits too fast.
The gardenlet performs shoot reconciliations according to the following rules:
If status.observedGeneration is less than metadata.generation: this is the case, e.g., when the spec was changed, a manual reconciliation operation was triggered, or the shoot was deleted.
If the shoot is in a failed state, the gardenlet does not perform any reconciliation on the shoot (unless the retry operation was triggered). However, it syncs the Cluster resource to the seed in order to inform the extension controllers about the failed state.
Regular reconciliations are performed with every GardenletConfiguration.controllers.shoot.syncPeriod (defaults to 1h).
Shoot reconciliations are not performed if the assigned seed cluster is not healthy or has not been reconciled by the current gardenlet version yet (determined by the Seed.status.gardener section). This is done to make sure that shoots are reconciled with fully rolled out seed system components after a Gardener upgrade. Otherwise, the gardenlet might perform operations of the new version that doesn’t match the old version of the deployed seed system components, which might lead to unspecified behavior.
There are a few special cases that overwrite or confine how often and under which circumstances periodic shoot reconciliations are performed:
In case the gardenlet config allows it (controllers.shoot.respectSyncPeriodOverwrite, disabled by default), the sync period for a shoot can be increased individually by setting the shoot.gardener.cloud/sync-period annotation. This is always allowed for shoots in the garden namespace. Shoots are not reconciled with a higher frequency than specified in GardenletConfiguration.controllers.shoot.syncPeriod.
In case the gardenlet config allows it (controllers.shoot.respectSyncPeriodOverwrite, disabled by default), shoots can be marked as “ignored” by setting the shoot.gardener.cloud/ignore annotation. In this case, the gardenlet does not perform any reconciliation for the shoot.
In case GardenletConfiguration.controllers.shoot.reconcileInMaintenanceOnly is enabled (disabled by default), the gardenlet performs regular shoot reconciliations only once in the respective maintenance time window (GardenletConfiguration.controllers.shoot.syncPeriod is ignored). The gardenlet randomly distributes shoot reconciliations over the maintenance time window to avoid high bursts of reconciliations (see Shoot Maintenance).
In case Shoot.spec.maintenance.confineSpecUpdateRollout is enabled (disabled by default), changes to the shoot specification are not rolled out immediately but only during the respective maintenance time window (see Shoot Maintenance).
This reconciler performs three “care” actions related to Shoots.
Conditions
It maintains the following conditions:
APIServerAvailable: The /healthz endpoint of the shoot’s kube-apiserver is called and considered healthy when it responds with 200 OK.
ControlPlaneHealthy: The control plane is considered healthy when the respective Deployments (for example kube-apiserver,kube-controller-manager), and Etcds (for example etcd-main) exist and are healthy.
ObservabilityComponentsHealthy: This condition is considered healthy when the respective Deployments (for example plutono) and StatefulSets (for example prometheus,vali) exist and are healthy.
EveryNodeReady: The conditions of the worker nodes are checked (e.g., Ready, MemoryPressure). Also, it’s checked whether the Kubernetes version of the installed kubelet matches the desired version specified in the Shoot resource.
SystemComponentsHealthy: The conditions of the ManagedResources are checked (e.g., ResourcesApplied). Also, it is verified whether the VPN tunnel connection is established (which is required for the kube-apiserver to communicate with the worker nodes).
Sometimes, ManagedResources can have both Healthy and Progressing conditions set to True (e.g., when a DaemonSet rolls out one-by-one on a large cluster with many nodes) while this is not reflected in the Shoot status. In order to catch issues where the rollout gets stuck, one can set .controllers.shootCare.managedResourceProgressingThreshold in the gardenlet’s component configuration. If the Progressing condition is still True for more than the configured duration, the SystemComponentsHealthy condition in the Shoot is set to False, eventually.
Each condition can optionally also have error codes in order to indicate which type of issue was detected (see Shoot Status for more details).
If all checks for a certain conditions are succeeded, then its status will be set to True.
Otherwise, it will be set to False.
If at least one check fails and there is threshold configuration for the conditions (in .controllers.seedCare.conditionThresholds), then the status will be set:
to Progressing if it was True before.
to Progressing if it was Progressing before and the lastUpdateTime of the condition does not exceed the configured threshold duration yet.
to False if it was Progressing before and the lastUpdateTime of the condition exceeds the configured threshold duration.
The condition thresholds can be used to prevent reporting issues too early just because there is a rollout or a short disruption.
Only if the unhealthiness persists for at least the configured threshold duration, then the issues will be reported (by setting the status to False).
Besides directly checking the status of Deployments, Etcds, StatefulSets in the shoot namespace, this reconciler also considers ManagedResources (in the shoot namespace) and their status in order to compute the condition statuses, see this document for more information.
The following table explains which ManagedResources are considered for which condition type:
Condition Type
ManagedResources are considered when
ControlPlaneHealthy
.spec.class=seed and care.gardener.cloud/condition-type label either unset, or set to ControlPlaneHealthy
ObservabilityComponentsHealthy
care.gardener.cloud/condition-type label set to ObservabilityComponentsHealthy
SystemComponentsHealthy
.spec.class unset or care.gardener.cloud/condition-type label set to SystemComponentsHealthy
Stale pods in the shoot namespace in the seed cluster and in the kube-system namespace in the shoot cluster are deleted.
A pod is considered stale when:
it was terminated with reason Evicted.
it was terminated with reason starting with OutOf (e.g., OutOfCpu).
it was terminated with reason NodeAffinity.
it is stuck in termination (i.e., if its deletionTimestamp is more than 5m ago).
This reconciler periodically (default: every 6h) performs backups of the state of Shoot clusters and persists them into ShootState resources into the same namespace as the Shoots in the garden cluster.
It is only started in case the gardenlet is responsible for an unmanaged Seed, i.e. a Seed which is not backed by a seedmanagement.gardener.cloud/v1alpha1.ManagedSeed object.
Alternatively, it can