This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Shoot Operations

1 - Controlling the Kubernetes Versions for Specific Worker Pools

Controlling the Kubernetes Versions for Specific Worker Pools

Since Gardener v1.36, worker pools can have different Kubernetes versions specified than the control plane.

In earlier Gardener versions, all worker pools inherited the Kubernetes version of the control plane. Once the Kubernetes version of the control plane was modified, all worker pools have been updated as well (either by rolling the nodes in case of a minor version change, or in-place for patch version changes).

In order to gracefully perform Kubernetes upgrades (triggering a rolling update of the nodes) with workloads sensitive to restarts (e.g., those dealing with lots of data), it might be required to be able to gradually perform the upgrade process. In such cases, the Kubernetes version for the worker pools can be pinned (.spec.provider.workers[].kubernetes.version) while the control plane Kubernetes version (.spec.kubernetes.version) is updated. This results in the nodes being untouched while the control plane is upgraded. Now a new worker pool (with the version equal to the control plane version) can be added. Administrators can then reschedule their workloads to the new worker pool according to their upgrade requirements and processes.

Example Usage in a `Shoot`

spec:
  kubernetes:
    version: 1.27.4
  provider:
    workers:
    - name: data1
      kubernetes:
        version: 1.26.8
    - name: data2

If .kubernetes.version is not specified in a worker pool, then the Kubernetes version of the kubelet is inherited from the control plane (.spec.kubernetes.version), i.e., in the above example, the data2 pool will use 1.26.8.
If .kubernetes.version is specified in a worker pool, then it must meet the following constraints:
- It must be at most two minor versions lower than the control plane version.
- If it was not specified before, then no downgrade is possible (you cannot set it to 1.26.8 while .spec.kubernetes.version is already 1.27.4). The “two minor version skew” is only possible if the worker pool version is set to the control plane version and then the control plane was updated gradually by two minor versions.
- If the version is removed from the worker pool, only one minor version difference is allowed to the control plane (you cannot upgrade a pool from version 1.25.0 to 1.27.0 in one go).

Automatic updates of Kubernetes versions (see Shoot Maintenance) also apply to worker pool Kubernetes versions.

2 - SecretBinding to CredentialsBinding Migration

SecretBinding to CredentialsBinding Migration

With the introduction of the CredentialsBinding resource a new way of referencing credentials through the Shoot was created. While SecretBindings can only reference Secrets, CredentialsBindings can also reference WorkloadIdentitys which provide an alternative authentication method. WorkloadIdentitys do not directly contain credentials but are rather a representation of the workload that is going to access the user’s account.

As CredentialsBindings cover the functionality of SecretBindings, the latter are considered legacy and will be deprecated in the future. This incurs the need for migration from SecretBinding to CredentialsBinding resources.

Note
Mind that the migration will be allowed only if the old SecretBinding and the new CredentialsBinding refer to the same exact Secret. One cannot do a direct migration to a CredentialsBinding that reference a WorkloadIdentity. For information on how to use WorkloadIdentity, please refer to the following document.

Migration Path

A standard use of SecretBinding can look like the following example.

apiVersion: core.gardener.cloud/v1beta1
kind: SecretBinding
metadata:
  name: infrastructure-credentials
  namespace: garden-proj
provider:
  type: foo-provider
secretRef:
  name: infrastructure-credentials-secret
  namespace: garden-proj
---
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: bar
  namespace: garden-proj
spec:
  secretBindingName: infrastructure-credentials
  ...

In order to migrate to CredentialsBinding one should:

Create a CredentialsBinding resource corresponding to the existing SecretBinding. The main difference is that we set the kind and apiVersion of the credentials that the CredentialsBinding is referencing.

apiVersion: security.gardener.cloud/v1alpha1
kind: CredentialsBinding
metadata:
  name: infrastructure-credentials
  namespace: garden-proj
credentialsRef:
  apiVersion: v1
  kind: Secret
  name: infrastructure-credentials-secret
  namespace: garden-proj
provider:
  type: foo-provider

Replace secretBindingName with credentialsBindingName in the Shoot spec.

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: bar
  namespace: garden-proj
spec:
  credentialsBindingName: infrastructure-credentials
  ...

3 - Shoot Credentials Rotation

Credentials Rotation for Shoot Clusters

There are a lot of different credentials for Shoots to make sure that the various components can communicate with each other and to make sure it is usable and operable.

This page explains how the varieties of credentials can be rotated so that the cluster can be considered secure.

User-Provided Credentials

Cloud Provider Keys

End-users must provide credentials such that Gardener and Kubernetes controllers can communicate with the respective cloud provider APIs in order to perform infrastructure operations. For example, Gardener uses them to set up and maintain the networks, security groups, subnets, etc., while the cloud-controller-manager uses them to reconcile load balancers and routes, and the CSI controller uses them to reconcile volumes and disks.

Depending on the cloud provider, the required data keys of the Secret differ. Please consult the documentation of the respective provider extension documentation to get to know the concrete data keys (e.g., this document for AWS).

It is the responsibility of the end-user to regularly rotate those credentials. The following steps are required to perform the rotation:

Update the data in the Secret with new credentials.
⚠️ Wait until all Shoots using the Secret are reconciled before you disable the old credentials in your cloud provider account! Otherwise, the Shoots will no longer work as expected. Check out this document to learn how to trigger a reconciliation of your Shoots.
- (Optional) If you want to verify that the new credentials are valid you can trigger immediate maintenance operation for the corresponding shoot cluster(s) and wait for to subsequent reconciliation(s) to complete successfully. The maintenance operation triggers infrastructure reconciliation during the subsequent shoot reconciliation which makes use of the new cloud provider credentials.
After all Shoots using the Secret were reconciled, you can go ahead and deactivate the old credentials in your provider account.

Gardener-Provided Credentials

The below credentials are generated by Gardener when shoot clusters are being created. Those include:

certificate authorities (and related server and client certificates)
observability passwords for Plutono
SSH key pair for worker nodes
ETCD encryption key
ServiceAccount token signing key
…

🚨 There is no auto-rotation of those credentials, and it is the responsibility of the end-user to regularly rotate them.

While it is possible to rotate them one by one, there is also a convenient method to combine the rotation of all of those credentials. The rotation happens in two phases since it might be required to update some API clients (e.g., when CAs are rotated).

Prepare Rotation of All Credentials

In order to start the rotation (first phase), you have to annotate the shoot with the rotate-credentials-start operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-credentials-start

Tip
You can check the .status.credentials.rotation field in the Shoot to see when the rotation was last initiated and last completed.

Kindly consider the detailed descriptions below to learn how the rotation is performed and what your responsibilities are. Please note that all respective individual actions apply for this combined rotation as well (e.g., worker nodes are rolled out in the first phase).

Tip
If you don’t want the worker nodes to roll out immediately in this phase (and rather trigger it individually at a later time of your convenience), you can use the rotate-credentials-start-without-workers-rollout and rotate-rollout-workers operations instead. Read up all about it here.

Complete Rotation of All Credentials

You can complete the rotation (second phase) by annotating the shoot with the rotate-credentials-complete operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-credentials-complete

Certificate Authorities

Gardener generates several certificate authorities (CAs) to ensure secured communication between the various components and actors. Most of those CAs are used for internal communication (e.g., kube-apiserver talks to etcd, vpn-shoot talks to the vpn-seed-server, kubelet talks to kube-apiserver). However, there is also the “cluster CA” which is part of all kubeconfigs and used to sign the server certificate exposed by the kube-apiserver.

Gardener populates a ConfigMap with the name <shoot-name>.ca-cluster in the project namespace in the garden cluster which contains the following data keys:

ca.crt: the CA bundle of the cluster

This bundle contains one or multiple CAs which are used for signing serving certificates of the Shoot’s API server. Hence, the certificates contained in this ConfigMap can be used to verify the API server’s identity when communicating with its public endpoint (e.g., as certificate-authority-data in a kubeconfig). This is the same certificate that is also contained in the kubeconfig’s certificate-authority-data field.

Shoots created with Gardener >= v1.45 have a dedicated client CA which verifies the legitimacy of client certificates. For older Shoots, the client CA is equal to the cluster CA. With the first CA rotation, such clusters will get a dedicated client CA as well.

All the certificates are valid for 10 years. Since it requires adaptation for the consumers of the Shoot, there is no automatic rotation, and it is the responsibility of the end-user to regularly rotate the CA certificates.

The rotation happens in three stages (see also GEP-18 for the full details):

In stage one, new CAs are created and added to the bundle (together with the old CAs). Client certificates are re-issued immediately.
In stage two, end-users update all cluster API clients that communicate with the control plane.
In stage three, the old CAs are dropped from the bundle and server certificate are re-issued.

Technically, the Preparing phase indicates stage one. Once it is completed, the Prepared phase indicates readiness for stage two. The Completing phase indicates stage three, and the Completed phase states that the rotation process has finished.

You can check the .status.credentials.rotation.certificateAuthorities field in the Shoot to see when the rotation was last initiated, last completed, and in which phase it currently is.

In order to start the rotation (stage one), you have to annotate the shoot with the rotate-ca-start operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-ca-start

This will trigger a Shoot reconciliation and performs stage one. After it is completed, the .status.credentials.rotation.certificateAuthorities.phase is set to Prepared.

Now you must update all API clients outside the cluster (such as the kubeconfigs on developer machines) to use the newly issued CA bundle in the <shoot-name>.ca-cluster ConfigMap. Please also note that client certificates must be re-issued now.

After updating all API clients, you can complete the rotation by annotating the shoot with the rotate-ca-complete operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-ca-complete

This will trigger another Shoot reconciliation and performs stage three. After it is completed, the .status.credentials.rotation.certificateAuthorities.phase is set to Completed. You could update your API clients again and drop the old CA from their bundle.

Note that the CA rotation also rotates all internal CAs and signed certificates. Hence, most of the components need to be restarted (including etcd and kube-apiserver).
⚠️ In stage one, all worker nodes of the Shoot will be rolled out to ensure that the Pods as well as the kubelets get the updated credentials as well.

Triggering Worker Node Rollout Individually

If you don’t want that all worker nodes of the Shoot get rolled out in phase one, you can start the rotation with rotate-ca-start-without-workers-rollout instead of rotate-ca-start. This allows you to trigger the worker node rollout individually (per worker pool) whenever you are ready for it.

Using this annotation will trigger a Shoot reconciliation and performs stage one. While it’s running, .status.credentials.rotation.certificateAuthorities.phase is set to PreparingWithoutWorkersRollout. Once completed, the phase transitions to WaitingForWorkersRollout.

Now you can update all API clients outside the cluster (see above) and also trigger the rollout of your worker pools like this:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-rollout-workers=<pool1-name>[,<pool2-name>,...]

You can check which worker pools still need to be rolled by reading .status.credentials.rotation.certificateAuthorities.pendingWorkersRollouts[].name. Once this list is empty, the phase transitions to Prepared. Now you can just complete the rotation as usual (see above).

Worker Node with ManualInPlaceUpdate Update Strategy

In case of manual in-place update, shoot CA rotation phase will be at Preparing until all the worker pools are successfully in-place updated and there are no pending worker pools with strategy ManualInPlaceUpdate.

You can check which worker pools still need to be updated by reading .status.inPlaceUpdates.pendingWorkerUpdates.manualInPlaceUpdate. Once this list is empty, the phase transitions to Prepared. After this rotation will be completed as usual (see above).

Observability Password(s) For Plutono and Prometheus

For Shoots with .spec.purpose!=testing, Gardener deploys an observability stack with Prometheus for monitoring, Alertmanager for alerting (optional), Vali for logging, and Plutono for visualization. The Plutono instance is exposed via Ingress and accessible for end-users via basic authentication credentials generated and managed by Gardener.

Those credentials are stored in a Secret with the name <shoot-name>.monitoring in the project namespace in the garden cluster and has multiple data keys:

username: the username
password: the password
auth: the username with SHA-1 representation of the password

It is the responsibility of the end-user to regularly rotate those credentials. In order to rotate the password, annotate the Shoot with gardener.cloud/operation=rotate-observability-credentials. This operation is not allowed for Shoots that are already marked for deletion.

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-observability-credentials

You can check the .status.credentials.rotation.observability field in the Shoot to see when the rotation was last initiated and last completed.

SSH Key Pair for Worker Nodes

Gardener generates an SSH key pair whose public key is propagated to all worker nodes of the Shoot. The private key can be used to establish an SSH connection to the workers for troubleshooting purposes. It is recommended to use gardenctl-v2 and its gardenctl ssh command since it is required to first open up the security groups and create a bastion VM (no direct SSH access to the worker nodes is possible).

The private key is stored in a Secret with the name <shoot-name>.ssh-keypair in the project namespace in the garden cluster and has multiple data keys:

id_rsa: the private key
id_rsa.pub: the public key for SSH

In order to rotate the keys, annotate the Shoot with gardener.cloud/operation=rotate-ssh-keypair. This will propagate a new key to all worker nodes while keeping the old key active and valid as well (it will only be invalidated/removed with the next rotation).

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-ssh-keypair

You can check the .status.credentials.rotation.sshKeypair field in the Shoot to see when the rotation was last initiated or last completed.

The old key is stored in a Secret with the name <shoot-name>.ssh-keypair.old in the project namespace in the garden cluster and has the same data keys as the regular Secret.

ETCD Encryption Key

This key is used to encrypt the data of Secret resources inside etcd (see upstream Kubernetes documentation).

The encryption key has no expiration date. There is no automatic rotation, and it is the responsibility of the end-user to regularly rotate the encryption key.

The rotation happens in three stages:

In stage one, a new encryption key is created and added to the bundle (together with the old encryption key).
In stage two, all Secrets in the cluster and resources configured in the spec.kubernetes.kubeAPIServer.encryptionConfig of the Shoot (see ETCD Encryption Config) are rewritten by the kube-apiserver so that they become encrypted with the new encryption key.
In stage three, the old encryption is dropped from the bundle.

Technically, the Preparing phase indicates the stages one and two. Once it is completed, the Prepared phase indicates readiness for stage three. The Completing phase indicates stage three, and the Completed phase states that the rotation process has finished.

You can check the .status.credentials.rotation.etcdEncryptionKey field in the Shoot to see when the rotation was last initiated, last completed, and in which phase it currently is.

In order to start the rotation (stage one), you have to annotate the shoot with the rotate-etcd-encryption-key-start operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-etcd-encryption-key-start

This will trigger a Shoot reconciliation and performs the stages one and two. After it is completed, the .status.credentials.rotation.etcdEncryptionKey.phase is set to Prepared. Now you can complete the rotation by annotating the shoot with the rotate-etcd-encryption-key-complete operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-etcd-encryption-key-complete

This will trigger another Shoot reconciliation and performs stage three. After it is completed, the .status.credentials.rotation.etcdEncryptionKey.phase is set to Completed.

`ServiceAccount` Token Signing Key

Gardener generates a key which is used to sign the tokens for ServiceAccounts. Those tokens are typically used by workload Pods running inside the cluster in order to authenticate themselves with the kube-apiserver. This also includes system components running in the kube-system namespace.

The token signing key has no expiration date. Since it might require adaptation for the consumers of the Shoot, there is no automatic rotation, and it is the responsibility of the end-user to regularly rotate the signing key.

The rotation happens in three stages, similar to how the CA certificates are rotated:

In stage one, a new signing key is created and added to the bundle (together with the old signing key).
In stage two, end-users update all out-of-cluster API clients that communicate with the control plane via ServiceAccount tokens.
In stage three, the old signing key is dropped from the bundle.

You can check the .status.credentials.rotation.serviceAccountKey field in the Shoot to see when the rotation was last initiated, last completed, and in which phase it currently is.

In order to start the rotation (stage one), you have to annotate the shoot with the rotate-serviceaccount-key-start operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-serviceaccount-key-start

This will trigger a Shoot reconciliation and performs stage one. After it is completed, the .status.credentials.rotation.serviceAccountKey.phase is set to Prepared.

Now you must update all API clients outside the cluster using a ServiceAccount token (such as the kubeconfigs on developer machines) to use a token issued by the new signing key. Gardener already generates new secrets for those ServiceAccounts in the cluster, whose static token was automatically created by Kubernetes (typically before v1.22 - ref) However, if you need to create it manually, you can check out this document for instructions.

After updating all API clients, you can complete the rotation by annotating the shoot with the rotate-serviceaccount-key-complete operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-serviceaccount-key-complete

This will trigger another Shoot reconciliation and performs stage three. After it is completed, the .status.credentials.rotation.serviceAccountKey.phase is set to Completed.

⚠️ In stage one, all worker nodes of the Shoot will be rolled out to ensure that the Pods use a new token.

Triggering Worker Node Rollout Individually

Similar to the rotation of the certificate authorities, you can control the worker node rollout individually. Please read this section to get more information. It works the same way for the ServiceAccount token signing key (using rotate-serviceaccount-key-start-without-workers-rollout).

Worker Node with ManualInPlaceUpdate Update Strategy

Similar to the rotation of the certificate authorities, in case of manual in-place update, ServiceAccount token signing key rotation phase will be at Preparing until all the worker pools are successfully in-place updated and there are no pending worker pools with strategy ManualInPlaceUpdate. Please read this section for more information.

OpenVPN TLS Auth Keys

This key is used to ensure encrypted communication for the VPN connection between the control plane in the seed cluster and the shoot cluster. It is currently not rotated automatically and there is no way to trigger it manually.

4 - Shoot Kubernetes and Operating System Versioning in Gardener

Shoot Kubernetes and Operating System Versioning in Gardener

Motivation

On the one hand-side, Gardener is responsible for managing the Kubernetes and the Operating System (OS) versions of its Shoot clusters. On the other hand-side, Gardener needs to be configured and updated based on the availability and support of the Kubernetes and Operating System version it provides. For instance, the Kubernetes community releases minor versions roughly every three months and usually maintains three minor versions (the current and the last two) with bug fixes and security updates. Patch releases are done more frequently.

When using the term Machine image in the following, we refer to the OS version that comes with the machine image of the node/worker pool of a Gardener Shoot cluster. As such, we are not referring to the CloudProvider specific machine image like the AMI for AWS. For more information on how Gardener maps machine image versions to CloudProvider specific machine images, take a look at the individual gardener extension providers, such as the provider for AWS.

Gardener should be configured accordingly to reflect the “logical state” of a version. It should be possible to define the Kubernetes or Machine image versions that still receive bug fixes and security patches, and also vice-versa to define the version that are out-of-maintenance and are potentially vulnerable. Moreover, this allows Gardener to “understand” the current state of a version and act upon it (more information in the following sections).

Overview

As a Gardener operator:

I can classify a version based on it’s logical state (preview, supported, deprecated, and expired; see Version Classification).
I can define which Machine image and Kubernetes versions are eligible for the auto update of clusters during the maintenance time.
I can define a moment in time when Shoot clusters are forcefully migrated off a certain version (through an expirationDate).
I can define an update path for machine images for auto and force updates; see Update path for machine image versions).
I can disallow the creation of clusters having a certain version (think of severe security issues).

As an end-user/Shoot owner of Gardener:

I can get information about which Kubernetes and Machine image versions exist and their classification.
I can determine the time when my Shoot clusters Machine image and Kubernetes version will be forcefully updated to the next patch or minor version (in case the cluster is running a deprecated version with an expiration date).
I can get this information via API from the CloudProfile.

Version Classifications

Administrators can classify versions into four distinct “logical states”: preview, supported, deprecated, and expired. The version classification serves as a “point-of-reference” for end-users and also has implications during shoot creation and the maintenance time.

If a version is unclassified, Gardener cannot make those decision based on the “logical state”. Nevertheless, Gardener can operate without version classifications and can be added at any time to the Kubernetes and machine image versions in the CloudProfile.

As a best practice, versions usually start with the classification preview, then are promoted to supported, eventually deprecated and finally expired. This information is programmatically available in the CloudProfiles of the Garden cluster.

preview: A preview version is a new version that has not yet undergone thorough testing, possibly a new release, and needs time to be validated. Due to its short early age, there is a higher probability of undiscovered issues and is therefore not yet recommended for production usage. A Shoot does not update (neither auto-update or force-update) to a preview version during the maintenance time. Also, preview versions are not considered for the defaulting to the highest available version when deliberately omitting the patch version during Shoot creation. Typically, after a fresh release of a new Kubernetes (e.g., v1.25.0) or Machine image version (e.g., suse-chost 15.4.20220818), the operator tags it as preview until they have gained sufficient experience and regards this version to be reliable. After the operator has gained sufficient trust, the version can be manually promoted to supported.
supported: A supported version is the recommended version for new and existing Shoot clusters. This is the version that new Shoot clusters should use and existing clusters should update to. Typically for Kubernetes versions, the latest Kubernetes patch versions of the actual (if not still in preview) and the last 3 minor Kubernetes versions are maintained by the community. An operator could define these versions as being supported (e.g., v1.27.6, v1.26.10, and v1.25.12).
deprecated: A deprecated version is a version that approaches the end of its lifecycle and can contain issues which are probably resolved in a supported version. New Shoots should not use this version anymore. Existing Shoots will be updated to a newer version if auto-update is enabled (.spec.maintenance.autoUpdate.kubernetesVersion for Kubernetes version auto-update, or .spec.maintenance.autoUpdate.machineImageVersion for machine image version auto-update). Using automatic upgrades, however, does not guarantee that a Shoot runs a non-deprecated version, as the latest version (overall or of the minor version) can be deprecated as well. Deprecated versions should have an expiration date set for eventual expiration.
expired: An expired versions has an expiration date (based on the Golang time package) in the past. New clusters with that version cannot be created and existing clusters are forcefully migrated to a higher version during the maintenance time.

Below is an example how the relevant section of the CloudProfile might look like:

apiVersion: core.gardener.cloud/v1beta1
kind: CloudProfile
metadata:
  name: alicloud
spec:
  kubernetes:
    versions:
      - classification: preview
        version: 1.27.0
      - classification: preview
        version: 1.26.3
      - classification: supported
        version: 1.26.2
      - classification: preview
        version: 1.25.5
      - classification: supported
        version: 1.25.4
      - classification: supported
        version: 1.24.6
      - classification: deprecated
        expirationDate: "2022-11-30T23:59:59Z"
        version: 1.24.5

Automatic Version Upgrades

There are two ways, the Kubernetes version of the control plane as well as the Kubernetes and machine image version of a worker pool can be upgraded: auto update and forceful update. See Automatic Version Updates for how to enable auto updates for Kubernetes or machine image versions on the Shoot cluster.

If a Shoot is running a version after its expiration date has passed, it will be forcefully updated during its maintenance time. This happens even if the owner has opted out of automatic cluster updates!

When an auto update is triggered?:

The Shoot has auto-update enabled and the version is not the latest eligible version for the auto-update. Please note that this latest version that qualifies for an auto-update is not necessarily the overall latest version in the CloudProfile:
- For Kubernetes version, the latest eligible version for auto-updates is the latest patch version of the current minor.
- For machine image version, the latest eligible version for auto-updates is controlled by the updateStrategy field of the machine image in the CloudProfile.
The Shoot has auto-update disabled and the version is either expired or does not exist.

The auto update can fail if the version is already on the latest eligible version for the auto-update. A failed auto update triggers a force update. The force and auto update path for Kubernetes and machine image versions differ slightly and are described in more detail below.

Update rules for both Kubernetes and machine image versions

Both auto and force update first try to update to the latest patch version of the same minor.
An auto update prefers supported versions over deprecated versions. If there is a lower supported version and a higher deprecated version, auto update will pick the supported version. If all qualifying versions are deprecated, update to the latest deprecated version.
An auto update never updates to an expired version.
A force update prefers to update to not-expired versions. If all qualifying versions are expired, update to the latest expired version. Please note that therefore multiple consecutive version upgrades are possible. In this case, the version is again upgraded in the next maintenance time.

Update path for machine image versions

Administrators can define three different update strategies (field updateStrategy) for machine images in the CloudProfile: patch, minor, major (default). This is to accommodate the different version schemes of Operating Systems (e.g. Gardenlinux only updates major and minor versions with occasional patches).

patch: update to the latest patch version of the current minor version. When using an expired version: force update to the latest patch of the current minor. If already on the latest patch version, then force update to the next higher (not necessarily +1) minor version.
minor: update to the latest minor and patch version. When using an expired version: force update to the latest minor and patch of the current major. If already on the latest minor and patch of the current major, then update to the next higher (not necessarily +1) major version.
major: always update to the overall latest version. This is the legacy behavior for automatic machine image version upgrades. Force updates are not possible and will fail if the latest version in the CloudProfile for that image is expired (EOL scenario).

Example configuration in the CloudProfile:

machineImages:
  - name: gardenlinux
    updateStrategy: minor
    versions:
     - version: 1096.1.0
     - version: 934.8.0
     - version: 934.7.0
  - name: suse-chost
    updateStrategy: patch
    versions:
    - version: 15.3.20220818 
    - version: 15.3.20221118

Please note that force updates for machine images can skip minor versions (strategy: patch) or major versions (strategy: minor) if the next minor/major version has no qualifying versions (only preview versions).

Update path for Kubernetes versions

For Kubernetes versions, the auto update picks the latest non-preview patch version of the current minor version.

If the cluster is already on the latest patch version and the latest patch version is also expired, it will continue with the latest patch version of the next consecutive minor (minor +1) Kubernetes version, so it will result in an update of a minor Kubernetes version!

Kubernetes “minor version jumps” are not allowed - meaning to skip the update to the consecutive minor version and directly update to any version after that. For instance, the version 1.24.x can only update to a version 1.25.x, not to 1.26.x or any other version. This is because Kubernetes does not guarantee upgradability in this case, leading to possibly broken Shoot clusters. The administrator has to set up the CloudProfile in such a way that consecutive Kubernetes minor versions are available. Otherwise, Shoot clusters will fail to upgrade during the maintenance time.

Consider the CloudProfile below with a Shoot using the Kubernetes version 1.24.12. Even though the version is expired, due to missing 1.25.x versions, the Gardener Controller Manager cannot upgrade the Shoot’s Kubernetes version.

spec:
  kubernetes:
    versions:
    - version: 1.26.10
    - version: 1.26.9
    - version: 1.24.12
      expirationDate: "<expiration date in the past>"

The CloudProfile must specify versions 1.25.x of the consecutive minor version. Configuring the CloudProfile in such a way, the Shoot’s Kubernetes version will be upgraded to version 1.25.10 in the next maintenance time.

spec:
  kubernetes:
    versions:
    - version: 1.26.9
    - version: 1.25.10
    - version: 1.25.9
    - version: 1.24.12
      expirationDate: "<expiration date in the past>"

Version Requirements (Kubernetes and Machine Image)

The Gardener API server enforces the following requirements for versions:

A version that is in use by a Shoot cannot be deleted from the CloudProfile.
Creating a new version with expiration date in the past is not allowed.
There can be only one supported version per minor version.
The latest Kubernetes version cannot have an expiration date.
- NOTE: The latest version for a machine image can have an expiration date. [*]

_{[*] Useful for cases in which support for a given machine image needs to be deprecated and removed (for example, the machine image reaches end of life).}

You might want to read about the Shoot Updates and Upgrades procedures to get to know the effects of such operations.

5 - Shoot Updates and Upgrades

Shoot Updates and Upgrades

This document describes what happens during shoot updates (changes incorporated in a newly deployed Gardener version) and during shoot upgrades (changes for version controllable by end-users).

Updates

Updates to all aspects of the shoot cluster happen when the gardenlet reconciles the Shoot resource.

When are Reconciliations Triggered

Generally, when you change the specification of your Shoot the reconciliation will start immediately, potentially updating your cluster. Please note that you can also confine the reconciliation triggered due to your specification updates to the cluster’s maintenance time window. Please find more information in Confine Specification Changes/Updates Roll Out.

You can also annotate your shoot with special operation annotations (for more information, see Trigger Shoot Operations), which will cause the reconciliation to start due to your actions.

There is also an automatic reconciliation by Gardener. The period, i.e., how often it is performed, depends on the configuration of the Gardener administrators/operators. In some Gardener installations the operators might enable “reconciliation in maintenance time window only” (for more information, see Cluster Reconciliation), which will result in at least one reconciliation during the time configured in the Shoot’s .spec.maintenance.timeWindow field.

Which Updates are Applied

As end-users can only control the Shoot resource’s specification but not the used Gardener version, they don’t have any influence on which of the updates are rolled out (other than those settings configurable in the Shoot). A Gardener operator can deploy a new Gardener version at any point in time. Any subsequent reconciliation of Shoots will update them by rolling out the changes incorporated in this new Gardener version.

Some examples for such shoot updates are:

Add a new/remove an old component to/from the shoot’s control plane running in the seed, or to/from the shoot’s system components running on the worker nodes.
Change the configuration of an existing control plane/system component.
Restart of existing control plane/system components (this might result in a short unavailability of the Kubernetes API server, e.g., when etcd or a kube-apiserver itself is being restarted)

Behavioural Changes

Generally, some of such updates (e.g., configuration changes) could theoretically result in different behaviour of controllers. If such changes would be backwards-incompatible, then we usually follow one of those approaches (depends on the concrete change):

Only apply the change for new clusters.
Expose a new field in the Shoot resource that lets users control this changed behaviour to enable it at a convenient point in time.
Put the change behind an alpha feature gate (disabled by default) in the gardenlet (only controllable by Gardener operators), which will be promoted to beta (enabled by default) in subsequent releases (in this case, end-users have no influence on when the behaviour changes - Gardener operators should inform their end-users and provide clear timelines when they will enable the feature gate).

Upgrades

We consider shoot upgrades to change either the:

Kubernetes version (.spec.kubernetes.version)
Kubernetes version of the worker pool if specified (.spec.provider.workers[].kubernetes.version)
Machine image version of at least one worker pool (.spec.provider.workers[].machine.image.version)

Generally, an upgrade is also performed through a reconciliation of the Shoot resource, i.e., the same concepts as for shoot updates apply. If an end-user triggers an upgrade (e.g., by changing the Kubernetes version) after a new Gardener version was deployed but before the shoot was reconciled again, then this upgrade might incorporate the changes delivered with this new Gardener version.

The UpdateStrategy field in the Shoot specification (.spec.provider.workers[].updateStrategy) gives users the flexibility to define how the machine-controller-manager handles worker pool updates during the upgrade process. Currently gardener support three update strategies:

AutoRollingUpdate
AutoInPlaceUpdate
ManualInPlaceUpdate

⚠️ The above strategies generally require draining the node when changes are made to the Shoot specification. The specific changes that trigger this behavior will be discussed in later sections. For all other spec changes like Kubernetes patch version update, the upgrade is executed without draining the node and the shoot worker nodes remain unchanged. In case of Kubernetes patch version, only the kubelet process is restarted with the updated Kubernetes version binary.

Rolling Updates

The upgrade is performed in a “rolling update” manner, during which nodes in the worker pool are replaced. Similar to how pods in Kubernetes are updated when backed by a Deployment. Worker nodes are terminated one after another and replaced by new nodes. The existing workload is gracefully drained and evicted from the old worker nodes to new worker nodes, respecting the configured PodDisruptionBudgets (see Specifying a Disruption Budget for your Application).

Automatic Rolling Updates

When the auto rolling update strategy is selected, the update process is fully orchestrated by the Gardener and the machine-controller-manager. The machine-controller-manager sequentially terminates the worker nodes and replaces them with new nodes.

ℹ️ This is the default update strategy.

To create workers with AutoRollingUpdate, either omit the Shoot's .spec.provider.workers[].updateStrategy field (it will default to AutoRollingUpdate) or explicitly set the field to AutoRollingUpdate.

spec:
  provider:
    workers:
    - name: cpu-worker
      maxSurge: 0
      maxUnavailable: 2
      updateStrategy: AutoRollingUpdate

Customize Rolling Update Behaviour of Shoot Worker Nodes

The .spec.provider.workers[] list exposes two fields that you might configure based on your workload’s needs: maxSurge and maxUnavailable. The same concepts like in Kubernetes apply. Additionally, you might customize how the machine-controller-manager is behaving. You can configure the following fields in .spec.provider.worker[].machineControllerManager:

machineDrainTimeout: Timeout (in duration) used while draining of machine before deletion, beyond which machine-controller-manager forcefully deletes the machine (default: 2h).
machineHealthTimeout: Timeout (in duration) used while re-joining (in case of temporary health issues) of a machine before it is declared as failed (default: 10m).
machineCreationTimeout: Timeout (in duration) used while joining (during creation) of a machine before it is declared as failed (default: 10m).
maxEvictRetries: Maximum number of times evicts would be attempted on a pod before it is forcibly deleted during the draining of a machine (default: 10).
nodeConditions: List of case-sensitive node-conditions which will change a machine to a Failed state after the machineHealthTimeout duration. It may further be replaced with a new machine if the machine is backed by a machine-set object (defaults: KernelDeadlock, ReadonlyFilesystem , DiskPressure).

Rolling Update Triggers

A rolling update of the shoot worker nodes is triggered for some changes to your worker pool specification (.spec.provider.workers[], even if you don’t change the Kubernetes or machine image version). The complete list of fields that trigger a rolling update:

.spec.kubernetes.version (except for patch version changes)
.spec.provider.workers[].machine.image.name
.spec.provider.workers[].machine.image.version
.spec.provider.workers[].machine.type
.spec.provider.workers[].volume.type
.spec.provider.workers[].volume.size
.spec.provider.workers[].providerConfig (provider extension dependent with feature gate NewWorkerPoolHash)
.spec.provider.workers[].cri.name
.spec.provider.workers[].kubernetes.version (except for patch version changes)
.spec.systemComponents.nodeLocalDNS.enabled
.status.credentials.rotation.certificateAuthorities.lastInitiationTime (changed by Gardener when a shoot CA rotation is initiated) when worker pool is not part of .status.credentials.rotation.certificateAuthorities.pendingWorkersRollouts[]
.status.credentials.rotation.serviceAccountKey.lastInitiationTime (changed by Gardener when a shoot service account signing key rotation is initiated) when worker pool is not part of .status.credentials.rotation.serviceAccountKey.pendingWorkersRollouts[]

If feature gate NewWorkerPoolHash is enabled:

.spec.kubernetes.kubelet.kubeReserved (unless a worker pool-specific value is set)
.spec.kubernetes.kubelet.systemReserved (unless a worker pool-specific value is set)
.spec.kubernetes.kubelet.evictionHard (unless a worker pool-specific value is set)
.spec.kubernetes.kubelet.cpuManagerPolicy (unless a worker pool-specific value is set)
.spec.provider.workers[].kubernetes.kubelet.kubeReserved
.spec.provider.workers[].kubernetes.kubelet.systemReserved
.spec.provider.workers[].kubernetes.kubelet.evictionHard
.spec.provider.workers[].kubernetes.kubelet.cpuManagerPolicy

Changes to kubeReserved or systemReserved do not trigger a node roll if their sum does not change.

Generally, the provider extension controllers might have additional constraints for changes leading to rolling updates, so please consult the respective documentation as well. In particular, if the feature gate NewWorkerPoolHash is enabled and a worker pool uses the new hash, then the providerConfig as a whole is not included. Instead only fields selected by the provider extension are considered.

In-Place Updates

For scenarios where users want to retain the current nodes and avoid deletion during updates, Gardener provides the option of in-place updates. The upgrade is performed without replacing the underlying machines. Although there is an exception where new nodes are created with the updated configuration and old ones are terminated. One such exception is discussed in Automatic inplace update section.

The existing workload is gracefully drained and evicted from the worker nodes, respecting the configured PodDisruptionBudgets (see Specifying a Disruption Budget for your Application).

ℹ️ Currently, in-place updates are controlled by the InPlaceNodeUpdates feature gate in the gardener-apiserver.

For in-place updates, the first requirement is that the operating system must support them. For a specific machine image version, the configuration for in-place updates must be defined in the CloudProfile under spec.machineImages[].versions[].inPlaceUpdates:

The inPlaceUpdates.supported field must be set to true.
The inPlaceUpdates.minVersionForUpdate field specifies the minimum version from which an in-place update to the target machine image version can be performed.

machineImages:
- name: gardenlinux
  versions:
  - version: 1632.0.0
    inPlaceUpdates:
      supported: true
      minVersionForUpdate: 1630.0.0

The inPlaceUpdates field in the Shoot status provides details about in-place updates for the Shoot workers. It includes the pendingWorkerUpdates field, which lists the worker pools that are awaiting in-place updates.

Customize In-Place Update Behaviour of Shoot Worker Nodes

In addition to customisable fields mentioned in section, you can configure the following fields in .spec.provider.worker[].machineControllerManager:

MachineInPlaceUpdateTimeout: Timeout (in duration) after which an in-place update is declared as failed.
DisableHealthTimeout: A boolean value that, when set to true, ignores the health timeout. As a result, machines are never marked as failed, and unhealthy machines are not deleted. The default value is true for in-place updates.

In-Place Update Triggers

An in-place update of the shoot worker nodes is triggered for rolling update triggers listed under Rolling Update Triggers except for the following:

.spec.provider.workers[].machine.image.name
.spec.provider.workers[].machine.type
.spec.provider.workers[].volume.type
.spec.provider.workers[].volume.size
.spec.provider.workers[].cri.name
.spec.systemComponents.nodeLocalDNS.enabled

There are validations which restricts changing the above mentioned exception fields when in-place updates strategy is configured.

When a worker pool is undergoing an in-place update, applying subsequent updates to the same worker pool is restricted. If an in-place update fails and nodes are left in a problematic state, user intervention is required to manually fix the nodes. In cases where a subsequent update is necessary to resolve the issue, users can update the worker pool after adding the force update annotation gardener.cloud/operation=force-in-place-update on the Shoot. Refer to Force-update a worker pool with InPlace update strategy for more details.

⚠️ Changing the update strategy from AutoRollingUpdate to AutoInPlaceUpdate/ManualInPlaceUpdate (and vice versa) is not allowed. However, switching between AutoInPlaceUpdate and ManualInPlaceUpdate is permitted.

Automatic In-Place Updates

In case of AutoInPlaceUpdate update strategy, the update process is fully orchestrated by Gardener and the machine-controller-manager. No user intervention is required. Set .spec.provider.workers[].updateStrategy field in the Shoot spec to AutoInPlaceUpdate.

spec:
  provider:
    workers:
    - name: cpu-worker
      maxSurge: 0
      maxUnavailable: 2
      updateStrategy: AutoInPlaceUpdate

During automatic in-place updates, if the maxSurge value is set to greater than 0, the machine-controller-manager creates new nodes equal to the maxSurge value. All old nodes, except for those equal to the maxSurge value, are updated in place, and the old nodes corresponding to the maxSurge value are terminated. If maxSurge is set to 0, no new nodes are created and all old nodes are updated in-place.

The inPlaceUpdates.pendingWorkerUpdates.autoInPlaceUpdate field in the Shoot status lists the names of worker pools that are pending updates with this strategy.

Manual In-Place Updates

The ManualInPlaceUpdate strategy allows users to control and orchestrate the update process manually. Set .spec.provider.workers[].updateStrategy field in the Shoot spec to ManualInPlaceUpdate.

spec:
  provider:
    workers:
    - name: cpu-worker
      maxSurge: 0
      maxUnavailable: 2
      updateStrategy: ManualInPlaceUpdate

Once machine-controller-manager labels nodes with node.machine.sapcloud.io/candidate-for-update, user can select the candidate nodes for update by labeling them with node.machine.sapcloud.io/selected-for-update=true:

kubectl label node <node-name> node.machine.sapcloud.io/selected-for-update=true

The ManualInPlaceWorkersUpdated constraint in the shoot status indicates that at least one worker pool with the ManualInPlaceUpdate strategy is pending an update. Shoot reconciliation will still succeed even if there are worker pools pending updates.

The inPlaceUpdates.pendingWorkerUpdates.manualInPlaceUpdate field in the Shoot status lists the names of worker pools that are pending updates with this strategy.

6 - Supported Kubernetes Versions

Supported Kubernetes Versions

Currently, Gardener supports the following Kubernetes versions:

Garden Clusters

The minimum version of a garden cluster that can be used to run Gardener is 1.27.x.

Seed Clusters

The minimum version of a seed cluster that can be connected to Gardener is 1.27.x.

Shoot Clusters

Gardener itself is capable of spinning up clusters with Kubernetes versions 1.27 up to 1.33. However, the concrete versions that can be used for shoot clusters depend on the installed provider extension. Consequently, please consult the documentation of your provider extension to see which Kubernetes versions are supported for shoot clusters.

👨🏼‍💻 Developers note: The Adding Support For a New Kubernetes Version topic explains what needs to be done in order to add support for a new Kubernetes version.

7 - Trigger Shoot Operations Through Annotations

Trigger Shoot Operations Through Annotations

You can trigger a few explicit operations by annotating the Shoot with an operation annotation. This might allow you to induct certain behavior without the need to change the Shoot specification. Some of the operations can also not be caused by changing something in the shoot specification because they can’t properly be reflected here. Note that once the triggered operation is considered by the controllers, the annotation will be automatically removed and you have to add it each time you want to trigger the operation.

Please note: If .spec.maintenance.confineSpecUpdateRollout=true, then the only way to trigger a shoot reconciliation is by setting the reconcile operation, see below.

Immediate Reconciliation

Annotate the shoot with gardener.cloud/operation=reconcile to make the gardenlet start a reconciliation operation without changing the shoot spec and possibly without being in its maintenance time window:

kubectl -n garden-<project-name> annotate shoot <shoot-name> gardener.cloud/operation=reconcile

Immediate Maintenance

Annotate the shoot with gardener.cloud/operation=maintain to make the gardener-controller-manager start maintaining your shoot immediately (possibly without being in its maintenance time window). If no reconciliation starts, then nothing needs to be maintained:

kubectl -n garden-<project-name> annotate shoot <shoot-name> gardener.cloud/operation=maintain

Retry Failed Reconciliation

Annotate the shoot with gardener.cloud/operation=retry to make the gardenlet start a new reconciliation loop on a failed shoot. Failed shoots are only reconciled again if a new Gardener version is deployed, the shoot specification is changed or this annotation is set:

kubectl -n garden-<project-name> annotate shoot <shoot-name> gardener.cloud/operation=retry

Force-update a worker pool with InPlace update strategy

Annotate the shoot with gardener.cloud/operation=force-in-place-update to force an update for worker pools using the update strategy AutoInPlaceUpdate or ManualInPlaceUpdate. Without this annotation, any subsequent updates to the same worker pool are denied until the Shoot has been successfully reconciled following the current in-place update.

kubectl -n garden-<project-name> annotate shoot <shoot-name> gardener.cloud/operation=force-in-place-update

Credentials Rotation Operations

Please consult Credentials Rotation for Shoot Clusters for more information.

Restart `systemd` Services on Particular Worker Nodes

It is possible to make Gardener restart particular systemd services on your shoot worker nodes if needed. The annotation is not set on the Shoot resource but directly on the Node object you want to target. For example, the following will restart both the kubelet and the containerd services:

kubectl annotate node <node-name> worker.gardener.cloud/restart-systemd-services=kubelet,containerd

It may take up to a minute until the service is restarted. The annotation will be removed from the Node object after all specified systemd services have been restarted. It will also be removed even if the restart of one or more services failed.

ℹ️ In the example mentioned above, you could additionally verify when/whether the kubelet restarted by using kubectl describe node <node-name> and looking for such a Starting kubelet event.

Force Deletion

When a Shoot fails to be deleted normally, users can force-delete the Shoot if it meets the following conditions:

Shoot has a deletion timestamp.
Shoot status contains at least one of the following ErrorCodes:
- ERR_CLEANUP_CLUSTER_RESOURCES
- ERR_CONFIGURATION_PROBLEM
- ERR_INFRA_DEPENDENCIES
- ERR_INFRA_UNAUTHENTICATED
- ERR_INFRA_UNAUTHORIZED

If the above conditions are satisfied, you can annotate the Shoot with confirmation.gardener.cloud/force-deletion=true, and Gardener will cleanup the Shoot controlplane and the Shoot metadata.

⚠️ You MUST ensure that all the resources created in the IaaS account are cleaned up to prevent orphaned resources. Gardener will NOT delete any resources in the underlying infrastructure account. Hence, use this annotation at your own risk and only if you are fully aware of these consequences.

Shoot Operations

1 - Controlling the Kubernetes Versions for Specific Worker Pools

Controlling the Kubernetes Versions for Specific Worker Pools

Example Usage in a Shoot

2 - SecretBinding to CredentialsBinding Migration

SecretBinding to CredentialsBinding Migration

Migration Path

3 - Shoot Credentials Rotation

Credentials Rotation for Shoot Clusters

User-Provided Credentials

Cloud Provider Keys

Gardener-Provided Credentials

Prepare Rotation of All Credentials

Complete Rotation of All Credentials

Certificate Authorities

Triggering Worker Node Rollout Individually

Worker Node with ManualInPlaceUpdate Update Strategy

Observability Password(s) For Plutono and Prometheus

SSH Key Pair for Worker Nodes

ETCD Encryption Key

ServiceAccount Token Signing Key

Triggering Worker Node Rollout Individually

Worker Node with ManualInPlaceUpdate Update Strategy

OpenVPN TLS Auth Keys

4 - Shoot Kubernetes and Operating System Versioning in Gardener

Shoot Kubernetes and Operating System Versioning in Gardener

Motivation

Overview

Version Classifications

Automatic Version Upgrades

Update path for machine image versions

Update path for Kubernetes versions

Version Requirements (Kubernetes and Machine Image)

Related Documentation

5 - Shoot Updates and Upgrades

Shoot Updates and Upgrades

Updates

When are Reconciliations Triggered

Which Updates are Applied

Behavioural Changes

Upgrades

Rolling Updates

Automatic Rolling Updates

Customize Rolling Update Behaviour of Shoot Worker Nodes

Rolling Update Triggers

In-Place Updates

Customize In-Place Update Behaviour of Shoot Worker Nodes

In-Place Update Triggers

Automatic In-Place Updates

Manual In-Place Updates

Related Documentation

6 - Supported Kubernetes Versions

Supported Kubernetes Versions

Garden Clusters

Seed Clusters

Shoot Clusters

7 - Trigger Shoot Operations Through Annotations

Trigger Shoot Operations Through Annotations

Immediate Reconciliation

Immediate Maintenance

Retry Failed Reconciliation

Force-update a worker pool with InPlace update strategy

Credentials Rotation Operations

Restart systemd Services on Particular Worker Nodes

Force Deletion

Example Usage in a `Shoot`

`ServiceAccount` Token Signing Key

Restart `systemd` Services on Particular Worker Nodes