This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Usage

1 - Hibernate a Cluster

Hibernate a Cluster

Clusters are only needed 24 hours a day if they run productive workload. So whenever you do development in a cluster, or just use it for tests or demo purposes, you can save much money if you scale-down your Kubernetes resources whenever you don’t need them. However, scaling them down manually can become time-consuming the more resources you have.

Gardener offers a clever way to automatically scale-down all resources to zero: cluster hibernation. You can either hibernate a cluster by pushing a button or by defining a hibernation schedule.

To save costs, it’s recommended to define a hibernation schedule before the creation of a cluster. You can hibernate your cluster or wake up your cluster manually even if there’s a schedule for its hibernation.

What is hibernated?

When a cluster is hibernated, Gardener scales down worker nodes and the cluster’s control plane to free resources at the IaaS provider. This affects:

  • Your workload, for example, pods, deployments, custom resources.
  • The virtual machines running your workload.
  • The resources of the control plane of your cluster.

What isn’t affected by the hibernation?

To scale up everything where it was before hibernation, Gardener doesn’t delete state-related information, that is, information stored in persistent volumes. The cluster state as persistent in etcd is also preserved.

Hibernate your cluster manually

The .spec.hibernation.enabled field specifies whether the cluster needs to be hibernated or not. If the field is set to true, the cluster’s desired state is to be hibernated. If it is set to false or not specified at all, the cluster’s desired state is to be awakened.

To hibernate your cluster you can run the following kubectl command:

$ kubectl patch shoot -n $NAMESPACE $SHOOT_NAME -p '{"spec":{"hibernation":{"enabled": true}}}'

Wake up your cluster manually

To wake up your cluster you can run the following kubectl command:

$ kubectl patch shoot -n $NAMESPACE $SHOOT_NAME -p '{"spec":{"hibernation":{"enabled": false}}}'

Create a schedule to hibernate your cluster

You can specify a hibernation schedule to automatically hibernate/wake up a cluster.

Let’s have a look into the following example:

  hibernation:
    enabled: false
    schedules:
    - start: "0 20 * * *" # Start hibernation every day at 8PM
      end: "0 6 * * *"    # Stop hibernation every day at 6AM
      location: "America/Los_Angeles" # Specify a location for the cron to run in

The above section configures a hibernation schedule that hibernates the cluster every day at 08:00 PM and wakes it up at 06:00 AM. The start or end fields can be omitted, though at least one of them has to be specified. Hence, it is possible to configure a hibernation schedule that only hibernates or wakes up a cluster. The location field is the time location used to evaluate the cron expressions.

2 - APIServer SNI Injection

APIServerSNI environment variable injection

If the Gardener administrator has enabled APIServerSNI feature gate for a particular Seed cluster, then in each Shoot cluster’s kube-system namespace a DaemonSet called apiserver-proxy is deployed. It routes traffic to the upstream Shoot Kube APIServer. See the APIServer SNI GEP for more details.

To skip this extra network hop, a mutating webhook called apiserver-proxy.networking.gardener.cloud is deployed next to the API server in the Seed. It adds KUBERNETES_SERVICE_HOST environment variable to each container and init container that do not specify it. See the webhook repository for more information.

Opt-out of pod injection

In some cases it’s desirable to opt-out of Pod injection:

  • DNS is disabled on that individual Pod, but it still needs to talk to the kube-apiserver.
  • Want to test the kube-proxy and kubelet in-cluster discovery.

Opt-out of pod injection for specific pods

To opt out of the injection, the Pod should be labeled with apiserver-proxy.networking.gardener.cloud/inject: disable e.g.:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
        apiserver-proxy.networking.gardener.cloud/inject: disable
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

Opt-out of pod injection on namespace level

To opt out of the injection of all Pods in a namespace, you should label your namespace with apiserver-proxy.networking.gardener.cloud/inject: disable e.g.:

apiVersion: v1
kind: Namespace
metadata:
  labels:
    apiserver-proxy.networking.gardener.cloud/inject: disable
  name: my-namespace

or via kubectl for existing namespace:

kubectl label namespace my-namespace apiserver-proxy.networking.gardener.cloud/inject=disable

NOTE: Please be aware that it’s not possible to disable injection on namespace level and enable it for individual pods in it.

Opt-out of pod injection for the entire cluster

If the injection is causing problems for different workloads and ignoring individual pods or namespaces is not possible, then the feature could be disabled for the entire cluster with the alpha.featuregates.shoot.gardener.cloud/apiserver-sni-pod-injector annotation with value disable on the Shoot resource itself:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  annotations:
    alpha.featuregates.shoot.gardener.cloud/apiserver-sni-pod-injector: 'disable'
  name: my-cluster

or via kubectl for existing shoot cluster:

kubectl label shoot my-cluster alpha.featuregates.shoot.gardener.cloud/apiserver-sni-pod-injector=disable

NOTE: Please be aware that it’s not possible to disable injection on cluster level and enable it for individual pods in it.

3 - Configuration

Gardener Configuration and Usage

Gardener automates the full lifecycle of Kubernetes clusters as a service. Additionally, it has several extension points allowing external controllers to plug-in to the lifecycle. As a consequence, there are several configuration options for the various custom resources that are partially required.

This document describes the

  1. configuration and usage of Gardener as operator/administrator.
  2. configuration and usage of Gardener as end-user/stakeholder/customer.

Configuration and Usage of Gardener as Operator/Administrator

When we use the terms “operator/administrator” we refer to both the people deploying and operating Gardener. Gardener consists of the following components:

  1. gardener-apiserver, a Kubernetes-native API extension that serves custom resources in the Kubernetes-style (like Seeds and Shoots), and a component that contains multiple admission plugins.
  2. gardener-admission-controller, an HTTP(S) server with several handlers to be used in a ValidatingWebhookConfiguration.
  3. gardener-controller-manager, a component consisting out of multiple controllers that implement reconciliation and deletion flows for some of the custom resources (e.g., it contains the logic for maintaining Shoots, reconciling Projects, etc.).
  4. gardener-scheduler, a component that assigns newly created Shoot clusters to appropriate Seed clusters.
  5. gardenlet, a component running in seed clusters and consisting out of multiple controllers that implement reconciliation and deletion flows for some of the custom resources (e.g., it contains the logic for reconciliation and deletion of Shoots).

Each of these components have various configuration options. The gardener-apiserver uses the standard API server library maintained by the Kubernetes community, and as such it mainly supports command line flags. Other components use so-called componentconfig files that describe their configuration in a Kubernetes-style versioned object.

Configuration file for Gardener admission controller

The Gardener admission controller does only support one command line flag which should be a path to a valid admission-controller configuration file. Please take a look at this example configuration.

Configuration file for Gardener controller manager

The Gardener controller manager does only support one command line flag which should be a path to a valid controller-manager configuration file. Please take a look at this example configuration.

Configuration file for Gardener scheduler

The Gardener scheduler also only supports one command line flag which should be a path to a valid scheduler configuration file. Please take a look at this example configuration. Information about the concepts of the Gardener scheduler can be found here

Configuration file for Gardenlet

The Gardenlet also only supports one command line flag which should be a path to a valid gardenlet configuration file. Please take a look at this example configuration. Information about the concepts of the Gardenlet can be found here

System configuration

After successful deployment of the four components you need to setup the system. Let’s first focus on some “static” configuration. When the gardenlet starts it scans the garden namespace of the garden cluster for Secrets that have influence on its reconciliation loops, mainly the Shoot reconciliation:

  • Internal domain secret, contains the DNS provider credentials (having appropriate privileges) which will be used to create/delete so-called “internal” DNS records for the Shoot clusters, please see this for an example.

    • This secret is used in order to establish a stable endpoint for shoot clusters which is used internally by all control plane components.
    • The DNS records are normal DNS records but called “internal” in our scenario because only the kubeconfigs for the control plane components use this endpoint when talking to the shoot clusters.
    • It is forbidden to change the internal domain secret if there are existing shoot clusters.
  • Default domain secrets (optional), contain the DNS provider credentials (having appropriate privileges) which will be used to create/delete DNS records for a default domain for shoots (e.g., example.com), please see this for an example.

    • Not every end-user/stakeholder/customer has its own domain, however, Gardener needs to create a DNS record for every shoot cluster.
    • As landscape operator you might want to define a default domain owned and controlled by you that is used for all shoot clusters that don’t specify their own domain.
    • If you have multiple default domain secrets defined you can add a priority as an annotation (dns.gardener.cloud/domain-default-priority) to select which domain should be used for new shoots while creation. The domain with the highest priority is selected while shoot creation. If there is no annotation defined the default priority is 0, also all non integer values are considered as priority 0.

⚠️ Please note that the mentioned domain secrets are only needed if you have at least one seed cluster that is not specifing .spec.settings.shootDNS.enabled=false. Seeds with this taint don’t create any DNS records for shoots scheduled on it, hence, if you only have such seeds, you don’t need to create the domain secrets.

  • Alerting secrets (optional), contain the alerting configuration and credentials for the AlertManager to send email alerts. It is also possible to configure the monitoring stack to send alerts to an AlertManager not deployed by Gardener to handle alerting. Please see this for an example.

    • If email alerting is configured:
      • An AlertManager is deployed into each seed cluster that handles the alerting for all shoots on the seed cluster.
      • Gardener will inject the SMTP credentials into the configuration of the AlertManager.
      • The AlertManager will send emails to the configured email address in case any alerts are firing.
    • If an external AlertManager is configured:
      • Each shoot has a Prometheus responsible for monitoring components and sending out alerts. The alerts will be sent to a URL configured in the alerting secret.
      • This external AlertManager is not managed by Gardener and can be configured however the operator sees fit.
      • Supported authentication types are no authentication, basic, or mutual TLS.
  • OpenVPN Diffie-Hellmann Key secret (optional), contains the self-generated Diffie-Hellmann key used by OpenVPN in your landscape, please see this for an example.

    • If you don’t specify a custom key then a default key is used, but for productive landscapes it’s recommend to create a landscape-specific key and define it.
  • Global monitoring secrets (optional), contains basic authentication credentials for the Prometheus aggregating metrics for all clusters.

    • These secrets are synced to each seed cluster and used to gain access to the aggregate monitoring components.

Apart from this “static” configuration there are several custom resources extending the Kubernetes API and used by Gardener. As an operator/administrator you have to configure some of them to make the system work.

Configuration and Usage of Gardener as End-User/Stakeholder/Customer

As an end-user/stakeholder/customer you are using a Gardener landscape that has been setup for you by another team. You don’t need to care about how Gardener itself has to be configured or how it has to be deployed. Take a look at this document - it describes which resources are offered by Gardener. You may want to have a more detailed look for Projects, SecretBindings, Shoots, and (Cluster)OpenIDConnectPresets.

4 - Control Plane Migration

Control Plane Migration

Preconditions

To be able to use this feature, the SeedChange feature gate has to be enabled on your gardener-apiserver.

Also, the involved Seeds need to have enabled BackupBuckets.

ShootState

ShootState is an API resource which stores non-reconstructible state and data required to completely recreate a Shoot’s control plane on a new Seed. The ShootState resource is created on Shoot creation in its Project namespace and the required state/data is persisted during Shoot creation or reconciliation.

Shoot Control Plane Migration

Triggering the migration is done by changing the Shoot’s .spec.seedName to a Seed that differs from the .status.seedName, we call this Seed "Destination Seed". This action can only be performed by an operator with necessary RBAC. If the Destination Seed does not have a backup and restore configuration, the change to spec.seedName is rejected. Additionally, this Seed must not be set for deletion and must be healthy.

If the Shoot has different .spec.seedName and .status.seedName a process is started to prepare the Control Plane for migration:

  1. .status.lastOperation is changed to Migrate.
  2. Kubernetes API Server is stopped and the extension resources are annotated with gardener.cloud/operation=migrate.
  3. Full snapshot of the ETCD is created and terminating of the Control Plane in the Source Seed is initiated.

If the process is successful, we update the status of the Shoot by setting the .status.seedName to the null value. That way, a restoration is triggered in the Destination Seed and .status.lastOperation is changed to Restore. The control plane migration is completed when the Restore operation has completed successfully.

When the CopyEtcdBackupsDuringControlPlaneMigration feature gate is enabled on the gardenlet, the etcd backups will be copied over to the BackupBucket of the Destination Seed during control plane migration and any future backups will be uploaded there. Otherwise, backups will continue to be uploaded to the BackupBucket of the Source Seed,

Triggering the migration

For controlplane migration, operators with necessary RBAC can use the shoots/binding subresource to change the .spec.seedName, with the following commands:

export NAMESPACE=my-namespace
export SHOOT_NAME=my-shoot
kubectl get --raw /apis/core.gardener.cloud/v1beta1/namespaces/${NAMESPACE}/shoots/${SHOOT_NAME} | jq -c '.spec.seedName = "<destination-seed>"' | kubectl replace --raw /apis/core.gardener.cloud/v1beta1/namespaces/${NAMESPACE}/shoots/${SHOOT_NAME}/binding -f - | jq -r '.spec.seedName'

5 - CSI Components

(Custom) CSI Components

Some provider extensions for Gardener are using CSI components to manage persistent volumes in the shoot clusters. Additionally, most of the provider extensions are deploying controllers for taking volume snapshots (CSI snapshotter).

End-users can deploy their own CSI components and controllers into shoot clusters. In such situations, there are multiple controllers acting on the VolumeSnapshot custom resources (each responsible for those instances associated with their respective driver provisioner types).

However, this might lead to operational conflicts that cannot be overcome by Gardener alone. Concretely, Gardener cannot know which custom CSI components were installed by end-users which can lead to issues, especially during shoot cluster deletion. You can add a label to your custom CSI components indicating that Gardener should not try to remove them during shoot cluster deletion. This means you have to take care of the lifecycle for these components yourself!

Recommendations

Custom CSI components are typically regular Deployments running in the shoot clusters.

Please label them with the shoot.gardener.cloud/no-cleanup=true label.

Background Information

When a shoot cluster is deleted, Gardener deletes most Kubernetes resources (Deployments, DaemonSets, StatefulSets, etc.). Gardener will also try to delete CSI components if they are not marked with the above mentioned label.

This can result in VolumeSnapshot resources still having finalizers that will never be cleaned up. Consequently, manual intervention is required to clean them up before the cluster deletion can continue.

6 - Custom containerd Configuration

Custom containerd Configuration

In case a Shoot cluster uses containerd (see this document) for more information), it is possible to make the containerd process load custom configuration files. Gardener initializes contaienerd with the following statement:

imports = ["/etc/containerd/conf.d/*.toml"]

This means that all *.toml files in the /etc/containerd/conf.d directory will be imported and merged with the default configuration. Please consult the upstream containerd documentation for more information.

⚠️ Note that this only applies to nodes which were newly created after gardener/gardener@v1.51 was deployed. Existing nodes are not affected.

7 - Custom DNS Configuration

Custom DNS Configuration

Gardener provides Kubernetes-Clusters-As-A-Service where all the system components (e.g., kube-proxy, networking, dns, …) are managed. As a result, Gardener needs to ensure and auto-correct additional configuration to those system components to avoid unnecessary down-time.

In some cases, auto-correcting system components can prevent users from deploying applications on top of the cluster that requires bits of customization, DNS configuration can be a good example.

To allow for customizations for DNS configuration (that could potentially lead to downtime) while having the option to “undo”, we utilize the import plugin from CoreDNS [1]. which enables in-line configuration changes.

How to use

To customize your CoreDNS cluster config, you can simply edit a ConfigMap named coredns-custom in the kube-system namespace. By editing, this ConfigMap, you are modifying CoreDNS configuration, therefore care is advised.

For example, to apply new config to CoreDNS that would point all .global DNS requests to another DNS pod, simply edit the configuration as follows:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  istio.server: |
    global:8053 {
            errors
            cache 30
            forward . 1.2.3.4
        }    
  corefile.override: |
         # <some-plugin> <some-plugin-config>
         debug
         whoami         

It is important to have the ConfigMap keys ending with *.server (if you would like to add a new server) or *.override if you want to customize the current server configuration (it is optional setting both).

[Optional] Reload CoreDNS

As Gardener is configuring the reload plugin of CoreDNS a restart of the CoreDNS components is typically not necessary to propagate ConfigMap changes. However, if you don’t want to wait for the default (30s) to kick in, you can roll-out your CoreDNS deployment using:

kubectl -n kube-system rollout restart deploy coredns

This will reload the config into CoreDNS.

The approach we follow here was inspired by AKS’s approach [2].

Anti-Pattern

Applying a configuration that is in-compatible with the running version of CoreDNS is an anti-pattern (sometimes plugin configuration changes, simply applying a configuration can break DNS).

If incompatible changes are applied by mistake, simply delete the content of the ConfigMap and re-apply. This should bring the cluster DNS back to functioning state.

Node Local DNS

Custom DNS configuration] may not work as expected in conjunction with NodeLocalDNS. With NodeLocalDNS, ordinary DNS queries targeted at the upstream DNS servers, i.e. non-kubernetes domains, will not end up at CoreDNS, but will instead be directly sent to the upstream DNS server. Therefore, configuration applying to non-kubernetes entities, e.g. the istio.server block in the custom DNS configuration example, may not have any effect with NodeLocalDNS enabled. If this kind of custom configuration is required, forwarding to upstream DNS has to be disabled. This can be done by setting the option (spec.systemComponents.nodeLocalDNS.disableForwardToUpstreamDNS) in the Shoot resource to true:

...
spec:
  ...
  systemComponents:
    nodeLocalDNS:
      enabled: true
      disableForwardToUpstreamDNS: true
...

References

[1] Import plugin [2] AKS Custom DNS

8 - Default Seccomp Profile

Default Seccomp Profile and Configuration

This is a short guide describing how to enable the defaulting of seccomp profiles for Gardener managed workloads in the seed.

Default Kubernetes Behavior

The state of Kubernetes in versions < 1.25 is such that all workloads by default run in Unconfined (seccomp disabled) mode. This is undesirable since this is the least restrictive profile. Also mind that any privileged container will always run as Unconfined. More information about seccomp can be found in this Kubernetes tutorial.

Setting the Seccomp Profile to RuntimeDefault for seed clusters

To address the above issue, Gardener provides a webhook that is capable of mutating pods in the seed clusters, explicitly providing them with a seccomp profile type of RuntimeDefault. This profile is defined by the container runtime and represents a set of default syscalls that are allowed or not.

spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault

A Pod is mutated when all of the following preconditions are fulfilled:

  1. The Pod is created in Gardener managed namespace.
  2. The Pod is NOT labeled with seccompprofile.resources.gardener.cloud/skip.
  3. The Pod does NOT explicitly specify .spec.securityContext.seccompProfile.type.

How to Configure

To enable this feature, the Gardenlet DefaultSeccompProfile feature gate must be set to true.

featureGates:
  DefaultSeccompProfile: true

Please refer to the examples here for more information.

Once the feature gate is enabled, the webhook will be registered and configured for the seed cluster. Newly created pods will be mutated to have their seccomp profile set to RuntimeDefault.

Please note that this feature is still in Alpha, so you might see instabilities every now and then.

Setting the Seccomp Profile to RuntimeDefault for shoot clusters

For kubernetes shoot versions >= 1.25 you can enable the use of RuntimeDefault as the default seccomp profile for all workloads. If enabled, the kubelet will use the RuntimeDefault seccomp profile by default, which is defined by the container runtime, instead of using the Unconfined mode. More information for this feature can be found in the Kubernetes documentation

To use seccomp profile defaulting, you must run the kubelet with the SeccompDefault feature gate enabled (this is the default for k8s versions >= 1.25).

How to Configure

To enable this feature, the kubelet seccompDefault configuration parameter must be set to true in the shoot’s spec.

spec:
  kubernetes:
    version: 1.25.0
    kubelet:
      seccompDefault: true

Please refer to the examples here for more information.

9 - DNS Autoscaling

DNS Autoscaling

This is a short guide describing different options how to automatically scale CoreDNS in the shoot cluster.

Background

Currently, Gardener uses CoreDNS as DNS server. Per default, it is installed as a deployment into the shoot cluster that is auto-scaled horizontally to cover for QPS-intensive applications. However, doing so does not seem to be enough to completely circumvent DNS bottlenecks such as:

  • Cloud provider limits for DNS lookups.
  • Unreliable UDP connections that forces a period of timeout in case packets are dropped.
  • Unnecessary node hopping since CoreDNS is not deployed on all nodes, and as a result DNS queries end-up traversing multiple nodes before reaching the destination server.
  • Inefficient load-balancing of services (e.g., round-robin might not be enough when using IPTables mode).
  • Overload of the CoreDNS replicas as the maximum amount of replicas is fixed.
  • and more …

As an alternative with extended configuration options, Gardener provides cluster-proportional autoscaling of CoreDNS. This guide focuses on the configuration of cluster-proportional autoscaling of CoreDNS and its advantages/disadvantages compared to the horizontal autoscaling. Please note that there is also the option to use a node-local DNS cache, which helps mitigate potential DNS bottlenecks (see Trade-offs in conjunction with NodeLocalDNS for considerations regarding using NodeLocalDNS together with one of the CoreDNS autoscaling approaches).

Configuring cluster-proportional DNS Autoscaling

All that needs to be done to enable the usage of cluster-proportional autoscaling of CoreDNS is to set the corresponding option (spec.systemComponents.coreDNS.autoscaling.mode) in the Shoot resource to cluster-proportional:

...
spec:
  ...
  systemComponents:
    coreDNS:
      autoscaling:
        mode: cluster-proportional
...

To switch back to horizontal DNS autoscaling you can set the spec.systemComponents.coreDNS.autoscaling.mode to horizontal (or remove the coreDNS section).

Once the cluster-proportional autoscaling of CoreDNS has been enabled and the Shoot cluster has been reconciled afterwards, a ConfigMap called coredns-autoscaler will be created in the kube-system namespace with the default settings. The content will be similar to the following:

linear: '{"coresPerReplica":256,"min":2,"nodesPerReplica":16}'

It is possible to adapt the ConfigMap according to your needs in case the defaults do not work as desired. The number of CoreDNS replicas is calculated according to the following formula:

replicas = max( ceil( cores × 1 / coresPerReplica ) , ceil( nodes × 1 / nodesPerReplica ) )

Depending on your needs, you can adjust coresPerReplica or nodesPerReplica, but it is also possible to override min if required.

Trade-offs of horizontal and cluster-proportional DNS Autoscaling

The horizontal autoscaling of CoreDNS as implemented by Gardener is fully managed, i.e. you do not need to perform any configuration changes. It scales according to the CPU usage of CoreDNS replicas meaning that it will create new replicas if the existing ones are under heavy load. This approach scales between 2 and 5 instances, which is sufficient for most workloads. In case this is not enough, the cluster-proportional autoscaling approach can be used instead with its more flexible configuration options.

The cluster-proportional autoscaling of CoreDNS as implemented by Gardener is fully managed, but allows more configuration options to adjust the default settings to your individual needs. It scales according to the cluster size, i.e. if your cluster grows in terms of cores/nodes so will the amount of CoreDNS replicas. However, it does not take the actual workload, e.g. CPU consumption, into account.

Experience shows that the horizontal autoscaling of CoreDNS works for a variety of workloads. It does reach its limits if a cluster has a high amount of DNS requests, though. The cluster-proportional autoscaling approach allows to fine-tune the amount of CoreDNS replicas. It helps to scale in clusters of changing size. However, please keep in mind that you need to cater for the maximum amount of DNS requests as the replicas will not be adapted according to the workload, but only according to the cluster size (cores/nodes).

Trade-offs in conjunction with NodeLocalDNS

Using a node-local DNS cache can mitigate a lot of the potential DNS related problems. It works fine with a DNS workload that can be handle through the cache and reduces the inter-node DNS communication. As node-local DNS cache reduces the amount of traffic being sent to the cluster’s CoreDNS replicas, it usually works fine with horizontally scaled CoreDNS. Nevertheless, it also works with CoreDNS scaled in a cluster-proportional approach. In this mode, though, it might make sense to adapt the default settings as the CoreDNS workload is likely significantly reduced.

Overall, you can view the DNS options on a scale. Horizontally scaled DNS provides a small amount of DNS servers. Especially for bigger clusters, a cluster-proportional approach will yield more CoreDNS instances and hence may yield a more balanced DNS solution. By adapting the settings you can further increase the amount of CoreDNS replicas. On the other end of the spectrum, a node-local DNS cache provides DNS on every node and allows to reduce the amount of (backend) CoreDNS instances regardless if they are horizontally or cluster-proportionally scaled.

10 - DNS Search Path Optimization

DNS Search Path Optimization

DNS Search Path

Using fully qualified names has some downsides, e.g. it may become harder to move deployments from one landscape to the next. It is far easier and simple to rely on short/local names, which may have different meaning depending on the context they are used in.

The DNS search path allows the usage of short/local names. It is an ordered list of DNS suffixes to append to short/local names to create a fully qualified name.

If a short/local name should be resolved each entry is appended to it one by one to check whether it can be resolved. The process stops when either the name could be resolved or the DNS search path ends. As the last step after trying the search path, the short/local name is attempted to be resolved on it own.

DNS Option ndots

As explained in the section above, the DNS search path is used for short/local names to create fully qualified names. The DNS option ndots specifies how many dots (.) a name needs to have to be considered fully qualified. For names with less than ndots dots (.), the DNS search path will be applied.

DNS Search Path, ndots and Kubernetes

Kubernetes tries to make it easy/convenient for developers to use name resolution. It provides several means to address a service, most notably by its name directly, using the namespace as suffix, utilizing <namespace>.svc as suffix or as a fully qualified name as <service>.<namespace>.svc.cluster.local (assuming cluster.local to be the cluster domain).

This is why the DNS search path is fairly long in Kubernetes, usually consisting of <namespace>.svc.cluster.local, svc.cluster.local, cluster.local and potentially some additional entries coming from the local network of the cluster. For various reasons, the default ndots value in the context of Kubernetes is with 5 also fairly large. See this comment for a more detailed description.

DNS Search Path/ndots Problem in Kubernetes

As the DNS search path is long and ndots is large, a lot of DNS queries might traverse the DNS search path. This results in an explosion of DNS requests.

For example, consider the name resolution of the default kubernetes service kubernetes.default.svc.cluster.local. As this name has only four dots it is not considered a fully qualified name according to the default ndots=5 setting. Therefore, the DNS search path is applied resulting in the following queries being created

  • kubernetes.default.svc.cluster.local.some-namespace.svc.cluster.local
  • kubernetes.default.svc.cluster.local.svc.cluster.local
  • kubernetes.default.svc.cluster.local.cluster.local
  • kubernetes.default.svc.cluster.local.network-domain

In IPv4/IPv6 dual stack systems, the amount of DNS requests may even double as each name is resolved for IPv4 and IPv6.

General Workarounds/Mitigations

Kubernetes provides the capability to set the DNS options for each pod (see Pod DNS config for details). However, this has to be applied for every pod (doing name resolution) to resolve the problem. A mutating webhook may be useful in this regard. Unfortunately, the DNS requirements may be different depending on the workload. Therefore, a general solution may difficult to impossible.

Another approach is to use always fully qualified names and append a dot (.) to the name to prevent the name resolution system from using the DNS search path. This might be somewhat counterintuitive as most developers are not used to the trailing dot (.). Furthermore, it makes moving to different landscapes more difficult/error-prone.

Gardener specific Workarounds/Mitigations

Gardener allows users to customize their DNS configuration. CoreDNS allows several approaches to deal with the requests generated by the DNS search path. Caching is possible as well as query rewriting. There are also several other plugins available, which may mitigate the situation.

Gardener DNS Query Rewriting

As explained above, the application of the DNS search path may lead to the undesired creation of DNS requests. Especially with the default setting of ndots=5, seemingly fully qualified names pointing to services in the cluster may trigger the DNS search path application.

Gardener allows to automatically rewrite some obviously incorrect DNS names, which stem from application of the DNS search path, to the most likely desired name. The feature can be enabled by setting the Gardenlet feature gate CoreDNSQueryRewriting to true:

featureGates:
  CoreDNSQueryRewriting: true

In case the feature is enabled in the Gardenlet it can be disabled per shoot cluster by setting the annotation alpha.featuregates.shoot.gardener.cloud/core-dns-rewriting-disabled to any value.

This will automatically rewrite requests like service.namespace.svc.cluster.local.other-namespace.svc.cluster.local to service.namespace.svc.cluster.local. The same holds true for service.namespace.svc.other-namespace.svc.cluster.local, which will also be rewritten to service.namespace.svc.cluster.local.

In case applications also target services for name resolution, which are outside of the cluster and have less than ndots dots, it might be helpful to prevent search path application for them as well. One way to achieve it is by adding them to the commonSuffixes:

...
spec:
  ...
  systemComponents:
    coreDNS:
      rewriting:
        commonSuffixes:
        - gardener.cloud
        - github.com
...

DNS requests containing a common suffix and ending in <namespace>.svc.cluster.local are assumed to be incorrect application of the DNS search path. Therefore, they are rewritten to everything ending in the common suffix. For example, www.gardener.cloud.namespace.svc.cluster.local would be rewritten to www.gardener.cloud.

Please note that the common suffixes should be long enough and include enough dots (.) to prevent random overlap with other DNS queries. For example, it would be a bad idea to simply put com on the list of common suffixes as there may be services/namespaces, which have com as part of their name. The effect would be seemingly random DNS requests. Gardener enforces at least two dots (.) in the common suffixes.

11 - Docker Shim Removal

Kubernetes dockershim removal

What’s happening?

With Kubernetes v1.20 the built-in dockershim was deprecated and is scheduled to be removed with v1.24. Don’t Panic! The Kubernetes community has published a blogpost and an FAQ with more information.

Gardener also needs to switch from using the built-in dockershim to containerd. Gardener will not change running Shoot clusters. But changes to the container runtime will be coupled to the K8s version selected by the Shoot:

  • starting with K8s version 1.22 Shoots not explicitly selecting a container runtime will get containerd instead of docker. Shoots can still select docker explicitly if needed.
  • starting with K8s version 1.23 docker can no longer be selected.

At this point in time, we have no plans to support other container runtimes, such as cri-o.

What should I do?

As a gardener operator:

As a shoot owner:

  • check if you have dependencies to the docker container runtime. Note: This is not only about your actual workload, but also concerns ops tooling as well as logging, monitoring and metric agents installed on the nodes
  • test with containerd:
  • once testing is successful, switch to containerd with your production workload. You don’t need to wait for kubernetes v1.22, containerd is considered production ready as of today
  • if you find dependencies to docker, set .spec.provider.workers[].cri.name: docker explicitly to avoid defaulting to containerd once you update your Shoot to kubernetes v1.22

Timeline

  • 2021-08-04: Kubernetes v1.22 released. Shoots using this version get containerd as default container runtime. Shoots can still select docker explicitly if needed.
  • 2021-12-07: Kubernetes v1.23 released. Shoots using this version can no longer select docker as container runtime.
  • 2022-06-28: Kubernetes v1.21 goes out of maintenance. This is the last version not affected by these changes. Make sure you have tested thoroughly and set the correct configuration for your Shoots!
  • 2022-10-28: Kubernetes v1.22 goes out of maintenance. This is the last version that you can use with docker as container runtime. Make sure you have removed any dependencies to docker as container runtime!

See the official kubernetes documentation for the exact dates for all releases.

Container Runtime support in Gardener Operating System Extensions

Operating Systemdocker supportcontainerd support
GardenLinux>= v0.3.0
Ubuntu>= v1.4.0
SuSE CHost>= v1.14.0
CoreOS/FlatCar>= v1.8.0

Note: If you’re using a different Operating System Extension, start evaluating now if it provides support for containerd. Please refer to our documentation of the operatingsystemconfig contract to understand how to support containerd for an Operating System Extension.

Stable Worker node hash support in Gardener Provider Extensions

Upgrade to these versions to avoid a node rollout when a Shoot is configured from cri: nil to cri.name: docker.

Provider ExtensionStable worker hash support
Alicloud>= 1.26.0
AWS>= 1.27.0
Azure>= 1.21.0
GCP>= 1.18.0
OpenStack>= 1.21.0
vSphere>= 0.11.0

Note: If you’re using a different Provider Extension, start evaluating now if it keeps the worker hash stable when switching from .spec.provider.workers[].cri: nil to .spec.provider.workers[].cri.name: docker. This doesn’t impact functional correctness, however, a node rollout will be triggered when users decide to configure docker for their shoots.

12 - ExposureClasses

ExposureClasses

The Gardener API server provides a cluster-scoped ExposureClass resource. This resource is used to allow exposing the control plane of a Shoot cluster in various network environments like restricted corporate networks, DMZ etc.

Background

The ExposureClass resource is based on the concept for the RuntimeClass resource in Kubernetes.

A RuntimeClass abstracts the installation of a certain container runtime (e.g. gVisor, Kata Containers) on all nodes or a subset of the nodes in a Kubernetes cluster. See here.

In contrast, an ExposureClass abstracts the ability to expose a Shoot clusters control plane in certain network environments (e.g. corporate networks, DMZ, internet) on all Seeds or a subset of the Seeds.

Example: RuntimeClass and ExposureClass

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: gvisorconfig
# scheduling:
#   nodeSelector:
#     env: prod
---
kind: ExposureClass
metadata:
  name: internet
handler: internet-config
# scheduling:
#   seedSelector:
#     matchLabels:
#       network/env: internet

Similar to RuntimeClasses, ExposureClasses also define a .handler field reflecting the name reference for the corresponding CRI configuration of the RuntimeClass and the control plane exposure configuration for the ExposureClass.

The CRI handler for RuntimeClasses is usually installed by an administrator (e.g. via a DaemonSet which installs the corresponding container runtime on the nodes). The control plane exposure configuration for ExposureClasses will be also provided by an administrator. This exposure configuration is part of the Gardenlet configuration as this component is responsible to configure the control plane accordingly. See here.

The RuntimeClass also supports the selection of a node subset (which have the respective controller runtime binaries installed) for pod scheduling via its .scheduling section. The ExposureClass also supports the selection of a subset of available Seed clusters whose Gardenlet is capable of applying the exposure configuration for the Shoot control plane accordingly via its .scheduling section.

Usage by a Shoot

A Shoot can reference an ExposureClass via the .spec.exposureClassName field.

⚠️ When creating a Shoot resource, the Gardener scheduler will try to assign the Shoot to a Seed which will host its control plane. The scheduling behaviour can be influenced via the .spec.seedSelectors and/or .spec.tolerations fields in the Shoot. ExposureClasses can contain also scheduling instructions. If a Shoot is referencing an ExposureClass then the scheduling instructions of both will be merged into the Shoot. Those unions of scheduling instructions might lead to a selection of a Seed which is not able to deal with the handler of the ExposureClass and the Shoot creation might end up in an error. In such case, the Shoot scheduling instructions should be revisited to check that they are not interfere with the ones from the ExposureClass. If this is not feasible then the combination with the ExposureClass is might not possible and you need to contact your Gardener administrator.

Example: Shoot and ExposureClass scheduling instructions merge flow
  1. Assuming there is the following Shoot which is referencing the ExposureClass below:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: abc
  namespace: garden-dev
spec:
  exposureClassName: abc
  seedSelectors:
    matchLabels:
      env: prod
---
apiVersion: core.gardener.cloud/v1alpha1
kind: ExposureClass
metadata:
  name: abc
handler: abc
scheduling:
  seedSelector:
    matchLabels:
      network: internal
  1. Both seedSelectors would be merged into the Shoot. The result would be the following:
apiVersion: core.gardener.cloud/v1alpha1
kind: Shoot
metadata:
  name: abc
  namespace: garden-dev
spec:
  exposureClassName: abc
  seedSelectors:
    matchLabels:
      env: prod
      network: internal
  1. Now the Gardener Scheduler would try to find a Seed with those labels.
  • If there are no Seeds with matching labels for the seed selector then the Shoot will be unschedulable
  • If there are Seeds with matching labels for the seed selector then the Shoot will be assigned to the best candidate after the scheduling strategy is applied, see here
    • If the Seed is not able to serve the ExposureClass handler abc then the Shoot will end up in error state
    • If the Seed is able to serve the ExposureClass handler abc then the Shoot will be created

Gardenlet Configuration ExposureClass handlers

The Gardenlet is responsible to realize the control plane exposure strategy defined in the referenced ExposureClass of a Shoot.

Therefore, the GardenletConfiguration can contain an .exposureClassHandlers list with the respective configuration.

Example of the GardenletConfiguration:

exposureClassHandlers:
- name: internet-config
  loadBalancerService:
    annotations:
      loadbalancer/network: internet
- name: internal-config
  loadBalancerService:
    annotations:
      loadbalancer/network: internal
  sni:
    ingress:
      namespace: ingress-internal
      labels:
        network: internal

Each Gardenlet can define how the handler of a certain ExposureClass needs to be implemented for the Seed(s) where it is responsible for.

The .name is the name of the handler config and it must match to the .handler in the ExposureClass.

All control planes on a Seed are exposed via a load balancer. Either a dedicated one or a central shared one. The load balancer service needs to be configured in a way that it is reachable from the target network environment. Therefore, the configuration of load balancer service need to be specified which can be done via the .loadBalancerService section. The common way to influence load balancer service behaviour is via annotations where the respective cloud-controller-manager will react on and configure the infrastructure load balancer accordingly.

In case the Gardenlet runs with activated APIServerSNI feature flag (default), the control planes on a Seed will be exposed via a central load balancer and with Envoy via TLS SNI passthrough proxy. In this case, the Gardenlet will install a dedicated ingress gateway (Envoy + load balancer + respective configuration) for each handler on the Seed. The configuration of the ingress gateways can be controlled via the .sni section in the same way like for the default ingress gateways.

13 - Istio

Istio

Istio offers a service mesh implementation with focus on several important features - traffic, observability, security and policy.

Gardener ManagedIstio feature gate

When enabled in gardenlet the ManagedIstio feature gate can be used to deploy a Gardener-tailored Istio installation in Seed clusters. It’s main usage is to enable features such as Shoot API server SNI. This feature should not be enabled on a Seed cluster where Istio is already deployed.

However, this feature gate is deprecated, turned on by default and will be removed in a future version of Gardener. This means that Gardener will unconditionally deploy Istio with its desired configuration to seed clusters. Consequently, existing/bring-your-own Istio deployments will no longer be supported.

Prerequisites

Differences with Istio’s default profile

The default profile which is recommended for production deployment, is not suitable for the Gardener use case as it offers more functionality than desired. The current installation goes through heavy refactorings due to the IstioOperator and the mixture of Helm values + Kubernetes API specification makes configuring and fine-tuning it very hard. A more simplistic deployment is used by Gardener. The differences are the following:

  • Telemetry is not deployed.
  • istiod is deployed.
  • istio-ingress-gateway is deployed in a separate istio-ingress namespace.
  • istio-egress-gateway is not deployed.
  • None of the Istio addons are deployed.
  • Mixer (deprecated) is not deployed
  • Mixer CDRs are not deployed.
  • Kubernetes Service, Istio’s VirtualService and ServiceEntry are NOT advertised in the service mesh. This means that if a Service needs to be accessed directly from the Istio Ingress Gateway, it should have networking.istio.io/exportTo: "*" annotation. VirtualService and ServiceEntry must have .spec.exportTo: ["*"] set on them respectively.
  • Istio injector is not enabled.
  • mTLS is enabled by default.

14 - Logging

Logging stack

Motivation

Kubernetes uses the underlying container runtime logging, which does not persist logs for stopped and destroyed containers. This makes it difficult to investigate issues in the very common case of not running containers. Gardener provides a solution to this problem for the managed cluster components, by introducing its own logging stack.

Components:

  • A Fluent-bit daemonset which works like a log collector and custom Golang plugin which spreads log messages to their Loki instances
  • One Loki Statefulset in the garden namespace which contains logs for the seed cluster and one per shoot namespace which contains logs for shoot’s controlplane.
  • One Grafana Deployment in garden namespace and two Deployments per shoot namespace (one exposed to the end users and one for the operators). Grafana is the UI component used in the logging stack.

Container Logs rotation and retention

Container log rotation in Kubernetes describes a subtile but important implementation detail depending on the type of the used high-level container runtime. When the used container runtime is not CRI compliant (such as dockershim) then the kubelet does not provide any rotation or retention implementations, hence leaving those aspects to the downstream components. When the used container runtime is CRI compliant (such as containerd) then the kubelet provides the necessary implementation with two configuration options:

  • ContainerLogMaxSize for rotation
  • ContainerLogMaxFiles for retention.

Docker container runtime

In this case, the log rotation and retention is implemented by a logrotate service provisioned by Gardener which rotates logs once 100M size is reached. Logs are compressed on daily basis and retained for a maximum period of 14d.

ContainerD runtime

In this case, it is possible to configure the containerLogMaxSize and containerLogMaxFiles fields in the Shoot specification. Both fields are optional and if nothing is specified then the kubelet rotates on the same size 100M as in the docker container runtime. Those fields are part of provider’s workers definition. Here is an example:

spec:
  provider:
    workers:
      - cri:
          name: containerd
        kubernetes:
          kubelet:
            # accepted values are of resource.Quantity
            containerLogMaxSize: 150Mi
            containerLogMaxFiles: 10

The values of the containerLogMaxSize and containerLogMaxFiles fields need to be considered with care since container log files claim disk space from the host. On the opposite side, log rotations on too small sizes may result in frequent rotations which can be missed by other components (log shippers) observing these rotations.

In the majority of the cases, the defaults shall do just. Custom configuration might be of use under rare conditions.

Extension of the logging stack

The logging stack is extended to scrape logs from the systemd services of each shoots’ nodes and from all Gardener components in the shoot kube-system namespace. These logs are exposed only to the Gardener operators.

Also, in the shoot control plane an event-logger pod is deployed which scrapes events from the shoot kube-system namespace and shoot control-plane namespace in the seed. The event-logger logs the events to the standard output. Then the fluent-bit gets these events as container logs and sends them to the Loki in the shoot control plane (similar to how it works for any other control plane component).

How to access the logs

The first step is to authenticate in front of the Grafana ingress. There are two Grafana instances where the logs are accessible from.

  1. The user (stakeholder/cluster-owner) Grafana consist of a predefined Monitoring and Logging dashboards which help the end-user to get the most important metrics and logs out of the box. This Grafana UI is dedicated only for the end-user and does not show logs from components which could log a sensitive information. Also, the Explore tab is not available. Those logs are in the predefined dashboard named Controlplane Logs Dashboard. In this dashboard the user can search logs by pod name, container name, severity and a phrase or regex. The user Grafana URL can be found in the Logging and Monitoring section of a cluster in the Gardener Dashboard alongside with the credentials, when opened as cluster owner/user. The secret with the credentials can be found in garden-<project> namespace under <shoot-name>.monitoring in the garden cluster or in the control-plane (shoot–project–shoot-name) namespace under observability-ingress-users-<hash> secrets in the seed cluster. Also, the Grafana URL can be found in the control-plane namespace under the grafana-users ingress in the seed. The end-user has access only to the logs of some of the control-plane components.

  2. In addition to the dashboards in the User Grafana, the Operator Grafana contains several other dashboards that aim to facilitate the work of operators. The operator Grafana URL can be found in the Logging and Monitoring section of a cluster in the Gardener Dashboard alongside with the credentials, when opened as Gardener operator. Also, it can be found in the control-plane namespace under the grafana-operators ingress in the seed. Operators have access to the Explore tab. The secret with the credentials can be found in the control-plane (shoot–project–shoot-name) namespace under observability-ingress-<hash>-<hash> secrets in the seed. From Explore tab, operators have unlimited abilities to extract and manipulate logs. The Grafana itself helps them with suggestions and auto-completion.

NOTE: Operators are people part of the Gardener team with operator permissions, not operators of the end-user cluster!

How to use Explore tab.

If you click on the Log browser > button you will see all of the available labels. Clicking on the label you can see all of its available values for the given period of time you have specified. If you are searching for logs for the past one hour do not expect to see labels or values for which there were no logs for that period of time. By clicking on a value, Grafana automatically eliminates all other label and/or values with which no valid log stream can be made. After choosing the right labels and their values, click on Show logs button. This will build Log query and execute it. This approach is convenient when you don’t know the labels names or they values.

Once you felt comfortable, you can start to use the LogQL language to search for logs. Next to the Log browser > button is the place where you can type log queries.

Examples:

  1. If you want to get logs for calico-node-<hash> pod in the cluster kube-system. The name of the node on which calico-node was running is known but not the hash suffix of the calico-node pod. Also we want to search for errors in the logs.

    {pod_name=~"calico-node-.+", nodename="ip-10-222-31-182.eu-central-1.compute.internal"} |~ "error"

    Here, you will get as much help as possible from the Grafana by giving you suggestions and auto-completion.

  2. If you want to get the logs from kubelet systemd service of a given node and search for a pod name in the logs.

    {unit="kubelet.service", nodename="ip-10-222-31-182.eu-central-1.compute.internal"} |~ "pod name"

NOTE: Under unit label there is only the docker, containerd, kubelet and kernel logs.

  1. If you want to get the logs from cloud-config-downloader systemd service of a given node and search for a string in the logs.

    {job="systemd-combine-journal",nodename="ip-10-222-31-182.eu-central-1.compute.internal"} | unpack | unit="cloud-config-downloader.service" |~ "last execution was"

NOTE: {job="systemd-combine-journal",nodename="<node name>"} stream pack all logs from systemd services except docker, containerd, kubelet and kernel. To filter those log by unit you have to unpack them first.

  1. Retrieving events:
  • If you want to get the events from the shoot kube-system namespace generated by kubelet and related to the node-problem-detector:

    {job="event-logging"} | unpack | origin_extracted="shoot",source="kubelet",object=~".*node-problem-detector.*"

  • If you want to get the events generated by MCM in the shoot control plane in the seed:

    {job="event-logging"} | unpack | origin_extracted="seed",source=~".*machine-controller-manager.*"

NOTE: In order to group events by origin one has to specify origin_extracted because origin label is reserved for all of the logs from the seed and the event-logger resides in the seed, so all of its logs are coming as they are only from the seed. The actual origin is embedded in the unpacked event. When unpacked the embedded origin becomes origin_extracted.

Expose logs for component to User Grafana

Exposing logs for a new component to the User’s Grafana is described here

Configuration

Fluent-bit

The Fluent-bit configurations can be found on charts/seed-bootstrap/charts/fluent-bit/templates/fluent-bit-configmap.yaml There are five different specifications:

  • SERVICE: Defines the location of the server specifications
  • INPUT: Defines the location of the input stream of the logs
  • OUTPUT: Defines the location of the output source (Loki for example)
  • FILTER: Defines filters which match specific keys
  • PARSER: Defines parsers which are used by the filters

Loki

The Loki configurations can be found on charts/seed-bootstrap/charts/loki/templates/loki-configmap.yaml

The main specifications there are:

  • Index configuration: Currently is used the following one:
    schema_config:
      configs:
      - from: 2018-04-15
        store: boltdb
        object_store: filesystem
        schema: v11
        index:
          prefix: index_
          period: 24h
  • from: is the date from which logs collection is started. Using a date in the past is okay.
  • store: The DB used for storing the index.
  • object_store: Where the data is stored
  • schema: Schema version which should be used (v11 is currently recommended)
  • index.prefix: The prefix for the index.
  • index.period: The period for updating the indices

Adding of new index happens with new config block definition. from field should start from the current day + previous index.period and should not overlap with the current index. The prefix also should be different

    schema_config:
      configs:
      - from: 2018-04-15
        store: boltdb
        object_store: filesystem
        schema: v11
        index:
          prefix: index_
          period: 24h
      - from: 2020-06-18
        store: boltdb
        object_store: filesystem
        schema: v11
        index:
          prefix: index_new_
          period: 24h
  • chunk_store_config Configuration
    chunk_store_config:
      max_look_back_period: 336h

chunk_store_config.max_look_back_period should be the same as the retention_period

  • table_manager Configuration
    table_manager:
      retention_deletes_enabled: true
      retention_period: 336h

table_manager.retention_period is the living time for each log message. Loki will keep messages for sure for (table_manager.retention_period - index.period) time due to specification in the Loki implementation.

Grafana

The Grafana configurations can be found on charts/seed-bootstrap/charts/templates/grafana/grafana-datasources-configmap.yaml and charts/seed-monitoring/charts/grafana/tempates/grafana-datasources-configmap.yaml

This is the Loki configuration that Grafana uses:

    - name: loki
      type: loki
      access: proxy
      url: http://loki.{{ .Release.Namespace }}.svc:3100
      jsonData:
        maxLines: 5000
  • name: is the name of the datasource
  • type: is the type of the datasource
  • access: should be set to proxy
  • url: Loki’s url
  • svc: Loki’s port
  • jsonData.maxLines: The limit of the log messages which Grafana will show to the users.

Decrease this value if the browser works slowly!

15 - Managed Seed

Register Shoot as Seed

An existing shoot can be registered as a seed by creating a ManagedSeed resource. This resource contains:

  • The name of the shoot that should be registered as seed.
  • An gardenlet section that contains:
    • gardenlet deployment parameters, such as the number of replicas, the image, etc.
    • The GardenletConfiguration resource that contains controllers configuration, feature gates, and a seedConfig section that contains the Seed spec and parts of its metadata.
    • Additional configuration parameters, such as the garden connection bootstrap mechanism (see TLS Bootstrapping), and whether to merge the provided configuration with the configuration of the parent gardenlet.

gardenlet is deployed to the shoot, and it registers a new seed upon startup based on the seedConfig section.

Note: Earlier Gardener allowed specifying a seedTemplate directly in the ManagedSeed resource. This feature is discontinued, any seed configuration must be via the GardenletConfiguration.

Note the following important aspects:

  • Unlike the Seed resource, the ManagedSeed resource is namespaced. Currently, managed seeds are restricted to the garden namespace.
  • The newly created Seed resource always has the same name as the ManagedSeed resource. Attempting to specify a different name in seedConfig will fail.
  • The ManagedSeed resource must always refer to an existing shoot. Attempting to create a ManagedSeed referring to a non-existing shoot will fail.
  • A shoot that is being referred to by a ManagedSeed cannot be deleted. Attempting to delete such a shoot will fail.
  • You can omit practically everything from the gardenlet section, including all or most of the Seed spec fields. Proper defaults will be supplied in all cases, based either on the most common use cases or the information already available in the Shoot resource.
  • Also, if your seed is configured to host HA shoot control planes, then gardenlet will be deployed with multiple replicas across nodes or availability zones by default.
  • Some Seed spec fields, for example the provider type and region, networking CIDRs for pods, services, and nodes, etc., must be the same as the corresponding Shoot spec fields of the shoot that is being registered as seed. Attempting to use different values (except empty ones, so that they are supplied by the defaulting mechanims) will fail.

Deploying Gardenlet to the Shoot

To register a shoot as a seed and deploy gardenlet to the shoot using a default configuration, create a ManagedSeed resource similar to the following:

apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeed
metadata:
  name: my-managed-seed
  namespace: garden
spec:
  shoot:
    name: crazy-botany
  gardenlet: {}

For an example that uses non-default configuration, see 55-managed-seed-gardenlet.yaml

Renewing the Gardenlet Kubeconfig Secret

In order to making the ManagedSeed controller renew the gardenlet’s kubeconfig secret, annotate the ManagedSeed with gardener.cloud/operation=renew-kubeconfig. This will trigger a reconciliation during which the kubeconfig secret is deleted and the bootstrapping is performed again (during which gardenlet obtains a new client certificate).

It is also possible to trigger the renewal on the secret directly, see here.

Specifying apiServer replicas and autoscaler options

There are few configuration options that are not supported in a Shoot resource but due to backward compatibility reasons it is possible to specify them for a Shoot that is referred by a ManagedSeed. These options are:

OptionDescription
apiServer.autoscaler.minReplicasControls the minimum number of kube-apiserver replicas for the shoot registered as seed cluster.
apiServer.autoscaler.maxReplicasControls the maximum number of kube-apiserver replicas for the shoot registered as seed cluster.
apiServer.replicasControls how many kube-apiserver replicas the shoot registered as seed cluster gets by default.

It is possible to specify these options via the shoot.gardener.cloud/managed-seed-api-server annotation on the Shoot resource. Example configuration:

  annotations:
    shoot.gardener.cloud/managed-seed-api-server: "apiServer.replicas=3,apiServer.autoscaler.minReplicas=3,apiServer.autoscaler.maxReplicas=6"

Enforced configuration options

The following configuration options are enforced by Gardener API server for the ManagedSeed resources:

  1. The vertical pod autoscaler should be enabled from the Shoot specification.

    The vertical pod autoscaler is a prerequisite for a Seed cluster. It is possible to enable the VPA feature for a Seed (using the Seed spec) and for a Shoot (using the Shoot spec). In context of ManagedSeeds, enabling the VPA in the Seed spec (instead of the Shoot spec) offers less flexibility and increases the network transfer and cost. Due to these reasons, the Gardener API server enforces the vertical pod autoscaler to be enabled from the Shoot specification.

  2. The nginx-ingress addon should not be enabled for a Shoot referred by a ManagedSeed.

    An Ingress controller is also a prerequisite for a Seed cluster. For a Seed cluster it is possible to enable Gardener managed Ingress controller or to deploy self-managed Ingress controller. There is also the nginx-ingress addon that can be enabled for a Shoot (using the Shoot spec). However, the Shoot nginx-ingress addon is in deprecated mode and it is not recommended for production clusters. Due to these reasons, the Gardener API server does not allow the Shoot nginx-ingress addon to be enabled for ManagedSeeds.

16 - NodeLocalDNS Configuration

NodeLocalDNS Configuration

This is a short guide describing how to enable DNS caching on the shoot cluster nodes.

Background

Currently in Gardener we are using CoreDNS as a deployment that is auto-scaled horizontally to cover for QPS-intensive applications. However, doing so does not seem to be enough to completely circumvent DNS bottlenecks such as:

  • Cloud provider limits for DNS lookups.
  • Unreliable UDP connections that forces a period of timeout in case packets are dropped.
  • Unnecessary node hopping since CoreDNS is not deployed on all nodes, and as a result DNS queries end-up traversing multiple nodes before reaching the destination server.
  • Inefficient load-balancing of services (e.g., round-robin might not be enough when using IPTables mode)
  • and more …

To workaround the issues described above, node-local-dns was introduced. The architecture is described below. The idea is simple:

  • For new queries, the connection is upgraded from UDP to TCP and forwarded towards the cluster IP for the original CoreDNS server.
  • For previously resolved queries, an immediate response from the same node where the requester workload / pod resides is provided.

node-local-dns-architecture

Configuring NodeLocalDNS

All that needs to be done to enable the usage of the node-local-dns feature is to set the corresponding option (spec.systemComponents.nodeLocalDNS.enabled) in the Shoot resource to true:

...
spec:
  ...
  systemComponents:
    nodeLocalDNS:
      enabled: true
...

It is worth noting that:

  • When migrating from IPVS to IPTables, existing pods will continue to leverage the node-local-dns cache.
  • When migrating from IPtables to IPVS, only newer pods will be switched to the node-local-dns cache.
  • The annotation will take effect during the next shoot reconciliation. This happens automatically once per day in the maintenance period (unless you have disabled it).
  • During the reconfiguration of the node-local-dns there might be a short disruption in terms of domain name resolution depending on the setup. Usually, dns requests are repeated for some time as udp is an unreliable protocol, but that strictly depends on the application/way the domain name resolution happens. It is recommended to let the shoot be reconciled during the next maintenance period.
  • If a short DNS outage is not a big issue, you can trigger reconciliation directly after setting the annotation.
  • Switching node-local-dns off by removing the annotation can be a rather destructive operation that will result in pods without a working dns configuration.

For more information about node-local-dns please refer to the KEP or to the usage documentation.

Known Issues

Custom DNS configuration may not work as expected in conjunction with NodeLocalDNS. Please refer to Custom DNS Configuration.

17 - OpenIDConnect Presets

ClusterOpenIDConnectPreset and OpenIDConnectPreset

This page provides an overview of ClusterOpenIDConnectPresets and OpenIDConnectPresets, which are objects for injecting OpenIDConnect Configuration into Shoot at creation time. The injected information contains configuration for the Kube API Server and optionally configuration for kubeconfig generation using said configuration.

OpenIDConnectPreset

An OpenIDConnectPreset is an API resource for injecting additional runtime OIDC requirements into a Shoot at creation time. You use label selectors to specify the Shoot to which a given OpenIDConnectPreset applies.

Using a OpenIDConnectPresets allows project owners to not have to explicitly provide the same OIDC configuration for every Shoot in their Project.

For more information about the background, see the issue for OpenIDConnectPreset.

How OpenIDConnectPreset works

Gardener provides an admission controller (OpenIDConnectPreset) which, when enabled, applies OpenIDConnectPresets to incoming Shoot creation requests. When a Shoot creation request occurs, the system does the following:

  • Retrieve all OpenIDConnectPreset available for use in the Shoot namespace.

  • Check if the shoot label selectors of any OpenIDConnectPreset matches the labels on the Shoot being created.

  • If multiple presets are matched then only one is chosen and results are sorted based on:

    1. .spec.weight value.
    2. lexicographically ordering their names ( e.g. 002preset > 001preset )
  • If the Shoot already has a .spec.kubernetes.kubeAPIServer.oidcConfig then no mutation occurs.

Simple OpenIDConnectPreset example

This is a simple example to show how a Shoot is modified by the OpenIDConnectPreset

apiVersion: settings.gardener.cloud/v1alpha1
kind: OpenIDConnectPreset
metadata:
  name:  test-1
  namespace: default
spec:
  shootSelector:
    matchLabels:
      oidc: enabled
  server:
    clientID: test-1
    issuerURL: https://foo.bar
    # caBundle: |
    #   -----BEGIN CERTIFICATE-----
    #   Li4u
    #   -----END CERTIFICATE-----
    groupsClaim: groups-claim
    groupsPrefix: groups-prefix
    usernameClaim: username-claim
    usernamePrefix: username-prefix
    signingAlgs:
    - RS256
    requiredClaims:
      key: value
  client:
    secret: oidc-client-secret
    extraConfig:
      extra-scopes: "email,offline_access,profile"
      foo: bar
  weight: 90

Create the OpenIDConnectPreset:

kubectl apply -f preset.yaml

Examine the created OpenIDConnectPreset:

kubectl get openidconnectpresets
NAME     ISSUER            SHOOT-SELECTOR   AGE
test-1   https://foo.bar   oidc=enabled     1s

Simple Shoot example:

This is a sample of a Shoot with some fields omitted:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: preset
  namespace: default
  labels:
    oidc: enabled
spec:
  kubernetes:
    allowPrivilegedContainers: true
    version: 1.20.2

Create the Shoot:

kubectl apply -f shoot.yaml

Examine the created Shoot:

kubectl get shoot preset -o yaml
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: preset
  namespace: default
  labels:
    oidc: enabled
spec:
  kubernetes:
    kubeAPIServer:
      oidcConfig:
        clientAuthentication:
          extraConfig:
            extra-scopes: email,offline_access,profile
            foo: bar
          secret: oidc-client-secret
        clientID: test-1
        groupsClaim: groups-claim
        groupsPrefix: groups-prefix
        issuerURL: https://foo.bar
        requiredClaims:
          key: value
        signingAlgs:
        - RS256
        usernameClaim: username-claim
        usernamePrefix: username-prefix
    version: 1.20.2

Disable OpenIDConnectPreset

The OpenIDConnectPreset admission control is enabled by default. To disable it use the --disable-admission-plugins flag on the gardener-apiserver.

For example:

--disable-admission-plugins=OpenIDConnectPreset

ClusterOpenIDConnectPreset

A ClusterOpenIDConnectPreset is an API resource for injecting additional runtime OIDC requirements into a Shoot at creation time. In contrast to OpenIDConnect it’s a cluster-scoped resource. You use label selectors to specify the Project and Shoot to which a given OpenIDCConnectPreset applies.

Using a OpenIDConnectPresets allows cluster owners to not have to explicitly provide the same OIDC configuration for every Shoot in specific Project.

For more information about the background, see the issue for ClusterOpenIDConnectPreset.

How ClusterOpenIDConnectPreset works

Gardener provides an admission controller (ClusterOpenIDConnectPreset) which, when enabled, applies ClusterOpenIDConnectPresets to incoming Shoot creation requests. When a Shoot creation request occurs, the system does the following:

  • Retrieve all ClusterOpenIDConnectPresets available.

  • Check if the project label selector of any ClusterOpenIDConnectPreset matches the labels of the Project in which the Shoot is being created.

  • Check if the shoot label selectors of any ClusterOpenIDConnectPreset matches the labels on the Shoot being created.

  • If multiple presets are matched then only one is chosen and results are sorted based on:

    1. .spec.weight value.
    2. lexicographically ordering their names ( e.g. 002preset > 001preset )
  • If the Shoot already has a .spec.kubernetes.kubeAPIServer.oidcConfig then no mutation occurs.

Note: Due to the previous requirement if a Shoot is matched by both OpenIDConnectPreset and ClusterOpenIDConnectPreset then OpenIDConnectPreset takes precedence over ClusterOpenIDConnectPreset.

Simple ClusterOpenIDConnectPreset example

This is a simple example to show how a Shoot is modified by the ClusterOpenIDConnectPreset

apiVersion: settings.gardener.cloud/v1alpha1
kind: ClusterOpenIDConnectPreset
metadata:
  name:  test
spec:
  shootSelector:
    matchLabels:
      oidc: enabled
  projectSelector: {} # selects all projects.
  server:
    clientID: cluster-preset
    issuerURL: https://foo.bar
    # caBundle: |
    #   -----BEGIN CERTIFICATE-----
    #   Li4u
    #   -----END CERTIFICATE-----
    groupsClaim: groups-claim
    groupsPrefix: groups-prefix
    usernameClaim: username-claim
    usernamePrefix: username-prefix
    signingAlgs:
    - RS256
    requiredClaims:
      key: value
  client:
    secret: oidc-client-secret
    extraConfig:
      extra-scopes: "email,offline_access,profile"
      foo: bar
  weight: 90

Create the ClusterOpenIDConnectPreset:

kubectl apply -f preset.yaml

Examine the created ClusterOpenIDConnectPreset:

kubectl get clusteropenidconnectpresets
NAME     ISSUER            PROJECT-SELECTOR   SHOOT-SELECTOR   AGE
test     https://foo.bar   <none>             oidc=enabled     1s

This is a sample of a Shoot with some fields omitted:

kind: Shoot
apiVersion: core.gardener.cloud/v1beta1
metadata:
  name: preset
  namespace: default
  labels:
    oidc: enabled
spec:
  kubernetes:
    allowPrivilegedContainers: true
    version: 1.20.2

Create the Shoot:

kubectl apply -f shoot.yaml

Examine the created Shoot:

kubectl get shoot preset -o yaml
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: preset
  namespace: default
  labels:
    oidc: enabled
spec:
  kubernetes:
    kubeAPIServer:
      oidcConfig:
        clientAuthentication:
          extraConfig:
            extra-scopes: email,offline_access,profile
            foo: bar
          secret: oidc-client-secret
        clientID: cluster-preset
        groupsClaim: groups-claim
        groupsPrefix: groups-prefix
        issuerURL: https://foo.bar
        requiredClaims:
          key: value
        signingAlgs:
        - RS256
        usernameClaim: username-claim
        usernamePrefix: username-prefix
    version: 1.20.2

Disable ClusterOpenIDConnectPreset

The ClusterOpenIDConnectPreset admission control is enabled by default. To disable it use the --disable-admission-plugins flag on the gardener-apiserver.

For example:

--disable-admission-plugins=ClusterOpenIDConnectPreset

18 - Pod Security

Migrating From PodSecurityPolicys To PodSecurity Admission Controller

Kubernetes has deprecated the PodSecurityPolicy API in v1.21 and it will be removed in v1.25. With v1.23, a new feature called PodSecurity was promoted to beta. From v1.25 onwards, there will be no API serving PodSecurityPolicys, so you have to cleanup all the existing PSPs before upgrading your cluster. Detailed migration steps are described here.

After migration, you should disable the PodSecurityPolicy admission plugin. To do so, you have to add:

admissionPlugins:
- name: PodSecurityPolicy
  disabled: true

in spec.kubernetes.kubeAPIServer.admissionPlugins field in the Shoot resource. Please refer the example Shoot manifest here.

Only if the PodSecurityPolicy admission plugin is disabled the cluster can be upgraded to v1.25.

⚠️ You should disable the admission plugin and wait until Gardener finish at least one Shoot reconciliation before upgrading to v1.25. This is to make sure all the PodSecurityPolicy related resources deployed by Gardener are cleaned up.

Admission Configuration For The PodSecurity Admission Plugin

If you wish to add your custom configuration for the PodSecurity plugin and your cluster version is v1.23+, you can do so in the Shoot spec under .spec.kubernetes.kubeAPIServer.admissionPlugins by adding:

admissionPlugins:
- name: PodSecurity
  config:
    apiVersion: pod-security.admission.config.k8s.io/v1
    kind: PodSecurityConfiguration
    # Defaults applied when a mode label is not set.
    #
    # Level label values must be one of:
    # - "privileged" (default)
    # - "baseline"
    # - "restricted"
    #
    # Version label values must be one of:
    # - "latest" (default) 
    # - specific version like "v1.25"
    defaults:
      enforce: "privileged"
      enforce-version: "latest"
      audit: "privileged"
      audit-version: "latest"
      warn: "privileged"
      warn-version: "latest"
    exemptions:
      # Array of authenticated usernames to exempt.
      usernames: []
      # Array of runtime class names to exempt.
      runtimeClasses: []
      # Array of namespaces to exempt.
      namespaces: []

⚠️ Note that pod-security.admission.config.k8s.io/v1 configuration requires v1.25+. For v1.23 and v1.24, use pod-security.admission.config.k8s.io/v1beta1. For v1.22, use pod-security.admission.config.k8s.io/v1alpha1.

Also note that in v1.22 the feature gate PodSecurity is not enabled by default. You have to add:

featureGates:
  PodSecurity: true

under .spec.kubernetes.kubeAPIServer.

For proper functioning of Gardener, kube-system namespace will also be automatically added to the exemptions.namespaces list.

.spec.kubernetes.allowPrivilegedContainers in the Shoot spec

If this field is set to true then all authenticated users can use the “gardener.privileged” PodSecurityPolicy, allowing full unrestricted access to Pod features. However, the PodSecurityPolicy admission plugin is removed in Kubernetes v1.25 and PodSecurity has taken its place as its successor. Therefore, this field doesn’t have any relevance in versions >= v1.25 anymore. If you need to set a default pod admission level for your cluster, follow this documentation.

Note: You should remove this field from the Shoot spec for v1.24 clusters after migrating to the new PodSecurity admission controller, before upgrading your cluster to v1.25.

19 - Projects

Projects

The Gardener API server supports a cluster-scoped Project resource which is used for data isolation between individual Gardener consumers. For example, each development team has its own project to manage its own shoot clusters.

Each Project is backed by a Kubernetes Namespace that contains the actual related Kubernetes resources like Secrets or Shoots.

Example resource:

apiVersion: core.gardener.cloud/v1beta1
kind: Project
metadata:
  name: dev
spec:
  namespace: garden-dev
  description: "This is my first project"
  purpose: "Experimenting with Gardener"
  owner:
    apiGroup: rbac.authorization.k8s.io
    kind: User
    name: john.doe@example.com
  members:
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: alice.doe@example.com
    role: admin
  # roles:
  # - viewer 
  # - uam
  # - serviceaccountmanager
  # - extension:foo
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: bob.doe@example.com
    role: viewer
# tolerations:
#   defaults:
#   - key: <some-key>
#   whitelist:
#   - key: <some-key>

The .spec.namespace field is optional and is initialized if unset. The name of the resulting namespace will be determined based on the Project name and UID, e.g. garden-dev-5aef3. It’s also possible to adopt existing namespaces by labeling them gardener.cloud/role=project and project.gardener.cloud/name=dev beforehand (otherwise, they cannot be adopted).

When deleting a Project resource, the corresponding namespace is also deleted. To keep a namespace after project deletion, an administrator/operator (not Project members!) can annotate the project-namespace with namespace.gardener.cloud/keep-after-project-deletion.

The spec.description and .spec.purpose fields can be used to describe to fellow team members and Gardener operators what this project is used for.

Each project has one dedicated owner, configured in .spec.owner using the rbac.authorization.k8s.io/v1.Subject type. The owner is the main contact person for Gardener operators. Please note that the .spec.owner field is deprecated and will be removed in future API versions in favor of the owner role, see below.

The list of members (again a list in .spec.members[] using the rbac.authorization.k8s.io/v1.Subject type) contains all the people that are associated with the project in any way. Each project member must have at least one role (currently described in .spec.members[].role, additional roles can be added to .spec.members[].roles[]). The following roles exist:

  • admin: This allows to fully manage resources inside the project (e.g., secrets, shoots, configmaps, and similar). Mind that the admin role has readonly access to service accounts.
  • serviceaccountmanager: This allows to fully manage service accounts inside the project namespace and request tokens for them. The permissions of the created service accounts are instead managed by the admin role. Please refer to Service Account Manager.
  • uam: This allows to add/modify/remove human users or groups to/from the project member list.
  • viewer: This allows to read all resources inside the project except secrets.
  • owner: This combines the admin, uam and serviceaccountmanager roles.
  • Extension roles (prefixed with extension:): Please refer to Extending Project Roles.

The project controller inside the Gardener Controller Manager is managing RBAC resources that grant the described privileges to the respective members.

There are three central ClusterRoles gardener.cloud:system:project-member, gardener.cloud:system:project-viewer and gardener.cloud:system:project-serviceaccountmanager that grant the permissions for namespaced resources (e.g., Secrets, Shoots, ServiceAccounts, etc.). Via referring RoleBindings created in the respective namespace the project members get bound to these ClusterRoles and, thus, the needed permissions. There are also project-specific ClusterRoles granting the permissions for cluster-scoped resources, e.g. the Namespace or Project itself.
For each role, the following ClusterRoles, ClusterRoleBindings, and RoleBindings are created:

RoleClusterRoleClusterRoleBindingRoleBinding
admingardener.cloud:system:project-member:<projectName>gardener.cloud:system:project-member:<projectName>gardener.cloud:system:project-member
serviceaccountmanagergardener.cloud:system:project-serviceaccountmanager
uamgardener.cloud:system:project-uam:<projectName>gardener.cloud:system:project-uam:<projectName>
viewergardener.cloud:system:project-viewer:<projectName>gardener.cloud:system:project-viewer:<projectName>gardener.cloud:system:project-viewer
ownergardener.cloud:system:project:<projectName>gardener.cloud:system:project:<projectName>
extension:*gardener.cloud:extension:project:<projectName>:<extensionRoleName>gardener.cloud:extension:project:<projectName>:<extensionRoleName>

User Access Management

For Projects created before Gardener v1.8 all admins were allowed to manage other members. Beginning with v1.8 the new uam role is being introduced. It is backed by the manage-members custom RBAC verb which allows to add/modify/remove human users or groups to/from the project member list. Human users are subjects with kind=User and name!=system:serviceaccount:*, and groups are subjects with kind=Group. The management of service account subjects (kind=ServiceAccount or name=system:serviceaccount:*) is not controlled via the uam custom verb but with the standard update/patch verbs for projects.

All newly created projects will only bind the owner to the uam role. The owner can still grant the uam role to other members if desired. For projects created before Gardener v1.8 the Gardener Controller Manager will migrate all projects to also assign the uam role to all admin members (to not break existing use-cases). The corresponding migration logic is present in Gardener Controller Manager from v1.8 to v1.13. The project owner can gradually remove these roles if desired.

Stale Projects

When a project is not actively used for some period of time, it is marked as “stale”. This is done by a controller called “Stale Projects Reconciler”. Once the project is marked as stale there is a time frame in which if not used it will be deleted by that controller.

20 - Reversed VPN Tunnel

Reversed VPN Tunnel Setup and Configuration

This is a short guide describing how to enable tunneling traffic from shoot cluster to seed cluster instead of the default “seed to shoot” direction.

The OpenVPN Default

By default, Gardener makes use of OpenVPN to connect the shoot controlplane running on the seed cluster to the dataplane running on the shoot worker nodes, usually in isolated networks. This is achieved by having a sidecar to certain control plane components such as the kube-apiserver and prometheus.

With a sidecar, all traffic directed to the cluster is intercepted by iptables rules and redirected to the tunnel endpoint in the shoot cluster deployed behind a cloud loadbalancer. This has the following disadvantages:

  • Every shoot would require an additional loadbalancer, this accounts for additional overhead in terms of both costs and troubleshooting efforts.
  • Private access use-cases would not be possible without having a seed residing in the same private domain as a hard requirement. For example, have a look at this issue
  • Providing a public endpoint to access components in the shoot poses a security risk.

This is how it looks like today with the OpenVPN solution:

APIServer | VPN-seed ---> internet ---> LB --> VPN-Shoot (4314) --> Pods | Nodes | Services

Reversing the Tunnel

To address the above issues, the tunnel can establishment direction can be reverted, i.e. instead of having the client reside in the seed, we deploy the client in the shoot and initiate the connection from there. This way, there is no need to deploy a special purpose loadbalancer for the sake of addressing the dataplane, in addition to saving costs, this is considered the more secure alternative. For more information on how this is achieved, please have a look at the following GEP.

How it should look like at the end:

APIServer --> Envoy-Proxy | VPN-Seed-Server <-- Istio/Envoy-Proxy <-- SNI API Server Endpoint <-- LB (one for all clusters of a seed) <--- internet <--- VPN-Shoot-Client --> Pods | Nodes | Services

How to Configure

To enable the usage of the reversed vpn tunnel feature, either the Gardenlet ReversedVPN feature-gate must be set to true as shown below or the shoot must be annotated with "alpha.featuregates.shoot.gardener.cloud/reversed-vpn: true".

featureGates:
  ReversedVPN: true

Please refer to the examples here for more information.

To disable the feature-gate the shoot must be annotated with "alpha.featuregates.shoot.gardener.cloud/reversed-vpn: false"

Once the feature-gate is enabled, a vpn-seed-server deployment will be added to the controlplane. The kube-apiserver will be configured to connect to resources in the dataplane such as pods, services and nodes though the vpn-seed-service via http proxy/connect protocol. In the dataplane of the cluster, the vpn-shoot will establish the connection to the vpn-seed-server indirectly using the SNI API Server endpoint as a http proxy. After the connection has been established requests from the kube-apiserver will be handled by the tunnel.

Please note this feature is still in Beta, so you might see instabilities every now and then.

21 - Seed Bootstrapping

Seed Bootstrapping

Whenever the Gardenlet is responsible for a new Seed resource its “seed controller” is being activated. One part of this controller’s reconciliation logic is deploying certain components into the garden namespace of the seed cluster itself. These components are required to spawn and manage control planes for shoot clusters later on. This document is providing an overview which actions are performed during this bootstrapping phase, and it explains the rationale behind them.

Dependency Watchdog

The dependency watchdog (abbreviation: DWD) is a component developed separately in the gardener/dependency-watchdog GitHub repository. Gardener is using it for two purposes:

  1. Prevention of melt-down situations when the load balancer used to expose the kube-apiserver of shoot clusters goes down while the kube-apiserver itself is still up and running
  2. Fast recovery times for crash-looping pods when depending pods are again available

For the sake of separating these concerns, two instances of the DWD are deployed by the seed controller.

Probe

The dependency-watchdog-probe deployment is responsible for above mentioned first point.

The kube-apiserver of shoot clusters is exposed via a load balancer, usually with an attached public IP, which serves as the main entry point when it comes to interaction with the shoot cluster (e.g., via kubectl). While end-users are talking to their clusters via this load balancer, other control plane components like the kube-controller-manager or kube-scheduler run in the same namespace/same cluster, so they can communicate via the in-cluster Service directly instead of using the detour with the load balancer. However, the worker nodes of shoot clusters run in isolated, distinct networks. This means that the kubelets and kube-proxys also have to talk to the control plane via the load balancer.

The kube-controller-manager has a special control loop called nodelifecycle which will set the status of Nodes to NotReady in case the kubelet stops to regularly renew its lease/to send its heartbeat. This will trigger other self-healing capabilities of Kubernetes, for example the eviction of pods from such “unready” nodes to healthy nodes. Similarly, the cloud-controller-manager has a control loop that will disconnect load balancers from “unready” nodes, i.e., such workload would no longer be accessible until moved to a healthy node.

While these are awesome Kubernetes features on their own, they have a dangerous drawback when applied in the context of Gardener’s architecture: When the kube-apiserver load balancer fails for whatever reason then the kubelets can’t talk to the kube-apiserver to renew their lease anymore. After a minute or so the kube-controller-manager will get the impression that all nodes have died and will mark them as NotReady. This will trigger above mentioned eviction as well as detachment of load balancers. As a result, the customer’s workload will go down and become unreachable.

This is exactly the situation that the DWD prevents: It regularly tries to talk to the kube-apiservers of the shoot clusters, once by using their load balancer, and once by talking via the in-cluster Service. If it detects that the kube-apiserver is reachable internally but not externally it scales down the kube-controller-manager to 0. This will prevent it from marking the shoot worker nodes as “unready”. As soon as the kube-apiserver is reachable externally again the kube-controller-manager will be scaled up to 1 again.

Endpoint

The dependency-watchdog-endpoint deployment is responsible for above mentioned second point.

Kubernetes is restarting failing pods with an exponentially increasing backoff time. While this is a great strategy to prevent system overloads it has the disadvantage that the delay between restarts is increasing up to multiple minutes very fast.

In the Gardener context, we are deploying many components that are depending on other components. For example, the kube-apiserver is depending on a running etcd, or the kube-controller-manager and kube-scheduler are depending on a running kube-apiserver. In case such a “higher-level” component fails for whatever reason, the dependent pods will fail and end-up in crash-loops. As Kubernetes does not know anything about these hierarchies it won’t recognize that such pods can be restarted faster as soon as their dependents are up and running again.

This is exactly the situation in which the DWD will become active: If it detects that a certain Service is available again (e.g., after the etcd was temporarily down while being moved to another seed node) then DWD will restart all crash-looping dependant pods. These dependant pods are detected via a pre-configured label selector.

As of today, the DWD is configured to restart a crash-looping kube-apiserver after etcd became available again, or any pod depending on the kube-apiserver that has a gardener.cloud/role=controlplane label (e.g., kube-controller-manager, kube-scheduler, etc.).

22 - Seed Settings

Settings for Seeds

The Seed resource offers a few settings that are used to control the behaviour of certain Gardener components. This document provides an overview over the available settings:

Dependency Watchdog

Gardenlet can deploy two instances of the dependency-watchdog into the garden namespace of the seed cluster. One instance only activates the endpoint controller while the second instance only activates the probe controller.

Endpoint Controller

The endpoint controller helps to alleviate the delay where control plane components remain unavailable by finding the respective pods in CrashLoopBackoff status and restarting them once their dependants become ready and available again. For example, if etcd goes down then also kube-apiserver goes down (and into a CrashLoopBackoff state). If etcd comes up again then (without the endpoint controller) it might take some time until kube-apiserver gets restarted as well.

It can be enabled/disabled via the .spec.settings.dependencyWatchdog.endpoint.enabled field. It defaults to true.

Probe Controller

The probe controller scales down the kube-controller-manager of shoot clusters in case their respective kube-apiserver is not reachable via its external ingress. This is in order to avoid melt-down situations since the kube-controller-manager uses in-cluster communication when talking to the kube-apiserver, i.e., it wouldn’t be affected if the external access to the kube-apiserver is interrupted for whatever reason. The kubelets on the shoot worker nodes, however, would indeed be affected since they typically run in different networks and use the external ingress when talking to the kube-apiserver. Hence, without scaling down kube-controller-manager, the nodes might be marked as NotReady and eventually replaced (since the kubelets cannot report their status anymore). To prevent such unnecessary turbulences, kube-controller-manager is being scaled down until the external ingress becomes available again.

It can be enabled/disabled via the .spec.settings.dependencyWatchdog.probe.enabled field. It defaults to true.

Reserve Excess Capacity

If the excess capacity reservation is enabled then the Gardenlet will deploy a special Deployment into the garden namespace of the seed cluster. This Deployment’s pod template has only one container, the pause container, which simply runs in an infinite loop. The priority of the deployment is very low, so any other pod will preempt these pause pods. This is especially useful if new shoot control planes are created in the seed. In case the seed cluster runs at its capacity then there is no waiting time required during the scale-up. Instead, the low-priority pause pods will be preempted and allow newly created shoot control plane pods to be scheduled fast. In the meantime, the cluster-autoscaler will trigger the scale-up because the preempted pause pods want to run again. However, this delay doesn’t affect the important shoot control plane pods which will improve the user experience.

It can be enabled/disabled via the .spec.settings.excessCapacityReservation.enabled field. It defaults to true.

Scheduling

By default, the Gardener Scheduler will consider all seed clusters when a new shoot cluster shall be created. However, administrators/operators might want to exclude some of them from being considered by the scheduler. Therefore, seed clusters can be marked as “invisible”. In this case, the scheduler simply ignores them as if they wouldn’t exist. Shoots can still use the invisible seed but only by explicitly specifying the name in their .spec.seedName field.

Seed clusters can be marked visible/invisible via the .spec.settings.scheduling.visible field. It defaults to true.

Shoot DNS

Generally, the Gardenlet creates a few DNS records during the creation/reconciliation of a shoot cluster (see here). However, some infrastructures don’t need/want this behaviour. Instead, they want to directly use the IP addresses/hostnames of load balancers. Another use-case is a local development setup where DNS is not needed for simplicity reasons.

By setting the .spec.settings.shootDNS.enabled field this behavior can be controlled.

ℹ️ In previous Gardener versions (< 1.5) these settings were controlled via taint keys (seed.gardener.cloud/{disable-capacity-reservation,disable-dns,invisible}). The taint keys are no longer supported and removed in version 1.12. The rationale behind it is the implementation of tolerations similar to Kubernetes tolerations. More information about it can be found in #2193.

Load Balancer Services

Gardener creates certain Kubernetes Service objects of type LoadBalancer in the seed cluster. Most prominently, they are used for exposing the shoot control planes, namely the kube-apiserver of the shoot clusters. In most cases, the cloud-controller-manager (responsible for managing these load balancers on the respective underlying infrastructure) supports certain customization and settings via annotations. This document provides a good overview and many examples.

By setting the .spec.settings.loadBalancerServices.annotations field the Gardener administrator can specify a list of annotations which will be injected into the Services of type LoadBalancer.

Vertical Pod Autoscaler

Gardener heavily relies on the Kubernetes vertical-pod-autoscaler component. By default, the seed controller deploys the VPA components into the garden namespace of the respective seed clusters. In case you want to manage the VPA deployment on your own or have a custom one then you might want to disable the automatic deployment of Gardener. Otherwise, you might end up with two VPAs which will cause erratic behaviour. By setting the .spec.settings.verticalPodAutoscaler.enabled=false you can disable the automatic deployment.

⚠️ In any case, there must be a VPA available for your seed cluster. Using a seed without VPA is not supported.

Owner Checks

When a shoot is scheduled to a seed and actually reconciled, Gardener appoints the seed as the current “owner” of the shoot by creating a special “owner DNS record” and checking against it if the seed still owns the shoot in order to guard against “split brain scenario” during control plane migration, as described in GEP-17 Shoot Control Plane Migration “Bad Case” Scenario. This mechanism relies on the DNS resolution of TXT DNS records being possible and highly reliable, since if the owner check fails the shoot will be effectively disabled for the duration of the failure. In environments where resolving TXT DNS records is either not possible or not considered reliable enough, it may be necessary to disable the owner check mechanism, in order to avoid shoots failing to reconcile or temporary outages due to transient DNS failures. By setting the .spec.settings.ownerChecks.enabled=false (default is true) the creation and checking of owner DNS records can be disabled for all shoots scheduled on this seed. Note that if owner checks are disabled, migrating shoots scheduled on this seed to other seeds should be considered unsafe, and in the future will be disabled as well.

23 - Service Account Manager

Service Account Manager

Overview

With Gardener v1.47 a new role called serviceaccountmanager was introduced. This role allows to fully manage ServiceAccount’s in the project namespace and request tokens for them. This is the preferred way of managing the access to a project namespace as it aims to replace the usage of the default ServiceAccount secrets that will no longer be generated automatically with Kubernetes v1.24+.

Actions

Once assigned the serviceaccountmanager role, a user can create/update/delete ServiceAccounts in the project namespace.

Create a Service Account

In order to create a ServiceAccount named “robot-user”, run the following kubectl command:

kubectl -n project-abc create sa robot-user

Request a Token for a Service Account

A token for the “robot-user” ServiceAccount can be requested via the TokenRequest API in several ways:

  • using kubectl >= v1.24
kubectl -n project-abc create token robot-user --duration=3600s
  • using kubectl < v1.24
cat <<EOF | kubectl create -f - --raw /api/v1/namespaces/project-abc/serviceaccounts/robot-user/token
{
  "apiVersion": "authentication.k8s.io/v1",
  "kind": "TokenRequest",
  "spec": {
    "expirationSeconds": 3600
  }
}
EOF
  • directly calling the Kubernetes HTTP API
curl -X POST https://api.gardener/api/v1/namespaces/project-abc/serviceaccounts/robot-user/token \
    -H "Authorization: Bearer <auth-token>" \
    -H "Content-Type: application/json" \
    -d '{
        "apiVersion": "authentication.k8s.io/v1",
        "kind": "TokenRequest",
        "spec": {
          "expirationSeconds": 3600
        }
      }'

Mind that the returned token is not stored within the Kubernetes cluster, will be valid for 3600 seconds, and will be invalidated if the “robot-user” ServiceAccount is deleted. Although expirationSeconds can be modified depending on the needs, the returned token’s validity will not exceed the configured service-account-max-token-expiration duration for the garden cluster. It is advised that the actual expirationTimestamp is verified so that expectations are met. This can be done by asserting the expirationTimestamp in the TokenRequestStatus or the exp claim in the token itself.

Delete a Service Account

In order to delete the ServiceAccount named “robot-user”, run the following kubectl command:

kubectl -n project-abc delete sa robot-user

This will invalidate all existing tokens for the “robot-user” ServiceAccount.

24 - Shoot Access

Accessing Shoot Clusters

After creation of a shoot cluster, end-users require a kubeconfig to access it. There are several options available to get to such kubeconfig.

Static Token Kubeconfig

This kubeconfig contains a static token and provides cluster-admin privileges. It is created by default and persisted in the <shoot-name>.kubeconfig secret in the project namespace in the garden cluster.

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
...
spec:
  kubernetes:
    enableStaticTokenKubeconfig: true
...

It is not the recommended method to access the shoot cluster as the static token kubeconfig has some security flaws associated with it:

  • The static token in the kubeconfig doesn’t have any expiration date. Read this document to learn how to rotate the static token.
  • The static token doesn’t have any user identity associated with it. The user in that token will always be system:cluster-admin irrespective of the person accessing the cluster. Hence, it is impossible to audit the events in cluster.

shoots/adminkubeconfig subresource

The shoots/adminkubeconfig subresource allows users to dynamically generate temporary kubeconfigs that can be used to access shoot cluster with cluster-admin privileges. The credentials associated with this kubeconfig are client certificates which have a very short validity and must be renewed before they expire (by calling the subresource endpoint again).

The username associated with such kubeconfig will be the same which is used for authenticating to the Gardener API. Apart from this advantage, the created kubeconfig will not be persisted anywhere.

In order to request such a kubeconfig, you can run the following commands:

export NAMESPACE=my-namespace
export SHOOT_NAME=my-shoot
kubectl create \
    -f <path>/<to>/kubeconfig-request.json \
    --raw /apis/core.gardener.cloud/v1beta1/namespaces/${NAMESPACE}/shoots/${SHOOT_NAME}/adminkubeconfig | jq -r ".status.kubeconfig" | base64 -d

Here, the kubeconfig-request.json has the following content:

{
    "apiVersion": "authentication.gardener.cloud/v1alpha1",
    "kind": "AdminKubeconfigRequest",
    "spec": {
        "expirationSeconds": 1000
    }
}

The gardenctl-v2 tool makes it easy to target shoot clusters and automatically renews such kubeconfig when required.

OpenID Connect

The kube-apiserver of shoot clusters can be provided with OpenID Connect configuration via the ShootSpec:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
...
spec:
  kubernetes:
    oidcConfig:
      ...

It is the end-user’s responsibility to incorporate the OpenID Connect configurations in kubeconfig for accessing the cluster (i.e., Gardener will not automatically generate kubeconfig based on these OIDC settings). The recommended way is using the kubectl plugin called kubectl oidc-login for OIDC authentication.

If you want to use the same OIDC configuration for all your shoots by default then you can use the ClusterOpenIDConnectPreset and OpenIDConnectPreset API resources. They allow defaulting the .spec.kubernetes.kubeAPIServer.oidcConfig fields for newly created Shoots such that you don’t have to repeat yourself every time (similar to PodPreset resources in Kubernetes). ClusterOpenIDConnectPreset specified OIDC configuration applies to Projects and Shoots cluster-wide (hence, only available to Gardener operators) while OpenIDConnectPreset is Project-scoped. Shoots have to “opt-in” for such defaulting by using the oidc=enable label.

For further information on (Cluster)OpenIDConnectPreset, refer to this document.

25 - Shoot Auditpolicy

Audit a Kubernetes Cluster

The shoot cluster is a kubernetes cluster and its kube-apiserver handles the audit events. In order to define which audit events must be logged, a proper audit policy file must be passed to the kubernetes API server. You could find more information about auditing a kubernetes cluster here.

Default Audit Policy

By default, the Gardener will deploy the shoot cluster with audit policy defined in the kube-apiserver package.

Custom Audit Policy

If you need specific audit policy for your shoot cluster, then you could deploy the required audit policy in the garden cluster as ConfigMap resource and set up your shoot to refer this ConfigMap. Note, the policy must be stored under the key policy in the data section of the ConfigMap.

For example, deploy the auditpolicy ConfigMap in the same namespace as your Shoot resource:

kubectl apply -f example/95-configmap-custom-audit-policy.yaml

then set your shoot to refer that ConfigMap (only related fields are shown):

spec:
  kubernetes:
    kubeAPIServer:
      auditConfig:
        auditPolicy:
          configMapRef:
            name: auditpolicy

The Gardener validate the Shoot resource to refer only existing ConfigMap containing valid audit policy, and rejects the Shoot on failure. If you want to switch back to the default audit policy, you have to remove the section

auditPolicy:
  configMapRef:
    name: <configmap-name>

from the shoot spec.

Rolling Out Changes to the Audit Policy

Gardener is not automatically rolling out changes to the Audit Policy to minimize the amount of Shoot reconciliations in order to prevent cloud provider rate limits, etc. Gardener will pick up the changes on the next reconciliation of Shoots referencing the Audit Policy ConfigMap. If users want to immediately rollout Audit Policy changes, they can manually trigger a Shoot reconciliation as described in triggering an immediate reconciliation. This is similar to changes to the cloud provider secret referenced by Shoots.

26 - Shoot Autoscaling

Auto-Scaling in Shoot Clusters

There are two parts that relate to auto-scaling in Kubernetes clusters in general:

  • Horizontal node auto-scaling, i.e., dynamically adding and removing worker nodes
  • Vertical pod auto-scaling, i.e., dynamically raising or shrinking the resource requests/limits of pods

This document provides an overview of both scenarios.

Horizontal Node Auto-Scaling

Every shoot cluster that has at least one worker pool with minimum < maximum nodes configuration will get a cluster-autoscaler deployment. Gardener is leveraging the upstream community Kubernetes cluster-autoscaler component. We have forked it to gardener/autoscaler so that it supports the way how Gardener manages the worker nodes (leveraging gardener/machine-controller-manager). However, we have not touched the logic how it performs auto-scaling decisions. Consequently, please refer to the offical documentation for this component.

The Shoot API allows to configure a few flags of the cluster-autoscaler:

  • .spec.kubernetes.clusterAutoscaler.ScaleDownDelayAfterAdd defines how long after scale up that scale down evaluation resumes (default: 1h).
  • .spec.kubernetes.clusterAutoscaler.ScaleDownDelayAfterDelete defines how long after node deletion that scale down evaluation resumes (defaults to ScanInterval).
  • .spec.kubernetes.clusterAutoscaler.ScaleDownDelayAfterFailure defines how long after scale down failure that scale down evaluation resumes (default: 3m).
  • .spec.kubernetes.clusterAutoscaler.ScaleDownUnneededTime defines how long a node should be unneeded before it is eligible for scale down (default: 30m).
  • .spec.kubernetes.clusterAutoscaler.ScaleDownUtilizationThreshold defines the threshold under which a node is being removed (default: 0.5).
  • .spec.kubernetes.clusterAutoscaler.ScanInterval defines how often cluster is reevaluated for scale up or down (default: 10s).
  • .spec.kubernetes.clusterAutoscaler.IgnoreTaints specifies a list of taint keys to ignore in node templates when considering to scale a node group (default: nil).

Vertical Pod Auto-Scaling

This form of auto-scaling is not enabled by default and must be explicitly enabled in the Shoot by setting .spec.kubernetes.verticalPodAutoscaler.enabled=true. The reason is that it was only introduced lately, and some end-users might have already deployed their own VPA into their clusters, i.e., enabling it by default would interfere with such custom deployments and lead to issues, eventually.

Gardener is also leveraging an upstream community tool, i.e., the Kubernetes vertical-pod-autoscaler component. If enabled, Gardener will deploy it as part of the control plane into the seed cluster. It will also be used for the vertical autoscaling of Gardener’s system components deployed into the kube-system namespace of shoot clusters, for example, kube-proxy or metrics-server.

You might want to refer to the official documentation for this component to get more information how to use it.

The Shoot API allows to configure a few flags of the vertical-pod-autoscaler:

  • .spec.kubernetes.verticalPodAutoscaler.evictAfterOOMThreshold defines the threshold that will lead to pod eviction in case it OOMed in less than the given threshold since its start and if it has only one container (default: 10m0s).
  • .spec.kubernetes.verticalPodAutoscaler.evictionRateBurst defines the burst of pods that can be evicted (default: 1).
  • .spec.kubernetes.verticalPodAutoscaler.evictionRateLimit defines the number of pods that can be evicted per second. A rate limit set to 0 or -1 will disable the rate limiter (default: -1).
  • .spec.kubernetes.verticalPodAutoscaler.evictionTolerance defines the fraction of replica count that can be evicted for update in case more than one pod can be evicted (default: 0.5).
  • .spec.kubernetes.verticalPodAutoscaler.recommendationMarginFraction is the fraction of usage added as the safety margin to the recommended request (default: 0.15).
  • .spec.kubernetes.verticalPodAutoscaler.updaterInterval is the interval how often the updater should run (default: 1m0s).
  • .spec.kubernetes.verticalPodAutoscaler.recommenderInterval is the interval how often metrics should be fetched (default: 1m0s).

⚠️ Please note that if you disable the VPA again then the related CustomResourceDefinitions will remain in your shoot cluster (although, nobody will act on them). This will also keep all existing VerticalPodAutoscaler objects in the system, including those that might be created by you. You can delete the CustomResourceDefinitions yourself using kubectl delete crd if you want to get rid of them.

27 - Shoot Cleanup

Cleanup of Shoot clusters in deletion

When a shoot cluster is deleted then Gardener tries to gracefully remove most of the Kubernetes resources inside the cluster. This is to prevent that any infrastructure or other artefacts remain after the shoot deletion.

The cleanup is performed in four steps. Some resources are deleted with a grace period, and all resources are forcefully deleted (by removing blocking finalizers) after some time to not block the cluster deletion entirely.

Cleanup steps:

  1. All ValidatingWebhookConfigurations and MutatingWebhookConfigurations are deleted with a 5m grace period. Forceful finalization happens after 5m.
  2. All APIServices and CustomResourceDefinitions are deleted with a 5m grace period. Forceful finalization happens after 1h.
  3. All CronJobs, DaemonSets, Deployments, Ingresss, Jobs, Pods, ReplicaSets, ReplicationControllers, Services, StatefulSets, PersistentVolumeClaims are deleted with a 5m grace period. Forceful finalization happens after 5m.

    If the Shoot is annotated with shoot.gardener.cloud/skip-cleanup=true then only Services and PersistentVolumeClaims are considered.

  4. All VolumeSnapshots and VolumeSnapshotContents are deleted with a 5m grace period. Forceful finalization happens after 1h.
  5. All Namespaces are deleted without any grace period. Forceful finalization happens after 5m.

It is possible to override the finalization grace periods via annotations on the Shoot:

  • shoot.gardener.cloud/cleanup-webhooks-finalize-grace-period-seconds (for the resources handled in step 1)
  • shoot.gardener.cloud/cleanup-extended-apis-finalize-grace-period-seconds (for the resources handled in step 2)
  • shoot.gardener.cloud/cleanup-kubernetes-resources-finalize-grace-period-seconds (for the resources handled in step 3)
  • shoot.gardener.cloud/cleanup-namespaces-finalize-grace-period-seconds (for the resources handled in step 4)

⚠️ If "0" is provided then all resources are finalized immediately without waiting for any graceful deletion. Please be aware that this might lead to orphaned infrastructure artefacts.

Infrastructure Cleanup Wait Period

After all above cleanup steps have been performed and the Infrastructure extension resource has been deleted the gardenlet waits for a certain duration to allow controllers to properly cleanup infrastructure resources.

By default, this duration is set to 5m. Only after this time has passed the shoot deletion flow continues with the entire tear-down of the remaining control plane components (including kube-apiservers, etc.).

It is also possible to override this wait period via an annotations on the Shoot:

  • shoot.gardener.cloud/infrastructure-cleanup-wait-period-seconds

ℹ️️ All provided period values larger than the above mentioned defaults are ignored.

28 - Shoot Credentials Rotation

Credentials Rotation For Shoot Clusters

There are a lot of different credentials for Shoots to make sure that the various components can communicate with each other, and to make sure it is usable and operable.

This page explains how the varieties of credentials can be rotated so that the cluster can be considered secure.

User-Provided Credentials

Cloud Provider Keys

End-users must provide credentials such that Gardener and Kubernetes controllers can communicate with the respective cloud provider APIs in order to perform infrastructure operations. For example, Gardener uses them to setup and maintain the networks, security groups, subnets, etc., while the cloud-controller-manager uses them to reconcile load balancers and routes, and the CSI controller uses them to reconcile volumes and disks.

Depending on the cloud provider, the required data keys of the Secret differ. Please consult the documentation of the respective provider extension documentation to get to know the concrete data keys (e.g., this document for AWS).

It is the responsibility of the end-user to regularly rotate those credentials. The following steps are required to perform the rotation:

  1. Update the data in the Secret with new credentials.
  2. ⚠️ Wait until all Shoots using the Secret are reconciled before you disable the old credentials in your cloud provider account! Otherwise, the Shoots will no longer work as expected. Check out this document to learn how to trigger a reconciliation of your Shoots.
  3. After all Shoots using the Secret were reconciled, you can go ahead and deactivate the old credentials in your provider account.

Gardener-Provided Credentials

Below credentials are generated by Gardener when shoot clusters are being created. Those include

  • kubeconfig (if enabled)
  • certificate authorities (and related server and client certificates)
  • observability passwords for Grafana
  • SSH key pair for worker nodes
  • ETCD encryption key
  • ServiceAccount token signing key

🚨 There is no auto-rotation of those credentials, and it is the responsibility of the end-user to regularly rotate them.

While it is possible to rotate them one by one, there is also a convenient method to combine the rotation of all of those credentials. The rotation happens in two phases since it might be required to update some API clients (e.g., when CAs are rotated). In order to start the rotation (first phase), you have to annotate the shoot with the rotate-credentials-start operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-credentials-start

You can check the .status.credentials.rotation field in the Shoot to see when the rotation was last initiated and last completed.

Kindly consider the detailed descriptions below to learn how the rotation is performed and what your responsibilities are. Please note that all respective individual actions apply for this combined rotation as well (e.g., worker nodes are rolled out in the first phase).

You can complete the rotation (second phase) by annotating the shoot with the rotate-credentials-complete operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-credentials-complete

Kubeconfig

If the .spec.kubernetes.enableStaticTokenKubeconfig field is set to true (default) then Gardener generates a kubeconfig with cluster-admin privileges for the Shoots containing credentials for communication with the kube-apiserver (see this document for more information).

This Secret is stored with name <shoot-name>.kubeconfig in the project namespace in the garden cluster and has multiple data keys:

  • kubeconfig: the completed kubeconfig
  • token: token for system:cluster-admin user
  • username/password: basic auth credentials (if enabled via Shoot.spec.kubernetes.kubeAPIServer.enableBasicAuthentication)
  • ca.crt: the CA bundle for establishing trust to the API server (same as in the Cluster CA bundle secret)

Shoots created with Gardener <= 0.28 used to have a kubeconfig based on a client certificate instead of a static token. With the first kubeconfig rotation, such clusters will get a static token as well.

⚠️ This does not invalidate the old client certificate. In order to do this, you should perform a rotation of the CAs (see section below).

It is the responsibility of the end-user to regularly rotate those credentials (or disable this kubeconfig entirely). In order to rotate the token in this kubeconfig, annotate the Shoot with gardener.cloud/operation=rotate-kubeconfig-credentials. This operation is not allowed for Shoots that are already marked for deletion. Please note that only the token (and basic auth password, if enabled) are exchanged. The CA certificate remains the same (see section below for information about the rotation).

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-kubeconfig-credentials

You can check the .status.credentials.rotation.kubeconfig field in the Shoot to see when the rotation was last initiated and last completed.

Certificate Authorities

Gardener generates several certificate authorities (CAs) to ensure secured communication between the various components and actors. Most of those CAs are used for internal communication (e.g., kube-apiserver talks to etcd, vpn-shoot talks to the vpn-seed-server, kubelet talks to kube-apiserver etc.). However, there is also the “cluster CA” which is part of all kubeconfigs and used to sign the server certificate exposed by the kube-apiserver.

Gardener populates a Secret with name <shoot-name>.ca-cluster in the project namespace in the garden cluster which contains the following data keys:

  • ca.crt: the CA bundle of the cluster

This bundle contains one or multiple CAs which are used for signing serving certificates of the Shoot’s API server. Hence, the certificates contained in this Secret can be used to verify the API server’s identity when communicating with its public endpoint (e.g. as certificate-authority-data in a kubeconfig). This is the same certificate that is also contained in the kubeconfig’s certificate-authority-data field.

Shoots created with Gardener >= v1.45 have a dedicated client CA which verifies the legitimacy of client certificates. For older Shoots, the client CA is equal to the cluster CA. With the first CA rotation, such clusters will get a dedicated client CA as well.

All of the certificates are valid for 10 years. Since it requires adaptation for the consumers of the Shoot, there is no automatic rotation and it is the responsibility of the end-user to regularly rotate the CA certificates.

The rotation happens in three stages (see also GEP-18 for the full details):

  • In stage one, new CAs are created and added to the bundle (together with the old CAs). Client certificates are re-issued immediately.
  • In stage two, end-users update all cluster API clients that communicate with the control plane.
  • In stage three, the old CAs are dropped from the bundle and server certificate are re-issued.

Technically, the Preparing phase indicates stage one. Once it is completed, the Prepared phase indicates readiness for stage two. The Completing phase indicates stage three, and the Completed phase states that the rotation process has finished.

You can check the .status.credentials.rotation.certificateAuthorities field in the Shoot to see when the rotation was last initiated, last completed, and in which phase it currently is.

In order to start the rotation (stage one), you have to annotate the shoot with the rotate-ca-start operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-ca-start

This will trigger a Shoot reconciliation and performs stage one. After it is completed, the .status.credentials.rotation.certificateAuthorities.phase is set to Prepared.

Now you must update all API clients outside the cluster (such as the kubeconfigs on developer machines) to use the newly issued CA bundle in the <shoot-name>.ca-cluster Secret. Please also note that client certificates must be re-issued now.

After updating all API clients, you can complete the rotation by annotating the shoot with the rotate-ca-complete operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-ca-complete

This will trigger another Shoot reconciliation and performs stage three. After it is completed, the .status.credentials.rotation.certificateAuthorities.phase is set to Completed. You could update your API clients again and drop the old CA from their bundle.

Note that the CA rotation also rotates all internal CAs and signed certificates. Hence, most of the components need to be restarted (including etcd and kube-apiserver).

⚠️ In stage one, all worker nodes of the Shoot will be rolled out to ensure that the Pods as well as the kubelets get the updated credentials as well.

Observability Password(s) For Grafana

For Shoots with .spec.purpose!=testing, Gardener deploys an observability stack with Prometheus for monitoring, Alertmanager for alerting (optional), Loki for logging, and Grafana for visualization. The Grafana instance is exposed via Ingress and accessible for end-users via basic authentication credentials generated and managed by Gardener.

Those credentials are stored in a Secret with name <shoot-name>.monitoring in the project namespace in the garden cluster and has multiple data keys:

  • username: the user name
  • password: the password
  • basic_auth.csv: the user name and password in CSV format
  • auth: the user name with SHA-1 representation of the password

It is the responsibility of the end-user to regularly rotate those credentials. In order to rotate the password, annotate the Shoot with gardener.cloud/operation=rotate-observability-credentials. This operation is not allowed for Shoots that are already marked for deletion.

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-observability-credentials

You can check the .status.credentials.rotation.observability field in the Shoot to see when the rotation was last initiated and last completed.

Operators

Gardener operators have separate credentials to access their own Grafana instance or Prometheus, Alertmanager, Loki directly. These credentials are only stored in the shoot namespace in the seed cluster and can be retrieved as follows:

kubectl -n shoot--<project>--<name> get secret -l name=observability-ingress,managed-by=secrets-manager,manager-identity=gardenlet

These credentials are only valid for 30d and get automatically rotated with the next Shoot reconciliation when 80% of the validity approaches or when there are less than 10d until expiration. There is no way to trigger the rotation manually.

SSH Key Pair For Worker Nodes

Gardener generates an SSH key pair whose public key is propagated to all worker nodes of the Shoot. The private key can be used to establish an SSH connection to the workers for troubleshooting purposes. It is recommended to use gardenctl-v2 and its gardenctl ssh command since it is required to first open up the security groups and create a bastion VM (no direct SSH access to the worker nodes is possible).

The private key is stored in a Secret with name <shoot-name>.ssh-keypair in the project namespace in the garden cluster and has multiple data keys:

  • id_rsa: the private key
  • id_rsa.pub: the public key for SSH

In order to rotate the keys, annotate the Shoot with gardener.cloud/operation=rotate-ssh-keypair. This will propagate a new key to all worker nodes while keeping the old key active and valid as well (it will only be invalidated/removed with the next rotation).

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-ssh-keypair

You can check the .status.credentials.rotation.sshKeypair field in the Shoot to see when the rotation was last initiated or last completed.

The old key is stored in a Secret with name <shoot-name>.ssh-keypair.old in the project namespace in the garden cluster and has the same data keys as the regular Secret.

ETCD Encryption Key

This key is used to encrypt the data of Secret resources inside etcd (see upstream Kubernetes documentation).

The encryption key has no expiration date. There is no automatic rotation and it is the responsibility of the end-user to regularly rotate the encryption key.

The rotation happens in three stages:

  • In stage one, a new encryption key is created and added to the bundle (together with the old encryption key).
  • In stage two, all Secrets in the cluster are rewritten by the kube-apiserver so that they become encrypted with the new encryption key.
  • In stage three, the old encryption is dropped from the bundle.

Technically, the Preparing phase indicates the stages one and two. Once it is completed, the Prepared phase indicates readiness for stage three. The Completing phase indicates stage three, and the Completed phase states that the rotation process has finished.

You can check the .status.credentials.rotation.etcdEncryptionKey field in the Shoot to see when the rotation was last initiated, last completed, and in which phase it currently is.

In order to start the rotation (stage one), you have to annotate the shoot with the rotate-etcd-encryption-key-start operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-etcd-encryption-key-start

This will trigger a Shoot reconciliation and performs the stages one and two. After it is completed, the .status.credentials.rotation.etcdEncryptionKey.phase is set to Prepared. Now you can complete the rotation by annotating the shoot with the rotate-etcd-encryption-key-complete operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-etcd-encryption-key-complete

This will trigger another Shoot reconciliation and performs stage three. After it is completed, the .status.credentials.rotation.etcdEncryptionKey.phase is set to Completed.

ServiceAccount Token Signing Key

Gardener generates a key which is used to sign the tokens for ServiceAccounts. Those tokens are typically used by workload Pods running inside the cluster in order to authenticate themselves with the kube-apiserver. This also includes system components running in the kube-system namespace.

The token signing key has no expiration date. Since it might require adaptation for the consumers of the Shoot, there is no automatic rotation and it is the responsibility of the end-user to regularly rotate the signing key.

The rotation happens in three stages, similar to how the CA certificates are rotated:

  • In stage one, a new signing key is created and added to the bundle (together with the old signing key).
  • In stage two, end-users update all out-of-cluster API clients that communicate with the control plane via ServiceAccount tokens.
  • In stage three, the old signing key is dropped from the bundle.

Technically, the Preparing phase indicates stage one. Once it is completed, the Prepared phase indicates readiness for stage two. The Completing phase indicates stage three, and the Completed phase states that the rotation process has finished.

You can check the .status.credentials.rotation.serviceAccountKey field in the Shoot to see when the rotation was last initiated, last completed, and in which phase it currently is.

In order to start the rotation (stage one), you have to annotate the shoot with the rotate-serviceaccount-key-start operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-serviceaccount-key-start

This will trigger a Shoot reconciliation and performs stage one. After it is completed, the .status.credentials.rotation.serviceAccountKey.phase is set to Prepared.

Now you must update all API clients outside the cluster using a ServiceAccount token (such as the kubeconfigs on developer machines) to use a token issued by the new signing key. Gardener already generates new static token secrets for all ServiceAccounts in the cluster. However, if you need to create it manually, you can check out this document for instructions.

After updating all API clients, you can complete the rotation by annotating the shoot with the rotate-serviceaccount-key-complete operation:

kubectl -n <shoot-namespace> annotate shoot <shoot-name> gardener.cloud/operation=rotate-serviceaccount-key-complete

This will trigger another Shoot reconciliation and performs stage three. After it is completed, the .status.credentials.rotation.serviceAccountKey.phase is set to Completed.

⚠️ In stage one, all worker nodes of the Shoot will be rolled out to ensure that the Pods use a new token.

OpenVPN TLS Auth Keys

This key is used to ensure encrypted communication for the VPN connection between the control plane in the seed cluster and the shoot cluster. It is currently not rotated automatically and there is no way to trigger it manually.

29 - Shoot High Availability

Highly Available Shoot Control Plane

Shoot resource offers a way to request for a highly available control plane.

Failure Tolerance Types

A highly available shoot control plane can be setup with either a failure tolerance of zone or node.

Node Failure Tolerance

Failure tolerance of node will have the following characteristics:

  • Control plane components will be spread across different nodes within a single availability zone. There will not be more than one replica per node for each control plane component which has more than one replica.
  • Worker pool should have a minimum of 3 nodes.
  • A multi-node etcd (quorum size of 3) will be provisioned offering zero-downtime capabilities with each member in a different node within a single availability zone.

Zone Failure Tolerance

Failure tolerance of zone will have the following characteristics:

  • Control plane components will be spread across different availability zones. There will at least be one replica per zone for each control plane component which has more than one replica.
  • Gardener scheduler will automatically select a seed which has a minimum of 3 zones to host the shoot control plane.
  • A multi-node etcd (quorum size of 3) will be provisioned offering zero-downtime capabilities with each member in a different zone.

Shoot Spec

To request for a highly available shoot control plane gardener provides the following configuration in the shoot spec.

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  controlPlane:
    highAvailability:
      failureTolerance:
        type: <node | zone>

Allowed Transitions

If you already have a shoot cluster with non-HA control plane then following upgrades are possible:

  • Upgrade of non-HA shoot control plane to HA shoot control plane with node failure tolerance.
  • Upgrade of non-HA shoot control plane to HA shoot control plane with zone failure tolerance. However, it is essential that the seed which is currently hosting the shoot control plane should be multi-zonal. If it is not then request to upgrade will be rejected.

NOTE: There will be a small downtime during the upgrade especially for etcd which will transition from a single node etcd cluster to a multi-node etcd cluster.

Disallowed Transitions

If you have already set-up a HA shoot control plane with node failure tolerance then an upgrade to zone failure tolerance is currently not supported, mainly because already existing volumes are bound to the zone they were created in.

30 - Shoot Info Configmap

Shoot Info ConfigMap

Overview

Gardenlet maintains a ConfigMap inside the Shoot cluster that contains information about the cluster itself. The ConfigMap is named shoot-info and located in the kube-system namespace.

Fields

The following fields are provided:

apiVersion: v1
kind: ConfigMap
metadata:
  name: shoot-info
  namespace: kube-system
data:
  domain: crazy-botany.core.my-custom-domain.com     # .spec.dns.domain field from the Shoot resource
  extensions: foobar,foobaz                          # List of extensions that are enabled
  kubernetesVersion: 1.20.1                          # .spec.kubernetes.version field from the Shoot resource
  maintenanceBegin: 220000+0100                      # .spec.maintenance.timeWindow.begin field from the Shoot resource
  maintenanceEnd: 230000+0100                        # .spec.maintenance.timeWindow.end field from the Shoot resource
  nodeNetwork: 10.250.0.0/16                         # .spec.networking.nodes field from the Shoot resource
  podNetwork: 100.96.0.0/11                          # .spec.networking.pods field from the Shoot resource
  projectName: dev                                   # .metadata.name of the Project
  provider: <some-provider-name>                     # .spec.provider.type field from the Shoot resource
  region: europe-central-1                           # .spec.region field from the Shoot resource
  serviceNetwork: 100.64.0.0/13                      # .spec.networking.services field from the Shoot resource
  shootName: crazy-botany                            # .metadata.name from the Shoot resource

31 - Shoot Maintenance

Shoot Maintenance

Shoots configure a maintenance time window in which Gardener performs certain operations that may restart the control plane, roll out the nodes, result in higher network traffic, etc. This document outlines what happens during a shoot maintenance.

Time Window

Via the .spec.maintenance.timeWindow field in the shoot specification end-users can configure the time window in which maintenance operations are executed. Gardener runs one maintenance operation per day in this time window:

spec:
  maintenance:
    timeWindow:
      begin: 220000+0100
      end: 230000+0100

The offset (+0100) is considered with respect to UTC time. The minimum time window is 30m and the maximum is 6h.

⚠️ Please note that there is no guarantee that a maintenance operation that e.g. starts a node roll-out will finish within the time window. Especially for large clusters it may take several hours until a graceful rolling update of the worker nodes succeeds (also depending on the workload and the configured pod disruption budgets/termination grace periods).

Internally, Gardener is subtracting 15m from the end of the time window to (best-effort) try to finish the maintenance until the end is reached, however, it might not work in all cases.

If you don’t specify a time window then Gardener will randomly compute it. You can change it later, of course.

Automatic Version Updates

The .spec.maintenance.autoUpdate field in the shoot specification allows you to control how/whether automatic updates of Kubernetes patch and machine image versions are performed. Machine image versions are updated per worker pool.

spec:
  maintenance:
    autoUpdate:
      kubernetesVersion: true
      machineImageVersion: true

During the daily maintenance, the Gardener Controller Manager updates the Shoot’s Kubernetes and machine image version if any of the following criteria applies:

  • there is a higher version available and the Shoot opted-in for automatic version updates
  • the currently used version is expired

Gardener creates events with type MaintenanceDone on the Shoot describing the action performed during maintenance including the reason why an update has been triggered.

MaintenanceDone  Updated image of worker-pool 'coreos-xy' from 'coreos' version 'xy' to version 'abc'. Reason: AutoUpdate of MachineImage configured.
MaintenanceDone  Updated Kubernetes version '0.0.1' to version '0.0.5'. This is an increase in the patch level. Reason: AutoUpdate of Kubernetes version configured.
MaintenanceDone  Updated Kubernetes version '0.0.5' to version '0.1.5'. This is an increase in the minor level. Reason: Kubernetes version expired - force update required.

Please refer to this document for more information about Kubernetes and machine image versions in Gardener.

Cluster Reconciliation

Gardener administrators/operators can configure the Gardenlet in a way that it only reconciles shoot clusters during their maintenance time windows. This behaviour is not controllable by end-users but might make sense for large Gardener installations. Concretely, your shoot will be reconciled regularly during its maintenance time window. Outside of the maintenance time window it will only reconcile if you change the specification or if you explicitly trigger it, see also this document.

Confine Specification Changes/Updates Roll Out

Via the .spec.maintenance.confineSpecUpdateRollout field you can control whether you want to make Gardener roll out changes/updates to your shoot specification only during the maintenance time window. It is false by default, i.e., any change to your shoot specification triggers a reconciliation (even outside of the maintenance time window). This is helpful if you want to update your shoot but don’t want the changes to be applied immediately. One example use-case would be a Kubernetes version upgrade that you want to roll out during the maintenance time window. Any update to the specification will not increase the .metadata.generation of the Shoot which is something you should be aware of. Also, even if Gardener administrators/operators have not enabled the “reconciliation in maintenance time window only” configuration (as mentioned above) then your shoot will only reconcile in the maintenance time window. The reason is that Gardener cannot differentiate between create/update/reconcile operations.

⚠️ If confineSpecUpdateRollout=true, please note that if you change the maintenance time window itself then it will only be effective after the upcoming maintenance.

⚠️ As exceptions to above rules, manually triggered reconciliations and changes to the .spec.hibernation.enabled field trigger immediate rollouts. I.e., if you hibernate or wake-up your shoot, or you explicitly tell Gardener to reconcile your shoot, then Gardener gets active right away.

Shoot Operations

In case you would like to perform a shoot credential rotation or a reconcile operation during your maintenance time window, you can annotate the Shoot with

maintenance.gardener.cloud/operation=<operation>

This will execute the specified <operation> during the next maintenance reconciliation. Note that Gardener will remove this annotation after it has been performed in the maintenance reconciliation.

⚠️ This is skipped when the Shoot’s .status.lastOperation.state=Failed. Make sure to retry your shoot reconciliation beforehand.

Special Operations During Maintenance

The shoot maintenance controller triggers special operations that are performed as part of the shoot reconciliation.

Infrastructure and DNSRecord Reconciliation

The reconciliation of the Infrastructure and DNSRecord extension resources is only demanded during the shoot’s maintenance time window. The rationale behind it is to prevent sending too many requests against the cloud provider APIs, especially on large landscapes or if a user has many shoot clusters in the same cloud provider account.

Restart Control Plane Controllers

Gardener operators can make Gardener restart/delete certain control plane pods during a shoot maintenance. This feature helps to automatically solve service denials of controllers due to stale caches, dead-locks or starving routines.

Please note that these are exceptional cases but they are observed from time to time. Gardener, for example, takes this precautionary measure for kube-controller-manager pods.

See this document to see how extension developers can extend this behaviour.

Restart Some Core Addons

Gardener operators can make Gardener restart some core addons, at the moment only CoreDNS, during a shoot maintenance.

CoreDNS benefits from this feature as it automatically solve problems with clients stuck to single replica of the deployment and thus overloading it. Please note that these are exceptional cases but they are observed from time to time.

32 - Shoot Network Policies

Network policies in the Shoot Cluster

In addition to deploying network policies into the Seed, Gardener deploys network policies into the kube-system namespace of the Shoot. These network policies are used by Shoot system components (that are not part of the control plane). Other namespaces in the Shoot do not contain network policies deployed by Gardener.

As best practice, every pod deployed into the kube-system namespace should use appropriate network policies in order to only allow required network traffic. Therefore, pods should have labels matching to the selectors of the available network policies.

Gardener deploys the following network policies:

NAME                                       POD-SELECTOR
gardener.cloud--allow-dns                  k8s-app in (kube-dns)
gardener.cloud--allow-from-seed            networking.gardener.cloud/from-seed=allowed
gardener.cloud--allow-to-apiserver         networking.gardener.cloud/to-apiserver=allowed
gardener.cloud--allow-to-dns               networking.gardener.cloud/to-dns=allowed
gardener.cloud--allow-to-from-nginx        app=nginx-ingress
gardener.cloud--allow-to-kubelet           networking.gardener.cloud/to-kubelet=allowed
gardener.cloud--allow-to-public-networks   networking.gardener.cloud/to-public-networks=allowed
gardener.cloud--allow-vpn                  app=vpn-shoot

Additionally, there can be network policies deployed by Gardener extensions such as extension-calico.

NAME                                       POD-SELECTOR
gardener.cloud--allow-from-calico-node     k8s-app=calico-typha

33 - Shoot Networking

Shoot Networking

This document contains network related information for Shoot clusters.

Pod Network

A Pod network is imperative for any kind of cluster communication with Pods not started within the Node’s host network. More information about the Kubernetes network model can be found here.

Gardener allows users to configure the Pod network’s CIDR during Shoot creation:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  networking:
    type: <some-network-extension-name> # {calico,cilium}
    pods: 100.96.0.0/16
    nodes: ...
    services: ...

⚠️ The networking.pods IP configuration is immutable and cannot be changed afterwards. Please consider the following paragraph to choose a configuration which will meet your demands.

One of the network plugin’s (CNI) tasks is to assign IP addresses to Pods started in the Pod network. Different network plugins come with different IP address management (IPAM) features, so we can’t give any definite advice how IP ranges should be configured. Nevertheless, we want to outline the standard configuration.

Information in .spec.networking.pods matches the –cluster-cidr flag of the Kube-Controller-Manager of your Shoot cluster. This IP range is divided into smaller subnets, also called podCIDRs (default mask /24) and assigned to Node objects .spec.podCIDR. Pods get their IP address from this smaller node subnet in a default IPAM setup. Thus, it must be guaranteed that enough of these subnets can be created for the maximum amount of nodes you expect in the cluster.

Example 1

Pod network: 100.96.0.0/16
nodeCIDRMaskSize: /24
-------------------------

Number of podCIDRs: 256 --> max. Node count 
Number of IPs per podCIDRs: 256

With the configuration above a Shoot cluster can at most have 256 nodes which are ready to run workload in the Pod network.

Example 2

Pod network: 100.96.0.0/20
nodeCIDRMaskSize: /24
-------------------------

Number of podCIDRs: 16 --> max. Node count 
Number of IPs per podCIDRs: 256

With the configuration above a Shoot cluster can at most have 16 nodes which are ready to run workload in the Pod network.

Beside the configuration in .spec.networking.pods, users can tune the nodeCIDRMaskSize used by Kube-Controller-Manager on shoot creation. A smaller IP range per node means more podCIDRs and thus the ability to provision more nodes in the cluster, but less available IPs for Pods running on each of the nodes.

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  kubeControllerManager:
    nodeCIDRMaskSize: 24 (default)

⚠️ The nodeCIDRMaskSize configuration is immutable and cannot be changed afterwards.

Example 3

Pod network: 100.96.0.0/20
nodeCIDRMaskSize: /25
-------------------------

Number of podCIDRs: 32 --> max. Node count 
Number of IPs per podCIDRs: 128

With the configuration above a Shoot cluster can at most have 32 nodes which are ready to run workload in the Pod network.

34 - Shoot Operations

Trigger Shoot Operations

You can trigger a few explicit operations by annotating the Shoot with an operation annotation. This might allow you to induct certain behavior without the need to change the Shoot specification. Some of the operations can also not be caused by changing something in the shoot specification because they can’t properly be reflected here. Note, once the triggered operation is considered by the controllers, the annotation will be automatically removed and you have to add it each time you want to trigger the operation.

Please note: If .spec.maintenance.confineSpecUpdateRollout=true then the only way to trigger a shoot reconciliation is by setting the reconcile operation, see below.

Immediate Reconciliation

Annotate the shoot with gardener.cloud/operation=reconcile to make the gardenlet start a reconciliation operation without changing the shoot spec and possibly without being in its maintenance time window:

kubectl -n garden-<project-name> annotate shoot <shoot-name> gardener.cloud/operation=reconcile

Immediate Maintenance

Annotate the shoot with gardener.cloud/operation=maintain to make the gardener-controller-manager start maintaining your shoot immediately (possibly without being in its maintenance time window). If no reconciliation starts then nothing needed to be maintained:

kubectl -n garden-<project-name> annotate shoot <shoot-name> gardener.cloud/operation=maintain

Retry Failed Reconciliation

Annotate the shoot with gardener.cloud/operation=retry to make the gardenlet start a new reconciliation loop on a failed shoot. Failed shoots are only reconciled again if a new Gardener version is deployed, the shoot specification is changed or this annotation is set

kubectl -n garden-<project-name> annotate shoot <shoot-name> gardener.cloud/operation=retry

Credentials Rotation Operations

Please consult this document for more information.

Restart systemd Services On Particular Worker Nodes

It is possible to make Gardener restart particular systemd services on your shoot worker nodes if needed. The annotation is not set on the Shoot resource but directly on the Node object you want to target. For example, the following will restart both the kubelet and the docker services:

kubectl annotate node <node-name> worker.gardener.cloud/restart-systemd-services=kubelet,docker

It may take up to a minute until the service is restarted. The annotation will be removed from the Node object after all specified systemd services have been restarted. It will also be removed even if the restart of one or more services failed.

ℹ️ In the example mentioned above, you could additionally verify when/whether the kubelet restarted by using kubectl describe node <node-name> and looking for such a Starting kubelet event.

35 - Shoot Purposes

Shoot Cluster Purpose

The Shoot resource contains a .spec.purpose field indicating how the shoot is used whose allowed values are as follows:

  • evaluation (default): Indicates that the shoot cluster is for evaluation scenarios.
  • development: Indicates that the shoot cluster is for development scenarios.
  • testing: Indicates that the shoot cluster is for testing scenarios.
  • production: Indicates that the shoot cluster is for production scenarios.
  • infrastructure: Indicates that the shoot cluster is for infrastructure scenarios (only allowed for shoots in the garden namespace).

Behavioral Differences

The following enlists the differences in the way the shoot clusters are set up based on the selected purpose:

  • testing shoot clusters do not get a monitoring or a logging stack as part of their control planes.
  • production shoot clusters get at least two replicas of the kube-apiserver for their control planes. Auto-scaling scale down of the main ETCD is disabled for such clusters.

There are also differences with respect to how testing shoots are scheduled after creation, please consult the Scheduler documentation.

Future Steps

We might introduce more behavioral difference depending on the shoot purpose in the future. As of today, there are no plans yet.

36 - Shoot Scheduling Profiles

Shoot Scheduling Profiles

This guide describes the available scheduling profiles and how they can be configured in the Shoot cluster. It also clarifies how a custom scheduling profile can be configured.

Scheduling profiles

The scheduling process in the kube-scheduler happens in series of stages. A scheduling profile allows configuring the different stages of the scheduling.

As of today, Gardener supports two predefined scheduling profiles:

  • balanced (default)

    Overview

    The balanced profile attempts to spread Pods evenly across Nodes to obtain a more balanced resource usage. This profile provides the default kube-scheduler behavior.

    How it works?

    The kube-scheduler is started without any profiles. In such case, by default, one profile with the scheduler name default-scheduler is created. This profile includes the default plugins. If a Pod doesn’t specify the .spec.schedulerName field, kube-apiserver sets it to default-scheduler. Then, the Pod gets scheduled by the default-scheduler accordingly.

  • bin-packing (alpha)

    Overview

    The bin-packing profile scores Nodes based on the allocation of resources. It prioritizes Nodes with most allocated resources. By favoring the Nodes with most allocation some of the other Nodes become under-utilized over time (because new Pods keep being scheduled to the most allocated Nodes). Then, the cluster-autoscaler identifies such under-utilized Nodes and removes them from the cluster. In this way, this profile provides a greater overall resource utilization (compared to the balanced profile).

    Note: The decision of when to remove a Node is a trade-off between optimizing for utilization or the availability of resources. Removing under-utilized Nodes improves cluster utilization, but new workloads might have to wait for resources to be provisioned again before they can run.

    Note: The bin-packing profile is considered as alpha feature. Use it only for evaluation purposes.

    How it works?

    The kube-scheduler is configured with the following bin packing profile:

    apiVersion: kubescheduler.config.k8s.io/v1beta3
    kind: KubeSchedulerConfiguration
    profiles:
    - schedulerName: bin-packing-scheduler
      pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated
      plugins:
        score:
          disabled:
          - name: NodeResourcesBalancedAllocation
    

    To impose the new profile, a MutatingWebhookConfiguration is deployed in the Shoot cluster. The MutatingWebhookConfiguration intercepts CREATE operations for Pods and sets the .spec.schedulerName field to bin-packing-scheduler. Then, the Pod gets scheduled by the bin-packing-scheduler accordingly. Pods that specify a custom scheduler (i.e., having .spec.schedulerName different from default-scheduler and bin-packing-scheduler) are not affected.

Configuring the scheduling profile

The scheduling profile can be configured via the .spec.kubernetes.kubeScheduler.profile field in the Shoot:

spec:
  # ...
  kubernetes:
    kubeScheduler:
      profile: "balanced" # or "bin-packing"

Custom scheduling profiles

The kube-scheduler’s component configs allows configuring custom scheduling profiles to match the cluster needs. As of today, Gardener supports only two predefined scheduling profiles. The profile configuration in the component config is quite expressive and it is not possible to easily define profiles that would match the needs of every cluster. Because of these reasons, there are no plans to add support for new predefined scheduling profiles. If a cluster owner wants to use a custom scheduling profile then they have to deploy (and maintain) a dedicated kube-scheduler deployment in the cluster itself.

37 - Shoot Serviceaccounts

ServiceAccount Configurations For Shoot Clusters

The Shoot specification allows to configure some of the settings for the handling of ServiceAccounts:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  kubernetes:
    kubeAPIServer:
      serviceAccountConfig:
        issuer: foo
        acceptedIssuers:
        - foo1
        - foo2
        # Deprecated: This field is deprecated and will be removed in a future version of Gardener. Do not use it.
        signingKeySecretName:
          name: my-signing-key-secret
        extendTokenExpiration: true
        maxTokenExpiration: 45d
...

Issuer And Accepted Issuers

The .spec.kubernetes.kubeAPIServer.serviceAccountConfig.{issuer,acceptedIssuers} field are translated to the --service-account-issuer flag for the kube-apiserver. The issuer will assert its identifier in the iss claim of issued tokens. According to the upstream specification, values need to meet the following requirements:

This value is a string or URI. If this option is not a valid URI per the OpenID Discovery 1.0 spec, the ServiceAccountIssuerDiscovery feature will remain disabled, even if the feature gate is set to true. It is highly recommended that this value comply with the OpenID spec: https://openid.net/specs/openid-connect-discovery-1_0.html. In practice, this means that service-account-issuer must be an https URL. It is also highly recommended that this URL be capable of serving OpenID discovery documents at {service-account-issuer}/.well-known/openid-configuration.

By default, Gardener uses the internal cluster domain as issuer (e.g., https://api.foo.bar.example.com). If you specify the issuer then this default issuer will always be part of the list of accepted issuers (you don’t need to specify it yourself).

⚠️ Caution: If you change from the default issuer to a custom issuer then all previously issued tokens are still valid/accepted. However, if you change from a custom issuer A to another issuer B (custom or default) then you have to add A to the acceptedIssuers so that previously issued tokens are not invalidated. Otherwise, the control plane components as well as system components and your workload pods might fail. You can remove A from the acceptedIssuers when all active tokens were issued by B. This can be ensured by using projected token volumes with a short validity, or by rolling out all pods. Additionally, all ServiceAccount token secrets should be recreated. Apart from this you should wait for at least 12h to make sure the control plane and system components receive a new token from Gardener.

Signing Key Secret

🚨 This field is deprecated and will be removed in a future version of Gardener. Do not use it.

The .spec.kubernetes.kubeAPIServer.serviceAccountConfig.signingKeySecretName.name specifies the name of Secret in the same namespace as the Shoot in the garden cluster. It should look as follows:

apiVersion: v1
kind: Secret
metadata:
  name: <name>
  namespace: <namespace>
data:
  signing-key: base64(signing-key-pem)

The provided key will be used for configuring both the --service-account-signing-key-file and --service-account-key-file flags of the kube-apiserver.

According to the upstream specification, they have the following effects:

  • --service-account-key-file: File containing PEM-encoded x509 RSA or ECDSA private or public keys, used to verify ServiceAccount tokens. The specified file can contain multiple keys, and the flag can be specified multiple times with different files. If specified multiple times, tokens signed by any of the specified keys are considered valid by the Kubernetes API server.
  • --service-account-signing-key-file: Path to the file that contains the current private key of the service account token issuer. The issuer signs issued ID tokens with this private key.

Note that rotation of this key is not yet supported, hence usage is not recommended. By default, Gardener will generate a service account signing key for the cluster.

Token Expirations

The .spec.kubernetes.kubeAPIServer.serviceAccountConfig.extendTokenExpiration configures the --service-account-extend-token-expiration flag of the kube-apiserver. It is enabled by default and has the following specification:

Turns on projected service account expiration extension during token generation, which helps safe transition from legacy token to bound service account token feature. If this flag is enabled, admission injected tokens would be extended up to 1 year to prevent unexpected failure during transition, ignoring value of service-account-max-token-expiration.

The .spec.kubernetes.kubeAPIServer.serviceAccountConfig.maxTokenExpiration configures the --service-account-max-token-expiration flag of the kube-apiserver. It has the following specification:

The maximum validity duration of a token created by the service account token issuer. If an otherwise valid TokenRequest with a validity duration larger than this value is requested, a token will be issued with a validity duration of this value.

⚠️ Note that the value for this field must be in the [30d,90d] range. The background for this limitation is that all Gardener components to rely on the TokenRequest API and the Kubernetes service account token projection feature with short-lived, auto-rotating tokens. Any values lower than 30d risk impacting the SLO for shoot clusters, and any values above 90d violate security best practices with respect to maximum validity of credentials before they must be rotated. Given that the field just specifies the upper bound, end-users can still use lower values for their individual workload by specifying the .spec.volumes[].projected.sources[].serviceAccountToken.expirationSeconds in the PodSpecs.

38 - Shoot Status

Shoot Status

This document provides an overview of the ShootStatus.

Conditions

The Shoot status consists of a set of conditions. A Condition has the following fields:

Field nameDescription
typeName of the condition.
statusIndicates whether the condition is applicable, with possible values True, False, Unknown, or Progressing.
lastTransitionTimeTimestamp for when the condition last transitioned from one status to another.
lastUpdateTimeTimestamp for when the condition was updated. Usually changes when reason or message in condition is updated.
reasonMachine-readable, UpperCamelCase text indicating the reason for the condition’s last transition.
messageHuman-readable message indicating details about the last status transition.
codesWell-defined error codes in case the condition reports a problem.

Currently the available Shoot condition types are:

  • APIServerAvailable

    This condition type indicates whether the Shoot’s kube-apiserver is available or not. In particular, the /healthz endpoint of the kube-apiserver is called, and the expected response code is HTTP 200.

  • ControlPlaneHealthy

    This condition type indicates whether all the control plane components deployed to the Shoot’s namespace in the Seed do exist and are running fine.

  • EveryNodeReady

    This condition type indicates whether at least the requested minimum number of Nodes is present per each worker pool and whether all Nodes are healthy.

  • SystemComponentsHealthy

    This condition type indicates whether all system components deployed to the kube-system namespace in the shoot do exist and are running fine. It also reflects whether the tunnel connection between the control plane and the Shoot networks can be established.

The Shoot conditions are maintained by the shoot care control of gardenlet.

Sync Period

The condition checks are executed periodically at interval which is configurable in the GardenletConfiguration (.controllers.shootCare.syncPeriod, defaults to 1m).

Condition Thresholds

The GardenletConfiguration also allows configuring condition thresholds (controllers.shootCare.conditionThresholds). Condition threshold is the amount of time to consider condition as Processing on condition status changes.

Let’s check the following example to get better understanding. Let’s say that the APIServerAvailable condition of our Shoot is with status True. If the next condition check fails (for example kube-apiserver becomes unreachable), then the condition first goes to Processing state. Only if this state remains for condition threshold amount of time, then the condition finally is updated to False.

Constraints

Constraints represent conditions of a Shoot’s current state that constraint some operations on it. The current constraints are:

HibernationPossible:

This constraint indicates whether a Shoot is allowed to be hibernated. The rationale behind this constraint is that a Shoot can have ValidatingWebhookConfigurations or MutatingWebhookConfigurations acting on resources that are critical for waking up a cluster. For example, if a webhook has rules for CREATE/UPDATE Pods or Nodes and failurePolicy=Fail, the webhook will block joining Nodes and creating critical system component Pods and thus block the entire wakeup operation, because the server backing the webhook is not running.

Even if the failurePolicy is set to Ignore, high timeouts (>15s) can lead to blocking requests of control plane components. That’s because most control-plane API calls are made with a client-side timeout of 30s, so if a webhook has timeoutSeconds=30 the overall request might still fail as there is overhead in communication with the API server and potential other webhooks. Generally, it’s best pratice to specify low timeouts in WebhookConfigs. Also, it’s best practice to exclude the kube-system namespace from webhooks to avoid blocking critical operations on system components of the cluster. Shoot owners can do so by adding a namespaceSelector similar to this one to their webhook configurations:

namespaceSelector:
  matchExpressions:
  - key: gardener.cloud/purpose
    operator: NotIn
    values:
    - kube-system

If the Shoot still has webhooks with either failurePolicy={Fail,nil} or failurePolicy=Ignore && timeoutSeconds>15 that act on critical resources in the kube-system namespace, Gardener will set the HibernationPossible to False indicating, that the Shoot can probably not be woken up again after hibernation without manual intervention of the Gardener Operator. gardener-apiserver will prevent any Shoot with the HibernationPossible constraint set to False from being hibernated, that is via manual hibernation as well as scheduled hibernation.

By setting .controllers.shootCare.webhookRemediatorEnabled=true in the gardenlet configuration, the auto-remediation of webhooks not following the best practices can be turned on in the shoot clusters. Concretely, missing namespaceSelectors or objectSelectors will be added and too high timeoutSeconds will be lowered. In some cases, the failurePolicy will be set from Fail to Ignore. Gardenlet will also add an annotation to make it visible to end-users that their webhook configurations were mutated and should be fixed by them in the first place. Note that all of this is no perfect solution and just done on a best effort basis. Only the owner of the webhook can know whether it indeed is problematic and configured correctly.

Webhooks labeled with remediation.webhook.shoot.gardener.cloud/exclude=true will be excluded from auto-remediation.

MaintenancePreconditionsSatisfied:

This constraint indicates whether all preconditions for a safe maintenance operation are satisfied (see also this document for more information about what happens during a shoot maintenance). As of today, the same checks as in the HibernationPossible constraint are being performed (user-deployed webhooks that might interfere with potential rolling updates of shoot worker nodes). There is no further action being performed on this constraint’s status (maintenance is still being performed). It is meant to make the user aware of potential problems that might occur due to his configurations.

CACertificateValiditiesAcceptable:

This constraints indicates that there is at least one CA certificate which expires in less than 1y. It will not be added to the .status.constraints if there is no such CA certificate. However, if it’s visible, then a credentials rotation operation should be considered.

Last Operation

The Shoot status holds information about the last operation that is performed on the Shoot. The last operation field reflects overall progress and the tasks that are currently being executed. Allowed operation types are Create, Reconcile, Delete, Migrate and Restore. Allowed operation states are Processing, Succeeded, Error, Failed, Pending and Aborted. An operation in Error state is an operation that will be retried for a configurable amount of time (controllers.shoot.retryDuration field in GardenletConfiguration, defaults to 12h). If the operation cannot complete successfully for the configured retry duration, it will be marked as Failed. An operation in Failed state is an operation that won’t be retried automatically (to retry such an operation, see Retry failed operation).

Last Errors

The Shoot status also contains information about the last occurred error(s) (if any) during an operation. A LastError consists of identifier of the task returned error, human-readable message of the error and error codes (if any) associated with the error.

Error Codes

Known error codes are:

  • ERR_INFRA_UNAUTHENTICATED - indicates that the last error occurred due to the client request not being completed because it lacks valid authentication credentials for the requested resource. It is classified as a non-retryable error code.
  • ERR_INFRA_UNAUTHORIZED - indicates that the last error occurred due to the server understanding the request but refusing to authorize it. It is classified as a non-retryable error code.
  • ERR_INFRA_QUOTA_EXCEEDED - indicates that the last error occurred due to infrastructure quota limits. It is classified as a non-retryable error code.
  • ERR_INFRA_RATE_LIMITS_EXCEEDED - indicates that the last error occurred due to exceeded infrastructure request rate limits.
  • ERR_INFRA_DEPENDENCIES - indicates that the last error occurred due to dependent objects on the infrastructure level. It is classified as a non-retryable error code.
  • ERR_RETRYABLE_INFRA_DEPENDENCIES - indicates that the last error occurred due to dependent objects on the infrastructure level, but the operation should be retried.
  • ERR_INFRA_RESOURCES_DEPLETED - indicates that the last error occurred due to depleted resource in the infrastructure.
  • ERR_CLEANUP_CLUSTER_RESOURCES - indicates that the last error occurred due to resources in the cluster that are stuck in deletion.
  • ERR_CONFIGURATION_PROBLEM - indicates that the last error occurred due to a configuration problem. It is classified as a non-retryable error code.
  • ERR_RETRYABLE_CONFIGURATION_PROBLEM - indicates that the last error occurred due to a retryable configuration problem. “Retryable” means that the occurred error is likely to be resolved in a ungraceful manner after given period of time.
  • ERR_PROBLEMATIC_WEBHOOK - indicates that the last error occurred due to a webhook not following the Kubernetes best practices (https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#best-practices-and-warnings).

Status Label

Shoots will be automatically labeled with the shoot.gardener.cloud/status label. Its value might either be healthy, progressing, unhealthy or unknown depending on the .status.conditions, .status.lastOperation and status.lastErrors of the Shoot. This can be used as an easy filter method to find shoots based on their “health” status.

39 - Shoot Supported Architectures

Supported CPU Architectures for Shoot Worker Nodes

Users can create shoot clusters with worker groups having virtual machines of different architectures. CPU architecture of each worker pool can be specified in the Shoot specification as follows:

Example Usage in a Shoot

spec:
  provider:
    workers:
    - name: cpu-worker
      machine:
        architecture: <some-cpu-architecture> # optional

If no value is specified for the architecture field, it defaults to amd64. For a valid shoot object, a machine type should be present in the respective CloudProfile with the same CPU architecture as specified in the Shoot yaml. Also, a valid machine image should be present in the CloudProfile that supports the required architecture specified in the Shoot worker pool.

Example Usage in a CloudProfile

spec:
  machineImages:
  - name: test-image
    versions:
    - architectures: # optional
      - <architecture-1>
      - <architecture-2>
      version: 1.2.3
  machineTypes:
  - architecture: <some-cpu-architecture>
    cpu: "2"
    gpu: "0"
    memory: 8Gi
    name: test-machine

Currently, Gardener supports two most widely used CPU architectures:

  • amd64
  • arm64

40 - Shoot Updates

Shoot Updates and Upgrades

This document describes what happens during shoot updates (changes incorporated in a newly deployed Gardener version) and during shoot upgrades (changes for version controllable by end-users).

Updates

Updates to all aspects of the shoot cluster happen when the gardenlet reconciles the Shoot resource.

When are Reconciliations Triggered

Generally, when you change the specification of your Shoot the reconciliation will start immediately, potentially updating your cluster. Please note that you can also confine the reconciliation triggered due to your specification updates to the cluster’s maintenance time window. Please find more information here.

You can also annotate your shoot with special operation annotations (see this document) which will cause the reconciliation to start due to your actions.

There is also an automatic reconciliation by Gardener. The period, i.e., how often it is performed, depends on the configuration of the Gardener administrators/operators. In some Gardener installations the operators might enable “reconciliation in maintenance time window only” (more information) which will result in at least one reconciliation during the time configured in the Shoot’s .spec.maintenance.timeWindow field.

Which Updates are Applied

As end-users can only control the Shoot resource’s specification but not the used Gardener version, they don’t have any influence on which of the updates are rolled out (other than those settings configurable in the Shoot). A Gardener operator can deploy a new Gardener version at any point in time. Any subsequent reconciliation of Shoots will update them by rolling out the changes incorporated in this new Gardener version.

Some examples for such shoot updates are:

  • Add a new/remove an old component to/from the shoot’s control plane running in the seed, or to/from the shoot’s system components running on the worker nodes.
  • Change the configuration of an existing control plane/system component.
  • Restart of existing control plane/system components (this might result in a short unavailability of the Kubernetes API server, e.g., when etcd or a kube-apiserver itself is being restarted)

Behavioural Changes

Generally, some of such updates (e.g., configuration changes) could theoretically result in different behaviour of controllers. If such changes would be backwards-incompatible then we usually follow one of those approaches (depends on the concrete change):

  • Only apply the change for new clusters.
  • Expose a new field in the Shoot resource that lets users control this changed behaviour to enable it at a convenient point in time.
  • Put the change behind an alpha feature gate (disabled by default) in the gardenlet (only controllable by Gardener operators) which will be promoted to beta (enabled by default) in subsequent releases (in this case, end-users have no influence on when the behaviour changes - Gardener operators should inform their end-users and provide clear timelines when they will enable the feature gate).

Upgrades

We consider shoot upgrades to change either the

  • Kubernetes version (.spec.kubernetes.version)
  • Kubernetes version of the worker pool if specified (.spec.provider.workers[].kubernetes.version)
  • Machine image version of at least one worker pool (.spec.provider.workers[].machine.image.version)

Generally, an upgrade is also performed through a reconciliation of the Shoot resource, i.e., the same concepts like for shoot updates apply. If an end-user triggers an upgrade (e.g., by changing the Kubernetes version) after a new Gardener version was deployed but before the shoot was reconciled again, then this upgrade might incorporate the changes delivered with this new Gardener version.

In-Place vs. Rolling Updates

If the Kubernetes patch version is changed then the upgrade happens in-place. This means that the shoot worker nodes remain untouched and only the kubelet process restarts with the new Kubernetes version binary. The same applies for configuration changes of the kubelet.

If the Kubernetes minor version is changed then the upgrade is done in a “rolling update” fashion, similar to how pods in Kubernetes are updated (when backed by a Deployment). The worker nodes will be terminated one after another and replaced by new machines. The existing workload is gracefully drained and evicted from the old worker nodes to new worker nodes, respecting the configured PodDisruptionBudgets (see Kubernetes documentation).

Customize Rolling Update Behaviour of Shoot Worker Nodes

The .spec.provider.workers[] list exposes two fields that you might configure based on your workload’s needs: maxSurge and maxUnavailable. The same concepts like in Kubernetes apply. Additionally, you might customize how the machine-controller-manager (abbrev.: MCM; the component instrumenting this rolling update) is behaving. You can configure the following fields in .spec.provider.worker[].machineControllerManager:

  • machineDrainTimeout: Timeout (in duration) used while draining of machine before deletion, beyond which MCM forcefully deletes machine (default: 10m).
  • machineHealthTimeout: Timeout (in duration) used while re-joining (in case of temporary health issues) of machine before it is declared as failed (default: 10m).
  • machineCreationTimeout: Timeout (in duration) used while joining (during creation) of machine before it is declared as failed (default: 10m).
  • maxEvictRetries: Maximum number of times evicts would be attempted on a pod before it is forcibly deleted during draining of a machine (default: 10).
  • nodeConditions: List of case-sensitive node-conditions which will change a machine to a Failed state after the machineHealthTimeout duration. It may further be replaced with a new machine if the machine is backed by a machine-set object (defaults: KernelDeadlock, ReadonlyFilesystem , DiskPressure).

Rolling Update Triggers

Apart from the above mentioned triggers, a rolling update of the shoot worker nodes is also triggered for some changes to your worker pool specification (.spec.provider.workers[], even if you don’t change the Kubernetes or machine image version). The complete list of fields that trigger a rolling update:

  • .spec.kubernetes.version (except for patch version changes)
  • .spec.provider.workers[].machine.image.name
  • .spec.provider.workers[].machine.image.version
  • .spec.provider.workers[].machine.type
  • .spec.provider.workers[].volume.type
  • .spec.provider.workers[].volume.size
  • .spec.provider.workers[].providerConfig
  • .spec.provider.workers[].cri.name
  • .spec.provider.workers[].kubernetes.version (except for patch version changes)
  • .status.credentials.rotation.certificateAuthorities.lastInitiationTime (changed by gardener when a shoot CA rotation is initiated)
  • .status.credentials.rotation.serviceAccountKey.lastInitiationTime (changed by gardener when a shoot service account signing key rotation is initiated)

Generally, the provider extension controllers might have additional constraints for changes leading to rolling updates, so please consult the respective documentation as well.

41 - Shoot Versions

Shoot Kubernetes and Operating System Versioning in Gardener

Motivation

On the one hand-side, Gardener is responsible for managing the Kubernetes and the Operating System (OS) versions of its Shoot clusters. On the other hand-side, Gardener needs to be configured and updated based on the availability and support of the Kubernetes and Operating System version it provides. For instance, the Kubernetes community releases minor versions roughly every three months and usually maintains three minor versions (the current and the last two) with bug fixes and security updates. Patch releases are done more frequently.

When using the term Machine image in the following, we refer to the OS version that comes with the machine image of the node/worker pool of a Gardener Shoot cluster. As such we are not referring to the CloudProvider specific machine image like the AMI for AWS. For more information how Gardener maps machine image versions to CloudProvider specific machine images, take a look at the individual gardener extension providers such as the provider for AWS.

Gardener should be configured accordingly to reflect the “logical state” of a version. It should be possible to define the Kubernetes or Machine image versions that still receive bug fixes and security patches, and also vice-versa to define the version that are out-of-maintenance and are potentially vulnerable. Moreover, this allows Gardener to “understand” the current state of a version and act upon it (more information in the following sections).

Overview

As a Gardener operator:

  • I can classify a version based on it’s logical state (preview, supported, deprecated and expired see Version Classification).
  • I can define which Machine image and Kubernetes versions are eligible for the auto update of clusters during the maintenance time.
  • I can disallow the creation of clusters having a certain version (think of severe security issues).

As an end-user/Shoot owner of Gardener:

  • I can get information about which Kubernetes and Machine image versions exist and their classification.
  • I can determine the time when my Shoot clusters Machine image and Kubernetes version will be forcefully updated to the next patch or minor version (in case the cluster is running a deprecated version with an expiration date).
  • I can get this information via API from the CloudProfile.

Version Classifications

Administrators can classify versions into four distinct “logical states”: preview, supported, deprecated and expired. The version classification serves as a “point-of-reference” for end-users and also has implications during shoot creation and the maintenance time.

If a version is unclassified, Gardener cannot make those decision based on the “logical state”. Nevertheless, Gardener can operate without version classifications and can be added at any time to the Kubernetes and machine image versions in the CloudProfile.

As a best practice, versions usually start with the classification preview, then are promoted to supported, eventually deprecated and finally expired. This information is programmatically available in the CloudProfiles of the Garden cluster.

  • preview: A preview version is a new version that has not yet undergone thorough testing, possibly a new release, and needs time to be validated. Due to its short early age, there is a higher probability of undiscovered issues and is therefore not yet recommended for production usage. A Shoot does not update (neither auto-update or force-update) to a preview version during the maintenance time. Also preview versions are not considered for the defaulting to the highest available version when deliberately omitting the patch version during Shoot creation. Typically, after a fresh release of a new Kubernetes (e.g. v1.25.0) or Machine image version (e.g. suse-chost 15.4.20220818), the operator tags it as preview until he has gained sufficient experience and regards this version to be reliable. After the operator gained sufficient trust, the version can be manually promoted to supported.

  • supported: A supported version is the recommended version for new and existing Shoot clusters. New Shoot clusters should use and existing clusters should update to this version. Typically for Kubernetes versions, the latest Kubernetes patch versions of the actual (if not still in preview) and the last 3 minor Kubernetes versions are maintained by the community. An operator could define these versions as being supported (e.g. v1.24.6, v1.23.12 and v1.22.15).

  • deprecated: A deprecated version is a version that approaches the end of its lifecycle and can contain issues which are probably resolved in a supported version. New Shoots should not use this version any more. Existing Shoots will be updated to a newer version if auto-update is enabled (.spec.maintenance.autoUpdate.kubernetesVersion for Kubernetes version auto-update, or .spec.maintenance.autoUpdate.machineImageVersion for machine image version auto-update). Using automatic upgrades, however, does not guarantee that a Shoot runs a non-deprecated version, as the latest version (overall or of the minor version) can be deprecated as well. Deprecated versions should have an expiration date set for eventual expiration.

  • expired: An expired versions has an expiration date (based on the Golang time package) in the past. New clusters with that version cannot be created and existing clusters are forcefully migrated to a higher version during the maintenance time.

Below is an example how the relevant section of the CloudProfile might look like:

apiVersion: core.gardener.cloud/v1beta1
kind: CloudProfile
metadata:
  name: alicloud
spec:
  kubernetes:
    versions:
      - classification: preview
        version: 1.25.0
      - classification: supported
        version: 1.24.6
      - classification: deprecated
        expirationDate: "2022-11-30T23:59:59Z"
        version: 1.24.5
      - classification: supported
        version: 1.23.12
      - classification: deprecated
        expirationDate: "2023-01-31T23:59:59Z"
        version: 1.23.11
      - classification: supported
        version: 1.22.15
      - classification: deprecated
        version: 1.21.14

Version Requirements (Kubernetes and Machine image)

The Gardener API server enforces the following requirements for versions:

Deletion of a version

  • A version that is in use by a Shoot cannot be deleted from the CloudProfile.

Adding a version

  • A version must not have an expiration date in the past.
  • There can be only one supported version per minor version.
  • The latest Kubernetes version cannot have an expiration date.
  • The latest version for a machine image can have an expiration date. [*]

[*] Useful for cases in which support for given machine image needs to be deprecated and removed (for example the machine image reaches end of life).

Forceful migration of expired versions

If a Shoot is running a version after its expiration date has passed, it will be forcefully migrated during its maintenance time. This happens even if the owner has opted out of automatic cluster updates!

For Machine images, the Shoots worker pools will be updated to the latest non-preview version of the pools respective image.

For Kubernetes versions, the forceful update picks the latest non-preview patch version of the current minor version.

If the cluster is already on the latest patch version and the latest patch version is also expired, it will continue with the latest patch version of the next consecutive minor Kubernetes version, so it will result in an update of a minor Kubernetes version!

Please note, that multiple consecutive minor version upgrades are possible. This can occur if the Shoot is updated to a version that in turn is also expired. In this case, the version is again upgraded in the next maintenance time.

Depending on the circumstances described above, it can happen that the cluster receives multiple consecutive minor Kubernetes version updates!

Kubernetes “minor version jumps” are not allowed - meaning to skip the update to the consecutive minor version and directly update to any version after that. For instance, the version 1.20.x can only update to a version 1.21.x, not to 1.22.x or any other version. This is because Kubernetes does not guarantee upgradeability in this case, leading to possibly broken Shoot clusters. The administrator has to set up the CloudProfile in such a way, that consecutive Kubernetes minor versions are available. Otherwise, Shoot clusters will fail to upgrade during the maintenance time.

Consider the CloudProfile below with a Shoot using the Kubernetes version 1.20.12. Even though the version is expired, due to missing 1.21.x versions, the Gardener Controller Manager cannot upgrade the Shoot’s Kubernetes version.

spec:
  kubernetes:
    versions:
    - version: 1.22.8
    - version: 1.22.7
    - version: 1.20.12
      expirationDate: "<expiration date in the past>"

The CloudProfile must specify versions 1.21.x of the consecutive minor version. Configuring the CloudProfile in such a way, the Shoot’s Kubernetes version will be upgraded to version 1.21.10 in the next maintenance time.

spec:
  kubernetes:
    versions:
    - version: 1.22.8
    - version: 1.21.10
    - version: 1.21.09
    - version: 1.20.12
      expirationDate: "<expiration date in the past>"

You might want to read about the Shoot Updates and Upgrades procedures to get to know the effects of such operations.

42 - Supported K8s Versions

Supported Kubernetes Versions

Currently, the Gardener supports the following Kubernetes versions:

Garden cluster version

The minimum version of the garden cluster that can be used to run Gardener is 1.20.x.

Seed cluster versions

The minimum version of a seed cluster that can be connected to Gardener is 1.20.x. Please note that Gardener does not support 1.25 seeds yet.

Shoot cluster versions

Gardener itself is capable of spinning up clusters with Kubernetes versions 1.20 up to 1.25. However, the concrete versions that can be used for shoot clusters depend on the installed provider extension. Consequently, please consult the documentation of your provider extension to see which Kubernetes versions are supported for shoot clusters.

👨🏼‍💻 Developers note: This document explains what needs to be done in order to add support for a new Kubernetes version.

43 - Tolerations

Taints and Tolerations for Seeds and Shoots

Similar to taints and tolerations for Nodes and Pods in Kubernetes, the Seed resource supports specifying taints (.spec.taints, see this example) while the Shoot resource supports specifying tolerations (.spec.tolerations, see this example). The feature is used to control scheduling to seeds as well as decisions whether a shoot can use a certain seed.

Compared to Kubernetes, Gardener’s taints and tolerations are very much down-stripped right now and have some behavioral differences. Please read the following explanations carefully if you plan to use it.

Scheduling

When scheduling a new shoot then the gardener-scheduler will filter all seed candidates whose taints are not tolerated by the shoot. As Gardener’s taints/tolerations don’t support effects yet you can compare this behaviour with using a NoSchedule effect taint in Kubernetes.

Be reminded that taints/tolerations are no means to define any affinity or selection for seeds - please use .spec.seedSelector in the Shoot to state such desires.

⚠️ Please note that - unlike how it’s implemented in Kubernetes - a certain seed cluster may only be used when the shoot tolerates all the seed’s taints. This means that specifying .spec.seedName for a seed whose taints are not tolerated will make the gardener-apiserver rejecting the request.

Consequently, the taints/tolerations feature can be used as means to restrict usage of certain seeds.

Toleration Defaults and Whitelist

The Project resource features a .spec.tolerations object that may carry defaults and a whitelist (see this example). The corresponding ShootTolerationRestriction admission plugin (cf. Kubernetes’ PodTolerationRestriction admission plugin) is responsible for evaluating these settings during creation/update of Shoots.

Whitelist

If a shoot gets created or updated with tolerations then it is validated that only those tolerations may be used which were added to either a) the Project’s .spec.tolerations.whitelist, or b) to the global whitelist in the ShootTolerationRestriction’s admission config (see this example).

⚠️ Please note that the tolerations whitelist of Projects can only be changed if the user trying to change it is bound to the modify-spec-tolerations-whitelist custom RBAC role, e.g. via the following ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: full-project-modification-access
rules:
- apiGroups:
  - core.gardener.cloud
  resources:
  - projects
  verbs:
  - create
  - patch
  - update
  - modify-spec-tolerations-whitelist
  - delete

Defaults

If a shoot gets created then the default tolerations specified in both the Project’s .spec.tolerations.defaults and global default list in the ShootTolerationRestriction admission plugin’s configuration will be added to the .spec.tolerations of the Shoot (unless it already specifies a certain key).

44 - Trouble Shooting Guide

Trouble Shooting Guide

Are there really issue that cannot be fixed :O?

Well, of course not :P. With continuous development of Gardener, over the time its architecture and API might have to be changed to reduce complexity and support more features. In this process developers are bound to keep Gardener version backward compatible with last two releases. But maintaining backward compatibility is quite complex and effortful tasks. So, to save short term complex effort, its common practice in open source community to use work around or hacky solutions sometimes. This results in rare issues which are supposed to be resolved by human interaction across upgrades of Gardener version.

This guide records the issues that are quite possible across upgrade of Gardener version, root cause and the human action required for graceful resolution of issue. For troubleshooting guide of bugs which are not yet fixed, please refer the associated github issue.

Note To Maintainers: Please use only mention the resolution of issues which are by design. For bugs please report the temporary resolution on github issue create for the bug.

Etcd-Main pod fails to come up, since backup-restore sidecar is reporting RevisionConsistencyCheckErr

Issue

  • Etcd-main pod goes in CrashLoopBackoff.
  • Etcd-backup-restore sidecar reports validation error with RevisionConsistencyCheckErr.

Environment

  • Gardener version: 0.29.0+

Root Cause

  • From version 0.29.0, Gardener uses shared backup bucket for storing etcd backups, replacing old logic of having single bucket per shoot as per proposal.
  • Since there are very rare chances that etcd data directory will get corrupt, while doing this migration, to avoid etcd down time and implementation effort, we decided to switch directly from old bucket to new shared bucket without migrating old snapshot from old bucket to new bucket.
  • In this case just for safety side we added sanity check in etcd-backup-restore sidecar of etcd-main pod, which checks if etcd data revision is greater than the last snapshot revision from old bucket.
  • If above check fails mean there is surely some data corruption occurred with etcd, so etcd-backup-restore reports error and then etcd-main pod goes in CrashLoopBackoff creating etcd-main down alerts.

Action

  1. Disable the Gardener reconciliation for Shoot by annotating it with shoot.gardener.cloud/ignore=true
  2. Scale down the etcd-main statefulset in seed cluster.
  3. Find out the latest full snapshot and delta snapshot from old backup bucket. The old backup bucket name is same as the backupInfra resource associated with Shoot in Garden cluster.
  4. Move them manually to new backup bucket.
  5. Enable the Gardener reconciliation for shoot by removing annotation shoot.gardener.cloud/ignore=true.

45 - Trusted Tls For Control Planes

Trusted TLS certificate for shoot control planes

Shoot clusters are composed of several control plane components deployed by the Gardener and corresponding extensions.

Some components are exposed via Ingress resources which make them addressable under the HTTPS protocol.

Examples:

  • Alertmanager
  • Grafana for operators and end-users
  • Prometheus

Gardener generates the backing TLS certificates which are signed by the shoot cluster’s CA by default (self-signed).

Unlike with a self-contained Kubeconfig file, common internet browsers or operating systems don’t trust a shoot’s cluster CA and adding it as a trusted root is often undesired in enterprise environments.

Therefore, Gardener operators can predefine trusted wildcard certificates under which the mentioned endpoints will be served instead.

Register a trusted wildcard certificate

Since control plane components are published under the ingress domain (core.gardener.cloud/v1beta1.Seed.spec.dns.ingressDomain) a wildcard certificate is required.

For example:

  • Seed ingress domain: dev.my-seed.example.com
  • CN or SAN for certificate: *.dev.my-seed.example.com

A wildcard certificate matches exactly one seed. It must be deployed as part of your landscape setup as a Kubernetes Secret inside the garden namespace of the corresponding seed cluster.

Please ensure that the secret has the gardener.cloud/role label shown below.

apiVersion: v1
data:
  ca.crt: base64-encoded-ca.crt
  tls.crt: base64-encoded-tls.crt
  tls.key: base64-encoded-tls.key
kind: Secret
metadata:
  labels:
    gardener.cloud/role: controlplane-cert
  name: seed-ingress-certificate
  namespace: garden
type: Opaque

Gardener copies the secret during the reconciliation of shoot clusters to the shoot namespace in the seed. Afterwards, Ingress resources in that namespace for the mentioned components will refer to the wildcard certificate.

Best practice

While it is possible to create the wildcard certificates manually and deploy them to seed clusters, it is recommended to let certificate management components do this job. Often, a seed cluster is also a shoot cluster at the same time (ManagedSeed) and might already provide a certificate service extension. Otherwise, a Gardener operator may use solutions like Cert-Management or Cert-Manager.

46 - Worker Pool K8s Versions

Controlling the Kubernetes versions for specific worker pools

Since Gardener v1.36, worker pools can have different Kubernetes versions specified than the control plane.

In earlier Gardener versions all worker pools inherited the Kubernetes version of the control plane. Once the Kubernetes version of the control plane was modified, all worker pools have been updated as well (either by rolling the nodes in case of a minor version change, or in-place for patch version changes).

In order to gracefully perform Kubernetes upgrades (triggering a rolling update of the nodes) with workloads sensitive to restarts (e.g., those dealing with lots of data), it might be required to be able to gradually perform the upgrade process. In such cases, the Kubernetes version for the worker pools can be pinned (.spec.provider.workers[].kubernetes.version) while the control plane Kubernetes version (.spec.kubernetes.version) is updated. This results in the nodes being untouched while the control plane is upgraded. Now a new worker pool (with the version equal to the control plane version) can be added. Administrators can then reschedule their workloads to the new worker pool according to their upgrade requirements and processes.

Example Usage in a Shoot

spec:
  kubernetes:
    version: 1.24.6
  provider:
    workers:
    - name: data1
      kubernetes:
        version: 1.23.13
    - name: data2
  • If .kubernetes.version is not specified in a worker pool, then the Kubernetes version of the kubelet is inherited from the control plane (.spec.kubernetes.version), i.e., in the above example, the data2 pool will use 1.24.6.
  • If .kubernetes.version is specified in a worker pool then it must meet the following constraints:
    • It must be at most two minor versions lower than the control plane version.
    • If it was not specified before, then no downgrade is possible (you cannot set it to 1.23.13 while .spec.kubernetes.version is already 1.24.6). The “two minor version skew” is only possible if the worker pool version is set to control plane version and then the control plane was updated gradually two minor versions.
    • If the version is removed from the worker pool, only one minor version difference is allowed to the control plane (you cannot upgrade a pool from version 1.22.0 to 1.24.0 in one go).

Automatic updates of Kubernetes versions (see Shoot Maintenance) also apply to worker pool Kubernetes versions.