This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Concepts

1 - Admission Controller

Gardener Admission Controller

While the Gardener API server works with admission plugins to validate and mutate resources belonging to Gardener related API groups, e.g. core.gardener.cloud, the same is needed for resources belonging to non-Gardener API groups as well, e.g. Secrets in the core API group. Therefore, the Gardener Admission Controller runs a http(s) server with the following handlers which serve as validating/mutating endpoints for admission webhooks. It is also used to serve http(s) handlers for authorization webhooks.

Admission Webhook Handlers

This section describes the admission webhook handlers that are currently served.

Kubeconfig Secret Validator

Malicious Kubeconfigs applied by end users may cause a leakage of sensitive data. This handler checks if the incoming request contains a Kubernetes secret with a .data.kubeconfig field and denies the request if the Kubeconfig structure violates Gardener’s security standards.

Namespace Validator

Namespaces are the backing entities of Gardener projects in which shoot clusters objects reside. This validation handler protects active namespaces against premature deletion requests. Therefore, it denies deletion requests if a namespace still contains shoot clusters or if it belongs to a non-deleting Gardener project (w/o .metadata.deletionTimestamp).

Resource Size Validator

Since users directly apply Kubernetes native objects to the Garden cluster, it also involves the risk of being vulnerable to DoS attacks because these resources are read continuously watched and read by controllers. One example is the creation of Shoot resources with large annotation values (up to 256 kB per value) which can cause severe out-of-memory issues for the Gardenlet component. Vertical autoscaling can help to mitigate such situations, but we cannot expect to scale infinitely, and thus need means to block the attack itself.

The Resource Size Validator checks arbitrary incoming admission requests against a configured maximum size for the resource’s group-version-kind combination and denies the request if the contained object exceeds the quota.

Example for Gardener Admission Controller configuration:

server:
  resourceAdmissionConfiguration:
    limits:
    - apiGroups: ["core.gardener.cloud"]
      apiVersions: ["*"]
      resources: ["shoots"]
      size: 100k
    - apiGroups: [""]
      apiVersions: ["v1"]
      resources: ["secrets"]
      size: 100k
    unrestrictedSubjects:
    - kind: Group
      name: gardener.cloud:system:seeds
      apiGroup: rbac.authorization.k8s.io
 #  - kind: User
 #    name: admin
 #    apiGroup: rbac.authorization.k8s.io
 #  - kind: ServiceAccount
 #    name: "*"
 #    namespace: garden
 #    apiGroup: ""
    operationMode: block #log

With the configuration above, the Resource Size Validator denies requests for shoots with Gardener’s core API group which exceed a size of 100 kB. The same is done for Kubernetes secrets.

As this feature is meant to protect the system from malicious requests sent by users, it is recommended to exclude trusted groups, users or service accounts from the size restriction via resourceAdmissionConfiguration.unrestrictedSubjects. For example, the backing user for the Gardenlet should always be capable of changing the shoot resource instead of being blocked due to size restrictions. This is because the Gardenlet itself occasionally changes the shoot specification, labels or annotations, and might violate the quota if the existing resource is already close to the quota boundary. Also, operators are supposed to be trusted users and subjecting them to a size limitation can inhibit important operational tasks. Wildcard ("*") in subject name is supported.

Size limitations depend on the individual Gardener setup and choosing the wrong values can affect the availability of your Gardener service. resourceAdmissionConfiguration.operationMode allows to control if a violating request is actually denied (default) or only logged. It’s recommended to start with log, check the logs for exceeding requests, adjust the limits if necessary and finally switch to block.

SeedRestriction

Please refer to this document for more information.

Authorization Webhook Handlers

This section describes the authorization webhook handlers that are currently served.

SeedAuthorization

Please refer to this document for more information.

2 - API Server

Gardener API server

The Gardener API server is a Kubernetes-native extension based on its aggregation layer. It is registered via an APIService object and designed to run inside a Kubernetes cluster whose API it wants to extend.

After registration, it exposes the following resources:

CloudProfiles

CloudProfiles are resources that describe a specific environment of an underlying infrastructure provider, e.g. AWS, Azure, etc. Each shoot has to reference a CloudProfile to declare the environment it should be created in. In a CloudProfile the gardener operator specifies certain constraints like available machine types, regions, which Kubernetes versions they want to offer, etc. End-users can read CloudProfiles to see these values, but only operators can change the content or create/delete them. When a shoot is created or updated then an admission plugin checks that only values are used that are allowed via the referenced CloudProfile.

Additionally, a CloudProfile may contain a providerConfig which is a special configuration dedicated for the infrastructure provider. Gardener does not evaluate or understand this config, but extension controllers might need for declaration of provider-specific constraints, or global settings.

Please see this example manifest and consult the documentation of your provider extension controller to get information about its providerConfig.

Seeds

Seeds are resources that represent seed clusters. Gardener does not care about how a seed cluster got created - the only requirement is that it is of at least Kubernetes v1.20 and passes the Kubernetes conformance tests. The Gardener operator has to either deploy the Gardenlet into the cluster they want to use as seed (recommended, then the Gardenlet will create the Seed object itself after bootstrapping), or they provide the kubeconfig to the cluster inside a secret (that is referenced by the Seed resource) and create the Seed resource themselves.

Please see this, this(, and optionally this) example manifests.

ShootQuotas

In order to allow end-users not having their own dedicated infrastructure account to try out Gardener the operator can register an account owned by them that they allow to be used for trial clusters. Trial clusters can be put under quota such that they don’t consume too many resources (resulting in costs), and so that one user cannot consume all resources on their own. These clusters are automatically terminated after a specified time, but end-users may extend the lifetime manually if needed.

Please see this example manifest.

Projects

The first thing before creating a shoot cluster is to create a Project. A project is used to group multiple shoot clusters together. End-users can invite colleagues to the project to enable collaboration, and they can either make them admin or viewer. After an end-user has created a project they will get a dedicated namespace in the garden cluster for all their shoots.

Please see this example manifest.

SecretBindings

Now that the end-user has a namespace the next step is registering their infrastructure provider account.

Please see this example manifest and consult the documentation of the extension controller for the respective infrastructure provider to get information about which keys are required in this secret.

After the secret has been created the end-user has to create a special SecretBinding resource that binds this secret. Later when creating shoot clusters they will reference such a binding.

Please see this example manifest.

Shoots

Shoot cluster contain various settings that influence how end-user Kubernetes clusters will look like in the end. As Gardener heavily relies on extension controllers for operating system configuration, networking, and infrastructure specifics, the end-user has the possibility (and responsibility) to provide these provider-specific configurations as well. Such configurations are not evaluated by Gardener (because it doesn’t know/understand them), but they are only transported to the respective extension controller.

⚠️ This means that any configuration issues/mistake on the end-user side that relates to a provider-specific flag or setting cannot be caught during the update request itself but only later during the reconciliation (unless a validator webhook has been registered in the garden cluster by an operator).

Please see this example manifest and consult the documentation of the provider extension controller to get information about its spec.provider.controlPlaneConfig, .spec.provider.infrastructureConfig, and .spec.provider.workers[].providerConfig.

(Cluster)OpenIDConnectPresets

Please see this separate documentation file.

Overview Data Model

Gardener Overview Data Model

3 - APIServer Admission Plugins

Admission Plugins

Similar to the kube-apiserver, the gardener-apiserver comes with a few in-tree managed admission plugins. If you want to get an overview of the what and why of admission plugins then this document might be a good start.

This document lists all existing admission plugins with a short explanation of what it is responsible for.

ClusterOpenIDConnectPreset, OpenIDConnectPreset

(both enabled by default)

These admission controllers react on CREATE operations for Shoots. If the Shoot does not specify any OIDC configuration (.spec.kubernetes.kubeAPIServer.oidcConfig=nil) then it tries to find a matching ClusterOpenIDConnectPreset or OpenIDConnectPreset, respectively. If there are multiples that match then the one with the highest weight “wins”. In this case, the admission controller will default the OIDC configuration in the Shoot.

ControllerRegistrationResources

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for ControllerRegistrations. It validates that there exists only one ControllerRegistration in the system that is primarily responsible for a given kind/type resource combination. This prevents misconfiguration by the Gardener administrator/operator.

CustomVerbAuthorizer

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for Projects. It validates whether the user is bound to a RBAC role with the modify-spec-tolerations-whitelist verb in case the user tries to change the .spec.tolerations.whitelist field of the respective Project resource. Usually, regular project members are not bound to this custom verb, allowing the Gardener administrator to manage certain toleration whitelists on Project basis.

DeletionConfirmation

(enabled by default)

This admission controller reacts on DELETE operations for Projects and Shoots and ShootStates. It validates that the respective resource is annotated with a deletion confirmation annotation, namely confirmation.gardener.cloud/deletion=true. Only if this annotation is present it allows the DELETE operation to pass. This prevents users from accidental/undesired deletions.

ExposureClass

(enabled by default)

This admission controller reacts on Create operations for Shootss. It mutates Shoot resources which has an ExposureClass referenced by merging their both shootSelectors and/or tolerations into the Shoot resource.

ExtensionValidator

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for BackupEntrys, BackupBuckets, Seeds, and Shoots. For all the various extension types in the specifications of these objects, it validates whether there exists a ControllerRegistration in the system that is primarily responsible for the stated extension type(s). This prevents misconfigurations that would otherwise allow users to create such resources with extension types that don’t exist in the cluster, effectively leading to failing reconciliation loops.

ExtensionLabels

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for BackupBuckets, BackupEntrys, CloudProfiles, Seeds, SecretBindings and Shoots. For all the various extension types in the specifications of these objects, it adds a corresponding label in the resource. This would allow extension admission webhooks to filter out the resources they are responsible for and ignore all others. This label is of the form <extension-type>.extensions.gardener.cloud/<extension-name> : "true". For example, an extension label for provider extension type aws, looks like provider.extensions.gardener.cloud/aws : "true".

ProjectValidator

(enabled by default)

This admission controller reacts on CREATE operations for Projects. It prevents creating Projects with a non-empty .spec.namespace if the value in .spec.namespace does not start with garden-.

⚠️ This admission plugin will be removed in a future release and its business logic will be incorporated into the static validation of the gardener-apiserver.

ResourceQuota

(enabled by default)

This admission controller enables object count ResourceQuotas for Gardener resources, e.g. Shoots, SecretBindings, Projects, etc..

⚠️ In addition to this admission plugin, the ResourceQuota controller must be enabled for the Kube-Controller-Manager of your Garden cluster.

ResourceReferenceManager

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for CloudProfiles, Projects, SecretBindings, Seeds, and Shoots. Generally, it checks whether referred resources stated in the specifications of these objects exist in the system (e.g., if a referenced Secret exists). However, it also has some special behaviours for certain resources:

  • CloudProfiles: It rejects removing Kubernetes or machine image versions if there is at least one Shoot that refers to them.
  • Projects: It sets the .spec.createdBy field for newly created Project resources, and defaults the .spec.owner field in case it is empty (to the same value of .spec.createdBy).
  • Seeds: It rejects changing the .spec.settings.shootDNS.enabled value if there is at least one Shoot that refers to this seed.
  • Shoots: It sets the gardener.cloud/created-by=<username> annotation for newly created Shoot resources.

SeedValidator

(enabled by default)

This admission controller reacts on DELETE operations for Seeds. Rejects the deletion if Shoot(s) reference the seed cluster.

ShootDNS

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for Shoots. It tries to assign a default domain to the Shoot if it gets scheduled to a seed that enables DNS for shoots (.spec.settings.shootDNS.enabled=true). It also validates that the DNS configuration (.spec.dns) is not set if the seed disables DNS for shoots.

ShootNodeLocalDNSEnabledByDefault

(disabled by default)

This admission controller reacts on CREATE operations for Shoots. If enabled, it will enable node local dns within the shoot cluster (see this doc) by setting spec.systemComponents.nodeLocalDNS.enabled=true for newly created Shoots. Already existing Shoots and new Shoots that explicitly disable node local dns (spec.systemComponents.nodeLocalDNS.enabled=false) will not be affected by this admission plugin.

ShootQuotaValidator

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for Shoots. It validates the resource consumption declared in the specification against applicable Quota resources. Only if the applicable Quota resources admit the configured resources in the Shoot then it allows the request. Applicable Quotas are referred in the SecretBinding that is used by the Shoot.

ShootVPAEnabledByDefault

(disabled by default)

This admission controller reacts on CREATE operations for Shoots. If enabled, it will enable the managed VerticalPodAutoscaler components (see this doc) by setting spec.kubernetes.verticalPodAutoscaler.enabled=true for newly created Shoots. Already existing Shoots and new Shoots that explicitly disable VPA (spec.kubernetes.verticalPodAutoscaler.enabled=false) will not be affected by this admission plugin.

ShootTolerationRestriction

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for Shoots. It validates the .spec.tolerations used in Shoots against the whitelist of its Project, or against the whitelist configured in the admission controller’s configuration, respectively. Additionally, it defaults the .spec.tolerations in Shoots with those configured in its Project, and those configured in the admission controller’s configuration, respectively.

ShootValidator

(enabled by default)

This admission controller reacts on CREATE, UPDATE and DELETE operations for Shoots. It validates certain configurations in the specification against the referred CloudProfile (e.g., machine images, machine types, used Kubernetes version, …). Generally, it performs validations that cannot be handled by the static API validation due to their dynamic nature (e.g., when something needs to be checked against referred resources). Additionally, it takes over certain defaulting tasks (e.g., default machine image for worker pools).

ShootManagedSeed

(enabled by default)

This admission controller reacts on UPDATE and DELETE operations for Shoots. It validates certain configuration values in the specification that are specific to ManagedSeeds (e.g. the nginx-addon of the Shoot has to be disabled, the Shoot VPA has to be enabled). It rejects the deletion if the Shoot is referred to by a ManagedSeed.

ManagedSeedValidator

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for ManagedSeedss. It validates certain configuration values in the specification against the referred Shoot, for example Seed provider, network ranges, DNS domain, etc. Similarly to ShootValidator, it performs validations that cannot be handled by the static API validation due to their dynamic nature. Additionally, it performs certain defaulting tasks, making sure that configuration values that are not specified are defaulted to the values of the referred Shoot, for example Seed provider, network ranges, DNS domain, etc.

ManagedSeedShoot

(enabled by default)

This admission controller reacts on DELETE operations for ManagedSeeds. It rejects the deletion if there are Shoots that are scheduled onto the Seed that is registered by the ManagedSeed.

ShootDNSRewriting

(disabled by default)

This admission controller reacts on CREATE operations for Shoots. If enabled, it adds a set of common suffixes configured in its admission plugin configuration to the Shoot (spec.systemComponents.coreDNS.rewriting.commonSuffixes) (see this doc). Already existing Shoots will not be affected by this admission plugin.

4 - Architecture

Official Definition - What is Kubernetes?

“Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.”

Introduction - Basic Principle

The foundation of the Gardener (providing Kubernetes Clusters as a Service) is Kubernetes itself, because Kubernetes is the go-to solution to manage software in the Cloud, even when it’s Kubernetes itself (see also OpenStack which is provisioned more and more on top of Kubernetes as well).

While self-hosting, meaning to run Kubernetes components inside Kubernetes, is a popular topic in the community, we apply a special pattern catering to the needs of our cloud platform to provision hundreds or even thousands of clusters. We take a so-called “seed” cluster and seed the control plane (such as the API server, scheduler, controllers, etcd persistence and others) of an end-user cluster, which we call “shoot” cluster, as pods into the “seed” cluster. That means one “seed” cluster, of which we will have one per IaaS and region, hosts the control planes of multiple “shoot” clusters. That allows us to avoid dedicated hardware/virtual machines for the “shoot” cluster control planes. We simply put the control plane into pods/containers and since the “seed” cluster watches them, they can be deployed with a replica count of 1 and only need to be scaled out when the control plane gets under pressure, but no longer for HA reasons. At the same time, the deployments get simpler (standard Kubernetes deployment) and easier to update (standard Kubernetes rolling update). The actual “shoot” cluster consists only out of the worker nodes (no control plane) and therefore the users may get full administrative access to their clusters.

Setting The Scene - Components and Procedure

We provide a central operator UI, which we call the “Gardener Dashboard”. It talks to a dedicated cluster, which we call the “Garden” cluster and uses custom resources managed by an aggregated API server, one of the general extension concepts of Kubernetes) to represent “shoot” clusters. In this “Garden” cluster runs the “Gardener”, which is basically a Kubernetes controller that watches the custom resources and acts upon them, i.e. creates, updates/modifies, or deletes “shoot” clusters. The creation follows basically these steps:

  • Create a namespace in the “seed” cluster for the “shoot” cluster which will host the “shoot” cluster control plane
  • Generate secrets and credentials which the worker nodes will need to talk to the control plane
  • Create the infrastructure (using Terraform), which basically consists out of the network setup)
  • Deploy the “shoot” cluster control plane into the “shoot” namespace in the “seed” cluster, containing the “machine-controller-manager” pod
  • Create machine CRDs in the “seed” cluster, describing the configuration and the number of worker machines for the “shoot” (the machine-controller-manager watches the CRDs and creates virtual machines out of it)
  • Wait for the “shoot” cluster API server to become responsive (pods will be scheduled, persistent volumes and load balancers are created by Kubernetes via the respective cloud provider)
  • Finally we deploy kube-system daemons like kube-proxy and further add-ons like the dashboard into the “shoot” cluster and the cluster becomes active

Overview Architecture Diagram

Gardener Overview Architecture Diagram

Detailed Architecture Diagram

Gardener Detailed Architecture Diagram

Note: The kubelet as well as the pods inside the “shoot” cluster talk through the front-door (load balancer IP; public Internet) to its “shoot” cluster API server running in the “seed” cluster. The reverse communication from the API server to the pod, service, and node networks happens through a VPN connection that we deploy into “seed” and “shoot” clusters.

5 - Backup Restore

Backup and restore

Kubernetes uses Etcd as the key-value store for its resource definitions. Gardener supports the backup and restore of etcd. It is the responsibility of the shoot owners to backup the workload data.

Gardener uses etcd-backup-restore component to backup the etcd backing the Shoot cluster regularly and restore in case of disaster. It is deployed as sidecar via etcd-druid. This doc mainly focuses on the backup and restore configuration used by Gardener when deploying these components. For more details on the design and internal implementation details, please refer GEP-06 and documentation on individual repository.

Bucket provisioning

Refer the backup bucket extension document to know details about configuring backup bucket.

Backup Policy

etcd-backup-restore supports full snapshot and delta snapshots over full snapshot. In Gardener, this configuration is currently hard-coded to following parameters:

  • Full Snapshot Schedule:
    • Daily, 24hr interval.
    • For each Shoot, the schedule time in a day is randomized based on the configured Shoot maintenance window.
  • Delta Snapshot schedule:
    • At 5min interval.
    • If aggregated events size since last snapshot goes beyond 100Mib.
  • Backup History / Garbage backup deletion policy:
    • Gardener configure backup restore to have Exponential garbage collection policy.
    • As per policy, following backups are retained.
    • All full backups and delta backups for the previous hour.
    • Latest full snapshot of each previous hour for the day.
    • Latest full snapshot of each previous day for 7 days.
    • Latest full snapshot of the previous 4 weeks.
    • Garbage Collection is configured at 12hr interval.
  • Listing:
    • Gardener don’t have any API to list out the backups.
    • To find the backup list, admin can checkout the BackupEntry resource associated with Shoot which holds the bucket and prefix details on object store.

Restoration

Restoration process of etcd is automated through the etcd-backup-restore component from latest snapshot. Gardener dosen’t support Point-In-Time-Recovery (PITR) of etcd. In case of etcd disaster, the etcd is recovered from latest backup automatically. For further details, please refer the doc. Post restoration of etcd, the Shoot reconciliation loop brings back the cluster to same state.

Again, Shoot owner is responsible for maintaining the backup/restore of his workload. Gardener does only take care of the cluster’s etcd.

6 - Cluster API

Relation between Gardener API and Cluster API (SIG Cluster Lifecycle)

In essence, the Cluster API harmonizes how to get to clusters, while Gardener goes one step further and also harmonizes the clusters themselves. The Cluster API delegates the specifics to so-called providers for infrastructures or control planes via specific CR(D)s while Gardener only has one cluster CR(D). Different Cluster API providers, e.g. for AWS, Azure, GCP, etc. give you vastly different Kubernetes clusters. In contrast, Gardener gives you the exact same clusters with the exact same K8s version, operating system, control plane configuration like for API server or kubelet, add-ons like overlay network, HPA/VPA, DNS and certificate controllers, ingress and network policy controllers, control plane monitoring and logging stacks, down to the behavior of update procedures, auto-scaling, self-healing, etc. on all supported infrastructures. These homogeneous clusters are an essential goal for Gardener as its main purpose is to simplify operations for teams that need to develop and ship software on Kubernetes clusters on a plethora of infrastructures (a.k.a. multi-cloud).

Incidentally, Gardener influenced the Machine API in the Cluster API with its Machine Controller Manager and was the first to adopt it, see also joint SIG Cluster Lifecycle KubeCon talk where @hardikdr from our Gardener team in India spoke.

That means, we follow the Cluster API with great interest and are active members. It was completely overhauled from v1alpha1 to v1alpha2. But because v1alpha2 made too many assumptions about the bring-up of masters and was enforcing master machine operations (see here: “As of v1alpha2, Machine-Based is the only control plane type that Cluster API supports”), services that managed their control planes differently like GKE or Gardener couldn’t adopt it (e.g. Google only supports v1alpha1). In 2020 v1alpha3 was introduced and made it possible (again) to integrate managed services like GKE or Gardener. The mapping from the Gardener API to the Cluster API is mostly syntactic.

To wrap it up, while the Cluster API knows about clusters, it doesn’t know about their make-up. With Gardener, we wanted to go beyond that and harmonize the make-up of the clusters themselves and make them homogeneous across all supported infrastructures. Gardener can therefore deliver homogeneous clusters with exactly the same configuration and behavior on all infrastructures (see also Gardener’s coverage in the official conformance test grid).

With Cluster API v1alpha3 and the support for declarative control plane management, it became now possible (again) to enable Kubernetes managed services like GKE or Gardener. We would be more than happy, if the community would be interested, to contribute a Gardener control plane provider.

7 - Controller Manager

Gardener Controller Manager

The gardener-controller-manager (often refered to as “GCM”) is a component that runs next to the Gardener API server, similar to the Kubernetes Controller Manager. It runs several controllers that do not require talking to any seed or shoot cluster. Also, as of today it exposes an HTTP server that is serving several health check endpoints and metrics.

This document explains the various functionalities of the gardener-controller-manager and their purpose.

Controllers

Bastion Controller

Bastion resources have a limited lifetime which can be extended up to a certain amount by performing a heartbeat on them. The Bastion controller is responsible for deleting expired or rotten Bastions.

  • “expired” means a Bastion has exceeded its status.expirationTimestamp.
  • “rotten” means a Bastion is older than the configured maxLifetime.

The maxLifetime defaults to 24 hours and is an option in the BastionControllerConfiguration which is part of gardener-controller-managers ControllerManagerControllerConfiguration, see the example config file for details.

The controller also deletes Bastions in case the referenced Shoot

  • does no longer exist
  • is marked for deletion (i.e., have a non-nil .metadata.deletionTimestamp)
  • was migrated to another seed (i.e., Shoot.spec.seedName is different than Bastion.spec.seedName).

The deletion of Bastions triggers the gardenlet to perform the necessary cleanups in the Seed cluster, so some time can pass between deletion and the Bastion actually disappearing. Clients like gardenctl are advised to not re-use Bastions whose deletion timestamp has been set already.

Refer to GEP-15 for more information on the lifecycle of Bastion resources.

CertificateSigningRequest Controller

After the gardenlet gets deployed on the Seed cluster it needs to establish itself as a trusted party to communicate with the Gardener API server. It runs through a bootstrap flow similar to the kubelet bootstrap process.

On startup the gardenlet uses a kubeconfig with a bootstrap token which authenticates it as being part of the system:bootstrappers group. This kubeconfig is used to create a CertificateSigningRequest (CSR) against the Gardener API server.

The controller in gardener-controller-manager checks whether the CertificateSigningRequest has the expected organisation, common name and usages which the gardenlet would request.

It only auto-approves the CSR if the client making the request is allowed to “create” the certificatesigningrequests/seedclient subresource. Clients with the system:bootstrappers group are bound to the gardener.cloud:system:seed-bootstrapper ClusterRole, hence, they have such privileges. As the bootstrap kubeconfig for the gardenlet contains a bootstrap token which is authenticated as being part of the systems:bootstrappers group, its created CSR gets auto-approved.

CloudProfile Controller

CloudProfiles are essential when it comes to reconciling Shoots since they contain constraints (like valid machine types, Kubernetes versions, or machine images) and sometimes also some global configuration for the respective environment (typically via provider-specific configuration in .spec.providerConfig).

Consequently, to ensure that CloudProfiles in-use are always present in the system until the last referring Shoot gets deleted, the controller adds a finalizer which is only released when there is no Shoot referencing the CloudProfile anymore.

ControllerDeployment Controller

Extensions are registered in garden cluster via ControllerRegistration and deployment of respective extensions are specified via ControllerDeployment. For more info refer here.

This controller ensures that ControllerDeployment in-use always exists until the last ControllerRegistration referencing them gets deleted. The controller adds a finalizer which is only released when there is no ControllerRegistration referencing the ControllerDeployment anymore.

ControllerRegistration Controller

The ControllerRegistration controller makes sure that the required Gardener Extensions specified by the ControllerRegistration resources are present in the seed clusters. It also takes care of the creation and deletion of ControllerInstallation objects for a given seed cluster. The controller has three reconciliation loops.

“Main” Reconciler

This reconciliation loop watches the Seed objects and determines which ControllerRegistrations are required for them and reconciles the corresponding ControllerInstallation resources to reach the determined state. To begin with, it computes the kind/type combinations of extensions required for the seed. For this, the controller examines a live list of ControllerRegistrations, ControllerInstallations, BackupBuckets, BackupEntrys, Shoots, and Secrets from the garden cluster. For example, it examines the shoots running on the seed and deducts kind/type like Infrastructure/gcp. It also decides whether they should always be deployed based on the .spec.deployment.policy. For the configuration options, please see this section.

Based on these required combinations, each of them are mapped to ControllerRegistration objects and then to their corresponding ControllerInstallation objects (if existing). The controller then creates or updates the required ControllerInstallation objects for the given seed. It also deletes every existing ControllerInstallation whose referenced ControllerRegistration is not part of the required list. For example, if the shoots in the seed are no longer using the DNS provider aws-route53, then the controller proceeds to delete the respective ControllerInstallation object.

“ControllerRegistration” Reconciler

This reconciliation loop watches the ControllerRegistration resource and adds finalizers to it when they are created. In case a deletion request comes in for the resource, i.e., if a .metadata.deletionTimestamp is set, it actively scans for a ControllerInstallation resource using this ControllerRegistration, and decides whether the deletion can be allowed. In case no related ControllerInstallation is present, it removes the finalizer and marks it for deletion.

“Seed” Reconciler

This loop also watches the Seed object and adds finalizers to it at creation. If a .metadata.deletionTimestamp is set for the seed then the controller checks for existing ControllerInstallation objects which reference this seed. If no such objects exist then it removes the finalizer and allows the deletion.

Event Controller

With the Gardener Event Controller you can prolong the lifespan of events related to Shoot clusters. This is an optional controller which will become active once you provide the below mentioned configuration.

All events in K8s are deleted after a configurable time-to-live (controlled via a kube-apiserver argument called --event-ttl (defaulting to 1 hour)). The need to prolong the time-to-live for Shoot cluster events frequently arises when debugging customer issues on live systems. This controller leaves events involving Shoots untouched while deleting all other events after a configured time. In order to activate it, provide the following configuration:

  • concurrentSyncs: The amount of goroutines scheduled for reconciling events.
  • ttlNonShootEvents: When an event reaches this time-to-live it gets deleted unless it is a Shoot-related event (defaults to 1h, equivalent to the event-ttl default).

⚠️ In addition, you should also configure the --event-ttl for the kube-apiserver to define an upper-limit of how long Shoot-related events should be stored. The --event-ttl should be larger than the ttlNonShootEvents or this controller will have no effect.

ExposureClass Controller

ExposureClass abstracts the ability to expose a Shoot clusters control plane in certain network environments (e.g. corporate networks, DMZ, internet) on all Seeds or a subset of the Seeds. For more refer.

Consequently, to ensure that ExposureClasss in-use are always present in the system until the last referring Shoot gets deleted, the controller adds a finalizer which is only released when there is no Shoot referencing the ExposureClass anymore.

ManagedSeedSet Controller

ManagedSeedSet objects maintain a stable set of replicas of ManagedSeeds, i.e. they guarantee the availability of a specified number of identical ManagedSeeds on an equal number of identical Shoots. ManagedSeedSet controller creates and deletes ManagedSeeds and Shoots in response to changes to the replicas and selector fields. For more info refer to the ManagedSeedSet proposal document.

  1. The reconciler first gets all the replicas of the given ManagedSeedSet in the ManagedSeedSet’s namespace and with the matching selector. Each replica is a struct that contains a ManagedSeed, its corresponding Seed and Shoot objects.
  2. Then the pending replica is retrieved if it exists.
  3. Next it determines the ready, postponed and deletable replica.
    • A replica is considered ready when a Seed owned by a ManagedSeed has been registered either directly or by deploying gardenlet into a Shoot, the Seed is Ready and the Shoot’s status is Healthy.
    • If a replica is not ready and it is not pending, i.e. it is not specified in the ManagedSeed’s status.pendingReplica field, then it is added to the postponed replicas.
    • A replica is deletable if it has no scheduled Shoots and replica’s Shoot and ManagedSeed do not have the seedmanagement.gardener.cloud/protect-from-deletion annotation.
  4. Finally, it checks the actual and target replica counts. If the actual count is less than the target count, the controller scales up the replicas by creating new replicas to match the desired target count. If the actual count is more than the target, the controller deletes replicas to match the desired count. Before scale-out or scale-in, the controller first reconciles the pending replica (there can always only be one) and makes sure the replica is ready before moving on to the next one.
    • Scale-out(actual count < target count)
      • During the scale-out phase, the controller first creates the Shoot object from the ManagedSeedSet’s spec.shootTemplate field and adds the replica to the status.pendingReplica of the ManagedSeedSet.
      • For the subsequent reconciliation steps, the controller makes sure that the pending replica is ready before proceeding to the next replica. Once the Shoot is created successfully, the ManagedSeed object is created from the ManagedSeedSet’s spec.template. The ManagedSeed object is reconciled by the ManagedSeed controller and a Seed object is created for the replica. Once the replica’s Seed becomes ready and the Shoot becomes healthy, the replica also becomes ready.
    • Scale-in(actual count > target count)
      • During the scale-in phase, the controller first determines the replica that can be deleted. From the deletable replicas, it chooses the one with the lowest priority and deletes it. Priority is determined in the following order:
        • First, compare replica statuses. Replicas with “less advanced” status are considered lower priority. For example, a replica with StatusShootReconciling status has a lower value than a replica with StatusShootReconciled status. Hence, in this case, replica with StatusShootReconciling status will have lower priority and will be considered for deletion.
        • Then, the replicas are compared with the readiness of their Seeds. Replicas with non-ready Seeds are considered lower priority.
        • Then, the replicas are compared with the health statuses of their Shoots. Replicas with “worse” statuses are considered lower priority.
        • Finally, the replica ordinals are compared. Replicas with lower ordinals are considered lower priority.

Quota Controller

Quota object limits the resources consumed by shoot clusters either per provider secret or per project/namespace.

Consequently, to ensure that Quotas in-use are always present in the system until the last SecretBinding that references them gets deleted, the controller adds a finalizer which is only released when there is no SecretBinding referencing the Quota anymore.

Project Controller

There are multiple controllers responsible for different aspects of Project objects. Please also refer to the Project documentation.

“Main” Reconciler

This reconciler manages a dedicated Namespace for each Project. The namespace name can either be specified explicitly in .spec.namespace (must be prefixed with garden-) or it will be determined by the controller. If .spec.namespace is set, it tries to create it. If it already exists, it tries to adopt it. This will only succeed if the Namespace was previously labeled with gardener.cloud/role=project and project.gardener.cloud/name=<project-name>. This is to prevent that end-users can adopt arbitrary namespaces and escalate their privileges, e.g. the kube-system namespace.

After the namespace was created/adopted the controller creates several ClusterRoles and ClusterRoleBindings that allow the project members to access related resources based on their roles. These RBAC resources are prefixed with gardener.cloud:system:project{-member,-viewer}:<project-name>. Gardener administrators and extension developers can define their own roles, see this document for more information.

In addition, operators can configure the Project controller to maintain a default ResourceQuota for project namespaces. Quotas can especially limit the creation of user facing resources, e.g. Shoots, SecretBindings, Secrets and thus protect the Garden cluster from massive resource exhaustion but also enable operators to align quotas with respective enterprise policies.

⚠️ Gardener itself is not exempted from configured quotas. For example, Gardener creates Secrets for every shoot cluster in the project namespace and at the same time increases the available quota count. Please mind this additional resource consumption.

The controller configuration provides a template section controllers.project.quotas where such a ResourceQuota (see example below) can be deposited.

controllers:
  project:
    quotas:
    - config:
        apiVersion: v1
        kind: ResourceQuota
        spec:
          hard:
            count/shoots.core.gardener.cloud: "100"
            count/secretbindings.core.gardener.cloud: "10"
            count/secrets: "800"
      projectSelector: {}

The Project controller takes the specified config and creates a ResourceQuota with the name gardener in the project namespace. If a ResourceQuota resource with the name gardener already exists, the controller will only update fields in spec.hard which are unavailable at that time. This is done to configure a default Quota in all projects but to allow manual quota increases as the projects’ demands increase. spec.hard fields in the ResourceQuota object that are not present in the configuration are removed from the object. Labels and annotations on the ResourceQuota config get merged with the respective fields on existing ResourceQuotas. An optional projectSelector narrows down the amount of projects that are equipped with the given config. If multiple configs match for a project, then only the first match in the list is applied to the project namespace.

The .status.phase of the Project resources is set to Ready or Failed by the reconciler to indicate whether the reconciliation loop was performed successfully. Also, it generates Events to provide further information about its operations.

When a Project is marked for deletion, the controller ensures that there are no Shoots left in the project namespace. Once all Shoots are gone, the Namespace and Project is released.

“Stale Projects” Reconciler

As Gardener is a large-scale Kubernetes as a Service it is designed for being used by a large amount of end-users. Over time, it is likely to happen that some of the hundreds or thousands of Project resources are no longer actively used.

Gardener offers the “stale projects” reconciler which will take care of identifying such stale projects, marking them with a “warning”, and eventually deleting them after a certain time period. This reconciler is enabled by default and works as following:

  1. Projects are considered as “stale”/not actively used when all of the following conditions apply: The namespace associated with the Project does not have any…
    1. Shoot resources.
    2. BackupEntry resources.
    3. Secret resources that are referenced by a SecretBinding that is in use by a Shoot (not necessarily in the same namespace).
    4. Quota resources that are referenced by a SecretBinding that is in use by a Shoot (not necessarily in the same namespace).
    5. The time period when the project was used for the last time (status.lastActivityTimestamp) is longer than the configured minimumLifetimeDays

If a project is considered “stale” then its .status.staleSinceTimestamp will be set to the time when it was first detected to be stale. If it gets actively used again this timestamp will be removed. After some time the .status.staleAutoDeleteTimestamp will be set to a timestamp after which Gardener will auto-delete the Project resource if it still is not actively used.

The component configuration of the gardener-controller-manager offers to configure the following options:

  • minimumLifetimeDays: Don’t consider newly created Projects as “stale” too early to give people/end-users some time to onboard and get familiar with the system. The “stale project” reconciler won’t set any timestamp for Projects younger than minimumLifetimeDays. When you change this value then projects marked as “stale” may be no longer marked as “stale” in case they are young enough, or vice versa.
  • staleGracePeriodDays: Don’t compute auto-delete timestamps for stale Projects that are unused for only less than staleGracePeriodDays. This is to not unnecessarily make people/end-users nervous “just because” they haven’t actively used their Project for a given amount of time. When you change this value then already assigned auto-delete timestamps may be removed again if the new grace period is not yet exceeded.
  • staleExpirationTimeDays: Expiration time after which stale Projects are finally auto-deleted (after .status.staleSinceTimestamp). If this value is changed and an auto-delete timestamp got already assigned to the projects then the new value will only take effect if it’s increased. Hence, decreasing the staleExpirationTimeDays will not decrease already assigned auto-delete timestamps.

Gardener administrators/operators can exclude specific Projects from the stale check by annotating the related Namespace resource with project.gardener.cloud/skip-stale-check=true.

“Activity” Reconciler

Since the other two reconcilers are unable to actively monitor the relevant objects that are used in a Project (Shoot, Secret, etc.), there could be a situation where the user creates and deletes objects in a short period of time. In that case the Stale Project Reconciler could not see that there was any activity on that project and it will still mark it as a Stale, even though it is actively used.

The Project Activity Reconciler is implemented to take care of such cases. An event handler will notify the reconciler for any acitivity and then it will update the status.lastActivityTimestamp. This update will also trigger the Stale Project Reconciler.

SecretBinding Controller

SecretBindings reference Secrets and Quotas and are themselves referenced by Shoots. The controller adds finalizers to the referenced objects to ensure they don’t get deleted while still being referenced. Similarly, to ensure that SecretBindings in-use are always present in the system until the last referring Shoot gets deleted, the controller adds a finalizer which is only released when there is no Shoot referencing the SecretBinding anymore.

Referenced Secrets will also be labeled with provider.shoot.gardener.cloud/<type>=true where <type> is the value of the .provider.type of the SecretBinding. Also, all referenced Secrets as well as Quotas will be labeled with reference.gardener.cloud/secretbinding=true to allow easily filtering for objects referenced by SecretBindings.

Seed Controller

The Seed controller in the gardener-controller-manager reconciles Seed objects with the help of the following reconcilers.

“Main” Reconciler

This reconciliation loop takes care about seed related operations in the Garden cluster. When a new Seed object is created the reconciler creates a new Namespace in the garden cluster seed-<seed-name>. Namespaces dedicated to single seed clusters allow us to segregate access permissions i.e., a gardenlet must not have permissions to access objects in all Namespaces in the Garden cluster. There are objects in a Garden environment which are created once by the operator e.g., default domain secret, alerting credentials, and required for operations happening in the gardenlet. Therefore, we not only need a seed specific Namespace but also a copy of these “shared” objects.

The “main” reconciler takes care about this replication:

KindNamespaceLabel Selector
Secretgardengardener.cloud/role

“Backup Buckets Check” Reconciler

Every time a BackupBucket object is created or updated, the referenced Seed object is enqueued for reconciliation. It’s the reconciler’s task to check the status subresource of all existing BackupBuckets that reference this Seed. If at least one BackupBucket has .status.lastError != nil, the BackupBucketsReady condition on the Seed will be set to False, and consequently the Seed is considered as NotReady. If the SeedBackupBucketsCheckControllerConfiguration (which is part of gardener-controller-managers component configuration) contains a conditionThreshold for the BackupBucketsReady, the condition will instead first be set to Progressing and eventually to False once the conditionThreshold expires, see the example config file for details. Once the BackupBucket is healthy again, the seed will be re-queued and the condition will turn true.

“Extensions Check” Reconciler

This reconciler reconciles Seed objects and checks whether all ControllerInstallations referencing them are in a healthy state. Concretely, all three conditions Valid, Installed, and Healthy must have status True and the Progressing condition must have status False. Based on this check, it maintains the ExtensionsReady condition in the respective Seed’s .status.conditions list.

“Lifecycle” Reconciler

The “Lifecycle” reconciler processes Seed objects which are enqueued every 10 seconds in order to check if the responsible gardenlet is still responding and operable. Therefore, it checks renewals via Lease objects of the seed in the garden cluster which are renewed regularly by the gardenlet.

In case a Lease is not renewed for the configured amount in config.controllers.seed.monitorPeriod.duration:

  1. The reconciler assumes that the gardenlet stopped operating and updates the GardenletReady condition to Unknown.
  2. Additionally, conditions and constraints of all Shoot resources scheduled on the affected seed are set to Unknown as well because a striking gardenlet won’t be able to maintain these conditions any more.
  3. If the gardenlet’s client certificate has expired (identified based on the .status.clientCertificateExpirationTimestamp field in the Seed resource) and if it is managed by a ManagedSeed then this will be triggered for a reconciliation. This will trigger the bootstrapping process again and allows gardenlets to obtain a fresh client certificate.

Shoot Controller

“Conditions” Reconciler

In case the reconciled Shoot is registered via a ManagedSeed as a seed cluster, this reconciler merges the conditions in the respective Seed’s .status.conditions into the .status.conditions of the Shoot. This is to provide a holistic view on the status of the registered seed cluster by just looking at the Shoot resource.

“Hibernation” Reconciler

This reconciler is responsible for hibernating or awakening shoot clusters based on the schedules defined in their .spec.hibernation.schedules. It ignores failed Shoots and those marked for deletion.

“Maintenance” Reconciler

This reconciler is responsible for maintaining shoot clusters based on the time window defined in their .spec.maintenance.timeWindow. It might auto-update the Kubernetes version or the operating system versions specified in the worker pools (.spec.provider.workers). It could also add some operation or task annotations, read more here.

“Quota” Reconciler

This reconciler might auto-delete shoot clusters in case their referenced SecretBinding is itself referencing a Quota with .spec.clusterLifetimeDays != nil. If the shoot cluster is older than the configured lifetime then it gets deleted. It maintains the expiration time of the Shoot in the value of the shoot.gardener.cloud/expiration-timestamp annotation. This annotation might be overridden, however only by at most twice the value of the .spec.clusterLifetimeDays.

“Reference” Reconciler

Shoot objects may specify references to other objects in the Garden cluster which are required for certain features. For example, users can configure various DNS providers via .spec.dns.providers and usually need to refer to a corresponding Secret with valid DNS provider credentials inside. Such objects need a special protection against deletion requests as long as they are still being referenced by one or multiple shoots.

Therefore, this reconciler checks Shoots for referenced objects and adds the finalizer gardener.cloud/reference-protection to their .metadata.finalizers list. The reconciled Shoot also gets this finalizer to enable a proper garbage collection in case the gardener-controller-manager is offline at the moment of an incoming deletion request. When an object is not actively referenced anymore because the Shoot specification has changed or all related shoots were deleted (are in deletion), the controller will remove the added finalizer again so that the object can safely be deleted or garbage collected.

This reconciler inspects the following references:

  • DNS provider secrets (.spec.dns.provider)
  • Audit policy configmaps (.spec.kubernetes.kubeAPIServer.auditConfig.auditPolicy.configMapRef)

Further checks might be added in the future.

“Retry” Reconciler

This reconciler is responsible for retrying certain failed Shoots. Currently, the reconciler retries only failed Shoots with error code ERR_INFRA_RATE_LIMITS_EXCEEDED, see this document for more details.

“Status Label” Reconciler

This reconciler is responsible for maintaining the shoot.gardener.cloud/status label on Shoots, see this document for more details.

8 - Etcd

etcd - Key-Value Store for Kubernetes

etcd is a strongly consistent key-value store and the most prevalent choice for the Kubernetes persistence layer. All API cluster objects like Pods, Deployments, Secrets, etc. are stored in etcd which makes it an essential part of a Kubernetes control plane.

Shoot cluster persistence

Each shoot cluster gets its very own persistence for the control plane. It runs in the shoot namespace on the respective seed cluster. Concretely, there are two etcd instances per shoot cluster which the Kube-Apiserver is configured to use in the following way:

  • etcd-main

A store that contains all “cluster critical” or “long-term” objects. These object kinds are typically considered for a backup to prevent any data loss.

  • etcd-events

A store that contains all Event objects (events.k8s.io) of a cluster. Events have usually a short retention period, occur frequently but are not essential for a disaster recovery.

The setup above prevents both, the critical etcd-main is not flooded by Kubernetes Events as well as backup space is not occupied by non-critical data. This segmentation saves time and resources.

etcd Operator

Configuring, maintaining and health-checking etcd is outsourced to a dedicated operator called ETCD Druid. When Gardenlet reconciles a Shoot resource, it creates or updates an Etcd resources in the seed cluster, containing necessary information (backup information, defragmentation schedule, resources, etc.) etcd-druid needs to manage the lifecycle of the desired etcd instance (today main or events). Likewise, when the shoot is deleted, Gardenlet deletes the Etcd resource and ETCD Druid takes care about cleaning up all related objects, e.g. the backing StatefulSet.

Autoscaling

Gardenlet maintains HVPA objects for etcd StatefulSets if the corresponding feature gate is enabled. This enables a vertical scaling for etcd. Downscaling is handled more pessimistic to prevent many subsequent etcd restarts. Thus, for production and infrastructure clusters downscaling is deactivated and for all other clusters lower advertised requests/limits are only applied during a shoot’s maintenance time window.

Backup

If Seeds specify backups for etcd (example), then Gardener and the respective provider extensions are responsible for creating a bucket on the cloud provider’s side (modelled through BackupBucket resource). The bucket stores backups of shoots scheduled on that seed. Furthermore, Gardener creates a BackupEntry which subdivides the bucket and thus makes it possible to store backups of multiple shoot clusters.

The etcd-main instance itself is configured to run with a special backup-restore sidecar. It takes care about regularly backing up etcd data and restoring it in case of data loss. More information can be found on the component’s GitHub page https://github.com/gardener/etcd-backup-restore.

How long backups are stored in the bucket after a shoot has been deleted, depends on the configured retention period in the Seed resource. Please see this example configuration for more information.

Housekeeping

etcd maintenance tasks must be performed from time to time in order to re-gain database storage and to ensure the system’s reliability. The backup-restore sidecar takes care about this job as well. Gardener chooses a random time within the shoot’s maintenance time to schedule these tasks.

9 - Gardenlet

Gardenlet

Gardener is implemented using the operator pattern: It uses custom controllers that act on our own custom resources, and apply Kubernetes principles to manage clusters instead of containers. Following this analogy, you can recognize components of the Gardener architecture as well-known Kubernetes components, for example, shoot clusters can be compared with pods, and seed clusters can be seen as worker nodes.

The following Gardener components play a similar role as the corresponding components in the Kubernetes architecture:

Gardener ComponentKubernetes Component
gardener-apiserverkube-apiserver
gardener-controller-managerkube-controller-manager
gardener-schedulerkube-scheduler
gardenletkubelet

Similar to how the kube-scheduler of Kubernetes finds an appropriate node for newly created pods, the gardener-scheduler of Gardener finds an appropriate seed cluster to host the control plane for newly ordered clusters. By providing multiple seed clusters for a region or provider, and distributing the workload, Gardener also reduces the blast radius of potential issues.

Kubernetes runs a primary “agent” on every node, the kubelet, which is responsible for managing pods and containers on its particular node. Decentralizing the responsibility to the kubelet has the advantage that the overall system is scalable. Gardener achieves the same for cluster management by using a gardenlet as primary “agent” on every seed cluster, and is only responsible for shoot clusters located in its particular seed cluster:

Counterparts in the Gardener Architecture and the Kubernetes Architecture

The gardener-controller-manager has controllers to manage resources of the Gardener API. However, instead of letting the gardener-controller-manager talk directly to seed clusters or shoot clusters, the responsibility isn’t only delegated to the gardenlet, but also managed using a reversed control flow: It’s up to the gardenlet to contact the Gardener API server, for example, to share a status for its managed seed clusters.

Reversing the control flow allows placing seed clusters or shoot clusters behind firewalls without the necessity of direct access via VPN tunnels anymore.

Reversed Control Flow Using a Gardenlet

TLS Bootstrapping

Kubernetes doesn’t manage worker nodes itself, and it’s also not responsible for the lifecycle of the kubelet running on the workers. Similarly, Gardener doesn’t manage seed clusters itself, so Gardener is also not responsible for the lifecycle of the gardenlet running on the seeds. As a consequence, both the gardenlet and the kubelet need to prepare a trusted connection to the Gardener API server and the Kubernetes API server correspondingly.

To prepare a trusted connection between the gardenlet and the Gardener API server, the gardenlet initializes a bootstrapping process after you deployed it into your seed clusters:

  1. The gardenlet starts up with a bootstrap kubeconfig having a bootstrap token that allows to create CertificateSigningRequest (CSR) resources.

  2. After the CSR is signed, the gardenlet downloads the created client certificate, creates a new kubeconfig with it, and stores it inside a Secret in the seed cluster.

  3. The gardenlet deletes the bootstrap kubeconfig secret, and starts up with its new kubeconfig.

  4. The gardenlet starts normal operation.

The gardener-controller-manager runs a control loop that automatically signs CSRs created by gardenlets.

The gardenlet bootstrapping process is based on the kubelet bootstrapping process. More information: Kubelet’s TLS bootstrapping.

If you don’t want to run this bootstrap process you can create a kubeconfig pointing to the garden cluster for the gardenlet yourself, and use field gardenClientConnection.kubeconfig in the gardenlet configuration to share it with the gardenlet.

Gardenlet Certificate Rotation

The certificate used to authenticate the gardenlet against the API server has a certain validity based on the configuration of the garden cluster (--cluster-signing-duration flag of the kube-controller-manager (default 1y)).

If your garden cluster is of at least Kubernetes v1.22 then you can also configure the validity for the client certificate by specifying .gardenClientConnection.kubeconfigValidity.validity in the gardenlet’s component configuration. Note that changing this value will only take effect when the kubeconfig is rotated again (it is not picked up immediately). The minimum validity is 10m (that’s what is enforced by the CertificateSigningRequest API in Kubernetes which is used by gardenlet).

By default, after about 70-90% of the validity expired, the gardenlet tries to automatically replace the current certificate with a new one (certificate rotation).

You can change these boundaries by specifying .gardenClientConnection.kubeconfigValidity.autoRotationJitterPercentage{Min,Max} in the gardenlet’s component configuration.

To use certificate rotation, you need to specify the secret to store the kubeconfig with the rotated certificate in field .gardenClientConnection.kubeconfigSecret of the gardenlet component configuration.

Rotate certificates using bootstrap kubeconfig

If the gardenlet created the certificate during the initial TLS Bootstrapping using the Bootstrap kubeconfig, certificates can be rotated automatically. The same control loop in the gardener-controller-manager that signs the CSRs during the initial TLS Bootstrapping also automatically signs the CSR during a certificate rotation.

ℹ️ You can trigger an immediate renewal by annotating the Secret in the seed cluster stated in the .gardenClientConnection.kubeconfigSecret field with gardener.cloud/operation=renew and restarting the gardenlet. After it booted up again, gardenlet will issue a new certificate independent of the remaining validity of the existing one.

Rotate Certificate Using Custom kubeconfig

When trying to rotate a custom certificate that wasn’t created by gardenlet as part of the TLS Bootstrap, the x509 certificate’s Subject field needs to conform to the following:

  • the Common Name (CN) is prefixed with gardener.cloud:system:seed:
  • the Organization (O) equals gardener.cloud:system:seeds

Otherwise, the gardener-controller-manager doesn’t automatically sign the CSR. In this case, an external component or user needs to approve the CSR manually, for example, using command kubectl certificate approve seed-csr-<...>). If that doesn’t happen within 15 minutes, the gardenlet repeats the process and creates another CSR.

Configuring the Seed to work with

The Gardenlet works with a single seed, which must be configured in the GardenletConfiguration under .seedConfig. This must be a copy of the Seed resource, for example (see example/20-componentconfig-gardenlet.yaml for a more complete example):

apiVersion: gardenlet.config.gardener.cloud/v1alpha1
kind: GardenletConfiguration
seedConfig:
  metadata:
    name: my-seed
  spec:
    provider:
      type: aws
    # ...
    secretRef:
      name: my-seed-secret
      namespace: garden

When using make start-gardenlet, the corresponding script will automatically fetch the seed cluster’s kubeconfig based on the seedConfig.spec.secretRef and set the environment accordingly.

On startup, gardenlet registers a Seed resource using the given template in seedConfig if it’s not present already.

Component Configuration

In the component configuration for the gardenlet, it’s possible to define:

  • settings for the Kubernetes clients interacting with the various clusters
  • settings for the controllers inside the gardenlet
  • settings for leader election and log levels, feature gates, and seed selection or seed configuration.

More information: Example Gardenlet Component Configuration.

Heartbeats

Similar to how Kubernetes uses Lease objects for node heart beats (see KEP), the gardenlet is using Lease objects for heart beats of the seed cluster. Every two seconds, the gardenlet checks that the seed cluster’s /healthz endpoint returns HTTP status code 200. If that is the case, the gardenlet renews the lease in the Garden cluster in the gardener-system-seed-lease namespace and updates the GardenletReady condition in the status.conditions field of the Seed resource, see also this section.

Similarly to the node-lifecycle-controller inside the kube-controller-manager, the gardener-controller-manager features a seed-lifecycle-controller that sets the GardenletReady condition to Unknown in case the gardenlet fails to renew the lease. As a consequence, the gardener-scheduler doesn’t consider this seed cluster for newly created shoot clusters anymore.

/healthz Endpoint

The gardenlet includes an HTTP server that serves a /healthz endpoint. It’s used as a liveness probe in the Deployment of the gardenlet. If the gardenlet fails to renew its lease then the endpoint returns 500 Internal Server Error, otherwise it returns 200 OK.

Please note that the /healthz only indicates whether the gardenlet could successfully probe the Seed’s API server and renew the lease with the Garden cluster. It does not show that the Gardener extension API server (with the Gardener resource groups) is available. However, the Gardenlet is designed to withstand such connection outages and retries until the connection is reestablished.

Controllers

The gardenlet consists out of several controllers which are now described in more detail.

BackupBucket Controller

The BackupBucket controller reconciles those core.gardener.cloud/v1beta1.BackupBucket resources whose .spec.seedName value is equal to the name of the Seed the respective gardenlet is responsible for. A core.gardener.cloud/v1beta1.BackupBucket resource is created by the Seed controller if .spec.backup is defined in the Seed.

The controller adds finalizers to the BackupBucket and the secret mentioned in the .spec.secretRef of the BackupBucket. The controller also copies this secret to the seed cluster. Additionally, it creates an extensions.gardener.cloud/v1alpha1.BackupBucket resource (non-namespaced) in the seed cluster and waits until the responsible extension controller reconciles it (see this for more details). The status from the reconciliation is reported in the .status.lastOperation field. Once the extension resource is ready and the .status.generatedSecretRef is set by the extension controller, gardenlet copies the referenced secret to the garden namespace in the garden cluster. An owner reference to the core.gardener.cloud/v1beta1.BackupBucket is added to this secret.

If the core.gardener.cloud/v1beta1.BackupBucket is deleted, the controller deletes the generated secret in the garden cluster and the extensions.gardener.cloud/v1alpha1.BackupBucket resource in the seed cluster and it waits for the respective extension controller to remove its finalizers from the extensions.gardener.cloud/v1alpha1.BackupBucket. Then it deletes the secret in the seed cluster and finally removes the finalizers from the core.gardener.cloud/v1beta1.BackupBucket and the referred secret.

BackupEntry Controller

The BackupEntry controller reconciles those core.gardener.cloud/v1beta1.BackupEntry resources whose .spec.seedName value is equal to the name of a Seed the respective gardenlet is responsible for. Those resources are created by the Shoot controller (only if backup is enabled for the respective Seed) and there is exactly one BackupEntry per Shoot.

“Main” Reconciler

The controller creates an extensions.gardener.cloud/v1alpha1.BackupEntry resource (non-namespaced) in the seed cluster and waits until the responsible extension controller reconciled it (see this for more details). The status is populated in the .status.lastOperation field.

The core.gardener.cloud/v1beta1.BackupEntry resource has an owner reference pointing to the corresponding Shoot. Hence, if the Shoot is deleted, also the BackupEntry resource gets deleted. In this case, the controller deletes the extensions.gardener.cloud/v1alpha1.BackupEntry resource in the seed cluster and waits until the responsible extension controller has deleted it. Afterwards, the finalizer of the core.gardener.cloud/v1beta1.BackupEntry resource is released so that it finally disappears from the system.

If the spec.seedName and .status.seedName of the core.gardener.cloud/v1beta1.BackupEntry are different, the controller will migrate it by annotating the extensions.gardener.cloud/v1alpha1.BackupEntry in the Source Seed with gardener.cloud/operation: migrate, waiting for it to be migrated successfully and eventually deleting it from the Source Seed cluster. Afterwards, the controller will recreate the extensions.gardener.cloud/v1alpha1.BackupEntry in the Destination Seed, annotate it with gardener.cloud/operation: restore and wait for the restore operation to finish. For more details about control plane migration, please read Shoot Control Plane Migration.

Keep Backup for Deleted Shoots

In some scenarios it might be beneficial to not immediately delete the BackupEntrys (and with them, the etcd backup) for deleted Shoots.

In this case you can configure the .controllers.backupEntry.deletionGracePeriodHours field in the component configuration of the gardenlet. For example, if you set it to 48, then the BackupEntrys for deleted Shoots will only be deleted 48 hours after the Shoot was deleted.

Additionally, you can limit the shoot purposes for which this applies by setting .controllers.backupEntry.deletionGracePeriodShootPurposes[]. For example, if you set it to [production] then only the BackupEntrys for Shoots with .spec.purpose=production will be deleted after the configured grace period. All others will be deleted immediately after the Shoot deletion.

“Migration” Reconciler

This reconciler is only active if the ForceRestore feature gate is enabled in the gardenlet and if the seed has owner checks enabled (i.e., spec.setttings.ownerChecks.enabled=true). It checks if the source Seed also has owner checks enabled. If not or if the BackupEntry is being deleted, it sets the status.migrationStartTime to nil. The controller updates the status to force restoration in the following cases:

  1. If the BackupEntry is annotated with shoot.gardener.cloud/force-restore=true.
  2. If the grace period for migration has elapsed (which is set in the BackupEntryMigrationControllerConfiguration in the gardenlet’s component configuration).
  3. If the last operation is Migrate and if LastOperationStaleDuration (which is also set in the BackupEntryMigrationControllerConfiguration in the gardenlet’s component configuration) has passed since the lastUpdateTime of the operation.

Bastion Controller

The Bastion controller reconciles those operations.gardener.cloud/v1alpha1.Bastion resources whose .spec.seedName value is equal to the name of a Seed the respective gardenlet is responsible for.

The controller creates an extensions.gardener.cloud/v1alpha1.Bastion resource in the seed cluster in the shoot namespace with the same name as operations.gardener.cloud/v1alpha1.Bastion. Then it waits until the responsible extension controller has reconciled it (see this for more details).The status is populated in the .status.conditions and .status.ingress fields.

During the deletion of operations.gardener.cloud/v1alpha1.Bastion resources, the controller first sets the Ready condition to False and then deletes the extensions.gardener.cloud/v1alpha1.Bastion resource in the seed cluster. Once this resource is gone, the finalizer of the operations.gardener.cloud/v1alpha1.Bastion resource is released, so it finally disappears from the system.

ControllerInstallation Controller

The ControllerInstallation controller in the gardenlet reconciles ControllerInstallation objects with the help of the following reconcilers.

“Main” Reconciler

This reconciler is responsible for ControllerInstallations referencing a ControllerDeployment whose type=helm. It is responsible for unpacking the Helm chart tarball in the ControllerDeployments .providerConfig.chart field and deploying the rendered resources to the seed cluster. The Helm chart values in .providerConfig.values will be used and extended with some information about the Gardener environment and the seed cluster:

gardener:
  version: <gardenlet-version>
  garden:
    clusterIdentity: <identity-of-garden-cluster>
  seed:
    identity: <seed-name>
    clusterIdentity: <identity-of-seed-cluster>
    annotations: <seed-annotations>
    labels: <seed-labels>
    spec: <seed-specification>

As of today, there are a few more fields in .gardener.seed, but it is recommended to use the .gardener.seed.spec if the Helm chart needs more information about the seed configuration.

The rendered chart will be deployed via a ManagedResource created in the garden namespace of the seed cluster. It is labeled with controllerinstallation-name=<name> so that one can easily find the owning ControllerInstallation for an existing ManagedResource.

The reconciler maintains the Installed condition of the ControllerInstallation and sets it to False if the rendering or deployment fails.

“Care” Reconciler

This reconciler reconciles ControllerInstallation objects and checks whether they are in a healthy state. It checks the .status.conditions of the backing ManagedResource created in the garden namespace of the seed cluster.

  • If the ResourcesApplied condition of the ManagedResource is True then the Installed condition of the ControllerInstallation will be set to True.
  • If the ResourcesHealthy condition of the ManagedResource is True then the Healthy condition of the ControllerInstallation will be set to True.
  • If the ResourcesProgressing condition of the ManagedResource is True then the Progressing condition of the ControllerInstallation will be set to True.

A ControllerInstallation is considered “healthy” if Applied=Healthy=True and Progressing=False.

“Required” Reconciler

This reconciler watches all resources in the extensions.gardener.cloud API group in the seed cluster. It is responsible for maintaining the Required condition on ControllerInstallations. Concretely, when there is at least one extension resource in the seed cluster a ControllerInstallation is responsible for, then the status of the Required condition will be True. If there are no extension resources anymore, its status will be False.

This condition is taken into account by the ControllerRegistration controller part of gardener-controller-manager when it computes which extensions have to deployed to which seed cluster, see this document for more details.

NetworkPolicy Controller

The NetworkPolicy controller reconciles NetworkPolicys in shoot namespaces in order to ensure access to Kubernetes API server.

The controller resolves the IP address of Kubernetes service in default namespace and creates egress NetworkPolicys for it.

For more details about NetworkPolicys in Gardener please see this document.

Seed Controller

The Seed controller in the gardenlet reconciles Seed objects with the help of the following reconcilers.

“Main Reconciler”

This reconciler is responsible for managing the seed’s system components. Those comprise CA certificates, the various CustomResourceDefinitions, the logging and monitoring stacks, and few central components like gardener-resource-manager, etcd-druid, istio, etc.

The reconciler also deploys a BackupBucket resource in the garden cluster in case the Seed's .spec.backup is set. It also checks whether the seed cluster’s Kubernetes version is at least the minimum supported version and errors in case this constraint is not met.

This reconciler maintains the Bootstrapped condition, i.e. it sets it

  • to Progressing before it executes its reconciliation flow,
  • to False in case an error occurs,
  • to True in case the reconciliation succeeded.

“Care” Reconciler

This reconciler checks whether the seed system components (deployed by the “main” reconciler) are healthy. It checks the .status.conditions of the backing ManagedResource created in the garden namespace of the seed cluster. A ManagedResource is considered “healthy” if the conditions ResourcesApplied=ResourcesHealthy=True and ResourcesProgressing=False.

If all ManagedResources are healthy then the SeedSystemComponentsHealthy condition of the Seed will be set to True. Otherwise, it will be set to False.

If at least one ManagedResource is unhealthy and there is threshold configuration for the conditions (in .controllers.seedCare.conditionThresholds) then the status of the SeedSystemComponentsHealthy condition will be set

  • to Progressing if it was True before.
  • to Progressing if it was Progressing before and the lastUpdateTime of the condition does not exceed the configured threshold duration yet.
  • to False if it was Progressing before and the lastUpdateTime of the condition does exceed the configured threshold duration.

The condition thresholds can be used to prevent reporting issues too early just because there is a rollout or a short disruption. Only if the unhealthiness persists for at least the configured threshold duration then the issues will be reported (by setting the status to False).

“Lease” Reconciler

This reconciler checks whether the connection to the seed cluster’s /healthz endpoint works. If this succeeds then it renews a Lease resource in the garden cluster’s gardener-system-seed-lease namespace. This indicates a heartbeat to the external world, and internally the gardenlet sets its health status to true. In addition, the GardenletReady condition in the status of the Seed is set to True. The whole process is similar to what the kubelet does to report heartbeats for its Node resource and its KubeletReady condition, see also this section.

If the connection to the /healthz endpoint or the update of the Lease fails then the internal health status of gardenlet is set to false. Also, this internal health status is set to false automatically after some time in case the controller gets stuck for whatever reason. This internal health status is available via the gardenlet’s /healthz endpoint and is used for the livenessProbe in the gardenlet pod.

Shoot Controller

The Shoot controller in the gardenlet reconciles Shoot objects with the help of the following reconcilers.

“Main” Reconciler

This reconciler is responsible for managing all shoot cluster components and implements the core logic for creating, updating, hibernating, deleting, and migrating shoot clusters. It is also responsible for syncing the Cluster cluster to the seed cluster before and after each successful shoot reconciliation.

The main reconciliation logic is performed in 3 different task flows dedicated to specific operation types:

  • reconcile (operations: create, reconcile, restore): this is the main flow responsible for creation and regular reconciliation of shoots. Hibernating a shoot also triggers this flow. It is also used for restoration of the shoot control plane on the new seed (second half of a Control Plane Migration)
  • migrate: this flow is triggered when spec.seedName specifies a different seed than status.seedName. It performs the first half of the Control Plane Migration, i.e., a backup (migrate operation) of all control plane components followed by a “shallow delete”.
  • delete: this flow is triggered when the shoot’s deletionTimestamp is set, i.e., when it is deleted.

Gardenlet takes special care to prevent unnecessary shoot reconciliations. This is important for several reasons, e.g., to not overload the seed API servers and to not exhaust infrastructure rate limits too fast. Gardenlet performs shoot reconciliations according to the following rules:

  • If status.observedGeneration is less than metadata.generation: this is the case, e.g., when the spec was changed, a manual reconciliation operation was triggered, or the shoot was deleted.
  • If the last operation was not successful.
  • If the shoot is in a failed state, gardenlet does not perform any reconciliation on the shoot (unless the retry operation was triggered). However, it syncs the Cluster resource to the seed in order to inform extension controllers about the failed state.
  • Regular reconciliations are performed with every GardenletConfiguration.controllers.shoot.syncPeriod (defaults to 1h).
  • Shoot reconciliations are not performed if the assigned seed cluster is not healthy or has not been reconciled by the current gardenlet version yet (determined by the Seed.status.gardener section). This is done to make sure that shoots are reconciled with fully rolled out seed system components after a gardener upgrade. Otherwise, gardenlet might perform operations of the new version that doesn’t match the old version of the deployed seed system components, which might lead to unspecified behavior.

There are a few special cases that overwrite or confine how often and under which circumstances periodic shoot reconciliations are performed:

  • In case, the gardenlet config allows it (controllers.shoot.respectSyncPeriodOverwrite, disabled by default), the sync period for a shoot can be increased individually by setting the shoot.gardener.cloud/sync-period annotation. This is always allowed for shoots in the garden namespace. Shoots are not reconciled with a higher frequency than specified in GardenletConfiguration.controllers.shoot.syncPeriod.
  • In case, the gardenlet config allows it (controllers.shoot.respectSyncPeriodOverwrite, disabled by default), shoots can be marked as “ignored” by setting the shoot.gardener.cloud/ignore annotation. In this case, gardenlet does not perform any reconciliation for the shoot.
  • In case GardenletConfiguration.controllers.shoot.reconcileInMaintenanceOnly is enabled (disabled by default), gardenlet performs regular shoot reconciliations only once in the respective maintenance time window (GardenletConfiguration.controllers.shoot.syncPeriod is ignored). Gardenlet randomly distributes shoot reconciliations over the maintenance time window to avoid high bursts of reconciliations (see this doc).
  • In case Shoot.spec.maintenance.confineSpecUpdateRollout is enabled (disabled by default), changes to the shoot specification are not rolled out immediately but only during the respective maintenance time window (see this doc).

“Migration” Reconciler

This reconciler is only active if the ForceRestore feature gate is enabled in the gardenlet and if the Seed has owner checks enabled (i.e., spec.setttings.ownerChecks.enabled=true). It checks if the source Seed also has owner checks enabled. If not or if the Shoot is being deleted, it sets the status.migrationStartTime to nil. The controller updates the status to force restoration in the following cases:

  1. If the Shoot is annotated with shoot.gardener.cloud/force-restore=true.
  2. If the grace period for migration has elapsed (which is set in the ShootMigrationControllerConfiguration in the gardenlet’s component configuration).
  3. If the last operation is Migrate and if LastOperationStaleDuration (which is also set in the ShootMigrationControllerConfiguration in the gardenlet’s component configuration) has passed since the lastUpdateTime of the operation.

ShootState Controller

The ShootState controller in the gardenlet reconciles resources containing information that have to be synced to the ShootState. This information is used when a control plane migration is performed.

“Extensions” Reconciler

This reconciler watches resources in the extensions.gardener.cloud API group in the seed cluster which contain Shoot-specific state or data. Those are BackupEntrys, ContainerRuntimes, ControlPlanes, DNSRecords, Extensions, Infrastructures, Networks, OperatingSystemConfigs, and Workers.

When there is a change in the .status.state or .status.resources[] fields of these resources then this information is synced into the ShootState resource in the garden cluster.

“Secret” Reconciler

This reconciler reconciles Secrets having labels managed-by=secrets-manager and persist=true in the shoot namespaces in the seed cluster. It syncs them to the ShootState so that the secrets can be restored from there in case a shoot control plane has to be restored to another seed cluster (in case of migration).

Managed Seeds

Gardener users can use shoot clusters as seed clusters, so-called “managed seeds” (aka “shooted seeds”), by creating ManagedSeed resources. By default, the gardenlet that manages this shoot cluster then automatically creates a clone of itself with the same version and the same configuration that it currently has. Then it deploys the gardenlet clone into the managed seed cluster.

More information: Register Shoot as Seed

Migrating from Previous Gardener Versions

If your Gardener version doesn’t support gardenlets yet, no special migration is required, but the following prerequisites must be met:

  • Your Gardener version is at least 0.31 before upgrading to v1.
  • You have to make sure that your garden cluster is exposed in a way that it’s reachable from all your seed clusters.

With previous Gardener versions, you had deployed the Gardener Helm chart (incorporating the API server, controller-manager, and scheduler). With v1, this stays the same, but you now have to deploy the gardenlet Helm chart as well into all of your seeds (if they aren’t managed, as mentioned earlier).

More information: Deploy a Gardenlet for all instructions.

10 - Network Policies

Network Policies in Gardener

As Seed clusters can host the Kubernetes control planes of many Shoot clusters, it is necessary to isolate the control planes from each other for security reasons. Besides deploying each control plane in its own namespace, Gardener creates network policies to also isolate the networks. Essentially, network policies make sure that pods can only talk to other pods over the network they are supposed to. As such, network policies are an important part of Gardener’s tenant isolation.

Gardener deploys network policies into

  • each namespace hosting the Kubernetes control plane of the Shoot cluster.
  • the namespace dedicated to Gardener seed-wide global controllers. This namespace is often called garden and contains e.g. the Gardenlet.
  • the kube-system namespace in the Shoot.

The aforementioned namespaces in the Seed contain a deny-all network policy that denies all ingress and egress traffic. This secure by default setting requires pods to allow network traffic. This is done by pods having labels matching to the selectors of the network policies deployed by Gardener.

More details on the deployed network policies can be found in the development and usage sections.

11 - Operator

Gardener Operator

The gardener-operator is meant to be responsible for the garden cluster environment. Without this component, users must deploy ETCD, the Gardener control plane, etc. manually and with separate mechanisms (not maintained in this repository). This is quite unfortunate since this requires separate tooling, processes, etc. A lot of production- and enterprise-grade features were built into Gardener for managing the seed and shoot clusters, so it makes sense to re-use them as much as possible also for the garden cluster.

⚠️ Consider this component highly experimental and DO NOT use it in production.

Deployment

There is a Helm chart which can be used to deploy the gardener-operator. Once deployed and ready, you can create a Garden resource.

ℹ️ Similar to seed clusters, garden runtime clusters require a VPA. By default, gardener-operator deploys the VPA components. However, when there already is a VPA available, then set .spec.runtimeCluster.settings.verticalPodAutoscaler.enabled=false in the Garden resource.

Using Garden Runtime Cluster As Seed Cluster

In production scenarios, you probably wouldn’t use the Kubernetes cluster running gardener-operator and the Gardener control plane (called “runtime cluster”) as seed cluster at the same time. However, such setup is technically possible and might simplify certain situations (e.g., development, evaluation, …).

If the runtime cluster is a seed cluster at the same time, gardenlet’s Seed controller will not manage the components which were already deployed (and reconciled) by gardener-operator. As of today, this applies to:

  • gardener-resource-manager
  • vpa-{admission-controller,recommender,updater}
  • hvpa-controller (when HVPA feature gate is enabled)
  • etcd-druid

Those components are so-called “seed system components”. As they were already made available by gardener-operator, the gardenlet just skips them.

ℹ️ There is no need to configure anything - the gardenlet will automatically detect when its seed cluster is the garden runtime cluster at the same time.

⚠️ Note that such setup requires that you upgrade the versions of gardener-operator and gardenlet in lock-step. Otherwise, you might experience unexpected behaviour or issues with your seed or shoot clusters.

Local Development

The easiest setup is using a local KinD cluster and the Skaffold based approach to deploy the gardener-operator.

make kind-operator-up
make operator-up

# now you can create Garden resources, for example
kubectl create -f example/operator/20-garden.yaml
# alternatively, you can run the e2e test
make test-e2e-local-operator

make operator-down
make kind-operator-down

Generally, any Kubernetes cluster can be used. An alternative approach is to start the process locally and manually deploy the CustomResourceDefinition for the Garden resources into the targeted cluster (potentially remote):

kubectl create -f example/operator/10-crd-operator.gardener.cloud_gardens.yaml
make KUBECONFIG=... start-operator

# now you can create Garden resources, for example
kubectl create -f example/operator/20-garden.yaml
# alternatively, you can run the e2e test
make KUBECONFIG=... test-e2e-local-operator

Implementation Details

Controllers

As of today, the gardener-operator only has one controller which is now described in more detail.

Garden Controller

The reconciler first generates a general CA certificate which is valid for ~30d and auto-rotated when 80% of its lifetime is reached. Afterwards, it brings up the so-called “garden system components”. The gardener-resource-manager is deployed first since its ManagedResource controller will be used to bring up the remainders.

Other system components are:

  • garden system resources (PriorityClasses for the workload resources)
  • Vertical Pod Autoscaler (if enabled via .spec.runtimeCluster.settings.verticalPodAutoscaler.enabled=true in the Garden)
  • HVPA controller (when HVPA feature gate is enabled)
  • ETCD Druid

The controller maintains the Reconciled condition which indicates the status of an operation.

12 - Resource Manager

Gardener Resource Manager

Initially, the gardener-resource-manager was a project similar to the kube-addon-manager. It manages Kubernetes resources in a target cluster which means that it creates, updates, and deletes them. Also, it makes sure that manual modifications to these resources are reconciled back to the desired state.

In the Gardener project we were using the kube-addon-manager since more than two years. While we have progressed with our extensibility story (moving cloud providers out-of-tree) we had decided that the kube-addon-manager is no longer suitable for this use-case. The problem with it is that it needs to have its managed resources on its file system. This requires storing the resources in ConfigMaps or Secrets and mounting them to the kube-addon-manager pod during deployment time. The gardener-resource-manager uses CustomResourceDefinitions which allows to dynamically add, change, and remove resources with immediate action and without the need to reconfigure the volume mounts/restarting the pod.

Meanwhile, the gardener-resource-manager has evolved to a more generic component comprising several controllers and webhook handlers. It is deployed by gardenlet once per seed (in the garden namespace) and once per shoot (in the respective shoot namespaces in the seed).

Component Configuration

Similar to other Gardener components, the gardener-resource-manager uses a so-called component configuration file. It allows specifying certain central settings like log level and formatting, client connection configuration, server ports and bind addresses, etc. In addition, controllers and webhooks can be configured and sometimes even disabled.

Note that the very basic ManagedResource, secret and health controllers cannot be disabled.

You can find an example configuration file here.

Controllers

ManagedResource controller

This controller watches custom objects called ManagedResources in the resources.gardener.cloud/v1alpha1 API group. These objects contain references to secrets which itself contain the resources to be managed. The reason why a Secret is used to store the resources is that they could contain confidential information like credentials.

---
apiVersion: v1
kind: Secret
metadata:
  name: managedresource-example1
  namespace: default
type: Opaque
data:
  objects.yaml: YXBpVmVyc2lvbjogdjEKa2luZDogQ29uZmlnTWFwCm1ldGFkYXRhOgogIG5hbWU6IHRlc3QtMTIzNAogIG5hbWVzcGFjZTogZGVmYXVsdAotLS0KYXBpVmVyc2lvbjogdjEKa2luZDogQ29uZmlnTWFwCm1ldGFkYXRhOgogIG5hbWU6IHRlc3QtNTY3OAogIG5hbWVzcGFjZTogZGVmYXVsdAo=
    # apiVersion: v1
    # kind: ConfigMap
    # metadata:
    #   name: test-1234
    #   namespace: default
    # ---
    # apiVersion: v1
    # kind: ConfigMap
    # metadata:
    #   name: test-5678
    #   namespace: default
---
apiVersion: resources.gardener.cloud/v1alpha1
kind: ManagedResource
metadata:
  name: example
  namespace: default
spec:
  secretRefs:
  - name: managedresource-example1

In the above example, the controller creates two ConfigMaps in the default namespace. When a user is manually modifying them they will be reconciled back to the desired state stored in the managedresource-example secret.

It is also possible to inject labels into all the resources:

---
apiVersion: v1
kind: Secret
metadata:
  name: managedresource-example2
  namespace: default
type: Opaque
data:
  other-objects.yaml: YXBpVmVyc2lvbjogYXBwcy92MSAjIGZvciB2ZXJzaW9ucyBiZWZvcmUgMS45LjAgdXNlIGFwcHMvdjFiZXRhMgpraW5kOiBEZXBsb3ltZW50Cm1ldGFkYXRhOgogIG5hbWU6IG5naW54LWRlcGxveW1lbnQKc3BlYzoKICBzZWxlY3RvcjoKICAgIG1hdGNoTGFiZWxzOgogICAgICBhcHA6IG5naW54CiAgcmVwbGljYXM6IDIgIyB0ZWxscyBkZXBsb3ltZW50IHRvIHJ1biAyIHBvZHMgbWF0Y2hpbmcgdGhlIHRlbXBsYXRlCiAgdGVtcGxhdGU6CiAgICBtZXRhZGF0YToKICAgICAgbGFiZWxzOgogICAgICAgIGFwcDogbmdpbngKICAgIHNwZWM6CiAgICAgIGNvbnRhaW5lcnM6CiAgICAgIC0gbmFtZTogbmdpbngKICAgICAgICBpbWFnZTogbmdpbng6MS43LjkKICAgICAgICBwb3J0czoKICAgICAgICAtIGNvbnRhaW5lclBvcnQ6IDgwCg==
    # apiVersion: apps/v1
    # kind: Deployment
    # metadata:
    #   name: nginx-deployment
    # spec:
    #   selector:
    #     matchLabels:
    #       app: nginx
    #   replicas: 2 # tells deployment to run 2 pods matching the template
    #   template:
    #     metadata:
    #       labels:
    #         app: nginx
    #     spec:
    #       containers:
    #       - name: nginx
    #         image: nginx:1.7.9
    #         ports:
    #         - containerPort: 80

---
apiVersion: resources.gardener.cloud/v1alpha1
kind: ManagedResource
metadata:
  name: example
  namespace: default
spec:
  secretRefs:
  - name: managedresource-example2
  injectLabels:
    foo: bar

In this example the label foo=bar will be injected into the Deployment as well as into all created ReplicaSets and Pods.

Preventing Reconciliations

If a ManagedResource is annotated with resources.gardener.cloud/ignore=true then it will be skipped entirely by the controller (no reconciliations or deletions of managed resources at all). However, when the ManagedResource itself is deleted (for example when a shoot is deleted) then the annotation is not respected and all resources will be deleted as usual. This feature can be helpful to temporarily patch/change resources managed as part of such ManagedResource. Condition checks will be skipped for such ManagedResources.

Modes

The gardener-resource-manager can manage a resource in the following supported modes:

  • Ignore
    • The corresponding resource is removed from the ManagedResource status (.status.resources). No action is performed on the cluster - the resource is no longer “managed” (updated or deleted).
    • The primary use case is a migration of a resource from one ManagedResource to another one.

The mode for a resource can be specified with the resources.gardener.cloud/mode annotation. The annotation should be specified in the encoded resource manifest in the Secret that is referenced by the ManagedResource.

Skipping health check

If a resource in the ManagedResource is annotated with resources.gardener.cloud/skip-health-check=true then the resource will be skipped during health checks by the health controller. The ManagedResource conditions will not reflect the health condition of this resource anymore. The ResourcesProgressing condition will also be set to False.

Resource Class

By default, gardener-resource-manager controller watches for ManagedResources in all namespaces. The .sourceClientConnection.namespace field in the component configuration restricts the watch to ManagedResources in a single namespace only. Note that this setting also affects all other controllers and webhooks since it’s a central configuration.

A ManagedResource has an optional .spec.class field that allows to indicate that it belongs to given class of resources. The .controllers.resourceClass field in the component configuration restricts the watch to ManagedResources with the given .spec.class. A default class is assumed if no class is specified.

Conditions

A ManagedResource has a ManagedResourceStatus, which has an array of Conditions. Conditions currently include:

ConditionDescription
ResourcesAppliedTrue if all resources are applied to the target cluster
ResourcesHealthyTrue if all resources are present and healthy
ResourcesProgressingFalse if all resources have been fully rolled out

ResourcesApplied may be False when:

  • the resource apiVersion is not known to the target cluster
  • the resource spec is invalid (for example the label value does not match the required regex for it)

ResourcesHealthy may be False when:

  • the resource is not found
  • the resource is a Deployment and the Deployment does not have the minimum availability.

ResourcesProgressing may be True when:

  • a Deployment, StatefulSet or DaemonSet has not been fully rolled out yet, i.e. not all replicas have been updated with the latest changes to spec.template.

Each Kubernetes resources has different notion for being healthy. For example, a Deployment is considered healthy if the controller observed its current revision and if the number of updated replicas is equal to the number of replicas.

The following status.conditions section describes a healthy ManagedResource:

conditions:
- lastTransitionTime: "2022-05-03T10:55:39Z"
  lastUpdateTime: "2022-05-03T10:55:39Z"
  message: All resources are healthy.
  reason: ResourcesHealthy
  status: "True"
  type: ResourcesHealthy
- lastTransitionTime: "2022-05-03T10:55:36Z"
  lastUpdateTime: "2022-05-03T10:55:36Z"
  message: All resources have been fully rolled out.
  reason: ResourcesRolledOut
  status: "False"
  type: ResourcesProgressing
- lastTransitionTime: "2022-05-03T10:55:18Z"
  lastUpdateTime: "2022-05-03T10:55:18Z"
  message: All resources are applied.
  reason: ApplySucceeded
  status: "True"
  type: ResourcesApplied

Ignoring Updates

In some cases it is not desirable to update or re-apply some of the cluster components (for example, if customization is required or needs to be applied by the end-user). For these resources, the annotation “resources.gardener.cloud/ignore” needs to be set to “true” or a truthy value (Truthy values are “1”, “t”, “T”, “true”, “TRUE”, “True”) in the corresponding managed resource secrets, this can be done from the components that create the managed resource secrets, for example Gardener extensions or Gardener. Once this is done, the resource will be initially created and later ignored during reconciliation.

Preserving replicas or resources in Workload Resources

The objects which are part of the ManagedResource can be annotated with

  • resources.gardener.cloud/preserve-replicas=true in case the .spec.replicas field of workload resources like Deployments, StatefulSets, etc. shall be preserved during updates.
  • resources.gardener.cloud/preserve-resources=true in case the .spec.containers[*].resources fields of all containers of workload resources like Deployments, StatefulSets, etc. shall be preserved during updates.

This can be useful if there are non-standard horizontal/vertical auto-scaling mechanisms in place. Standard mechanisms like HorizontalPodAutoscaler or VerticalPodAutoscaler will be auto-recognized by gardener-resource-manager, i.e., in such cases the annotations are not needed.

Origin

All the objects managed by the resource manager get a dedicated annotation resources.gardener.cloud/origin describing the ManagedResource object that describes this object. The default format is <namespace>/<objectname>.

In multi-cluster scenarios (the ManagedResource objects are maintained in a cluster different from the one the described objects are managed), it might be useful to include the cluster identity, as well.

This can be enforced by setting the .controllers.clusterID field in the component configuration. Here, several possibilities are supported:

  • given a direct value: use this as id for the source cluster
  • <cluster>: read the cluster identity from a cluster-identity config map in the kube-system namespace (attribute cluster-identity). This is automatically maintained in all clusters managed or involved in a gardener landscape.
  • <default>: try to read the cluster identity from the config map. If not found, no identity is used
  • empty string: no cluster identity is used (completely cluster local scenarios)

By default, cluster id is not used. If cluster id is specified the format is <cluster id>:<namespace>/<objectname>.

In addition to the origin annotation, all objects managed by the resource manager get a dedicated label resources.gardener.cloud/managed-by. This label can be used to describe these objects with a selector. By default it is set to “gardener”, but this can be overwritten by setting the .conrollers.managedResources.managedByLabelValue field in the component configuration.

Garbage Collector For Immutable ConfigMaps/Secrets

In Kubernetes, workload resources (e.g., Pods) can mount ConfigMaps or Secrets or reference them via environment variables in containers. Typically, when the content of such ConfigMap/Secret gets changed then the respective workload is usually not dynamically reloading the configuration, i.e., a restart is required. The most commonly used approach is probably having so-called checksum annotations in the pod template which makes Kubernetes to recreate the pod if the checksum changes. However, it has the downside that old, still running versions of the workload might not be able to properly work with the already updated content in the ConfigMap/Secret, potentially causing application outages.

In order to protect users from such outages (and to also improve the performance of the cluster), the Kubernetes community provides the “immutable ConfigMaps/Secrets feature”. Enabling immutability requires ConfigMaps/Secrets to have unique names. Having unique names requires the client to delete ConfigMaps/Secret`s no longer in use.

In order to provide a similarly lightweight experience for clients (compared to the well-established checksum annotation approach), the gardener-resource-manager features an optional garbage collector controller (disabled by default). The purpose of this controller is cleaning up such immutable ConfigMaps/Secrets if they are no longer in use.

How does the garbage collector work?

The following algorithm is implemented in the GC controller:

  1. List all ConfigMaps and Secrets labeled with resources.gardener.cloud/garbage-collectable-reference=true.
  2. List all Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, Pods and for each of them
    1. iterate over the .metadata.annotations and for each of them
      1. If the annotation key follows the reference.resources.gardener.cloud/{configmap,secret}-<hash> scheme and the value equals <name> then consider it as “in-use”.
  3. Delete all ConfigMaps and Secrets not considered as “in-use”.

Consequently, clients need to

  1. Create immutable ConfigMaps/Secrets with unique names (e.g., a checksum suffix based on the .data).

  2. Label such ConfigMaps/Secrets with resources.gardener.cloud/garbage-collectable-reference=true.

  3. Annotate their workload resources with reference.resources.gardener.cloud/{configmap,secret}-<hash>=<name> for all ConfigMaps/Secrets used by the containers of the respective Pods.

    ⚠️ Add such annotations to .metadata.annotations as well as to all templates of other resources (e.g., .spec.template.metadata.annotations in Deployments or .spec.jobTemplate.metadata.annotations and .spec.jobTemplate.spec.template.metadata.annotations for CronJobs. This ensures that the GC controller does not unintentionally consider ConfigMaps/Secrets as “not in use” just because there isn’t a Pod referencing them anymore (e.g., they could still be used by a Deployment scaled down to 0).

ℹ️ For the last step, there is a helper function InjectAnnotations in the pkg/controller/garbagecollector/references which you can use for your convenience.

Example:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: test-1234
  namespace: default
  labels:
    resources.gardener.cloud/garbage-collectable-reference: "true"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: test-5678
  namespace: default
  labels:
    resources.gardener.cloud/garbage-collectable-reference: "true"
---
apiVersion: v1
kind: Pod
metadata:
  name: example
  namespace: default
  annotations:
    reference.resources.gardener.cloud/configmap-82a3537f: test-5678
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    terminationGracePeriodSeconds: 2

The GC controller would delete the ConfigMap/test-1234 because it is considered as not “in-use”.

ℹ️ If the GC controller is activated then the ManagedResource controller will no longer delete ConfigMaps/Secrets having the above label.

How to activate the garbage collector?

The GC controller can be activated by setting the .controllers.garbageCollector.enabled field to true in the component configuration.

TokenInvalidator

The Kubernetes community is slowly transitioning from static ServiceAccount token Secrets to ServiceAccount Token Volume Projection. Typically, when you create a ServiceAccount

apiVersion: v1
kind: ServiceAccount
metadata:
  name: default

then the serviceaccount-token controller (part of kube-controller-manager) auto-generates a Secret with a static token:

apiVersion: v1
kind: Secret
metadata:
   annotations:
      kubernetes.io/service-account.name: default
      kubernetes.io/service-account.uid: 86e98645-2e05-11e9-863a-b2d4d086dd5a)
   name: default-token-ntxs9
type: kubernetes.io/service-account-token
data:
   ca.crt: base64(cluster-ca-cert)
   namespace: base64(namespace)
   token: base64(static-jwt-token)

Unfortunately, when using ServiceAccount Token Volume Projection in a Pod, this static token is actually not used at all:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  serviceAccountName: default
  containers:
  - image: nginx
    name: nginx
    volumeMounts:
    - mountPath: /var/run/secrets/tokens
      name: token
  volumes:
  - name: token
    projected:
      sources:
      - serviceAccountToken:
          path: token
          expirationSeconds: 7200

While the Pod is now using an expiring and auto-rotated token, the static token is still generated and valid.

As of Kubernetes v1.22, there is neither a way of preventing kube-controller-manager to generate such static tokens, nor a way to proactively remove or invalidate them:

Disabling the serviceaccount-token controller is an option, however, especially in the Gardener context it may either break end-users or it may not even be possible to control such settings. Also, even if a future Kubernetes version supports native configuration of above behaviour, Gardener still supports older versions which won’t get such features but need a solution as well.

This is where the TokenInvalidator comes into play: Since it is not possible to prevent kube-controller-manager from generating static ServiceAccount Secrets, the TokenInvalidator is - as its name suggests - just invalidating these tokens. It considers all such Secrets belonging to ServiceAccounts with .automountServiceAccountToken=false. By default, all namespaces in the target cluster are watched, however, this can be configured by specifying the .targetClientConnection.namespace field in the component configuration. Note that this setting also affects all other controllers and webhooks since it’s a central configuration.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-serviceaccount
automountServiceAccountToken: false

This will result in a static ServiceAccount token secret whose token value is invalid:

apiVersion: v1
kind: Secret
metadata:
  annotations:
    kubernetes.io/service-account.name: my-serviceaccount
    kubernetes.io/service-account.uid: 86e98645-2e05-11e9-863a-b2d4d086dd5a
  name: my-serviceaccount-token-ntxs9
type: kubernetes.io/service-account-token
data:
  ca.crt: base64(cluster-ca-cert)
  namespace: base64(namespace)
  token: AAAA

Any attempt to regenerate the token or creating a new such secret will again make the component invalidating it.

You can opt-out of this behaviour for ServiceAccounts setting .automountServiceAccountToken=false by labeling them with token-invalidator.resources.gardener.cloud/skip=true.

In order to enable the TokenInvalidator you have to set both .controllers.tokenValidator.enabled=true and .webhooks.tokenValidator.enabled=true in the component configuration.

Below graphic shows an overview of the Token Invalidator for Service account secrets in the Shoot cluster. image

TokenRequestor

This controller provides the service to create and auto-renew tokens via the TokenRequest API.

It provides a functionality similar to the kubelet’s Service Account Token Volume Projection. It was created to handle the special case of issuing tokens to pods that run in a different cluster than the API server they communicate with (hence, using the native token volume projection feature is not possible).

The controller differentiates between source cluster and target cluster. The source cluster hosts the gardener-resource-manager pod. Secrets in this cluster are watched and modified by the controller. The target cluster can be configured to point to another cluster. The existence of ServiceAccounts are ensured and token requests are issued against the target. When the gardener-resource-manager is deployed next to the Shoot’s controlplane in the Seed the source cluster is the Seed while the target cluster points to the Shoot.

Reconciliation Loop

This controller reconciles secrets in all namespaces in the source cluster with the label: resources.gardener.cloud/purpose: token-requestor. See here for an example of the secret.

The controller ensures a ServiceAccount exists in the target cluster as specified in the annotations of the Secret in the source cluster:

serviceaccount.resources.gardener.cloud/name: <sa-name>
serviceaccount.resources.gardener.cloud/namespace: <sa-namespace>

The requested tokens will act with the privileges which are assigned to this ServiceAccount.

The controller will then request a token via the TokenRequest API and populate it into the .data.token field to the Secret in the source cluster.

Alternatively, the client can provide a raw kubeconfig (in YAML or JSON format) via the Secret’s .data.kubeconfig field. The controller will then populate the requested token in the kubeconfig for the user used in the .current-context. For example, if .data.kubeconfig is

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: AAAA
    server: some-server-url
  name: shoot--foo--bar
contexts:
- context:
    cluster: shoot--foo--bar
    user: shoot--foo--bar-token
  name: shoot--foo--bar
current-context: shoot--foo--bar
kind: Config
preferences: {}
users:
- name: shoot--foo--bar-token
  user:
    token: ""

then the .users[0].user.token field of the kubeconfig will be updated accordingly.

The controller also adds an annotation to the Secret to keep track when to renew the token before it expires. By default, the tokens are issued to expire after 12 hours. The expiration time can be set with the following annotation:

serviceaccount.resources.gardener.cloud/token-expiration-duration: 6h

It automatically renews once 80% of the lifetime is reached or after 24h.

Optionally, the controller can also populate the token into a Secret in the target cluster. This can be requested by annotating the Secret in the source cluster with

token-requestor.resources.gardener.cloud/target-secret-name: "foo"
token-requestor.resources.gardener.cloud/target-secret-namespace: "bar"

Overall, the TokenRequestor controller provides credentials with limited lifetime (JWT tokens) used by Shoot control plane components running in the Seed to talk to the Shoot API Server. Please see the graphic below:

image

Kubelet Server CertificateSigningRequest Approver

Gardener configures the kubelets such that they request two certificates via the CertificateSigningRequest API:

  1. client certificate for communicating with the kube-apiserver
  2. server certificate for serving its HTTPS server

For client certificates, the kubernetes.io/kube-apiserver-client-kubelet signer is used (see this document for more details). The kube-controller-manager’s csrapprover controller is responsible for auto-approving such CertificateSigningRequests so that the respective certificates can be issued.

For server certificates, the kubernetes.io/kubelet-serving signer is used. Unfortunately, the kube-controller-manager is not able to auto-approve such CertificateSigningRequests (see kubernetes/kubernetes#73356 for details).

That’s the motivation for having this controller as part of gardener-resource-manager. It watches CertificateSigningRequests with the kubernetes.io/kubelet-serving signer and auto-approves them when all the following conditions are met:

  • The .spec.username is prefixed with system:node:.
  • There must be at least one DNS name or IP address as part of the certificate SANs.
  • The common name in the CSR must match the .spec.username.
  • The organization in the CSR must only contain system:nodes.
  • There must be a Node object with the same name in the shoot cluster.
  • There must be exactly one Machine for the node in the seed cluster.
  • The DNS names part of the SANs must be equal to all .status.addresses[] of type Hostname in the Node.
  • The IP addresses part of the SANs must be equal to all .status.addresses[] of type InternalIP in the Node.

If one of these requirements is violated the CertificateSigningRequest will be denied. Otherwise, once approved the kube-controller-manager’s csrsigner controller will issue the requested certificate.

Webhooks

Mutating Webhooks

High Availability Config

This webhook is used to conveniently apply the configuration to make components deployed to seed or shoot clusters highly available. The details and scenarios are described in this document.

The webhook reacts on creation/update of Deployments and StatefulSets in namespaces labeled with high-availability-config.resources.gardener.cloud/consider=true.

The webhook performs the following actions:

  1. The .spec.replicas field is mutated based on the high-availability-config.resources.gardener.cloud/type label of the resource and the high-availability-config.resources.gardener.cloud/failure-tolerance-type annotation of the namespace:

    Failure Tolerance Type ➡️
    /
    ⬇️ Component Type️ ️
    unsetemptynon-empty
    controller212
    server222
    • The replica count values can be overwritten by the high-availability-config.resources.gardener.cloud/replicas annotation.
    • It does NOT mutate the replicas when
      • the replicas are already set to 0 (hibernation case), or
      • when the resource is scaled horizontally by HorizontalPodAutoscaler or Hvpa, and the current replica count is higher than what was computed above.
  2. When the high-availability-config.resources.gardener.cloud/zones annotation is NOT empty and the high-availability-config.resources.gardener.cloud/failure-tolerance-type annotation is set, then it adds a node affinity to the pod template spec:

    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - <zone1>
              # - ...
    

    This ensures that all pods are pinned to only nodes in exactly those concrete zones.

  3. Topology Spread Constraints are added to the pod template spec when the .spec.replicas are greater than 1. When the high-availability-config.resources.gardener.cloud/zones annotation …

    • … contains only one zone, then the following is added:

      spec:
        topologySpreadConstraints:
        - topologyKey: kubernetes.io/hostname
          maxSkew: 1
          whenUnsatisfiable: ScheduleAnyway
          labelSelector: ...
      

      This ensures that the (multiple) pods are scheduled across nodes on best-effort basis.

    • … contains at least two zones, then the following is added:

      spec:
        topologySpreadConstraints:
        - topologyKey: kubernetes.io/hostname
          maxSkew: 1
          whenUnsatisfiable: ScheduleAnyway
          labelSelector: ...
        - topologyKey: topology.kubernetes.io/zone
          maxSkew: 1
          whenUnsatisfiable: DoNotSchedule
          labelSelector: ...
      

      This enforces that the (multiple) pods are scheduled across zones. It circumvents a known limitation in Kubernetes for clusters < 1.26 (ref kubernetes/kubernetes#109364. In case the number of replicas is larger than twice the number of zones then the maxSkew=2 for the second spread constraints.

    Independent on the number of zones, when the high-availability-config.resources.gardener.cloud/failure-tolerance-type annotation is set and NOT empty, then the whenUnsatisfiable is set to DoNotSchedule for the constraint with topologyKey=kubernetes.io/hostname (which enforces the node-spread).

Auto-Mounting Projected ServiceAccount Tokens

When this webhook is activated then it automatically injects projected ServiceAccount token volumes into Pods and all its containers if all of the following preconditions are fulfilled:

  1. The Pod is NOT labeled with projected-token-mount.resources.gardener.cloud/skip=true.
  2. The Pod’s .spec.serviceAccountName field is NOT empty and NOT set to default.
  3. The ServiceAccount specified in the Pod’s .spec.serviceAccountName sets .automountServiceAccountToken=false.
  4. The Pod’s .spec.volumes[] DO NOT already contain a volume with a name prefixed with kube-api-access-.

The projected volume will look as follows:

spec:
  volumes:
  - name: kube-api-access-gardener
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 43200
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

The expirationSeconds are defaulted to 12h and can be overwritten with the .webhooks.projectedTokenMount.expirationSeconds field in the component configuration, or with the projected-token-mount.resources.gardener.cloud/expiration-seconds annotation on a Pod resource.

The volume will be mounted into all containers specified in the Pod to the path /var/run/secrets/kubernetes.io/serviceaccount. This is the default location where client libraries expect to find the tokens and mimics the upstream ServiceAccount admission plugin, see this document for more information.

Overall, this webhook is used to inject projected service account tokens into pods running in the Shoot and the Seed cluster. Hence, it is served from the Seed GRM and each Shoot GRM. Please find an overview below for pods deployed in the Shoot cluster:

image

Pod Topology Spread Constraints

When this webhook is enabled then it mimics the topologyKey feature for Topology Spread Constraints (TSC) on the label pod-template-hash. Concretely, when a pod is labelled with pod-template-hash the handler of this webhook extends any topology spread constraint in the pod:

metadata:
  labels:
    pod-template-hash: 123abc
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        pod-template-hash: 123abc # added by webhook

The procedure circumvents a known limitation with TSCs which leads to imbalanced deployments after rolling updates. Gardener enables this webhook to schedule pods of deployments across nodes and zones.

Please note, the gardener-resource-manager itself as well as pods labelled with topology-spread-constraints.resources.gardener.cloud/skip are excluded from any mutations.

Validating Webhooks

Unconfirmed Deletion Prevention For Custom Resources And Definitions

As part of Gardener’s extensibility concepts, a lot of CustomResourceDefinitions are deployed to the seed clusters that serve as extension points for provider-specific controllers. For example, the Infrastructure CRD triggers the provider extension to prepare the IaaS infrastructure of the underlying cloud provider for a to-be-created shoot cluster. Consequently, these extension CRDs have a lot of power and control large portions of the end-user’s shoot cluster. Accidental or undesired deletions of those resource can cause tremendous and hard-to-recover-from outages and should be prevented.

When this webhook is activated, it reacts for CustomResourceDefinitions and most of the custom resources in the extensions.gardener.cloud/v1alpha1 API group. It also reacts for the druid.gardener.cloud/v1alpha1.Etcd resources.

The webhook prevents DELETE requests for those CustomResourceDefinitions labeled with gardener.cloud/deletion-protected=true, and for all mentioned custom resources if they were not previously annotated with the confirmation.gardener.cloud/deletion=true. This prevents that undesired kubectl delete <...> requests are accepted.

Extension Resource Validation

When this webhook is activated, it reacts for most of the custom resources in the extensions.gardener.cloud/v1alpha1 API group. It also reacts for the druid.gardener.cloud/v1alpha1.Etcd resources.

The webhook validates the resources specifications for CREATE and UPDATE requests.

13 - Scheduler

Gardener Scheduler

The Gardener Scheduler is in essence a controller that watches newly created shoots and assigns a seed cluster to them. Conceptually, the task of the Gardener Scheduler is very similar to the task of the Kubernetes Scheduler: finding a seed for a shoot instead of a node for a pod.

Either the scheduling strategy or the shoot cluster purpose hereby determines how the scheduler is operating. The following sections explain the configuration and flow in greater detail.

Why is the Gardener Scheduler needed?

1. Decoupling

Previously, an admission plugin in the Gardener API server conducted the scheduling decisions. This implies changes to the API server whenever adjustments of the scheduling are needed. Decoupling the API server and the scheduler comes with greater flexibility to develop these components independently from each other.

2. Extensibility

It should be possible to easily extend and tweak the scheduler in the future. Possibly, similar to the Kubernetes scheduler, hooks could be provided which influence the scheduling decisions. It should be also possible to completely replace the standard Gardener Scheduler with a custom implementation.

Algorithm overview

The following sequence describes the steps involved to determine a seed candidate:

  1. Determine usable seeds with “usable” defined as follows:
    • no .metadata.deletionTimestamp
    • .spec.settings.scheduling.visible is true
    • conditions Bootstrapped, GardenletReady, BackupBucketsReady (if available) are true
  2. Filter seeds:
    • matching .spec.seedSelector in CloudProfile used by the Shoot
    • matching .spec.seedSelector in Shoot
    • having no network intersection with the Shoot’s networks (due to the VPN connectivity between seeds and shoots their networks must be disjoint)
    • having .spec.settings.shootDNS.enabled=false (only if the shoot specifies a DNS domain or does not use the unmanaged DNS provider)
    • whose taints (.spec.taints) are tolerated by the Shoot (.spec.tolerations)
    • whose capacity for shoots would not be exceeded if the shoot is scheduled onto the seed, see Ensuring seeds capacity for shoots is not exceeded
    • which have at least three zones in .spec.provider.zones if shoot requests a high available control plane with failure tolerance type zone.
  3. Apply active strategy e.g., Minimal Distance strategy
  4. Choose least utilized seed, i.e., the one with the least number of shoot control planes, will be the winner and written to the .spec.seedName field of the Shoot.

Configuration

The Gardener Scheduler configuration has to be supplied on startup. It is a mandatory and also the only available flag. Here is an example scheduler configuration.

Most of the configuration options are the same as in the Gardener Controller Manager (leader election, client connection, …). However, the Gardener Scheduler on the other hand does not need a TLS configuration, because there are currently no webhooks configurable.

Strategies

The scheduling strategy is defined in the candidateDeterminationStrategy of the scheduler’s configuration and can have the possible values SameRegion and MinimalDistance. The SameRegion strategy is the default strategy.

  1. Same Region strategy

    The Gardener Scheduler reads the spec.provider.type and .spec.region fields from the Shoot resource. It tries to find a seed that has the identical .spec.provider.type and .spec.provider.region fields set. If it cannot find a suitable seed, it adds an event to the shoot stating, that it is unschedulable.

  2. Minimal Distance strategy

    The Gardener Scheduler tries to find a valid seed with minimal distance to the shoot’s intended region. The distance is calculated based on the Levenshtein distance of the region. Therefore the region name is split into a base name and an orientation. Possible orientations are north, south, east, west and central. The distance then is twice the Levenshtein distance of the region’s base name plus a correction value based on the orientation and the provider.

    If the orientations of shoot and seed candidate match, the correction value is 0, if they differ it is 2 and if either the seed’s or the shoot’s region does not have an orientation it is 1. If the provider differs the correction value is additionally incremented by 2.

    Because of this a matching region with a matching provider is always prefered.

In order to put the scheduling decision into effect, the scheduler sends an update request for the Shoot resource to the API server. After validation, the Gardener Aggregated API server updates the shoot to have the spec.seedName field set. Subsequently, the Gardenlet picks up and starts to create the cluster on the specified seed.

  1. Special handling based on shoot cluster purpose

Every shoot cluster can have a purpose that describes what the cluster is used for, and also influences how the cluster is setup (see this document for more information).

In case the shoot has the testing purpose then the scheduler only reads the .spec.provider.type from the Shoot resource and tries to find a Seed that has the identical .spec.provider.type. The region does not matter, i.e., testing shoots may also be scheduled on a seed in a complete different region if it is better for balancing the whole Gardener system.

shoots/binding subresource

The shoots/binding subresource is used to bind a Shoot to a Seed. On creation of shoot clusters, the scheduler updates the binding automatically if an appropriate seed cluster is available. Only an operator with necessary RBAC can update this binding manually. This can be done by changing the .spec.seedName of the shoot. However, if a different seed is already assigned to the shoot, this will trigger a control-plane migration. For required steps, Please see Triggering the Migration.

spec.seedName field in the Shoot specification

Similar to the .spec.nodeName field in Pods, the Shoot specification has an optional .spec.seedName field. If this field is set on creation, the shoot will be scheduled to this seed. However, this field can only be set by users having RBAC for the shoots/binding subresource. If this field is not set, the scheduler will assign a suitable seed automatically and populate this field with the seed name.

seedSelector field in the Shoot specification

Similar to the .spec.nodeSelector field in Pods, the Shoot specification has an optional .spec.seedSelector field. It allows the user to provide a label selector that must match the labels of Seeds in order to be scheduled to one of them. The labels on Seeds are usually controlled by Gardener administrators/operators - end users cannot add arbitrary labels themselves. If provided, the Gardener Scheduler will only consider those seeds as “suitable” whose labels match those provided in the .spec.seedSelector of the Shoot.

By default only seeds with the same provider than the shoot are selected. By adding a providerTypes field to the seedSelector a dedicated set of possible providers (* means all provider types) can be selected.

Ensuring seeds capacity for shoots is not exceeded

Seeds have a practical limit of how many shoots they can accommodate. Exceeding this limit is undesirable as the system performance will be noticeably impacted. Therefore, the scheduler ensures that a seed’s capacity for shoots is not exceeded by taking into account a maximum number of shoots that can be scheduled onto a seed.

This mechanism works as follows:

  • The gardenlet is configured with certain resources and their total capacity (and, for certain resources, the amount reserved for Gardener), see /example/20-componentconfig-gardenlet.yaml. Currently, the only such resource is the maximum number of shoots that can be scheduled onto a seed.
  • The gardenlet seed controller updates the capacity and allocatable fields in Seed status with the capacity of each resource and how much of it is actually available to be consumed by shoots. The allocatable value of a resource is equal to capacity minus reserved.
  • When scheduling shoots, the scheduler filters out all candidate seeds whose allocatable capacity for shoots would be exceeded if the shoot is scheduled onto the seed.

Failure to determine a suitable seed

In case the scheduler fails to find a suitable seed, the operation is being retried with exponential backoff.

Current Limitation / Future Plans

  • Azure has unfortunately a geographically non-hierarchical naming pattern and does not start with the continent. This is the reason why we will exchange the implementation of the MinimalDistance strategy with a more suitable one in the future.