This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Concepts

1 - Admission Controller

Gardener Admission Controller

While the Gardener API server works with admission plugins to validate and mutate resources belonging to Gardener related API groups, e.g. core.gardener.cloud, the same is needed for resources belonging to non-Gardener API groups as well, e.g. Secrets in the core API group. Therefore, the Gardener Admission Controller runs a http(s) server with the following handlers which serve as validating/mutating endpoints for admission webhooks. It is also used to serve http(s) handlers for authorization webhooks.

Admission Webhook Handlers

This section describes the admission webhook handlers that are currently served.

Kubeconfig Secret Validator

Malicious Kubeconfigs applied by end users may cause a leakage of sensitive data. This handler checks if the incoming request contains a Kubernetes secret with a .data.kubeconfig field and denies the request if the Kubeconfig structure violates Gardener’s security standards.

Namespace Validator

Namespaces are the backing entities of Gardener projects in which shoot clusters objects reside. This validation handler protects active namespaces against premature deletion requests. Therefore, it denies deletion requests if a namespace still contains shoot clusters or if it belongs to a non-deleting Gardener project (w/o .metadata.deletionTimestamp).

Resource Size Validator

Since users directly apply Kubernetes native objects to the Garden cluster, it also involves the risk of being vulnerable to DoS attacks because these resources are read continuously watched and read by controllers. One example is the creation of Shoot resources with large annotation values (up to 256 kB per value) which can cause severe out-of-memory issues for the Gardenlet component. Vertical autoscaling can help to mitigate such situations, but we cannot expect to scale infinitely, and thus need means to block the attack itself.

The Resource Size Validator checks arbitrary incoming admission requests against a configured maximum size for the resource’s group-version-kind combination and denies the request if the contained object exceeds the quota.

Example for Gardener Admission Controller configuration:

server:
  resourceAdmissionConfiguration:
    limits:
    - apiGroups: ["core.gardener.cloud"]
      apiVersions: ["*"]
      resources: ["shoots", "plants"]
      size: 100k
    - apiGroups: [""]
      apiVersions: ["v1"]
      resources: ["secrets"]
      size: 100k
    unrestrictedSubjects:
    - kind: Group
      name: gardener.cloud:system:seeds
      apiGroup: rbac.authorization.k8s.io
 #  - kind: User
 #    name: admin
 #    apiGroup: rbac.authorization.k8s.io
 #  - kind: ServiceAccount
 #    name: "*"
 #    namespace: garden
 #    apiGroup: ""
    operationMode: block #log

With the configuration above, the Resource Size Validator denies requests for shoots and plants with Gardener’s core API group which exceed a size of 100 kB. The same is done for Kubernetes secrets.

As this feature is meant to protect the system from malicious requests sent by users, it is recommended to exclude trusted groups, users or service accounts from the size restriction via resourceAdmissionConfiguration.unrestrictedSubjects. For example, the backing user for the Gardenlet should always be capable of changing the shoot resource instead of being blocked due to size restrictions. This is because the Gardenlet itself occasionally changes the shoot specification, labels or annotations, and might violate the quota if the existing resource is already close to the quota boundary. Also, operators are supposed to be trusted users and subjecting them to a size limitation can inhibit important operational tasks. Wildcard ("*") in subject name is supported.

Size limitations depend on the individual Gardener setup and choosing the wrong values can affect the availability of your Gardener service. resourceAdmissionConfiguration.operationMode allows to control if a violating request is actually denied (default) or only logged. It’s recommended to start with log, check the logs for exceeding requests, adjust the limits if necessary and finally switch to block.

SeedRestriction

Please refer to this document for more information.

Authorization Webhook Handlers

This section describes the authorization webhook handlers that are currently served.

SeedAuthorization

Please refer to this document for more information.

2 - API Server

Gardener API server

The Gardener API server is a Kubernetes-native extension based on its aggregation layer. It is registered via an APIService object and designed to run inside a Kubernetes cluster whose API it wants to extend.

After registration, it exposes the following resources:

CloudProfiles

CloudProfiles are resources that describe a specific environment of an underlying infrastructure provider, e.g. AWS, Azure, etc. Each shoot has to reference a CloudProfile to declare the environment it should be created in. In a CloudProfile the gardener operator specifies certain constraints like available machine types, regions, which Kubernetes versions they want to offer, etc. End-users can read CloudProfiles to see these values, but only operators can change the content or create/delete them. When a shoot is created or updated then an admission plugin checks that only values are used that are allowed via the referenced CloudProfile.

Additionally, a CloudProfile may contain a providerConfig which is a special configuration dedicated for the infrastructure provider. Gardener does not evaluate or understand this config, but extension controllers might need for declaration of provider-specific constraints, or global settings.

Please see this example manifest and consult the documentation of your provider extension controller to get information about its providerConfig.

Seeds

Seeds are resources that represent seed clusters. Gardener does not care about how a seed cluster got created - the only requirement is that it is of at least Kubernetes v1.17 and passes the Kubernetes conformance tests. The Gardener operator has to either deploy the Gardenlet into the cluster they want to use as seed (recommended, then the Gardenlet will create the Seed object itself after bootstrapping), or they provide the kubeconfig to the cluster inside a secret (that is referenced by the Seed resource) and create the Seed resource themselves.

Please see this, this(, and optionally this) example manifests.

ShootQuotas

In order to allow end-users not having their own dedicated infrastructure account to try out Gardener the operator can register an account owned by them that they allow to be used for trial clusters. Trial clusters can be put under quota such that they don’t consume too many resources (resulting in costs), and so that one user cannot consume all resources on their own. These clusters are automatically terminated after a specified time, but end-users may extend the lifetime manually if needed.

Please see this example manifest.

Projects

The first thing before creating a shoot cluster is to create a Project. A project is used to group multiple shoot clusters together. End-users can invite colleagues to the project to enable collaboration, and they can either make them admin or viewer. After an end-user has created a project they will get a dedicated namespace in the garden cluster for all their shoots.

Please see this example manifest.

SecretBindings

Now that the end-user has a namespace the next step is registering their infrastructure provider account.

Please see this example manifest and consult the documentation of the extension controller for the respective infrastructure provider to get information about which keys are required in this secret.

After the secret has been created the end-user has to create a special SecretBinding resource that binds this secret. Later when creating shoot clusters they will reference such a binding.

Please see this example manifest.

Shoots

Shoot cluster contain various settings that influence how end-user Kubernetes clusters will look like in the end. As Gardener heavily relies on extension controllers for operating system configuration, networking, and infrastructure specifics, the end-user has the possibility (and responsibility) to provide these provider-specific configurations as well. Such configurations are not evaluated by Gardener (because it doesn’t know/understand them), but they are only transported to the respective extension controller.

⚠️ This means that any configuration issues/mistake on the end-user side that relates to a provider-specific flag or setting cannot be caught during the update request itself but only later during the reconciliation (unless a validator webhook has been registered in the garden cluster by an operator).

Please see this example manifest and consult the documentation of the provider extension controller to get information about its spec.provider.controlPlaneConfig, .spec.provider.infrastructureConfig, and .spec.provider.workers[].providerConfig.

(Cluster)OpenIDConnectPresets

Please see this separate documentation file.

Overview Data Model

Gardener Overview Data Model

3 - APIServer Admission Plugins

Admission Plugins

Similar to the kube-apiserver, the gardener-apiserver comes with a few in-tree managed admission plugins. If you want to get an overview of the what and why of admission plugins then this document might be a good start.

This document lists all existing admission plugins with a short explanation of what it is responsible for.

ClusterOpenIDConnectPreset, OpenIDConnectPreset

(both enabled by default)

These admission controllers react on CREATE operations for Shoots. If the Shoot does not specify any OIDC configuration (.spec.kubernetes.kubeAPIServer.oidcConfig=nil) then it tries to find a matching ClusterOpenIDConnectPreset or OpenIDConnectPreset, respectively. If there are multiples that match then the one with the highest weight “wins”. In this case, the admission controller will default the OIDC configuration in the Shoot.

ControllerRegistrationResources

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for ControllerRegistrations. It validates that there exists only one ControllerRegistration in the system that is primarily responsible for a given kind/type resource combination. This prevents misconfiguration by the Gardener administrator/operator.

CustomVerbAuthorizer

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for Projects. It validates whether the user is bound to a RBAC role with the modify-spec-tolerations-whitelist verb in case the user tries to change the .spec.tolerations.whitelist field of the respective Project resource. Usually, regular project members are not bound to this custom verb, allowing the Gardener administrator to manage certain toleration whitelists on Project basis.

DeletionConfirmation

(enabled by default)

This admission controller reacts on DELETE operations for Projects and Shoots and ShootStates. It validates that the respective resource is annotated with a deletion confirmation annotation, namely confirmation.gardener.cloud/deletion=true. Only if this annotation is present it allows the DELETE operation to pass. This prevents users from accidental/undesired deletions.

ExposureClass

(enabled by default)

This admission controller reacts on Create operations for Shootss. It mutates Shoot resources which has an ExposureClass referenced by merging their both shootSelectors and/or tolerations into the Shoot resource.

ExtensionValidator

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for BackupEntrys, BackupBuckets, Seeds, and Shoots. For all the various extension types in the specifications of these objects, it validates whether there exists a ControllerRegistration in the system that is primarily responsible for the stated extension type(s). This prevents misconfigurations that would otherwise allow users to create such resources with extension types that don’t exist in the cluster, effectively leading to failing reconciliation loops.

ExtensionLabels

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for BackupBuckets, BackupEntrys, CloudProfiles, Seeds, SecretBindings and Shoots. For all the various extension types in the specifications of these objects, it adds a corresponding label in the resource. This would allow extension admission webhooks to filter out the resources they are responsible for and ignore all others. This label is of the form <extension-type>.extensions.gardener.cloud/<extension-name> : "true". For example, an extension label for provider extension type aws, looks like provider.extensions.gardener.cloud/aws : "true".

PlantValidator

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for Plants. It sets the gardener.cloud/created-by annotation for newly created Plant resources. Also, it prevents creating new Plant resources in Projects that are already have a deletion timestamp.

ProjectValidator

(enabled by default)

This admission controller reacts on CREATE operations for Projects. It prevents creating Projects with a non-empty .spec.namespace if the value in .spec.namespace does not start with garden-.

⚠️ This admission plugin will be removed in a future release and its business logic will be incorporated into the static validation of the gardener-apiserver.

ResourceQuota

(enabled by default)

This admission controller enables object count ResourceQuotas for Gardener resources, e.g. Shoots, SecretBindings, Projects, etc..

⚠️ In addition to this admission plugin, the ResourceQuota controller must be enabled for the Kube-Controller-Manager of your Garden cluster.

ResourceReferenceManager

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for CloudProfiles, Projects, SecretBindings, Seeds, and Shoots. Generally, it checks whether referred resources stated in the specifications of these objects exist in the system (e.g., if a referenced Secret exists). However, it also has some special behaviours for certain resources:

  • CloudProfiles: It rejects removing Kubernetes or machine image versions if there is at least one Shoot that refers to them.
  • Projects: It sets the .spec.createdBy field for newly created Project resources, and defaults the .spec.owner field in case it is empty (to the same value of .spec.createdBy).
  • Seeds: It rejects changing the .spec.settings.shootDNS.enabled value if there is at least one Shoot that refers to this seed.
  • Shoots: It sets the gardener.cloud/created-by=<username> annotation for newly created Shoot resources.

SeedValidator

(enabled by default)

This admission controller reacts on DELETE operations for Seeds. Rejects the deletion if Shoot(s) reference the seed cluster.

ShootDNS

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for Shoots. It tries to assign a default domain to the Shoot if it gets scheduled to a seed that enables DNS for shoots (.spec.settings.shootDNS.enabled=true). It also validates that the DNS configuration (.spec.dns) is not set if the seed disables DNS for shoots.

ShootQuotaValidator

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for Shoots. It validates the resource consumption declared in the specification against applicable Quota resources. Only if the applicable Quota resources admit the configured resources in the Shoot then it allows the request. Applicable Quotas are referred in the SecretBinding that is used by the Shoot.

ShootVPAEnabledByDefault

(disabled by default)

This admission controller reacts on CREATE operations for Shoots. If enabled, it will enable the managed VerticalPodAutoscaler components (see this doc) by setting spec.kubernetes.verticalPodAutoscaler.enabled=true for newly created Shoots. Already existing Shoots and new Shoots that explicitly disable VPA (spec.kubernetes.verticalPodAutoscaler.enabled=false) will not be affected by this admission plugin.

ShootTolerationRestriction

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for Shoots. It validates the .spec.tolerations used in Shoots against the whitelist of its Project, or against the whitelist configured in the admission controller’s configuration, respectively. Additionally, it defaults the .spec.tolerations in Shoots with those configured in its Project, and those configured in the admission controller’s configuration, respectively.

ShootValidator

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for Shoots. It validates certain configurations in the specification against the referred CloudProfile (e.g., machine images, machine types, used Kubernetes version, …). Generally, it performs validations that cannot be handled by the static API validation due to their dynamic nature (e.g., when something needs to be checked against referred resources). Additionally, it takes over certain defaulting tasks (e.g., default machine image for worker pools).

ShootManagedSeed

(enabled by default)

This admission controller reacts on DELETE operations for Shoots. It rejects the deletion if the Shoot is referred to by a ManagedSeed.

ManagedSeedValidator

(enabled by default)

This admission controller reacts on CREATE and UPDATE operations for ManagedSeedss. It validates certain configuration values in the specification against the referred Shoot, for example Seed provider, network ranges, DNS domain, etc. Similarly to ShootValidator, it performs validations that cannot be handled by the static API validation due to their dynamic nature. Additionally, it performs certain defaulting tasks, making sure that configuration values that are not specified are defaulted to the values of the referred Shoot, for example Seed provider, network ranges, DNS domain, etc.

ManagedSeedShoot

(enabled by default)

This admission controller reacts on DELETE operations for ManagedSeeds. It rejects the deletion if there are Shoots that are scheduled onto the Seed that is registered by the ManagedSeed.

4 - Architecture

Official Definition - What is Kubernetes?

“Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.”

Introduction - Basic Principle

The foundation of the Gardener (providing Kubernetes Clusters as a Service) is Kubernetes itself, because Kubernetes is the go-to solution to manage software in the Cloud, even when it’s Kubernetes itself (see also OpenStack which is provisioned more and more on top of Kubernetes as well).

While self-hosting, meaning to run Kubernetes components inside Kubernetes, is a popular topic in the community, we apply a special pattern catering to the needs of our cloud platform to provision hundreds or even thousands of clusters. We take a so-called “seed” cluster and seed the control plane (such as the API server, scheduler, controllers, etcd persistence and others) of an end-user cluster, which we call “shoot” cluster, as pods into the “seed” cluster. That means one “seed” cluster, of which we will have one per IaaS and region, hosts the control planes of multiple “shoot” clusters. That allows us to avoid dedicated hardware/virtual machines for the “shoot” cluster control planes. We simply put the control plane into pods/containers and since the “seed” cluster watches them, they can be deployed with a replica count of 1 and only need to be scaled out when the control plane gets under pressure, but no longer for HA reasons. At the same time, the deployments get simpler (standard Kubernetes deployment) and easier to update (standard Kubernetes rolling update). The actual “shoot” cluster consists only out of the worker nodes (no control plane) and therefore the users may get full administrative access to their clusters.

Setting The Scene - Components and Procedure

We provide a central operator UI, which we call the “Gardener Dashboard”. It talks to a dedicated cluster, which we call the “Garden” cluster and uses custom resources managed by an aggregated API server, one of the general extension concepts of Kubernetes) to represent “shoot” clusters. In this “Garden” cluster runs the “Gardener”, which is basically a Kubernetes controller that watches the custom resources and acts upon them, i.e. creates, updates/modifies, or deletes “shoot” clusters. The creation follows basically these steps:

  • Create a namespace in the “seed” cluster for the “shoot” cluster which will host the “shoot” cluster control plane
  • Generate secrets and credentials which the worker nodes will need to talk to the control plane
  • Create the infrastructure (using Terraform), which basically consists out of the network setup)
  • Deploy the “shoot” cluster control plane into the “shoot” namespace in the “seed” cluster, containing the “machine-controller-manager” pod
  • Create machine CRDs in the “seed” cluster, describing the configuration and the number of worker machines for the “shoot” (the machine-controller-manager watches the CRDs and creates virtual machines out of it)
  • Wait for the “shoot” cluster API server to become responsive (pods will be scheduled, persistent volumes and load balancers are created by Kubernetes via the respective cloud provider)
  • Finally we deploy kube-system daemons like kube-proxy and further add-ons like the dashboard into the “shoot” cluster and the cluster becomes active

Overview Architecture Diagram

Gardener Overview Architecture Diagram

Detailed Architecture Diagram

Gardener Detailed Architecture Diagram

Note: The kubelet as well as the pods inside the “shoot” cluster talk through the front-door (load balancer IP; public Internet) to its “shoot” cluster API server running in the “seed” cluster. The reverse communication from the API server to the pod, service, and node networks happens through a VPN connection that we deploy into “seed” and “shoot” clusters.

5 - Backup Restore

Backup and restore

Kubernetes uses Etcd as the key-value store for its resource definitions. Gardener supports the backup and restore of etcd. It is the responsibility of the shoot owners to backup the workload data.

Gardener uses etcd-backup-restore component to backup the etcd backing the Shoot cluster regularly and restore in case of disaster. It is deployed as sidecar via etcd-druid. This doc mainly focuses on the backup and restore configuration used by Gardener when deploying these components. For more details on the design and internal implementation details, please refer GEP-06 and documentation on individual repository.

Bucket provisioning

Refer the backup bucket extension document to know details about configuring backup bucket.

Backup Policy

etcd-backup-restore supports full snapshot and delta snapshots over full snapshot. In Gardener, this configuration is currently hard-coded to following parameters:

  • Full Snapshot Schedule:
    • Daily, 24hr interval.
    • For each Shoot, the schedule time in a day is randomized based on the configured Shoot maintenance window.
  • Delta Snapshot schedule:
    • At 5min interval.
    • If aggregated events size since last snapshot goes beyond 100Mib.
  • Backup History / Garbage backup deletion policy:
    • Gardener configure backup restore to have Exponential garbage collection policy.
    • As per policy, following backups are retained.
    • All full backups and delta backups for the previous hour.
    • Latest full snapshot of each previous hour for the day.
    • Latest full snapshot of each previous day for 7 days.
    • Latest full snapshot of the previous 4 weeks.
    • Garbage Collection is configured at 12hr interval.
  • Listing:
    • Gardener don’t have any API to list out the backups.
    • To find the backup list, admin can checkout the BackupEntry resource associated with Shoot which holds the bucket and prefix details on object store.

Restoration

Restoration process of etcd is automated through the etcd-backup-restore component from latest snapshot. Gardener dosen’t support Point-In-Time-Recovery (PITR) of etcd. In case of etcd disaster, the etcd is recovered from latest backup automatically. For further details, please refer the doc. Post restoration of etcd, the Shoot reconciliation loop brings back the cluster to same state.

Again, Shoot owner is responsible for maintaining the backup/restore of his workload. Gardener does only take care of the cluster’s etcd.

6 - Cluster API

Relation between Gardener API and Cluster API (SIG Cluster Lifecycle)

In essence, the Cluster API harmonizes how to get to clusters, while Gardener goes one step further and also harmonizes the clusters themselves. The Cluster API delegates the specifics to so-called providers for infrastructures or control planes via specific CR(D)s while Gardener only has one cluster CR(D). Different Cluster API providers, e.g. for AWS, Azure, GCP, etc. give you vastly different Kubernetes clusters. In contrast, Gardener gives you the exact same clusters with the exact same K8s version, operating system, control plane configuration like for API server or kubelet, add-ons like overlay network, HPA/VPA, DNS and certificate controllers, ingress and network policy controllers, control plane monitoring and logging stacks, down to the behavior of update procedures, auto-scaling, self-healing, etc. on all supported infrastructures. These homogeneous clusters are an essential goal for Gardener as its main purpose is to simplify operations for teams that need to develop and ship software on Kubernetes clusters on a plethora of infrastructures (a.k.a. multi-cloud).

Incidentally, Gardener influenced the Machine API in the Cluster API with its Machine Controller Manager and was the first to adopt it, see also joint SIG Cluster Lifecycle KubeCon talk where @hardikdr from our Gardener team in India spoke.

That means, we follow the Cluster API with great interest and are active members. It was completely overhauled from v1alpha1 to v1alpha2. But because v1alpha2 made too many assumptions about the bring-up of masters and was enforcing master machine operations (see here: “As of v1alpha2, Machine-Based is the only control plane type that Cluster API supports”), services that managed their control planes differently like GKE or Gardener couldn’t adopt it (e.g. Google only supports v1alpha1). In 2020 v1alpha3 was introduced and made it possible (again) to integrate managed services like GKE or Gardener. The mapping from the Gardener API to the Cluster API is mostly syntactic.

To wrap it up, while the Cluster API knows about clusters, it doesn’t know about their make-up. With Gardener, we wanted to go beyond that and harmonize the make-up of the clusters themselves and make them homogeneous across all supported infrastructures. Gardener can therefore deliver homogeneous clusters with exactly the same configuration and behavior on all infrastructures (see also Gardener’s coverage in the official conformance test grid).

With Cluster API v1alpha3 and the support for declarative control plane management, it became now possible (again) to enable Kubernetes managed services like GKE or Gardener. We would be more than happy, if the community would be interested, to contribute a Gardener control plane provider.

7 - Controller Manager

Gardener Controller Manager

The Gardener Controller Manager (often refered to as “GCM”) is a component that runs next to the Gardener API server, similar to the Kubernetes Controller Manager. It runs several control loops that do not require talking to any seed or shoot cluster. Also, as of today it exposes a HTTPS server that is serving several endpoints for webhooks for certain resources.

This document explains the various functionalities of the Gardener Controller Manager and their purpose.

Control Loops

Project Controller

This controller consists out of three reconciliation loops: The main loop is reconciling Project resources while the second loop is controlling the necessary actions for stale projects.

“Main” Reconciler

This reconciler will create a dedicated Namespace prefixed with garden- for each Project resource. The name of the namespace can either be stated in the .spec.namespace, or it will be auto-generated by the reconciler. If .spec.namespace is set then it creates it if it does not exist yet. Otherwise, it tries to adopt it. This will only succeed if the Namespace was previously labeled with gardener.cloud/role=project and project.gardener.cloud/name=<project-name>. This is to prevent that end-users can adopt arbitrary namespaces and escalate their privileges, e.g. the kube-system namespace.

After the namespace was created/adopted the reconciler creates several ClusterRoles and ClusterRoleBindings that allow the project members to access related resources based on their roles. These RBAC resources are prefixed with gardener.cloud:system:project{-member,-viewer}:<project-name>. Gardener administrators and extension developers can define their own roles, see this document for more information.

In addition, operators can configure the Project controller to maintain a default ResourceQuota for project namespaces. Quotas can especially limit the creation of user facing resources, e.g. Shoots, SecretBindings, Secrets and thus protect the Garden cluster from massive resource exhaustion but also enable operators to align quotas with respective enterprise policies.

⚠️ Gardener itself is not exempted from configured quotas. For example, Gardener creates Secrets for every shoot cluster in the project namespace and at the same time increases the available quota count. Please mind this additional resource consumption.

The GCM configuration provides a template section controllers.project.quotas where such a ResourceQuota (see example below) can be deposited.

controllers:
  project:
    quotas:
    - config:
        apiVersion: v1
        kind: ResourceQuota
        spec:
          hard:
            count/shoots.core.gardener.cloud: "100"
            count/secretbindings.core.gardener.cloud: "10"
            count/secrets: "800"
      projectSelector: {}

The Project controller takes the shown config and creates a ResourceQuota with the name gardener in the project namespace. If a ResourceQuota resource with the name gardener already exists, the controller will only update fields in spec.hard which are unavailable at that time. Labels and annotations on the ResourceQuota config get merged with the respective fields on existing ResourceQuotas. An optional projectSelector narrows down the amount of projects that are equipped with the given config. If multiple configs match for a project, then only the first match in the list is applied to the project namespace.

The .status.phase of the Project resources will be set to Ready or Failed by the reconciler to indicate whether the reconciliation loop was performed successfully. Also, it will generate Events to provide further information about its operations.

“Stale Projects” Reconciler

As Gardener is a large-scale Kubernetes as a Service it is designed for being used by a large amount of end-users. Over time, it is likely to happen that some of the hundreds or thousands of Project resources are no longer actively used.

Gardener offers the “stale projects” reconciler which will take care of identifying such stale projects, marking them with a “warning”, and eventually deleting them after a certain time period. This reconciler is enabled by default and works as following:

  1. Projects are considered as “stale”/not actively used when all of the following conditions apply: The namespace associated with the Project does not have any…
    1. Shoot resources.
    2. Plant resources.
    3. BackupEntry resources.
    4. Secret resources that are referenced by a SecretBinding that is in use by a Shoot (not necessarily in the same namespace).
    5. Quota resources that are referenced by a SecretBinding that is in use by a Shoot (not necessarily in the same namespace).
    6. The time period when the project was used for the last time (status.lastActivityTimestamp) is longer than the configured minimumLifetimeDays

If a project is considered “stale” then its .status.staleSinceTimestamp will be set to the time when it was first detected to be stale. If it gets actively used again this timestamp will be removed. After some time the .status.staleAutoDeleteTimestamp will be set to a timestamp after which Gardener will auto-delete the Project resource if it still is not actively used.

The component configuration of the Gardener Controller Manager offers to configure the following options:

  • minimumLifetimeDays: Don’t consider newly created Projects as “stale” too early to give people/end-users some time to onboard and get familiar with the system. The “stale project” reconciler won’t set any timestamp for Projects younger than minimumLifetimeDays. When you change this value then projects marked as “stale” may be no longer marked as “stale” in case they are young enough, or vice versa.
  • staleGracePeriodDays: Don’t compute auto-delete timestamps for stale Projects that are unused for only less than staleGracePeriodDays. This is to not unnecessarily make people/end-users nervous “just because” they haven’t actively used their Project for a given amount of time. When you change this value then already assigned auto-delete timestamps may be removed again if the new grace period is not yet exceeded.
  • staleExpirationTimeDays: Expiration time after which stale Projects are finally auto-deleted (after .status.staleSinceTimestamp). If this value is changed and an auto-delete timestamp got already assigned to the projects then the new value will only take effect if it’s increased. Hence, decreasing the staleExpirationTimeDays will not decrease already assigned auto-delete timestamps.

Gardener administrators/operators can exclude specific Projects from the stale check by annotating the related Namespace resource with project.gardener.cloud/skip-stale-check=true.

“Activity” Reconciler

Since the other two reconcilers are unable to actively monitor the relevant objects that are used in a Project (Shoot, Plant, etc.), there could be a situation where the user creates and deletes objects in a short period of time. In that case the Stale Project Reconciler could not see that there was any activity on that project and it will still mark it as a Stale, even though it is actively used.

The Project Activity Reconciler is implemented to take care of such cases. An event handler will notify the reconciler for any acitivity and then it will update the status.lastActivityTimestamp. This update will also trigger the Stale Project Reconciler.

Event Controller

With the Gardener Event Controller you can prolong the lifespan of events related to Shoot clusters. This is an optional controller which will become active once you provide the below mentioned configuration.

All events in K8s are deleted after a configurable time-to-live (controlled via a kube-apiserver argument called --event-ttl (defaulting to 1 hour)). The need to prolong the time-to-live for Shoot cluster events frequently arises when debugging customer issues on live systems. This controller leaves events involving Shoots untouched while deleting all other events after a configured time. In order to activate it, provide the following configuration:

  • concurrentSyncs: The amount of goroutines scheduled for reconciling events.
  • ttlNonShootEvents: When an event reaches this time-to-live it gets deleted unless it is a Shoot-related event (defaults to 1h, equivalent to the event-ttl default).

⚠️ In addition, you should also configure the --event-ttl for the kube-apiserver to define an upper-limit of how long Shoot-related events should be stored. The --event-ttl should be larger than the ttlNonShootEvents or this controller will have no effect.

Shoot Reference Controller

Shoot objects may specify references to further objects in the Garden cluster which are required for certain features. For example, users can configure various DNS providers via .spec.dns.providers and usually need to refer to a corresponding secret with valid DNS provider credentials inside. Such objects need a special protection against deletion requests as long as they are still being referenced by one or multiple shoots.

Therefore, the Shoot Reference Controller scans shoot clusters for referenced objects and adds the finalizer gardener.cloud/reference-protection to their .metadata.finalizers list. The scanned shoot also gets this finalizer to enable a proper garbage collection in case the Gardener-Controller-Manager is offline at the moment of an incoming deletion request. When an object is not actively referenced anymore because the shoot specification has changed or all related shoots were deleted (are in deletion), the controller will remove the added finalizer again, so that the object can safely be deleted or garbage collected.

The Shoot Reference Controller inspects the following references:

  • DNS provider secrets (.spec.dns.provider)
  • Audit policy configmaps (.spec.kubernetes.kubeAPIServer.auditConfig.auditPolicy.configMapRef)

Further checks might be added in the future.

Shoot Retry Controller

The Shoot Retry Controller is responsible for retrying certain failed Shoots. Currently the controller retries only failed Shoots with error code ERR_INFRA_RATE_LIMITS_EXCEEDED.

Seed Controller

The Seed controller in the Gardener Controller Manager reconciles Seed objects with the help of the following reconcilers.

“Main” Reconciler

This reconciliation loop takes care about seed related operations in the Garden cluster. When a new Seed object is created the reconciler creates a new Namespace in the garden cluster seed-<seed-name>. Namespaces dedicated to single seed clusters allow us to segregate access permissions i.e., a Gardenlet must not have permissions to access objects in all Namespaces in the Garden cluster. There are objects in a Garden environment which are created once by the operator e.g., default domain secret, alerting credentials, and required for operations happening in the Gardenlet. Therefore, we not only need a seed specific Namespace but also a copy of these “shared” objects.

The “main” reconciler takes care about this replication:

KindNamespaceLabel Selector
Secretgardengardener.cloud/role

“Backup Bucket” Reconciler

Every time a BackupBucket object is created or updated, the referenced Seed object is enqueued for reconciliation. It’s the reconciler’s task to check the status subresource of all existing BackupBuckets that belong to this seed. If at least one BackupBucket has .status.lastError, the seed condition BackupBucketsReady will turn false and consequently the seed is considered as NotReady. Once the BackupBucket is healthy again, the seed will be re-queued and the condition will turn true.

“Lifecycle” Reconciler

The “Lifecycle” reconciler processes Seed objects which are enqueued every 10 seconds in order to check if the responsible Gardenlet is still responding and operable. Therefore, it checks renewals via Lease objects of the seed in the garden cluster which are renewed regularly by the Gardenlet.

In case a Lease is not renewed for the configured amount in config.controllers.seed.monitorPeriod.duration:

  1. The reconciler assumes that the Gardenlet stopped operating and updates the GardenletReady condition to Unknown.
  2. Additionally, conditions and constraints of all Shoot resources scheduled on the affected seed are set to Unknown as well because a striking Gardenlet won’t be able to maintain these conditions any more.
  3. If the gardenlet’s client certificate has expired (identified based on the .status.clientCertificateExpirationTimestamp field in the Seed resource) and if it is managed by a ManagedSeed then this will be triggered for a reconciliation. This will trigger the bootstrapping process again and allows gardenlets to obtain a fresh client certificate.

ControllerRegistration Controller

The ControllerRegistration controller makes sure that the required Gardener extensions specified by the ControllerRegistration resources are present in the seed clusters. It also takes care of the creation and deletion of ControllerInstallation objects for a given seed cluster. The controller has three reconciliation loops.

“Main” Reconciler

This reconciliation loop watches the Seed objects and determines which ControllerRegistrations are required for them and creates/deletes the corresponding extension controller to reach the determined state. To begin with, it computes the kind/type combinations of extensions required for the seed. For this, the controller examines a live list of ControllerRegistrations, ControllerInstallations, BackupBuckets, BackupEntrys, Shoots, and Secrets from the garden cluster. For example, it examines the shoots running on the seed and deducts kind/type like Infrastructure/gcp. It also decides whether they should always be deployed based on the .spec.deployment.policy. For the configuration options, please see this section.

Based on these required combinations, each of them are mapped to ControllerRegistration objects and then to their corresponding ControllerInstallation objects (if existing). The controller then creates or updates the required ControllerInstallation objects for the given seed. It also deletes every existing ControllerInstallation whose referenced ControllerRegistration is not part of the required list. For example, if the shoots in the seed are no longer using the DNS provider aws-route53, then the controller proceeds to delete the respective ControllerInstallation object.

“ControllerRegistration” Reconciler

This reconciliation loop watches the ControllerRegistration resource and adds finalizers to it when they are created. In case a deletion request comes in for the resource, i.e., if a .metadata.deletionTimestamp is set, it actively scans for a ControllerInstallation resource using this ControllerRegistration, and decides whether the deletion can be allowed. In case no related ControllerInstallation is present, it removes the finalizer and marks it for deletion.

“Seed” Reconciler

This loop also watches the Seed object and adds finalizers to it at creation. If a .metadata.deletionTimestamp is set for the seed then the controller checks for existing ControllerInstallation objects which reference this seed. If no such objects exist then it removes the finalizer and allows the deletion.

“CertificateSigningRequest” controller

After the gardenlet gets deployed on the Seed cluster it needs to establish itself as a trusted party to communicate with the Gardener API server. It runs through a bootstrap flow similar to the kubelet bootstrap process.

On startup the gardenlet uses a kubeconfig with a bootstrap token which authenticates it as being part of the system:bootstrappers group. This kubeconfig is used to create a CertificateSigningRequest (CSR) against the Gardener API server.

The controller in gardener-controller-manager checks whether the CertificateSigningRequest has the expected organisation, common name and usages which the gardenlet would request.

It only auto-approves the CSR if the client making the request is allowed to “create” the certificatesigningrequests/seedclient subresource. Clients with the system:bootstrappers group are bound to the gardener.cloud:system:seed-bootstrapper ClusterRole, hence, they have such privileges. As the bootstrap kubeconfig for the gardenlet contains a bootstrap token which is authenticated as being part of the systems:bootstrappers group, its created CSR gets auto-approved.

“Bastion” Controller

Bastion resources have a limited lifetime, which can be extended up to a certain amount by performing a heartbeat on them. The Bastion controller is responsible for deleting expired or rotten Bastions.

  • “expired” means a Bastion has exceeded its status.ExpirationTimestamp.
  • “rotten” means a Bastion is older than the configured maxLifetime.

The maxLifetime is an option on the Bastion controller and defaults to 24 hours.

The deletion triggers the gardenlet to perform the necessary cleanups in the Seed cluster, so some time can pass between deletion and the Bastion actually disappearing. Clients like gardenctl are advised to not re-use Bastions whose deletion timestamp has been set already.

Refer to GEP-15 for more information on the lifecycle of Bastion resources.

“Plant” Controller

Using the Plant resource, an external Kubernetes cluster (not managed by Gardener) can be registered to Gardener. Gardener Controller Manager is the component that is responsible for the Plant resource reconciliation. As part of the reconciliation loop, the Gardener Controller Manager performs health checks on the external Kubernetes cluster and gathers more information about it - all of this information serves for monitoring purposes of the external Kubernetes cluster.

The component configuration of the Gardener Controller Manager offers to configure the following options for the plant controller:

  • syncPeriod: The duration of how often the Plant resource is reconciled, i.e., how often health checks are performed. The default value is 30s.
  • concurrentSyncs: The number of goroutines scheduled for reconciling events, i.e., the number of possible parallel reconciliations. The default value is 5.

The Plant resource reports the following information for the external Kubernetes cluster:

  • Cluster information
    • Cloud provider information - the cloud provider type and region are maintained in the Plant status (.status.clusterInfo.cloud).
    • Kubernetes version - the Kubernetes version is maintained in the Plant status (.status.clusterInfo.kubernetes.version).
  • Cluster status
    • API Server availability - maintained as condition with type APIServerAvailable.
    • Cluster Nodes healthiness - maintained as condition with type EveryNodeReady.

8 - Etcd

etcd - Key-Value Store for Kubernetes

etcd is a strongly consistent key-value store and the most prevalent choice for the Kubernetes persistence layer. All API cluster objects like Pods, Deployments, Secrets, etc. are stored in etcd which makes it an essential part of a Kubernetes control plane.

Shoot cluster persistence

Each shoot cluster gets its very own persistence for the control plane. It runs in the shoot namespace on the respective seed cluster. Concretely, there are two etcd instances per shoot cluster which the Kube-Apiserver is configured to use in the following way:

  • etcd-main

A store that contains all “cluster critical” or “long-term” objects. These object kinds are typically considered for a backup to prevent any data loss.

  • etcd-events

A store that contains all Event objects (events.k8s.io) of a cluster. Events have usually a short retention period, occur frequently but are not essential for a disaster recovery.

The setup above prevents both, the critical etcd-main is not flooded by Kubernetes Events as well as backup space is not occupied by non-critical data. This segmentation saves time and resources.

etcd Operator

Configuring, maintaining and health-checking etcd is outsourced to a dedicated operator called ETCD Druid. When Gardenlet reconciles a Shoot resource, it creates or updates an Etcd resources in the seed cluster, containing necessary information (backup information, defragmentation schedule, resources, etc.) etcd-druid needs to manage the lifecycle of the desired etcd instance (today main or events). Likewise, when the shoot is deleted, Gardenlet deletes the Etcd resource and ETCD Druid takes care about cleaning up all related objects, e.g. the backing StatefulSet.

Autoscaling

Gardenlet maintains HVPA objects for etcd StatefulSets if the corresponding feature gate is enabled. This enables a vertical scaling for etcd. Downscaling is handled more pessimistic to prevent many subsequent etcd restarts. Thus, for production and infrastructure clusters downscaling is deactivated and for all other clusters lower advertised requests/limits are only applied during a shoot’s maintenance time window.

Backup

If Seeds specify backups for etcd (example), then Gardener and the respective provider extensions are responsible for creating a bucket on the cloud provider’s side (modelled through BackupBucket resource). The bucket stores backups of shoots scheduled on that seed. Furthermore, Gardener creates a BackupEntry which subdivides the bucket and thus makes it possible to store backups of multiple shoot clusters.

The etcd-main instance itself is configured to run with a special backup-restore sidecar. It takes care about regularly backing up etcd data and restoring it in case of data loss. More information can be found on the component’s GitHub page https://github.com/gardener/etcd-backup-restore.

How long backups are stored in the bucket after a shoot has been deleted, depends on the configured retention period in the Seed resource. Please see this example configuration for more information.

Housekeeping

etcd maintenance tasks must be performed from time to time in order to re-gain database storage and to ensure the system’s reliability. The backup-restore sidecar takes care about this job as well. Gardener chooses a random time within the shoot’s maintenance time to schedule these tasks.

9 - Gardenlet

Gardenlet

Gardener is implemented using the operator pattern: It uses custom controllers that act on our own custom resources, and apply Kubernetes principles to manage clusters instead of containers. Following this analogy, you can recognize components of the Gardener architecture as well-known Kubernetes components, for example, shoot clusters can be compared with pods, and seed clusters can be seen as worker nodes.

The following Gardener components play a similar role as the corresponding components in the Kubernetes architecture:

Gardener ComponentKubernetes Component
gardener-apiserverkube-apiserver
gardener-controller-managerkube-controller-manager
gardener-schedulerkube-scheduler
gardenletkubelet

Similar to how the kube-scheduler of Kubernetes finds an appropriate node for newly created pods, the gardener-scheduler of Gardener finds an appropriate seed cluster to host the control plane for newly ordered clusters. By providing multiple seed clusters for a region or provider, and distributing the workload, Gardener also reduces the blast radius of potential issues.

Kubernetes runs a primary “agent” on every node, the kubelet, which is responsible for managing pods and containers on its particular node. Decentralizing the responsibility to the kubelet has the advantage that the overall system is scalable. Gardener achieves the same for cluster management by using a gardenlet as primary “agent” on every seed cluster, and is only responsible for shoot clusters located in its particular seed cluster:

Counterparts in the Gardener Architecture and the Kubernetes Architecture

The gardener-controller-manager has control loops to manage resources of the Gardener API. However, instead of letting the gardener-controller-manager talk directly to seed clusters or shoot clusters, the responsibility isn’t only delegated to the gardenlet, but also managed using a reversed control flow: It’s up to the gardenlet to contact the Gardener API server, for example, to share a status for its managed seed clusters.

Reversing the control flow allows placing seed clusters or shoot clusters behind firewalls without the necessity of direct access via VPN tunnels anymore.

Reversed Control Flow Using a Gardenlet

TLS Bootstrapping

Kubernetes doesn’t manage worker nodes itself, and it’s also not responsible for the lifecycle of the kubelet running on the workers. Similarly, Gardener doesn’t manage seed clusters itself, so Gardener is also not responsible for the lifecycle of the gardenlet running on the seeds. As a consequence, both the gardenlet and the kubelet need to prepare a trusted connection to the Gardener API server and the Kubernetes API server correspondingly.

To prepare a trusted connection between the gardenlet and the Gardener API server, the gardenlet initializes a bootstrapping process after you deployed it into your seed clusters:

  1. The gardenlet starts up with a bootstrap kubeconfig having a bootstrap token that allows to create CertificateSigningRequest (CSR) resources.

  2. After the CSR is signed, the gardenlet downloads the created client certificate, creates a new kubeconfig with it, and stores it inside a Secret in the seed cluster.

  3. The gardenlet deletes the bootstrap kubeconfig secret, and starts up with its new kubeconfig.

  4. The gardenlet starts normal operation.

The gardener-controller-manager runs a control loop that automatically signs CSRs created by gardenlets.

The gardenlet bootstrapping process is based on the kubelet bootstrapping process. More information: Kubelet’s TLS bootstrapping.

If you don’t want to run this bootstrap process you can create a kubeconfig pointing to the garden cluster for the gardenlet yourself, and use field gardenClientConnection.kubeconfig in the gardenlet configuration to share it with the gardenlet.

Gardenlet Certificate Rotation

The certificate used to authenticate the gardenlet against the API server has a certain validity based on the configuration of the garden cluster (--cluster-signing-duration flag of the kube-controller-manager (default 1y)). After about 80% of the validity expired, the gardenlet tries to automatically replace the current certificate with a new one (certificate rotation).

To use certificate rotation, you need to specify the secret to store the kubeconfig with the rotated certificate in field .gardenClientConnection.kubeconfigSecret of the gardenlet component configuration.

Rotate certificates using bootstrap kubeconfig

If the gardenlet created the certificate during the initial TLS Bootstrapping using the Bootstrap kubeconfig, certificates can be rotated automatically. The same control loop in the gardener-controller-manager that signs the CSRs during the initial TLS Bootstrapping also automatically signs the CSR during a certificate rotation.

ℹ️ You can trigger an immediate renewal by annotating the Secret in the seed cluster stated in the .gardenClientConnection.kubeconfigSecret field with gardener.cloud/operation=renew and restarting the gardenlet. After it booted up again, gardenlet will issue a new certificate independent of the remaining validity of the existing one.

Rotate Certificate Using Custom kubeconfig

When trying to rotate a custom certificate that wasn’t created by gardenlet as part of the TLS Bootstrap, the x509 certificate’s Subject field needs to conform to the following:

  • the Common Name (CN) is prefixed with gardener.cloud:system:seed:
  • the Organization (O) equals gardener.cloud:system:seeds

Otherwise, the gardener-controller-manager doesn’t automatically sign the CSR. In this case, an external component or user needs to approve the CSR manually, for example, using command kubectl certificate approve seed-csr-<...>). If that doesn’t happen within 15 minutes, the gardenlet repeats the process and creates another CSR.

Configuring the Seed to work with

The Gardenlet works with a single seed, which must be configured in the GardenletConfiguration under .seedConfig. This must be a copy of the Seed resource, for example (see example/20-componentconfig-gardenlet.yaml for a more complete example):

apiVersion: gardenlet.config.gardener.cloud/v1alpha1
kind: GardenletConfiguration
seedConfig:
  metadata:
    name: my-seed
  spec:
    provider:
      type: aws
    # ...
    secretRef:
      name: my-seed-secret
      namespace: garden

When using make start-gardenlet, the corresponding script will automatically fetch the seed cluster’s kubeconfig based on the seedConfig.spec.secretRef and set the environment accordingly.

On startup, gardenlet registers a Seed resource using the given template in seedConfig if it’s not present already.

Component Configuration

In the component configuration for the gardenlet, it’s possible to define:

  • settings for the Kubernetes clients interacting with the various clusters
  • settings for the control loops inside the gardenlet
  • settings for leader election and log levels, feature gates, and seed selection or seed configuration.

More information: Example Gardenlet Component Configuration.

Heartbeats

Similar to how Kubernetes uses Lease objects for node heart beats (see KEP), the gardenlet is using Lease objects for heart beats of the seed cluster. Every two seconds, the gardenlet checks that the seed cluster’s /healthz endpoint returns HTTP status code 200. If that is the case, the gardenlet renews the lease in the Garden cluster in the gardener-system-seed-lease namespace and updates the GardenletReady condition in the status.conditions field of the Seed resource(s).

Similarly to the node-lifecycle-controller inside the kube-controller-manager, the gardener-controller-manager features a seed-lifecycle-controller that sets the GardenletReady condition to Unknown in case the gardenlet fails to renew the lease. As a consequence, the gardener-scheduler doesn’t consider this seed cluster for newly created shoot clusters anymore.

/healthz Endpoint

The gardenlet includes an HTTPS server that serves a /healthz endpoint. It’s used as a liveness probe in the Deployment of the gardenlet. If the gardenlet fails to renew its lease then the endpoint returns 500 Internal Server Error, otherwise it returns 200 OK.

Please note that the /healthz only indicates whether the gardenlet could successfully probe the Seed’s API server and renew the lease with the Garden cluster. It does not show that the Gardener extension API server (with the Gardener resource groups) is available. However, the Gardenlet is designed to withstand such connection outages and retries until the connection is reestablished.

Control Loops

The gardenlet consists out of several controllers which are now described in more detail.

⚠️ This section is not necessarily complete and might be under construction.

BackupEntry Controller

The BackupEntry controller reconciles those core.gardener.cloud/v1beta1.BackupEntry resources whose .spec.seedName value is equal to the name of a Seed the respective gardenlet is responsible for. Those resources are created by the Shoot controller (only if backup is enabled for the respective Seed) and there is exactly one BackupEntry per Shoot.

The controller creates an extensions.gardener.cloud/v1alpha1.BackupEntry resource (non-namespaced) in the seed cluster and waits until the responsible extension controller reconciled it (see this for more details). The status is populated in the .status.lastOperation field.

The core.gardener.cloud/v1beta1.BackupEntry resource has an owner reference pointing to the corresponding Shoot. Hence, if the Shoot is deleted, also the BackupEntry resource gets deleted. In this case, the controller deletes the extensions.gardener.cloud/v1alpha1.BackupEntry resource in the seed cluster and waits until the responsible extension controller has deleted it. Afterwards, the finalizer of the core.gardener.cloud/v1beta1.BackupEntry resource is released so that it finally disappears from the system.

Keep Backup for Deleted Shoots

In some scenarios it might be beneficial to not immediately delete the BackupEntrys (and with them, the etcd backup) for deleted Shoots.

In this case you can configure the .controllers.backupEntry.deletionGracePeriodHours field in the component configuration of the gardenlet. For example, if you set it to 48, then the BackupEntrys for deleted Shoots will only be deleted 48 hours after the Shoot was deleted.

Additionally, you can limit the shoot purposes for which this applies by setting .controllers.backupEntry.deletionGracePeriodShootPurposes[]. For example, if you set it to [production] then only the BackupEntrys for Shoots with .spec.purpose=production will be deleted after the configured grace period. All others will be deleted immediately after the Shoot deletion.

Managed Seeds

Gardener users can use shoot clusters as seed clusters, so-called “managed seeds” (aka “shooted seeds”), by creating ManagedSeed resources. By default, the gardenlet that manages this shoot cluster then automatically creates a clone of itself with the same version and the same configuration that it currently has. Then it deploys the gardenlet clone into the managed seed cluster.

If you want to prevent the automatic gardenlet deployment, specify the seedTemplate section in the ManagedSeed resource, and don’t specify the gardenlet section. In this case, you have to deploy the gardenlet on your own into the seed cluster.

More information: Register Shoot as Seed

Migrating from Previous Gardener Versions

If your Gardener version doesn’t support gardenlets yet, no special migration is required, but the following prerequisites must be met:

  • Your Gardener version is at least 0.31 before upgrading to v1.
  • You have to make sure that your garden cluster is exposed in a way that it’s reachable from all your seed clusters.

With previous Gardener versions, you had deployed the Gardener Helm chart (incorporating the API server, controller-manager, and scheduler). With v1, this stays the same, but you now have to deploy the gardenlet Helm chart as well into all of your seeds (if they aren’t managed, as mentioned earlier).

More information: Deploy a Gardenlet for all instructions.

Gardener Architecture

Issue #356: Implement Gardener Scheduler

PR #2309: Add /healthz endpoint for Gardenlet

10 - Network Policies

Network Policies in Gardener

As Seed clusters can host the Kubernetes control planes of many Shoot clusters, it is necessary to isolate the control planes from each other for security reasons. Besides deploying each control plane in its own namespace, Gardener creates network policies to also isolate the networks. Essentially, network policies make sure that pods can only talk to other pods over the network they are supposed to. As such, network policies are an important part of Gardener’s tenant isolation.

Gardener deploys network policies into

  • each namespace hosting the Kubernetes control plane of the Shoot cluster.
  • the namespace dedicated to Gardener seed-wide global controllers. This namespace is often called garden and contains e.g. the Gardenlet.
  • the kube-system namespace in the Shoot.

The aforementioned namespaces in the Seed contain a deny-all network policy that denies all ingress and egress traffic. This secure by default setting requires pods to allow network traffic. This is done by pods having labels matching to the selectors of the network policies deployed by Gardener.

More details on the deployed network policies can be found in the development and usage sections.

11 - Resource Manager

Gardener Resource Manager

Initially, the gardener-resource-manager was a project similar to the kube-addon-manager. It manages Kubernetes resources in a target cluster which means that it creates, updates, and deletes them. Also, it makes sure that manual modifications to these resources are reconciled back to the desired state.

In the Gardener project we were using the kube-addon-manager since more than two years. While we have progressed with our extensibility story (moving cloud providers out-of-tree) we had decided that the kube-addon-manager is no longer suitable for this use-case. The problem with it is that it needs to have its managed resources on its file system. This requires storing the resources in ConfigMaps or Secrets and mounting them to the kube-addon-manager pod during deployment time. The gardener-resource-manager uses CustomResourceDefinitions which allows to dynamically add, change, and remove resources with immediate action and without the need to reconfigure the volume mounts/restarting the pod.

Meanwhile, the gardener-resource-manager has evolved to a more generic component comprising several controllers and webhook handlers. It is deployed by gardenlet once per seed (in the garden namespace) and once per shoot (in the respective shoot namespaces in the seed).

Controllers

ManagedResource controller

This controller watches custom objects called ManagedResources in the resources.gardener.cloud/v1alpha1 API group. These objects contain references to secrets which itself contain the resources to be managed. The reason why a Secret is used to store the resources is that they could contain confidential information like credentials.

---
apiVersion: v1
kind: Secret
metadata:
  name: managedresource-example1
  namespace: default
type: Opaque
data:
  objects.yaml: YXBpVmVyc2lvbjogdjEKa2luZDogQ29uZmlnTWFwCm1ldGFkYXRhOgogIG5hbWU6IHRlc3QtMTIzNAogIG5hbWVzcGFjZTogZGVmYXVsdAotLS0KYXBpVmVyc2lvbjogdjEKa2luZDogQ29uZmlnTWFwCm1ldGFkYXRhOgogIG5hbWU6IHRlc3QtNTY3OAogIG5hbWVzcGFjZTogZGVmYXVsdAo=
    # apiVersion: v1
    # kind: ConfigMap
    # metadata:
    #   name: test-1234
    #   namespace: default
    # ---
    # apiVersion: v1
    # kind: ConfigMap
    # metadata:
    #   name: test-5678
    #   namespace: default
---
apiVersion: resources.gardener.cloud/v1alpha1
kind: ManagedResource
metadata:
  name: example
  namespace: default
spec:
  secretRefs:
  - name: managedresource-example1

In the above example, the controller creates two ConfigMaps in the default namespace. When a user is manually modifying them they will be reconciled back to the desired state stored in the managedresource-example secret.

It is also possible to inject labels into all the resources:

---
apiVersion: v1
kind: Secret
metadata:
  name: managedresource-example2
  namespace: default
type: Opaque
data:
  other-objects.yaml: YXBpVmVyc2lvbjogYXBwcy92MSAjIGZvciB2ZXJzaW9ucyBiZWZvcmUgMS45LjAgdXNlIGFwcHMvdjFiZXRhMgpraW5kOiBEZXBsb3ltZW50Cm1ldGFkYXRhOgogIG5hbWU6IG5naW54LWRlcGxveW1lbnQKc3BlYzoKICBzZWxlY3RvcjoKICAgIG1hdGNoTGFiZWxzOgogICAgICBhcHA6IG5naW54CiAgcmVwbGljYXM6IDIgIyB0ZWxscyBkZXBsb3ltZW50IHRvIHJ1biAyIHBvZHMgbWF0Y2hpbmcgdGhlIHRlbXBsYXRlCiAgdGVtcGxhdGU6CiAgICBtZXRhZGF0YToKICAgICAgbGFiZWxzOgogICAgICAgIGFwcDogbmdpbngKICAgIHNwZWM6CiAgICAgIGNvbnRhaW5lcnM6CiAgICAgIC0gbmFtZTogbmdpbngKICAgICAgICBpbWFnZTogbmdpbng6MS43LjkKICAgICAgICBwb3J0czoKICAgICAgICAtIGNvbnRhaW5lclBvcnQ6IDgwCg==
    # apiVersion: apps/v1
    # kind: Deployment
    # metadata:
    #   name: nginx-deployment
    # spec:
    #   selector:
    #     matchLabels:
    #       app: nginx
    #   replicas: 2 # tells deployment to run 2 pods matching the template
    #   template:
    #     metadata:
    #       labels:
    #         app: nginx
    #     spec:
    #       containers:
    #       - name: nginx
    #         image: nginx:1.7.9
    #         ports:
    #         - containerPort: 80

---
apiVersion: resources.gardener.cloud/v1alpha1
kind: ManagedResource
metadata:
  name: example
  namespace: default
spec:
  secretRefs:
  - name: managedresource-example2
  injectLabels:
    foo: bar

In this example the label foo=bar will be injected into the Deployment as well as into all created ReplicaSets and Pods.

Preventing Reconciliations

If a ManagedResource is annotated with resources.gardener.cloud/ignore=true then it will be skipped entirely by the controller (no reconciliations or deletions of managed resources at all). However, when the ManagedResource itself is deleted (for example when a shoot is deleted) then the annotation is not respected and all resources will be deleted as usual. This feature can be helpful to temporarily patch/change resources managed as part of such ManagedResource. Condition checks will be skipped for such ManagedResources.

Modes

The gardener-resource-manager can manage a resource in different modes. The supported modes are:

  • Ignore
    • The corresponding resource is removed from the ManagedResource status (.status.resources). No action is performed on the cluster - the resource is no longer “managed” (updated or deleted).
    • The primary use case is a migration of a resource from one ManagedResource to another one.

The mode for a resource can be specified with the resources.gardener.cloud/mode annotation. The annotation should be specified in the encoded resource manifest in the Secret that is referenced by the ManagedResource.

Resource Class

By default, gardener-resource-manager controller watches for ManagedResources in all namespaces. --namespace flag can be specified to gardener-resource-manager binary to restrict the watch to ManagedResources in a single namespace. A ManagedResource has an optional .spec.class field that allows to indicate that it belongs to given class of resources. --resource-class flag can be specified to gardener-resource-manager binary to restrict the watch to ManagedResources with the given .spec.class. A default class is assumed if no class is specified.

Conditions

A ManagedResource has a ManagedResourceStatus, which has an array of Conditions. Conditions currently include:

ConditionDescription
ResourcesAppliedTrue if all resources are applied to the target cluster
ResourcesHealthyTrue if all resources are present and healthy
ResourcesProgressingFalse if all resources have been fully rolled out

ResourcesApplied may be False when:

  • the resource apiVersion is not known to the target cluster
  • the resource spec is invalid (for example the label value does not match the required regex for it)

ResourcesHealthy may be False when:

  • the resource is not found
  • the resource is a Deployment and the Deployment does not have the minimum availability.

ResourcesProgressing may be True when:

  • a Deployment, StatefulSet or DaemonSet has not been fully rolled out yet, i.e. not all replicas have been updated with the latest changes to spec.template.

Each Kubernetes resources has different notion for being healthy. For example, a Deployment is considered healthy if the controller observed its current revision and if the number of updated replicas is equal to the number of replicas.

The following status.conditions section describes a healthy ManagedResource:

conditions:
- lastTransitionTime: "2022-05-03T10:55:39Z"
  lastUpdateTime: "2022-05-03T10:55:39Z"
  message: All resources are healthy.
  reason: ResourcesHealthy
  status: "True"
  type: ResourcesHealthy
- lastTransitionTime: "2022-05-03T10:55:36Z"
  lastUpdateTime: "2022-05-03T10:55:36Z"
  message: All resources have been fully rolled out.
  reason: ResourcesRolledOut
  status: "False"
  type: ResourcesProgressing
- lastTransitionTime: "2022-05-03T10:55:18Z"
  lastUpdateTime: "2022-05-03T10:55:18Z"
  message: All resources are applied.
  reason: ApplySucceeded
  status: "True"
  type: ResourcesApplied

Ignoring Updates

In some cases it is not desirable to update or re-apply some of the cluster components (for example, if customization is required or needs to be applied by the end-user). For these resources, the annotation “resources.gardener.cloud/ignore” needs to be set to “true” or a truthy value (Truthy values are “1”, “t”, “T”, “true”, “TRUE”, “True”) in the corresponding managed resource secrets, this can be done from the components that create the managed resource secrets, for example Gardener extensions or Gardener. Once this is done, the resource will be initially created and later ignored during reconciliation.

Preserving replicas or resources in Workload Resources

The objects which are part of the ManagedResource can be annotated with

  • resources.gardener.cloud/preserve-replicas=true in case the .spec.replicas field of workload resources like Deployments, StatefulSets, etc. shall be preserved during updates.
  • resources.gardener.cloud/preserve-resources=true in case the .spec.containers[*].resources fields of all containers of workload resources like Deployments, StatefulSets, etc. shall be preserved during updates.

This can be useful if there are non-standard horizontal/vertical auto-scaling mechanisms in place. Standard mechanisms like HorizontalPodAutoscaler or VerticalPodAutoscaler will be auto-recognized by gardener-resource-manager, i.e., in such cases the annotations are not needed.

Origin

All the objects managed by the resource manager get a dedicated annotation resources.gardener.cloud/origin describing the ManagedResource object that describes this object.

By default this is in this format <namespace>/<objectname>. In multi-cluster scenarios (the ManagedResource objects are maintained in a cluster different from the one the described objects are managed), it might be useful to include the cluster identity, as well.

This can be enforced by setting the --cluster-id option. Here, several possibilities are supported:

  • given a direct value: use this as id for the source cluster
  • <cluster>: read the cluster identity from a cluster-identity config map in the kube-system namespace (attribute cluster-identity). This is automatically maintained in all clusters managed or involved in a gardener landscape.
  • <default>: try to read the cluster identity from the config map. If not found, no identity is used
  • empty string: no cluster identity is used (completely cluster local scenarios)

The format of the origin annotation with a cluster id is <cluster id>:<namespace>/<objectname>.

The default for the cluster id is the empty value (do not use cluster id).

Garbage Collector For Immutable ConfigMaps/Secrets

In Kubernetes, workload resources (e.g., Pods) can mount ConfigMaps or Secrets or reference them via environment variables in containers. Typically, when the content of such ConfigMap/Secret gets changed then the respective workload is usually not dynamically reloading the configuration, i.e., a restart is required. The most commonly used approach is probably having so-called checksum annotations in the pod template which makes Kubernetes to recreate the pod if the checksum changes. However, it has the downside that old, still running versions of the workload might not be able to properly work with the already updated content in the ConfigMap/Secret, potentially causing application outages.

In order to protect users from such outages (and to also improve the performance of the cluster), the Kubernetes community provides the “immutable ConfigMaps/Secrets feature”. Enabling immutability requires ConfigMaps/Secrets to have unique names. Having unique names requires the client to delete ConfigMaps/Secret`s no longer in use.

In order to provide a similarly lightweight experience for clients (compared to the well-established checksum annotation approach), the Gardener Resource Manager features an optional garbage collector controller (disabled by default). The purpose of this controller is cleaning up such immutable ConfigMaps/Secrets if they are no longer in use.

How does the garbage collector work?

The following algorithm is implemented in the GC controller:

  1. List all ConfigMaps and Secrets labeled with resources.gardener.cloud/garbage-collectable-reference=true.
  2. List all Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, Pods and for each of them
    1. iterate over the .metadata.annotations and for each of them
      1. If the annotation key follows the reference.resources.gardener.cloud/{configmap,secret}-<hash> scheme and the value equals <name> then consider it as “in-use”.
  3. Delete all ConfigMaps and Secrets not considered as “in-use”.

Consequently, clients need to

  1. Create immutable ConfigMaps/Secrets with unique names (e.g., a checksum suffix based on the .data).

  2. Label such ConfigMaps/Secrets with resources.gardener.cloud/garbage-collectable-reference=true.

  3. Annotate their workload resources with reference.resources.gardener.cloud/{configmap,secret}-<hash>=<name> for all ConfigMaps/Secrets used by the containers of the respective Pods.

    ⚠️ Add such annotations to .metadata.annotations as well as to all templates of other resources (e.g., .spec.template.metadata.annotations in Deployments or .spec.jobTemplate.metadata.annotations and .spec.jobTemplate.spec.template.metadata.annotations for CronJobs. This ensures that the GC controller does not unintentionally consider ConfigMaps/Secrets as “not in use” just because there isn’t a Pod referencing them anymore (e.g., they could still be used by a Deployment scaled down to 0).

ℹ️ For the last step, there is a helper function InjectAnnotations in the pkg/controller/garbagecollector/references which you can use for your convenience.

Example:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: test-1234
  namespace: default
  labels:
    resources.gardener.cloud/garbage-collectable-reference: "true"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: test-5678
  namespace: default
  labels:
    resources.gardener.cloud/garbage-collectable-reference: "true"
---
apiVersion: v1
kind: Pod
metadata:
  name: example
  namespace: default
  annotations:
    reference.resources.gardener.cloud/configmap-82a3537f: test-5678
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    terminationGracePeriodSeconds: 2

The GC controller would delete the ConfigMap/test-1234 because it is considered as not “in-use”.

ℹ️ If the GC controller is activated then the ManagedResource controller will no longer delete ConfigMaps/Secrets having the above label.

How to activate the garbage collector?

The GC controller can be activated by providing the --garbage-collector-sync-period flag with a value larger than 0 (e.g., 1h) to the Gardener Resource Manager.

TokenInvalidator

The Kubernetes community is slowly transitioning from static ServiceAccount token Secrets to ServiceAccount Token Volume Projection. Typically, when you create a ServiceAccount

apiVersion: v1
kind: ServiceAccount
metadata:
  name: default

then the serviceaccount-token controller (part of kube-controller-manager) auto-generates a Secret with a static token:

apiVersion: v1
kind: Secret
metadata:
   annotations:
      kubernetes.io/service-account.name: default
      kubernetes.io/service-account.uid: 86e98645-2e05-11e9-863a-b2d4d086dd5a)
   name: default-token-ntxs9
type: kubernetes.io/service-account-token
data:
   ca.crt: base64(cluster-ca-cert)
   namespace: base64(namespace)
   token: base64(static-jwt-token)

Unfortunately, when using ServiceAccount Token Volume Projection in a Pod, this static token is actually not used at all:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  serviceAccountName: default
  containers:
  - image: nginx
    name: nginx
    volumeMounts:
    - mountPath: /var/run/secrets/tokens
      name: token
  volumes:
  - name: token
    projected:
      sources:
      - serviceAccountToken:
          path: token
          expirationSeconds: 7200

While the Pod is now using an expiring and auto-rotated token, the static token is still generated and valid.

As of Kubernetes v1.22, there is neither a way of preventing kube-controller-manager to generate such static tokens, nor a way to proactively remove or invalidate them:

Disabling the serviceaccount-token controller is an option, however, especially in the Gardener context it may either break end-users or it may not even be possible to control such settings. Also, even if a future Kubernetes version supports native configuration of above behaviour, Gardener still supports older versions which won’t get such features but need a solution as well.

This is where the TokenInvalidator comes into play: Since it is not possible to prevent kube-controller-manager from generating static ServiceAccount Secrets, the TokenInvalidator is - as its name suggests - just invalidating these tokens. It considers all such Secrets belonging to ServiceAccounts with .automountServiceAccountToken=false. By default, all namespaces in the target cluster are watched, however, this can be configured by specifying the --target-namespace flag.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-serviceaccount
automountServiceAccountToken: false

This will result in a static ServiceAccount token secret whose token value is invalid:

apiVersion: v1
kind: Secret
metadata:
  annotations:
    kubernetes.io/service-account.name: my-serviceaccount
    kubernetes.io/service-account.uid: 86e98645-2e05-11e9-863a-b2d4d086dd5a
  name: my-serviceaccount-token-ntxs9
type: kubernetes.io/service-account-token
data:
  ca.crt: base64(cluster-ca-cert)
  namespace: base64(namespace)
  token: AAAA

Any attempt to regenerate the token or creating a new such secret will again make the component invalidating it.

You can opt-out of this behaviour for ServiceAccounts setting .automountServiceAccountToken=false by labeling them with token-invalidator.resources.gardener.cloud/skip=true.

In order to enable the TokenInvalidator you have to set --token-invalidator-max-concurrent-workers to a value larger than 0.

Below graphic shows an overview of the Token Invalidator for Service account secrets in the Shoot cluster. image

TokenRequestor

This controller provides the service to create and auto-renew tokens via the TokenRequest API.

It provides a functionality similar to the kubelet’s Service Account Token Volume Projection. It was created to handle the special case of issuing tokens to pods that run in a different cluster than the API server they communicate with (hence, using the native token volume projection feature is not possible).

The controller differentiates between source cluster and target cluster. The source cluster hosts the gardener-resource-manager pod. Secrets in this cluster are watched and modified by the controller. The target cluster can be configured to point to another cluster. The existence of ServiceAccounts are ensured and token requests are issued against the target. When the gardener-resource-manager is deployed next to the Shoot’s controlplane in the Seed the source cluster is the Seed while the target cluster points to the Shoot.

Reconciliation Loop

This controller reconciles secrets in all namespaces in the source cluster with the label: resources.gardener.cloud/purpose: token-requestor. See here for an example of the secret.

The controller ensures a ServiceAccount exists in the target cluster as specified in the annotations of the Secret in the source cluster:

serviceaccount.resources.gardener.cloud/name: <sa-name>
serviceaccount.resources.gardener.cloud/namespace: <sa-namespace>

The requested tokens will act with the privileges which are assigned to this ServiceAccount.

The controller will then request a token via the TokenRequest API and populate it into the .data.token field to the Secret in the source cluster.

Alternatively, the client can provide a raw kubeconfig (in YAML or JSON format) via the Secret’s .data.kubeconfig field. The controller will then populate the requested token in the kubeconfig for the user used in the .current-context. For example, if .data.kubeconfig is

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: AAAA
    server: some-server-url
  name: shoot--foo--bar
contexts:
- context:
    cluster: shoot--foo--bar
    user: shoot--foo--bar-token
  name: shoot--foo--bar
current-context: shoot--foo--bar
kind: Config
preferences: {}
users:
- name: shoot--foo--bar-token
  user:
    token: ""

then the .users[0].user.token field of the kubeconfig will be updated accordingly.

The controller also adds an annotation to the Secret to keep track when to renew the token before it expires. By default, the tokens are issued to expire after 12 hours. The expiration time can be set with the following annotation:

serviceaccount.resources.gardener.cloud/token-expiration-duration: 6h

It automatically renews once 80% of the lifetime is reached or after 24h.

Optionally, the controller can also populate the token into a Secret in the target cluster. This can be requested by annotating the Secret in the source cluster with

token-requestor.resources.gardener.cloud/target-secret-name: "foo"
token-requestor.resources.gardener.cloud/target-secret-namespace: "bar"

Overall, the TokenRequestor controller provides credentials with limited lifetime (JWT tokens) used by Shoot control plane components running in the Seed to talk to the Shoot API Server. Please see the graphic below:

image

Webhooks

Auto-Mounting Projected ServiceAccount Tokens

When this webhook is activated then it automatically injects projected ServiceAccount token volumes into Pods and all its containers if all of the following preconditions are fulfilled:

  1. The Pod is NOT labeled with projected-token-mount.resources.gardener.cloud/skip=true.
  2. The Pod’s .spec.serviceAccountName field is NOT empty and NOT set to default.
  3. The ServiceAccount specified in the Pod’s .spec.serviceAccountName sets .automountServiceAccountToken=false.
  4. The Pod’s .spec.volumes[] DO NOT already contain a volume with a name prefixed with kube-api-access-.

The projected volume will look as follows:

spec:
  volumes:
  - name: kube-api-access-gardener
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 43200
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

The expirationSeconds are defaulted to 12h and can be overwritten with the --projected-token-mount-expiration-seconds flag, or with the projected-token-mount.resources.gardener.cloud/expiration-seconds annotation on a Pod resource.

The volume will be mounted into all containers specified in the Pod to the path /var/run/secrets/kubernetes.io/serviceaccount. This is the default location where client libraries expect to find the tokens and mimics the upstream ServiceAccount admission plugin, see this document for more information.

Overall, this webhook is used to inject projected service account tokens into pods running in the Shoot and the Seed cluster. Hence, it is served from the Seed GRM and each Shoot GRM. Please find an overview below for pods deployed in the Shoot cluster:

image

12 - Scheduler

Gardener Scheduler

The Gardener Scheduler is in essence a controller that watches newly created shoots and assigns a seed cluster to them. Conceptually, the task of the Gardener Scheduler is very similar to the task of the Kubernetes Scheduler: finding a seed for a shoot instead of a node for a pod.

Either the scheduling strategy or the shoot cluster purpose hereby determines how the scheduler is operating. The following sections explain the configuration and flow in greater detail.

Why is the Gardener Scheduler needed?

1. Decoupling

Previously, an admission plugin in the Gardener API server conducted the scheduling decisions. This implies changes to the API server whenever adjustments of the scheduling are needed. Decoupling the API server and the scheduler comes with greater flexibility to develop these components independently from each other.

2. Extensibility

It should be possible to easily extend and tweak the scheduler in the future. Possibly, similar to the Kubernetes scheduler, hooks could be provided which influence the scheduling decisions. It should be also possible to completely replace the standard Gardener Scheduler with a custom implementation.

Algorithm overview

The following sequence describes the steps involved to determine a seed candidate:

  1. Determine usable seeds with “usable” defined as follows:
    • no .metadata.deletionTimestamp
    • .spec.settings.scheduling.visible is true
    • conditions Bootstrapped, GardenletReady, BackupBucketsReady (if available) are true
  2. Filter seeds:
    • matching .spec.seedSelector in CloudProfile used by the Shoot
    • matching .spec.seedSelector in Shoot
    • having no network intersection with the Shoot’s networks (due to the VPN connectivity between seeds and shoots their networks must be disjoint)
    • having .spec.settings.shootDNS.enabled=false (only if the shoot specifies a DNS domain or does not use the unmanaged DNS provider)
    • whose taints (.spec.taints) are tolerated by the Shoot (.spec.tolerations)
    • whose capacity for shoots would not be exceeded if the shoot is scheduled onto the seed, see Ensuring seeds capacity for shoots is not exceeded
  3. Apply active strategy e.g., Minimal Distance strategy
  4. Choose least utilized seed, i.e., the one with the least number of shoot control planes, will be the winner and written to the .spec.seedName field of the Shoot.

Configuration

The Gardener Scheduler configuration has to be supplied on startup. It is a mandatory and also the only available flag. Here is an example scheduler configuration.

Most of the configuration options are the same as in the Gardener Controller Manager (leader election, client connection, …). However, the Gardener Scheduler on the other hand does not need a TLS configuration, because there are currently no webhooks configurable.

Strategies

The scheduling strategy is defined in the candidateDeterminationStrategy of the scheduler’s configuration and can have the possible values SameRegion and MinimalDistance. The SameRegion strategy is the default strategy.

  1. Same Region strategy

    The Gardener Scheduler reads the spec.provider.type and .spec.region fields from the Shoot resource. It tries to find a seed that has the identical .spec.provider.type and .spec.provider.region fields set. If it cannot find a suitable seed, it adds an event to the shoot stating, that it is unschedulable.

  2. Minimal Distance strategy

    The Gardener Scheduler tries to find a valid seed with minimal distance to the shoot’s intended region. The distance is calculated based on the Levenshtein distance of the region. Therefore the region name is split into a base name and an orientation. Possible orientations are north, south, east, west and central. The distance then is twice the Levenshtein distance of the region’s base name plus a correction value based on the orientation and the provider.

    If the orientations of shoot and seed candidate match, the correction value is 0, if they differ it is 2 and if either the seed’s or the shoot’s region does not have an orientation it is 1. If the provider differs the correction value is additionally incremented by 2.

    Because of this a matching region with a matching provider is always prefered.

In order to put the scheduling decision into effect, the scheduler sends an update request for the Shoot resource to the API server. After validation, the Gardener Aggregated API server updates the shoot to have the spec.seedName field set. Subsequently, the Gardenlet picks up and starts to create the cluster on the specified seed.

  1. Special handling based on shoot cluster purpose

Every shoot cluster can have a purpose that describes what the cluster is used for, and also influences how the cluster is setup (see this document for more information).

In case the shoot has the testing purpose then the scheduler only reads the .spec.provider.type from the Shoot resource and tries to find a Seed that has the identical .spec.provider.type. The region does not matter, i.e., testing shoots may also be scheduled on a seed in a complete different region if it is better for balancing the whole Gardener system.

seedSelector field in the Shoot specification

Similar to the .spec.nodeSelector field in Pods, the Shoot specification has an optional .spec.seedSelector field. It allows the user to provide a label selector that must match the labels of Seeds in order to be scheduled to one of them. The labels on Seeds are usually controlled by Gardener administrators/operators - end users cannot add arbitrary labels themselves. If provided, the Gardener Scheduler will only consider those seeds as “suitable” whose labels match those provided in the .spec.seedSelector of the Shoot.

By default only seeds with the same provider than the shoot are selected. By adding a providerTypes field to the seedSelector a dedicated set of possible providers (* means all provider types) can be selected.

Ensuring seeds capacity for shoots is not exceeded

Seeds have a practical limit of how many shoots they can accommodate. Exceeding this limit is undesirable as the system performance will be noticeably impacted. Therefore, the scheduler ensures that a seed’s capacity for shoots is not exceeded by taking into account a maximum number of shoots that can be scheduled onto a seed.

This mechanism works as follows:

  • The gardenlet is configured with certain resources and their total capacity (and, for certain resources, the amount reserved for Gardener), see /example/20-componentconfig-gardenlet.yaml. Currently, the only such resource is the maximum number of shoots that can be scheduled onto a seed.
  • The gardenlet seed controller updates the capacity and allocatable fields in Seed status with the capacity of each resource and how much of it is actually available to be consumed by shoots. The allocatable value of a resource is equal to capacity minus reserved.
  • When scheduling shoots, the scheduler filters out all candidate seeds whose allocatable capacity for shoots would be exceeded if the shoot is scheduled onto the seed.

Failure to determine a suitable seed

In case the scheduler fails to find a suitable seed, the operation is being retried with exponential backoff.

Current Limitation / Future Plans

  • Azure has unfortunately a geographically non-hierarchical naming pattern and does not start with the continent. This is the reason why we will exchange the implementation of the MinimalDistance strategy with a more suitable one in the future.

13 - Seed Admission Controller

Gardener Seed Admission Controller

The Gardener Seed admission controller is deployed by the Gardenlet as part of its seed bootstrapping phase and, consequently, running in every seed cluster. It’s main purpose is to serve webhooks (validating or mutating) in order to admit or deny certain requests to the seed’s API server.

What is it doing concretely?

Validating Webhooks

Unconfirmed Deletion Prevention

As part of Gardener’s extensibility concepts a lot of CustomResourceDefinitions are deployed to the seed clusters that serve as extension points for provider-specific controllers. For example, the Infrastructure CRD triggers the provider extension to prepare the IaaS infrastructure of the underlying cloud provider for a to-be-created shoot cluster. Consequently, these extension CRDs have a lot of power and control large portions of the end-user’s shoot cluster. Accidental or undesired deletions of those resource can cause tremendous and hard-to-recover-from outages and should be prevented.

Together with the deployment of the Gardener seed admission controller a ValidatingWebhookConfiguration for CustomResourceDefinitions and most (custom) resources in the extensions.gardener.cloud/v1alpha1 API group is registered. It prevents DELETE requests for those CustomResourceDefinitions labeled with gardener.cloud/deletion-protected=true, and for all mentioned custom resources if they were not previously annotated with the confirmation.gardener.cloud/deletion=true. This prevents that undesired kubectl delete <...> requests are accepted.

Mutating Webhooks

The admission controller endpoint /webhooks/default-pod-scheduler-name/gardener-kube-scheduler mutates pods and adds gardener-kube-scheduler to .spec.scheduleName.

When SeedKubeScheduler feature gate is enabled, all control plane components are mutated. The scheduler scores Nodes with most resource usage higher than the rest, resulting in greater resource utilization.