This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Proposals

Gardener Enhancement Proposal (GEP)

Changes to the Gardener code base are often incorporated directly via pull requests which either themselves contain a description about the motivation and scope of a change or a linked GitHub issue does.

If a perspective feature has a bigger extent, requires the involvement of several parties or more discussion is needed before the actual implementation can be started, you may consider filing a pull request with a Gardener Enhancement Proposal (GEP) first.

GEPs are a measure to propose a change or to add a feature to Gardener, help you to describe the change(s) conceptionally, and to list the steps that are necessary to reach this goal. It helps the Gardener maintainers as well as the community to understand the motivation and scope around your proposed change(s) and encourages their contribution to discussions and future pull requests. If you are familiar with the Kubernetes community, GEPs are analogue to Kubernetes Enhancement Proposals (KEPs).

Reasons for a GEP

You may consider filing a GEP for the following reasons:

  • A Gardener architectural change is intended / necessary
  • Major changes to the Gardener code base
  • A phased implementation approach is expected because of the widespread scope of the change
  • Your proposed changes may be controversial

We encourage you to take a look at already merged GEPs since they give you a sense of what a typical GEP comprises.

Before creating a GEP

Before starting your work and creating a GEP, please take some time to familiarize yourself with our general Gardener Contribution Guidelines.

It is recommended to discuss and outline the motivation of your prospective GEP as a draft with the community before you take the investment of creating the actual GEP. This early briefing supports the understanding for the broad community and leads to a fast feedback for your proposal from the respective experts in the community. An appropriate format for this may be the regular Gardener community meetings.

How to file a GEP

GEPs should be created as Markdown .md files and are submitted through a GitHub pull request to their current home in docs/proposals. Please use the provided template or follow the structure of existing GEPs which makes reviewing easier and faster. Additionally, please link the new GEP in our documentation index.

If not already done, please present your GEP in the regular community meetings to brief the community about your proposal (we strive for personal communication :) ). Also consider that this may be an important step to raise awareness and understanding for everyone involved.

The GEP template contains a small set of metadata, which is helpful for keeping track of the enhancement in general and especially of who is responsible for implementing and reviewing PRs that are part of the enhancement.

Main Reviewers

Apart from general metadata, the GEP should name at least one “main reviewer”. You can find a main reviewer for your GEP either when discussing the proposal in the community meeting, by asking in our Slack Channel or at latest during the GEP PR review. New GEPs should only be accepted once at least one main reviewer is nominated/assigned.

The main reviewers are charged with the following tasks:

  • familiarizing themselves with the details of the proposal
  • reviewing the GEP PR itself and any further updates to the document
  • discussing design details and clarifying implementation questions with the author before and after the proposal was accepted
  • reviewing PRs related to the GEP in-depth

Other community members are of course also welcome to help the GEP author, review his work and raise general concerns with the enhancement. Nevertheless, the main reviewers are supposed to focus on more in-depth reviews and accompaning the whole GEP process end-to-end, which helps with getting more high-quality reviews and faster feedback cycles instead of having more people looking at the process with lower priority and less focus.

GEP Process

  1. Pre-discussions about GEP (if necessary)
  2. Find a main reviewer for your enhancement
  3. GEP is filed through GitHub PR
  4. Presentation in Gardener community meeting (if possible)
  5. Review of GEP from maintainers/community
  6. GEP is merged if accepted
  7. Implementation of GEP
  8. Consider keeping GEP up-to-date in case implementation differs essentially

1 - 01 Extensibility

Gardener extensibility and extraction of cloud-specific/OS-specific knowledge (#308, #262)

Table of Contents

Summary

Gardener has evolved to a large compound of packages containing lots of highly specific knowledge which makes it very hard to extend (supporting a new cloud provider, new OS, …, or behave differently depending on the underlying infrastructure).

This proposal aims to move out the cloud-specific implementations (called “(cloud) botanists”) and the OS-specifics into dedicated controllers, and simultaneously to allow deviation from the standard Gardener deployment.

Motivation

Currently, it is too hard to support additional cloud providers or operation systems/distributions as everything must be done in-tree which might affect the implementation of other cloud providers as well. The various conditions and branches make the code hard to maintain and hard to test. Every change must be done centrally, requires to completely rebuild Gardener, and cannot be deployed individually. Similar to the motivation for Kubernetes to extract their cloud-specifics into dedicated cloud-controller-managers or to extract the container/storage/network/… specifics into CRI/CSI/CNI/…, we aim to do the same right now.

Goals

  • Gardener does not contain any cloud-specific knowledge anymore but defines a clear contract allowing external controllers (botanists) to support different environments (AWS, Azure, GCP, …).
  • Gardener does not contain any operation system-specific knowledge anymore but defines a clear contract allowing external controllers to support different operation systems/distributions (CoreOS, SLES, Ubuntu, …).
  • It shall become much easier to move control planes of Shoot clusters between Seed clusters (#232) which is a necessary requirement of an automated setup for the Gardener Ring (#233).

Non-Goals

  • We want to also factor out the specific knowledge of the addon deployments (nginx-ingress, kubernetes-dashboard, …), but we already have dedicated projects/issues for that: https://github.com/gardener/bouquet and #246. We will keep the addons in-tree as part of this proposal and tackle their extraction separately.
  • We do not want to make the Gardener a plain workflow engine that just executes a given template (which indeed would allow to be generic, open, and extensible in their highest forms but which would end-up in building a “programming/scripting language” inside a serialization format (YAML/JSON/…)). Rather, we want to have well-defined contracts and APIs, keeping Gardener responsible for the clusters management.

Proposal

Gardener heavily relies on and implements Kubernetes principles, and its ultimate strategy is to use Kubernetes wherever applicable. The extension concept in Kubernetes is based on (next to others) CustomResourceDefinitions, ValidatingWebhookConfigurations and MutatingWebhookConfigurations, and InitializerConfigurations. Consequently, Gardener’s extensibility concept relies on these mechanisms.

Instead of implementing all aspects directly in Gardener it will deploy some CRDs to the Seed cluster which will be watched by dedicated controllers (also running in the Seed clusters), each one implementing one aspect of cluster management. This way one complex strongly coupled Gardener implementation covering all infrastructures is decomposed into a set of loosely coupled controllers implementing aspects of APIs defined by Gardener. Gardener will just wait until the controllers report that they are done (or have faced an error) in the CRD’s .status field instead of doing the respective tasks itself. We will have one specific CRD for every specific operation (e.g., DNS, infrastructure provisioning, machine cloud config generation, …). However, there are also parts inside Gardener which can be handled generically (not by cloud botanists) because they are the same or very similar for all the environments. One example of those is the deployment of a Namespace in the Seed which will run the Shoot’s control plane Another one is the deployment of a Service for the Shoot’s kube-apiserver. In case a cloud botanist needs to cooperate and react on those operations it should register a ValidatingWebhookConfiguration, a MutatingWebhookConfiguration, or a InitializerConfiguration. With this approach it can validate, modify, or react on any resource created by Gardener to make it cloud infrastructure specific.

The web hooks should be registered with failurePolicy=Fail to ensure that a request made by Gardener fails if the respective web hook is not available.

Modification of existing CloudProfile and Shoot resources

We will introduce the new API group gardener.cloud:

CloudProfiles

---
apiVersion: gardener.cloud/v1alpha1
kind: CloudProfile
metadata:
  name: aws
spec:
  type: aws
# caBundle: |
#   -----BEGIN CERTIFICATE-----
#   ...
#   -----END CERTIFICATE-----
  dnsProviders:
  - type: aws-route53
  - type: unmanaged
  kubernetes:
    versions:
    - 1.12.1
    - 1.11.0
    - 1.10.5
  machineTypes:
  - name: m4.large
    cpu: "2"
    gpu: "0"
    memory: 8Gi
  # storage: 20Gi   # optional (not needed in every environment, may only be specified if no volumeTypes have been specified)
  ...
  volumeTypes:      # optional (not needed in every environment, may only be specified if no machineType has a `storage` field)
  - name: gp2
    class: standard
  - name: io1
    class: premium
  providerConfig:
    apiVersion: aws.cloud.gardener.cloud/v1alpha1
    kind: CloudProfileConfig
    constraints:
      minimumVolumeSize: 20Gi
      machineImages:
      - name: coreos
        regions:
        - name: eu-west-1
          ami: ami-32d1474b
        - name: us-east-1
          ami: ami-e582d29f
      zones:
      - region: eu-west-1
        zones:
        - name: eu-west-1a
          unavailableMachineTypes: # list of machine types defined above that are not available in this zone
          - name: m4.large
          unavailableVolumeTypes:  # list of volume types defined above that are not available in this zone
          - name: gp2
        - name: eu-west-1b
        - name: eu-west-1c

Shoots

apiVersion: gardener.cloud/v1alpha1
kind: Shoot
metadata:
  name: johndoe-aws
  namespace: garden-dev
spec:
  cloudProfileName: aws
  secretBindingName: core-aws
  cloud:
    type: aws
    region: eu-west-1
    providerConfig:
      apiVersion: aws.cloud.gardener.cloud/v1alpha1
      kind: InfrastructureConfig
      networks:
        vpc: # specify either 'id' or 'cidr'
        # id: vpc-123456
          cidr: 10.250.0.0/16
        internal:
        - 10.250.112.0/22
        public:
        - 10.250.96.0/22
        workers:
        - 10.250.0.0/19
      zones:
      - eu-west-1a
    workerPools:
    - name: pool-01
    # Taints, labels, and annotations are not yet implemented. This requires interaction with the machine-controller-manager, see
    # https://github.com/gardener/machine-controller-manager/issues/174. It is only mentioned here as future proposal.
    # taints:
    # - key: foo
    #   value: bar
    #   effect: PreferNoSchedule
    # labels:
    # - key: bar
    #   value: baz
    # annotations:
    # - key: foo
    #   value: hugo
      machineType: m4.large
      volume: # optional, not needed in every environment, may only be specified if the referenced CloudProfile contains the volumeTypes field
        type: gp2
        size: 20Gi
      providerConfig:
        apiVersion: aws.cloud.gardener.cloud/v1alpha1
        kind: WorkerPoolConfig
        machineImage:
          name: coreos
          ami: ami-d0dcef3
        zones:
        - eu-west-1a
      minimum: 2
      maximum: 2
      maxSurge: 1
      maxUnavailable: 0
  kubernetes:
    version: 1.11.0
    ...
  dns:
    provider: aws-route53
    domain: johndoe-aws.garden-dev.example.com
  maintenance:
    timeWindow:
      begin: 220000+0100
      end: 230000+0100
    autoUpdate:
      kubernetesVersion: true
  backup:
    schedule: "*/5 * * * *"
    maximum: 7
  addons:
    kube2iam:
      enabled: false
    kubernetes-dashboard:
      enabled: true
    cluster-autoscaler:
      enabled: true
    nginx-ingress:
      enabled: true
      loadBalancerSourceRanges: []
    kube-lego:
      enabled: true
      email: john.doe@example.com

ℹ The specifications for the other cloud providers Gardener already has an implementation for looks similar.

CRD definitions and workflow adaptation

In the following we are outlining the CRD definitions which define the API between Gardener and the dedicated controllers. After that we will take a look at the current reconciliation/deletion flow and describe how it would look like in case we would implement this proposal.

Custom resource definitions

Every CRD has a .spec.type field containing the respective instance of the dimension the CRD represents, e.g. the cloud provider, the DNS provider or the operation system name. Moreover, the .status field must contain

  • observedGeneration (int64), a field indicating on which generation the controller last worked on.
  • state (*runtime.RawExtension), a field which is not interpreted by Gardener but persisted; it should be treated opaque and only be used by the respective CRD-specific controller (it can store anything it needs to re-construct its own state).
  • lastError (object), a field which is optional and only present if the last operation ended with an error state.
  • lastOperation (object), a field which always exists and which indicates what the last operation of the controller was.
  • conditions (list), a field allowing the controller to report health checks for its area of responsibility.

Some CRDs might have a .spec.providerConfig or a .status.providerStatus field containing controller-specific information that is treated opaque by Gardener and will only be copied to dependent or depending CRDs.

DNS records

Every Shoot needs two DNS records (or three, depending on whether nginx-ingress addon is enabled), one so-called “internal” record that Gardener uses in the kubeconfigs of the Shoot cluster’s system components, and one so-called “external” record which is used in the kubeconfig provided to the user.

---
apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSProvider
metadata:
  name: alicloud
  namespace: default
spec:
  type: alicloud-dns
  secretRef:
    name: alicloud-credentials
  domains:
    include:
    - my.own.domain.com
---
apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSEntry
metadata:
  name: dns
  namespace: default
spec:
  dnsName: dns.my.own.domain.com
  ttl: 600
  targets:
  - 8.8.8.8
status:
  observedGeneration: 4
  state: some-state
  lastError:
    lastUpdateTime: 2018-04-04T07:08:51Z
    description: some-error message
    codes:
    - ERR_UNAUTHORIZED
  lastOperation:
    lastUpdateTime: 2018-04-04T07:24:51Z
    progress: 70
    type: Reconcile
    state: Processing
    description: Currently provisioning ...
  conditions:
  - lastTransitionTime: 2018-07-11T10:18:25Z
    message: DNS record has been created and is available.
    reason: RecordResolvable
    status: "True"
    type: Available
    propagate: false
  providerStatus:
    apiVersion: aws.extensions.gardener.cloud/v1alpha1
    kind: DNSStatus
    ...
Infrastructure provisioning

The Infrastructure CRD contains the information about VPC, networks, security groups, availability zones, …, basically, everything that needs to be prepared before an actual VMs/load balancers/… can be provisioned.

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Infrastructure
metadata:
  name: infrastructure
  namespace: shoot--core--aws-01
spec:
  type: aws
  providerConfig:
    apiVersion: aws.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureConfig
    networks:
      vpc:
        cidr: 10.250.0.0/16
      internal:
      - 10.250.112.0/22
      public:
      - 10.250.96.0/22
      workers:
      - 10.250.0.0/19
    zones:
    - eu-west-1a
  dns:
    apiserver: api.aws-01.core.example.com
  region: eu-west-1
  secretRef:
    name: my-aws-credentials
  sshPublicKey: |
        base64(key)
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
  providerStatus:
    apiVersion: aws.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureStatus
    vpc:
      id: vpc-1234
      subnets:
      - id: subnet-acbd1234
        name: workers
        zone: eu-west-1
      securityGroups:
      - id: sg-xyz12345
        name: workers
    iam:
      nodesRoleARN: <some-arn>
      instanceProfileName: foo
    ec2:
      keyName: bar
Backup infrastructure provisioning

The BackupInfrastructure CRD in the Seeds tells the cloud-specific controller to prepare a blob store bucket/container which can later be used to store etcd backups.

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: BackupInfrastructure
metadata:
  name: etcd-backup
  namespace: shoot--core--aws-01
spec:
  type: aws
  region: eu-west-1
  storageContainerName: asdasjndasd-1293912378a-2213
  secretRef:
    name: my-aws-credentials
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
Cloud config (user-data) for bootstrapping machines

Gardener will continue to keep knowledge about the content of the cloud config scripts, but it will hand over it to the respective OS-specific controller which will generate the specific valid representation. Gardener creates two MachineCloudConfig CRDs, one for the cloud-config-downloader (which will later flow into the WorkerPool CRD) and one for the real cloud-config (which will be stored as a Secret in the Shoot’s kube-system namespace, and downloaded and executed from the cloud-config-downloader on the machines).

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: MachineCloudConfig
metadata:
  name: pool-01-downloader
  namespace: shoot--core--aws-01
spec:
  type: CoreOS
  units:
  - name: cloud-config-downloader.service
    command: start
    enable: true
    content: |
      [Unit]
      Description=Downloads the original cloud-config from Shoot API Server and executes it
      After=docker.service docker.socket
      Wants=docker.socket
      [Service]
      Restart=always
      RestartSec=30
      EnvironmentFile=/etc/environment
      ExecStart=/bin/sh /var/lib/cloud-config-downloader/download-cloud-config.sh      
  files:
  - path: /var/lib/cloud-config-downloader/credentials/kubeconfig
    permissions: 0644
    content:
      secretRef:
        name: cloud-config-downloader
        dataKey: kubeconfig
  - path: /var/lib/cloud-config-downloader/download-cloud-config.sh
    permissions: 0644
    content:
      inline:
        encoding: b64
        data: IyEvYmluL2Jhc2ggL...
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
  cloudConfig: | # base64-encoded
    #cloud-config

    coreos:
      update:
        reboot-strategy: off
      units:
      - name: cloud-config-downloader.service
        command: start
        enable: true
        content: |
          [Unit]
          Description=Downloads the original cloud-config from Shoot API Server and execute it
          After=docker.service docker.socket
          Wants=docker.socket
          [Service]
          Restart=always
          RestartSec=30
          ...          

ℹ The cloud-config-downloader script does not only download the cloud-config initially but at regular intervals, e.g., every 30s. If it sees an updated cloud-config then it applies it again by reloading and restarting all systemd units in order to reflect the changes. The way how this reloading of the cloud-config happens is OS-specific as well and not known to Gardener anymore, however, it must be part of the script already. On CoreOS, you have to execute /usr/bin/coreos-cloudinit --from-file=<path> whereas on SLES you execute cloud-init --file <path> single -n write_files --frequency=once. As Gardener doesn’t know these commands it will write a placeholder expression instead (e.g., {RELOAD-CLOUD-CONFIG-WITH-PATH:<path>}) and the OS-specific controller is asked to replace it with the proper expression.

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: MachineCloudConfig
metadata:
  name: pool-01-original # stored as secret and downloaded later
  namespace: shoot--core--aws-01
spec:
  type: CoreOS
  units:
  - name: docker.service
    drop-ins:
    - name: 10-docker-opts.conf
      content: |
        [Service]
        Environment="DOCKER_OPTS=--log-opt max-size=60m --log-opt max-file=3"        
  - name: docker-monitor.service
    command: start
    enable: true
    content: |
      [Unit]
      Description=Docker-monitor daemon
      After=kubelet.service
      [Service]
      Restart=always
      EnvironmentFile=/etc/environment
      ExecStart=/opt/bin/health-monitor docker      
  - name: kubelet.service
    command: start
    enable: true
    content: |
      [Unit]
      Description=kubelet daemon
      Documentation=https://kubernetes.io/docs/admin/kubelet
      After=docker.service
      Wants=docker.socket rpc-statd.service
      [Service]
      Restart=always
      RestartSec=10
      EnvironmentFile=/etc/environment
      ExecStartPre=/bin/docker run --rm -v /opt/bin:/opt/bin:rw k8s.gcr.io/hyperkube:v1.11.2 cp /hyperkube /opt/bin/
      ExecStartPre=/bin/sh -c 'hostnamectl set-hostname $(cat /etc/hostname | cut -d '.' -f 1)'
      ExecStart=/opt/bin/hyperkube kubelet \
          --allow-privileged=true \
          --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig-bootstrap \
          ...      
  files:
  - path: /var/lib/kubelet/ca.crt
    permissions: 0644
    content:
      secretRef:
        name: ca-kubelet
        dataKey: ca.crt
  - path: /var/lib/cloud-config-downloader/download-cloud-config.sh
    permissions: 0644
    content:
      inline:
        encoding: b64
        data: IyEvYmluL2Jhc2ggL...
  - path: /etc/sysctl.d/99-k8s-general.conf
    permissions: 0644
    content:
      inline:
        data: |
          vm.max_map_count = 135217728
          kernel.softlockup_panic = 1
          kernel.softlockup_all_cpu_backtrace = 1
          ...          
  - path: /opt/bin/health-monitor
    permissions: 0755
    content:
      inline:
        data: |
          #!/bin/bash
          set -o nounset
          set -o pipefail

          function docker_monitoring {
          ...          
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
  cloudConfig: ...

Cloud-specific controllers which might need to add another kernel option or another flag to the kubelet, maybe even another file to the disk, can register a MutatingWebhookConfiguration to that resource and modify it upon creation/update. The task of the MachineCloudConfig controller is to only generate the OS-specific cloud-config based on the .spec field, but not to add or change any logic related to Shoots.

Worker pools definition

For every worker pool defined in the Shoot Gardener will create a WorkerPool CRD which shall be picked up by a cloud-specific controller and be translated to MachineClasses and MachineDeployments.

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: WorkerPool
metadata:
  name: pool-01
  namespace: shoot--core--aws-01
spec:
  cloudConfig: base64(downloader-cloud-config)
  infrastructureProviderStatus:
    apiVersion: aws.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureStatus
    vpc:
      id: vpc-1234
      subnets:
      - id: subnet-acbd1234
        name: workers
        zone: eu-west-1
      securityGroups:
      - id: sg-xyz12345
        name: workers
    iam:
      nodesRoleARN: <some-arn>
      instanceProfileName: foo
    ec2:
      keyName: bar
  providerConfig:
    apiVersion: aws.cloud.gardener.cloud/v1alpha1
    kind: WorkerPoolConfig
    machineImage:
      name: CoreOS
      ami: ami-d0dcef3b
    machineType: m4.large
    volumeType: gp2
    volumeSize: 20Gi
    zones:
    - eu-west-1a
  region: eu-west-1
  secretRef:
    name: my-aws-credentials
  minimum: 2
  maximum: 2
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
Generic resources

Some components are cloud-specific and must be deployed by the cloud-specific botanists. Others might need to deploy another pod next to the shoot’s control plane or must do anything else. Some of these might be important for a functional cluster (e.g., the cloud-controller-manager, or a CSI plugin in the future), and controllers should be able to report errors back to the user. Consequently, in order to trigger the controllers to deploy these components Gardener would write a Generic CRD to the Seed to trigger the deployment. No operation is depending on the status of these resources, however, the entire reconciliation flow is.

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Generic
metadata:
  name: cloud-components
  namespace: shoot--core--aws-01
spec:
  type: cloud-components
  secretRef:
    name: my-aws-credentials
  shootSpec:
    ...
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...

Shoot state

In order to enable moving the control plane of a Shoot between Seed clusters (e.g., if a Seed cluster is not available anymore or entirely broken) Gardener must store some non-reconstructable state, potentially also the state written by the controllers. Gardener watches these extension CRDs and copies the .status.state in a ShootState resource into the Garden cluster. Any observed status change of the respective CRD-controllers must be immediately reflected in the ShootState resource. The contract between Gardener and those controllers is: Every controller must be capable of reconstructing its own environment based on both the state it has written before and on the real world’s conditions/state.

---
apiVersion: gardener.cloud/v1alpha1
kind: ShootState
metadata:
  name: shoot--core--aws-01
shootRef:
  name: aws-01
  project: core
state:
  secrets:
  - name: ca
    data: ...
  - name: kube-apiserver-cert
    data: ...
  resources:
  - kind: DNS
    name: record-1
    state: <copied-state-of-dns-crd>
  - kind: Infrastructure
    name: networks
    state: <copied-state-of-infrastructure-crd>
  ...
  <other fields required to keep track of>

We cannot assume that Gardener is always online to observe the most recent states the controllers have written to their resources. Consequently, the information stored here must not be used as “single point of truth”, but the controllers must potentially check the real world’s status to reconstruct themselves. However, this must anyway be part of their normal reconciliation logic and is a general best practice for Kubernetes controllers.

Shoot health checks/conditions

Some of the existing conditions already contain specific code which shall be simplified as well. All of the CRDs described above have a .status.conditions field to which the controllers may write relevant health information of their function area. Gardener will pick them up and copy them over to the Shoots .status.conditions (only those conditions setting propagate=true).

Reconciliation flow

We are now examining the current Shoot creation/reconciliation flow and describe how it could look like when applying this proposal:

OperationDescription
botanist.DeployNamespaceGardener creates the namespace for the Shoot in the Seed cluster.
botanist.DeployKubeAPIServerServiceGardener creates a Service of type LoadBalancer in the Seed.
AWS Botanist registers a Mutating Webhook and adds its AWS-specific annotation.
botanist.WaitUntilKubeAPIServerServiceIsReadyGardener checks the .status object of the just created Service in the Seed. The contract is that also clouds not supporting load balancers must react on the Service object and modify the .status to correctly reflect the kube-apiserver’s ingress IP.
botanist.DeploySecretsGardener creates the secrets/certificates it needs like it does today, but it provides utility functions that can be adopted by Botanists/other controllers if they need additional certificates/secrets created on their own. (We should also add labels to all secrets)
botanist.Shoot.Components.DNS.Internal{Provider/Entry}.DeployGardener creates a DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and creates a corresponding DNS record (see CRD specification above).
botanist.Shoot.Components.DNS.External{Provider/Entry}.DeployGardener creates a DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and creates a corresponding DNS record: (see CRD specification above).
shootCloudBotanist.DeployInfrastructureGardener creates a Infrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job: (see CRD above).
botanist.DeployBackupInfrastructureGardener creates a BackupInfrastructure resource in the Garden cluster.
(The BackupInfrastructure controller creates a BackupInfrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job: (see CRD above).)
botanist.WaitUntilBackupInfrastructureReconciledGardener checks the .status object of the just created BackupInfrastructure resource.
hybridBotanist.DeployETCDGardener does only deploy the etcd StatefulSet without backup-restore sidecar at all.
The cloud-specific Botanist registers a Mutating Webhook and adds the backup-restore sidecar, and it also creates the Secret needed by the backup-restore sidecar.
botanist.WaitUntilEtcdReadyGardener checks the .status object of the etcd Statefulset and waits until readiness is indicated.
hybridBotanist.DeployCloudProviderConfigGardener does not execute this anymore because it doesn’t know anything about cloud-specific configuration.
hybridBotanist.DeployKubeAPIServerGardener does only deploy the kube-apiserver Deployment without any cloud-specific flags/configuration.
The cloud-specific Botanist registers a Mutating Webhook and adds whatever is needed for the kube-apiserver to run in its cloud environment.
hybridBotanist.DeployKubeControllerManagerGardener does only deploy the kube-controller-manager Deployment without any cloud-specific flags/configuration.
The cloud-specific Botanist registers a Mutating Webhook and adds whatever is needed for the kube-controller-manager to run in its cloud environment (e.g., the cloud-config).
hybridBotanist.DeployKubeSchedulerGardener does only deploy the kube-scheduler Deployment without any cloud-specific flags/configuration.
The cloud-specific Botanist registers a Mutating Webhook and adds whatever is needed for the kube-scheduler to run in its cloud environment.
hybridBotanist.DeployCloudControllerManagerGardener does not execute this anymore because it doesn’t know anything about cloud-specific configuration. The Botanists would be responsible to deploy their own cloud-controller-manager now.
They would watch for the kube-apiserver Deployment to exist, and as soon as it does, they deploy the CCM.
(Side note: The Botanist would also be responsible to deploy further controllers needed for this cloud environment, e.g. F5-controllers or CSI plugins).
botanist.WaitUntilKubeAPIServerReadyGardener checks the .status object of the kube-apiserver Deployment and waits until readiness is indicated.
botanist.InitializeShootClientsUnchanged; Gardener creates a Kubernetes client for the Shoot cluster.
botanist.DeployMachineControllerManagerDeleted, Gardener does no longer deploy MCM itself. See below.
hybridBotanist.ReconcileMachinesGardener creates a Worker CRD in the Seed, and the responsible Worker controller picks it up and does its job (see CRD above). It also deploys the machine-controller-manager.
Gardener waits until the status indicates that the controller is done.
hybridBotanist.DeployKubeAddonManagerThis function also computes the CoreOS cloud-config (because the secret storing it is managed by the kube-addon-manager).
Gardener would deploy the CloudConfig-specific CRD in the Seed, and the responsible OS controller picks it up and does its job (see CRD above).
The Botanists which would have to modify something would register a Webhook for this CloudConfig-specific resource and apply their changes.
The rest is mostly unchanged, Gardener generates the manifests for the addons and deploys the kube-addon-manager into the Seed.
AWS Botanist registers a Webhook for nginx-ingress.
Azure Botanist registers a Webhook for calico.
Gardener will no longer deploy the StorageClasses. Instead, the Botanists wait until the kube-apiserver is available and deploy them.

In the long term we want to get rid of optional addons inside the Gardener core and implement a sophisticated addon concept (see #246).
shootCloudBotanist.DeployKube2IAMResourcesThis function would be removed (currently Gardener would execute a Terraform job creating the IAM roles specified in the Shoot manifest). We cannot keep this behavior, the user would be responsible to create the needed IAM roles on its own.
botanist.Shoot.Components.Nginx.DNSEtnryGardener creates a DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and creates a corresponding DNS record (see CRD specification above).
botanist.WaitUntilVPNConnectionExistsUnchanged, Gardener checks that it is possible to port-forward to a Shoot pod.
seedCloudBotanist.ApplyCreateHookThis function would be removed (actually, only the AWS Botanist implements it).
AWS Botanist deploys the aws-lb-readvertiser once the API Server is deployed and updates the ELB health check protocol one the load balancer pointing to the API server is created.
botanist.DeploySeedMonitoringUnchanged, Gardener deploys the monitoring stack into the Seed.
botanist.DeployClusterAutoscalerUnchanged, Gardener deploys the cluster-autoscaler into the Seed.

ℹ We can easily lift the contract later and allow dynamic network plugins or not using the VPN solution at all. We could also introduce a dedicated ControlPlane CRD and leave the complete responsibility of deploying kube-apiserver, kube-controller-manager, etc. to other controllers (if we need it at some point in time).

Deletion flow

We are now examining the current Shoot deletion flow and describe shortly how it could look like when applying this proposal:

OperationDescription
botanist.DeploySecretsThis is just refreshing the cloud provider secret in the Shoot namespace in the Seed (in case the user has changed it before triggering the deletion). This function would stay as it is.
hybridBotanist.RefreshMachineClassSecretsThis function would disappear.
Worker Pool controller needs to watch the referenced secret and update the generated MachineClassSecrets immediately.
hybridBotanist.RefreshCloudProviderConfigThis function would disappear. Botanist needs to watch the referenced secret and update the generated cloud-provider-config immediately.
botanist.RefreshCloudControllerManagerChecksumsSee “hybridBotanist.RefreshCloudProviderConfig”.
botanist.RefreshKubeControllerManagerChecksumsSee “hybridBotanist.RefreshCloudProviderConfig”.
botanist.InitializeShootClientsUnchanged; Gardener creates a Kubernetes client for the Shoot cluster.
botanist.DeleteSeedMonitoringUnchanged; Gardener deletes the monitoring stack.
botanist.DeleteKubeAddonManagerUnchanged; Gardener deletes the kube-addon-manager.
botanist.DeleteClusterAutoscalerUnchanged; Gardener deletes the cluster-autoscaler.
botanist.WaitUntilKubeAddonManagerDeletedUnchanged; Gardener waits until the kube-addon-manager is deleted.
botanist.CleanCustomResourceDefinitionsUnchanged, Gardener cleans the CRDs in the Shoot.
botanist.CleanKubernetesResourcesUnchanged, Gardener cleans all remaining Kubernetes resources in the Shoot.
hybridBotanist.DestroyMachinesGardener deletes the WorkerPool-specific CRD in the Seed, and the responsible WorkerPool-controller picks it up and does its job.
Gardener waits until the CRD is deleted.
shootCloudBotanist.DestroyKube2IAMResourcesThis function would disappear (currently Gardener would execute a Terraform job deleting the IAM roles specified in the Shoot manifest). We cannot keep this behavior, the user would be responsible to delete the needed IAM roles on its own.
shootCloudBotanist.DestroyInfrastructureGardener deletes the Infrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job.
Gardener waits until the CRD is deleted.
botanist.Shoot.Components.DNS.External{Provider/Entry}.DestroyGardener deletes the DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and does its job.
Gardener waits until the CRD is deleted.
botanist.DeleteKubeAPIServerUnchanged; Gardener deletes the kube-apiserver.
botanist.DeleteBackupInfrastructureUnchanged; Gardener deletes the BackupInfrastructure object in the Garden cluster.
(The BackupInfrastructure controller deletes the BackupInfrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job.
The BackupInfrastructure controller waits until the CRD is deleted.)
botanist.Shoot.Components.DNS.Internal{Provider/Entry}.DestroyGardener deletes the DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and does its job.
Gardener waits until the CRD is deleted.
botanist.DeleteNamespaceUnchanged; Gardener deletes the Shoot namespace in the Seed cluster.
botanist.WaitUntilSeedNamespaceDeletedUnchanged; Gardener waits until the Shoot namespace in the Seed has been deleted.
botanist.DeleteGardenSecretsUnchanged; Gardener deletes the kubeconfig/ssh-keypair Secret in the project namespace in the Garden.

Gardenlet

One part of the whole extensibility work will also to further split Gardener itself. Inspired from Kubernetes itself we plan to move the Shoot reconciliation/deletion controller loops as well as the BackupInfrastructure reconciliation/deletion controller loops into a dedicated “gardenlet” component that will run in the Seed cluster. With that, it can talk locally to the responsible kube-apiserver and we do no longer need to perform every operation out of the Garden cluster. This approach will also help us with scalability, performance, maintainability, testability in general.

This architectural change implies that the Kubernetes API server of the Garden cluster must be exposed publicly (or at least be reachable by the registered Seeds). The Gardener controller-manager will remain and will keep its CloudProfile, SecretBinding, Quota, Project, and Seed controller loops. One part of the seed controller could be to deploy the “gardenlet” into the Seeds, however, this would require network connectivity to the Seed cluster.

Shoot control plane movement/migration

Automatically moving control planes is difficult with the current implementation as some resources created in the old Seed must be moved to the new one. However, some of them are not under Gardener’s control (e.g., Machine resources). Moreover, the old control plane must be deactivated somehow to ensure that not two controllers work on the same things (e.g., virtual machines) from different environments.

Gardener does not only deploy a DNS controller into the Seeds but also into its own Garden cluster. For every Shoot cluster, Gardener commissions it to create a DNS TXT record containing the name of the Seed responsible for the Shoot (holding the control plane), e.g.

dig -t txt aws-01.core.garden.example.com

...
;; ANSWER SECTION:
aws-01.core.garden.example.com. 120 IN	TXT "Seed=seed-01"
...

Gardener always keeps the DNS record up-to-date based on which Seed is responsible.

In the above CRD examples one object in the .spec section was omitted as it is needed to get Shoot control plane movement/migration working (the field is only explained now in this section and not before; it was omitted on purpose to support focusing on the relevant specifications first). Every CRD also has the following section in its .spec:

leadership:
  record: aws-01.core.garden.example.com
  value: seed-01
  leaseSeconds: 60

Before every operation the CRD-controllers check this DNS record (based on the .spec.leadership.leaseSeconds configuration) and verify that its result is equal to the .spec.leadership.value field. If both match they know that they should act on the resource, otherwise they stop doing anything.

ℹ We will provide an easy-to-use framework for the controllers containing all of these features out-of-the-box in order to allow the developers to focus on writing the actual controller logic.

When a Seed control plane move is triggered, the .spec.cloud.seed field of the respective Shoot is changed. Gardener will change the respective DNS record’s value (aws-01.core.garden.example.com) to contain the new Seed name. After that it will wait 2*60s to be sure that all controllers have observed the change. Then it starts reconciling and applying the CRDs together with a preset .status.state into the new Seed (based on its last observations which were stored in the respective ShootState object stored in the Garden cluster). The controllers are - as per contract - asked to reconstruct their own environment based on the .status.state they have written before and the real world’s status. Apart from that, the normal reconciliation flow gets executed.

Gardener stores the list of Seeds that were responsible for hosting a Shoots control plane at some time in the Shoots .status.seeds list so that it knows which Seeds must be cleaned up (i.e., where the control plane must be deleted because it has been moved). Once cleaned up, the Seed’s name will be removed from that list.

BackupInfrastructure migration

One part of the reconciliation flow above is the provisioning of the infrastructure for the Shoot’s etcd backups (usually, this is a blob store bucket/container). Gardener already uses a separate BackupInfrastructure resource that is written into the Garden cluster and picked up by a dedicated BackupInfrastructure controller (bundled into the Gardener controller manager). This dedicated resource exists mainly for the reason to allow keeping backups for a certain “grace period” even after the Shoot deletion itself:

apiVersion: gardener.cloud/v1alpha1
kind: BackupInfrastructure
metadata:
  name: aws-01-bucket
  namespace: garden-core
spec:
  seed: seed-01
  shootUID: uuid-of-shoot

The actual provisioning is executed in a corresponding Seed cluster as Gardener can only assume network connectivity to the underlying cloud environment in the Seed. We would like to keep the created artifacts in the Seed (e.g., Terraform state) near to the control plane. Consequently, when Gardener moves a control plane, it will update the .spec.seed field of the BackupInfrastructure resource as well. With the exact same logic described above the BackupInfrastructure controller inside the Gardener will move to the new Seed.

Registration of external controllers at Gardener

We want to have a dynamic registration process, i.e. we don’t want to hard-code any information about which controllers shall be deployed. The ideal solution would be to not even requiring a restart of Gardener when a new controller registers.

Every controller is registered by a ControllerRegistration resource that introduces every controller together with its supported resources (dimension (kind) and shape (type) combination) to Gardener:

apiVersion: gardener.cloud/v1alpha1
kind: ControllerRegistration
metadata:
  name: dns-aws-route53
spec:
  resources:
  - kind: DNS
    type: aws-route53
# deployment:
#   type: helm
#   providerConfig:
#     chart.tgz: base64(helm-chart)
#     values.yaml: |
#       foo: bar

Every .kind/.type combination may only exist once in the system.

When a Shoot shall be reconciled Gardener can identify based on the referenced Seed and the content of the Shoot specification which controllers are needed in the respective Seed cluster. It will demand the operators in the Garden cluster to deploy the controllers they are responsible for to a specific Seed. This kind of communication happens via CRDs as well:

apiVersion: gardener.cloud/v1alpha1
kind: ControllerInstallation
metadata:
  name: dns-aws-route53
spec:
  registrationRef:
    name: dns-aws-route53
  seedRef:
    name: seed-01
status:
  conditions:
  - lastTransitionTime: 2018-08-07T15:09:23Z
    message: The controller has been successfully deployed to the seed.
    reason: ControllerDeployed
    status: "True"
    type: Available

The default scenario is that every controller is gets deployed by a dedicated operator that knows how to handle its lifecycle operations like deployment, update, upgrade, deletion. This operator watches ControllerInstallation resources and reacts on those it is responsible for (that it has created earlier). Gardener is responsible for writing the .spec field, the operator is responsible for providing information in the .status indicating whether the controller was successfully deployed and is ready to be used. Gardener will be also able to ask for deletion of controllers from Seeds when they are not needed there anymore by deleting the corresponding ControllerInstallation object.

ℹ The provided easy-to-use framework for the controllers will also contain these needed features to implement corresponding operators.

For most cases the controller deployment is very simple (just deploying it into the seed with some static configuration). In these cases it would produce unnecessary effort to ask for providing another component (the operator) that deploys the controller. To simplify this situation Gardener will be able to react on ControllerInstallations specifying .spec.registration.deployment.type=helm. The controller would be registered with the ControllerRegistration resources that would contain a Helm chart with all resources needed to deploy this controller into a seed (plus some static values). Gardener would render the Helm chart and deploy the resources into the seed. It will not react if .spec.registration.deployment.type!=helm which allows to also use any other deployment mechanism. Controllers that are getting deployed by operators would not specify the .spec.deployment section in the ControllerRegistration at all.

ℹ Any controller requiring dynamic configuration values (e.g., based on the cloud provider or the region of the seed) must be installed with the operator approach.

Other cloud-specific parts

The Gardener API server has a few admission controllers that contain cloud-specific code as well. We have to replace these parts as well.

Defaulting and validation admission plugins

Right now, the admission controllers inside the Gardener API server do perform a lot of validation and defaulting of fields in the Shoot specification. The cloud-specific parts of these admission controllers will be replaced by mutating admission webhooks that will get called instead. As we will have a dedicated operator running in the Garden cluster anyway it will also get the responsibility to register this webhook if it needs to validate/default parts of the Shoot specification.

Example: The .spec.cloud.workerPools[*].providerConfig.machineImage field in the new Shoot manifest mentioned above could be omitted by the user and would get defaulted by the cloud-specific operator.

DNS Hosted Zone admission plugin

For the same reasons the existing DNS Hosted Zone admission plugin will be removed from the Gardener core and moved into the responsibility of the respective DNS-specific operators running in the Garden cluster.

Shoot Quota admission plugin

The Shoot quota admission plugin validates create or update requests on Shoots and checks that the specified machine/storage configuration is defined as per referenced Quota objects. The cloud-specifics in this controller are no longer needed as the CloudProfile and the Shoot resource have been adapted: The machine/storage configuration is no longer in cloud-specific sections but hard-wired fields in the general Shoot specification (see example resources above). The quota admission plugin will be simplified and remains in the Gardener core.

Shoot maintenance controller

Every Shoot cluster can define a maintenance time window in which Gardener will update the Kubernetes patch version (if enabled) and the used machine image version in the Shoot resource. While the Kubernetes version is not part of the providerConfig section in the CloudProfile resource, the machineImage field is, and thus Gardener can’t understand it any longer. In the future Gardener has to rely on the cloud-specific operator (probably the same doing the defaulting/validation mentioned before) to update this field. In the maintenance time window the maintenance controller will update the Kubernetes patch version (if enabled) and add a trigger.gardener.cloud=maintenance annotation in the Shoot resource. The already registered mutating web hook will call the operator who has to remove this annotation and update the machineImage in the .spec.cloud.workerPools[*].providerConfig sections.

Alternatives

  • Alternative to DNS approach for Shoot control plane movement/migration: We have thought about rotating the credentials when a move is triggered which would make all controllers ineffective immediately. However, one problem with this is that we require IAM privileges for the users infrastructure account which might be not desired. Another, more complicated problem is that we cannot assume API access in order to create technical users for all cloud environments that might be supported.

2 - 02 Backupinfra

Backup Infrastructure CRD and Controller Redesign

Goal

  • As an operator, I would like to efficiently use the backup bucket for multiple clusters, thereby limiting the total number of buckets required.
  • As an operator, I would like to use different cloud provider for backup bucket provisioning other than cloud provider used for seed infrastructure.
  • Have seed independent backups, so that we can easily migrate a shoot from one seed to another.
  • Execute the backup operations (including bucket creation and deletion) from a seed, because network connectivity may only be ensured from the seeds (not necessarily from the garden cluster).
  • Preserve the garden cluster as source of truth (no information is missing in the garden cluster to reconstruct the state of the backups even if seed and shoots are lost completely).
  • Do not violate the infrastructure limits in regards to blob store limits/quotas.

Motivation

Currently, every shoot cluster has its own etcd backup bucket with a centrally configured retention period. With the growing number of clusters, we are soon running out of the quota limits of buckets on the cloud provider. Moreover, even if the clusters are deleted, the backup buckets do exist, for a configured period of retention. Hence, there is need of minimizing the total count of buckets.

In addition, currently we use seed infrastructure credentials to provision the bucket for etcd backups. This results in binding backup bucket provider to seed infrastructure provider.

Terminology

  • Bucket : It is equivalent to s3 bucket, abs container, gcs bucket, swift container, alicloud bucket
  • Object : It is equivalent s3 object, abs blob, gcs object, swift object, alicloud object, snapshot/backup of etcd on object store.
  • Directory : As such there is no concept of directory in object store but usually the use directory as / separate common prefix for set of objects. Alternatively they use term folder for same.
  • deletionGracePeriod: This means grace period or retention period for which backups will be persisted post deletion of shoot.

Current Spec:

#BackupInfra spec
Kind: BackupInfrastructure
Spec:
    seed: seedName
    shootUID : shoot.status.uid

Current naming conventions

SeedNamespace :Shoot–projectname–shootname
seed:seedname
ShootUID :shoot.status.UID
BackupInfraname:seednamespce+sha(uid)[:5]
Backup-bucket-name:BackupInfraName
BackupNamespace:backup–BackupInfraName

Proposal

Considering Gardener extension proposal in mind, the backup infrastructure controller can be divided in two parts. There will be basically four backup infrastructure related CRD’s. Two on the garden apiserver. And two on the seed cluster. Before going into to workflow, let’s just first have look at the CRD.

CRD on Garden cluster

Just to give brief before going into the details, we will be sticking to the fact that Garden apiserver is always source of truth. Since backupInfra will be maintained post deletion of shoot, the info regarding this should always come from garden apiserver, we will continue to have BackupInfra resource on garden apiserver with some modifications.

apiVersion: garden.cloud/v1alpha1
kind: BackupBucket
metadata:
  name: packet-region1-uid[:5]
  # No namespace needed. This will be cluster scope resource.
  ownerReferences:
  - kind: CloudProfile
    name: packet
spec:
  provider: aws
  region: eu-west-1
  secretRef: # Required for root
    name: backup-operator-aws
    namespace: garden
status:
  lastOperation: ...
  observedGeneration: ...
  seed: ...
apiVersion: garden.cloud/v1alpha1
kind: BackupEntry
metadata:
  name: shoot--dev--example--3ef42 # Naming convention explained before
  namespace: garden-dev
  ownerReferences:
  - apiVersion: core.gardener.cloud/v1beta1
    blockOwnerDeletion: false
    controller: true
    kind: Shoot
    name: example
    uid: 19a9538b-5058-11e9-b5a6-5e696cab3bc8
spec:
  shootUID: 19a9538b-5058-11e9-b5a6-5e696cab3bc8 # Just for reference to find back associated shoot.
  # Following section comes from cloudProfile or seed yaml based on granularity decision.
  bucketName: packet-region1-uid[:5]
status:
  lastOperation: ...
  observedGeneration: ...
  seed: ...

CRD on Seed cluster

Considering the extension proposal, we want individual component to be handled by controller inside seed cluster. We will have Backup related resource in registered seed cluster as well.

apiVersion: extensions.gardener.cloud/v1alpha1
kind: BackupBucket
metadata:
  name: packet-random[:5]
  # No namespace need. This will be cluster scope resource
spec:
  type: aws
  region: eu-west-1
  secretRef:
    name: backup-operator-aws
    namespace: backup-garden
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...

There are two points for introducing BackupEntry resource.

  1. Cloud provider specific code goes completely in seed cluster.
  2. Network issue is also handled by moving deletion part to backup-extension-controller in seed cluster.
apiVersion: extensions.gardener.cloud/v1alpha1
kind: BackupEntry
metadata:
  name: shoot--dev--example--3ef42 # Naming convention explained later
  # No namespace need. This will be cluster scope resource
spec:
  type: aws
  region: eu-west-1
  secretRef: # Required for root
    name: backup-operator-aws
    namespace: backup-garden
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...

Workflow

  • Gardener administrator will configure the cloudProfile with backup infra credentials and provider config as follows.
# CloudProfile.yaml:
Spec:
    backup:
        provider: aws
        region: eu-west-1
        secretRef:
            name: backup-operator-aws
            namespace: garden

Here CloudProfileController will interpret this spec as follows:

  • If spec.backup is nil
    • No backup for any shoot.
  • If spec.backup.region is not nil,
    • Then respect it, i.e. use the provider and unique region field mentioned there for BackupBucket.
    • Here Preferably, spec.backup.region field will be unique, since for cross provider, it doesn’t make much sense. Since region name will be different for different providers.
  • Otherwise, spec.backup.region is nil then,
    • If same provider case i.e. spec.backup.provider = spec.(type-of-provider) or nil,
      • Then, for each region from spec.(type-of-provider).constraints.regions create a BackupBucket instance. This can be done lazily i.e. create BackupBucket instance for region only if some seed actually spawned in the region has been registered. This will avoid creating IaaS bucket even if no seed is registered in that region, but region is listed in cloudprofile.
      • Shoot controller will choose backup container as per the seed region. (With shoot control plane migration also, seed’s availability zone might change but the region will be remaining same as per current scope.)
    • Otherwise cross provider case i.e. spec.backup.provider != spec.(type-of-provider)
      • Report validation error: Since, for example, we can’t expect spec.backup.provider = aws to support region in, spec.packet.constraint.region. Where type-of-provider is packet

Following diagram represent overall flow in details:

sequence-diagram

Reconciliation

Reconciliation on backup entry in seed cluster mostly comes in picture at the time of deletion. But we can add initialization steps like creation of directory specific to shoot in backup bucket. We can simply create BackupEntry at the time of shoot deletion as well.

Deletion

  • On shoot deletion, the BackupEntry instance i.e. shoot specific instance will get deletion timestamp because of ownerReference.
  • If deletionGracePeriod configured in GCM component configuration is expired, BackupInfrastructure Controller will delete the backup folder associated with it from backup object store.
  • Finally, it will remove the finalizer from backupEntry instance.

Alternative

sequence-diagram

Discussion points / variations

Manual vs dynamic bucket creation

  • As per limit observed on different cloud providers, we can have single bucket for backups on one cloud providers. So, we could avoid the little complexity introduced in above approach by pre-provisioning buckets as a part of landscape setup. But there won’t be anybody to detect bucket existence and its reconciliation. Ideally this should be avoided.

  • Another thing we can have is, we can let administrator register the pool of root backup infra resource and let the controller schedule backup on one of this.

  • One more variation here could be to create bucket dynamically per hash of shoot UID.

SDK vs Terraform

Initial reason for going for terraform script is its stability and the provided parallelism/concurrency in resource creation. For backup infrastructure, Terraform scripts are very minimal right now. Its simply have bucket creation script. With shared bucket logic, if possible we might want to isolate access at directory level but again its additional one call. So, we will prefer switching to SDK for all object store operations.

Limiting the number of shoots per bucket

Again as per limit observed on different cloud providers, we can have single bucket for backups on one cloud providers. But if we want to limit the number of shoots associated with bucket, we can have central map of configuration in gardener-controller-component-configuration.yaml. Where we will mark supported count of shoots per cloud provider. Most probable space could be, controller.backupInfrastructures.quota. If limit is reached we can create new BucketBucket instance.

e.g.

apiVersion: controllermanager.config.gardener.cloud/v1alpha1
kind: ControllerManagerConfiguration
controllers:
  backupInfrastructure:
    quota:
      - provider: aws
        limit: 100 # Number mentioned here are random, just for example purpose.
      - provider: azure
        limit: 80
      - provider: openstack
        limit: 100
      ...

Backward compatibility

Migration

  • Create shoot specific folder.
  • Transfer old objects.
  • Create manifest of objects on new bucket
    • Each entry will have status: None,Copied, NotFound.
    • Copy objects one by one.
  • Scale down etcd-main with old config. ⚠️ Cluster down time
  • Copy remaining objects
  • Scale up etcd-main with new config.
  • Destroy Old bucket and old backup namespace. It can be immediate or preferably lazy deletion.

backup-migration-sequence-diagram

Legacy Mode alternative

  • If Backup namespace present in seed cluster, then follow the legacy approach.
  • i.e. reconcile creation/existence of shoot specific bucket and backup namespace.
  • If backup namespace is not created, use shared bucket.
  • Limitation Never know when the existing cluster will be deleted, and hence, it might be little difficult to maintain with next release of gardener. This might look simple and straight-forward for now but may become pain point in future, if in worst case, because of some new use cases or refactoring, we have to change the design again. Also, even after multiple garden release we won’t be able to remove deprecated existing BackupInfrastructure CRD

References

3 - 03 Networking Extensibility

Network Extensibility

Currently Gardener follows a mono network-plugin support model (i.e., Calico). Although this can seem to be the more stable approach, it does not completely reflect the real use-case. This proposal brings forth an effort to add an extra level of customizability to Gardener networking.

Motivation

Gardener is an open-source project that provides a nested user model. Basically, there are two types of services provided by Gardener to its users:

  • Managed: users only request a Kubernetes cluster (Clusters-as-a-Service)
  • Hosted: users utilize Gardener to provide their own managed version of Kubernetes (Cluster-Provisioner-as-a-service)

For the first set of users, the choice of network plugin might not be so important, however, for the second class of users (i.e., Hosted) it is important to be able to customize networking based on their needs.

Furthermore, Gardener provisions clusters on different cloud-providers with different networking requirements. For example, Azure does not support Calico Networking [1], this leads to the introduction of manual exceptions in static add-on charts which is error prune and can lead to failures during upgrades.

Finally, every provider is different, and thus the network always needs to adapt to the infrastructure needs to provider better performance. Consistency does not necessarily lie in the implementation but in the interface.

Gardener Network Extension

The goal of the Gardener Network Extensions is to support different network plugin, therefore, the specification for the network resource won’t be fixed and will be customized based on the underlying network plugin. To do so, a NetworkConfig field in the spec will be provided where each plugin will define. Below is an example for deploy Calico as the cluster network plugin.

Long Term Spec

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Network
metadata:
  name: calico-network
  namespace: shoot--core--test-01
spec:
  type: calico
  clusterCIDR: 192.168.0.0/24
  serviceCIDR:  10.96.0.0/24
  providerConfig:
    apiVersion: calico.extensions.gardener.cloud/v1alpha1
    kind: NetworkConfig
    ipam:
      type: host-local
      cidr: usePodCIDR
    backend: bird
    typha:
      enabled: true
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
  providerStatus:
    apiVersion: calico.extensions.gardener.cloud/v1alpha1
    kind: NetworkStatus
    components:
      kubeControllers: true
      calicoNodes: true
    connectivityTests:
      pods: true
      services: true
    networkModules:
      arp_proxy: true
    config:
      clusterCIDR: 192.168.0.0/24
      serviceCIDR:  10.96.0.0/24
      ipam:
        type: host-local
        cidr: usePodCIDR

First Implementation (Short Term)

As an initial implementation the network plugin type will be specified by the user e.g., Calico (without further configuration in the provider spec). This will then be used to generate the Network resource in the seed. The Network operator will pick it up, and apply the configuration based on the spec.cloudProvider specified directly to the shoot or via the Gardener resource manager (still in the works).

The cloudProvider field in the spec is just an initial catalyst but not meant to be stay long-term. In the future, the network provider configuration will be customized to match the best needs of the infrastructure.

Here is how the simplified initial spec would look like:

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Network
metadata:
  name: calico-network
  namespace: shoot--core--test-01
spec:
  type: calico
  cloudProvider: {aws,azure,...}
status:
  observedGeneration: 2
  lastOperation: ...
  lastError: ...

Functionality

The network resource need to be created early-on during cluster provisioning. Once created, the Network operator residing in every seed will create all the necessary networking resources and apply them to the shoot cluster.

The status of the Network resource should reflect the health of the networking components as well as additional tests if required.

References

[1] Azure support for Calico Networking

4 - 05 Versioning Policy

Gardener Versioning Policy

Please refer to this document for the documentation of the implementation of this GEP.

Goal

  • As a Garden operator I would like to define a clear Kubernetes version policy, which informs my users about deprecated or expired Kubernetes versions.
  • As an user of Gardener, I would like to get information which Kubernetes version is supported for how long. I want to be able to get this information via API (cloudprofile) and also in the Dashboard.

Motivation

The Kubernetes community releases minor versions roughly every three months and usually maintains three minor versions (the actual and the last two) with bug fixes and security updates. Patch releases are done more frequently. Operators of Gardener should be able to define their own Kubernetes version policy. This GEP suggests the possibility for operators to classify Kubernetes versions, while they are going through their “maintenance life-cycle”.

Kubernetes Version Classifications

An operator should be able to classify Kubernetes versions differently while they go through their “maintenance life-cycle”, starting with preview, supported, deprecated, and finally expired. This information should be programmatically available in the cloudprofiles of the Garden cluster as well as in the Dashboard. Please also note, that Gardener keeps the control plane and the workers on the same Kubernetes version.

For further explanation of the possible classifications, we assume that an operator wants to support four minor versions e.g. v1.16, v1.15, v1.14 and v1.13.

  • preview: After a fresh release of a new Kubernetes minor version (e.g. v1.17.0) the operator could tag it as preview until he has gained sufficient experience. It will not become the default in the Gardener Dashboard until he promotes that minor version to supported, which could happen a few weeks later with the first patch version.

  • supported: The operator would tag the latest Kubernetes patch versions of the actual (if not still in preview) and the last three minor Kubernetes versions as supported (e.g. v1.16.1, v1.15.4, v1.14.9 and v1.13.12). The latest of these becomes the default in the Gardener Dashboard (e.g. v1.16.1).

  • deprecated: The operator could decide, that he generally wants to classify every version that is not the latest patch version as deprecated and flag this versions accordingly (e.g. v1.16.0 and older, v1.15.3 and older, 1.14.8 and older as well as v1.13.11 and older). He could also tag all versions (latest or not) of every Kubernetes minor release that is neither the actual nor one of the last three minor Kubernetes versions as deprecated, too (e.g. v1.12.x and older). Deprecated versions will eventually expire (i.e., removed).

  • expired: This state is a logical state only. It doesn’t have to be maintained in the cloudprofile. All cluster versions whose expirationDate as defined in the cloudprofile is expired, are automatically in this logical state. After that date has passed, users cannot create new clusters with that version anymore and any cluster that is on that version will be forcefully migrated in its next maintenance time window, even if the owner has opted out of automatic cluster updates! The forceful update will pick the latest patch version of the current minor Kubernetes version. If the cluster was already on that latest patch version and the latest patch version is also expired, it will continue with latest patch version of the next minor Kubernetes version, so it will result in an update of a minor Kubernetes version, which is potentially harmful to your workload, so you should avoid that/plan ahead! If that’s expired as well, the update process repeats until a non-expired Kubernetes version is reached, so depending on the circumstances described above, it can happen that the cluster receives multiple consecutive minor Kubernetes version updates!

To fulfill his specific versioning policy, the Garden operator should be able to classify his versions as well set the expiration date in the cloudprofiles. The user should see this classifiers as well as the expiration date in the dashboard.

5 - 06 Etcd Druid

Integrating etcd-druid with Gardener

Etcd is currently deployed by garden-controller-manager as a Statefulset. The sidecar container spec contains details pertaining to cloud-provider object-store which is injected into the statefulset via a mutable webhook running as part of the gardener extension story. This approach restricts the operations on etcd such as scale-up and upgrade. Etcd-druid will eliminate the need to hijack statefulset creation to add cloudprovider details. It has been designed to provide an intricate control over the procedure of deploying and maintaining etcd. The roadmap for etcd-druid can be found here.

This document explains how Gardener deploys etcd and what resources it creates for etcd-druid to deploy an etcd cluster.

Resources required by etcd-druid (created by Gardener)

  • Secret containing credentials to access backup bucket in Cloud provider object store.
  • TLS server and client secrets for etcd and backup-sidecar
  • Etcd CRD resource that contains parameters pertaining to etcd, backup-sidecar and cloud-provider object store.

When an etcd resource is created in the cluster, the druid acts on it by creating an etcd statefulset, a service and a configmap containing etcd bootstrap script. The secrets containing the infrastructure credentials and the TLS certificates are mounted as volumes. If no secret/information regarding backups is stated then etcd data backups are not taken. Only data corruption checks are performed prior to starting etcd.

Garden-controller-manager, being cloud agnostic, deploys the etcd resource. This will not contain any cloud-specific information other than the cloud-provider. The extension controller that contains the cloud specific implementation to create the backup bucket will create it if needed and create a secret containing the credentials to access the bucket. The etcd backup secret name should be exposed in the BackupEntry status. Then, Gardener can read it and write it into the ETCD resource. The secret will have to be made available in the namespace the etcd statefulset will be deployed. If etcd and backup-sidecar communicates over TLS then the CA certificates, server and client certificates, and keys will also have to be made available in the namespace as well. The etcd resource will have reference to these aforementioned secrets. etcd-druid will deploy the statefulset only if the secrets are available.

Workflow

  • etcd-druid will be deployed and etcd CRD will be created as part of the seed bootstrap.
  • Garden-controller-manager creates backupBucket extension resource. Extension controller creates the backup bucket associated with the seed.
  • Garden-controller-manager creates backupentry associated with each shoot in the seed namespace.
  • Garden-controller-manager creates etcd resource with secretRefs and etcd information populated appropriately.
  • etcd-druid acts on the etcd resource; druid creates the statefulset, the service and the configmap.

etcd-druid

6 - 07 Shoot Control Plane Migration

Shoot Control Plane Migration

Motivation

Currently moving the control plane of a shoot cluster can only be done manually and requires deep knowledge of how exactly to transfer the resources and state from one seed to another. This can make it slow and prone to errors.

Automatic migration can be very useful in a couple of scenarios:

  • Seed goes down and can’t be repaired (fast enough or at all) and it’s control planes need to be brought to another seed
  • Seed needs to be changed, but this operation requires the recreation of the seed (e.g. turn a single-AZ seed into a multi-AZ seed)
  • Seeds need to be rebalanced
  • New seeds become available in a region closer to/in the region of the workers and the control plane should be moved there to improve latency
  • Gardener ring, which is a self-supporting setup/underlay for a highly available (usually cross-region) Gardener deployment

Goals

  • Provide a mechanism to migrate the control plane of a shoot cluster from one seed to another
  • The mechanism should support migration from a seed which is no longer reachable (Disaster Recovery)
  • The shoot cluster nodes are preserved and continue to run the workload, but will talk to the new control plane after the migration completes
  • Extension controllers implement a mechanism which allows them to store their state or to be restored from an already existing state on a different seed cluster.
  • The already existing shoot reconciliation flow is reused for migration with minimum changes

Terminology

Source Seed is the seed which currently hosts the control plane of a Shoot Cluster

Destination Seed is the seed to which the control plane is being migrated

Resources and controller state which have to be migrated between two seeds:

Note: The following lists are just FYI and are meant to show the current resources which need to be moved to the Destination Seed

Secrets

Gardener has preconfigured lists of needed secrets which are generated when a shoot is created and deployed in the seed. Following is a minimum set of secrets which must be migrated to the Destination Seed. Other secrets can be regenerated from them.

  • ca
  • ca-front-proxy
  • static-token
  • ca-kubelet
  • ca-metrics-server
  • etcd-encryption-secret
  • kube-aggregator
  • kube-apiserver-basic-auth
  • kube-apiserver
  • service-account-key
  • ssh-keypair

Custom Resources and state of extension controllers

Gardenlet deploys custom resources in the Source Seed cluster during shoot reconciliation which are reconciled by extension controllers. The state of these controllers and any additional resources they create is independent of the gardenlet and must also be migrated to the Destination Seed. Following is a list of custom resources, and the state which is generated by them that has to be migrated.

  • BackupBucket: nothing relevant for migration
  • BackupEntry: nothing relevant for migration
  • ControlPlane: nothing relevant for migration
  • DNSProvider/DNSEntry: nothing relevant for migration
  • Extensions: migration of state needs to be handled individually
  • Infrastructure: terraform state
  • Network: nothing relevant for migration
  • OperatingSystemConfig: nothing relevant for migration
  • Worker: Machine-Controller-Manager related objects: machineclasses, machinedeployments, machinesets, machines

This list depends on the currently installed extensions and can change in the future

Proposal

Custom Resource on the garden cluster

The Garden cluster has a new Custom Resource which is stored in the project namespace of the Shoot called ShootState. It contains all the required data described above so that the control plane can be recreated on the Destination Seed.

This data is separated into two sections. The first is generated by the gardenlet and then either used to generate new resources (e.g secrets) or is directly deployed to the Shoot’s control plane on the Destination Seed.

The second is generated by the extension controllers in the seed.

apiVersion: core.gardener.cloud/v1alpha1
kind: ShootState
metadata:
  name: my-shoot
  namespace: garden-core
  ownerReference:
    apiVersion: core.gardener.cloud/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Shoot
    name: my-shoot
    uid: ...
  finalizers:
  - gardener
gardenlet:
  secrets:
  - name: ca
    data:
      ca.crt: ...
      ca.key: ...
  - name: ssh-keypair
    data:
      id_rsa: ...
  - name:
...
extensions:
- kind: Infrastructure
  state: ... (Terraform state)
- kind: ControlPlane
  purpose: normal
  state: ... (Certificates generated by the extension)
- kind: Worker
  state: ... (Machine objects)

The state data is saved as a runtime.RawExtension type, which can be encoded/decoded by the corresponding extension controller.

There can be sensitive data in the ShootState which has to be hidden from the end-users. Hence, it will be recommended to provide an etcd encryption configuration to the Gardener API server in order to encrypt the ShootState resource.

Size limitations

There are limits on the size of the request bodies sent to the kubernetes API server when creating or updating resources: by default ETCD can only accept request bodies which do not exceed 1.5 MiB (this can be configured with the --max-request-bytes flag); the kubernetes API Server has a request body limit of 3 MiB which cannot be set from the outside (with a command line flag); the gRPC configuration used by the API server to talk to ETCD has a limit of 2 MiB per request body which cannot be configured from the outside; and watch requests have a 16 MiB limit on the buffer used to stream resources.

This means that if ShootState is bigger than 1.5 MiB, the ETCD max request bytes will have to be increased. However, there is still an upper limit of 2 MiB imposed by the gRPC configuration.

If ShootState exceeds this size limitation it must make use of configmap/secret references to store the state of extension controllers. This is an implementation detail of Gardener and can be done at a later time if necessary as extensions will not be affected.

Splitting the ShootState into multiple resources could have a positive benefit on performance as the Gardener API Server and Gardener Controller Manager would handle multiple small resources instead of one big resource.

Gardener extensions changes

All extension controllers which require state migration must save their state in a new status.state field and act on an annotation gardener.cloud/operation=restore in the respective Custom Resources which should trigger a restoration operation instead of reconciliation. A restoration operation means that the extension has to restore its state in the Shoot’s namespace on the Destination Seed from the status.state field.

As an example: the Infrastructure resource must save the terraform state.

apiVersion: extensions.gardener.cloud/v1alpha1
kind: Infrastructure
metadata:
  name: infrastructure
  namespace: shoot--foo--bar
spec:
  type: azure
  region: eu-west-1
  secretRef:
    name: cloudprovider
    namespace: shoot--foo--bar
  providerConfig:
    apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureConfig
    resourceGroup:
      name: mygroup
    networks:
      vnet: # specify either 'name' or 'cidr'
      # name: my-vnet
        cidr: 10.250.0.0/16
      workers: 10.250.0.0/19
status:
  state: |
      {
          "version": 3,
          "terraform_version": "0.11.14",
          "serial": 2,
          "lineage": "3a1e2faa-e7b6-f5f0-5043-368dd8ea6c10",
          "modules": [
              {
              }
          ]
          ...
      }

Extensions which do not require state migration should set status.state=nil in their Custom Resources and trigger a normal reconciliation operation if the CR contains the core.gardener.cloud/operation=restore annotation.

Similar to the contract for the reconcile operation, the extension controller has to remove the restore annotation after the restoration operation has finished.

An additional annotation gardener.cloud/operation=migrate is added to the Custom Resources. It is used to tell the extension controllers in the Source Seed that they must stop reconciling resources (in case they are requeued due to errors) and should perform cleanup activities in the Shoot’s control plane. These cleanup activities involve removing the finalizers on Custom Resources and deleting them without actually deleting any infrastructure resources.

Note: The same size limitations from the previous section are relevant here as well.

Shoot reconciliation flow changes

The only data which must be stored in the ShootState by the gardenlet is secrets (e.g ca for the API server). Therefore the botanist.DeploySecrets step is changed. It is split into two functions which take a list of secrets that have to be generated.

  • botanist.GenerateSecretState Generates certificate authorities and other secrets which have to be persisted in the ShootState and must not be regenerated on the Destination Seed.
  • botanist.DeploySecrets Takes secret data from the ShootState, generates new ones (e.g. client tls certificates from the saved certificate authorities) and deploys everything in the Shoot’s control plane on the Destination Seed

ShootState synchronization controller

The ShootState synchronization controller will become part of the gardenlet. It syncs the state of extension custom resources from the shoot namespace to the garden cluster and updates the corresponding spec.extension.state field in the ShootState resource. The controller can watch Custom Resources used by the extensions and update the ShootState only when changes occur.

Migration workflow

  1. Starting migration
    • Migration can only be started after a Shoot cluster has been successfully created so that the status.seed field in the Shoot resource has been set
    • The Shoot resource’s field spec.seedName="new-seed" is edited to hold the name of the Destination Seed and reconciliation is automatically triggered
    • The Garden Controller Manager checks if the equality between spec.seedName and status.seed, detects that they are different and triggers migration.
  2. The Garden Controller Manager waits for the Destination Seed to be ready
  3. Shoot’s API server is stopped
  4. Backup the Shoot’s ETCD.
  5. Extension resources in the Source Seed are annotated with gardener.cloud/operation=migrate
  6. Scale Down the Shoot’s control plane in the Source Seed.
  7. The gardenlet in the Destination Seed fetches the state of extension resources from the ShootState resource in the garden cluster.
  8. Normal reconciliation flow is resumed in the Destination Seed. Extension resources are annotated with gardener.cloud/operation=restore to instruct the extension controllers to reconstruct their state.
  9. The Shoot’s namespace in Source Seed is deleted.

7 - 09 Test Framework

Gardener integration test framework

Motivation

As we want to improve our code coverage in the next months we will need a simple and easy to use test framework. The current testframework already contains a lot of general test functions that ease the work for writing new tests. However there are multiple disadvantages with the current structure of the tests and the testframework:

  1. Every new test is an own testsuite and therefore needs its own TestDef (https://github.com/gardener/gardener/tree/master/.test-defs). With this approach there will be hundreds of test definitions, growing with every new test (or at least new test suite). But in most cases new tests do not need their own special TestDef: it’s just the wrong scope for the testmachinery and will result in unnecessary complex testruns and configurations. In addition it would result in additional maintenance for a huge number of TestDefs.
  2. The testsuites currently have their own specific interface/configuration that they need in order to be executed correctly (see K8s Update test). Consequently the configuration has to be defined in the testruns which result in one step per test with their very own configuration which means that the testmachinery cannot simply select testdefinitions by label. As the testmachinery cannot make use of its ability to run labeled tests (e.g. run all tests labeled default), the testflow size increases with every new tests and the testruns have to be manually adjusted with every new test.
  3. The current gardener test framework contains multiple test operations where some are just used for specific tests (e.g. plant_operations) and some are more general (garden_operation). Also the functions offered by the operations vary in their specialization as some are really specific to just one test e.g. shoot test operation with WaitUntilGuestbookAppIsAvailable whereas others are more general like WaitUntilPodIsRunning.
    This structure makes it hard for developers to find commonly used functions and also hard to integrate as the common framework grows with specialized functions.

Goals

In order to clean the testframework, make it easier for new developers to write tests and easier to add and maintain test execution within the testmachinery, the following goals are defined:

  • Have a small number of test suites (gardener, shoots see test flavors) to only maintain a fixed number of testdefinitions.
  • Use ginkgo test labels (inspired by the k8s e2e tests) to differentiate test behavior, test execution and test importance.
  • Use standardized configuration for all tests (differ depending on the test suite) but provide better tooling to dynamically read additional configuration from configuration files like the cloudprofile.
  • Clean the testframework to only contain general functionality and keep specific functions inside the tests

Proposal

The proposed new test framework consists of the following changes to tackle the above described goals. ​

Test Flavors

Reducing the number of test definitions is done by ​combining the current specified test suites into the following 3 general ones:

  • System test suite
    • e.g. create-shoot, delete-shoot, hibernate
    • need their own testdef because they have a special meaning in the context of the testmachinery
  • Gardener test suite
    • e.g. RBAC, scheduler
    • All tests that only need a gardener installation but no shoot cluster
    • Possible functions/environment:
      • New project for test suite (copy secret binding, cleanup)?
  • Shoot test suite
    • e.g. shoot app, network
    • Test that require a running shoot
    • Possible functions:
      • Namespace per test
      • cleanup of ns

As inspired by the k8s e2e tests, test labels are used to differentiate the tests by their behavior, their execution and their importance. Test labels means that tests are described using predefined labels in the test’s text (e.g ginkgo.It("[BETA] this is a test")). With this labeling strategy, it is also possible to see the test properties directly in the code and promoting a test can be done via a pullrequest and will then be automatically recognized by the testmachinery with the next release.

Using ginkgo focus to only run desired tests and combined testsuites, an example test definition will look like the following.

kind: TestDefinition
metadata:
  name: gardener-beta-suite
spec:
  description: Test suite that runs all gardener tests that are labeled as beta
  activeDeadlineSeconds: 7200
  labels: ["gardener", "beta"]
  command: [bash, -c]
  args:
  - >-
    go test -timeout=0 -mod=vendor ./test/integration/suite
    --v -ginkgo.v -ginkgo.progress -ginkgo.no-color
    -ginkgo.focus="[GARDENER] [BETA]"    

Using this approach, the overall number of testsuites is then reduced to a fixed number (excluding the system steps) of test suites * labelCombinations.

Framework

The new framework will consist of a common framework, a gardener framework (integrating the commom framework) and a shoot framework (integrating the gardener framework).

All of these frameworks will have their own configuration that is exposed via commandline flags so that for example the shoot test framework can be executed by go test -timeout=0 -mod=vendor ./test/integration/suite --v -ginkgo.v -ginkgo.focus="[SHOOT]" --kubecfg=/path/to/config --shoot-name=xx.

The available test labels should be declared in the code with predefined values and in a predefined order so that everyone is aware about possible labels and the tests are labeled similarly across all integration tests. This approach is somehow similar to what kubernetes is doing in their e2e test suite but with some more restrictions (compare example k8s e2e test).
A possible solution to have consistent labeling would be to define them with every new ginkgo.It definition: f.Beta().Flaky().It("my test") which internally orders them and would produce a ginkgo test with the text : [BETA] [FLAKY] my test.

General Functions The test framework should include some general functions that can and will be reused by every test. These general functions may include: ​

  • Logging
  • State Dump
  • Detailed test output (status, duration, etc..)
  • Cleanup handling per test (It)
  • General easy to use functions like WaitUntilDeploymentCompleted, GetLogs, ExecCommand, AvailableCloudprofiles, etc.. ​

Example

A possible test with the new test framework would look like:

var _ = ginkgo.Describe("Shoot network testing", func() {
  // the testframework registers some cleanup handling for a state dump on failure and maybe cleanup of created namespaces
  f := framework.NewShootFramework()
  f.CAfterEach(func(ctx context.Context) {
    ginkgo.By("cleanup network test daemonset")
    err := f.ShootClient.Client().Delete(ctx, &appsv1.DaemonSet{ObjectMeta: metav1.ObjectMeta{Name: name, Namespace: namespace}})
    if err != nil {
      if !apierrors.IsNotFound(err) {
        Expect(err).To(HaveOccurred())
      }
    }
  }, FinalizationTimeout)
  f.Release().Default().CIt("should reach all webservers on all nodes", func(ctx context.Context) {
    ginkgo.By("Deploy the net test daemon set")
    templateFilepath := filepath.Join(f.ResourcesDir, "templates", nginxTemplateName)
    err := f.RenderAndDeployTemplate(f.Namespace(), tempalteFilepath)
    Expect(err).ToNot(HaveOccurred())
    err = f.WaitUntilDaemonSetIsRunning(ctx, f.ShootClient.Client(), name, namespace)
    Expect(err).NotTo(HaveOccurred())
    pods := &corev1.PodList{}
    err = f.ShootClient.Client().List(ctx, pods, client.MatchingLabels{"app": "net-nginx"})
    Expect(err).NotTo(HaveOccurred())
    // check if all webservers can be reached from all nodes
    ginkgo.By("test connectivity to webservers")
    shootRESTConfig := f.ShootClient.RESTConfig()
    var res error
    for _, from := range pods.Items {
      for _, to := range pods.Items {
        // test pods
        f.Logger.Infof("%s to %s: %s", from.GetName(), to.GetName(), data)
      }
    }
    Expect(res).ToNot(HaveOccurred())
  }, NetworkTestTimeout)
})

Future Plans

Ownership

When the test coverage is increased and there will be more tests, we will need to track ownership for tests. At the beginning the ownership will be shared across all maintainers of the residing repository but this is not suitable anymore as tests will grow and get more complex.

Therefore the test ownership should be tracked via subgroups (in kubernetes this would be a SIG (comp. sig apps e2e test)). These subgroup will then be tracked via labels and the members of these groups will then be notified if tests fail.

8 - 10 Shoot Additional Container Runtimes

Gardener extensibility to support shoot additional container runtimes

Table of Contents

Summary

Gardener-managed Kubernetes clusters are sometimes used to run sensitive workloads, which sometimes are comprised of OCI images originating from untrusted sources. Additional use-cases want to leverage economy-of-scale to run workloads for multiple tenants on the same cluster. In some cases, Gardener users want to use operating systems which do not easily support the Docker engine.

This proposal aims to allow Gardener Shoot clusters to use CRI instead of the legacy Docker API, and to provide extension type for adding CRI shims (like GVisor and Kata Containers) which can be used to add support in Gardener Shoot clusters for these runtimes.

Motivation

While pods and containers are intended to create isolated areas for concurrently running workloads on nodes, this isolation is not as robust as could be expected. Containers leverage the core Linux CGroup and Namespace features to isolate workloads, and many kernel vulnerabilities have the potential to allow processes to escape from their isolation. Once a process has escaped from its container, any other process running on the same node is compromised. Several projects try to mitigate this problem; for example Kata Containers allow isolating a Kubernetes Pod in a micro-vm, gVisor reduces the kernel attack surface by adding another level of indirection between the actual payload and the real kernel.

Kubernetes supports running pods using these alternate runtimes via the RuntimeClass concept, which was promoted to Beta in Kubernetes 1.14. Once Kubernetes is configured to use the Container Runtime Interface to control pods, it becomes possible to leverage CRI and run specific pods using different Runtime Classes. Additionally, configuring Kubernetes to use CRI instead of the legacy Dockershim is faster.

The motivation behind this proposal is to make all of this functionality accessible to Shoot clusters managed by Gardener.

Goals

  • Gardener must allow to configue its managed clusters with the CRI interface instead of the legacy Dockershim.
  • Low-level runtimes like gVisor or Kata Containers are provided as gardener extensions which are (optionally) installed into a landscape by the Gardener operator. There must be no runtime-specific knowledge in the core Gardener code.
  • It shall be possible to configure multiple low-level runtimes in Shoot clusters, on the Worker Group level.

Proposal

Gardener today assumes that all supported operating systems have Docker pre-installed in the base image. Starting with Docker Engine 1.11, Docker itself was refactored and cleaned-up to be based on the containerd library. The first phase would be to allow the change of the Kubelet configuration as described here so that Kubernetes would use containerd instead of the default Dockershim. This will be implemented for CoreOS, Ubuntu, and SuSE-CHost.

We will implement two Gardener extensions, providing gVisor and Kata Containers as options for Gardener landscapes. The WorkerGroup specification will be extended to allow specifying the CRI name and a list of additional required Runtimes for nodes in that group. For example:

workers:
- name: worker-b8jg5
  machineType: m5.large
  volumeType: gp2
  volumeSize: 50Gi
  autoScalerMin: 1
  autoScalerMax: 2
  maxSurge: 1
  cri:
    name: containerd
    containerRuntimes:
    - type: gvisor
    - type: kata-containers
  machineImage:
    name: coreos
    version: 2135.6.0

Each extension will need to address the following concern:

  1. Add the low-level runtime binaries to the worker nodes. Each extension should get the runtime binaries from a container.
  2. Hook the runtime binary into the containerd configuration file, so that the runtime becomes available to containerd.
  3. Apply a label to each node that allows identifying nodes where the runtime is available.
  4. Apply the relevant RuntimeClass to the Shoot cluster, to expose the functionality to users.
  5. Provide a separate binary with a ValidatingWebhook (deployable to the garden cluster) to catch invalid configurations. For example, Kata Containers on AWS requires a machineType of i3.metal, so any Shoot requests with a Kata Containers runtime and a different machine type on AWS should be rejected.

Design Details

  1. Change the nodes container runtime to work with CRI and ContainerD (Only if specified in the Shoot spec):

    1. In order to configure each worker machine in the cluster to work with CRI, the following configurations should be done:

      1. Add kubelet execution flags:
        1. –container-runtime=remote
        2. –container-runtime-endpoint=unix:///run/containerd/containerd.sock
      2. Make sure that default containerd configuration file exist in path /etc/containerd/config.toml.
    2. ContainerD and Docker configurations are different for each OS. To make sure the default configurations above works well in each worker machine, each OS extension would be responsible to configure them during the reconciliation of the OperatingSystemConfig:

      1. os-ubuntu -
        1. Create ContainerD unit Drop-In to execute ContainerD with the default configurations file in path /etc/containerd/config.toml.
        2. Create the container runtime metadata file with a OS path for binaries installations: /usr/bin.
      2. os-coreos -
        1. Create ContainerD unit Drop-In to execute ContainerD with the default configurations file in path /etc/containerd/config.toml.
        2. Create Docker Drop-In unit to execute Docker with the correct socket path of ContainerD.
        3. Create the container runtime metadata file with a OS path for binaries installations: /var/bin.
      3. os-suse-chost -
        1. Create ContainerD service unit and execute ContainerD with the default configurations file in path /etc/containerd/config.toml.
        2. Download and install ctr-cli which is not shipped with the current SuSe image.
        3. Create the container runtime metadata file with a OS path for binaries installations /usr/sbin.
    3. To rotate the ContainerD (CRI) logs we will activate the kubelet feature flag: CRIContainerLogRotation=true.

    4. Docker monitor service will be replaced with equivalent ContainerD monitor service.

  2. Validate workers additional runtime configurations:

    1. Disallow additional runtimes with shoots < 1.14
    2. kata-container validation: Machine type support nested virtualization.
  3. Add support for each additional container runtime in the cluster.

    1. In order to install each additional available runtime in the cluster we should:

      1. Install the runtime binaries in each Worker’s pool nodes that specified the runtime support.
      2. Apply the relevant RuntimeClass to the cluster.
    2. The installation above should be done by a new kind of extension: ContainerRuntime resource. For each container runtime type (Kata-container/gvisor) a dedicate extension controller will be created.

      1. A label for each container runtime support will be added to every node that belongs to the worker pool. This should be done similar to the way labels created today for each node, through kubelet execution parameters (_kubelet.flags: –node-labels). When creating the OperatingSystemConfig (original) for the worker each container runtime support should be mapped to a label on the node. For Example: label: container.runtime.kata-containers=true (shoot.spec.cloud..worker.containerRuntimes.kata-container) label: container.runtime.gvisor=true (shoot.spec.cloud..worker.containerRuntimes.gvisor)

      2. During the Shoot reconciliation (Similar steps to the Extensions today) Gardener will create new ContainerRuntime resource if a container runtime exist in at least one worker spec:

        apiVersion: extensions.gardener.cloud/v1alpha1
        kind: ContainerRuntime
        metadata:
          name: kata-containers-runtime-extension
          namespace: shoot--foo--bar
        spec:
          type: kata-containers
        

        Gardener will wait that all ContainerRuntimes extensions will be reconciled by the appropriate extensions controllers.

      3. Each runtime extension controller will be responsible to reconcile it’s RuntimeContainer resource type. rc-kata-containers extension controller will reconcile RuntimeContainer resource from type kata-container and rc-gvisor will reconcile RuntimeContainer resource from gvisor. Reconciliation process by container runtime extension controllers:

        1. Runtime extension controller from specific type should apply a chart which responsible for the installation of the runtime container in the cluster:
          1. DaemonSet which will run a privileged pod on each node with the label: container.runtime.:true The pod will be responsible for:
            1. Copy the runtime container binaries (From extension package ) to the relevant path in the host OS.
            2. Add the relevant container runtime plugin section to the containerd configuration file (/etc/containerd/config.toml).
            3. Restart containerd in the node.
          2. RuntimeClasses in the cluster to support the runtime class. for example:
            apiVersion: node.k8s.io/v1beta1
            kind: RuntimeClass
            metadata:
              name: gvisor
            handler: runsc
            
        2. Update the status of the relevant RuntimeContainer resource to succeeded.

9 - 12 Oidc Webhook Authenticator

OIDC Webhook Authenticator

Problem

In Kubernetes you can authenticate via several authentication strategies:

  • x509 Client Certificates
  • Static Token Files
  • Bootstrap Tokens
  • Static Password File (Basic authentication - deprecated and removed in 1.19)
  • Service Account Tokens
  • OpenID Connect Tokens
  • Webhook Token Authentication
  • Authenticating Proxy

End-users should use OpenID Connect (OIDC) Tokens created by OIDC-compatible Identity Provider (IDP) and present id_token to the kube-apiserver. If the kube-apiserver is configured to trust the IDP and the token is valid, then the user is authenticated and the UserInfo is send to the authorization stack.

Ideally, operators of the Gardener cluster should be able to authenticate to end-user Shoot clusters with id_token generated by OIDC IDP, but in many cases, end-users might have already configured OIDC for their cluster and more than one OIDC configurations are not allowed.

Another interesting application of multiple OIDC providers would be per Project OIDC provider where end-users of Gardener can add their own OIDC-compatible IDPs.

To workaround the one OIDC per kube-apiserver limitation, a new OIDC Webhook Authenticator (OWA) could be implemented.

Goals

  • Dynamic registrations of OpenID Connect configurations.
  • Close as possible to the Kubernetes build-in OIDC Authenticator.
  • Build as an optional extension and not required for functional Shoot or Gardener cluster.

Non-goals

Proposal

The kube-apiserver can use Webhook Token Authentication to send a Bearer Tokens (id_token) to external webhook for validation:

{
  "apiVersion": "authentication.k8s.io/v1beta1",
  "kind": "TokenReview",
  "spec": {
    "token": "(BEARERTOKEN)"
  }
}

Where upon verification, the remote webhook returns the identity of the user (if authentication succeeds):

{
  "apiVersion": "authentication.k8s.io/v1beta1",
  "kind": "TokenReview",
  "status": {
    "authenticated": true,
    "user": {
      "username": "janedoe@example.com",
      "uid": "42",
      "groups": [
        "developers",
        "qa"
      ],
      "extra": {
        "extrafield1": [
          "extravalue1",
          "extravalue2"
        ]
      }
    }
  }
}

Registration of new OpenIDConnect

This new OWA can be configured with multiple OIDC providers and the entire flow can look like this:

  1. Admin adds a new OpenIDConnect resource (via CRD) to the cluster.

    apiVersion: authentication.gardener.cloud/v1alpha1
    kind: OpenIDConnect
    metadata:
      name: foo
    spec:
      issuerURL: https://foo.bar
      clientID: some-client-id
      usernameClaim: email
      usernamePrefix: "test-"
      groupsClaim: groups
      groupsPrefix: "baz-"
      supportedSigningAlgs:
      - RS256
      requiredClaims:
        baz: bar
      caBundle: LS0tLS1CRUdJTiBDRVJU...base64-encoded CA certs for issuerURL.
    
  2. OWA watches for changes on this resource and does OIDC discovery. The OIDC provider’s configuration has to be accessible under the spec.issuerURL with a well-known path (.well-known/openid-configuration).

  3. OWA uses the jwks_uri obtained from the OIDC providers configuration, to fetch the OIDC provider’s public keys from that endpoint.

  4. OWA uses those keys, issuer, client_id and other settings to add an OIDC authenticator to an in-memory list of Token Authenticators.

alt text

End-user authentication via new OpenIDConnect IDP

When a user presents an id_token obtained from a OpenID Connect the flow looks like this:

  1. The user authenticates against a Custom IDP.

  2. id_token is obtained from the Custom IDP.

  3. The user uses id_token to perform an API call to kube-apiserver.

  4. As the id_token is not matched by any build-in or configured authenticators in the kube-apiserver, it is send to OWA for validation.

    {
      "TokenReview": {
        "kind": "TokenReview",
        "apiVersion": "authentication.k8s.io/v1beta1",
        "spec": {
          "token": "ddeewfwef..."
        }
      }
    }
    
  5. OWA uses TokenReview to authenticate the calling API server (the kube-apiserver for delegation of authentication and authorization may be different from the calling kube-apiserver).

    {
      "TokenReview": {
        "kind": "TokenReview",
        "apiVersion": "authentication.k8s.io/v1beta1",
        "spec": {
          "token": "api-server-token..."
        }
      }
    }
    
  6. After the Authentication API server returns the identity of the calling API server:

    {
        "apiVersion": "authentication.k8s.io/v1",
        "kind": "TokenReview",
        "metadata": {
            "creationTimestamp": null
        },
        "spec": {
            "token": "eyJhbGciOiJSUzI1NiIsImtpZCI6InJocEdLTXZlYjV1OE5heD..."
        },
        "status": {
            "authenticated": true,
            "user": {
                "groups": [
                    "system:serviceaccounts",
                    "system:serviceaccounts:shoot--abcd",
                    "system:authenticated"
                ],
                "uid": "14db103e-88bb-4fb3-8efd-ca9bec91c7bf",
                "username": "system:serviceaccount:shoot--abcd:kube-apiserver"
            }
        }
    }
    

    OWA makes a SubjectAccessReview call to the Authorization API server to ensure that calling API server is allowed to validate tokens:

    {
      "apiVersion": "authorization.k8s.io/v1",
      "kind": "SubjectAccessReview",
      "spec": {
        "groups": [
          "system:serviceaccounts",
          "system:serviceaccounts:shoot--abcd",
          "system:authenticated"
        ],
        "nonResourceAttributes": {
          "path": "/validate-token",
          "verb": "post"
        },
        "user": "system:serviceaccount:shoot--abcd:kube-apiserver"
      },
      "status": {
        "allowed": true,
        "reason": "RBAC: allowed by RoleBinding \"kube-apiserver\" of ClusterRole \"kube-apiserver\" to ServiceAccount \"system:serviceaccount:shoot--abcd:kube-apiserver\""
      }
    }
    
  7. OWA then iterates over all registered OpenIDConnect Token authenticators and tries to validate the token.

  8. Upon a successful validation it returns the TokeReview with user, groups and extra parameters:

    {
      "TokenReview": {
        "kind": "TokenReview",
        "apiVersion": "authentication.k8s.io/v1beta1",
        "spec": {
          "token": "ddeewfwef..."
        },
        "status": {
          "authenticated": true,
          "user": {
            "username": "test-foo@bar.com",
            "groups": [
              "baz-employee"
            ],
            "extra": {
              "gardener.cloud/authenticator/name": [
                "foo"
              ],
              "gardener.cloud/authenticator/uid": [
                "e5062528-e5a4-4b97-ad83-614d015b0979"
              ]
            }
          }
        }
      }
    }
    

    It also adds some extra information which can be used by custom authorizers later on:

    1. gardener.cloud/authenticator/name contains the name of the OpenIDConnect authenticator which was used.
    2. gardener.cloud/authenticator/uid contains the metadata.uid of the OpenIDConnect authenticator which was used.
  9. The kube-apiserver proceeds with authorization checks and returns response.

An overview of the flow:

alt text

Deployment for Shoot clusters

OWA can be deployed per Shoot cluster via the Shoot OIDC Service Extension. The shoot’s kube-apiserver is mutated so that it has the following flag configured.

--authentication-token-webhook-config-file=/etc/webhook/kubeconfig

OWA on the other hand uses the shoot’s kube-apiserver and delegates auth capabilities to it. This means that the needed RBAC is managed in the shoot cluster. By default only the shoot’s kube-apiserver has permissions to validate tokens against OWA.

10 - 13 Automated Seed Management

Automated Seed Management

Automated seed management involves automating certain aspects of managing seeds in Garden clusters, such as:

Implementing the above features would involve changes to various existing Gardener components, as well as perhaps introducing new ones. This document describes these features in more detail and proposes a design approach for some of them.

In Gardener, scheduling shoots onto seeds is quite similar to scheduling pods onto nodes in Kubernetes. Therefore, a guiding principle behind the proposed design approaches is taking advantage of best practices and existing components already used in Kubernetes.

Ensuring Seeds Capacity for Shoots Is Not Exceeded

Seeds have a practical limit of how many shoots they can accommodate. Exceeding this limit is undesirable as the system performance will be noticeably impacted. Therefore, it is important to ensure that a seed’s capacity for shoots is not exceeded by introducing a maximum number of shoots that can be scheduled onto a seed and making sure that it is taken into account by the scheduler.

An initial discussion of this topic is available in Issue #2938. The proposed solution is based on the following flow:

  • The gardenlet is configured with certain resources and their total capacity (and, for certain resources, the amount reserved for Gardener).
  • The gardenlet seed controller updates the Seed status with the capacity of each resource and how much of it is actually available to be consumed by shoots, using capacity and allocatable fields that are very similar to the corresponding fields in the Node status.
  • When scheduling shoots, gardener-scheduler is influenced by the remaining capacity of the seed. In the simplest possible implementation, it never schedules shoots onto a seed that has already reached its capacity for a resource needed by the shoot.

Initially, the only resource considered would be the maximum number of shoots that can be scheduled onto a seed. Later, more resources could be added to make more precise scheduling calculations.

Note: Resources could also be requested by shoots, similarly to how pods can request node resources, and the scheduler could then ensure that such requests are taken into account when scheduling shoots onto seeds. However, the user is rarely, if at all, concerned with what resources does a shoot consume from a seed, and this should also be regarded as an implementation detail that could change in the future. Therefore, such resource requests are not included in this GEP.

In addition, an extensibility plugin framework could be introduced in the future in order to advertise custom resources, including provider-specific resources, so that gardenlet would be able to update the seed status with their capacity and allocatable values, for example load balancers on Azure. Such a concept is not described here in further details as it is sufficiently complex to require a separate GEP.

Example Seed status with capacity and allocatable fields:

status:
  capacity:
    shoots: "100"
    persistent-volumes: "200" # Built-in resource
    azure.provider.extensions.gardener.cloud/load-balancers: "30" # Custom resource advertised by an Azure-specific plugin
  allocatable:
    shoots: "100"
    persistent-volumes: "197" # 3 persistent volumes are reserved for Gardener
    azure.provider.extensions.gardener.cloud/load-balancers: "300"

Gardenlet Configuration

As mentioned above, the total resource capacity for built-in resources such as the number of shoots is specified as part of the gardenlet configuration, not in the Seed spec. The gardenlet configuration itself could be specified in the spec of the newly introduced ManagedSeed resource. Here it is assumed that in the future this could become the recommended and most widely used way to manage seeds. If the same gardenlet is responsible for multiple seeds, they would all share the same capacity settings.

To specify the total resource capacity for built-in resources, as well as the amount of such resources reserved for Gardener, the 2 new fields resources.capacity and resources.reserved are introduced in the GardenletConfiguration resource. The gardenlet seed controller would then initialize the capacity and allocatable fields in the seed status as follows:

  • The capacity value is set to the configured resources.capacity.
  • The allocatable value is set to the configured resources.capacity minus resources.reserved.

Example GardenletConfiguration with resources.capacity and resources.reserved field:

resources:
  capacity:
    shoots: 100
    persistent-volumes: 200
  reserved:
    persistent-volumes: 3

Scheduling Algorithm

Currently gardener-scheduler uses a simple non-extensible algorithm in order to schedule shoots onto seeds. It goes through the following stages:

  • Filter out seeds that don’t meet scheduling requirements such as being ready, matching cloud profile and shoot label selectors, matching the shoot provider, and not having taints that are not tolerated by the shoot.
  • From the remaining seeds, determine candidates that are considered best based on their region, by using a strategy that can be either “same region” or “minimal distance”.
  • Among these candidates, choose the one with the least number of shoots.

This scheduling algorithm should be adapted in order to properly take into account resources capacity and requests. As a first step, during the filtering stage, any seeds that would exceed their capacity for shoots, or their capacity for any resources requested by the shoot, should simply be filtered out and not considered during the next stages.

Later, the scheduling algorithm could be further enhanced by replacing the step in which the region strategy is applied by a scoring step similar to the one in Kubernetes Scheduler. In this scoring step, the scheduler would rank the remaining seeds to choose the most suitable shoot placement. It would assign a score to each seed that survived filtering based on a list of scoring rules. These rules might include for example MinimalDistance and SeedResourcesLeastAllocated, among others. Each rule would produce its own score for the seed, and the overall seed score would be calculated as a weighted sum of all such scores. Finally, the scheduler would assign the shoot to the seed with the highest ranking.

ManagedSeeds

When all or most of the existing seeds are near capacity, new seeds should be created in order to accommodate more shoots. Conversely, sometimes there could be too many seeds for the number of shoots, and so some of the seeds could be deleted to save resources. Currently, the process of creating a new seed involves a number of manual steps, such as creating a new shoot that meets certain criteria, and then registering it as a seed in Gardener. This could be automated to some extent by annotating a shoot with the use-as-seed annotation, in order to create a “shooted seed”. However, adding more than one similar seeds still requires manually creating all needed shoots, annotating them appropriately, and making sure that they are successfully reconciled and registered.

To create, delete, and update seeds effectively in a declarative way and allow auto-scaling, a “creatable seed” resource along with a “set” (and in the future, perhaps also a “deployment”) of such creatable seeds should be introduced, similar to Kubernetes Pod, ReplicaSet, and Deployment (or to MCM Machine, MachineSet, and MachineDeployment) resources. With such resources (and their respective controllers), creating a new seed based on a template would become as simple as increasing the replicas field in the “set” resource.

In Issue #2181 it is already proposed that the use-as-seed annotation is replaced by a dedicated ShootedSeed resource. The solution proposed here further elaborates on this idea.

ManagedSeed Resource

The ManagedSeed resource is a dedicated custom resource that represents an evolution of the “shooted seed” and properly replaces the use-as-seed annotation. This resource contains:

  • The name of the Shoot that should be registered as a Seed.
  • An optional seedTemplate section that contains the Seed spec and parts of the metadata, such as labels and annotations.
  • An optional gardenlet section that contains:
    • gardenlet deployment parameters, such as the number of replicas, the image, etc.
    • The GardenletConfiguration resource that contains controllers configuration, feature gates, and a seedConfig section that contains the Seed spec and parts of its metadata.
    • Additional configuration parameters, such as the garden connection bootstrap mechanism (see TLS Bootstrapping), and whether to merge the provided configuration with the configuration of the parent gardenlet.

Either the seedTemplate or the gardenlet section must be specified, but not both:

  • If the seedTemplate section is specified, gardenlet is not deployed to the shoot, and a new Seed resource is created based on the template.
  • If the gardenlet section is specified, gardenlet is deployed to the shoot, and it registers a new seed upon startup based on the seedConfig section of the GardenletConfiguration resource.

A ManagedSeed allows fine-tuning the seed and the gardenlet configuration of shooted seeds in order to deviate from the global defaults, e.g. lower the concurrent sync for some of the seed’s controllers or enable a feature gate only on certain seeds. Also, it simplifies the deletion protection of such seeds.

Also, the ManagedSeed resource is a more powerful alternative to the use-as-seed annotation. The implementation of the use-as-seed annotation itself could be refactored to use a ManagedSeed resource extracted from the annotation by a controller.

Although in this proposal a ManagedSeed is always a “shooted seed”, that is a Shoot that is registered as a Seed, this idea could be further extended in the future by adding a type field that could be either Shoot (implied in this proposal), or something different. Such an extension would allow to register and manage as Seed a cluster that is not a Shoot, e.g. a GKE cluster.

Last but not least, ManagedSeeds could be used as the basis for creating and deleting seeds automatically via the ManagedSeedSet resource that is described in ManagedSeedSets.

Unlike the Seed resource, the ManagedSeed resource is namespaced. If created in the garden namespace, the resulting seed is globally available. If created in a project namespace, the resulting seed can be used as a “private seed” by shoots in the project, either by being decorated with project-specific taints and labels, or by being of the special PrivateSeed kind that is also namespaced. The concept of private seeds / cloudprofiles is described in Issue #2874. Until this concept is implemented, ManagedSeed resources might need to be restricted to the garden namespace, similarly to how shoots with the use-as-seed annotation currently are.

Example ManagedSeed resource with a seedTemplate section:

apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeed
metadata:
  name: crazy-botany
  namespace: garden
spec:
  shoot:
    name: crazy-botany # Shoot that should be registered as a Seed
  seedTemplate: # Seed template, including spec and parts of the metadata
    metadata:
      labels:
        foo: bar
    spec:
      provider:
        type: gcp
        region: europe-west1
      taints:
      - key: seed.gardener.cloud/protected
      ...

Example ManagedSeed resource with a gardenlet section:

apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeed
metadata:
  name: crazy-botany
  namespace: garden
spec:
  shoot:
    name: crazy-botany # Shoot that should be registered as a Seed
  gardenlet: 
    deployment: # Gardenlet deployment configuration
      replicaCount: 1
      revisionHistoryLimit: 10
      serviceAccountName: gardenlet
      image:
        repository: eu.gcr.io/gardener-project/gardener/gardenlet
        tag: latest
        pullPolicy: IfNotPresent
      resources:
        ...
      podLabels:
        ...
      podAnnotations: 
        ...
      additionalVolumes:
        ...
      additionalVolumeMounts:
        ...
      env:
        ...
      vpa: false
    config: # GardenletConfiguration resource
      apiVersion: gardenlet.config.gardener.cloud/v1alpha1
      kind: GardenletConfiguration
      seedConfig: # Seed template, including spec and parts of the metadata
        metadata:
          labels:
            foo: bar
        spec:
          provider:
            type: gcp
            region: europe-west1
          taints:
          - key: seed.gardener.cloud/protected
          ...
      controllers:
        shoot:
          concurrentSyncs: 20
      featureGates:
        ...
      ...
    bootstrap: BootstrapToken
    mergeWithParent: true

ManagedSeed Controller

ManagedSeeds are reconciled by a new managed seed controller in gardenlet. Its implementation is very similar to the current seed registration controller, and in fact could be regarded as a refactoring of the latter, with the difference that it uses the ManagedSeed resource rather than the use-as-seed annotation on a Shoot. The gardenlet only reconciles ManagedSeeds that refer to Shoots scheduled on Seeds the gardenlet is responsible for.

Once this controller is considered sufficiently stable, the current use-as-seed annotation and the controller mentioned above should be marked as deprecated and eventually removed.

A ManagedSeed that is in use by shoots cannot be deleted, unless the shoots are either deleted or moved to other seeds first. The managed seed controller ensures that this is the case by only allowing a ManagedSeed to be deleted if its Seed has been already deleted.

ManagedSeed Admission Plugins

In addition to the managed seed controller mentioned above, new gardener-apiserver admission plugins should be introduced to properly validate the creation and update of ManagedSeeds, as well as the deletion of shoots registered as seeds. These plugins should ensure that:

  • A Shoot that is being referred to by a ManagedSeed cannot be deleted.
  • Certain Seed spec fields, for example the provider type and region, networking CIDRs for pods, services, and nodes, etc., are the same as (or compatible with) the corresponding Shoot spec fields of the shoot that is being registered as seed.
  • If such Seed spec fields are omitted or empty, the plugins should supply proper defaults based on the values in the Shoot resource.

Provider-specific Seed Bootstrapping Actions

Bootstrapping a new seed might require additional provider-specific actions to the ones performed automatically by the managed seed controller. For example, on Azure this might include getting a new subscription, extending quotas, etc. This could eventually be automated by introducing an extension mechanism for the Gardener seed bootstrapping flow, to be handled by a new type of controller in the provider extensions. However, such an extension mechanism is not in the scope of this proposal and might require a separate GEP.

One idea that could be further explored is the use shoot readiness gates, similar to Kubernetes pod readiness gates, in order to control whether a Shoot is considered Ready before it could be registered as a Seed. A provider-specific extension could set the special condition that is specified as a readiness gate to True only after it has successfully performed the provider-specific actions needed.

Changes to Existing Controllers

Since the Shoot registration as a Seed is decoupled from the Shoot reconciliation, existing gardenlet controllers would not have to be changed in order to properly support ManagedSeeds. The main change to gardenlet that would be needed is introducing the new managed seed controller mentioned above, and possibly retiring the old one at some point. In addition, the Shoot controller would need to be adapted as it currently performs certain actions differently if the shoot has a “shooted seed”.

The introduction of the ManagedSeed resource would also require no changes to existing gardener-controller-manager controllers that operate on Shoots (for example, shoot hibernation and maintenance controllers).

ManagedSeedSets

Similarly to a ReplicaSet, the purpose of a ManagedSeedSet is to maintain a stable set of replica ManagedSeeds available at any given time. As such, it is used to guarantee the availability of a specified number of identical ManagedSeeds, on an equal number of identical Shoots.

ManagedSeedSet Resource

The ManagedSeedSet resource has a selector field that specifies how to identify ManagedSeeds it can acquire, a number of replicas indicating how many ManagedSeeds (and their corresponding Shoots) it should be maintaining, and a two templates:

  • A ManagedSeed template (template) specifying the data of new ManagedSeeds it should create to meet the number of replicas criteria.
  • A Shoot template (shootTemplate) specifying the data of new Shoots it should create to host the ManagedSeeds.

A ManagedSeedSet then fulfills its purpose by creating and deleting ManagedSeeds (and their corresponding Shoots) as needed to reach the desired number.

A ManagedSeedSet is linked to its ManagedSeeds and Shoots via the metadata.ownerReferences field, which specifies what resource the current object is owned by. All ManagedSeeds and Shoots acquired by a ManagedSeedSet have their owning ManagedSeedSet’s identifying information within their ownerReferences field.

Example ManagedSeedSet resource:

apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeedSet
metadata:
  name: crazy-botany
  namespace: garden
spec:
  replicas: 3
  selector:
    matchLabels:
      foo: bar
  updateStrategy:
    type: RollingUpdate # Update strategy, must be `RollingUpdate`
    rollingUpdate:
      partition: 2 # Only update the last replica (#2), assuming there are no gaps ("rolling out a canary")
  template: # ManagedSeed template, including spec and parts of the metadata
    metadata:
      labels:
        foo: bar
    spec: 
      # shoot.name is not specified since it's filled automatically by the controller
      seedTemplate: # Either a seed or a gardenlet section must be specified, see above
        metadata:
          labels:
            foo: bar
        provider:
          type: gcp
          region: europe-west1
        taints:
        - key: seed.gardener.cloud/protected
        ...
  shootTemplate: # Shoot template, including spec and parts of the metadata
    metadata:
      labels:
        foo: bar
    spec:
      cloudProfileName: gcp
      secretBindingName: shoot-operator-gcp
      region: europe-west1
      provider:
        type: gcp
      ...

ManagedSeedSet Controller

ManagedSeedSets are reconciled by a new managed seed set controller in gardener-controller-manager. During the reconciliation this controller creates and deletes ManagedSeeds and Shoots in response to changes to the replicas and selector fields.

Note: The introduction of the ManagedSeedSet resource would not require any changes to gardenlet or to existing gardener-controller-manager controllers.

Managing ManagedSeed Updates

To manage ManagedSeed updates, we considered two possible approaches:

  • A ManagedSeedSet, similarly to a ReplicaSet, does not manage updates to its replicas in any way. In the future, we might introduce ManagedSeedDeployments, a higher-level concept that manages ManagedSeedSets and provides declarative updates to ManagedSeeds along with other useful features, similarly to a Deployment. Such a mechanism would involve creating new ManagedSeedSets, and therefore new seeds, behind the scenes, and moving existing shoots to them.
  • A ManagedSeedSet does manage updates to its replicas, similarly to a StatefulSet. Updates are performed “in-place”, without creating new seeds and moving existing shoots to them. Such a mechanism could also take advantage of other StatefulSet features, such as ordered rolling updates and phased rollouts.

There is an important difference between seeds and pods or nodes in that seeds are more “heavyweight” and therefore updating a set of seeds by introducing new seeds and moving shoots to them tends to be much more complex, time-consuming, and prone to failures compared to updating the seeds “in place”. Furthermore, updating seeds in this way depends on a mature implementation of GEP-7: Shoot Control Plane Migration, which is not available right now. Due to these considerations, we favor the second approach over the first one.

ManagedSeed Identity and Order

A StatefulSet manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods. It maintains a stable identity (including network identity) for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.

A StatefulSet achieves the above by associating each replica with an ordinal number. With n replicas, these ordinal numbers range from 0 to n-1. When scaling out, newly added replicas always have ordinal numbers larger than those of previously existing replicas. When scaling in, it is the replicas with the largest original numbers that are removed.

Besides stable identity and persistent storage, these ordinal numbers are also used to implement the following StatefulSet features:

  • Ordered, graceful deployment and scaling.
  • Ordered, automated rolling updates. Such rolling updates can be partitioned (limited to replicas with ordinal numbers greater than or equal to the “partition”) to achieve phased rollouts.

A ManagedSeedSet, unlike a StatefulSet, does not need to maintain a stable identity for its ManagedSeeds. Furthermore, it would not be practical to always remove the replicas with the largest ordinal numbers when scaling in, since the corresponding seeds may have shoots scheduled onto them, while other seeds, with lower ordinals, may have fewer shoots (or none), and therefore be much better candidates for being removed.

On the other hand, it would be beneficial if a ManagedSeedSet, like a StatefulSet, provides ordered deployment and scaling, ordered rolling updates, and phased rollouts. The main advantage of these features is that a deployment or update failure would affect fewer replicas (ideally just one), containing any potential damage and making the situation easier to handle, thus achieving some of the goals stated in Issue #87. They could also help to contain seed rolling updates outside business hours.

Based on the above considerations, we propose the following mechanism for handling ManagedSeed identity and order:

  • A ManagedSeedSet uses ordinal numbers generated by an increasing sequence to identify ManagedSeeds and Shoots it creates and manages. These numbers always start from 0 and are incremented by 1 for each newly added replica.
  • Replicas (both ManagedSeeds and Shoots) are named after the ManagedSeedSet with the ordinal number appended. For example, for a ManagedSeedSet named test its replicas are named test-0, test-1, etc.
  • Gaps in the sequence created by removing replicas with ordinal numbers in the middle of the range are never filled in. A newly added replica always receives a number that is not only free, but also unique to itself. For example, if there are 2 replicas named test-0 and test-1 and any one of them is removed, a newly added replica will still be named test-2.

Although such ordinal numbers can also provide some form of stable identity, in this case it is much more important that they can provide a predictable ordering for deployments and updates, and can also be used to partition rolling updates similarly to StatefulSet ordinal numbers.

Update Strategies

The ManagedSeedSet’s .spec.updateStrategy field allows configuring automated rolling updates for the ManagedSeeds and Shoots in a ManagedSeedSet.

Rolling Updates

The RollingUpdate update strategy implements automated, rolling update for the ManagedSeeds and Shoots in a ManagedSeedSet. With this strategy, the ManagedSeedSet controller will update each ManagedSeed and Shoot in the ManagedSeedSet. It will proceed from the largest number to the smallest, updating each ManagedSeed and its corresponding Shoot one at a time. It will wait until both the Shoot and the Seed of an updated ManagedSeed are Ready prior to updating its predecessor.

As a further improvement upon the above, the controller could check not only the ManagedSeeds and their corresponding Shoots for readiness, but also the Shoots scheduled onto these ManagedSeeds. The rollout would then only continue if no more than X percent of these Shoots are not reconciled and Ready. Since checking all these additional conditions might require some complex logic, it should be performed by an independent managed seed care controller that updates the ManagedSeed resource with the readiness of its Seed and all Shoots scheduled onto the Seed.

Note that unlike a StatefulSet, an OnDelete update strategy is not supported.

Partitions

The RollingUpdate update strategy can be partitioned, by specifying a .spec.updateStrategy.rollingUpdate.partition. If a partition is specified, only ManagedSeeds and Shoots with ordinals greater than or equal to the partition will be updated when any of the ManagedSeedSet’s templates is updated. All remaining ManagedSeeds and Shoots will not be updated. If a ManagedSeedSet’s .spec.updateStrategy.rollingUpdate.partition is greater than the largest ordinal number in use by a replica, updates to its templates will not be propagated to its replicas (but newly added replicas may still use the updated templates depending on the partition value).

Keeping Track of Revision History and Performing Rollbacks

Similarly to a StatefulSet, the ManagedSeedSet controller uses ControllerRevisions to keep track of the revision history, and controller-revision-hash labels to maintain an association between a ManagedSeed or a Shoot and the concrete template revisions based on which they were created or last updated. These are used for the following purposes:

  • During an update, determine which replicas are still not on the latest revision and therefore should be updated.
  • Display the revision history of a ManagedSeedSet via kubectl rollout history.
  • Roll back all ManagedSeedSet replicas to a specific revision via kubectl rollout undo

Note: The above kubectl rollout commands will not work with custom resources such as ManagedSeedSets out of the box (the documentation says explicitly that valid resource types are only deployments, daemonsets, and statefulsets), but it should be possible to eventually support such commands for ManagedSeedSets via a kubectl plugin.

Scaling-in ManagedSeedSets

Deleting ManagedSeeds in response to decreasing the replicas of a ManagedSeedSet deserves special attention for two reasons:

  • A seed that is already in use by shoots cannot be deleted, unless the shoots are either deleted or moved to other seeds first.
  • When there are more empty seeds than requested for deletion, determining which seeds to delete might not be as straightforward as with pods or nodes.

The above challenges could be addressed as follows:

  • In order to scale in a ManagedSeedSet successfully, there should be at least as many empty ManagedSeeds as the difference between the old and the new replicas. In some cases, the user might need to ensure that this is the case by draining some seeds manually before decreasing the replicas field.
  • It should be possible to protect ManagedSeeds from deletion even if they are empty, perhaps via an annotation such as seedmanagement.gardener.cloud/protect-from-deletion. Such seeds are not taken into account when determining whether the scale in operation can succeed.
  • The decision which seeds to delete among the ManagedSeeds that are empty and not protected should be based on hints, perhaps again in the form of annotations, that could be added manually by the user, as well as other factors, see Prioritizing ManagedSeed Deletion.

Prioritizing ManagedSeed Deletion

To help the controller decide which empty ManagedSeeds are to be deleted first, the user could manually annotate ManagedSeeds with a seed priority annotation such as seedmanagement.gardener.cloud/priority. ManagedSeeds with lower priority are more likely to be deleted first. If not specified, a certain default value is assumed, for example 3.

Besides this annotation, the controller should take into account also other factors, such as the current seed conditions (NotReady should be preferred for deletion over Ready), as well as its age (older should be preferred for deletion over newer).

Auto-scaling Seeds

The most interesting and advanced automated seed management feature is making sure that a Garden cluster has enough seeds registered to schedule new shoots (and, in the future, reschedule shoots from drained seeds) without exceeding the seeds capacity for shoots, but not more than actually needed at any given moment. This would involve introducing an auto-scaling mechanism for seeds in Garden clusters.

The proposed solution builds upon the ideas introduced earlier. The ManagedSeedSet resource (and in the future, also the ManagedSeedDeployment resource) could have a scale subresource that changes the replicas field. This would allow a new “seed autoscaler” controller to scale these resources via a special “autoscaler” resource (for example SeedAutoscaler), similarly to how the Kubernetes Horizontal Pod Autoscaler controller scales pods, as described in Horizontal Pod Autoscaler Walkthrough.

The primary metric used for scaling should be the number of shoots already scheduled onto that seed either as a direct value or as a percentage of the seed’s capacity for shoots introduced in Ensuring Seeds Capacity for Shoots Is Not Exceeded (utilization). Later, custom metrics based on other resources, including provider-specific resources, could be considered as well.

Note: Even if the controller is called Horizontal Pod Autoscaler, it is capable of scaling any resource with a scale subresource, using any custom metric. Therefore, initially it was proposed to use this controller directly. However, a number of important drawbacks were identified with this approach, and so it is no longer proposed here.

SeedAutoscaler Resource

The SeedAutoscaler automatically scales the number of ManagedSeeds in a ManagedSeedSet based on observed resource utilization. The resource could be any resource that is tracked via the capacity and allocatable fields in the Seed status, including in particular the number of shoots already scheduled onto the seed.

The SeedAutoscaler is implemented as a custom resource and a new controller. The resource determines the behavior of the controller. The SeedAutoscaler resource has a scaleTargetRef that specifies the target resource to be scaled, the minimum and maximum number of replicas, as well as a list of metrics. The only supported metric type initially is Resource for resources that are tracked via the capacity and allocatable fields in the Seed status. The resource target can be of type Utilization or AverageValue.

Example SeedAutoscaler resource:

apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: SeedAutoscaler
metadata:
  name: crazy-botany
  namespace: garden
spec:
  scaleTargetRef:
    apiVersion: seedmanagement.gardener.cloud/v1alpha1
    kind: ManagedSeedSet
    name: crazy-botany
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource # Only Resource is supported
    resource:
      name: shoots
      target:
        type: Utilization # Utilization or AverageValue
        averageUtilization: 50

SeedAutoscaler Controller

SeedAutoscaler resources are reconciled by a new seed autoscaler controller, either in gardener-controller-manager or out-of-tree, similarly to cluster-autoscaler. The controller periodically adjusts the number of replicas in a ManagedSeedSet to match the observed average resource utilization to the target specified by user.

Note: The SeedAutoscaler controller should perhaps not be limited to evaluating only metrics, it could also take into account also taints, label selectors, etc. This is not yet reflected in the example SeedAutoscaler resource above. Such details are intentionally not specified in this GEP, they should be further explored in the issues created to track the actual implementation.

Evaluating Metrics for Autoscaling

The metrics used by the controller, for example the shoots metric above, could be evaluated in one of the following ways:

  • Directly, by looking at the capacity and allocatable fields in the Seed status and comparing to the actual resource consumption calculated by simply counting all shoots that meet a certain criteria (e.g. shoots that are scheduled onto the seed), then taking an average over all seeds in the set.
  • By sampling existing metrics exported for example by gardener-metrics-exporter.

The second approach decouples the seed autoscaler controller from the actual metrics evaluation, and therefore allows plugging in new metrics more easily. It also has the advantage that the exported metrics could also be used for other purposes, e.g. for triggering Prometheus alerts or building Grafana dashboards. It has the disadvantage that the seed autoscaler controller would depend on the metrics exporter to do its job properly.

11 - 17 Shoot Control Plane Migration Bad Case

Shoot Control Plane Migration “Bad Case” Scenario

The migration flow described as part of GEP-7 can only be executed if both the Garden cluster and source seed cluster are healthy, and gardenlet in the source seed cluster can connect to the Garden cluster. In this case, gardenlet can directly scale down the shoot’s control plane in the source seed, after checking the spec.seedName field.

However, there might be situations in which gardenlet in the source seed cluster can’t connect to the Garden cluster and determine that spec.seedName has changed. Similarly, the connection to the seed kube-apiserver could also be broken. This might be caused by issues with the seed cluster itself. In other situations, the migration flow steps in the source seed might have started but might not be able to finish successfully. In all such cases, it should still be possible to migrate a shoot’s control plane to a different seed, even though executing the migration flow steps in the source seed might not be possible. The potential “split brain” situation caused by having the shoot’s control plane components attempting to reconcile the shoot resources in two different seeds must still be avoided, by ensuring that the shoot’s control plane in the source seed is deactivated before it is activated in the destination seed.

The mechanisms and adaptations described below have been tested as part of a PoC prior to describing them here.

Owner Election / Copying Snapshots

To achieve the goals outlined above, an “owner election” (or rather, “ownership passing”) mechanism is introduced to ensure that the source and destination seeds are able to successfully negotiate a single “owner” during the migration. This mechanism is based on special owner DNS records that uniquely identify the seed that currently hosts the shoot’s control plane (“owns” the shoot).

For example, for a shoot named i500152-gcp in project dev that uses an internal domain suffix internal.dev.k8s.ondemand.com and is scheduled on a seed with an identity shoot--i500152--gcp2-0841c87f-8db9-4d04-a603-35570da6341f-sap-landscape-dev, the owner DNS record is a TXT record with a domain name owner.i500152-gcp.dev.internal.dev.k8s.ondemand.com and a single value shoot--i500152--gcp2-0841c87f-8db9-4d04-a603-35570da6341f-sap-landscape-dev. The owner DNS record is created and maintained by reconciling an owner DNSRecord resource.

Unlike other extension resources, the owner DNSRecord resource is not reconciled every time the shoot is reconciled, but only when the resource is created. Therefore, the owner DNS record value (the owner ID) is updated only when the shoot is migrated to a different seed. For more information, see Add handling of owner DNSRecord resources.

The owner DNS record domain name and owner ID are passed to components that need to perform ownership checks, such as the backup-restore container of the etcd-main StatefulSet, and all extension controllers. These components then check regularly whether the actual owner ID (the value of the record) matches the passed ID. If they don’t, the ownership check is considered failed, which causes the special behavior described below.

Note: A previous revision of this document proposed using “sync objects” written to and read from the backup container of the source seed as JSON files by the etcd-backup-restore processes in both seeds. With the introduction of owner DNS records such sync objects are no longer needed.

For the destination seed to actually become the owner, it needs to acquire the shoot’s etcd data by copying the final full snapshot (and potentially also older snapshots) from the backup container of the source seed.

The mechanism to copy the snapshots and pass the ownership from the source to the destination seed consists of the following steps:

  1. The reconciliation flow (“restore” phase) is triggered in the destination seed without first executing the migration flow in the source seed (or perhaps it was executed, but it failed, and its state is currently unknown).

  2. The owner DNSRecord resource is created in the destination seed. As a result, the actual owner DNS record is updated with the destination seed ID. From this point, ownership checks by the etcd-backup-restore process and extension controller watchdogs in the source seed will fail, which will cause the special behavior described below.

  3. An additional “source” backup entry referencing the source seed backup bucket is deployed to the Garden cluster and the destination seed and reconciled by the backup entry controller. As a result, a secret with the appropriate credentials for accessing the source seed backup container named source-etcd-backup is created in the destination seed. The normal backup entry (referencing the destination seed backup container) is also deployed and reconciled, as usual, resulting in the usual etcd-backup secret being created.

  4. A special “copy” version of the etcd-main Etcd resource is deployed to the destination seed. In its backup section, this resource contains a sourceStore in addition to the usual store, which contains the parameters needed to use the source seed backup container, such as its name and the secret created in the previous step.

    spec:
      backup:
        ...
        store:
          container: 408740b8-6491-415e-98e6-76e92e5956ac
          secretRef:
            name: etcd-backup
          ...
        sourceStore:
          container: d1435fea-cd5e-4d5b-a198-81f4025454ff
          secretRef:
            name: source-etcd-backup
          ...
    
  5. The etcd-druid in the destination seed reconciles the above resource by deploying a etcd-copy Job that contains a single backup-restore container. It executes the newly introduced copy command of etcd-backup-restore that copies the snapshots from the source to the destination backup container.

  6. Before starting the copy itself, the etcd-backup-restore process in the destination seed checks if a final full snapshot (a full snapshot marked as final=true) exists in the backup container. If such a snapshot is not found, it waits for it to appear in order to proceed. This waiting is up to a certain timeout that should be sufficient for a full snapshot to be taken; after this timeout has elapsed, it proceeds anyway, and the reconciliation flow continues from step 9. As described in Handling Inability to Access the Backup Container below, this is safe to do.

  7. The etcd-backup-restore process in the source seed detects that the owner ID in the owner DNS record is different from the expected owner ID (because it was updated in step 2) and switches to a special “final snapshot” mode. In this mode the regular snapshotter is stopped, the readiness probe of the main etcd container starts returning 503, and one final full snapshot is taken. This snapshot is marked as final=true in order to ensure that it’s only taken once, and in order to enable the etcd-backup-restore process in the destination seed to find it (see step 6).

    Note: While testing our PoC, we noticed that simply making the readiness probe of the main etcd container fail doesn’t terminate the existing open connections from kube-apiserver to etcd. For this to happen, either the kube-apiserver or the etcd process has to be restarted at least once. Therefore, when the snapshotter is stopped because an ownership change has been detected, the main etcd process is killed (using SIGTERM to allow graceful termination) to ensure that any open connections from kube-apiserver are terminated. For this to work, the 2 containers must share the process namespace.

  8. Since the kube-apiserver process in the source seed is no longer able to connect to etcd, all shoot control plane controllers (kube-controller-manager, kube-scheduler, machine-controller-manager, etc.) and extension controllers reconciling shoot resources in the source seed that require a connection to the shoot in order to work start failing. All remaining extension controllers are prevented from reconciling shoot resources via the watchdogs mechanism. At this point, the source seed has effectively lost its ownership of the shoot, and it is safe for the destination seed to assume the ownership.

  9. After the etcd-backup-restore process in the destination seed detects that a final full snapshot exists, it copies all snapshots (or a subset of all snapshots) from the source to the destination backup container. When this is done, the Job finishes successfully which signals to the reconciliation flow that the snapshots have been copied.

    Note: To save time, only the final full snapshot taken in step 6, or a subset defined by some criteria, could be copied, instead of all snapshots.

  10. The special “copy” version of the etcd-main Etcd resource is deleted from the source seed, and as a result the etcd-copy Job is also deleted by etcd-druid.

  11. The additional “source” backup entry referencing the source seed backup container is deleted from the Garden cluster and the destination seed. As a result, its corresponding source-etcd-backup secret is also deleted from the destination seed.

  12. From this point, the reconciliation flow proceeds as already described in GEP-7. This is safe, since the source seed cluster is no longer able to interfere with the shoot.

Handling Inability to Access the Backup Container

The mechanism described above assumes that the etcd-backup-restore process in the source seed is able to access its backup container in order to take snapshots. If this is not the case, but an ownership change was detected, the etcd-backup-restore process still sets the readiness probe status of the main etcd container to 503, and kills the main etcd process as described above to ensure that any open connections from kube-apiserver are terminated. This effectively deactivates the source seed control plane to ensure that the ownership of the shoot can be passed to a different seed.

Because of this, etcd-backup-restore process in the destination seed responsible for copying the snapshots can avoid waiting forever for a final full snapshot to appear. Instead, after a certain timeout has elapsed, it can proceed with the copying. In this situation, whatever latest snapshot is found in the source backup container will be restored in the destination seed. The shoot is still migrated to a healthy seed at the cost of losing the etcd data that accumulated between the point in time when the connection to the source backup container was lost, and the point in time when the source seed cluster was deactivated.

When the connection to the backup container is restored in the source seed, a final full snapshot will be eventually taken. Depending on the stage of the restoration flow in the destination seed, this snapshot may be copied to the destination seed and restored, or it may simply be ignored since the snapshots have already been copied.

Handling Inability to Resolve the Owner DNS Record

The situation when the owner DNS record cannot be resolved is treated similarly to a failed ownership check: the etcd-backup-restore process sets the readiness probe status of the main etcd container to 503, and kills the main etcd process as described above to ensure that any open connections from kube-apiserver are terminated, effectively deactivating the source seed control plane. The final full snapshot is not taken in this case to ensure that the control plane can be re-activated if needed.

When the owner DNS record can be resolved again, the following 2 situations are possible:

  • If the source seed is still the owner of the shoot, the etcd-backup-restore process will set the readiness probe status of the main etcd container to 200, so kube-apiserver will be able to connect to etcd and the source seed control plane will be activated again.
  • If the source seed is no longer the owner of the shoot, the etcd readiness probe will continue to fail, and the source seed control plane will remain inactive. In addition, the final full snapshot will be taken at this time, for the same reason as described in Handling Inability to Access the Backup Container.

Note: We expect that actual DNS outages are extremely unlikely. A more likely reason for an inability to resolve a DNS record could be network issues with the underlying infrastructure. In such cases, the shoot would usually not be usable / reachable anyway, so deactivating its control plane would not cause a worse outage.

Migration Flow Adaptations

Certain changes to the migration flow are needed in order to ensure that it is compatible with the owner election mechanism described above. Instead of taking a full snapshot of the source seed etcd, the flow deletes the owner DNS record by deleting the owner DNSRecord resource. This causes the ownership check by etcd-backup-restore to fail, and the final full snapshot to be eventually taken, so the migration flow waits for a final full snapshot to appear as the last step before deleting the shoot namespace in the source seed. This ensures that the reconciliation flow described above will find a final full snapshot waiting to be copied at step 6.

Checking for the final full snapshot is performed by calling the already existing etcd-backup-restore endpoint snapshot/latest. This is possible, since the backup-restore container is always running at this point.

After the final full snapshot has been taken, the readiness probe of the main etcd container starts failing, which means that if the migration flow is retried due to an error it must skip the step that waits for etcd-main to become ready. To determine if this is the case, a check whether the final full snapshot has been taken or not is performed by calling the same etcd-backup-restore endpoint, e.g. snapshot/latest. This is possible if the etcd-main Etcd resource exists with non-zero replicas. Otherwise:

  • If the resource doesn’t exist, it must have been already deleted, so the final full snapshot n must have been already taken.
  • If it exists with zero replicas, the shoot must be hibernated, and the migration flow must have never been executed (since it scales up etcd as one of its first steps), so the final full snapshot must not have been taken yet.

Extension Controller Watchdogs

Some extension controllers will stop reconciling shoot resources after the connection to the shoot’s kube-apiserver is lost. Others, most notably the infrastructure controller, will not be affected. Even though new shoot reconciliations won’t be performed by gardenlet, such extension controllers might be stuck in a retry loop triggered by a previous reconciliation, which may cause them to reconcile their resources after gardenlet has already stopped reconciling the shoot. In addition, a reconciliation started when the seed still owned the shoot might take some time and therefore might still be running after the ownership has changed. To ensure that the source seed is completely deactivated, an additional safety mechanism is needed.

This mechanism should handle the following interesting cases:

  • gardenlet cannot connect to the Garden kube-apiserver. In this case it cannot fetch shoots and therefore does not know if control plane migration has been triggered. Even though gardenlet will not trigger new reconciliations, extension controllers could still attempt to reconcile their resources if they are stuck a retry loop from a previous reconciliation, and already running reconciliations will not be stopped.
  • gardenlet cannot connect to the seed’s kube-apiserver. In this case gardenlet knows if migration has been triggered, but it will not start shoot migration or reconciliation as it will first check the seed conditions and try to update the Cluster resource, both of which will fail. Extension controllers could still be able to connect to the seed’s kube-apiserver (if they are not running where gardenlet is running), and similarly to the previous case, they could still attempt to reconcile their resources.
  • The seed components (etcd-druid, extension controllers, etc) cannot connect to the seed’s kube-apiserver. In this case extension controllers would not be able to reconcile their resources as they cannot fetch them from the seed’s kube-apiserver. When the connection to the kube-apiserver comes back, the controllers might be stuck in a retry loop from a previous reconciliation, or the resources could still be annotated with gardener.cloud/operation=reconcile. This could lead to a race condition depending on who manages to update or get the resources first. If gardenlet manages to update the resources before they are read by the extension controllers, they would be properly updated with gardener.cloud/operation=migrate. Otherwise, they would be reconciled as usual.

Note: A previous revision of this document proposed using “cluster leases” as such an additional safety mechanism. With the introduction of owner DNS records cluster leases are no longer needed.

The safety mechanism is based on extension controller watchdogs. These are simply additional goroutines that are started when a reconciliation is started by an extension controller. These goroutines perform an ownership check on a regular basis using the owner DNS record, similar to the check performed by the etcd-backup-restore process described above. If the check fails, the watchdog cancels the reconciliation context, which immediately aborts the reconciliation.

Note: The dns-external extension controller is the only extension controller that neither needs the shoot’s kube-apiserver, nor uses the watchdog mechanism described here. Therefore, this controller will continue reconciling DNSEntry resources even after the source seed has lost the ownership of the shoot. With the PoC, we manually delete the DNSOwner resources from the source seed cluster to prevent this from happening. Eventually, the dns-external controller should be adapted to use the owner DNS records to ensure that it disables itself after the seed has lost the ownership of the shoot. Changes in this direction have already been agreed and relevant PRs proposed.

12 - Bastion Management and SSH Key Pair Rotation

GEP-15: Bastion Management and SSH Key Pair Rotation

Table of Contents

Motivation

gardenctl (v1) has the functionality to setup ssh sessions to the targeted shoot cluster (nodes). To this end, infrastructure resources like VMs, public IPs, firewall rules, etc. have to be created. gardenctl will clean up the resources after termination of the ssh session (or rather when the operator is done with her work). However, there were issues in the past where these infrastructure resources were not properly cleaned up afterwards, e.g. due to some error (no retries either). Hence, the proposal is to have a dedicated controller (for each infrastructure) that manages the infrastructure resources and their cleanup. The current gardenctl also re-used the ssh node credentials for the bastion host. While that’s possible, it would be safer to rather use personal or generated ssh key pairs to access the bastion host. The static shoot-specific ssh key pair should be rotated regularly, e.g. once in the maintenance time window. This also means that we cannot create the node VMs anymore with infrastructure public keys as these cannot be revoked or rotated (e.g. in AWS) without terminating the VM itself.

Changes to the Bastion resource should only be allowed for controllers on seeds that are responsible for it. This cannot be restricted when using custom resources. The proposal, as outlined below, suggests to implement the necessary changes in the gardener core components and to adapt the SeedAuthorizer to consider Bastion resources that the Gardener API Server serves.

Goals

  • Operators can request and will be granted time-limited ssh access to shoot cluster nodes via bastion hosts.
  • To that end, requestors must present their public ssh key and only this will be installed into sshd on the bastion hosts.
  • The bastion hosts will be firewalled and ingress traffic will be permitted only from the client IP of the requestor. Except for traffic on port 22 to the cluster worker nodes, no egress from the bastion is allowed.
  • The actual node ssh private key (resp. key pair) will be rotated by Gardener and access to the nodes is only possible with this constantly rotated key pair and not with the personal one that is used only for the bastion host.
  • Bastion host and access is granted only for the extent of this operator request (of course multiple ssh sessions are possible, in parallel or repeatedly, but after “the time is up”, access is no longer possible).
  • By these means (personal public key and allow-listed client IP) nobody else can use (a.k.a. impersonate) the requestor (not even other operators).
  • Necessary infrastructure resources for ssh access (such as VMs, public IPs, firewall rules, etc.) are automatically created and also terminated after usage, but at the latest after the above mentioned time span is up.

Non-Goals

  • Node-specific access
  • Auditability on operating system level (not only auditing the ssh login, but everything that is done on a node and other respective resources, e.g. by using dedicated operating system users)
  • Reuse of temporarily created necessary infrastructure resources by different users

Proposal

Involved Components

The following is a list of involved components, that either need to be newly introduced or extended if already existing

  • Gardener API Server (GAPI)
  • gardenlet
    • Deploys Bastion CRD under the extensions.gardener.cloud API Group to the Seed, see resource example below
    • Similar to BackupBuckets or BackupEntry, the gardenlet watches the Bastion resource in the garden cluster and creates a seed-local Bastion resource, on which the provider specific bastion controller acts upon
  • gardenctlv2 (or any other client)
    • Creates Bastion resource in the garden cluster
    • Establishes an ssh connection to a shoot node, using a bastion host as proxy
    • Heartbeats / keeps alive the Bastion resource during ssh connection
  • Gardener extension provider
  • Gardener Controller Manager (GCM)
    • Bastion heartbeat controller
      • Cleans up Bastion resource on missing heartbeat.
      • Is configured with a maxLifetime for the Bastion resource
  • Gardener (RBAC)

SSH Flow

  1. Users should only get the RBAC permission to create / update Bastion resources for a namespace, if they should be allowed to ssh onto the shoot nodes in this namespace. A project member with admin role will have these permissions.
  2. User/gardenctlv2 creates Bastion resource in garden cluster (see resource example below)
    • First, gardenctl would figure out the own public IP of the user’s machine. Either by calling an external service (gardenctl (v1) uses https://github.com/gardener/gardenctl/blob/master/pkg/cmd/miscellaneous.go#L226) or by calling a binary that prints the public IP(s) to stdout. The binary should be configurable. The result is set under spec.ingress[].ipBlock.cidr
    • Creates new ssh key pair. The newly created key pair is used only once for each bastion host, so it has a 1:1 relationship to it. It is cleaned up after it is not used anymore, e.g. if the Bastion resource was deleted.
    • The public ssh key is set under spec.sshPublicKey
    • The targeted shoot is set under spec.shootRef
  3. GAPI Admission Plugin for the Bastion resource in the garden cluster
    • on creation, sets metadata.annotations["gardener.cloud/created-by"] according to the user that created the resource
    • when gardener.cloud/operation: keepalive is set it will be removed by GAPI from the annotations and status.lastHeartbeatTimestamp will be set with the current timestamp. The status.expirationTimestamp will be calculated by taking the last heartbeat timestamp and adding x minutes (configurable, default 60 Minutes).
    • validates that only the creator of the bastion (see gardener.cloud/created-by annotation) can update spec.ingress
    • validates that a Bastion can only be created for a Shoot if that Shoot is already assigned to a Seed
    • sets spec.seedName and spec.providerType based on the spec.shootRef
  4. gardenlet
  5. Gardener extension provider / Bastion Controller on Seed:
    • With own Bastion Custom Resource Definition in the seed under the api group extensions.gardener.cloud/v1alpha1
    • Watches Bastion custom resources that are created by the gardenlet in the seed
    • Controller reads cloudprovider credentials from seed-shoot namespace
    • Deploy infrastructure resources
      • Bastion VM. Uses user data from spec.userData
      • attaches public IP, creates security group, firewall rules, etc.
    • Updates status of Bastion resource:
      • With bastion IP under status.ingress.ip or hostname under status.ingress.hostname
      • Updates the status.lastOperation with the status of the last reconcile operation
  6. gardenlet
    • Syncs back the status.ingress and status.conditions of the Bastion resource in the seed to the garden cluster in case it changed
  7. gardenctl
    • initiates ssh session once status.conditions['BastionReady'] is true of the Bastion resource in the garden cluster
      • locates private ssh key matching spec["sshPublicKey"] which was configured beforehand by the user
      • reads bastion IP (status.ingress.ip) or hostname (status.ingress.hostname)
      • reads the private key from the ssh key pair for the shoot node
      • opens ssh connection to the bastion and from there to the respective shoot node
    • runs heartbeat in parallel as long as the ssh session is open by annotating the Bastion resource with gardener.cloud/operation: keepalive
  8. GCM:
    • Once status.expirationTimestamp is reached, the Bastion will be marked for deletion
  9. gardenlet:
    • Once the Bastion resource in the garden cluster is marked for deletion, it marks the Bastion resource in the seed for deletion
  10. Gardener extension provider / Bastion Controller on Seed:
    • all created resources will be cleaned up
    • On succes, removes finalizer on Bastion resource in seed
  11. gardenlet:
    • removes finalizer on Bastion resource in garden cluster

Resource Example

Bastion resource in the garden cluster

apiVersion: operations.gardener.cloud/v1alpha1
kind: Bastion
metadata:
  generateName: cli-
  name: cli-abcdef
  namespace: garden-myproject
  annotations:
    gardener.cloud/created-by: foo # immutable, set by the GAPI Admission Plugin
    # gardener.cloud/operation: keepalive # this annotation is removed by the GAPI and the status.lastHeartbeatTimestamp and status.expirationTimestamp will be updated accordingly
spec:
  shootRef: # namespace cannot be set / it's the same as .metadata.namespace
    name: my-cluster # immutable

  # the following fields are set by the GAPI
  seedName: aws-eu2
  providerType: aws

  sshPublicKey: c3NoLXJzYSAuLi4K # immutable, public `ssh` key of the user

  ingress: # can only be updated by the creator of the bastion
  - ipBlock:
      cidr: 1.2.3.4/32 # public IP of the user. CIDR is a string representing the IP Block. Valid examples are "192.168.1.1/24" or "2001:db9::/64"

status:
  observedGeneration: 1

  # the following fields are managed by the controller in the seed and synced by gardenlet
  ingress: # IP or hostname of the bastion
    ip: 1.2.3.5
    # hostname: foo.bar

  conditions:
  - type: BastionReady # when the `status` is true of condition type `BastionReady`, the client can initiate the `ssh` connection
    status: 'True'
    lastTransitionTime: "2021-03-19T11:59:00Z"
    lastUpdateTime: "2021-03-19T11:59:00Z"
    reason: BastionReady
    message: Bastion for the cluster is ready.

  # the following fields are only set by the GAPI
  lastHeartbeatTimestamp: "2021-03-19T11:58:00Z" # will be set when setting the annotation gardener.cloud/operation: keepalive
  expirationTimestamp: "2021-03-19T12:58:00Z" # extended on each keepalive

Bastion custom resource in the seed cluster

apiVersion: extensions.gardener.cloud/v1alpha1
kind: Bastion
metadata:
  name: cli-abcdef
  namespace: shoot--myproject--mycluster
spec:
  userData: |- # this is normally base64-encoded, but decoded for the example. Contains spec.sshPublicKey from Bastion resource in garden cluster
    #!/bin/bash
    # create user
    # add ssh public key to authorized_keys
    # ...

  ingress:
  - ipBlock:
      cidr: 1.2.3.4/32

  type: aws # from extensionsv1alpha1.DefaultSpec

status:
  observedGeneration: 1
  ingress:
    ip: 1.2.3.5
    # hostname: foo.bar
  conditions:
  - type: BastionReady
    status: 'True'
    lastTransitionTime: "2021-03-19T11:59:00Z"
    lastUpdateTime: "2021-03-19T11:59:00Z"
    reason: BastionReady
    message: Bastion for the cluster is ready.

SSH Key Pair Rotation

Currently, the ssh key pair for the shoot nodes are created once during shoot cluster creation. These key pairs should be rotated on a regular basis.

Rotation Proposal

  • gardeneruser original user data component:
    • The gardeneruser create script should be changed into a reconcile script, and renamed accordingly. It needs to be adapted so that the authorized_keys file will be updated / overwritten with the current and old ssh public key from the cloud-config user data.
  • Rotation trigger:
    • Once in the maintenance time window
    • On demand, by annotating the shoot with gardener.cloud/operation: rotate-ssh-keypair
  • On rotation trigger:
    • gardenlet
      • Prerequisite of ssh key pair rotation: all nodes of all the worker pools have successfully applied the desired version of their cloud-config user data
      • Creates or updates the secret ssh-keypair.old with the content of ssh-keypair in the seed-shoot namespace. The old private key can be used by clients as fallback, in case the new ssh public key is not yet applied on the node
      • Generates new ssh-keypair secret
      • The OperatingSystemConfig needs to be re-generated and deployed with the new and old ssh public key
    • As usual (for more details, see here):
      • Once the cloud-config-<X> secret in the kube-system namespace of the shoot cluster is updated, it will be picked up by the downloader script (checks every 30s for updates)
      • The downloader runs the “execution” script from the cloud-config-<X> secret
      • The “execution” script includes also the original user data script, which it writes to PATH_CLOUDCONFIG, compares it against the previous cloud config and runs the script in case it has changed
      • Running the original user data script will also run the gardeneruser component, where the authorized_keys file will be updated
      • After the most recent cloud-config user data was applied, the “execution” script annotates the node with checksum/cloud-config-data: <cloud-config-checksum> to indicate the success

Limitations

Each operating system has its own default user (e.g. core, admin, ec2-user etc). These users get their SSH keys during VM creation (however there is a different handling on Google Cloud Platform as stated below). These keys currently do not get rotated respectively are not removed from the authorized_keys file. This means that the initial ssh key will still be valid for the default operating system user.

On Google Cloud Platform, the VMs do not have any static users (i.e. no gardener user) and there is an agent on the nodes that syncs the users with their SSH keypairs from the GCP IAM service.

13 - Dynamic kubeconfig generation for Shoot clusters

GEP-16: Dynamic kubeconfig generation for Shoot clusters

Table of Contents

Summary

This GEP introduces new Shoot subresource called AdminKubeconfigRequest allowing for users to dynamically generate a short-lived kubeconfig that can be used to access the Shoot cluster as cluster-admin.

Motivation

Today, when access to the created Shoot clusters is needed, a kubeconfig with static token credentials is used. This static token is in the system:masters group, granting it cluster-admin privileges. The kubeconfig is generated when the cluster is reconciled, stored in ShootState and replicated in the Project’s namespace in a Secret. End-users can fetch the secret and use the kubeconfig inside it.

There are several problems with this approach:

  • The token in the kubeconfig does not have any expiration, so end-users have to request a kubeconfig credential rotation if they want revoke the token.
  • There is no user identity in the token. e.g. if user Joe gets the kubeconfig from the Secret, user in that token would be system:cluster-admin and not Joe when accessing the Shoot cluster with it. This makes auditing events in the cluster almost impossible.

Goals

  • Add a Shoot subresource called adminkubeconfig that would produce a kubeconfig used to access that Shoot cluster.

  • The kubeconfig is not stored in the API Server, but generated for each request.

  • In the AdminKubeconfigRequest send to that subresource, end-users can specify the expiration time of the credential.

  • The identity (user) in the Gardener cluster would be part of the identity (x509 client certificate). E.g if Joe authenticates against the Gardener API server, the generated certificate for Shoot authentication would have the following subject:

    • Common Name: Joe
    • Organisation: system:masters
  • The maximum validity of the certificate can be enforced by setting a flag on the gardener-apiserver.

  • Deprecate and remove the old {shoot-name}.kubeconfig secrets in each Project namespace.

Non-Goals

  • Generate OpenID Connect kubeconfigs

Proposal

The gardener-apiserver would serve a new shoots/adminkubeconfig resource. It can only accept CREATE calls and accept AdminKubeconfigRequest. A AdminKubeconfigRequest would have the following structure:

apiVersion: authentication.gardener.cloud/v1alpha1
kind: AdminKubeconfigRequest
spec:
  expirationSeconds: 3600

Where expirationSeconds is the validity of the certificate in seconds. In this case it would be 1 hour. The maximum validity of a AdminKubeconfigRequest is configured by --shoot-admin-kubeconfig-max-expiration flag in the gardener-apiserver.

When such request is received, the API server would find the ShootState associated with that cluster and generate a kubeconfig. The x509 client certificate would be signed by the Shoot cluster’s CA and the user used in the subject’s common name would be from the User.Info used to make the request.

apiVersion: authentication.gardener.cloud/v1alpha1
kind: AdminKubeconfigRequest
spec:
  expirationSeconds: 3600
status:
  expirationTimestamp: "2021-02-22T09:06:51Z"
  kubeConfig: # this is normally base64-encoded, but decoded for the example
    apiVersion: v1
    clusters:
    - cluster:
        certificate-authority-data: LS0tLS1....
        server: https://api.shoot-cluster
      name: shoot-cluster-a
    contexts:
    - context:
        cluster: shoot-cluster-a
        user: shoot-cluster-a
      name: shoot-cluster-a
    current-context: shoot-cluster-a
    kind: Config
    preferences: {}
    users:
    - name: shoot-cluster-a
      user:
        client-certificate-data: LS0tLS1CRUd...
        client-key-data: LS0tLS1CRUd...

New feature gate called AdminKubeconfigRequest enables the above mentioned API in the gardener-apiserver. The old {shoot-name}.kubeconfig is kept, but deprecated and will be removed in the future.

In order to get the server’s address used in the kubeconfig, the Shoot’s status should be updated with new entries:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: crazy-botany
  namespace: garden-dev
spec: {}
status:
  advertisedAddresses:
  - name: external
    url: https://api.shoot-cluster.external.foo
  - name: internal
    url: https://api.shoot-cluster.internal.foo
  - name: ip
    url: https://1.2.3.4

This is needed, because the Gardener API server might not know on which IP address the API server is advertised on (e.g. DNS is disabled).

If there are multiple entries, each would be added in a separate cluster in the kubeconfig and a context with the same name would be added as well. The current context would be selected as the first entry in the advertisedAddresses list (.status.advertisedAddresses[0]).

Alternatives

14 - GEP Title

GEP-NNNN: Your short, descriptive title

Table of Contents

Summary

Motivation

Goals

Non-Goals

Proposal

Alternatives

15 - Highly Available Shoot Control Planes

GEP20 - Highly Available Shoot Control Planes

Table of Contents

Summary

Gardener today only offers highly available control planes for some of its components (like Kubernetes API Server and Gardener Resource Manager) which are deployed with multiple replicas and allow a distribution across nodes. Many of the other critical control plane components including etcd are only offered with a single replica, making them susceptible to both node failure as well as zone failure causing downtimes.

This GEP extends the failure domain tolerance for shoot control plane components as well as seed components to survive extensive node or availability zone (AZ) outages.

Motivation

High availability (HA) of Kubernetes control planes is desired to ensure continued operation, even in the case of partial failures of nodes or availability zones. Tolerance to common failure domains ranges from hardware (e.g. utility power sources and backup power sources, network switches, disk/data, racks, cooling systems etc.) to software.

Each consumer therefore needs to decide on the degree of failure isolation that is desired for the control plane of their respective shoot clusters.

Goals

  • Provision shoot clusters with highly available control planes (HA shoots) and a failure tolerance on node or AZ level. Consumers may enable/disable high-availability and choose failure tolerance between multiple nodes within a single zone or multiple nodes spread across multiple zones.
  • Migrating non-HA to HA shoots. For failure tolerance on zone level only if shoot is already scheduled to a multi-zonal seed.
  • Scheduling HA shoots to adequate seeds.

Non-Goals

  • Setting up a high available Gardener service.
  • Upgrading from a single-zone shoot control plane to a multi-zonal shoot control plane.
  • Failure domains on region level, i.e. multi-region control-planes.
  • Downgrading HA shoots to non-HA shoots.
  • In the current scope, three control plane components - Kube Apiserver, etcd and Gardener Resource Manager will be highly available. In the future, other components could be set up in HA mode.
  • To achieve HA we consider components to have at least three replicas. Greater failure tolerance is not targeted by this GEP.

High Availablity

Topologies

Many shoot control plane (etcd, kube-apiserver, gardener-resource-manager, …) and seed system components (gardenlet, istio, etcd-druid, …) provide means to achieve high availability. Commonly these either run in an Active-Active or in an Active-Passive mode. Active-Active means that each component replica serves incoming requests (primarily intended for load balancing) whereas Active-Passive means that only one replica is active while others remain on stand-by.

It is recommended that for high-availability setup an odd number of nodes (node tolerance) or zones (zone tolerance) must be used. This also follows the recommendations on the etcd cluster size. The recommendations for number of zones will be largely influenced by quorum-based etcd cluster setup recommendations as other shoot control plane components are either stateless or non-quorum-based stateful components.

Let’s take the following example to explain this recommendation further:

  • Seed clusters’ worker nodes are spread across two zones
  • Gardener would distribute a three member etcd cluster - AZ-1: 2 replicas, AZ-2: 1 replica
  • If AZ-1 goes down then, quorum is lost and the only remaining etcd member enters into a read-only state.
  • If AZ-2 goes down then:
    • If the leader is in AZ-2, then it will force a re-election and the quorum will be restored with 2 etcd members in AZ-1.
    • If the leader is not in AZ-2, then etcd cluster will still be operational without any downtime as the quorum is not lost.

Result:

  • There seems to be no clear benefit to spreading an etcd cluster across 2 zones as there is an additional cost of cross-zonal traffic that will be incurred due to communication amongst the etcd members and also due to API server communication with an etcd member across zones.
  • There is no significant gain in availability as compared to an etcd cluster provisioned within a single zone. Therefore it is a recommendation that for regions having 2 availability zones, etcd cluster should only be spread across nodes in a single AZ.

Validation

Enforcing that a highly available ManagedSeed is setup with odd number of zones, additional checks needs to be introduced in admission plugin.

The minimum number of replicas required to achieve HA depends on the topology and the requirement of each component that run in an active-active mode.

Active-Active

  • If application needs a quorum to operate (e.g. etcd), at least three replicas are required (ref).
  • Non-quorum based components are also supposed to run with a minimum count of three replicas to survive node/zone outages and support load balancing.

Active-Passive

  • Components running in a active-passive mode are expected to have at least two replicas, so that there is always one replica on stand-by.

Gardener Shoot API

Proposed changes

The following changes to the shoot API are suggested to enable the HA feature for a shoot cluster:

kind: Shoot
apiVersion: core.gardener.cloud/v1beta1
spec:
  controlPlane:
    highAvailability:
      failureTolerance:
        type: <node | zone>

The consumer can optionally specify highAvailability in the shoot spec. Failure tolerance of node (aka single-zone shoot clusters) signifies that the HA shoot control plane can tolerate a single node failure, whereas zone (aka multi-zone shoots clusters) signifies that the HA shoot control plane can withstand an outage of a single zone.

Gardener Scheduler

A scheduling request could be for a HA shoot with failure tolerance of node or zone. It is therefore required to appropriately select a seed.

Case #1: HA shoot with no seed assigned

Proposed Changes

Fitering candidate seeds

A new filter step needs to be introduced in the reconciler which selects candidate seeds. It ensures that shoots with zone tolerance are only scheduled to seeds which have worker nodes across multiple availability tones (aka multi-zonal seeds).

Scoring of candidate seeds

Today, after Gardener Scheduler filtered candidates and applied the configured strategy, it chooses the seed with least scheduled shoots ref.

This GEP intends to enhance this very last step by also taking the requested failure tolerance into consideration: If there are potential single- and multi-zonal candidates remaining, a single-zonal seed is always preferred for a shoot requesting no or node tolerance, independent from the utilization of the seed (also see this draft PR). A multi-zonal seed is only chosen if no single-zonal one is suitable after filtering was done and the strategy has been applied.

Motivation: It is expected that operators will prepare their landscapes for HA control-planes by changing worker nodes of existing seeds but also by adding completely new multi-zonal seed clusters. For the latter, multi-zonal seeds should primarily be reserved for multi-zonal shoots.

Case #2: HA shoot with assigned seed and updated failure tolerance

A shoot has a pre-defined non-HA seed. A change has been made to the shoot spec, setting control HA to zone.

Proposed Change

  • If the shoot is not already scheduled on a multi-zonal seed, then the shoot admission plugin must deny the request.
  • Either shoot owner creates shoot from scratch or needs to align with Gardener operator who has to move the shoot to a proper seed first via control plane migration (editing the shoots/binding resource).
  • An automated control plane migration is deliberately not performed as it involves a considerable downtime and the feature itself is not stable by the time this GEP was written.

Setting up a Seed for HA

As mentioned in High-Availablity, certain aspects need to be considered for a seed cluster to host HA shoots. The following sections explain the requirements for a seed cluster to host a single or multi zonal HA shoot cluster.

Hosting a HA shoot control plane with node failure tolerance

To host an HA shoot control plane within a single zone, it should be ensured that each worker pool that potentially runs seed system or shoot control plane components should at least have three nodes. This is also the minium size that is required by an HA etcd cluster with a failure tolerance of a single node. Furthermore, the nodes must run in a single zone only (see Recommended number of nodes and zones).

Hosting a HA shoot control plane with zone failure tolerance

To host an HA shoot control plane across availability zones, worker pools should have a minimum of three nodes spread across an odd number of availability zones (min. 3).

An additional label seed.gardener.cloud/multi-zonal: true should be added to the seed indicating that this seed is capable of hosting multi-zonal HA shoot control planes, which in turn will help gardener scheduler to short-list the seeds as candidates.

In case of a ManagedSeed Gardener can add this label automatically to seed clusters if at least one worker pool fulfills the requirements mentioned above and doesn’t enforce Taints on its nodes. Gardener may in addition validate if the ManagedSeed is properly set up for the seed.gardener.cloud/multi-zonal: true label when it is added manually.

Compute Seed Usage

At present seed usage is computed by counting the number of shoot control planes that are hosted in a seed. Every seed has a number of shoots it can host status.allocatable.shoots (configurable via ResourceConfiguration. Operators need to rethink this value for multi-zonal seed clusters.

Which parameters could be considered?

  • Number of available machines of a type as requested as part of the shoot spec. Sufficient capacity should be available to also allow rolling updates which will also be governed by maxSurge configuration at the worker pool level.
  • Node CIDR range must grant enough space to schedule additional replicas that the HA feature requires. (For instance, for etcd the requirement for nodes will be 3 times as compared to the current single node).
  • If additional zones are added to an existing non-multi-zonal seed cluster to make it multi-zonal then care should be taken that zone specific CIDRs are appropriately checked and changed if required.
  • Number of volumes that will be required to host a multi-node/multi-zone etcd cluster will increase by (n-1) where n is the total number of members in the etcd cluster.

The above list is not an exhaustive list and is just indicative that the currently set limit of 250 will have to be revisited.

Scheduling control plane components

Zone pinning

HA shoot clusters with failure tolerance of node as well as non-HA shoot clusters can be scheduled on single-zonal and multi-zonal seeds alike. On a multi-zonal seed it’s desireable to place components of the same control plane in one zone only to reduce cost and latency effects due to cross network traffic. Thus, it’s essential to add Pod affinity rules to control plane component with multiple replicas:

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            gardener.cloud/shoot: <technical-id>
            <labels>
        topologyKey: "topology.kubernetes.io/zone"

A special challenge is to select the entire set of control plane pods belonging to a single control plane. Today, Gardener and extensions don’t put a common label to the affected pods. We propose to introduce a new label gardener.cloud/shoot: <technical-id> where technical-id is the shoot namespace. A mutating webhoook in the Gardener Resource Manager should apply this label and the affinity rules to every pod in the control plane. This label and the pod affinity rule will ensure that all the pods in the control plane are pinned to a specific zone for HA shoot cluster having failure tolerance of node.

Single-Zone

There are control plane components (like etcd) which requires one etcd member pod per node. Following anti-affinity rule guarantees that each pod of etcd is scheduled onto a different node.

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            <labels>
        topologyKey: "kubernetes.io/hostname"

For other control plane components which do not have a stricter requirements to have one replica per node, a more optimal scheduling strategy should be used. Following topology spread constraint provides better utilization of node resources, allowing cluster autoscaler to downsize node groups if certain nodes are under-utilized.

spec:
  topologySpreadConstraints:
  - maxSkew: 1
     topologyKey: kubernetes.io/hostname
     whenUnsatisfiable: DoNotSchedule
     labelSelector:
      matchLabels:
        <labels>

Using topology spread constraints (as described above) would still ensure that if there are more than one replica defined for a control plane component then it will be distributed across more than one node ensuring failure tolerance of at least one node.

Multi-Zone (#replicas <= #zones)

If the replica count is equal to the number of available zones, then we can enforce the zone spread during scheduling.

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            <labels>
      topologyKey: "topology.kubernetes.io/zone"

Multi-Zone (#replicas > #zones)

Enforcing a zone spread for components with a replica count higher than the total amount of zones is not possible. In this case we plan to rather use the following Pod Topology Spread Constraints which allows a distribution over zones and nodes. The maxSkew value determines how big a imbalance of the pod distribution can be and thus it allows to schedule replicas with a count beyond the number of availability zones (e.g. Kube-Apiserver).

NOTE:

  • During testing we found a few inconsistencies and some quirks (more information) which is why we rely on Topology Spread Constraints (TSC) only for this case.
  • In addition to circumvent issue kubernetes/kubernetes#98215, Gardener is supposed to add a gardener.cloud/rolloutVersion label and incrementing the version every time the .spec of the component is changed (see workaround).

Update: kubernetes/kubernetes#98215 has been very recently closed. A new feature-gate has been created which is only available from kubernetes 1.5 onwards.

spec:
  topologySpreadConstraints:
  - maxSkew: 2
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        <labels>
        gardener.cloud/rolloutVersion: <version>
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        <labels>
        gardener.cloud/rolloutVersion: <version>

Disruptions and zero downtime maintenance

A secondary effect of provisioning affected seed system and shoot control plane components in an HA fashion is the support for zero downtime maintenance, i.e. a certain amount of replicas always serves requests while an update is being rolled out. Therefore, proper PodDisruptionBudgets are required. With this GEP it’s planned to set spec.maxUnavailable: 1 for every involved and further mentioned component.

Seed System Components

The following seed system components already run or are planned[*] to be configured with a minimum of two replicas:

  • Etcd-Druid* (active-passive)
  • Gardenlet* (active-passive)
  • Istio Ingress Gateway* (active-active)
  • Istio Control Plane* (active-passive)
  • Gardener Resource Manager (controllers: active-passive, webhooks: active-active)
  • Gardener Seed Admission Controller (active-active)
  • Nginx Ingress Controller (active-active)
  • Reversed VPN Auth Server (active-active)

The reason to run controller in active-passive mode is that in case of an outage a stand-by instance can quickly take over the leadership which reduces the overall downtime of that component in comparison to a single replica instance that would need to be evicted and re-scheduled first (see Current-Recovery-Mechanisms).

In addition, the pods of above mentioned components will be configured with the discussed anti-affinity rules (see Scheduling control plane components). The Single-Zone case will be the default while Multi-Zone anti-affinity rules apply to seed system components, if the seed is labelled with seed.gardener.cloud/multi-zonal: true (see Hosting a multi-zonal HA shoot control plane).

Shoot Control Plane Components

Similar to the Seed System Components the following shoot control plane components are considered critical so that Gardener ought to avoid any downtime. Thus, current recovery mechanisms are considered insufficient if only one replica is involved.

Kube Apiserver

The Kube Apiserver’s HVPA resource needs to be adjusted in case of an HA shoot request:

spec:
  hpa:
    template:
      spec:
        minReplicas: 3

The discussed TSC in Scheduling control plane components applies as well.

Gardener Resource Manager

The Gardener Resource Manager is already set up with spec.replicas: 3 today. Only the Affinity and anti-affinity rules must be configured on top.

Etcd

In contrast to other components, it’s not trivial to run multiple replicas for etcd because different rules and considerations apply to form a quorum-based cluster ref. Most of the complexity (e.g. cluster bootstrap, scale-up) is already outsourced to Etcd-Druid and efforts have been made to support may use-cases already (see gardener/etcd-druid#107 and Multi-Node etcd GEP). Please note, that especially for etcd an active-passive alternative was evaluated here. Due to the complexity and implementation effort it was decided to proceed with the active-active built-in support, but to keep this as a reference in case we’ll see blockers in the future.

Gardener etcd component changes

With most of the complexity being handled by Etcd-Druid, Gardener still needs to implement the following requirements if HA is enabled.:

  • Set etcd.spec.replicas: 3.
  • Set etcd.spec.etcd.schedulingConstraints to the matching anti-affinity rule.
  • Deploy NetworkPolicies that a allow a peer-to-peer communication between the etcd pods.
  • Create and pass a peer CA and server/client certificate to etcd.spec.etcd.peerUrlTls

The groundwork for this was already done by gardener/gardener#5741.

Other critical components having single replica

Following shoot control plane components are currently setup with a single replica and are planned to run with a minimum of 2 replicas:

  • Cluster Autoscaler (if enabled)
  • Cloud Controller Manager (CCM)
  • Kube Controller Manager (KCM)
  • Kube Scheduler
  • Machine Controller Manager (MCM)
  • CSI driver controller

NOTE: MCM, CCM and CSI driver controller are components deployed by provider extensions. HA specific configuration should be configured there.

Additionally Affinity and anti-affinity rules must be configured.

Handling Outages

Node failures

It is possible that node(s) hosting the control plane component are no longer available/reachable. Some of the reasons could be crashing of a node, kubelet running on the node is unable to renew its lease, network partition etc. The topology of control plane components and recovery mechanisms will determine the duration of the downtime that will result when a node is no longer reachable/available.

Impact of Node failure

Case #1

HA Failure Tolerance: node

For control plane components having multiple replicas, each replica will be provisioned on a different node (one per node) as per the scheduling constraints.

Since there are lesser than the desired replicas for shoot control plane pods, kube-scheduler will attempt to look for another node while respecting pod scheduling constraints. If a node satisfying scheduling constraints is found then it will be chosen to schedule control plane pods. If there are no nodes that satisfy the scheduling constraints, then it must wait for Cluster-Autoscaler to scale the node group and then for the Machine-Controller-Manager to provision a new node in the scaled node group. In the event Machine-Controller-Manager is unable to create a new machine, then the replica that was evicted from the failed node, will be stuck in Pending state.

Impact on etcd

Assuming that default etcd cluster size of 3 members, unavailability of one node is tolerated. If more than 1 node hosting the control plane components goes down or is unavailable, then etcd cluster will lose quorum till new nodes are provisioned.

Case #2

HA Failure Tolerance: zone

For control plane components having multiple replicas, each replica will be spread across zones as per the scheduling constraints.

etcd

kube-scheduler will attempt to look for another node in the same zone since the pod scheduling constraints will prevent it from scheduling the pod onto another zone. If a node is found where there are no control plane components deployed, then it will choose that node to schedule the control plane pods. If there are no nodes that satisfy the scheduling constraints, then it must wait for Machine-Controller-Manager to provision a new node. Reference to PVC will also prevent an etcd member from getting scheduled in another zone since persistent volumes are not shared across availability zones.

Kube Apiserver and Gardener Resource Manager

These control plane components will use these rules which allows the replica deployed on the now deleted machine to be brought up on another node within the same zone. However, if there are no nodes available in that zone, then for Gardener Resource Manager which uses requiredDuringSchedulingIgnoredDuringExecution no replacement replica will be scheduled in another zone. Kube ApiServer on the other hand uses topology spread constrains with maxSkew=2 on the topologyKey: topology.kubernetes.io/zone which allows it to schedule a replacement pod in another zone.

Other control plane components having single replica

Currently there are no pod scheduling constraints on such control plane components. Current recovery mechanisms as described above will come into play and recover these pods.

What is Zone outage?

No clear definition of a zone outage emerges. However, we can look at different reasons for a zone outage across providers that have been listed in the past and derive a definition out of it.

Some of the most common failures for zone outages have been due to:

  • Network congestion, failure of network devices etc., resulting in loss of connectivity to the nodes within a zone.
  • Infrastructure power down due to cooling systems failure/general temperature threshold breach.
  • Loss of power due to extreme weather conditions and failure of primary and backup generators resulting in partial or complete shutdown of infrastructure.
  • Operator mistakes leading to cascading DNS issues, over-provisioning of servers resulting in massive increase in system memory.
  • Stuck volumes or volumes with severely degraded performance which are unable to service read and write requests which can potentially have cascading effects on other critical services like load balancers, database services etc.

The above list is not comprehensive but a general pattern emerges. The outages range from:

  1. Shutdown of machines in a specific availability zone due to infrastructure failure which in turn could be due to many reasons listed above.
  2. Network connectivity to the machines running in an availability zone is either severely impacted or broken.
  3. Subset of essential services (e.g. EBS volumes in case of AWS provider) are unhealthy which might also have a cascading effect on other services.
  4. Elevated API request failure rates when creating or updating infrastructure resources like machines, load balancers, etc.

One can further derive that either there is a complete breakdown of an entire availability zone (1 and 2 above) or there is a degradation or unavailability of a subset of essential services.

In the first version of this document we define an AZ outage only when either of (1) or (2) occurs as defined above.

Impact of a Zone Outage

As part of the current recovery mechanisms, if Machine-Controller-Manager is able to delete the machines, then per MachineDeployment it will delete one machine at a time and wait for a new machine to transition from Pending to Running state. In case of a network outage, it will be able to delete a machine and subsequently launch a new machine but the newly launched machine will be stuck in Pending state as the Kubelet running on the machine will not be able to create its lease. There will also not be any corresponding Node object for the newly launched machine. Rest of the machines in this MachineDeployment will be stuck in Unknown state.

Kube Apiserver, Gardener Resource Manager & seed system components

These pods are stateless, losing one pod can be tolerated since there will be two other replicas that will continue to run in other two zones which are available (considering that there are 3 zones in a region).

etcd

A minimum and default size of HA etcd cluster setup is 3. This allows tolerance of one AZ failure. If more than one AZ fails or is unreachable then etcd cluster will lose its quorum. Pod Anti-Affinity policies that are initially set will not allow automatic rescheduling of etcd pod onto another zone (unless of course affinity rules are dynamically changed). Reference to PVC will also prevent an etcd member from getting scheduled in another zone, since persistent volumes are not shared across availability zones.

Other Shoot Control Plane Components

All the other shoot control plane components have:

  • Single replicas
  • No affinity rules influencing their scheduling

See the current recovery mechanisms described above.

Identify a Zone outage

NOTE: This section should be read in context of the currently limited definition of a zone outage as described above.

In case the Machine-Controller-Manager is unable to delete Failed machines, then following will be observed:

  • All nodes in that zone will be stuck in NotReady or Unknown state and there will be an additional taint key: node.kubernetes.io/unreachable, effect: NoSchedule on the node resources.
  • Across all MachineDeployment in the affected zone, one machine will be in Terminating state and other existing machines will be in Unknown state. There might one additional machine in CrashLoopBackOff.

In case the Machine-Controller-Manager is able to delete the Failed machines then following will be observed:

  • For every MachineDeployment in the affected zone, there will be one machine stuck in Pending state and all other machines will be in Unknown state.
  • For the machine in Pending state there will not be any corresponding node object.
  • For all the machines in Unknown state, corresponding node resource will be in NotReady/Unknown state and there will be an additional taint key: node.kubernetes.io/unreachable, effect: NoSchedule on each of the node.

If the above state is observed for an extended period of time (beyond a threshold that could be defined), then it can be deduced that there is a zone outage.

Identify zone recovery

NOTE: This section should be read in context of the current limited definition of a zone outage as described above.

The machines which were previously stuck in either Pending or CrashLoopBackOff state are now in Running state and if there are corresponding Node resources created for machines, then the zonal recovery has started.

Recovery

Current Recovery Mechanisms

Gardener and upstream Kubernetes already provide recovery mechanism for node and pod recovery in case of a failure of a node. Those have been tested in the scope of a availability zone outage simulation.

Machine recovery

In the seed control plane, kube-controller-manager will detect that a node has not renewed its lease and after a timeout (configurable via --node-monitor-grace-period flag; 2m0s by default for Shoot clusters) it will transition the Node to Unknown state. machine-controller-manager will detect that an existing Node has transitioned to Unknown state and will do the following:

  • It will transition the corresponding Machine to Failed state after waiting for a duration (currently 10 mins, configured via the –machine-health-timeout flag).
  • Thereafter a deletion timestamp will be put on this machine indicating that the machine is now going to be terminated, transitioning the machine to Terminating state.
  • It attempts to drain the node first and if it is unable to drain the node, then currently after a period of 2 hours (configurable via –machine-drain-timeout), it will attempt to force-delete the Machine and create a new machine. Draining a node will be skipped if certain conditions are met e.g. node in NotReady state, node condition reported by node-problem-detector as ReadonlyFilesystem etc.

In case machine-controller-manager is unable to delete a machine, then that machine will be stuck in Terminating state. It will attempt to launch a new machine and if that also fails, then the new machine will transition to CrashLoopBackoff and will be stuck in this state.

Pod recovery

Once kube-controller-manager transitions the node to Unknown/NotReady, it also puts the following taints on the node:

taints:
- effect: NoSchedule
  key: node.kubernetes.io/unreachable
- effect: NoExecute
  key: node.kubernetes.io/unreachable

The taints have the following effect:

  • New pods will not be scheduled unless they have a toleration added which is all permissive or matches the effect and/or key.

For Deployments, once a Pod managed by a Deployment transitions to Terminating state the kube-controller-manager creates a new Pod (replica) right away to fullfil the desired replica count of the Deployment. Hence, in case of a Seed Node/zone outage for the Deployments kube-controller-manager creates new Pods in place of the Pods that are evicted due to the outage. The newly created Pods are scheduled on healthy Nodes and start successfully after a short period of time. For StatefulSets, once a Pod managed by a StatefulSet transitions to Terminating state the kube-controller-manager waits until the Pod is removed from store and only after that it creates the replacement Pod. In case of a Seed node/zone outage the StatefulSet Pods in the Shoot control plane stay in Terminating state for 5min. After 5min the shoot-care-controller of gardenlet deletes forcefully (garbage collects) the Terminating Pods from the Shoot control plane. Alternatively, machine-controller-manager deletes the unhealthy Nodes after --machine-health-timeout (10min by default) and the Terminating Pods are removed from store shortly after the Node removal. kube-controller-manager creates new StatefulSet Pods. Depending on the outage type (node or zone) the new StatefulSet Pods respectively succeed or fail to recover. For node the Pods will likely recover, whereas for zone it’s usual that Pods remain in Pending state as long as the outage lasts, since depending volumes cannot be moved across availability zones.

Recovery from Node failure

If there is a single node failure in any availability zone irrespective of whether it is a single-zone or multi-zone setup, then the recovery is automatic (see current recovery mechanisms). In the mean time, if there are other available nodes (as per affinity rules) in the same availability zone, then the scheduler will deploy the affected shoot control plane components on these nodes.

Recovery from Zone failure

In the following section, options are presented to recover from an availability zone failure in a multi-zone shoot control plane setup.

Option #1: Leverage existing recovery options - Preferred

In this option existing recovery mechanisms as described above are used. There is no change to the current replicas for all shoot control plane components and there is no dynamic re-balancing of quorum based pods considered.

Pros:

  • Less complex to implement since no dynamic re-balancing of pods is required and there is no need to determine if there is an AZ outage.
  • Additional cost to host an HA shoot control plane is kept to the bare minimum.
  • Existing recovery mechanisms are leveraged:
    • For the affected Deployment Pods, the kube-controller-manager will create new replicas after the affected replicas are terminating. kube-controller-manager starts terminating the affected replicas after 7min by default - --node-monitor-grace-period (2min by default) + --default-not-ready-toleration-seconds/default-unreachable-toleration-seconds (300s by default). The newly created Pods will be scheduled on healthy Nodes and will start successfully.

Cons:

  • Existing recovery mechanisms are leveraged:
    • For the affected StatefulSet Pods, the replacement Pods will fail to be scheduled due to their affinity rules (etcd Pods) or volume requirements (prometheus and loki Pods) to run in the outage zone and will not recover. However, the etcd will be still operational because of its quorum. Downtime of the monitoring and logging components for the time of an ongoing availability zone outage is acceptable for now.
  • etcd cluster will run with one less member resulting in no tolerance to any further failure. If it takes a long time to recover a zone then etcd cluster is now susceptible to a quorum loss, if any further failure happens.
  • Any zero downtime maintenance is disabled during this time.
  • If the recovery of the zone takes a long time, then it is possible that difference revision between the leader and the follower (which was in the zone that is not available) becomes large. When the AZ is restored and the etcd pod is deployed again, then there will be an additional load on the etcd leader to synchronize this etcd member.

Option #2: Redundencies for all critical control plane components

In this option :

  • Kube Apiserver, Gardener Resource Manager and etcd will be setup with a minimum of 3 replicas as it is done today.
  • All other critical control plane components are setup with more than one replicas. Based on the criticality of the functionality different replica count (>1) could be decided.
  • As in Option #1 no additional recovery mechanism other than what currently exists are provided.

Pros:

  • Toleration to at least a single AZ is now provided for all critical control plane components.
  • There is no need for dynamic re-balancing of pods in the event of an AZ failure and there is also no need to determine if there is an AZ outage reducing the complexity.

Cons:

  • Provisioning redundancies entails additional hosting cost. With all critical components now set up with more than one replica, the overall requirement for compute resources will increase.
  • Increase in the overall resource requirements will result in lesser number of shoot control planes that can be hosted in a seed, thereby requiring more seeds to be provisioned, which also increases the cost of hosting seeds.
  • If the recovery of the zone takes a long time, then it is possible that difference revision between the leader and the follower (which was in the zone that is not available) becomes large. When the AZ is restored and the etcd pod is deployed again, then there will be an additional load on the etcd leader to synchronize this etcd member.

NOTE:


Before increasing the replicas for control plane components that currently have a single replica following needs to be checked:

  1. Is the control plane component stateless? If it is stateless, then it is easier to increase the replicas.
  2. If the control plane component is not stateless, then check if leader election is required to ensure that at any time there is only one leader and the rest of the replicas will only be followers. This will require additional changes to be implemented if they are not already there.

Option #3: Auto-rebalance pods in the event of AZ failure

NOTE: Prerequisite for this option is to have the ability to detect an outage and recover from it.

Kube Apiserver, Gardener Resource Manager, etcd and seed system compontents will be setup with multiple replicas spread across zones. Rest of the control plane components will continue to have a single replica. However, in this option etcd cluster members will be rebalanced to ensure that the desired replicas are available at all times.

Recovering etcd cluster to its full strength

Affinity rules set on etcd statefulset enforces that there will be at most one etcd member per zone. Two approaches could be taken:

Variant-#1

Change the affinity rules and always use preferredDuringSchedulingIgnoredDuringExecution for topologyKey: topology.kubernetes.io/zone. If all zones are available then it will prefer to distribute the etcd members across zones, each zone having just one replica. In case of zonal failure, kube scheduler will be able to re-schedule this pod in another zone while ensuring that it chooses a node within that zone that does not already have an etcd member running.

Pros

  • Simpler to implement as it does not require any change in the affinity rules upon identification of a zonal failure.
  • Etcd cluster runs with full strength as long as there is a single zone where etcd pods can be rescheduled.

Cons

  • It is possible that even when there is no zonal failure, more than one etcd member can be provisioned in a single zone. The chances of that happening are slim as typically there is a dedicate worker pool for hosting etcd pods.

Variant-#2

Use requiredDuringSchedulingIgnoredDuringExecution for topologyKey: topology.kubernetes.io/zone during the initial setup to strictly enforce one etcd member per zone.

If and when a zonal failure is detected then etcd-druid should do the following:

  • Remove the PV and PVC for the etcd member in a zone having an outage
  • Change the affinity rules for etcd pods to now use preferredDuringSchedulingIgnoredDuringExecution for topologyKey: topology.kubernetes.io/zone during the downtime duration of a zone.
  • Delete the etcd pod in the zone which has an outage

This will force the kube-scheduler to schedule the new pod in another zone.

When it is detected that the zone has now recovered then it should re-balance the etcd members. To achieve that the following etcd-druid should do the following:

  • Change the affinity rule to again have requiredDuringSchedulingIgnoredDuringExecution for topologyKey: topology.kubernetes.io/zone
  • Delete an etcd pod from a zone which has 2 pods running. Subsequently also delete the associated PV and PVC.

The kube-scheduler will now schedule this pod in the just recovered zone.

Consequence of doing this is that etcd-druid, which today runs with a single replica, now needs to have a HA setup across zones.

Pros

  • When all the zones are healthy and available, it ensures that there is at most one pod per zone, thereby providing the desired QoS w.r.t failure tolerance. Only in the case of a zone failure will it relax the rule for spreading etcd members to allowing more than one member in a zone to be provisioned. However this would ideally be temporary.
  • Etcd cluster runs with full strength as long as there is a single zone where etcd pods can be rescheduled.

Cons

  • It is complex to implement.
  • Requires etcd-druid to be highly available as it now plays a key role in ensuring that the affinity rules are changed and PV/PVC’s are deleted.

Cost Implications on hosting HA control plane

Compute & Storage

Cost differential as compared to current setup will be due to:

Consider a 3 zone cluster (in case of multi-zonal shoot control plane) and 3 node cluster (in case of single-zonal shoot control plane)

  • Machines: Depending on the number of zones, a minimum of one additional machine per zone will be provisioned.
  • Persistent Volume: 4 additional PVs need to be provisioned for etcd-main and etcd-events pods.
  • Backup-Bucket: etcd backup-restore container uses provider object store as a backup-bucket to store full and delta snapshots. In a multi-node etcd setup, there will only be a leading backup-bucket sidecar (in etcd leader pod) that will only be taking snapshots and uploading it to the object store. Therefore there is no additional cost incurred as compared to the current non-HA shoot control plane setup.

Network latency

Network latency measurements were done focusing on etcd. Three different etcd topologies were considered for comparison - single node etcd (cluster-size = 1), multi-node etcd (within a single zone, cluster-size = 3) and multi-node etcd (across 3 zones, cluster-size = 3).

Test Details

  • etcd benchmark tool is used to generate load and reports generated from the benchmark tool is used.
  • A subset of etcd requests (namely PUT, RANGE) are considered for network analysis.
  • Following are the parameters that have been considered for each test run across all etcd topologies:
    • Number of connections
    • Number of clients connecting to etcd concurrently
    • Size and the number of key-value pairs that are and queried
    • Consistency which can either be serializable or linearizable
    • For each PUT or GET (range) request leader and followers are targetted
  • Zones have other workloads running and therefore measurements will have outliers which are ignored.

Acronyms used

  • sn-sz - single node, single zone etcd cluster
  • mn-sz - multi node, single zone etcd cluster
  • mn-mz - multi node, multi zone etcd cluster

Test findings for PUT requests

  • When number of clients and connections are kept at 1 then it is observed that sn-sz latency is lesser (range of 20%-50%) as compared to mn-sz. The variance is due to changes in payload size.
  • For the following observations it was ensured that the only a leader is targetted for both multi-node etcd cluster topologies and that the leader is in the same zone as that of sn-sz to have a fair comparison.
    • When the number of clients, connections and payload size is increased then it has been observed that sn-sz latency is more (in the range of 8% to 30%) as compared to mn-sz and mn-mz.
  • When comparing mn-sz and mn-mz following observations were made:
    • When number of clients and connections are kept at 1 then irrespective of the payload size, mn-sz latency is lesser (~3%) as compared to mn-mz. However the difference is in the usually is within 1ms.
    • mn-sz latency is lesser (range of 5%-20%) as compared to mn-mz when the request is directly serviced by the leader. However the difference is in a range of a micro-seconds to a few milli-seconds.
    • mn-sz latency is lesser (range of 20-30%) as compared to mn-mz when the request is serviced by a follower. However the difference is usually within the same millisecond.
    • When the number of clients and connections are kept at 1 then irrespective of the payload size it is observed that latency of a leader is lesser than any follower. This is on the expected lines. However if the number of clients and connections are increased then leader seems to have a higher latency as compared to a follower which could not be explained.

Test findings for GET (Range) requests

Using etcd benchmark tool range requests were generated.

Range Request to fetch one key per request

We fixed the number of connections and clients to 1 and varied the payload size. Range requests were directed to leader and follower etcd members and network latencies were measured. Following are the findings:

  • sn-sz latency is ~40% greater as compared to mn-sz and around 30% greater as compared to mn-mz for smaller payload sizes. However for larger payload sizes (~1MB) the trend reverses and we see that sn-sz latency is around (15-20%) lesser as compared to mn-sz and mn-mz.
  • mn-sz latency is ~20% lesser than mn-mz
  • With consistency set to serializable, latency was lesser (in the range of 15-40%) as compared to when consistency was set to linearizable.
  • When requesting a single key at time (keeping number of connections and clients to 1):
    • sn-sz latency is ~40% greater as compared to mn-sz and around 30% greater as compared to mn-mz.
    • mn-sz latency is ~20% lesser than mn-mz
    • With consistency serializable latency was ~40% lesser as compared to when consistency was set to linearizable.
    • For both mn-sz and mn-mz leader latency is in general lesser (in the range of 20% to 50%) than that of the follower. However the difference is still in milliseconds range when consistency is set to linearizable and in micro seconds range when it is set to serializable.

When connections and clients were increased keeping the payload size fixed then following were the findings:

  • sn-sz latency is ~30% greater as compared to mn-sz and mn-mz with consistency set to linearizable. This is consistent with the above finding as well. However when consistency is set to serializable then across all topologies latencies are comparable (within ~1 millisecond).
  • With increased connections and clients the latencies of mn-sz and mn-mz are almost similar.
  • With consistency set to serializable, latency was ~20% lesser as compared to when consistency was set to linearizable. This is also consistent with the above findings.
  • When range requests are served by the follower then mn-sz latency is ~20% lesser than mn-mz when consistency is set to linearizable. However it is quite the opposite when consistency is set to serializable.

Range requests to fetch all keys per request

For these tests - for payload size = 1MB, total number of key-value’s retrieved per request are 1000 and for payload-size = 256 bytes, total number of key-value pairs retrived per request are 100000.

  • sn-sz latency is around 5% lesser than both mn-sz and mn-mz. This is a deviation for smaller payloads (see above), but for larger payloads this finding is consistent.
  • There is hardly any difference in latency between mn-sz and mn-mz.
  • There seems to be no significant different between serializable and linearizable consistency setting. However, when follower etcd instances serviced the request, there were mixed results and nothing could be concluded.

Summary

  • For range requests consistency of Serializable has a lesser network latency as compared to Linearizable which is on the expected lines as linearlizable requests must go through the raft consensus process.
  • For PUT requests:
    • sn-sz has a lower network latency when number of clients and connections are less. However it starts to deteriorate once that is increased along with increase in payload size makine multi-node etcd clusters out-perform single-node etcd in terms of network latency.
    • In general mn-sz has lesser network latency as compared to mn-mz but it is still within milliseconds and therefore is not of concern.
    • Requests that go directly to the leader have lesser overall network latency as compared to when the request goes to the follower. This is also expected as the follower will have to forward all PUT requests to the leader as an additional hop.
  • For GET requests:
    • For lower payload sizes sn-sz latency is greater as compared to mn-sz and mn-mz but with larger payload sizes this trend reverses.
    • With lower number of connections and clients mn-sz has lower latencies as compared to mn-mz however this difference diminishes as number of connections/clients/payload size is increased.
    • In general when consistency is set to serializable it has lower overall latency as compared to linearizable. There were some outliers w.r.t etcd followers but currently we do not give too much weightage to it.

In a nutshell we do not see any major concerns w.r.t latencies in a multi-zonal setup as compared to single-zone HA setup or single node etcd.

NOTE: Detailed network latency analysis can be viewed here.

Cross-Zonal traffic

Providers typically do not charge ingress and egress traffic which is contained within an availability zone. However, they do charge traffic that is across zones.

Cross zonal traffic rates for some of the providers are:

Setting up shoot control plane with failure tolerance zone will therefore have a higher running cost due to ingress/egress cost as compared to a HA shoot with failure tolerance of node or to a non-HA control plane.

Ingress/Egress traffic analysis

Majority of the cross zonal traffic is generated via the following communication lines:

  • Between Kube Apiserver and etcd members (ingress/egress)
  • Amongst etcd members (ingress/egress)

Since both of these components are spread across zones, their contribution to the cross-zonal network cost is the largest. In this section the focus is only on these components and the cross-zonal traffic that gets generated.

Details of the network traffic is described in Ingress/Egress traffic analysis section.

Observation Summary

Terminology

  • Idle state: In an idle state of a shoot control plane, there is no user driven activity which results in a call to the API server. All the controllers have started and initial listing of watched resources has been completed (in other words informer caches are now in sync).

Findings

  • etcd inherently uses raft consensus protocol to provide consistency and linearizability guarantees. All PUT or DELETE requests are always and only serviced by the leader etcd pod. Kube Apiserver can either connect to a leader or a follower etcd.
    • If Kube Apiserver connects to the leader then for every PUT, the leader will additionally distribute the request payload to all the followers and only if the majority of followers responded with a successful update to their local boltDB database, will the leader commit the message and subsequently respond back to the client. For Delete, a similar flow is executed but instead of passing around the entire k8s resource, only keys that need to be deleted are passed, making this operation significantly lighter from the network bandwidth consumption perspective.
    • If the Kube Apiserver connects to a follower then for every PUT, the follower will first forward the PUT request along with the request payload to the leader, who in turn will attempt to get consensus from majority of the followers by again sending the entire request payload to all the followers. Rest of the flow is the same as above. There is an additional network traffic from follower to leader and is equal to the weight of the request payload. For Delete a similar flow is executed where the follower will forward the keys that need to be deleted to the leader as an additional step. Rest of the flow is the same as the PUT request flow. Since the keys are quite small in size, the network bandwidth consumed is very small.
  • GET calls made to the Kube Apiserver with labels + selector get translated to range requests to etcd. etcd’s database does not understand labels and selectors and is therefore not optimized for k8s query patterns. This call can either be serviced by the leader or follower etcd member. follower etcd will not forward the call to the leader.
  • From within controllers, periodic informer resync which generates reconcile events does not make calls to the Kube Apiserver (under the condition that no change is made to the resources for which a watch is created).
  • If a follower etcd is not in sync (w.r.t revisions) with the leader etcd, then it will reject the call. The client (in this case Kube Apiserver) retries. Needs to be checked, if it retries by connecting to another etcd member. This will result in additional cross zonal traffic. This is currently not a concern as members are generally kept in sync and will only go out of sync in case of a crash of a member or addition of a new member (as a learner) or during rolling updates. However, the time it takes to complete the sync is generally quick.
  • etcd cluster members which are spread across availability zones generated a total cross zonal traffic of ~84 Kib/s in an ideal multi-zonal shoot control plane. Across several runs we have seen this number go up to ~100Kib/s.
  • etcd follower to another etcd follower remains consistent at ~2Kib/s in all the cases that have been tested (see Appendix).
  • Kube Apiserver making a PUT call to a etcd follower is more expensive than directly making the call to the etcd leader. A PUT call also carries the entire payload of the k8s resource that is being created. Topology aware hints should be evaluated to potentially reduce the network cost to some extent.
  • In case of a large difference (w.r.t revision) between a follower and a leader, significant network traffic is observed between the leader and the follower. This is usually an edge case, but occurrence of these cases should be monitored.

Optimizing Cost: Topology Aware Hint

In a multi-zonal shoot control plane setup there will be multiple replicas of Kube Apiserver and etcd spread across different availability zones. Network cost and latency is much lower when the communication is within a zone and increases once zonal boundary is crossed. Network traffic amongst etcd members cannot be optimized as these are strictly spread across different zones. However, what could be optimized is the network traffic between Kube Apiserver and etcd member (leader or follower) deployed within a single zone. Kubernetes provides topology aware hints to influence how clients should consume endpoints. Additional metadata is added to EndpointSlice to influence routing of traffic to the endpoints closer to the caller. Kube-Proxy utilizes the hints (added as metadata) to favor routing to topologically closer endpoints.

Disclaimer: Topology Aware Hints won’t improve network traffic if the seed has worker nodes in more than three zones and the Kube Apiserver is scaled beyond three replicas at the same time. In this case, Kube Apiserver replicas run in zones which don’t have an etcd and thus cross zone traffic is inevitable.

During evaluation of this feature some caveats were discovered:

For each cluster, gardener provides a capability to create one or more Worker Pool/Group. Each worker pool can span across one or more availability zones. For a combination of each worker pool and zone there will be a corresponding MachineDeployment which will also map 1:1 to a Node Group which is understood by cluster-autoscaler.

Consider the following cluster setup:

EndpointSliceController does the following:

  • Computes the overall allocatable CPU across all zones - call it TotalCPU
  • Computes the allocatable CPU for all nodes per zone - call it ZoneTotalCPU
  • For each zone it computes the CPU ratio via ZoneTotalCPU/TotalCPU. If the ratio between any two zones is approaching 2x, then it will remove all topology hints.

Given that the cluster-autoscaler can scale the individual node groups based on unscheduled pods or lower than threshold usage, it is possible that topological hints are added and removed dynamically. This results in non-determinism w.r.t request routing across zones, resulting in difficult to estimate cross-zonal network cost and network latencies.

K8S#110714 has been raised.

References

  1. https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
  2. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

Appendix

ETCD Active-Passive Options

In this topology there will be just one active/primary etcd instance and all other etcd instances will be running as hot-standby. Each etcd instance serves as an independent single node cluster. There are three options to setup an active-passive etcd.

Option-1: Independent single node etcd clusters

Primary etcd will periodically take a snapshot (full and delta) and will push these snapshots to the backup-bucket. Hot-Standy etcd instances will periodically query the backup-bucket and sync its database accordingly. If a new full snapshot is available which has a higher revision number than what is available in its local etcd database then it will restore from a full snapshot. It will additionally check if there are delta snapshots having a higher revision number. If that is the case then it will apply the delta snapshots directly to its local etcd database.

NOTE: There is no need to run an embedded etcd to apply delta snapshots.

For the sake of illustration only assume that there are two etcd pods etcd-0 and etcd-1 with corresponding labels which uniquely identify each pod. Assume that etcd-0 is the current primary/active etcd instance.

etcd-druid will take an additional responsibility to monitor the health of etcd-0 and etcd-1. When it detects that the etcd-0 is no longer healthy it will patch the etcd service to point to the etcd-1 pod by updating the label/selector so that it becomes the primary etcd. It will then restart etc-0 pod and henceforth that will serve as a hot-standby.

Pros

  • There is no leader election, no quorum related issues to be handled. It is simpler to setup and manage.
  • Allows you to just have a total of two etcd nodes - one is active and another is passive. This allows high availability across zones in cases where regions only have 2 zones (e.g. CCloud and Azure regions that do not have more than 2 zones).
  • For all PUT calls the maximum cost in terms of network bandwidth is one call (cross-zonal) from Kube ApiServer to etcd instance which carries the payload with it. In comparison in a three member etcd cluster, the leader will have to send the PUT request to other members (cross zonal) in the etcd cluster which will be slightly more expensive than just having a single member etcd.

Cons

  • As compared to an active-active etcd cluster there is not much difference in cost of compute resources (CPU, Memory, Storage)
  • etcd-druid will have to periodically check the health of both the primary and hot-standby nodes and ensure that these are up and running.
  • There will be a potential delay in determining that a primary etcd instance is no longer healthy. Thereby increasing the delay in switching to the hot-standy etcd instance causing longer downtime. It is also possible that at the same time hot-standy also went down or is otherwise unhealthy resulting in a complete downtime. The amount of time it will take to recover from such a situation would be several minutes (time to start etcd pod + time to restore either from full snapshot or apply delta snapshots).
  • Synchronization is always via backup-bucket which will be less frequent as compared to an active-active etcd cluster where there is real-time synchronization done for any updates by the leader to majority or all of its followers. If the primary crashes, the time.
  • During the switchover from primary to hot-standby if the hot-standy etcd is in process of applying delta snaphots or restoring from a new full snapshot then hot-standby should ensure that the backup-restore container sets the readiness probe to indicate that it is not ready yet causing additional downtime.
Option-2: Perpetual Learner

In this option etcd cluster and learner facilities are leveraged. etcd-druid will bootstrap a cluster with one member. Once this member is ready to serve client requests then an additional learner will be added which joins the etcd cluster as a non-voting member. Learner will reject client reads and writes requests and the clients will have to reattempt. Typically switching to another member in a cluster post retries is provided out-of-the-box by etcd clientv3. The only member who is also the leader will serve all client requests.

Pros

  • All the pros in Option-1 are also applicable for this option.
  • Leaner will be continuously updated by the leader and will remain in-sync with the leader. Using a learner retains historical data and its ordering and is therefore better in that aspect as compared to Option-2.

Cons

  • All the cons in Option-1 are also applicable for this option.
  • etcd-druid will now have to additionally play an active role in managing members of an etcd cluster by add new members as learner and promoting learner to an active member if the leader is no longer available. This will increase the complexity in etcd-druid.
  • To prevent clients from re-attempting to reach the learner, etcd-druid will have to ensure that the label on the learner are set differently than the leader. This needs to be done every time there is a switch in the leader/learner.
  • Since etcd only allows addition of 1 learner at a time, this means that the HA setup can only have one failover etcd node, limiting its capability to have more than one hot-standby.

Topology Spread Constraints evaluation and findings

Finding #1

Consider the following setup:

Single zone & multiple nodes

When the constraints defined above are applied then the following was the findings:

  • With 3 replicas of etcd, all three got scheduled (one per node). This was a bit unexpected. As per the documentation if there are multiple constraints then they will be evaluated in conjunction. The first constraint should only allow 1 etcd pod per zone and the remaining 2 should not have been scheduled and should continue to be stuck in pending state. However all 3 etcd pods got scheduled and started successfully.
Finding #2

Consider the following setup:

_Multiple zones & multiple nodes

When the constraints defined above are applied then the following was the findings:

  • Both constraints are evaluated in conjunction and the scheduling is done as expected.
  • TSC behaves correctly till replicas=5. Beyond that TSC fails. This was reported as an issue kubernetes#109364
Finding #3

NOTE: Also see the Known Limitations

Availability Zone Outage simulation

A zone outage was simulated by doing the following (Provider:AWS):

  • Network ACL were replaced with empty ACL (which denies all ingress and egress). This was done for all subnets in a zone. Impact of denying all traffic:
    • Kubelet running on the nodes in this zone will not be able to communicate to the Kube Apiserver. This will inturn result in Kube-Controller-Manager changing the status of the corresponding Node objects to Unknown.
    • Control plane components will not be able to communicate to the kubelet, thereby unable to drain the node.
  • To simulate the scenario where Machine-Controller-Manager is unable to create/delete machines, cloudprovider credentials were changed so that any attempt to create/delete machines will be un-authorized.

Worker groups were configured to use region: eu-west-1 and zones: eu-west-1a, eu-west-1b, eu-west-1c. eu-west-1a zone was brought down following the above steps. State before and after the outage simulation is captured below.

State before the outage simulation
kubectl get po -n <shoot-control-ns> # list of pods in the shoot control plane
NAMEREADYSTATUSNODE
cert-controller-manager-6cf9787df6-wzq861/1Runningip-10-242-20-17.eu-west-1.compute.internal
cloud-controller-manager-7748bcf697-n66t71/1Runningip-10-242-20-17.eu-west-1.compute.internal
csi-driver-controller-6cd9bc7997-m7hr66/6Runningip-10-242-20-17.eu-west-1.compute.internal
csi-snapshot-controller-5f774d57b4-2bghj1/1Runningip-10-242-20-17.eu-west-1.compute.internal
csi-snapshot-validation-7c99986c85-rr7zk1/1Runningip-10-242-20-17.eu-west-1.compute.internal
etcd-events-02/2Runningip-10-242-60-155.eu-west-1.compute.internal
etcd-events-12/2Runningip-10-242-73-89.eu-west-1.compute.internal
etcd-events-22/2Runningip-10-242-20-17.eu-west-1.compute.internal
etcd-main-02/2Runningip-10-242-73-77.eu-west-1.compute.internal
etcd-main-12/2Runningip-10-242-22-85.eu-west-1.compute.internal
etcd-main-22/2Runningip-10-242-53-131.eu-west-1.compute.internal
gardener-resource-manager-7fff9f77f6-jwggx1/1Runningip-10-242-73-89.eu-west-1.compute.internal
gardener-resource-manager-7fff9f77f6-jwggx1/1Runningip-10-242-60-155.eu-west-1.compute.internal
gardener-resource-manager-7fff9f77f6-jwggx1/1Runningip-10-242-20-17.eu-west-1.compute.internal
grafana-operators-79b9cd58bb-z6hc21/1Runningip-10-242-20-17.eu-west-1.compute.internal
grafana-operators-79b9cd58bb-z6hc21/1Runningip-10-242-20-17.eu-west-1.compute.internal
kube-apiserver-5fcb7f4bff-7p4xc1/1Runningip-10-242-20-17.eu-west-1.compute.internal
kube-apiserver-5fcb7f4bff-845p71/1Runningip-10-242-73-89.eu-west-1.compute.internal
kube-apiserver-5fcb7f4bff-mrspt1/1Runningip-10-242-60-155.eu-west-1.compute.internal
kube-controller-manager-6b94bcbc4-9bz8q1/1Runningip-10-242-20-17.eu-west-1.compute.internal
kube-scheduler-7f855ffbc4-8c9pg1/1Runningip-10-242-20-17.eu-west-1.compute.internal
kube-state-metrics-5446bb6d56-xqqnt1/1Runningip-10-242-20-17.eu-west-1.compute.internal
loki-04/4Runningip-10-242-60-155.eu-west-1.compute.internal
machine-controller-manager-967bc89b5-kgdwx2/2Runningip-10-242-20-17.eu-west-1.compute.internal
prometheus-03/3Runningip-10-242-60-155.eu-west-1.compute.internal
shoot-dns-service-75768bd764-4957h1/1Runningip-10-242-20-17.eu-west-1.compute.internal
vpa-admission-controller-6994f855c9-5vmh61/1Runningip-10-242-20-17.eu-west-1.compute.internal
vpa-recommender-5bf4cfccb6-wft4b1/1Runningip-10-242-20-17.eu-west-1.compute.internal
vpa-updater-6f795d7bb8-snq671/1Runningip-10-242-20-17.eu-west-1.compute.internal
vpn-seed-server-748674b7d8-qmjbm2/2Runningip-10-242-20-17.eu-west-1.compute.internal
kubectl get nodes -o=custom-columns='NAME:metadata.name,ZONE:metadata.labels.topology\.kubernetes\.io\/zone' # list of nodes with name, zone and status (was taken separately)
NAMESTATUSZONE
ip-10-242-20-17.eu-west-1.compute.internalReadyeu-west-1a
ip-10-242-22-85.eu-west-1.compute.internalReadyeu-west-1a
ip-10-242-3-0.eu-west-1.compute.internalReadyeu-west-1a
ip-10-242-53-131.eu-west-1.compute.internalReadyeu-west-1b
ip-10-242-60-155.eu-west-1.compute.internalReadyeu-west-1b
ip-10-242-73-77.eu-west-1.compute.internalReadyeu-west-1c
ip-10-242-73-89.eu-west-1.compute.internalReadyeu-west-1c
kubectl get machines # list of machines for the multi-AZ shoot control plane
NAMESTATUS
shoot–garden–aws-ha2-cpu-worker-etcd-z1-66659-sf4wtRunning
shoot–garden–aws-ha2-cpu-worker-etcd-z2-c767d-s8cmfRunning
shoot–garden–aws-ha2-cpu-worker-etcd-z3-9678d-6p8w5Running
shoot–garden–aws-ha2-cpu-worker-z1-766bc-pjq6nRunning
shoot–garden–aws-ha2-cpu-worker-z2-85968-5qmjhRunning
shoot–garden–aws-ha2-cpu-worker-z3-5f499-hnrs6Running
shoot–garden–aws-ha2-etcd-compaction-z1-6bd58-9ffc7Running
State during outage
kubectl get po -n <shoot-control-ns> # list of pods in the shoot control plane
NAMEREADYSTATUSNODE
cert-controller-manager-6cf9787df6-dt5nw1/1Runningip-10-242-60-155.eu-west-1.compute.internal
cloud-controller-manager-7748bcf697-t2pn71/1Runningip-10-242-73-89.eu-west-1.compute.internal
csi-driver-controller-6cd9bc7997-bn82b6/6Runningip-10-242-60-155.eu-west-1.compute.internal
csi-snapshot-controller-5f774d57b4-rskwj1/1Runningip-10-242-73-89.eu-west-1.compute.internal
csi-snapshot-validation-7c99986c85-ft2qp1/1Runningip-10-242-60-155.eu-west-1.compute.internal
etcd-events-02/2Runningip-10-242-60-155.eu-west-1.compute.internal
etcd-events-12/2Runningip-10-242-73-89.eu-west-1.compute.internal
etcd-events-20/2Pending
etcd-main-02/2Runningip-10-242-73-77.eu-west-1.compute.internal
etcd-main-10/2Pending
etcd-main-22/2Runningip-10-242-53-131.eu-west-1.compute.internal
gardener-resource-manager-7fff9f77f6-8wr5n1/1Runningip-10-242-73-89.eu-west-1.compute.internal
gardener-resource-manager-7fff9f77f6-jwggx1/1Runningip-10-242-73-89.eu-west-1.compute.internal
gardener-resource-manager-7fff9f77f6-lkgjh1/1Runningip-10-242-60-155.eu-west-1.compute.internal
grafana-operators-79b9cd58bb-m55sx1/1Runningip-10-242-60-155.eu-west-1.compute.internal
grafana-users-85c7b6856c-gx48n1/1Runningip-10-242-73-89.eu-west-1.compute.internal
kube-apiserver-5fcb7f4bff-845p71/1Runningip-10-242-73-89.eu-west-1.compute.internal
kube-apiserver-5fcb7f4bff-mrspt1/1Runningip-10-242-60-155.eu-west-1.compute.internal
kube-apiserver-5fcb7f4bff-vkrdh1/1Runningip-10-242-73-89.eu-west-1.compute.internal
kube-controller-manager-6b94bcbc4-49v5x1/1Runningip-10-242-60-155.eu-west-1.compute.internal
kube-scheduler-7f855ffbc4-6xnbk1/1Runningip-10-242-73-89.eu-west-1.compute.internal
kube-state-metrics-5446bb6d56-g8wkp1/1Runningip-10-242-73-89.eu-west-1.compute.internal
loki-04/4Runningip-10-242-73-89.eu-west-1.compute.internal
machine-controller-manager-967bc89b5-rr96r2/2Runningip-10-242-60-155.eu-west-1.compute.internal
prometheus-03/3Runningip-10-242-60-155.eu-west-1.compute.internal
shoot-dns-service-75768bd764-7xhrw1/1Runningip-10-242-73-89.eu-west-1.compute.internal
vpa-admission-controller-6994f855c9-7xt8p1/1Runningip-10-242-73-89.eu-west-1.compute.internal
vpa-recommender-5bf4cfccb6-8wdpr1/1Runningip-10-242-73-89.eu-west-1.compute.internal
vpa-updater-6f795d7bb8-gccv21/1Runningip-10-242-60-155.eu-west-1.compute.internal
vpn-seed-server-748674b7d8-cb8gh2/2Runningip-10-242-60-155.eu-west-1.compute.internal

Most of the pods except etcd-events-2 and etcd-main-1 are stuck in Pending state. Rest all of the pods are which were running on nodes in eu-west-1a zone were rescheduled automatically.

kubectl get nodes -o=custom-columns='NAME:metadata.name,ZONE:metadata.labels.topology\.kubernetes\.io\/zone' # list of nodes with name, zone and status (was taken separately)
NAMESTATUSZONE
ip-10-242-20-17.eu-west-1.compute.internalNotReadyeu-west-1a
ip-10-242-22-85.eu-west-1.compute.internalNotReadyeu-west-1a
ip-10-242-3-0.eu-west-1.compute.internalNotReadyeu-west-1a
ip-10-242-53-131.eu-west-1.compute.internalReadyeu-west-1b
ip-10-242-60-155.eu-west-1.compute.internalReadyeu-west-1b
ip-10-242-73-77.eu-west-1.compute.internalReadyeu-west-1c
ip-10-242-73-89.eu-west-1.compute.internalReadyeu-west-1c
kubectl get machines # list of machines for the multi-AZ shoot control plane
NAMESTATUSAGE
shoot–garden–aws-ha2-cpu-worker-etcd-z1-66659-jlv56Terminating21m
shoot–garden–aws-ha2-cpu-worker-etcd-z1-66659-sf4wtUnknown3d
shoot–garden–aws-ha2-cpu-worker-etcd-z2-c767d-s8cmfRunning3d
shoot–garden–aws-ha2-cpu-worker-etcd-z3-9678d-6p8w5Running3d
shoot–garden–aws-ha2-cpu-worker-z1-766bc-9m45jCrashLoopBackOff2m55s
shoot–garden–aws-ha2-cpu-worker-z1-766bc-pjq6nUnknown2d21h
shoot–garden–aws-ha2-cpu-worker-z1-766bc-vlflqTerminating28m
shoot–garden–aws-ha2-cpu-worker-z2-85968-5qmjhRunning3d
shoot–garden–aws-ha2-cpu-worker-z2-85968-zs9lrCrashLoopBackOff7m26s
shoot–garden–aws-ha2-cpu-worker-z3-5f499-hnrs6Running2d21h
shoot–garden–aws-ha2-etcd-compaction-z1-6bd58-8qzlnCrashLoopBackOff12m
shoot–garden–aws-ha2-etcd-compaction-z1-6bd58-9ffc7Terminating3d

MCM attempts to delete the machines and since it is unable to the machines transition to Terminating state and are stuck there. It subsequently attempts to launch new machines which also fails and these machines transition to CrashLoopBackOff state.

Ingress/Egress Traffic Analysis Details

Consider the following etcd cluster:

$ etcdctl endpoint status --cluster -w table
ENDPOINTIDVERSIONDB SIZEIS LEADERIS LEARNERRAFT TERMRAFT INDEXRAFT APPLIED INDEX
https://etcd-main-0.etcd-main-peer.shoot--ash-garden--mz-neem.svc:237937e93e9d1dd2cc8e3.4.137.6 MBfalsefalse4738633863
https://etcd-main-2.etcd-main-peer.shoot--ash-garden--mz-neem.svc:237965fe447d73e9dc583.4.137.6 MBtruefalse4738633863
https://etcd-main-1.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379ad4fe89f4e7312983.4.137.6 MBfalsefalse4738633863
Multi-zonal shoot control plane ingress/egress traffic in a fresh shoot cluster with no user activity

The steady state traffic (post all controllers have made initial list requests to refresh their informer caches) is depicted below (span = 1hr):

Observations:

  • Leader to per follower max egress: ~20Kib/s
  • One Follower to Leader max egress: ~20Kib/s
  • Follower to follower max egress: ~2Kibs/s

Total ingress + egress traffic amongst etcd members = ~84Kib/s.

Traffic generated during PUT requests to etcd leader

Generating a load 100 put requests/second for 30 seconds duration by targeting etcd leader, This will generate ~100KiB/s traffic (value size is 1kib).

 benchmark put --target-leader  --rate 100 --conns=400 --clients=400 --sequential-keys --key-starts 0 --val-size=1024 --total=3000 \
     --endpoints=https://etcd-main-client:2379 \
     --key=/var/etcd/ssl/client/client/tls.key \
     --cacert=/var/etcd/ssl/client/ca/bundle.crt \
     --cert=/var/etcd/ssl/client/client/tls.crt

Observations:

  • Leader to per follower max egress: ~155 KiB/s (pattern duration: 50 secs)
  • One Follower to Leader max egress: ~50Kib/s (pattern duration: 50 secs)
  • Follower to follower max egress: ~2Kibs/s

Total ingress + egress traffic amongst etcd members = ~412Kib/s.

Traffic generated during PUT requests to etcd follower Generating a load 100 put requests/second for 30 seconds duration by targeting etcd follower, This will generate ~100KiB/s traffic(value size is 1kib).
benchmark put  --rate 100 --conns=400 --clients=400 --sequential-keys --key-starts 3000 --val-size=1024 --total=3000 \
    --endpoints=https://etcd-main-1.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379 \
    --key=/var/etcd/ssl/client/client/tls.key \
    --cacert=/var/etcd/ssl/client/ca/bundle.crt \
    --cert=/var/etcd/ssl/client/client/tls.crt

Observations:

  • In this case, the follower(etcd-main-1) redirects the put request to leader(etcd-main2) max egress: ~168 KiB/s
  • Leader to per follower max egress: ~150 KiB/s (pattern duration: 50 secs)
  • One Follower to Leader max egress: ~45Kib/s (pattern duration: 50 secs)
  • Follower to follower max egress: ~2Kibs/s

Total ingress + egress traffic amongst etcd members = ~517KiB/s.

Traffic generated during etcd follower initial sync with large revision difference

Consider the following etcd cluster:

$ etcdctl endpoint status --cluster -w table
ENDPOINTIDVERSIONDB SIZEIS LEADERIS LEARNERRAFT TERMRAFT INDEXRAFT APPLIED INDEX
https://etcd-main-0.etcd-main-peer.shoot--ash-garden--mz-neem.svc:237937e93e9d1dd2cc8e3.4.135.1 GBfalsefalse484752747527
https://etcd-main-1.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379ad4fe89f4e7312983.4.135.1 GBtruefalse484757647576

In this case, new follower or crashed follower etcd-main-2 joins the etcd cluster with large revision difference (5.1 GB DB size). The etcd follower had its DB size = 14MB when it crashed. There was a flurry of activity and that increased the leader DB size to 5.1GB thus creating a huge revision difference.

Scale up etcd member to 3

kubectl scale statefulsets etcd-main --replicas=3 -n shoot--ash-garden--mz-neem

ENDPOINTIDVERSIONDB SIZEIS LEADERIS LEARNERRAFT TERMRAFT INDEXRAFT APPLIED INDEX
https://etcd-main-0.etcd-main-peer.shoot--ash-garden--mz-neem.svc:237937e93e9d1dd2cc8e3.4.135.1 GBfalsefalse485050250502
https://etcd-main-2.etcd-main-peer.shoot--ash-garden--mz-neem.svc:237965fe447d73e9dc583.4.135.0 GBfalsefalse485050250502
https://etcd-main-1.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379ad4fe89f4e7312983.4.135.1 GBtruefalse485050250502

Observations:

  • Leader to new follower or crashed follower etcd-main-2 (which joins with large revision difference) max egress: ~120 MiB/s (noticeable pattern duration: 40 secs).
  • New follower etcd-main-2 to Leader max egress: ~159 KiB/s (pattern duration: 40 secs).
  • Leader to another follower etcd-main-0 max egress: <20KiB/s.
  • Follower etcd-main-0 to Leader max egress: <20> Kib/s.
  • Follower to follower max egress: ~2Kibs/s

Total ingress + egress traffic amongst etcd members = ~121MiB/s.

Traffic generated during GET requests to etcd leader

In this case, trying to get the keys which matches between 1 and 17999 by targeting leader etcd-main-1. This will dump both keys and values.

Executing the following command from etcd-client pod(running in same namespace).

root@etcd-client:/#  etcdctl --endpoints=https://etcd-main-1.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379 get 1 17999 > /tmp/range2.txt
root@etcd-client:/# du -h  /tmp/range2.txt
607M	/tmp/range2.txt

Observations:

  • Downloaded dump file is around 607 MiB.
  • Leader etcd-main-1 to etcd-client max egress ~34MiBs.
  • Etcd intra cluster network traffic remains same, observed there is no change in network traffic pattern
Traffic generated during GET requests to etcd follower

In this case, trying to get the keys which matches between 1 and 17999 by targeting followeretcd-main-2. This will dump both keys and values.

root@etcd-client:/# etcdctl --endpoints=https://etcd-main-2.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379 get 1 17999 > /tmp/range.txt
root@etcd-client:/# du -h  /tmp/range.txt
607M	/tmp/range.txt
root@etcd-client:/#

Observations:

  • Downloaded dump file is around 607 MiB.
  • Follower etcd-main-2 to etcd-client max egress ~32MiBs.
  • Etcd intra cluster network traffic remains same, observed there is no change in network traffic pattern
  • Watch requests are not forwarded to Leader etcd-main-1 from etcd-main-2
Traffic generated during DELETE requests to etcd leader

In this case, trying to delete the keys which matches between 0 and 99999 by targeting leader.

Executing the following command from etcd-client pod.

root@etcd-client:/# time etcdctl --endpoints=https://etcd-main-2.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379 del 0 99999    --dial-timeout=300s --command-timeout=300s
99999

real	0m0.664s
user	0m0.008s
sys	0m0.016s
root@etcd-client:/#

Observations:

  • Downloaded dump file is around 607 MiB.
  • Leader etcd-main-2 to etcd-client max egress ~226B/s.
  • Etcd intra cluster network traffic remains same, observed there is no change in network traffic pattern.
Traffic generated during DELETE requests to etcd follower

In this case, trying to delete the keys which matches between 0 and 99999 by targeting follower.

Executing the following command from etcd-client pod.

root@etcd-client:/# time etcdctl --endpoints=https://etcd-main-1.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379 del 0 99999    --dial-timeout=300s --command-timeout=300s
99999

real	0m0.664s
user	0m0.008s
sys	0m0.016s
root@etcd-client:/#

Observations:

  • Downloaded dump file is around 607 MiB.
  • Leader etcd-main-2 to etcd-client max egress ~222B/s.
  • Etcd intra cluster network traffic remains same, observed there is no change in network traffic pattern.
Traffic generated during WATCH requests to etcd

Etcd cluster state

ENDPOINTIDVERSIONDB SIZEIS LEADERIS LEARNERRAFT TERMRAFT INDEXRAFT APPLIED INDEX
https://etcd-main-0.etcd-main-peer.shoot--ash-garden--mz-neem.svc:237937e93e9d1dd2cc8e3.4.13673 MBfalsefalse388970471970471
https://etcd-main-2.etcd-main-peer.shoot--ash-garden--mz-neem.svc:237965fe447d73e9dc583.4.13673 MBtruefalse388970472970472
https://etcd-main-1.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379ad4fe89f4e7312983.4.13673 MBfalsefalse388970472970472

Watching the keys which matches between 0 and 99999 by targeting follower.

Executing the following command from etcd-client pod.

time etcdctl --endpoints=https://etcd-main-1.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379 watch 0 99999    --dial-timeout=300s --command-timeout=300s

In parallel generating 100000 keys and each value size is 1Kib by targeting etcd leader for this case around(500rps)

benchmark put --target-leader  --rate 500 --conns=400 --clients=800 --sequential-keys --key-starts 0 --val-size=1024 --total=100000 \
    --endpoints=https://etcd-main-client:2379 \
    --key=/var/etcd/ssl/client/client/tls.key \
    --cacert=/var/etcd/ssl/client/ca/bundle.crt \
    --cert=/var/etcd/ssl/client/client/tls.crt

Observations:

  • Etcd intra cluster network traffic remains same, observed there is no change in network traffic pattern.
  • Follower etcd-main-1 to etcd-client max egress is 496 KiBs.
  • Watch requests are not forwarded to Leader etcd-main-2 from follower etcd-main-1 .

Deleting the keys which matches between 0 and 99999 by targeting follower and in parallel watching the keys.

root@etcd-client:/# time etcdctl --endpoints=https://etcd-main-1.etcd-main-peer.shoot--ash-garden--mz-neem.svc:2379 del 0 99999    --dial-timeout=300s --command-timeout=300s
99999

real	0m0.590s
user	0m0.018s
sys	0m0.006s

Observations:

  • Etcd intra cluster network traffic remains same, observed there is no change in network traffic pattern.
  • Follower etcd-main-1 to etcd-client max egress is 222B/s.
  • watch lists the keys which are deleted, not values.

16 - New Core Gardener Cloud APIs

New core.gardener.cloud/v1beta1 APIs required to extract cloud-specific/OS-specific knowledge out of Gardener core

Table of Contents

Summary

In GEP-1 we have proposed how to (re-)design Gardener to allow providers maintaining their provider-specific knowledge out of the core tree. Meanwhile, we have progressed a lot and are about to remove the CloudBotanist interface entirely. The only missing aspect that will allow providers to really maintain their code out of the core is to design new APIs.

This proposal describes how the new Shoot, Seed etc. APIs will be re-designed to cope with the changes made with extensibility. We already have the new core.gardener.cloud/v1beta1 API group that will be the new default soon.

Motivation

We want to allow providers to individually maintain their specific knowledge without the necessity to touch the Gardener core code. In order to achieve the same, we have to provide proper APIs.

Goals

  • Provide proper APIs to allow providers maintaining their code outside of the core codebase.
  • Do not complicate the APIs for end-users such that they can easily create, delete, and maintain shoot clusters.

Non-Goals

  • Let’s try to not split everything up into too many different resources. Instead, let’s try to keep all relevant information in the same resources when possible/appropriate.

Proposal

In GEP-1 we already have proposed a first version for new CloudProfile and Shoot resources. In order to deprecate the existing/old garden.sapcloud.io/v1beta1 API group (and remove it, eventually) we should move all existing resources to the new core.gardener.cloud/v1beta1 API group.

CloudProfile resource

apiVersion: core.gardener.cloud/v1beta1
kind: CloudProfile
metadata:
  name: cloudprofile1
spec:
  type: <some-provider-name> # {aws,azure,gcp,...}
# Optional list of labels on `Seed` resources that marks those seeds whose shoots may use this provider profile.
# An empty list means that all seeds of the same provider type are supported.
# This is useful for environments that are of the same type (like openstack) but may have different "instances"/landscapes.
# seedSelector:
#   matchLabels:
#     foo: bar
  kubernetes:
    versions:
    - version: 1.12.1
    - version: 1.11.0
    - version: 1.10.6
    - version: 1.10.5
      expirationDate: 2020-04-05T01:02:03Z # optional
  machineImages:
  - name: coreos
    versions:
    - version: 2023.5.0
    - version: 1967.5.0
      expirationDate: 2020-04-05T08:00:00Z
  - name: ubuntu
    versions:
    - version: 18.04.201906170
  machineTypes:
  - name: m5.large
    cpu: "2"
    gpu: "0"
    memory: 8Gi
  # storage: 20Gi # optional (not needed in every environment, may only be specified if no volumeTypes have been specified)
    usable: true
  volumeTypes: # optional (not needed in every environment, may only be specified if no machineType has a `storage` field)
  - name: gp2
    class: standard
  - name: io1
    class: premium
  regions:
  - name: europe-central-1
    zones: # optional (not needed in every environment)
    - name: europe-central-1a
    - name: europe-central-1b
    - name: europe-central-1c
    # unavailableMachineTypes: # optional, list of machine types defined above that are not available in this zone
    # - m5.large
    # unavailableVolumeTypes: # optional, list of volume types defined above that are not available in this zone
    # - io1
# CA bundle that will be installed onto every shoot machine that is using this provider profile.
# caBundle: |
#   -----BEGIN CERTIFICATE-----
#   ...
#   -----END CERTIFICATE-----
  providerConfig:
    <some-provider-specific-cloudprofile-config>
    # We don't have concrete examples for every existing provider yet, but these are the proposals:
    #
    # Example for Alicloud:
    #
    # apiVersion: alicloud.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 2023.5.0
    #   id: coreos_2023_4_0_64_30G_alibase_20190319.vhd
    #
    #
    # Example for AWS:
    #
    # apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 1967.5.0
    #   regions:
    #   - name: europe-central-1
    #     ami: ami-0f46c2ed46d8157aa
    #
    #
    # Example for Azure:
    #
    # apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 1967.5.0
    #   publisher: CoreOS
    #   offer: CoreOS
    #   sku: Stable
    # countFaultDomains:
    # - region: westeurope
    #   count: 2
    # countUpdateDomains:
    # - region: westeurope
    #   count: 5
    #
    #
    # Example for GCP:
    #
    # apiVersion: gcp.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 2023.5.0
    #   image: projects/coreos-cloud/global/images/coreos-stable-2023-5-0-v20190312
    #
    #
    # Example for OpenStack:
    #
    # apiVersion: openstack.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 2023.5.0
    #   image: coreos-2023.5.0
    # keyStoneURL: https://url-to-keystone/v3/
    # dnsServers:
    # - 10.10.10.10
    # - 10.10.10.11
    # dhcpDomain: foo.bar
    # requestTimeout: 30s
    # constraints:
    #   loadBalancerProviders:
    #   - name: haproxy
    #   floatingPools:
    #   - name: fip1
    #     loadBalancerClasses:
    #     - name: class1
    #       floatingSubnetID: 04eed401-f85f-4610-8041-c4835c4beea6
    #       floatingNetworkID: 23949a30-1cdd-4732-ba47-d03ced950acc
    #       subnetID: ac46c204-9d0d-4a4c-a90d-afefe40cfc35
    #
    #
    # Example for Packet:
    #
    # apiVersion: packet.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 2079.3.0
    #   id: d61c3912-8422-4daf-835e-854efa0062e4

Seed resource

Special note: The proposal contains fields that are not yet existing in the current garden.sapcloud.io/v1beta1.Seed resource, but they should be implemented (open issues that require them are linked).

apiVersion: v1
kind: Secret
metadata:
  name: seed-secret
  namespace: garden
type: Opaque
data:
  kubeconfig: base64(kubeconfig-for-seed-cluster)

---
apiVersion: v1
kind: Secret
metadata:
  name: backup-secret
  namespace: garden
type: Opaque
data:
  # <some-provider-specific data keys>
  # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-backupbucket.yaml#L9-L11
  # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-infrastructure.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-backupbucket.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-backupbucket.yaml#L9
  # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-backupbucket.yaml#L9-L13

---
apiVersion: core.gardener.cloud/v1beta1
kind: Seed
metadata:
  name: seed1
spec:
  provider:
    type: <some-provider-name> # {aws,azure,gcp,...}
    region: europe-central-1
  secretRef:
    name: seed-secret
    namespace: garden
  # Motivation for DNS section: https://github.com/gardener/gardener/issues/201.
  dns:
    provider: <some-provider-name> # {aws-route53, google-clouddns, ...}
    secretName: my-dns-secret # must be in `garden` namespace
    ingressDomain: seed1.dev.example.com
  volume: # optional (introduced to get rid of `persistentvolume.garden.sapcloud.io/minimumSize` and `persistentvolume.garden.sapcloud.io/provider` annotations)
    minimumSize: 20Gi
    providers:
    - name: foo
      purpose: etcd-main
  networks: # Seed and Shoot networks must be disjunct
    nodes: 10.240.0.0/16
    pods: 10.241.128.0/17
    services: 10.241.0.0/17
  # Shoot default networks, see also https://github.com/gardener/gardener/issues/895.
  # shootDefaults:
  #   pods: 100.96.0.0/11
  #   services: 100.64.0.0/13
  taints:
  - key: seed.gardener.cloud/protected
  - key: seed.gardener.cloud/invisible
  blockCIDRs:
  - 169.254.169.254/32
  backup: # See https://github.com/gardener/gardener/blob/master/docs/proposals/02-backupinfra.md.
    type: <some-provider-name> # {aws,azure,gcp,...}
  # region: eu-west-1
    secretRef:
      name: backup-secret
      namespace: garden
status:
  conditions:
  - lastTransitionTime: "2020-07-14T19:16:42Z"
    lastUpdateTime: "2020-07-14T19:18:17Z"
    message: all checks passed
    reason: Passed
    status: "True"
    type: Available
  gardener:
    id: 4c9832b3823ee6784064877d3eb10c189fc26e98a1286c0d8a5bc82169ed702c
    name: gardener-controller-manager-7fhn9ikan73n-7jhka
    version: 1.0.0
  observedGeneration: 1

Project resource

Special note: The members and viewers field of the garden.sapcloud.io/v1beta1.Project resource will be merged together into one members field. Every member will have a role that is either admin or viewer. This will allow us to add new roles without changing the API.

apiVersion: core.gardener.cloud/v1beta1
kind: Project
metadata:
  name: example
spec:
  description: Example project
  members:
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: john.doe@example.com
    role: admin
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: joe.doe@example.com
    role: viewer
  namespace: garden-example
  owner:
    apiGroup: rbac.authorization.k8s.io
    kind: User
    name: john.doe@example.com
  purpose: Example project
status:
  observedGeneration: 1
  phase: Ready

SecretBinding resource

Special note: No modifications needed compared to the current garden.sapcloud.io/v1beta1.SecretBinding resource.

apiVersion: v1
kind: Secret
metadata:
  name: secret1
  namespace: garden-core
type: Opaque
data:
  # <some-provider-specific data keys>
  # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-infrastructure.yaml#L14-L15
  # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-infrastructure.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-infrastructure.yaml#L14-L17
  # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-infrastructure.yaml#L14
  # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-infrastructure.yaml#L15-L18
  # https://github.com/gardener/gardener-extension-provider-packet/blob/master/example/30-infrastructure.yaml#L14-L15
  #
  # If you use your own domain (not the default domain of your landscape) then you have to add additional keys to this secret.
  # The reason is that the DNS management is not part of the Gardener core code base but externalized, hence, it might use other
  # key names than Gardener itself.
  # The actual values here depend on the DNS extension that is installed to your landscape.
  # For example, check out https://github.com/gardener/external-dns-management and find a lot of example secret manifests here:
  # https://github.com/gardener/external-dns-management/tree/master/examples

---
apiVersion: core.gardener.cloud/v1beta1
kind: SecretBinding
metadata:
  name: secretbinding1
  namespace: garden-core
secretRef:
  name: secret1
# namespace: namespace-other-than-'garden-core' // optional
quotas: []
# - name: quota-1
# # namespace: namespace-other-than-'garden-core' // optional

Quota resource

Special note: No modifications needed compared to the current garden.sapcloud.io/v1beta1.Quota resource.

apiVersion: core.gardener.cloud/v1beta1
kind: Quota
metadata:
  name: trial-quota
  namespace: garden-trial
spec:
  scope:
    apiGroup: core.gardener.cloud
    kind: Project
# clusterLifetimeDays: 14
  metrics:
    cpu: "200"
    gpu: "20"
    memory: 4000Gi
    storage.standard: 8000Gi
    storage.premium: 2000Gi
    loadbalancer: "100"

BackupBucket resource

Special note: This new resource is cluster-scoped.

# See also: https://github.com/gardener/gardener/blob/master/docs/proposals/02-backupinfra.md.

apiVersion: v1
kind: Secret
metadata:
  name: backup-operator-provider
  namespace: backup-garden
type: Opaque
data:
  # <some-provider-specific data keys>
  # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-backupbucket.yaml#L9-L11
  # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-backupbucket.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-backupbucket.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-backupbucket.yaml#L9
  # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-backupbucket.yaml#L9-L13

---
apiVersion: core.gardener.cloud/v1beta1
kind: BackupBucket
metadata:
  name: <seed-provider-type>-<region>-<seed-uid>
  ownerReferences:
  - kind: Seed
    name: seed1
spec:
  provider:
    type: <some-provider-name> # {aws,azure,gcp,...}
    region: europe-central-1
  seed: seed1
  secretRef:
    name: backup-operator-provider
    namespace: backup-garden
status:
  lastOperation:
    description: Backup bucket has been successfully reconciled.
    lastUpdateTime: '2020-04-13T14:34:27Z'
    progress: 100
    state: Succeeded
    type: Reconcile
  observedGeneration: 1

BackupEntry resource

Special note: This new resource is cluster-scoped.

# See also: https://github.com/gardener/gardener/blob/master/docs/proposals/02-backupinfra.md.

apiVersion: v1
kind: Secret
metadata:
  name: backup-operator-provider
  namespace: backup-garden
type: Opaque
data:
  # <some-provider-specific data keys>
  # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-backupbucket.yaml#L9-L11
  # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-backupbucket.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-backupbucket.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-backupbucket.yaml#L9
  # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-backupbucket.yaml#L9-L13

---
apiVersion: core.gardener.cloud/v1beta1
kind: BackupEntry
metadata:
  name: shoot--core--crazy-botany--3ef42
  namespace: garden-core
  ownerReferences:
  - apiVersion: core.gardener.cloud/v1beta1
    blockOwnerDeletion: false
    controller: true
    kind: Shoot
    name: crazy-botany
    uid: 19a9538b-5058-11e9-b5a6-5e696cab3bc8
spec:
  bucketName: cloudprofile1-random[:5]
  seed: seed1
status:
  lastOperation:
    description: Backup entry has been successfully reconciled.
    lastUpdateTime: '2020-04-13T14:34:27Z'
    progress: 100
    state: Succeeded
    type: Reconcile
  observedGeneration: 1

Shoot resource

Special notes:

  • kubelet configuration in the worker pools may override the default .spec.kubernetes.kubelet configuration (that applies for all worker pools if not overridden).
  • Moved remaining control plane configuration to new .spec.provider.controlplane section.
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: crazy-botany
  namespace: garden-core
spec:
  secretBindingName: secretbinding1
  cloudProfileName: cloudprofile1
  region: europe-central-1
# seedName: seed1
  provider:
    type: <some-provider-name> # {aws,azure,gcp,...}
    infrastructureConfig:
      <some-provider-specific-infrastructure-config>
      # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-infrastructure.yaml#L56-L64
      # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-infrastructure.yaml#L43-L53
      # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-infrastructure.yaml#L63-L71
      # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-infrastructure.yaml#L53-L57
      # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-infrastructure.yaml#L56-L64
      # https://github.com/gardener/gardener-extension-provider-packet/blob/master/example/30-infrastructure.yaml#L48-L49
    controlPlaneConfig:
      <some-provider-specific-controlplane-config>
      # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-controlplane.yaml#L60-L65
      # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-controlplane.yaml#L60-L64
      # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-controlplane.yaml#L61-L66
      # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-controlplane.yaml#L59-L64
      # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-controlplane.yaml#L64-L70
      # https://github.com/gardener/gardener-extension-provider-packet/blob/master/example/30-controlplane.yaml#L60-L61
    workers:
    - name: cpu-worker
      minimum: 3
      maximum: 5
    # maxSurge: 1
    # maxUnavailable: 0
      machine:
        type: m5.large
        image:
          name: <some-os-name>
          version: <some-os-version>
        # providerConfig:
        #   <some-os-specific-configuration>
      volume:
        type: gp2
        size: 20Gi
    # providerConfig:
    #   <some-provider-specific-worker-config>
    # labels:
    #   key: value
    # annotations:
    #   key: value
    # taints: # See also https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
    # - key: foo
    #   value: bar
    #   effect: NoSchedule
    # caBundle: <some-ca-bundle-to-be-installed-to-all-nodes-in-this-pool>
    # kubernetes:
    #   kubelet:
    #     cpuCFSQuota: true
    #     cpuManagerPolicy: none
    #     podPidsLimit: 10
    #     featureGates:
    #       SomeKubernetesFeature: true
    # zones: # optional, only relevant if the provider supports availability zones
    # - europe-central-1a
    # - europe-central-1b
  kubernetes:
    version: 1.15.1
  # allowPrivilegedContainers: true # 'true' means that all authenticated users can use the "gardener.privileged" PodSecurityPolicy, allowing full unrestricted access to Pod features.
  # kubeAPIServer:
  #   featureGates:
  #     SomeKubernetesFeature: true
  #   runtimeConfig:
  #     scheduling.k8s.io/v1alpha1: true
  #   oidcConfig:
  #     caBundle: |
  #       -----BEGIN CERTIFICATE-----
  #       Li4u
  #       -----END CERTIFICATE-----
  #     clientID: client-id
  #     groupsClaim: groups-claim
  #     groupsPrefix: groups-prefix
  #     issuerURL: https://identity.example.com
  #     usernameClaim: username-claim
  #     usernamePrefix: username-prefix
  #     signingAlgs: RS256,some-other-algorithm
  #-#-# only usable with Kubernetes >= 1.11
  #     requiredClaims:
  #       key: value
  #   admissionPlugins:
  #   - name: PodNodeSelector
  #     config: |
  #       podNodeSelectorPluginConfig:
  #         clusterDefaultNodeSelector: <node-selectors-labels>
  #         namespace1: <node-selectors-labels>
  #         namespace2: <node-selectors-labels>
  #   auditConfig:
  #     auditPolicy:
  #       configMapRef:
  #         name: auditpolicy
  # kubeControllerManager:
  #   featureGates:
  #     SomeKubernetesFeature: true
  #   horizontalPodAutoscaler:
  #     syncPeriod: 30s
  #     tolerance: 0.1
  #-#-# only usable with Kubernetes < 1.12
  #     downscaleDelay: 15m0s
  #     upscaleDelay: 1m0s
  #-#-# only usable with Kubernetes >= 1.12
  #     downscaleStabilization: 5m0s
  #     initialReadinessDelay: 30s
  #     cpuInitializationPeriod: 5m0s
  # kubeScheduler:
  #   featureGates:
  #     SomeKubernetesFeature: true
  # kubeProxy:
  #   featureGates:
  #     SomeKubernetesFeature: true
  #   mode: IPVS
  # kubelet:
  #   cpuCFSQuota: true
  #   cpuManagerPolicy: none
  #   podPidsLimit: 10
  #   featureGates:
  #     SomeKubernetesFeature: true
  # clusterAutoscaler:
  #   scaleDownUtilizationThreshold: 0.5
  #   scaleDownUnneededTime: 30m
  #   scaleDownDelayAfterAdd: 60m
  #   scaleDownDelayAfterFailure: 10m
  #   scaleDownDelayAfterDelete: 10s
  #   scanInterval: 10s
  dns:
    # When the shoot shall use a cluster domain no domain and no providers need to be provided - Gardener will
    # automatically compute a correct domain.
    domain: crazy-botany.core.my-custom-domain.com
    providers:
    - type: aws-route53
      secretName: my-custom-domain-secret
      domains:
        include:
        - my-custom-domain.com
        - my-other-custom-domain.com
        exclude:
        - yet-another-custom-domain.com
      zones:
        include:
        - zone-id-1
        exclude:
        - zone-id-2
  extensions:
  - type: foobar
  # providerConfig:
  #   apiVersion: foobar.extensions.gardener.cloud/v1alpha1
  #   kind: FooBarConfiguration
  #   foo: bar
  networking:
    type: calico
    pods: 100.96.0.0/11
    services: 100.64.0.0/13
    nodes: 10.250.0.0/16
  # providerConfig:
  #   apiVersion: calico.extensions.gardener.cloud/v1alpha1
  #   kind: NetworkConfig
  #   ipam:
  #     type: host-local
  #     cidr: usePodCIDR
  #   backend: bird
  #   typha:
  #     enabled: true
  # See also: https://github.com/gardener/gardener/blob/master/docs/proposals/03-networking.md
  maintenance:
    timeWindow:
      begin: 220000+0100
      end: 230000+0100
    autoUpdate:
      kubernetesVersion: true
      machineImageVersion: true
# hibernation:
#   enabled: false
#   schedules:
#   - start: "0 20 * * *" # Start hibernation every day at 8PM
#     end: "0 6 * * *"    # Stop hibernation every day at 6AM
#     location: "America/Los_Angeles" # Specify a location for the cron to run in
  addons:
    nginx-ingress:
      enabled: false
    # loadBalancerSourceRanges: []
    kubernetes-dashboard:
      enabled: true
    # authenticationMode: basic # allowed values: basic,token
status:
  conditions:
  - type: APIServerAvailable
    status: 'True'
    lastTransitionTime: '2020-01-30T10:38:15Z'
    lastUpdateTime: '2020-04-13T14:35:21Z'
    reason: HealthzRequestFailed
    message: API server /healthz endpoint responded with success status code. [response_time:3ms]
  - type: ControlPlaneHealthy
    status: 'True'
    lastTransitionTime: '2020-04-02T05:18:58Z'
    lastUpdateTime: '2020-04-13T14:35:21Z'
    reason: ControlPlaneRunning
    message: All control plane components are healthy.
  - type: EveryNodeReady
    status: 'True'
    lastTransitionTime: '2020-04-01T16:27:21Z'
    lastUpdateTime: '2020-04-13T14:35:21Z'
    reason: EveryNodeReady
    message: Every node registered to the cluster is ready.
  - type: SystemComponentsHealthy
    status: 'True'
    lastTransitionTime: '2020-04-03T18:26:28Z'
    lastUpdateTime: '2020-04-13T14:35:21Z'
    reason: SystemComponentsRunning
    message: All system components are healthy.
  gardener:
    id: 4c9832b3823ee6784064877d3eb10c189fc26e98a1286c0d8a5bc82169ed702c
    name: gardener-controller-manager-7fhn9ikan73n-7jhka
    version: 1.0.0
  lastOperation:
    description: Shoot cluster state has been successfully reconciled.
    lastUpdateTime: '2020-04-13T14:34:27Z'
    progress: 100
    state: Succeeded
    type: Reconcile
  observedGeneration: 1
  seed: seed1
  hibernated: false
  technicalID: shoot--core--crazy-botany
  uid: d8608cfa-2856-11e8-8fdc-0a580af181af

Plant resource

apiVersion: v1
kind: Secret
metadata:
  name: crazy-plant-secret
  namespace: garden-core
type: Opaque
data:
  kubeconfig: base64(kubeconfig-for-plant-cluster)

---
apiVersion: core.gardener.cloud/v1beta1
kind: Plant
metadata:
  name: crazy-plant
  namespace: garden-core
spec:
  secretRef:
    name: crazy-plant-secret
  endpoints:
  - name: Cluster GitHub repository
    purpose: management
    url: https://github.com/my-org/my-cluster-repo
  - name: GKE cluster page
    purpose: management
    url: https://console.cloud.google.com/kubernetes/clusters/details/europe-west1-b/plant?project=my-project&authuser=1&tab=details
status:
  clusterInfo:
    provider:
      type: gce
      region: europe-west4-c
    kubernetes:
      version: v1.11.10-gke.5
  conditions:
  - lastTransitionTime: "2020-03-01T11:31:37Z"
    lastUpdateTime: "2020-04-14T18:00:29Z"
    message: API server /healthz endpoint responded with success status code. [response_time:8ms]
    reason: HealthzRequestFailed
    status: "True"
    type: APIServerAvailable
  - lastTransitionTime: "2020-04-01T06:26:56Z"
    lastUpdateTime: "2020-04-14T18:00:29Z"
    message: Every node registered to the cluster is ready.
    reason: EveryNodeReady
    status: "True"
    type: EveryNodeReady

17 - Observability Stack - Migrating to the prometheus-operator and fluent-bit operator

GEP-19: Observability Stack - Migrating to the prometheus-operator and fluent-bit-operator

Table of Contents

Summary

As Gardener has grown, the observability configurations have also evolved with it. Many components must be monitored and the configuration for these components must also be managed. This poses a challenge because the configuration is distributed across the Gardener project among different folders and even different repositories (for example extensions). While it is not possible to centralize the configuration, it is possible to improve the developer experience and improve the general stability of the monitoring and logging. This can be done by introducing kubernetes operators such as the prometheus-operator for monitoring stack and fluent-bit-operator for the logging stack. These operators will make it easier for monitoring and logging configurations to be discovered and picked up with the use of the respective custom resources:

  1. Prometheus Custom Resources provided by the prometheus-operator.
  2. Fluent-bit Custom Resources provided by the fluent-bit-operator.

These resources can also be directly referenced in Go and be deployed by their respective components, instead of creating .yaml files in Go, or templating charts. With the addition of the operators it should is easier to provide flexible deployment layouts of the components or even introduce new features, such as Thanos in the case of prometheus-operator

Motivation

Simplify monitoring and logging updates and extensions with the use of the prometheus-operator and fluent-bit-operator. The current extension contract is described here. This document aims to define a new contract.

Make it easier to add new monitoring and logging features and make new changes. For example, when using the operators components can bring their own monitoring configuration and specify exactly how they should be monitored without needing to add this configuration directly into Prometheus. Or components can defined how their logs shall be parsed or enhanced before being indexed.

Both operators handles validation of the respective configurations. It will be more difficult to give Prometheus or Fluent-Bit apps an invalid config.

Goals

  • Migrate from the current monitoring stack to the prometheus-operator.

  • Improve the monitoring extensibility and improve developer experience when adding or editing configuration. This includes the monitoring extensions in addition to core Gardener components.

  • Provide a clear direction on how monitoring resources should be managed. Currently, there are many ways that monitoring configuration is being deployed and this should be unified.

  • Improve how dashboards are discovered and provisioned for Grafana. Currently, all dashboards are appended into a single configmap. This can be an issue if the maximum configmap size of 1MiB is ever exceeded.

  • Migrate the life-cycle of log shippers to an operator.

  • Introduce a stable, declarative API for defining filters and parsers for gardener core components and extensions.

  • Full match of the current log shippers configurations in gardener

Non-Goals

  • Feature parity between the current solution and the one proposed in this GEP. The new stack should provide similar monitoring coverage, but it will be very difficult to evaluate if there is feature parity. Perhaps some features will be missing, but others may be added.

Proposal

Today, Gardener provides monitoring and logging for shoot clusters (i.e. system components and the control plane) and for the seed cluster. The proposal is to migrate these stacks to use the prometheus-operator and fluent-bit-operator. The proposal is lined out below:

API

Use the API provided by the prometheus-operator and create Go structs.

Prometheus Operator CRDs

Deploy the prometheus-operator and its CRDs. These components can be deployed via ManagedResources. The operator itself and some other components outlined in the GEP will be deployed in a new namespace called monitoring. The CRDs for the prometheus-operator and the operator itself can be found here. This step is a prerequisite for all other steps.

Shoot Monitoring

Gardener will create a monitoring stack similar to the current one with the prometheus-operator custom resources.

  1. Most of the shoot monitoring is deployed via this chart. The goal is to create a similar stack, but not necessarily with feature parity, using the prometheus-operator.

    • An example Prometheus object that would be deployed in a shoot’s control plane.
    apiVersion: monitoring.coreos.com/v1
    kind: Prometheus
    metadata:
      labels:
        app: prometheus
      name: prometheus
      namespace: shoot--project--name
    spec:
      enableAdminAPI: false
      logFormat: logfmt
      logLevel: info
      image: image:tag
      paused: false
      portName: web
      replicas: 1
      retention: 30d
      routePrefix: /
      serviceAccountName: prometheus
      serviceMonitorNamespaceSelector:
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - shoot--project--name
      podMonitorNamespaceSelector:
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - shoot--project--name
      ruleNamespaceSelector:
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: In
          values:
          - shoot--project--name
      serviceMonitorSelector:
        matchLabels:
          monitoring.gardener.cloud/monitoring-target: shoot-control-plane
      podMonitorSelector:
        matchLabels:
          monitoring.gardener.cloud/monitoring-target: shoot-control-plane
      storage:
        volumeClaimTemplate:
          spec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 20Gi
      version: v2.35.0
    
  2. Contract between the shoot Prometheus and its configuration.

    • Prometheus can discover *Monitors in different namespaces and also by using labels.

    • In some cases, specific configuration is required (e.g. specific configuration due to K8s versions). In this case, the configuration will also be deployed in the shoot’s namespace and Prometheus will also be able to discover this configuration.

    • Prometheus must also distinguish between *Monitors relevant for shoot control plane and shoot targets. This can be done with a serviceMonitorSelector and podMonitorSelector where monitoring.gardener.cloud/monitoring-target=shoot-control-plane. For a ServiceMonitor it would look like this:

      serviceMonitorSelector:
        matchLabels:
          monitoring.gardener.cloud/monitoring-target: shoot-control-plane
      
    • In addition to a Prometheus, the configuration must also be created. To do this, each job in the Prometheus configuration will need to be replaced with either a ServiceMonitor, PodMonitor, or Probe. This ServiceMonitor will be picked up by the Prometheus defined in the previous step. This ServiceMonitor will scrape any service that has the label app=prometheus on the port called metrics.

      apiVersion: monitoring.coreos.com/v1
      kind: ServiceMonitor
      metadata:
        labels:
          monitoring.gardener.cloud/monitoring-target: shoot-control-plane
        name: prometheus-job
        namespace: shoot--project--name
      spec:
        endpoints:
        - port: metrics
        selector:
          matchLabels:
            app: prometheus
      
  3. Prometheus needs to discover targets running in the shoot cluster. Normally, this is done by changing the api_server field in the config (example). This is currently not possible with the prometheus-operator, but there is an open issue.

    • Preferred approach: A second Prometheus can be created that is running in agent mode. This Prometheus can also be deployed/managed by the prometheus-operator. The agent Prometheus can be configured to use the API Server for the shoot cluster and use service discovery in the shoot. The metrics can then be written via remote write to the “normal” Prometheus or federated. This Prometheus will also discover configuration in the same way as the other Prometheus with 1 difference. Instead of discovering configuration with the label monitoring.gardener.cloud/monitoring-target=shoot-control-plane it will find configuration with the label monitoring.gardener.cloud/monitoring-target=shoot.

    • Alternative: Use additional scrape config. In this case, the Prometheus config snippet is put into a secret and the prometheus-operator will append it to the config. The downside here is that it is only possible to have 1 additional-scrape-config per Prometheus. This could be an issue if multiple components will need to use this.

  4. Deploy an optional alertmanager that is deployed whenever the end-user specifies alerting.

    • Create an Alertmanager resource

    • Create the AlertmanagerConfig

  5. Health checks - The gardenlet periodically checks the status of the Prometheus StatefulSet among other components in the shoot care controller. The gardenlet knows which component to check based on labels. Since the gardenlet is no longer deploying the StatefulSet directly and rather a Prometheus resource, it does not know which labels are attached to the Prometheus StatefulSet. However, the prometheus-operator will create StatefulSets with the same labelset that the Prometheus resource has. The gardenlet will be able to discover the StatefulSet because it knows the labelset of the Prometheus resource.

Seed Monitoring

There is a monitoring stack deployed for each seed cluster. A similar setup must also be provided using the prometheus-operator. The steps for this are very similar to the shoot monitoring.

  • Replace the Prometheis and their configuration.

  • Replace the alertmanager and its configuration.

BYOMC (Bring your own monitoring configuration)

  • In general, components should bring their own monitoring configuration. Gardener currently does this for some components such as the gardener-resource-manager. This configuration is then appended to the existing Prometheus configuration. The goal is to replace the inline yaml with PodMonitors and/or ServiceMonitors instead.

  • If alerting rules or recording rules need to be created for a component, it can bring its own PrometheusRules.

  • The same thing can potentially be done for components such as kube-state-metrics which are still currently deployed via the [seed-bootstrap].

  • If an extension requires monitoring it must bring its own configuration in the form of PodMonitors, ServiceMonitors or PrometheusRules.

    • Adding monitoring config to the control plane: In some scenarios extensions will add components to the controlplane that need to be monitored. An example of this is the provider-aws extension that deploys a cloud-controller-manager. In the current setup, if an extension needs something to be monitored in the control plane, it brings its own configmap with Prometheus config. The configmap has the label extensions.gardener.cloud/configuration=monitoring to specify that the config should be appended to the current Prometheus config. Below is an example of what this looks like for the cloud controller manager.

      apiVersion: v1
      kind: ConfigMap
      metadata:
        labels:
          extensions.gardener.cloud/configuration: monitoring
        name: cloud-controller-manager-observability-config
        namespace: shoot--project--name
      data:
        alerting_rules: |
          cloud-controller-manager.rules.yaml: |
          groups:
          - name: cloud-controller-manager.rules
            rules:
            - alert: CloudControllerManagerDown
            expr: absent(up{job="cloud-controller-manager"} == 1)
            for: 15m
            labels:
              service: cloud-controller-manager
              severity: critical
              type: seed
              visibility: all
            annotations:
              description: All infrastruture specific operations cannot be completed (e.g. creating loadbalancers or persistent volumes).
              summary: Cloud controller manager is down.    
        observedComponents: |
          observedPods:
          - podPrefix: cloud-controller-manager
          isExposedToUser: true    
        scrape_config: |
          - job_name: cloud-controller-manager
            scheme: https
            tls_config:
              insecure_skip_verify: true
            authorization:
              type: Bearer
              credentials_file: /var/run/secrets/gardener.cloud/shoot/token/token
            honor_labels: false
            kubernetes_sd_configs:
            - role: endpoints
              namespaces:
                names: [shoot--project--name]
            relabel_configs:
            - source_labels:
              - __meta_kubernetes_service_name
              - __meta_kubernetes_endpoint_port_name
              action: keep
              regex: cloud-controller-manager;metrics
            # common metrics
            - action: labelmap
                regex: __meta_kubernetes_service_label_(.+)
            - source_labels: [ __meta_kubernetes_pod_name ]
                target_label: pod
            metric_relabel_configs:
            - source_labels: [ __name__ ]
              regex: ^(rest_client_requests_total|process_max_fds|process_open_fds)$
              action: keep    
      
  • This configmap will be split up into 2 separate resources. One for the alerting_rules and another for the scrape_config. The alerting_rules should be converted into a PrometheusRules object. Since the scrape_config only has one job_name we will only need one ServiceMonitor or PodMonitor for this. The following is an example of how this could be done. There are multiple ways to get the same results and this is just one example.

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      labels:
        cluster: shoot--project--name
      name: cloud-controller-manager
      namespace: shoot--project--name
    spec:
      endpoints:
      - port: metrics # scrape the service port with name `metrics`
        bearerTokenFile: /var/run/secrets/gardener.cloud/shoot/token/token # could also be replaced with a secret
        metricRelabelings:
        - sourceLabels: [ __name__ ]
          regex: ^(rest_client_requests_total|process_max_fds|process_open_fds)$
          action: keep
      namespaceSelector:
        matchNames:
        - shoot--project--name
      selector:
        matchLabels:
          role: cloud-controller-manager # discover any service with this label
    
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
        cluster: shoot--project--name
      name: cloud-controller-manager-rules
      namespace: shoot--project--name
    spec:
      groups:
      - name: cloud-controller-manager.rules
        rules:
        - alert: CloudControllerManagerDown
          expr: absent(up{job="cloud-controller-manager"} == 1)
          for: 15m
          labels:
            service: cloud-controller-manager
            severity: critical
            type: seed
            visibility: all
          annotations:
            description: All infrastruture specific operations cannot be completed (e.g. creating loadbalancers or persistent volumes).
            summary: Cloud controller manager is down.
    
  • Adding meta monitoring for the extensions: If the extensions need to be scraped for monitoring, the extensions must bring their own Custom Resources.

    • Currently the contract between extensions and gardener is that anything that needs to be scraped must have the labels: prometheus.io/scrape=true and prometheus.io/port=<port>. This is defined here. This is something that we can still support with a PodMonitor that will scrape any pod in a specified namespace with these labels.

Grafana Sidecar

Add a sidecar to Grafana that will pickup dashboards and provision them. Each dashboard gets its own configmap.

  • Grafana in the control plane

    • Most dashboards provisioned by Grafana are the same for each shoot cluster. To avoid unnecessary duplication of configmaps, the dashboards could be added once in a single namespace. These “common” dashboards can then be discovered by each Grafana and provisioned.

    • In some cases, dashboards are more “specific” because they are related to a certain Kubernetes version.

    • Contract between dashboards in configmaps and the Grafana sidecar.

      • Label schema: monitoring.gardener.cloud/dashboard-{seed,shoot,shoot-user}=true

      • Each common dashboard will be deployed in the monitoring namespace as a configmap. If the dashboard should be provisioned by the user Grafana in a shoot cluster it should have the label monitoring.gardener.cloud/dashboard-shoot-user=true. For dashboards that should be provisioned by the operator Grafana the label monitoring.gardener.cloud/dashboard-shoot=true is required.

      • Each specific dashboard will be deployed in the shoot namespace. The configmap will use the same label scheme.

      • The Grafana sidecar must be configured with:

        env:
        - name: METHOD
          value: WATCH
        - name: LABEL
          value: monitoring.gardener.cloud/dashboard-shoot # monitoring.gardener.cloud/dashboard-shoot-user for user Grafana
        - name: FOLDER
          value: /tmp/dashboards
        - name: NAMESPACE
          value: monitoring,<shoot namespace>
      
  • Grafana in the seed

    • There is also a Grafana deployed in the seed. This Grafana will be configured in a very similar way, except it will discover dashboards with a different label.

    • The seed Grafana can discover configmaps labeled with monitoring.gardener.cloud/dashboard-seed.

    • The sidecar will be configured in a similar way:

      env:
      - name: METHOD
        value: WATCH
      - name: LABEL
        value: monitoring.gardener.cloud/dashboard-seed
      - name: FOLDER
        value: /tmp/dashboards
      - name: NAMESPACE
        value: monitoring,garden
    
  • Dashboards can have multiple labels and be provisioned in a seed and/or shoot Grafana.

Fluent-bit Operator CRDs

The fluent-bit operator oversees two types for resources:

  1. FluentBit resource defining the properties of the fluent-bit deamonset.

  2. ClusterInputs, ClusterOutputs, ClusterFilters and ClusterParsers defining the fluent-but app configuration.

  3. FluentBit custom resource

apiVersion: fluentbit.fluent.io/v1alpha2
kind: FluentBit
metadata:
  name: fluent-bit
  namespace: garden
  labels:
    app.kubernetes.io/name: fluent-bit
spec:
  image: kubesphere/fluent-bit:v1.9.9
  fluentBitConfigName: fluent-bit-config
  # workload properties
  annotations: {}
  resources: {}
  nodeSelector: {}
  tolerations: {}
  priorityClassName: ""
  ...
  # fluent-bit configurations
  # container runtime output path
  containerLogRealPath: ""
  # Recommended in case of input tail plugin
  # holds persisted events in fluent-bit supporting re-emitting
  positionDB: {}

fluent-bit-resource carries the usual kubernetes workload properties, such as resource definitions, node selectors, pod/node affinity and so on. The sole purpose is to construct the fluent-bit-daemonset spec managing the fluent-bit application instances on the cluster nodes.

The upstream project has been enhanced by Gardener and now it supports adding fluent-bit custom plugins. The latter is required by Gardener to mainting the custom ordering plugin. In its current state, the fluent-bit operator can be used by Gardener “as is”. without the need of forking it.

Fluent-bit filters and parsers

The second major function of the fluent-bit operator is to compile a valid application configuration aggregating declarations supplied by fluent-bit custom resources such as ClusterFitlers, ClusterParsers and so on as shown by the example.

  1. Fluent-Bit Inputs, Outputs, Filters and Parsers
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterFluentBitConfig
metadata:
  name: fluent-bit-config
  labels:
    app.kubernetes.io/name: fluent-bit
spec:
  service:
    parsersFile: parsers.conf
  filterSelector:
    matchLabels:
      fluentbit.fluent.io/enabled: "true"
  parserSelector:
    matchLabels:
      fluentbit.fluent.io/enabled: "true"

---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterFilter
metadata:
  name: parser
  labels:
    fluentbit.fluent.io/enabled: "true"
spec:
  match: "*"
  filters:
  - parser:
      keyName: log
      parser: my-regex

---
apiVersion: fluentbit.fluent.io/v1alpha2
kind: ClusterParser
metadata:
  name: my-regex
  labels:
    fluentbit.fluent.io/enabled: "true"
spec:
  regex:
    timeKey: time
    timeFormat: "%d/%b/%Y:%H:%M:%S %z"
    types: "code:integer size:integer"
    regex: '^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$'

In this example, we have a fluent bit filter-plugin of type parser and the corresponding regex-parser. In the Gardener context, these configuration resources are supplied by the core components and extensions.

BYOLC (Bring your own logging configuration)

Since fluent-bit uses input-tail plugin and reads any container output under /var/log/pods it processes logs outputs of all workloads. In this case, any workload may bring its own set of filters and parsers if needed using the declarative APIs supported by the operator.

Migration

Prometheus Operator

  1. Deploy the prometheus-operator and its custom resources.
  2. Delete the old monitoring-stack.
  3. Configure Prometheus to “reuse” the pv from the old Prometheus’s pvc. An init container will be temporarily needed for this migration. This ensures that no data is lost and provides a clean migration.
  4. Any extension or monitoring configuration that is not migrated to the prometheus-operator right away will be collected and added to an additionalScrapeConfig. Once all extensions and components have migrated, this can be dropped.

Fluent-bit Operator

  1. Add fluent-bit operator CRDs in Gardener
  2. Add ClusterFilters and ClusterParsers resources in all extensions which are deploying ConfigMap with label: extensions.gardener.cloud/configuration: logging
  3. Add the Fluent operator in Gardener in place of fluent-bit

Alternatives

  1. Continue using the current setup.

18 - Reversed Cluster VPN

GEP-14: Reversed Cluster VPN

Table of Contents

Motivation

It is necessary to describe the current VPN solution and outline its shortcomings in order to motivate this proposal.

Problem Statement

Today’s Gardener cluster VPN solution has several issues including:

  1. Connection establishment is always from the seed cluster to the shoot cluster. This means that there needs to be connectivity both ways which is not desirable in many cases (OpenStack, VMware) and causes high effort in firewall configuration or extra infrastructure. These firewall configurations are prohibited in some cases due to security policies.
  2. Shoot clusters must provide a VPN endpoint. This means extra cost for the endpoint (roughly €20/month on hyperscalers) or will consume scarce resources (limited number of VMware NSX-T load balancers).

A first implementation has been provided to resolve the issues with the konnectivity server. As we did find several shortcomings with the underlying technology component, the apiserver-network-proxy we believe that this is not a suitable way ahead. We have opened an issue and provided two solution proposals to the community. We do see some remedies, e.g. using the Quick Protocol instead of GRPC but we (a) consider the implementation effort significantly higher compared to this proposal and (b) would use an experimental protocol to solve a problem that can also be solved with existing and proven core network technologies.

We will therefore not continue to invest into this approach. We will however research a similar approach (see below in “Further Research”).

Current Solution Outline

The current solution consists of multiple VPN connections from each API server pod and the Prometheus pod of a control plane to an OpenVPN server running in the shoot cluster. This OpenVPN server is exposed via a load balancer that must have an IP address which is reachable from the seed cluster. The routing in the seed cluster pods is configured to route all traffic for the node, pod, and service ranges to the shoot cluster. This means that there is no address overlap allowed between seed- and shoot cluster node, pod, and service ranges.

In the seed cluster the vpn-seed container is a sidecar to the kube-apiserver and prometheus pods. OpenVPN acts as a TCP client connecting to an OpenVPN TCP server. This is not optimal (e.g. tunneling TCP over TCP is discouraged) but at the time of development there was no UDP load balancer available on at least one of the major hyperscalers. Connectivity could have been switched to UDP later but the development effort was not spent.

The solution is depicted in this diagram:

alt text

These are the essential parts of the OpenVPN client configuration in the vpn-seed sidecar container:

# use TCP instead of UDP (commonly not supported by load balancers)
proto tcp-client

[...]

# get all routing information from server
pull

tls-client
key "/srv/secrets/vpn-seed/tls.key"
cert "/srv/secrets/vpn-seed/tls.crt"
ca "/srv/secrets/vpn-seed/ca.crt"

tls-auth "/srv/secrets/tlsauth/vpn.tlsauth" 1
cipher AES-256-CBC

# https://openvpn.net/index.php/open-source/documentation/howto.html#mitm
remote-cert-tls server

# pull filter
pull-filter accept "route 100.64.0.0 255.248.0.0"
pull-filter accept "route 100.96.0.0 255.224.0.0"
pull-filter accept "route 10.1.60.0 255.255.252.0"
pull-filter accept "route 192.168.123."
pull-filter ignore "route"
pull-filter ignore redirect-gateway
pull-filter ignore route-ipv6
pull-filter ignore redirect-gateway-ipv6

Encryption is based on SSL certificates with an additional HMAC signature to all SSL/TLS handshake packets. As multiple clients connect to the OpenVPN server in the shoot cluster, all clients must be assigned a unique IP address. This is done by the OpenVPN server pushing that configuration to the client (keyword pull). As this is potentially problematic because the OpenVPN server runs in an untrusted environment there are pull filters denying all but necessary routes for the pod, service, and node networks.

The OpenVPN server running in the shoot cluster is configured as follows:

mode server
tls-server
proto tcp4-server
dev tun0

[...]

server 192.168.123.0 255.255.255.0

push "route 10.243.0.0 255.255.128.0"
push "route 10.243.128.0 255.255.128.0"

duplicate-cn

key "/srv/secrets/vpn-shoot/tls.key"
cert "/srv/secrets/vpn-shoot/tls.crt"
ca "/srv/secrets/vpn-shoot/ca.crt"
dh "/srv/secrets/dh/dh2048.pem"

tls-auth "/srv/secrets/tlsauth/vpn.tlsauth" 0
push "route 10.242.0.0 255.255.0.0"

It is a TCP TLS server and configured to automatically assign IP addresses for OpenVPN clients (server directive). In addition, it pushes the shoot cluster node-, pod-, and service ranges to the clients running in the seed cluster (push directive).

Note: The network mesh spanned by OpenVPN uses the network range 192.168.123.0 - 192.168.123.255. This network range cannot be used in either shoot-, or seed clusters. If it is used this might cause subtle problem due to network range overlaps. Unfortunately, this appears not to be well documented but this restriction exists since the very beginning. We should clean up this technical debt as part of the exercise.

Goals

  • We intend to supersede the current VPN solution with the solution outlined in this proposal.
  • We intend to remove the code for the konnectivity tunnel once this solution proposal has been validated.

Non Goals

  • The solution is not a low latency, or high throughput solution. As the kube-apiserver to shoot cluster traffic does not demand these properties we do not intend to invest in improvements.
  • We do not intend to provide continuous availability to the shoot-seed VPN connection. We expect the availability to be comparable to the existing solution.

Proposal

The proposal is depicted in the following diagram:

alt text

We have added an OpenVPN server pod (vpn-seed-server) to each control plane. The OpenVPN client in the shoot cluster (vpn-shoot-client) connects to the OpenVPN server.

The two containers vpn-seed-server and vpn-shoot-client are new containers and are not related to containers in the github.com/gardener/vpn project. We will create a new project github.com/gardener/vpn2 for these containers. With this solution we intend to supersede the containers from the github.com/gardener/vpn project.

A service vpn-seed-server of type ClusterIP is created for each control plane in its namespace.

The vpn-shoot-client pod connects to the correct vpn-seed-server service via the SNI passthrough proxy introduced with SNI Passthrough proxy for kube-apiservers on port 8132.

Shoot OpenVPN clients (vpn-shoot-client) connect to the correct OpenVPN Server using the http proxy feature provided by OpenVPN. A configuration is added to the envoy proxy to detect http proxy requests and open a connection attempt to the correct OpenVPN server.

The kube-apiserver to shoot cluster connections are established using the API server proxy feature via an envoy proxy sidecar container of the vpn-seed-server container.

The restriction regarding the 192.168.123.0/24 network range in the current VPN solution still applies to this proposal. No other restrictions are introduced. In the context of this GEP a pull requst has been filed to block usage of that range by shoot clusters.

Performance and Scalability

We do expect performance and throughput to be slightly lower compared to the existing solution. This is because the OpenVPN server acts as an additional hop and must decrypt and re-encrypt traffic that passes through. As there are no low latency, or high thoughput requirements for this connection we do not assume this to be an issue.

Availability and Failure Scenarios

This solution re-uses multiple instances of the envoy component used for the kube-apiserver endpoints. We assume that the availability for kube-apiservers is good enough for the cluster VPN as well.

The OpenVPN client- and server pods are singleton pods in this approach and therefore are affected by potential failures and during cluster-, and control plane updates. Potential outages are only restricted to single shoot clusters and are comparable to the situation with the existing solution today.

Feature Gates and Migration Strategy

We have introduced a gardenlet feature gate ReversedVPN. If APIServerSNI and ReversedVPN are enabled the proposed solution is automatically enabled for all shoot clusters hosted by the seed. If ReversedVPN is enabled but APIServerSNI is not the gardenlet will panic during startup as this is an invalid configuration. All existing shoot clusters will automatically be migrated during the next reconciliation. We assume that the ReversedVPN feature will work with Gardener as well as operator managed Istio.

We have also added a shoot annotation alpha.featuregates.shoot.gardener.cloud/reversed-vpn which can override the feature gate to enable or disable the solution for individual clusters. This is only respected if APIServerSNI is enabled, otherwise it is ignored.

Security Review

The change in the VPN solution will potentially open up new attack vectors. We will perform a thorough analysis outside of this document.

Alternatives

We have done a detailed investigation and implementation of a reversed VPN based on WireGuard. While we believe that it is technically feasible and superior to the approach presented above there are some concerns with regards to scalability, and high availability. As the WireGuard scenario based on kubelink is relevant for other use cases we continue to improve this implementation and address the concerns but we concede that this might not be on time for the cluster VPN. We nevertheless keep the implementation and provide an outline as part of this proposal.

The general idea of the proposal is to keep the existing cluster VPN solution more or less as is, but change the underlying network used for the vpn seed => vpn shoot connection. The underlying network should be established in the reversed direction, i.e. the shoot cluster should initiate the network connection, but it nevertheless should work in both directions.

We achieve this by tunneling the open vpn connection through a WireGuard tunnel, which is established from the shoot to the seed (note that WireGuard uses UDP as protocol). Independent of that we can also use UDP for the OpenVPN connection, but we can also stay with TCP as it was before. While this might look like a big change, it only introduces minor changes to the existing solution, but let’s look at the details. In essence, the OpenVPN connection does not require a public endpoint in the shoot cluster but it usees the internal endpoint provided by the WireGuard tunnel.

This is roughly depcited in this diagram. Note, that the vpn-seed and vpn-shoot containers only require very little changes and are fully backwards compatible.

alt text

The WireGuard network needs a separate network range/CIDR. It has to be unique for the seed and all its shoot clusters. An example for an assumed workload of around 1000 shoot clusters would be 192.168.128.0/22 (1024 IP addresses), i.e. 192.168.128.0-192.168.131.255. The IP addresses from this range need to be managed, but the IP address management (IPAM) using the Gardener Kubernetes objects like seed and shootstate as backing store is fairly straightforward. This is especially true as we do not expect large network ranges and only infrequent IP allocations. Hence, the IP address allocation can be quite simple, i.e. scan the range for a free IP address of all shoot clusters in a seed and allocate the first free address from the range.

There is another restriction: in case shoot clusters are configured to be seed clusters this network range must not overlap with the “parent” seed cluster. If the parent seed cluster uses 192.168.128.0/22 the child seed cluster can for example use 192.168.132.0/22. Grandchildren can however use grandparent IP address ranges. Also 2 children seed clusters can use identical ranges.

This slightly adds to the restrictions described in the current solution outline. In that the arbitrary chosen 192.168.123.0/24 range is restricted. For the purpose of this implementation we propose to extend that restriction to 192.168.128.0/17 range. Most of it would be reserved for “future use” however. We are well aware that this adds to the burden of correctly configuring Gardener landscapes.

We do consider this to be a challenge that needs to be addressed by careful configuration of the Gardener seed cluster infrastructure. Together with the 192.168.123.0/24 address range these ranges should be automatically blocked for usage by shoots.

WireGuard can utilize the Linux kernel so that after initialization/configuration no user space processes are required. We propose to recommend the WireGuard kernel module as the default solution for all seeds. For shoot clusters, the WireGuard kernel based approach is also recommended, but the user space solution should also work as we expect less traffic on the shoot side. We expect the userspace implementation to work on all operating systems supported by Gardener in case no kernel module is available.

Almost all seed clusters are already managed by Gardener and we assume that those are configured with the WireGuard kernel module. There are however some cases where we use other Kubernetes distributions as seed cluster which may not have an operating system with WireGuard module available. We will therefore generally support the user space WireGuard process on seed cluster but place a size restriction on the number of control planes on those seeds.

There is a user space implementation of WireGuard, which can be used on Linux distributions without the WireGuard kernel module. (WireGuard moved into the standard Linux kernel 5.6.) Our proposal can handle the kernel/user space switch transparently, i.e. we include the user space binaries and use them only when required. However, especially for the seed the kernel based solution might be more attractive. Garden Linux 318.4.0 supports WireGuard.

We have looked at Ubuntu and SuSE chost:

  • SuSE chost does not provide the WireGuard kernel module and it is not installable via zypper. It should however be straightforward for SuSE to include that in their next release.
  • Ubuntu does not provide the kernel module either but it can be installed using apt-get install wireguard. With that it appears straightforward to provide an image with WireGuard pre-installed.

On the seed, we add a WireGuard device to one node on the host network. For all other nodes on the seed, we adapt the routes accordingly to route traffic destined for the WireGuard network to our WireGuard node. The Kubernetes pods managing the WireGuard device and routes are only used for initial configuration and later reconfiguration. During runtime, they can restart without any impact on the operation of the WireGuard network as the WireGuard device is managed by the Linux kernel.

With Calico as the networking solution it is not easily possible to put the WireGuard endpoint into a pod. Putting the WireGuard endpoint into a pod would require to define it as a gateway in the api server or prometheus pods but this is not possible since Calico does not span a proper subnet. While the defined CIDR in the pod network might be 100.96.0.0/11 the network visible from within a pod is only 100.96.0.5/32. This restriction might not exist with other networking solutions.

The WireGuard endpoint on the seed is exposed via a load balancer. We propose to use kubelink to manage the WireGuard configuration/device on the seed. We consider the management of the WireGuard endpoint to be complex especially in error situations which is the reason for utilizing kubelink as there is already significant experience managing an endpoint. We propose moving kubelink to the Gardener org in case it is used by this proposal.

Kubelink addresses three challenges managing WireGuard interfaces on cluster nodes. First, with WireGuard interfaces directly on the node (hostNetwork=true) the lifecycle of the interface is decoupled from the lifecycle of the pod that created it. This means that there will have to be means of cleaning up the interfaces and its configuration in case the interface moves to a different node. Second, additional routing information must be distributed across the cluster. The WireGuard CIDR is unknown to the network implementation so additional routes must be distributed on all nodes of the cluster. Third, kubelink dynamincally configures the Wireguard interface with endpoints and their public keys.

On the shoot, we create the keys and acquire the WireGuard IP in the standard secret generation. The data is added as a secret to the control plane and to the shootstate. The vpn shoot deployment is extended to include the WireGuard device setup inside the vpn shoot pod network. For certain infrastructures (AWS), we need a re-advertiser to resolve the seed WireGuard endpoint and evaluate whether the IP address changed.

While it is possible to configure a WireGuard device using DNS names only IP addresses can be stored in Linux Kernel data structures. A change of a load balancer IP address can therefore not be mitigated on that level. As WireGuard dynamically adapts endpoint IP addresses a change in load banlancer IPs is mitigated in most but not all cases. This is why a re-advertiser is required for public cloud providers such as AWS.

The load balancer exposing the OpenVPN endpoint in the shoot cluster is no longer required and therefore removed if this functionality is used.

As we want to slowly expand the usage of the WireGuard solution, we propose to introduce a feature gate for it. Furthermore, since the WireGuard network requires a separate network range, we propose to introduce a new section to the seed settings with two additional flags (enabled & cidr):

apiVersion: core.gardener.cloud/v1beta1
kind: Seed
metadata:
  name: my-seed
  ...
spec:
  ...
  settings:
  ...
    wireguard:
      enabled: true
      cidr: 192.168.128.0/22

Last but not least, we propose to introduce an annotation to the shoots to enable/disable the WireGuard tunnel explicitly.

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: my-shoot
  annotations:
    alpha.featuregates.shoot.gardener.cloud/wireguard-tunnel: "true"
  ...

Using this approach, it is easy to switch the solution on and off, i.e. migrate the shoot clusters automatically during ordinary reconciliation.

High Availability

There is an issue if the node that hosts the WireGuard endpoint fails. The endpoint is migrated to another node however the time required to do this might exceed the budget for downtimes although one could argue that a disruption of less than 30 seconds to 1 minute does not qualify as a downtime and will in almost all cases not noticeable by end users.

In this case we also assume that TCP connections won’t be interrupted - they would just appear to hang. We will confirm this behavior and the potential downtime as part of the development and testing effort as this is hard to predict.

As a possible mitigation we propose to instantiate 2 Kubelink instances in the seed cluster that are served by two different load balancers. The instances must run on different nodes (if possible but we assume a proper seed cluster has more than one node). Each shoot cluster connects to both endpoints. This means that the OpenVPN server is reachable with two different IP addresses. The VPN seed sidecars must attempt to connect to both of them and will continue to do so. The “Persistent Keepalive” feature is set to 21 seconds by default but could be reduced. Due to the redundancy this however appears not to be necessary.

It is desirable that both connections are used in an equal manner. One strategy could be to use the kubelink 1 connection if the first target WireGuard address is even (the last byte of the IPv4 address), otherwise the kubelink 2 connection. The vpn-seed sidecars can then use the following configuration in their OpenVPN configuration file:

<connection>
remote 192.168.45.3 1194 udp
</connection>

<connection>
remote 192.168.47.34 1194 udp
</connection>

OpenVPN will go through the list sequentially and try to connect to these endpoints.

As an additional mitigation it appears possible to instantiate WireGuard devices on all hosts and replicate its relevant conntrack state across all cluster nodes. The relevant conntrack state keeps the state of all connections passing through the WireGuard interface (e.g. the WireGuard CIDR). conntrack and the tools to replicate conntrack state are part of the essential Linux netfilter tools package.

Load Considerations

What happens in case of a failure? In this case one router will end up owning all connections as the clients will attempt to use the next connection. This could be mitigated by adding a third redundant WireGuard connection. Using this strategy, the failure of one WireGuard endpoint would result in the equal distribution of connections to the two remaining interfaces. We believe however that this will not be necessary.

The cluster node running the Wireguard endpoint is essentially a router that routes all traffic to the various shoot clusters. This is established and proven technology that already exists since decades and has been highly optimized since then. This is also the technology that hyperscalers rely on to provide VPN connectivity to their customers. This said, hyperscalers essentially provide solutions based on IPsec which is known not to scale as well as Wireguard. Wireguard is a relatively new technology but we have no doubt that it is less stable than existing IPsec solution.

Regarding performance there is a lot of information on the Internet basically suggesting that Wireguard performs better than other VPN solutions such as IPsec or OpenVPN. One example is https://www.wireguard.com/performance/ and https://www.net.in.tum.de/fileadmin/bibtex/publications/papers/2020-ifip-moonwire.pdf.

Based on this, we have no reason to believe that one router will not be able to handle all traffic going to and coming from shoot clusters. Nevertheless, we will closely monitor the situation in our tests and will take action if necessary.

Further Research

Based on feedback on this proposal and while working on this implementation we identified two additinal approaches that we have not thought of so far. The first idea can be used to replace the “inner” OpenVPN implementation and the second can be used to replace WireGuard with OpenVPN and get rid of the single point of failure.

  1. Instead of using OpenVPN for the inner seed/shoot communication we can use the proxy protocol and use a TCP proxy (e.g. envoy) in the shoot cluster to broker the seed-shoot connections. The advantage is that with this solution seed- and shoot cluster network ranges are allowed to overlap. Disadvantages are increased implementation effort and less efficient network in terms of throughput and scalability. We believe however that the reduced network efficiency does not invalidate this option.

  2. There is an option in OpenVPN to specify a tcp proxy as part of the endpoint configuration.

19 - Shoot APIServer via SNI

SNI Passthrough proxy for kube-apiservers

This GEP tackles the problem that today a single LoadBalancer is needed for every single Shoot cluster’s control plane.

Background

When the control plane of a Shoot cluster is provisioned, a dedicated LoadBalancer is created for it. It keeps the entire flow quite easy - the apiserver Pods are running and they are accessible via that LoadBalancer. It’s hostnames / IP addresses are used for DNS records like api.<external-domain> and api.<shoot>.<project>.<internal-domain>. While this solution is simple it comes with several issues.

Motivation

There are several problems with the current setup.

  • IaaS provider costs. For example ClassicLoadBalancer on AWS costs at minimum 17 USD / month.
  • Quotas can limit the amount of LoadBalancers you can get per account / project, limiting the number of clusters you can host under a single account.
  • Lack of support for better loadbalancing algorithms than round-robin.
  • Slow cluster provisioning time - depending on the provider a LoadBalancer provisioning could take quite a while.
  • Lower downtime when workload is shuffled in the clusters as the LoadBalancer is Kubernetes-aware.

Goals

  • Only one LoadBalancer is used for all Shoot cluster API servers running in a Seed cluster.
  • Out-of-cluster (end-user / robot) communication to the API server is still possible.
  • In-cluster communication via the kubernetes master service (IPv4/v6 ClusterIP and the kubernetes.default.svc.cluster.local) is possible.
  • Client TLS authentication works without intermediate TLS termination (TLS is terminated by kube-apiserver).
  • Solution should be cloud-agnostic.

Proposal

Seed cluster

To solve the problem of having multiple kube-apiservers behind a single LoadBalancer, an intermediate proxy must be placed between the Cloud-Provider’s LoadBalancer and kube-apiservers. This proxy is going to choose the Shoot API Server with the help of Server Name Indication. From wikipedia:

Server Name Indication (SNI) is an extension to the Transport Layer Security (TLS) computer networking protocol by which a client indicates which hostname it is attempting to connect to at the start of the handshaking process. This allows a server to present multiple certificates on the same IP address and TCP port number and hence allows multiple secure (HTTPS) websites (or any other service over TLS) to be served by the same IP address without requiring all those sites to use the same certificate. It is the conceptual equivalent to HTTP/1.1 name-based virtual hosting, but for HTTPS.

A rough diagram of the flow of data:

+-------------------------------+
|                               |
|           Network LB          | (accessible from clients)
|                               |
|                               |
+-------------+-------+---------+                       +------------------+
              |       |                                 |                  |
              |       |            proxy + lb           | Shoot API Server |
              |       |    +-------------+------------->+                  |
              |       |    |                            | Cluster A        |
              |       |    |                            |                  |
              |       |    |                            +------------------+
              |       |    |
     +----------------v----+--+
     |        |               |
   +-+--------v----------+    |                         +------------------+
   |                     |    |                         |                  |
   |                     |    |       proxy + lb        | Shoot API Server |
   |        Proxy        |    +-------------+---------->+                  |
   |                     |    |                         | Cluster B        |
   |                     |    |                         |                  |
   |                     +----+                         +------------------+
   +----------------+----+
                    |
                    |
                    |                                   +------------------+
                    |                                   |                  |
                    |             proxy + lb            | Shoot API Server |
                    +-------------------+-------------->+                  |
                                                        | Cluster C        |
                                                        |                  |
                                                        +------------------+

Sequentially:

  1. client requests Shoot Cluster A and sets the Server Name in the TLS handshake to api.shoot-a.foo.bar.
  2. this packet goes through the Network LB and it’s forwarded to the Proxy server. (this loadbalancer should be a simple Layer-4 TCP proxy)
  3. the proxy server reads the packet and see that client requests api.shoot-a.foo.bar.
  4. based on its configuration, it maps api.shoot-a.foo.bar to Shoot API Server Cluster A.
  5. it acts as TCP proxy and simply send the data Shoot API Server Cluster A.

There are multiple OSS proxies for this case:

To ease integration it should:

  • be configurable via Kubernetes resources
  • not require restarting when configuration changes
  • be fast and with little overhead

All things considered, Envoy proxy is the most fitting solution as it provides all the features Gardener would like (no process reload being the most important one + battle tested in production by various companies).

While building a custom control plane for Envoy is quite simple, an already established solution might be the better path forward. Istio’s Pilot is one of the most feature-complete Envoy control plane solutions as it offers a way to configure edge ingress traffic for Envoy via Gateway and VirtualService.

The resources which needs to be created per Shoot clusters are the following:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: kube-apiserver-gateway
  namespace: <shoot-namespace>
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 443
      name: tls
      protocol: TLS
    tls:
      mode: PASSTHROUGH
    hosts:
    - api.<external-domain>
    - api.<shoot>.<project>.<internal-domain>

and correct VirtualService pointing to the correct API server:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: kube-apiserver
  namespace: <shoot-namespace>
spec:
  hosts:
  - api.<external-domain>
  - api.<shoot>.<project>.<internal-domain>
  gateways:
  - kube-apiserver-gateway
  tls:
  - match:
    - port: 443
      sniHosts:
      - api.<external-domain>
      - api.<shoot>.<project>.<internal-domain>
    route:
    - destination:
        host: kube-apiserver.<shoot-namespace>.svc.cluster.local
        port:
          number: 443

The resources above configures Envoy to forward the raw TLS data (without termination) to the Shoot kube-apiserver.

Updated diagram:

+-------------------------------+
|                               |
|           Network LB          | (accessible from clients)
|                               |
|                               |
+-------------+-------+---------+                       +------------------+
              |       |                                 |                  |
              |       |            proxy + lb           | Shoot API Server |
              |       |    +-------------+------------->+                  |
              |       |    |                            | Cluster A        |
              |       |    |                            |                  |
              |       |    |                            +------------------+
              |       |    |
     +----------------v----+--+
     |        |               |
   +-+--------v----------+    |                         +------------------+
   |                     |    |                         |                  |
   |                     |    |       proxy + lb        | Shoot API Server |
   |    Envoy Proxy      |    +-------------+---------->+                  |
   | (ingress Gateway)   |    |                         | Cluster B        |
   |                     |    |                         |                  |
   |                     +----+                         +------------------+
   +-----+----------+----+
         |          |
         |          |
         |          |                                   +------------------+
         |          |                                   |                  |
         |          |             proxy + lb            | Shoot API Server |
         |          +-------------------+-------------->+                  |
         |   get                                        | Cluster C        |
         | configuration                                |                  |
         |                                              +------------------+
         |
         v                                                  Configure
      +--+--------------+         +---------------------+   via Istio
      |                 |         |                     |   Custom Resources
      |     Pilot       +-------->+   Seed API Server   +<------------------+
      |                 |         |                     |
      |                 |         |                     |
      +-----------------+         +---------------------+

In this case the internal and external DNSEntries should be changed to the Network LoadBalancer’s IP.

In-cluster communication to the apiserver

In Kubernetes the API server is discoverable via the master service (kubernetes in default namespace). Today, this service can only be of type ClusterIP - making in-cluster communication to the API server impossible due to:

  • the client doesn’t set the Server Name in the TLS handshake, if it attempts to talk to an IP address. In this case, the TLS handshake reaches the Envoy IngressGateway proxy, but it’s rejected by it.
  • Kubernetes services can be of type ExternalName, but the master service is not supported by kubelet.
    • even if this is fixed in future Kubernetes versions, this problem still exists for older versions where this functionality is not available.

Another issue occurs when the client tries to talk to the apiserver via the in-cluster DNS. For all Shoot API servers kubernetes.default.svc.cluster.local is the same and when a client tries to connect to that API server using that server name. This makes distinction between different in-cluster Shoot clients impossible by the Envoy IngressGateway.

To mitigate this problem an additional proxy must be deployed on every single Node. It does not terminate TLS and sends the traffic to the correct Shoot API Server. This is achieved by:

  • the apiserver master service reconciler is started and pointing to the kube-apiserver’s Cluster IP in the Seed cluster (e.g. --advertise-address=10.1.2.3).
  • the proxy runs in the host network of the Node.
  • the proxy has a sidecar container which:
    • creates a dummy network interface and assigns the 10.1.2.3 to it.
    • removes connection tracking (conntrack) if iptables/nftables is enabled as the IP address is local to the Node.
  • the proxy listens on the 10.1.2.3 and using the PROXY protocol it sends the data stream to the Envoy ingress gateway (EIGW).
  • EIGW listens for PROXY protocol on a dedicated 8443 port. EIGW reads the destination IP + port from the PROXY protocol and forwards traffic to the correct upstream apiserver.

The sidecar is a standalone component. It’s possible to transparently change the proxy implementation without any modifications to the sidecar. The simplified flow looks like:

+------------------+                    +----------------+
| Shoot API Server |       TCP          |   Envoy IGW    |
|                  +<-------------------+ PROXY listener |
| Cluster A        |                    |     :8443      |
+------------------+                    +-+--------------+
                                          ^
                                          |
                                          |
                                          |
                                          |
+-----------------------------------------------------------+
                                          |   Single Node in
                                          |   the Shoot cluster
                                          |
                                          | PROXY Protocol
                                          |
                                          |
                                          |
 +---------------------+       +----------+----------+
 |  Pod talking to     |       |                     |
 |  the kubernetes     |       |       Proxy         |
 |  service            +------>+  No TLS termination |
 |                     |       |                     |
 +---------------------+       +---------------------+

Multiple OSS solutions can be used:

  • haproxy
  • nginx

To add a PROXY lister with Istio several resources must be created - a dedicated Gateway, dummy VirtualService and EnvoyFilter which adds listener filter (envoy.listener.proxy_protocol) on 8443 port:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: blackhole
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 8443
      name: tcp
      protocol: TCP
    hosts:
    - "*"

---

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: blackhole
  namespace: istio-system
spec:
  hosts:
  - blackhole.local
  gateways:
  - blackhole
  tcp:
  - match:
    - port: 8443
    route:
    - destination:
        host: localhost
        port:
          number: 9999 # any dummy port will work

---

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: proxy-protocol
  namespace: istio-system
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
  configPatches:
  - applyTo: LISTENER
    match:
      context: ANY
      listener:
        portNumber: 8443
        name: 0.0.0.0_8443
    patch:
      operation: MERGE
      value:
        listener_filters:
        - name: envoy.filters.listener.proxy_protocol

For each individual Shoot cluster, a dedicated FilterChainMatch is added. It ensures that only Shoot API servers can receive traffic from this listener:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: <shoot-namespace>
  namespace: istio-system
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
  configPatches:
  - applyTo: FILTER_CHAIN
    match:
      context: ANY
      listener:
        portNumber: 8443
        name: 0.0.0.0_8443
    patch:
      operation: ADD
      value:
        filters:
        - name: envoy.filters.network.tcp_proxy
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
            stat_prefix: outbound|443||kube-apiserver.<shoot-namespace>.svc.cluster.local
            cluster: outbound|443||kube-apiserver.<shoot-namespace>.svc.cluster.local
        filter_chain_match:
          destination_port: 443
          prefix_ranges:
          - address_prefix: 10.1.2.3 # kube-apiserver's cluster-ip
            prefix_len: 32

Note: this additional EnvoyFilter can be removed when Istio supports full L4 matching.

A nginx proxy client in the Shoot cluster on every node could have the following configuration:

error_log /dev/stdout;
stream {
    server {
        listen 10.1.2.3:443;
        proxy_pass api.<external-domain>:8443;
        proxy_protocol on;

        proxy_protocol_timeout 5s;
        resolver_timeout 5s;
        proxy_connect_timeout 5s;
    }
}

events { }

In-cluster communication to the apiserver when ExernalName is supported

Even if in future versions of Kubernetes, the master service of type ExternalName is supported, we still have the problem that in-cluster workload can talk to the server via DNS. For this to work we still need the above mentioned proxy (this time listening on another IP address 10.0.0.2). An additional change to CoreDNS would be needed:

default.svc.cluster.local.:8053 {
    file kubernetes.default.svc.cluster.local
}

.:8053 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        upstream
        fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}

The content of the kubernetes.default.svc.cluster.local is going to be:

$ORIGIN default.svc.cluster.local.

@	30 IN	SOA local. local. (
        2017042745 ; serial
        1209600    ; refresh (2 hours)
        1209600    ; retry (1 hour)
        1209600    ; expire (2 weeks)
        30         ; minimum (1 hour)
        )

  30 IN NS local.

kubernetes     IN A     10.0.0.2

So when a client requests kubernetes.default.svc.cluster.local, it’ll be send to the proxy listening on that IP address.

Future work

While out of scope of this GEP, several things can be improved:

  • Make the sidecar work with eBPF and environments where iptables/nftables are not enabled.

References

20 - Shoot CA Rotation

GEP-18: Automated Shoot CA Rotation

Table of Contents

Summary

This proposal outlines an on-demand, multi-step approach to rotate all certificate authorities (CA) used in a Shoot cluster. This process includes creating new CAs, invalidating the old ones and recreating all certificates signed by the CAs.

We propose to bundle the rotation of all CAs in the Shoot together as one triggerable action. This includes the recreation and invalidation of the following CAs and all certificates signed by them:

  • Cluster CA (currently used for signing kube-apiserver serving certificates and client certificates)
  • kubelet CA (used for signing client certificates for talking to kubelet API, e.g. kube-apiserver-kubelet)
  • etcd CA (used for signing etcd serving certificates and client certificates)
  • front-proxy CA (used for signing client certificates that kube-aggregator (part of kube-apiserver) uses to talk to extension API servers, filled into extension-apiserver-authentication ConfigMap and read by extension API servers to verify incoming kube-aggregator requests)
  • metrics-server CA (used for signing serving certificates, filled into APIService caBundle field and read by kube-aggregator to verify the presented serving certificate)
  • ReversedVPN CA (used for signing vpn-seed-server serving certificate and vpn-shoot client certificate)

Out of scope for now:

  • kubelet serving CA is self-generated (valid for 1y) and self-signed by kubelet on startup
    • kube-apiserver does not seem to verify the presented serving certificate
    • kubelet can be configured to request serving certificate via CSR that can be verified by kube-apiserver, though, we consider this as a separate improvement outside of this GEP
  • Legacy VPN solution uses the cluster CA for both serving and client certificates. As the solution is soon to be dropped in favor of the new ReversedVPN solution, we don’t intend to introduce a dedicated CA for this component. If ReversedVPN is disabled and the CA rotation is triggered, we make sure to propagate the cluster CA to the relevant places in the legacy VPN solution.

Naturally, not all certificates used for communication with the kube-apiserver are under control of Gardener. An example for a Gardener-controlled certificate is the kubelet client certificate used to communicate with the api server. An example for credentials not controlled by gardener are kubeconfigs or client certificates requested via CertificateSigningRequests by the shoot owner.

We propose to use a two step approach to rotate CAs. The start of each phase is triggered by the shoot owner. In summary the first phase is used to create new CAs (for example the new api server and client CA). Then we make sure that all servers and clients under Gardener’s control trust both old and new CA. Next we renew all client certificates that are under Gardener’s control so they are now signed by the new CAs. This includes a node rollout in order to propagate the certificates to kubelets and restart all pods. Afterwards the user needs to change their client credentials to trust both old and new cluster CA. In the second phase, we remove all trust to the old CA for servers and clients under Gardener’s control. This does not include a node rollout but all still running pods using ServiceAccounts will continue to trust the old CA until they restart. Also, the user needs to retrieve the new CA bundle to no longer trust the old CA.

A detailed overview of all steps required for each phase is given in the proposal section of this GEP.

Introducing a new client CA

Currently, client certificates and the kube-apiserver certificate are signed by the same CA. We propose to create a separate client CA when triggering the rotation. The client CA is used to sign certificates of clients talking to the API Server.

Motivation

There are a few reasons for rotating shoot cluster CAs:

  • If we have to invalidate client certificates for the kube-apiserver or any other component we are forced to rotate the CA. The only way to invalidate them is to stop trusting all client certificates that are signed by the respective CA as kubernetes does not support revoking certificates.
  • If the CA itself got leaked.
  • If the CA is about to expire.
  • If a company policy requires to rotate a CA after a certain point in time.

In each of those cases we currently need to basically manually recreate and replace all CAs and certificates. The process of rotating by hand is cumbersome and could lead to errors due to the many steps needing to be performed in the right order. By automating the process we want to create a way to securely and easily rotate shoot CAs.

Goals

  • Offer an automated and safe solution to rotate all CAs in a shoot cluster.
  • Offer a process that is easily understandable for developers and users.
  • Rotate the different CAs in the shoot with a similar process to reduce complexity.
  • Add visibility for Shoot owners when the last CA rotation happened

Non-Goals

  • Offer an automated solution for rotating other static credentials (like static token).
    • Later on, a similar two-phase approach could be implemented for the kubeconfig rotation. However, this is out of scope for this enhancement.
  • Creating a process that runs fully automated without shoot owner interaction. As the shoot owner controls some secrets that would probably not even be possible.
  • Forcing the shoot owner to rotate after a certain time period. Our goal rather is to issue long-running certificates and let the user decide depending on their requirements to rotate as needed.
  • Configurable default CA lifetime

Proposal

We will add a new feature gate CARotation for gardener-apiserver and gardenlet which allows to enable or disable the possibility to trigger the rotation.

Triggering the CA Rotation

  • Triggered via gardener.cloud/operation annotation in symmetry with other operations like reconciliation, kubeconfig rotation, etc.
    • annotation increases the generation
    • value for triggering first phase: start-ca-rotation
    • value for triggering the second phase: complete-ca-rotation
    • gardener-apiserver performs the needful validation: user can’t trigger another rotation if one is already in progress, user can’t trigger complete-ca-rotation if first phase has not been compeleted, etc.
  • The annotation triggers a usual shoot reconciliation (just like a kubeconfig or SSH key rotation)
  • gardenlet begins the CA rotation sequence by setting the new status section .status.credentials.caRotation (probably in updateShootStatusOperationStart) and removes the annotation afterwards
    • shoot reconciliation needs to be idemptotent to CA rotation phase, i.e. if a usual reconciliation or maintenance operation is triggered in between, no new CAs are generated or similar things that would interfere with the CA rotation sequence

Changing the Shoot Status

A new section in the Shoot status is added when the first rotation is triggered:

status:
  credentials:
    rotation:
      certificateAuthorities:
        phase: Prepare # Prepare|Finalize|Completed
        lastCompletion: 2022-02-07T14:23:44Z
    # kubeconfig:
    #   phase:
    #   lastCompletion:

Later on, this section could be augmented with other information like the names of the credentials secrets (e.g. gardener/gardener#1749)

status:
  credentials:
    resources:
    - type: kubeconfig
      kind: Secret
      name: shoot-foo.kubeconfig

Rotation Sequence for Cluster and Client CA

The proposal section includes a detailed description of all steps involved for rotating from a given CA0 to the target CA1.

t0: Today’s situation

  • kube-apiserver uses SERVER CERT signed by CA0 and trusts CLIENT CERTS signed by CA0
  • kube-controller-manager issues new CLIENT CERTS signed by CA0
  • kubeconfig trusts only CA0
  • ServiceAccount secrets trust only CA0
  • kubelet uses CLIENT CERT signed by CA0

t1: Shoot owner triggers first step of CA rotation process (–> phase one is started):

  • Generate CA1
  • Generate CLIENT_CA1
  • Update kube-apiserver, kube-scheduler, etc. to trust CLIENT CERTS signed by both CA0 and CLIENT_CA1 (--client-ca-file flag)
  • Update kube-controller-manager to issue new CLIENT CERTS now with CLIENT_CA1
  • Update kubeconfig so that its CA bundle contains both CA0 andCA1 (if kubeconfig still contains a legacy CLIENT CERT then rotate the kubeconfig)
  • Update generic-token-kubeconfig so that its CA bundle contains both CA0 andCA1
  • Update kube-controller-manager to populate both CA0 and CA1 in ServiceAccount secrets.
  • Restart control plane components so that their CA bundle contains both CA0 and CA1
  • Renew CLIENT CERTS (sign them with CLIENT_CA1) for the following control plane components: Prometheus, DWD, legacy VPN), if not dropped already in the context of gardener/gardener#4661
  • Trigger node rollout
    • This issues new CLIENT CERTS for all kubelets signed by CLIENT_CA1
    • This restarts all Pods and propagates CA0 and CA1 into their mounted ServiceAccount secrets (note CAs can not be reloaded by go client, therefore we need a restart of pods.)
  • Ask user to exchange all their client credentials (kubeconfig, CLIENT CERTS issued by CertificateSigningRequests) to trust both CA0 and CA1

t2: Shoot owner triggers second step of CA rotation process (–> phase two is started):

Prerequisite: All Gardener-controlled actions listed in t1 were executed successfully (for example node rollout). The shoot owner has guaranteed that they exchanged their client credentials and triggered step 2 via an annotation.

  • Renew SERVER CERTS (sign them with CA1) for kube-apiserver, kube-controller-manager, cloud-controller-manager etc.
  • Update kube-apiserver, kube-scheduler, etc. to trust only CLIENT CERTS signed by CLIENT_CA1
  • Update kubeconfig so that its CA bundle contains only CA1
  • Update generic-token-kubeconfig so that its CA bundle contains only CA1
  • Update kube-controller-manager to only contain CA1. ServiceAccount secrets created after this point will get secrets that include only CA1
  • Restart control plane components so that their CA bundle contains only CA1
  • Restart kubelets so that the CA bundle in their kubeconfigs contain only CA1
  • Delete CA0
  • Ask user to optionally restart their Pods since they still contain CA0 in memory in order to eliminate trust to the old cluster CA.
  • Ask user to exchange all their client credentials (download kubeconfig containing only CA1; when using CLIENT CERTS trust only CA1)

Rotation Sequence of Other CAs

Apart from the kube-apiserver CA (and the client CA) we also use 5 other CAs as mentioned above in the gardener codebase. We propose to rotate those CAs together with the kube-apiserver CA following the same trigger.

ℹ️ Note for the front-proxy CA: users need to make sure, extension API servers have reloaded the extension-apiserver-authentication ConfigMap, before triggering the second phase.

You can find gardener managed CAs listed here.

Regarding the rotation steps we want to follow a similar approach to the one we defined for the kube-apiserver CA. Exemplary, we are going to show the timeline for ETCD_CA but the logic should be similiar for all the above listed CAs.

  • t0
    • etcd trusts client certificates signed by ETCD_CA0 and uses a server certificate signed by ETCD_CA0
    • kube-apiserver and backup-restore use a client certificate signed by ETCD_CA0 and trust ETCD_CA0
  • t1:
    • Generate ETCD_CA1
    • Update etcd to trust CLIENT CERTS signed by both ETCD_CA0 and ETCD_CA1
    • Update kube-apiserver and backup-restore:
      • Adapt CA bundle to trust both ETCD_CA0 and ETCD_CA1
      • Renew CLIENT CERTS (sign them with ETCD_CA1)
  • t2:
    • Update etcd:
      • Trust only CLIENT CERTS signed by ETCD_CA1
      • Renew SERVER CERT (sign it with ETCD_CA1)
    • Update kube-apiserver and backup-restore so that their CA bundle contains only ETCD_CA1

ℹ️ This means we are requiring two restarts of etcd in total.

Alternatives

This section presents a different approach to rotate the CAs which is to temporarily create a second set of api-servers utilizing the new CA . After presenting the approach advantages and disadvantages of both approaches are listed.

t0: Today’s situation

  • kube-apiserver uses SERVER CERT signed by CA0 and trusts CLIENT CERTS signed by CA0
  • kube-controller-manager issues new CLIENT CERTS with CA0
  • kubeconfig contains only CA0
  • ServiceAccount secrets contain only CA0
  • kubelet uses CLIENT CERT signed by CA0

t1: User triggers first step of CA rotation process (–> phase one):

  • Generate CA1
  • Generate CLIENT_CA1
  • Create new DNSRecord, Service, Istio configuration, etc. for second kube-apiserver deployment
  • Deploy second kube-apiserver deployment trusting only CLIENT CERTS signed by CLIENT_CA1 and using SERVER CERT signed by CA1
  • Update kube-scheduler, etc. to trust only CLIENT CERTS signed by CLIENT_CA1 (--client-ca-file flag)
  • Update kube-controller-manager to issue new CLIENT CERTS with CLIENT_CA1
  • Update kubeconfig so that it points to the new DNSRecord and its CA bundle contains only CA1 (if kubeconfig still contains a legacy CLIENT CERT then rotate the kubeconfig)
  • Update ServiceAccount secrets so that their CA bundle contains both CA0 and CA1
  • Restart control plane components so that they point to the second kube-apiserver Service and so that their CA bundle contains only CA1
  • Renew CLIENT CERTS (sign them with CLIENT_CA1) for control plane components (Prometheus, DWD, legacy VPN) and point them to the second kube-apiserver Service
  • Adapt apiserver-proxy-pod-mutator to point KUBERNETES_SERVICE_HOST env variable to second kube-apiserver
  • Trigger node rollout
    • This issues new CLIENT CERTS for all kubelets signed by CLIENT_CA1 and points them to the second DNSRecord
    • This restarts all Pods and propagates CA0 and CA1 into their mounted ServiceAccount secrets
  • Ask user to exchange all their client credentials (kubeconfig, CLIENT CERTS issued by CertificateSigningRequests)

t2: User triggers second step of CA rotation process (–> phase two):

  • Update ServiceAccount secrets so that their CA bundle contains only CA1
  • Update apiserver-proxy to talk to second kube-apiserver
  • Drop first DNSRecord, Service, Istio configuration and first kube-apiserver deployment
  • Drop CA0
  • Ask user to optionally restart their Pods since they still contain CA0 in memory.

Advantages/Disadvantages approach two api servers

  • (+) User needs to adapt client credentials only once
  • (/) Unstable API server domain
  • (-) Probably more implementation effort
  • (-) More complex
  • (-) CA rotation process does not work similar for all CAs in our system

Advantages/Disadvantages of currently preferred approach (see proposal)

  • (+) Implementation effort seems “straight-forward”
  • (+) CA rotation process works similar for all CAs in our system
  • (/) Stable API server domain
  • (-) User needs to adapt client credentials twice

21 - Utilize API Server Network Proxy to Invert Seed-to-Shoot Connectivity

Utilize API Server Network Proxy to Invert Seed-to-Shoot Connectivity

Problem

Gardener’s architecture for Kubernetes clusters relies on having the control-plane (e.g., kube-apiserver, kube-scheduler, kube-controller-manager, etc.) and the data-plane (e.g., kube-proxy, kubelet, etc.) of the cluster residing in separate places, this provides many benefits but poses some challenges, especially when API-server to system components communication is required. This problem is solved today in Gardener by making use of OpenVPN to establish a VPN connection from the seed to the shoot. To do so, the following steps are required:

  • Create a Loadbalancer service on the shoot.
  • Add a sidecar to the API server pod which knows the address of the newly created Loadbalancer.
  • Establish a connection over the internet to the VPN Loadbalancer
  • Install additional iptables rules that would redirect all the IPs of the shoot (i.e., service, pod, node CIDRs) to the established VPN tunnel

There are however quite a few problems with the above approach, here are some:

  • Every shoot would require an additional loadbalancer, this accounts for addition overhead in terms of both costs and troubleshooting efforts.
  • Private access use-cases would not be possible without having a seed residing in the same private domain as a hard requirement. For example, have a look at this issue
  • Providing a public endpoint to access components in the shoot poses a security risk.

Proposal

There are mutliple ways to tackle the directional connectivity issue mentioned above, one way would be to invert the connection between the API server and the system components, i.e., instead of having the API server side-car establish a tunnel, we would have an agent residing in the shoot cluster initiate the connection itself. This way we don’t need a Loadbalancer for every shoot and from the security perspective, there is no ingress from outside, only controlled egress.

We want to replace this:

APIServer | VPN-seed ---> internet ---> LB --> VPN-Shoot (4314) --> Pods | Nodes | Services

With this:

APIServer <-> Proxy-Server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services

API Server Network Proxy

To solve this issue we can utilize the apiserver-network-proxy upstream implementation. Which provides a reference implementation for a reverse streaming server. The way it works is as follows:

  • Proxy agent connects to proxy server to establish a sticky connection.
  • Traffic to the proxy server (residing in the seed) gets then re-directed to the agent (residing in the shoot) which forwards the traffic to in-cluster components.

The initial motivation for the apiserver-network-proxy project is to get rid of provider-specific implementations that reside in the API-server (e.g., SSH), but it turns out that it has other interesting use-cases such as data-plane connection decoupling, which is the main use-case for this proposal.

Starting with Kubernetes 1.18 it’s possible to make use of an --egress-selector-config-file flag, this helps point the API-server to traffic hook points based on traffic direction. For example, in the config below the API server would have to forward all cluster related traffic (e.g., logs, port-forward, exec, …etc.) to the proxy-server which then knows how to forward traffic to the shoot. For the rest of the traffic, e.g. API server to ETCD or other control-plane components direct is used which means legacy routing method, i.e., by-pass the proxy.

  egress-selector-configuration.yaml: |-
    apiVersion: apiserver.k8s.io/v1alpha1
    kind: EgressSelectorConfiguration
    egressSelections:
    - name: cluster
      connection:
        proxyProtocol: httpConnect
        transport:
          tcp:
            url: https://proxy-server:8131
    - name: master
      connection:
        proxyProtocol: direct
    - name: etcd
      connection:
        proxyProtocol: direct    

Challenges

Prometheus to Shoot connectivity

One challenge remains to completely eliminate the need for a VPN connection. In today’s Gardener setup, each control-plane has a Prometheus instance that directly scrapes cluster components such as CoreDNS, Kubelets, cadvisor, etc. This works because in addition to the VPN side car attached to the API server pod, we have another one attached to prometheus which knows how to forward traffic to these endpoints. Once the VPN is eliminated, it is required to find other means to forward traffic to these components.

Possible Solutions

There are currently two ways to solve this problem:

  • Attach a port-forwarder side-car to prometheus.
  • Utilize the proxy subresource on the API server.

Port-forwarder Sidecar

With this solution each prometheus instance would have a side-car that has the kubeconfig of the shoot cluster, and which establishes a port-forward connection to the endpoints residing in the shoot.

There are a many problems with this approach:

  • the port-forward connection is not reliable.
  • the connection would break if the API server instance dies.
  • requires an additional component.
  • would need to expose every pod / service via port-forward.
Prom Pod (Prometheus -> Port-forwarder) <-> APIServer -> Proxy-server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services

Proxy Client Sidecar

Another solution would be to implement a proxy-client as a sidecar for every component that wishes to communicate with the shoot cluster. For this to work, means to re-direct / inject that proxy to handle the component’s traffic is necessary (e.g., additional IPtable rules).

Prometheus Pod (Prometheus -> Proxy) <-> Proxy-Server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services

The problem with this approach is that it requires an additional sidecar (along with traffic redirection) to be attached to every client that wishes to communicate with the shoot cluster, this can cause:

  • additional maintenance efforts (extra code).
  • other side-effects (e.g., if istio sidecar injection is enabled)

Proxy sub-resource

Kubernetes supports proxying requests to nodes, services, and pod endpoints in the shoot cluster. This proxy connection can be utilized for scraping the necessary endpoints in the shoot.

This approach requires less components and is more reliable than the port-forward solution, however, it relies on having the API server supporting proxied connection for the required endpoints.

Prometheus  <-> APIServer <-> Proxy-Server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services

As simple as it is, it has a downside that it relies on the availability of the API server.

Proxy-server Loadbalancer Sharing and Re-advertising

With the proxy-server in place, we need to provide means to enable the proxy-agent in the shoot to establish the connection with the server. As a result, we need to provide a public endpoint through which this channel of communication can be established, i.e., we need a Loadbalancer(s).

Possible Solution

Using a Loadbalancer / proxy server would not make sense since this is a pain-point we are trying to eliminate in the first-place, doing so just moves the costs to the control-plane. A possible solution is to communicate over a shared loadbalancer in the seed, similar to what has been proposed here, this way we can prevent the extra-costs for load-balancers.

With this in mind, we still have other pain-points, namely:

  • Advertising Loadbalancer public IPs to the shoot.
  • Directing the traffic to the corresponding shoot proxy-server.

For advertising the Loadbalancer IP, a DNS entry can be created for the proxy loadbalancer (or re-use the DNS entry for the SNI proxy), along with necessary certificates, which is then used to connect to the loadbalancer. At this point we can decide on either one of the two approaches:

  1. One Proxy / API server with a shared loadbalancer.
  2. Use one proxy server for all agents.

In the first case, we will probably need a proxy for the proxy-server that knows how to direct traffic to the correct proxy server based on the corresponding shoot cluster. In the second case, we don’t need another proxy if the proxy server is cluster-aware, i.e., can pool and identify connections coming from the same cluster and peer them with the correct API. Unfortunately, the second case is not supported today.

Summary

  • API server proxy can be utilized to invert the connection (only for clusters >= 1.18, for older clusters the old VPN solution will remain).
  • This is achieved by utilizing the --egress-selector-config-file flag on the api-server.
  • For monitoring endpoints, the proxy subresources would be the preferable methods to go, but in the future we can also support sidecar proxies that can communicate with the proxy-server.
  • For Directing traffic to the correct proxy-server we will re-use the SNI proxy along with the load-balancer from the shoot API server via SNI GEP.