This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Proposals

Gardener Enhancement Proposal (GEP)

Changes to the Gardener code base are often incorporated directly via pull requests which either themselves contain a description about the motivation and scope of a change or a linked GitHub issue does.

If a perspective feature has a bigger extent, requires the involvement of several parties or more discussion is needed before the actual implementation can be started, you may consider filing a pull request with a Gardener Enhancement Proposal (GEP) first.

GEPs are a measure to propose a change or to add a feature to Gardener, help you to describe the change(s) conceptionally, and to list the steps that are necessary to reach this goal. It helps the Gardener maintainers as well as the community to understand the motivation and scope around your proposed change(s) and encourages their contribution to discussions and future pull requests. If you are familiar with the Kubernetes community, GEPs are analogue to Kubernetes Enhancement Proposals (KEPs).

Reasons for a GEP

You may consider filing a GEP for the following reasons:

  • A Gardener architectural change is intended / necessary
  • Major changes to the Gardener code base
  • A phased implementation approach is expected because of the widespread scope of the change
  • Your proposed changes may be controversial

We encourage you to take a look at already merged GEPs since they give you a sense of what a typical GEP comprises.

Before creating a GEP

Before starting your work and creating a GEP, please take some time to familiarize yourself with our general Gardener Contribution Guidelines.

It is recommended to discuss and outline the motivation of your prospective GEP as a draft with the community before you take the investment of creating the actual GEP. This early briefing supports the understanding for the broad community and leads to a fast feedback for your proposal from the respective experts in the community. An appropriate format for this may be the regular Gardener community meetings.

How to file a GEP

GEPs should be created as Markdown .md files and are submitted through a GitHub pull request to their current home in docs/proposals. Please use the provided template or follow the structure of existing GEPs which makes reviewing easier and faster. Additionally, please link the new GEP in our documentation index.

If not already done, please present your GEP in the regular community meetings to brief the community about your proposal (we strive for personal communication :) ). Also consider that this may be an important step to raise awareness and understanding for everyone involved.

The GEP template contains a small set of metadata, which is helpful for keeping track of the enhancement in general and especially of who is responsible for implementing and reviewing PRs that are part of the enhancement.

Main Reviewers

Apart from general metadata, the GEP should name at least one “main reviewer”. You can find a main reviewer for your GEP either when discussing the proposal in the community meeting, by asking in our Slack Channel or at latest during the GEP PR review. New GEPs should only be accepted once at least one main reviewer is nominated/assigned.

The main reviewers are charged with the following tasks:

  • familiarizing themselves with the details of the proposal
  • reviewing the GEP PR itself and any further updates to the document
  • discussing design details and clarifying implementation questions with the author before and after the proposal was accepted
  • reviewing PRs related to the GEP in-depth

Other community members are of course also welcome to help the GEP author, review his work and raise general concerns with the enhancement. Nevertheless, the main reviewers are supposed to focus on more in-depth reviews and accompaning the whole GEP process end-to-end, which helps with getting more high-quality reviews and faster feedback cycles instead of having more people looking at the process with lower priority and less focus.

GEP Process

  1. Pre-discussions about GEP (if necessary)
  2. Find a main reviewer for your enhancement
  3. GEP is filed through GitHub PR
  4. Presentation in Gardener community meeting (if possible)
  5. Review of GEP from maintainers/community
  6. GEP is merged if accepted
  7. Implementation of GEP
  8. Consider keeping GEP up-to-date in case implementation differs essentially

1 - 01 Extensibility

Gardener extensibility and extraction of cloud-specific/OS-specific knowledge (#308, #262)

Table of Contents

Summary

Gardener has evolved to a large compound of packages containing lots of highly specific knowledge which makes it very hard to extend (supporting a new cloud provider, new OS, …, or behave differently depending on the underlying infrastructure).

This proposal aims to move out the cloud-specific implementations (called “(cloud) botanists”) and the OS-specifics into dedicated controllers, and simultaneously to allow deviation from the standard Gardener deployment.

Motivation

Currently, it is too hard to support additional cloud providers or operation systems/distributions as everything must be done in-tree which might affect the implementation of other cloud providers as well. The various conditions and branches make the code hard to maintain and hard to test. Every change must be done centrally, requires to completely rebuild Gardener, and cannot be deployed individually. Similar to the motivation for Kubernetes to extract their cloud-specifics into dedicated cloud-controller-managers or to extract the container/storage/network/… specifics into CRI/CSI/CNI/…, we aim to do the same right now.

Goals

  • Gardener does not contain any cloud-specific knowledge anymore but defines a clear contract allowing external controllers (botanists) to support different environments (AWS, Azure, GCP, …).
  • Gardener does not contain any operation system-specific knowledge anymore but defines a clear contract allowing external controllers to support different operation systems/distributions (CoreOS, SLES, Ubuntu, …).
  • It shall become much easier to move control planes of Shoot clusters between Seed clusters (#232) which is a necessary requirement of an automated setup for the Gardener Ring (#233).

Non-Goals

  • We want to also factor out the specific knowledge of the addon deployments (nginx-ingress, kubernetes-dashboard, …), but we already have dedicated projects/issues for that: https://github.com/gardener/bouquet and #246. We will keep the addons in-tree as part of this proposal and tackle their extraction separately.
  • We do not want to make the Gardener a plain workflow engine that just executes a given template (which indeed would allow to be generic, open, and extensible in their highest forms but which would end-up in building a “programming/scripting language” inside a serialization format (YAML/JSON/…)). Rather, we want to have well-defined contracts and APIs, keeping Gardener responsible for the clusters management.

Proposal

Gardener heavily relies on and implements Kubernetes principles, and its ultimate strategy is to use Kubernetes wherever applicable. The extension concept in Kubernetes is based on (next to others) CustomResourceDefinitions, ValidatingWebhookConfigurations and MutatingWebhookConfigurations, and InitializerConfigurations. Consequently, Gardener’s extensibility concept relies on these mechanisms.

Instead of implementing all aspects directly in Gardener it will deploy some CRDs to the Seed cluster which will be watched by dedicated controllers (also running in the Seed clusters), each one implementing one aspect of cluster management. This way one complex strongly coupled Gardener implementation covering all infrastructures is decomposed into a set of loosely coupled controllers implementing aspects of APIs defined by Gardener. Gardener will just wait until the controllers report that they are done (or have faced an error) in the CRD’s .status field instead of doing the respective tasks itself. We will have one specific CRD for every specific operation (e.g., DNS, infrastructure provisioning, machine cloud config generation, …). However, there are also parts inside Gardener which can be handled generically (not by cloud botanists) because they are the same or very similar for all the environments. One example of those is the deployment of a Namespace in the Seed which will run the Shoot’s control plane Another one is the deployment of a Service for the Shoot’s kube-apiserver. In case a cloud botanist needs to cooperate and react on those operations it should register a ValidatingWebhookConfiguration, a MutatingWebhookConfiguration, or a InitializerConfiguration. With this approach it can validate, modify, or react on any resource created by Gardener to make it cloud infrastructure specific.

The web hooks should be registered with failurePolicy=Fail to ensure that a request made by Gardener fails if the respective web hook is not available.

Modification of existing CloudProfile and Shoot resources

We will introduce the new API group gardener.cloud:

CloudProfiles

---
apiVersion: gardener.cloud/v1alpha1
kind: CloudProfile
metadata:
  name: aws
spec:
  type: aws
# caBundle: |
#   -----BEGIN CERTIFICATE-----
#   ...
#   -----END CERTIFICATE-----
  dnsProviders:
  - type: aws-route53
  - type: unmanaged
  kubernetes:
    versions:
    - 1.12.1
    - 1.11.0
    - 1.10.5
  machineTypes:
  - name: m4.large
    cpu: "2"
    gpu: "0"
    memory: 8Gi
  # storage: 20Gi   # optional (not needed in every environment, may only be specified if no volumeTypes have been specified)
  ...
  volumeTypes:      # optional (not needed in every environment, may only be specified if no machineType has a `storage` field)
  - name: gp2
    class: standard
  - name: io1
    class: premium
  providerConfig:
    apiVersion: aws.cloud.gardener.cloud/v1alpha1
    kind: CloudProfileConfig
    constraints:
      minimumVolumeSize: 20Gi
      machineImages:
      - name: coreos
        regions:
        - name: eu-west-1
          ami: ami-32d1474b
        - name: us-east-1
          ami: ami-e582d29f
      zones:
      - region: eu-west-1
        zones:
        - name: eu-west-1a
          unavailableMachineTypes: # list of machine types defined above that are not available in this zone
          - name: m4.large
          unavailableVolumeTypes:  # list of volume types defined above that are not available in this zone
          - name: gp2
        - name: eu-west-1b
        - name: eu-west-1c

Shoots

apiVersion: gardener.cloud/v1alpha1
kind: Shoot
metadata:
  name: johndoe-aws
  namespace: garden-dev
spec:
  cloudProfileName: aws
  secretBindingName: core-aws
  cloud:
    type: aws
    region: eu-west-1
    providerConfig:
      apiVersion: aws.cloud.gardener.cloud/v1alpha1
      kind: InfrastructureConfig
      networks:
        vpc: # specify either 'id' or 'cidr'
        # id: vpc-123456
          cidr: 10.250.0.0/16
        internal:
        - 10.250.112.0/22
        public:
        - 10.250.96.0/22
        workers:
        - 10.250.0.0/19
      zones:
      - eu-west-1a
    workerPools:
    - name: pool-01
    # Taints, labels, and annotations are not yet implemented. This requires interaction with the machine-controller-manager, see
    # https://github.com/gardener/machine-controller-manager/issues/174. It is only mentioned here as future proposal.
    # taints:
    # - key: foo
    #   value: bar
    #   effect: PreferNoSchedule
    # labels:
    # - key: bar
    #   value: baz
    # annotations:
    # - key: foo
    #   value: hugo
      machineType: m4.large
      volume: # optional, not needed in every environment, may only be specified if the referenced CloudProfile contains the volumeTypes field
        type: gp2
        size: 20Gi
      providerConfig:
        apiVersion: aws.cloud.gardener.cloud/v1alpha1
        kind: WorkerPoolConfig
        machineImage:
          name: coreos
          ami: ami-d0dcef3
        zones:
        - eu-west-1a
      minimum: 2
      maximum: 2
      maxSurge: 1
      maxUnavailable: 0
  kubernetes:
    version: 1.11.0
    ...
  dns:
    provider: aws-route53
    domain: johndoe-aws.garden-dev.example.com
  maintenance:
    timeWindow:
      begin: 220000+0100
      end: 230000+0100
    autoUpdate:
      kubernetesVersion: true
  backup:
    schedule: "*/5 * * * *"
    maximum: 7
  addons:
    kube2iam:
      enabled: false
    kubernetes-dashboard:
      enabled: true
    cluster-autoscaler:
      enabled: true
    nginx-ingress:
      enabled: true
      loadBalancerSourceRanges: []
    kube-lego:
      enabled: true
      email: john.doe@example.com

ℹ The specifications for the other cloud providers Gardener already has an implementation for looks similar.

CRD definitions and workflow adaptation

In the following we are outlining the CRD definitions which define the API between Gardener and the dedicated controllers. After that we will take a look at the current reconciliation/deletion flow and describe how it would look like in case we would implement this proposal.

Custom resource definitions

Every CRD has a .spec.type field containing the respective instance of the dimension the CRD represents, e.g. the cloud provider, the DNS provider or the operation system name. Moreover, the .status field must contain

  • observedGeneration (int64), a field indicating on which generation the controller last worked on.
  • state (*runtime.RawExtension), a field which is not interpreted by Gardener but persisted; it should be treated opaque and only be used by the respective CRD-specific controller (it can store anything it needs to re-construct its own state).
  • lastError (object), a field which is optional and only present if the last operation ended with an error state.
  • lastOperation (object), a field which always exists and which indicates what the last operation of the controller was.
  • conditions (list), a field allowing the controller to report health checks for its area of responsibility.

Some CRDs might have a .spec.providerConfig or a .status.providerStatus field containing controller-specific information that is treated opaque by Gardener and will only be copied to dependent or depending CRDs.

DNS records

Every Shoot needs two DNS records (or three, depending on whether nginx-ingress addon is enabled), one so-called “internal” record that Gardener uses in the kubeconfigs of the Shoot cluster’s system components, and one so-called “external” record which is used in the kubeconfig provided to the user.

---
apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSProvider
metadata:
  name: alicloud
  namespace: default
spec:
  type: alicloud-dns
  secretRef:
    name: alicloud-credentials
  domains:
    include:
    - my.own.domain.com
---
apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSEntry
metadata:
  name: dns
  namespace: default
spec:
  dnsName: dns.my.own.domain.com
  ttl: 600
  targets:
  - 8.8.8.8
status:
  observedGeneration: 4
  state: some-state
  lastError:
    lastUpdateTime: 2018-04-04T07:08:51Z
    description: some-error message
    codes:
    - ERR_UNAUTHORIZED
  lastOperation:
    lastUpdateTime: 2018-04-04T07:24:51Z
    progress: 70
    type: Reconcile
    state: Processing
    description: Currently provisioning ...
  conditions:
  - lastTransitionTime: 2018-07-11T10:18:25Z
    message: DNS record has been created and is available.
    reason: RecordResolvable
    status: "True"
    type: Available
    propagate: false
  providerStatus:
    apiVersion: aws.extensions.gardener.cloud/v1alpha1
    kind: DNSStatus
    ...
Infrastructure provisioning

The Infrastructure CRD contains the information about VPC, networks, security groups, availability zones, …, basically, everything that needs to be prepared before an actual VMs/load balancers/… can be provisioned.

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Infrastructure
metadata:
  name: infrastructure
  namespace: shoot--core--aws-01
spec:
  type: aws
  providerConfig:
    apiVersion: aws.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureConfig
    networks:
      vpc:
        cidr: 10.250.0.0/16
      internal:
      - 10.250.112.0/22
      public:
      - 10.250.96.0/22
      workers:
      - 10.250.0.0/19
    zones:
    - eu-west-1a
  dns:
    apiserver: api.aws-01.core.example.com
  region: eu-west-1
  secretRef:
    name: my-aws-credentials
  sshPublicKey: |
        base64(key)
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
  providerStatus:
    apiVersion: aws.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureStatus
    vpc:
      id: vpc-1234
      subnets:
      - id: subnet-acbd1234
        name: workers
        zone: eu-west-1
      securityGroups:
      - id: sg-xyz12345
        name: workers
    iam:
      nodesRoleARN: <some-arn>
      instanceProfileName: foo
    ec2:
      keyName: bar
Backup infrastructure provisioning

The BackupInfrastructure CRD in the Seeds tells the cloud-specific controller to prepare a blob store bucket/container which can later be used to store etcd backups.

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: BackupInfrastructure
metadata:
  name: etcd-backup
  namespace: shoot--core--aws-01
spec:
  type: aws
  region: eu-west-1
  storageContainerName: asdasjndasd-1293912378a-2213
  secretRef:
    name: my-aws-credentials
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
Cloud config (user-data) for bootstrapping machines

Gardener will continue to keep knowledge about the content of the cloud config scripts, but it will hand over it to the respective OS-specific controller which will generate the specific valid representation. Gardener creates two MachineCloudConfig CRDs, one for the cloud-config-downloader (which will later flow into the WorkerPool CRD) and one for the real cloud-config (which will be stored as a Secret in the Shoot’s kube-system namespace, and downloaded and executed from the cloud-config-downloader on the machines).

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: MachineCloudConfig
metadata:
  name: pool-01-downloader
  namespace: shoot--core--aws-01
spec:
  type: CoreOS
  units:
  - name: cloud-config-downloader.service
    command: start
    enable: true
    content: |
      [Unit]
      Description=Downloads the original cloud-config from Shoot API Server and executes it
      After=docker.service docker.socket
      Wants=docker.socket
      [Service]
      Restart=always
      RestartSec=30
      EnvironmentFile=/etc/environment
      ExecStart=/bin/sh /var/lib/cloud-config-downloader/download-cloud-config.sh      
  files:
  - path: /var/lib/cloud-config-downloader/credentials/kubeconfig
    permissions: 0644
    content:
      secretRef:
        name: cloud-config-downloader
        dataKey: kubeconfig
  - path: /var/lib/cloud-config-downloader/download-cloud-config.sh
    permissions: 0644
    content:
      inline:
        encoding: b64
        data: IyEvYmluL2Jhc2ggL...
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
  cloudConfig: | # base64-encoded
    #cloud-config

    coreos:
      update:
        reboot-strategy: off
      units:
      - name: cloud-config-downloader.service
        command: start
        enable: true
        content: |
          [Unit]
          Description=Downloads the original cloud-config from Shoot API Server and execute it
          After=docker.service docker.socket
          Wants=docker.socket
          [Service]
          Restart=always
          RestartSec=30
          ...          

ℹ The cloud-config-downloader script does not only download the cloud-config initially but at regular intervals, e.g., every 30s. If it sees an updated cloud-config then it applies it again by reloading and restarting all systemd units in order to reflect the changes. The way how this reloading of the cloud-config happens is OS-specific as well and not known to Gardener anymore, however, it must be part of the script already. On CoreOS, you have to execute /usr/bin/coreos-cloudinit --from-file=<path> whereas on SLES you execute cloud-init --file <path> single -n write_files --frequency=once. As Gardener doesn’t know these commands it will write a placeholder expression instead (e.g., {RELOAD-CLOUD-CONFIG-WITH-PATH:<path>}) and the OS-specific controller is asked to replace it with the proper expression.

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: MachineCloudConfig
metadata:
  name: pool-01-original # stored as secret and downloaded later
  namespace: shoot--core--aws-01
spec:
  type: CoreOS
  units:
  - name: docker.service
    drop-ins:
    - name: 10-docker-opts.conf
      content: |
        [Service]
        Environment="DOCKER_OPTS=--log-opt max-size=60m --log-opt max-file=3"        
  - name: docker-monitor.service
    command: start
    enable: true
    content: |
      [Unit]
      Description=Docker-monitor daemon
      After=kubelet.service
      [Service]
      Restart=always
      EnvironmentFile=/etc/environment
      ExecStart=/opt/bin/health-monitor docker      
  - name: kubelet.service
    command: start
    enable: true
    content: |
      [Unit]
      Description=kubelet daemon
      Documentation=https://kubernetes.io/docs/admin/kubelet
      After=docker.service
      Wants=docker.socket rpc-statd.service
      [Service]
      Restart=always
      RestartSec=10
      EnvironmentFile=/etc/environment
      ExecStartPre=/bin/docker run --rm -v /opt/bin:/opt/bin:rw k8s.gcr.io/hyperkube:v1.11.2 cp /hyperkube /opt/bin/
      ExecStartPre=/bin/sh -c 'hostnamectl set-hostname $(cat /etc/hostname | cut -d '.' -f 1)'
      ExecStart=/opt/bin/hyperkube kubelet \
          --allow-privileged=true \
          --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig-bootstrap \
          ...      
  files:
  - path: /var/lib/kubelet/ca.crt
    permissions: 0644
    content:
      secretRef:
        name: ca-kubelet
        dataKey: ca.crt
  - path: /var/lib/cloud-config-downloader/download-cloud-config.sh
    permissions: 0644
    content:
      inline:
        encoding: b64
        data: IyEvYmluL2Jhc2ggL...
  - path: /etc/sysctl.d/99-k8s-general.conf
    permissions: 0644
    content:
      inline:
        data: |
          vm.max_map_count = 135217728
          kernel.softlockup_panic = 1
          kernel.softlockup_all_cpu_backtrace = 1
          ...          
  - path: /opt/bin/health-monitor
    permissions: 0755
    content:
      inline:
        data: |
          #!/bin/bash
          set -o nounset
          set -o pipefail

          function docker_monitoring {
          ...          
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
  cloudConfig: ...

Cloud-specific controllers which might need to add another kernel option or another flag to the kubelet, maybe even another file to the disk, can register a MutatingWebhookConfiguration to that resource and modify it upon creation/update. The task of the MachineCloudConfig controller is to only generate the OS-specific cloud-config based on the .spec field, but not to add or change any logic related to Shoots.

Worker pools definition

For every worker pool defined in the Shoot Gardener will create a WorkerPool CRD which shall be picked up by a cloud-specific controller and be translated to MachineClasses and MachineDeployments.

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: WorkerPool
metadata:
  name: pool-01
  namespace: shoot--core--aws-01
spec:
  cloudConfig: base64(downloader-cloud-config)
  infrastructureProviderStatus:
    apiVersion: aws.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureStatus
    vpc:
      id: vpc-1234
      subnets:
      - id: subnet-acbd1234
        name: workers
        zone: eu-west-1
      securityGroups:
      - id: sg-xyz12345
        name: workers
    iam:
      nodesRoleARN: <some-arn>
      instanceProfileName: foo
    ec2:
      keyName: bar
  providerConfig:
    apiVersion: aws.cloud.gardener.cloud/v1alpha1
    kind: WorkerPoolConfig
    machineImage:
      name: CoreOS
      ami: ami-d0dcef3b
    machineType: m4.large
    volumeType: gp2
    volumeSize: 20Gi
    zones:
    - eu-west-1a
  region: eu-west-1
  secretRef:
    name: my-aws-credentials
  minimum: 2
  maximum: 2
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
Generic resources

Some components are cloud-specific and must be deployed by the cloud-specific botanists. Others might need to deploy another pod next to the shoot’s control plane or must do anything else. Some of these might be important for a functional cluster (e.g., the cloud-controller-manager, or a CSI plugin in the future), and controllers should be able to report errors back to the user. Consequently, in order to trigger the controllers to deploy these components Gardener would write a Generic CRD to the Seed to trigger the deployment. No operation is depending on the status of these resources, however, the entire reconciliation flow is.

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Generic
metadata:
  name: cloud-components
  namespace: shoot--core--aws-01
spec:
  type: cloud-components
  secretRef:
    name: my-aws-credentials
  shootSpec:
    ...
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...

Shoot state

In order to enable moving the control plane of a Shoot between Seed clusters (e.g., if a Seed cluster is not available anymore or entirely broken) Gardener must store some non-reconstructable state, potentially also the state written by the controllers. Gardener watches these extension CRDs and copies the .status.state in a ShootState resource into the Garden cluster. Any observed status change of the respective CRD-controllers must be immediately reflected in the ShootState resource. The contract between Gardener and those controllers is: Every controller must be capable of reconstructing its own environment based on both the state it has written before and on the real world’s conditions/state.

---
apiVersion: gardener.cloud/v1alpha1
kind: ShootState
metadata:
  name: shoot--core--aws-01
shootRef:
  name: aws-01
  project: core
state:
  secrets:
  - name: ca
    data: ...
  - name: kube-apiserver-cert
    data: ...
  resources:
  - kind: DNS
    name: record-1
    state: <copied-state-of-dns-crd>
  - kind: Infrastructure
    name: networks
    state: <copied-state-of-infrastructure-crd>
  ...
  <other fields required to keep track of>

We cannot assume that Gardener is always online to observe the most recent states the controllers have written to their resources. Consequently, the information stored here must not be used as “single point of truth”, but the controllers must potentially check the real world’s status to reconstruct themselves. However, this must anyway be part of their normal reconciliation logic and is a general best practice for Kubernetes controllers.

Shoot health checks/conditions

Some of the existing conditions already contain specific code which shall be simplified as well. All of the CRDs described above have a .status.conditions field to which the controllers may write relevant health information of their function area. Gardener will pick them up and copy them over to the Shoots .status.conditions (only those conditions setting propagate=true).

Reconciliation flow

We are now examining the current Shoot creation/reconciliation flow and describe how it could look like when applying this proposal:

OperationDescription
botanist.DeployNamespaceGardener creates the namespace for the Shoot in the Seed cluster.
botanist.DeployKubeAPIServerServiceGardener creates a Service of type LoadBalancer in the Seed.
AWS Botanist registers a Mutating Webhook and adds its AWS-specific annotation.
botanist.WaitUntilKubeAPIServerServiceIsReadyGardener checks the .status object of the just created Service in the Seed. The contract is that also clouds not supporting load balancers must react on the Service object and modify the .status to correctly reflect the kube-apiserver’s ingress IP.
botanist.DeploySecretsGardener creates the secrets/certificates it needs like it does today, but it provides utility functions that can be adopted by Botanists/other controllers if they need additional certificates/secrets created on their own. (We should also add labels to all secrets)
botanist.Shoot.Components.DNS.Internal{Provider/Entry}.DeployGardener creates a DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and creates a corresponding DNS record (see CRD specification above).
botanist.Shoot.Components.DNS.External{Provider/Entry}.DeployGardener creates a DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and creates a corresponding DNS record: (see CRD specification above).
shootCloudBotanist.DeployInfrastructureGardener creates a Infrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job: (see CRD above).
botanist.DeployBackupInfrastructureGardener creates a BackupInfrastructure resource in the Garden cluster.
(The BackupInfrastructure controller creates a BackupInfrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job: (see CRD above).)
botanist.WaitUntilBackupInfrastructureReconciledGardener checks the .status object of the just created BackupInfrastructure resource.
hybridBotanist.DeployETCDGardener does only deploy the etcd StatefulSet without backup-restore sidecar at all.
The cloud-specific Botanist registers a Mutating Webhook and adds the backup-restore sidecar, and it also creates the Secret needed by the backup-restore sidecar.
botanist.WaitUntilEtcdReadyGardener checks the .status object of the etcd Statefulset and waits until readiness is indicated.
hybridBotanist.DeployCloudProviderConfigGardener does not execute this anymore because it doesn’t know anything about cloud-specific configuration.
hybridBotanist.DeployKubeAPIServerGardener does only deploy the kube-apiserver Deployment without any cloud-specific flags/configuration.
The cloud-specific Botanist registers a Mutating Webhook and adds whatever is needed for the kube-apiserver to run in its cloud environment.
hybridBotanist.DeployKubeControllerManagerGardener does only deploy the kube-controller-manager Deployment without any cloud-specific flags/configuration.
The cloud-specific Botanist registers a Mutating Webhook and adds whatever is needed for the kube-controller-manager to run in its cloud environment (e.g., the cloud-config).
hybridBotanist.DeployKubeSchedulerGardener does only deploy the kube-scheduler Deployment without any cloud-specific flags/configuration.
The cloud-specific Botanist registers a Mutating Webhook and adds whatever is needed for the kube-scheduler to run in its cloud environment.
hybridBotanist.DeployCloudControllerManagerGardener does not execute this anymore because it doesn’t know anything about cloud-specific configuration. The Botanists would be responsible to deploy their own cloud-controller-manager now.
They would watch for the kube-apiserver Deployment to exist, and as soon as it does, they deploy the CCM.
(Side note: The Botanist would also be responsible to deploy further controllers needed for this cloud environment, e.g. F5-controllers or CSI plugins).
botanist.WaitUntilKubeAPIServerReadyGardener checks the .status object of the kube-apiserver Deployment and waits until readiness is indicated.
botanist.InitializeShootClientsUnchanged; Gardener creates a Kubernetes client for the Shoot cluster.
botanist.DeployMachineControllerManagerDeleted, Gardener does no longer deploy MCM itself. See below.
hybridBotanist.ReconcileMachinesGardener creates a Worker CRD in the Seed, and the responsible Worker controller picks it up and does its job (see CRD above). It also deploys the machine-controller-manager.
Gardener waits until the status indicates that the controller is done.
hybridBotanist.DeployKubeAddonManagerThis function also computes the CoreOS cloud-config (because the secret storing it is managed by the kube-addon-manager).
Gardener would deploy the CloudConfig-specific CRD in the Seed, and the responsible OS controller picks it up and does its job (see CRD above).
The Botanists which would have to modify something would register a Webhook for this CloudConfig-specific resource and apply their changes.
The rest is mostly unchanged, Gardener generates the manifests for the addons and deploys the kube-addon-manager into the Seed.
AWS Botanist registers a Webhook for nginx-ingress.
Azure Botanist registers a Webhook for calico.
Gardener will no longer deploy the StorageClasses. Instead, the Botanists wait until the kube-apiserver is available and deploy them.

In the long term we want to get rid of optional addons inside the Gardener core and implement a sophisticated addon concept (see #246).
shootCloudBotanist.DeployKube2IAMResourcesThis function would be removed (currently Gardener would execute a Terraform job creating the IAM roles specified in the Shoot manifest). We cannot keep this behavior, the user would be responsible to create the needed IAM roles on its own.
botanist.Shoot.Components.Nginx.DNSEtnryGardener creates a DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and creates a corresponding DNS record (see CRD specification above).
botanist.WaitUntilVPNConnectionExistsUnchanged, Gardener checks that it is possible to port-forward to a Shoot pod.
seedCloudBotanist.ApplyCreateHookThis function would be removed (actually, only the AWS Botanist implements it).
AWS Botanist deploys the aws-lb-readvertiser once the API Server is deployed and updates the ELB health check protocol one the load balancer pointing to the API server is created.
botanist.DeploySeedMonitoringUnchanged, Gardener deploys the monitoring stack into the Seed.
botanist.DeployClusterAutoscalerUnchanged, Gardener deploys the cluster-autoscaler into the Seed.

ℹ We can easily lift the contract later and allow dynamic network plugins or not using the VPN solution at all. We could also introduce a dedicated ControlPlane CRD and leave the complete responsibility of deploying kube-apiserver, kube-controller-manager, etc. to other controllers (if we need it at some point in time).

Deletion flow

We are now examining the current Shoot deletion flow and describe shortly how it could look like when applying this proposal:

OperationDescription
botanist.DeploySecretsThis is just refreshing the cloud provider secret in the Shoot namespace in the Seed (in case the user has changed it before triggering the deletion). This function would stay as it is.
hybridBotanist.RefreshMachineClassSecretsThis function would disappear.
Worker Pool controller needs to watch the referenced secret and update the generated MachineClassSecrets immediately.
hybridBotanist.RefreshCloudProviderConfigThis function would disappear. Botanist needs to watch the referenced secret and update the generated cloud-provider-config immediately.
botanist.RefreshCloudControllerManagerChecksumsSee “hybridBotanist.RefreshCloudProviderConfig”.
botanist.RefreshKubeControllerManagerChecksumsSee “hybridBotanist.RefreshCloudProviderConfig”.
botanist.InitializeShootClientsUnchanged; Gardener creates a Kubernetes client for the Shoot cluster.
botanist.DeleteSeedMonitoringUnchanged; Gardener deletes the monitoring stack.
botanist.DeleteKubeAddonManagerUnchanged; Gardener deletes the kube-addon-manager.
botanist.DeleteClusterAutoscalerUnchanged; Gardener deletes the cluster-autoscaler.
botanist.WaitUntilKubeAddonManagerDeletedUnchanged; Gardener waits until the kube-addon-manager is deleted.
botanist.CleanCustomResourceDefinitionsUnchanged, Gardener cleans the CRDs in the Shoot.
botanist.CleanKubernetesResourcesUnchanged, Gardener cleans all remaining Kubernetes resources in the Shoot.
hybridBotanist.DestroyMachinesGardener deletes the WorkerPool-specific CRD in the Seed, and the responsible WorkerPool-controller picks it up and does its job.
Gardener waits until the CRD is deleted.
shootCloudBotanist.DestroyKube2IAMResourcesThis function would disappear (currently Gardener would execute a Terraform job deleting the IAM roles specified in the Shoot manifest). We cannot keep this behavior, the user would be responsible to delete the needed IAM roles on its own.
shootCloudBotanist.DestroyInfrastructureGardener deletes the Infrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job.
Gardener waits until the CRD is deleted.
botanist.Shoot.Components.DNS.External{Provider/Entry}.DestroyGardener deletes the DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and does its job.
Gardener waits until the CRD is deleted.
botanist.DeleteKubeAPIServerUnchanged; Gardener deletes the kube-apiserver.
botanist.DeleteBackupInfrastructureUnchanged; Gardener deletes the BackupInfrastructure object in the Garden cluster.
(The BackupInfrastructure controller deletes the BackupInfrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job.
The BackupInfrastructure controller waits until the CRD is deleted.)
botanist.Shoot.Components.DNS.Internal{Provider/Entry}.DestroyGardener deletes the DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and does its job.
Gardener waits until the CRD is deleted.
botanist.DeleteNamespaceUnchanged; Gardener deletes the Shoot namespace in the Seed cluster.
botanist.WaitUntilSeedNamespaceDeletedUnchanged; Gardener waits until the Shoot namespace in the Seed has been deleted.
botanist.DeleteGardenSecretsUnchanged; Gardener deletes the kubeconfig/ssh-keypair Secret in the project namespace in the Garden.

Gardenlet

One part of the whole extensibility work will also to further split Gardener itself. Inspired from Kubernetes itself we plan to move the Shoot reconciliation/deletion controller loops as well as the BackupInfrastructure reconciliation/deletion controller loops into a dedicated “gardenlet” component that will run in the Seed cluster. With that, it can talk locally to the responsible kube-apiserver and we do no longer need to perform every operation out of the Garden cluster. This approach will also help us with scalability, performance, maintainability, testability in general.

This architectural change implies that the Kubernetes API server of the Garden cluster must be exposed publicly (or at least be reachable by the registered Seeds). The Gardener controller-manager will remain and will keep its CloudProfile, SecretBinding, Quota, Project, and Seed controller loops. One part of the seed controller could be to deploy the “gardenlet” into the Seeds, however, this would require network connectivity to the Seed cluster.

Shoot control plane movement/migration

Automatically moving control planes is difficult with the current implementation as some resources created in the old Seed must be moved to the new one. However, some of them are not under Gardener’s control (e.g., Machine resources). Moreover, the old control plane must be deactivated somehow to ensure that not two controllers work on the same things (e.g., virtual machines) from different environments.

Gardener does not only deploy a DNS controller into the Seeds but also into its own Garden cluster. For every Shoot cluster, Gardener commissions it to create a DNS TXT record containing the name of the Seed responsible for the Shoot (holding the control plane), e.g.

dig -t txt aws-01.core.garden.example.com

...
;; ANSWER SECTION:
aws-01.core.garden.example.com. 120 IN	TXT "Seed=seed-01"
...

Gardener always keeps the DNS record up-to-date based on which Seed is responsible.

In the above CRD examples one object in the .spec section was omitted as it is needed to get Shoot control plane movement/migration working (the field is only explained now in this section and not before; it was omitted on purpose to support focusing on the relevant specifications first). Every CRD also has the following section in its .spec:

leadership:
  record: aws-01.core.garden.example.com
  value: seed-01
  leaseSeconds: 60

Before every operation the CRD-controllers check this DNS record (based on the .spec.leadership.leaseSeconds configuration) and verify that its result is equal to the .spec.leadership.value field. If both match they know that they should act on the resource, otherwise they stop doing anything.

ℹ We will provide an easy-to-use framework for the controllers containing all of these features out-of-the-box in order to allow the developers to focus on writing the actual controller logic.

When a Seed control plane move is triggered, the .spec.cloud.seed field of the respective Shoot is changed. Gardener will change the respective DNS record’s value (aws-01.core.garden.example.com) to contain the new Seed name. After that it will wait 2*60s to be sure that all controllers have observed the change. Then it starts reconciling and applying the CRDs together with a preset .status.state into the new Seed (based on its last observations which were stored in the respective ShootState object stored in the Garden cluster). The controllers are - as per contract - asked to reconstruct their own environment based on the .status.state they have written before and the real world’s status. Apart from that, the normal reconciliation flow gets executed.

Gardener stores the list of Seeds that were responsible for hosting a Shoots control plane at some time in the Shoots .status.seeds list so that it knows which Seeds must be cleaned up (i.e., where the control plane must be deleted because it has been moved). Once cleaned up, the Seed’s name will be removed from that list.

BackupInfrastructure migration

One part of the reconciliation flow above is the provisioning of the infrastructure for the Shoot’s etcd backups (usually, this is a blob store bucket/container). Gardener already uses a separate BackupInfrastructure resource that is written into the Garden cluster and picked up by a dedicated BackupInfrastructure controller (bundled into the Gardener controller manager). This dedicated resource exists mainly for the reason to allow keeping backups for a certain “grace period” even after the Shoot deletion itself:

apiVersion: gardener.cloud/v1alpha1
kind: BackupInfrastructure
metadata:
  name: aws-01-bucket
  namespace: garden-core
spec:
  seed: seed-01
  shootUID: uuid-of-shoot

The actual provisioning is executed in a corresponding Seed cluster as Gardener can only assume network connectivity to the underlying cloud environment in the Seed. We would like to keep the created artifacts in the Seed (e.g., Terraform state) near to the control plane. Consequently, when Gardener moves a control plane, it will update the .spec.seed field of the BackupInfrastructure resource as well. With the exact same logic described above the BackupInfrastructure controller inside the Gardener will move to the new Seed.

Registration of external controllers at Gardener

We want to have a dynamic registration process, i.e. we don’t want to hard-code any information about which controllers shall be deployed. The ideal solution would be to not even requiring a restart of Gardener when a new controller registers.

Every controller is registered by a ControllerRegistration resource that introduces every controller together with its supported resources (dimension (kind) and shape (type) combination) to Gardener:

apiVersion: gardener.cloud/v1alpha1
kind: ControllerRegistration
metadata:
  name: dns-aws-route53
spec:
  resources:
  - kind: DNS
    type: aws-route53
# deployment:
#   type: helm
#   providerConfig:
#     chart.tgz: base64(helm-chart)
#     values.yaml: |
#       foo: bar

Every .kind/.type combination may only exist once in the system.

When a Shoot shall be reconciled Gardener can identify based on the referenced Seed and the content of the Shoot specification which controllers are needed in the respective Seed cluster. It will demand the operators in the Garden cluster to deploy the controllers they are responsible for to a specific Seed. This kind of communication happens via CRDs as well:

apiVersion: gardener.cloud/v1alpha1
kind: ControllerInstallation
metadata:
  name: dns-aws-route53
spec:
  registrationRef:
    name: dns-aws-route53
  seedRef:
    name: seed-01
status:
  conditions:
  - lastTransitionTime: 2018-08-07T15:09:23Z
    message: The controller has been successfully deployed to the seed.
    reason: ControllerDeployed
    status: "True"
    type: Available

The default scenario is that every controller is gets deployed by a dedicated operator that knows how to handle its lifecycle operations like deployment, update, upgrade, deletion. This operator watches ControllerInstallation resources and reacts on those it is responsible for (that it has created earlier). Gardener is responsible for writing the .spec field, the operator is responsible for providing information in the .status indicating whether the controller was successfully deployed and is ready to be used. Gardener will be also able to ask for deletion of controllers from Seeds when they are not needed there anymore by deleting the corresponding ControllerInstallation object.

ℹ The provided easy-to-use framework for the controllers will also contain these needed features to implement corresponding operators.

For most cases the controller deployment is very simple (just deploying it into the seed with some static configuration). In these cases it would produce unnecessary effort to ask for providing another component (the operator) that deploys the controller. To simplify this situation Gardener will be able to react on ControllerInstallations specifying .spec.registration.deployment.type=helm. The controller would be registered with the ControllerRegistration resources that would contain a Helm chart with all resources needed to deploy this controller into a seed (plus some static values). Gardener would render the Helm chart and deploy the resources into the seed. It will not react if .spec.registration.deployment.type!=helm which allows to also use any other deployment mechanism. Controllers that are getting deployed by operators would not specify the .spec.deployment section in the ControllerRegistration at all.

ℹ Any controller requiring dynamic configuration values (e.g., based on the cloud provider or the region of the seed) must be installed with the operator approach.

Other cloud-specific parts

The Gardener API server has a few admission controllers that contain cloud-specific code as well. We have to replace these parts as well.

Defaulting and validation admission plugins

Right now, the admission controllers inside the Gardener API server do perform a lot of validation and defaulting of fields in the Shoot specification. The cloud-specific parts of these admission controllers will be replaced by mutating admission webhooks that will get called instead. As we will have a dedicated operator running in the Garden cluster anyway it will also get the responsibility to register this webhook if it needs to validate/default parts of the Shoot specification.

Example: The .spec.cloud.workerPools[*].providerConfig.machineImage field in the new Shoot manifest mentioned above could be omitted by the user and would get defaulted by the cloud-specific operator.

DNS Hosted Zone admission plugin

For the same reasons the existing DNS Hosted Zone admission plugin will be removed from the Gardener core and moved into the responsibility of the respective DNS-specific operators running in the Garden cluster.

Shoot Quota admission plugin

The Shoot quota admission plugin validates create or update requests on Shoots and checks that the specified machine/storage configuration is defined as per referenced Quota objects. The cloud-specifics in this controller are no longer needed as the CloudProfile and the Shoot resource have been adapted: The machine/storage configuration is no longer in cloud-specific sections but hard-wired fields in the general Shoot specification (see example resources above). The quota admission plugin will be simplified and remains in the Gardener core.

Shoot maintenance controller

Every Shoot cluster can define a maintenance time window in which Gardener will update the Kubernetes patch version (if enabled) and the used machine image version in the Shoot resource. While the Kubernetes version is not part of the providerConfig section in the CloudProfile resource, the machineImage field is, and thus Gardener can’t understand it any longer. In the future Gardener has to rely on the cloud-specific operator (probably the same doing the defaulting/validation mentioned before) to update this field. In the maintenance time window the maintenance controller will update the Kubernetes patch version (if enabled) and add a trigger.gardener.cloud=maintenance annotation in the Shoot resource. The already registered mutating web hook will call the operator who has to remove this annotation and update the machineImage in the .spec.cloud.workerPools[*].providerConfig sections.

Alternatives

  • Alternative to DNS approach for Shoot control plane movement/migration: We have thought about rotating the credentials when a move is triggered which would make all controllers ineffective immediately. However, one problem with this is that we require IAM privileges for the users infrastructure account which might be not desired. Another, more complicated problem is that we cannot assume API access in order to create technical users for all cloud environments that might be supported.

2 - 02 Backupinfra

Backup Infrastructure CRD and Controller Redesign

Goal

  • As an operator, I would like to efficiently use the backup bucket for multiple clusters, thereby limiting the total number of buckets required.
  • As an operator, I would like to use different cloud provider for backup bucket provisioning other than cloud provider used for seed infrastructure.
  • Have seed independent backups, so that we can easily migrate a shoot from one seed to another.
  • Execute the backup operations (including bucket creation and deletion) from a seed, because network connectivity may only be ensured from the seeds (not necessarily from the garden cluster).
  • Preserve the garden cluster as source of truth (no information is missing in the garden cluster to reconstruct the state of the backups even if seed and shoots are lost completely).
  • Do not violate the infrastructure limits in regards to blob store limits/quotas.

Motivation

Currently, every shoot cluster has its own etcd backup bucket with a centrally configured retention period. With the growing number of clusters, we are soon running out of the quota limits of buckets on the cloud provider. Moreover, even if the clusters are deleted, the backup buckets do exist, for a configured period of retention. Hence, there is need of minimizing the total count of buckets.

In addition, currently we use seed infrastructure credentials to provision the bucket for etcd backups. This results in binding backup bucket provider to seed infrastructure provider.

Terminology

  • Bucket : It is equivalent to s3 bucket, abs container, gcs bucket, swift container, alicloud bucket
  • Object : It is equivalent s3 object, abs blob, gcs object, swift object, alicloud object, snapshot/backup of etcd on object store.
  • Directory : As such there is no concept of directory in object store but usually the use directory as / separate common prefix for set of objects. Alternatively they use term folder for same.
  • deletionGracePeriod: This means grace period or retention period for which backups will be persisted post deletion of shoot.

Current Spec:

#BackupInfra spec
Kind: BackupInfrastructure
Spec:
    seed: seedName
    shootUID : shoot.status.uid

Current naming conventions

SeedNamespace :Shoot–projectname–shootname
seed:seedname
ShootUID :shoot.status.UID
BackupInfraname:seednamespce+sha(uid)[:5]
Backup-bucket-name:BackupInfraName
BackupNamespace:backup–BackupInfraName

Proposal

Considering Gardener extension proposal in mind, the backup infrastructure controller can be divided in two parts. There will be basically four backup infrastructure related CRD’s. Two on the garden apiserver. And two on the seed cluster. Before going into to workflow, let’s just first have look at the CRD.

CRD on Garden cluster

Just to give brief before going into the details, we will be sticking to the fact that Garden apiserver is always source of truth. Since backupInfra will be maintained post deletion of shoot, the info regarding this should always come from garden apiserver, we will continue to have BackupInfra resource on garden apiserver with some modifications.

apiVersion: garden.cloud/v1alpha1
kind: BackupBucket
metadata:
  name: packet-region1-uid[:5]
  # No namespace needed. This will be cluster scope resource.
  ownerReferences:
  - kind: CloudProfile
    name: packet
spec:
  provider: aws
  region: eu-west-1
  secretRef: # Required for root
    name: backup-operator-aws
    namespace: garden
status:
  lastOperation: ...
  observedGeneration: ...
  seed: ...
apiVersion: garden.cloud/v1alpha1
kind: BackupEntry
metadata:
  name: shoot--dev--example--3ef42 # Naming convention explained before
  namespace: garden-dev
  ownerReferences:
  - apiVersion: core.gardener.cloud/v1beta1
    blockOwnerDeletion: false
    controller: true
    kind: Shoot
    name: example
    uid: 19a9538b-5058-11e9-b5a6-5e696cab3bc8
spec:
  shootUID: 19a9538b-5058-11e9-b5a6-5e696cab3bc8 # Just for reference to find back associated shoot.
  # Following section comes from cloudProfile or seed yaml based on granularity decision.
  bucketName: packet-region1-uid[:5]
status:
  lastOperation: ...
  observedGeneration: ...
  seed: ...

CRD on Seed cluster

Considering the extension proposal, we want individual component to be handled by controller inside seed cluster. We will have Backup related resource in registered seed cluster as well.

apiVersion: extensions.gardener.cloud/v1alpha1
kind: BackupBucket
metadata:
  name: packet-random[:5]
  # No namespace need. This will be cluster scope resource
spec:
  type: aws
  region: eu-west-1
  secretRef:
    name: backup-operator-aws
    namespace: backup-garden
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...

There are two points for introducing BackupEntry resource.

  1. Cloud provider specific code goes completely in seed cluster.
  2. Network issue is also handled by moving deletion part to backup-extension-controller in seed cluster.
apiVersion: extensions.gardener.cloud/v1alpha1
kind: BackupEntry
metadata:
  name: shoot--dev--example--3ef42 # Naming convention explained later
  # No namespace need. This will be cluster scope resource
spec:
  type: aws
  region: eu-west-1
  secretRef: # Required for root
    name: backup-operator-aws
    namespace: backup-garden
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...

Workflow

  • Gardener administrator will configure the cloudProfile with backup infra credentials and provider config as follows.
# CloudProfile.yaml:
Spec:
    backup:
        provider: aws
        region: eu-west-1
        secretRef:
            name: backup-operator-aws
            namespace: garden

Here CloudProfileController will interpret this spec as follows:

  • If spec.backup is nil
    • No backup for any shoot.
  • If spec.backup.region is not nil,
    • Then respect it, i.e. use the provider and unique region field mentioned there for BackupBucket.
    • Here Preferably, spec.backup.region field will be unique, since for cross provider, it doesn’t make much sense. Since region name will be different for different providers.
  • Otherwise, spec.backup.region is nil then,
    • If same provider case i.e. spec.backup.provider = spec.(type-of-provider) or nil,
      • Then, for each region from spec.(type-of-provider).constraints.regions create a BackupBucket instance. This can be done lazily i.e. create BackupBucket instance for region only if some seed actually spawned in the region has been registered. This will avoid creating IaaS bucket even if no seed is registered in that region, but region is listed in cloudprofile.
      • Shoot controller will choose backup container as per the seed region. (With shoot control plane migration also, seed’s availability zone might change but the region will be remaining same as per current scope.)
    • Otherwise cross provider case i.e. spec.backup.provider != spec.(type-of-provider)
      • Report validation error: Since, for example, we can’t expect spec.backup.provider = aws to support region in, spec.packet.constraint.region. Where type-of-provider is packet

Following diagram represent overall flow in details:

sequence-diagram

Reconciliation

Reconciliation on backup entry in seed cluster mostly comes in picture at the time of deletion. But we can add initialization steps like creation of directory specific to shoot in backup bucket. We can simply create BackupEntry at the time of shoot deletion as well.

Deletion

  • On shoot deletion, the BackupEntry instance i.e. shoot specific instance will get deletion timestamp because of ownerReference.
  • If deletionGracePeriod configured in GCM component configuration is expired, BackupInfrastructure Controller will delete the backup folder associated with it from backup object store.
  • Finally, it will remove the finalizer from backupEntry instance.

Alternative

sequence-diagram

Discussion points / variations

Manual vs dynamic bucket creation

  • As per limit observed on different cloud providers, we can have single bucket for backups on one cloud providers. So, we could avoid the little complexity introduced in above approach by pre-provisioning buckets as a part of landscape setup. But there won’t be anybody to detect bucket existence and its reconciliation. Ideally this should be avoided.

  • Another thing we can have is, we can let administrator register the pool of root backup infra resource and let the controller schedule backup on one of this.

  • One more variation here could be to create bucket dynamically per hash of shoot UID.

SDK vs Terraform

Initial reason for going for terraform script is its stability and the provided parallelism/concurrency in resource creation. For backup infrastructure, Terraform scripts are very minimal right now. Its simply have bucket creation script. With shared bucket logic, if possible we might want to isolate access at directory level but again its additional one call. So, we will prefer switching to SDK for all object store operations.

Limiting the number of shoots per bucket

Again as per limit observed on different cloud providers, we can have single bucket for backups on one cloud providers. But if we want to limit the number of shoots associated with bucket, we can have central map of configuration in gardener-controller-component-configuration.yaml. Where we will mark supported count of shoots per cloud provider. Most probable space could be, controller.backupInfrastructures.quota. If limit is reached we can create new BucketBucket instance.

e.g.

apiVersion: controllermanager.config.gardener.cloud/v1alpha1
kind: ControllerManagerConfiguration
controllers:
  backupInfrastructure:
    quota:
      - provider: aws
        limit: 100 # Number mentioned here are random, just for example purpose.
      - provider: azure
        limit: 80
      - provider: openstack
        limit: 100
      ...

Backward compatibility

Migration

  • Create shoot specific folder.
  • Transfer old objects.
  • Create manifest of objects on new bucket
    • Each entry will have status: None,Copied, NotFound.
    • Copy objects one by one.
  • Scale down etcd-main with old config. ⚠️ Cluster down time
  • Copy remaining objects
  • Scale up etcd-main with new config.
  • Destroy Old bucket and old backup namespace. It can be immediate or preferably lazy deletion.

backup-migration-sequence-diagram

Legacy Mode alternative

  • If Backup namespace present in seed cluster, then follow the legacy approach.
  • i.e. reconcile creation/existence of shoot specific bucket and backup namespace.
  • If backup namespace is not created, use shared bucket.
  • Limitation Never know when the existing cluster will be deleted, and hence, it might be little difficult to maintain with next release of gardener. This might look simple and straight-forward for now but may become pain point in future, if in worst case, because of some new use cases or refactoring, we have to change the design again. Also, even after multiple garden release we won’t be able to remove deprecated existing BackupInfrastructure CRD

References

3 - 03 Networking Extensibility

Network Extensibility

Currently Gardener follows a mono network-plugin support model (i.e., Calico). Although this can seem to be the more stable approach, it does not completely reflect the real use-case. This proposal brings forth an effort to add an extra level of customizability to Gardener networking.

Motivation

Gardener is an open-source project that provides a nested user model. Basically, there are two types of services provided by Gardener to its users:

  • Managed: users only request a Kubernetes cluster (Clusters-as-a-Service)
  • Hosted: users utilize Gardener to provide their own managed version of Kubernetes (Cluster-Provisioner-as-a-service)

For the first set of users, the choice of network plugin might not be so important, however, for the second class of users (i.e., Hosted) it is important to be able to customize networking based on their needs.

Furthermore, Gardener provisions clusters on different cloud-providers with different networking requirements. For example, Azure does not support Calico Networking [1], this leads to the introduction of manual exceptions in static add-on charts which is error prune and can lead to failures during upgrades.

Finally, every provider is different, and thus the network always needs to adapt to the infrastructure needs to provider better performance. Consistency does not necessarily lie in the implementation but in the interface.

Gardener Network Extension

The goal of the Gardener Network Extensions is to support different network plugin, therefore, the specification for the network resource won’t be fixed and will be customized based on the underlying network plugin. To do so, a NetworkConfig field in the spec will be provided where each plugin will define. Below is an example for deploy Calico as the cluster network plugin.

Long Term Spec

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Network
metadata:
  name: calico-network
  namespace: shoot--core--test-01
spec:
  type: calico
  clusterCIDR: 192.168.0.0/24
  serviceCIDR:  10.96.0.0/24
  providerConfig:
    apiVersion: calico.extensions.gardener.cloud/v1alpha1
    kind: NetworkConfig
    ipam:
      type: host-local
      cidr: usePodCIDR
    backend: bird
    typha:
      enabled: true
status:
  observedGeneration: ...
  state: ...
  lastError: ..
  lastOperation: ...
  providerStatus:
    apiVersion: calico.extensions.gardener.cloud/v1alpha1
    kind: NetworkStatus
    components:
      kubeControllers: true
      calicoNodes: true
    connectivityTests:
      pods: true
      services: true
    networkModules:
      arp_proxy: true
    config:
      clusterCIDR: 192.168.0.0/24
      serviceCIDR:  10.96.0.0/24
      ipam:
        type: host-local
        cidr: usePodCIDR

First Implementation (Short Term)

As an initial implementation the network plugin type will be specified by the user e.g., Calico (without further configuration in the provider spec). This will then be used to generate the Network resource in the seed. The Network operator will pick it up, and apply the configuration based on the spec.cloudProvider specified directly to the shoot or via the Gardener resource manager (still in the works).

The cloudProvider field in the spec is just an initial catalyst but not meant to be stay long-term. In the future, the network provider configuration will be customized to match the best needs of the infrastructure.

Here is how the simplified initial spec would look like:

---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Network
metadata:
  name: calico-network
  namespace: shoot--core--test-01
spec:
  type: calico
  cloudProvider: {aws,azure,...}
status:
  observedGeneration: 2
  lastOperation: ...
  lastError: ...

Functionality

The network resource need to be created early-on during cluster provisioning. Once created, the Network operator residing in every seed will create all the necessary networking resources and apply them to the shoot cluster.

The status of the Network resource should reflect the health of the networking components as well as additional tests if required.

References

[1] Azure support for Calico Networking

4 - 05 Versioning Policy

Gardener Versioning Policy

Please refer to this document for the documentation of the implementation of this GEP.

Goal

  • As a Garden operator I would like to define a clear Kubernetes version policy, which informs my users about deprecated or expired Kubernetes versions.
  • As an user of Gardener, I would like to get information which Kubernetes version is supported for how long. I want to be able to get this information via API (cloudprofile) and also in the Dashboard.

Motivation

The Kubernetes community releases minor versions roughly every three months and usually maintains three minor versions (the actual and the last two) with bug fixes and security updates. Patch releases are done more frequently. Operators of Gardener should be able to define their own Kubernetes version policy. This GEP suggests the possibility for operators to classify Kubernetes versions, while they are going through their “maintenance life-cycle”.

Kubernetes Version Classifications

An operator should be able to classify Kubernetes versions differently while they go through their “maintenance life-cycle”, starting with preview, supported, deprecated, and finally expired. This information should be programmatically available in the cloudprofiles of the Garden cluster as well as in the Dashboard. Please also note, that Gardener keeps the control plane and the workers on the same Kubernetes version.

For further explanation of the possible classifications, we assume that an operator wants to support four minor versions e.g. v1.16, v1.15, v1.14 and v1.13.

  • preview: After a fresh release of a new Kubernetes minor version (e.g. v1.17.0) the operator could tag it as preview until he has gained sufficient experience. It will not become the default in the Gardener Dashboard until he promotes that minor version to supported, which could happen a few weeks later with the first patch version.

  • supported: The operator would tag the latest Kubernetes patch versions of the actual (if not still in preview) and the last three minor Kubernetes versions as supported (e.g. v1.16.1, v1.15.4, v1.14.9 and v1.13.12). The latest of these becomes the default in the Gardener Dashboard (e.g. v1.16.1).

  • deprecated: The operator could decide, that he generally wants to classify every version that is not the latest patch version as deprecated and flag this versions accordingly (e.g. v1.16.0 and older, v1.15.3 and older, 1.14.8 and older as well as v1.13.11 and older). He could also tag all versions (latest or not) of every Kubernetes minor release that is neither the actual nor one of the last three minor Kubernetes versions as deprecated, too (e.g. v1.12.x and older). Deprecated versions will eventually expire (i.e., removed).

  • expired: This state is a logical state only. It doesn’t have to be maintained in the cloudprofile. All cluster versions whose expirationDate as defined in the cloudprofile is expired, are automatically in this logical state. After that date has passed, users cannot create new clusters with that version anymore and any cluster that is on that version will be forcefully migrated in its next maintenance time window, even if the owner has opted out of automatic cluster updates! The forceful update will pick the latest patch version of the current minor Kubernetes version. If the cluster was already on that latest patch version and the latest patch version is also expired, it will continue with latest patch version of the next minor Kubernetes version, so it will result in an update of a minor Kubernetes version, which is potentially harmful to your workload, so you should avoid that/plan ahead! If that’s expired as well, the update process repeats until a non-expired Kubernetes version is reached, so depending on the circumstances described above, it can happen that the cluster receives multiple consecutive minor Kubernetes version updates!

To fulfill his specific versioning policy, the Garden operator should be able to classify his versions as well set the expiration date in the cloudprofiles. The user should see this classifiers as well as the expiration date in the dashboard.

5 - 06 Etcd Druid

Integrating etcd-druid with Gardener

Etcd is currently deployed by garden-controller-manager as a Statefulset. The sidecar container spec contains details pertaining to cloud-provider object-store which is injected into the statefulset via a mutable webhook running as part of the gardener extension story. This approach restricts the operations on etcd such as scale-up and upgrade. Etcd-druid will eliminate the need to hijack statefulset creation to add cloudprovider details. It has been designed to provide an intricate control over the procedure of deploying and maintaining etcd. The roadmap for etcd-druid can be found here.

This document explains how Gardener deploys etcd and what resources it creates for etcd-druid to deploy an etcd cluster.

Resources required by etcd-druid (created by Gardener)

  • Secret containing credentials to access backup bucket in Cloud provider object store.
  • TLS server and client secrets for etcd and backup-sidecar
  • Etcd CRD resource that contains parameters pertaining to etcd, backup-sidecar and cloud-provider object store.

When an etcd resource is created in the cluster, the druid acts on it by creating an etcd statefulset, a service and a configmap containing etcd bootstrap script. The secrets containing the infrastructure credentials and the TLS certificates are mounted as volumes. If no secret/information regarding backups is stated then etcd data backups are not taken. Only data corruption checks are performed prior to starting etcd.

Garden-controller-manager, being cloud agnostic, deploys the etcd resource. This will not contain any cloud-specific information other than the cloud-provider. The extension controller that contains the cloud specific implementation to create the backup bucket will create it if needed and create a secret containing the credentials to access the bucket. The etcd backup secret name should be exposed in the BackupEntry status. Then, Gardener can read it and write it into the ETCD resource. The secret will have to be made available in the namespace the etcd statefulset will be deployed. If etcd and backup-sidecar communicates over TLS then the CA certificates, server and client certificates, and keys will also have to be made available in the namespace as well. The etcd resource will have reference to these aforementioned secrets. etcd-druid will deploy the statefulset only if the secrets are available.

Workflow

  • etcd-druid will be deployed and etcd CRD will be created as part of the seed bootstrap.
  • Garden-controller-manager creates backupBucket extension resource. Extension controller creates the backup bucket associated with the seed.
  • Garden-controller-manager creates backupentry associated with each shoot in the seed namespace.
  • Garden-controller-manager creates etcd resource with secretRefs and etcd information populated appropriately.
  • etcd-druid acts on the etcd resource; druid creates the statefulset, the service and the configmap.

etcd-druid

6 - 07 Shoot Control Plane Migration

Shoot Control Plane Migration

Motivation

Currently moving the control plane of a shoot cluster can only be done manually and requires deep knowledge of how exactly to transfer the resources and state from one seed to another. This can make it slow and prone to errors.

Automatic migration can be very useful in a couple of scenarios:

  • Seed goes down and can’t be repaired (fast enough or at all) and it’s control planes need to be brought to another seed
  • Seed needs to be changed, but this operation requires the recreation of the seed (e.g. turn a single-AZ seed into a multi-AZ seed)
  • Seeds need to be rebalanced
  • New seeds become available in a region closer to/in the region of the workers and the control plane should be moved there to improve latency
  • Gardener ring, which is a self-supporting setup/underlay for a highly available (usually cross-region) Gardener deployment

Goals

  • Provide a mechanism to migrate the control plane of a shoot cluster from one seed to another
  • The mechanism should support migration from a seed which is no longer reachable (Disaster Recovery)
  • The shoot cluster nodes are preserved and continue to run the workload, but will talk to the new control plane after the migration completes
  • Extension controllers implement a mechanism which allows them to store their state or to be restored from an already existing state on a different seed cluster.
  • The already existing shoot reconciliation flow is reused for migration with minimum changes

Terminology

Source Seed is the seed which currently hosts the control plane of a Shoot Cluster

Destination Seed is the seed to which the control plane is being migrated

Resources and controller state which have to be migrated between two seeds:

Note: The following lists are just FYI and are meant to show the current resources which need to be moved to the Destination Seed

Secrets

Gardener has preconfigured lists of needed secrets which are generated when a shoot is created and deployed in the seed. Following is a minimum set of secrets which must be migrated to the Destination Seed. Other secrets can be regenerated from them.

  • ca
  • ca-front-proxy
  • static-token
  • ca-kubelet
  • ca-metrics-server
  • etcd-encryption-secret
  • kube-aggregator
  • kube-apiserver-basic-auth
  • kube-apiserver
  • service-account-key
  • ssh-keypair

Custom Resources and state of extension controllers

Gardenlet deploys custom resources in the Source Seed cluster during shoot reconciliation which are reconciled by extension controllers. The state of these controllers and any additional resources they create is independent of the gardenlet and must also be migrated to the Destination Seed. Following is a list of custom resources, and the state which is generated by them that has to be migrated.

  • BackupBucket: nothing relevant for migration
  • BackupEntry: nothing relevant for migration
  • ControlPlane: nothing relevant for migration
  • DNSProvider/DNSEntry: nothing relevant for migration
  • Extensions: migration of state needs to be handled individually
  • Infrastructure: terraform state
  • Network: nothing relevant for migration
  • OperatingSystemConfig: nothing relevant for migration
  • Worker: Machine-Controller-Manager related objects: machineclasses, machinedeployments, machinesets, machines

This list depends on the currently installed extensions and can change in the future

Proposal

Custom Resource on the garden cluster

The Garden cluster has a new Custom Resource which is stored in the project namespace of the Shoot called ShootState. It contains all the required data described above so that the control plane can be recreated on the Destination Seed.

This data is separated into two sections. The first is generated by the gardenlet and then either used to generate new resources (e.g secrets) or is directly deployed to the Shoot’s control plane on the Destination Seed.

The second is generated by the extension controllers in the seed.

apiVersion: core.gardener.cloud/v1alpha1
kind: ShootState
metadata:
  name: my-shoot
  namespace: garden-core
  ownerReference:
    apiVersion: core.gardener.cloud/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Shoot
    name: my-shoot
    uid: ...
  finalizers:
  - gardener
gardenlet:
  secrets:
  - name: ca
    data:
      ca.crt: ...
      ca.key: ...
  - name: ssh-keypair
    data:
      id_rsa: ...
  - name:
...
extensions:
- kind: Infrastructure
  state: ... (Terraform state)
- kind: ControlPlane
  purpose: normal
  state: ... (Certificates generated by the extension)
- kind: Worker
  state: ... (Machine objects)

The state data is saved as a runtime.RawExtension type, which can be encoded/decoded by the corresponding extension controller.

There can be sensitive data in the ShootState which has to be hidden from the end-users. Hence, it will be recommended to provide an etcd encryption configuration to the Gardener API server in order to encrypt the ShootState resource.

Size limitations

There are limits on the size of the request bodies sent to the kubernetes API server when creating or updating resources: by default ETCD can only accept request bodies which do not exceed 1.5 MiB (this can be configured with the --max-request-bytes flag); the kubernetes API Server has a request body limit of 3 MiB which cannot be set from the outside (with a command line flag); the gRPC configuration used by the API server to talk to ETCD has a limit of 2 MiB per request body which cannot be configured from the outside; and watch requests have a 16 MiB limit on the buffer used to stream resources.

This means that if ShootState is bigger than 1.5 MiB, the ETCD max request bytes will have to be increased. However, there is still an upper limit of 2 MiB imposed by the gRPC configuration.

If ShootState exceeds this size limitation it must make use of configmap/secret references to store the state of extension controllers. This is an implementation detail of Gardener and can be done at a later time if necessary as extensions will not be affected.

Splitting the ShootState into multiple resources could have a positive benefit on performance as the Gardener API Server and Gardener Controller Manager would handle multiple small resources instead of one big resource.

Gardener extensions changes

All extension controllers which require state migration must save their state in a new status.state field and act on an annotation gardener.cloud/operation=restore in the respective Custom Resources which should trigger a restoration operation instead of reconciliation. A restoration operation means that the extension has to restore its state in the Shoot’s namespace on the Destination Seed from the status.state field.

As an example: the Infrastructure resource must save the terraform state.

apiVersion: extensions.gardener.cloud/v1alpha1
kind: Infrastructure
metadata:
  name: infrastructure
  namespace: shoot--foo--bar
spec:
  type: azure
  region: eu-west-1
  secretRef:
    name: cloudprovider
    namespace: shoot--foo--bar
  providerConfig:
    apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureConfig
    resourceGroup:
      name: mygroup
    networks:
      vnet: # specify either 'name' or 'cidr'
      # name: my-vnet
        cidr: 10.250.0.0/16
      workers: 10.250.0.0/19
status:
  state: |
      {
          "version": 3,
          "terraform_version": "0.11.14",
          "serial": 2,
          "lineage": "3a1e2faa-e7b6-f5f0-5043-368dd8ea6c10",
          "modules": [
              {
              }
          ]
          ...
      }

Extensions which do not require state migration should set status.state=nil in their Custom Resources and trigger a normal reconciliation operation if the CR contains the core.gardener.cloud/operation=restore annotation.

Similar to the contract for the reconcile operation, the extension controller has to remove the restore annotation after the restoration operation has finished.

An additional annotation gardener.cloud/operation=migrate is added to the Custom Resources. It is used to tell the extension controllers in the Source Seed that they must stop reconciling resources (in case they are requeued due to errors) and should perform cleanup activities in the Shoot’s control plane. These cleanup activities involve removing the finalizers on Custom Resources and deleting them without actually deleting any infrastructure resources.

Note: The same size limitations from the previous section are relevant here as well.

Shoot reconciliation flow changes

The only data which must be stored in the ShootState by the gardenlet is secrets (e.g ca for the API server). Therefore the botanist.DeploySecrets step is changed. It is split into two functions which take a list of secrets that have to be generated.

  • botanist.GenerateSecretState Generates certificate authorities and other secrets which have to be persisted in the ShootState and must not be regenerated on the Destination Seed.
  • botanist.DeploySecrets Takes secret data from the ShootState, generates new ones (e.g. client tls certificates from the saved certificate authorities) and deploys everything in the Shoot’s control plane on the Destination Seed

ShootState synchronization controller

The ShootState synchronization controller will become part of the gardenlet. It syncs the state of extension custom resources from the shoot namespace to the garden cluster and updates the corresponding spec.extension.state field in the ShootState resource. The controller can watch Custom Resources used by the extensions and update the ShootState only when changes occur.

Migration workflow

  1. Starting migration
    • Migration can only be started after a Shoot cluster has been successfully created so that the status.seed field in the Shoot resource has been set
    • The Shoot resource’s field spec.seedName="new-seed" is edited to hold the name of the Destination Seed and reconciliation is automatically triggered
    • The Garden Controller Manager checks if the equality between spec.seedName and status.seed, detects that they are different and triggers migration.
  2. The Garden Controller Manager waits for the Destination Seed to be ready
  3. Shoot’s API server is stopped
  4. Backup the Shoot’s ETCD.
  5. Extension resources in the Source Seed are annotated with gardener.cloud/operation=migrate
  6. Scale Down the Shoot’s control plane in the Source Seed.
  7. The gardenlet in the Destination Seed fetches the state of extension resources from the ShootState resource in the garden cluster.
  8. Normal reconciliation flow is resumed in the Destination Seed. Extension resources are annotated with gardener.cloud/operation=restore to instruct the extension controllers to reconstruct their state.
  9. The Shoot’s namespace in Source Seed is deleted.

7 - 09 Test Framework

Gardener integration test framework

Motivation

As we want to improve our code coverage in the next months we will need a simple and easy to use test framework. The current testframework already contains a lot of general test functions that ease the work for writing new tests. However there are multiple disadvantages with the current structure of the tests and the testframework:

  1. Every new test is an own testsuite and therefore needs its own TestDef (https://github.com/gardener/gardener/tree/master/.test-defs). With this approach there will be hundreds of test definitions, growing with every new test (or at least new test suite). But in most cases new tests do not need their own special TestDef: it’s just the wrong scope for the testmachinery and will result in unnecessary complex testruns and configurations. In addition it would result in additional maintenance for a huge number of TestDefs.
  2. The testsuites currently have their own specific interface/configuration that they need in order to be executed correctly (see K8s Update test). Consequently the configuration has to be defined in the testruns which result in one step per test with their very own configuration which means that the testmachinery cannot simply select testdefinitions by label. As the testmachinery cannot make use of its ability to run labeled tests (e.g. run all tests labeled default), the testflow size increases with every new tests and the testruns have to be manually adjusted with every new test.
  3. The current gardener test framework contains multiple test operations where some are just used for specific tests (e.g. plant_operations) and some are more general (garden_operation). Also the functions offered by the operations vary in their specialization as some are really specific to just one test e.g. shoot test operation with WaitUntilGuestbookAppIsAvailable whereas others are more general like WaitUntilPodIsRunning.
    This structure makes it hard for developers to find commonly used functions and also hard to integrate as the common framework grows with specialized functions.

Goals

In order to clean the testframework, make it easier for new developers to write tests and easier to add and maintain test execution within the testmachinery, the following goals are defined:

  • Have a small number of test suites (gardener, shoots see test flavors) to only maintain a fixed number of testdefinitions.
  • Use ginkgo test labels (inspired by the k8s e2e tests) to differentiate test behavior, test execution and test importance.
  • Use standardized configuration for all tests (differ depending on the test suite) but provide better tooling to dynamically read additional configuration from configuration files like the cloudprofile.
  • Clean the testframework to only contain general functionality and keep specific functions inside the tests

Proposal

The proposed new test framework consists of the following changes to tackle the above described goals. ​

Test Flavors

Reducing the number of test definitions is done by ​combining the current specified test suites into the following 3 general ones:

  • System test suite
    • e.g. create-shoot, delete-shoot, hibernate
    • need their own testdef because they have a special meaning in the context of the testmachinery
  • Gardener test suite
    • e.g. RBAC, scheduler
    • All tests that only need a gardener installation but no shoot cluster
    • Possible functions/environment:
      • New project for test suite (copy secret binding, cleanup)?
  • Shoot test suite
    • e.g. shoot app, network
    • Test that require a running shoot
    • Possible functions:
      • Namespace per test
      • cleanup of ns

As inspired by the k8s e2e tests, test labels are used to differentiate the tests by their behavior, their execution and their importance. Test labels means that tests are described using predefined labels in the test’s text (e.g ginkgo.It("[BETA] this is a test")). With this labeling strategy, it is also possible to see the test properties directly in the code and promoting a test can be done via a pullrequest and will then be automatically recognized by the testmachinery with the next release.

Using ginkgo focus to only run desired tests and combined testsuites, an example test definition will look like the following.

kind: TestDefinition
metadata:
  name: gardener-beta-suite
spec:
  description: Test suite that runs all gardener tests that are labeled as beta
  activeDeadlineSeconds: 7200
  labels: ["gardener", "beta"]
  command: [bash, -c]
  args:
  - >-
    go test -timeout=0 -mod=vendor ./test/integration/suite
    --v -ginkgo.v -ginkgo.progress -ginkgo.no-color
    -ginkgo.focus="[GARDENER] [BETA]"    

Using this approach, the overall number of testsuites is then reduced to a fixed number (excluding the system steps) of test suites * labelCombinations.

Framework

The new framework will consist of a common framework, a gardener framework (integrating the commom framework) and a shoot framework (integrating the gardener framework).

All of these frameworks will have their own configuration that is exposed via commandline flags so that for example the shoot test framework can be executed by go test -timeout=0 -mod=vendor ./test/integration/suite --v -ginkgo.v -ginkgo.focus="[SHOOT]" --kubecfg=/path/to/config --shoot-name=xx.

The available test labels should be declared in the code with predefined values and in a predefined order so that everyone is aware about possible labels and the tests are labeled similarly across all integration tests. This approach is somehow similar to what kubernetes is doing in their e2e test suite but with some more restrictions (compare example k8s e2e test).
A possible solution to have consistent labeling would be to define them with every new ginkgo.It definition: f.Beta().Flaky().It("my test") which internally orders them and would produce a ginkgo test with the text : [BETA] [FLAKY] my test.

General Functions The test framework should include some general functions that can and will be reused by every test. These general functions may include: ​

  • Logging
  • State Dump
  • Detailed test output (status, duration, etc..)
  • Cleanup handling per test (It)
  • General easy to use functions like WaitUntilDeploymentCompleted, GetLogs, ExecCommand, AvailableCloudprofiles, etc.. ​

Example

A possible test with the new test framework would look like:

var _ = ginkgo.Describe("Shoot network testing", func() {
  // the testframework registers some cleanup handling for a state dump on failure and maybe cleanup of created namespaces
  f := framework.NewShootFramework()
  f.CAfterEach(func(ctx context.Context) {
    ginkgo.By("cleanup network test daemonset")
    err := f.ShootClient.Client().Delete(ctx, &appsv1.DaemonSet{ObjectMeta: metav1.ObjectMeta{Name: name, Namespace: namespace}})
    if err != nil {
      if !apierrors.IsNotFound(err) {
        Expect(err).To(HaveOccurred())
      }
    }
  }, FinalizationTimeout)
  f.Release().Default().CIt("should reach all webservers on all nodes", func(ctx context.Context) {
    ginkgo.By("Deploy the net test daemon set")
    templateFilepath := filepath.Join(f.ResourcesDir, "templates", nginxTemplateName)
    err := f.RenderAndDeployTemplate(f.Namespace(), tempalteFilepath)
    Expect(err).ToNot(HaveOccurred())
    err = f.WaitUntilDaemonSetIsRunning(ctx, f.ShootClient.Client(), name, namespace)
    Expect(err).NotTo(HaveOccurred())
    pods := &corev1.PodList{}
    err = f.ShootClient.Client().List(ctx, pods, client.MatchingLabels{"app": "net-nginx"})
    Expect(err).NotTo(HaveOccurred())
    // check if all webservers can be reached from all nodes
    ginkgo.By("test connectivity to webservers")
    shootRESTConfig := f.ShootClient.RESTConfig()
    var res error
    for _, from := range pods.Items {
      for _, to := range pods.Items {
        // test pods
        f.Logger.Infof("%s to %s: %s", from.GetName(), to.GetName(), data)
      }
    }
    Expect(res).ToNot(HaveOccurred())
  }, NetworkTestTimeout)
})

Future Plans

Ownership

When the test coverage is increased and there will be more tests, we will need to track ownership for tests. At the beginning the ownership will be shared across all maintainers of the residing repository but this is not suitable anymore as tests will grow and get more complex.

Therefore the test ownership should be tracked via subgroups (in kubernetes this would be a SIG (comp. sig apps e2e test)). These subgroup will then be tracked via labels and the members of these groups will then be notified if tests fail.

8 - 10 Shoot Additional Container Runtimes

Gardener extensibility to support shoot additional container runtimes

Table of Contents

Summary

Gardener-managed Kubernetes clusters are sometimes used to run sensitive workloads, which sometimes are comprised of OCI images originating from untrusted sources. Additional use-cases want to leverage economy-of-scale to run workloads for multiple tenants on the same cluster. In some cases, Gardener users want to use operating systems which do not easily support the Docker engine.

This proposal aims to allow Gardener Shoot clusters to use CRI instead of the legacy Docker API, and to provide extension type for adding CRI shims (like GVisor and Kata Containers) which can be used to add support in Gardener Shoot clusters for these runtimes.

Motivation

While pods and containers are intended to create isolated areas for concurrently running workloads on nodes, this isolation is not as robust as could be expected. Containers leverage the core Linux CGroup and Namespace features to isolate workloads, and many kernel vulnerabilities have the potential to allow processes to escape from their isolation. Once a process has escaped from its container, any other process running on the same node is compromised. Several projects try to mitigate this problem; for example Kata Containers allow isolating a Kubernetes Pod in a micro-vm, gVisor reduces the kernel attack surface by adding another level of indirection between the actual payload and the real kernel.

Kubernetes supports running pods using these alternate runtimes via the RuntimeClass concept, which was promoted to Beta in Kubernetes 1.14. Once Kubernetes is configured to use the Container Runtime Interface to control pods, it becomes possible to leverage CRI and run specific pods using different Runtime Classes. Additionally, configuring Kubernetes to use CRI instead of the legacy Dockershim is faster.

The motivation behind this proposal is to make all of this functionality accessible to Shoot clusters managed by Gardener.

Goals

  • Gardener must allow to configue its managed clusters with the CRI interface instead of the legacy Dockershim.
  • Low-level runtimes like gVisor or Kata Containers are provided as gardener extensions which are (optionally) installed into a landscape by the Gardener operator. There must be no runtime-specific knowledge in the core Gardener code.
  • It shall be possible to configure multiple low-level runtimes in Shoot clusters, on the Worker Group level.

Proposal

Gardener today assumes that all supported operating systems have Docker pre-installed in the base image. Starting with Docker Engine 1.11, Docker itself was refactored and cleaned-up to be based on the containerd library. The first phase would be to allow the change of the Kubelet configuration as described here so that Kubernetes would use containerd instead of the default Dockershim. This will be implemented for CoreOS, Ubuntu, and SuSE-CHost.

We will implement two Gardener extensions, providing gVisor and Kata Containers as options for Gardener landscapes. The WorkerGroup specification will be extended to allow specifying the CRI name and a list of additional required Runtimes for nodes in that group. For example:

workers:
- name: worker-b8jg5
  machineType: m5.large
  volumeType: gp2
  volumeSize: 50Gi
  autoScalerMin: 1
  autoScalerMax: 2
  maxSurge: 1
  cri:
    name: containerd
    containerRuntimes:
    - type: gvisor
    - type: kata-containers
  machineImage:
    name: coreos
    version: 2135.6.0

Each extension will need to address the following concern:

  1. Add the low-level runtime binaries to the worker nodes. Each extension should get the runtime binaries from a container.
  2. Hook the runtime binary into the containerd configuration file, so that the runtime becomes available to containerd.
  3. Apply a label to each node that allows identifying nodes where the runtime is available.
  4. Apply the relevant RuntimeClass to the Shoot cluster, to expose the functionality to users.
  5. Provide a separate binary with a ValidatingWebhook (deployable to the garden cluster) to catch invalid configurations. For example, Kata Containers on AWS requires a machineType of i3.metal, so any Shoot requests with a Kata Containers runtime and a different machine type on AWS should be rejected.

Design Details

  1. Change the nodes container runtime to work with CRI and ContainerD (Only if specified in the Shoot spec):

    1. In order to configure each worker machine in the cluster to work with CRI, the following configurations should be done:

      1. Add kubelet execution flags:
        1. –container-runtime=remote
        2. –container-runtime-endpoint=unix:///run/containerd/containerd.sock
      2. Make sure that default containerd configuration file exist in path /etc/containerd/config.toml.
    2. ContainerD and Docker configurations are different for each OS. To make sure the default configurations above works well in each worker machine, each OS extension would be responsible to configure them during the reconciliation of the OperatingSystemConfig:

      1. os-ubuntu -
        1. Create ContainerD unit Drop-In to execute ContainerD with the default configurations file in path /etc/containerd/config.toml.
        2. Create the container runtime metadata file with a OS path for binaries installations: /usr/bin.
      2. os-coreos -
        1. Create ContainerD unit Drop-In to execute ContainerD with the default configurations file in path /etc/containerd/config.toml.
        2. Create Docker Drop-In unit to execute Docker with the correct socket path of ContainerD.
        3. Create the container runtime metadata file with a OS path for binaries installations: /var/bin.
      3. os-suse-chost -
        1. Create ContainerD service unit and execute ContainerD with the default configurations file in path /etc/containerd/config.toml.
        2. Download and install ctr-cli which is not shipped with the current SuSe image.
        3. Create the container runtime metadata file with a OS path for binaries installations /usr/sbin.
    3. To rotate the ContainerD (CRI) logs we will activate the kubelet feature flag: CRIContainerLogRotation=true.

    4. Docker monitor service will be replaced with equivalent ContainerD monitor service.

  2. Validate workers additional runtime configurations:

    1. Disallow additional runtimes with shoots < 1.14
    2. kata-container validation: Machine type support nested virtualization.
  3. Add support for each additional container runtime in the cluster.

    1. In order to install each additional available runtime in the cluster we should:

      1. Install the runtime binaries in each Worker’s pool nodes that specified the runtime support.
      2. Apply the relevant RuntimeClass to the cluster.
    2. The installation above should be done by a new kind of extension: ContainerRuntime resource. For each container runtime type (Kata-container/gvisor) a dedicate extension controller will be created.

      1. A label for each container runtime support will be added to every node that belongs to the worker pool. This should be done similar to the way labels created today for each node, through kubelet execution parameters (_kubelet.flags: –node-labels). When creating the OperatingSystemConfig (original) for the worker each container runtime support should be mapped to a label on the node. For Example: label: container.runtime.kata-containers=true (shoot.spec.cloud..worker.containerRuntimes.kata-container) label: container.runtime.gvisor=true (shoot.spec.cloud..worker.containerRuntimes.gvisor)

      2. During the Shoot reconciliation (Similar steps to the Extensions today) Gardener will create new ContainerRuntime resource if a container runtime exist in at least one worker spec:

        apiVersion: extensions.gardener.cloud/v1alpha1
        kind: ContainerRuntime
        metadata:
          name: kata-containers-runtime-extention
          namespace: shoot--foo--bar
        spec:
          type: kata-containers
        

        Gardener will wait that all ContainerRuntimes extensions will be reconciled by the appropriate extensions controllers.

      3. Each runtime extension controller will be responsible to reconcile it’s RuntimeContainer resource type. rc-kata-containers extension controller will reconcile RuntimeContainer resource from type kata-container and rc-gvisor will reconcile RuntimeContainer resource from gvisor. Reconciliation process by container runtime extension controllers:

        1. Runtime extension controller from specific type should apply a chart which responsible for the installation of the runtime container in the cluster:
          1. DaemonSet which will run a privileged pod on each node with the label: container.runtime.:true The pod will be responsible for:
            1. Copy the runtime container binaries (From extension package ) to the relevant path in the host OS.
            2. Add the relevant container runtime plugin section to the containerd configuration file (/etc/containerd/config.toml).
            3. Restart containerd in the node.
          2. RuntimeClasses in the cluster to support the runtime class. for example:
            apiVersion: node.k8s.io/v1beta1
            kind: RuntimeClass
            metadata:
              name: gvisor
            handler: runsc
            
        2. Update the status of the relevant RuntimeContainer resource to succeeded.

9 - 12 Oidc Webhook Authenticator

OIDC Webhook Authenticator

Problem

In Kubernetes you can authenticate via several authentication strategies:

  • x509 Client Certificates
  • Static Token Files
  • Bootstrap Tokens
  • Static Password File (Basic authentication - deprecated and removed in 1.19)
  • Service Account Tokens
  • OpenID Connect Tokens
  • Webhook Token Authentication
  • Authenticating Proxy

End-users should use OpenID Connect (OIDC) Tokens created by OIDC-compatible Identity Provider (IDP) and present id_token to the kube-apiserver. If the kube-apiserver is configured to trust the IDP and the token is valid, then the user is authenticated and the UserInfo is send to the authorization stack.

Ideally, operators of the Gardener cluster should be able to authenticate to end-user Shoot clusters with id_token generated by OIDC IDP, but in many cases, end-users might have already configured OIDC for their cluster and more than one OIDC configurations are not allowed.

Another interesting application of multiple OIDC providers would be per Project OIDC provider where end-users of Gardener can add their own OIDC-compatible IDPs.

To workaround the one OIDC per kube-apiserver limitation, a new OIDC Webhook Authenticator (OWA) could be implemented.

Goals

  • Dynamic registrations of OpenID Connect configurations.
  • Close as possible to the Kubernetes build-in OIDC Authenticator.
  • Build as an optional extension and not required for functional Shoot or Gardener cluster.

Non-goals

Proposal

The kube-apiserver can use Webhook Token Authentication to send a Bearer Tokens (id_token) to external webhook for validation:

{
  "apiVersion": "authentication.k8s.io/v1beta1",
  "kind": "TokenReview",
  "spec": {
    "token": "(BEARERTOKEN)"
  }
}

Where upon verification, the remote webhook returns the identity of the user (if authentication succeeds):

{
  "apiVersion": "authentication.k8s.io/v1beta1",
  "kind": "TokenReview",
  "status": {
    "authenticated": true,
    "user": {
      "username": "janedoe@example.com",
      "uid": "42",
      "groups": [
        "developers",
        "qa"
      ],
      "extra": {
        "extrafield1": [
          "extravalue1",
          "extravalue2"
        ]
      }
    }
  }
}

Registration of new OpenIDConnect

This new OWA can be configured with multiple OIDC providers and the entire flow can look like this:

  1. Admin adds a new OpenIDConnect resource (via CRD) to the cluster.

    apiVersion: authentication.gardener.cloud/v1alpha1
    kind: OpenIDConnect
    metadata:
      name: foo
    spec:
      issuerURL: https://foo.bar
      clientID: some-client-id
      usernameClaim: email
      usernamePrefix: "test-"
      groupsClaim: groups
      groupsPrefix: "baz-"
      supportedSigningAlgs:
      - RS256
      requiredClaims:
        baz: bar
      caBundle: LS0tLS1CRUdJTiBDRVJU...base64-encoded CA certs for issuerURL.
    
  2. OWA watches for changes on this resource and does OIDC discovery. The OIDC provider’s configuration has to be accessible under the spec.issuerURL with a well-known path (.well-known/openid-configuration).

  3. OWA uses the jwks_uri obtained from the OIDC providers configuration, to fetch the OIDC provider’s public keys from that endpoint.

  4. OWA uses those keys, issuer, client_id and other settings to add an OIDC authenticator to an in-memory list of Token Authenticators.

alt text

End-user authentication via new OpenIDConnect IDP

When a user presents an id_token obtained from a OpenID Connect the flow looks like this:

  1. The user authenticates against a Custom IDP.

  2. id_token is obtained from the Custom IDP.

  3. The user uses id_token to perform an API call to kube-apiserver.

  4. As the id_token is not matched by any build-in or configured authenticators in the kube-apiserver, it is send to OWA for validation.

    {
      "TokenReview": {
        "kind": "TokenReview",
        "apiVersion": "authentication.k8s.io/v1beta1",
        "spec": {
          "token": "ddeewfwef..."
        }
      }
    }
    
  5. OWA uses TokenReview to authenticate the calling API server (the kube-apiserver for delegation of authentication and authorization may be different from the calling kube-apiserver).

    {
      "TokenReview": {
        "kind": "TokenReview",
        "apiVersion": "authentication.k8s.io/v1beta1",
        "spec": {
          "token": "api-server-token..."
        }
      }
    }
    
  6. After the Authentication API server returns the identity of the calling API server:

    {
        "apiVersion": "authentication.k8s.io/v1",
        "kind": "TokenReview",
        "metadata": {
            "creationTimestamp": null
        },
        "spec": {
            "token": "eyJhbGciOiJSUzI1NiIsImtpZCI6InJocEdLTXZlYjV1OE5heD..."
        },
        "status": {
            "authenticated": true,
            "user": {
                "groups": [
                    "system:serviceaccounts",
                    "system:serviceaccounts:shoot--abcd",
                    "system:authenticated"
                ],
                "uid": "14db103e-88bb-4fb3-8efd-ca9bec91c7bf",
                "username": "system:serviceaccount:shoot--abcd:kube-apiserver"
            }
        }
    }
    

    OWA makes a SubjectAccessReview call to the Authorization API server to ensure that calling API server is allowed to validate tokens:

    {
      "apiVersion": "authorization.k8s.io/v1",
      "kind": "SubjectAccessReview",
      "spec": {
        "groups": [
          "system:serviceaccounts",
          "system:serviceaccounts:shoot--abcd",
          "system:authenticated"
        ],
        "nonResourceAttributes": {
          "path": "/validate-token",
          "verb": "post"
        },
        "user": "system:serviceaccount:shoot--abcd:kube-apiserver"
      },
      "status": {
        "allowed": true,
        "reason": "RBAC: allowed by RoleBinding \"kube-apiserver\" of ClusterRole \"kube-apiserver\" to ServiceAccount \"system:serviceaccount:shoot--abcd:kube-apiserver\""
      }
    }
    
  7. OWA then iterates over all registered OpenIDConnect Token authenticators and tries to validate the token.

  8. Upon a successful validation it returns the TokeReview with user, groups and extra parameters:

    {
      "TokenReview": {
        "kind": "TokenReview",
        "apiVersion": "authentication.k8s.io/v1beta1",
        "spec": {
          "token": "ddeewfwef..."
        },
        "status": {
          "authenticated": true,
          "user": {
            "username": "test-foo@bar.com",
            "groups": [
              "baz-employee"
            ],
            "extra": {
              "gardener.cloud/authenticator/name": [
                "foo"
              ],
              "gardener.cloud/authenticator/uid": [
                "e5062528-e5a4-4b97-ad83-614d015b0979"
              ]
            }
          }
        }
      }
    }
    

    It also adds some extra information which can be used by custom authorizers later on:

    1. gardener.cloud/authenticator/name contains the name of the OpenIDConnect authenticator which was used.
    2. gardener.cloud/authenticator/uid contains the metadata.uid of the OpenIDConnect authenticator which was used.
  9. The kube-apiserver proceeds with authorization checks and returns response.

An overview of the flow:

alt text

Deployment for Shoot clusters

OWA can be deployed per Shoot cluster via the Shoot OIDC Service Extension. The shoot’s kube-apiserver is mutated so that it has the following flag configured.

--authentication-token-webhook-config-file=/etc/webhook/kubeconfig

OWA on the other hand uses the shoot’s kube-apiserver and delegates auth capabilities to it. This means that the needed RBAC is managed in the shoot cluster. By default only the shoot’s kube-apiserver has permissions to validate tokens against OWA.

10 - 13 Automated Seed Management

Automated Seed Management

Automated seed management involves automating certain aspects of managing seeds in Garden clusters, such as:

Implementing the above features would involve changes to various existing Gardener components, as well as perhaps introducing new ones. This document describes these features in more detail and proposes a design approach for some of them.

In Gardener, scheduling shoots onto seeds is quite similar to scheduling pods onto nodes in Kubernetes. Therefore, a guiding principle behind the proposed design approaches is taking advantage of best practices and existing components already used in Kubernetes.

Ensuring Seeds Capacity for Shoots Is Not Exceeded

Seeds have a practical limit of how many shoots they can accommodate. Exceeding this limit is undesirable as the system performance will be noticeably impacted. Therefore, it is important to ensure that a seed’s capacity for shoots is not exceeded by introducing a maximum number of shoots that can be scheduled onto a seed and making sure that it is taken into account by the scheduler.

An initial discussion of this topic is available in Issue #2938. The proposed solution is based on the following flow:

  • The gardenlet is configured with certain resources and their total capacity (and, for certain resources, the amount reserved for Gardener).
  • The gardenlet seed controller updates the Seed status with the capacity of each resource and how much of it is actually available to be consumed by shoots, using capacity and allocatable fields that are very similar to the corresponding fields in the Node status.
  • When scheduling shoots, gardener-scheduler is influenced by the remaining capacity of the seed. In the simplest possible implementation, it never schedules shoots onto a seed that has already reached its capacity for a resource needed by the shoot.

Initially, the only resource considered would be the maximum number of shoots that can be scheduled onto a seed. Later, more resources could be added to make more precise scheduling calculations.

Note: Resources could also be requested by shoots, similarly to how pods can request node resources, and the scheduler could then ensure that such requests are taken into account when scheduling shoots onto seeds. However, the user is rarely, if at all, concerned with what resources does a shoot consume from a seed, and this should also be regarded as an implementation detail that could change in the future. Therefore, such resource requests are not included in this GEP.

In addition, an extensibility plugin framework could be introduced in the future in order to advertise custom resources, including provider-specific resources, so that gardenlet would be able to update the seed status with their capacity and allocatable values, for example load balancers on Azure. Such a concept is not described here in further details as it is sufficiently complex to require a separate GEP.

Example Seed status with capacity and allocatable fields:

status:
  capacity:
    shoots: "100"
    persistent-volumes: "200" # Built-in resource
    azure.provider.extensions.gardener.cloud/load-balancers: "30" # Custom resource advertised by an Azure-specific plugin
  allocatable:
    shoots: "100"
    persistent-volumes: "197" # 3 persistent volumes are reserved for Gardener
    azure.provider.extensions.gardener.cloud/load-balancers: "300"

Gardenlet Configuration

As mentioned above, the total resource capacity for built-in resources such as the number of shoots is specified as part of the gardenlet configuration, not in the Seed spec. The gardenlet configuration itself could be specified in the spec of the newly introduced ManagedSeed resource. Here it is assumed that in the future this could become the recommended and most widely used way to manage seeds. If the same gardenlet is responsible for multiple seeds, they would all share the same capacity settings.

To specify the total resource capacity for built-in resources, as well as the amount of such resources reserved for Gardener, the 2 new fields resources.capacity and resources.reserved are introduced in the GardenletConfiguration resource. The gardenlet seed controller would then initialize the capacity and allocatable fields in the seed status as follows:

  • The capacity value is set to the configured resources.capacity.
  • The allocatable value is set to the configured resources.capacity minus resources.reserved.

Example GardenletConfiguration with resources.capacity and resources.reserved field:

resources:
  capacity:
    shoots: 100
    persistent-volumes: 200
  reserved:
    persistent-volumes: 3

Scheduling Algorithm

Currently gardener-scheduler uses a simple non-extensible algorithm in order to schedule shoots onto seeds. It goes through the following stages:

  • Filter out seeds that don’t meet scheduling requirements such as being ready, matching cloud profile and shoot label selectors, matching the shoot provider, and not having taints that are not tolerated by the shoot.
  • From the remaining seeds, determine candidates that are considered best based on their region, by using a strategy that can be either “same region” or “minimal distance”.
  • Among these candidates, choose the one with the least number of shoots.

This scheduling algorithm should be adapted in order to properly take into account resources capacity and requests. As a first step, during the filtering stage, any seeds that would exceed their capacity for shoots, or their capacity for any resources requested by the shoot, should simply be filtered out and not considered during the next stages.

Later, the scheduling algorithm could be further enhanced by replacing the step in which the region strategy is applied by a scoring step similar to the one in Kubernetes Scheduler. In this scoring step, the scheduler would rank the remaining seeds to choose the most suitable shoot placement. It would assign a score to each seed that survived filtering based on a list of scoring rules. These rules might include for example MinimalDistance and SeedResourcesLeastAllocated, among others. Each rule would produce its own score for the seed, and the overall seed score would be calculated as a weighted sum of all such scores. Finally, the scheduler would assign the shoot to the seed with the highest ranking.

ManagedSeeds

When all or most of the existing seeds are near capacity, new seeds should be created in order to accommodate more shoots. Conversely, sometimes there could be too many seeds for the number of shoots, and so some of the seeds could be deleted to save resources. Currently, the process of creating a new seed involves a number of manual steps, such as creating a new shoot that meets certain criteria, and then registering it as a seed in Gardener. This could be automated to some extent by annotating a shoot with the use-as-seed annotation, in order to create a “shooted seed”. However, adding more than one similar seeds still requires manually creating all needed shoots, annotating them appropriately, and making sure that they are successfully reconciled and registered.

To create, delete, and update seeds effectively in a declarative way and allow auto-scaling, a “creatable seed” resource along with a “set” (and in the future, perhaps also a “deployment”) of such creatable seeds should be introduced, similar to Kubernetes Pod, ReplicaSet, and Deployment (or to MCM Machine, MachineSet, and MachineDeployment) resources. With such resources (and their respective controllers), creating a new seed based on a template would become as simple as increasing the replicas field in the “set” resource.

In Issue #2181 it is already proposed that the use-as-seed annotation is replaced by a dedicated ShootedSeed resource. The solution proposed here further elaborates on this idea.

ManagedSeed Resource

The ManagedSeed resource is a dedicated custom resource that represents an evolution of the “shooted seed” and properly replaces the use-as-seed annotation. This resource contains:

  • The name of the Shoot that should be registered as a Seed.
  • An optional seedTemplate section that contains the Seed spec and parts of the metadata, such as labels and annotations.
  • An optional gardenlet section that contains:
    • gardenlet deployment parameters, such as the number of replicas, the image, etc.
    • The GardenletConfiguration resource that contains controllers configuration, feature gates, and a seedConfig section that contains the Seed spec and parts of its metadata.
    • Additional configuration parameters, such as the garden connection bootstrap mechanism (see TLS Bootstrapping), and whether to merge the provided configuration with the configuration of the parent gardenlet.

Either the seedTemplate or the gardenlet section must be specified, but not both:

  • If the seedTemplate section is specified, gardenlet is not deployed to the shoot, and a new Seed resource is created based on the template.
  • If the gardenlet section is specified, gardenlet is deployed to the shoot, and it registers a new seed upon startup based on the seedConfig section of the GardenletConfiguration resource.

A ManagedSeed allows fine-tuning the seed and the gardenlet configuration of shooted seeds in order to deviate from the global defaults, e.g. lower the concurrent sync for some of the seed’s controllers or enable a feature gate only on certain seeds. Also, it simplifies the deletion protection of such seeds.

Also, the ManagedSeed resource is a more powerful alternative to the use-as-seed annotation. The implementation of the use-as-seed annotation itself could be refactored to use a ManagedSeed resource extracted from the annotation by a controller.

Although in this proposal a ManagedSeed is always a “shooted seed”, that is a Shoot that is registered as a Seed, this idea could be further extended in the future by adding a type field that could be either Shoot (implied in this proposal), or something different. Such an extension would allow to register and manage as Seed a cluster that is not a Shoot, e.g. a GKE cluster.

Last but not least, ManagedSeeds could be used as the basis for creating and deleting seeds automatically via the ManagedSeedSet resource that is described in ManagedSeedSets.

Unlike the Seed resource, the ManagedSeed resource is namespaced. If created in the garden namespace, the resulting seed is globally available. If created in a project namespace, the resulting seed can be used as a “private seed” by shoots in the project, either by being decorated with project-specific taints and labels, or by being of the special PrivateSeed kind that is also namespaced. The concept of private seeds / cloudprofiles is described in Issue #2874. Until this concept is implemented, ManagedSeed resources might need to be restricted to the garden namespace, similarly to how shoots with the use-as-seed annotation currently are.

Example ManagedSeed resource with a seedTemplate section:

apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeed
metadata:
  name: crazy-botany
  namespace: garden
spec:
  shoot:
    name: crazy-botany # Shoot that should be registered as a Seed
  seedTemplate: # Seed template, including spec and parts of the metadata
    metadata:
      labels:
        foo: bar
    spec:
      provider:
        type: gcp
        region: europe-west1
      taints:
      - key: seed.gardener.cloud/protected
      ...

Example ManagedSeed resource with a gardenlet section:

apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeed
metadata:
  name: crazy-botany
  namespace: garden
spec:
  shoot:
    name: crazy-botany # Shoot that should be registered as a Seed
  gardenlet: 
    deployment: # Gardenlet deployment configuration
      replicaCount: 1
      revisionHistoryLimit: 10
      serviceAccountName: gardenlet
      image:
        repository: eu.gcr.io/gardener-project/gardener/gardenlet
        tag: latest
        pullPolicy: IfNotPresent
      resources:
        ...
      podLabels:
        ...
      podAnnotations: 
        ...
      additionalVolumes:
        ...
      additionalVolumeMounts:
        ...
      env:
        ...
      vpa: false
    config: # GardenletConfiguration resource
      apiVersion: gardenlet.config.gardener.cloud/v1alpha1
      kind: GardenletConfiguration
      seedConfig: # Seed template, including spec and parts of the metadata
        metadata:
          labels:
            foo: bar
        spec:
          provider:
            type: gcp
            region: europe-west1
          taints:
          - key: seed.gardener.cloud/protected
          ...
      controllers:
        shoot:
          concurrentSyncs: 20
      featureGates:
        ...
      ...
    bootstrap: BootstrapToken
    mergeWithParent: true

ManagedSeed Controller

ManagedSeeds are reconciled by a new managed seed controller in gardenlet. Its implementation is very similar to the current seed registration controller, and in fact could be regarded as a refactoring of the latter, with the difference that it uses the ManagedSeed resource rather than the use-as-seed annotation on a Shoot. The gardenlet only reconciles ManagedSeeds that refer to Shoots scheduled on Seeds the gardenlet is responsible for.

Once this controller is considered sufficiently stable, the current use-as-seed annotation and the controller mentioned above should be marked as deprecated and eventually removed.

A ManagedSeed that is in use by shoots cannot be deleted, unless the shoots are either deleted or moved to other seeds first. The managed seed controller ensures that this is the case by only allowing a ManagedSeed to be deleted if its Seed has been already deleted.

ManagedSeed Admission Plugins

In addition to the managed seed controller mentioned above, new gardener-apiserver admission plugins should be introduced to properly validate the creation and update of ManagedSeeds, as well as the deletion of shoots registered as seeds. These plugins should ensure that:

  • A Shoot that is being referred to by a ManagedSeed cannot be deleted.
  • Certain Seed spec fields, for example the provider type and region, networking CIDRs for pods, services, and nodes, etc., are the same as (or compatible with) the corresponding Shoot spec fields of the shoot that is being registered as seed.
  • If such Seed spec fields are omitted or empty, the plugins should supply proper defaults based on the values in the Shoot resource.

Provider-specific Seed Bootstrapping Actions

Bootstrapping a new seed might require additional provider-specific actions to the ones performed automatically by the managed seed controller. For example, on Azure this might include getting a new subscription, extending quotas, etc. This could eventually be automated by introducing an extension mechanism for the Gardener seed bootstrapping flow, to be handled by a new type of controller in the provider extensions. However, such an extension mechanism is not in the scope of this proposal and might require a separate GEP.

One idea that could be further explored is the use shoot readiness gates, similar to Kubernetes pod readiness gates, in order to control whether a Shoot is considered Ready before it could be registered as a Seed. A provider-specific extension could set the special condition that is specified as a readiness gate to True only after it has successfully performed the provider-specific actions needed.

Changes to Existing Controllers

Since the Shoot registration as a Seed is decoupled from the Shoot reconciliation, existing gardenlet controllers would not have to be changed in order to properly support ManagedSeeds. The main change to gardenlet that would be needed is introducing the new managed seed controller mentioned above, and possibly retiring the old one at some point. In addition, the Shoot controller would need to be adapted as it currently performs certain actions differently if the shoot has a “shooted seed”.

The introduction of the ManagedSeed resource would also require no changes to existing gardener-controller-manager controllers that operate on Shoots (for example, shoot hibernation and maintenance controllers).

ManagedSeedSets

Similarly to a ReplicaSet, the purpose of a ManagedSeedSet is to maintain a stable set of replica ManagedSeeds available at any given time. As such, it is used to guarantee the availability of a specified number of identical ManagedSeeds, on an equal number of identical Shoots.

ManagedSeedSet Resource

The ManagedSeedSet resource has a selector field that specifies how to identify ManagedSeeds it can acquire, a number of replicas indicating how many ManagedSeeds (and their corresponding Shoots) it should be maintaining, and a two templates:

  • A ManagedSeed template (template) specifying the data of new ManagedSeeds it should create to meet the number of replicas criteria.
  • A Shoot template (shootTemplate) specifying the data of new Shoots it should create to host the ManagedSeeds.

A ManagedSeedSet then fulfills its purpose by creating and deleting ManagedSeeds (and their corresponding Shoots) as needed to reach the desired number.

A ManagedSeedSet is linked to its ManagedSeeds and Shoots via the metadata.ownerReferences field, which specifies what resource the current object is owned by. All ManagedSeeds and Shoots acquired by a ManagedSeedSet have their owning ManagedSeedSet’s identifying information within their ownerReferences field.

Example ManagedSeedSet resource:

apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeedSet
metadata:
  name: crazy-botany
  namespace: garden
spec:
  replicas: 3
  selector:
    matchLabels:
      foo: bar
  updateStrategy:
    type: RollingUpdate # Update strategy, must be `RollingUpdate`
    rollingUpdate:
      partition: 2 # Only update the last replica (#2), assuming there are no gaps ("rolling out a canary")
  template: # ManagedSeed template, including spec and parts of the metadata
    metadata:
      labels:
        foo: bar
    spec: 
      # shoot.name is not specified since it's filled automatically by the controller
      seedTemplate: # Either a seed or a gardenlet section must be specified, see above
        metadata:
          labels:
            foo: bar
        provider:
          type: gcp
          region: europe-west1
        taints:
        - key: seed.gardener.cloud/protected
        ...
  shootTemplate: # Shoot template, including spec and parts of the metadata
    metadata:
      labels:
        foo: bar
    spec:
      cloudProfileName: gcp
      secretBindingName: shoot-operator-gcp
      region: europe-west1
      provider:
        type: gcp
      ...

ManagedSeedSet Controller

ManagedSeedSets are reconciled by a new managed seed set controller in gardener-controller-manager. During the reconciliation this controller creates and deletes ManagedSeeds and Shoots in response to changes to the replicas and selector fields.

Note: The introduction of the ManagedSeedSet resource would not require any changes to gardenlet or to existing gardener-controller-manager controllers.

Managing ManagedSeed Updates

To manage ManagedSeed updates, we considered two possible approaches:

  • A ManagedSeedSet, similarly to a ReplicaSet, does not manage updates to its replicas in any way. In the future, we might introduce ManagedSeedDeployments, a higher-level concept that manages ManagedSeedSets and provides declarative updates to ManagedSeeds along with other useful features, similarly to a Deployment. Such a mechanism would involve creating new ManagedSeedSets, and therefore new seeds, behind the scenes, and moving existing shoots to them.
  • A ManagedSeedSet does manage updates to its replicas, similarly to a StatefulSet. Updates are performed “in-place”, without creating new seeds and moving existing shoots to them. Such a mechanism could also take advantage of other StatefulSet features, such as ordered rolling updates and phased rollouts.

There is an important difference between seeds and pods or nodes in that seeds are more “heavyweight” and therefore updating a set of seeds by introducing new seeds and moving shoots to them tends to be much more complex, time-consuming, and prone to failures compared to updating the seeds “in place”. Furthermore, updating seeds in this way depends on a mature implementation of GEP-7: Shoot Control Plane Migration, which is not available right now. Due to these considerations, we favor the second approach over the first one.

ManagedSeed Identity and Order

A StatefulSet manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods. It maintains a stable identity (including network identity) for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.

A StatefulSet achieves the above by associating each replica with an ordinal number. With n replicas, these ordinal numbers range from 0 to n-1. When scaling out, newly added replicas always have ordinal numbers larger than those of previously existing replicas. When scaling in, it is the replicas with the largest original numbers that are removed.

Besides stable identity and persistent storage, these ordinal numbers are also used to implement the following StatefulSet features:

  • Ordered, graceful deployment and scaling.
  • Ordered, automated rolling updates. Such rolling updates can be partitioned (limited to replicas with ordinal numbers greater than or equal to the “partition”) to achieve phased rollouts.

A ManagedSeedSet, unlike a StatefulSet, does not need to maintain a stable identity for its ManagedSeeds. Furthermore, it would not be practical to always remove the replicas with the largest ordinal numbers when scaling in, since the corresponding seeds may have shoots scheduled onto them, while other seeds, with lower ordinals, may have fewer shoots (or none), and therefore be much better candidates for being removed.

On the other hand, it would be beneficial if a ManagedSeedSet, like a StatefulSet, provides ordered deployment and scaling, ordered rolling updates, and phased rollouts. The main advantage of these features is that a deployment or update failure would affect fewer replicas (ideally just one), containing any potential damage and making the situation easier to handle, thus achieving some of the goals stated in Issue #87. They could also help to contain seed rolling updates outside business hours.

Based on the above considerations, we propose the following mechanism for handling ManagedSeed identity and order:

  • A ManagedSeedSet uses ordinal numbers generated by an increasing sequence to identify ManagedSeeds and Shoots it creates and manages. These numbers always start from 0 and are incremented by 1 for each newly added replica.
  • Replicas (both ManagedSeeds and Shoots) are named after the ManagedSeedSet with the ordinal number appended. For example, for a ManagedSeedSet named test its replicas are named test-0, test-1, etc.
  • Gaps in the sequence created by removing replicas with ordinal numbers in the middle of the range are never filled in. A newly added replica always receives a number that is not only free, but also unique to itself. For example, if there are 2 replicas named test-0 and test-1 and any one of them is removed, a newly added replica will still be named test-2.

Although such ordinal numbers can also provide some form of stable identity, in this case it is much more important that they can provide a predictable ordering for deployments and updates, and can also be used to partition rolling updates similarly to StatefulSet ordinal numbers.

Update Strategies

The ManagedSeedSet’s .spec.updateStrategy field allows configuring automated rolling updates for the ManagedSeeds and Shoots in a ManagedSeedSet.

Rolling Updates

The RollingUpdate update strategy implements automated, rolling update for the ManagedSeeds and Shoots in a ManagedSeedSet. With this strategy, the ManagedSeedSet controller will update each ManagedSeed and Shoot in the ManagedSeedSet. It will proceed from the largest number to the smallest, updating each ManagedSeed and its corresponding Shoot one at a time. It will wait until both the Shoot and the Seed of an updated ManagedSeed are Ready prior to updating its predecessor.

As a further improvement upon the above, the controller could check not only the ManagedSeeds and their corresponding Shoots for readiness, but also the Shoots scheduled onto these ManagedSeeds. The rollout would then only continue if no more than X percent of these Shoots are not reconciled and Ready. Since checking all these additional conditions might require some complex logic, it should be performed by an independent managed seed care controller that updates the ManagedSeed resource with the readiness of its Seed and all Shoots scheduled onto the Seed.

Note that unlike a StatefulSet, an OnDelete update strategy is not supported.

Partitions

The RollingUpdate update strategy can be partitioned, by specifying a .spec.updateStrategy.rollingUpdate.partition. If a partition is specified, only ManagedSeeds and Shoots with ordinals greater than or equal to the partition will be updated when any of the ManagedSeedSet’s templates is updated. All remaining ManagedSeeds and Shoots will not be updated. If a ManagedSeedSet’s .spec.updateStrategy.rollingUpdate.partition is greater than the largest ordinal number in use by a replica, updates to its templates will not be propagated to its replicas (but newly added replicas may still use the updated templates depending on the partition value).

Keeping Track of Revision History and Performing Rollbacks

Similarly to a StatefulSet, the ManagedSeedSet controller uses ControllerRevisions to keep track of the revision history, and controller-revision-hash labels to maintain an association between a ManagedSeed or a Shoot and the concrete template revisions based on which they were created or last updated. These are used for the following purposes:

  • During an update, determine which replicas are still not on the latest revision and therefore should be updated.
  • Display the revision history of a ManagedSeedSet via kubectl rollout history.
  • Roll back all ManagedSeedSet replicas to a specific revision via kubectl rollout undo

Note: The above kubectl rollout commands will not work with custom resources such as ManagedSeedSets out of the box (the documentation says explicitly that valid resource types are only deployments, daemonsets, and statefulsets), but it should be possible to eventually support such commands for ManagedSeedSets via a kubectl plugin.

Scaling-in ManagedSeedSets

Deleting ManagedSeeds in response to decreasing the replicas of a ManagedSeedSet deserves special attention for two reasons:

  • A seed that is already in use by shoots cannot be deleted, unless the shoots are either deleted or moved to other seeds first.
  • When there are more empty seeds than requested for deletion, determining which seeds to delete might not be as straightforward as with pods or nodes.

The above challenges could be addressed as follows:

  • In order to scale in a ManagedSeedSet successfully, there should be at least as many empty ManagedSeeds as the difference between the old and the new replicas. In some cases, the user might need to ensure that this is the case by draining some seeds manually before decreasing the replicas field.
  • It should be possible to protect ManagedSeeds from deletion even if they are empty, perhaps via an annotation such as seedmanagement.gardener.cloud/protect-from-deletion. Such seeds are not taken into account when determining whether the scale in operation can succeed.
  • The decision which seeds to delete among the ManagedSeeds that are empty and not protected should be based on hints, perhaps again in the form of annotations, that could be added manually by the user, as well as other factors, see Prioritizing ManagedSeed Deletion.

Prioritizing ManagedSeed Deletion

To help the controller decide which empty ManagedSeeds are to be deleted first, the user could manually annotate ManagedSeeds with a seed priority annotation such as seedmanagement.gardener.cloud/priority. ManagedSeeds with lower priority are more likely to be deleted first. If not specified, a certain default value is assumed, for example 3.

Besides this annotation, the controller should take into account also other factors, such as the current seed conditions (NotReady should be preferred for deletion over Ready), as well as its age (older should be preferred for deletion over newer).

Auto-scaling Seeds

The most interesting and advanced automated seed management feature is making sure that a Garden cluster has enough seeds registered to schedule new shoots (and, in the future, reschedule shoots from drained seeds) without exceeding the seeds capacity for shoots, but not more than actually needed at any given moment. This would involve introducing an auto-scaling mechanism for seeds in Garden clusters.

The proposed solution builds upon the ideas introduced earlier. The ManagedSeedSet resource (and in the future, also the ManagedSeedDeployment resource) could have a scale subresource that changes the replicas field. This would allow a new “seed autoscaler” controller to scale these resources via a special “autoscaler” resource (for example SeedAutoscaler), similarly to how the Kubernetes Horizontal Pod Autoscaler controller scales pods, as described in Horizontal Pod Autoscaler Walkthrough.

The primary metric used for scaling should be the number of shoots already scheduled onto that seed either as a direct value or as a percentage of the seed’s capacity for shoots introduced in Ensuring Seeds Capacity for Shoots Is Not Exceeded (utilization). Later, custom metrics based on other resources, including provider-specific resources, could be considered as well.

Note: Even if the controller is called Horizontal Pod Autoscaler, it is capable of scaling any resource with a scale subresource, using any custom metric. Therefore, initially it was proposed to use this controller directly. However, a number of important drawbacks were identified with this approach, and so it is no longer proposed here.

SeedAutoscaler Resource

The SeedAutoscaler automatically scales the number of ManagedSeeds in a ManagedSeedSet based on observed resource utilization. The resource could be any resource that is tracked via the capacity and allocatable fields in the Seed status, including in particular the number of shoots already scheduled onto the seed.

The SeedAutoscaler is implemented as a custom resource and a new controller. The resource determines the behavior of the controller. The SeedAutoscaler resource has a scaleTargetRef that specifies the target resource to be scaled, the minimum and maximum number of replicas, as well as a list of metrics. The only supported metric type initially is Resource for resources that are tracked via the capacity and allocatable fields in the Seed status. The resource target can be of type Utilization or AverageValue.

Example SeedAutoscaler resource:

apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: SeedAutoscaler
metadata:
  name: crazy-botany
  namespace: garden
spec:
  scaleTargetRef:
    apiVersion: seedmanagement.gardener.cloud/v1alpha1
    kind: ManagedSeedSet
    name: crazy-botany
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource # Only Resource is supported
    resource:
      name: shoots
      target:
        type: Utilization # Utilization or AverageValue
        averageUtilization: 50

SeedAutoscaler Controller

SeedAutoscaler resources are reconciled by a new seed autoscaler controller, either in gardener-controller-manager or out-of-tree, similarly to cluster-autoscaler. The controller periodically adjusts the number of replicas in a ManagedSeedSet to match the observed average resource utilization to the target specified by user.

Note: The SeedAutoscaler controller should perhaps not be limited to evaluating only metrics, it could also take into account also taints, label selectors, etc. This is not yet reflected in the example SeedAutoscaler resource above. Such details are intentionally not specified in this GEP, they should be further explored in the issues created to track the actual implementation.

Evaluating Metrics for Autoscaling

The metrics used by the controller, for example the shoots metric above, could be evaluated in one of the following ways:

  • Directly, by looking at the capacity and allocatable fields in the Seed status and comparing to the actual resource consumption calculated by simply counting all shoots that meet a certain criteria (e.g. shoots that are scheduled onto the seed), then taking an average over all seeds in the set.
  • By sampling existing metrics exported for example by gardener-metrics-exporter.

The second approach decouples the seed autoscaler controller from the actual metrics evaluation, and therefore allows plugging in new metrics more easily. It also has the advantage that the exported metrics could also be used for other purposes, e.g. for triggering Prometheus alerts or building Grafana dashboards. It has the disadvantage that the seed autoscaler controller would depend on the metrics exporter to do its job properly.

11 - 17 Shoot Control Plane Migration Bad Case

Shoot Control Plane Migration “Bad Case” Scenario

The migration flow described as part of GEP-7 can only be executed if both the Garden cluster and source seed cluster are healthy, and gardenlet in the source seed cluster can connect to the Garden cluster. In this case, gardenlet can directly scale down the shoot’s control plane in the source seed, after checking the spec.seedName field.

However, there might be situations in which gardenlet in the source seed cluster can’t connect to the Garden cluster and determine that spec.seedName has changed. Similarly, the connection to the seed kube-apiserver could also be broken. This might be caused by issues with the seed cluster itself. In other situations, the migration flow steps in the source seed might have started but might not be able to finish successfully. In all such cases, it should still be possible to migrate a shoot’s control plane to a different seed, even though executing the migration flow steps in the source seed might not be possible. The potential “split brain” situation caused by having the shoot’s control plane components attempting to reconcile the shoot resources in two different seeds must still be avoided, by ensuring that the shoot’s control plane in the source seed is deactivated before it is activated in the destination seed.

The mechanisms and adaptations described below have been tested as part of a PoC prior to describing them here.

Owner Election / Copying Snapshots

To achieve the goals outlined above, an “owner election” (or rather, “ownership passing”) mechanism is introduced to ensure that the source and destination seeds are able to successfully negotiate a single “owner” during the migration. This mechanism is based on special owner DNS records that uniquely identify the seed that currently hosts the shoot’s control plane (“owns” the shoot).

For example, for a shoot named i500152-gcp in project dev that uses an internal domain suffix internal.dev.k8s.ondemand.com and is scheduled on a seed with an identity shoot--i500152--gcp2-0841c87f-8db9-4d04-a603-35570da6341f-sap-landscape-dev, the owner DNS record is a TXT record with a domain name owner.i500152-gcp.dev.internal.dev.k8s.ondemand.com and a single value shoot--i500152--gcp2-0841c87f-8db9-4d04-a603-35570da6341f-sap-landscape-dev. The owner DNS record is created and maintained by reconciling an owner DNSRecord resource, if the recently introduced DNSRecords feature is enabled via the UseDNSRecords feature gate.

Unlike other extension resources, the owner DNSRecord resource is not reconciled every time the shoot is reconciled, but only when the resource is created. Therefore, the owner DNS record value (the owner ID) is updated only when the shoot is migrated to a different seed. For more information, see Add handling of owner DNSRecord resources.

The owner DNS record domain name and owner ID are passed to components that need to perform ownership checks, such as the backup-restore container of the etcd-main StatefulSet, and all extension controllers. These components then check regularly whether the actual owner ID (the value of the record) matches the passed ID. If they don’t, the ownership check is considered failed, which causes the special behavior described below.

Note: A previous revision of this document proposed using “sync objects” written to and read from the backup container of the source seed as JSON files by the etcd-backup-restore processes in both seeds. With the introduction of owner DNS records such sync objects are no longer needed.

For the destination seed to actually become the owner, it needs to acquire the shoot’s etcd data by copying the final full snapshot (and potentially also older snapshots) from the backup container of the source seed.

The mechanism to copy the snapshots and pass the ownership from the source to the destination seed consists of the following steps:

  1. The reconciliation flow (“restore” phase) is triggered in the destination seed without first executing the migration flow in the source seed (or perhaps it was executed, but it failed, and its state is currently unknown).

  2. The owner DNSRecord resource is created in the destination seed. As a result, the actual owner DNS record is updated with the destination seed ID. From this point, ownership checks by the etcd-backup-restore process and extension controller watchdogs in the source seed will fail, which will cause the special behavior described below.

  3. An additional “source” backup entry referencing the source seed backup bucket is deployed to the Garden cluster and the destination seed and reconciled by the backup entry controller. As a result, a secret with the appropriate credentials for accessing the source seed backup container named source-etcd-backup is created in the destination seed. The normal backup entry (referencing the destination seed backup container) is also deployed and reconciled, as usual, resulting in the usual etcd-backup secret being created.

  4. A special “copy” version of the etcd-main Etcd resource is deployed to the destination seed. In its backup section, this resource contains a sourceStore in addition to the usual store, which contains the parameters needed to use the source seed backup container, such as its name and the secret created in the previous step.

    spec:
      backup:
        ...
        store:
          container: 408740b8-6491-415e-98e6-76e92e5956ac
          secretRef:
            name: etcd-backup
          ...
        sourceStore:
          container: d1435fea-cd5e-4d5b-a198-81f4025454ff
          secretRef:
            name: source-etcd-backup
          ...
    
  5. The etcd-druid in the destination seed reconciles the above resource by deploying a etcd-copy Job that contains a single backup-restore container. It executes the newly introduced copy command of etcd-backup-restore that copies the snapshots from the source to the destination backup container.

  6. Before starting the copy itself, the etcd-backup-restore process in the destination seed checks if a final full snapshot (a full snapshot marked as final=true) exists in the backup container. If such a snapshot is not found, it waits for it to appear in order to proceed. This waiting is up to a certain timeout that should be sufficient for a full snapshot to be taken; after this timeout has elapsed, it proceeds anyway, and the reconciliation flow continues from step 9. As described in Handling Inability to Access the Backup Container below, this is safe to do.

  7. The etcd-backup-restore process in the source seed detects that the owner ID in the owner DNS record is different from the expected owner ID (because it was updated in step 2) and switches to a special “final snapshot” mode. In this mode the regular snapshotter is stopped, the readiness probe of the main etcd container starts returning 503, and one final full snapshot is taken. This snapshot is marked as final=true in order to ensure that it’s only taken once, and in order to enable the etcd-backup-restore process in the destination seed to find it (see step 6).

    Note: While testing our PoC, we noticed that simply making the readiness probe of the main etcd container fail doesn’t terminate the existing open connections from kube-apiserver to etcd. For this to happen, either the kube-apiserver or the etcd process has to be restarted at least once. Therefore, when the snapshotter is stopped because an ownership change has been detected, the main etcd process is killed (using SIGTERM to allow graceful termination) to ensure that any open connections from kube-apiserver are terminated. For this to work, the 2 containers must share the process namespace.

  8. Since the kube-apiserver process in the source seed is no longer able to connect to etcd, all shoot control plane controllers (kube-controller-manager, kube-scheduler, machine-controller-manager, etc.) and extension controllers reconciling shoot resources in the source seed that require a connection to the shoot in order to work start failing. All remaining extension controllers are prevented from reconciling shoot resources via the watchdogs mechanism. At this point, the source seed has effectively lost its ownership of the shoot, and it is safe for the destination seed to assume the ownership.

  9. After the etcd-backup-restore process in the destination seed detects that a final full snapshot exists, it copies all snapshots (or a subset of all snapshots) from the source to the destination backup container. When this is done, the Job finishes successfully which signals to the reconciliation flow that the snapshots have been copied.

    Note: To save time, only the final full snapshot taken in step 6, or a subset defined by some criteria, could be copied, instead of all snapshots.

  10. The special “copy” version of the etcd-main Etcd resource is deleted from the source seed, and as a result the etcd-copy Job is also deleted by etcd-druid.

  11. The additional “source” backup entry referencing the source seed backup container is deleted from the Garden cluster and the destination seed. As a result, its corresponding source-etcd-backup secret is also deleted from the destination seed.

  12. From this point, the reconciliation flow proceeds as already described in GEP-7. This is safe, since the source seed cluster is no longer able to interfere with the shoot.

Handling Inability to Access the Backup Container

The mechanism described above assumes that the etcd-backup-restore process in the source seed is able to access its backup container in order to take snapshots. If this is not the case, but an ownership change was detected, the etcd-backup-restore process still sets the readiness probe status of the main etcd container to 503, and kills the main etcd process as described above to ensure that any open connections from kube-apiserver are terminated. This effectively deactivates the source seed control plane to ensure that the ownership of the shoot can be passed to a different seed.

Because of this, etcd-backup-restore process in the destination seed responsible for copying the snapshots can avoid waiting forever for a final full snapshot to appear. Instead, after a certain timeout has elapsed, it can proceed with the copying. In this situation, whatever latest snapshot is found in the source backup container will be restored in the destination seed. The shoot is still migrated to a healthy seed at the cost of losing the etcd data that accumulated between the point in time when the connection to the source backup container was lost, and the point in time when the source seed cluster was deactivated.

When the connection to the backup container is restored in the source seed, a final full snapshot will be eventually taken. Depending on the stage of the restoration flow in the destination seed, this snapshot may be copied to the destination seed and restored, or it may simply be ignored since the snapshots have already been copied.

Handling Inability to Resolve the Owner DNS Record

The situation when the owner DNS record cannot be resolved is treated similarly to a failed ownership check: the etcd-backup-restore process sets the readiness probe status of the main etcd container to 503, and kills the main etcd process as described above to ensure that any open connections from kube-apiserver are terminated, effectively deactivating the source seed control plane. The final full snapshot is not taken in this case to ensure that the control plane can be re-activated if needed.

When the owner DNS record can be resolved again, the following 2 situations are possible:

  • If the source seed is still the owner of the shoot, the etcd-backup-restore process will set the readiness probe status of the main etcd container to 200, so kube-apiserver will be able to connect to etcd and the source seed control plane will be activated again.
  • If the source seed is no longer the owner of the shoot, the etcd readiness probe will continue to fail, and the source seed control plane will remain inactive. In addition, the final full snapshot will be taken at this time, for the same reason as described in Handling Inability to Access the Backup Container.

Note: We expect that actual DNS outages are extremely unlikely. A more likely reason for an inability to resolve a DNS record could be network issues with the underlying infrastructure. In such cases, the shoot would usually not be usable / reachable anyway, so deactivating its control plane would not cause a worse outage.

Migration Flow Adaptations

Certain changes to the migration flow are needed in order to ensure that it is compatible with the owner election mechanism described above. Instead of taking a full snapshot of the source seed etcd, the flow deletes the owner DNS record by deleting the owner DNSRecord resource. This causes the ownership check by etcd-backup-restore to fail, and the final full snapshot to be eventually taken, so the migration flow waits for a final full snapshot to appear as the last step before deleting the shoot namespace in the source seed. This ensures that the reconciliation flow described above will find a final full snapshot waiting to be copied at step 6.

Checking for the final full snapshot is performed by calling the already existing etcd-backup-restore endpoint snapshot/latest. This is possible, since the backup-restore container is always running at this point.

After the final full snapshot has been taken, the readiness probe of the main etcd container starts failing, which means that if the migration flow is retried due to an error it must skip the step that waits for etcd-main to become ready. To determine if this is the case, a check whether the final full snapshot has been taken or not is performed by calling the same etcd-backup-restore endpoint, e.g. snapshot/latest. This is possible if the etcd-main Etcd resource exists with non-zero replicas. Otherwise:

  • If the resource doesn’t exist, it must have been already deleted, so the final full snapshot n must have been already taken.
  • If it exists with zero replicas, the shoot must be hibernated, and the migration flow must have never been executed (since it scales up etcd as one of its first steps), so the final full snapshot must not have been taken yet.

Extension Controller Watchdogs

Some extension controllers will stop reconciling shoot resources after the connection to the shoot’s kube-apiserver is lost. Others, most notably the infrastructure controller, will not be affected. Even though new shoot reconciliations won’t be performed by gardenlet, such extension controllers might be stuck in a retry loop triggered by a previous reconciliation, which may cause them to reconcile their resources after gardenlet has already stopped reconciling the shoot. In addition, a reconciliation started when the seed still owned the shoot might take some time and therefore might still be running after the ownership has changed. To ensure that the source seed is completely deactivated, an additional safety mechanism is needed.

This mechanism should handle the following interesting cases:

  • gardenlet cannot connect to the Garden kube-apiserver. In this case it cannot fetch shoots and therefore does not know if control plane migration has been triggered. Even though gardenlet will not trigger new reconciliations, extension controllers could still attempt to reconcile their resources if they are stuck a retry loop from a previous reconciliation, and already running reconciliations will not be stopped.
  • gardenlet cannot connect to the seed’s kube-apiserver. In this case gardenlet knows if migration has been triggered, but it will not start shoot migration or reconciliation as it will first check the seed conditions and try to update the Cluster resource, both of which will fail. Extension controllers could still be able to connect to the seed’s kube-apiserver (if they are not running where gardenlet is running), and similarly to the previous case, they could still attempt to reconcile their resources.
  • The seed components (etcd-druid, extension controllers, etc) cannot connect to the seed’s kube-apiserver. In this case extension controllers would not be able to reconcile their resources as they cannot fetch them from the seed’s kube-apiserver. When the connection to the kube-apiserver comes back, the controllers might be stuck in a retry loop from a previous reconciliation, or the resources could still be annotated with gardener.cloud/operation=reconcile. This could lead to a race condition depending on who manages to update or get the resources first. If gardenlet manages to update the resources before they are read by the extension controllers, they would be properly updated with gardener.cloud/operation=migrate. Otherwise, they would be reconciled as usual.

Note: A previous revision of this document proposed using “cluster leases” as such an additional safety mechanism. With the introduction of owner DNS records cluster leases are no longer needed.

The safety mechanism is based on extension controller watchdogs. These are simply additional goroutines that are started when a reconciliation is started by an extension controller. These goroutines perform an ownership check on a regular basis using the owner DNS record, similar to the check performed by the etcd-backup-restore process described above. If the check fails, the watchdog cancels the reconciliation context, which immediately aborts the reconciliation.

Note: The dns-external extension controller is the only extension controller that neither needs the shoot’s kube-apiserver, nor uses the watchdog mechanism described here. Therefore, this controller will continue reconciling DNSEntry resources even after the source seed has lost the ownership of the shoot. With the PoC, we manually delete the DNSOwner resources from the source seed cluster to prevent this from happening. Eventually, the dns-external controller should be adapted to use the owner DNS records to ensure that it disables itself after the seed has lost the ownership of the shoot. Changes in this direction have already been agreed and relevant PRs proposed.

12 - Bastion Management and SSH Key Pair Rotation

GEP-15: Bastion Management and SSH Key Pair Rotation

Table of Contents

Motivation

gardenctl (v1) has the functionality to setup ssh sessions to the targeted shoot cluster (nodes). To this end, infrastructure resources like VMs, public IPs, firewall rules, etc. have to be created. gardenctl will clean up the resources after termination of the ssh session (or rather when the operator is done with her work). However, there were issues in the past where these infrastructure resources were not properly cleaned up afterwards, e.g. due to some error (no retries either). Hence, the proposal is to have a dedicated controller (for each infrastructure) that manages the infrastructure resources and their cleanup. The current gardenctl also re-used the ssh node credentials for the bastion host. While that’s possible, it would be safer to rather use personal or generated ssh key pairs to access the bastion host. The static shoot-specific ssh key pair should be rotated regularly, e.g. once in the maintenance time window. This also means that we cannot create the node VMs anymore with infrastructure public keys as these cannot be revoked or rotated (e.g. in AWS) without terminating the VM itself.

Changes to the Bastion resource should only be allowed for controllers on seeds that are responsible for it. This cannot be restricted when using custom resources. The proposal, as outlined below, suggests to implement the necessary changes in the gardener core components and to adapt the SeedAuthorizer to consider Bastion resources that the Gardener API Server serves.

Goals

  • Operators can request and will be granted time-limited ssh access to shoot cluster nodes via bastion hosts.
  • To that end, requestors must present their public ssh key and only this will be installed into sshd on the bastion hosts.
  • The bastion hosts will be firewalled and ingress traffic will be permitted only from the client IP of the requestor. Except for traffic on port 22 to the cluster worker nodes, no egress from the bastion is allowed.
  • The actual node ssh private key (resp. key pair) will be rotated by Gardener and access to the nodes is only possible with this constantly rotated key pair and not with the personal one that is used only for the bastion host.
  • Bastion host and access is granted only for the extent of this operator request (of course multiple ssh sessions are possible, in parallel or repeatedly, but after “the time is up”, access is no longer possible).
  • By these means (personal public key and allow-listed client IP) nobody else can use (a.k.a. impersonate) the requestor (not even other operators).
  • Necessary infrastructure resources for ssh access (such as VMs, public IPs, firewall rules, etc.) are automatically created and also terminated after usage, but at the latest after the above mentioned time span is up.

Non-Goals

  • Node-specific access
  • Auditability on operating system level (not only auditing the ssh login, but everything that is done on a node and other respective resources, e.g. by using dedicated operating system users)
  • Reuse of temporarily created necessary infrastructure resources by different users

Proposal

Involved Components

The following is a list of involved components, that either need to be newly introduced or extended if already existing

  • Gardener API Server (GAPI)
  • gardenlet
    • Deploys Bastion CRD under the extensions.gardener.cloud API Group to the Seed, see resource example below
    • Similar to BackupBuckets or BackupEntry, the gardenlet watches the Bastion resource in the garden cluster and creates a seed-local Bastion resource, on which the provider specific bastion controller acts upon
  • gardenctlv2 (or any other client)
    • Creates Bastion resource in the garden cluster
    • Establishes an ssh connection to a shoot node, using a bastion host as proxy
    • Heartbeats / keeps alive the Bastion resource during ssh connection
  • Gardener extension provider
  • Gardener Controller Manager (GCM)
    • Bastion heartbeat controller
      • Cleans up Bastion resource on missing heartbeat.
      • Is configured with a maxLifetime for the Bastion resource
  • Gardener (RBAC)

SSH Flow

  1. Users should only get the RBAC permission to create / update Bastion resources for a namespace, if they should be allowed to ssh onto the shoot nodes in this namespace. A project member with admin role will have these permissions.
  2. User/gardenctlv2 creates Bastion resource in garden cluster (see resource example below)
    • First, gardenctl would figure out the own public IP of the user’s machine. Either by calling an external service (gardenctl (v1) uses https://github.com/gardener/gardenctl/blob/master/pkg/cmd/miscellaneous.go#L226) or by calling a binary that prints the public IP(s) to stdout. The binary should be configurable. The result is set under spec.ingress[].ipBlock.cidr
    • Creates new ssh key pair. The newly created key pair is used only once for each bastion host, so it has a 1:1 relationship to it. It is cleaned up after it is not used anymore, e.g. if the Bastion resource was deleted.
    • The public ssh key is set under spec.sshPublicKey
    • The targeted shoot is set under spec.shootRef
  3. GAPI Admission Plugin for the Bastion resource in the garden cluster
    • on creation, sets metadata.annotations["gardener.cloud/created-by"] according to the user that created the resource
    • when gardener.cloud/operation: keepalive is set it will be removed by GAPI from the annotations and status.lastHeartbeatTimestamp will be set with the current timestamp. The status.expirationTimestamp will be calculated by taking the last heartbeat timestamp and adding x minutes (configurable, default 60 Minutes).
    • validates that only the creator of the bastion (see gardener.cloud/created-by annotation) can update spec.ingress
    • validates that a Bastion can only be created for a Shoot if that Shoot is already assigned to a Seed
    • sets spec.seedName and spec.providerType based on the spec.shootRef
  4. gardenlet
  5. Gardener extension provider / Bastion Controller on Seed:
    • With own Bastion Custom Resource Definition in the seed under the api group extensions.gardener.cloud/v1alpha1
    • Watches Bastion custom resources that are created by the gardenlet in the seed
    • Controller reads cloudprovider credentials from seed-shoot namespace
    • Deploy infrastructure resources
      • Bastion VM. Uses user data from spec.userData
      • attaches public IP, creates security group, firewall rules, etc.
    • Updates status of Bastion resource:
      • With bastion IP under status.ingress.ip or hostname under status.ingress.hostname
      • Updates the status.lastOperation with the status of the last reconcile operation
  6. gardenlet
    • Syncs back the status.ingress and status.conditions of the Bastion resource in the seed to the garden cluster in case it changed
  7. gardenctl
    • initiates ssh session once status.conditions['BastionReady'] is true of the Bastion resource in the garden cluster
      • locates private ssh key matching spec["sshPublicKey"] which was configured beforehand by the user
      • reads bastion IP (status.ingress.ip) or hostname (status.ingress.hostname)
      • reads the private key from the ssh key pair for the shoot node
      • opens ssh connection to the bastion and from there to the respective shoot node
    • runs heartbeat in parallel as long as the ssh session is open by annotating the Bastion resource with gardener.cloud/operation: keepalive
  8. GCM:
    • Once status.expirationTimestamp is reached, the Bastion will be marked for deletion
  9. gardenlet:
    • Once the Bastion resource in the garden cluster is marked for deletion, it marks the Bastion resource in the seed for deletion
  10. Gardener extension provider / Bastion Controller on Seed:
    • all created resources will be cleaned up
    • On succes, removes finalizer on Bastion resource in seed
  11. gardenlet:
    • removes finalizer on Bastion resource in garden cluster

Resource Example

Bastion resource in the garden cluster

apiVersion: operations.gardener.cloud/v1alpha1
kind: Bastion
metadata:
  generateName: cli-
  name: cli-abcdef
  namespace: garden-myproject
  annotations:
    gardener.cloud/created-by: foo # immutable, set by the GAPI Admission Plugin
    # gardener.cloud/operation: keepalive # this annotation is removed by the GAPI and the status.lastHeartbeatTimestamp and status.expirationTimestamp will be updated accordingly
spec:
  shootRef: # namespace cannot be set / it's the same as .metadata.namespace
    name: my-cluster # immutable

  # the following fields are set by the GAPI
  seedName: aws-eu2
  providerType: aws

  sshPublicKey: c3NoLXJzYSAuLi4K # immutable, public `ssh` key of the user

  ingress: # can only be updated by the creator of the bastion
  - ipBlock:
      cidr: 1.2.3.4/32 # public IP of the user. CIDR is a string representing the IP Block. Valid examples are "192.168.1.1/24" or "2001:db9::/64"

status:
  observedGeneration: 1

  # the following fields are managed by the controller in the seed and synced by gardenlet
  ingress: # IP or hostname of the bastion
    ip: 1.2.3.5
    # hostname: foo.bar

  conditions:
  - type: BastionReady # when the `status` is true of condition type `BastionReady`, the client can initiate the `ssh` connection
    status: 'True'
    lastTransitionTime: "2021-03-19T11:59:00Z"
    lastUpdateTime: "2021-03-19T11:59:00Z"
    reason: BastionReady
    message: Bastion for the cluster is ready.

  # the following fields are only set by the GAPI
  lastHeartbeatTimestamp: "2021-03-19T11:58:00Z" # will be set when setting the annotation gardener.cloud/operation: keepalive
  expirationTimestamp: "2021-03-19T12:58:00Z" # extended on each keepalive

Bastion custom resource in the seed cluster

apiVersion: extensions.gardener.cloud/v1alpha1
kind: Bastion
metadata:
  name: cli-abcdef
  namespace: shoot--myproject--mycluster
spec:
  userData: |- # this is normally base64-encoded, but decoded for the example. Contains spec.sshPublicKey from Bastion resource in garden cluster
    #!/bin/bash
    # create user
    # add ssh public key to authorized_keys
    # ...

  ingress:
  - ipBlock:
      cidr: 1.2.3.4/32

  type: aws # from extensionsv1alpha1.DefaultSpec

status:
  observedGeneration: 1
  ingress:
    ip: 1.2.3.5
    # hostname: foo.bar
  conditions:
  - type: BastionReady
    status: 'True'
    lastTransitionTime: "2021-03-19T11:59:00Z"
    lastUpdateTime: "2021-03-19T11:59:00Z"
    reason: BastionReady
    message: Bastion for the cluster is ready.

SSH Key Pair Rotation

Currently, the ssh key pair for the shoot nodes are created once during shoot cluster creation. These key pairs should be rotated on a regular basis.

Rotation Proposal

  • gardeneruser original user data component:
    • The gardeneruser create script should be changed into a reconcile script script, and renamed accordingly. It needs to be adapted so that the authorized_keys file will be updated / overwritten with the current and old ssh public key from the cloud-config user data.
  • Rotation trigger:
    • Once in the maintenance time window
    • On demand, by annotating the shoot with gardener.cloud/operation: rotate-ssh-keypair
  • On rotation trigger:
    • gardenlet
      • Prerequisite of ssh key pair rotation: all nodes of all the worker pools have successfully applied the desired version of their cloud-config user data
      • Creates or updates the secret ssh-keypair.old with the content of ssh-keypair in the seed-shoot namespace. The old private key can be used by clients as fallback, in case the new ssh public key is not yet applied on the node
      • Generates new ssh-keypair secret
      • The OperatingSystemConfig needs to be re-generated and deployed with the new and old ssh public key
    • As usual (for more details, see here):
      • Once the cloud-config-<X> secret in the kube-system namespace of the shoot cluster is updated, it will be picked up by the downloader script (checks every 30s for updates)
      • The downloader runs the “execution” script from the cloud-config-<X> secret
      • The “execution” script includes also the original user data script, which it writes to PATH_CLOUDCONFIG, compares it against the previous cloud config and runs the script in case it has changed
      • Running the original user data script will also run the gardeneruser component, where the authorized_keys file will be updated
      • After the most recent cloud-config user data was applied, the “execution” script annotates the node with checksum/cloud-config-data: <cloud-config-checksum> to indicate the success

Limitations

Each operating system has its own default user (e.g. core, admin, ec2-user etc). These users get their SSH keys during VM creation (however there is a different handling on Google Cloud Platform as stated below). These keys currently do not get rotated respectively are not removed from the authorized_keys file. This means that the initial ssh key will still be valid for the default operating system user.

On Google Cloud Platform, the VMs do not have any static users (i.e. no gardener user) and there is an agent on the nodes that syncs the users with their SSH keypairs from the GCP IAM service.

13 - Dynamic kubeconfig generation for Shoot clusters

GEP-16: Dynamic kubeconfig generation for Shoot clusters

Table of Contents

Summary

This GEP introduces new Shoot subresource called AdminKubeconfigRequest allowing for users to dynamically generate a short-lived kubeconfig that can be used to access the Shoot cluster as cluster-admin.

Motivation

Today, when access to the created Shoot clusters is needed, a kubeconfig with static token credentials is used. This static token is in the system:masters group, granting it cluster-admin privileges. The kubeconfig is generated when the cluster is reconciled, stored in ShootState and replicated in the Project’s namespace in a Secret. End-users can fetch the secret and use the kubeconfig inside it.

There are several problems with this approach:

  • The token in the kubeconfig does not have any expiration, so end-users have to request a kubeconfig credential rotation if they want revoke the token.
  • There is no user identity in the token. e.g. if user Joe gets the kubeconfig from the Secret, user in that token would be system:cluster-admin and not Joe when accessing the Shoot cluster with it. This makes auditing events in the cluster almost impossible.

Goals

  • Add a Shoot subresource called adminkubeconfig that would produce a kubeconfig used to access that Shoot cluster.

  • The kubeconfig is not stored in the API Server, but generated for each request.

  • In the AdminKubeconfigRequest send to that subresource, end-users can specify the expiration time of the credential.

  • The identity (user) in the Gardener cluster would be part of the identity (x509 client certificate). E.g if Joe authenticates against the Gardener API server, the generated certificate for Shoot authentication would have the following subject:

    • Common Name: Joe
    • Organisation: system:masters
  • The maximum validity of the certificate can be enforced by setting a flag on the gardener-apiserver.

  • Deprecate and remove the old {shoot-name}.kubeconfig secrets in each Project namespace.

Non-Goals

  • Generate OpenID Connect kubeconfigs

Proposal

The gardener-apiserver would serve a new shoots/adminkubeconfig resource. It can only accept CREATE calls and accept AdminKubeconfigRequest. A AdminKubeconfigRequest would have the following structure:

apiVersion: authentication.gardener.cloud/v1alpha1
kind: AdminKubeconfigRequest
spec:
  expirationSeconds: 3600

Where expirationSeconds is the validity of the certificate in seconds. In this case it would be 1 hour. The maximum validity of a AdminKubeconfigRequest is configured by --shoot-admin-kubeconfig-max-expiration flag in the gardener-apiserver.

When such request is received, the API server would find the ShootState associated with that cluster and generate a kubeconfig. The x509 client certificate would be signed by the Shoot cluster’s CA and the user used in the subject’s common name would be from the User.Info used to make the request.

apiVersion: authentication.gardener.cloud/v1alpha1
kind: AdminKubeconfigRequest
spec:
  expirationSeconds: 3600
status:
  expirationTimestamp: "2021-02-22T09:06:51Z"
  kubeConfig: # this is normally base64-encoded, but decoded for the example
    apiVersion: v1
    clusters:
    - cluster:
        certificate-authority-data: LS0tLS1....
        server: https://api.shoot-cluster
      name: shoot-cluster-a
    contexts:
    - context:
        cluster: shoot-cluster-a
        user: shoot-cluster-a
      name: shoot-cluster-a
    current-context: shoot-cluster-a
    kind: Config
    preferences: {}
    users:
    - name: shoot-cluster-a
      user:
        client-certificate-data: LS0tLS1CRUd...
        client-key-data: LS0tLS1CRUd...

New feature gate called AdminKubeconfigRequest enables the above mentioned API in the gardener-apiserver. The old {shoot-name}.kubeconfig is kept, but deprecated and will be removed in the future.

In order to get the server’s address used in the kubeconfig, the Shoot’s status should be updated with new entries:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: crazy-botany
  namespace: garden-dev
spec: {}
status:
  advertisedAddresses:
  - name: external
    url: https://api.shoot-cluster.external.foo
  - name: internal
    url: https://api.shoot-cluster.internal.foo
  - name: ip
    url: https://1.2.3.4

This is needed, because the Gardener API server might not know on which IP address the API server is advertised on (e.g. DNS is disabled).

If there are multiple entries, each would be added in a separate cluster in the kubeconfig and a context with the same name would be added added as well. The current context would be selected as the first entry in the advertisedAddresses list (.status.advertisedAddresses[0]).

Alternatives

14 - GEP Title

GEP-NNNN: Your short, descriptive title

Table of Contents

Summary

Motivation

Goals

Non-Goals

Proposal

Alternatives

15 - New Core Gardener Cloud APIs

New core.gardener.cloud/v1beta1 APIs required to extract cloud-specific/OS-specific knowledge out of Gardener core

Table of Contents

Summary

In GEP-1 we have proposed how to (re-)design Gardener to allow providers maintaining their provider-specific knowledge out of the core tree. Meanwhile, we have progressed a lot and are about to remove the CloudBotanist interface entirely. The only missing aspect that will allow providers to really maintain their code out of the core is to design new APIs.

This proposal describes how the new Shoot, Seed etc. APIs will be re-designed to cope with the changes made with extensibility. We already have the new core.gardener.cloud/v1beta1 API group that will be the new default soon.

Motivation

We want to allow providers to individually maintain their specific knowledge without the necessity to touch the Gardener core code. In order to achieve the same, we have to provide proper APIs.

Goals

  • Provide proper APIs to allow providers maintaining their code outside of the core codebase.
  • Do not complicate the APIs for end-users such that they can easily create, delete, and maintain shoot clusters.

Non-Goals

  • Let’s try to not split everything up into too many different resources. Instead, let’s try to keep all relevant information in the same resources when possible/appropriate.

Proposal

In GEP-1 we already have proposed a first version for new CloudProfile and Shoot resources. In order to deprecate the existing/old garden.sapcloud.io/v1beta1 API group (and remove it, eventually) we should move all existing resources to the new core.gardener.cloud/v1beta1 API group.

CloudProfile resource

apiVersion: core.gardener.cloud/v1beta1
kind: CloudProfile
metadata:
  name: cloudprofile1
spec:
  type: <some-provider-name> # {aws,azure,gcp,...}
# Optional list of labels on `Seed` resources that marks those seeds whose shoots may use this provider profile.
# An empty list means that all seeds of the same provider type are supported.
# This is useful for environments that are of the same type (like openstack) but may have different "instances"/landscapes.
# seedSelector:
#   matchLabels:
#     foo: bar
  kubernetes:
    versions:
    - version: 1.12.1
    - version: 1.11.0
    - version: 1.10.6
    - version: 1.10.5
      expirationDate: 2020-04-05T01:02:03Z # optional
  machineImages:
  - name: coreos
    versions:
    - version: 2023.5.0
    - version: 1967.5.0
      expirationDate: 2020-04-05T08:00:00Z
  - name: ubuntu
    versions:
    - version: 18.04.201906170
  machineTypes:
  - name: m5.large
    cpu: "2"
    gpu: "0"
    memory: 8Gi
  # storage: 20Gi # optional (not needed in every environment, may only be specified if no volumeTypes have been specified)
    usable: true
  volumeTypes: # optional (not needed in every environment, may only be specified if no machineType has a `storage` field)
  - name: gp2
    class: standard
  - name: io1
    class: premium
  regions:
  - name: europe-central-1
    zones: # optional (not needed in every environment)
    - name: europe-central-1a
    - name: europe-central-1b
    - name: europe-central-1c
    # unavailableMachineTypes: # optional, list of machine types defined above that are not available in this zone
    # - m5.large
    # unavailableVolumeTypes: # optional, list of volume types defined above that are not available in this zone
    # - io1
# CA bundle that will be installed onto every shoot machine that is using this provider profile.
# caBundle: |
#   -----BEGIN CERTIFICATE-----
#   ...
#   -----END CERTIFICATE-----
  providerConfig:
    <some-provider-specific-cloudprofile-config>
    # We don't have concrete examples for every existing provider yet, but these are the proposals:
    #
    # Example for Alicloud:
    #
    # apiVersion: alicloud.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 2023.5.0
    #   id: coreos_2023_4_0_64_30G_alibase_20190319.vhd
    #
    #
    # Example for AWS:
    #
    # apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 1967.5.0
    #   regions:
    #   - name: europe-central-1
    #     ami: ami-0f46c2ed46d8157aa
    #
    #
    # Example for Azure:
    #
    # apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 1967.5.0
    #   publisher: CoreOS
    #   offer: CoreOS
    #   sku: Stable
    # countFaultDomains:
    # - region: westeurope
    #   count: 2
    # countUpdateDomains:
    # - region: westeurope
    #   count: 5
    #
    #
    # Example for GCP:
    #
    # apiVersion: gcp.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 2023.5.0
    #   image: projects/coreos-cloud/global/images/coreos-stable-2023-5-0-v20190312
    #
    #
    # Example for OpenStack:
    #
    # apiVersion: openstack.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 2023.5.0
    #   image: coreos-2023.5.0
    # keyStoneURL: https://url-to-keystone/v3/
    # dnsServers:
    # - 10.10.10.10
    # - 10.10.10.11
    # dhcpDomain: foo.bar
    # requestTimeout: 30s
    # constraints:
    #   loadBalancerProviders:
    #   - name: haproxy
    #   floatingPools:
    #   - name: fip1
    #     loadBalancerClasses:
    #     - name: class1
    #       floatingSubnetID: 04eed401-f85f-4610-8041-c4835c4beea6
    #       floatingNetworkID: 23949a30-1cdd-4732-ba47-d03ced950acc
    #       subnetID: ac46c204-9d0d-4a4c-a90d-afefe40cfc35
    #
    #
    # Example for Packet:
    #
    # apiVersion: packet.provider.extensions.gardener.cloud/v1alpha1
    # kind: CloudProfileConfig
    # machineImages:
    # - name: coreos
    #   version: 2079.3.0
    #   id: d61c3912-8422-4daf-835e-854efa0062e4

Seed resource

Special note: The proposal contains fields that are not yet existing in the current garden.sapcloud.io/v1beta1.Seed resource, but they should be implemented (open issues that require them are linked).

apiVersion: v1
kind: Secret
metadata:
  name: seed-secret
  namespace: garden
type: Opaque
data:
  kubeconfig: base64(kubeconfig-for-seed-cluster)

---
apiVersion: v1
kind: Secret
metadata:
  name: backup-secret
  namespace: garden
type: Opaque
data:
  # <some-provider-specific data keys>
  # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-backupbucket.yaml#L9-L11
  # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-infrastructure.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-backupbucket.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-backupbucket.yaml#L9
  # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-backupbucket.yaml#L9-L13

---
apiVersion: core.gardener.cloud/v1beta1
kind: Seed
metadata:
  name: seed1
spec:
  provider:
    type: <some-provider-name> # {aws,azure,gcp,...}
    region: europe-central-1
  secretRef:
    name: seed-secret
    namespace: garden
  # Motivation for DNS section: https://github.com/gardener/gardener/issues/201.
  dns:
    provider: <some-provider-name> # {aws-route53, google-clouddns, ...}
    secretName: my-dns-secret # must be in `garden` namespace
    ingressDomain: seed1.dev.example.com
  volume: # optional (introduced to get rid of `persistentvolume.garden.sapcloud.io/minimumSize` and `persistentvolume.garden.sapcloud.io/provider` annotations)
    minimumSize: 20Gi
    providers:
    - name: foo
      purpose: etcd-main
  networks: # Seed and Shoot networks must be disjunct
    nodes: 10.240.0.0/16
    pods: 10.241.128.0/17
    services: 10.241.0.0/17
  # Shoot default networks, see also https://github.com/gardener/gardener/issues/895.
  # shootDefaults:
  #   pods: 100.96.0.0/11
  #   services: 100.64.0.0/13
  taints:
  - key: seed.gardener.cloud/protected
  - key: seed.gardener.cloud/invisible
  blockCIDRs:
  - 169.254.169.254/32
  backup: # See https://github.com/gardener/gardener/blob/master/docs/proposals/02-backupinfra.md.
    type: <some-provider-name> # {aws,azure,gcp,...}
  # region: eu-west-1
    secretRef:
      name: backup-secret
      namespace: garden
status:
  conditions:
  - lastTransitionTime: "2020-07-14T19:16:42Z"
    lastUpdateTime: "2020-07-14T19:18:17Z"
    message: all checks passed
    reason: Passed
    status: "True"
    type: Available
  gardener:
    id: 4c9832b3823ee6784064877d3eb10c189fc26e98a1286c0d8a5bc82169ed702c
    name: gardener-controller-manager-7fhn9ikan73n-7jhka
    version: 1.0.0
  observedGeneration: 1

Project resource

Special note: The members and viewers field of the garden.sapcloud.io/v1beta1.Project resource will be merged together into one members field. Every member will have a role that is either admin or viewer. This will allow us to add new roles without changing the API.

apiVersion: core.gardener.cloud/v1beta1
kind: Project
metadata:
  name: example
spec:
  description: Example project
  members:
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: john.doe@example.com
    role: admin
  - apiGroup: rbac.authorization.k8s.io
    kind: User
    name: joe.doe@example.com
    role: viewer
  namespace: garden-example
  owner:
    apiGroup: rbac.authorization.k8s.io
    kind: User
    name: john.doe@example.com
  purpose: Example project
status:
  observedGeneration: 1
  phase: Ready

SecretBinding resource

Special note: No modifications needed compared to the current garden.sapcloud.io/v1beta1.SecretBinding resource.

apiVersion: v1
kind: Secret
metadata:
  name: secret1
  namespace: garden-core
type: Opaque
data:
  # <some-provider-specific data keys>
  # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-infrastructure.yaml#L14-L15
  # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-infrastructure.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-infrastructure.yaml#L14-L17
  # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-infrastructure.yaml#L14
  # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-infrastructure.yaml#L15-L18
  # https://github.com/gardener/gardener-extension-provider-packet/blob/master/example/30-infrastructure.yaml#L14-L15
  #
  # If you use your own domain (not the default domain of your landscape) then you have to add additional keys to this secret.
  # The reason is that the DNS management is not part of the Gardener core code base but externalized, hence, it might use other
  # key names than Gardener itself.
  # The actual values here depend on the DNS extension that is installed to your landscape.
  # For example, check out https://github.com/gardener/external-dns-management and find a lot of example secret manifests here:
  # https://github.com/gardener/external-dns-management/tree/master/examples

---
apiVersion: core.gardener.cloud/v1beta1
kind: SecretBinding
metadata:
  name: secretbinding1
  namespace: garden-core
secretRef:
  name: secret1
# namespace: namespace-other-than-'garden-core' // optional
quotas: []
# - name: quota-1
# # namespace: namespace-other-than-'garden-core' // optional

Quota resource

Special note: No modifications needed compared to the current garden.sapcloud.io/v1beta1.Quota resource.

apiVersion: core.gardener.cloud/v1beta1
kind: Quota
metadata:
  name: trial-quota
  namespace: garden-trial
spec:
  scope:
    apiGroup: core.gardener.cloud
    kind: Project
# clusterLifetimeDays: 14
  metrics:
    cpu: "200"
    gpu: "20"
    memory: 4000Gi
    storage.standard: 8000Gi
    storage.premium: 2000Gi
    loadbalancer: "100"

BackupBucket resource

Special note: This new resource is cluster-scoped.

# See also: https://github.com/gardener/gardener/blob/master/docs/proposals/02-backupinfra.md.

apiVersion: v1
kind: Secret
metadata:
  name: backup-operator-provider
  namespace: backup-garden
type: Opaque
data:
  # <some-provider-specific data keys>
  # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-backupbucket.yaml#L9-L11
  # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-backupbucket.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-backupbucket.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-backupbucket.yaml#L9
  # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-backupbucket.yaml#L9-L13

---
apiVersion: core.gardener.cloud/v1beta1
kind: BackupBucket
metadata:
  name: <seed-provider-type>-<region>-<seed-uid>
  ownerReferences:
  - kind: Seed
    name: seed1
spec:
  provider:
    type: <some-provider-name> # {aws,azure,gcp,...}
    region: europe-central-1
  seed: seed1
  secretRef:
    name: backup-operator-provider
    namespace: backup-garden
status:
  lastOperation:
    description: Backup bucket has been successfully reconciled.
    lastUpdateTime: '2020-04-13T14:34:27Z'
    progress: 100
    state: Succeeded
    type: Reconcile
  observedGeneration: 1

BackupEntry resource

Special note: This new resource is cluster-scoped.

# See also: https://github.com/gardener/gardener/blob/master/docs/proposals/02-backupinfra.md.

apiVersion: v1
kind: Secret
metadata:
  name: backup-operator-provider
  namespace: backup-garden
type: Opaque
data:
  # <some-provider-specific data keys>
  # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-backupbucket.yaml#L9-L11
  # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-backupbucket.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-backupbucket.yaml#L9-L10
  # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-backupbucket.yaml#L9
  # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-backupbucket.yaml#L9-L13

---
apiVersion: core.gardener.cloud/v1beta1
kind: BackupEntry
metadata:
  name: shoot--core--crazy-botany--3ef42
  namespace: garden-core
  ownerReferences:
  - apiVersion: core.gardener.cloud/v1beta1
    blockOwnerDeletion: false
    controller: true
    kind: Shoot
    name: crazy-botany
    uid: 19a9538b-5058-11e9-b5a6-5e696cab3bc8
spec:
  bucketName: cloudprofile1-random[:5]
  seed: seed1
status:
  lastOperation:
    description: Backup entry has been successfully reconciled.
    lastUpdateTime: '2020-04-13T14:34:27Z'
    progress: 100
    state: Succeeded
    type: Reconcile
  observedGeneration: 1

Shoot resource

Special notes:

  • kubelet configuration in the worker pools may override the default .spec.kubernetes.kubelet configuration (that applies for all worker pools if not overridden).
  • Moved remaining control plane configuration to new .spec.provider.controlplane section.
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: crazy-botany
  namespace: garden-core
spec:
  secretBindingName: secretbinding1
  cloudProfileName: cloudprofile1
  region: europe-central-1
# seedName: seed1
  provider:
    type: <some-provider-name> # {aws,azure,gcp,...}
    infrastructureConfig:
      <some-provider-specific-infrastructure-config>
      # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-infrastructure.yaml#L56-L64
      # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-infrastructure.yaml#L43-L53
      # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-infrastructure.yaml#L63-L71
      # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-infrastructure.yaml#L53-L57
      # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-infrastructure.yaml#L56-L64
      # https://github.com/gardener/gardener-extension-provider-packet/blob/master/example/30-infrastructure.yaml#L48-L49
    controlPlaneConfig:
      <some-provider-specific-controlplane-config>
      # https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-controlplane.yaml#L60-L65
      # https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-controlplane.yaml#L60-L64
      # https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-controlplane.yaml#L61-L66
      # https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-controlplane.yaml#L59-L64
      # https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-controlplane.yaml#L64-L70
      # https://github.com/gardener/gardener-extension-provider-packet/blob/master/example/30-controlplane.yaml#L60-L61
    workers:
    - name: cpu-worker
      minimum: 3
      maximum: 5
    # maxSurge: 1
    # maxUnavailable: 0
      machine:
        type: m5.large
        image:
          name: <some-os-name>
          version: <some-os-version>
        # providerConfig:
        #   <some-os-specific-configuration>
      volume:
        type: gp2
        size: 20Gi
    # providerConfig:
    #   <some-provider-specific-worker-config>
    # labels:
    #   key: value
    # annotations:
    #   key: value
    # taints: # See also https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
    # - key: foo
    #   value: bar
    #   effect: NoSchedule
    # caBundle: <some-ca-bundle-to-be-installed-to-all-nodes-in-this-pool>
    # kubernetes:
    #   kubelet:
    #     cpuCFSQuota: true
    #     cpuManagerPolicy: none
    #     podPidsLimit: 10
    #     featureGates:
    #       SomeKubernetesFeature: true
    # zones: # optional, only relevant if the provider supports availability zones
    # - europe-central-1a
    # - europe-central-1b
  kubernetes:
    version: 1.15.1
  # allowPrivilegedContainers: true # 'true' means that all authenticated users can use the "gardener.privileged" PodSecurityPolicy, allowing full unrestricted access to Pod features.
  # kubeAPIServer:
  #   featureGates:
  #     SomeKubernetesFeature: true
  #   runtimeConfig:
  #     scheduling.k8s.io/v1alpha1: true
  #   oidcConfig:
  #     caBundle: |
  #       -----BEGIN CERTIFICATE-----
  #       Li4u
  #       -----END CERTIFICATE-----
  #     clientID: client-id
  #     groupsClaim: groups-claim
  #     groupsPrefix: groups-prefix
  #     issuerURL: https://identity.example.com
  #     usernameClaim: username-claim
  #     usernamePrefix: username-prefix
  #     signingAlgs: RS256,some-other-algorithm
  #-#-# only usable with Kubernetes >= 1.11
  #     requiredClaims:
  #       key: value
  #   admissionPlugins:
  #   - name: PodNodeSelector
  #     config: |
  #       podNodeSelectorPluginConfig:
  #         clusterDefaultNodeSelector: <node-selectors-labels>
  #         namespace1: <node-selectors-labels>
  #         namespace2: <node-selectors-labels>
  #   auditConfig:
  #     auditPolicy:
  #       configMapRef:
  #         name: auditpolicy
  # kubeControllerManager:
  #   featureGates:
  #     SomeKubernetesFeature: true
  #   horizontalPodAutoscaler:
  #     syncPeriod: 30s
  #     tolerance: 0.1
  #-#-# only usable with Kubernetes < 1.12
  #     downscaleDelay: 15m0s
  #     upscaleDelay: 1m0s
  #-#-# only usable with Kubernetes >= 1.12
  #     downscaleStabilization: 5m0s
  #     initialReadinessDelay: 30s
  #     cpuInitializationPeriod: 5m0s
  # kubeScheduler:
  #   featureGates:
  #     SomeKubernetesFeature: true
  # kubeProxy:
  #   featureGates:
  #     SomeKubernetesFeature: true
  #   mode: IPVS
  # kubelet:
  #   cpuCFSQuota: true
  #   cpuManagerPolicy: none
  #   podPidsLimit: 10
  #   featureGates:
  #     SomeKubernetesFeature: true
  # clusterAutoscaler:
  #   scaleDownUtilizationThreshold: 0.5
  #   scaleDownUnneededTime: 30m
  #   scaleDownDelayAfterAdd: 60m
  #   scaleDownDelayAfterFailure: 10m
  #   scaleDownDelayAfterDelete: 10s
  #   scanInterval: 10s
  dns:
    # When the shoot shall use a cluster domain no domain and no providers need to be provided - Gardener will
    # automatically compute a correct domain.
    domain: crazy-botany.core.my-custom-domain.com
    providers:
    - type: aws-route53
      secretName: my-custom-domain-secret
      domains:
        include:
        - my-custom-domain.com
        - my-other-custom-domain.com
        exclude:
        - yet-another-custom-domain.com
      zones:
        include:
        - zone-id-1
        exclude:
        - zone-id-2
  extensions:
  - type: foobar
  # providerConfig:
  #   apiVersion: foobar.extensions.gardener.cloud/v1alpha1
  #   kind: FooBarConfiguration
  #   foo: bar
  networking:
    type: calico
    pods: 100.96.0.0/11
    services: 100.64.0.0/13
    nodes: 10.250.0.0/16
  # providerConfig:
  #   apiVersion: calico.extensions.gardener.cloud/v1alpha1
  #   kind: NetworkConfig
  #   ipam:
  #     type: host-local
  #     cidr: usePodCIDR
  #   backend: bird
  #   typha:
  #     enabled: true
  # See also: https://github.com/gardener/gardener/blob/master/docs/proposals/03-networking.md
  maintenance:
    timeWindow:
      begin: 220000+0100
      end: 230000+0100
    autoUpdate:
      kubernetesVersion: true
      machineImageVersion: true
# hibernation:
#   enabled: false
#   schedules:
#   - start: "0 20 * * *" # Start hibernation every day at 8PM
#     end: "0 6 * * *"    # Stop hibernation every day at 6AM
#     location: "America/Los_Angeles" # Specify a location for the cron to run in
  addons:
    nginx-ingress:
      enabled: false
    # loadBalancerSourceRanges: []
    kubernetes-dashboard:
      enabled: true
    # authenticationMode: basic # allowed values: basic,token
status:
  conditions:
  - type: APIServerAvailable
    status: 'True'
    lastTransitionTime: '2020-01-30T10:38:15Z'
    lastUpdateTime: '2020-04-13T14:35:21Z'
    reason: HealthzRequestFailed
    message: API server /healthz endpoint responded with success status code. [response_time:3ms]
  - type: ControlPlaneHealthy
    status: 'True'
    lastTransitionTime: '2020-04-02T05:18:58Z'
    lastUpdateTime: '2020-04-13T14:35:21Z'
    reason: ControlPlaneRunning
    message: All control plane components are healthy.
  - type: EveryNodeReady
    status: 'True'
    lastTransitionTime: '2020-04-01T16:27:21Z'
    lastUpdateTime: '2020-04-13T14:35:21Z'
    reason: EveryNodeReady
    message: Every node registered to the cluster is ready.
  - type: SystemComponentsHealthy
    status: 'True'
    lastTransitionTime: '2020-04-03T18:26:28Z'
    lastUpdateTime: '2020-04-13T14:35:21Z'
    reason: SystemComponentsRunning
    message: All system components are healthy.
  gardener:
    id: 4c9832b3823ee6784064877d3eb10c189fc26e98a1286c0d8a5bc82169ed702c
    name: gardener-controller-manager-7fhn9ikan73n-7jhka
    version: 1.0.0
  lastOperation:
    description: Shoot cluster state has been successfully reconciled.
    lastUpdateTime: '2020-04-13T14:34:27Z'
    progress: 100
    state: Succeeded
    type: Reconcile
  observedGeneration: 1
  seed: seed1
  hibernated: false
  technicalID: shoot--core--crazy-botany
  uid: d8608cfa-2856-11e8-8fdc-0a580af181af

Plant resource

apiVersion: v1
kind: Secret
metadata:
  name: crazy-plant-secret
  namespace: garden-core
type: Opaque
data:
  kubeconfig: base64(kubeconfig-for-plant-cluster)

---
apiVersion: core.gardener.cloud/v1beta1
kind: Plant
metadata:
  name: crazy-plant
  namespace: garden-core
spec:
  secretRef:
    name: crazy-plant-secret
  endpoints:
  - name: Cluster GitHub repository
    purpose: management
    url: https://github.com/my-org/my-cluster-repo
  - name: GKE cluster page
    purpose: management
    url: https://console.cloud.google.com/kubernetes/clusters/details/europe-west1-b/plant?project=my-project&authuser=1&tab=details
status:
  clusterInfo:
    provider:
      type: gce
      region: europe-west4-c
    kubernetes:
      version: v1.11.10-gke.5
  conditions:
  - lastTransitionTime: "2020-03-01T11:31:37Z"
    lastUpdateTime: "2020-04-14T18:00:29Z"
    message: API server /healthz endpoint responded with success status code. [response_time:8ms]
    reason: HealthzRequestFailed
    status: "True"
    type: APIServerAvailable
  - lastTransitionTime: "2020-04-01T06:26:56Z"
    lastUpdateTime: "2020-04-14T18:00:29Z"
    message: Every node registered to the cluster is ready.
    reason: EveryNodeReady
    status: "True"
    type: EveryNodeReady

16 - Reversed Cluster VPN

GEP-14: Reversed Cluster VPN

Table of Contents

Motivation

It is necessary to describe the current VPN solution and outline its shortcomings in order to motivate this proposal.

Problem Statement

Today’s Gardener cluster VPN solution has several issues including:

  1. Connection establishment is always from the seed cluster to the shoot cluster. This means that there needs to be connectivity both ways which is not desirable in many cases (OpenStack, VMware) and causes high effort in firewall configuration or extra infrastructure. These firewall configurations are prohibited in some cases due to security policies.
  2. Shoot clusters must provide a VPN endpoint. This means extra cost for the endpoint (roughly €20/month on hyperscalers) or will consume scarce resources (limited number of VMware NSX-T load balancers).

A first implementation has been provided to resolve the issues with the konnectivity server. As we did find several shortcomings with the underlying technology component, the apiserver-network-proxy we believe that this is not a suitable way ahead. We have opened an issue and provided two solution proposals to the community. We do see some remedies, e.g. using the Quick Protocol instead of GRPC but we (a) consider the implementation effort significantly higher compared to this proposal and (b) would use an experimental protocol to solve a problem that can also be solved with existing and proven core network technologies.

We will therefore not continue to invest into this approach. We will however research a similar approach (see below in “Further Research”).

Current Solution Outline

The current solution consists of multiple VPN connections from each API server pod and the Prometheus pod of a control plane to an OpenVPN server running in the shoot cluster. This OpenVPN server is exposed via a load balancer that must have an IP address which is reachable from the seed cluster. The routing in the seed cluster pods is configured to route all traffic for the node, pod, and service ranges to the shoot cluster. This means that there is no address overlap allowed between seed- and shoot cluster node, pod, and service ranges.

In the seed cluster the vpn-seed container is a sidecar to the kube-apiserver and prometheus pods. OpenVPN acts as a TCP client connecting to an OpenVPN TCP server. This is not optimal (e.g. tunneling TCP over TCP is discouraged) but at the time of development there was no UDP load balancer available on at least one of the major hyperscalers. Connectivity could have been switched to UDP later but the development effort was not spent.

The solution is depicted in this diagram:

alt text

These are the essential parts of the OpenVPN client configuration in the vpn-seed sidecar container:

# use TCP instead of UDP (commonly not supported by load balancers)
proto tcp-client

[...]

# get all routing information from server
pull

tls-client
key "/srv/secrets/vpn-seed/tls.key"
cert "/srv/secrets/vpn-seed/tls.crt"
ca "/srv/secrets/vpn-seed/ca.crt"

tls-auth "/srv/secrets/tlsauth/vpn.tlsauth" 1
cipher AES-256-CBC

# https://openvpn.net/index.php/open-source/documentation/howto.html#mitm
remote-cert-tls server

# pull filter
pull-filter accept "route 100.64.0.0 255.248.0.0"
pull-filter accept "route 100.96.0.0 255.224.0.0"
pull-filter accept "route 10.1.60.0 255.255.252.0"
pull-filter accept "route 192.168.123."
pull-filter ignore "route"
pull-filter ignore redirect-gateway
pull-filter ignore route-ipv6
pull-filter ignore redirect-gateway-ipv6

Encryption is based on SSL certificates with an additional HMAC signature to all SSL/TLS handshake packets. As multiple clients connect to the OpenVPN server in the shoot cluster, all clients must be assigned a unique IP address. This is done by the OpenVPN server pushing that configuration to the client (keyword pull). As this is potentially problematic because the OpenVPN server runs in an untrusted environment there are pull filters denying all but necessary routes for the pod, service, and node networks.

The OpenVPN server running in the shoot cluster is configured as follows:

mode server
tls-server
proto tcp4-server
dev tun0

[...]

server 192.168.123.0 255.255.255.0

push "route 10.243.0.0 255.255.128.0"
push "route 10.243.128.0 255.255.128.0"

duplicate-cn

key "/srv/secrets/vpn-shoot/tls.key"
cert "/srv/secrets/vpn-shoot/tls.crt"
ca "/srv/secrets/vpn-shoot/ca.crt"
dh "/srv/secrets/dh/dh2048.pem"

tls-auth "/srv/secrets/tlsauth/vpn.tlsauth" 0
push "route 10.242.0.0 255.255.0.0"

It is a TCP TLS server and configured to automatically assign IP addresses for OpenVPN clients (server directive). In addition, it pushes the shoot cluster node-, pod-, and service ranges to the clients running in the seed cluster (push directive).

Note: The network mesh spanned by OpenVPN uses the network range 192.168.123.0 - 192.168.123.255. This network range cannot be used in either shoot-, or seed clusters. If it is used this might cause subtle problem due to network range overlaps. Unfortunately, this appears not to be well documented but this restriction exists since the very beginning. We should clean up this technical debt as part of the exercise.

Goals

  • We intend to supersede the current VPN solution with the solution outlined in this proposal.
  • We intend to remove the code for the konnectivity tunnel once this solution proposal has been validated.

Non Goals

  • The solution is not a low latency, or high throughput solution. As the kube-apiserver to shoot cluster traffic does not demand these properties we do not intend to invest in improvements.
  • We do not intend to provide continuous availability to the shoot-seed VPN connection. We expect the availability to be comparable to the existing solution.

Proposal

The proposal is depicted in the following diagram:

alt text

We have added an OpenVPN server pod (vpn-seed-server) to each control plane. The OpenVPN client in the shoot cluster (vpn-shoot-client) connects to the OpenVPN server.

The two containers vpn-seed-server and vpn-shoot-client are new containers and are not related to containers in the github.com/gardener/vpn project. We will create a new project github.com/gardener/vpn2 for these containers. With this solution we intend to supersede the containers from the github.com/gardener/vpn project.

A service vpn-seed-server of type ClusterIP is created for each control plane in its namespace.

The vpn-shoot-client pod connects to the correct vpn-seed-server service via the SNI passthrough proxy introduced with SNI Passthrough proxy for kube-apiservers on port 8132.

Shoot OpenVPN clients (vpn-shoot-client) connect to the correct OpenVPN Server using the http proxy feature provided by OpenVPN. A configuration is added to the envoy proxy to detect http proxy requests and open a connection attempt to the correct OpenVPN server.

The kube-apiserver to shoot cluster connections are established using the API server proxy feature via an envoy proxy sidecar container of the vpn-seed-server container.

The restriction regarding the 192.168.123.0/24 network range in the current VPN solution still applies to this proposal. No other restrictions are introduced. In the context of this GEP a pull requst has been filed to block usage of that range by shoot clusters.

Performance and Scalability

We do expect performance and throughput to be slightly lower compared to the existing solution. This is because the OpenVPN server acts as an additional hop and must decrypt and re-encrypt traffic that passes through. As there are no low latency, or high thoughput requirements for this connection we do not assume this to be an issue.

Availability and Failure Scenarios

This solution re-uses multiple instances of the envoy component used for the kube-apiserver endpoints. We assume that the availability for kube-apiservers is good enough for the cluster VPN as well.

The OpenVPN client- and server pods are singleton pods in this approach and therefore are affected by potential failures and during cluster-, and control plane updates. Potential outages are only restricted to single shoot clusters and are comparable to the situation with the existing solution today.

Feature Gates and Migration Strategy

We have introduced a gardenlet feature gate ReversedVPN. If APIServerSNI and ReversedVPN are enabled the proposed solution is automatically enabled for all shoot clusters hosted by the seed. If ReversedVPN is enabled but APIServerSNI is not the gardenlet will panic during startup as this is an invalid configuration. All existing shoot clusters will automatically be migrated during the next reconciliation. We assume that the ReversedVPN feature will work with Gardener as well as operator managed Istio.

We have also added a shoot annotation alpha.featuregates.shoot.gardener.cloud/reversed-vpn which can override the feature gate to enable or disable the solution for individual clusters. This is only respected if APIServerSNI is enabled, otherwise it is ignored.

Security Review

The change in the VPN solution will potentially open up new attack vectors. We will perform a thorough analysis outside of this document.

Alternatives

We have done a detailed investigation and implementation of a reversed VPN based on WireGuard. While we believe that it is technically feasible and superior to the approach presented above there are some concerns with regards to scalability, and high availability. As the WireGuard scenario based on kubelink is relevant for other use cases we continue to improve this implementation and address the concerns but we concede that this might not be on time for the cluster VPN. We nevertheless keep the implementation and provide an outline as part of this proposal.

The general idea of the proposal is to keep the existing cluster VPN solution more or less as is, but change the underlying network used for the vpn seed => vpn shoot connection. The underlying network should be established in the reversed direction, i.e. the shoot cluster should initiate the network connection, but it nevertheless should work in both directions.

We achieve this by tunneling the open vpn connection through a WireGuard tunnel, which is established from the shoot to the seed (note that WireGuard uses UDP as protocol). Independent of that we can also use UDP for the OpenVPN connection, but we can also stay with TCP as it was before. While this might look like a big change, it only introduces minor changes to the existing solution, but let’s look at the details. In essence, the OpenVPN connection does not require a public endpoint in the shoot cluster but it usees the internal endpoint provided by the WireGuard tunnel.

This is roughly depcited in this diagram. Note, that the vpn-seed and vpn-shoot containers only require very little changes and are fully backwards compatible.

alt text

The WireGuard network needs a separate network range/CIDR. It has to be unique for the seed and all its shoot clusters. An example for an assumed workload of around 1000 shoot clusters would be 192.168.128.0/22 (1024 IP addresses), i.e. 192.168.128.0-192.168.131.255. The IP addresses from this range need to be managed, but the IP address management (IPAM) using the Gardener Kubernetes objects like seed and shootstate as backing store is fairly straightforward. This is especially true as we do not expect large network ranges and only infrequent IP allocations. Hence, the IP address allocation can be quite simple, i.e. scan the range for a free IP address of all shoot clusters in a seed and allocate the first free address from the range.

There is another restriction: in case shoot clusters are configured to be seed clusters this network range must not overlap with the “parent” seed cluster. If the parent seed cluster uses 192.168.128.0/22 the child seed cluster can for example use 192.168.132.0/22. Grandchildren can however use grandparent IP address ranges. Also 2 children seed clusters can use identical ranges.

This slightly adds to the restrictions described in the current solution outline. In that the arbitrary chosen 192.168.123.0/24 range is restricted. For the purpose of this implementation we propose to extend that restriction to 192.168.128.0/17 range. Most of it would be reserved for “future use” however. We are well aware that this adds to the burden of correctly configuring Gardener landscapes.

We do consider this to be a challenge that needs to be addressed by careful configuration of the Gardener seed cluster infrastructure. Together with the 192.168.123.0/24 address range these ranges should be automatically blocked for usage by shoots.

WireGuard can utilize the Linux kernel so that after initialization/configuration no user space processes are required. We propose to recommend the WireGuard kernel module as the default solution for all seeds. For shoot clusters, the WireGuard kernel based approach is also recommended, but the user space solution should also work as we expect less traffic on the shoot side. We expect the userspace implementation to work on all operating systems supported by Gardener in case no kernel module is available.

Almost all seed clusters are already managed by Gardener and we assume that those are configured with the WireGuard kernel module. There are however some cases where we use other Kubernetes distributions as seed cluster which may not have an operating system with WireGuard module available. We will therefore generally support the user space WireGuard process on seed cluster but place a size restriction on the number of control planes on those seeds.

There is a user space implementation of WireGuard, which can be used on Linux distributions without the WireGuard kernel module. (WireGuard moved into the standard Linux kernel 5.6.) Our proposal can handle the kernel/user space switch transparently, i.e. we include the user space binaries and use them only when required. However, especially for the seed the kernel based solution might be more attractive. Garden Linux 318.4.0 supports WireGuard.

We have looked at Ubuntu and SuSE chost:

  • SuSE chost does not provide the WireGuard kernel module and it is not installable via zypper. It should however be straightforward for SuSE to include that in their next release.
  • Ubuntu does not provide the kernel module either but it can be installed using apt-get install wireguard. With that it appears straightforward to provide an image with WireGuard pre-installed.

On the seed, we add a WireGuard device to one node on the host network. For all other nodes on the seed, we adapt the routes accordingly to route traffic destined for the WireGuard network to our WireGuard node. The Kubernetes pods managing the WireGuard device and routes are only used for initial configuration and later reconfiguration. During runtime, they can restart without any impact on the operation of the WireGuard network as the WireGuard device is managed by the Linux kernel.

With Calico as the networking solution it is not easily possible to put the WireGuard endpoint into a pod. Putting the WireGuard endpoint into a pod would require to define it as a gateway in the api server or prometheus pods but this is not possible since Calico does not span a proper subnet. While the defined CIDR in the pod network might be 100.96.0.0/11 the network visible from within a pod is only 100.96.0.5/32. This restriction might not exist with other networking solutions.

The WireGuard endpoint on the seed is exposed via a load balancer. We propose to use kubelink to manage the WireGuard configuration/device on the seed. We consider the management of the WireGuard endpoint to be complex especially in error situations which is the reason for utilizing kubelink as there is already significant experience managing an endpoint. We propose moving kubelink to the Gardener org in case it is used by this proposal.

Kubelink addresses three challenges managing WireGuard interfaces on cluster nodes. First, with WireGuard interfaces directly on the node (hostNetwork=true) the lifecycle of the interface is decoupled from the lifecycle of the pod that created it. This means that there will have to be means of cleaning up the interfaces and its configuration in case the interface moves to a different node. Second, additional routing information must be distributed across the cluster. The WireGuard CIDR is unknown to the network implementation so additional routes must be distributed on all nodes of the cluster. Third, kubelink dynamincally configures the Wireguard interface with endpoints and their public keys.

On the shoot, we create the keys and acquire the WireGuard IP in the standard secret generation. The data is added as a secret to the control plane and to the shootstate. The vpn shoot deployment is extended to include the WireGuard device setup inside the vpn shoot pod network. For certain infrastructures (AWS), we need a re-advertiser to resolve the seed WireGuard endpoint and evaluate whether the IP address changed.

While it is possible to configure a WireGuard device using DNS names only IP addresses can be stored in Linux Kernel data structures. A change of a load balancer IP address can therefore not be mitigated on that level. As WireGuard dynamically adapts endpoint IP addresses a change in load banlancer IPs is mitigated in most but not all cases. This is why a re-advertiser is required for public cloud providers such as AWS.

The load balancer exposing the OpenVPN endpoint in the shoot cluster is no longer required and therefore removed if this functionality is used.

As we want to slowly expand the usage of the WireGuard solution, we propose to introduce a feature gate for it. Furthermore, since the WireGuard network requires a separate network range, we propose to introduce a new section to the seed settings with two additional flags (enabled & cidr):

apiVersion: core.gardener.cloud/v1beta1
kind: Seed
metadata:
  name: my-seed
  ...
spec:
  ...
  settings:
  ...
    wireguard:
      enabled: true
      cidr: 192.168.128.0/22

Last but not least, we propose to introduce an annotation to the shoots to enable/disable the WireGuard tunnel explicitly.

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: my-shoot
  annotations:
    alpha.featuregates.shoot.gardener.cloud/wireguard-tunnel: "true"
  ...

Using this approach, it is easy to switch the solution on and off, i.e. migrate the shoot clusters automatically during ordinary reconciliation.

High Availability

There is an issue if the node that hosts the WireGuard endpoint fails. The endpoint is migrated to another node however the time required to do this might exceed the budget for downtimes although one could argue that a disruption of less than 30 seconds to 1 minute does not qualify as a downtime and will in almost all cases not noticeable by end users.

In this case we also assume that TCP connections won’t be interrupted - they would just appear to hang. We will confirm this behavior and the potential downtime as part of the development and testing effort as this is hard to predict.

As a possible mitigation we propose to instantiate 2 Kubelink instances in the seed cluster that are served by two different load balancers. The instances must run on different nodes (if possible but we assume a proper seed cluster has more than one node). Each shoot cluster connects to both endpoints. This means that the OpenVPN server is reachable with two different IP addresses. The VPN seed sidecars must attempt to connect to both of them and will continue to do so. The “Persistent Keepalive” feature is set to 21 seconds by default but could be reduced. Due to the redundancy this however appears not to be necessary.

It is desirable that both connections are used in an equal manner. One strategy could be to use the kubelink 1 connection if the first target WireGuard address is even (the last byte of the IPv4 address), otherwise the kubelink 2 connection. The vpn-seed sidecars can then use the following configuration in their OpenVPN configuration file:

<connection>
remote 192.168.45.3 1194 udp
</connection>

<connection>
remote 192.168.47.34 1194 udp
</connection>

OpenVPN will go through the list sequentially and try to connect to these endpoints.

As an additional mitigation it appears possible to instantiate WireGuard devices on all hosts and replicate its relevant conntrack state across all cluster nodes. The relevant conntrack state keeps the state of all connections passing through the WireGuard interface (e.g. the WireGuard CIDR). conntrack and the tools to replicate conntrack state are part of the essential Linux netfilter tools package.

Load Considerations

What happens in case of a failure? In this case one router will end up owning all connections as the clients will attempt to use the next connection. This could be mitigated by adding a third redundant WireGuard connection. Using this strategy, the failure of one WireGuard endpoint would result in the equal distribution of connections to the two remaining interfaces. We believe however that this will not be necessary.

The cluster node running the Wireguard endpoint is essentially a router that routes all traffic to the various shoot clusters. This is established and proven technology that already exists since decades and has been highly optimized since then. This is also the technology that hyperscalers rely on to provide VPN connectivity to their customers. This said, hyperscalers essentially provide solutions based on IPsec which is known not to scale as well as Wireguard. Wireguard is a relatively new technology but we have no doubt that it is less stable than existing IPsec solution.

Regarding performance there is a lot of information on the Internet basically suggesting that Wireguard performs better than other VPN solutions such as IPsec or OpenVPN. One example is https://www.wireguard.com/performance/ and https://www.net.in.tum.de/fileadmin/bibtex/publications/papers/2020-ifip-moonwire.pdf.

Based on this, we have no reason to believe that one router will not be able to handle all traffic going to and coming from shoot clusters. Nevertheless, we will closely monitor the situation in our tests and will take action if necessary.

Further Research

Based on feedback on this proposal and while working on this implementation we identified two additinal approaches that we have not thought of so far. The first idea can be used to replace the “inner” OpenVPN implementation and the second can be used to replace WireGuard with OpenVPN and get rid of the single point of failure.

  1. Instead of using OpenVPN for the inner seed/shoot communication we can use the proxy protocol and use a TCP proxy (e.g. envoy) in the shoot cluster to broker the seed-shoot connections. The advantage is that with this solution seed- and shoot cluster network ranges are allowed to overlap. Disadvantages are increased implementation effort and less efficient network in terms of throughput and scalability. We believe however that the reduced network efficiency does not invalidate this option.

  2. There is an option in OpenVPN to specify a tcp proxy as part of the endpoint configuration.

17 - Shoot APIServer via SNI

SNI Passthrough proxy for kube-apiservers

This GEP tackles the problem that today a single LoadBalancer is needed for every single Shoot cluster’s control plane.

Background

When the control plane of a Shoot cluster is provisioned, a dedicated LoadBalancer is created for it. It keeps the entire flow quite easy - the apiserver Pods are running and they are accessible via that LoadBalancer. It’s hostnames / IP addresses are used for DNS records like api.<external-domain> and api.<shoot>.<project>.<internal-domain>. While this solution is simple it comes with several issues.

Motivation

There are several problems with the current setup.

  • IaaS provider costs. For example ClassicLoadBalancer on AWS costs at minimum 17 USD / month.
  • Quotas can limit the amount of LoadBalancers you can get per account / project, limiting the number of clusters you can host under a single account.
  • Lack of support for better loadbalancing algorithms than round-robin.
  • Slow cluster provisioning time - depending on the provider a LoadBalancer provisioning could take quite a while.
  • Lower downtime when workload is shuffled in the clusters as the LoadBalancer is Kubernetes-aware.

Goals

  • Only one LoadBalancer is used for all Shoot cluster API servers running in a Seed cluster.
  • Out-of-cluster (end-user / robot) communication to the API server is still possible.
  • In-cluster communication via the kubernetes master service (IPv4/v6 ClusterIP and the kubernetes.default.svc.cluster.local) is possible.
  • Client TLS authentication works without intermediate TLS termination (TLS is terminated by kube-apiserver).
  • Solution should be cloud-agnostic.

Proposal

Seed cluster

To solve the problem of having multiple kube-apiservers behind a single LoadBalancer, an intermediate proxy must be placed between the Cloud-Provider’s LoadBalancer and kube-apiservers. This proxy is going to choose the Shoot API Server with the help of Server Name Indication. From wikipedia:

Server Name Indication (SNI) is an extension to the Transport Layer Security (TLS) computer networking protocol by which a client indicates which hostname it is attempting to connect to at the start of the handshaking process. This allows a server to present multiple certificates on the same IP address and TCP port number and hence allows multiple secure (HTTPS) websites (or any other service over TLS) to be served by the same IP address without requiring all those sites to use the same certificate. It is the conceptual equivalent to HTTP/1.1 name-based virtual hosting, but for HTTPS.

A rough diagram of the flow of data:

+-------------------------------+
|                               |
|           Network LB          | (accessible from clients)
|                               |
|                               |
+-------------+-------+---------+                       +------------------+
              |       |                                 |                  |
              |       |            proxy + lb           | Shoot API Server |
              |       |    +-------------+------------->+                  |
              |       |    |                            | Cluster A        |
              |       |    |                            |                  |
              |       |    |                            +------------------+
              |       |    |
     +----------------v----+--+
     |        |               |
   +-+--------v----------+    |                         +------------------+
   |                     |    |                         |                  |
   |                     |    |       proxy + lb        | Shoot API Server |
   |        Proxy        |    +-------------+---------->+                  |
   |                     |    |                         | Cluster B        |
   |                     |    |                         |                  |
   |                     +----+                         +------------------+
   +----------------+----+
                    |
                    |
                    |                                   +------------------+
                    |                                   |                  |
                    |             proxy + lb            | Shoot API Server |
                    +-------------------+-------------->+                  |
                                                        | Cluster C        |
                                                        |                  |
                                                        +------------------+

Sequentially:

  1. client requests Shoot Cluster A and sets the Server Name in the TLS handshake to api.shoot-a.foo.bar.
  2. this packet goes through the Network LB and it’s forwarded to the Proxy server. (this loadbalancer should be a simple Layer-4 TCP proxy)
  3. the proxy server reads the packet and see that client requests api.shoot-a.foo.bar.
  4. based on its configuration, it maps api.shoot-a.foo.bar to Shoot API Server Cluster A.
  5. it acts as TCP proxy and simply send the data Shoot API Server Cluster A.

There are multiple OSS proxies for this case:

To ease integration it should:

  • be configurable via Kubernetes resources
  • not require restarting when configuration changes
  • be fast and with little overhead

All things considered, Envoy proxy is the most fitting solution as it provides all the features Gardener would like (no process reload being the most important one + battle tested in production by various companies).

While building a custom control plane for Envoy is quite simple, an already established solution might be the better path forward. Istio’s Pilot is one of the most feature-complete Envoy control plane solutions as it offers a way to configure edge ingress traffic for Envoy via Gateway and VirtualService.

The resources which needs to be created per Shoot clusters are the following:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: kube-apiserver-gateway
  namespace: <shoot-namespace>
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 443
      name: tls
      protocol: TLS
    tls:
      mode: PASSTHROUGH
    hosts:
    - api.<external-domain>
    - api.<shoot>.<project>.<internal-domain>

and correct VirtualService pointing to the correct API server:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: kube-apiserver
  namespace: <shoot-namespace>
spec:
  hosts:
  - api.<external-domain>
  - api.<shoot>.<project>.<internal-domain>
  gateways:
  - kube-apiserver-gateway
  tls:
  - match:
    - port: 443
      sniHosts:
      - api.<external-domain>
      - api.<shoot>.<project>.<internal-domain>
    route:
    - destination:
        host: kube-apiserver.<shoot-namespace>.svc.cluster.local
        port:
          number: 443

The resources above configures Envoy to forward the raw TLS data (without termination) to the Shoot kube-apiserver.

Updated diagram:

+-------------------------------+
|                               |
|           Network LB          | (accessible from clients)
|                               |
|                               |
+-------------+-------+---------+                       +------------------+
              |       |                                 |                  |
              |       |            proxy + lb           | Shoot API Server |
              |       |    +-------------+------------->+                  |
              |       |    |                            | Cluster A        |
              |       |    |                            |                  |
              |       |    |                            +------------------+
              |       |    |
     +----------------v----+--+
     |        |               |
   +-+--------v----------+    |                         +------------------+
   |                     |    |                         |                  |
   |                     |    |       proxy + lb        | Shoot API Server |
   |    Envoy Proxy      |    +-------------+---------->+                  |
   | (ingress Gateway)   |    |                         | Cluster B        |
   |                     |    |                         |                  |
   |                     +----+                         +------------------+
   +-----+----------+----+
         |          |
         |          |
         |          |                                   +------------------+
         |          |                                   |                  |
         |          |             proxy + lb            | Shoot API Server |
         |          +-------------------+-------------->+                  |
         |   get                                        | Cluster C        |
         | configuration                                |                  |
         |                                              +------------------+
         |
         v                                                  Configure
      +--+--------------+         +---------------------+   via Istio
      |                 |         |                     |   Custom Resources
      |     Pilot       +-------->+   Seed API Server   +<------------------+
      |                 |         |                     |
      |                 |         |                     |
      +-----------------+         +---------------------+

In this case the internal and external DNSEntries should be changed to the Network LoadBalancer’s IP.

In-cluster communication to the apiserver

In Kubernetes the API server is discoverable via the master service (kubernetes in default namespace). Today, this service can only be of type ClusterIP - making in-cluster communication to the API server impossible due to:

  • the client doesn’t set the Server Name in the TLS handshake, if it attempts to talk to an IP address. In this case, the TLS handshake reaches the Envoy IngressGateway proxy, but it’s rejected by it.
  • Kubernetes services can be of type ExternalName, but the master service is not supported by kubelet.
    • even if this is fixed in future Kubernetes versions, this problem still exists for older versions where this functionality is not available.

Another issue occurs when the client tries to talk to the apiserver via the in-cluster DNS. For all Shoot API servers kubernetes.default.svc.cluster.local is the same and when a client tries to connect to that API server using that server name. This makes distinction between different in-cluster Shoot clients impossible by the Envoy IngressGateway.

To mitigate this problem an additional proxy must be deployed on every single Node. It does not terminate TLS and sends the traffic to the correct Shoot API Server. This is achieved by:

  • the apiserver master service reconciler is started and pointing to the kube-apiserver’s Cluster IP in the Seed cluster (e.g. --advertise-address=10.1.2.3).
  • the proxy runs in the host network of the Node.
  • the proxy has a sidecar container which:
    • creates a dummy network interface and assigns the 10.1.2.3 to it.
    • removes connection tracking (conntrack) if iptables/nftables is enabled as the IP address is local to the Node.
  • the proxy listens on the 10.1.2.3 and using the PROXY protocol it sends the data stream to the Envoy ingress gateway (EIGW).
  • EIGW listens for PROXY protocol on a dedicated 8443 port. EIGW reads the destination IP + port from the PROXY protocol and forwards traffic to the correct upstream apiserver.

The sidecar is a standalone component. It’s possible to transparently change the proxy implementation without any modifications to the sidecar. The simplified flow looks like:

+------------------+                    +----------------+
| Shoot API Server |       TCP          |   Envoy IGW    |
|                  +<-------------------+ PROXY listener |
| Cluster A        |                    |     :8443      |
+------------------+                    +-+--------------+
                                          ^
                                          |
                                          |
                                          |
                                          |
+-----------------------------------------------------------+
                                          |   Single Node in
                                          |   the Shoot cluster
                                          |
                                          | PROXY Protocol
                                          |
                                          |
                                          |
 +---------------------+       +----------+----------+
 |  Pod talking to     |       |                     |
 |  the kubernetes     |       |       Proxy         |
 |  service            +------>+  No TLS termination |
 |                     |       |                     |
 +---------------------+       +---------------------+

Multiple OSS solutions can be used:

  • haproxy
  • nginx

To add a PROXY lister with Istio several resources must be created - a dedicated Gateway, dummy VirtualService and EnvoyFilter which adds listener filter (envoy.listener.proxy_protocol) on 8443 port:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: blackhole
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 8443
      name: tcp
      protocol: TCP
    hosts:
    - "*"

---

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: blackhole
  namespace: istio-system
spec:
  hosts:
  - blackhole.local
  gateways:
  - blackhole
  tcp:
  - match:
    - port: 8443
    route:
    - destination:
        host: localhost
        port:
          number: 9999 # any dummy port will work

---

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: proxy-protocol
  namespace: istio-system
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
  configPatches:
  - applyTo: LISTENER
    match:
      context: ANY
      listener:
        portNumber: 8443
        name: 0.0.0.0_8443
    patch:
      operation: MERGE
      value:
        listener_filters:
        - name: envoy.filters.listener.proxy_protocol

For each individual Shoot cluster, a dedicated FilterChainMatch is added. It ensures that only Shoot API servers can receive traffic from this listener:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: <shoot-namespace>
  namespace: istio-system
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
  configPatches:
  - applyTo: FILTER_CHAIN
    match:
      context: ANY
      listener:
        portNumber: 8443
        name: 0.0.0.0_8443
    patch:
      operation: ADD
      value:
        filters:
        - name: envoy.filters.network.tcp_proxy
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
            stat_prefix: outbound|443||kube-apiserver.<shoot-namespace>.svc.cluster.local
            cluster: outbound|443||kube-apiserver.<shoot-namespace>.svc.cluster.local
        filter_chain_match:
          destination_port: 443
          prefix_ranges:
          - address_prefix: 10.1.2.3 # kube-apiserver's cluster-ip
            prefix_len: 32

Note: this additional EnvoyFilter can be removed when Istio supports full L4 matching.

A nginx proxy client in the Shoot cluster on every node could have the following configuration:

error_log /dev/stdout;
stream {
    server {
        listen 10.1.2.3:443;
        proxy_pass api.<external-domain>:8443;
        proxy_protocol on;

        proxy_protocol_timeout 5s;
        resolver_timeout 5s;
        proxy_connect_timeout 5s;
    }
}

events { }

In-cluster communication to the apiserver when ExernalName is supported

Even if in future versions of Kubernetes, the master service of type ExternalName is supported, we still have the problem that in-cluster workload can talk to the server via DNS. For this to work we still need the above mentioned proxy (this time listening on another IP address 10.0.0.2). An additional change to CoreDNS would be needed:

default.svc.cluster.local.:8053 {
    file kubernetes.default.svc.cluster.local
}

.:8053 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        upstream
        fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}

The content of the kubernetes.default.svc.cluster.local is going to be:

$ORIGIN default.svc.cluster.local.

@	30 IN	SOA local. local. (
        2017042745 ; serial
        1209600    ; refresh (2 hours)
        1209600    ; retry (1 hour)
        1209600    ; expire (2 weeks)
        30         ; minimum (1 hour)
        )

  30 IN NS local.

kubernetes     IN A     10.0.0.2

So when a client requests kubernetes.default.svc.cluster.local, it’ll be send to the proxy listening on that IP address.

Future work

While out of scope of this GEP, several things can be improved:

  • Make the sidecar work with eBPF and environments where iptables/nftables are not enabled.

References

18 - Shoot CA Rotation

GEP-18: Automated Shoot CA Rotation

Table of Contents

Summary

This proposal outlines an on-demand, multi-step approach to rotate all certificate authorities (CA) used in a Shoot cluster. This process includes creating new CAs, invalidating the old ones and recreating all certificates signed by the CAs.

We propose to bundle the rotation of all CAs in the Shoot together as one triggerable action. This includes the recreation and invalidation of the following CAs and all certificates signed by them:

  • Cluster CA (currently used for signing kube-apiserver serving certificates and client certificates)
  • kubelet CA (used for signing client certificates for talking to kubelet API, e.g. kube-apiserver-kubelet)
  • etcd CA (used for signing etcd serving certificates and client certificates)
  • front-proxy CA (used for signing client certificates that kube-aggregator (part of kube-apiserver) uses to talk to extension API servers, filled into extension-apiserver-authentication ConfigMap and read by extension API servers to verify incoming kube-aggregator requests)
  • metrics-server CA (used for signing serving certificates, filled into APIService caBundle field and read by kube-aggregator to verify the presented serving certificate)
  • ReversedVPN CA (used for signing vpn-seed-server serving certificate and vpn-shoot client certificate)

Out of scope for now:

  • kubelet serving CA is self-generated (valid for 1y) and self-signed by kubelet on startup
    • kube-apiserver does not seem to verify the presented serving certificate
    • kubelet can be configured to request serving certificate via CSR that can be verified by kube-apiserver, though, we consider this as a separate improvement outside of this GEP
  • Legacy VPN solution uses the cluster CA for both serving and client certificates. As the solution is soon to be dropped in favor of the new ReversedVPN solution, we don’t intend to introduce a dedicated CA for this component. If ReversedVPN is disabled and the CA rotation is triggered, we make sure to propagate the cluster CA to the relevant places in the legacy VPN solution.

Naturally, not all certificates used for communication with the kube-apiserver are under control of Gardener. An example for a Gardener-controlled certificate is the kubelet client certificate used to communicate with the api server. An example for credentials not controlled by gardener are kubeconfigs or client certificates requested via CertificateSigningRequests by the shoot owner.

We propose to use a two step approach to rotate CAs. The start of each phase is triggered by the shoot owner. In summary the first phase is used to create new CAs (for example the new api server and client CA). Then we make sure that all servers and clients under Gardener’s control trust both old and new CA. Next we renew all client certificates that are under Gardener’s control so they are now signed by the new CAs. This includes a node rollout in order to propagate the certificates to kubelets and restart all pods. Afterwards the user needs to change their client credentials to trust both old and new cluster CA. In the second phase, we remove all trust to the old CA for servers and clients under Gardener’s control. This does not include a node rollout but all still running pods using ServiceAccounts will continue to trust the old CA until they restart. Also, the user needs to retrieve the new CA bundle to no longer trust the old CA.

A detailed overview of all steps required for each phase is given in the proposal section of this GEP.

Introducing a new client CA

Currently, client certificates and the kube-apiserver certificate are signed by the same CA. We propose to create a separate client CA when triggering the rotation. The client CA is used to sign certificates of clients talking to the API Server.

Motivation

There are a few reasons for rotating shoot cluster CAs:

  • If we have to invalidate client certificates for the kube-apiserver or any other component we are forced to rotate the CA. The only way to invalidate them is to stop trusting all client certificates that are signed by the respective CA as kubernetes does not support revoking certificates.
  • If the CA itself got leaked.
  • If the CA is about to expire.
  • If a company policy requires to rotate a CA after a certain point in time.

In each of those cases we currently need to basically manually recreate and replace all CAs and certificates. The process of rotating by hand is cumbersome and could lead to errors due to the many steps needing to be performed in the right order. By automating the process we want to create a way to securely and easily rotate shoot CAs.

Goals

  • Offer an automated and safe solution to rotate all CAs in a shoot cluster.
  • Offer a process that is easily understandable for developers and users.
  • Rotate the different CAs in the shoot with a similar process to reduce complexity.
  • Add visibility for Shoot owners when the last CA rotation happened

Non-Goals

  • Offer an automated solution for rotating other static credentials (like static token).
    • Later on, a similar two-phase approach could be implemented for the kubeconfig rotation. However, this is out of scope for this enhancement.
  • Creating a process that runs fully automated without shoot owner interaction. As the shoot owner controls some secrets that would probably not even be possible.
  • Forcing the shoot owner to rotate after a certain time period. Our goal rather is to issue long-running certificates and let the user decide depending on their requirements to rotate as needed.
  • Configurable default CA lifetime

Proposal

We will add a new feature gate CARotation for gardener-apiserver and gardenlet which allows to enable or disable the possibility to trigger the rotation.

Triggering the CA Rotation

  • Triggered via gardener.cloud/operation annotation in symmetry with other operations like reconciliation, kubeconfig rotation, etc.
    • annotation increases the generation
    • value for triggering first phase: start-ca-rotation
    • value for triggering the second phase: complete-ca-rotation
    • gardener-apiserver performs the needful validation: user can’t trigger another rotation if one is already in progress, user can’t trigger complete-ca-rotation if first phase has not been compeleted, etc.
  • The annotation triggers a usual shoot reconciliation (just like a kubeconfig or SSH key rotation)
  • gardenlet begins the CA rotation sequence by setting the new status section .status.credentials.caRotation (probably in updateShootStatusOperationStart) and removes the annotation afterwards
    • shoot reconciliation needs to be idemptotent to CA rotation phase, i.e. if a usual reconciliation or maintenance operation is triggered in between, no new CAs are generated or similar things that would interfere with the CA rotation sequence

Changing the Shoot Status

A new section in the Shoot status is added when the first rotation is triggered:

status:
  credentials:
    rotation:
      certificateAuthorities:
        phase: Prepare # Prepare|Finalize|Completed
        lastCompletion: 2022-02-07T14:23:44Z
    # kubeconfig:
    #   phase:
    #   lastCompletion:

Later on, this section could be augmented with other information like the names of the credentials secrets (e.g. gardener/gardener#1749)

status:
  credentials:
    resources:
    - type: kubeconfig
      kind: Secret
      name: shoot-foo.kubeconfig

Rotation Sequence for Cluster and Client CA

The proposal section includes a detailed description of all steps involved for rotating from a given CA0 to the target CA1.

t0: Today’s situation

  • kube-apiserver uses SERVER CERT signed by CA0 and trusts CLIENT CERTS signed by CA0
  • kube-controller-manager issues new CLIENT CERTS signed by CA0
  • kubeconfig trusts only CA0
  • ServiceAccount secrets trust only CA0
  • kubelet uses CLIENT CERT signed by CA0

t1: Shoot owner triggers first step of CA rotation process (–> phase one is started):

  • Generate CA1
  • Generate CLIENT_CA1
  • Update kube-apiserver, kube-scheduler, etc. to trust CLIENT CERTS signed by both CA0 and CLIENT_CA1 (--client-ca-file flag)
  • Update kube-controller-manager to issue new CLIENT CERTS now with CLIENT_CA1
  • Update kubeconfig so that its CA bundle contains both CA0 andCA1 (if kubeconfig still contains a legacy CLIENT CERT then rotate the kubeconfig)
  • Update generic-token-kubeconfig so that its CA bundle contains both CA0 andCA1
  • Update kube-controller-manager to populate both CA0 and CA1 in ServiceAccount secrets.
  • Restart control plane components so that their CA bundle contains both CA0 and CA1
  • Renew CLIENT CERTS (sign them with CLIENT_CA1) for the following control plane components: Prometheus, DWD, legacy VPN), if not dropped already in the context of gardener/gardener#4661
  • Trigger node rollout
    • This issues new CLIENT CERTS for all kubelets signed by CLIENT_CA1
    • This restarts all Pods and propagates CA0 and CA1 into their mounted ServiceAccount secrets (note CAs can not be reloaded by go client, therefore we need a restart of pods.)
  • Ask user to exchange all their client credentials (kubeconfig, CLIENT CERTS issued by CertificateSigningRequests) to trust both CA0 and CA1

t2: Shoot owner triggers second step of CA rotation process (–> phase two is started):

Prerequisite: All Gardener-controlled actions listed in t1 were executed successfully (for example node rollout). The shoot owner has guaranteed that they exchanged their client credentials and triggered step 2 via an annotation.

  • Renew SERVER CERTS (sign them with CA1) for kube-apiserver, kube-controller-manager, cloud-controller-manager etc.
  • Update kube-apiserver, kube-scheduler, etc. to trust only CLIENT CERTS signed by CLIENT_CA1
  • Update kubeconfig so that its CA bundle contains only CA1
  • Update generic-token-kubeconfig so that its CA bundle contains only CA1
  • Update kube-controller-manager to only contain CA1. ServiceAccount secrets created after this point will get secrets that include only CA1
  • Restart control plane components so that their CA bundle contains only CA1
  • Restart kubelets so that the CA bundle in their kubeconfigs contain only CA1
  • Delete CA0
  • Ask user to optionally restart their Pods since they still contain CA0 in memory in order to eliminate trust to the old cluster CA.
  • Ask user to exchange all their client credentials (download kubeconfig containing only CA1; when using CLIENT CERTS trust only CA1)

Rotation Sequence of Other CAs

Apart from the kube-apiserver CA (and the client CA) we also use 5 other CAs as mentioned above in the gardener codebase. We propose to rotate those CAs together with the kube-apiserver CA following the same trigger.

ℹ️ Note for the front-proxy CA: users need to make sure, extension API servers have reloaded the extension-apiserver-authentication ConfigMap, before triggering the second phase.

You can find gardener managed CAs listed here.

Regarding the rotation steps we want to follow a similar approach to the one we defined for the kube-apiserver CA. Exemplary, we are going to show the timeline for ETCD_CA but the logic should be similiar for all the above listed CAs.

  • t0
    • etcd trusts client certificates signed by ETCD_CA0 and uses a server certificate signed by ETCD_CA0
    • kube-apiserver and backup-restore use a client certificate signed by ETCD_CA0 and trust ETCD_CA0
  • t1:
    • Generate ETCD_CA1
    • Update etcd to trust CLIENT CERTS signed by both ETCD_CA0 and ETCD_CA1
    • Update kube-apiserver and backup-restore:
      • Adapt CA bundle to trust both ETCD_CA0 and ETCD_CA1
      • Renew CLIENT CERTS (sign them with ETCD_CA1)
  • t2:
    • Update etcd:
      • Trust only CLIENT CERTS signed by ETCD_CA1
      • Renew SERVER CERT (sign it with ETCD_CA1)
    • Update kube-apiserver and backup-restore so that their CA bundle contains only ETCD_CA1

ℹ️ This means we are requiring two restarts of etcd in total.

Alternatives

This section presents a different approach to rotate the CAs which is to temporarily create a second set of api-servers utilizing the new CA . After presenting the approach advantages and disadvantages of both approaches are listed.

t0: Today’s situation

  • kube-apiserver uses SERVER CERT signed by CA0 and trusts CLIENT CERTS signed by CA0
  • kube-controller-manager issues new CLIENT CERTS with CA0
  • kubeconfig contains only CA0
  • ServiceAccount secrets contain only CA0
  • kubelet uses CLIENT CERT signed by CA0

t1: User triggers first step of CA rotation process (–> phase one):

  • Generate CA1
  • Generate CLIENT_CA1
  • Create new DNSRecord, Service, Istio configuration, etc. for second kube-apiserver deployment
  • Deploy second kube-apiserver deployment trusting only CLIENT CERTS signed by CLIENT_CA1 and using SERVER CERT signed by CA1
  • Update kube-scheduler, etc. to trust only CLIENT CERTS signed by CLIENT_CA1 (--client-ca-file flag)
  • Update kube-controller-manager to issue new CLIENT CERTS with CLIENT_CA1
  • Update kubeconfig so that it points to the new DNSRecord and its CA bundle contains only CA1 (if kubeconfig still contains a legacy CLIENT CERT then rotate the kubeconfig)
  • Update ServiceAccount secrets so that their CA bundle contains both CA0 and CA1
  • Restart control plane components so that they point to the second kube-apiserver Service and so that their CA bundle contains only CA1
  • Renew CLIENT CERTS (sign them with CLIENT_CA1) for control plane components (Prometheus, DWD, legacy VPN) and point them to the second kube-apiserver Service
  • Adapt apiserver-proxy-pod-mutator to point KUBERNETES_SERVICE_HOST env variable to second kube-apiserver
  • Trigger node rollout
    • This issues new CLIENT CERTS for all kubelets signed by CLIENT_CA1 and points them to the second DNSRecord
    • This restarts all Pods and propagates CA0 and CA1 into their mounted ServiceAccount secrets
  • Ask user to exchange all their client credentials (kubeconfig, CLIENT CERTS issued by CertificateSigningRequests)

t2: User triggers second step of CA rotation process (–> phase two):

  • Update ServiceAccount secrets so that their CA bundle contains only CA1
  • Update apiserver-proxy to talk to second kube-apiserver
  • Drop first DNSRecord, Service, Istio configuration and first kube-apiserver deployment
  • Drop CA0
  • Ask user to optionally restart their Pods since they still contain CA0 in memory.

Advantages/Disadvantages approach two api servers

  • (+) User needs to adapt client credentials only once
  • (/) Unstable API server domain
  • (-) Probably more implementation effort
  • (-) More complex
  • (-) CA rotation process does not work similar for all CAs in our system

Advantages/Disadvantages of currently preferred approach (see proposal)

  • (+) Implementation effort seems “straight-forward”
  • (+) CA rotation process works similar for all CAs in our system
  • (/) Stable API server domain
  • (-) User needs to adapt client credentials twice

19 - Utilize API Server Network Proxy to Invert Seed-to-Shoot Connectivity

Utilize API Server Network Proxy to Invert Seed-to-Shoot Connectivity

Problem

Gardener’s architecture for Kubernetes clusters relies on having the control-plane (e.g., kube-apiserver, kube-scheduler, kube-controller-manager, etc.) and the data-plane (e.g., kube-proxy, kubelet, etc.) of the cluster residing in separate places, this provides many benefits but poses some challenges, especially when API-server to system components communication is required. This problem is solved today in Gardener by making use of OpenVPN to establish a VPN connection from the seed to the shoot. To do so, the following steps are required:

  • Create a Loadbalancer service on the shoot.
  • Add a sidecar to the API server pod which knows the address of the newly created Loadbalancer.
  • Establish a connection over the internet to the VPN Loadbalancer
  • Install additional iptables rules that would redirect all the IPs of the shoot (i.e., service, pod, node CIDRs) to the established VPN tunnel

There are however quite a few problems with the above approach, here are some:

  • Every shoot would require an additional loadbalancer, this accounts for addition overhead in terms of both costs and troubleshooting efforts.
  • Private access use-cases would not be possible without having a seed residing in the same private domain as a hard requirement. For example, have a look at this issue
  • Providing a public endpoint to access components in the shoot poses a security risk.

Proposal

There are mutliple ways to tackle the directional connectivity issue mentioned above, one way would be to invert the connection between the API server and the system components, i.e., instead of having the API server side-car establish a tunnel, we would have an agent residing in the shoot cluster initiate the connection itself. This way we don’t need a Loadbalancer for every shoot and from the security perspective, there is no ingress from outside, only controlled egress.

We want to replace this:

APIServer | VPN-seed ---> internet ---> LB --> VPN-Shoot (4314) --> Pods | Nodes | Services

With this:

APIServer <-> Proxy-Server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services

API Server Network Proxy

To solve this issue we can utilize the apiserver-network-proxy upstream implementation. Which provides a reference implementation for a reverse streaming server. The way it works is as follows:

  • Proxy agent connects to proxy server to establish a sticky connection.
  • Traffic to the proxy server (residing in the seed) gets then re-directed to the agent (residing in the shoot) which forwards the traffic to in-cluster components.

The initial motivation for the apiserver-network-proxy project is to get rid of provider-specific implementations that reside in the API-server (e.g., SSH), but it turns out that it has other interesting use-cases such as data-plane connection decoupling, which is the main use-case for this proposal.

Starting with Kubernetes 1.18 it’s possible to make use of an --egress-selector-config-file flag, this helps point the API-server to traffic hook points based on traffic direction. For example, in the config below the API server would have to forward all cluster related traffic (e.g., logs, port-forward, exec, …etc.) to the proxy-server which then knows how to forward traffic to the shoot. For the rest of the traffic, e.g. API server to ETCD or other control-plane components direct is used which means legacy routing method, i.e., by-pass the proxy.

  egress-selector-configuration.yaml: |-
    apiVersion: apiserver.k8s.io/v1alpha1
    kind: EgressSelectorConfiguration
    egressSelections:
    - name: cluster
      connection:
        proxyProtocol: httpConnect
        transport:
          tcp:
            url: https://proxy-server:8131
    - name: master
      connection:
        proxyProtocol: direct
    - name: etcd
      connection:
        proxyProtocol: direct    

Challenges

Prometheus to Shoot connectivity

One challenge remains to completely eliminate the need for a VPN connection. In today’s Gardener setup, each control-plane has a Prometheus instance that directly scrapes cluster components such as CoreDNS, Kubelets, cadvisor, etc. This works because in addition to the VPN side car attached to the API server pod, we have another one attached to prometheus which knows how to forward traffic to these endpoints. Once the VPN is eliminated, it is required to find other means to forward traffic to these components.

Possible Solutions

There are currently two ways to solve this problem:

  • Attach a port-forwarder side-car to prometheus.
  • Utilize the proxy subresource on the API server.

Port-forwarder Sidecar

With this solution each prometheus instance would have a side-car that has the kubeconfig of the shoot cluster, and which establishes a port-forward connection to the endpoints residing in the shoot.

There are a many problems with this approach:

  • the port-forward connection is not reliable.
  • the connection would break if the API server instance dies.
  • requires an additional component.
  • would need to expose every pod / service via port-forward.
Prom Pod (Prometheus -> Port-forwarder) <-> APIServer -> Proxy-server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services

Proxy Client Sidecar

Another solution would be to implement a proxy-client as a sidecar for every component that wishes to communicate with the shoot cluster. For this to work, means to re-direct / inject that proxy to handle the component’s traffic is necessary (e.g., additional IPtable rules).

Prometheus Pod (Prometheus -> Proxy) <-> Proxy-Server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services

The problem with this approach is that it requires an additional sidecar (along with traffic redirection) to be attached to every client that wishes to communicate with the shoot cluster, this can cause:

  • additional maintenance efforts (extra code).
  • other side-effects (e.g., if istio sidecar injection is enabled)

Proxy sub-resource

Kubernetes supports proxying requests to nodes, services, and pod endpoints in the shoot cluster. This proxy connection can be utilized for scraping the necessary endpoints in the shoot.

This approach requires less components and is more reliable than the port-forward solution, however, it relies on having the API server supporting proxied connection for the required endpoints.

Prometheus  <-> APIServer <-> Proxy-Server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services

As simple as it is, it has a downside that it relies on the availability of the API server.

Proxy-server Loadbalancer Sharing and Re-advertising

With the proxy-server in place, we need to provide means to enable the proxy-agent in the shoot to establish the connection with the server. As a result, we need to provide a public endpoint through which this channel of communication can be established, i.e., we need a Loadbalancer(s).

Possible Solution

Using a Loadbalancer / proxy server would not make sense since this is a pain-point we are trying to eliminate in the first-place, doing so just moves the costs to the control-plane. A possible solution is to communicate over a shared loadbalancer in the seed, similar to what has been proposed here, this way we can prevent the extra-costs for load-balancers.

With this in mind, we still have other pain-points, namely:

  • Advertising Loadbalancer public IPs to the shoot.
  • Directing the traffic to the corresponding shoot proxy-server.

For advertising the Loadbalancer IP, a DNS entry can be created for the proxy loadbalancer (or re-use the DNS entry for the SNI proxy), along with necessary certificates, which is then used to connect to the loadbalancer. At this point we can decide on either one of the two approaches:

  1. One Proxy / API server with a shared loadbalancer.
  2. Use one proxy server for all agents.

In the first case, we will probably need a proxy for the proxy-server that knows how to direct traffic to the correct proxy server based on the corresponding shoot cluster. In the second case, we don’t need another proxy if the proxy server is cluster-aware, i.e., can pool and identify connections coming from the same cluster and peer them with the correct API. Unfortunately, the second case is not supported today.

Summary

  • API server proxy can be utilized to invert the connection (only for clusters >= 1.18, for older clusters the old VPN solution will remain).
  • This is achieved by utilizing the --egress-selector-config-file flag on the api-server.
  • For monitoring endpoints, the proxy subresources would be the preferable methods to go, but in the future we can also support sidecar proxies that can communicate with the proxy-server.
  • For Directing traffic to the correct proxy-server we will re-use the SNI proxy along with the load-balancer from the shoot API server via SNI GEP.