Proposals
Gardener Enhancement Proposal (GEP)
Changes to the Gardener code base are often incorporated directly via pull requests which either themselves contain a description about the motivation and scope of a change or a linked GitHub issue does.
If a perspective feature has a bigger extent, requires the involvement of several parties or more discussion is needed before the actual implementation can be started, you may consider filing a pull request with a Gardener Enhancement Proposal (GEP) first.
GEPs are a measure to propose a change or to add a feature to Gardener, help you to describe the change(s) conceptionally, and to list the steps that are necessary to reach this goal. It helps the Gardener maintainers as well as the community to understand the motivation and scope around your proposed change(s) and encourages their contribution to discussions and future pull requests. If you are familiar with the Kubernetes community, GEPs are analogue to Kubernetes Enhancement Proposals (KEPs).
Reasons for a GEP
You may consider filing a GEP for the following reasons:
- A Gardener architectural change is intended / necessary
- Major changes to the Gardener code base
- A phased implementation approach is expected because of the widespread scope of the change
- Your proposed changes may be controversial
We encourage you to take a look at already merged GEPs since they give you a sense of what a typical GEP comprises.
Before creating a GEP
Before starting your work and creating a GEP, please take some time to familiarize yourself with our
general Gardener Contribution Guidelines.
It is recommended to discuss and outline the motivation of your prospective GEP as a draft with the community before you take the investment of creating the actual GEP. This early briefing supports the understanding for the broad community and leads to a fast feedback for your proposal from the respective experts in the community.
An appropriate format for this may be the regular Gardener community meetings.
How to file a GEP
GEPs should be created as Markdown .md
files and are submitted through a GitHub pull request to their current home in docs/proposals. Please use the provided template or follow the structure of existing GEPs which makes reviewing easier and faster. Additionally, please link the new GEP in our documentation index.
If not already done, please present your GEP in the regular community meetings to brief the community about your proposal (we strive for personal communication :) ). Also consider that this may be an important step to raise awareness and understanding for everyone involved.
The GEP template contains a small set of metadata, which is helpful for keeping track of the enhancement
in general and especially of who is responsible for implementing and reviewing PRs that are part of
the enhancement.
Main Reviewers
Apart from general metadata, the GEP should name at least one “main reviewer”.
You can find a main reviewer for your GEP either when discussing the proposal in the community meeting, by asking in our
Slack Channel or at latest during the GEP PR review.
New GEPs should only be accepted once at least one main reviewer is nominated/assigned.
The main reviewers are charged with the following tasks:
- familiarizing themselves with the details of the proposal
- reviewing the GEP PR itself and any further updates to the document
- discussing design details and clarifying implementation questions with the author before and after
the proposal was accepted
- reviewing PRs related to the GEP in-depth
Other community members are of course also welcome to help the GEP author, review his work and raise
general concerns with the enhancement. Nevertheless, the main reviewers are supposed to focus on more
in-depth reviews and accompaning the whole GEP process end-to-end, which helps with getting more
high-quality reviews and faster feedback cycles instead of having more people looking at the process
with lower priority and less focus.
GEP Process
- Pre-discussions about GEP (if necessary)
- Find a main reviewer for your enhancement
- GEP is filed through GitHub PR
- Presentation in Gardener community meeting (if possible)
- Review of GEP from maintainers/community
- GEP is merged if accepted
- Implementation of GEP
- Consider keeping GEP up-to-date in case implementation differs essentially
1 - 01 Extensibility
Gardener extensibility and extraction of cloud-specific/OS-specific knowledge (#308, #262)
Table of Contents
Summary
Gardener has evolved to a large compound of packages containing lots of highly specific knowledge which makes it very hard to extend (supporting a new cloud provider, new OS, …, or behave differently depending on the underlying infrastructure).
This proposal aims to move out the cloud-specific implementations (called “(cloud) botanists”) and the OS-specifics into dedicated controllers, and simultaneously to allow deviation from the standard Gardener deployment.
Motivation
Currently, it is too hard to support additional cloud providers or operation systems/distributions as everything must be done in-tree which might affect the implementation of other cloud providers as well.
The various conditions and branches make the code hard to maintain and hard to test.
Every change must be done centrally, requires to completely rebuild Gardener, and cannot be deployed individually. Similar to the motivation for Kubernetes to extract their cloud-specifics into dedicated cloud-controller-managers or to extract the container/storage/network/… specifics into CRI/CSI/CNI/…, we aim to do the same right now.
Goals
- Gardener does not contain any cloud-specific knowledge anymore but defines a clear contract allowing external controllers (botanists) to support different environments (AWS, Azure, GCP, …).
- Gardener does not contain any operation system-specific knowledge anymore but defines a clear contract allowing external controllers to support different operation systems/distributions (CoreOS, SLES, Ubuntu, …).
- It shall become much easier to move control planes of Shoot clusters between Seed clusters (#232) which is a necessary requirement of an automated setup for the Gardener Ring (#233).
Non-Goals
- We want to also factor out the specific knowledge of the addon deployments (nginx-ingress, kubernetes-dashboard, …), but we already have dedicated projects/issues for that: https://github.com/gardener/bouquet and #246. We will keep the addons in-tree as part of this proposal and tackle their extraction separately.
- We do not want to make the Gardener a plain workflow engine that just executes a given template (which indeed would allow to be generic, open, and extensible in their highest forms but which would end-up in building a “programming/scripting language” inside a serialization format (YAML/JSON/…)). Rather, we want to have well-defined contracts and APIs, keeping Gardener responsible for the clusters management.
Proposal
Gardener heavily relies on and implements Kubernetes principles, and its ultimate strategy is to use Kubernetes wherever applicable.
The extension concept in Kubernetes is based on (next to others) CustomResourceDefinition
s, ValidatingWebhookConfiguration
s and MutatingWebhookConfiguration
s, and InitializerConfiguration
s.
Consequently, Gardener’s extensibility concept relies on these mechanisms.
Instead of implementing all aspects directly in Gardener it will deploy some CRDs to the Seed cluster which will be watched by dedicated controllers (also running in the Seed clusters), each one implementing one aspect of cluster management. This way one complex strongly coupled Gardener implementation covering all infrastructures is decomposed into a set of loosely coupled controllers implementing aspects of APIs defined by Gardener.
Gardener will just wait until the controllers report that they are done (or have faced an error) in the CRD’s .status
field instead of doing the respective tasks itself.
We will have one specific CRD for every specific operation (e.g., DNS, infrastructure provisioning, machine cloud config generation, …).
However, there are also parts inside Gardener which can be handled generically (not by cloud botanists) because they are the same or very similar for all the environments.
One example of those is the deployment of a Namespace
in the Seed which will run the Shoot’s control plane
Another one is the deployment of a Service
for the Shoot’s kube-apiserver.
In case a cloud botanist needs to cooperate and react on those operations it should register a ValidatingWebhookConfiguration
, a MutatingWebhookConfiguration
, or a InitializerConfiguration
.
With this approach it can validate, modify, or react on any resource created by Gardener to make it cloud infrastructure specific.
The web hooks should be registered with failurePolicy=Fail
to ensure that a request made by Gardener fails if the respective web hook is not available.
Modification of existing CloudProfile
and Shoot
resources
We will introduce the new API group gardener.cloud
:
CloudProfiles
---
apiVersion: gardener.cloud/v1alpha1
kind: CloudProfile
metadata:
name: aws
spec:
type: aws
# caBundle: |
# -----BEGIN CERTIFICATE-----
# ...
# -----END CERTIFICATE-----
dnsProviders:
- type: aws-route53
- type: unmanaged
kubernetes:
versions:
- 1.12.1
- 1.11.0
- 1.10.5
machineTypes:
- name: m4.large
cpu: "2"
gpu: "0"
memory: 8Gi
# storage: 20Gi # optional (not needed in every environment, may only be specified if no volumeTypes have been specified)
...
volumeTypes: # optional (not needed in every environment, may only be specified if no machineType has a `storage` field)
- name: gp2
class: standard
- name: io1
class: premium
providerConfig:
apiVersion: aws.cloud.gardener.cloud/v1alpha1
kind: CloudProfileConfig
constraints:
minimumVolumeSize: 20Gi
machineImages:
- name: coreos
regions:
- name: eu-west-1
ami: ami-32d1474b
- name: us-east-1
ami: ami-e582d29f
zones:
- region: eu-west-1
zones:
- name: eu-west-1a
unavailableMachineTypes: # list of machine types defined above that are not available in this zone
- name: m4.large
unavailableVolumeTypes: # list of volume types defined above that are not available in this zone
- name: gp2
- name: eu-west-1b
- name: eu-west-1c
Shoots
apiVersion: gardener.cloud/v1alpha1
kind: Shoot
metadata:
name: johndoe-aws
namespace: garden-dev
spec:
cloudProfileName: aws
secretBindingName: core-aws
cloud:
type: aws
region: eu-west-1
providerConfig:
apiVersion: aws.cloud.gardener.cloud/v1alpha1
kind: InfrastructureConfig
networks:
vpc: # specify either 'id' or 'cidr'
# id: vpc-123456
cidr: 10.250.0.0/16
internal:
- 10.250.112.0/22
public:
- 10.250.96.0/22
workers:
- 10.250.0.0/19
zones:
- eu-west-1a
workerPools:
- name: pool-01
# Taints, labels, and annotations are not yet implemented. This requires interaction with the machine-controller-manager, see
# https://github.com/gardener/machine-controller-manager/issues/174. It is only mentioned here as future proposal.
# taints:
# - key: foo
# value: bar
# effect: PreferNoSchedule
# labels:
# - key: bar
# value: baz
# annotations:
# - key: foo
# value: hugo
machineType: m4.large
volume: # optional, not needed in every environment, may only be specified if the referenced CloudProfile contains the volumeTypes field
type: gp2
size: 20Gi
providerConfig:
apiVersion: aws.cloud.gardener.cloud/v1alpha1
kind: WorkerPoolConfig
machineImage:
name: coreos
ami: ami-d0dcef3
zones:
- eu-west-1a
minimum: 2
maximum: 2
maxSurge: 1
maxUnavailable: 0
kubernetes:
version: 1.11.0
...
dns:
provider: aws-route53
domain: johndoe-aws.garden-dev.example.com
maintenance:
timeWindow:
begin: 220000+0100
end: 230000+0100
autoUpdate:
kubernetesVersion: true
backup:
schedule: "*/5 * * * *"
maximum: 7
addons:
kube2iam:
enabled: false
kubernetes-dashboard:
enabled: true
cluster-autoscaler:
enabled: true
nginx-ingress:
enabled: true
loadBalancerSourceRanges: []
kube-lego:
enabled: true
email: john.doe@example.com
ℹ The specifications for the other cloud providers Gardener already has an implementation for looks similar.
CRD definitions and workflow adaptation
In the following we are outlining the CRD definitions which define the API between Gardener and the dedicated controllers.
After that we will take a look at the current reconciliation/deletion flow and describe how it would look like in case we would implement this proposal.
Custom resource definitions
Every CRD has a .spec.type
field containing the respective instance of the dimension the CRD represents, e.g. the cloud provider, the DNS provider or the operation system name.
Moreover, the .status
field must contain
observedGeneration
(int64
), a field indicating on which generation the controller last worked on.state
(*runtime.RawExtension
), a field which is not interpreted by Gardener but persisted; it should be treated opaque and only be used by the respective CRD-specific controller (it can store anything it needs to re-construct its own state).lastError
(object
), a field which is optional and only present if the last operation ended with an error state.lastOperation
(object
), a field which always exists and which indicates what the last operation of the controller was.conditions
(list
), a field allowing the controller to report health checks for its area of responsibility.
Some CRDs might have a .spec.providerConfig
or a .status.providerStatus
field containing controller-specific information that is treated opaque by Gardener and will only be copied to dependent or depending CRDs.
DNS records
Every Shoot needs two DNS records (or three, depending on whether nginx-ingress addon is enabled), one so-called “internal” record that Gardener uses in the kubeconfigs of the Shoot cluster’s system components, and one so-called “external” record which is used in the kubeconfig provided to the user.
---
apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSProvider
metadata:
name: alicloud
namespace: default
spec:
type: alicloud-dns
secretRef:
name: alicloud-credentials
domains:
include:
- my.own.domain.com
---
apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSEntry
metadata:
name: dns
namespace: default
spec:
dnsName: dns.my.own.domain.com
ttl: 600
targets:
- 8.8.8.8
status:
observedGeneration: 4
state: some-state
lastError:
lastUpdateTime: 2018-04-04T07:08:51Z
description: some-error message
codes:
- ERR_UNAUTHORIZED
lastOperation:
lastUpdateTime: 2018-04-04T07:24:51Z
progress: 70
type: Reconcile
state: Processing
description: Currently provisioning ...
conditions:
- lastTransitionTime: 2018-07-11T10:18:25Z
message: DNS record has been created and is available.
reason: RecordResolvable
status: "True"
type: Available
propagate: false
providerStatus:
apiVersion: aws.extensions.gardener.cloud/v1alpha1
kind: DNSStatus
...
Infrastructure provisioning
The Infrastructure
CRD contains the information about VPC, networks, security groups, availability zones, …, basically, everything that needs to be prepared before an actual VMs/load balancers/… can be provisioned.
---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Infrastructure
metadata:
name: infrastructure
namespace: shoot--core--aws-01
spec:
type: aws
providerConfig:
apiVersion: aws.extensions.gardener.cloud/v1alpha1
kind: InfrastructureConfig
networks:
vpc:
cidr: 10.250.0.0/16
internal:
- 10.250.112.0/22
public:
- 10.250.96.0/22
workers:
- 10.250.0.0/19
zones:
- eu-west-1a
dns:
apiserver: api.aws-01.core.example.com
region: eu-west-1
secretRef:
name: my-aws-credentials
sshPublicKey: |
base64(key)
status:
observedGeneration: ...
state: ...
lastError: ..
lastOperation: ...
providerStatus:
apiVersion: aws.extensions.gardener.cloud/v1alpha1
kind: InfrastructureStatus
vpc:
id: vpc-1234
subnets:
- id: subnet-acbd1234
name: workers
zone: eu-west-1
securityGroups:
- id: sg-xyz12345
name: workers
iam:
nodesRoleARN: <some-arn>
instanceProfileName: foo
ec2:
keyName: bar
Backup infrastructure provisioning
The BackupInfrastructure
CRD in the Seeds tells the cloud-specific controller to prepare a blob store bucket/container which can later be used to store etcd backups.
---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: BackupInfrastructure
metadata:
name: etcd-backup
namespace: shoot--core--aws-01
spec:
type: aws
region: eu-west-1
storageContainerName: asdasjndasd-1293912378a-2213
secretRef:
name: my-aws-credentials
status:
observedGeneration: ...
state: ...
lastError: ..
lastOperation: ...
Cloud config (user-data) for bootstrapping machines
Gardener will continue to keep knowledge about the content of the cloud config scripts, but it will hand over it to the respective OS-specific controller which will generate the specific valid representation.
Gardener creates two MachineCloudConfig
CRDs, one for the cloud-config-downloader (which will later flow into the WorkerPool
CRD) and one for the real cloud-config (which will be stored as a Secret
in the Shoot’s kube-system
namespace, and downloaded and executed from the cloud-config-downloader on the machines).
---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: MachineCloudConfig
metadata:
name: pool-01-downloader
namespace: shoot--core--aws-01
spec:
type: CoreOS
units:
- name: cloud-config-downloader.service
command: start
enable: true
content: |
[Unit]
Description=Downloads the original cloud-config from Shoot API Server and executes it
After=docker.service docker.socket
Wants=docker.socket
[Service]
Restart=always
RestartSec=30
EnvironmentFile=/etc/environment
ExecStart=/bin/sh /var/lib/cloud-config-downloader/download-cloud-config.sh
files:
- path: /var/lib/cloud-config-downloader/credentials/kubeconfig
permissions: 0644
content:
secretRef:
name: cloud-config-downloader
dataKey: kubeconfig
- path: /var/lib/cloud-config-downloader/download-cloud-config.sh
permissions: 0644
content:
inline:
encoding: b64
data: IyEvYmluL2Jhc2ggL...
status:
observedGeneration: ...
state: ...
lastError: ..
lastOperation: ...
cloudConfig: | # base64-encoded
#cloud-config
coreos:
update:
reboot-strategy: off
units:
- name: cloud-config-downloader.service
command: start
enable: true
content: |
[Unit]
Description=Downloads the original cloud-config from Shoot API Server and execute it
After=docker.service docker.socket
Wants=docker.socket
[Service]
Restart=always
RestartSec=30
...
ℹ The cloud-config-downloader script does not only download the cloud-config initially but at regular intervals, e.g., every 30s
.
If it sees an updated cloud-config then it applies it again by reloading and restarting all systemd units in order to reflect the changes.
The way how this reloading of the cloud-config happens is OS-specific as well and not known to Gardener anymore, however, it must be part of the script already.
On CoreOS, you have to execute /usr/bin/coreos-cloudinit --from-file=<path>
whereas on SLES you execute cloud-init --file <path> single -n write_files --frequency=once
.
As Gardener doesn’t know these commands it will write a placeholder expression instead (e.g., {RELOAD-CLOUD-CONFIG-WITH-PATH:<path>}
) and the OS-specific controller is asked to replace it with the proper expression.
---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: MachineCloudConfig
metadata:
name: pool-01-original # stored as secret and downloaded later
namespace: shoot--core--aws-01
spec:
type: CoreOS
units:
- name: docker.service
drop-ins:
- name: 10-docker-opts.conf
content: |
[Service]
Environment="DOCKER_OPTS=--log-opt max-size=60m --log-opt max-file=3"
- name: docker-monitor.service
command: start
enable: true
content: |
[Unit]
Description=Docker-monitor daemon
After=kubelet.service
[Service]
Restart=always
EnvironmentFile=/etc/environment
ExecStart=/opt/bin/health-monitor docker
- name: kubelet.service
command: start
enable: true
content: |
[Unit]
Description=kubelet daemon
Documentation=https://kubernetes.io/docs/admin/kubelet
After=docker.service
Wants=docker.socket rpc-statd.service
[Service]
Restart=always
RestartSec=10
EnvironmentFile=/etc/environment
ExecStartPre=/bin/docker run --rm -v /opt/bin:/opt/bin:rw k8s.gcr.io/hyperkube:v1.11.2 cp /hyperkube /opt/bin/
ExecStartPre=/bin/sh -c 'hostnamectl set-hostname $(cat /etc/hostname | cut -d '.' -f 1)'
ExecStart=/opt/bin/hyperkube kubelet \
--allow-privileged=true \
--bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig-bootstrap \
...
files:
- path: /var/lib/kubelet/ca.crt
permissions: 0644
content:
secretRef:
name: ca-kubelet
dataKey: ca.crt
- path: /var/lib/cloud-config-downloader/download-cloud-config.sh
permissions: 0644
content:
inline:
encoding: b64
data: IyEvYmluL2Jhc2ggL...
- path: /etc/sysctl.d/99-k8s-general.conf
permissions: 0644
content:
inline:
data: |
vm.max_map_count = 135217728
kernel.softlockup_panic = 1
kernel.softlockup_all_cpu_backtrace = 1
...
- path: /opt/bin/health-monitor
permissions: 0755
content:
inline:
data: |
#!/bin/bash
set -o nounset
set -o pipefail
function docker_monitoring {
...
status:
observedGeneration: ...
state: ...
lastError: ..
lastOperation: ...
cloudConfig: ...
Cloud-specific controllers which might need to add another kernel option or another flag to the kubelet, maybe even another file to the disk, can register a MutatingWebhookConfiguration
to that resource and modify it upon creation/update.
The task of the MachineCloudConfig
controller is to only generate the OS-specific cloud-config based on the .spec
field, but not to add or change any logic related to Shoots.
Worker pools definition
For every worker pool defined in the Shoot
Gardener will create a WorkerPool
CRD which shall be picked up by a cloud-specific controller and be translated to MachineClass
es and MachineDeployment
s.
---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: WorkerPool
metadata:
name: pool-01
namespace: shoot--core--aws-01
spec:
cloudConfig: base64(downloader-cloud-config)
infrastructureProviderStatus:
apiVersion: aws.extensions.gardener.cloud/v1alpha1
kind: InfrastructureStatus
vpc:
id: vpc-1234
subnets:
- id: subnet-acbd1234
name: workers
zone: eu-west-1
securityGroups:
- id: sg-xyz12345
name: workers
iam:
nodesRoleARN: <some-arn>
instanceProfileName: foo
ec2:
keyName: bar
providerConfig:
apiVersion: aws.cloud.gardener.cloud/v1alpha1
kind: WorkerPoolConfig
machineImage:
name: CoreOS
ami: ami-d0dcef3b
machineType: m4.large
volumeType: gp2
volumeSize: 20Gi
zones:
- eu-west-1a
region: eu-west-1
secretRef:
name: my-aws-credentials
minimum: 2
maximum: 2
status:
observedGeneration: ...
state: ...
lastError: ..
lastOperation: ...
Generic resources
Some components are cloud-specific and must be deployed by the cloud-specific botanists.
Others might need to deploy another pod next to the shoot’s control plane or must do anything else.
Some of these might be important for a functional cluster (e.g., the cloud-controller-manager, or a CSI plugin in the future), and controllers should be able to report errors back to the user.
Consequently, in order to trigger the controllers to deploy these components Gardener would write a Generic
CRD to the Seed to trigger the deployment.
No operation is depending on the status of these resources, however, the entire reconciliation flow is.
---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Generic
metadata:
name: cloud-components
namespace: shoot--core--aws-01
spec:
type: cloud-components
secretRef:
name: my-aws-credentials
shootSpec:
...
status:
observedGeneration: ...
state: ...
lastError: ..
lastOperation: ...
Shoot state
In order to enable moving the control plane of a Shoot between Seed clusters (e.g., if a Seed cluster is not available anymore or entirely broken) Gardener must store some non-reconstructable state, potentially also the state written by the controllers.
Gardener watches these extension CRDs and copies the .status.state
in a ShootState
resource into the Garden cluster.
Any observed status change of the respective CRD-controllers must be immediately reflected in the ShootState
resource.
The contract between Gardener and those controllers is: Every controller must be capable of reconstructing its own environment based on both the state it has written before and on the real world’s conditions/state.
---
apiVersion: gardener.cloud/v1alpha1
kind: ShootState
metadata:
name: shoot--core--aws-01
shootRef:
name: aws-01
project: core
state:
secrets:
- name: ca
data: ...
- name: kube-apiserver-cert
data: ...
resources:
- kind: DNS
name: record-1
state: <copied-state-of-dns-crd>
- kind: Infrastructure
name: networks
state: <copied-state-of-infrastructure-crd>
...
<other fields required to keep track of>
We cannot assume that Gardener is always online to observe the most recent states the controllers have written to their resources.
Consequently, the information stored here must not be used as “single point of truth”, but the controllers must potentially check the real world’s status to reconstruct themselves.
However, this must anyway be part of their normal reconciliation logic and is a general best practice for Kubernetes controllers.
Shoot health checks/conditions
Some of the existing conditions already contain specific code which shall be simplified as well.
All of the CRDs described above have a .status.conditions
field to which the controllers may write relevant health information of their function area.
Gardener will pick them up and copy them over to the Shoots .status.conditions
(only those conditions setting propagate=true
).
Reconciliation flow
We are now examining the current Shoot creation/reconciliation flow and describe how it could look like when applying this proposal:
Operation | Description |
---|
botanist.DeployNamespace | Gardener creates the namespace for the Shoot in the Seed cluster. |
botanist.DeployKubeAPIServerService | Gardener creates a Service of type LoadBalancer in the Seed. AWS Botanist registers a Mutating Webhook and adds its AWS-specific annotation. |
botanist.WaitUntilKubeAPIServerServiceIsReady | Gardener checks the .status object of the just created Service in the Seed. The contract is that also clouds not supporting load balancers must react on the Service object and modify the .status to correctly reflect the kube-apiserver’s ingress IP. |
botanist.DeploySecrets | Gardener creates the secrets/certificates it needs like it does today, but it provides utility functions that can be adopted by Botanists/other controllers if they need additional certificates/secrets created on their own. (We should also add labels to all secrets) |
botanist.Shoot.Components.DNS.Internal{Provider/Entry}.Deploy | Gardener creates a DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and creates a corresponding DNS record (see CRD specification above). |
botanist.Shoot.Components.DNS.External{Provider/Entry}.Deploy | Gardener creates a DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and creates a corresponding DNS record: (see CRD specification above). |
shootCloudBotanist.DeployInfrastructure | Gardener creates a Infrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job: (see CRD above). |
botanist.DeployBackupInfrastructure | Gardener creates a BackupInfrastructure resource in the Garden cluster. (The BackupInfrastructure controller creates a BackupInfrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job: (see CRD above).) |
botanist.WaitUntilBackupInfrastructureReconciled | Gardener checks the .status object of the just created BackupInfrastructure resource. |
hybridBotanist.DeployETCD | Gardener does only deploy the etcd StatefulSet without backup-restore sidecar at all. The cloud-specific Botanist registers a Mutating Webhook and adds the backup-restore sidecar, and it also creates the Secret needed by the backup-restore sidecar. |
botanist.WaitUntilEtcdReady | Gardener checks the .status object of the etcd Statefulset and waits until readiness is indicated. |
hybridBotanist.DeployCloudProviderConfig | Gardener does not execute this anymore because it doesn’t know anything about cloud-specific configuration. |
hybridBotanist.DeployKubeAPIServer | Gardener does only deploy the kube-apiserver Deployment without any cloud-specific flags/configuration. The cloud-specific Botanist registers a Mutating Webhook and adds whatever is needed for the kube-apiserver to run in its cloud environment. |
hybridBotanist.DeployKubeControllerManager | Gardener does only deploy the kube-controller-manager Deployment without any cloud-specific flags/configuration. The cloud-specific Botanist registers a Mutating Webhook and adds whatever is needed for the kube-controller-manager to run in its cloud environment (e.g., the cloud-config). |
hybridBotanist.DeployKubeScheduler | Gardener does only deploy the kube-scheduler Deployment without any cloud-specific flags/configuration. The cloud-specific Botanist registers a Mutating Webhook and adds whatever is needed for the kube-scheduler to run in its cloud environment. |
hybridBotanist.DeployCloudControllerManager | Gardener does not execute this anymore because it doesn’t know anything about cloud-specific configuration. The Botanists would be responsible to deploy their own cloud-controller-manager now. They would watch for the kube-apiserver Deployment to exist, and as soon as it does, they deploy the CCM. (Side note: The Botanist would also be responsible to deploy further controllers needed for this cloud environment, e.g. F5-controllers or CSI plugins). |
botanist.WaitUntilKubeAPIServerReady | Gardener checks the .status object of the kube-apiserver Deployment and waits until readiness is indicated. |
botanist.InitializeShootClients | Unchanged; Gardener creates a Kubernetes client for the Shoot cluster. |
botanist.DeployMachineControllerManager | Deleted, Gardener does no longer deploy MCM itself. See below. |
hybridBotanist.ReconcileMachines | Gardener creates a Worker CRD in the Seed, and the responsible Worker controller picks it up and does its job (see CRD above). It also deploys the machine-controller-manager. Gardener waits until the status indicates that the controller is done. |
hybridBotanist.DeployKubeAddonManager | This function also computes the CoreOS cloud-config (because the secret storing it is managed by the kube-addon-manager). Gardener would deploy the CloudConfig-specific CRD in the Seed, and the responsible OS controller picks it up and does its job (see CRD above). The Botanists which would have to modify something would register a Webhook for this CloudConfig-specific resource and apply their changes. The rest is mostly unchanged, Gardener generates the manifests for the addons and deploys the kube-addon-manager into the Seed. AWS Botanist registers a Webhook for nginx-ingress. Azure Botanist registers a Webhook for calico. Gardener will no longer deploy the StorageClass es. Instead, the Botanists wait until the kube-apiserver is available and deploy them.
In the long term we want to get rid of optional addons inside the Gardener core and implement a sophisticated addon concept (see #246). |
shootCloudBotanist.DeployKube2IAMResources | This function would be removed (currently Gardener would execute a Terraform job creating the IAM roles specified in the Shoot manifest). We cannot keep this behavior, the user would be responsible to create the needed IAM roles on its own. |
botanist.Shoot.Components.Nginx.DNSEtnry | Gardener creates a DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and creates a corresponding DNS record (see CRD specification above). |
botanist.WaitUntilVPNConnectionExists | Unchanged, Gardener checks that it is possible to port-forward to a Shoot pod. |
seedCloudBotanist.ApplyCreateHook | This function would be removed (actually, only the AWS Botanist implements it). AWS Botanist deploys the aws-lb-readvertiser once the API Server is deployed and updates the ELB health check protocol one the load balancer pointing to the API server is created. |
botanist.DeploySeedMonitoring | Unchanged, Gardener deploys the monitoring stack into the Seed. |
botanist.DeployClusterAutoscaler | Unchanged, Gardener deploys the cluster-autoscaler into the Seed. |
ℹ We can easily lift the contract later and allow dynamic network plugins or not using the VPN solution at all.
We could also introduce a dedicated ControlPlane
CRD and leave the complete responsibility of deploying kube-apiserver, kube-controller-manager, etc. to other controllers (if we need it at some point in time).
Deletion flow
We are now examining the current Shoot deletion flow and describe shortly how it could look like when applying this proposal:
Operation | Description |
---|
botanist.DeploySecrets | This is just refreshing the cloud provider secret in the Shoot namespace in the Seed (in case the user has changed it before triggering the deletion). This function would stay as it is. |
hybridBotanist.RefreshMachineClassSecrets | This function would disappear. Worker Pool controller needs to watch the referenced secret and update the generated MachineClassSecrets immediately. |
hybridBotanist.RefreshCloudProviderConfig | This function would disappear. Botanist needs to watch the referenced secret and update the generated cloud-provider-config immediately. |
botanist.RefreshCloudControllerManagerChecksums | See “hybridBotanist.RefreshCloudProviderConfig”. |
botanist.RefreshKubeControllerManagerChecksums | See “hybridBotanist.RefreshCloudProviderConfig”. |
botanist.InitializeShootClients | Unchanged; Gardener creates a Kubernetes client for the Shoot cluster. |
botanist.DeleteSeedMonitoring | Unchanged; Gardener deletes the monitoring stack. |
botanist.DeleteKubeAddonManager | Unchanged; Gardener deletes the kube-addon-manager. |
botanist.DeleteClusterAutoscaler | Unchanged; Gardener deletes the cluster-autoscaler. |
botanist.WaitUntilKubeAddonManagerDeleted | Unchanged; Gardener waits until the kube-addon-manager is deleted. |
botanist.CleanCustomResourceDefinitions | Unchanged, Gardener cleans the CRDs in the Shoot. |
botanist.CleanKubernetesResources | Unchanged, Gardener cleans all remaining Kubernetes resources in the Shoot. |
hybridBotanist.DestroyMachines | Gardener deletes the WorkerPool-specific CRD in the Seed, and the responsible WorkerPool-controller picks it up and does its job. Gardener waits until the CRD is deleted. |
shootCloudBotanist.DestroyKube2IAMResources | This function would disappear (currently Gardener would execute a Terraform job deleting the IAM roles specified in the Shoot manifest). We cannot keep this behavior, the user would be responsible to delete the needed IAM roles on its own. |
shootCloudBotanist.DestroyInfrastructure | Gardener deletes the Infrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job. Gardener waits until the CRD is deleted. |
botanist.Shoot.Components.DNS.External{Provider/Entry}.Destroy | Gardener deletes the DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and does its job. Gardener waits until the CRD is deleted. |
botanist.DeleteKubeAPIServer | Unchanged; Gardener deletes the kube-apiserver. |
botanist.DeleteBackupInfrastructure | Unchanged; Gardener deletes the BackupInfrastructure object in the Garden cluster. (The BackupInfrastructure controller deletes the BackupInfrastructure-specific CRD in the Seed, and the responsible Botanist picks it up and does its job. The BackupInfrastructure controller waits until the CRD is deleted.) |
botanist.Shoot.Components.DNS.Internal{Provider/Entry}.Destroy | Gardener deletes the DNS-specific CRD in the Seed, and the responsible DNS-controller picks it up and does its job. Gardener waits until the CRD is deleted. |
botanist.DeleteNamespace | Unchanged; Gardener deletes the Shoot namespace in the Seed cluster. |
botanist.WaitUntilSeedNamespaceDeleted | Unchanged; Gardener waits until the Shoot namespace in the Seed has been deleted. |
botanist.DeleteGardenSecrets | Unchanged; Gardener deletes the kubeconfig/ssh-keypair Secret in the project namespace in the Garden. |
Gardenlet
One part of the whole extensibility work will also to further split Gardener itself.
Inspired from Kubernetes itself we plan to move the Shoot
reconciliation/deletion controller loops as well as the BackupInfrastructure
reconciliation/deletion controller loops into a dedicated “gardenlet” component that will run in the Seed cluster.
With that, it can talk locally to the responsible kube-apiserver and we do no longer need to perform every operation out of the Garden cluster.
This approach will also help us with scalability, performance, maintainability, testability in general.
This architectural change implies that the Kubernetes API server of the Garden cluster must be exposed publicly (or at least be reachable by the registered Seeds). The Gardener controller-manager will remain and will keep its CloudProfile
, SecretBinding
, Quota
, Project
, and Seed
controller loops. One part of the seed controller could be to deploy the “gardenlet” into the Seeds, however, this would require network connectivity to the Seed cluster.
Shoot control plane movement/migration
Automatically moving control planes is difficult with the current implementation as some resources created in the old Seed must be moved to the new one. However, some of them are not under Gardener’s control (e.g., Machine
resources). Moreover, the old control plane must be deactivated somehow to ensure that not two controllers work on the same things (e.g., virtual machines) from different environments.
Gardener does not only deploy a DNS controller into the Seeds but also into its own Garden cluster.
For every Shoot cluster, Gardener commissions it to create a DNS TXT
record containing the name of the Seed responsible for the Shoot (holding the control plane), e.g.
dig -t txt aws-01.core.garden.example.com
...
;; ANSWER SECTION:
aws-01.core.garden.example.com. 120 IN TXT "Seed=seed-01"
...
Gardener always keeps the DNS record up-to-date based on which Seed is responsible.
In the above CRD examples one object in the .spec
section was omitted as it is needed to get Shoot control plane movement/migration working (the field is only explained now in this section and not before; it was omitted on purpose to support focusing on the relevant specifications first).
Every CRD also has the following section in its .spec
:
leadership:
record: aws-01.core.garden.example.com
value: seed-01
leaseSeconds: 60
Before every operation the CRD-controllers check this DNS record (based on the .spec.leadership.leaseSeconds
configuration) and verify that its result is equal to the .spec.leadership.value
field.
If both match they know that they should act on the resource, otherwise they stop doing anything.
ℹ We will provide an easy-to-use framework for the controllers containing all of these features out-of-the-box in order to allow the developers to focus on writing the actual controller logic.
When a Seed control plane move is triggered, the .spec.cloud.seed
field of the respective Shoot
is changed.
Gardener will change the respective DNS record’s value (aws-01.core.garden.example.com
) to contain the new Seed name.
After that it will wait 2*60s
to be sure that all controllers have observed the change.
Then it starts reconciling and applying the CRDs together with a preset .status.state
into the new Seed (based on its last observations which were stored in the respective ShootState
object stored in the Garden cluster).
The controllers are - as per contract - asked to reconstruct their own environment based on the .status.state
they have written before and the real world’s status.
Apart from that, the normal reconciliation flow gets executed.
Gardener stores the list of Seeds that were responsible for hosting a Shoots control plane at some time in the Shoots .status.seeds
list so that it knows which Seeds must be cleaned up (i.e., where the control plane must be deleted because it has been moved).
Once cleaned up, the Seed’s name will be removed from that list.
BackupInfrastructure migration
One part of the reconciliation flow above is the provisioning of the infrastructure for the Shoot’s etcd backups (usually, this is a blob store bucket/container).
Gardener already uses a separate BackupInfrastructure
resource that is written into the Garden cluster and picked up by a dedicated BackupInfrastructure
controller (bundled into the Gardener controller manager).
This dedicated resource exists mainly for the reason to allow keeping backups for a certain “grace period” even after the Shoot deletion itself:
apiVersion: gardener.cloud/v1alpha1
kind: BackupInfrastructure
metadata:
name: aws-01-bucket
namespace: garden-core
spec:
seed: seed-01
shootUID: uuid-of-shoot
The actual provisioning is executed in a corresponding Seed cluster as Gardener can only assume network connectivity to the underlying cloud environment in the Seed.
We would like to keep the created artifacts in the Seed (e.g., Terraform state) near to the control plane.
Consequently, when Gardener moves a control plane, it will update the .spec.seed
field of the BackupInfrastructure
resource as well.
With the exact same logic described above the BackupInfrastructure
controller inside the Gardener will move to the new Seed.
Registration of external controllers at Gardener
We want to have a dynamic registration process, i.e. we don’t want to hard-code any information about which controllers shall be deployed.
The ideal solution would be to not even requiring a restart of Gardener when a new controller registers.
Every controller is registered by a ControllerRegistration
resource that introduces every controller together with its supported resources (dimension (kind
) and shape (type
) combination) to Gardener:
apiVersion: gardener.cloud/v1alpha1
kind: ControllerRegistration
metadata:
name: dns-aws-route53
spec:
resources:
- kind: DNS
type: aws-route53
# deployment:
# type: helm
# providerConfig:
# chart.tgz: base64(helm-chart)
# values.yaml: |
# foo: bar
Every .kind
/.type
combination may only exist once in the system.
When a Shoot
shall be reconciled Gardener can identify based on the referenced Seed
and the content of the Shoot
specification which controllers are needed in the respective Seed cluster.
It will demand the operators in the Garden cluster to deploy the controllers they are responsible for to a specific Seed.
This kind of communication happens via CRDs as well:
apiVersion: gardener.cloud/v1alpha1
kind: ControllerInstallation
metadata:
name: dns-aws-route53
spec:
registrationRef:
name: dns-aws-route53
seedRef:
name: seed-01
status:
conditions:
- lastTransitionTime: 2018-08-07T15:09:23Z
message: The controller has been successfully deployed to the seed.
reason: ControllerDeployed
status: "True"
type: Available
The default scenario is that every controller is gets deployed by a dedicated operator that knows how to handle its lifecycle operations like deployment, update, upgrade, deletion.
This operator watches ControllerInstallation
resources and reacts on those it is responsible for (that it has created earlier).
Gardener is responsible for writing the .spec
field, the operator is responsible for providing information in the .status
indicating whether the controller was successfully deployed and is ready to be used.
Gardener will be also able to ask for deletion of controllers from Seeds when they are not needed there anymore by deleting the corresponding ControllerInstallation
object.
ℹ The provided easy-to-use framework for the controllers will also contain these needed features to implement corresponding operators.
For most cases the controller deployment is very simple (just deploying it into the seed with some static configuration).
In these cases it would produce unnecessary effort to ask for providing another component (the operator) that deploys the controller.
To simplify this situation Gardener will be able to react on ControllerInstallation
s specifying .spec.registration.deployment.type=helm
.
The controller would be registered with the ControllerRegistration
resources that would contain a Helm chart with all resources needed to deploy this controller into a seed (plus some static values).
Gardener would render the Helm chart and deploy the resources into the seed.
It will not react if .spec.registration.deployment.type!=helm
which allows to also use any other deployment mechanism. Controllers that are getting deployed by operators would not specify the .spec.deployment
section in the ControllerRegistration
at all.
ℹ Any controller requiring dynamic configuration values (e.g., based on the cloud provider or the region of the seed) must be installed with the operator approach.
Other cloud-specific parts
The Gardener API server has a few admission controllers that contain cloud-specific code as well. We have to replace these parts as well.
Defaulting and validation admission plugins
Right now, the admission controllers inside the Gardener API server do perform a lot of validation and defaulting of fields in the Shoot specification.
The cloud-specific parts of these admission controllers will be replaced by mutating admission webhooks that will get called instead.
As we will have a dedicated operator running in the Garden cluster anyway it will also get the responsibility to register this webhook if it needs to validate/default parts of the Shoot specification.
Example: The .spec.cloud.workerPools[*].providerConfig.machineImage
field in the new Shoot manifest mentioned above could be omitted by the user and would get defaulted by the cloud-specific operator.
DNS Hosted Zone admission plugin
For the same reasons the existing DNS Hosted Zone admission plugin will be removed from the Gardener core and moved into the responsibility of the respective DNS-specific operators running in the Garden cluster.
Shoot Quota admission plugin
The Shoot quota admission plugin validates create or update requests on Shoots and checks that the specified machine/storage configuration is defined as per referenced Quota
objects.
The cloud-specifics in this controller are no longer needed as the CloudProfile
and the Shoot
resource have been adapted:
The machine/storage configuration is no longer in cloud-specific sections but hard-wired fields in the general Shoot
specification (see example resources above).
The quota admission plugin will be simplified and remains in the Gardener core.
Shoot maintenance controller
Every Shoot cluster can define a maintenance time window in which Gardener will update the Kubernetes patch version (if enabled) and the used machine image version in the Shoot resource.
While the Kubernetes version is not part of the providerConfig
section in the CloudProfile
resource, the machineImage
field is, and thus Gardener can’t understand it any longer.
In the future Gardener has to rely on the cloud-specific operator (probably the same doing the defaulting/validation mentioned before) to update this field.
In the maintenance time window the maintenance controller will update the Kubernetes patch version (if enabled) and add a trigger.gardener.cloud=maintenance
annotation in the Shoot resource.
The already registered mutating web hook will call the operator who has to remove this annotation and update the machineImage
in the .spec.cloud.workerPools[*].providerConfig
sections.
Alternatives
- Alternative to DNS approach for Shoot control plane movement/migration: We have thought about rotating the credentials when a move is triggered which would make all controllers ineffective immediately. However, one problem with this is that we require IAM privileges for the users infrastructure account which might be not desired. Another, more complicated problem is that we cannot assume API access in order to create technical users for all cloud environments that might be supported.
2 - 02 Backupinfra
Backup Infrastructure CRD and Controller Redesign
Goal
- As an operator, I would like to efficiently use the backup bucket for multiple clusters, thereby limiting the total number of buckets required.
- As an operator, I would like to use different cloud provider for backup bucket provisioning other than cloud provider used for seed infrastructure.
- Have seed independent backups, so that we can easily migrate a shoot from one seed to another.
- Execute the backup operations (including bucket creation and deletion) from a seed, because network connectivity may only be ensured from the seeds (not necessarily from the garden cluster).
- Preserve the garden cluster as source of truth (no information is missing in the garden cluster to reconstruct the state of the backups even if seed and shoots are lost completely).
- Do not violate the infrastructure limits in regards to blob store limits/quotas.
Motivation
Currently, every shoot cluster has its own etcd backup bucket with a centrally configured retention period. With the growing number of clusters, we are soon running out of the quota limits of buckets on the cloud provider. Moreover, even if the clusters are deleted, the backup buckets do exist, for a configured period of retention. Hence, there is need of minimizing the total count of buckets.
In addition, currently we use seed infrastructure credentials to provision the bucket for etcd backups. This results in binding backup bucket provider to seed infrastructure provider.
Terminology
- Bucket : It is equivalent to s3 bucket, abs container, gcs bucket, swift container, alicloud bucket
- Object : It is equivalent s3 object, abs blob, gcs object, swift object, alicloud object, snapshot/backup of etcd on object store.
- Directory : As such there is no concept of directory in object store but usually the use directory as
/
separate common prefix for set of objects. Alternatively they use term folder for same. - deletionGracePeriod: This means grace period or retention period for which backups will be persisted post deletion of shoot.
Current Spec:
#BackupInfra spec
Kind: BackupInfrastructure
Spec:
seed: seedName
shootUID : shoot.status.uid
Current naming conventions
| |
---|
SeedNamespace : | Shoot–projectname–shootname |
seed: | seedname |
ShootUID : | shoot.status.UID |
BackupInfraname: | seednamespce+sha(uid)[:5] |
Backup-bucket-name: | BackupInfraName |
BackupNamespace: | backup–BackupInfraName |
Proposal
Considering Gardener extension proposal in mind, the backup infrastructure controller can be divided in two parts. There will be basically four backup infrastructure related CRD’s. Two on the garden apiserver. And two on the seed cluster. Before going into to workflow, let’s just first have look at the CRD.
CRD on Garden cluster
Just to give brief before going into the details, we will be sticking to the fact that Garden apiserver is always source of truth. Since backupInfra will be maintained post deletion of shoot, the info regarding this should always come from garden apiserver, we will continue to have BackupInfra resource on garden apiserver with some modifications.
apiVersion: garden.cloud/v1alpha1
kind: BackupBucket
metadata:
name: packet-region1-uid[:5]
# No namespace needed. This will be cluster scope resource.
ownerReferences:
- kind: CloudProfile
name: packet
spec:
provider: aws
region: eu-west-1
secretRef: # Required for root
name: backup-operator-aws
namespace: garden
status:
lastOperation: ...
observedGeneration: ...
seed: ...
apiVersion: garden.cloud/v1alpha1
kind: BackupEntry
metadata:
name: shoot--dev--example--3ef42 # Naming convention explained before
namespace: garden-dev
ownerReferences:
- apiVersion: core.gardener.cloud/v1beta1
blockOwnerDeletion: false
controller: true
kind: Shoot
name: example
uid: 19a9538b-5058-11e9-b5a6-5e696cab3bc8
spec:
shootUID: 19a9538b-5058-11e9-b5a6-5e696cab3bc8 # Just for reference to find back associated shoot.
# Following section comes from cloudProfile or seed yaml based on granularity decision.
bucketName: packet-region1-uid[:5]
status:
lastOperation: ...
observedGeneration: ...
seed: ...
CRD on Seed cluster
Considering the extension proposal, we want individual component to be handled by controller inside seed cluster. We will have Backup related resource in registered seed cluster as well.
apiVersion: extensions.gardener.cloud/v1alpha1
kind: BackupBucket
metadata:
name: packet-random[:5]
# No namespace need. This will be cluster scope resource
spec:
type: aws
region: eu-west-1
secretRef:
name: backup-operator-aws
namespace: backup-garden
status:
observedGeneration: ...
state: ...
lastError: ..
lastOperation: ...
There are two points for introducing BackupEntry resource.
- Cloud provider specific code goes completely in seed cluster.
- Network issue is also handled by moving deletion part to backup-extension-controller in seed cluster.
apiVersion: extensions.gardener.cloud/v1alpha1
kind: BackupEntry
metadata:
name: shoot--dev--example--3ef42 # Naming convention explained later
# No namespace need. This will be cluster scope resource
spec:
type: aws
region: eu-west-1
secretRef: # Required for root
name: backup-operator-aws
namespace: backup-garden
status:
observedGeneration: ...
state: ...
lastError: ..
lastOperation: ...
Workflow
- Gardener administrator will configure the cloudProfile with backup infra credentials and provider config as follows.
# CloudProfile.yaml:
Spec:
backup:
provider: aws
region: eu-west-1
secretRef:
name: backup-operator-aws
namespace: garden
Here CloudProfileController will interpret this spec as follows:
- If
spec.backup
is nil - If
spec.backup.region
is not nil,- Then respect it, i.e. use the provider and unique region field mentioned there for BackupBucket.
- Here Preferably,
spec.backup.region
field will be unique, since for cross provider, it doesn’t make much sense. Since region name will be different for different providers.
- Otherwise, spec.backup.region is nil then,
- If same provider case i.e. spec.backup.provider = spec.(type-of-provider) or nil,
- Then, for each region from
spec.(type-of-provider).constraints.regions
create a BackupBucket
instance. This can be done lazily i.e. create BackupBucket
instance for region only if some seed actually spawned in the region has been registered. This will avoid creating IaaS bucket even if no seed is registered in that region, but region is listed in cloudprofile
. - Shoot controller will choose backup container as per the seed region. (With shoot control plane migration also, seed’s availability zone might change but the region will be remaining same as per current scope.)
- Otherwise cross provider case i.e. spec.backup.provider != spec.(type-of-provider)
- Report validation error: Since, for example, we can’t expect
spec.backup.provider
= aws
to support region in, spec.packet.constraint.region
. Where type-of-provider is packet
Following diagram represent overall flow in details:

Reconciliation
Reconciliation on backup entry in seed cluster mostly comes in picture at the time of deletion. But we can add initialization steps like creation of directory specific to shoot in backup bucket. We can simply create BackupEntry at the time of shoot deletion as well.
Deletion
- On shoot deletion, the BackupEntry instance i.e. shoot specific instance will get deletion timestamp because of ownerReference.
- If
deletionGracePeriod
configured in GCM component configuration is expired, BackupInfrastructure Controller will delete the backup folder associated with it from backup object store. - Finally, it will remove the
finalizer
from backupEntry instance.
Alternative

Discussion points / variations
Manual vs dynamic bucket creation
As per limit observed on different cloud providers, we can have single bucket for backups on one cloud providers. So, we could avoid the little complexity introduced in above approach by pre-provisioning buckets as a part of landscape setup. But there won’t be anybody to detect bucket existence and its reconciliation. Ideally this should be avoided.
Another thing we can have is, we can let administrator register the pool of root backup infra resource and let the controller schedule backup on one of this.
One more variation here could be to create bucket dynamically per hash of shoot UID.
Initial reason for going for terraform script is its stability and the provided parallelism/concurrency in resource creation. For backup infrastructure, Terraform scripts are very minimal right now. Its simply have bucket creation script. With shared bucket logic, if possible we might want to isolate access at directory level but again its additional one call. So, we will prefer switching to SDK for all object store operations.
Limiting the number of shoots per bucket
Again as per limit observed on different cloud providers, we can have single bucket for backups on one cloud providers. But if we want to limit the number of shoots associated with bucket, we can have central map of configuration in gardener-controller-component-configuration.yaml
.
Where we will mark supported count of shoots per cloud provider. Most probable space could be,
controller.backupInfrastructures.quota
. If limit is reached we can create new BucketBucket
instance.
e.g.
apiVersion: controllermanager.config.gardener.cloud/v1alpha1
kind: ControllerManagerConfiguration
controllers:
backupInfrastructure:
quota:
- provider: aws
limit: 100 # Number mentioned here are random, just for example purpose.
- provider: azure
limit: 80
- provider: openstack
limit: 100
...
Backward compatibility
Migration
- Create shoot specific folder.
- Transfer old objects.
- Create manifest of objects on new bucket
- Each entry will have status: None,Copied, NotFound.
- Copy objects one by one.
- Scale down etcd-main with old config. ⚠️ Cluster down time
- Copy remaining objects
- Scale up etcd-main with new config.
- Destroy Old bucket and old backup namespace. It can be immediate or preferably lazy deletion.

Legacy Mode alternative
- If Backup namespace present in seed cluster, then follow the legacy approach.
- i.e. reconcile creation/existence of shoot specific bucket and backup namespace.
- If backup namespace is not created, use shared bucket.
- Limitation Never know when the existing cluster will be deleted, and hence, it might be little difficult to maintain with next release of gardener. This might look simple and straight-forward for now but may become pain point in future, if in worst case, because of some new use cases or refactoring, we have to change the design again. Also, even after multiple garden release we won’t be able to remove deprecated existing BackupInfrastructure CRD
References
3 - 03 Networking Extensibility
Network Extensibility
Currently Gardener follows a mono network-plugin support model (i.e., Calico). Although this can seem to be the more stable approach, it does not completely reflect the real use-case. This proposal brings forth an effort to add an extra level of customizability to Gardener networking.
Motivation
Gardener is an open-source project that provides a nested user model. Basically, there are two types of services provided by Gardener to its users:
- Managed: users only request a Kubernetes cluster (Clusters-as-a-Service)
- Hosted: users utilize Gardener to provide their own managed version of Kubernetes (Cluster-Provisioner-as-a-service)
For the first set of users, the choice of network plugin might not be so important, however, for the second class of users (i.e., Hosted) it is important to be able to customize networking based on their needs.
Furthermore, Gardener provisions clusters on different cloud-providers with different networking requirements. For example, Azure does not support Calico Networking [1], this leads to the introduction of manual exceptions in static add-on charts which is error prune and can lead to failures during upgrades.
Finally, every provider is different, and thus the network always needs to adapt to the infrastructure needs to provider better performance. Consistency does not necessarily lie in the implementation but in the interface.
Gardener Network Extension
The goal of the Gardener Network Extensions is to support different network plugin, therefore, the specification for the network resource won’t be fixed and will be customized based on the underlying network plugin. To do so, a NetworkConfig
field in the spec will be provided where each plugin will define. Below is an example for deploy Calico as the cluster network plugin.
Long Term Spec
---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Network
metadata:
name: calico-network
namespace: shoot--core--test-01
spec:
type: calico
clusterCIDR: 192.168.0.0/24
serviceCIDR: 10.96.0.0/24
providerConfig:
apiVersion: calico.extensions.gardener.cloud/v1alpha1
kind: NetworkConfig
ipam:
type: host-local
cidr: usePodCIDR
backend: bird
typha:
enabled: true
status:
observedGeneration: ...
state: ...
lastError: ..
lastOperation: ...
providerStatus:
apiVersion: calico.extensions.gardener.cloud/v1alpha1
kind: NetworkStatus
components:
kubeControllers: true
calicoNodes: true
connectivityTests:
pods: true
services: true
networkModules:
arp_proxy: true
config:
clusterCIDR: 192.168.0.0/24
serviceCIDR: 10.96.0.0/24
ipam:
type: host-local
cidr: usePodCIDR
First Implementation (Short Term)
As an initial implementation the network plugin type will be specified by the user e.g., Calico (without further configuration in the provider spec). This will then be used to generate
the Network
resource in the seed. The Network operator will pick it up, and apply the configuration based on the spec.cloudProvider
specified directly to the shoot or via the
Gardener resource manager (still in the works).
The cloudProvider
field in the spec is just an initial catalyst but not meant to be stay long-term. In the future, the network provider configuration will be customized to match the best
needs of the infrastructure.
Here is how the simplified initial spec would look like:
---
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Network
metadata:
name: calico-network
namespace: shoot--core--test-01
spec:
type: calico
cloudProvider: {aws,azure,...}
status:
observedGeneration: 2
lastOperation: ...
lastError: ...
Functionality
The network resource need to be created early-on during cluster provisioning. Once created, the Network operator residing in every seed will create all the necessary networking resources and apply them to the shoot cluster.
The status of the Network resource should reflect the health of the networking components as well as additional tests if required.
References
[1] Azure support for Calico Networking
4 - 05 Versioning Policy
Gardener Versioning Policy
Please refer to this document for the documentation of the implementation of this GEP.
Goal
- As a Garden operator I would like to define a clear Kubernetes version policy, which informs my users about deprecated or expired Kubernetes versions.
- As an user of Gardener, I would like to get information which Kubernetes version is supported for how long. I want to be able to get this information via API (cloudprofile) and also in the Dashboard.
Motivation
The Kubernetes community releases minor versions roughly every three months and usually maintains three minor versions (the actual and the last two) with bug fixes and security updates. Patch releases are done more frequently. Operators of Gardener should be able to define their own Kubernetes version policy. This GEP suggests the possibility for operators to classify Kubernetes versions, while they are going through their “maintenance life-cycle”.
Kubernetes Version Classifications
An operator should be able to classify Kubernetes versions differently while they go through their “maintenance life-cycle”, starting with preview, supported, deprecated, and finally expired. This information should be programmatically available in the cloudprofiles
of the Garden cluster as well as in the Dashboard. Please also note, that Gardener keeps the control plane and the workers on the same Kubernetes version.
For further explanation of the possible classifications, we assume that an operator wants to support four minor versions e.g. v1.16, v1.15, v1.14 and v1.13.
preview: After a fresh release of a new Kubernetes minor version (e.g. v1.17.0) the operator could tag it as preview until he has gained sufficient experience. It will not become the default in the Gardener Dashboard until he promotes that minor version to supported, which could happen a few weeks later with the first patch version.
supported: The operator would tag the latest Kubernetes patch versions of the actual (if not still in preview) and the last three minor Kubernetes versions as supported (e.g. v1.16.1, v1.15.4, v1.14.9 and v1.13.12). The latest of these becomes the default in the Gardener Dashboard (e.g. v1.16.1).
deprecated: The operator could decide, that he generally wants to classify every version that is not the latest patch version as deprecated and flag this versions accordingly (e.g. v1.16.0 and older, v1.15.3 and older, 1.14.8 and older as well as v1.13.11 and older). He could also tag all versions (latest or not) of every Kubernetes minor release that is neither the actual nor one of the last three minor Kubernetes versions as deprecated, too (e.g. v1.12.x and older). Deprecated versions will eventually expire (i.e., removed).
expired: This state is a logical state only. It doesn’t have to be maintained in the cloudprofile
. All cluster versions whose expirationDate
as defined in the cloudprofile
is expired, are automatically in this logical state. After that date has passed, users cannot create new clusters with that version anymore and any cluster that is on that version will be forcefully migrated in its next maintenance time window, even if the owner has opted out of automatic cluster updates! The forceful update will pick the latest patch version of the current minor Kubernetes version. If the cluster was already on that latest patch version and the latest patch version is also expired, it will continue with latest patch version of the next minor Kubernetes version, so it will result in an update of a minor Kubernetes version, which is potentially harmful to your workload, so you should avoid that/plan ahead! If that’s expired as well, the update process repeats until a non-expired Kubernetes version is reached, so depending on the circumstances described above, it can happen that the cluster receives multiple consecutive minor Kubernetes version updates!
To fulfill his specific versioning policy, the Garden operator should be able to classify his versions as well set the expiration date in the cloudprofiles
. The user should see this classifiers as well as the expiration date in the dashboard.
5 - 06 Etcd Druid
Integrating etcd-druid with Gardener
Etcd is currently deployed by garden-controller-manager as a Statefulset. The sidecar container spec contains details pertaining to cloud-provider object-store which is injected into the statefulset via a mutable webhook running as part of the gardener extension story. This approach restricts the operations on etcd such as scale-up and upgrade. Etcd-druid will eliminate the need to hijack statefulset creation to add cloudprovider details. It has been designed to provide an intricate control over the procedure of deploying and maintaining etcd. The roadmap for etcd-druid can be found here.
This document explains how Gardener deploys etcd and what resources it creates for etcd-druid to deploy an etcd cluster.
Resources required by etcd-druid (created by Gardener)
- Secret containing credentials to access backup bucket in Cloud provider object store.
- TLS server and client secrets for etcd and backup-sidecar
- Etcd CRD resource that contains parameters pertaining to etcd, backup-sidecar and cloud-provider object store.
When an etcd resource is created in the cluster, the druid acts on it by creating an etcd statefulset, a service and a configmap containing etcd bootstrap script. The secrets containing the infrastructure credentials and the TLS certificates are mounted as volumes. If no secret/information regarding backups is stated then etcd data backups are not taken. Only data corruption checks are performed prior to starting etcd.
Garden-controller-manager, being cloud agnostic, deploys the etcd resource. This will not contain any cloud-specific information other than the cloud-provider. The extension controller that contains the cloud specific implementation to create the backup bucket will create it if needed and create a secret containing the credentials to access the bucket. The etcd backup secret name should be exposed in the BackupEntry status. Then, Gardener can read it and write it into the ETCD resource. The secret will have to be made available in the namespace the etcd statefulset will be deployed. If etcd and backup-sidecar communicates over TLS then the CA certificates, server and client certificates, and keys will also have to be made available in the namespace as well. The etcd resource will have reference to these aforementioned secrets. etcd-druid will deploy the statefulset only if the secrets are available.
Workflow
- etcd-druid will be deployed and etcd CRD will be created as part of the seed bootstrap.
- Garden-controller-manager creates backupBucket extension resource. Extension controller creates the backup bucket associated with the seed.
- Garden-controller-manager creates backupentry associated with each shoot in the seed namespace.
- Garden-controller-manager creates etcd resource with secretRefs and etcd information populated appropriately.
- etcd-druid acts on the etcd resource; druid creates the statefulset, the service and the configmap.

6 - 07 Shoot Control Plane Migration
Shoot Control Plane Migration
Motivation
Currently moving the control plane of a shoot cluster can only be done manually and requires deep knowledge of how exactly to transfer the resources and state from one seed to another. This can make it slow and prone to errors.
Automatic migration can be very useful in a couple of scenarios:
- Seed goes down and can’t be repaired (fast enough or at all) and it’s control planes need to be brought to another seed
- Seed needs to be changed, but this operation requires the recreation of the seed (e.g. turn a single-AZ seed into a multi-AZ seed)
- Seeds need to be rebalanced
- New seeds become available in a region closer to/in the region of the workers and the control plane should be moved there to improve latency
- Gardener ring, which is a self-supporting setup/underlay for a highly available (usually cross-region) Gardener deployment
Goals
- Provide a mechanism to migrate the control plane of a shoot cluster from one seed to another
- The mechanism should support migration from a seed which is no longer reachable (Disaster Recovery)
- The shoot cluster nodes are preserved and continue to run the workload, but will talk to the new control plane after the migration completes
- Extension controllers implement a mechanism which allows them to store their state or to be restored from an already existing state on a different seed cluster.
- The already existing shoot reconciliation flow is reused for migration with minimum changes
Terminology
Source Seed is the seed which currently hosts the control plane of a Shoot Cluster
Destination Seed is the seed to which the control plane is being migrated
Resources and controller state which have to be migrated between two seeds:
Note: The following lists are just FYI and are meant to show the current resources which need to be moved to the Destination Seed
Secrets
Gardener has preconfigured lists of needed secrets which are generated when a shoot is created and deployed in the seed. Following is a minimum set of secrets which must be migrated to the Destination Seed. Other secrets can be regenerated from them.
- ca
- ca-front-proxy
- static-token
- ca-kubelet
- ca-metrics-server
- etcd-encryption-secret
- kube-aggregator
- kube-apiserver-basic-auth
- kube-apiserver
- service-account-key
- ssh-keypair
Custom Resources and state of extension controllers
Gardenlet deploys custom resources in the Source Seed cluster during shoot reconciliation which are reconciled by extension controllers. The state of these controllers and any additional resources they create is independent of the gardenlet and must also be migrated to the Destination Seed. Following is a list of custom resources, and the state which is generated by them that has to be migrated.
- BackupBucket: nothing relevant for migration
- BackupEntry: nothing relevant for migration
- ControlPlane: nothing relevant for migration
- DNSProvider/DNSEntry: nothing relevant for migration
- Extensions: migration of state needs to be handled individually
- Infrastructure: terraform state
- Network: nothing relevant for migration
- OperatingSystemConfig: nothing relevant for migration
- Worker: Machine-Controller-Manager related objects: machineclasses, machinedeployments, machinesets, machines
This list depends on the currently installed extensions and can change in the future
Proposal
Custom Resource on the garden cluster
The Garden cluster has a new Custom Resource which is stored in the project namespace of the Shoot called ShootState
. It contains all the required data described above so that the control plane can be recreated on the Destination Seed.
This data is separated into two sections. The first is generated by the gardenlet and then either used to generate new resources (e.g secrets) or is directly deployed to the Shoot’s control plane on the Destination Seed.
The second is generated by the extension controllers in the seed.
apiVersion: core.gardener.cloud/v1alpha1
kind: ShootState
metadata:
name: my-shoot
namespace: garden-core
ownerReference:
apiVersion: core.gardener.cloud/v1beta1
blockOwnerDeletion: true
controller: true
kind: Shoot
name: my-shoot
uid: ...
finalizers:
- gardener
gardenlet:
secrets:
- name: ca
data:
ca.crt: ...
ca.key: ...
- name: ssh-keypair
data:
id_rsa: ...
- name:
...
extensions:
- kind: Infrastructure
state: ... (Terraform state)
- kind: ControlPlane
purpose: normal
state: ... (Certificates generated by the extension)
- kind: Worker
state: ... (Machine objects)
The state data is saved as a runtime.RawExtension
type, which can be encoded/decoded by the corresponding extension controller.
There can be sensitive data in the ShootState
which has to be hidden from the end-users. Hence, it will be recommended to provide an etcd encryption configuration to the Gardener API server in order to encrypt the ShootState
resource.
Size limitations
There are limits on the size of the request bodies sent to the kubernetes API server when creating or updating resources: by default ETCD can only accept request bodies which do not exceed 1.5 MiB (this can be configured with the --max-request-bytes
flag); the kubernetes API Server has a request body limit of 3 MiB which cannot be set from the outside (with a command line flag); the gRPC configuration used by the API server to talk to ETCD has a limit of 2 MiB per request body which cannot be configured from the outside; and watch
requests have a 16 MiB limit on the buffer used to stream resources.
This means that if ShootState
is bigger than 1.5 MiB, the ETCD max request bytes will have to be increased. However, there is still an upper limit of 2 MiB imposed by the gRPC configuration.
If ShootState
exceeds this size limitation it must make use of configmap/secret references to store the state of extension controllers. This is an implementation detail of Gardener and can be done at a later time if necessary as extensions will not be affected.
Splitting the ShootState
into multiple resources could have a positive benefit on performance as the Gardener API Server and Gardener Controller Manager would handle multiple small resources instead of one big resource.
Gardener extensions changes
All extension controllers which require state migration must save their state in a new status.state
field and act on an annotation gardener.cloud/operation=restore
in the respective Custom Resources which should trigger a restoration operation instead of reconciliation. A restoration operation means that the extension has to restore its state in the Shoot’s namespace on the Destination Seed from the status.state
field.
As an example: the Infrastructure
resource must save the terraform state.
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Infrastructure
metadata:
name: infrastructure
namespace: shoot--foo--bar
spec:
type: azure
region: eu-west-1
secretRef:
name: cloudprovider
namespace: shoot--foo--bar
providerConfig:
apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
kind: InfrastructureConfig
resourceGroup:
name: mygroup
networks:
vnet: # specify either 'name' or 'cidr'
# name: my-vnet
cidr: 10.250.0.0/16
workers: 10.250.0.0/19
status:
state: |
{
"version": 3,
"terraform_version": "0.11.14",
"serial": 2,
"lineage": "3a1e2faa-e7b6-f5f0-5043-368dd8ea6c10",
"modules": [
{
}
]
...
}
Extensions which do not require state migration should set status.state=nil
in their Custom Resources and trigger a normal reconciliation operation if the CR contains the core.gardener.cloud/operation=restore
annotation.
Similar to the contract for the reconcile operation, the extension controller has to remove the restore
annotation after the restoration operation has finished.
An additional annotation gardener.cloud/operation=migrate
is added to the Custom Resources. It is used to tell the extension controllers in the Source Seed that they must stop reconciling resources (in case they are requeued due to errors) and should perform cleanup activities in the Shoot’s control plane. These cleanup activities involve removing the finalizers on Custom Resources and deleting them without actually deleting any infrastructure resources.
Note: The same size limitations from the previous section are relevant here as well.
Shoot reconciliation flow changes
The only data which must be stored in the ShootState
by the gardenlet is secrets (e.g ca for the API server). Therefore the botanist.DeploySecrets
step is changed. It is split into two functions which take a list of secrets that have to be generated.
botanist.GenerateSecretState
Generates certificate authorities and other secrets which have to be persisted in the ShootState and must not be regenerated on the Destination Seed.botanist.DeploySecrets
Takes secret data from the ShootState
, generates new ones (e.g. client tls certificates from the saved certificate authorities) and deploys everything in the Shoot’s control plane on the Destination Seed
ShootState synchronization controller
The ShootState synchronization controller will become part of the gardenlet. It syncs the state of extension custom resources from the shoot namespace to the garden cluster and updates the corresponding spec.extension.state
field in the ShootState
resource. The controller can watch
Custom Resources used by the extensions and update the ShootState
only when changes occur.
Migration workflow
- Starting migration
- Migration can only be started after a Shoot cluster has been successfully created so that the
status.seed
field in the Shoot
resource has been set - The
Shoot
resource’s field spec.seedName="new-seed"
is edited to hold the name of the Destination Seed and reconciliation is automatically triggered - The Garden Controller Manager checks if the equality between
spec.seedName
and status.seed
, detects that they are different and triggers migration.
- The Garden Controller Manager waits for the Destination Seed to be ready
- Shoot’s API server is stopped
- Backup the Shoot’s ETCD.
- Extension resources in the Source Seed are annotated with
gardener.cloud/operation=migrate
- Scale Down the Shoot’s control plane in the Source Seed.
- The gardenlet in the Destination Seed fetches the state of extension resources from the
ShootState
resource in the garden cluster. - Normal reconciliation flow is resumed in the Destination Seed. Extension resources are annotated with
gardener.cloud/operation=restore
to instruct the extension controllers to reconstruct their state. - The Shoot’s namespace in Source Seed is deleted.
7 - 09 Test Framework
Gardener integration test framework
Motivation
As we want to improve our code coverage in the next months we will need a simple and easy to use test framework.
The current testframework already contains a lot of general test functions that ease the work for writing new tests.
However there are multiple disadvantages with the current structure of the tests and the testframework:
- Every new test is an own testsuite and therefore needs its own
TestDef
(https://github.com/gardener/gardener/tree/master/.test-defs). With this approach there will be hundreds of test definitions, growing with every new test (or at least new test suite).
But in most cases new tests do not need their own special TestDef
: it’s just the wrong scope for the testmachinery and will result in unnecessary complex testruns and configurations. In addition it would result in additional maintenance for a huge number of TestDefs
. - The testsuites currently have their own specific interface/configuration that they need in order to be executed correctly (see K8s Update test).
Consequently the configuration has to be defined in the testruns which result in one step per test with their very own configuration which means that the testmachinery cannot simply select testdefinitions by label.
As the testmachinery cannot make use of its ability to run labeled tests (e.g. run all tests labeled
default
), the testflow size increases with every new tests and the testruns have to be manually adjusted with every new test. - The current gardener test framework contains multiple test operations where some are just used for specific tests (e.g.
plant_operations
) and some are more general (garden_operation
). Also the functions offered by the operations vary in their specialization as some are really specific to just one test e.g. shoot test operation with WaitUntilGuestbookAppIsAvailable
whereas others are more general like WaitUntilPodIsRunning
.
This structure makes it hard for developers to find commonly used functions and also hard to integrate as the common framework grows with specialized functions.
Goals
In order to clean the testframework, make it easier for new developers to write tests and easier to add and maintain test execution within the testmachinery, the following goals are defined:
- Have a small number of test suites (gardener, shoots see test flavors) to only maintain a fixed number of testdefinitions.
- Use ginkgo test labels (inspired by the k8s e2e tests) to differentiate test behavior, test execution and test importance.
- Use standardized configuration for all tests (differ depending on the test suite) but provide better tooling to dynamically read additional configuration from configuration files like the
cloudprofile
. - Clean the testframework to only contain general functionality and keep specific functions inside the tests
Proposal
The proposed new test framework consists of the following changes to tackle the above described goals.
Test Flavors
Reducing the number of test definitions is done by combining the current specified test suites into the following 3 general ones:
- System test suite
- e.g. create-shoot, delete-shoot, hibernate
- need their own testdef because they have a special meaning in the context of the testmachinery
- Gardener test suite
- e.g. RBAC, scheduler
- All tests that only need a gardener installation but no shoot cluster
- Possible functions/environment:
- New project for test suite (copy secret binding, cleanup)?
- Shoot test suite
- e.g. shoot app, network
- Test that require a running shoot
- Possible functions:
- Namespace per test
- cleanup of ns
As inspired by the k8s e2e tests, test labels are used to differentiate the tests by their behavior, their execution and their importance.
Test labels means that tests are described using predefined labels in the test’s text (e.g ginkgo.It("[BETA] this is a test")
).
With this labeling strategy, it is also possible to see the test properties directly in the code and promoting a test can be done via a pullrequest and will then be automatically recognized by the testmachinery with the next release.
Using ginkgo focus to only run desired tests and combined testsuites, an example test definition will look like the following.
kind: TestDefinition
metadata:
name: gardener-beta-suite
spec:
description: Test suite that runs all gardener tests that are labeled as beta
activeDeadlineSeconds: 7200
labels: ["gardener", "beta"]
command: [bash, -c]
args:
- >-
go test -timeout=0 -mod=vendor ./test/integration/suite
--v -ginkgo.v -ginkgo.progress -ginkgo.no-color
-ginkgo.focus="[GARDENER] [BETA]"
Using this approach, the overall number of testsuites is then reduced to a fixed number (excluding the system steps) of test suites * labelCombinations
.
Framework
The new framework will consist of a common framework, a gardener framework (integrating the commom framework) and a shoot framework (integrating the gardener framework).
All of these frameworks will have their own configuration that is exposed via commandline flags so that for example the shoot test framework can be executed by go test -timeout=0 -mod=vendor ./test/integration/suite --v -ginkgo.v -ginkgo.focus="[SHOOT]" --kubecfg=/path/to/config --shoot-name=xx
.
The available test labels should be declared in the code with predefined values and in a predefined order so that everyone is aware about possible labels and the tests are labeled similarly across all integration tests. This approach is somehow similar to what kubernetes is doing in their e2e test suite but with some more restrictions (compare example k8s e2e test).
A possible solution to have consistent labeling would be to define them with every new ginkgo.It
definition: f.Beta().Flaky().It("my test")
which internally orders them and would produce a ginkgo test with the text : [BETA] [FLAKY] my test
.
General Functions
The test framework should include some general functions that can and will be reused by every test.
These general functions may include:
- Logging
- State Dump
- Detailed test output (status, duration, etc..)
- Cleanup handling per test (
It
) - General easy to use functions like
WaitUntilDeploymentCompleted
, GetLogs
, ExecCommand
, AvailableCloudprofiles
, etc..
Example
A possible test with the new test framework would look like:
var _ = ginkgo.Describe("Shoot network testing", func() {
// the testframework registers some cleanup handling for a state dump on failure and maybe cleanup of created namespaces
f := framework.NewShootFramework()
f.CAfterEach(func(ctx context.Context) {
ginkgo.By("cleanup network test daemonset")
err := f.ShootClient.Client().Delete(ctx, &appsv1.DaemonSet{ObjectMeta: metav1.ObjectMeta{Name: name, Namespace: namespace}})
if err != nil {
if !apierrors.IsNotFound(err) {
Expect(err).To(HaveOccurred())
}
}
}, FinalizationTimeout)
f.Release().Default().CIt("should reach all webservers on all nodes", func(ctx context.Context) {
ginkgo.By("Deploy the net test daemon set")
templateFilepath := filepath.Join(f.ResourcesDir, "templates", nginxTemplateName)
err := f.RenderAndDeployTemplate(f.Namespace(), tempalteFilepath)
Expect(err).ToNot(HaveOccurred())
err = f.WaitUntilDaemonSetIsRunning(ctx, f.ShootClient.Client(), name, namespace)
Expect(err).NotTo(HaveOccurred())
pods := &corev1.PodList{}
err = f.ShootClient.Client().List(ctx, pods, client.MatchingLabels{"app": "net-nginx"})
Expect(err).NotTo(HaveOccurred())
// check if all webservers can be reached from all nodes
ginkgo.By("test connectivity to webservers")
shootRESTConfig := f.ShootClient.RESTConfig()
var res error
for _, from := range pods.Items {
for _, to := range pods.Items {
// test pods
f.Logger.Infof("%s to %s: %s", from.GetName(), to.GetName(), data)
}
}
Expect(res).ToNot(HaveOccurred())
}, NetworkTestTimeout)
})
Future Plans
Ownership
When the test coverage is increased and there will be more tests, we will need to track ownership for tests.
At the beginning the ownership will be shared across all maintainers of the residing repository but this is not suitable anymore as tests will grow and get more complex.
Therefore the test ownership should be tracked via subgroups (in kubernetes this would be a SIG (comp. sig apps e2e test)). These subgroup will then be tracked via labels and the members of these groups will then be notified if tests fail.
8 - 10 Shoot Additional Container Runtimes
Gardener extensibility to support shoot additional container runtimes
Table of Contents
Summary
Gardener-managed Kubernetes clusters are sometimes used to run sensitive workloads, which sometimes are comprised of OCI images originating from untrusted sources. Additional use-cases want to leverage economy-of-scale to run workloads for multiple tenants on the same cluster. In some cases, Gardener users want to use operating systems which do not easily support the Docker engine.
This proposal aims to allow Gardener Shoot clusters to use CRI instead of the legacy Docker API, and to provide extension type for adding CRI shims (like GVisor and Kata Containers) which can be used to add support in Gardener Shoot clusters for these runtimes.
Motivation
While pods and containers are intended to create isolated areas for concurrently running workloads on nodes, this isolation is not as robust as could be expected. Containers leverage the core Linux CGroup and Namespace features to isolate workloads, and many kernel vulnerabilities have the potential to allow processes to escape from their isolation. Once a process has escaped from its container, any other process running on the same node is compromised. Several projects try to mitigate this problem; for example Kata Containers allow isolating a Kubernetes Pod in a micro-vm, gVisor reduces the kernel attack surface by adding another level of indirection between the actual payload and the real kernel.
Kubernetes supports running pods using these alternate runtimes via the RuntimeClass concept, which was promoted to Beta in Kubernetes 1.14. Once Kubernetes is configured to use the Container Runtime Interface to control pods, it becomes possible to leverage CRI and run specific pods using different Runtime Classes. Additionally, configuring Kubernetes to use CRI instead of the legacy Dockershim is faster.
The motivation behind this proposal is to make all of this functionality accessible to Shoot clusters managed by Gardener.
Goals
- Gardener must allow to configue its managed clusters with the CRI interface instead of the legacy Dockershim.
- Low-level runtimes like gVisor or Kata Containers are provided as gardener extensions which are (optionally) installed into a landscape by the Gardener operator. There must be no runtime-specific knowledge in the core Gardener code.
- It shall be possible to configure multiple low-level runtimes in Shoot clusters, on the Worker Group level.
Proposal
Gardener today assumes that all supported operating systems have Docker pre-installed in the base image. Starting with Docker Engine 1.11, Docker itself was refactored and cleaned-up to be based on the containerd library. The first phase would be to allow the change of the Kubelet configuration as described here so that Kubernetes would use containerd instead of the default Dockershim. This will be implemented for CoreOS, Ubuntu, and SuSE-CHost.
We will implement two Gardener extensions, providing gVisor and Kata Containers as options for Gardener landscapes.
The WorkerGroup
specification will be extended to allow specifying the CRI name and a list of additional required Runtimes for nodes in that group. For example:
workers:
- name: worker-b8jg5
machineType: m5.large
volumeType: gp2
volumeSize: 50Gi
autoScalerMin: 1
autoScalerMax: 2
maxSurge: 1
cri:
name: containerd
containerRuntimes:
- type: gvisor
- type: kata-containers
machineImage:
name: coreos
version: 2135.6.0
Each extension will need to address the following concern:
- Add the low-level runtime binaries to the worker nodes. Each extension should get the runtime binaries from a container.
- Hook the runtime binary into the containerd configuration file, so that the runtime becomes available to containerd.
- Apply a label to each node that allows identifying nodes where the runtime is available.
- Apply the relevant
RuntimeClass
to the Shoot cluster, to expose the functionality to users. - Provide a separate binary with a
ValidatingWebhook
(deployable to the garden cluster) to catch invalid configurations. For example, Kata Containers on AWS requires a machineType
of i3.metal
, so any Shoot
requests with a Kata Containers runtime and a different machine type on AWS should be rejected.
Design Details
Change the nodes container runtime to work with CRI and ContainerD (Only if specified in the Shoot spec):
In order to configure each worker machine in the cluster to work with CRI, the following configurations should be done:
- Add kubelet execution flags:
- –container-runtime=remote
- –container-runtime-endpoint=unix:///run/containerd/containerd.sock
- Make sure that default containerd configuration file exist in path /etc/containerd/config.toml.
ContainerD and Docker configurations are different for each OS. To make sure the default configurations above works well in each worker machine, each OS extension would be responsible to configure them during the reconciliation of the
OperatingSystemConfig:
- os-ubuntu -
- Create ContainerD unit Drop-In to execute ContainerD with the default configurations file in path /etc/containerd/config.toml.
- Create the container runtime metadata file with a OS path for binaries installations: /usr/bin.
- os-coreos -
- Create ContainerD unit Drop-In to execute ContainerD with the default configurations file in path /etc/containerd/config.toml.
- Create Docker Drop-In unit to execute Docker with the correct socket path of ContainerD.
- Create the container runtime metadata file with a OS path for binaries installations: /var/bin.
- os-suse-chost -
- Create ContainerD service unit and execute ContainerD with the default configurations file in path /etc/containerd/config.toml.
- Download and install ctr-cli which is not shipped with the current SuSe image.
- Create the container runtime metadata file with a OS path for binaries installations /usr/sbin.
To rotate the ContainerD (CRI) logs we will activate the kubelet feature flag: CRIContainerLogRotation=true.
Docker monitor service will be replaced with equivalent ContainerD monitor service.
Validate workers additional runtime configurations:
- Disallow additional runtimes with shoots < 1.14
- kata-container validation: Machine type support nested virtualization.
Add support for each additional container runtime in the cluster.
In order to install each additional available runtime in the cluster we should:
- Install the runtime binaries in each Worker’s pool nodes that specified the runtime support.
- Apply the relevant RuntimeClass to the cluster.
The installation above should be done by a new kind of extension: ContainerRuntime resource. For each container runtime type (Kata-container/gvisor) a dedicate extension controller will be created.
A label for each container runtime support will be added to every node that belongs to the worker pool. This should be done similar
to the way labels created today for each node, through kubelet execution parameters (_kubelet.flags: –node-labels). When creating the OperatingSystemConfig (original) for the worker each container runtime support should be mapped to a label on the node.
For Example:
label: container.runtime.kata-containers=true (shoot.spec.cloud..worker.containerRuntimes.kata-container)
label: container.runtime.gvisor=true (shoot.spec.cloud..worker.containerRuntimes.gvisor)
During the Shoot reconciliation (Similar steps to the Extensions today) Gardener will create new ContainerRuntime resource if a container runtime exist in at least one worker spec:
apiVersion: extensions.gardener.cloud/v1alpha1
kind: ContainerRuntime
metadata:
name: kata-containers-runtime-extention
namespace: shoot--foo--bar
spec:
type: kata-containers
Gardener will wait that all ContainerRuntimes extensions will be reconciled by the appropriate extensions controllers.
Each runtime extension controller will be responsible to reconcile it’s RuntimeContainer resource type.
rc-kata-containers extension controller will reconcile RuntimeContainer resource from type kata-container and rc-gvisor will reconcile RuntimeContainer resource from gvisor.
Reconciliation process by container runtime extension controllers:
- Runtime extension controller from specific type should apply a chart which responsible for the installation of the runtime container in the cluster:
- DaemonSet which will run a privileged pod on each node with the label: container.runtime.:true The pod will be responsible for:
- Copy the runtime container binaries (From extension package ) to the relevant path in the host OS.
- Add the relevant container runtime plugin section to the containerd configuration file (/etc/containerd/config.toml).
- Restart containerd in the node.
- RuntimeClasses in the cluster to support the runtime class. for example:
apiVersion: node.k8s.io/v1beta1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
- Update the status of the relevant RuntimeContainer resource to succeeded.
9 - 12 Oidc Webhook Authenticator
OIDC Webhook Authenticator
Problem
In Kubernetes you can authenticate via several authentication strategies:
- x509 Client Certificates
- Static Token Files
- Bootstrap Tokens
- Static Password File (Basic authentication - deprecated and removed in 1.19)
- Service Account Tokens
- OpenID Connect Tokens
- Webhook Token Authentication
- Authenticating Proxy
End-users should use OpenID Connect (OIDC) Tokens created by OIDC-compatible Identity Provider (IDP) and present id_token to the kube-apiserver
. If the kube-apiserver
is configured to trust the IDP and the token is valid, then the user is authenticated and the UserInfo is send to the authorization stack.
Ideally, operators of the Gardener cluster should be able to authenticate to end-user Shoot clusters with id_token
generated by OIDC IDP, but in many cases, end-users might have already configured OIDC for their cluster and more than one OIDC configurations are not allowed.
Another interesting application of multiple OIDC providers would be per Project
OIDC provider where end-users of Gardener can add their own OIDC-compatible IDPs.
To workaround the one OIDC per kube-apiserver
limitation, a new OIDC Webhook Authenticator
(OWA) could be implemented.
Goals
- Dynamic registrations of OpenID Connect configurations.
- Close as possible to the Kubernetes build-in OIDC Authenticator.
- Build as an optional extension and not required for functional Shoot or Gardener cluster.
Non-goals
Proposal
The kube-apiserver
can use Webhook Token Authentication to send a Bearer Tokens (id_token) to external webhook for validation:
{
"apiVersion": "authentication.k8s.io/v1beta1",
"kind": "TokenReview",
"spec": {
"token": "(BEARERTOKEN)"
}
}
Where upon verification, the remote webhook returns the identity of the user (if authentication succeeds):
{
"apiVersion": "authentication.k8s.io/v1beta1",
"kind": "TokenReview",
"status": {
"authenticated": true,
"user": {
"username": "janedoe@example.com",
"uid": "42",
"groups": [
"developers",
"qa"
],
"extra": {
"extrafield1": [
"extravalue1",
"extravalue2"
]
}
}
}
}
Registration of new OpenIDConnect
This new OWA can be configured with multiple OIDC providers and the entire flow can look like this:
Admin adds a new OpenIDConnect
resource (via CRD) to the cluster.
apiVersion: authentication.gardener.cloud/v1alpha1
kind: OpenIDConnect
metadata:
name: foo
spec:
issuerURL: https://foo.bar
clientID: some-client-id
usernameClaim: email
usernamePrefix: "test-"
groupsClaim: groups
groupsPrefix: "baz-"
supportedSigningAlgs:
- RS256
requiredClaims:
baz: bar
caBundle: LS0tLS1CRUdJTiBDRVJU...base64-encoded CA certs for issuerURL.
OWA watches for changes on this resource and does OIDC discovery. The OIDC provider’s configuration has to be accessible under the spec.issuerURL
with a well-known path (.well-known/openid-configuration).
OWA uses the jwks_uri
obtained from the OIDC providers configuration, to fetch the OIDC provider’s public keys from that endpoint.
OWA uses those keys, issuer, client_id and other settings to add an OIDC authenticator to an in-memory list of Token Authenticators.

End-user authentication via new OpenIDConnect IDP
When a user presents an id_token
obtained from a OpenID Connect the flow looks like this:
The user authenticates against a Custom IDP.
id_token
is obtained from the Custom IDP.
The user uses id_token
to perform an API call to kube-apiserver
.
As the id_token
is not matched by any build-in or configured authenticators in the kube-apiserver
, it is send to OWA for validation.
{
"TokenReview": {
"kind": "TokenReview",
"apiVersion": "authentication.k8s.io/v1beta1",
"spec": {
"token": "ddeewfwef..."
}
}
}
OWA uses TokenReview
to authenticate the calling API server (the kube-apiserver
for delegation of authentication and authorization may be different from the calling kube-apiserver
).
{
"TokenReview": {
"kind": "TokenReview",
"apiVersion": "authentication.k8s.io/v1beta1",
"spec": {
"token": "api-server-token..."
}
}
}
After the Authentication API server returns the identity of the calling API server:
{
"apiVersion": "authentication.k8s.io/v1",
"kind": "TokenReview",
"metadata": {
"creationTimestamp": null
},
"spec": {
"token": "eyJhbGciOiJSUzI1NiIsImtpZCI6InJocEdLTXZlYjV1OE5heD..."
},
"status": {
"authenticated": true,
"user": {
"groups": [
"system:serviceaccounts",
"system:serviceaccounts:shoot--abcd",
"system:authenticated"
],
"uid": "14db103e-88bb-4fb3-8efd-ca9bec91c7bf",
"username": "system:serviceaccount:shoot--abcd:kube-apiserver"
}
}
}
OWA makes a SubjectAccessReview
call to the Authorization API server to ensure that calling API server is allowed to validate tokens:
{
"apiVersion": "authorization.k8s.io/v1",
"kind": "SubjectAccessReview",
"spec": {
"groups": [
"system:serviceaccounts",
"system:serviceaccounts:shoot--abcd",
"system:authenticated"
],
"nonResourceAttributes": {
"path": "/validate-token",
"verb": "post"
},
"user": "system:serviceaccount:shoot--abcd:kube-apiserver"
},
"status": {
"allowed": true,
"reason": "RBAC: allowed by RoleBinding \"kube-apiserver\" of ClusterRole \"kube-apiserver\" to ServiceAccount \"system:serviceaccount:shoot--abcd:kube-apiserver\""
}
}
OWA then iterates over all registered OpenIDConnect
Token authenticators and tries to validate the token.
Upon a successful validation it returns the TokeReview
with user, groups and extra parameters:
{
"TokenReview": {
"kind": "TokenReview",
"apiVersion": "authentication.k8s.io/v1beta1",
"spec": {
"token": "ddeewfwef..."
},
"status": {
"authenticated": true,
"user": {
"username": "test-foo@bar.com",
"groups": [
"baz-employee"
],
"extra": {
"gardener.cloud/authenticator/name": [
"foo"
],
"gardener.cloud/authenticator/uid": [
"e5062528-e5a4-4b97-ad83-614d015b0979"
]
}
}
}
}
}
It also adds some extra information which can be used by custom authorizers later on:
gardener.cloud/authenticator/name
contains the name of the OpenIDConnect
authenticator which was used.gardener.cloud/authenticator/uid
contains the metadata.uid
of the OpenIDConnect
authenticator which was used.
The kube-apiserver
proceeds with authorization checks and returns response.
An overview of the flow:

Deployment for Shoot clusters
OWA can be deployed per Shoot cluster via the Shoot OIDC Service Extension. The shoot’s kube-apiserver
is mutated so that it has the following flag configured.
--authentication-token-webhook-config-file=/etc/webhook/kubeconfig
OWA on the other hand uses the shoot’s kube-apiserver
and delegates auth capabilities to it. This means that the needed RBAC is managed in the shoot cluster. By default only the shoot’s kube-apiserver
has permissions to validate tokens against OWA.
10 - 13 Automated Seed Management
Automated Seed Management
Automated seed management involves automating certain aspects of managing seeds in Garden clusters, such as:
Implementing the above features would involve changes to various existing Gardener components, as well as perhaps introducing new ones. This document describes these features in more detail and proposes a design approach for some of them.
In Gardener, scheduling shoots onto seeds is quite similar to scheduling pods onto nodes in Kubernetes. Therefore, a guiding principle behind the proposed design approaches is taking advantage of best practices and existing components already used in Kubernetes.
Ensuring Seeds Capacity for Shoots Is Not Exceeded
Seeds have a practical limit of how many shoots they can accommodate. Exceeding this limit is undesirable as the system performance will be noticeably impacted. Therefore, it is important to ensure that a seed’s capacity for shoots is not exceeded by introducing a maximum number of shoots that can be scheduled onto a seed and making sure that it is taken into account by the scheduler.
An initial discussion of this topic is available in Issue #2938. The proposed solution is based on the following flow:
- The
gardenlet
is configured with certain resources and their total capacity (and, for certain resources, the amount reserved for Gardener). - The
gardenlet
seed controller updates the Seed status with the capacity of each resource and how much of it is actually available to be consumed by shoots, using capacity
and allocatable
fields that are very similar to the corresponding fields in the Node status. - When scheduling shoots,
gardener-scheduler
is influenced by the remaining capacity of the seed. In the simplest possible implementation, it never schedules shoots onto a seed that has already reached its capacity for a resource needed by the shoot.
Initially, the only resource considered would be the maximum number of shoots that can be scheduled onto a seed. Later, more resources could be added to make more precise scheduling calculations.
Note: Resources could also be requested by shoots, similarly to how pods can request node resources, and the scheduler could then ensure that such requests are taken into account when scheduling shoots onto seeds. However, the user is rarely, if at all, concerned with what resources does a shoot consume from a seed, and this should also be regarded as an implementation detail that could change in the future. Therefore, such resource requests are not included in this GEP.
In addition, an extensibility plugin framework could be introduced in the future in order to advertise custom resources, including provider-specific resources, so that gardenlet
would be able to update the seed status with their capacity and allocatable values, for example load balancers on Azure. Such a concept is not described here in further details as it is sufficiently complex to require a separate GEP.
Example Seed status with capacity
and allocatable
fields:
status:
capacity:
shoots: "100"
persistent-volumes: "200" # Built-in resource
azure.provider.extensions.gardener.cloud/load-balancers: "30" # Custom resource advertised by an Azure-specific plugin
allocatable:
shoots: "100"
persistent-volumes: "197" # 3 persistent volumes are reserved for Gardener
azure.provider.extensions.gardener.cloud/load-balancers: "300"
Gardenlet Configuration
As mentioned above, the total resource capacity for built-in resources such as the number of shoots is specified as part of the gardenlet
configuration, not in the Seed spec. The gardenlet
configuration itself could be specified in the spec of the newly introduced ManagedSeed resource. Here it is assumed that in the future this could become the recommended and most widely used way to manage seeds. If the same gardenlet
is responsible for multiple seeds, they would all share the same capacity settings.
To specify the total resource capacity for built-in resources, as well as the amount of such resources reserved for Gardener, the 2 new fields resources.capacity
and resources.reserved
are introduced in the GardenletConfiguration
resource. The gardenlet
seed controller would then initialize the capacity
and allocatable
fields in the seed status as follows:
- The
capacity
value is set to the configured resources.capacity
. - The
allocatable
value is set to the configured resources.capacity
minus resources.reserved
.
Example GardenletConfiguration
with resources.capacity
and resources.reserved
field:
resources:
capacity:
shoots: 100
persistent-volumes: 200
reserved:
persistent-volumes: 3
Scheduling Algorithm
Currently gardener-scheduler
uses a simple non-extensible algorithm in order to schedule shoots onto seeds. It goes through the following stages:
- Filter out seeds that don’t meet scheduling requirements such as being ready, matching cloud profile and shoot label selectors, matching the shoot provider, and not having taints that are not tolerated by the shoot.
- From the remaining seeds, determine candidates that are considered best based on their region, by using a strategy that can be either “same region” or “minimal distance”.
- Among these candidates, choose the one with the least number of shoots.
This scheduling algorithm should be adapted in order to properly take into account resources capacity and requests. As a first step, during the filtering stage, any seeds that would exceed their capacity for shoots, or their capacity for any resources requested by the shoot, should simply be filtered out and not considered during the next stages.
Later, the scheduling algorithm could be further enhanced by replacing the step in which the region strategy is applied by a scoring step similar to the one in Kubernetes Scheduler. In this scoring step, the scheduler would rank the remaining seeds to choose the most suitable shoot placement. It would assign a score to each seed that survived filtering based on a list of scoring rules. These rules might include for example MinimalDistance
and SeedResourcesLeastAllocated
, among others. Each rule would produce its own score for the seed, and the overall seed score would be calculated as a weighted sum of all such scores. Finally, the scheduler would assign the shoot to the seed with the highest ranking.
ManagedSeeds
When all or most of the existing seeds are near capacity, new seeds should be created in order to accommodate more shoots. Conversely, sometimes there could be too many seeds for the number of shoots, and so some of the seeds could be deleted to save resources. Currently, the process of creating a new seed involves a number of manual steps, such as creating a new shoot that meets certain criteria, and then registering it as a seed in Gardener. This could be automated to some extent by annotating a shoot with the use-as-seed
annotation, in order to create a “shooted seed”. However, adding more than one similar seeds still requires manually creating all needed shoots, annotating them appropriately, and making sure that they are successfully reconciled and registered.
To create, delete, and update seeds effectively in a declarative way and allow auto-scaling, a “creatable seed” resource along with a “set” (and in the future, perhaps also a “deployment”) of such creatable seeds should be introduced, similar to Kubernetes Pod
, ReplicaSet
, and Deployment
(or to MCM Machine
, MachineSet
, and MachineDeployment
) resources. With such resources (and their respective controllers), creating a new seed based on a template would become as simple as increasing the replicas
field in the “set” resource.
In Issue #2181 it is already proposed that the use-as-seed
annotation is replaced by a dedicated ShootedSeed
resource. The solution proposed here further elaborates on this idea.
ManagedSeed Resource
The ManagedSeed
resource is a dedicated custom resource that represents an evolution of the “shooted seed” and properly replaces the use-as-seed
annotation. This resource contains:
- The name of the Shoot that should be registered as a Seed.
- An optional
seedTemplate
section that contains the Seed spec and parts of the metadata, such as labels and annotations. - An optional
gardenlet
section that contains:gardenlet
deployment parameters, such as the number of replicas, the image, etc.- The
GardenletConfiguration
resource that contains controllers configuration, feature gates, and a seedConfig
section that contains the Seed
spec and parts of its metadata. - Additional configuration parameters, such as the garden connection bootstrap mechanism (see TLS Bootstrapping), and whether to merge the provided configuration with the configuration of the parent
gardenlet
.
Either the seedTemplate
or the gardenlet
section must be specified, but not both:
- If the
seedTemplate
section is specified, gardenlet
is not deployed to the shoot, and a new Seed
resource is created based on the template. - If the
gardenlet
section is specified, gardenlet
is deployed to the shoot, and it registers a new seed upon startup based on the seedConfig
section of the GardenletConfiguration
resource.
A ManagedSeed allows fine-tuning the seed and the gardenlet
configuration of shooted seeds in order to deviate from the global defaults, e.g. lower the concurrent sync for some of the seed’s controllers or enable a feature gate only on certain seeds. Also, it simplifies the deletion protection of such seeds.
Also, the ManagedSeed
resource is a more powerful alternative to the use-as-seed
annotation. The implementation of the use-as-seed
annotation itself could be refactored to use a ManagedSeed
resource extracted from the annotation by a controller.
Although in this proposal a ManagedSeed is always a “shooted seed”, that is a Shoot that is registered as a Seed, this idea could be further extended in the future by adding a type
field that could be either Shoot
(implied in this proposal), or something different. Such an extension would allow to register and manage as Seed a cluster that is not a Shoot, e.g. a GKE cluster.
Last but not least, ManagedSeeds could be used as the basis for creating and deleting seeds automatically via the ManagedSeedSet
resource that is described in ManagedSeedSets.
Unlike the Seed
resource, the ManagedSeed
resource is namespaced. If created in the garden
namespace, the resulting seed is globally available. If created in a project namespace, the resulting seed can be used as a “private seed” by shoots in the project, either by being decorated with project-specific taints and labels, or by being of the special PrivateSeed
kind that is also namespaced. The concept of private seeds / cloudprofiles is described in Issue #2874. Until this concept is implemented, ManagedSeed
resources might need to be restricted to the garden
namespace, similarly to how shoots with the use-as-seed
annotation currently are.
Example ManagedSeed
resource with a seedTemplate
section:
apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeed
metadata:
name: crazy-botany
namespace: garden
spec:
shoot:
name: crazy-botany # Shoot that should be registered as a Seed
seedTemplate: # Seed template, including spec and parts of the metadata
metadata:
labels:
foo: bar
spec:
provider:
type: gcp
region: europe-west1
taints:
- key: seed.gardener.cloud/protected
...
Example ManagedSeed
resource with a gardenlet
section:
apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeed
metadata:
name: crazy-botany
namespace: garden
spec:
shoot:
name: crazy-botany # Shoot that should be registered as a Seed
gardenlet:
deployment: # Gardenlet deployment configuration
replicaCount: 1
revisionHistoryLimit: 10
serviceAccountName: gardenlet
image:
repository: eu.gcr.io/gardener-project/gardener/gardenlet
tag: latest
pullPolicy: IfNotPresent
resources:
...
podLabels:
...
podAnnotations:
...
additionalVolumes:
...
additionalVolumeMounts:
...
env:
...
vpa: false
config: # GardenletConfiguration resource
apiVersion: gardenlet.config.gardener.cloud/v1alpha1
kind: GardenletConfiguration
seedConfig: # Seed template, including spec and parts of the metadata
metadata:
labels:
foo: bar
spec:
provider:
type: gcp
region: europe-west1
taints:
- key: seed.gardener.cloud/protected
...
controllers:
shoot:
concurrentSyncs: 20
featureGates:
...
...
bootstrap: BootstrapToken
mergeWithParent: true
ManagedSeed Controller
ManagedSeeds are reconciled by a new managed seed controller in gardenlet
. Its implementation is very similar to the current seed registration controller, and in fact could be regarded as a refactoring of the latter, with the difference that it uses the ManagedSeed
resource rather than the use-as-seed
annotation on a Shoot. The gardenlet
only reconciles ManagedSeeds that refer to Shoots scheduled on Seeds the gardenlet
is responsible for.
Once this controller is considered sufficiently stable, the current use-as-seed
annotation and the controller mentioned above should be marked as deprecated and eventually removed.
A ManagedSeed
that is in use by shoots cannot be deleted, unless the shoots are either deleted or moved to other seeds first. The managed seed controller ensures that this is the case by only allowing a ManagedSeed to be deleted if its Seed has been already deleted.
ManagedSeed Admission Plugins
In addition to the managed seed controller mentioned above, new gardener-apiserver
admission plugins should be introduced to properly validate the creation and update of ManagedSeeds, as well as the deletion of shoots registered as seeds. These plugins should ensure that:
- A
Shoot
that is being referred to by a ManagedSeed
cannot be deleted. - Certain
Seed
spec fields, for example the provider type and region, networking CIDRs for pods, services, and nodes, etc., are the same as (or compatible with) the corresponding Shoot
spec fields of the shoot that is being registered as seed. - If such
Seed
spec fields are omitted or empty, the plugins should supply proper defaults based on the values in the Shoot
resource.
Provider-specific Seed Bootstrapping Actions
Bootstrapping a new seed might require additional provider-specific actions to the ones performed automatically by the managed seed controller. For example, on Azure this might include getting a new subscription, extending quotas, etc. This could eventually be automated by introducing an extension mechanism for the Gardener seed bootstrapping flow, to be handled by a new type of controller in the provider extensions. However, such an extension mechanism is not in the scope of this proposal and might require a separate GEP.
One idea that could be further explored is the use shoot readiness gates, similar to Kubernetes pod readiness gates, in order to control whether a Shoot is considered Ready
before it could be registered as a Seed. A provider-specific extension could set the special condition that is specified as a readiness gate to True
only after it has successfully performed the provider-specific actions needed.
Changes to Existing Controllers
Since the Shoot registration as a Seed is decoupled from the Shoot reconciliation, existing gardenlet
controllers would not have to be changed in order to properly support ManagedSeeds. The main change to gardenlet
that would be needed is introducing the new managed seed controller mentioned above, and possibly retiring the old one at some point. In addition, the Shoot controller would need to be adapted as it currently performs certain actions differently if the shoot has a “shooted seed”.
The introduction of the ManagedSeed
resource would also require no changes to existing gardener-controller-manager
controllers that operate on Shoots (for example, shoot hibernation and maintenance controllers).
ManagedSeedSets
Similarly to a ReplicaSet, the purpose of a ManagedSeedSet is to maintain a stable set of replica ManagedSeeds available at any given time. As such, it is used to guarantee the availability of a specified number of identical ManagedSeeds, on an equal number of identical Shoots.
ManagedSeedSet Resource
The ManagedSeedSet
resource has a selector
field that specifies how to identify ManagedSeeds it can acquire, a number of replicas
indicating how many ManagedSeeds (and their corresponding Shoots) it should be maintaining, and a two templates:
- A ManagedSeed template (
template
) specifying the data of new ManagedSeeds it should create to meet the number of replicas criteria. - A Shoot template (
shootTemplate
) specifying the data of new Shoots it should create to host the ManagedSeeds.
A ManagedSeedSet then fulfills its purpose by creating and deleting ManagedSeeds (and their corresponding Shoots) as needed to reach the desired number.
A ManagedSeedSet is linked to its ManagedSeeds and Shoots via the metadata.ownerReferences
field, which specifies what resource the current object is owned by. All ManagedSeeds and Shoots acquired by a ManagedSeedSet have their owning ManagedSeedSet’s identifying information within their ownerReferences
field.
Example ManagedSeedSet
resource:
apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeedSet
metadata:
name: crazy-botany
namespace: garden
spec:
replicas: 3
selector:
matchLabels:
foo: bar
updateStrategy:
type: RollingUpdate # Update strategy, must be `RollingUpdate`
rollingUpdate:
partition: 2 # Only update the last replica (#2), assuming there are no gaps ("rolling out a canary")
template: # ManagedSeed template, including spec and parts of the metadata
metadata:
labels:
foo: bar
spec:
# shoot.name is not specified since it's filled automatically by the controller
seedTemplate: # Either a seed or a gardenlet section must be specified, see above
metadata:
labels:
foo: bar
provider:
type: gcp
region: europe-west1
taints:
- key: seed.gardener.cloud/protected
...
shootTemplate: # Shoot template, including spec and parts of the metadata
metadata:
labels:
foo: bar
spec:
cloudProfileName: gcp
secretBindingName: shoot-operator-gcp
region: europe-west1
provider:
type: gcp
...
ManagedSeedSet Controller
ManagedSeedSets are reconciled by a new managed seed set controller in gardener-controller-manager
. During the reconciliation this controller creates and deletes ManagedSeeds and Shoots in response to changes to the replicas
and selector
fields.
Note: The introduction of the ManagedSeedSet
resource would not require any changes to gardenlet
or to existing gardener-controller-manager
controllers.
Managing ManagedSeed Updates
To manage ManagedSeed updates, we considered two possible approaches:
- A ManagedSeedSet, similarly to a ReplicaSet, does not manage updates to its replicas in any way. In the future, we might introduce ManagedSeedDeployments, a higher-level concept that manages ManagedSeedSets and provides declarative updates to ManagedSeeds along with other useful features, similarly to a Deployment. Such a mechanism would involve creating new ManagedSeedSets, and therefore new seeds, behind the scenes, and moving existing shoots to them.
- A ManagedSeedSet does manage updates to its replicas, similarly to a StatefulSet. Updates are performed “in-place”, without creating new seeds and moving existing shoots to them. Such a mechanism could also take advantage of other StatefulSet features, such as ordered rolling updates and phased rollouts.
There is an important difference between seeds and pods or nodes in that seeds are more “heavyweight” and therefore updating a set of seeds by introducing new seeds and moving shoots to them tends to be much more complex, time-consuming, and prone to failures compared to updating the seeds “in place”. Furthermore, updating seeds in this way depends on a mature implementation of GEP-7: Shoot Control Plane Migration, which is not available right now. Due to these considerations, we favor the second approach over the first one.
ManagedSeed Identity and Order
A StatefulSet manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods. It maintains a stable identity (including network identity) for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.
A StatefulSet achieves the above by associating each replica with an ordinal number. With n replicas, these ordinal numbers range from 0 to n-1. When scaling out, newly added replicas always have ordinal numbers larger than those of previously existing replicas. When scaling in, it is the replicas with the largest original numbers that are removed.
Besides stable identity and persistent storage, these ordinal numbers are also used to implement the following StatefulSet features:
- Ordered, graceful deployment and scaling.
- Ordered, automated rolling updates. Such rolling updates can be partitioned (limited to replicas with ordinal numbers greater than or equal to the “partition”) to achieve phased rollouts.
A ManagedSeedSet, unlike a StatefulSet, does not need to maintain a stable identity for its ManagedSeeds. Furthermore, it would not be practical to always remove the replicas with the largest ordinal numbers when scaling in, since the corresponding seeds may have shoots scheduled onto them, while other seeds, with lower ordinals, may have fewer shoots (or none), and therefore be much better candidates for being removed.
On the other hand, it would be beneficial if a ManagedSeedSet, like a StatefulSet, provides ordered deployment and scaling, ordered rolling updates, and phased rollouts. The main advantage of these features is that a deployment or update failure would affect fewer replicas (ideally just one), containing any potential damage and making the situation easier to handle, thus achieving some of the goals stated in Issue #87. They could also help to contain seed rolling updates outside business hours.
Based on the above considerations, we propose the following mechanism for handling ManagedSeed identity and order:
- A ManagedSeedSet uses ordinal numbers generated by an increasing sequence to identify ManagedSeeds and Shoots it creates and manages. These numbers always start from 0 and are incremented by 1 for each newly added replica.
- Replicas (both ManagedSeeds and Shoots) are named after the ManagedSeedSet with the ordinal number appended. For example, for a ManagedSeedSet named
test
its replicas are named test-0
, test-1
, etc. - Gaps in the sequence created by removing replicas with ordinal numbers in the middle of the range are never filled in. A newly added replica always receives a number that is not only free, but also unique to itself. For example, if there are 2 replicas named
test-0
and test-1
and any one of them is removed, a newly added replica will still be named test-2
.
Although such ordinal numbers can also provide some form of stable identity, in this case it is much more important that they can provide a predictable ordering for deployments and updates, and can also be used to partition rolling updates similarly to StatefulSet ordinal numbers.
Update Strategies
The ManagedSeedSet’s .spec.updateStrategy
field allows configuring automated rolling updates for the ManagedSeeds and Shoots in a ManagedSeedSet.
Rolling Updates
The RollingUpdate
update strategy implements automated, rolling update for the ManagedSeeds and Shoots in a ManagedSeedSet. With this strategy, the ManagedSeedSet controller will update each ManagedSeed and Shoot in the ManagedSeedSet. It will proceed from the largest number to the smallest, updating each ManagedSeed and its corresponding Shoot one at a time. It will wait until both the Shoot and the Seed of an updated ManagedSeed are Ready prior to updating its predecessor.
As a further improvement upon the above, the controller could check not only the ManagedSeeds and their corresponding Shoots for readiness, but also the Shoots scheduled onto these ManagedSeeds. The rollout would then only continue if no more than X percent of these Shoots are not reconciled and Ready. Since checking all these additional conditions might require some complex logic, it should be performed by an independent managed seed care controller that updates the ManagedSeed resource with the readiness of its Seed and all Shoots scheduled onto the Seed.
Note that unlike a StatefulSet, an OnDelete
update strategy is not supported.
Partitions
The RollingUpdate
update strategy can be partitioned, by specifying a .spec.updateStrategy.rollingUpdate.partition
. If a partition is specified, only ManagedSeeds and Shoots with ordinals greater than or equal to the partition will be updated when any of the ManagedSeedSet’s templates is updated. All remaining ManagedSeeds and Shoots will not be updated. If a ManagedSeedSet’s .spec.updateStrategy.rollingUpdate.partition
is greater than the largest ordinal number in use by a replica, updates to its templates will not be propagated to its replicas (but newly added replicas may still use the updated templates depending on the partition value).
Keeping Track of Revision History and Performing Rollbacks
Similarly to a StatefulSet, the ManagedSeedSet controller uses ControllerRevisions to keep track of the revision history, and controller-revision-hash
labels to maintain an association between a ManagedSeed or a Shoot and the concrete template revisions based on which they were created or last updated. These are used for the following purposes:
- During an update, determine which replicas are still not on the latest revision and therefore should be updated.
- Display the revision history of a ManagedSeedSet via
kubectl rollout history
. - Roll back all ManagedSeedSet replicas to a specific revision via
kubectl rollout undo
Note: The above kubectl rollout
commands will not work with custom resources such as ManagedSeedSets out of the box (the documentation says explicitly that valid resource types are only deployments, daemonsets, and statefulsets), but it should be possible to eventually support such commands for ManagedSeedSets via a kubectl plugin.
Scaling-in ManagedSeedSets
Deleting ManagedSeeds in response to decreasing the replicas of a ManagedSeedSet deserves special attention for two reasons:
- A seed that is already in use by shoots cannot be deleted, unless the shoots are either deleted or moved to other seeds first.
- When there are more empty seeds than requested for deletion, determining which seeds to delete might not be as straightforward as with pods or nodes.
The above challenges could be addressed as follows:
- In order to scale in a ManagedSeedSet successfully, there should be at least as many empty ManagedSeeds as the difference between the old and the new replicas. In some cases, the user might need to ensure that this is the case by draining some seeds manually before decreasing the replicas field.
- It should be possible to protect ManagedSeeds from deletion even if they are empty, perhaps via an annotation such as
seedmanagement.gardener.cloud/protect-from-deletion
. Such seeds are not taken into account when determining whether the scale in operation can succeed. - The decision which seeds to delete among the ManagedSeeds that are empty and not protected should be based on hints, perhaps again in the form of annotations, that could be added manually by the user, as well as other factors, see Prioritizing ManagedSeed Deletion.
Prioritizing ManagedSeed Deletion
To help the controller decide which empty ManagedSeeds are to be deleted first, the user could manually annotate ManagedSeeds with a seed priority annotation such as seedmanagement.gardener.cloud/priority
. ManagedSeeds with lower priority are more likely to be deleted first. If not specified, a certain default value is assumed, for example 3.
Besides this annotation, the controller should take into account also other factors, such as the current seed conditions (NotReady
should be preferred for deletion over Ready
), as well as its age (older should be preferred for deletion over newer).
Auto-scaling Seeds
The most interesting and advanced automated seed management feature is making sure that a Garden cluster has enough seeds registered to schedule new shoots (and, in the future, reschedule shoots from drained seeds) without exceeding the seeds capacity for shoots, but not more than actually needed at any given moment. This would involve introducing an auto-scaling mechanism for seeds in Garden clusters.
The proposed solution builds upon the ideas introduced earlier. The ManagedSeedSet
resource (and in the future, also the ManagedSeedDeployment
resource) could have a scale
subresource that changes the replicas
field. This would allow a new “seed autoscaler” controller to scale these resources via a special “autoscaler” resource (for example SeedAutoscaler
), similarly to how the Kubernetes Horizontal Pod Autoscaler controller scales pods, as described in Horizontal Pod Autoscaler Walkthrough.
The primary metric used for scaling should be the number of shoots already scheduled onto that seed either as a direct value or as a percentage of the seed’s capacity for shoots introduced in Ensuring Seeds Capacity for Shoots Is Not Exceeded (utilization). Later, custom metrics based on other resources, including provider-specific resources, could be considered as well.
Note: Even if the controller is called Horizontal Pod Autoscaler, it is capable of scaling any resource with a scale
subresource, using any custom metric. Therefore, initially it was proposed to use this controller directly. However, a number of important drawbacks were identified with this approach, and so it is no longer proposed here.
SeedAutoscaler Resource
The SeedAutoscaler automatically scales the number of ManagedSeeds in a ManagedSeedSet based on observed resource utilization. The resource could be any resource that is tracked via the capacity
and allocatable
fields in the Seed status, including in particular the number of shoots already scheduled onto the seed.
The SeedAutoscaler is implemented as a custom resource and a new controller. The resource determines the behavior of the controller. The SeedAutoscaler
resource has a scaleTargetRef
that specifies the target resource to be scaled, the minimum and maximum number of replicas, as well as a list of metrics. The only supported metric type initially is Resource
for resources that are tracked via the capacity
and allocatable
fields in the Seed status. The resource target can be of type Utilization
or AverageValue
.
Example SeedAutoscaler
resource:
apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: SeedAutoscaler
metadata:
name: crazy-botany
namespace: garden
spec:
scaleTargetRef:
apiVersion: seedmanagement.gardener.cloud/v1alpha1
kind: ManagedSeedSet
name: crazy-botany
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource # Only Resource is supported
resource:
name: shoots
target:
type: Utilization # Utilization or AverageValue
averageUtilization: 50
SeedAutoscaler Controller
SeedAutoscaler
resources are reconciled by a new seed autoscaler controller, either in gardener-controller-manager
or out-of-tree, similarly to cluster-autoscaler. The controller periodically adjusts the number of replicas in a ManagedSeedSet to match the observed average resource utilization to the target specified by user.
Note: The SeedAutoscaler controller should perhaps not be limited to evaluating only metrics, it could also take into account also taints, label selectors, etc. This is not yet reflected in the example SeedAutoscaler
resource above. Such details are intentionally not specified in this GEP, they should be further explored in the issues created to track the actual implementation.
Evaluating Metrics for Autoscaling
The metrics used by the controller, for example the shoots
metric above, could be evaluated in one of the following ways:
- Directly, by looking at the
capacity
and allocatable
fields in the Seed status and comparing to the actual resource consumption calculated by simply counting all shoots that meet a certain criteria (e.g. shoots that are scheduled onto the seed), then taking an average over all seeds in the set. - By sampling existing metrics exported for example by
gardener-metrics-exporter
.
The second approach decouples the seed autoscaler controller from the actual metrics evaluation, and therefore allows plugging in new metrics more easily. It also has the advantage that the exported metrics could also be used for other purposes, e.g. for triggering Prometheus alerts or building Grafana dashboards. It has the disadvantage that the seed autoscaler controller would depend on the metrics exporter to do its job properly.
11 - 17 Shoot Control Plane Migration Bad Case
Shoot Control Plane Migration “Bad Case” Scenario
The migration flow described as part of GEP-7 can only be executed if both the Garden cluster and source seed cluster are healthy, and gardenlet
in the source seed cluster can connect to the Garden cluster. In this case, gardenlet
can directly scale down the shoot’s control plane in the source seed, after checking the spec.seedName
field.
However, there might be situations in which gardenlet
in the source seed cluster can’t connect to the Garden cluster and determine that spec.seedName
has changed. Similarly, the connection to the seed kube-apiserver
could also be broken. This might be caused by issues with the seed cluster itself. In other situations, the migration flow steps in the source seed might have started but might not be able to finish successfully. In all such cases, it should still be possible to migrate a shoot’s control plane to a different seed, even though executing the migration flow steps in the source seed might not be possible. The potential “split brain” situation caused by having the shoot’s control plane components attempting to reconcile the shoot resources in two different seeds must still be avoided, by ensuring that the shoot’s control plane in the source seed is deactivated before it is activated in the destination seed.
The mechanisms and adaptations described below have been tested as part of a PoC prior to describing them here.
Owner Election / Copying Snapshots
To achieve the goals outlined above, an “owner election” (or rather, “ownership passing”) mechanism is introduced to ensure that the source and destination seeds are able to successfully negotiate a single “owner” during the migration. This mechanism is based on special owner DNS records that uniquely identify the seed that currently hosts the shoot’s control plane (“owns” the shoot).
For example, for a shoot named i500152-gcp
in project dev
that uses an internal domain suffix internal.dev.k8s.ondemand.com
and is scheduled on a seed with an identity shoot--i500152--gcp2-0841c87f-8db9-4d04-a603-35570da6341f-sap-landscape-dev
, the owner DNS record is a TXT record with a domain name owner.i500152-gcp.dev.internal.dev.k8s.ondemand.com
and a single value shoot--i500152--gcp2-0841c87f-8db9-4d04-a603-35570da6341f-sap-landscape-dev
. The owner DNS record is created and maintained by reconciling an owner
DNSRecord resource, if the recently introduced DNSRecords feature is enabled via the UseDNSRecords
feature gate.
Unlike other extension resources, the owner
DNSRecord resource is not reconciled every time the shoot is reconciled, but only when the resource is created. Therefore, the owner DNS record value (the owner ID) is updated only when the shoot is migrated to a different seed. For more information, see Add handling of owner DNSRecord resources.
The owner DNS record domain name and owner ID are passed to components that need to perform ownership checks, such as the backup-restore
container of the etcd-main
StatefulSet, and all extension controllers. These components then check regularly whether the actual owner ID (the value of the record) matches the passed ID. If they don’t, the ownership check is considered failed, which causes the special behavior described below.
Note: A previous revision of this document proposed using “sync objects” written to and read from the backup container of the source seed as JSON files by the etcd-backup-restore
processes in both seeds. With the introduction of owner DNS records such sync objects are no longer needed.
For the destination seed to actually become the owner, it needs to acquire the shoot’s etcd data by copying the final full snapshot (and potentially also older snapshots) from the backup container of the source seed.
The mechanism to copy the snapshots and pass the ownership from the source to the destination seed consists of the following steps:
The reconciliation flow (“restore” phase) is triggered in the destination seed without first executing the migration flow in the source seed (or perhaps it was executed, but it failed, and its state is currently unknown).
The owner
DNSRecord resource is created in the destination seed. As a result, the actual owner DNS record is updated with the destination seed ID. From this point, ownership checks by the etcd-backup-restore
process and extension controller watchdogs in the source seed will fail, which will cause the special behavior described below.
An additional “source” backup entry referencing the source seed backup bucket is deployed to the Garden cluster and the destination seed and reconciled by the backup entry controller. As a result, a secret with the appropriate credentials for accessing the source seed backup container named source-etcd-backup
is created in the destination seed. The normal backup entry (referencing the destination seed backup container) is also deployed and reconciled, as usual, resulting in the usual etcd-backup
secret being created.
A special “copy” version of the etcd-main
Etcd resource is deployed to the destination seed. In its backup
section, this resource contains a sourceStore
in addition to the usual store
, which contains the parameters needed to use the source seed backup container, such as its name and the secret created in the previous step.
spec:
backup:
...
store:
container: 408740b8-6491-415e-98e6-76e92e5956ac
secretRef:
name: etcd-backup
...
sourceStore:
container: d1435fea-cd5e-4d5b-a198-81f4025454ff
secretRef:
name: source-etcd-backup
...
The etcd-druid
in the destination seed reconciles the above resource by deploying a etcd-copy
Job that contains a single backup-restore
container. It executes the newly introduced copy
command of etcd-backup-restore
that copies the snapshots from the source to the destination backup container.
Before starting the copy itself, the etcd-backup-restore
process in the destination seed checks if a final full snapshot (a full snapshot marked as final=true
) exists in the backup container. If such a snapshot is not found, it waits for it to appear in order to proceed. This waiting is up to a certain timeout that should be sufficient for a full snapshot to be taken; after this timeout has elapsed, it proceeds anyway, and the reconciliation flow continues from step 9. As described in Handling Inability to Access the Backup Container below, this is safe to do.
The etcd-backup-restore
process in the source seed detects that the owner ID in the owner DNS record is different from the expected owner ID (because it was updated in step 2) and switches to a special “final snapshot” mode. In this mode the regular snapshotter is stopped, the readiness probe of the main etcd
container starts returning 503, and one final full snapshot is taken. This snapshot is marked as final=true
in order to ensure that it’s only taken once, and in order to enable the etcd-backup-restore
process in the destination seed to find it (see step 6).
Note: While testing our PoC, we noticed that simply making the readiness probe of the main etcd
container fail doesn’t terminate the existing open connections from kube-apiserver
to etcd
. For this to happen, either the kube-apiserver
or the etcd
process has to be restarted at least once. Therefore, when the snapshotter is stopped because an ownership change has been detected, the main etcd
process is killed (using SIGTERM
to allow graceful termination) to ensure that any open connections from kube-apiserver
are terminated. For this to work, the 2 containers must share the process namespace.
Since the kube-apiserver
process in the source seed is no longer able to connect to etcd
, all shoot control plane controllers (kube-controller-manager
, kube-scheduler
, machine-controller-manager
, etc.) and extension controllers reconciling shoot resources in the source seed that require a connection to the shoot in order to work start failing. All remaining extension controllers are prevented from reconciling shoot resources via the watchdogs mechanism. At this point, the source seed has effectively lost its ownership of the shoot, and it is safe for the destination seed to assume the ownership.
After the etcd-backup-restore
process in the destination seed detects that a final full snapshot exists, it copies all snapshots (or a subset of all snapshots) from the source to the destination backup container. When this is done, the Job finishes successfully which signals to the reconciliation flow that the snapshots have been copied.
Note: To save time, only the final full snapshot taken in step 6, or a subset defined by some criteria, could be copied, instead of all snapshots.
The special “copy” version of the etcd-main
Etcd resource is deleted from the source seed, and as a result the etcd-copy
Job is also deleted by etcd-druid
.
The additional “source” backup entry referencing the source seed backup container is deleted from the Garden cluster and the destination seed. As a result, its corresponding source-etcd-backup
secret is also deleted from the destination seed.
From this point, the reconciliation flow proceeds as already described in GEP-7. This is safe, since the source seed cluster is no longer able to interfere with the shoot.
Handling Inability to Access the Backup Container
The mechanism described above assumes that the etcd-backup-restore
process in the source seed is able to access its backup container in order to take snapshots. If this is not the case, but an ownership change was detected, the etcd-backup-restore
process still sets the readiness probe status of the main etcd
container to 503, and kills the main etcd
process as described above to ensure that any open connections from kube-apiserver
are terminated. This effectively deactivates the source seed control plane to ensure that the ownership of the shoot can be passed to a different seed.
Because of this, etcd-backup-restore
process in the destination seed responsible for copying the snapshots can avoid waiting forever for a final full snapshot to appear. Instead, after a certain timeout has elapsed, it can proceed with the copying. In this situation, whatever latest snapshot is found in the source backup container will be restored in the destination seed. The shoot is still migrated to a healthy seed at the cost of losing the etcd data that accumulated between the point in time when the connection to the source backup container was lost, and the point in time when the source seed cluster was deactivated.
When the connection to the backup container is restored in the source seed, a final full snapshot will be eventually taken. Depending on the stage of the restoration flow in the destination seed, this snapshot may be copied to the destination seed and restored, or it may simply be ignored since the snapshots have already been copied.
Handling Inability to Resolve the Owner DNS Record
The situation when the owner DNS record cannot be resolved is treated similarly to a failed ownership check: the etcd-backup-restore
process sets the readiness probe status of the main etcd
container to 503, and kills the main etcd
process as described above to ensure that any open connections from kube-apiserver
are terminated, effectively deactivating the source seed control plane. The final full snapshot is not taken in this case to ensure that the control plane can be re-activated if needed.
When the owner DNS record can be resolved again, the following 2 situations are possible:
- If the source seed is still the owner of the shoot, the
etcd-backup-restore
process will set the readiness probe status of the main etcd
container to 200, so kube-apiserver
will be able to connect to etcd
and the source seed control plane will be activated again. - If the source seed is no longer the owner of the shoot, the etcd readiness probe will continue to fail, and the source seed control plane will remain inactive. In addition, the final full snapshot will be taken at this time, for the same reason as described in Handling Inability to Access the Backup Container.
Note: We expect that actual DNS outages are extremely unlikely. A more likely reason for an inability to resolve a DNS record could be network issues with the underlying infrastructure. In such cases, the shoot would usually not be usable / reachable anyway, so deactivating its control plane would not cause a worse outage.
Migration Flow Adaptations
Certain changes to the migration flow are needed in order to ensure that it is compatible with the owner election mechanism described above. Instead of taking a full snapshot of the source seed etcd, the flow deletes the owner DNS record by deleting the owner
DNSRecord resource. This causes the ownership check by etcd-backup-restore
to fail, and the final full snapshot to be eventually taken, so the migration flow waits for a final full snapshot to appear as the last step before deleting the shoot namespace in the source seed. This ensures that the reconciliation flow described above will find a final full snapshot waiting to be copied at step 6.
Checking for the final full snapshot is performed by calling the already existing etcd-backup-restore
endpoint snapshot/latest
. This is possible, since the backup-restore
container is always running at this point.
After the final full snapshot has been taken, the readiness probe of the main etcd
container starts failing, which means that if the migration flow is retried due to an error it must skip the step that waits for etcd-main
to become ready. To determine if this is the case, a check whether the final full snapshot has been taken or not is performed by calling the same etcd-backup-restore
endpoint, e.g. snapshot/latest
. This is possible if the etcd-main
Etcd resource exists with non-zero replicas. Otherwise:
- If the resource doesn’t exist, it must have been already deleted, so the final full snapshot n must have been already taken.
- If it exists with zero replicas, the shoot must be hibernated, and the migration flow must have never been executed (since it scales up etcd as one of its first steps), so the final full snapshot must not have been taken yet.
Extension Controller Watchdogs
Some extension controllers will stop reconciling shoot resources after the connection to the shoot’s kube-apiserver
is lost. Others, most notably the infrastructure controller, will not be affected. Even though new shoot reconciliations won’t be performed by gardenlet
, such extension controllers might be stuck in a retry loop triggered by a previous reconciliation, which may cause them to reconcile their resources after gardenlet
has already stopped reconciling the shoot. In addition, a reconciliation started when the seed still owned the shoot might take some time and therefore might still be running after the ownership has changed. To ensure that the source seed is completely deactivated, an additional safety mechanism is needed.
This mechanism should handle the following interesting cases:
gardenlet
cannot connect to the Garden kube-apiserver
. In this case it cannot fetch shoots and therefore does not know if control plane migration has been triggered. Even though gardenlet
will not trigger new reconciliations, extension controllers could still attempt to reconcile their resources if they are stuck a retry loop from a previous reconciliation, and already running reconciliations will not be stopped.gardenlet
cannot connect to the seed’s kube-apiserver
. In this case gardenlet
knows if migration has been triggered, but it will not start shoot migration or reconciliation as it will first check the seed conditions and try to update the Cluster
resource, both of which will fail. Extension controllers could still be able to connect to the seed’s kube-apiserver
(if they are not running where gardenlet
is running), and similarly to the previous case, they could still attempt to reconcile their resources.- The seed components (
etcd-druid
, extension controllers, etc) cannot connect to the seed’s kube-apiserver
. In this case extension controllers would not be able to reconcile their resources as they cannot fetch them from the seed’s kube-apiserver
. When the connection to the kube-apiserver
comes back, the controllers might be stuck in a retry loop from a previous reconciliation, or the resources could still be annotated with gardener.cloud/operation=reconcile
. This could lead to a race condition depending on who manages to update
or get
the resources first. If gardenlet
manages to update the resources before they are read by the extension controllers, they would be properly updated with gardener.cloud/operation=migrate
. Otherwise, they would be reconciled as usual.
Note: A previous revision of this document proposed using “cluster leases” as such an additional safety mechanism. With the introduction of owner DNS records cluster leases are no longer needed.
The safety mechanism is based on extension controller watchdogs. These are simply additional goroutines that are started when a reconciliation is started by an extension controller. These goroutines perform an ownership check on a regular basis using the owner DNS record, similar to the check performed by the etcd-backup-restore
process described above. If the check fails, the watchdog cancels the reconciliation context, which immediately aborts the reconciliation.
Note: The dns-external
extension controller is the only extension controller that neither needs the shoot’s kube-apiserver
, nor uses the watchdog mechanism described here. Therefore, this controller will continue reconciling DNSEntry
resources even after the source seed has lost the ownership of the shoot. With the PoC, we manually delete the DNSOwner
resources from the source seed cluster to prevent this from happening. Eventually, the dns-external
controller should be adapted to use the owner DNS records to ensure that it disables itself after the seed has lost the ownership of the shoot. Changes in this direction have already been agreed and relevant PRs proposed.
12 - Bastion Management and SSH Key Pair Rotation
GEP-15: Bastion Management and SSH Key Pair Rotation
Table of Contents
Motivation
gardenctl
(v1) has the functionality to setup ssh
sessions to the targeted shoot cluster (nodes). To this end, infrastructure resources like VMs, public IPs, firewall rules, etc. have to be created. gardenctl
will clean up the resources after termination of the ssh
session (or rather when the operator is done with her work). However, there were issues in the past where these infrastructure resources were not properly cleaned up afterwards, e.g. due to some error (no retries either). Hence, the proposal is to have a dedicated controller (for each infrastructure) that manages the infrastructure resources and their cleanup. The current gardenctl
also re-used the ssh
node credentials for the bastion host. While that’s possible, it would be safer to rather use personal or generated ssh
key pairs to access the bastion host.
The static shoot-specific ssh
key pair should be rotated regularly, e.g. once in the maintenance time window. This also means that we cannot create the node VMs anymore with infrastructure public keys as these cannot be revoked or rotated (e.g. in AWS) without terminating the VM itself.
Changes to the Bastion
resource should only be allowed for controllers on seeds that are responsible for it. This cannot be restricted when using custom resources.
The proposal, as outlined below, suggests to implement the necessary changes in the gardener core components and to adapt the SeedAuthorizer to consider Bastion
resources that the Gardener API Server serves.
Goals
- Operators can request and will be granted time-limited
ssh
access to shoot cluster nodes via bastion hosts. - To that end, requestors must present their public
ssh
key and only this will be installed into sshd
on the bastion hosts. - The bastion hosts will be firewalled and ingress traffic will be permitted only from the client IP of the requestor. Except for traffic on port 22 to the cluster worker nodes, no egress from the bastion is allowed.
- The actual node
ssh
private key (resp. key pair) will be rotated by Gardener and access to the nodes is only possible with this constantly rotated key pair and not with the personal one that is used only for the bastion host. - Bastion host and access is granted only for the extent of this operator request (of course multiple
ssh
sessions are possible, in parallel or repeatedly, but after “the time is up”, access is no longer possible). - By these means (personal public key and allow-listed client IP) nobody else can use (a.k.a. impersonate) the requestor (not even other operators).
- Necessary infrastructure resources for
ssh
access (such as VMs, public IPs, firewall rules, etc.) are automatically created and also terminated after usage, but at the latest after the above mentioned time span is up.
Non-Goals
- Node-specific access
- Auditability on operating system level (not only auditing the
ssh
login, but everything that is done on a node and other respective resources, e.g. by using dedicated operating system users) - Reuse of temporarily created necessary infrastructure resources by different users
Proposal
Involved Components
The following is a list of involved components, that either need to be newly introduced or extended if already existing
- Gardener API Server (
GAPI
) gardenlet
- Deploys
Bastion
CRD under the extensions.gardener.cloud
API Group to the Seed, see resource example below - Similar to
BackupBucket
s or BackupEntry
, the gardenlet
watches the Bastion
resource in the garden cluster and creates a seed-local Bastion
resource, on which the provider specific bastion controller acts upon
gardenctlv2
(or any other client)- Creates
Bastion
resource in the garden cluster - Establishes an
ssh
connection to a shoot node, using a bastion host as proxy - Heartbeats / keeps alive the
Bastion
resource during ssh
connection
- Gardener extension provider
- Gardener Controller Manager (
GCM
)Bastion
heartbeat controller- Cleans up
Bastion
resource on missing heartbeat. - Is configured with a
maxLifetime
for the Bastion
resource
- Gardener (RBAC)
SSH Flow
- Users should only get the RBAC permission to
create
/ update
Bastion
resources for a namespace, if they should be allowed to ssh
onto the shoot nodes in this namespace. A project member with admin
role will have these permissions. - User/
gardenctlv2
creates Bastion
resource in garden cluster (see resource example below)- First, gardenctl would figure out the own public IP of the user’s machine. Either by calling an external service (gardenctl (v1) uses https://github.com/gardener/gardenctl/blob/master/pkg/cmd/miscellaneous.go#L226) or by calling a binary that prints the public IP(s) to stdout. The binary should be configurable. The result is set under
spec.ingress[].ipBlock.cidr
- Creates new
ssh
key pair. The newly created key pair is used only once for each bastion host, so it has a 1:1 relationship to it. It is cleaned up after it is not used anymore, e.g. if the Bastion
resource was deleted. - The public
ssh
key is set under spec.sshPublicKey
- The targeted shoot is set under
spec.shootRef
- GAPI Admission Plugin for the
Bastion
resource in the garden cluster- on creation, sets
metadata.annotations["gardener.cloud/created-by"]
according to the user that created the resource - when
gardener.cloud/operation: keepalive
is set it will be removed by GAPI from the annotations and status.lastHeartbeatTimestamp
will be set with the current timestamp. The status.expirationTimestamp
will be calculated by taking the last heartbeat timestamp and adding x
minutes (configurable, default 60
Minutes). - validates that only the creator of the bastion (see
gardener.cloud/created-by
annotation) can update spec.ingress
- validates that a Bastion can only be created for a Shoot if that Shoot is already assigned to a Seed
- sets
spec.seedName
and spec.providerType
based on the spec.shootRef
gardenlet
- Watches
Bastion
resource for own seed under api group operations.gardener.cloud
in the garden cluster - Creates
Bastion
custom resource under api group extensions.gardener.cloud/v1alpha1
in the seed cluster
- Gardener extension provider / Bastion Controller on Seed:
- With own
Bastion
Custom Resource Definition in the seed under the api group extensions.gardener.cloud/v1alpha1
- Watches
Bastion
custom resources that are created by the gardenlet
in the seed - Controller reads
cloudprovider
credentials from seed-shoot namespace - Deploy infrastructure resources
- Bastion VM. Uses user data from
spec.userData
- attaches public IP, creates security group, firewall rules, etc.
- Updates status of
Bastion
resource:- With bastion IP under
status.ingress.ip
or hostname under status.ingress.hostname
- Updates the
status.lastOperation
with the status of the last reconcile operation
gardenlet
- Syncs back the
status.ingress
and status.conditions
of the Bastion
resource in the seed to the garden cluster in case it changed
gardenctl
- initiates
ssh
session once status.conditions['BastionReady']
is true of the Bastion
resource in the garden cluster- locates private
ssh
key matching spec["sshPublicKey"]
which was configured beforehand by the user - reads bastion IP (
status.ingress.ip
) or hostname (status.ingress.hostname
) - reads the private key from the
ssh
key pair for the shoot node - opens
ssh
connection to the bastion and from there to the respective shoot node
- runs heartbeat in parallel as long as the
ssh
session is open by annotating the Bastion
resource with gardener.cloud/operation: keepalive
GCM
:- Once
status.expirationTimestamp
is reached, the Bastion
will be marked for deletion
gardenlet
:- Once the
Bastion
resource in the garden cluster is marked for deletion, it marks the Bastion
resource in the seed for deletion
- Gardener extension provider / Bastion Controller on Seed:
- all created resources will be cleaned up
- On succes, removes finalizer on
Bastion
resource in seed
gardenlet
:- removes finalizer on
Bastion
resource in garden cluster
Resource Example
Bastion
resource in the garden cluster
apiVersion: operations.gardener.cloud/v1alpha1
kind: Bastion
metadata:
generateName: cli-
name: cli-abcdef
namespace: garden-myproject
annotations:
gardener.cloud/created-by: foo # immutable, set by the GAPI Admission Plugin
# gardener.cloud/operation: keepalive # this annotation is removed by the GAPI and the status.lastHeartbeatTimestamp and status.expirationTimestamp will be updated accordingly
spec:
shootRef: # namespace cannot be set / it's the same as .metadata.namespace
name: my-cluster # immutable
# the following fields are set by the GAPI
seedName: aws-eu2
providerType: aws
sshPublicKey: c3NoLXJzYSAuLi4K # immutable, public `ssh` key of the user
ingress: # can only be updated by the creator of the bastion
- ipBlock:
cidr: 1.2.3.4/32 # public IP of the user. CIDR is a string representing the IP Block. Valid examples are "192.168.1.1/24" or "2001:db9::/64"
status:
observedGeneration: 1
# the following fields are managed by the controller in the seed and synced by gardenlet
ingress: # IP or hostname of the bastion
ip: 1.2.3.5
# hostname: foo.bar
conditions:
- type: BastionReady # when the `status` is true of condition type `BastionReady`, the client can initiate the `ssh` connection
status: 'True'
lastTransitionTime: "2021-03-19T11:59:00Z"
lastUpdateTime: "2021-03-19T11:59:00Z"
reason: BastionReady
message: Bastion for the cluster is ready.
# the following fields are only set by the GAPI
lastHeartbeatTimestamp: "2021-03-19T11:58:00Z" # will be set when setting the annotation gardener.cloud/operation: keepalive
expirationTimestamp: "2021-03-19T12:58:00Z" # extended on each keepalive
Bastion
custom resource in the seed cluster
apiVersion: extensions.gardener.cloud/v1alpha1
kind: Bastion
metadata:
name: cli-abcdef
namespace: shoot--myproject--mycluster
spec:
userData: |- # this is normally base64-encoded, but decoded for the example. Contains spec.sshPublicKey from Bastion resource in garden cluster
#!/bin/bash
# create user
# add ssh public key to authorized_keys
# ...
ingress:
- ipBlock:
cidr: 1.2.3.4/32
type: aws # from extensionsv1alpha1.DefaultSpec
status:
observedGeneration: 1
ingress:
ip: 1.2.3.5
# hostname: foo.bar
conditions:
- type: BastionReady
status: 'True'
lastTransitionTime: "2021-03-19T11:59:00Z"
lastUpdateTime: "2021-03-19T11:59:00Z"
reason: BastionReady
message: Bastion for the cluster is ready.
SSH Key Pair Rotation
Currently, the ssh
key pair for the shoot nodes are created once during shoot cluster creation. These key pairs should be rotated on a regular basis.
Rotation Proposal
gardeneruser
original user data component:- The
gardeneruser
create script should be changed into a reconcile script script, and renamed accordingly. It needs to be adapted so that the authorized_keys
file will be updated / overwritten with the current and old ssh
public key from the cloud-config user data.
- Rotation trigger:
- Once in the maintenance time window
- On demand, by annotating the shoot with
gardener.cloud/operation: rotate-ssh-keypair
- On rotation trigger:
gardenlet
- Prerequisite of
ssh
key pair rotation: all nodes of all the worker pools have successfully applied the desired version of their cloud-config user data - Creates or updates the secret
ssh-keypair.old
with the content of ssh-keypair
in the seed-shoot namespace. The old private key can be used by clients as fallback, in case the new ssh
public key is not yet applied on the node - Generates new
ssh-keypair
secret - The
OperatingSystemConfig
needs to be re-generated and deployed with the new and old ssh
public key
- As usual (for more details, see here):
- Once the
cloud-config-<X>
secret in the kube-system
namespace of the shoot cluster is updated, it will be picked up by the downloader
script (checks every 30s for updates) - The
downloader
runs the “execution” script from the cloud-config-<X>
secret - The “execution” script includes also the original user data script, which it writes to
PATH_CLOUDCONFIG
, compares it against the previous cloud config and runs the script in case it has changed - Running the original user data script will also run the
gardeneruser
component, where the authorized_keys
file will be updated - After the most recent cloud-config user data was applied, the “execution” script annotates the node with
checksum/cloud-config-data: <cloud-config-checksum>
to indicate the success
Limitations
Each operating system has its own default user (e.g. core
, admin
, ec2-user
etc). These users get their SSH keys during VM creation (however there is a different handling on Google Cloud Platform as stated below). These keys currently do not get rotated respectively are not removed from the authorized_keys
file. This means that the initial ssh
key will still be valid for the default operating system user.
On Google Cloud Platform, the VMs do not have any static users (i.e. no gardener
user) and there is an agent on the nodes that syncs the users with their SSH keypairs from the GCP IAM service.
13 - Dynamic kubeconfig generation for Shoot clusters
GEP-16: Dynamic kubeconfig generation for Shoot clusters
Table of Contents
Summary
This GEP
introduces new Shoot
subresource called AdminKubeconfigRequest
allowing for users to dynamically generate a short-lived kubeconfig
that can be used to access the Shoot
cluster as cluster-admin
.
Motivation
Today, when access to the created Shoot
clusters is needed, a kubeconfig
with static token credentials is used. This static token is in the system:masters
group, granting it cluster-admin
privileges. The kubeconfig
is generated when the cluster is reconciled, stored in ShootState
and replicated in the Project
’s namespace in a Secret
. End-users can fetch the secret and use the kubeconfig
inside it.
There are several problems with this approach:
- The token in the
kubeconfig
does not have any expiration, so end-users have to request a kubeconfig
credential rotation if they want revoke the token. - There is no user identity in the token. e.g. if user
Joe
gets the kubeconfig
from the Secret
, user in that token would be system:cluster-admin
and not Joe
when accessing the Shoot
cluster with it. This makes auditing events in the cluster almost impossible.
Goals
Add a Shoot
subresource called adminkubeconfig
that would produce a kubeconfig
used to access that Shoot
cluster.
The kubeconfig
is not stored in the API Server, but generated for each request.
In the AdminKubeconfigRequest
send to that subresource, end-users can specify the expiration time of the credential.
The identity (user) in the Gardener cluster would be part of the identity (x509 client certificate). E.g if Joe
authenticates against the Gardener API server, the generated certificate for Shoot
authentication would have the following subject:
- Common Name:
Joe
- Organisation:
system:masters
The maximum validity of the certificate can be enforced by setting a flag on the gardener-apiserver
.
Deprecate and remove the old {shoot-name}.kubeconfig
secrets in each Project
namespace.
Non-Goals
- Generate
OpenID Connect
kubeconfigs
Proposal
The gardener-apiserver
would serve a new shoots/adminkubeconfig
resource. It can only accept CREATE
calls and accept AdminKubeconfigRequest
. A AdminKubeconfigRequest
would have the following structure:
apiVersion: authentication.gardener.cloud/v1alpha1
kind: AdminKubeconfigRequest
spec:
expirationSeconds: 3600
Where expirationSeconds
is the validity of the certificate in seconds. In this case it would be 1 hour
. The maximum validity of a AdminKubeconfigRequest
is configured by --shoot-admin-kubeconfig-max-expiration
flag in the gardener-apiserver
.
When such request is received, the API server would find the ShootState
associated with that cluster and generate a kubeconfig
. The x509 client certificate would be signed by the Shoot
cluster’s CA and the user used in the subject’s common name would be from the User.Info
used to make the request.
apiVersion: authentication.gardener.cloud/v1alpha1
kind: AdminKubeconfigRequest
spec:
expirationSeconds: 3600
status:
expirationTimestamp: "2021-02-22T09:06:51Z"
kubeConfig: # this is normally base64-encoded, but decoded for the example
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1....
server: https://api.shoot-cluster
name: shoot-cluster-a
contexts:
- context:
cluster: shoot-cluster-a
user: shoot-cluster-a
name: shoot-cluster-a
current-context: shoot-cluster-a
kind: Config
preferences: {}
users:
- name: shoot-cluster-a
user:
client-certificate-data: LS0tLS1CRUd...
client-key-data: LS0tLS1CRUd...
New feature gate called AdminKubeconfigRequest
enables the above mentioned API in the gardener-apiserver
. The old {shoot-name}.kubeconfig
is kept, but deprecated and will be removed in the future.
In order to get the server’s address used in the kubeconfig
, the Shoot’s status
should be updated with new entries:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
name: crazy-botany
namespace: garden-dev
spec: {}
status:
advertisedAddresses:
- name: external
url: https://api.shoot-cluster.external.foo
- name: internal
url: https://api.shoot-cluster.internal.foo
- name: ip
url: https://1.2.3.4
This is needed, because the Gardener API server might not know on which IP address the API server is advertised on (e.g. DNS is disabled).
If there are multiple entries, each would be added in a separate cluster
in the kubeconfig
and a context
with the same name would be added added as well. The current context would be selected as the first entry in the advertisedAddresses
list (.status.advertisedAddresses[0]
).
Alternatives
14 - GEP Title
GEP-NNNN: Your short, descriptive title
Table of Contents
Summary
Motivation
Goals
Non-Goals
Proposal
Alternatives
15 - New Core Gardener Cloud APIs
Table of Contents
Summary
In GEP-1 we have proposed how to (re-)design Gardener to allow providers maintaining their provider-specific knowledge out of the core tree.
Meanwhile, we have progressed a lot and are about to remove the CloudBotanist
interface entirely.
The only missing aspect that will allow providers to really maintain their code out of the core is to design new APIs.
This proposal describes how the new Shoot
, Seed
etc. APIs will be re-designed to cope with the changes made with extensibility.
We already have the new core.gardener.cloud/v1beta1
API group that will be the new default soon.
Motivation
We want to allow providers to individually maintain their specific knowledge without the necessity to touch the Gardener core code.
In order to achieve the same, we have to provide proper APIs.
Goals
- Provide proper APIs to allow providers maintaining their code outside of the core codebase.
- Do not complicate the APIs for end-users such that they can easily create, delete, and maintain shoot clusters.
Non-Goals
- Let’s try to not split everything up into too many different resources. Instead, let’s try to keep all relevant information in the same resources when possible/appropriate.
Proposal
In GEP-1 we already have proposed a first version for new CloudProfile
and Shoot
resources.
In order to deprecate the existing/old garden.sapcloud.io/v1beta1
API group (and remove it, eventually) we should move all existing resources to the new core.gardener.cloud/v1beta1
API group.
CloudProfile
resource
apiVersion: core.gardener.cloud/v1beta1
kind: CloudProfile
metadata:
name: cloudprofile1
spec:
type: <some-provider-name> # {aws,azure,gcp,...}
# Optional list of labels on `Seed` resources that marks those seeds whose shoots may use this provider profile.
# An empty list means that all seeds of the same provider type are supported.
# This is useful for environments that are of the same type (like openstack) but may have different "instances"/landscapes.
# seedSelector:
# matchLabels:
# foo: bar
kubernetes:
versions:
- version: 1.12.1
- version: 1.11.0
- version: 1.10.6
- version: 1.10.5
expirationDate: 2020-04-05T01:02:03Z # optional
machineImages:
- name: coreos
versions:
- version: 2023.5.0
- version: 1967.5.0
expirationDate: 2020-04-05T08:00:00Z
- name: ubuntu
versions:
- version: 18.04.201906170
machineTypes:
- name: m5.large
cpu: "2"
gpu: "0"
memory: 8Gi
# storage: 20Gi # optional (not needed in every environment, may only be specified if no volumeTypes have been specified)
usable: true
volumeTypes: # optional (not needed in every environment, may only be specified if no machineType has a `storage` field)
- name: gp2
class: standard
- name: io1
class: premium
regions:
- name: europe-central-1
zones: # optional (not needed in every environment)
- name: europe-central-1a
- name: europe-central-1b
- name: europe-central-1c
# unavailableMachineTypes: # optional, list of machine types defined above that are not available in this zone
# - m5.large
# unavailableVolumeTypes: # optional, list of volume types defined above that are not available in this zone
# - io1
# CA bundle that will be installed onto every shoot machine that is using this provider profile.
# caBundle: |
# -----BEGIN CERTIFICATE-----
# ...
# -----END CERTIFICATE-----
providerConfig:
<some-provider-specific-cloudprofile-config>
# We don't have concrete examples for every existing provider yet, but these are the proposals:
#
# Example for Alicloud:
#
# apiVersion: alicloud.provider.extensions.gardener.cloud/v1alpha1
# kind: CloudProfileConfig
# machineImages:
# - name: coreos
# version: 2023.5.0
# id: coreos_2023_4_0_64_30G_alibase_20190319.vhd
#
#
# Example for AWS:
#
# apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
# kind: CloudProfileConfig
# machineImages:
# - name: coreos
# version: 1967.5.0
# regions:
# - name: europe-central-1
# ami: ami-0f46c2ed46d8157aa
#
#
# Example for Azure:
#
# apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
# kind: CloudProfileConfig
# machineImages:
# - name: coreos
# version: 1967.5.0
# publisher: CoreOS
# offer: CoreOS
# sku: Stable
# countFaultDomains:
# - region: westeurope
# count: 2
# countUpdateDomains:
# - region: westeurope
# count: 5
#
#
# Example for GCP:
#
# apiVersion: gcp.provider.extensions.gardener.cloud/v1alpha1
# kind: CloudProfileConfig
# machineImages:
# - name: coreos
# version: 2023.5.0
# image: projects/coreos-cloud/global/images/coreos-stable-2023-5-0-v20190312
#
#
# Example for OpenStack:
#
# apiVersion: openstack.provider.extensions.gardener.cloud/v1alpha1
# kind: CloudProfileConfig
# machineImages:
# - name: coreos
# version: 2023.5.0
# image: coreos-2023.5.0
# keyStoneURL: https://url-to-keystone/v3/
# dnsServers:
# - 10.10.10.10
# - 10.10.10.11
# dhcpDomain: foo.bar
# requestTimeout: 30s
# constraints:
# loadBalancerProviders:
# - name: haproxy
# floatingPools:
# - name: fip1
# loadBalancerClasses:
# - name: class1
# floatingSubnetID: 04eed401-f85f-4610-8041-c4835c4beea6
# floatingNetworkID: 23949a30-1cdd-4732-ba47-d03ced950acc
# subnetID: ac46c204-9d0d-4a4c-a90d-afefe40cfc35
#
#
# Example for Packet:
#
# apiVersion: packet.provider.extensions.gardener.cloud/v1alpha1
# kind: CloudProfileConfig
# machineImages:
# - name: coreos
# version: 2079.3.0
# id: d61c3912-8422-4daf-835e-854efa0062e4
Seed
resource
Special note: The proposal contains fields that are not yet existing in the current garden.sapcloud.io/v1beta1.Seed
resource, but they should be implemented (open issues that require them are linked).
apiVersion: v1
kind: Secret
metadata:
name: seed-secret
namespace: garden
type: Opaque
data:
kubeconfig: base64(kubeconfig-for-seed-cluster)
---
apiVersion: v1
kind: Secret
metadata:
name: backup-secret
namespace: garden
type: Opaque
data:
# <some-provider-specific data keys>
# https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-backupbucket.yaml#L9-L11
# https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-infrastructure.yaml#L9-L10
# https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-backupbucket.yaml#L9-L10
# https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-backupbucket.yaml#L9
# https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-backupbucket.yaml#L9-L13
---
apiVersion: core.gardener.cloud/v1beta1
kind: Seed
metadata:
name: seed1
spec:
provider:
type: <some-provider-name> # {aws,azure,gcp,...}
region: europe-central-1
secretRef:
name: seed-secret
namespace: garden
# Motivation for DNS section: https://github.com/gardener/gardener/issues/201.
dns:
provider: <some-provider-name> # {aws-route53, google-clouddns, ...}
secretName: my-dns-secret # must be in `garden` namespace
ingressDomain: seed1.dev.example.com
volume: # optional (introduced to get rid of `persistentvolume.garden.sapcloud.io/minimumSize` and `persistentvolume.garden.sapcloud.io/provider` annotations)
minimumSize: 20Gi
providers:
- name: foo
purpose: etcd-main
networks: # Seed and Shoot networks must be disjunct
nodes: 10.240.0.0/16
pods: 10.241.128.0/17
services: 10.241.0.0/17
# Shoot default networks, see also https://github.com/gardener/gardener/issues/895.
# shootDefaults:
# pods: 100.96.0.0/11
# services: 100.64.0.0/13
taints:
- key: seed.gardener.cloud/protected
- key: seed.gardener.cloud/invisible
blockCIDRs:
- 169.254.169.254/32
backup: # See https://github.com/gardener/gardener/blob/master/docs/proposals/02-backupinfra.md.
type: <some-provider-name> # {aws,azure,gcp,...}
# region: eu-west-1
secretRef:
name: backup-secret
namespace: garden
status:
conditions:
- lastTransitionTime: "2020-07-14T19:16:42Z"
lastUpdateTime: "2020-07-14T19:18:17Z"
message: all checks passed
reason: Passed
status: "True"
type: Available
gardener:
id: 4c9832b3823ee6784064877d3eb10c189fc26e98a1286c0d8a5bc82169ed702c
name: gardener-controller-manager-7fhn9ikan73n-7jhka
version: 1.0.0
observedGeneration: 1
Project
resource
Special note: The members
and viewers
field of the garden.sapcloud.io/v1beta1.Project
resource will be merged together into one members
field.
Every member will have a role that is either admin
or viewer
.
This will allow us to add new roles without changing the API.
apiVersion: core.gardener.cloud/v1beta1
kind: Project
metadata:
name: example
spec:
description: Example project
members:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: john.doe@example.com
role: admin
- apiGroup: rbac.authorization.k8s.io
kind: User
name: joe.doe@example.com
role: viewer
namespace: garden-example
owner:
apiGroup: rbac.authorization.k8s.io
kind: User
name: john.doe@example.com
purpose: Example project
status:
observedGeneration: 1
phase: Ready
SecretBinding
resource
Special note: No modifications needed compared to the current garden.sapcloud.io/v1beta1.SecretBinding
resource.
apiVersion: v1
kind: Secret
metadata:
name: secret1
namespace: garden-core
type: Opaque
data:
# <some-provider-specific data keys>
# https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-infrastructure.yaml#L14-L15
# https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-infrastructure.yaml#L9-L10
# https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-infrastructure.yaml#L14-L17
# https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-infrastructure.yaml#L14
# https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-infrastructure.yaml#L15-L18
# https://github.com/gardener/gardener-extension-provider-packet/blob/master/example/30-infrastructure.yaml#L14-L15
#
# If you use your own domain (not the default domain of your landscape) then you have to add additional keys to this secret.
# The reason is that the DNS management is not part of the Gardener core code base but externalized, hence, it might use other
# key names than Gardener itself.
# The actual values here depend on the DNS extension that is installed to your landscape.
# For example, check out https://github.com/gardener/external-dns-management and find a lot of example secret manifests here:
# https://github.com/gardener/external-dns-management/tree/master/examples
---
apiVersion: core.gardener.cloud/v1beta1
kind: SecretBinding
metadata:
name: secretbinding1
namespace: garden-core
secretRef:
name: secret1
# namespace: namespace-other-than-'garden-core' // optional
quotas: []
# - name: quota-1
# # namespace: namespace-other-than-'garden-core' // optional
Quota
resource
Special note: No modifications needed compared to the current garden.sapcloud.io/v1beta1.Quota
resource.
apiVersion: core.gardener.cloud/v1beta1
kind: Quota
metadata:
name: trial-quota
namespace: garden-trial
spec:
scope:
apiGroup: core.gardener.cloud
kind: Project
# clusterLifetimeDays: 14
metrics:
cpu: "200"
gpu: "20"
memory: 4000Gi
storage.standard: 8000Gi
storage.premium: 2000Gi
loadbalancer: "100"
BackupBucket
resource
Special note: This new resource is cluster-scoped.
# See also: https://github.com/gardener/gardener/blob/master/docs/proposals/02-backupinfra.md.
apiVersion: v1
kind: Secret
metadata:
name: backup-operator-provider
namespace: backup-garden
type: Opaque
data:
# <some-provider-specific data keys>
# https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-backupbucket.yaml#L9-L11
# https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-backupbucket.yaml#L9-L10
# https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-backupbucket.yaml#L9-L10
# https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-backupbucket.yaml#L9
# https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-backupbucket.yaml#L9-L13
---
apiVersion: core.gardener.cloud/v1beta1
kind: BackupBucket
metadata:
name: <seed-provider-type>-<region>-<seed-uid>
ownerReferences:
- kind: Seed
name: seed1
spec:
provider:
type: <some-provider-name> # {aws,azure,gcp,...}
region: europe-central-1
seed: seed1
secretRef:
name: backup-operator-provider
namespace: backup-garden
status:
lastOperation:
description: Backup bucket has been successfully reconciled.
lastUpdateTime: '2020-04-13T14:34:27Z'
progress: 100
state: Succeeded
type: Reconcile
observedGeneration: 1
BackupEntry
resource
Special note: This new resource is cluster-scoped.
# See also: https://github.com/gardener/gardener/blob/master/docs/proposals/02-backupinfra.md.
apiVersion: v1
kind: Secret
metadata:
name: backup-operator-provider
namespace: backup-garden
type: Opaque
data:
# <some-provider-specific data keys>
# https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-backupbucket.yaml#L9-L11
# https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-backupbucket.yaml#L9-L10
# https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-backupbucket.yaml#L9-L10
# https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-backupbucket.yaml#L9
# https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-backupbucket.yaml#L9-L13
---
apiVersion: core.gardener.cloud/v1beta1
kind: BackupEntry
metadata:
name: shoot--core--crazy-botany--3ef42
namespace: garden-core
ownerReferences:
- apiVersion: core.gardener.cloud/v1beta1
blockOwnerDeletion: false
controller: true
kind: Shoot
name: crazy-botany
uid: 19a9538b-5058-11e9-b5a6-5e696cab3bc8
spec:
bucketName: cloudprofile1-random[:5]
seed: seed1
status:
lastOperation:
description: Backup entry has been successfully reconciled.
lastUpdateTime: '2020-04-13T14:34:27Z'
progress: 100
state: Succeeded
type: Reconcile
observedGeneration: 1
Shoot
resource
Special notes:
kubelet
configuration in the worker pools may override the default .spec.kubernetes.kubelet
configuration (that applies for all worker pools if not overridden).- Moved remaining control plane configuration to new
.spec.provider.controlplane
section.
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
name: crazy-botany
namespace: garden-core
spec:
secretBindingName: secretbinding1
cloudProfileName: cloudprofile1
region: europe-central-1
# seedName: seed1
provider:
type: <some-provider-name> # {aws,azure,gcp,...}
infrastructureConfig:
<some-provider-specific-infrastructure-config>
# https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-infrastructure.yaml#L56-L64
# https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-infrastructure.yaml#L43-L53
# https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-infrastructure.yaml#L63-L71
# https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-infrastructure.yaml#L53-L57
# https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-infrastructure.yaml#L56-L64
# https://github.com/gardener/gardener-extension-provider-packet/blob/master/example/30-infrastructure.yaml#L48-L49
controlPlaneConfig:
<some-provider-specific-controlplane-config>
# https://github.com/gardener/gardener-extension-provider-alicloud/blob/master/example/30-controlplane.yaml#L60-L65
# https://github.com/gardener/gardener-extension-provider-aws/blob/master/example/30-controlplane.yaml#L60-L64
# https://github.com/gardener/gardener-extension-provider-azure/blob/master/example/30-controlplane.yaml#L61-L66
# https://github.com/gardener/gardener-extension-provider-gcp/blob/master/example/30-controlplane.yaml#L59-L64
# https://github.com/gardener/gardener-extension-provider-openstack/blob/master/example/30-controlplane.yaml#L64-L70
# https://github.com/gardener/gardener-extension-provider-packet/blob/master/example/30-controlplane.yaml#L60-L61
workers:
- name: cpu-worker
minimum: 3
maximum: 5
# maxSurge: 1
# maxUnavailable: 0
machine:
type: m5.large
image:
name: <some-os-name>
version: <some-os-version>
# providerConfig:
# <some-os-specific-configuration>
volume:
type: gp2
size: 20Gi
# providerConfig:
# <some-provider-specific-worker-config>
# labels:
# key: value
# annotations:
# key: value
# taints: # See also https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
# - key: foo
# value: bar
# effect: NoSchedule
# caBundle: <some-ca-bundle-to-be-installed-to-all-nodes-in-this-pool>
# kubernetes:
# kubelet:
# cpuCFSQuota: true
# cpuManagerPolicy: none
# podPidsLimit: 10
# featureGates:
# SomeKubernetesFeature: true
# zones: # optional, only relevant if the provider supports availability zones
# - europe-central-1a
# - europe-central-1b
kubernetes:
version: 1.15.1
# allowPrivilegedContainers: true # 'true' means that all authenticated users can use the "gardener.privileged" PodSecurityPolicy, allowing full unrestricted access to Pod features.
# kubeAPIServer:
# featureGates:
# SomeKubernetesFeature: true
# runtimeConfig:
# scheduling.k8s.io/v1alpha1: true
# oidcConfig:
# caBundle: |
# -----BEGIN CERTIFICATE-----
# Li4u
# -----END CERTIFICATE-----
# clientID: client-id
# groupsClaim: groups-claim
# groupsPrefix: groups-prefix
# issuerURL: https://identity.example.com
# usernameClaim: username-claim
# usernamePrefix: username-prefix
# signingAlgs: RS256,some-other-algorithm
#-#-# only usable with Kubernetes >= 1.11
# requiredClaims:
# key: value
# admissionPlugins:
# - name: PodNodeSelector
# config: |
# podNodeSelectorPluginConfig:
# clusterDefaultNodeSelector: <node-selectors-labels>
# namespace1: <node-selectors-labels>
# namespace2: <node-selectors-labels>
# auditConfig:
# auditPolicy:
# configMapRef:
# name: auditpolicy
# kubeControllerManager:
# featureGates:
# SomeKubernetesFeature: true
# horizontalPodAutoscaler:
# syncPeriod: 30s
# tolerance: 0.1
#-#-# only usable with Kubernetes < 1.12
# downscaleDelay: 15m0s
# upscaleDelay: 1m0s
#-#-# only usable with Kubernetes >= 1.12
# downscaleStabilization: 5m0s
# initialReadinessDelay: 30s
# cpuInitializationPeriod: 5m0s
# kubeScheduler:
# featureGates:
# SomeKubernetesFeature: true
# kubeProxy:
# featureGates:
# SomeKubernetesFeature: true
# mode: IPVS
# kubelet:
# cpuCFSQuota: true
# cpuManagerPolicy: none
# podPidsLimit: 10
# featureGates:
# SomeKubernetesFeature: true
# clusterAutoscaler:
# scaleDownUtilizationThreshold: 0.5
# scaleDownUnneededTime: 30m
# scaleDownDelayAfterAdd: 60m
# scaleDownDelayAfterFailure: 10m
# scaleDownDelayAfterDelete: 10s
# scanInterval: 10s
dns:
# When the shoot shall use a cluster domain no domain and no providers need to be provided - Gardener will
# automatically compute a correct domain.
domain: crazy-botany.core.my-custom-domain.com
providers:
- type: aws-route53
secretName: my-custom-domain-secret
domains:
include:
- my-custom-domain.com
- my-other-custom-domain.com
exclude:
- yet-another-custom-domain.com
zones:
include:
- zone-id-1
exclude:
- zone-id-2
extensions:
- type: foobar
# providerConfig:
# apiVersion: foobar.extensions.gardener.cloud/v1alpha1
# kind: FooBarConfiguration
# foo: bar
networking:
type: calico
pods: 100.96.0.0/11
services: 100.64.0.0/13
nodes: 10.250.0.0/16
# providerConfig:
# apiVersion: calico.extensions.gardener.cloud/v1alpha1
# kind: NetworkConfig
# ipam:
# type: host-local
# cidr: usePodCIDR
# backend: bird
# typha:
# enabled: true
# See also: https://github.com/gardener/gardener/blob/master/docs/proposals/03-networking.md
maintenance:
timeWindow:
begin: 220000+0100
end: 230000+0100
autoUpdate:
kubernetesVersion: true
machineImageVersion: true
# hibernation:
# enabled: false
# schedules:
# - start: "0 20 * * *" # Start hibernation every day at 8PM
# end: "0 6 * * *" # Stop hibernation every day at 6AM
# location: "America/Los_Angeles" # Specify a location for the cron to run in
addons:
nginx-ingress:
enabled: false
# loadBalancerSourceRanges: []
kubernetes-dashboard:
enabled: true
# authenticationMode: basic # allowed values: basic,token
status:
conditions:
- type: APIServerAvailable
status: 'True'
lastTransitionTime: '2020-01-30T10:38:15Z'
lastUpdateTime: '2020-04-13T14:35:21Z'
reason: HealthzRequestFailed
message: API server /healthz endpoint responded with success status code. [response_time:3ms]
- type: ControlPlaneHealthy
status: 'True'
lastTransitionTime: '2020-04-02T05:18:58Z'
lastUpdateTime: '2020-04-13T14:35:21Z'
reason: ControlPlaneRunning
message: All control plane components are healthy.
- type: EveryNodeReady
status: 'True'
lastTransitionTime: '2020-04-01T16:27:21Z'
lastUpdateTime: '2020-04-13T14:35:21Z'
reason: EveryNodeReady
message: Every node registered to the cluster is ready.
- type: SystemComponentsHealthy
status: 'True'
lastTransitionTime: '2020-04-03T18:26:28Z'
lastUpdateTime: '2020-04-13T14:35:21Z'
reason: SystemComponentsRunning
message: All system components are healthy.
gardener:
id: 4c9832b3823ee6784064877d3eb10c189fc26e98a1286c0d8a5bc82169ed702c
name: gardener-controller-manager-7fhn9ikan73n-7jhka
version: 1.0.0
lastOperation:
description: Shoot cluster state has been successfully reconciled.
lastUpdateTime: '2020-04-13T14:34:27Z'
progress: 100
state: Succeeded
type: Reconcile
observedGeneration: 1
seed: seed1
hibernated: false
technicalID: shoot--core--crazy-botany
uid: d8608cfa-2856-11e8-8fdc-0a580af181af
Plant
resource
apiVersion: v1
kind: Secret
metadata:
name: crazy-plant-secret
namespace: garden-core
type: Opaque
data:
kubeconfig: base64(kubeconfig-for-plant-cluster)
---
apiVersion: core.gardener.cloud/v1beta1
kind: Plant
metadata:
name: crazy-plant
namespace: garden-core
spec:
secretRef:
name: crazy-plant-secret
endpoints:
- name: Cluster GitHub repository
purpose: management
url: https://github.com/my-org/my-cluster-repo
- name: GKE cluster page
purpose: management
url: https://console.cloud.google.com/kubernetes/clusters/details/europe-west1-b/plant?project=my-project&authuser=1&tab=details
status:
clusterInfo:
provider:
type: gce
region: europe-west4-c
kubernetes:
version: v1.11.10-gke.5
conditions:
- lastTransitionTime: "2020-03-01T11:31:37Z"
lastUpdateTime: "2020-04-14T18:00:29Z"
message: API server /healthz endpoint responded with success status code. [response_time:8ms]
reason: HealthzRequestFailed
status: "True"
type: APIServerAvailable
- lastTransitionTime: "2020-04-01T06:26:56Z"
lastUpdateTime: "2020-04-14T18:00:29Z"
message: Every node registered to the cluster is ready.
reason: EveryNodeReady
status: "True"
type: EveryNodeReady
16 - Reversed Cluster VPN
GEP-14: Reversed Cluster VPN
Table of Contents
Motivation
It is necessary to describe the current VPN solution and outline its shortcomings in order to motivate this proposal.
Problem Statement
Today’s Gardener cluster VPN solution has several issues including:
- Connection establishment is always from the seed cluster to the shoot cluster. This means that there needs to be connectivity both ways which is not desirable in many cases (OpenStack, VMware) and causes high effort in firewall configuration or extra infrastructure. These firewall configurations are prohibited in some cases due to security policies.
- Shoot clusters must provide a VPN endpoint. This means extra cost for the endpoint (roughly €20/month on hyperscalers) or will consume scarce resources (limited number of VMware NSX-T load balancers).
A first implementation has been provided to resolve the issues with the konnectivity server. As we did find several shortcomings with the underlying technology component, the apiserver-network-proxy we believe that this is not a suitable way ahead. We have opened an issue and provided two solution proposals to the community. We do see some remedies, e.g. using the Quick Protocol instead of GRPC but we (a) consider the implementation effort significantly higher compared to this proposal and (b) would use an experimental protocol to solve a problem that can also be solved with existing and proven core network technologies.
We will therefore not continue to invest into this approach. We will however research a similar approach (see below in “Further Research”).
Current Solution Outline
The current solution consists of multiple VPN connections from each API server pod and the Prometheus pod of a control plane to an OpenVPN server running in the shoot cluster. This OpenVPN server is exposed via a load balancer that must have an IP address which is reachable from the seed cluster. The routing in the seed cluster pods is configured to route all traffic for the node, pod, and service ranges to the shoot cluster. This means that there is no address overlap allowed between seed- and shoot cluster node, pod, and service ranges.
In the seed cluster the vpn-seed
container is a sidecar to the kube-apiserver and prometheus pods. OpenVPN acts as a TCP client connecting to an OpenVPN TCP server. This is not optimal (e.g. tunneling TCP over TCP is discouraged) but at the time of development there was no UDP load balancer available on at least one of the major hyperscalers. Connectivity could have been switched to UDP later but the development effort was not spent.
The solution is depicted in this diagram:

These are the essential parts of the OpenVPN client configuration in the vpn-seed
sidecar container:
# use TCP instead of UDP (commonly not supported by load balancers)
proto tcp-client
[...]
# get all routing information from server
pull
tls-client
key "/srv/secrets/vpn-seed/tls.key"
cert "/srv/secrets/vpn-seed/tls.crt"
ca "/srv/secrets/vpn-seed/ca.crt"
tls-auth "/srv/secrets/tlsauth/vpn.tlsauth" 1
cipher AES-256-CBC
# https://openvpn.net/index.php/open-source/documentation/howto.html#mitm
remote-cert-tls server
# pull filter
pull-filter accept "route 100.64.0.0 255.248.0.0"
pull-filter accept "route 100.96.0.0 255.224.0.0"
pull-filter accept "route 10.1.60.0 255.255.252.0"
pull-filter accept "route 192.168.123."
pull-filter ignore "route"
pull-filter ignore redirect-gateway
pull-filter ignore route-ipv6
pull-filter ignore redirect-gateway-ipv6
Encryption is based on SSL certificates with an additional HMAC signature to all SSL/TLS handshake packets. As multiple clients connect to the OpenVPN server in the shoot cluster, all clients must be assigned a unique IP address. This is done by the OpenVPN server pushing that configuration to the client (keyword pull
). As this is potentially problematic because the OpenVPN server runs in an untrusted environment there are pull filters denying all but necessary routes for the pod, service, and node networks.
The OpenVPN server running in the shoot cluster is configured as follows:
mode server
tls-server
proto tcp4-server
dev tun0
[...]
server 192.168.123.0 255.255.255.0
push "route 10.243.0.0 255.255.128.0"
push "route 10.243.128.0 255.255.128.0"
duplicate-cn
key "/srv/secrets/vpn-shoot/tls.key"
cert "/srv/secrets/vpn-shoot/tls.crt"
ca "/srv/secrets/vpn-shoot/ca.crt"
dh "/srv/secrets/dh/dh2048.pem"
tls-auth "/srv/secrets/tlsauth/vpn.tlsauth" 0
push "route 10.242.0.0 255.255.0.0"
It is a TCP TLS server and configured to automatically assign IP addresses for OpenVPN clients (server
directive). In addition, it pushes the shoot cluster node-, pod-, and service ranges to the clients running in the seed cluster (push
directive).
Note: The network mesh spanned by OpenVPN uses the network range 192.168.123.0 - 192.168.123.255
. This network range cannot be used in either shoot-, or seed clusters. If it is used this might cause subtle problem due to network range overlaps. Unfortunately, this appears not to be well documented but this restriction exists since the very beginning. We should clean up this technical debt as part of the exercise.
Goals
- We intend to supersede the current VPN solution with the solution outlined in this proposal.
- We intend to remove the code for the konnectivity tunnel once this solution proposal has been validated.
Non Goals
- The solution is not a low latency, or high throughput solution. As the kube-apiserver to shoot cluster traffic does not demand these properties we do not intend to invest in improvements.
- We do not intend to provide continuous availability to the shoot-seed VPN connection. We expect the availability to be comparable to the existing solution.
Proposal
The proposal is depicted in the following diagram:

We have added an OpenVPN server pod (vpn-seed-server
) to each control plane. The OpenVPN client in the shoot cluster (vpn-shoot-client
) connects to the OpenVPN server.
The two containers vpn-seed-server
and vpn-shoot-client
are new containers and are not related to containers in the github.com/gardener/vpn project. We will create a new project github.com/gardener/vpn2 for these containers. With this solution we intend to supersede the containers from the github.com/gardener/vpn project.
A service vpn-seed-server
of type ClusterIP
is created for each control plane in its namespace.
The vpn-shoot-client
pod connects to the correct vpn-seed-server
service via the SNI passthrough proxy introduced with SNI Passthrough proxy for kube-apiservers on port 8132.
Shoot OpenVPN clients (vpn-shoot-client
) connect to the correct OpenVPN Server using the http proxy feature provided by OpenVPN. A configuration is added to the envoy proxy to detect http proxy requests and open a connection attempt to the correct OpenVPN server.
The kube-apiserver
to shoot cluster connections are established using the API server proxy feature via an envoy proxy sidecar container of the vpn-seed-server
container.
The restriction regarding the 192.168.123.0/24
network range in the current VPN solution still applies to this proposal. No other restrictions are introduced. In the context of this GEP a pull requst has been filed to block usage of that range by shoot clusters.
We do expect performance and throughput to be slightly lower compared to the existing solution. This is because the OpenVPN server acts as an additional hop and must decrypt and re-encrypt traffic that passes through. As there are no low latency, or high thoughput requirements for this connection we do not assume this to be an issue.
Availability and Failure Scenarios
This solution re-uses multiple instances of the envoy component used for the kube-apiserver endpoints. We assume that the availability for kube-apiservers is good enough for the cluster VPN as well.
The OpenVPN client- and server pods are singleton pods in this approach and therefore are affected by potential failures and during cluster-, and control plane updates. Potential outages are only restricted to single shoot clusters and are comparable to the situation with the existing solution today.
Feature Gates and Migration Strategy
We have introduced a gardenlet feature gate ReversedVPN
. If APIServerSNI
and ReversedVPN
are enabled the proposed solution is automatically enabled for all shoot clusters hosted by the seed. If ReversedVPN
is enabled but APIServerSNI
is not the gardenlet will panic during startup as this is an invalid configuration. All existing shoot clusters will automatically be migrated during the next reconciliation. We assume that the ReversedVPN
feature will work with Gardener as well as operator managed Istio.
We have also added a shoot annotation alpha.featuregates.shoot.gardener.cloud/reversed-vpn
which can override the feature gate to enable or disable the solution for individual clusters. This is only respected if APIServerSNI
is enabled, otherwise it is ignored.
Security Review
The change in the VPN solution will potentially open up new attack vectors. We will perform a thorough analysis outside of this document.
Alternatives
WireGuard and Kubelink based Cluster VPN
We have done a detailed investigation and implementation of a reversed VPN based on WireGuard. While we believe that it is technically feasible and superior to the approach presented above there are some concerns with regards to scalability, and high availability. As the WireGuard scenario based on kubelink is relevant for other use cases we continue to improve this implementation and address the concerns but we concede that this might not be on time for the cluster VPN. We nevertheless keep the implementation and provide an outline as part of this proposal.
The general idea of the proposal is to keep the existing cluster VPN solution more or less as is, but change the underlying network used for the vpn seed => vpn shoot
connection. The underlying network should be established in the reversed direction, i.e. the shoot cluster should initiate the network connection, but it nevertheless should work in both directions.
We achieve this by tunneling the open vpn connection through a WireGuard tunnel, which is established from the shoot to the seed (note that WireGuard uses UDP as protocol). Independent of that we can also use UDP for the OpenVPN connection, but we can also stay with TCP as it was before. While this might look like a big change, it only introduces minor changes to the existing solution, but let’s look at the details. In essence, the OpenVPN connection does not require a public endpoint in the shoot cluster but it usees the internal endpoint provided by the WireGuard tunnel.
This is roughly depcited in this diagram. Note, that the vpn-seed
and vpn-shoot
containers only require very little changes and are fully backwards compatible.

The WireGuard network needs a separate network range/CIDR. It has to be unique for the seed and all its shoot clusters. An example for an assumed workload of around 1000 shoot clusters would be 192.168.128.0/22
(1024 IP addresses), i.e. 192.168.128.0-192.168.131.255
. The IP addresses from this range need to be managed, but the IP address management (IPAM) using the Gardener Kubernetes objects like seed and shootstate as backing store is fairly straightforward. This is especially true as we do not expect large network ranges and only infrequent IP allocations. Hence, the IP address allocation can be quite simple, i.e. scan the range for a free IP address of all shoot clusters in a seed and allocate the first free address from the range.
There is another restriction: in case shoot clusters are configured to be seed clusters this network range must not overlap with the “parent” seed cluster. If the parent seed cluster uses 192.168.128.0/22
the child seed cluster can for example use 192.168.132.0/22
. Grandchildren can however use grandparent IP address ranges. Also 2 children seed clusters can use identical ranges.
This slightly adds to the restrictions described in the current solution outline. In that the arbitrary chosen 192.168.123.0/24
range is restricted. For the purpose of this implementation we propose to extend that restriction to 192.168.128.0/17
range. Most of it would be reserved for “future use” however. We are well aware that this adds to the burden of correctly configuring Gardener landscapes.
We do consider this to be a challenge that needs to be addressed by careful configuration of the Gardener seed cluster infrastructure. Together with the 192.168.123.0/24
address range these ranges should be automatically blocked for usage by shoots.
WireGuard can utilize the Linux kernel so that after initialization/configuration no user space processes are required. We propose to recommend the WireGuard kernel module as the default solution for all seeds. For shoot clusters, the WireGuard kernel based approach is also recommended, but the user space solution should also work as we expect less traffic on the shoot side. We expect the userspace implementation to work on all operating systems supported by Gardener in case no kernel module is available.
Almost all seed clusters are already managed by Gardener and we assume that those are configured with the WireGuard kernel module. There are however some cases where we use other Kubernetes distributions as seed cluster which may not have an operating system with WireGuard module available. We will therefore generally support the user space WireGuard process on seed cluster but place a size restriction on the number of control planes on those seeds.
There is a user space implementation of WireGuard, which can be used on Linux distributions without the WireGuard kernel module. (WireGuard moved into the standard Linux kernel 5.6.) Our proposal can handle the kernel/user space switch transparently, i.e. we include the user space binaries and use them only when required. However, especially for the seed the kernel based solution might be more attractive. Garden Linux 318.4.0 supports WireGuard.
We have looked at Ubuntu and SuSE chost:
- SuSE chost does not provide the WireGuard kernel module and it is not installable via zypper. It should however be straightforward for SuSE to include that in their next release.
- Ubuntu does not provide the kernel module either but it can be installed using
apt-get install wireguard
. With that it appears straightforward to provide an image with WireGuard pre-installed.
On the seed, we add a WireGuard device to one node on the host network. For all other nodes on the seed, we adapt the routes accordingly to route traffic destined for the WireGuard network to our WireGuard node. The Kubernetes pods managing the WireGuard device and routes are only used for initial configuration and later reconfiguration. During runtime, they can restart without any impact on the operation of the WireGuard network as the WireGuard device is managed by the Linux kernel.
With Calico as the networking solution it is not easily possible to put the WireGuard endpoint into a pod. Putting the WireGuard endpoint into a pod would require to define it as a gateway in the api server or prometheus pods but this is not possible since Calico does not span a proper subnet. While the defined CIDR in the pod network might be 100.96.0.0/11
the network visible from within a pod is only 100.96.0.5/32
. This restriction might not exist with other networking solutions.
The WireGuard endpoint on the seed is exposed via a load balancer. We propose to use kubelink to manage the WireGuard configuration/device on the seed. We consider the management of the WireGuard endpoint to be complex especially in error situations which is the reason for utilizing kubelink as there is already significant experience managing an endpoint. We propose moving kubelink to the Gardener org in case it is used by this proposal.
Kubelink addresses three challenges managing WireGuard interfaces on cluster nodes. First, with WireGuard interfaces directly on the node (hostNetwork=true
) the lifecycle of the interface is decoupled from the lifecycle of the pod that created it. This means that there will have to be means of cleaning up the interfaces and its configuration in case the interface moves to a different node. Second, additional routing information must be distributed across the cluster. The WireGuard CIDR is unknown to the network implementation so additional routes must be distributed on all nodes of the cluster. Third, kubelink dynamincally configures the Wireguard interface with endpoints and their public keys.
On the shoot, we create the keys and acquire the WireGuard IP in the standard secret generation. The data is added as a secret to the control plane and to the shootstate. The vpn shoot deployment is extended to include the WireGuard device setup inside the vpn shoot pod network. For certain infrastructures (AWS), we need a re-advertiser to resolve the seed WireGuard endpoint and evaluate whether the IP address changed.
While it is possible to configure a WireGuard device using DNS names only IP addresses can be stored in Linux Kernel data structures. A change of a load balancer IP address can therefore not be mitigated on that level. As WireGuard dynamically adapts endpoint IP addresses a change in load banlancer IPs is mitigated in most but not all cases. This is why a re-advertiser is required for public cloud providers such as AWS.
The load balancer exposing the OpenVPN endpoint in the shoot cluster is no longer required and therefore removed if this functionality is used.
As we want to slowly expand the usage of the WireGuard solution, we propose to introduce a feature gate for it. Furthermore, since the WireGuard network requires a separate network range, we propose to introduce a new section to the seed settings with two additional flags (enabled & cidr):
apiVersion: core.gardener.cloud/v1beta1
kind: Seed
metadata:
name: my-seed
...
spec:
...
settings:
...
wireguard:
enabled: true
cidr: 192.168.128.0/22
Last but not least, we propose to introduce an annotation to the shoots to enable/disable the WireGuard tunnel explicitly.
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
name: my-shoot
annotations:
alpha.featuregates.shoot.gardener.cloud/wireguard-tunnel: "true"
...
Using this approach, it is easy to switch the solution on and off, i.e. migrate the shoot clusters automatically during ordinary reconciliation.
High Availability
There is an issue if the node that hosts the WireGuard endpoint fails. The endpoint is migrated to another node however the time required to do this might exceed the budget for downtimes although one could argue that a disruption of less than 30 seconds to 1 minute does not qualify as a downtime and will in almost all cases not noticeable by end users.
In this case we also assume that TCP connections won’t be interrupted - they would just appear to hang. We will confirm this behavior and the potential downtime as part of the development and testing effort as this is hard to predict.
As a possible mitigation we propose to instantiate 2 Kubelink instances in the seed cluster that are served by two different load balancers. The instances must run on different nodes (if possible but we assume a proper seed cluster has more than one node). Each shoot cluster connects to both endpoints. This means that the OpenVPN server is reachable with two different IP addresses. The VPN seed sidecars must attempt to connect to both of them and will continue to do so. The “Persistent Keepalive” feature is set to 21 seconds by default but could be reduced. Due to the redundancy this however appears not to be necessary.
It is desirable that both connections are used in an equal manner. One strategy could be to use the kubelink 1 connection if the first target WireGuard address is even (the last byte of the IPv4 address), otherwise the kubelink 2 connection. The vpn-seed
sidecars can then use the following configuration in their OpenVPN configuration file:
<connection>
remote 192.168.45.3 1194 udp
</connection>
<connection>
remote 192.168.47.34 1194 udp
</connection>
OpenVPN will go through the list sequentially and try to connect to these endpoints.
As an additional mitigation it appears possible to instantiate WireGuard devices on all hosts and replicate its relevant conntrack state across all cluster nodes. The relevant conntrack state keeps the state of all connections passing through the WireGuard interface (e.g. the WireGuard CIDR). conntrack and the tools to replicate conntrack state are part of the essential Linux netfilter tools package.
Load Considerations
What happens in case of a failure? In this case one router will end up owning all connections as the clients will attempt to use the next connection. This could be mitigated by adding a third redundant WireGuard connection. Using this strategy, the failure of one WireGuard endpoint would result in the equal distribution of connections to the two remaining interfaces. We believe however that this will not be necessary.
The cluster node running the Wireguard endpoint is essentially a router that routes all traffic to the various shoot clusters. This is established and proven technology that already exists since decades and has been highly optimized since then. This is also the technology that hyperscalers rely on to provide VPN connectivity to their customers. This said, hyperscalers essentially provide solutions based on IPsec which is known not to scale as well as Wireguard. Wireguard is a relatively new technology but we have no doubt that it is less stable than existing IPsec solution.
Regarding performance there is a lot of information on the Internet basically suggesting that Wireguard performs better than other VPN solutions such as IPsec or OpenVPN. One example is https://www.wireguard.com/performance/ and https://www.net.in.tum.de/fileadmin/bibtex/publications/papers/2020-ifip-moonwire.pdf.
Based on this, we have no reason to believe that one router will not be able to handle all traffic going to and coming from shoot clusters. Nevertheless, we will closely monitor the situation in our tests and will take action if necessary.
Further Research
Based on feedback on this proposal and while working on this implementation we identified two additinal approaches that we have not thought of so far. The first idea can be used to replace the “inner” OpenVPN implementation and the second can be used to replace WireGuard with OpenVPN and get rid of the single point of failure.
Instead of using OpenVPN for the inner seed/shoot communication we can use the proxy protocol and use a TCP proxy (e.g. envoy) in the shoot cluster to broker the seed-shoot connections. The advantage is that with this solution seed- and shoot cluster network ranges are allowed to overlap. Disadvantages are increased implementation effort and less efficient network in terms of throughput and scalability. We believe however that the reduced network efficiency does not invalidate this option.
There is an option in OpenVPN to specify a tcp proxy as part of the endpoint configuration.
17 - Shoot APIServer via SNI
SNI Passthrough proxy for kube-apiservers
This GEP tackles the problem that today a single LoadBalancer
is needed for every single Shoot cluster’s control plane.
Background
When the control plane of a Shoot cluster is provisioned, a dedicated LoadBalancer is created for it. It keeps the entire flow quite easy - the apiserver Pods are running and they are accessible via that LoadBalancer. It’s hostnames / IP addresses are used for DNS records like api.<external-domain>
and api.<shoot>.<project>.<internal-domain>
. While this solution is simple it comes with several issues.
Motivation
There are several problems with the current setup.
- IaaS provider costs. For example
ClassicLoadBalancer
on AWS costs at minimum 17 USD / month. - Quotas can limit the amount of LoadBalancers you can get per account / project, limiting the number of clusters you can host under a single account.
- Lack of support for better loadbalancing algorithms than round-robin.
- Slow cluster provisioning time - depending on the provider a LoadBalancer provisioning could take quite a while.
- Lower downtime when workload is shuffled in the clusters as the LoadBalancer is Kubernetes-aware.
Goals
- Only one LoadBalancer is used for all Shoot cluster API servers running in a Seed cluster.
- Out-of-cluster (end-user / robot) communication to the API server is still possible.
- In-cluster communication via the kubernetes master service (IPv4/v6 ClusterIP and the
kubernetes.default.svc.cluster.local
) is possible. - Client TLS authentication works without intermediate TLS termination (TLS is terminated by
kube-apiserver
). - Solution should be cloud-agnostic.
Proposal
Seed cluster
To solve the problem of having multiple kube-apiservers
behind a single LoadBalancer, an intermediate proxy must be placed between the Cloud-Provider’s LoadBalancer and kube-apiservers
. This proxy is going to choose the Shoot API Server with the help of Server Name Indication. From wikipedia:
Server Name Indication (SNI) is an extension to the Transport Layer Security (TLS) computer networking protocol by which a client indicates which hostname it is attempting to connect to at the start of the handshaking process. This allows a server to present multiple certificates on the same IP address and TCP port number and hence allows multiple secure (HTTPS) websites (or any other service over TLS) to be served by the same IP address without requiring all those sites to use the same certificate. It is the conceptual equivalent to HTTP/1.1 name-based virtual hosting, but for HTTPS.
A rough diagram of the flow of data:
+-------------------------------+
| |
| Network LB | (accessible from clients)
| |
| |
+-------------+-------+---------+ +------------------+
| | | |
| | proxy + lb | Shoot API Server |
| | +-------------+------------->+ |
| | | | Cluster A |
| | | | |
| | | +------------------+
| | |
+----------------v----+--+
| | |
+-+--------v----------+ | +------------------+
| | | | |
| | | proxy + lb | Shoot API Server |
| Proxy | +-------------+---------->+ |
| | | | Cluster B |
| | | | |
| +----+ +------------------+
+----------------+----+
|
|
| +------------------+
| | |
| proxy + lb | Shoot API Server |
+-------------------+-------------->+ |
| Cluster C |
| |
+------------------+
Sequentially:
- client requests
Shoot Cluster A
and sets the Server Name
in the TLS handshake to api.shoot-a.foo.bar
. - this packet goes through the Network LB and it’s forwarded to the Proxy server. (this loadbalancer should be a simple Layer-4 TCP proxy)
- the proxy server reads the packet and see that client requests
api.shoot-a.foo.bar
. - based on its configuration, it maps
api.shoot-a.foo.bar
to Shoot API Server Cluster A
. - it acts as TCP proxy and simply send the data
Shoot API Server Cluster A
.
There are multiple OSS proxies for this case:
To ease integration it should:
- be configurable via Kubernetes resources
- not require restarting when configuration changes
- be fast and with little overhead
All things considered, Envoy proxy is the most fitting solution as it provides all the features Gardener would like (no process reload being the most important one + battle tested in production by various companies).
While building a custom control plane for Envoy is quite simple, an already established solution might be the better path forward. Istio’s Pilot is one of the most feature-complete Envoy control plane solutions as it offers a way to configure edge ingress traffic for Envoy via Gateway and VirtualService.
The resources which needs to be created per Shoot clusters are the following:
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: kube-apiserver-gateway
namespace: <shoot-namespace>
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 443
name: tls
protocol: TLS
tls:
mode: PASSTHROUGH
hosts:
- api.<external-domain>
- api.<shoot>.<project>.<internal-domain>
and correct VirtualService
pointing to the correct API server:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: kube-apiserver
namespace: <shoot-namespace>
spec:
hosts:
- api.<external-domain>
- api.<shoot>.<project>.<internal-domain>
gateways:
- kube-apiserver-gateway
tls:
- match:
- port: 443
sniHosts:
- api.<external-domain>
- api.<shoot>.<project>.<internal-domain>
route:
- destination:
host: kube-apiserver.<shoot-namespace>.svc.cluster.local
port:
number: 443
The resources above configures Envoy to forward the raw TLS data (without termination) to the Shoot kube-apiserver
.
Updated diagram:
+-------------------------------+
| |
| Network LB | (accessible from clients)
| |
| |
+-------------+-------+---------+ +------------------+
| | | |
| | proxy + lb | Shoot API Server |
| | +-------------+------------->+ |
| | | | Cluster A |
| | | | |
| | | +------------------+
| | |
+----------------v----+--+
| | |
+-+--------v----------+ | +------------------+
| | | | |
| | | proxy + lb | Shoot API Server |
| Envoy Proxy | +-------------+---------->+ |
| (ingress Gateway) | | | Cluster B |
| | | | |
| +----+ +------------------+
+-----+----------+----+
| |
| |
| | +------------------+
| | | |
| | proxy + lb | Shoot API Server |
| +-------------------+-------------->+ |
| get | Cluster C |
| configuration | |
| +------------------+
|
v Configure
+--+--------------+ +---------------------+ via Istio
| | | | Custom Resources
| Pilot +-------->+ Seed API Server +<------------------+
| | | |
| | | |
+-----------------+ +---------------------+
In this case the internal
and external
DNSEntries
should be changed to the Network LoadBalancer’s IP.
In-cluster communication to the apiserver
In Kubernetes the API server is discoverable via the master service (kubernetes
in default
namespace). Today, this service can only be of type ClusterIP
- making in-cluster communication to the API server impossible due to:
- the client doesn’t set the
Server Name
in the TLS handshake, if it attempts to talk to an IP address. In this case, the TLS handshake reaches the Envoy IngressGateway proxy, but it’s rejected by it. - Kubernetes services can be of type
ExternalName
, but the master service is not supported by kubelet.- even if this is fixed in future Kubernetes versions, this problem still exists for older versions where this functionality is not available.
Another issue occurs when the client tries to talk to the apiserver via the in-cluster DNS. For all Shoot API servers kubernetes.default.svc.cluster.local
is the same and when a client tries to connect to that API server using that server name. This makes distinction between different in-cluster Shoot clients impossible by the Envoy IngressGateway.
To mitigate this problem an additional proxy must be deployed on every single Node. It does not terminate TLS and sends the traffic to the correct Shoot API Server. This is achieved by:
- the apiserver master service reconciler is started and pointing to the
kube-apiserver
’s Cluster IP in the Seed cluster (e.g. --advertise-address=10.1.2.3
). - the proxy runs in the host network of the
Node
. - the proxy has a sidecar container which:
- creates a dummy network interface and assigns the
10.1.2.3
to it. - removes connection tracking (conntrack) if iptables/nftables is enabled as the IP address is local to the
Node
.
- the proxy listens on the
10.1.2.3
and using the PROXY protocol it sends the data stream to the Envoy ingress gateway (EIGW). - EIGW listens for PROXY protocol on a dedicated
8443
port. EIGW reads the destination IP + port from the PROXY protocol and forwards traffic to the correct upstream apiserver.
The sidecar is a standalone component. It’s possible to transparently change the proxy implementation without any modifications to the sidecar. The simplified flow looks like:
+------------------+ +----------------+
| Shoot API Server | TCP | Envoy IGW |
| +<-------------------+ PROXY listener |
| Cluster A | | :8443 |
+------------------+ +-+--------------+
^
|
|
|
|
+-----------------------------------------------------------+
| Single Node in
| the Shoot cluster
|
| PROXY Protocol
|
|
|
+---------------------+ +----------+----------+
| Pod talking to | | |
| the kubernetes | | Proxy |
| service +------>+ No TLS termination |
| | | |
+---------------------+ +---------------------+
Multiple OSS solutions can be used:
To add a PROXY lister with Istio several resources must be created - a dedicated Gateway
, dummy VirtualService
and EnvoyFilter
which adds listener filter (envoy.listener.proxy_protocol
) on 8443
port:
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: blackhole
namespace: istio-system
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 8443
name: tcp
protocol: TCP
hosts:
- "*"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: blackhole
namespace: istio-system
spec:
hosts:
- blackhole.local
gateways:
- blackhole
tcp:
- match:
- port: 8443
route:
- destination:
host: localhost
port:
number: 9999 # any dummy port will work
---
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: proxy-protocol
namespace: istio-system
spec:
workloadSelector:
labels:
istio: ingressgateway
configPatches:
- applyTo: LISTENER
match:
context: ANY
listener:
portNumber: 8443
name: 0.0.0.0_8443
patch:
operation: MERGE
value:
listener_filters:
- name: envoy.filters.listener.proxy_protocol
For each individual Shoot
cluster, a dedicated FilterChainMatch is added. It ensures that only Shoot API servers can receive traffic from this listener:
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: <shoot-namespace>
namespace: istio-system
spec:
workloadSelector:
labels:
istio: ingressgateway
configPatches:
- applyTo: FILTER_CHAIN
match:
context: ANY
listener:
portNumber: 8443
name: 0.0.0.0_8443
patch:
operation: ADD
value:
filters:
- name: envoy.filters.network.tcp_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
stat_prefix: outbound|443||kube-apiserver.<shoot-namespace>.svc.cluster.local
cluster: outbound|443||kube-apiserver.<shoot-namespace>.svc.cluster.local
filter_chain_match:
destination_port: 443
prefix_ranges:
- address_prefix: 10.1.2.3 # kube-apiserver's cluster-ip
prefix_len: 32
Note: this additional EnvoyFilter
can be removed when Istio supports full L4 matching.
A nginx proxy client in the Shoot cluster on every node could have the following configuration:
error_log /dev/stdout;
stream {
server {
listen 10.1.2.3:443;
proxy_pass api.<external-domain>:8443;
proxy_protocol on;
proxy_protocol_timeout 5s;
resolver_timeout 5s;
proxy_connect_timeout 5s;
}
}
events { }
In-cluster communication to the apiserver when ExernalName is supported
Even if in future versions of Kubernetes, the master service of type ExternalName
is supported, we still have the problem that in-cluster workload can talk to the server via DNS. For this to work we still need the above mentioned proxy (this time listening on another IP address 10.0.0.2
). An additional change to CoreDNS would be needed:
default.svc.cluster.local.:8053 {
file kubernetes.default.svc.cluster.local
}
.:8053 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
The content of the kubernetes.default.svc.cluster.local
is going to be:
$ORIGIN default.svc.cluster.local.
@ 30 IN SOA local. local. (
2017042745 ; serial
1209600 ; refresh (2 hours)
1209600 ; retry (1 hour)
1209600 ; expire (2 weeks)
30 ; minimum (1 hour)
)
30 IN NS local.
kubernetes IN A 10.0.0.2
So when a client requests kubernetes.default.svc.cluster.local
, it’ll be send to the proxy listening on that IP address.
Future work
While out of scope of this GEP, several things can be improved:
- Make the sidecar work with eBPF and environments where iptables/nftables are not enabled.
References
18 - Shoot CA Rotation
GEP-18: Automated Shoot CA Rotation
Table of Contents
Summary
This proposal outlines an on-demand, multi-step approach to rotate all certificate authorities (CA) used in a Shoot cluster. This process includes creating new CAs, invalidating the old ones and recreating all certificates signed by the CAs.
We propose to bundle the rotation of all CAs in the Shoot together as one triggerable action. This includes the recreation and invalidation of the following CAs and all certificates signed by them:
- Cluster CA (currently used for signing
kube-apiserver
serving certificates and client certificates) kubelet
CA (used for signing client certificates for talking to kubelet
API, e.g. kube-apiserver-kubelet
)etcd
CA (used for signing etcd
serving certificates and client certificates)- front-proxy CA (used for signing client certificates that
kube-aggregator
(part of kube-apiserver
) uses to talk to extension API servers, filled into extension-apiserver-authentication
ConfigMap and read by extension API servers to verify incoming kube-aggregator
requests) metrics-server
CA (used for signing serving certificates, filled into APIService caBundle
field and read by kube-aggregator
to verify the presented serving certificate)ReversedVPN
CA (used for signing vpn-seed-server
serving certificate and vpn-shoot
client certificate)
Out of scope for now:
kubelet
serving CA is self-generated (valid for 1y
) and self-signed by kubelet
on startupkube-apiserver
does not seem to verify the presented serving certificatekubelet
can be configured to request serving certificate via CSR that can be verified by kube-apiserver
, though, we consider this as a separate improvement outside of this GEP
- Legacy VPN solution uses the cluster CA for both serving and client certificates. As the solution is soon to be dropped in favor of the new
ReversedVPN
solution, we don’t intend to introduce a dedicated CA for this component. If ReversedVPN
is disabled and the CA rotation is triggered, we make sure to propagate the cluster CA to the relevant places in the legacy VPN solution.
Naturally, not all certificates used for communication with the kube-apiserver
are under control of Gardener. An example for a Gardener-controlled certificate is the kubelet client certificate used to communicate with the api server. An example for credentials not controlled by gardener are kubeconfigs or client certificates requested via CertificateSigningRequest
s by the shoot owner.
We propose to use a two step approach to rotate CAs. The start of each phase is triggered by the shoot owner.
In summary the first phase is used to create new CAs (for example the new api server and client CA). Then we make sure that all servers and clients under Gardener’s control trust both old and new CA. Next we renew all client certificates that are under Gardener’s control so they are now signed by the new CAs. This includes a node rollout in order to propagate the certificates to kubelets and restart all pods. Afterwards the user needs to change their client credentials to trust both old and new cluster CA.
In the second phase, we remove all trust to the old CA for servers and clients under Gardener’s control. This does not include a node rollout but all still running pods using ServiceAccount
s will continue to trust the old CA until they restart. Also, the user needs to retrieve the new CA bundle to no longer trust the old CA.
A detailed overview of all steps required for each phase is given in the proposal section of this GEP.
Introducing a new client CA
Currently, client certificates and the kube-apiserver certificate are signed by the same CA. We propose to create a separate client CA when triggering the rotation. The client CA is used to sign certificates of clients talking to the API Server.
Motivation
There are a few reasons for rotating shoot cluster CAs:
- If we have to invalidate client certificates for the kube-apiserver or any other component we are forced to rotate the CA. The only way to invalidate them is to stop trusting all client certificates that are signed by the respective CA as kubernetes does not support revoking certificates.
- If the CA itself got leaked.
- If the CA is about to expire.
- If a company policy requires to rotate a CA after a certain point in time.
In each of those cases we currently need to basically manually recreate and replace all CAs and certificates. The process of rotating by hand is cumbersome and could lead to errors due to the many steps needing to be performed in the right order. By automating the process we want to create a way to securely and easily rotate shoot CAs.
Goals
- Offer an automated and safe solution to rotate all CAs in a shoot cluster.
- Offer a process that is easily understandable for developers and users.
- Rotate the different CAs in the shoot with a similar process to reduce complexity.
- Add visibility for Shoot owners when the last CA rotation happened
Non-Goals
- Offer an automated solution for rotating other static credentials (like static token).
- Later on, a similar two-phase approach could be implemented for the kubeconfig rotation. However, this is out of scope for this enhancement.
- Creating a process that runs fully automated without shoot owner interaction. As the shoot owner controls some secrets that would probably not even be possible.
- Forcing the shoot owner to rotate after a certain time period. Our goal rather is to issue long-running certificates and let the user decide depending on their requirements to rotate as needed.
- Configurable default CA lifetime
Proposal
We will add a new feature gate CARotation
for gardener-apiserver
and gardenlet
which allows to enable or disable the possibility to trigger the rotation.
Triggering the CA Rotation
- Triggered via
gardener.cloud/operation
annotation in symmetry with other operations like reconciliation, kubeconfig rotation, etc.- annotation increases the generation
- value for triggering first phase:
start-ca-rotation
- value for triggering the second phase:
complete-ca-rotation
gardener-apiserver
performs the needful validation: user can’t trigger another rotation if one is already in progress, user can’t trigger complete-ca-rotation
if first phase has not been compeleted, etc.
- The annotation triggers a usual shoot reconciliation (just like a kubeconfig or SSH key rotation)
- gardenlet begins the CA rotation sequence by setting the new status section
.status.credentials.caRotation
(probably in updateShootStatusOperationStart
) and removes the annotation afterwards- shoot reconciliation needs to be idemptotent to CA rotation phase, i.e. if a usual reconciliation or maintenance operation is triggered in between, no new CAs are generated or similar things that would interfere with the CA rotation sequence
Changing the Shoot Status
A new section in the Shoot status is added when the first rotation is triggered:
status:
credentials:
rotation:
certificateAuthorities:
phase: Prepare # Prepare|Finalize|Completed
lastCompletion: 2022-02-07T14:23:44Z
# kubeconfig:
# phase:
# lastCompletion:
Later on, this section could be augmented with other information like the names of the credentials secrets (e.g. gardener/gardener#1749)
status:
credentials:
resources:
- type: kubeconfig
kind: Secret
name: shoot-foo.kubeconfig
Rotation Sequence for Cluster and Client CA
The proposal section includes a detailed description of all steps involved for rotating from a given CA0
to the target CA1
.
t0
: Today’s situation
kube-apiserver
uses SERVER CERT signed by CA0
and trusts CLIENT CERTS signed by CA0
kube-controller-manager
issues new CLIENT CERTS signed by CA0
- kubeconfig trusts only
CA0
ServiceAccount
secrets trust only CA0
- kubelet uses CLIENT CERT signed by
CA0
t1
: Shoot owner triggers first step of CA rotation process (–> phase one is started):
- Generate
CA1
- Generate
CLIENT_CA1
- Update
kube-apiserver
, kube-scheduler
, etc. to trust CLIENT CERTS signed by both CA0
and CLIENT_CA1
(--client-ca-file
flag) - Update
kube-controller-manager
to issue new CLIENT CERTS now with CLIENT_CA1
- Update kubeconfig so that its CA bundle contains both
CA0
andCA1
(if kubeconfig still contains a legacy CLIENT CERT then rotate the kubeconfig) - Update
generic-token-kubeconfig
so that its CA bundle contains both CA0
andCA1
- Update
kube-controller-manager
to populate both CA0
and CA1
in ServiceAccount
secrets. - Restart control plane components so that their CA bundle contains both
CA0
and CA1
- Renew CLIENT CERTS (sign them with
CLIENT_CA1
) for the following control plane components: Prometheus, DWD, legacy VPN), if not dropped already in the context of gardener/gardener#4661 - Trigger node rollout
- This issues new CLIENT CERTS for all kubelets signed by
CLIENT_CA1
- This restarts all
Pod
s and propagates CA0
and CA1
into their mounted ServiceAccount
secrets (note CAs can not be reloaded by go client, therefore we need a restart of pods.)
- Ask user to exchange all their client credentials (kubeconfig, CLIENT CERTS issued by
CertificateSigningRequest
s) to trust both CA0 and CA1
t2
: Shoot owner triggers second step of CA rotation process (–> phase two is started):
Prerequisite: All Gardener-controlled actions listed in t1 were executed successfully (for example node rollout). The shoot owner has guaranteed that they exchanged their client credentials and triggered step 2 via an annotation.
- Renew SERVER CERTS (sign them with
CA1
) for kube-apiserver
, kube-controller-manager
, cloud-controller-manager
etc. - Update
kube-apiserver
, kube-scheduler
, etc. to trust only CLIENT CERTS signed by CLIENT_CA1
- Update kubeconfig so that its CA bundle contains only
CA1
- Update
generic-token-kubeconfig
so that its CA bundle contains only CA1
- Update
kube-controller-manager
to only contain CA1. ServiceAccount
secrets created after this point will get secrets that include only CA1
- Restart control plane components so that their CA bundle contains only
CA1
- Restart kubelets so that the CA bundle in their kubeconfigs contain only
CA1
- Delete
CA0
- Ask user to optionally restart their
Pod
s since they still contain CA0
in memory in order to eliminate trust to the old cluster CA. - Ask user to exchange all their client credentials (download kubeconfig containing only
CA1
; when using CLIENT CERTS trust only CA1
)
Rotation Sequence of Other CAs
Apart from the kube-apiserver CA (and the client CA) we also use 5 other CAs as mentioned above in the gardener codebase. We propose to rotate those CAs together with the kube-apiserver CA following the same trigger.
ℹ️ Note for the front-proxy CA: users need to make sure, extension API servers have reloaded the extension-apiserver-authentication
ConfigMap, before triggering the second phase.
You can find gardener managed CAs listed here.
Regarding the rotation steps we want to follow a similar approach to the one we defined for the kube-apiserver CA. Exemplary, we are going to show the timeline for ETCD_CA but the logic should be similiar for all the above listed CAs.
t0
- etcd trusts client certificates signed by
ETCD_CA0
and uses a server certificate signed by ETCD_CA0
kube-apiserver
and backup-restore
use a client certificate signed by ETCD_CA0
and trust ETCD_CA0
t1
:- Generate
ETCD_CA1
- Update
etcd
to trust CLIENT CERTS signed by both ETCD_CA0
and ETCD_CA1
- Update
kube-apiserver
and backup-restore
:- Adapt CA bundle to trust both
ETCD_CA0
and ETCD_CA1
- Renew CLIENT CERTS (sign them with
ETCD_CA1
)
t2
:- Update
etcd
:- Trust only CLIENT CERTS signed by
ETCD_CA1
- Renew SERVER CERT (sign it with
ETCD_CA1
)
- Update
kube-apiserver
and backup-restore
so that their CA bundle contains only ETCD_CA1
ℹ️ This means we are requiring two restarts of etcd in total.
Alternatives
This section presents a different approach to rotate the CAs which is to temporarily create a second set of api-servers utilizing the new CA . After presenting the approach advantages and disadvantages of both approaches are listed.
t0
: Today’s situation
kube-apiserver
uses SERVER CERT signed by CA0
and trusts CLIENT CERTS signed by CA0
kube-controller-manager
issues new CLIENT CERTS with CA0
- kubeconfig contains only
CA0
ServiceAccount
secrets contain only CA0
- kubelet uses CLIENT CERT signed by
CA0
t1
: User triggers first step of CA rotation process (–> phase one):
- Generate
CA1
- Generate
CLIENT_CA1
- Create new
DNSRecord
, Service
, Istio configuration, etc. for second kube-apiserver
deployment - Deploy second
kube-apiserver
deployment trusting only CLIENT CERTS signed by CLIENT_CA1
and using SERVER CERT signed by CA1
- Update
kube-scheduler
, etc. to trust only CLIENT CERTS signed by CLIENT_CA1
(--client-ca-file
flag) - Update
kube-controller-manager
to issue new CLIENT CERTS with CLIENT_CA1
- Update kubeconfig so that it points to the new
DNSRecord
and its CA bundle contains only CA1
(if kubeconfig still contains a legacy CLIENT CERT then rotate the kubeconfig) - Update
ServiceAccount
secrets so that their CA bundle contains both CA0
and CA1
- Restart control plane components so that they point to the second
kube-apiserver
Service
and so that their CA bundle contains only CA1
- Renew CLIENT CERTS (sign them with
CLIENT_CA1
) for control plane components (Prometheus, DWD, legacy VPN) and point them to the second kube-apiserver
Service
- Adapt
apiserver-proxy-pod-mutator
to point KUBERNETES_SERVICE_HOST
env variable to second kube-apiserver
- Trigger node rollout
- This issues new CLIENT CERTS for all kubelets signed by
CLIENT_CA1
and points them to the second DNSRecord
- This restarts all
Pod
s and propagates CA0
and CA1
into their mounted ServiceAccount
secrets
- Ask user to exchange all their client credentials (kubeconfig, CLIENT CERTS issued by
CertificateSigningRequest
s)
t2
: User triggers second step of CA rotation process (–> phase two):
- Update
ServiceAccount
secrets so that their CA bundle contains only CA1
- Update
apiserver-proxy
to talk to second kube-apiserver
- Drop first
DNSRecord
, Service
, Istio configuration and first kube-apiserver
deployment - Drop
CA0
- Ask user to optionally restart their
Pod
s since they still contain CA0
in memory.
Advantages/Disadvantages approach two api servers
- (+) User needs to adapt client credentials only once
- (/) Unstable API server domain
- (-) Probably more implementation effort
- (-) More complex
- (-) CA rotation process does not work similar for all CAs in our system
Advantages/Disadvantages of currently preferred approach (see proposal)
- (+) Implementation effort seems “straight-forward”
- (+) CA rotation process works similar for all CAs in our system
- (/) Stable API server domain
- (-) User needs to adapt client credentials twice
19 - Utilize API Server Network Proxy to Invert Seed-to-Shoot Connectivity
Utilize API Server Network Proxy to Invert Seed-to-Shoot Connectivity
Problem
Gardener’s architecture for Kubernetes clusters relies on having the control-plane (e.g., kube-apiserver, kube-scheduler, kube-controller-manager, etc.) and the data-plane (e.g., kube-proxy, kubelet, etc.) of the cluster residing in separate places, this provides many benefits but poses some challenges, especially when API-server to system components communication is required. This problem is solved today in Gardener by making use of OpenVPN to establish a VPN connection from the seed to the shoot. To do so, the following steps are required:
- Create a Loadbalancer service on the shoot.
- Add a sidecar to the API server pod which knows the address of the newly created Loadbalancer.
- Establish a connection over the internet to the VPN Loadbalancer
- Install additional iptables rules that would redirect all the IPs of the shoot (i.e., service, pod, node CIDRs) to the established VPN tunnel
There are however quite a few problems with the above approach, here are some:
- Every shoot would require an additional loadbalancer, this accounts for addition overhead in terms of both costs and troubleshooting efforts.
- Private access use-cases would not be possible without having a seed residing in the same private domain as a hard requirement. For example, have a look at this issue
- Providing a public endpoint to access components in the shoot poses a security risk.
Proposal
There are mutliple ways to tackle the directional connectivity issue mentioned above, one way would be to invert the connection between the API server and the system components, i.e., instead of having the API server side-car establish a tunnel, we would have an agent residing in the shoot cluster initiate the connection itself. This way we don’t need a Loadbalancer for every shoot and from the security perspective, there is no ingress from outside, only controlled egress.
We want to replace this:
APIServer | VPN-seed ---> internet ---> LB --> VPN-Shoot (4314) --> Pods | Nodes | Services
With this:
APIServer <-> Proxy-Server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services
API Server Network Proxy
To solve this issue we can utilize the apiserver-network-proxy upstream implementation. Which provides a reference implementation for a reverse streaming server. The way it works is as follows:
- Proxy agent connects to proxy server to establish a sticky connection.
- Traffic to the proxy server (residing in the seed) gets then re-directed to the agent (residing in the shoot) which forwards the traffic to in-cluster components.
The initial motivation for the apiserver-network-proxy project is to get rid of provider-specific implementations that reside in the API-server (e.g., SSH), but it turns out that
it has other interesting use-cases such as data-plane connection decoupling, which is the main use-case for this proposal.
Starting with Kubernetes 1.18 it’s possible to make use of an --egress-selector-config-file
flag, this helps point the API-server to traffic hook points based on traffic direction. For example, in the config below the API server would have to forward all cluster related traffic (e.g., logs, port-forward, exec, …etc.) to the proxy-server which then knows how to forward traffic to the shoot. For the rest of the traffic, e.g. API server to ETCD or other control-plane components direct
is used which means legacy routing method, i.e., by-pass the proxy.
egress-selector-configuration.yaml: |-
apiVersion: apiserver.k8s.io/v1alpha1
kind: EgressSelectorConfiguration
egressSelections:
- name: cluster
connection:
proxyProtocol: httpConnect
transport:
tcp:
url: https://proxy-server:8131
- name: master
connection:
proxyProtocol: direct
- name: etcd
connection:
proxyProtocol: direct
Challenges
Prometheus to Shoot connectivity
One challenge remains to completely eliminate the need for a VPN connection. In today’s Gardener setup, each control-plane has a Prometheus instance that directly scrapes cluster components such as CoreDNS, Kubelets, cadvisor, etc. This works because in addition to the VPN side car attached to the API server pod, we have another one attached to prometheus which knows how to forward traffic to these endpoints. Once the VPN is eliminated, it is required to find other means to forward traffic to these components.
Possible Solutions
There are currently two ways to solve this problem:
- Attach a port-forwarder side-car to prometheus.
- Utilize the proxy subresource on the API server.
Port-forwarder Sidecar
With this solution each prometheus instance would have a side-car that has the kubeconfig of the shoot cluster, and which establishes a port-forward connection to the endpoints residing in the shoot.
There are a many problems with this approach:
- the port-forward connection is not reliable.
- the connection would break if the API server instance dies.
- requires an additional component.
- would need to expose every pod / service via port-forward.
Prom Pod (Prometheus -> Port-forwarder) <-> APIServer -> Proxy-server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services
Proxy Client Sidecar
Another solution would be to implement a proxy-client as a sidecar for every component that wishes to communicate with the shoot cluster. For this to work, means to re-direct / inject that proxy to handle the component’s traffic is necessary (e.g., additional IPtable rules).
Prometheus Pod (Prometheus -> Proxy) <-> Proxy-Server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services
The problem with this approach is that it requires an additional sidecar (along with traffic redirection) to be attached to every client that wishes to communicate with the shoot cluster, this can cause:
- additional maintenance efforts (extra code).
- other side-effects (e.g., if istio sidecar injection is enabled)
Proxy sub-resource
Kubernetes supports proxying requests to nodes, services, and pod endpoints in the shoot cluster. This proxy connection can be utilized for scraping the necessary endpoints in the shoot.
This approach requires less components and is more reliable than the port-forward solution, however, it relies on having the API server supporting proxied connection for the required endpoints.
Prometheus <-> APIServer <-> Proxy-Server <--- internet <--- Proxy-Agent --> Pods | Nodes | Services
As simple as it is, it has a downside that it relies on the availability of the API server.
Proxy-server Loadbalancer Sharing and Re-advertising
With the proxy-server in place, we need to provide means to enable the proxy-agent in the shoot to establish the connection with the server. As a result, we need to provide a public endpoint through which this channel of communication can be established, i.e., we need a Loadbalancer(s).
Possible Solution
Using a Loadbalancer / proxy server would not make sense since this is a pain-point we are trying to eliminate in the first-place, doing so just moves the costs to the control-plane. A possible solution is to communicate over a shared loadbalancer in the seed, similar to what has been proposed here, this way we can prevent the extra-costs for load-balancers.
With this in mind, we still have other pain-points, namely:
- Advertising Loadbalancer public IPs to the shoot.
- Directing the traffic to the corresponding shoot proxy-server.
For advertising the Loadbalancer IP, a DNS entry can be created for the proxy loadbalancer (or re-use the DNS entry for the SNI proxy), along with necessary certificates, which is then used to connect to the loadbalancer. At this point we can decide on either one of the two approaches:
- One Proxy / API server with a shared loadbalancer.
- Use one proxy server for all agents.
In the first case, we will probably need a proxy for the proxy-server that knows how to direct traffic to the correct proxy server based on the corresponding shoot cluster. In the second case, we don’t need another proxy if the proxy server is cluster-aware, i.e., can pool and identify connections coming from the same cluster and peer them with the correct API. Unfortunately, the second case is not supported today.
Summary
- API server proxy can be utilized to invert the connection (only for clusters >= 1.18, for older clusters the old VPN solution will remain).
- This is achieved by utilizing the
--egress-selector-config-file
flag on the api-server. - For monitoring endpoints, the proxy subresources would be the preferable methods to go, but in the future we can also support sidecar proxies that can communicate with the proxy-server.
- For Directing traffic to the correct proxy-server we will re-use the SNI proxy along with the load-balancer from the shoot API server via SNI GEP.