This is the multi-page printable view of this section. Click here to print.
Advanced
1 - Cleanup of Shoot Clusters in Deletion
Cleanup of Shoot Clusters in Deletion
When a shoot cluster is deleted then Gardener tries to gracefully remove most of the Kubernetes resources inside the cluster. This is to prevent that any infrastructure or other artifacts remain after the shoot deletion.
The cleanup is performed in four steps. Some resources are deleted with a grace period, and all resources are forcefully deleted (by removing blocking finalizers) after some time to not block the cluster deletion entirely.
Cleanup steps:
- All
ValidatingWebhookConfiguration
s andMutatingWebhookConfiguration
s are deleted with a5m
grace period. Forceful finalization happens after5m
. - All
APIService
s andCustomResourceDefinition
s are deleted with a5m
grace period. Forceful finalization happens after1h
. - All
CronJob
s,DaemonSet
s,Deployment
s,Ingress
s,Job
s,Pod
s,ReplicaSet
s,ReplicationController
s,Service
s,StatefulSet
s,PersistentVolumeClaim
s are deleted with a5m
grace period. Forceful finalization happens after5m
.If the
Shoot
is annotated withshoot.gardener.cloud/skip-cleanup=true
, then onlyService
s andPersistentVolumeClaim
s are considered. - All
VolumeSnapshot
s andVolumeSnapshotContent
s are deleted with a5m
grace period. Forceful finalization happens after1h
.
It is possible to override the finalization grace periods via annotations on the Shoot
:
shoot.gardener.cloud/cleanup-webhooks-finalize-grace-period-seconds
(for the resources handled in step 1)shoot.gardener.cloud/cleanup-extended-apis-finalize-grace-period-seconds
(for the resources handled in step 2)shoot.gardener.cloud/cleanup-kubernetes-resources-finalize-grace-period-seconds
(for the resources handled in step 3)
⚠️ If "0"
is provided, then all resources are finalized immediately without waiting for any graceful deletion.
Please be aware that this might lead to orphaned infrastructure artifacts.
2 - containerd Registry Configuration
containerd
Registry Configuration
containerd supports configuring registries and mirrors. Using this native containerd feature, Shoot owners can configure containerd to use public or private mirrors for a given upstream registry. More details about the registry configuration can be found in the corresponding upstream documentation.
containerd
Registry Configuration Patterns
At the time of writing this document, containerd support two patterns for configuring registries/mirrors.
Note: Trying to use both of the patterns at the same time is not supported by containerd. Only one of the configuration patterns has to be followed strictly.
Old and Deprecated Pattern
The old and deprecated pattern is specifying registry.mirrors
and registry.configs
in the containerd’s config.toml file. See the upstream documentation.
Example of the old and deprecated pattern:
version = 2
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://public-mirror.example.com"]
In the above example, containerd is configured to first try to pull docker.io
images from a configured endpoint (https://public-mirror.example.com
). If the image is not available in https://public-mirror.example.com
, then containerd will fall back to the upstream registry (docker.io
) and will pull the image from there.
Hosts Directory Pattern
The hosts directory pattern is the new and recommended pattern for configuring registries. It is available starting containerd@v1.5.0
. See the upstream documentation.
The above example in the hosts directory pattern looks as follows.
The /etc/containerd/config.toml
file has the following section:
version = 2
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
The following hosts directory structure has to be created:
$ tree /etc/containerd/certs.d
/etc/containerd/certs.d
└── docker.io
└── hosts.toml
Finally, for the docker.io
upstream registry, we configure a hosts.toml
file as follows:
server = "https://registry-1.docker.io"
[host."http://public-mirror.example.com"]
capabilities = ["pull", "resolve"]
Configuring containerd
Registries for a Shoot
Gardener supports configuring containerd
registries on a Shoot using the new hosts directory pattern. For each Shoot Node, Gardener creates the /etc/containerd/certs.d
directory and adds the following section to the containerd’s /etc/containerd/config.toml
file:
[plugins."io.containerd.grpc.v1.cri".registry] # gardener-managed
config_path = "/etc/containerd/certs.d"
This allows Shoot owners to use the hosts directory pattern to configure registries for containerd. To do this, the Shoot owners need to create a directory under /etc/containerd/certs.d
that is named with the upstream registry host name. In the newly created directory, a hosts.toml
file needs to be created. For more details, see the hosts directory pattern section and the upstream documentation.
The registry-cache Extension
There is a Gardener-native extension named registry-cache that supports:
- Configuring containerd registry mirrors based on the above-described contract. The feature is added in registry-cache@v0.6.0.
- Running pull through cache(s) in the Shoot.
For more details, see the registry-cache documentation.
3 - Control Plane Endpoints And Ports
Endpoints and Ports of a Shoot Control-Plane
With the reversed VPN tunnel, there are no endpoints with open ports in the shoot cluster required by Gardener. In order to allow communication to the shoots control-plane in the seed cluster, there are endpoints shared by multiple shoots of a seed cluster. Depending on the configured zones or exposure classes, there are different endpoints in a seed cluster. The IP address(es) can be determined by a DNS query for the API Server URL. The main entry-point into the seed cluster is the load balancer of the Istio ingress-gateway service. Depending on the infrastructure provider, there can be one IP address per zone.
The load balancer of the Istio ingress-gateway service exposes the following TCP ports:
- 443 for requests to the shoot API Server. The request is dispatched according to the set TLS SNI extension.
- 8443 for requests to the shoot API Server via
api-server-proxy
, dispatched based on the proxy protocol target, which is the IP address ofkubernetes.default.svc.cluster.local
in the shoot. - 8132 to establish the reversed VPN connection. It’s dispatched according to an HTTP header value.
kube-apiserver
via SNI
DNS entries for api.<external-domain>
and api.<shoot>.<project>.<internal-domain>
point to the load balancer of an Istio ingress-gateway service.
The Kubernetes client sets the server name to api.<external-domain>
or api.<shoot>.<project>.<internal-domain>
.
Based on SNI, the connection is forwarded to the respective API Server at TCP layer. There is no TLS termination at the Istio ingress-gateway.
TLS termination happens on the shoots API Server. Traffic is end-to-end encrypted between the client and the API Server. The certificate authority and authentication are defined in the corresponding kubeconfig
.
Details can be found in GEP-08.
kube-apiserver
via apiserver-proxy
Inside the shoot cluster, the API Server can also be reached by the cluster internal name kubernetes.default.svc.cluster.local
.
The pods apiserver-proxy
are deployed in the host network as daemonset and intercept connections to the Kubernetes service IP address.
The destination address is changed to the cluster IP address of the service kube-apiserver.<shoot-namespace>.svc.cluster.local
in the seed cluster.
The connections are forwarded via the HaProxy Proxy Protocol to the Istio ingress-gateway in the seed cluster.
The Istio ingress-gateway forwards the connection to the respective shoot API Server by it’s cluster IP address.
As TLS termination happens at the API Server, the traffic is end-to-end encrypted the same way as with SNI.
Details can be found in GEP-11.
Reversed VPN Tunnel
As the API Server has to be able to connect to endpoints in the shoot cluster, a VPN connection is established. This VPN connection is initiated from a VPN client in the shoot cluster. The VPN client connects to the Istio ingress-gateway and is forwarded to the VPN server in the control-plane namespace of the shoot. Once the VPN tunnel between the VPN client in the shoot and the VPN server in the seed cluster is established, the API Server can connect to nodes, services and pods in the shoot cluster.
More details can be found in the usage document and GEP-14.
4 - Custom containerd Configuration
Custom containerd
Configuration
In case a Shoot
cluster uses containerd
, it is possible to make the containerd
process load custom configuration files.
Gardener initializes containerd
with the following statement:
imports = ["/etc/containerd/conf.d/*.toml"]
This means that all *.toml
files in the /etc/containerd/conf.d
directory will be imported and merged with the default configuration.
To prevent unintended configuration overwrites, please be aware that containerd merges config sections, not individual keys (see here and here).
Please consult the upstream containerd
documentation for more information.
⚠️ Note that this only applies to nodes which were newly created after
gardener/gardener@v1.51
was deployed. Existing nodes are not affected.
5 - Necessary Labeling for Custom CSI Components
Necessary Labeling for Custom CSI Components
Some provider extensions for Gardener are using CSI components to manage persistent volumes in the shoot clusters. Additionally, most of the provider extensions are deploying controllers for taking volume snapshots (CSI snapshotter).
End-users can deploy their own CSI components and controllers into shoot clusters.
In such situations, there are multiple controllers acting on the VolumeSnapshot
custom resources (each responsible for those instances associated with their respective driver provisioner types).
However, this might lead to operational conflicts that cannot be overcome by Gardener alone. Concretely, Gardener cannot know which custom CSI components were installed by end-users which can lead to issues, especially during shoot cluster deletion. You can add a label to your custom CSI components indicating that Gardener should not try to remove them during shoot cluster deletion. This means you have to take care of the lifecycle for these components yourself!
Recommendations
Custom CSI components are typically regular Deployment
s running in the shoot clusters.
Please label them with the shoot.gardener.cloud/no-cleanup=true
label.
Background Information
When a shoot cluster is deleted, Gardener deletes most Kubernetes resources (Deployment
s, DaemonSet
s, StatefulSet
s, etc.). Gardener will also try to delete CSI components if they are not marked with the above mentioned label.
This can result in VolumeSnapshot
resources still having finalizers that will never be cleaned up.
Consequently, manual intervention is required to clean them up before the cluster deletion can continue.
6 - Readiness of Shoot Worker Nodes
Readiness of Shoot Worker Nodes
Background
When registering new Nodes
, kubelet adds the node.kubernetes.io/not-ready
taint to prevent scheduling workload Pods to the Node
until the Ready
condition gets True
.
However, the kubelet does not consider the readiness of node-critical Pods.
Hence, the Ready
condition might get True
and the node.kubernetes.io/not-ready
taint might get removed, for example, before the CNI daemon Pod (e.g., calico-node
) has successfully placed the CNI binaries on the machine.
This problem has been discussed extensively in kubernetes, e.g., in kubernetes/kubernetes#75890.
However, several proposals have been rejected because the problem can be solved by using the --register-with-taints
kubelet flag and dedicated controllers (ref).
Implementation in Gardener
Gardener makes sure that workload Pods are only scheduled to Nodes
where all node-critical components required for running workload Pods are ready.
For this, Gardener follows the proposed solution by the Kubernetes community and registers new Node
objects with the node.gardener.cloud/critical-components-not-ready
taint (effect NoSchedule
).
gardener-resource-manager’s Node
controller reacts on newly created Node
objects that have this taint.
The controller removes the taint once all node-critical Pods are ready (determined by checking the Pods’ Ready
conditions).
The Node
controller considers all DaemonSets
and Pods
as node-critical which run in the kube-system
namespace and are labeled with node.gardener.cloud/critical-component=true
.
If there are DaemonSets
that contain the node.gardener.cloud/critical-component=true
label in their metadata and in their Pod template, the Node
controller waits for corresponding daemon Pods to be scheduled and to get ready before removing the taint.
Additionally, the Node
controller checks for the readiness of csi-driver-node
components if a respective Pod indicates that it uses such a driver.
This is achieved through a well-defined annotation prefix (node.gardener.cloud/wait-for-csi-node-
).
For example, the csi-driver-node
Pod for Openstack Cinder is annotated with node.gardener.cloud/wait-for-csi-node-cinder=cinder.csi.openstack.org
.
A key prefix is used instead of a “regular” annotation to allow for multiple CSI drivers being registered by one csi-driver-node
Pod.
The annotation key’s suffix can be chosen arbitrarily (in this case cinder
) and the annotation value needs to match the actual driver name as specified in the CSINode
object.
The Node
controller will verify that the used driver is properly registered in this object before removing the node.gardener.cloud/critical-components-not-ready
taint.
Note that the csi-driver-node
Pod still needs to be labelled and tolerate the taint as described above to be considered in this additional check.
Marking Node-Critical Components
To make use of this feature, node-critical DaemonSets and Pods need to:
- Tolerate the
node.gardener.cloud/critical-components-not-ready
NoSchedule
taint. - Be labelled with
node.gardener.cloud/critical-component=true
. - Be placed in the
kube-system
namespace.
csi-driver-node
Pods additionally need to:
- Be annotated with
node.gardener.cloud/wait-for-csi-node-<name>=<full-driver-name>
. It’s required that these Pods fulfill the above criteria (label and toleration) as well.
Gardener already marks components like kube-proxy, apiserver-proxy and node-local-dns as node-critical.
Provider extensions mark components like csi-driver-node as node-critical and add the wait-for-csi-node
annotation.
Network extensions mark components responsible for setting up CNI on worker Nodes (e.g., calico-node
) as node-critical.
If shoot owners manage any additional node-critical components, they can make use of this feature as well.
7 - Taints and Tolerations for Seeds and Shoots
Taints and Tolerations for Seed
s and Shoot
s
Similar to taints and tolerations for Node
s and Pod
s in Kubernetes, the Seed
resource supports specifying taints (.spec.taints
, see this example) while the Shoot
resource supports specifying tolerations (.spec.tolerations
, see this example).
The feature is used to control scheduling to seeds as well as decisions whether a shoot can use a certain seed.
Compared to Kubernetes, Gardener’s taints and tolerations are very much down-stripped right now and have some behavioral differences. Please read the following explanations carefully if you plan to use them.
Scheduling
When scheduling a new shoot, the gardener-scheduler will filter all seed candidates whose taints are not tolerated by the shoot.
As Gardener’s taints/tolerations don’t support effect
s yet, you can compare this behaviour with using a NoSchedule
effect taint in Kubernetes.
Be reminded that taints/tolerations are no means to define any affinity or selection for seeds - please use .spec.seedSelector
in the Shoot
to state such desires.
⚠️ Please note that - unlike how it’s implemented in Kubernetes - a certain seed cluster may only be used when the shoot tolerates all the seed’s taints.
This means that specifying .spec.seedName
for a seed whose taints are not tolerated will make the gardener-apiserver reject the request.
Consequently, the taints/tolerations feature can be used as means to restrict usage of certain seeds.
Toleration Defaults and Whitelist
The Project
resource features a .spec.tolerations
object that may carry defaults
and a whitelist
(see this example).
The corresponding ShootTolerationRestriction
admission plugin (cf. Kubernetes’ PodTolerationRestriction
admission plugin) is responsible for evaluating these settings during creation/update of Shoot
s.
Whitelist
If a shoot gets created or updated with tolerations, then it is validated that only those tolerations may be used that were added to either a) the Project
’s .spec.tolerations.whitelist
, or b) to the global whitelist in the ShootTolerationRestriction
’s admission config (see this example).
⚠️ Please note that the tolerations whitelist of Project
s can only be changed if the user trying to change it is bound to the modify-spec-tolerations-whitelist
custom RBAC role, e.g., via the following ClusterRole
:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: full-project-modification-access
rules:
- apiGroups:
- core.gardener.cloud
resources:
- projects
verbs:
- create
- patch
- update
- modify-spec-tolerations-whitelist
- delete
Defaults
If a shoot gets created, then the default tolerations specified in both the Project
’s .spec.tolerations.defaults
and the global default list in the ShootTolerationRestriction
admission plugin’s configuration will be added to the .spec.tolerations
of the Shoot
(unless it already specifies a certain key).