This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Networking

1 - Custom DNS Configuration

Custom DNS Configuration

Gardener provides Kubernetes-Clusters-As-A-Service where all the system components (e.g., kube-proxy, networking, dns) are managed. As a result, Gardener needs to ensure and auto-correct additional configuration to those system components to avoid unnecessary down-time.

In some cases, auto-correcting system components can prevent users from deploying applications on top of the cluster that requires bits of customization, DNS configuration can be a good example.

To allow for customizations for DNS configuration (that could potentially lead to downtime) while having the option to “undo”, we utilize the import plugin from CoreDNS [1]. which enables in-line configuration changes.

How to use

To customize your CoreDNS cluster config, you can simply edit a ConfigMap named coredns-custom in the kube-system namespace. By editing, this ConfigMap, you are modifying CoreDNS configuration, therefore care is advised.

For example, to apply new config to CoreDNS that would point all .global DNS requests to another DNS pod, simply edit the configuration as follows:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  istio.server: |
    global:8053 {
            errors
            cache 30
            forward . 1.2.3.4
        }    
  corefile.override: |
         # <some-plugin> <some-plugin-config>
         debug
         whoami         

The port number 8053 in global:8053 is the specific port that CoreDNS is bound to and cannot be changed to any other port if it should act on ordinary name resolution requests from pods. Otherwise, CoreDNS will open a second port, but you are responsible to direct the traffic to this port. kube-dns service in kube-system namespace will direct name resolution requests within the cluster to port 8053 on the CoreDNS pods. Moreover, additional network policies are needed to allow corresponding ingress traffic to CoreDNS pods. In order for the destination DNS server to be reachable, it must listen on port 53 as it is required by network policies. Other ports are only possible if additional network policies allow corresponding egress traffic from CoreDNS pods.

It is important to have the ConfigMap keys ending with *.server (if you would like to add a new server) or *.override if you want to customize the current server configuration (it is optional setting both).

[Optional] Reload CoreDNS

As Gardener is configuring the reload plugin of CoreDNS a restart of the CoreDNS components is typically not necessary to propagate ConfigMap changes. However, if you don’t want to wait for the default (30s) to kick in, you can roll-out your CoreDNS deployment using:

kubectl -n kube-system rollout restart deploy coredns

This will reload the config into CoreDNS.

The approach we follow here was inspired by AKS’s approach [2].

Anti-Pattern

Applying a configuration that is in-compatible with the running version of CoreDNS is an anti-pattern (sometimes plugin configuration changes, simply applying a configuration can break DNS).

If incompatible changes are applied by mistake, simply delete the content of the ConfigMap and re-apply. This should bring the cluster DNS back to functioning state.

Node Local DNS

Custom DNS configuration] may not work as expected in conjunction with NodeLocalDNS. With NodeLocalDNS, ordinary DNS queries targeted at the upstream DNS servers, i.e. non-kubernetes domains, will not end up at CoreDNS, but will instead be directly sent to the upstream DNS server. Therefore, configuration applying to non-kubernetes entities, e.g. the istio.server block in the custom DNS configuration example, may not have any effect with NodeLocalDNS enabled. If this kind of custom configuration is required, forwarding to upstream DNS has to be disabled. This can be done by setting the option (spec.systemComponents.nodeLocalDNS.disableForwardToUpstreamDNS) in the Shoot resource to true:

...
spec:
  ...
  systemComponents:
    nodeLocalDNS:
      enabled: true
      disableForwardToUpstreamDNS: true
...

References

[1] Import plugin [2] AKS Custom DNS

2 - DNS Search Path Optimization

DNS Search Path Optimization

DNS Search Path

Using fully qualified names has some downsides, e.g., it may become harder to move deployments from one landscape to the next. It is far easier and simple to rely on short/local names, which may have different meaning depending on the context they are used in.

The DNS search path allows for the usage of short/local names. It is an ordered list of DNS suffixes to append to short/local names to create a fully qualified name.

If a short/local name should be resolved, each entry is appended to it one by one to check whether it can be resolved. The process stops when either the name could be resolved or the DNS search path ends. As the last step after trying the search path, the short/local name is attempted to be resolved on it own.

DNS Option ndots

As explained in the section above, the DNS search path is used for short/local names to create fully qualified names. The DNS option ndots specifies how many dots (.) a name needs to have to be considered fully qualified. For names with less than ndots dots (.), the DNS search path will be applied.

DNS Search Path, ndots, and Kubernetes

Kubernetes tries to make it easy/convenient for developers to use name resolution. It provides several means to address a service, most notably by its name directly, using the namespace as suffix, utilizing <namespace>.svc as suffix or as a fully qualified name as <service>.<namespace>.svc.cluster.local (assuming cluster.local to be the cluster domain).

This is why the DNS search path is fairly long in Kubernetes, usually consisting of <namespace>.svc.cluster.local, svc.cluster.local, cluster.local, and potentially some additional entries coming from the local network of the cluster. For various reasons, the default ndots value in the context of Kubernetes is with 5, also fairly large. See this comment for a more detailed description.

DNS Search Path/ndots Problem in Kubernetes

As the DNS search path is long and ndots is large, a lot of DNS queries might traverse the DNS search path. This results in an explosion of DNS requests.

For example, consider the name resolution of the default kubernetes service kubernetes.default.svc.cluster.local. As this name has only four dots, it is not considered a fully qualified name according to the default ndots=5 setting. Therefore, the DNS search path is applied, resulting in the following queries being created

  • kubernetes.default.svc.cluster.local.some-namespace.svc.cluster.local
  • kubernetes.default.svc.cluster.local.svc.cluster.local
  • kubernetes.default.svc.cluster.local.cluster.local
  • kubernetes.default.svc.cluster.local.network-domain

In IPv4/IPv6 dual stack systems, the amount of DNS requests may even double as each name is resolved for IPv4 and IPv6.

General Workarounds/Mitigations

Kubernetes provides the capability to set the DNS options for each pod (see Pod DNS config for details). However, this has to be applied for every pod (doing name resolution) to resolve the problem. A mutating webhook may be useful in this regard. Unfortunately, the DNS requirements may be different depending on the workload. Therefore, a general solution may difficult to impossible.

Another approach is to use always fully qualified names and append a dot (.) to the name to prevent the name resolution system from using the DNS search path. This might be somewhat counterintuitive as most developers are not used to the trailing dot (.). Furthermore, it makes moving to different landscapes more difficult/error-prone.

Gardener Specific Workarounds/Mitigations

Gardener allows users to customize their DNS configuration. CoreDNS allows several approaches to deal with the requests generated by the DNS search path. Caching is possible as well as query rewriting. There are also several other plugins available, which may mitigate the situation.

Gardener DNS Query Rewriting

As explained above, the application of the DNS search path may lead to the undesired creation of DNS requests. Especially with the default setting of ndots=5, seemingly fully qualified names pointing to services in the cluster may trigger the DNS search path application.

Gardener allows to automatically rewrite some obviously incorrect DNS names, which stem from an application of the DNS search path to the most likely desired name. This will automatically rewrite requests like service.namespace.svc.cluster.local.svc.cluster.local to service.namespace.svc.cluster.local.

In case the applications also target services for name resolution, which are outside of the cluster and have less than ndots dots, it might be helpful to prevent search path application for them as well. One way to achieve it is by adding them to the commonSuffixes:

...
spec:
  ...
  systemComponents:
    coreDNS:
      rewriting:
        commonSuffixes:
        - gardener.cloud
        - example.com
...

DNS requests containing a common suffix and ending in .svc.cluster.local are assumed to be incorrect application of the DNS search path. Therefore, they are rewritten to everything ending in the common suffix. For example, www.gardener.cloud.svc.cluster.local would be rewritten to www.gardener.cloud.

Please note that the common suffixes should be long enough and include enough dots (.) to prevent random overlap with other DNS queries. For example, it would be a bad idea to simply put com on the list of common suffixes, as there may be services/namespaces which have com as part of their name. The effect would be seemingly random DNS requests. Gardener requires that common suffixes contain at least one dot (.) and adds a second dot at the beginning. For instance, a common suffix of example.com in the configuration would match *.example.com.

Since some clients verify the host in the response of a DNS query, the host must also be rewritten. For that reason, we can’t rewrite a query for service.dst-namespace.svc.cluster.local.src-namespace.svc.cluster.local or www.example.com.src-namespace.svc.cluster.local, as for an answer rewrite src-namespace would not be known.

3 - ExposureClasses

ExposureClasses

The Gardener API server provides a cluster-scoped ExposureClass resource. This resource is used to allow exposing the control plane of a Shoot cluster in various network environments like restricted corporate networks, DMZ, etc.

Background

The ExposureClass resource is based on the concept for the RuntimeClass resource in Kubernetes.

A RuntimeClass abstracts the installation of a certain container runtime (e.g., gVisor, Kata Containers) on all nodes or a subset of the nodes in a Kubernetes cluster. See Runtime Class for more information.

In contrast, an ExposureClass abstracts the ability to expose a Shoot clusters control plane in certain network environments (e.g., corporate networks, DMZ, internet) on all Seeds or a subset of the Seeds.

Example: RuntimeClass and ExposureClass

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: gvisorconfig
# scheduling:
#   nodeSelector:
#     env: prod
---
kind: ExposureClass
metadata:
  name: internet
handler: internet-config
# scheduling:
#   seedSelector:
#     matchLabels:
#       network/env: internet

Similar to RuntimeClasses, ExposureClasses also define a .handler field reflecting the name reference for the corresponding CRI configuration of the RuntimeClass and the control plane exposure configuration for the ExposureClass.

The CRI handler for RuntimeClasses is usually installed by an administrator (e.g., via a DaemonSet which installs the corresponding container runtime on the nodes). The control plane exposure configuration for ExposureClasses will be also provided by an administrator. This exposure configuration is part of the gardenlet configuration, as this component is responsible to configure the control plane accordingly. See the gardenlet Configuration ExposureClass Handlers section for more information.

The RuntimeClass also supports the selection of a node subset (which have the respective controller runtime binaries installed) for pod scheduling via its .scheduling section. The ExposureClass also supports the selection of a subset of available Seed clusters whose gardenlet is capable of applying the exposure configuration for the Shoot control plane accordingly via its .scheduling section.

Usage by a Shoot

A Shoot can reference an ExposureClass via the .spec.exposureClassName field.

⚠️ When creating a Shoot resource, the Gardener scheduler will try to assign the Shoot to a Seed which will host its control plane.

The scheduling behaviour can be influenced via the .spec.seedSelectors and/or .spec.tolerations fields in the Shoot. ExposureClasses can also contain scheduling instructions. If a Shoot is referencing an ExposureClass, then the scheduling instructions of both will be merged into the Shoot. Those unions of scheduling instructions might lead to a selection of a Seed which is not able to deal with the handler of the ExposureClass and the Shoot creation might end up in an error. In such case, the Shoot scheduling instructions should be revisited to check that they are not interfering with the ones from the ExposureClass. If this is not feasible, then the combination with the ExposureClass might not be possible and you need to contact your Gardener administrator.

Example: Shoot and ExposureClass scheduling instructions merge flow
  1. Assuming there is the following Shoot which is referencing the ExposureClass below:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: abc
  namespace: garden-dev
spec:
  exposureClassName: abc
  seedSelectors:
    matchLabels:
      env: prod
---
apiVersion: core.gardener.cloud/v1beta1
kind: ExposureClass
metadata:
  name: abc
handler: abc
scheduling:
  seedSelector:
    matchLabels:
      network: internal
  1. Both seedSelectors would be merged into the Shoot. The result would be the following:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  name: abc
  namespace: garden-dev
spec:
  exposureClassName: abc
  seedSelectors:
    matchLabels:
      env: prod
      network: internal
  1. Now the Gardener Scheduler would try to find a Seed with those labels.
  • If there are no Seeds with matching labels for the seed selector, then the Shoot will be unschedulable.
  • If there are Seeds with matching labels for the seed selector, then the Shoot will be assigned to the best candidate after the scheduling strategy is applied, see Gardener Scheduler.
    • If the Seed is not able to serve the ExposureClass handler abc, then the Shoot will end up in error state.
    • If the Seed is able to serve the ExposureClass handler abc, then the Shoot will be created.

gardenlet Configuration ExposureClass Handlers

The gardenlet is responsible to realize the control plane exposure strategy defined in the referenced ExposureClass of a Shoot.

Therefore, the GardenletConfiguration can contain an .exposureClassHandlers list with the respective configuration.

Example of the GardenletConfiguration:

exposureClassHandlers:
- name: internet-config
  loadBalancerService:
    annotations:
      loadbalancer/network: internet
- name: internal-config
  loadBalancerService:
    annotations:
      loadbalancer/network: internal
  sni:
    ingress:
      namespace: ingress-internal
      labels:
        network: internal

Each gardenlet can define how the handler of a certain ExposureClass needs to be implemented for the Seed(s) where it is responsible for.

The .name is the name of the handler config and it must match to the .handler in the ExposureClass.

All control planes on a Seed are exposed via a load balancer, either a dedicated one or a central shared one. The load balancer service needs to be configured in a way that it is reachable from the target network environment. Therefore, the configuration of load balancer service need to be specified, which can be done via the .loadBalancerService section. The common way to influence load balancer service behaviour is via annotations where the respective cloud-controller-manager will react on and configure the infrastructure load balancer accordingly.

The control planes on a Seed will be exposed via a central load balancer and with Envoy via TLS SNI passthrough proxy. In this case, the gardenlet will install a dedicated ingress gateway (Envoy + load balancer + respective configuration) for each handler on the Seed. The configuration of the ingress gateways can be controlled via the .sni section in the same way like for the default ingress gateways.

4 - KUBERNETES_SERVICE_HOST Environment Variable Injection

KUBERNETES_SERVICE_HOST Environment Variable Injection

In each Shoot cluster’s kube-system namespace a DaemonSet called apiserver-proxy is deployed. It routes traffic to the upstream Shoot Kube APIServer. See the APIServer SNI GEP for more details.

To skip this extra network hop, a mutating webhook called apiserver-proxy.networking.gardener.cloud is deployed next to the API server in the Seed. It adds a KUBERNETES_SERVICE_HOST environment variable to each container and init container that do not specify it. See the webhook repository for more information.

Opt-Out of Pod Injection

In some cases it’s desirable to opt-out of Pod injection:

  • DNS is disabled on that individual Pod, but it still needs to talk to the kube-apiserver.
  • Want to test the kube-proxy and kubelet in-cluster discovery.

Opt-Out of Pod Injection for Specific Pods

To opt out of the injection, the Pod should be labeled with apiserver-proxy.networking.gardener.cloud/inject: disable, e.g.:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
        apiserver-proxy.networking.gardener.cloud/inject: disable
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

Opt-Out of Pod Injection on Namespace Level

To opt out of the injection of all Pods in a namespace, you should label your namespace with apiserver-proxy.networking.gardener.cloud/inject: disable, e.g.:

apiVersion: v1
kind: Namespace
metadata:
  labels:
    apiserver-proxy.networking.gardener.cloud/inject: disable
  name: my-namespace

or via kubectl for existing namespace:

kubectl label namespace my-namespace apiserver-proxy.networking.gardener.cloud/inject=disable

Note: Please be aware that it’s not possible to disable injection on a namespace level and enable it for individual pods in it.

Opt-Out of Pod Injection for the Entire Cluster

If the injection is causing problems for different workloads and ignoring individual pods or namespaces is not possible, then the feature could be disabled for the entire cluster with the alpha.featuregates.shoot.gardener.cloud/apiserver-sni-pod-injector annotation with value disable on the Shoot resource itself:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
metadata:
  annotations:
    alpha.featuregates.shoot.gardener.cloud/apiserver-sni-pod-injector: 'disable'
  name: my-cluster

or via kubectl for existing shoot cluster:

kubectl label shoot my-cluster alpha.featuregates.shoot.gardener.cloud/apiserver-sni-pod-injector=disable

Note: Please be aware that it’s not possible to disable injection on a cluster level and enable it for individual pods in it.

5 - NodeLocalDNS Configuration

NodeLocalDNS Configuration

This is a short guide describing how to enable DNS caching on the shoot cluster nodes.

Background

Currently in Gardener we are using CoreDNS as a deployment that is auto-scaled horizontally to cover for QPS-intensive applications. However, doing so does not seem to be enough to completely circumvent DNS bottlenecks such as:

  • Cloud provider limits for DNS lookups.
  • Unreliable UDP connections that forces a period of timeout in case packets are dropped.
  • Unnecessary node hopping since CoreDNS is not deployed on all nodes, and as a result DNS queries end-up traversing multiple nodes before reaching the destination server.
  • Inefficient load-balancing of services (e.g., round-robin might not be enough when using IPTables mode)
  • and more …

To workaround the issues described above, node-local-dns was introduced. The architecture is described below. The idea is simple:

  • For new queries, the connection is upgraded from UDP to TCP and forwarded towards the cluster IP for the original CoreDNS server.
  • For previously resolved queries, an immediate response from the same node where the requester workload / pod resides is provided.

node-local-dns-architecture

Configuring NodeLocalDNS

All that needs to be done to enable the usage of the node-local-dns feature is to set the corresponding option (spec.systemComponents.nodeLocalDNS.enabled) in the Shoot resource to true:

...
spec:
  ...
  systemComponents:
    nodeLocalDNS:
      enabled: true
...

It is worth noting that:

  • When migrating from IPVS to IPTables, existing pods will continue to leverage the node-local-dns cache.
  • When migrating from IPtables to IPVS, only newer pods will be switched to the node-local-dns cache.
  • During the reconfiguration of the node-local-dns there might be a short disruption in terms of domain name resolution depending on the setup. Usually, DNS requests are repeated for some time as UDP is an unreliable protocol, but that strictly depends on the application/way the domain name resolution happens. It is recommended to let the shoot be reconciled during the next maintenance period.
  • Enabling or disabling node-local-dns triggers a rollout of all shoot worker nodes, see also this document.

For more information about node-local-dns, please refer to the KEP or to the usage documentation.

Known Issues

Custom DNS configuration may not work as expected in conjunction with NodeLocalDNS. Please refer to Custom DNS Configuration.

6 - Shoot Networking Configurations

Configuring Pod network. Maximum number of Nodes and Pods per Node

Shoot Networking Configurations

This document contains network related information for Shoot clusters.

Pod Network

A Pod network is imperative for any kind of cluster communication with Pods not started within the Node’s host network. More information about the Kubernetes network model can be found in the Cluster Networking topic.

Gardener allows users to configure the Pod network’s CIDR during Shoot creation:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  networking:
    type: <some-network-extension-name> # {calico,cilium}
    pods: 100.96.0.0/16
    nodes: ...
    services: ...

⚠️ The networking.pods IP configuration is immutable and cannot be changed afterwards. Please consider the following paragraph to choose a configuration which will meet your demands.

One of the network plugin’s (CNI) tasks is to assign IP addresses to Pods started in the Pod network. Different network plugins come with different IP address management (IPAM) features, so we can’t give any definite advice how IP ranges should be configured. Nevertheless, we want to outline the standard configuration.

Information in .spec.networking.pods matches the –cluster-cidr flag of the Kube-Controller-Manager of your Shoot cluster. This IP range is divided into smaller subnets, also called podCIDRs (default mask /24) and assigned to Node objects .spec.podCIDR. Pods get their IP address from this smaller node subnet in a default IPAM setup. Thus, it must be guaranteed that enough of these subnets can be created for the maximum amount of nodes you expect in the cluster.

Example 1

Pod network: 100.96.0.0/16
nodeCIDRMaskSize: /24
-------------------------

Number of podCIDRs: 256 --> max. Node count 
Number of IPs per podCIDRs: 256

With the configuration above a Shoot cluster can at most have 256 nodes which are ready to run workload in the Pod network.

Example 2

Pod network: 100.96.0.0/20
nodeCIDRMaskSize: /24
-------------------------

Number of podCIDRs: 16 --> max. Node count 
Number of IPs per podCIDRs: 256

With the configuration above a Shoot cluster can at most have 16 nodes which are ready to run workload in the Pod network.

Beside the configuration in .spec.networking.pods, users can tune the nodeCIDRMaskSize used by Kube-Controller-Manager on shoot creation. A smaller IP range per node means more podCIDRs and thus the ability to provision more nodes in the cluster, but less available IPs for Pods running on each of the nodes.

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  kubeControllerManager:
    nodeCIDRMaskSize: 24 (default)

⚠️ The nodeCIDRMaskSize configuration is immutable and cannot be changed afterwards.

Example 3

Pod network: 100.96.0.0/20
nodeCIDRMaskSize: /25
-------------------------

Number of podCIDRs: 32 --> max. Node count 
Number of IPs per podCIDRs: 128

With the configuration above, a Shoot cluster can at most have 32 nodes which are ready to run workload in the Pod network.