This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Guides

Walkthroughs of common activities

1 - Set Up Client Tools

1.1 - Fun with kubectl Aliases

Some bash tips that save you some time

Speed up Your Terminal Workflow

Use the Kubernetes command-line tool, kubectl, to deploy and manage applications on Kubernetes. Using kubectl, you can inspect cluster resources, as well as create, delete, and update components.

port-forward

You will probably run more than a hundred kubectl commands on some days and you should speed up your terminal workflow with with some shortcuts. Of course, there are good shortcuts and bad shortcuts (lazy coding, lack of security review, etc.), but let’s stick with the positives and talk about a good shortcut: bash aliases in your .profile.

What are those mysterious .profile and .bash_profile files you’ve heard about?

What’s the .bash_profile then? It’s exactly the same, but under a different name. The unix shell you are logging into, in this case OS X, looks for etc/profile and loads it if it exists. Then it looks for ~/.bash_profile, ~/.bash_login and finally ~/.profile, and loads the first one of these it finds.

Populating the .profile File

Here is the fantastic time saver that needs to be in your shell profile:

# time save number one. shortcut for kubectl
#
alias k="kubectl"

# Start a shell in a pod AND kill them after leaving
#
alias ksh="kubectl run busybox -i --tty --image=busybox --restart=Never --rm -- sh"

# opens a bash
#
alias kbash="kubectl run busybox -i --tty --image=busybox --restart=Never --rm -- ash"

# activate/exports the kuberconfig.yaml in the current working directory
#
alias kexport="export KUBECONFIG=`pwd`/kubeconfig.yaml"


# usage: kurl http://your-svc.namespace.cluster.local
#
# we need for this our very own image...never trust an unknown image..
alias kurl="docker run --rm byrnedo/alpine-curl"

All the kubectl tab completions still work fine with these aliases, so you’re not losing that speed.

1.2 - Kubeconfig Context as bash Prompt

Expose the active kubeconfig into bash

Overview

Use the Kubernetes command-line tool, kubectl, to deploy and manage applications on Kubernetes. Using kubectl, you can inspect cluster resources, as well as create, delete, and update components.

port-forward

By default, the kubectl configuration is located at ~/.kube/config.

Let us suppose that you have two clusters, one for development work and one for scratch work.

How to handle this easily without copying the used configuration always to the right place?

Export the KUBECONFIG Environment Variable

bash$ export KUBECONFIG=<PATH-TO-M>-CONFIG>/kubeconfig-dev.yaml

How to determine which cluster is used by the kubectl command?

Determine Active Cluster

bash$ kubectl cluster-info
Kubernetes master is running at https://api.dev.garden.shoot.canary.k8s-hana.ondemand.com
KubeDNS is running at https://api.dev.garden.shoot.canary.k8s-hana.ondemand.com/api/v1/proxy/namespaces/kube-system/services/kube-dns

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
bash$ 

Display Cluster in the bash - Linux and Alike

I found this tip on Stackoverflow and find it worth to be added here.

Edit your ~/.bash_profile and add the following code snippet to show the current K8s context in the shell’s prompt:

prompt_k8s(){
  k8s_current_context=$(kubectl config current-context 2> /dev/null)
  if [[ $? -eq 0 ]] ; then echo -e "(${k8s_current_context}) "; fi
}
 
 
PS1+='$(prompt_k8s)'

After this, your bash command prompt contains the active KUBECONFIG context and you always know which cluster is active - develop or production.

For example:

bash$ export KUBECONFIG=/Users/d023280/Documents/workspace/gardener-ui/kubeconfig_gardendev.yaml 
bash (garden_dev)$ 

Note the (garden_dev) prefix in the bash command prompt.

This helps immensely to avoid thoughtless mistakes.

Display Cluster in the PowerShell - Windows

Display the current K8s cluster in the title of PowerShell window.

Create a profile file for your shell under %UserProfile%\Documents\Windows­PowerShell\Microsoft.PowerShell_profile.ps1

Copy following code to Microsoft.PowerShell_profile.ps1

 function prompt_k8s {
     $k8s_current_context = (kubectl config current-context) | Out-String
     if($?) {
         return $k8s_current_context
     }else {
         return "No K8S contenxt found"
     }
 }

 $host.ui.rawui.WindowTitle = prompt_k8s

port-forward

If you want to switch to different cluster, you can set KUBECONFIG to new value, and re-run the file Microsoft.PowerShell_profile.ps1

1.3 - Organizing Access Using kubeconfig Files

Overview

The kubectl command-line tool uses kubeconfig files to find the information it needs to choose a cluster and communicate with the API server of a cluster.

Problem

If you’ve become aware of a security breach that affects you, you may want to revoke or cycle credentials in case anything was leaked. However, this is not possible with the initial or master kubeconfig from your cluster.

teaser

Pitfall

Never distribute the kubeconfig, which you can download directly within the Gardener dashboard, for a productive cluster.

kubeconfig-dont

Create a Custom kubeconfig File for Each User

Create a separate kubeconfig for each user. One of the big advantages of this approach is that you can revoke them and control the permissions better. A limitation to single namespaces is also possible here.

The script creates a new ServiceAccount with read privileges in the whole cluster (Secrets are excluded). To run the script, Deno, a secure TypeScript runtime, must be installed.

#!/usr/bin/env -S deno run --allow-run

/*
* This script create Kubernetes ServiceAccount and other required resource and print KUBECONFIG to console.
* Depending on your requirements you might want change clusterRoleBindingTemplate() function
*
* In order to execute this script it's required to install Deno.js https://deno.land/ (TypeScript & JavaScript runtime).
* It's single executable binary for the major OSs from the original author of the Node.js
* example: deno run --allow-run kubeconfig-for-custom-user.ts d00001
* example: deno run --allow-run kubeconfig-for-custom-user.ts d00001 --delete
*
* known issue: shebang does works under the Linux but not for Windows Linux Subsystem
*/

const KUBECTL = "/usr/local/bin/kubectl" //or
// const KUBECTL = "C:\\Program Files\\Docker\\Docker\\resources\\bin\\kubectl.exe"

const serviceAccName = Deno.args[0]
const deleteIt = Deno.args[1]
if (serviceAccName == undefined || serviceAccName == "--delete" ) {
    console.log("please provide username as an argument, for example: deno run --allow-run kubeconfig-for-custom-user.ts USER_NAME [--delete]")
    Deno.exit(1)
}

if (deleteIt == "--delete") {
    exec([KUBECTL, "delete", "serviceaccount", serviceAccName])
    exec([KUBECTL, "delete", "secret", `${serviceAccName}-secret`])
    exec([KUBECTL, "delete", "clusterrolebinding", `view-${serviceAccName}-global`])
    Deno.exit(0)
}

await exec([KUBECTL, "create", "serviceaccount", serviceAccName, "-o", "json"])

await exec([KUBECTL, "create", "-o", "json", "-f", "-"], secretYamlTemplate())
let secret = await exec([KUBECTL, "get", "secret", `${serviceAccName}-secret`, "-o", "json"])
let caCRT = secret.data["ca.crt"];
let userToken = atob(secret.data["token"]); //decode base64

let kubeConfig = await exec([KUBECTL, "config", "view", "--minify", "-o", "json"]);
let clusterApi = kubeConfig.clusters[0].cluster.server
let clusterName = kubeConfig.clusters[0].name

await exec([KUBECTL, "create", "-o", "json", "-f", "-"], clusterRoleBindingTemplate())

console.log(kubeConfigTemplate(caCRT, userToken, clusterApi, clusterName, serviceAccName + "-" + clusterName))

async function exec(args: string[], stdInput?: string): Promise<Object> {
    console.log("# "+args.join(" "))
    let opt: Deno.RunOptions = {
        cmd: args,
        stdout: "piped",
        stderr: "piped",
        stdin: "piped",
    };

    const p = Deno.run(opt);

    if (stdInput != undefined) {
        await p.stdin.write(new TextEncoder().encode(stdInput));
        await p.stdin.close();
    }

    const status = await p.status()
    const output = await p.output()
    const stderrOutput = await p.stderrOutput()
    if (status.code === 0) {
        return JSON.parse(new TextDecoder().decode(output))
    } else {
        let error = new TextDecoder().decode(stderrOutput);
        return ""
    }
}

function clusterRoleBindingTemplate() {
    return `
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: view-${serviceAccName}-global
subjects:
- kind: ServiceAccount
  name: ${serviceAccName}
  namespace: default
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io    
`
}

function secretYamlTemplate() {
    return `
apiVersion: v1
kind: Secret
metadata:
  name: ${serviceAccName}-secret
  annotations:
    kubernetes.io/service-account.name: ${serviceAccName}
type: kubernetes.io/service-account-token`
}

function kubeConfigTemplate(certificateAuthority: string, token: string, clusterApi: string, clusterName: string, username: string) {
    return `
## KUBECONFIG generated on ${new Date()}
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ${certificateAuthority}
    server: ${clusterApi}
  name: ${clusterName}
contexts:
- context:
    cluster: ${clusterName}
    user: ${username}
  name: ${clusterName}
current-context: ${clusterName}
kind: Config
preferences: {}
users:
- name: ${username}
  user:
    token: ${token}
`
}

If edit or admin rights are to be assigned, the ClusterRoleBinding must be adapted in the roleRef section with the roles listed below.

Furthermore, you can restrict this to a single namespace by not creating a ClusterRoleBinding but only a RoleBinding within the desired namespace.

Default ClusterRoleDefault ClusterRoleBindingDescription
cluster-adminsystem:masters groupAllows super-user access to perform any action on any resource. When used in a ClusterRoleBinding, it gives full control over every resource in the cluster and in all namespaces. When used in a RoleBinding, it gives full control over every resource in the rolebinding’s namespace, including the namespace itself.
adminNoneAllows admin access, intended to be granted within a namespace using a RoleBinding. If used in a RoleBinding, allows read/write access to most resources in a namespace, including the ability to create roles and rolebindings within the namespace. It does not allow write access to resource quota or to the namespace itself.
editNoneAllows read/write access to most objects in a namespace. It does not allow viewing or modifying roles or rolebindings.
viewNoneAllows read-only access to see most objects in a namespace. It does not allow viewing roles or rolebindings. It does not allow viewing secrets, since those are escalating.

2 - High Availability

2.1 - Best Practices

Implementing High Availability and Tolerating Zone Outages

Developing highly available workload that can tolerate a zone outage is no trivial task. You will find here various recommendations to get closer to that goal. While many recommendations are general enough, the examples are specific in how to achieve this in a Gardener-managed cluster and where/how to tweak the different control plane components. If you do not use Gardener, it may be still a worthwhile read.

First however, what is a zone outage? It sounds like a clear-cut “thing”, but it isn’t. There are many things that can go haywire. Here are some examples:

  • Elevated cloud provider API error rates for individual or multiple services
  • Network bandwidth reduced or latency increased, usually also effecting storage sub systems as they are network attached
  • No networking at all, no DNS, machines shutting down or restarting, …
  • Functional issues, of either the entire service (e.g. all block device operations) or only parts of it (e.g. LB listener registration)
  • All services down, temporarily or permanently (the proverbial burning down data center 🔥)

This and everything in between make it hard to prepare for such events, but you can still do a lot. The most important recommendation is to not target specific issues exclusively - tomorrow another service will fail in an unanticipated way. Also, focus more on meaningful availability than on internal signals (useful, but not as relevant as the former). Always prefer automation over manual intervention (e.g. leader election is a pretty robust mechanism, auto-scaling may be required as well, etc.).

Also remember that HA is costly - you need to balance it against the cost of an outage as silly as this may sound, e.g. running all this excess capacity “just in case” vs. “going down” vs. a risk-based approach in between where you have means that will kick in, but they are not guaranteed to work (e.g. if the cloud provider is out of resource capacity). Maybe some of your components must run at the highest possible availability level, but others not - that’s a decision only you can make.

Control Plane

The Kubernetes cluster control plane is managed by Gardener (as pods in separate infrastructure clusters to which you have no direct access) and can be set up with no failure tolerance (control plane pods will be recreated best-effort when resources are available) or one of the failure tolerance types node or zone.

Strictly speaking, static workload does not depend on the (high) availability of the control plane, but static workload doesn’t rhyme with Cloud and Kubernetes and also means, that when you possibly need it the most, e.g. during a zone outage, critical self-healing or auto-scaling functionality won’t be available to you and your workload, if your control plane is down as well. That’s why, even though the resource consumption is significantly higher, we generally recommend to use the failure tolerance type zone for the control planes of productive clusters, at least in all regions that have 3+ zones. Regions that have only 1 or 2 zones don’t support the failure tolerance type zone and then your second best option is the failure tolerance type node, which means a zone outage can still take down your control plane, but individual node outages won’t.

In the shoot resource it’s merely only this what you need to add:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  controlPlane:
    highAvailability:
      failureTolerance:
        type: zone # valid values are `node` and `zone` (only available if your control plane resides in a region with 3+ zones)

This setting will scale out all control plane components for a Gardener cluster as necessary, so that no single zone outage can take down the control plane for longer than just a few seconds for the fail-over to take place (e.g. lease expiration and new leader election or readiness probe failure and endpoint removal). Components run highly available in either active-active (servers) or active-passive (controllers) mode at all times, the persistence (ETCD), which is consensus-based, will tolerate the loss of one zone and still maintain quorum and therefore remain operational. These are all patterns that we will revisit down below also for your own workload.

Worker Pools

Now that you have configured your Kubernetes cluster control plane in HA, i.e. spread it across multiple zones, you need to do the same for your own workload, but in order to do so, you need to spread your nodes across multiple zones first.

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  provider:
    workers:
    - name: ...
      minimum: 6
      maximum: 60
      zones:
      - ...

Prefer regions with at least 2, better 3+ zones and list the zones in the zones section for each of your worker pools. Whether you need 2 or 3 zones at a minimum depends on your fail-over concept:

  • Consensus-based software components (like ETCD) depend on maintaining a quorum of (n/2)+1, so you need at least 3 zones to tolerate the outage of 1 zone.
  • Primary/Secondary-based software components need just 2 zones to tolerate the outage of 1 zone.
  • Then there are software components that can scale out horizontally. They are probably fine with 2 zones, but you also need to think about the load-shift and that the remaining zone must then pick up the work of the unhealthy zone. With 2 zones, the remaining zone must cope with an increase of 100% load. With 3 zones, the remaining zones must only cope with an increase of 50% load (per zone).

In general, the question is also whether you have the fail-over capacity already up and running or not. If not, i.e. you depend on re-scheduling to a healthy zone or auto-scaling, be aware that during a zone outage, you will see a resource crunch in the healthy zones. If you have no automation, i.e. only human operators (a.k.a. “red button approach”), you probably will not get the machines you need and even with automation, it may be tricky. But holding the capacity available at all times is costly. In the end, that’s a decision only you can make. If you made that decision, please adapt the minimum, maximum, maxSurge and maxUnavailable settings for your worker pools accordingly (visit this section for more information).

Also, consider fall-back worker pools (with different/alternative machine types) and cluster autoscaler expanders using a priority-based strategy.

Gardener-managed clusters deploy the cluster autoscaler or CA for short and you can tweak the general CA knobs for Gardener-managed clusters like this:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  kubernetes:
    clusterAutoscaler:
      expander: "least-waste"
      scanInterval: 10s
      scaleDownDelayAfterAdd: 60m
      scaleDownDelayAfterDelete: 0s
      scaleDownDelayAfterFailure: 3m
      scaleDownUnneededTime: 30m
      scaleDownUtilizationThreshold: 0.5

If you want to be ready for a sudden spike or have some buffer in general, over-provision nodes by means of “placeholder” pods with low priority and appropriate resource requests. This way, they will demand nodes to be provisioned for them, but if any pod comes up with a regular/higher priority, the low priority pods will be evicted to make space for the more important ones. Strictly speaking, this is not related to HA, but it may be important to keep this in mind as you generally want critical components to be rescheduled as fast as possible and if there is no node available, it may take 3 minutes or longer to do so (depending on the cloud provider). Besides, not only zones can fail, but also individual nodes.

Replicas (Horizontal Scaling)

Now let’s talk about your workload. In most cases, this will mean to run multiple replicas. If you cannot do that (a.k.a. you have a singleton), that’s a bad situation to be in. Maybe you can run a spare (secondary) as backup? If you cannot, you depend on quick detection and rescheduling of your singleton (more on that below).

Obviously, things get messier with persistence. If you have persistence, you should ideally replicate your data, i.e. let your spare (secondary) “follow” your main (primary). If your software doesn’t support that, you have to deploy other means, e.g. volume snapshotting or side-backups (specific to the software you deploy; keep the backups regional, so that you can switch to another zone at all times). If you have to do those, your HA scenario becomes more a DR scenario and terms like RPO and RTO become relevant to you:

  • Recovery Point Objective (RPO): Potential data loss, i.e. how much data will you lose at most (time between backups)
  • Recovery Time Objective (RTO): Time until recovery, i.e. how long does it take you to be operational again (time to restore)

Also, keep in mind that your persistent volumes are usually zonal, i.e. once you have a volume in one zone, it’s bound to that zone and you cannot get up your pod in another zone w/o first recreating the volume yourself (Kubernetes won’t help you here directly).

Anyway, best avoid that, if you can (from technical and cost perspective). The best solution (and also the most costly one) is to run multiple replicas in multiple zones and keep your data replicated at all times, so that your RPO is always 0 (best). That’s what we do for Gardener-managed cluster HA control planes (ETCD) as any data loss may be disastrous and lead to orphaned resources (in addition, we deploy side cars that do side-backups for disaster recovery, with full and incremental snapshots with an RPO of 5m).

So, how to run with multiple replicas? That’s the easiest part in Kubernetes and the two most important resources, Deployments and StatefulSet, support that out of the box:

apiVersion: apps/v1
kind: Deployment | StatefulSet
spec:
  replicas: ...

The problem comes with the number of replicas. It’s easy only if the number is static, e.g. 2 for active-active/passive or 3 for consensus-based software components, but what with software components that can scale out horizontally? Here you usually do not set the number of replicas statically, but make use of the horizontal pod autoscaler or HPA for short (built-in; part of the kube-controller-manager). There are also other options like the cluster proportional autoscaler, but while the former works based on metrics, the latter is more a guestimate approach that derives the number of replicas from the number of nodes/cores in a cluster. Sometimes useful, but often blind to the actual demand.

So, HPA it is then for most of the cases. However, what is the resource (e.g. CPU or memory) that drives the number of desired replicas? Again, this is up to you, but not always are CPU or memory the best choices. In some cases, custom metrics may be more appropriate, e.g. requests per second (it was also for us).

You will have to create specific HorizontalPodAutoscaler resources for your scale target and can tweak the general HPA knobs for Gardener-managed clusters like this:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  kubernetes:
    kubeControllerManager:
      horizontalPodAutoscaler:
        syncPeriod: 15s
        tolerance: 0.1
        downscaleStabilization: 5m0s
        initialReadinessDelay: 30s
        cpuInitializationPeriod: 5m0s

Resources (Vertical Scaling)

While it is important to set a sufficient number of replicas, it is also important to give the pods sufficient resources (CPU and memory). This is especially true when you think about HA. When a zone goes down, you might need to get up replacement pods, if you don’t have them running already to take over the load from the impacted zone. Likewise, e.g. with active-active software components, you can expect the remaining pods to receive more load. If you cannot scale them out horizontally to serve the load, you will probably need to scale them out (or rather up) vertically. This is done by the vertical pod autoscaler or VPA for short (not built-in; part of the kubernetes/autoscaler repository).

A few caveats though:

  • You cannot use HPA and VPA on the same metrics as they would influence each other, which would lead to pod trashing (more replicas require fewer resources; fewer resources require more replicas)
  • Scaling horizontally doesn’t cause downtimes (at least not when out-scaling and only one replica is affected when in-scaling), but scaling vertically does (if the pod runs OOM anyway, but also when new recommendations are applied, resource requests for existing pods may be changed, which causes the pods to be rescheduled). Although the discussion is going on for a very long time now, that is still not supported in-place yet (see KEP 1287, implementation in Kubernetes, implementation in VPA).

VPA is a useful tool and Gardener-managed clusters deploy a VPA by default for you (HPA is supported anyway as it’s built into the kube-controller-manager). You will have to create specific VerticalPodAutoscaler resources for your scale target and can tweak the general VPA knobs for Gardener-managed clusters like this:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  kubernetes:
    verticalPodAutoscaler:
      enabled: true
      evictAfterOOMThreshold: 10m0s
      evictionRateBurst: 1
      evictionRateLimit: -1
      evictionTolerance: 0.5
      recommendationMarginFraction: 0.15
      updaterInterval: 1m0s
      recommenderInterval: 1m0s

While horizontal pod autoscaling is relatively straight-forward, it takes a long time to master vertical pod autoscaling. We saw performance issues, hard-coded behavior (on OOM, memory is bumped by +20% and it may take a few iterations to reach a good level), unintended pod disruptions by applying new resource requests (after 12h all targeted pods will receive new requests even though individually they would be fine without, which also drives active-passive resource consumption up), difficulties to deal with spiky workload in general (due to the algorithmic approach it takes), recommended requests may exceed node capacity, limit scaling is proportional and therefore often questionable, and more. VPA is a double-edged sword: useful and necessary, but not easy to handle.

For the Gardener-managed components, we mostly removed limits. Why?

  • CPU limits have almost always only downsides. They cause needless CPU throttling, which is not even easily visible. CPU requests turn into cpu shares, so if the node has capacity, the pod may consume the freely available CPU, but not if you have set limits, which curtail the pod by means of cpu quota. There are only certain scenarios in which they may make sense, e.g. if you set requests=limits and thereby define a pod with guaranteed QoS, which influences your cgroup placement. However, that is difficult to do for the components you implement yourself and practically impossible for the components you just consume, because what’s the correct value for requests/limits and will it hold true also if the load increases and what happens if a zone goes down or with the next update/version of this component? If anything, CPU limits caused outages, not helped prevent them.
  • As for memory limits, they are slightly more useful, because CPU is compressible and memory is not, so if one pod runs berserk, it may take others down (with CPU, cpu shares make it as fair as possible), depending on which OOM killer strikes (a complicated topic by itself). You don’t want the operating system OOM killer to strike as the result is unpredictable. Better, it’s the cgroup OOM killer or even the kubelet’s eviction, if the consumption is slow enough as it takes priorities into consideration even. If your component is critical and a singleton (e.g. node daemon set pods), you are better off also without memory limits, because letting the pod go OOM because of artificial/wrong memory limits can mean that the node becomes unusable. Hence, such components also better run only with no or a very high memory limit, so that you can catch the occasional memory leak (bug) eventually, but under normal operation, if you cannot decide about a true upper limit, rather not have limits and cause endless outages through them or when you need the pods the most (during a zone outage) where all your assumptions went out of the window.

The downside of having poor or no limits and poor and no requests is that nodes may “die” more often. Contrary to the expectation, even for managed services, the managed service is not responsible or cannot guarantee the health of a node under all circumstances, since the end user defines what is run on the nodes (shared responsibility). If the workload exhausts any resource, it will be the end of the node, e.g. by compressing the CPU too much (so that the kubelet fails to do its work), exhausting the main memory too fast, disk space, file handles, or any other resource.

The kubelet allows for explicit reservation of resources for operating system daemons (system-reserved) and Kubernetes daemons (kube-reserved) that are subtracted from the actual node resources and become the allocatable node resources for your workload/pods. All managed services configure these settings “by rule of thumb” (a balancing act), but cannot guarantee that the values won’t waste resources or always will be sufficient. You will have to fine-tune them eventually and adapt them to your needs. In addition, you can configure soft and hard eviction thresholds to give the kubelet some headroom to evict “greedy” pods in a controlled way. These settings can be configured for Gardener-managed clusters like this:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  kubernetes:
    kubelet:
      kubeReserved:                            # explicit resource reservation for Kubernetes daemons
        cpu: 100m
        memory: 1Gi
        ephemeralStorage: 1Gi
        pid: 1000
      evictionSoft:                            # soft, i.e. graceful eviction (used if the node is about to run out of resources, avoiding hard evictions)
        memoryAvailable: 200Mi
        imageFSAvailable: 10%
        imageFSInodesFree: 10%
        nodeFSAvailable: 10%
        nodeFSInodesFree: 10%
      evictionSoftGracePeriod:                 # caps pod's `terminationGracePeriodSeconds` value during soft evictions (specific grace periods)
        memoryAvailable: 1m30s
        imageFSAvailable: 1m30s
        imageFSInodesFree: 1m30s
        nodeFSAvailable: 1m30s
        nodeFSInodesFree: 1m30s
      evictionHard:                            # hard, i.e. immediate eviction (used if the node is out of resources, avoiding the OS generally run out of resources fail processes indiscriminately)
        memoryAvailable: 100Mi
        imageFSAvailable: 5%
        imageFSInodesFree: 5%
        nodeFSAvailable: 5%
        nodeFSInodesFree: 5%
      evictionMinimumReclaim:                  # additional resources to reclaim after hitting the hard eviction thresholds to not hit the same thresholds soon after again
        memoryAvailable: 0Mi
        imageFSAvailable: 0Mi
        imageFSInodesFree: 0Mi
        nodeFSAvailable: 0Mi
        nodeFSInodesFree: 0Mi
      evictionMaxPodGracePeriod: 90            # caps pod's `terminationGracePeriodSeconds` value during soft evictions (general grace periods)
      evictionPressureTransitionPeriod: 5m0s   # stabilization time window to avoid flapping of node eviction state

You can tweak these settings also individually per worker pool (spec.provider.workers.kubernetes.kubelet...), which makes sense especially with different machine types (and also workload that you may want to schedule there).

Physical memory is not compressible, but you can overcome this issue to some degree (alpha since Kubernetes v1.22 in combination with the feature gate NodeSwap on the kubelet) with swap memory. You can read more in this introductory blog and the docs. If you chose to use it (still only alpha at the time of this writing) you may want to consider also the risks associated with swap memory:

  • Reduced performance predictability
  • Reduced performance up to page trashing
  • Reduced security as secrets, normally held only in memory, could be swapped out to disk

That said, the various options mentioned above are only remotely related to HA and will not be further explored throughout this document, but just to remind you: if a zone goes down, load patterns will shift, existing pods will probably receive more load and will require more resources (especially because it is often practically impossible to set “proper” resource requests, which drive node allocation - limits are always ignored by the scheduler) or more pods will/must be placed on the existing and/or new nodes and then these settings, which are generally critical (especially if you switch on bin-packing for Gardener-managed clusters as a cost saving measure), will become even more critical during a zone outage.

Probes

Before we go down the rabbit hole even further and talk about how to spread your replicas, we need to talk about probes first, as they will become relevant later. Kubernetes supports three kinds of probes: startup, liveness, and readiness probes. If you are a visual thinker, also check out this slide deck by Tim Hockin (Kubernetes networking SIG chair).

Basically, the startupProbe and the livenessProbe help you restart the container, if it’s unhealthy for whatever reason, by letting the kubelet that orchestrates your containers on a node know, that it’s unhealthy. The former is a special case of the latter and only applied at the startup of your container, if you need to handle the startup phase differently (e.g. with very slow starting containers) from the rest of the lifetime of the container.

Now, the readinessProbe helps you manage the ready status of your container and thereby pod (any container that is not ready turns the pod not ready). This again has impact on endpoints and pod disruption budgets:

  • If the pod is not ready, the endpoint will be removed and the pod will not receive traffic anymore
  • If the pod is not ready, the pod counts into the pod disruption budget and if the budget is exceeded, no further voluntary pod disruptions will be permitted for the remaining ready pods (e.g. no eviction, no voluntary horizontal or vertical scaling, if the pod runs on a node that is about to be drained or in draining, draining will be paused until the max drain timeout passes)

As you can see, all of these probes are (also) related to HA (mostly the readinessProbe, but depending on your workload, you can also leverage livenessProbe and startupProbe into your HA strategy). If Kubernetes doesn’t know about the individual status of your container/pod, it won’t do anything for you (right away). That said, later/indirectly something might/will happen via the node status that can also be ready or not ready, which influences the pods and load balancer listener registration (a not ready node will not receive cluster traffic anymore), but this process is worker pool global and reacts delayed and also doesn’t discriminate between the containers/pods on a node.

In addition, Kubernetes also offers pod readiness gates to amend your pod readiness with additional custom conditions (normally, only the sum of the container readiness matters, but pod readiness gates additionally count into the overall pod readiness). This may be useful if you want to block (by means of pod disruption budgets that we will talk about next) the roll-out of your workload/nodes in case some (possibly external) condition fails.

Pod Disruption Budgets

One of the most important resources that help you on your way to HA are pod disruption budgets or PDB for short. They tell Kubernetes how to deal with voluntary pod disruptions, e.g. during the deployment of your workload, when the nodes are rolled, or just in general when a pod shall be evicted/terminated. Basically, if the budget is reached, they block all voluntary pod disruptions (at least for a while until possibly other timeouts act or things happen that leave Kubernetes no choice anymore, e.g. the node is forcefully terminated). You should always define them for your workload.

Very important to note is that they are based on the readinessProbe, i.e. even if all of your replicas are lively, but not enough of them are ready, this blocks voluntary pod disruptions, so they are very critical and useful. Here an example (you can specify either minAvailable or maxUnavailable in absolute numbers or as percentage):

apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      ...

And please do not specify a PDB of maxUnavailable being 0 or similar. That’s pointless, even detrimental, as it blocks then even useful operations, forces always the hard timeouts that are less graceful and it doesn’t make sense in the context of HA. You cannot “force” HA by preventing voluntary pod disruptions, you must work with the pod disruptions in a resilient way. Besides, PDBs are really only about voluntary pod disruptions - something bad can happen to a node/pod at any time and PDBs won’t make this reality go away for you.

PDBs will not always work as expected and can also get in your way, e.g. if the PDB is violated or would be violated, it may possibly block whatever you are trying to do to salvage the situation, e.g. drain a node or deploy a patch version (if the PDB is or would be violated, not even unhealthy pods would be evicted as they could theoretically become healthy again, which Kubernetes doesn’t know). In order to overcome this issue, it is now possible (alpha since Kubernetes v1.26 in combination with the feature gate PDBUnhealthyPodEvictionPolicy on the API server, beta and enabled by default since Kubernetes v1.27) to configure the so-called unhealthy pod eviction policy. The default is still IfHealthyBudget as a change in default would have changed the behavior (as described above), but you can now also set AlwaysAllow at the PDB (spec.unhealthyPodEvictionPolicy). For more information, please check out this discussion, the PR and this document and balance the pros and cons for yourself. In short, the new AlwaysAllow option is probably the better choice in most of the cases while IfHealthyBudget is useful only if you have frequent temporary transitions or for special cases where you have already implemented controllers that depend on the old behavior.

Pod Topology Spread Constraints

Pod topology spread constraints or PTSC for short (no official abbreviation exists, but we will use this in the following) are enormously helpful to distribute your replicas across multiple zones, nodes, or any other user-defined topology domain. They complement and improve on pod (anti-)affinities that still exist and can be used in combination.

PTSCs are an improvement, because they allow for maxSkew and minDomains. You can steer the “level of tolerated imbalance” with maxSkew, e.g. you probably want that to be at least 1, so that you can perform a rolling update, but this all depends on your deployment (maxUnavailable and maxSurge), etc. Stateful sets are a bit different (maxUnavailable) as they are bound to volumes and depend on them, so there usually cannot be 2 pods requiring the same volume. minDomains is a hint to tell the scheduler how far to spread, e.g. if all nodes in one zone disappeared because of a zone outage, it may “appear” as if there are only 2 zones in a 3 zones cluster and the scheduling decisions may end up wrong, so a minDomains of 3 will tell the scheduler to spread to 3 zones before adding another replica in one zone. Be careful with this setting as it also means, if one zone is down the “spread” is already at least 1, if pods run in the other zones. This is useful where you have exactly as many replicas as you have zones and you do not want any imbalance. Imbalance is critical as if you end up with one, nobody is going to do the (active) re-balancing for you (unless you deploy and configure additional non-standard components such as the descheduler). So, for instance, if you have something like a DBMS that you want to spread across 2 zones (active-passive) or 3 zones (consensus-based), you better specify minDomains of 2 respectively 3 to force your replicas into at least that many zones before adding more replicas to another zone (if supported).

Anyway, PTSCs are critical to have, but not perfect, so we saw (unsurprisingly, because that’s how the scheduler works), that the scheduler may block the deployment of new pods because it takes the decision pod-by-pod (see for instance #109364).

Pod Affinities and Anti-Affinities

As said, you can combine PTSCs with pod affinities and/or anti-affinities. Especially inter-pod (anti-)affinities may be helpful to place pods apart, e.g. because they are fall-backs for each other or you do not want multiple potentially resource-hungry “best-effort” or “burstable” pods side-by-side (noisy neighbor problem), or together, e.g. because they form a unit and you want to reduce the failure domain, reduce the network latency, and reduce the costs.

Topology Aware Hints

While topology aware hints are not directly related to HA, they are very relevant in the HA context. Spreading your workload across multiple zones may increase network latency and cost significantly, if the traffic is not shaped. Topology aware hints (beta since Kubernetes v1.23, replacing the now deprecated topology aware traffic routing with topology keys) help to route the traffic within the originating zone, if possible. Basically, they tell kube-proxy how to setup your routing information, so that clients can talk to endpoints that are located within the same zone.

Be aware however, that there are some limitations. Those are called safeguards and if they strike, the hints are off and traffic is routed again randomly. Especially controversial is the balancing limitation as there is the assumption, that the load that hits an endpoint is determined by the allocatable CPUs in that topology zone, but that’s not always, if even often, the case (see for instance #113731 and #110714). So, this limitation hits far too often and your hints are off, but then again, it’s about network latency and cost optimization first, so it’s better than nothing.

Networking

We have talked about networking only to some small degree so far (readiness probes, pod disruption budgets, topology aware hints). The most important component is probably your ingress load balancer - everything else is managed by Kubernetes. AWS, Azure, GCP, and also OpenStack offer multi-zonal load balancers, so make use of them. In Azure and GCP, LBs are regional whereas in AWS and OpenStack, they need to be bound to a zone, which the cloud-controller-manager does by observing the zone labels at the nodes (please note that this behavior is not always working as expected, see #570 where the AWS cloud-controller-manager is not readjusting to newly observed zones).

Please be reminded that even if you use a service mesh like Istio, the off-the-shelf installation/configuration usually never comes with productive settings (to simplify first-time installation and improve first-time user experience) and you will have to fine-tune your installation/configuration, much like the rest of your workload.

Relevant Cluster Settings

Following now a summary/list of the more relevant settings you may like to tune for Gardener-managed clusters:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  controlPlane:
    highAvailability:
      failureTolerance:
        type: zone # valid values are `node` and `zone` (only available if your control plane resides in a region with 3+ zones)
  kubernetes:
    kubeAPIServer:
      defaultNotReadyTolerationSeconds: 300
      defaultUnreachableTolerationSeconds: 300
    kubelet:
      ...
    kubeScheduler:
      featureGates:
        MinDomainsInPodTopologySpread: true
    kubeControllerManager:
      nodeMonitorGracePeriod: 40s
      horizontalPodAutoscaler:
        syncPeriod: 15s
        tolerance: 0.1
        downscaleStabilization: 5m0s
        initialReadinessDelay: 30s
        cpuInitializationPeriod: 5m0s
    verticalPodAutoscaler:
      enabled: true
      evictAfterOOMThreshold: 10m0s
      evictionRateBurst: 1
      evictionRateLimit: -1
      evictionTolerance: 0.5
      recommendationMarginFraction: 0.15
      updaterInterval: 1m0s
      recommenderInterval: 1m0s
    clusterAutoscaler:
      expander: "least-waste"
      scanInterval: 10s
      scaleDownDelayAfterAdd: 60m
      scaleDownDelayAfterDelete: 0s
      scaleDownDelayAfterFailure: 3m
      scaleDownUnneededTime: 30m
      scaleDownUtilizationThreshold: 0.5
  provider:
    workers:
    - name: ...
      minimum: 6
      maximum: 60
      maxSurge: 3
      maxUnavailable: 0
      zones:
      - ... # list of zones you want your worker pool nodes to be spread across, see above
      kubernetes:
        kubelet:
          ... # similar to `kubelet` above (cluster-wide settings), but here per worker pool (pool-specific settings), see above
      machineControllerManager: # optional, it allows to configure the machine-controller settings.
        machineCreationTimeout: 20m
        machineHealthTimeout: 10m
        machineDrainTimeout: 60h
  systemComponents:
    coreDNS:
      autoscaling:
        mode: horizontal # valid values are `horizontal` (driven by CPU load) and `cluster-proportional` (driven by number of nodes/cores)

On spec.controlPlane.highAvailability.failureTolerance.type

If set, determines the degree of failure tolerance for your control plane. zone is preferred, but only available if your control plane resides in a region with 3+ zones. See above and the docs.

On spec.kubernetes.kubeAPIServer.defaultUnreachableTolerationSeconds and defaultNotReadyTolerationSeconds

This is a very interesting API server setting that lets Kubernetes decide how fast to evict pods from nodes whose status condition of type Ready is either Unknown (node status unknown, a.k.a unreachable) or False (kubelet not ready) (see node status conditions; please note that kubectl shows both values as NotReady which is a somewhat “simplified” visualization).

You can also override the cluster-wide API server settings individually per pod:

spec:
  tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 0
  - key: "node.kubernetes.io/not-ready"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 0

This will evict pods on unreachable or not-ready nodes immediately, but be cautious: 0 is very aggressive and may lead to unnecessary disruptions. Again, you must decide for your own workload and balance out the pros and cons (e.g. long startup time).

Please note, these settings replace spec.kubernetes.kubeControllerManager.podEvictionTimeout that was deprecated with Kubernetes v1.26 (and acted as an upper bound).

On spec.kubernetes.kubeScheduler.featureGates.MinDomainsInPodTopologySpread

Required to be enabled for minDomains to work with PTSCs (beta since Kubernetes v1.25, but off by default). See above and the docs. This tells the scheduler, how many topology domains to expect (=zones in the context of this document).

On spec.kubernetes.kubeControllerManager.nodeMonitorGracePeriod

This is another very interesting kube-controller-manager setting that can help you speed up or slow down how fast a node shall be considered Unknown (node status unknown, a.k.a unreachable) when the kubelet is not updating its status anymore (see node status conditions), which effects eviction (see spec.kubernetes.kubeAPIServer.defaultUnreachableTolerationSeconds and defaultNotReadyTolerationSeconds above). The shorter the time window, the faster Kubernetes will act, but the higher the chance of flapping behavior and pod trashing, so you may want to balance that out according to your needs, otherwise stick to the default which is a reasonable compromise.

On spec.kubernetes.kubeControllerManager.horizontalPodAutoscaler...

This configures horizontal pod autoscaling in Gardener-managed clusters. See above and the docs for the detailed fields.

On spec.kubernetes.verticalPodAutoscaler...

This configures vertical pod autoscaling in Gardener-managed clusters. See above and the docs for the detailed fields.

On spec.kubernetes.clusterAutoscaler...

This configures node auto-scaling in Gardener-managed clusters. See above and the docs for the detailed fields, especially about expanders, which may become life-saving in case of a zone outage when a resource crunch is setting in and everybody rushes to get machines in the healthy zones.

In case of a zone outage, it is critical to understand how the cluster autoscaler will put a worker pool in one zone into “back-off” and what the consequences for your workload will be. Unfortunately, the official cluster autoscaler documentation does not explain these details, but you can find hints in the source code:

If a node fails to come up, the node group (worker pool in that zone) will go into “back-off”, at first 5m, then exponentially longer until the maximum of 30m is reached. The “back-off” is reset after 3 hours. This in turn means, that nodes must be first considered Unknown, which happens when spec.kubernetes.kubeControllerManager.nodeMonitorGracePeriod lapses (e.g. at the beginning of a zone outage). Then they must either remain in this state until spec.provider.workers.machineControllerManager.machineHealthTimeout lapses for them to be recreated, which will fail in the unhealthy zone, or spec.kubernetes.kubeAPIServer.defaultUnreachableTolerationSeconds lapses for the pods to be evicted (usually faster than node replacements, depending on your configuration), which will trigger the cluster autoscaler to create more capacity, but very likely in the same zone as it tries to balance its node groups at first, which will fail in the unhealthy zone. It will be considered failed only when maxNodeProvisionTime lapses (usually close to spec.provider.workers.machineControllerManager.machineCreationTimeout) and only then put the node group into “back-off” and not retry for 5m (at first and then exponentially longer). Only then you can expect new node capacity to be brought up somewhere else.

During the time of ongoing node provisioning (before a node group goes into “back-off”), the cluster autoscaler may have “virtually scheduled” pending pods onto those new upcoming nodes and will not reevaluate these pods anymore unless the node provisioning fails (which will fail during a zone outage, but the cluster autoscaler cannot know that and will therefore reevaluate its decision only after it has given up on the new nodes).

It’s critical to keep that in mind and accommodate for it. If you have already capacity up and running, the reaction time is usually much faster with leases (whatever you set) or endpoints (spec.kubernetes.kubeControllerManager.nodeMonitorGracePeriod), but if you depend on new/fresh capacity, the above should inform you how long you will have to wait for it and for how long pods might be pending (because capacity is generally missing and pending pods may have been “virtually scheduled” to new nodes that won’t come up until the node group goes eventually into “back-off” and nodes in the healthy zones come up).

On spec.provider.workers.minimum, maximum, maxSurge, maxUnavailable, zones, and machineControllerManager

Each worker pool in Gardener may be configured differently. Among many other settings like machine type, root disk, Kubernetes version, kubelet settings, and many more you can also specify the lower and upper bound for the number of machines (minimum and maximum), how many machines may be added additionally during a rolling update (maxSurge) and how many machines may be in termination/recreation during a rolling update (maxUnavailable), and of course across how many zones the nodes shall be spread (zones).

Gardener divides minimum, maximum, maxSurge, maxUnavailable values by the number of zones specified for this worker pool. This fact must be considered when you plan the sizing of your worker pools.

Example:

  provider:
    workers:
    - name: ...
      minimum: 6
      maximum: 60
      maxSurge: 3
      maxUnavailable: 0
      zones: ["a", "b", "c"]
  • The resulting MachineDeployments per zone will get minimum: 2, maximum: 20, maxSurge: 1, maxUnavailable: 0.
  • If another zone is added all values will be divided by 4, resulting in:
    • Less workers per zone.
    • ⚠️ One MachineDeployment with maxSurge: 0, i.e. there will be a replacement of nodes without rolling updates.

Interesting is also the configuration for Gardener’s machine-controller-manager or MCM for short that provisions, monitors, terminates, replaces, or updates machines that back your nodes:

  • The shorter machineCreationTimeout is, the faster MCM will retry to create a machine/node, if the process is stuck on cloud provider side. It is set to useful/practical timeouts for the different cloud providers and you probably don’t want to change those (in the context of HA at least). Please align with the cluster autoscaler’s maxNodeProvisionTime.
  • The shorter machineHealthTimeout is, the faster MCM will replace machines/nodes in case the kubelet isn’t reporting back, which translates to Unknown, or reports back with NotReady, or the node-problem-detector that Gardener deploys for you reports a non-recoverable issue/condition (e.g. read-only file system). If it is too short however, you risk node and pod trashing, so be careful.
  • The shorter machineDrainTimeout is, the faster you can get rid of machines/nodes that MCM decided to remove, but this puts a cap on the grace periods and PDBs. They are respected up until the drain timeout lapses - then the machine/node will be forcefully terminated, whether or not the pods are still in termination or not even terminated because of PDBs. Those PDBs will then be violated, so be careful here as well. Please align with the cluster autoscaler’s maxGracefulTerminationSeconds.

Especially the last two settings may help you recover faster from cloud provider issues.

On spec.systemComponents.coreDNS.autoscaling

DNS is critical, in general and also within a Kubernetes cluster. Gardener-managed clusters deploy CoreDNS, a graduated CNCF project. Gardener supports 2 auto-scaling modes for it, horizontal (using HPA based on CPU) and cluster-proportional (using cluster proportional autoscaler that scales the number of pods based on the number of nodes/cores, not to be confused with the cluster autoscaler that scales nodes based on their utilization). Check out the docs, especially the trade-offs why you would chose one over the other (cluster-proportional gives you more configuration options, if CPU-based horizontal scaling is insufficient to your needs). Consider also Gardener’s feature node-local DNS to decouple you further from the DNS pods and stabilize DNS. Again, that’s not strictly related to HA, but may become important during a zone outage, when load patterns shift and pods start to initialize/resolve DNS records more frequently in bulk.

More Caveats

Unfortunately, there are a few more things of note when it comes to HA in a Kubernetes cluster that may be “surprising” and hard to mitigate:

  • If the kubelet restarts, it will report all pods as NotReady on startup until it reruns its probes (#100277), which leads to temporary endpoint and load balancer target removal (#102367). This topic is somewhat controversial. Gardener uses rolling updates and a jitter to spread necessary kubelet restarts as good as possible.
  • If a kube-proxy pod on a node turns NotReady, all load balancer traffic to all pods (on this node) under services with externalTrafficPolicy local will cease as the load balancer will then take this node out of serving. This topic is somewhat controversial as well. So, please remember that externalTrafficPolicy local not only has the disadvantage of imbalanced traffic spreading, but also a dependency to the kube-proxy pod that may and will be unavailable during updates. Gardener uses rolling updates to spread necessary kube-proxy updates as good as possible.

These are just a few additional considerations. They may or may not affect you, but other intricacies may. It’s a reminder to be watchful as Kubernetes may have one or two relevant quirks that you need to consider (and will probably only find out over time and with extensive testing).

Meaningful Availability

Finally, let’s go back to where we started. We recommended to measure meaningful availability. For instance, in Gardener, we do not trust only internal signals, but track also whether Gardener or the control planes that it manages are externally available through the external DNS records and load balancers, SNI-routing Istio gateways, etc. (the same path all users must take). It’s a huge difference whether the API server’s internal readiness probe passes or the user can actually reach the API server and it does what it’s supposed to do. Most likely, you will be in a similar spot and can do the same.

What you do with these signals is another matter. Maybe there are some actionable metrics and you can trigger some active fail-over, maybe you can only use it to improve your HA setup altogether. In our case, we also use it to deploy mitigations, e.g. via our dependency-watchdog that watches, for instance, Gardener-managed API servers and shuts down components like the controller managers to avert cascading knock-off effects (e.g. melt-down if the kubelets cannot reach the API server, but the controller managers can and start taking down nodes and pods).

Either way, understanding how users perceive your service is key to the improvement process as a whole. Even if you are not struck by a zone outage, the measures above and tracking the meaningful availability will help you improve your service.

Thank you for your interest.

2.2 - Chaos Engineering

Overview

Gardener provides chaostoolkit modules to simulate compute and network outages for various cloud providers such as AWS, Azure, GCP, OpenStack/Converged Cloud, and VMware vSphere, as well as pod disruptions for any Kubernetes cluster.

The API, parameterization, and implementation is as homogeneous as possible across the different cloud providers, so that you have only minimal effort. As a Gardener user, you benefit from an additional garden module that leverages the generic modules, but exposes their functionality in the most simple, homogeneous, and secure way (no need to specify cloud provider credentials, cluster credentials, or filters explicitly; retrieves credentials and stores them in memory only).

Installation

The name of the package is chaosgarden and it was developed and tested with Python 3.9+. It’s being published to PyPI, so that you can comfortably install it via Python’s package installer pip (you may want to create a virtual environment before installing it):

pip install chaosgarden

ℹ️ If you want to use the VMware vSphere module, please note the remarks in requirements.txt for vSphere. Those are not contained in the published PyPI package.

The package can be used directly from Python scripts and supports this usage scenario with additional convenience that helps launch actions and probes in background (more on actions and probes later), so that you can compose also complex scenarios with ease.

If this technology is new to you, you will probably prefer the chaostoolkit CLI in combination with experiment files, so we need to install the CLI next:

pip install chaostoolkit

Please verify that it was installed properly by running:

chaos --help

Usage

ℹ️ We assume you are using Gardener and run Gardener-managed shoot clusters. You can also use the generic cloud provider and Kubernetes chaosgarden modules, but configuration and secrets will then differ. Please see the module docs for details.

A Simple Experiment

The most important command is the run command, but before we can use it, we need to compile an experiment file first. Let’s start with a simple one, invoking only a read-only 📖 action from chaosgarden that lists cloud provider machines and networks (depends on cloud provider) for the “first” zone of one of your shoot clusters.

Let’s assume, your project is called my-project and your shoot is called my-shoot, then we need to create the following experiment:

{
    "title": "assess-filters-impact",
    "description": "assess-filters-impact",
    "method": [
        {
            "type": "action",
            "name": "assess-filters-impact",
            "provider": {
                "type": "python",
                "module": "chaosgarden.garden.actions",
                "func": "assess_cloud_provider_filters_impact",
                "arguments": {
                    "zone": 0
                }
            }
        }
    ],
    "configuration": {
        "garden_project": "my-project",
        "garden_shoot": "my-shoot"
    }
}

We are not yet there and need one more thing to do before we can run it: We need to “target” the Gardener landscape resp. Gardener API server where you have created your shoot cluster (not to be confused with your shoot cluster API server). If you do not know what this is or how to download the Gardener API server kubeconfig, please follow these instructions. You can either download your personal credentials or project credentials (see creation of a serviceaccount) to interact with Gardener. For now (fastest and most convenient way, but generally not recommended), let’s use your personal credentials, but if you later plan to automate your experiments, please use proper project credentials (a serviceaccount is not bound to your person, but to the project, and can be restricted using RBAC roles and role bindings, which is why we recommend this for production).

To download your personal credentials, open the Gardener Dashboard and click on your avatar in the upper right corner of the page. Click “My Account”, then look for the “Access” pane, then “Kubeconfig”, then press the “Download” button and save the kubeconfig to disk. Run the following command next:

export KUBECONFIG=path/to/kubeconfig

We are now set and you can run your first experiment:

chaos run path/to/experiment

You should see output like this (depends on cloud provider):

[INFO] Validating the experiment's syntax
[INFO] Installing signal handlers to terminate all active background threads on involuntary signals (note that SIGKILL cannot be handled).
[INFO] Experiment looks valid
[INFO] Running experiment: assess-filters-impact
[INFO] Steady-state strategy: default
[INFO] Rollbacks strategy: default
[INFO] No steady state hypothesis defined. That's ok, just exploring.
[INFO] Playing your experiment's method now...
[INFO] Action: assess-filters-impact
[INFO] Validating client credentials and listing probably impacted instances and/or networks with the given arguments zone='world-1a' and filters={'instances': [{'Name': 'tag-key', 'Values': ['kubernetes.io/cluster/shoot--my-project--my-shoot']}], 'vpcs': [{'Name': 'tag-key', 'Values': ['kubernetes.io/cluster/shoot--my-project--my-shoot']}]}:
[INFO] 1 instance(s) would be impacted:
[INFO] - i-aabbccddeeff0000
[INFO] 1 VPC(s) would be impacted:
[INFO] - vpc-aabbccddeeff0000
[INFO] Let's rollback...
[INFO] No declared rollbacks, let's move on.
[INFO] Experiment ended with status: completed

🎉 Congratulations! You successfully ran your first chaosgarden experiment.

A Destructive Experiment

Now let’s break 🪓 your cluster. Be advised that this experiment will be destructive in the sense that we will temporarily network-partition all nodes in one availability zone (machine termination or restart is available with chaosgarden as well). That means, these nodes and their pods won’t be able to “talk” to other nodes, pods, and services. Also, the API server will become unreachable for them and the API server will report them as unreachable (confusingly shown as NotReady when you run kubectl get nodes and Unknown in the status Ready condition when you run kubectl get nodes --output yaml).

Being unreachable will trigger service endpoint and load balancer de-registration (when the node’s grace period lapses) as well as eventually pod eviction and machine replacement (which will continue to fail under test). We won’t run the experiment long enough for all of these effects to materialize, but the longer you run it, the more will happen, up to temporarily giving up/going into “back-off” for the affected worker pool in that zone. You will also see that the Kubernetes cluster autoscaler will try to create a new machine almost immediately, if pods are pending for the affected zone (which will initially fail under test, but may succeed later, which again depends on the runtime of the experiment and whether or not the cluster autoscaler goes into “back-off” or not).

But for now, all of this doesn’t matter as we want to start “small”. You can later read up more on the various settings and effects in our best practices guide on high availability.

Please create a new experiment file, this time with this content:

{
    "title": "run-network-failure-simulation",
    "description": "run-network-failure-simulation",
    "method": [
        {
            "type": "action",
            "name": "run-network-failure-simulation",
            "provider": {
                "type": "python",
                "module": "chaosgarden.garden.actions",
                "func": "run_cloud_provider_network_failure_simulation",
                "arguments": {
                    "mode": "total",
                    "zone": 0,
                    "duration": 60
                }
            }
        }
    ],
    "rollbacks": [
        {
            "type": "action",
            "name": "rollback-network-failure-simulation",
            "provider": {
                "type": "python",
                "module": "chaosgarden.garden.actions",
                "func": "rollback_cloud_provider_network_failure_simulation",
                "arguments": {
                    "mode": "total",
                    "zone": 0
                }
            }
        }
    ],
    "configuration": {
        "garden_project": {
            "type": "env",
            "key": "GARDEN_PROJECT"
        },
        "garden_shoot": {
            "type": "env",
            "key": "GARDEN_SHOOT"
        }
    }
}

ℹ️ There is an even more destructive action that terminates or alternatively restarts machines in a given zone 🔥 (immediately or delayed with some randomness/chaos for maximum inconvenience for the nodes and pods). You can find links to all these examples at the end of this tutorial.

This experiment is very similar, but this time we will break 🪓 your cluster - for 60s. If that’s too short to even see a node or pod transition from Ready to NotReady (actually Unknown), then increase the duration. Depending on the workload that your cluster runs, you may already see effects of the network partitioning, because it is effective immediately. It’s just that Kubernetes cannot know immediately and rather assumes that something is failing only after the node’s grace period lapses, but the actual workload is impacted immediately.

Most notably, this experiment also has a rollbacks section, which is invoked even if you abort the experiment or it fails unexpectedly, but only if you run the CLI with the option --rollback-strategy always which we will do soon. Any chaosgarden action that can undo its activity, will do that implicitly when the duration lapses, but it is a best practice to always configure a rollbacks section in case something unexpected happens. Should you be in panic and just want to run the rollbacks section, you can remove all other actions and the CLI will execute the rollbacks section immediately.

One other thing is different in the second experiment as well. We now read the name of the project and the shoot from the environment, i.e. a configuration section can automatically expand environment variables. Also useful to know (not shown here), chaostoolkit supports variable substitution too, so that you have to define variables only once. Please note that you can also add a secrets section that can also automatically expand environment variables. For instance, instead of targeting the Gardener API server via $KUBECONFIG, which is supported by our chaosgarden package natively, you can also explicitly refer to it in a secrets section (for brevity reasons not shown here either).

Let’s now run your second experiment (please watch your nodes and pods in parallel, e.g. by running watch kubectl get nodes,pods --output wide in another terminal):

export GARDEN_PROJECT=my-project
export GARDEN_SHOOT=my-shoot
chaos run --rollback-strategy always path/to/experiment

The output of the run command will be similar to the one above, but longer. It will mention either machines or networks that were network-partitioned (depends on cloud provider), but should revert everything back to normal.

Normally, you would not only run actions in the method section, but also probes as part of a steady state hypothesis. Such steady state hypothesis probes are run before and after the actions to validate that the “system” was in a healthy state before and gets back to a healthy state after the actions ran, hence show that the “system” is in a steady state when not under test. Eventually, you will write your own probes that don’t even have to be part of a steady state hypothesis. We at Gardener run multi-zone (multiple zones at once) and rolling-zone (strike each zone once) outages with continuous custom probes all within the method section to validate our KPIs continuously under test (e.g. how long do the individual fail-overs take/how long is the actual outage). The most complex scenarios are even run via Python scripts as all actions and probes can also be invoked directly (which is what the CLI does).

High Availability

Developing highly available workload that can tolerate a zone outage is no trivial task. You can find more information on how to achieve this goal in our best practices guide on high availability.

Thank you for your interest in Gardener chaos engineering and making your workload more resilient.

Further Reading

Here some links for further reading:

2.3 - Control Plane

Failure tolerance types node and zone. Possible mitigations for zone or node outages

Highly Available Shoot Control Plane

Shoot resource offers a way to request for a highly available control plane.

Failure Tolerance Types

A highly available shoot control plane can be setup with either a failure tolerance of zone or node.

Node Failure Tolerance

The failure tolerance of a node will have the following characteristics:

  • Control plane components will be spread across different nodes within a single availability zone. There will not be more than one replica per node for each control plane component which has more than one replica.
  • Worker pool should have a minimum of 3 nodes.
  • A multi-node etcd (quorum size of 3) will be provisioned, offering zero-downtime capabilities with each member in a different node within a single availability zone.

Zone Failure Tolerance

The failure tolerance of a zone will have the following characteristics:

  • Control plane components will be spread across different availability zones. There will be at least one replica per zone for each control plane component which has more than one replica.
  • Gardener scheduler will automatically select a seed which has a minimum of 3 zones to host the shoot control plane.
  • A multi-node etcd (quorum size of 3) will be provisioned, offering zero-downtime capabilities with each member in a different zone.

Shoot Spec

To request for a highly available shoot control plane Gardener provides the following configuration in the shoot spec:

apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
  controlPlane:
    highAvailability:
      failureTolerance:
        type: <node | zone>

Allowed Transitions

If you already have a shoot cluster with non-HA control plane, then the following upgrades are possible:

  • Upgrade of non-HA shoot control plane to HA shoot control plane with node failure tolerance.
  • Upgrade of non-HA shoot control plane to HA shoot control plane with zone failure tolerance. However, it is essential that the seed which is currently hosting the shoot control plane should be multi-zonal. If it is not, then the request to upgrade will be rejected.

Note: There will be a small downtime during the upgrade, especially for etcd, which will transition from a single node etcd cluster to a multi-node etcd cluster.

Disallowed Transitions

If you already have a shoot cluster with HA control plane, then the following transitions are not possible:

  • Upgrade of HA shoot control plane from node failure tolerance to zone failure tolerance is currently not supported, mainly because already existing volumes are bound to the zone they were created in originally.
  • Downgrade of HA shoot control plane with zone failure tolerance to node failure tolerance is currently not supported, mainly because of the same reason as above, that already existing volumes are bound to the respective zones they were created in originally.
  • Downgrade of HA shoot control plane with either node or zone failure tolerance, to a non-HA shoot control plane is currently not supported, mainly because etcd-druid does not currently support scaling down of a multi-node etcd cluster to a single-node etcd cluster.

Zone Outage Situation

Implementing highly available software that can tolerate even a zone outage unscathed is no trivial task. You may find our HA Best Practices helpful to get closer to that goal. In this document, we collected many options and settings for you that also Gardener internally uses to provide a highly available service.

During a zone outage, you may be forced to change your cluster setup on short notice in order to compensate for failures and shortages resulting from the outage. For instance, if the shoot cluster has worker nodes across three zones where one zone goes down, the computing power from these nodes is also gone during that time. Changing the worker pool (shoot.spec.provider.workers[]) and infrastructure (shoot.spec.provider.infrastructureConfig) configuration can eliminate this disbalance, having enough machines in healthy availability zones that can cope with the requests of your applications.

Gardener relies on a sophisticated reconciliation flow with several dependencies for which various flow steps wait for the readiness of prior ones. During a zone outage, this can block the entire flow, e.g., because all three etcd replicas can never be ready when a zone is down, and required changes mentioned above can never be accomplished. For this, a special one-off annotation shoot.gardener.cloud/skip-readiness helps to skip any readiness checks in the flow.

The shoot.gardener.cloud/skip-readiness annotation serves as a last resort if reconciliation is stuck because of important changes during an AZ outage. Use it with caution, only in exceptional cases and after a case-by-case evaluation with your Gardener landscape administrator. If used together with other operations like Kubernetes version upgrades or credential rotation, the annotation may lead to a severe outage of your shoot control plane.

3 - Networking

3.1 - Enable IPv4/IPv6 (dual-stack) Ingress on AWS

Use IPv4/IPv6 (dual-stack) Ingress in an IPv4 single-stack cluster on AWS

Using IPv4/IPv6 (dual-stack) Ingress in an IPv4 single-stack cluster

Motivation

IPv6 adoption is continuously growing, already overtaking IPv4 in certain regions, e.g. India, or scenarios, e.g. mobile. Even though most IPv6 installations deploy means to reach IPv4, it might still be beneficial to expose services natively via IPv4 and IPv6 instead of just relying on IPv4.

Disadvantages of full IPv4/IPv6 (dual-stack) Deployments

Enabling full IPv4/IPv6 (dual-stack) support in a kubernetes cluster is a major endeavor. It requires a lot of changes and restarts of all pods so that all pods get addresses for both IP families. A side-effect of dual-stack networking is that failures may be hidden as network traffic may take the other protocol to reach the target. For this reason and also due to reduced operational complexity, service teams might lean towards staying in a single-stack environment as much as possible. Luckily, this is possible with Gardener and IPv4/IPv6 (dual-stack) ingress on AWS.

Simplifying IPv4/IPv6 (dual-stack) Ingress with Protocol Translation on AWS

Fortunately, the network load balancer on AWS supports automatic protocol translation, i.e. it can expose both IPv4 and IPv6 endpoints while communicating with just one protocol to the backends. Under the hood, automatic protocol translation takes place. Client IP address preservation can be achieved by using proxy protocol.

This approach enables users to expose IPv4 workload to IPv6-only clients without having to change the workload/service. Without requiring invasive changes, it allows a fairly simple first step into the IPv6 world for services just requiring ingress (incoming) communication.

Necessary Shoot Cluster Configuration Changes for IPv4/IPv6 (dual-stack) Ingress

To be able to utilize IPv4/IPv6 (dual-stack) Ingress in an IPv4 shoot cluster, the cluster needs to meet two preconditions:

  1. dualStack.enabled needs to be set to true to configure VPC/subnet for IPv6 and add a routing rule for IPv6. (This does not add IPv6 addresses to kubernetes nodes.)
  2. loadBalancerController.enabled needs to be set to true as well to use the load balancer controller, which supports dual-stack ingress.
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
...
spec:
  provider:
    type: aws
    infrastructureConfig:
      apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
      kind: InfrastructureConfig
      dualStack:
        enabled: true
    controlPlaneConfig:
      apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
      kind: ControlPlaneConfig
      loadBalancerController:
        enabled: true
...

When infrastructureConfig.networks.vpc.id is set to the ID of an existing VPC, please make sure that your VPC has an Amazon-provided IPv6 CIDR block added.

After adapting the shoot specification and reconciling the cluster, dual-stack load balancers can be created using kubernetes services objects.

Creating an IPv4/IPv6 (dual-stack) Ingress

With the preconditions set, creating an IPv4/IPv6 load balancer is as easy as annotating a service with the correct annotations:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-ip-address-type: dualstack
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
    service.beta.kubernetes.io/aws-load-balancer-type: external
  name: ...
  namespace: ...
spec:
  ...
  type: LoadBalancer

In case the client IP address should be preserved, the following annotation can be used to enable proxy protocol. (The pod receiving the traffic needs to be configured for proxy protocol as well.)

    service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"

Please note that changing an existing Service to dual-stack may cause the creation of a new load balancer without deletion of the old AWS load balancer resource. While this helps in a seamless migration by not cutting existing connections it may lead to wasted/forgotten resources. Therefore, the (manual) cleanup needs to be taken into account when migrating an existing Service instance.

For more details see AWS Load Balancer Documentation - Network Load Balancer.

DNS Considerations to Prevent Downtime During a Dual-Stack Migration

In case the migration of an existing service is desired, please check if there are DNS entries directly linked to the corresponding load balancer. The migrated load balancer will have a new domain name immediately, which will not be ready in the beginning. Therefore, a direct migration of the domain name entries is not desired as it may cause a short downtime, i.e. domain name entries without backing IP addresses.

If there are DNS entries directly linked to the corresponding load balancer and they are managed by the shoot-dns-service, you can identify this via annotations with the prefix dns.gardener.cloud/. Those annotations can be linked to a Service, Ingress or Gateway resources. Alternatively, they may also use DNSEntry or DNSAnnotation resources.

For a seamless migration without downtime use the following three step approach:

  1. Temporarily prevent direct DNS updates
  2. Migrate the load balancer and wait until it is operational
  3. Allow DNS updates again

To prevent direct updates of the DNS entries when the load balancer is migrated add the annotation dns.gardener.cloud/ignore: 'true' to all affected resources next to the other dns.gardener.cloud/... annotations before starting the migration. For example, in case of a Service ensure that the service looks like the following:

kind: Service
metadata:
  annotations:
    dns.gardener.cloud/ignore: 'true'
    dns.gardener.cloud/class: garden
    dns.gardener.cloud/dnsnames: '...'
    ...

Next, migrate the load balancer to be dual-stack enabled by adding/changing the corresponding annotations.

You have multiple options how to check that the load balancer has been provisioned successfully. It might be useful to peek into status.loadBalancer.ingress of the corresponding Service to identify the load balancer:

  • Check in the AWS console for the corresponding load balancer provisioning state
  • Perform domain name lookups with nslookup/dig to check whether the name resolves to an IP address.
  • Call your workload via the new load balancer, e.g. using curl --resolve <my-domain-name>:<port>:<IP-address> https://<my-domain-name>:<port>, which allows you to call your service with the “correct” domain name without using actual name resolution.
  • Wait a fixed period of time as load balancer creation is usually finished within 15 minutes

Once the load balancer has been provisioned, you can remove the annotation dns.gardener.cloud/ignore: 'true' again from the affected resources. It may take some additional time until the domain name change finally propagates (up to one hour).

3.2 - Manage Certificates with Gardener

Use the Gardener cert-management to get fully managed, publicly trusted TLS certificates

Manage certificates with Gardener for public domain

Introduction

Dealing with applications on Kubernetes which offer a secure service endpoints (e.g. HTTPS) also require you to enable a secured communication via SSL/TLS. With the certificate extension enabled, Gardener can manage commonly trusted X.509 certificate for your application endpoint. From initially requesting certificate, it also handeles their renewal in time using the free Let’s Encrypt API.

There are two senarios with which you can use the certificate extension

  • You want to use a certificate for a subdomain the shoot’s default DNS (see .spec.dns.domain of your shoot resource, e.g. short.ingress.shoot.project.default-domain.gardener.cloud). If this is your case, please see Manage certificates with Gardener for default domain
  • You want to use a certificate for a custom domain. If this is your case, please keep reading this article.

Prerequisites

Before you start this guide there are a few requirements you need to fulfill:

  • You have an existing shoot cluster
  • Your custom domain is under a public top level domain (e.g. .com)
  • Your custom zone is resolvable with a public resolver via the internet (e.g. 8.8.8.8)
  • You have a custom DNS provider configured and working (see “DNS Providers”)

As part of the Let’s Encrypt ACME challenge validation process, Gardener sets a DNS TXT entry and Let’s Encrypt checks if it can both resolve and authenticate it. Therefore, it’s important that your DNS-entries are publicly resolvable. You can check this by querying e.g. Googles public DNS server and if it returns an entry your DNS is publicly visible:

# returns the A record for cert-example.example.com using Googles DNS server (8.8.8.8)
dig cert-example.example.com @8.8.8.8 A

DNS provider

In order to issue certificates for a custom domain you need to specify a DNS provider which is permitted to create DNS records for subdomains of your requested domain in the certificate. For example, if you request a certificate for host.example.com your DNS provider must be capable of managing subdomains of host.example.com.

DNS providers are normally specified in the shoot manifest. To learn more on how to configure one, please see the DNS provider documentation.

Issue a certificate

Every X.509 certificate is represented by a Kubernetes custom resource certificate.cert.gardener.cloud in your cluster. A Certificate resource may be used to initiate a new certificate request as well as to manage its lifecycle. Gardener’s certificate service regularly checks the expiration timestamp of Certificates, triggers a renewal process if necessary and replaces the existing X.509 certificate with a new one.

Your application should be able to reload replaced certificates in a timely manner to avoid service disruptions.

Certificates can be requested via 3 resources type

  • Ingress
  • Service (type LoadBalancer)
  • Gateways (both Istio gateways and from the Gateway API)
  • Certificate (Gardener CRD)

If either of the first 2 are used, a corresponding Certificate resource will be created automatically.

Using an Ingress Resource

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: amazing-ingress
  annotations:
    cert.gardener.cloud/purpose: managed
    # Optional but recommended, this is going to create the DNS entry at the same time
    dns.gardener.cloud/class: garden
    dns.gardener.cloud/ttl: "600"
    #cert.gardener.cloud/commonname: "*.example.com"              # optional, if not specified the first name from spec.tls[].hosts is used as common name
    #cert.gardener.cloud/dnsnames: ""                             # optional, if not specified the names from spec.tls[].hosts are used
    #cert.gardener.cloud/follow-cname: "true"                     # optional, same as spec.followCNAME in certificates
    #cert.gardener.cloud/secret-labels: "key1=value1,key2=value2" # optional labels for the certificate secret
    #cert.gardener.cloud/issuer: custom-issuer                    # optional to specify custom issuer (use namespace/name for shoot issuers)
    #cert.gardener.cloud/preferred-chain: "chain name"            # optional to specify preferred-chain (value is the Subject Common Name of the root issuer)
    #cert.gardener.cloud/private-key-algorithm: ECDSA             # optional to specify algorithm for private key, allowed values are 'RSA' or 'ECDSA'
    #cert.gardener.cloud/private-key-size: "384"                  # optional to specify size of private key, allowed values for RSA are "2048", "3072", "4096" and for ECDSA "256" and "384"

spec:
  tls:
  - hosts:
    # Must not exceed 64 characters.
    - amazing.example.com
    # Certificate and private key reside in this secret.
    secretName: tls-secret
  rules:
  - host: amazing.example.com
    http:
      paths:
      - pathType: Prefix
        path: "/"
        backend:
          service:
            name: amazing-svc
            port:
              number: 8080

Replace the hosts and rules[].host value again with your own domain and adjust the remaining Ingress attributes in accordance with your deployment (e.g. the above is for an istio Ingress controller and forwards traffic to a service1 on port 80).

Using a Service of type LoadBalancer

apiVersion: v1
kind: Service
metadata:
  annotations:
    cert.gardener.cloud/secretname: tls-secret
    dns.gardener.cloud/dnsnames: example.example.com
    dns.gardener.cloud/class: garden
    # Optional
    dns.gardener.cloud/ttl: "600"
    cert.gardener.cloud/commonname: "*.example.example.com"
    cert.gardener.cloud/dnsnames: ""
    #cert.gardener.cloud/follow-cname: "true"                     # optional, same as spec.followCNAME in certificates
    #cert.gardener.cloud/secret-labels: "key1=value1,key2=value2" # optional labels for the certificate secret
    #cert.gardener.cloud/issuer: custom-issuer                    # optional to specify custom issuer (use namespace/name for shoot issuers)
    #cert.gardener.cloud/preferred-chain: "chain name"            # optional to specify preferred-chain (value is the Subject Common Name of the root issuer)
    #cert.gardener.cloud/private-key-algorithm: ECDSA             # optional to specify algorithm for private key, allowed values are 'RSA' or 'ECDSA'
    #cert.gardener.cloud/private-key-size: "384"                  # optional to specify size of private key, allowed values for RSA are "2048", "3072", "4096" and for ECDSA "256" and "384"
    
  name: test-service
  namespace: default
spec:
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 8080
  type: LoadBalancer

Using a Gateway resource

Please see Istio Gateways or Gateway API for details.

Using the custom Certificate resource

apiVersion: cert.gardener.cloud/v1alpha1
kind: Certificate
metadata:
  name: cert-example
  namespace: default
spec:
  commonName: amazing.example.com
  secretRef:
    name: tls-secret
    namespace: default
  # Optionnal if using the default issuer
  issuerRef:
    name: garden

  # If delegated domain for DNS01 challenge should be used. This has only an effect if a CNAME record is set for
  # '_acme-challenge.amazing.example.com'.
  # For example: If a CNAME record exists '_acme-challenge.amazing.example.com' => '_acme-challenge.writable.domain.com',
  # the DNS challenge will be written to '_acme-challenge.writable.domain.com'.
  #followCNAME: true

  # optionally set labels for the secret
  #secretLabels:
  #  key1: value1
  #  key2: value2

  # Optionally specify the preferred certificate chain: if the CA offers multiple certificate chains, prefer the chain with an issuer matching this Subject Common Name. If no match, the default offered chain will be used.
  #preferredChain: "ISRG Root X1"

  # Optionally specify algorithm and key size for private key. Allowed algorithms: "RSA" (allowed sizes: 2048, 3072, 4096) and "ECDSA" (allowed sizes: 256, 384)
  # If not specified, RSA with 2048 is used.
  #privateKey:
  #  algorithm: ECDSA
  #  size: 384

Supported attributes

Here is a list of all supported annotations regarding the certificate extension:

PathAnnotationValueRequiredDescription
N/Acert.gardener.cloud/purpose:managedYes when using annotationsFlag for Gardener that this specific Ingress or Service requires a certificate
spec.commonNamecert.gardener.cloud/commonname:E.g. “*.demo.example.com” or
“special.example.com”
Certificate and Ingress : No
Service: Yes, if DNS names unset
Specifies for which domain the certificate request will be created. If not specified, the names from spec.tls[].hosts are used. This entry must comply with the 64 character limit.
spec.dnsNamescert.gardener.cloud/dnsnames:E.g. “special.example.com”Certificate and Ingress : No
Service: Yes, if common name unset
Additional domains the certificate should be valid for (Subject Alternative Name). If not specified, the names from spec.tls[].hosts are used. Entries in this list can be longer than 64 characters.
spec.secretRef.namecert.gardener.cloud/secretname:any-nameYes for certificate and ServiceSpecifies the secret which contains the certificate/key pair. If the secret is not available yet, it’ll be created automatically as soon as the certificate has been issued.
spec.issuerRef.namecert.gardener.cloud/issuer:E.g. gardenerNoSpecifies the issuer you want to use. Only necessary if you request certificates for custom domains.
N/Acert.gardener.cloud/revoked:true otherwise always falseNoUse only to revoke a certificate, see reference for more details
spec.followCNAMEcert.gardener.cloud/follow-cnameE.g. trueNoSpecifies that the usage of a delegated domain for DNS challenges is allowed. Details see Follow CNAME.
spec.preferredChaincert.gardener.cloud/preferred-chainE.g. ISRG Root X1NoSpecifies the Common Name of the issuer for selecting the certificate chain. Details see Preferred Chain.
spec.secretLabelscert.gardener.cloud/secret-labelsfor annotation use e.g. key1=value1,key2=value2NoSpecifies labels for the certificate secret.
spec.privateKey.algorithmcert.gardener.cloud/private-key-algorithmRSA, ECDSANoSpecifies algorithm for private key generation. The default value is depending on configuration of the extension (default of the default is RSA). You may request a new certificate without privateKey settings to find out the concrete defaults in your Gardener.
spec.privateKey.sizecert.gardener.cloud/private-key-size"256", "384", "2048", "3072", "4096"NoSpecifies size for private key generation. Allowed values for RSA are 2048, 3072, and 4096. For ECDSA allowed values are 256 and 384. The default values are depending on the configuration of the extension (defaults of the default values are 3072 for RSA and 384 for ECDSA respectively).

Request a wildcard certificate

In order to avoid the creation of multiples certificates for every single endpoints, you may want to create a wildcard certificate for your shoot’s default cluster.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: amazing-ingress
  annotations:
    cert.gardener.cloud/purpose: managed
    cert.gardener.cloud/commonName: "*.example.com"
spec:
  tls:
  - hosts:
    - amazing.example.com
    secretName: tls-secret
  rules:
  - host: amazing.example.com
    http:
      paths:
      - pathType: Prefix
        path: "/"
        backend:
          service:
            name: amazing-svc
            port:
              number: 8080

Please note that this can also be achived by directly adding an annotation to a Service type LoadBalancer. You could also create a Certificate object with a wildcard domain.

Using a custom Issuer

Most Gardener deployment with the certification extension enabled have a preconfigured garden issuer. It is also usually configured to use Let’s Encrypt as the certificate provider.

If you need a custom issuer for a specific cluster, please see Using a custom Issuer

Quotas

For security reasons there may be a default quota on the certificate requests per day set globally in the controller registration of the shoot-cert-service.

The default quota only applies if there is no explicit quota defined for the issuer itself with the field requestsPerDayQuota, e.g.:

kind: Shoot
...
spec:
  extensions:
  - type: shoot-cert-service
    providerConfig:
      apiVersion: service.cert.extensions.gardener.cloud/v1alpha1
      kind: CertConfig
      issuers:
        - email: your-email@example.com
          name: custom-issuer # issuer name must be specified in every custom issuer request, must not be "garden"
          server: 'https://acme-v02.api.letsencrypt.org/directory'
          requestsPerDayQuota: 10

DNS Propagation

As stated before, cert-manager uses the ACME challenge protocol to authenticate that you are the DNS owner for the domain’s certificate you are requesting. This works by creating a DNS TXT record in your DNS provider under _acme-challenge.example.example.com containing a token to compare with. The TXT record is only applied during the domain validation. Typically, the record is propagated within a few minutes. But if the record is not visible to the ACME server for any reasons, the certificate request is retried again after several minutes. This means you may have to wait up to one hour after the propagation problem has been resolved before the certificate request is retried. Take a look in the events with kubectl describe ingress example for troubleshooting.

Character Restrictions

Due to restriction of the common name to 64 characters, you may to leave the common name unset in such cases.

For example, the following request is invalid:

apiVersion: cert.gardener.cloud/v1alpha1
kind: Certificate
metadata:
  name: cert-invalid
  namespace: default
spec:
  commonName: morethan64characters.ingress.shoot.project.default-domain.gardener.cloud

But it is valid to request a certificate for this domain if you have left the common name unset:

apiVersion: cert.gardener.cloud/v1alpha1
kind: Certificate
metadata:
  name: cert-example
  namespace: default
spec:
  dnsNames:
  - morethan64characters.ingress.shoot.project.default-domain.gardener.cloud

References

3.3 - Manage Certificates with Gardener for Default Domain

Use the Gardener cert-management to get fully managed, publicly trusted TLS certificates

Manage certificates with Gardener for default domain

Introduction

Dealing with applications on Kubernetes which offer a secure service endpoints (e.g. HTTPS) also require you to enable a secured communication via SSL/TLS. With the certificate extension enabled, Gardener can manage commonly trusted X.509 certificate for your application endpoint. From initially requesting certificate, it also handeles their renewal in time using the free Let’s Encrypt API.

There are two senarios with which you can use the certificate extension

  • You want to use a certificate for a subdomain the shoot’s default DNS (see .spec.dns.domain of your shoot resource, e.g. short.ingress.shoot.project.default-domain.gardener.cloud). If this is your case, please keep reading this article.
  • You want to use a certificate for a custom domain. If this is your case, please see Manage certificates with Gardener for public domain

Prerequisites

Before you start this guide there are a few requirements you need to fulfill:

  • You have an existing shoot cluster

Since you are using the default DNS name, all DNS configuration should already be done and ready.

Issue a certificate

Every X.509 certificate is represented by a Kubernetes custom resource certificate.cert.gardener.cloud in your cluster. A Certificate resource may be used to initiate a new certificate request as well as to manage its lifecycle. Gardener’s certificate service regularly checks the expiration timestamp of Certificates, triggers a renewal process if necessary and replaces the existing X.509 certificate with a new one.

Your application should be able to reload replaced certificates in a timely manner to avoid service disruptions.

Certificates can be requested via 3 resources type

  • Ingress
  • Service (type LoadBalancer)
  • certificate (Gardener CRD)

If either of the first 2 are used, a corresponding Certificate resource will automatically be created.

Using an ingress Resource

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: amazing-ingress
  annotations:
    cert.gardener.cloud/purpose: managed
    #cert.gardener.cloud/issuer: custom-issuer                    # optional to specify custom issuer (use namespace/name for shoot issuers)
    #cert.gardener.cloud/follow-cname: "true"                     # optional, same as spec.followCNAME in certificates
    #cert.gardener.cloud/secret-labels: "key1=value1,key2=value2" # optional labels for the certificate secret
    #cert.gardener.cloud/preferred-chain: "chain name"            # optional to specify preferred-chain (value is the Subject Common Name of the root issuer)
    #cert.gardener.cloud/private-key-algorithm: ECDSA             # optional to specify algorithm for private key, allowed values are 'RSA' or 'ECDSA'
    #cert.gardener.cloud/private-key-size: "384"                  # optional to specify size of private key, allowed values for RSA are "2048", "3072", "4096" and for ECDSA "256" and "384"spec:
  tls:
  - hosts:
    # Must not exceed 64 characters.
    - short.ingress.shoot.project.default-domain.gardener.cloud
    # Certificate and private key reside in this secret.
    secretName: tls-secret
  rules:
  - host: short.ingress.shoot.project.default-domain.gardener.cloud
    http:
      paths:
      - pathType: Prefix
        path: "/"
        backend:
          service:
            name: amazing-svc
            port:
              number: 8080

Using a service type LoadBalancer

apiVersion: v1
kind: Service
metadata:
  annotations:
    cert.gardener.cloud/purpose: managed
    # Certificate and private key reside in this secret.
    cert.gardener.cloud/secretname: tls-secret
    # You may add more domains separated by commas (e.g. "service.shoot.project.default-domain.gardener.cloud, amazing.shoot.project.default-domain.gardener.cloud")
    dns.gardener.cloud/dnsnames: "service.shoot.project.default-domain.gardener.cloud" 
    dns.gardener.cloud/ttl: "600"
    #cert.gardener.cloud/issuer: custom-issuer                    # optional to specify custom issuer (use namespace/name for shoot issuers)
    #cert.gardener.cloud/follow-cname: "true"                     # optional, same as spec.followCNAME in certificates
    #cert.gardener.cloud/secret-labels: "key1=value1,key2=value2" # optional labels for the certificate secret
    #cert.gardener.cloud/preferred-chain: "chain name"            # optional to specify preferred-chain (value is the Subject Common Name of the root issuer)
    #cert.gardener.cloud/private-key-algorithm: ECDSA             # optional to specify algorithm for private key, allowed values are 'RSA' or 'ECDSA'
    #cert.gardener.cloud/private-key-size: "384"                  # optional to specify size of private key, allowed values for RSA are "2048", "3072", "4096" and for ECDSA "256" and "384"  name: test-service
  namespace: default
spec:
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 8080
  type: LoadBalancer

Using the custom Certificate resource

apiVersion: cert.gardener.cloud/v1alpha1
kind: Certificate
metadata:
  name: cert-example
  namespace: default
spec:
  commonName: short.ingress.shoot.project.default-domain.gardener.cloud
  secretRef:
    name: tls-secret
    namespace: default
  # Optionnal if using the default issuer
  issuerRef:
    name: garden

If you’re interested in the current progress of your request, you’re advised to consult the description, more specifically the status attribute in case the issuance failed.

Request a wildcard certificate

In order to avoid the creation of multiples certificates for every single endpoints, you may want to create a wildcard certificate for your shoot’s default cluster.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: amazing-ingress
  annotations:
    cert.gardener.cloud/purpose: managed
    cert.gardener.cloud/commonName: "*.ingress.shoot.project.default-domain.gardener.cloud"
spec:
  tls:
  - hosts:
    - amazing.ingress.shoot.project.default-domain.gardener.cloud
    secretName: tls-secret
  rules:
  - host: amazing.ingress.shoot.project.default-domain.gardener.cloud
    http:
      paths:
      - pathType: Prefix
        path: "/"
        backend:
          service:
            name: amazing-svc
            port:
              number: 8080

Please note that this can also be achived by directly adding an annotation to a Service type LoadBalancer. You could also create a Certificate object with a wildcard domain.

More information

For more information and more examples about using the certificate extension, please see Manage certificates with Gardener for public domain

3.4 - Managing DNS with Gardener

Setup Gardener-managed DNS records in cluster.

Request DNS Names in Shoot Clusters

Introduction

Within a shoot cluster, it is possible to request DNS records via the following resource types:

It is necessary that the Gardener installation your shoot cluster runs in is equipped with a shoot-dns-service extension. This extension uses the seed’s dns management infrastructure to maintain DNS names for shoot clusters. Please ask your Gardener operator if the extension is available in your environment.

Shoot Feature Gate

In some Gardener setups the shoot-dns-service extension is not enabled globally and thus must be configured per shoot cluster. Please adapt the shoot specification by the configuration shown below to activate the extension individually.

kind: Shoot
...
spec:
  extensions:
    - type: shoot-dns-service
...

Before you start

You should :

  • Have created a shoot cluster
  • Have created and correctly configured a DNS Provider (Please consult this page for more information)
  • Have a basic understanding of DNS (see link under References)

There are 2 types of DNS that you can use within Kubernetes :

  • internal (usually managed by coreDNS)
  • external (managed by a public DNS provider).

This page, and the extension, exclusively works for external DNS handling.

Gardener allows 2 way of managing your external DNS:

  • Manually, which means you are in charge of creating / maintaining your Kubernetes related DNS entries
  • Via the Gardener DNS extension

Gardener DNS extension

The managed external DNS records feature of the Gardener clusters makes all this easier. You do not need DNS service provider specific knowledge, and in fact you do not need to leave your cluster at all to achieve that. You simply annotate the Ingress / Service that needs its DNS records managed and it will be automatically created / managed by Gardener.

Managed external DNS records are supported with the following DNS provider types:

  • aws-route53
  • azure-dns
  • azure-private-dns
  • google-clouddns
  • openstack-designate
  • alicloud-dns
  • cloudflare-dns

Request DNS records for Ingress resources

To request a DNS name for Ingress, Service or Gateway (Istio or Gateway API) objects in the shoot cluster it must be annotated with the DNS class garden and an annotation denoting the desired DNS names.

Example for an annotated Ingress resource:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: amazing-ingress
  annotations:
    # Let Gardener manage external DNS records for this Ingress.
    dns.gardener.cloud/dnsnames: special.example.com # Use "*" to collects domains names from .spec.rules[].host
    dns.gardener.cloud/ttl: "600"
    dns.gardener.cloud/class: garden
    # If you are delegating the certificate management to Gardener, uncomment the following line
    #cert.gardener.cloud/purpose: managed
spec:
  rules:
  - host: special.example.com
    http:
      paths:
      - pathType: Prefix
        path: "/"
        backend:
          service:
            name: amazing-svc
            port:
              number: 8080
  # Uncomment the following part if you are delegating the certificate management to Gardener
  #tls:
  #  - hosts:
  #      - special.example.com
  #    secretName: my-cert-secret-name

For an Ingress, the DNS names are already declared in the specification. Nevertheless the dnsnames annotation must be present. Here a subset of the DNS names of the ingress can be specified. If DNS names for all names are desired, the value all can be used.

Keep in mind that ingress resources are ignored unless an ingress controller is set up. Gardener does not provide an ingress controller by default. For more details, see Ingress Controllers and Service in the Kubernetes documentation.

Request DNS records for service type LoadBalancer

Example for an annotated Service (it must have the type LoadBalancer) resource:

apiVersion: v1
kind: Service
metadata:
  name: amazing-svc
  annotations:
    # Let Gardener manage external DNS records for this Service.
    dns.gardener.cloud/dnsnames: special.example.com
    dns.gardener.cloud/ttl: "600"
    dns.gardener.cloud/class: garden
spec:
  selector:
    app: amazing-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  type: LoadBalancer

Request DNS records for Gateway resources

Please see Istio Gateways or Gateway API for details.

Creating a DNSEntry resource explicitly

It is also possible to create a DNS entry via the Kubernetes resource called DNSEntry:

apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSEntry
metadata:
  annotations:
    # Let Gardener manage this DNS entry.
    dns.gardener.cloud/class: garden
  name: special-dnsentry
  namespace: default
spec:
  dnsName: special.example.com
  ttl: 600
  targets:
  - 1.2.3.4

If one of the accepted DNS names is a direct subname of the shoot’s ingress domain, this is already handled by the standard wildcard entry for the ingress domain. Therefore this name should be excluded from the dnsnames list in the annotation. If only this DNS name is configured in the ingress, no explicit DNS entry is required, and the DNS annotations should be omitted at all.

You can check the status of the DNSEntry with

$ kubectl get dnsentry
NAME          DNS                                                            TYPE          PROVIDER      STATUS    AGE
mydnsentry    special.example.com     aws-route53   default/aws   Ready     24s

As soon as the status of the entry is Ready, the provider has accepted the new DNS record. Depending on the provider and your DNS settings and cache, it may take up to 24 hours for the new entry to be propagated over all internet.

More examples can be found here

Request DNS records for Service/Ingress resources using a DNSAnnotation resource

In rare cases it may not be possible to add annotations to a Service or Ingress resource object.

E.g.: the helm chart used to deploy the resource may not be adaptable for some reasons or some automation is used, which always restores the original content of the resource object by dropping any additional annotations.

In these cases, it is recommended to use an additional DNSAnnotation resource in order to have more flexibility that DNSentry resources. The DNSAnnotation resource makes the DNS shoot service behave as if annotations have been added to the referenced resource.

For the Ingress example shown above, you can create a DNSAnnotation resource alternatively to provide the annotations.

apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSAnnotation
metadata:
  annotations:
    dns.gardener.cloud/class: garden
  name: test-ingress-annotation
  namespace: default
spec:
  resourceRef:
    kind: Ingress
    apiVersion: networking.k8s.io/v1
    name: test-ingress
    namespace: default
  annotations:
    dns.gardener.cloud/dnsnames: '*'
    dns.gardener.cloud/class: garden    

Note that the DNSAnnotation resource itself needs the dns.gardener.cloud/class=garden annotation. This also only works for annotations known to the DNS shoot service (see Accepted External DNS Records Annotations).

For more details, see also DNSAnnotation objects

Accepted External DNS Records Annotations

Here are all of the accepted annotation related to the DNS extension:

AnnotationDescription
dns.gardener.cloud/dnsnamesMandatory for service and ingress resources, accepts a comma-separated list of DNS names if multiple names are required. For ingress you can use the special value '*'. In this case, the DNS names are collected from .spec.rules[].host.
dns.gardener.cloud/classMandatory, in the context of the shoot-dns-service it must always be set to garden.
dns.gardener.cloud/ttlRecommended, overrides the default Time-To-Live of the DNS record.
dns.gardener.cloud/cname-lookup-intervalOnly relevant if multiple domain name targets are specified. It specifies the lookup interval for CNAMEs to map them to IP addresses (in seconds)
dns.gardener.cloud/realmsInternal, for restricting provider access for shoot DNS entries. Typcially not set by users of the shoot-dns-service.
dns.gardener.cloud/ip-stackOnly relevant for provider type aws-route53 if target is an AWS load balancer domain name. Can be set for service, ingress and DNSEntry resources. It specify which DNS records with alias targets are created instead of the usual CNAME records. If the annotation is not set (or has the value ipv4), only an A record is created. With value dual-stack, both A and AAAA records are created. With value ipv6 only an AAAA record is created.
service.beta.kubernetes.io/aws-load-balancer-ip-address-type=dualstackFor services, behaves similar to dns.gardener.cloud/ip-stack=dual-stack.
loadbalancer.openstack.org/load-balancer-addressInternal, for services only: support for PROXY protocol on Openstack (which needs a hostname as ingress). Typcially not set by users of the shoot-dns-service.

If one of the accepted DNS names is a direct subdomain of the shoot’s ingress domain, this is already handled by the standard wildcard entry for the ingress domain. Therefore, this name should be excluded from the dnsnames list in the annotation. If only this DNS name is configured in the ingress, no explicit DNS entry is required, and the DNS annotations should be omitted at all.

Troubleshooting

General DNS tools

To check the DNS resolution, use the nslookup or dig command.

$ nslookup special.your-domain.com

or with dig

$ dig +short special.example.com
Depending on your network settings, you may get a successful response faster using a public DNS server (e.g. 8.8.8.8, 8.8.4.4, or 1.1.1.1)

dig @8.8.8.8 +short special.example.com

DNS record events

The DNS controller publishes Kubernetes events for the resource which requested the DNS record (Ingress, Service, DNSEntry). These events reveal more information about the DNS requests being processed and are especially useful to check any kind of misconfiguration, e.g. requests for a domain you don’t own.

Events for a successfully created DNS record:

$ kubectl describe service my-service

Events:
  Type    Reason          Age                From                    Message
  ----    ------          ----               ----                    -------
  Normal  dns-annotation  19s                dns-controller-manager  special.example.com: dns entry is pending
  Normal  dns-annotation  19s (x3 over 19s)  dns-controller-manager  special.example.com: dns entry pending: waiting for dns reconciliation
  Normal  dns-annotation  9s (x3 over 10s)   dns-controller-manager  special.example.com: dns entry active

Please note, events vanish after their retention period (usually 1h).

DNSEntry status

DNSEntry resources offer a .status sub-resource which can be used to check the current state of the object.

Status of a erroneous DNSEntry.

  status:
    message: No responsible provider found
    observedGeneration: 3
    provider: remote
    state: Error

References

4 - Administer Client (Shoot) Clusters

4.1 - Scalability of Gardener Managed Kubernetes Clusters

Know the boundary conditions when scaling your workloads

Have you ever wondered how much more your Kubernetes cluster can scale before it breaks down?

Of course, the answer is heavily dependent on your workloads. But be assured, any cluster will break eventually. Therefore, the best mitigation is to plan for sharding early and run multiple clusters instead of trying to optimize everything hoping to survive with a single cluster. Still, it is helpful to know when the time has come to scale out. This document aims at giving you the basic knowledge to keep a Gardener-managed Kubernetes cluster up and running while it scales according to your needs.

Welcome to Planet Scale, Please Mind the Gap!

For a complex, distributed system like Kubernetes it is impossible to give absolute thresholds for its scalability. Instead, the limit of a cluster’s scalability is a combination of various, interconnected dimensions.

Let’s take a rather simple example of two dimensions - the number of Pods per Node and number of Nodes in a cluster. According to the scalability thresholds documentation, Kubernetes can scale up to 5000 Nodes and with default settings accommodate a maximum of 110 Pods on a single Node. Pushing only a single dimension towards its limit will likely harm the cluster. But if both are pushed simultaneously, any cluster will break way before reaching one dimension’s limit.

Pods and Nodes

What sounds rather straightforward in theory can be a bit trickier in reality. While 110 Pods is the default limit, we successfully pushed beyond that and in certain cases run up to 200 Pods per Node without breaking the cluster. This is possible in an environment where one knows and controls all workloads and cluster configurations. It still requires careful testing, though, and comes at the cost of limiting the scalability of other dimensions, like the number of Nodes.

Of course, a Kubernetes cluster has a plethora of dimensions. Thus, when looking at a simple questions like “How many resources can I store in ETCD?”, the only meaningful answer must be: “it depends”

The following sections will help you to identify relevant dimensions and how they affect a Gardener-managed Kubernetes cluster’s scalability.

“Official” Kubernetes Thresholds and Scalability Considerations

To get started with the topic, please check the basic guidance provided by the Kubernetes community (specifically SIG Scalability):

Furthermore, the problem space has been discussed in a KubeCon talk, the slides for which can be found here. You should at least read the slides before continuing.

Essentially, it comes down to this:

If you promise to:

  • correctly configure your cluster
  • use extensibility features “reasonably”
  • keep the load in the cluster within recommended limits

Then we promise that your cluster will function properly.

With that knowledge in mind, let’s look at Gardener and eventually pick up the question about the number of objects in ETCD raised above.

Gardener-Specific Considerations

The following considerations are based on experience with various large clusters that scaled in different dimensions. Just as explained above, pushing beyond even one of the limits is likely to cause issues at some point in time (but not guaranteed). Depending on the setup of your workloads however, it might work unexpectedly well. Nevertheless, we urge you take conscious decisions and rather think about sharding your workloads. Please keep in mind - your workload affects the overall stability and scalability of a cluster significantly.

ETCD

The following section is based on a setup where ETCD Pods run on a dedicated Node pool and each Node has 8 vCPU and 32GB memory at least.

ETCD has a practical space limit of 8 GB. It caps the number of objects one can technically have in a Kubernetes cluster.

Of course, the number is heavily influenced by each object’s size, especially when considering that secrets and configmaps may store up to 1MB of data. Another dimension is a cluster’s churn rate. Since ETCD stores a history of the keyspace, a higher churn rate reduces the number of objects. Gardener runs compaction every 30min and defragmentation once per day during a cluster’s maintenance window to ensure proper ETCD operations. However, it is still possible to overload ETCD. If the space limit is reached, ETCD will only accept READ or DELETE requests and manual interaction by a Gardener operator is needed to disarm the alarm, once you got below the threshold.

To avoid such a situation, you can monitor the current ETCD usage via the “ETCD” dashboard of the monitoring stack. It gives you the current DB size, as well as historical data for the past 2 weeks. While there are improvements planned to trigger compaction and defragmentation based on DB size, an ETCD should not grow up to this threshold. A typical, healthy DB size is less than 3 GB.

Furthermore, the dashboard has a panel called “Memory”, which indicates the memory usage of the etcd pod(s). Using more than 16GB memory is a clear red flag, and you should reduce the load on ETCD.

Another dimension you should be aware of is the object count in ETCD. You can check it via the “API Server” dashboard, which features a “ETCD Object Counts By Resource” panel. The overall number of objects (excluding events, as they are stored in a different etcd instance) should not exceed 100k for most use cases.

Kube API Server

The following section is based on a setup where kube-apiserver run as Pods and are scheduled to Nodes with at least 8 vCPU and 32GB memory.

Gardener can scale the Deployment of a kube-apiserver horizontally and vertically. Horizontal scaling is limited to a certain number of replicas and should not concern a stakeholder much. However, the CPU / memory consumption of an individual kube-apiserver pod poses a potential threat to the overall availability of your cluster. The vertical scaling of any kube-apiserver is limited by the amount of resources available on a single Node. Outgrowing the resources of a Node will cause a downtime and render the cluster unavailable.

In general, continuous CPU usage of up to 3 cores and 16 GB memory per kube-apiserver pod is considered to be safe. This gives some room to absorb spikes, for example when the caches are initialized. You can check the resource consumption by selecting kube-apiserver Pods in the “Kubernetes Pods” dashboard. If these boundaries are exceeded constantly, you need to investigate and derive measures to lower the load.

Further information is also recorded and made available through the monitoring stack. The dashboard “API Server Request Duration and Response Size” provides insights into the request processing time of kube-apiserver Pods. Related information like request rates, dropped requests or termination codes (e.g., 429 for too many requests) can be obtained from the dashboards “API Server” and “Kubernetes API Server Details”. They provide a good indicator for how well the system is dealing with its current load.

Reducing the load on the API servers can become a challenge. To get started, you may try to:

  • Use immutable secrets and configmaps where possible to save watches. This pays off, especially when you have a high number of Nodes or just lots of secrets in general.
  • Applications interacting with the K8s API: If you know an object by its name, use it. Using label selector queries is expensive, as the filtering happens only within the kube-apiserver and not etcd, hence all resources must first pass completely from etcd to kube-apiserver.
  • Use (single object) caches within your controllers. Check the “Use cache for ShootStates in Gardenlet” issue for an example.

Nodes

When talking about the scalability of a Kubernetes cluster, Nodes are probably mentioned in the first place… well, obviously not in this guide. While vanilla Kubernetes lists 5000 Nodes as its upper limit, pushing that dimension is not feasible. Most clusters should run with fewer than 300 Nodes. But of course, the actual limit depends on the workloads deployed and can be lower or higher. As you scale your cluster, be extra careful and closely monitor ETCD and kube-apiserver.

The scalability of Nodes is subject to a range of limiting factors. Some of them can only be defined upon cluster creation and remain immutable during a cluster lifetime. So let’s discuss the most important dimensions.

CIDR:

Upon cluster creation, you have to specify or use the default values for several network segments. There are dedicated CIDRs for services, Pods, and Nodes. Each defines a range of IP addresses available for the individual resource type. Obviously, the maximum of possible Nodes is capped by the CIDR for Nodes. However, there is a second limiting factor, which is the pod CIDR combined with the nodeCIDRMaskSize. This mask is used to divide the pod CIDR into smaller subnets, where each blocks gets assigned to a node. With a /16 pod network and a /24 nodeCIDRMaskSize, a cluster can scale up to 256 Nodes. Please check Shoot Networking for details.

Even though a /24 nodeCIDRMaskSize translates to a theoretical 256 pod IP addresses per Node, the maxPods setting should be less than 1/2 of this value. This gives the system some breathing room for churn and minimizes the risk for strange effects like mis-routed packages caused by immediate re-use of IPs.

Cloud provider capacity:

Most of the time, Nodes in Kubernetes translate to virtual machines on a hyperscaler. An attempt to add more Nodes to a cluster might fail due to capacity issues resulting in an error message like this:

Cloud provider message - machine codes error: code = [Internal] message = [InsufficientInstanceCapacity: We currently do not have sufficient <instance type> capacity in the Availability Zone you requested. Our system will be working on provisioning additional capacity. 

In heavily utilized regions, individual clusters are competing for scarce resources. So before choosing a region / zone, try to ensure that the hyperscaler supports your anticipated growth. This might be done through quota requests or by contacting the respective support teams. To mitigate such a situation, you may configure a worker pool with a different Node type and a corresponding priority expander as part of a shoot’s autoscaler section. Please consult the Autoscaler FAQ for more details.

Rolling of Node pools:

The overall number of Nodes is affecting the duration of a cluster’s maintenance. When upgrading a Node pool to a new OS image or Kubernetes version, all machines will be drained and deleted, and replaced with new ones. The more Nodes a cluster has, the longer this process will take, given that workloads are typically protected by PodDisruptionBudgets. Check Shoot Updates and Upgrades for details. Be sure to take this into consideration when planning maintenance.

Root disk:

You should be aware that the Node configuration impacts your workload’s performance too. Take the root disk of a Node, for example. While most hyperscalers offer the usage of HDD and SSD disks, it is strongly recommended to use SSD volumes as root disks. When there are lots of Pods on a Node or workloads making extensive use of emptyDir volumes, disk throttling becomes an issue. When a disk hits its IOPS limits, processes are stuck in IO-wait and slow down significantly. This can lead to a slow-down in the kubelet’s heartbeat mechanism and result in Nodes being replaced automatically, as they appear to be unhealthy. To analyze such a situation, you might have to run tools like iostat, sar or top directly on a Node.

Switching to an I/O optimized instance type (if offered for your infrastructure) can help to resolve issue. Please keep in mind that disks used via PersistentVolumeClaims have I/O limits as well. Sometimes these limits are related to the size and/or can be increased for individual disks.

Cloud Provider (Infrastructure) Limits

In addition to the already mentioned capacity restrictions, a cloud provider may impose other limitations to a Kubernetes cluster’s scalability. One category is the account quota defining the number of resources allowed globally or per region. Make sure to request appropriate values that suit your needs and contain a buffer, for example for having more Nodes during a rolling update.

Another dimension is the network throughput per VM or network interface. While you may be able to choose a network-optimized Node type for your workload to mitigate issues, you cannot influence the available bandwidth for control plane components. Therefore, please ensure that the traffic on the ETCD does not exceed 100MB/s. The ETCD dashboard provides data for monitoring this metric.

In some environments the upstream DNS might become an issue too and make your workloads subject to rate limiting. Given the heterogeneity of cloud providers incl. private data centers, it is not possible to give any thresholds. Still, the “CoreDNS” and “NodeLocalDNS” dashboards can help to derive a workload’s usage pattern. Check the DNS autoscaling and NodeLocalDNS documentations for available configuration options.

Webhooks

While webhooks provide powerful means to manage a cluster, they are equally powerful in breaking a cluster upon a malfunction or unavailability. Imagine using a policy enforcing system like Kyverno or Open Policy Agent Gatekeeper. As part of the stack, both will deploy webhooks which are invoked for almost everything that happens in a cluster. Now, if this webhook gets either overloaded or is simply not available, the cluster will stop functioning properly.

Hence, you have to ensure proper sizing, quick processing time, and availability of the webhook serving Pods when deploying webhooks. Please consult Dynamic Admission Control (Availability and Timeouts sections) for details. You should also be aware of the time added to any request that has to go through a webhook, as the kube-apiserver sends the request for mutation / validation to another pod and waits for the response. The more resources being subject to an external webhook, the more likely this will become a bottleneck when having a high churn rate on resources. Within the Gardener monitoring stack, you can check the extra time per webhook via the “API Server (Admission Details)” dashboard, which has a panel for “Duration per Webhook”.

In Gardener, any webhook timeout should be less than 15 seconds. Due to the separation of Kubernetes data-plane (shoot) and control-plane (seed) in Gardener, the extra hop from kube-apiserver (control-plane) to webhook (data-plane) is more expensive. Please check Shoot Status for more details.

Custom Resource Definitions

Using Custom Resource Definitions (CRD) to extend a cluster’s API is a common Kubernetes pattern and so is writing an operator to act upon custom resources. Writing an efficient controller reduces the load on the kube-apiserver and allows for better scaling. As a starting point, you might want to read Gardener’s Kubernetes Clients Guide.

Another problematic dimension is the usage of conversion webhooks when having resources stored in different versions. Not only do they add latency (see Webhooks) but can also block the kube-controllermanager’s garbage collection. If a conversion webhook is unavailable, the garbage collector fails to list all resources and does not perform any cleanup. In order to avoid such a situation, it is highly recommended to use conversion webhooks only when necessary and complete the migration to a new version as soon as possible.

Conclusion

As outlined by SIG Scalability, it is quite impossible to give limits or even recommendations fitting every individual use case. Instead, this guide outlines relevant dimensions and gives rather conservative recommendations based on usage patterns observed. By combining this information, it is possible to operate and scale a cluster in stable manner.

While going beyond is certainly possible for some dimensions, it significantly increases the risk of instability. Typically, limits on the control-plane are introduced by the availability of resources like CPU or memory on a single machine and can hardly be influenced by any user. Therefore, utilizing the existing resources efficiently is key. Other parameters are controlled by a user. In these cases, careful testing may reveal actual limits for a specific use case.

Please keep in mind that all aspects of a workload greatly influence the stability and scalability of a Kubernetes cluster.

4.2 - Authenticating with an Identity Provider

Use OpenID Connect to authenticate users to access shoot clusters

Prerequisites

Please read the following background material on Authenticating.

Overview

Kubernetes on its own doesn’t provide any user management. In other words, users aren’t managed through Kubernetes resources. Whenever you refer to a human user it’s sufficient to use a unique ID, for example, an email address. Nevertheless, Gardener project owners can use an identity provider to authenticate user access for shoot clusters in the following way:

  1. Configure an Identity Provider using OpenID Connect (OIDC).
  2. Configure a local kubectl oidc-login to enable oidc-login.
  3. Configure the shoot cluster to share details of the OIDC-compliant identity provider with the Kubernetes API Server.
  4. Authorize an authenticated user using role-based access control (RBAC).
  5. Verify the result

Configure an Identity Provider

Create a tenant in an OIDC compatible Identity Provider. For simplicity, we use Auth0, which has a free plan.

  1. In your tenant, create a client application to use authentication with kubectl:

    Create client application

  2. Provide a Name, choose Native as application type, and choose CREATE.

    Choose application type

  3. In the tab Settings, copy the following parameters to a local text file:

    • Domain

      Corresponds to the issuer in OIDC. It must be an https-secured endpoint (Auth0 requires a trailing / at the end). For more information, see Issuer Identifier.

    • Client ID

    • Client Secret

      Basic information

  4. Configure the client to have a callback url of http://localhost:8000. This callback connects to your local kubectl oidc-login plugin:

    Configure callback

  5. Save your changes.

  6. Verify that https://<Auth0 Domain>/.well-known/openid-configuration is reachable.

  7. Choose Users & Roles > Users > CREATE USERS to create a user with a user and password:

    Create user

Configure a Local kubectl oidc-login

  1. Install the kubectl plugin oidc-login. We highly recommend the krew installation tool, which also makes other plugins easily available.

    kubectl krew install oidc-login
    

    The response looks like this:

    Updated the local copy of plugin index.
    Installing plugin: oidc-login
    CAVEATS:
    \
    |  You need to setup the OIDC provider, Kubernetes API server, role binding and kubeconfig.
    |  See https://github.com/int128/kubelogin for more.
    /
    Installed plugin: oidc-login
    
  2. Prepare a kubeconfig for later use:

    cp ~/.kube/config ~/.kube/config-oidc
    
  3. Modify the configuration of ~/.kube/config-oidc as follows:

    apiVersion: v1
    kind: Config
    
    ...
    
    contexts:
    - context:
        cluster: shoot--project--mycluster
        user: my-oidc
      name: shoot--project--mycluster
    
    ...
    
    users:
    - name: my-oidc
      user:
        exec:
          apiVersion: client.authentication.k8s.io/v1beta1
          command: kubectl
          args:
          - oidc-login
          - get-token
          - --oidc-issuer-url=https://<Issuer>/ 
          - --oidc-client-id=<Client ID>
          - --oidc-client-secret=<Client Secret>
          - --oidc-extra-scope=email,offline_access,profile
    

To test our OIDC-based authentication, the context shoot--project--mycluster of ~/.kube/config-oidc is used in a later step. For now, continue to use the configuration ~/.kube/config with administration rights for your cluster.

Configure the Shoot Cluster

Modify the shoot cluster YAML as follows, using the client ID and the domain (as issuer) from the settings of the client application you created in Auth0:

kind: Shoot
apiVersion: garden.sapcloud.io/v1beta1
metadata:
  name: mycluster
  namespace: garden-project
...
spec:
  kubernetes:
    kubeAPIServer:
      oidcConfig:
        clientID: <Client ID>
        issuerURL: "https://<Issuer>/"
        usernameClaim: email

This change of the Shoot manifest triggers a reconciliation. Once the reconciliation is finished, your OIDC configuration is applied. It doesn’t invalidate other certificate-based authentication methods. Wait for Gardener to reconcile the change. It can take up to 5 minutes.

Authorize an Authenticated User

In Auth0, you created a user with a verified email address, test@test.com in our example. For simplicity, we authorize a single user identified by this email address with the cluster role view:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: viewer-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: test@test.com

As administrator, apply the cluster role binding in your shoot cluster.

Verify the Result

  1. To step into the shoes of your user, use the prepared kubeconfig file ~/.kube/config-oidc, and switch to the context that uses oidc-login:

    cd ~/.kube
    export KUBECONFIG=$(pwd)/config-oidc
    kubectl config use-context `shoot--project--mycluster`
    
  2. kubectl delegates the authentication to plugin oidc-login the first time the user uses kubectl to contact the API server, for example:

    kubectl get all
    

    The plugin opens a browser for an interactive authentication session with Auth0, and in parallel serves a local webserver for the configured callback.

  3. Enter your login credentials.

    Login through identity provider

    You should get a successful response from the API server:

    Opening in existing browser session.
    NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
    service/kubernetes   ClusterIP   100.64.0.1   <none>        443/TCP   86m
    
  1. To see if your user uses the cluster role view, do some checks with kubectl auth can-i.

    • The response for the following commands should be no:

      kubectl auth can-i create clusterrolebindings
      
      kubectl auth can-i get secrets
      
      kubectl auth can-i describe secrets
      
    • The response for the following commands should be yes:

      kubectl auth can-i list pods
      
      kubectl auth can-i get pods
      

If the last step is successful, you’ve configured your cluster to authenticate against an identity provider using OIDC.

4.3 - Backup and Restore of Kubernetes Objects

Details about backup and recovery of Kubernetes objects based on the open source tool Velero.

Don’t worry … have a backup

TL;DR

In general, Backup and Restore (BR) covers activities enabling an organization to bring a system back in a consistent state, e.g., after a disaster or to setup a new system. These activities vary in a very broad way depending on the applications and its persistency.

Kubernetes objects like Pods, Deployments, NetworkPolicies, etc. configure Kubernetes internal components and might as well include external components like load balancer and persistent volumes of the cloud provider. The BR of external components and their configurations might be difficult to handle in case manual configurations were needed to prepare these components.

To set the expectations right from the beginning, this tutorial covers the BR of Kubernetes deployments which might use persistent volumes. The BR of any manual configuration of external components, e.g., via the cloud providers console, is not covered here, as well as the BR of a whole Kubernetes system.

This tutorial puts the focus on the open source tool Velero (formerly Heptio Ark) and its functionality to explain the BR process.

Basically, Velero allows you to:

  • backup and restore your Kubernetes cluster resources and persistent volumes (on-demand or scheduled)
  • backup or restore all objects in your cluster, or filter resources by type, namespace, and/or label
  • by default, all persistent volumes are backed up (configurable)
  • replicate your production environment for development and testing environments
  • define an expiration date per backup
  • execute pre- and post-activities in a container of a pod when a backup is created (see Hooks)
  • extend Velero by Plugins, e.g., for Object and Block store (see Plugins)

Velero consists of a server side component and a client tool. The server components consists of Custom Resource Definitions (CRD) and controllers to perform the activities. The client tool communicates with the K8s API server to, e.g., create objects like a Backup object.

The diagram below explains the backup process. When creating a backup, Velero client makes a call to the Kubernetes API server to create a Backup object (1). The BackupController notices the new Backup object, validates the object (2) and begins the backup process (3). Based on the filter settings provided by the Velero client it collects the resources in question (3). The BackupController creates a tar ball with the Kubernetes objects and stores it in the backup location, e.g., AWS S3 (4) as well as snapshots of persistent volumes (5).

The size of the backup tar ball corresponds to the number of objects in etcd. The gzipped archive contains the Json representations of the objects.

Backup process

Getting Started

At first, clone the Velero GitHub repository and get the Velero client from the releases or build it from source via make all in the main directory of the cloned GitHub repository.

To use an AWS S3 bucket as storage for the backup files and the persistent volumes, you need to:

  • create a S3 bucket as the backup target
  • create an AWS IAM user for Velero
  • configure the Velero server
  • create a secret for your AWS credentials

For details about this setup, check the Set Permissions for Velero documentation. Moreover, it is possible to use other supported storage providers.

Velero offers a wide range of filter possibilities for Kubernetes resources, e.g filter by namespaces, labels or resource types. The filter settings can be combined and used as include or exclude, which gives a great flexibility for selecting resources.

Exemplary Use Cases

Below are some use cases which could give you an idea on how to use Velero. You can also check Velero’s documentation for other introductory examples.

Helm Based Deployments

To be able to use Helm charts in your Kubernetes cluster, you need to install the Helm client helm and the server component tiller. Per default the server component is installed in the namespace kube-system. Even if it is possible to select single deployments via the filter settings of Velero, you should consider to install tiller in a separate namespace via helm init --tiller-namespace <your namespace>. This approach applies as well for all Helm charts to be deployed - consider separate namespaces for your deployments as well by using the parameter --namespace.

To backup a Helm based deployment, you need to backup both Tiller and the deployment. Only then the deployments could be managed via Helm. As mentioned above, the selection of resources would be easier in case they are separated in namespaces.

Separate Backup Locations

In case you run all your Kubernetes clusters on a single cloud provider, there is probably no need to store the backups in a bucket of a different cloud provider. However, if you run Kubernetes clusters on different cloud provider, you might consider to use a bucket on just one cloud provider as the target for the backups, e.g., to benefit from a lower price tag for the storage.

Per default, Velero assumes that both the persistent volumes and the backup location are on the same cloud provider. During the setup of Velero, a secret is created using the credentials for a cloud provider user who has access to both objects (see the policies, e.g., for the AWS configuration).

Now, since the backup location is different from the volume location, you need to follow these steps (described here for AWS):

  • configure as documented the volume storage location in examples/aws/06-volumesnapshotlocation.yaml and provide the user credentials. In this case, the S3 related settings like the policies can be omitted

  • create the bucket for the backup in the cloud provider in question and a user with the appropriate credentials and store them in a separate file similar to credentials-ark

  • create a secret which contains two credentials, one for the volumes and one for the backup target, e.g., by using the command kubectl create secret generic cloud-credentials --namespace heptio-ark --from-file cloud=credentials-ark --from-file backup-target=backup-ark

  • configure in the deployment manifest examples/aws/10-deployment.yaml the entries in volumeMounts, env and volumes accordingly, e.g., for a cluster running on AWS and the backup target bucket on GCP a configuration could look similar to:

    Example Velero deployment
    # Copyright 2017 the Heptio Ark contributors.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    ---
    apiVersion: apps/v1beta1
    kind: Deployment
    metadata:
      namespace: velero
      name: velero
    spec:
      replicas: 1
      template:
        metadata:
          labels:
            component: velero
          annotations:
            prometheus.io/scrape: "true"
            prometheus.io/port: "8085"
            prometheus.io/path: "/metrics"
        spec:
          restartPolicy: Always
          serviceAccountName: velero
          containers:
            - name: velero
              image: gcr.io/heptio-images/velero:latest
              command:
                - /velero
              args:
                - server
              volumeMounts:
                - name: cloud-credentials
                  mountPath: /credentials
                - name: plugins
                  mountPath: /plugins
                - name: scratch
                  mountPath: /scratch
              env:
                - name: AWS_SHARED_CREDENTIALS_FILE
                  value: /credentials/cloud
                - name: GOOGLE_APPLICATION_CREDENTIALS
                  value: /credentials/backup-target
                - name: VELERO_SCRATCH_DIR
                  value: /scratch
          volumes:
            - name: cloud-credentials
              secret:
                secretName: cloud-credentials
            - name: plugins
              emptyDir: {}
            - name: scratch
              emptyDir: {}
    
  • finally, configure the backup storage location in examples/aws/05-backupstoragelocation.yaml to use, in this case, a GCP bucket

Limitations

Below is a potentially incomplete list of limitations. You can also consult Velero’s documentation to get up to date information.

  • Only full backups of selected resources are supported. Incremental backups are not (yet) supported. However, by using filters it is possible to restrict the backup to specific resources
  • Inconsistencies might occur in case of changes during the creation of the backup
  • Application specific actions are not considered by default. However, they might be handled by using Velero’s Hooks or Plugins

4.4 - Create / Delete a Shoot Cluster

Create a Shoot Cluster

As you have already prepared an example Shoot manifest in the steps described in the development documentation, please open another Terminal pane/window with the KUBECONFIG environment variable pointing to the Garden development cluster and send the manifest to the Kubernetes API server:

kubectl apply -f your-shoot-aws.yaml

You should see that Gardener has immediately picked up your manifest and has started to deploy the Shoot cluster.

In order to investigate what is happening in the Seed cluster, please download its proper Kubeconfig yourself (see next paragraph). The namespace of the Shoot cluster in the Seed cluster will look like that: shoot-johndoe-johndoe-1, whereas the first johndoe is your namespace in the Garden cluster (also called “project”) and the johndoe-1 suffix is the actual name of the Shoot cluster.

To connect to the newly created Shoot cluster, you must download its Kubeconfig as well. Please connect to the proper Seed cluster, navigate to the Shoot namespace, and download the Kubeconfig from the kubecfg secret in that namespace.

Delete a Shoot Cluster

In order to delete your cluster, you have to set an annotation confirming the deletion first, and trigger the deletion after that. You can use the prepared delete shoot script which takes the Shoot name as first parameter. The namespace can be specified by the second parameter, but it is optional. If you don’t state it, it defaults to your namespace (the username you are logged in with to your machine).

./hack/usage/delete shoot johndoe-1 johndoe

(the hack bash script can be found at GitHub)

Configure a Shoot Cluster Aalert Receiver

The receiver of the Shoot alerts can be configured from the .spec.monitoring.alerting.emailReceivers section in the Shoot specification. The value of the field has to be a list of valid mail addresses.

The alerting for the Shoot clusters is handled by the Prometheus Alertmanager. The Alertmanager will be deployed next to the control plane when the Shoot resource specifies .spec.monitoring.alerting.emailReceivers and if a SMTP secret exists.

If the field gets removed then the Alertmanager will be also removed during the next reconcilation of the cluster. The opposite is also valid if the field is added to an existing cluster.

4.5 - Create a Shoot Cluster Into an Existing AWS VPC

Overview

Gardener can create a new VPC, or use an existing one for your shoot cluster. Depending on your needs, you may want to create shoot(s) into an already created VPC. The tutorial describes how to create a shoot cluster into an existing AWS VPC. The steps are identical for Alicloud, Azure, and GCP. Please note that the existing VPC must be in the same region like the shoot cluster that you want to deploy into the VPC.

TL;DR

If .spec.provider.infrastructureConfig.networks.vpc.cidr is specified, Gardener will create a new VPC with the given CIDR block and respectively will delete it on shoot deletion.
If .spec.provider.infrastructureConfig.networks.vpc.id is specified, Gardener will use the existing VPC and respectively won’t delete it on shoot deletion.

1. Configure the AWS CLI

The aws configure command is a convenient way to setup your AWS CLI. It will prompt you for your credentials and settings which will be used in the following AWS CLI invocations:

aws configure
AWS Access Key ID [None]: <ACCESS_KEY_ID>
AWS Secret Access Key [None]: <SECRET_ACCESS_KEY>
Default region name [None]: <DEFAULT_REGION>
Default output format [None]: <DEFAULT_OUTPUT_FORMAT>

2. Create a VPC

Create the VPC by running the following command:

aws ec2 create-vpc --cidr-block <cidr-block>
{
  "Vpc": {
      "VpcId": "vpc-ff7bbf86",
      "InstanceTenancy": "default",
      "Tags": [],
      "CidrBlockAssociations": [
          {
              "AssociationId": "vpc-cidr-assoc-6e42b505",
              "CidrBlock": "10.0.0.0/16",
              "CidrBlockState": {
                  "State": "associated"
              }
          }
      ],
      "Ipv6CidrBlockAssociationSet": [],
      "State": "pending",
      "DhcpOptionsId": "dopt-38f7a057",
      "CidrBlock": "10.0.0.0/16",
      "IsDefault": false
  }
}

Gardener requires the VPC to have enabled DNS support, i.e the attributes enableDnsSupport and enableDnsHostnames must be set to true. enableDnsSupport attribute is enabled by default, enableDnsHostnames - not. Set the enableDnsHostnames attribute to true:

aws ec2 modify-vpc-attribute --vpc-id vpc-ff7bbf86 --enable-dns-hostnames

3. Create an Internet Gateway

Gardener also requires that an internet gateway is attached to the VPC. You can create one by using:

aws ec2 create-internet-gateway
{
    "InternetGateway": {
        "Tags": [],
        "InternetGatewayId": "igw-c0a643a9",
        "Attachments": []
    }
}

and attach it to the VPC using:

aws ec2 attach-internet-gateway --internet-gateway-id igw-c0a643a9 --vpc-id vpc-ff7bbf86

4. Create the Shoot

Prepare your shoot manifest (you could check the example manifests). Please make sure that you choose the region in which you had created the VPC earlier (step 2). Also, put your VPC ID in the .spec.provider.infrastructureConfig.networks.vpc.id field:

spec:
  region: <aws-region-of-vpc>
  provider:
    type: aws
    infrastructureConfig:
      apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
      kind: InfrastructureConfig
      networks:
        vpc:
          id: vpc-ff7bbf86
    # ...

Apply your shoot manifest:

kubectl apply -f your-shoot-aws.yaml

Ensure that the shoot cluster is properly created:

kubectl get shoot $SHOOT_NAME -n $SHOOT_NAMESPACE
NAME           CLOUDPROFILE   VERSION   SEED   DOMAIN           OPERATION   PROGRESS   APISERVER   CONTROL   NODES   SYSTEM   AGE
<SHOOT_NAME>   aws            1.15.0    aws    <SHOOT_DOMAIN>   Succeeded   100        True        True      True    True     20m

4.6 - Fix Problematic Conversion Webhooks

Reasoning

Custom Resource Definition (CRD) is what you use to define a Custom Resource. This is a powerful way to extend Kubernetes capabilities beyond the default installation, adding any kind of API objects useful for your application.

The CustomResourceDefinition API provides a workflow for introducing and upgrading to new versions of a CustomResourceDefinition. In a scenario where a CRD adds support for a new version and switches its spec.versions.storage field to it (i.e., from v1beta1 to v1), existing objects are not migrated in etcd. For more information, see Versions in CustomResourceDefinitions.

This creates a mismatch between the requested and stored version for all clients (kubectl, KCM, etc.). When the CRD also declares the usage of a conversion webhook, it gets called whenever a client requests information about a resource that still exists in the old version. If the CRD is created by the end-user, the webhook runs on the shoot side, whereas controllers / kapi-servers run separated, as part of the control-plane. For the webhook to be reachable, a working VPN connection seed -> shoot is essential. In scenarios where the VPN connection is broken, the kube-controller-manager eventually stops its garbage collection, as that requires it to list v1.PartialObjectMetadata for everything to build a dependency graph. Without the kube-controller-manager’s garbage collector, managed resources get stuck during update/rollout.

Breaking Situations

When a user upgrades to failureTolerance: node|zone, that will cause the VPN deployments to be replaced by statefulsets. However, as the VPN connection is broken upon teardown of the deployment, garbage collection will fail, leading to a situation that is stuck until an operator manually tackles it.

Such a situation can be avoided if the end-user has correctly configured CRDs containing conversion webhooks.

Checking Problematic CRDs

In order to make sure there are no version problematic CRDs, please run the script below in your shoot. It will return the name of the CRDs in case they have one of the 2 problems:

  • the returned version of the CR is different than what is maintained in the status.storedVersions field of the CRD.
  • the status.storedVersions field of the CRD has more than 1 version defined.
#!/bin/bash

set -e -o pipefail

echo "Checking all CRDs in the cluster..."
for p in $(kubectl get crd | awk 'NR>1' | awk '{print $1}'); do
  strategy=$(kubectl get crd "$p" -o json | jq -r .spec.conversion.strategy)

  if [ "$strategy" == "Webhook" ]; then
     crd_name=$(kubectl get crd "$p" -o json | jq -r .metadata.name)

     number_of_stored_versions=$(kubectl get crd "$crd_name" -o json  | jq '.status.storedVersions | length')

      if [[ "$number_of_stored_versions" == 1 ]]; then
         returned_cr_version=$(kubectl get "$crd_name" -A -o json |  jq -r '.items[] | .apiVersion'  | sed 's:.*/::')
         if [ -z "$returned_cr_version" ]; then
           continue
         else
           variable=$(echo "$returned_cr_version" | xargs -n1 | sort -u | xargs)
           present_version=$(kubectl get crd "$crd_name" -o json  |  jq -cr '.status.storedVersions |.[]')
           if [[ $variable != "$present_version" ]]; then
             echo "ERROR: Stored version differs from the version that CRs are being returned. $crd_name with conversion webhook needs to be fixed"
           fi
         fi
      fi

      if [[ "$number_of_stored_versions" -gt 1 ]]; then
         returned_cr_version=$(kubectl get "$crd_name" -A -o json |  jq -r '.items[] | .apiVersion'  | sed 's:.*/::')
         if [ -z "$returned_cr_version" ]; then
           continue
         else
           echo "ERROR: Too many stored versions defined. $crd_name with conversion webhook needs to be fixed"
         fi
      fi
  fi
done
echo "Problematic CRDs are reported above."

Resolve CRDs

Below we give the steps needed to be taken in order to fix the CRDs reported by the script above.

Inspect all your CRDs that have conversion webhooks in place. If you have more than 1 version defined in its spec.status.storedVersions field, then initiate migration as described in Option 2 in the Upgrade existing objects to a new stored version guide.

For convenience, we have provided the necessary steps below.

  1. Please check/set the old CR version to storage:false and set the new CR version to storage:true.

    For the sake of an example, let’s consider the two versions v1beta1 (old) and v1 (new).

    Before:

    spec:
    versions:
    - name: v1beta1
    ......
    storage: true
    
    - name: v1
    ......
    storage: false
    

    After:

    spec:
    versions:
    - name: v1beta1
    ......
    storage: false
    
    - name: v1
    ......
    storage: true
    
  2. Convert custom-resources to the newest version.

    kubectl get <custom-resource-name> -A -ojson | k apply -f -
    
  3. Patch the CRD to keep only the latest version under storedVersions.

    kubectl patch customresourcedefinitions <crd-name> --subresource='status' --type='merge' -p '{"status":{"storedVersions":["your-latest-cr-version"]}}'
    

4.7 - GPU Enabled Cluster

Setting up a GPU Enabled Cluster for Deep Learning

Disclaimer

Be aware, that the following sections might be opinionated. Kubernetes, and the GPU support in particular, are rapidly evolving, which means that this guide is likely to be outdated sometime soon. For this reason, contributions are highly appreciated to update this guide.

Create a Cluster

First thing first, let’s create a Kubernetes (K8s) cluster with GPU accelerated nodes. In this example we will use an AWS p2.xlarge EC2 instance because it’s the cheapest available option at the moment. Use such cheap instances for learning to limit your resource costs. This costs around 1€/hour per GPU

gpu-selection

Install NVidia Driver as Daemonset

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-driver-installer
  namespace: kube-system
  labels:
    k8s-app: nvidia-driver-installer
spec:
  selector:
    matchLabels:
      name: nvidia-driver-installer
      k8s-app: nvidia-driver-installer
  template:
    metadata:
      labels:
        name: nvidia-driver-installer
        k8s-app: nvidia-driver-installer
    spec:
      hostPID: true
      initContainers:
      - image: squat/modulus:4a1799e7aa0143bcbb70d354bab3e419b1f54972
        name: modulus
        args:
        - compile
        - nvidia
        - "410.104"
        securityContext:
          privileged: true
        env:
        - name: MODULUS_CHROOT
          value: "true"
        - name: MODULUS_INSTALL
          value: "true"
        - name: MODULUS_INSTALL_DIR
          value: /opt/drivers
        - name: MODULUS_CACHE_DIR
          value: /opt/modulus/cache
        - name: MODULUS_LD_ROOT
          value: /root
        - name: IGNORE_MISSING_MODULE_SYMVERS
          value: "1"          
        volumeMounts:
        - name: etc-coreos
          mountPath: /etc/coreos
          readOnly: true
        - name: usr-share-coreos
          mountPath: /usr/share/coreos
          readOnly: true
        - name: ld-root
          mountPath: /root
        - name: module-cache
          mountPath: /opt/modulus/cache
        - name: module-install-dir-base
          mountPath: /opt/drivers
        - name: dev
          mountPath: /dev
      containers:
      - image: "gcr.io/google-containers/pause:3.1"
        name: pause
      tolerations:
      - key: "nvidia.com/gpu"
        effect: "NoSchedule"
        operator: "Exists"
      volumes:
      - name: etc-coreos
        hostPath:
          path: /etc/coreos
      - name: usr-share-coreos
        hostPath:
          path: /usr/share/coreos
      - name: ld-root
        hostPath:
          path: /
      - name: module-cache
        hostPath:
          path: /opt/modulus/cache
      - name: dev
        hostPath:
          path: /dev
      - name: module-install-dir-base
        hostPath:
          path: /opt/drivers

Install Device Plugin

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-gpu-device-plugin
  namespace: kube-system
  labels:
    k8s-app: nvidia-gpu-device-plugin
    #addonmanager.kubernetes.io/mode: Reconcile
spec:
  selector:
    matchLabels:
      k8s-app: nvidia-gpu-device-plugin
  template:
    metadata:
      labels:
        k8s-app: nvidia-gpu-device-plugin
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ''
    spec:
      priorityClassName: system-node-critical
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: dev
        hostPath:
          path: /dev
      containers:
      - image: "k8s.gcr.io/nvidia-gpu-device-plugin@sha256:08509a36233c5096bb273a492251a9a5ca28558ab36d74007ca2a9d3f0b61e1d"
        command: ["/usr/bin/nvidia-gpu-device-plugin", "-logtostderr", "-host-path=/opt/drivers/nvidia"]
        name: nvidia-gpu-device-plugin
        resources:
          requests:
            cpu: 50m
            memory: 10Mi
          limits:
            cpu: 50m
            memory: 10Mi
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /device-plugin
        - name: dev
          mountPath: /dev
  updateStrategy:
    type: RollingUpdate

Test

To run an example training on a GPU node, first start a base image with Tensorflow with GPU support & Keras:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deeplearning-workbench
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deeplearning-workbench
  template:
    metadata:
      labels:
        app: deeplearning-workbench
    spec:
      containers:
      - name: deeplearning-workbench
        image: afritzler/deeplearning-workbench
        resources:
          limits:
            nvidia.com/gpu: 1
      tolerations:
      - key: "nvidia.com/gpu"
        effect: "NoSchedule"
        operator: "Exists"

Now exec into the container and start an example Keras training:

kubectl exec -it deeplearning-workbench-8676458f5d-p4d2v -- /bin/bash
cd /keras/example
python imdb_cnn.py

4.8 - Shoot Cluster Maintenance

Understanding and configuring Gardener’s Day-2 operations for Shoot clusters.

Overview

Day two operations for shoot clusters are related to:

  • The Kubernetes version of the control plane and the worker nodes
  • The operating system version of the worker nodes

The following table summarizes what options Gardener offers to maintain these versions:

Auto-UpdateForceful UpdatesManual Updates
Kubernetes versionPatches onlyPatches and consecutive minor updates onlyyes
Operating system versionyesyesyes

Allowed Target Versions in the CloudProfile

Administrators maintain the allowed target versions that you can update to in the CloudProfile for each IaaS-Provider. Users with access to a Gardener project can check supported target versions with:

kubectl get cloudprofile [IAAS-SPECIFIC-PROFILE] -o yaml
PathDescriptionMore Information
spec.kubernetes.versionsThe supported Kubernetes version major.minor.patch.Patch releases
spec.machineImagesThe supported operating system versions for worker nodes

Both the Kubernetes version and the operating system version follow semantic versioning that allows Gardener to handle updates automatically.

For more information, see Semantic Versioning.

Impact of Version Classifications on Updates

Gardener allows to classify versions in the CloudProfile as preview, supported, deprecated, or expired. During maintenance operations, preview versions are excluded from updates, because they’re often recently released versions that haven’t yet undergone thorough testing and may contain bugs or security issues.

For more information, see Version Classifications.

Let Gardener Manage Your Updates

The Maintenance Window

Gardener can manage updates for you automatically. It offers users to specify a maintenance window during which updates are scheduled:

  • The time interval of the maintenance window can’t be less than 30 minutes or more than 6 hours.
  • If there’s no maintenance window specified during the creation of a shoot cluster, Gardener chooses a maintenance window randomly to spread the load.

You can either specify the maintenance window in the shoot cluster specification (.spec.maintenance.timeWindow) or the start time of the maintenance window using the Gardener dashboard (CLUSTERS > [YOUR-CLUSTER] > OVERVIEW > Lifecycle > Maintenance).

Auto-Update and Forceful Updates

To trigger updates during the maintenance window automatically, Gardener offers the following methods:

  • Auto-update:
    Gardener starts an update during the next maintenance window whenever there’s a version available in the CloudProfile that is higher than the one of your shoot cluster specification, and that isn’t classified as preview version. For Kubernetes versions, auto-update only updates to higher patch levels.

    You can either activate auto-update on the Gardener dashboard (CLUSTERS > [YOUR-CLUSTER] > OVERVIEW > Lifecycle > Maintenance) or in the shoot cluster specification:

    • .spec.maintenance.autoUpdate.kubernetesVersion: true
    • .spec.maintenance.autoUpdate.machineImageVersion: true
  • Forceful updates:
    In the maintenance window, Gardener compares the current version given in the shoot cluster specification with the version list in the CloudProfile. If the version has an expiration date and if the date is before the start of the maintenance window, Gardener starts an update to the highest version available in the CloudProfile that isn’t classified as preview version. The highest version in CloudProfile can’t have an expiration date. For Kubernetes versions, Gardener only updates to higher patch levels or consecutive minor versions.

If you don’t want to wait for the next maintenance window, you can annotate the shoot cluster specification with shoot.gardener.cloud/operation: maintain. Gardener then checks immediately if there’s an auto-update or a forceful update needed.

With expiration dates, administrators can give shoot cluster owners more time for testing before the actual version update happens, which allows for smoother transitions to new versions.

Kubernetes Update Paths

The bigger the delta of the Kubernetes source version and the Kubernetes target version, the better it must be planned and executed by operators. Gardener only provides automatic support for updates that can be applied safely to the cluster workload:

Update TypeExampleUpdate Method
Patches1.10.12 to 1.10.13auto-update or Forceful update
Update to consecutive minor version1.10.12 to 1.11.10Forceful update
Other1.10.12 to 1.12.0Manual update

Gardener doesn’t support automatic updates of nonconsecutive minor versions, because Kubernetes doesn’t guarantee updateability in this case. However, multiple minor version updates are possible if not only the minor source version is expired, but also the minor target version is expired. Gardener then updates the Kubernetes version first to the expired target version, and waits for the next maintenance window to update this version to the next minor target version.

Manual Updates

To update the Kubernetes version or the node operating system manually, change the .spec.kubernetes.version field or the .spec.provider.workers.machine.image.version field correspondingly.

Manual updates are required if you would like to do a minor update of the Kubernetes version. Gardener doesn’t do such updates automatically, as they can have breaking changes that could impact the cluster workload.

Manual updates are either executed immediately (default) or can be confined to the maintenance time window.
Choosing the latter option causes changes to the cluster (for example, node pool rolling-updates) and the subsequent reconciliation to only predictably happen during a defined time window (available since Gardener version 1.4).

For more information, see Confine Specification Changes/Update Roll Out.

Examples

In the examples for the CloudProfile and the shoot cluster specification, only the fields relevant for the example are shown.

Auto-Update of Kubernetes Version

Let’s assume that the Kubernetes versions 1.10.5 and 1.11.0 were added in the following CloudProfile:

spec:
  kubernetes:
    versions:
    - version: 1.11.0
    - version: 1.10.5
    - version: 1.10.0

Before this change, the shoot cluster specification looked like this:

spec:
  kubernetes:
    version: 1.10.0
  maintenance:
    timeWindow:
      begin: 220000+0000
      end: 230000+0000
    autoUpdate:
      kubernetesVersion: true

As a consequence, the shoot cluster is updated to Kubernetes version 1.10.5 between 22:00-23:00 UTC. Your shoot cluster isn’t updated automatically to 1.11.0, even though it’s the highest Kubernetes version in the CloudProfile, because Gardener only does automatic updates of the Kubernetes patch level.

Forceful Update Due to Expired Kubernetes Version

Let’s assume the following CloudProfile exists on the cluster:

spec:
  kubernetes:
    versions:
    - version: 1.12.8
    - version: 1.11.10
    - version: 1.10.13
    - version: 1.10.12
      expirationDate: "2019-04-13T08:00:00Z"

Let’s assume the shoot cluster has the following specification:

spec:
  kubernetes:
    version: 1.10.12
  maintenance:
    timeWindow:
      begin: 220000+0100
      end: 230000+0100
    autoUpdate:
      kubernetesVersion: false

The shoot cluster specification refers to a Kubernetes version that has an expirationDate. In the maintenance window on 2019-04-12, the Kubernetes version stays the same as it’s still not expired. But in the maintenance window on 2019-04-14, the Kubernetes version of the shoot cluster is updated to 1.10.13 (independently of the value of .spec.maintenance.autoUpdate.kubernetesVersion).

Forceful Update to New Minor Kubernetes Version

Let’s assume the following CloudProfile exists on the cluster:

spec:
  kubernetes:
    versions:
    - version: 1.12.8
    - version: 1.11.10
    - version: 1.11.09
    - version: 1.10.12
      expirationDate: "2019-04-13T08:00:00Z"

Let’s assume the shoot cluster has the following specification:

spec:
  kubernetes:
    version: 1.10.12
  maintenance:
    timeWindow:
      begin: 220000+0100
      end: 230000+0100
    autoUpdate:
      kubernetesVersion: false

The shoot cluster specification refers a Kubernetes version that has an expirationDate. In the maintenance window on 2019-04-14, the Kubernetes version of the shoot cluster is updated to 1.11.10, which is the highest patch version of minor target version 1.11 that follows the source version 1.10.

Automatic Update from Expired Machine Image Version

Let’s assume the following CloudProfile exists on the cluster:

spec:
  machineImages:
  - name: coreos
    versions:
    - version: 2191.5.0
    - version: 2191.4.1
    - version: 2135.6.0
      expirationDate: "2019-04-13T08:00:00Z"

Let’s assume the shoot cluster has the following specification:

spec:
  provider:
    type: aws
    workers:
    - name: name
      maximum: 1
      minimum: 1
      maxSurge: 1
      maxUnavailable: 0
      image:
        name: coreos
        version: 2135.6.0
        type: m5.large
      volume:
        type: gp2
        size: 20Gi
  maintenance:
    timeWindow:
      begin: 220000+0100
      end: 230000+0100
    autoUpdate:
      machineImageVersion: false

The shoot cluster specification refers a machine image version that has an expirationDate. In the maintenance window on 2019-04-12, the machine image version stays the same as it’s still not expired. But in the maintenance window on 2019-04-14, the machine image version of the shoot cluster is updated to 2191.5.0 (independently of the value of .spec.maintenance.autoUpdate.machineImageVersion) as version 2135.6.0 is expired.

4.9 - Tailscale

Access the Kubernetes apiserver from your tailnet

Overview

If you would like to strengthen the security of your Kubernetes cluster even further, this guide post explains how this can be achieved.

The most common way to secure a Kubernetes cluster which was created with Gardener is to apply the ACLs described in the Gardener ACL Extension repository or to use ExposureClass, which exposes the Kubernetes apiserver in a corporate network not exposed to the public internet.

However, those solutions are not without their drawbacks. Managing the ACL extension becomes fairly difficult with the growing number of participants, especially in a dynamic environment and work from home scenarios, and using ExposureClass requires you to first have a corporate network suitable for this purpose.

But there is a solution which bridges the gap between these two approaches by the use of a mesh VPN based on WireGuard

Tailscale

Tailscale is a mesh VPN network which uses Wireguard under the hood, but automates the key exchange procedure. Please consult the official tailscale documentation for a detailed explanation.

Target Architecture

architecture

Installation

In order to be able to access the Kubernetes apiserver only from a tailscale VPN, you need this steps:

  1. Create a tailscale account and ensure MagicDNS is enabled.
  2. Create an OAuth ClientID and Secret OAuth ClientID and Secret. Don’t forget to create the required tags.
  3. Install the tailscale operator tailscale operator.

If all went well after the operator installation, you should be able to see the tailscale operator by running tailscale status:

# tailscale status
...
100.83.240.121  tailscale-operator   tagged-devices linux   -
...

Expose the Kubernetes apiserver

Now you are ready to expose the Kubernetes apiserver in the tailnet by annotating the service which was created by Gardener:

kubectl annotate -n default kubernetes tailscale.com/expose=true tailscale.com/hostname=kubernetes

It is required to kubernetes as the hostname, because this is part of the certificate common name of the Kubernetes apiserver.

After annotating the service, it will be exposed in the tailnet and can be shown by running tailscale status:

# tailscale status
...
100.83.240.121  tailscale-operator   tagged-devices linux   -
100.96.191.87   kubernetes           tagged-devices linux   idle, tx 19548 rx 71656
...

Modify the kubeconfig

In order to access the cluster via the VPN, you must modify the kubeconfig to point to the Kubernetes service exposed in the tailnet, by changing the server entry to https://kubernetes.

---
apiVersion: v1
clusters:
  - cluster:
      certificate-authority-data: <base64 encoded secret>
      server: https://kubernetes
    name: my-cluster
...

Enable ACLs to Block All IPs

Now you are ready to use your cluster from every device which is part of your tailnet. Therefore you can now block all access to the Kubernetes apiserver with the ACL extension.

Caveats

Multiple Kubernetes Clusters

You can actually not join multiple Kubernetes Clusters to join your tailnet because the kubernetes service in every cluster would overlap.

Headscale

It is possible to host a tailscale coordination by your own if you do not want to rely on the service tailscale.com offers. The headscale project is a open source implementation of this.

This works for basic tailscale VPN setups, but not for the tailscale operator described here, because headscale does not implement all required API endpoints for the tailscale operator. The details can be found in this Github Issue.

5 - Monitor and Troubleshoot

5.1 - Analyzing Node Removal and Failures

Utilize Gardener’s Monitoring and Logging to analyze removal and failures of nodes

Overview

Sometimes operators want to find out why a certain node got removed. This guide helps to identify possible causes. There are a few potential reasons why nodes can be removed:

  • broken node: a node becomes unhealthy and machine-controller-manager terminates it in an attempt to replace the unhealthy node with a new one
  • scale-down: cluster-autoscaler sees that a node is under-utilized and therefore scales down a worker pool
  • node rolling: configuration changes to a worker pool (or cluster) require all nodes of one or all worker pools to be rolled and thus all nodes to be replaced. Some possible changes are:
    • the K8s/OS version
    • changing machine types

Helpful information can be obtained by using the logging stack. See Logging Stack for how to utilize the logging information in Gardener.

Find Out Whether the Node Was unhealthy

Check the Node Events

A good first indication on what happened to a node can be obtained from the node’s events. Events are scraped and ingested into the logging system, so they can be found in the explore tab of Grafana (make sure to select loki as datasource) with a query like {job="event-logging"} | unpack | object="Node/<node-name>" or find any event mentioning the node in question via a broader query like {job="event-logging"}|="<node-name>".

A potential result might reveal:

{"_entry":"Node ip-10-55-138-185.eu-central-1.compute.internal status is now: NodeNotReady","count":1,"firstTimestamp":"2023-04-05T12:02:08Z","lastTimestamp":"2023-04-05T12:02:08Z","namespace":"default","object":"Node/ip-10-55-138-185.eu-central-1.compute.internal","origin":"shoot","reason":"NodeNotReady","source":"node-controller","type":"Normal"}

Check machine-controller-manager Logs

If a node was getting unhealthy, the last conditions can be found in the logs of the machine-controller-manager by using a query like {pod_name=~"machine-controller-manager.*"}|="<node-name>".

Caveat: every node resource is backed by a corresponding machine resource managed by machine-controller-manager. Usually two corresponding node and machine resources have the same name with the exception of AWS. Here you first need to find with the above query the corresponding machine name, typically via a log like this

2023-04-05 12:02:08 {"log":"Conditions of Machine \"shoot--demo--cluster-pool-z1-6dffc-jh4z4\" with providerID \"aws:///eu-central-1/i-0a6ad1ca4c2e615dc\" and backing node \"ip-10-55-138-185.eu-central-1.compute.internal\" are changing","pid":"1","severity":"INFO","source":"machine_util.go:629"}

This reveals that node ip-10-55-138-185.eu-central-1.compute.internal is backed by machine shoot--demo--cluster-pool-z1-6dffc-jh4z4. On infrastructures other than AWS you can omit this step.

With the machine name at hand, now search for log entries with {pod_name=~"machine-controller-manager.*"}|="<machine-name>". In case the node had failing conditions, you’d find logs like this:

2023-04-05 12:02:08 {"log":"Machine shoot--demo--cluster-pool-z1-6dffc-jh4z4 is unhealthy - changing MachineState to Unknown. Node conditions: [{Type:ClusterNetworkProblem Status:False LastHeartbeatTime:2023-04-05 11:58:39 +0000 UTC LastTransitionTime:2023-03-23 11:59:29 +0000 UTC Reason:NoNetworkProblems Message:no cluster network problems} ... {Type:Ready Status:Unknown LastHeartbeatTime:2023-04-05 11:55:27 +0000 UTC LastTransitionTime:2023-04-05 12:02:07 +0000 UTC Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.}]","pid":"1","severity":"WARN","source":"machine_util.go:637"}

In the example above, the reason for an unhealthy node was that kubelet failed to renew its heartbeat. Typical reasons would be either a broken VM (that couldn’t execute kubelet anymore) or a broken network. Note that some VM terminations performed by the infrastructure provider are actually expected (e.g., scheduled events on AWS).

In both cases, the infrastructure provider might be able to provide more information on particular VM or network failures.

Whatever the failure condition might have been, if a node gets unhealthy, it will be terminated by machine-controller-manager after the machineHealthTimeout has elapsed (this parameter can be configured in your shoot spec).

Check the Node Logs

For each node the kernel and kubelet logs, as well as a few others, are scraped and can be queried with this query {nodename="<node-name>"} This might reveal OS specific issues or, in the absence of any logs (e.g., after the node went unhealthy), might indicate a network disruption or sudden VM termination. Note that some VM terminations performed by the infrastructure provider are actually expected (e.g., scheduled events on AWS).

Infrastructure providers might be able to provide more information on particular VM failures in such cases.

Check the Network Problem Detector Dashboard

If your Gardener installation utilizes gardener-extension-shoot-networking-problemdetector, you can check the dashboard named “Network Problem Detector” in Grafana for hints on network issues on the node of interest.

Scale-Down

In general, scale-downs are managed by the cluster-autoscaler, its logs can be found with the query {container_name="cluster-autoscaler"}. Attempts to remove a node can be found with the query {container_name="cluster-autoscaler"}|="Scale-down: removing empty node"

If a scale-down has caused disruptions in your workload, consider protecting your workload by adding PodDisruptionBudgets (see the autoscaler FAQ for more options).

Node Rolling

Node rolling can be caused by, e.g.:

  • change of the K8s minor version of the cluster or a worker pool
  • change of the OS version of the cluster or a worker pool
  • change of the disk size/type or machine size/type of a worker pool
  • change of node labels

Changes like the above are done by altering the shoot specification and thus are recorded in the external auditlog system that is configured for the garden cluster.

5.2 - Get a Shell to a Gardener Shoot Worker Node

Describes the methods for getting shell access to worker nodes

Overview

To troubleshoot certain problems in a Kubernetes cluster, operators need access to the host of the Kubernetes node. This can be required if a node misbehaves or fails to join the cluster in the first place.

With access to the host, it is for instance possible to check the kubelet logs and interact with common tools such as systemctl and journalctl.

The first section of this guide explores options to get a shell to the node of a Gardener Kubernetes cluster. The options described in the second section do not rely on Kubernetes capabilities to get shell access to a node and thus can also be used if an instance failed to join the cluster.

This guide only covers how to get access to the host, but does not cover troubleshooting methods.

Get a Shell to an Operational Cluster Node

The following describes four different approaches to get a shell to an operational Shoot worker node. As a prerequisite to troubleshooting a Kubernetes node, the node must have joined the cluster successfully and be able to run a pod. All of the described approaches involve scheduling a pod with root permissions and mounting the root filesystem.

Gardener Dashboard

Prerequisite: the terminal feature is configured for the Gardener dashboard.

  1. Navigate to the cluster overview page and find the Terminal in the Access tile.
Access Tile

Select the target Cluster (Garden, Seed / Control Plane, Shoot cluster) depending on the requirements and access rights (only certain users have access to the Seed Control Plane).

  1. To open the terminal configuration, interact with the top right-hand corner of the screen.
Terminal configuration
  1. Set the Terminal Runtime to “Privileged”. Also, specify the target node from the drop-down menu.
Dashboard terminal pod configuration

Result

The Dashboard then schedules a pod and opens a shell session to the node.

To get access to the common binaries installed on the host, prefix the command with chroot /hostroot. Note that the path depends on where the root path is mounted in the container. In the default image used by the Dashboard, it is under /hostroot.

Dashboard terminal pod configuration

Gardener Ops Toolbelt

Prerequisite: kubectl is available.

The Gardener ops-toolbelt can be used as a convenient way to deploy a root pod to a node. The pod uses an image that is bundled with a bunch of useful troubleshooting tools. This is also the same image that is used by default when using the Gardener Dashboard terminal feature as described in the previous section.

The easiest way to use the Gardener ops-toolbelt is to execute the ops-pod script in the hacks folder. To get root shell access to a node, execute the aforementioned script by supplying the target node name as an argument:

<path-to-ops-toolbelt-repo>/hacks/ops-pod <target-node>

Custom Root Pod

Alternatively, a pod can be assigned to a target node and a shell can be opened via standard Kubernetes means. To enable root access to the node, the pod specification requires proper securityContext and volume properties.

For instance, you can use the following pod manifest, after changing with the name of the node you want this pod attached to:

apiVersion: v1
kind: Pod
metadata:
  name: privileged-pod
  namespace: default
spec:
  nodeSelector:
    kubernetes.io/hostname: <target-node-name>
  containers:
  - name: busybox
    image: busybox
    stdin: true
    securityContext:
      privileged: true
    volumeMounts:
    - name: host-root-volume
      mountPath: /host
      readOnly: true
  volumes:
  - name: host-root-volume
    hostPath:
      path: /
  hostNetwork: true
  hostPID: true
  restartPolicy: Never

SSH Access to a Node That Failed to Join the Cluster

This section explores two options that can be used to get SSH access to a node that failed to join the cluster. As it is not possible to schedule a pod on the node, the Kubernetes-based methods explored so far cannot be used in this scenario.

Additionally, Gardener typically provisions worker instances in a private subnet of the VPC, hence - there is no public IP address that could be used for direct SSH access.

For this scenario, cloud providers typically have extensive documentation (e.g., AWS & GCP and in some cases tooling support). However, these approaches are mostly cloud provider specific, require interaction via their CLI and API or sometimes the installation of a cloud provider specific agent on the node.

Alternatively, gardenctl can be used providing a cloud provider agnostic and out-of-the-box support to get ssh access to an instance in a private subnet. Currently gardenctl supports AWS, GCP, Openstack, Azure and Alibaba Cloud.

Identifying the Problematic Instance

First, the problematic instance has to be identified. In Gardener, worker pools can be created in different cloud provider regions, zones, and accounts.

The instance would typically show up as successfully started / running in the cloud provider dashboard or API and it is not immediately obvious which one has a problem. Instead, we can use the Gardener API / CRDs to obtain the faulty instance identifier in a cloud-agnostic way.

Gardener uses the Machine Controller Manager to create the Shoot worker nodes. For each worker node, the Machine Controller Manager creates a Machine CRD in the Shoot namespace in the respective Seed cluster. Usually the problematic instance can be identified, as the respective Machine CRD has status pending.

The instance / node name can be obtained from the Machine .status field:

kubectl get machine <machine-name> -o json | jq -r .status.node

This is all the information needed to go ahead and use gardenctl ssh to get a shell to the node. In addition, the used cloud provider, the specific identifier of the instance, and the instance region can be identified from the Machine CRD.

Get the identifier of the instance via:

kubectl get machine <machine-name> -o json | jq -r .spec.providerID // e.g aws:///eu-north-1/i-069733c435bdb4640

The identifier shows that the instance belongs to the cloud provider aws with the ec2 instance-id i-069733c435bdb4640 in region eu-north-1.

To get more information about the instance, check out the MachineClass (e.g., AWSMachineClass) that is associated with each Machine CRD in the Shoot namespace of the Seed cluster.

The AWSMachineClass contains the machine image (ami), machine-type, iam information, network-interfaces, subnets, security groups and attached volumes.

Of course, the information can also be used to get the instance with the cloud provider CLI / API.

gardenctl ssh

Using the node name of the problematic instance, we can use the gardenctl ssh command to get SSH access to the cloud provider instance via an automatically set up bastion host. gardenctl takes care of spinning up the bastion instance, setting up the SSH keys, ports and security groups and opens a root shell on the target instance. After the SSH session has ended, gardenctl deletes the created cloud provider resources.

Use the following commands:

  1. First, target a Garden cluster containing all the Shoot definitions.
gardenctl target garden <target-garden>
  1. Target an available Shoot by name. This sets up the context, configures the kubeconfig file of the Shoot cluster and downloads the cloud provider credentials. Subsequent commands will execute in this context.
gardenctl target shoot <target-shoot>
  1. This uses the cloud provider credentials to spin up the bastion and to open a shell on the target instance.
gardenctl ssh <target-node>

SSH with a Manually Created Bastion on AWS

In case you are not using gardenctl or want to control the bastion instance yourself, you can also manually set it up. The steps described here are generally the same as those used by gardenctl internally. Despite some cloud provider specifics, they can be generalized to the following list:

  • Open port 22 on the target instance.
  • Create an instance / VM in a public subnet (the bastion instance needs to have a public IP address).
  • Set-up security groups and roles, and open port 22 for the bastion instance.

The following diagram shows an overview of how the SSH access to the target instance works:

SSH Bastion diagram

This guide demonstrates the setup of a bastion on AWS.

Prerequisites:

  • The AWS CLI is set up.

  • Obtain target instance-id (see Identifying the Problematic Instance).

  • Obtain the VPC ID the Shoot resources are created in. This can be found in the Infrastructure CRD in the Shoot namespace in the Seed.

  • Make sure that port 22 on the target instance is open (default for Gardener deployed instances).

    • Extract security group via:
    aws ec2 describe-instances --instance-ids <instance-id>
    
    • Check for rule that allows inbound connections on port 22:
    aws ec2 describe-security-groups --group-ids=<security-group-id>
    
    • If not available, create the rule with the following comamnd:
    aws ec2 authorize-security-group-ingress --group-id <security-group-id>  --protocol tcp --port 22 --cidr 0.0.0.0/0
    

Create the Bastion Security Group

  1. The common name of the security group is <shoot-name>-bsg. Create the security group:
aws ec2 create-security-group --group-name <bastion-security-group-name>  --description ssh-access --vpc-id <VPC-ID>
  1. Optionally, create identifying tags for the security group:
aws ec2 create-tags --resources <bastion-security-group-id> --tags Key=component,Value=<tag>
  1. Create a permission in the bastion security group that allows ssh access on port 22:
aws ec2 authorize-security-group-ingress --group-id <bastion-security-group-id>  --protocol tcp --port 22 --cidr 0.0.0.0/0
  1. Create an IAM role for the bastion instance with the name <shoot-name>-bastions:
aws iam create-role --role-name <shoot-name>-bastions

The content should be:

{
"Version": "2012-10-17",
"Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "ec2:DescribeRegions"
        ],
        "Resource": [
            "*"
        ]
    }
]
}
  1. Create the instance profile and name it <shoot-name>-bastions:
aws iam create-instance-profile --instance-profile-name <name>
  1. Add the created role to the instance profile:
aws iam add-role-to-instance-profile --instance-profile-name <instance-profile-name> --role-name <role-name>

Create the Bastion Instance

Next, in order to be able to ssh into the bastion instance, the instance has to be set up with a user with a public ssh key. Create a user gardener that has the same Gardener-generated public ssh key as the target instance.

  1. First, we need to get the public part of the Shoot ssh-key. The ssh-key is stored in a secret in the the project namespace in the Garden cluster. The name is: <shoot-name>-ssh-publickey. Get the key via:
kubectl get secret aws-gvisor.ssh-keypair -o json | jq -r .data.\"id_rsa.pub\"
  1. A script handed over as user-data to the bastion ec2 instance, can be used to create the gardener user and add the ssh-key. For your convenience, you can use the following script to generate the user-data.
#!/bin/bash -eu
saveUserDataFile () {
  ssh_key=$1

cat > gardener-bastion-userdata.sh <<EOF
#!/bin/bash -eu
id gardener || useradd gardener -mU
mkdir -p /home/gardener/.ssh
echo "$ssh_key" > /home/gardener/.ssh/authorized_keys
chown gardener:gardener /home/gardener/.ssh/authorized_keys
echo "gardener ALL=(ALL) NOPASSWD:ALL" >/etc/sudoers.d/99-gardener-user
EOF
}


if [ -p /dev/stdin ]; then
    read -r input
    cat | saveUserDataFile "$input"
else
    pbpaste | saveUserDataFile "$input"
fi
  1. Use the script by handing-over the public ssh-key of the Shoot cluster:
kubectl get secret aws-gvisor.ssh-keypair -o json | jq -r .data.\"id_rsa.pub\" | ./generate-userdata.sh

This generates a file called gardener-bastion-userdata.sh in the same directory containing the user-data.

  1. The following information is needed to create the bastion instance:

bastion-IAM-instance-profile-name - Use the created instance profile with the name <shoot-name>-bastions

image-id - It is possible to use the same image-id as the one used for the target instance (or any other image). Has cloud provider specific format (AWS: ami).

ssh-public-key-name

- This is the ssh key pair already created in the Shoot's cloud provider account by Gardener during the `Infrastructure` CRD reconciliation.
- The name is usually: `<shoot-name>-ssh-publickey`

subnet-id - Choose a subnet that is attached to an Internet Gateway and NAT Gateway (bastion instance must have a public IP). - The Gardener created public subnet with the name <shoot-name>-public-utility-<xy> can be used. Please check the created subnets with the cloud provider.

bastion-security-group-id - Use the id of the created bastion security group.

file-path-to-userdata - Use the filepath to the user-data file generated in the previous step.

  • bastion-instance-name
    • Optionaly, you can tag the instance.
    • Usually <shoot-name>-bastions
  1. Create the bastion instance via:
ec2 run-instances --iam-instance-profile Name=<bastion-IAM-instance-profile-name> --image-id <image-id>  --count 1 --instance-type t3.nano --key-name <ssh-public-key-name>  --security-group-ids <bastion-security-group-id> --subnet-id <subnet-id> --associate-public-ip-address --user-data <file-path-to-userdata> --tag-specifications ResourceType=instance,Tags=[{Key=Name,Value=<bastion-instance-name>},{Key=component,Value=<mytag>}] ResourceType=volume,Tags=[{Key=component,Value=<mytag>}]"

Capture the instance-id from the response and wait until the ec2 instance is running and has a public IP address.

Connecting to the Target Instance

  1. Save the private key of the ssh-key-pair in a temporary local file for later use:
umask 077

kubectl get secret <shoot-name>.ssh-keypair -o json | jq -r .data.\"id_rsa\" | base64 -d > id_rsa.key
  1. Use the private ssh key to ssh into the bastion instance:
ssh -i <path-to-private-key> gardener@<public-bastion-instance-ip> 
  1. If that works, connect from your local terminal to the target instance via the bastion:
ssh  -i <path-to-private-key> -o ProxyCommand="ssh -W %h:%p -i <private-key> -o IdentitiesOnly=yes -o StrictHostKeyChecking=no gardener@<public-ip-bastion>" gardener@<private-ip-target-instance> -o IdentitiesOnly=yes -o StrictHostKeyChecking=no

Cleanup

Do not forget to cleanup the created resources. Otherwise Gardener will eventually fail to delete the Shoot.

5.3 - How to Debug a Pod

Your pod doesn’t run as expected. Are there any log files? Where? How could I debug a pod?

Introduction

Kubernetes offers powerful options to get more details about startup or runtime failures of pods as e.g. described in Application Introspection and Debugging or Debug Pods and Replication Controllers.

In order to identify pods with potential issues, you could, e.g., run kubectl get pods --all-namespaces | grep -iv Running to filter out the pods which are not in the state Running. One of frequent error state is CrashLoopBackOff, which tells that a pod crashes right after the start. Kubernetes then tries to restart the pod again, but often the pod startup fails again.

Here is a short list of possible reasons which might lead to a pod crash:

  1. Error during image pull caused by e.g. wrong/missing secrets or wrong/missing image
  2. The app runs in an error state caused e.g. by missing environmental variables (ConfigMaps) or secrets
  3. Liveness probe failed
  4. Too high resource consumption (memory and/or CPU) or too strict quota settings
  5. Persistent volumes can’t be created/mounted
  6. The container image is not updated

Basically, the commands kubectl logs ... and kubectl describe ... with different parameters are used to get more detailed information. By calling e.g. kubectl logs --help you can get more detailed information about the command and its parameters.

In the next sections you’ll find some basic approaches to get some ideas what went wrong.

Remarks:

  • Even if the pods seem to be running, as the status Running indicates, a high counter of the Restarts shows potential problems
  • You can get a good overview of the troubleshooting process with the interactive tutorial Troubleshooting with Kubectl available which explains basic debugging activities
  • The examples below are deployed into the namespace default. In case you want to change it, use the optional parameter --namespace <your-namespace> to select the target namespace. The examples require a Kubernetes release ≥ 1.8.

Prerequisites

Your deployment was successful (no logical/syntactical errors in the manifest files), but the pod(s) aren’t running.

Error Caused by Wrong Image Name

Start by running kubectl describe pod <your-pod> <your-namespace> to get detailed information about the pod startup.

In the Events section, you should get an error message like Failed to pull image ... and Reason: Failed. The pod is in state ImagePullBackOff.

The example below is based on a demo in the Kubernetes documentation. In all examples, the default namespace is used.

First, perform a cleanup with:

kubectl delete pod termination-demo

Next, create a resource based on the yaml content below:

apiVersion: v1
kind: Pod 
metadata:
  name: termination-demo
spec:
  containers:
  - name: termination-demo-container
    image: debiann
    command: ["/bin/sh"]
    args: ["-c", "sleep 10 && echo Sleep expired > /dev/termination-log"]

kubectl describe pod termination-demo lists in the Event section the content

Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath					Type		Reason			Message
  ---------	--------	-----	----							-------------					--------	------			-------
  2m		2m		1	default-scheduler											Normal		Scheduled		Successfully assigned termination-demo to ip-10-250-17-112.eu-west-1.compute.internal
  2m		2m		1	kubelet, ip-10-250-17-112.eu-west-1.compute.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "default-token-sgccm" 
  2m		1m		4	kubelet, ip-10-250-17-112.eu-west-1.compute.internal	spec.containers{termination-demo-container}	Normal		Pulling			pulling image "debiann"
  2m		1m		4	kubelet, ip-10-250-17-112.eu-west-1.compute.internal	spec.containers{termination-demo-container}	Warning		Failed			Failed to pull image "debiann": rpc error: code = Unknown desc = Error: image library/debiann:latest not found
  2m		54s		10	kubelet, ip-10-250-17-112.eu-west-1.compute.internal							Warning		FailedSync		Error syncing pod
  2m		54s		6	kubelet, ip-10-250-17-112.eu-west-1.compute.internal	spec.containers{termination-demo-container}	Normal		BackOff			Back-off pulling image "debiann"

The error message with Reason: Failed tells you that there is an error during pulling the image. A closer look at the image name indicates a misspelling.

The App Runs in an Error State Caused, e.g., by Missing Environmental Variables (ConfigMaps) or Secrets

This example illustrates the behavior in the case when the app expects environment variables but the corresponding Kubernetes artifacts are missing.

First, perform a cleanup with:

kubectl delete deployment termination-demo
kubectl delete configmaps app-env

Next, deploy the following manifest:

apiVersion: apps/v1beta2 
kind: Deployment
metadata:
  name: termination-demo
  labels:
     app: termination-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: termination-demo
  template:
    metadata:
      labels:
        app: termination-demo
    spec:
      containers:
      - name: termination-demo-container
        image: debian
        command: ["/bin/sh"]
        args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]

Now, the command kubectl get pods lists the pod termination-demo-xxx in the state Error or CrashLoopBackOff. The command kubectl describe pod termination-demo-xxx tells you that there is no error during startup but gives no clue about what caused the crash.

Events:
  FirstSeen	LastSeen	Count	From							SubObjectPath					Type		Reason		Message
  ---------	--------	-----	----							-------------					--------	------		-------
  19m		19m		1	default-scheduler											Normal		Scheduled	Successfully assigned termination-demo-5fb484867d-xz2x9 to ip-10-250-17-112.eu-west-1.compute.internal
  19m		19m		1	kubelet, ip-10-250-17-112.eu-west-1.compute.internal							Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "default-token-sgccm" 
  19m		19m		4	kubelet, ip-10-250-17-112.eu-west-1.compute.internal	spec.containers{termination-demo-container}	Normal		Pulling		pulling image "debian"
  19m		19m		4	kubelet, ip-10-250-17-112.eu-west-1.compute.internal	spec.containers{termination-demo-container}	Normal		Pulled		Successfully pulled image "debian"
  19m		19m		4	kubelet, ip-10-250-17-112.eu-west-1.compute.internal	spec.containers{termination-demo-container}	Normal		Created		Created container
  19m		19m		4	kubelet, ip-10-250-17-112.eu-west-1.compute.internal	spec.containers{termination-demo-container}	Normal		Started		Started container
  19m		14m		24	kubelet, ip-10-250-17-112.eu-west-1.compute.internal	spec.containers{termination-demo-container}	Warning		BackOff		Back-off restarting failed container
  19m		4m		69	kubelet, ip-10-250-17-112.eu-west-1.compute.internal							Warning		FailedSync	Error syncing pod

The command kubectl get logs termination-demo-xxx gives access to the output, the application writes on stderr and stdout. In this case, you should get an output similar to:

/bin/sh: 1: cannot open : No such file

So you need to have a closer look at the application. In this case, the environmental variable MYFILE is missing. To fix this issue, you could e.g. add a ConfigMap to your deployment as is shown in the manifest listed below:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-env
data:
  MYFILE: "/etc/profile"
---
apiVersion: apps/v1beta2 
kind: Deployment
metadata:
  name: termination-demo
  labels:
     app: termination-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: termination-demo
  template:
    metadata:
      labels:
        app: termination-demo
    spec:
      containers:
      - name: termination-demo-container
        image: debian
        command: ["/bin/sh"]
        args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]
        envFrom:
        - configMapRef:
            name: app-env 

Note that once you fix the error and re-run the scenario, you might still see the pod in a CrashLoopBackOff status. It is because the container finishes the command sed ... and runs to completion. In order to keep the container in a Running status, a long running task is required, e.g.:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-env
data:
  MYFILE: "/etc/profile"
  SLEEP: "5"
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: termination-demo
  labels:
     app: termination-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: termination-demo
  template:
    metadata:
      labels:
        app: termination-demo
    spec:
      containers:
      - name: termination-demo-container
        image: debian
        command: ["/bin/sh"]
        # args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]
        args: ["-c", "while true; do sleep $SLEEP; echo sleeping; done;"]
        envFrom:
        - configMapRef:
            name: app-env

Too High Resource Consumption (Memory and/or CPU) or Too Strict Quota Settings

You can optionally specify the amount of memory and/or CPU your container gets during runtime. In case these settings are missing, the default requests settings are taken: CPU: 0m (in Milli CPU) and RAM: 0Gi, which indicate no other limits other than the ones of the node(s) itself. For more details, e.g. about how to configure limits, see Configure Default Memory Requests and Limits for a Namespace.

In case your application needs more resources, Kubernetes distinguishes between requests and limit settings: requests specify the guaranteed amount of resource, whereas limit tells Kubernetes the maximum amount of resource the container might need. Mathematically, both settings could be described by the relation 0 <= requests <= limit. For both settings you need to consider the total amount of resources your nodes provide. For a detailed description of the concept, see Resource Quality of Service in Kubernetes.

Use kubectl describe nodes to get a first overview of the resource consumption in your cluster. Of special interest are the figures indicating the amount of CPU and Memory Requests at the bottom of the output.

The next example demonstrates what happens in case the CPU request is too high in order to be managed by your cluster.

First, perform a cleanup with:

kubectl delete deployment termination-demo
kubectl delete configmaps app-env

Next, adapt the cpu below in the yaml below to be slightly higher than the remaining CPU resources in your cluster and deploy this manifest. In this example, 600m (milli CPUs) are requested in a Kubernetes system with a single 2 core worker node which results in an error message.

apiVersion: apps/v1beta2 
kind: Deployment
metadata:
  name: termination-demo
  labels:
     app: termination-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: termination-demo
  template:
    metadata:
      labels:
        app: termination-demo
    spec:
      containers:
      - name: termination-demo-container
        image: debian
        command: ["/bin/sh"]
        args: ["-c", "sleep 10 && echo Sleep expired > /dev/termination-log"]
        resources:
          requests:
            cpu: "600m" 

The command kubectl get pods lists the pod termination-demo-xxx in the state Pending. More details on why this happens could be found by using the command kubectl describe pod termination-demo-xxx:

$ kubectl describe po termination-demo-fdb7bb7d9-mzvfw
Name:           termination-demo-fdb7bb7d9-mzvfw
Namespace:      default
...
Containers:
  termination-demo-container:
    Image:      debian
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
    Args:
      -c
      sleep 10 && echo Sleep expired > /dev/termination-log
    Requests:
      cpu:        6
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-t549m (ro)
Conditions:
  Type           Status
  PodScheduled   False
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  9s (x7 over 40s)  default-scheduler  0/2 nodes are available: 2 Insufficient cpu.

You can find more details in:

Remarks:

  • This example works similarly when specifying a too high request for memory
  • In case you configured an autoscaler range when creating your Kubernetes cluster, another worker node will be spinned up automatically if you didn’t reach the maximum number of worker nodes
  • In case your app is running out of memory (the memory settings are too small), you will typically find an OOMKilled (Out Of Memory) message in the Events section of the kubectl describe pod ... output

The Container Image Is Not Updated

You applied a fix in your app, created a new container image and pushed it into your container repository. After redeploying your Kubernetes manifests, you expected to get the updated app, but the same bug is still in the new deployment present.

This behavior is related to how Kubernetes decides whether to pull a new docker image or to use the cached one.

In case you didn’t change the image tag, the default image policy IfNotPresent tells Kubernetes to use the cached image (see Images).

As a best practice, you should not use the tag latest and change the image tag in case you changed anything in your image (see Configuration Best Practices).

For more information, see Container Image Not Updating.

5.4 - tail -f /var/log/my-application.log

Aggregate log files from different pods

Problem

One thing that always bothered me was that I couldn’t get logs of several pods at once with kubectl. A simple tail -f <path-to-logfile> isn’t possible at all. Certainly, you can use kubectl logs -f <pod-id>, but it doesn’t help if you want to monitor more than one pod at a time.

This is something you really need a lot, at least if you run several instances of a pod behind a deployment. This is even more so if you don’t have a Kibana or a similar setup.

howto-kubetail

Solution

Luckily, there are smart developers out there who always come up with solutions. The finding of the week is a small bash script that allows you to aggregate log files of several pods at the same time in a simple way. The script is called kubetail and is available at GitHub.

6 - Applications

6.1 - Shoot Pod Autoscaling Best Practices

Introduction

There are two types of pod autoscaling in Kubernetes: Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA). HPA (implemented as part of the kube-controller-manager) scales the number of pod replicas, while VPA (implemented as independent community project) adjusts the CPU and memory requests for the pods. Both types of autoscaling aim to optimize resource usage/costs and maintain the performance and (high) availability of applications running on Kubernetes.

Horizontal Pod Autoscaling (HPA)

Horizontal Pod Autoscaling involves increasing or decreasing the number of pod replicas in a deployment, replica set, stateful set, or anything really with a scale subresource that manages pods. HPA adjusts the number of replicas based on specified metrics, such as CPU or memory average utilization (usage divided by requests; most common) or average value (usage; less common). When the demand on your application increases, HPA automatically scales out the number of pods to meet the demand. Conversely, when the demand decreases, it scales in the number of pods to reduce resource usage.

HPA targets (mostly stateless) applications where adding more instances of the application can linearly increase the ability to handle additional load. It is very useful for applications that experience variable traffic patterns, as it allows for real-time scaling without the need for manual intervention.

ℹ️ Note

HPA continuously monitors the metrics of the targeted pods and adjusts the number of replicas based on the observed metrics. It operates solely on the current metrics when it calculates the averages across all pods, meaning it reacts to the immediate resource usage without considering past trends or patterns. Also, all pods are treated equally based on the average metrics. This could potentially lead to situations where some pods are under high load while others are underutilized. Therefore, particular care must be applied to (fair) load-balancing (connection vs. request vs. actual resource load balancing are crucial).

A Few Words on the Cluster-Proportional (Horizontal) Autoscaler (CPA) and the Cluster-Proportional Vertical Autoscaler (CPVA)

Besides HPA and VPA, CPA and CPVA are further options for scaling horizontally or vertically (neither is deployed by Gardener and must be deployed by the user). Unlike HPA and VPA, CPA and CPVA do not monitor the actual pod metrics, but scale solely on the number of nodes or CPU cores in the cluster. While this approach may be helpful and sufficient in a few rare cases, it is often a risky and crude scaling scheme that we do not recommend. More often than not, cluster-proportional scaling results in either under- or over-reserving your resources.

Vertical Pod Autoscaling (VPA)

Vertical Pod Autoscaling, on the other hand, focuses on adjusting the CPU and memory resources allocated to the pods themselves. Instead of changing the number of replicas, VPA tweaks the resource requests (and limits, but only proportionally, if configured) for the pods in a deployment, replica set, stateful set, daemon set, or anything really with a scale subresource that manages pods. This means that each pod can be given more, or fewer resources as needed.

VPA is very useful for optimizing the resource requests of pods that have dynamic resource needs over time. It does so by mutating pod requests (unfortunately, not in-place). Therefore, in order to apply new recommendations, pods that are “out of bounds” (i.e. below a configured/computed lower or above a configured/computed upper recommendation percentile) will be evicted proactively, but also pods that are “within bounds” may be evicted after a grace period. The corresponding higher-level replication controller will then recreate a new pod that VPA will then mutate to set the currently recommended requests (and proportional limits, if configured).

ℹ️ Note

VPA continuously monitors all targeted pods and calculates recommendations based on their usage (one recommendation for the entire target). This calculation is influenced by configurable percentiles, with a greater emphasis on recent usage data and a gradual decrease (=decay) in the relevance of older data. However, this means, that VPA doesn’t take into account individual needs of single pods - eventually, all pods will receive the same recommendation, which may lead to considerable resource waste. Ideally, VPA would update pods in-place depending on their individual needs, but that’s (individual recommendations) not in its design, even if in-place updates get implemented, which may be years away for VPA based on current activity on the component.

Selecting the Appropriate Autoscaler

Before deciding on an autoscaling strategy, it’s important to understand the characteristics of your application:

  • Interruptibility: Most importantly, if the clients of your workload are too sensitive to disruptions/cannot cope well with terminating pods, then maybe neither HPA nor VPA is an option (both, HPA and VPA cause pods and connections to be terminated, though VPA even more frequently). Clients must retry on disruptions, which is a reasonable ask in a highly dynamic (and self-healing) environment such as Kubernetes, but this is often not respected (or expected) by your clients (they may not know or care you run the workload in a Kubernetes cluster and have different expectations to the stability of the workload unless you communicated those through SLIs/SLOs/SLAs).
  • Statelessness: Is your application stateless or stateful? Stateless applications are typically better candidates for HPA as they can be easily scaled out by adding more replicas without worrying about maintaining state.
  • Traffic Patterns: Does your application experience variable traffic? If so, HPA can help manage these fluctuations by adjusting the number of replicas to handle the load.
  • Resource Usage: Does your application’s resource usage change over time? VPA can adjust the CPU and memory reservations dynamically, which is beneficial for applications with non-uniform resource requirements.
  • Scalability: Can your application handle increased load by scaling vertically (more resources per pod) or does it require horizontal scaling (more pod instances)?

HPA is the right choice if:

  • Your application is stateless and can handle increased load by adding more instances.
  • You experience short-term fluctuations in traffic that require quick scaling responses.
  • You want to maintain a specific performance metric, such as requests per second per pod.

VPA is the right choice if:

  • Your application’s resource requirements change over time, and you want to optimize resource usage without manual intervention.
  • You want to avoid the complexity of managing resource requests for each pod, especially when they run code where it’s impossible for you to suggest static requests.

In essence:

  • For applications that can handle increased load by simply adding more replicas, HPA should be used to handle short-term fluctuations in load by scaling the number of replicas.
  • For applications that require more resources per pod to handle additional work, VPA should be used to adjust the resource allocation for longer-term trends in resource usage.

Consequently, if both cases apply (VPA often applies), HPA and VPA can also be combined. However, combining both, especially on the same metrics (CPU and memory), requires understanding and care to avoid conflicts and ensure that the autoscaling actions do not interfere with and rather complement each other. For more details, see Combining HPA and VPA.

Horizontal Pod Autoscaler (HPA)

HPA operates by monitoring resource metrics for all pods in a target. It computes the desired number of replicas from the current average metrics and the desired user-defined metrics as follows:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

HPA checks the metrics at regular intervals, which can be configured by the user. Several types of metrics are supported (classical resource metrics like CPU and memory, but also custom and external metrics like requests per second or queue length can be configured, if available). If a scaling event is necessary, HPA adjusts the replica count for the targeted resource.

Defining an HPA Resource

To configure HPA, you need to create an HPA resource in your cluster. This resource specifies the target to scale, the metrics to be used for scaling decisions, and the desired thresholds. Here’s an example of an HPA configuration:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: foo-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: foo-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: AverageValue
        averageValue: 2
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 8G
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 1800
      policies:
      - type: Pods
        value: 1
        periodSeconds: 300

In this example, HPA is configured to scale foo-deployment based on pod average CPU and memory usage. It will maintain an average CPU and memory usage (not utilization, which is usage divided by requests!) across all replicas of 2 CPUs and 8G or lower with as few replicas as possible. The number of replicas will be scaled between a minimum of 1 and a maximum of 10 based on this target.

Since a while, you can also configure the autoscaling based on the resource usage of individual containers, not only on the resource usage of the entire pod. All you need to do is to switch the type from Resource to ContainerResource and specify the container name.

In the official documentation ([1] and [2]) you will find examples with average utilization (averageUtilization), not average usage (averageValue), but this is not particularly helpful, especially if you plan to combine HPA together with VPA on the same metrics (generally discouraged in the documentation). If you want to safely combine both on the same metrics, you should scale on average usage (averageValue) as shown above. For more details, see Combining HPA and VPA.

Finally, the behavior section influences how fast you scale up and down. Most of the time (depends on your workload), you like to scale out faster than you scale in. In this example, the configuration will trigger a scale-out only after observing the need to scale out for 30s (stabilizationWindowSeconds) and will then only scale out at most 100% (value + type) of the current number of replicas every 60s (periodSeconds). The configuration will trigger a scale-in only after observing the need to scale in for 1800s (stabilizationWindowSeconds) and will then only scale in at most 1 pod (value + type) every 300s (periodSeconds). As you can see, scale-out happens quicker than scale-in in this example.

HPA (actually KCM) Options

HPA is a function of the kube-controller-manager (KCM).

You can read up the full KCM options online and set most of them conveniently in your Gardener shoot cluster spec:

  • downscaleStabilization (default 5m): HPA will scale out whenever the formula (in accordance with the behavior section, if present in the HPA resource) yields a higher replica count, but it won’t scale in just as eagerly. This option lets you define a trailing time window that HPA must check and only if the recommended replica count is consistently lower throughout the entire time window, HPA will scale in (in accordance with the behavior section, if present in the HPA resource). If at any point in time in that trailing time window the recommended replica count isn’t lower, scale-in won’t happen. This setting is just a default, if nothing is defined in the behavior section of an HPA resource. The default for the upscale stabilization is 0s and it cannot be set via a KCM option (downscale stabilization was historically more important than upscale stabilization and when later the behavior sections were added to the HPA resources, upscale stabilization remained missing from the KCM options).
  • tolerance (default +/-10%): HPA will not scale out or in if the desired replica count is (mathematically as a float) near the actual replica count (see source code for details), which is a form of hysteresis to avoid replica flapping around a threshold.

There are a few more configurable options of lesser interest:

  • syncPeriod (default 15s): How often HPA retrieves the pods and metrics respectively how often it recomputes and sets the desired replica count.

  • cpuInitializationPeriod (default 30s) and initialReadinessDelay (default 5m): Both settings only affect whether or not CPU metrics are considered for scaling decisions. They can be easily misinterpreted as the official docs are somewhat hard to read (see source code for details, which is more readable, if you ignore the comments). Normally, you have little reason to modify them, but here is what they do:

    • cpuInitializationPeriod: Defines a grace period after a pod starts during which HPA won’t consider CPU metrics of the pod for scaling if the pod is either not ready or it is ready, but a given CPU metric is older than the last state transition (to ready). This is to ignore CPU metrics that predate the current readiness while still in initialization to not make scaling decisions based on potentially misleading data. If the pod is ready and a CPU metric was collected after it became ready, it is considered also within this grace period.
    • initialReadinessDelay: Defines another grace period after a pod starts during which HPA won’t consider CPU metrics of the pod for scaling if the pod is not ready and it became not ready within this grace period (the docs/comments want to check whether the pod was ever ready, but the code only checks whether the pod condition last transition time to not ready happened within that grace period which it could have from being ready or simply unknown before). This is to ignore not (ever have been) ready pods while still in initialization to not make scaling decisions based on potentially misleading data. If the pod is ready, it is considered also within this grace period.

    So, regardless of the values of these settings, if a pod is reporting ready and it has a CPU metric from the time after it became ready, that pod and its metric will be considered. This holds true even if the pod becomes ready very early into its initialization. These settings cannot be used to “black-out” pods for a certain duration before being considered for scaling decisions. Instead, if it is your goal to ignore a potentially resource-intensive initialization phase that could wrongly lead to further scale-out, you would need to configure your pods to not report as ready until that resource-intensive initialization phase is over.

Considerations When Using HPA

  • Selection of metrics: Besides CPU and memory, HPA can also target custom or external metrics. Pick those (in addition or exclusively), if you guarantee certain SLOs in your SLAs.
  • Targeting usage or utilization: HPA supports usage (absolute) and utilization (relative). Utilization is often preferred in simple examples, but usage is more precise and versatile.
  • Compatibility with VPA: Care must be taken when using HPA in conjunction with VPA, as they can potentially interfere with each other’s scaling decisions.

Vertical Pod Autoscaler (VPA)

VPA operates by monitoring resource metrics for all pods in a target. It computes a resource requests recommendation from the historic and current resource metrics. VPA checks the metrics at regular intervals, which can be configured by the user. Only CPU and memory are supported. If VPA detects that a pod’s resource allocation is too high or too low, it may evict pods (if within the permitted disruption budget), which will trigger the creation of a new pod by the corresponding higher-level replication controller, which will then be mutated by VPA to match resource requests recommendation. This happens in three different components that work together:

  • VPA Recommender: The Recommender observes the historic and current resource metrics of pods and generates recommendations based on this data.
  • VPA Updater: The Updater component checks the recommendations from the Recommender and decides whether any pod’s resource requests need to be updated. If an update is needed, the Updater will evict the pod.
  • VPA Admission Controller: When a pod is (re-)created, the Admission Controller modifies the pod’s resource requests based on the recommendations from the Recommender. This ensures that the pod starts with the optimal amount of resources.

Since VPA doesn’t support in-place updates, pods will be evicted. You will want to control voluntary evictions by means of Pod Disruption Budgets (PDBs). Please make yourself familiar with those and use them.

ℹ️ Note

PDBs will not always work as expected and can also get in your way, e.g. if the PDB is violated or would be violated, it may possibly block evictions that would actually help your workload, e.g. to get a pod out of an OOMKilled CrashLoopBackoff (if the PDB is or would be violated, not even unhealthy pods would be evicted as they could theoretically become healthy again, which VPA doesn’t know). In order to overcome this issue, it is now possible (alpha since Kubernetes v1.26 in combination with the feature gate PDBUnhealthyPodEvictionPolicy on the API server, beta and enabled by default since Kubernetes v1.27) to configure the so-called unhealthy pod eviction policy. The default is still IfHealthyBudget as a change in default would have changed the behavior (as described above), but you can now also set AlwaysAllow at the PDB (spec.unhealthyPodEvictionPolicy). For more information, please check out this discussion, the PR and this document and balance the pros and cons for yourself. In short, the new AlwaysAllow option is probably the better choice in most of the cases while IfHealthyBudget is useful only if you have frequent temporary transitions or for special cases where you have already implemented controllers that depend on the old behavior.

Defining a VPA Resource

To configure VPA, you need to create a VPA resource in your cluster. This resource specifies the target to scale, the metrics to be used for scaling decisions, and the policies for resource updates. Here’s an example of an VPA configuration:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: foo-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment
    name:       foo-deployment
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: foo-container
      controlledValues: RequestsOnly
      minAllowed:
        cpu: 50m
        memory: 200M
      maxAllowed:
        cpu: 4
        memory: 16G

In this example, VPA is configured to scale foo-deployment requests (RequestsOnly) from 50m cores (minAllowed) up to 4 cores (maxAllowed) and 200M memory (minAllowed) up to 16G memory (maxAllowed) automatically (updateMode). VPA doesn’t support in-place updates, so in updateMode Auto it will evict pods under certain conditions and then mutate the requests (and possibly limits if you omit controlledValues or set it to RequestsAndLimits, which is the default) of upcoming new pods.

Multiple update modes exist. They influence eviction and mutation. The most important ones are:

  • Off: In this mode, recommendations are computed, but never applied. This mode is useful, if you want to learn more about your workload or if you have a custom controller that depends on VPA’s recommendations but shall act instead of VPA.
  • Initial: In this mode, recommendations are computed and applied, but pods are never proactively evicted to enforce new recommendations over time. This mode is useful, if you want to control pod evictions yourself (similar to the StatefulSet updateStrategy OnDelete) or your workload is sensitive to evictions, e.g. some brownfield singleton application or a daemon set pod that is critical for the node.
  • Auto (default): In this mode, recommendations are computed, applied, and pods are even proactively evicted to enforce new recommendations over time. This applies recommendations continuously without you having to worry too much.

As mentioned, controlledValues influences whether only requests or requests and limits are scaled:

  • RequestsOnly: Updates only requests and doesn’t change limits. Useful if you have defined absolute limits (unrelated to the requests).
  • RequestsAndLimits (default): Updates requests and proportionally scales limits along with the requests. Useful if you have defined relative limits (related to the requests). In this case, the gap between requests and limits should be either zero for QoS Guaranteed or small for QoS Burstable to avoid useless (way beyond the threshold of unhealthy behavior) or absurd (larger than node capacity) values.

VPA doesn’t offer many more settings that can be tuned per VPA resource than you see above (different than HPA’s behavior section). However, there is one more that isn’t shown above, which allows to scale only up or only down (evictionRequirements[].changeRequirement), in case you need that, e.g. to provide resources when needed, but avoid disruptions otherwise.

VPA Options

VPA is an independent community project that consists of a recommender (computing target recommendations and bounds), an updater (evicting pods that are out of recommendation bounds), and an admission controller (mutating webhook applying the target recommendation to newly created pods). As such, they have independent options.

VPA Recommender Options

You can read up the full VPA recommender options online and set some of them conveniently in your Gardener shoot cluster spec:

  • recommendationMarginFraction (default 15%): Safety margin that will be added to the recommended requests.
  • targetCPUPercentile (default 90%): CPU usage percentile that will be targeted with the CPU recommendation (i.e. recommendation will “fit” e.g. 90% of the observed CPU usages). This setting is relevant for balancing your requests reservations vs. your costs. If you want to reduce costs, you can reduce this value (higher risk because of potential under-reservation, but lower costs), because CPU is compressible, but then VPA may lack the necessary signals for scale-up as throttling on an otherwise fully utilized node will go unnoticed by VPA. If you want to err on the safe side, you can increase this value, but you will then target more and more a worst case scenario, quickly (maybe even exponentially) increasing the costs.
  • targetMemoryPercentile (default 90%): Memory usage percentile that will be targeted with the memory recommendation (i.e. recommendation will “fit” e.g. 90% of the observed memory usages). This setting is relevant for balancing your requests reservations vs. your costs. If you want to reduce costs, you can reduce this value (higher risk because of potential under-reservation, but lower costs), because OOMs will trigger bump-ups, but those will disrupt the workload. If you want to err on the safe side, you can increase this value, but you will then target more and more a worst case scenario, quickly (maybe even exponentially) increasing the costs.

There are a few more configurable options of lesser interest:

  • recommenderInterval (default 1m): How often VPA retrieves the pods and metrics respectively how often it recomputes the recommendations and bounds.

There are many more options that you can only configure if you deploy your own VPA and which we will not discuss here, but you can check them out here.

ℹ️ Note

Due to an implementation detail (smallest bucket size), VPA cannot create recommendations below 10m cores and 10M memory even if minAllowed is lower.

VPA Updater Options

You can read up the full VPA updater options online and set some of them conveniently in your Gardener shoot cluster spec:

  • evictAfterOOMThreshold (default 10m): Pods where at least one container OOMs within this time period since its start will be actively evicted, which will implicitly apply the new target recommendation that will have been bumped up after OOMKill. Please note, the kubelet may evict pods even before an OOM, but only if kube-reserved is underrun, i.e. node-level resources are running low. In these cases, eviction will happen first by pod priority and second by how much the usage overruns the requests.
  • evictionTolerance (default 50%): Defines a threshold below which no further eligible pod will be evited anymore, i.e. limits how many eligible pods may be in eviction in parallel (but at least 1). The threshold is computed as follows: running - evicted > replicas - tolerance. Example: 10 replicas, 9 running, 8 eligible for eviction, 20% tolerance with 10 replicas which amounts to 2 pods, and no pod evicted in this round yet, then 9 - 0 > 10 - 2 is true and a pod would be evicted, but the next one would be in violation as 9 - 1 = 10 - 2 and no further pod would be evicted anymore in this round.
  • evictionRateBurst (default 1): Defines how many eligible pods may be evicted in one go.
  • evictionRateLimit (default disabled): Defines how many eligible pods may be evicted per second (a value of 0 or -1 disables the rate limiting).

In general, avoid modifying these eviction settings unless you have good reasons and try to rely on Pod Disruption Budgets (PDBs) instead. However, PDBs are not available for daemon sets.

There are a few more configurable options of lesser interest:

  • updaterInterval (default 1m): How often VPA evicts the pods.

There are many more options that you can only configure if you deploy your own VPA and which we will not discuss here, but you can check them out here.

Considerations When Using VPA

  • Initial Resource Estimates: VPA requires historical resource usage data to base its recommendations on. Until they kick in, your initial resource requests apply and should be sensible.
  • Pod Disruption: When VPA adjusts the resources for a pod, it may need to “recreate” the pod, which can cause temporary disruptions. This should be taken into account.
  • Compatibility with HPA: Care must be taken when using VPA in conjunction with HPA, as they can potentially interfere with each other’s scaling decisions.

Combining HPA and VPA

HPA and VPA serve different purposes and operate on different axes of scaling. HPA increases or decreases the number of pod replicas based on metrics like CPU or memory usage, effectively scaling the application out or in. VPA, on the other hand, adjusts the CPU and memory reservations of individual pods, scaling the application up or down.

When used together, these autoscalers can provide both horizontal and vertical scaling. However, they can also conflict with each other if used on the same metrics (e.g. both on CPU or both on memory). In particular, if VPA adjusts the requests, the utilization, i.e. the ratio between usage and requests, will approach 100% (for various reasons not exactly right, but for this consideration, close enough), which may trigger HPA to scale out, if it’s configured to scale on utilization below 100% (often seen in simple examples), which will spread the load across more pods, which may trigger VPA again to adjust the requests to match the new pod usages.

This is a feedback loop and it stems from HPA’s method of calculating the desired number of replicas, which is:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

If desiredMetricValue is utilization and VPA adjusts the requests, which changes the utilization, this may inadvertently trigger HPA and create said feedback loop. On the other hand, if desiredMetricValue is usage and VPA adjusts the requests now, this will have no impact on HPA anymore (HPA will always influence VPA, but we can control whether VPA influences HPA).

Therefore, to safely combine HPA and VPA, consider the following strategies:

  • Configure HPA and VPA on different metrics: One way to avoid conflicts is to use HPA and VPA based on different metrics. For instance, you could configure HPA to scale based on requests per seconds (or another representative custom/external metric) and VPA to adjust CPU and memory requests. This way, each autoscaler operates independently based on its specific metric(s).
  • Configure HPA to scale on usage, not utilization, when used with VPA: Another way to avoid conflicts is to use HPA not on average utilization (averageUtilization), but instead on average usage (averageValue) as replicas driver, which is an absolute metric (requests don’t affect usage). This way, you can combine both autoscalers even on the same metrics.

Pod Autoscaling and Cluster Autoscaler

Autoscaling within Kubernetes can be implemented at different levels: pod autoscaling (HPA and VPA) and cluster autoscaling (CA). While pod autoscaling adjusts the number of pod replicas or their resource reservations, cluster autoscaling focuses on the number of nodes in the cluster, so that your pods can be hosted. If your workload isn’t static and especially if you make use of pod autoscaling, it only works if you have sufficient node capacity available. The most effective way to do that, without running a worst-case number of nodes, is to configure burstable worker pools in your shoot spec, i.e. define a true minimum node count and a worst-case maximum node count and leave the node autoscaling to Gardener that internally uses the Cluster Autoscaler to provision and deprovision nodes as needed.

Cluster Autoscaler automatically adjusts the number of nodes by adding or removing nodes based on the demands of the workloads and the available resources. It interacts with the cloud provider’s APIs to provision or deprovision nodes as needed. Cluster Autoscaler monitors the utilization of nodes and the scheduling of pods. If it detects that pods cannot be scheduled due to a lack of resources, it will trigger the addition of new nodes to the cluster. Conversely, if nodes are underutilized for some time and their pods can be placed on other nodes, it will remove those nodes to reduce costs and improve resource efficiency.

Best Practices:

  • Resource Buffering: Maintain a buffer of resources to accommodate temporary spikes in demand without waiting for node provisioning. This can be done by deploying pods with low priority that can be preempted when real workloads require resources. This helps in faster pod scheduling and avoids delays in scaling out or up.
  • Pod Disruption Budgets (PDBs): Use PDBs to ensure that during scale-down events, the availability of applications is maintained as the Cluster Autoscaler will not voluntarily evict a pod if a PDB would be violated.

Interesting CA Options

CA can be configured in your Gardener shoot cluster spec globally and also in parts per worker pool:

  • Can only be configured globally:
    • expander (default least-waste): Defines the “expander” algorithm to use during scale-up, see FAQ.
    • scaleDownDelayAfterAdd (default 1h): Defines how long after scaling up a node, a node may be scaled down.
    • scaleDownDelayAfterFailure (default 3m): Defines how long after scaling down a node failed, scaling down will be resumed.
    • scaleDownDelayAfterDelete (default 0s): Defines how long after scaling down a node, another node may be scaled down.
  • Can be configured globally and also overwritten individually per worker pool:
    • scaleDownUtilizationThreshold (default 50%): Defines the threshold below which a node becomes eligible for scaling down.
    • scaleDownUnneededTime (default 30m): Defines the trailing time window the node must be consistently below a certain utilization threshold before it can finally be scaled down.

There are many more options that you can only configure if you deploy your own CA and which we will not discuss here, but you can check them out here.

Importance of Monitoring

Monitoring is a critical component of autoscaling for several reasons:

  • Performance Insights: It provides insights into how well your autoscaling strategy is meeting the performance requirements of your applications.
  • Resource Utilization: It helps you understand resource utilization patterns, enabling you to optimize resource allocation and reduce waste.
  • Cost Management: It allows you to track the cost implications of scaling actions, helping you to maintain control over your cloud spending.
  • Troubleshooting: It enables you to quickly identify and address issues with autoscaling, such as unexpected scaling behavior or resource bottlenecks.

To effectively monitor autoscaling, you should leverage the following tools and metrics:

  • Kubernetes Metrics Server: Collects resource metrics from kubelets and provides them to HPA and VPA for autoscaling decisions (automatically provided by Gardener).
  • Prometheus: An open-source monitoring system that can collect and store custom metrics, providing a rich dataset for autoscaling decisions.
  • Grafana/Plutono: A visualization tool that integrates with Prometheus to create dashboards for monitoring autoscaling metrics and events.
  • Cloud Provider Tools: Most cloud providers offer native monitoring solutions that can be used to track the performance and costs associated with autoscaling.

Key metrics to monitor include:

  • CPU and Memory Utilization: Track the resource utilization of your pods and nodes to understand how they correlate with scaling events.
  • Pod Count: Monitor the number of pod replicas over time to see how HPA is responding to changes in load.
  • Scaling Events: Keep an eye on scaling events triggered by HPA and VPA to ensure they align with expected behavior.
  • Application Performance Metrics: Track application-specific metrics such as response times, error rates, and throughput.

Based on the insights gained from monitoring, you may need to adjust your autoscaling configurations:

  • Refine Thresholds: If you notice frequent scaling actions or periods of underutilization or overutilization, adjust the thresholds used by HPA and VPA to better match the workload patterns.
  • Update Policies: Modify VPA update policies if you observe that the current settings are causing too much or too little pod disruption.
  • Custom Metrics: If using custom metrics, ensure they accurately reflect the load on your application and adjust them if they do not.
  • Scaling Limits: Review and adjust the minimum and maximum scaling limits to prevent over-scaling or under-scaling based on the capacity of your cluster and the criticality of your applications.

Quality of Service (QoS)

A few words on the quality of service for pods. Basically, there are 3 classes of QoS and they influence the eviction of pods when kube-reserved is underrun, i.e. node-level resources are running low:

  • BestEffort, i.e. pods where no container has CPU or memory requests or limits: Avoid them unless you have really good reasons. The kube-scheduler will place them just anywhere according to its policy, e.g. balanced or bin-packing, but whatever resources these pods consume, may bring other pods into trouble or even the kubelet and the container runtime itself, if it happens very suddenly.
  • Burstable, i.e. pods where at least one container has CPU or memory requests and at least one has no limits or limits that don’t match the requests: Prefer them unless you have really good reasons for the other QoS classes. Always specify proper requests or use VPA to recommend those. This helps the kube-scheduler to make the right scheduling decisions. Not having limits will additionally provide upward resource flexibility, if the node is not under pressure.
  • Guaranteed, i.e. pods where all containers have CPU and memory requests and equal limits: Avoid them unless you really know the limits or throttling/killing is intended. While “Guaranteed” sounds like something “positive” in the English language, this class comes with the downside, that pods will be actively CPU-throttled and will actively go OOM, even if the node is not under pressure and has excess capacity left. Worse, if containers in the pod are under VPA, their CPU requests/limits will often not be scaled up as CPU throttling will go unnoticed by VPA.

Summary

  • As a rule of thumb, always set CPU and memory requests (or let VPA do that) and always avoid CPU and memory limits.
    • CPU limits aren’t helpful on an under-utilized node (=may result in needless outages) and even suppress the signals for VPA to act. On a nearly or fully utilized node, CPU limits are practically irrelevant as only the requests matter, which are translated into CPU shares that provide a fair use of the CPU anyway (see CFS).
      Therefore, if you do not know the healthy range, do not set CPU limits. If you as author of the source code know its healthy range, set them to the upper threshold of that healthy range (everything above, from your knowledge of that code, is definitely an unbound busy loop or similar, which is the main reason for CPU limits, besides batch jobs where throttling is acceptable or even desired).
    • Memory limits may be more useful, but suffer a similar, though not as negative downside. As with CPU limits, memory limits aren’t helpful on an under-utilized node (=may result in needless outages), but different than CPU limits, they result in an OOM, which triggers VPA to provide more memory suddenly (modifies the currently computed recommendations by a configurable factor, defaulting to +20%, see docs).
      Therefore, if you do not know the healthy range, do not set memory limits. If you as author of the source code know its healthy range, set them to the upper threshold of that healthy range (everything above, from your knowledge of that code, is definitely an unbound memory leak or similar, which is the main reason for memory limits)
  • Horizontal Pod Autoscaling (HPA): Use for pods that support horizontal scaling. Prefer scaling on usage, not utilization, as this is more predictable (not dependent on a second variable, namely the current requests) and conflict-free with vertical pod autoscaling (VPA).
  • As a rule of thumb, set the initial replicas to the 5th percentile of the actually observed replica count in production. Since HPA reacts fast, this is not as critical, but may help reduce initial load on the control plane early after deployment. However, be cautious when you update the higher-level resource not to inadvertently reset the current HPA-controlled replica count (very easy to make mistake that can lead to catastrophic loss of pods). HPA modifies the replica count directly in the spec and you do not want to overwrite that. Even if it reacts fast, it is not instant (not via a mutating webhook as VPA operates) and the damage may already be done.
  • As for minimum and maximum, let your high availability requirements determine the minimum and your theoretical maximum load determine the maximum, flanked with alerts to detect erroneous run-away out-scaling or the actual nearing of your practical maximum load, so that you can intervene.
  • Vertical Pod Autoscaling (VPA): Use for containers that have a significant usage (e.g. any container above 50m CPU or 100M memory) and a significant usage spread over time (by more than 2x), i.e. ignore small (e.g. side-cars) or static (e.g. Java statically allocated heap) containers, but otherwise use it to provide the resources needed on the one hand and keep the costs in check on the other hand.
  • As a rule of thumb, set the initial requests to the 5th percentile of the actually observed CPU resp. memory usage in production. Since VPA may need some time at first to respond and evict pods, this is especially critical early after deployment. The lower bound, below which pods will be immediately evicted, converges much faster than the upper bound, above which pods will be immediately evicted, but it isn’t instant, e.g. after 5 minutes the lower bound is just at 60% of the computed lower bound; after 12 hours the upper bound is still at 300% of the computed upper bound (see code). Unlike with HPA, you don’t need to be as cautious when updating the higher-level resource in the case of VPA. As long as VPA’s mutating webhook (VPA Admission Controller) is operational (which also the VPA Updater checks before evicting pods), it’s generally safe to update the higher-level resource. However, if it’s not up and running, any new pods that are spawned (e.g. as a consequence of a rolling update of the higher-level resource or for any other reason) will not be mutated. Instead, they will receive whatever requests are currently configured at the higher-level resource, which can lead to catastrophic resource under-reservation. Gardener deploys the VPA Admission Controller in HA - if unhealthy, it is reported under the ControlPlaneHealthy shoot status condition.
  • If you have defined absolute limits (unrelated to the requests), configure VPA to only scale the requests or else it will proportionally scale the limits as well, which can easily become useless (way beyond the threshold of unhealthy behavior) or absurd (larger than node capacity):
    spec:
      resourcePolicy:
        containerPolicies:
        - controlledValues: RequestsOnly
          ...
    
    If you have defined relative limits (related to the requests), the default policy to scale the limits proportionally with the requests is fine, but the gap between requests and limits must be zero for QoS Guaranteed and should best be small for QoS Burstable to avoid useless or absurd limits either, e.g. prefer limits being 5 to at most 20% larger than requests as opposed to being 100% larger or more.
  • As a rule of thumb, set minAllowed to the highest observed VPA recommendation (usually during the initialization phase or during any periodical activity) for an otherwise practically idle container, so that you avoid needless trashing (e.g. resource usage calms down over time and recommendations drop consecutively until eviction, which will then lead again to initialization or later periodical activity and higher recommendations and new evictions).
    ⚠️ You may want to provide higher minAllowed values, if you observe that up-scaling takes too long for CPU or memory for a too large percentile of your workload. This will get you out of the danger zone of too few resources for too many pods at the expense of providing too many resources for a few pods. Memory may react faster than CPU, because CPU throttling is not visible and memory gets aided by OOM bump-up incidents, but still, if you observe that up-scaling takes too long, you may want to increase minAllowed accordingly.
  • As a rule of thumb, set maxAllowed to your theoretical maximum load, flanked with alerts to detect erroneous run-away usage or the actual nearing of your practical maximum load, so that you can intervene. However, VPA can easily recommend requests larger than what is allocatable on a node, so you must either ensure large enough nodes (Gardener can scale up from zero, in case you like to define a low-priority worker pool with more resources for very large pods) and/or cap VPA’s target recommendations using maxAllowed at the node allocatable remainder (after daemon set pods) of the largest eligible machine type (may result in under-provisioning resources for a pod). Use your monitoring and check maximum pod usage to decide about the maximum machine type.

Recommendations in a Box

ContainerWhen to useValue
Requests- Set them (recommended) unless:
- Do not set requests for QoS BestEffort; useful only if pod can be evicted as often as needed and pod can pick up where it left off without any penalty
Set requests to 95th percentile (w/o VPA) of the actually observed CPU resp. memory usage in production resp. 5th percentile (w/ VPA) (see below)
Limits- Avoid them (recommended) unless:
- Set limits for QoS Guaranteed; useful only if pod has strictly static resource requirements
- Set CPU limits if you want to throttle CPU usage for containers that can be throttled w/o any other disadvantage than processing time (never do that when time-critical operations like leases are involved)
- Set limits if you know the healthy range and want to shield against unbound busy loops, unbound memory leaks, or similar
If you really can (otherwise not), set limits to healthy theoretical max load
ScalerWhen to useInitialMinimumMaximum
HPAUse for pods that support horizontal scalingSet initial replicas to 5th percentile of the actually observed replica count in production (prefer scaling on usage, not utilization) and make sure to never overwrite it later when controlled by HPASet minReplicas to 0 (requires feature gate and custom/external metrics), to 1 (regular HPA minimum), or whatever the high availability requirements of the workload demandSet maxReplicas to healthy theoretical max load
VPAUse for containers that have a significant usage (>50m/100M) and a significant usage spread over time (>2x)Set initial requests to 5th percentile of the actually observed CPU resp. memory usage in productionSet minAllowed to highest observed VPA recommendation (includes start-up phase) for an otherwise practically idle container (avoids pod trashing when pod gets evicted after idling)Set maxAllowed to fresh node allocatable remainder after daemonset pods (avoids pending pods when requests exeed fresh node allocatable remainder) or, if you really can (otherwise not), to healthy theoretical max load (less disruptive than limits as no throttling or OOM happens on under-utilized nodes)
CAUse for dynamic workloads, definitely if you use HPA and/or VPAN/ASet minimum to 0 or number of nodes required right after cluster creation or wake-upSet maximum to healthy theoretical max load

ℹ️ Note

Theoretical max load may be very difficult to ascertain, especially with modern software that consists of building blocks you do not own or know in detail. If you have comprehensive monitoring in place, you may be tempted to pick the observed maximum and add a safety margin or even factor on top (2x, 4x, or any other number), but this is not to be confused with “theoretical max load” (solely depending on the code, not observations from the outside). At any point in time, your numbers may change, e.g. because you updated a software component or your usage increased. If you decide to use numbers that are set based only on observations, make sure to flank those numbers with monitoring alerts, so that you have sufficient time to investigate, revise, and readjust if necessary.

Conclusion

Pod autoscaling is a dynamic and complex aspect of Kubernetes, but it is also one of the most powerful tools at your disposal for maintaining efficient, reliable, and cost-effective applications. By carefully selecting the appropriate autoscaler, setting well-considered thresholds, and continuously monitoring and adjusting your strategies, you can ensure that your Kubernetes deployments are well-equipped to handle your resource demands while not over-paying for the provided resources at the same time.

As Kubernetes continues to evolve (e.g. in-place updates) and as new patterns and practices emerge, the approaches to autoscaling may also change. However, the principles discussed above will remain foundational to creating scalable and resilient Kubernetes workloads. Whether you’re a developer or operations engineer, a solid understanding of pod autoscaling will be instrumental in the successful deployment and management of containerized applications.

6.2 - Specifying a Disruption Budget for Kubernetes Controllers

Introduction of Disruptions

We need to understand that some kind of voluntary disruptions can happen to pods. For example, they can be caused by cluster administrators who want to perform automated cluster actions, like upgrading and autoscaling clusters. Typical application owner actions include:

  • deleting the deployment or other controller that manages the pod
  • updating a deployment’s pod template causing a restart
  • directly deleting a pod (e.g., by accident)

Setup Pod Disruption Budgets

Kubernetes offers a feature called PodDisruptionBudget (PDB) for each application. A PDB limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions.

The most common use case is when you want to protect an application specified by one of the built-in Kubernetes controllers:

  • Deployment
  • ReplicationController
  • ReplicaSet
  • StatefulSet

A PodDisruptionBudget has three fields:

  • A label selector .spec.selector to specify the set of pods to which it applies.
  • .spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
  • .spec.maxUnavailable which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.

Cluster Upgrade or Node Deletion Failed due to PDB Violation

Misconfiguration of the PDB could block the cluster upgrade or node deletion processes. There are two main cases that can cause a misconfiguration.

Case 1: The replica of Kubernetes controllers is 1

  • Only 1 replica is running: there is no replicaCount setup or replicaCount for the Kubernetes controllers is set to 1

  • PDB configuration

      spec:
        minAvailable: 1
    
  • To fix this PDB misconfiguration, you need to change the value of replicaCount for the Kubernetes controllers to a number greater than 1

Case 2: HPA configuration violates PDB

In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand. The HorizontalPodAutoscaler manages the replicas field of the Kubernetes controllers.

  • There is no replicaCount setup or replicaCount for the Kubernetes controllers is set to 1

  • PDB configuration

      spec:
        minAvailable: 1
    
  • HPA configuration

      spec:
        minReplicas: 1
    
  • To fix this PDB misconfiguration, you need to change the value of HPA minReplicas to be greater than 1

6.3 - Access a Port of a Pod Locally

Question

You have deployed an application with a web UI or an internal endpoint in your Kubernetes (K8s) cluster. How to access this endpoint without an external load balancer (e.g., Ingress)?

This tutorial presents two options:

  • Using Kubernetes port forward
  • Using Kubernetes apiserver proxy

Please note that the options described here are mostly for quick testing or troubleshooting your application. For enabling access to your application for productive environment, please refer to the official Kubernetes documentation.

Solution 1: Using Kubernetes Port Forward

You could use the port forwarding functionality of kubectl to access the pods from your local host without involving a service.

To access any pod follow these steps:

  1. Run kubectl get pods
  2. Note down the name of the pod in question as <your-pod-name>
  3. Run kubectl port-forward <your-pod-name> <local-port>:<your-app-port>
  4. Run a web browser or curl locally and enter the URL: http(s)://localhost:<local-port>

In addition, kubectl port-forward allows using a resource name, such as a deployment name or service name, to select a matching pod to port forward. More details can be found in the Kubernetes documentation.

The main drawback of this approach is that the pod’s name changes as soon as it is restarted. Moreover, you need to have a web browser on your client and you need to make sure that the local port is not already used by an application running on your system. Finally, sometimes the port forwarding is canceled due to nonobvious reasons. This leads to a kind of shaky approach. A more stable possibility is based on accessing the app via the kube-proxy, which accesses the corresponding service.

port-forward

Solution 2: Using the apiserver Proxy of Your Kubernetes Cluster

There are several different proxies in Kubernetes. In this tutorial we will be using apiserver proxy to enable the access to the services in your cluster without Ingress. Unlike the first solution, here a service is required.

Use the following format to compose a URL for accessing your service through an existing proxy on the Kubernetes cluster:

https://<your-cluster-master>/api/v1/namespace/<your-namespace>/services/<your-service>:<your-service-port>/proxy/<service-endpoint>

Example:

your-main-clusteryour-namespaceyour-serviceyour-service-portyour-service-endpointurl to access service
api.testclstr.cpet.k8s.sapcloud.iodefaultnginx-svc80/http://api.testclstr.cpet.k8s.sapcloud.io/api/v1/namespaces/default/services/nginx-svc:80/proxy/
api.testclstr.cpet.k8s.sapcloud.iodefaultdocker-nodejs-svc4500/cpu?baseNumber=4https://api.testclstr.cpet.k8s.sapcloud.io/api/v1/namespaces/default/services/docker-nodejs-svc:4500/proxy/cpu?baseNumber=4

For more details on the format, please refer to the official Kubernetes documentation.

6.4 - Auditing Kubernetes for Secure Setup

A few insecure configurations in Kubernetes

teaser

Increasing the Security of All Gardener Stakeholders

In summer 2018, the Gardener project team asked Kinvolk to execute several penetration tests in its role as third-party contractor. The goal of this ongoing work was to increase the security of all Gardener stakeholders in the open source community. Following the Gardener architecture, the control plane of a Gardener managed shoot cluster resides in the corresponding seed cluster. This is a Control-Plane-as-a-Service with a network air gap.

Along the way we found various kinds of security issues, for example, due to misconfiguration or missing isolation, as well as two special problems with upstream Kubernetes and its Control-Plane-as-a-Service architecture.

Major Findings

From this experience, we’d like to share a few examples of security issues that could happen on a Kubernetes installation and how to fix them.

Alban Crequy (Kinvolk) and Dirk Marwinski (SAP SE) gave a presentation entitled Hardening Multi-Cloud Kubernetes Clusters as a Service at KubeCon 2018 in Shanghai presenting some of the findings.

Here is a summary of the findings:

  • Privilege escalation due to insecure configuration of the Kubernetes API server

    • Root cause: Same certificate authority (CA) is used for both the API server and the proxy that allows accessing the API server.
    • Risk: Users can get access to the API server.
    • Recommendation: Always use different CAs.
  • Exploration of the control plane network with malicious HTTP-redirects

    • Root cause: See detailed description below.
    • Risk: Provoked error message contains full HTTP payload from anexisting endpoint which can be exploited. The contents of the payload depends on your setup, but can potentially be user data, configuration data, and credentials.
      • Recommendation:
        • Use the latest version of Gardener
        • Ensure the seed cluster’s container network supports network policies. Clusters that have been created with Kubify are not protected as Flannel is used there which doesn’t support network policies.
  • Reading private AWS metadata via Grafana

    • Root cause: It is possible to configuring a new custom data source in Grafana, we could send HTTP requests to target the control
    • Risk: Users can get the “user-data” for the seed cluster from the metadata service and retrieve a kubeconfig for that Kubernetes cluster
    • Recommendation: Lockdown Grafana features to only what’s necessary in this setup, block all unnecessary outgoing traffic, move Grafana to a different network, lockdown unauthenticated endpoints

Scenario 1: Privilege Escalation with Insecure API Server

In most configurations, different components connect directly to the Kubernetes API server, often using a kubeconfig with a client certificate. The API server is started with the flag:

/hyperkube apiserver --client-ca-file=/srv/kubernetes/ca/ca.crt ...

The API server will check whether the client certificate presented by kubectl, kubelet, scheduler or another component is really signed by the configured certificate authority for clients.

The API server can have many clients of various kinds


However, it is possible to configure the API server differently for use with an intermediate authenticating proxy. The proxy will authenticate the client with its own custom method and then issue HTTP requests to the API server with additional HTTP headers specifying the user name and group name. The API server should only accept HTTP requests with HTTP headers from a legitimate proxy. To allow the API server to check incoming requests, you need pass on a list of certificate authorities (CAs) to it. Requests coming from a proxy are only accepted if they use a client certificate that is signed by one of the CAs of that list.

--requestheader-client-ca-file=/srv/kubernetes/ca/ca-proxy.crt
--requestheader-username-headers=X-Remote-User
--requestheader-group-headers=X-Remote-Group

API server clients can reach the API server through an authenticating proxy


So far, so good. But what happens if the malicious user “Mallory” tries to connect directly to the API server and reuses the HTTP headers to pretend to be someone else?

What happens when a client bypasses the proxy, connecting directly to the API server?


With a correct configuration, Mallory’s kubeconfig will have a certificate signed by the API server certificate authority but not signed by the proxy certificate authority. So the API server will not accept the extra HTTP header “X-Remote-Group: system:masters”.

You only run into an issue when the same certificate authority is used for both the API server and the proxy. Then, any Kubernetes client certificate can be used to take the role of different user or group as the API server will accept the user header and group header.

The kubectl tool does not normally add those HTTP headers but it’s pretty easy to generate the corresponding HTTP requests manually.

We worked on improving the Kubernetes documentation to make clearer that this configuration should be avoided.

Scenario 2: Exploration of the Control Plane Network with Malicious HTTP-Redirects

The API server is a central component of Kubernetes and many components initiate connections to it, including the kubelet running on worker nodes. Most of the requests from those clients will end up updating Kubernetes objects (pods, services, deployments, and so on) in the etcd database but the API server usually does not need to initiate TCP connections itself.

The API server is mostly a component that receives requests


However, there are exceptions. Some kubectl commands will trigger the API server to open a new connection to the kubelet. kubectl exec is one of those commands. In order to get the standard I/Os from the pod, the API server will start an HTTP connection to the kubelet on the worker node where the pod is running. Depending on the container runtime used, it can be done in different ways, but one way to do it is for the kubelet to reply with a HTTP-302 redirection to the Container Runtime Interface (CRI). Basically, the kubelet is telling the API server to get the streams from CRI itself directly instead of forwarding. The redirection from the kubelet will only change the port and path from the URL; the IP address will not be changed because the kubelet and the CRI component run on the same worker node.

But the API server also initiates some connections, for example, to worker nodes


It’s often quite easy for users of a Kubernetes cluster to get access to worker nodes and tamper with the kubelet. They could be given explicit SSH access or they could be given a kubeconfig with enough privileges to create privileged pods or even just pods with “host” volumes.

In contrast, users (even those with “system:masters” permissions or “root” rights) are often not given access to the control plane. On setups like, for example, GKE or Gardener, the control plane is running on separate nodes, with a different administrative access. It could be hosted on a different cloud provider account. So users are not free to explore the internal networking the control plane.

What would happen if a user was tampering with the kubelet to make it maliciously redirect kubectl exec requests to a different random endpoint? Most likely the given endpoint would not speak to the streaming server protocol, so there would be an error. However, the full HTTP payload from the endpoint is included in the error message printed by kubectl exec.

The API server is tricked to connect to other components


The impact of this issue depends on the specific setup. But in many configurations, we could find a metadata service (such as the AWS metadata service) containing user data, configurations and credentials. The setup we explored had a different AWS account and a different EC2 instance profile for the worker nodes and the control plane. This issue allowed users to get access to the AWS metadata service in the context of the control plane, which they should not have access to.

We have reported this issue to the Kubernetes Security mailing list and the public pull request that addresses the issue has been merged PR#66516. It provides a way to enforce HTTP redirect validation (disabled by default).

But there are several other ways that users could trigger the API server to generate HTTP requests and get the reply payload back, so it is advised to isolate the API server and other components from the network as additional precautious measures. Depending on where the API server runs, it could be with Kubernetes Network Policies, EC2 Security Groups or just iptables directly. Following the defense in depth principle, it is a good idea to apply the API server HTTP redirect validation when it is available as well as firewall rules.

In Gardener, this has been fixed with Kubernetes network policies along with changes to ensure the API server does not need to contact the metadata service. You can see more details in the announcements on the Gardener mailing list. This is tracked in CVE-2018-2475.

To be protected from this issue, stakeholders should:

  • Use the latest version of Gardener
  • Ensure the seed cluster’s container network supports network policies. Clusters that have been created with Kubify are not protected as Flannel is used there which doesn’t support network policies.

Scenario 3: Reading Private AWS Metadata via Grafana

For our tests, we had access to a Kubernetes setup where users are not only given access to the API server in the control plane, but also to a Grafana instance that is used to gather data from their Kubernetes clusters via Prometheus. The control plane is managed and users don’t have access to the nodes that it runs. They can only access the API server and Grafana via a load balancer. The internal network of the control plane is therefore hidden to users.

Prometheus and Grafana can be used to monitor worker nodes


Unfortunately, that setup was not protecting the control plane network from nosy users. By configuring a new custom data source in Grafana, we could send HTTP requests to target the control plane network, for example the AWS metadata service. The reply payload is not displayed on the Grafana Web UI but it is possible to access it from the debugging console of the Chrome browser.

Credentials can be retrieved from the debugging console of Chrome


Adding a Grafana data source is a way to issue HTTP requests to arbitrary targets


In that installation, users could get the “user-data” for the seed cluster from the metadata service and retrieve a kubeconfig for that Kubernetes cluster.

There are many possible measures to avoid this situation: lockdown Grafana features to only what’s necessary in this setup, block all unnecessary outgoing traffic, move Grafana to a different network, or lockdown unauthenticated endpoints, among others.

Conclusion

The three scenarios above show pitfalls with a Kubernetes setup. A lot of them were specific to the Kubernetes installation: different cloud providers or different configurations will show different weaknesses. Users should no longer be given access to Grafana.

6.5 - Container Image Not Pulled

Wrong Container Image or Invalid Registry Permissions

Problem

Two of the most common causes of this problems are specifying the wrong container image or trying to use private images without providing registry credentials.

Example

Let’s see an example. We’ll create a pod named fail, referencing a non-existent Docker image:

kubectl run -i --tty fail --image=tutum/curl:1.123456

The command doesn’t return and you can terminate the process with Ctrl+C.

Error Analysis

We can then inspect our pods and see that we have one pod with a status of ErrImagePull or ImagePullBackOff.

$ (minikube) kubectl get pods
NAME                      READY     STATUS         RESTARTS   AGE
client-5b65b6c866-cs4ch   1/1       Running        1          1m
fail-6667d7685d-7v6w8     0/1       ErrImagePull   0          <invalid>
vuejs-578574b75f-5x98z    1/1       Running        0          1d
$ (minikube) 

For some additional information, we can describe the failing pod.

kubectl describe pod fail-6667d7685d-7v6w8

As you can see in the events section, your image can’t be pulled:

Name:   fail-6667d7685d-7v6w8
Namespace: default
Node:   minikube/192.168.64.10
Start Time: Wed, 22 Nov 2017 10:01:59 +0100
Labels:   pod-template-hash=2223832418
    run=fail
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"fail-6667d7685d","uid":"cc4ccb3f-cf63-11e7-afca-4a7a1fa05b3f","a...
.
.
.
.
Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath		Type		Reason			Message
  ---------	--------	-----	----			-------------		--------	------			-------
  1m		1m		1	default-scheduler				Normal		Scheduled		Successfully assigned fail-6667d7685d-7v6w8 to minikube
  1m		1m		1	kubelet, minikube				Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "default-token-9fr6r" 
  1m		6s		4	kubelet, minikube	spec.containers{fail}	Normal		Pulling			pulling image "tutum/curl:1.123456"
  1m		5s		4	kubelet, minikube	spec.containers{fail}	Warning		Failed			Failed to pull image "tutum/curl:1.123456": rpc error: code = Unknown desc = Error response from daemon: manifest for tutum/curl:1.123456 not found
  1m		<invalid>	10	kubelet, minikube				Warning		FailedSync		Error syncing pod
  1m		<invalid>	6	kubelet, minikube	spec.containers{fail}	Normal		BackOff			Back-off pulling image "tutum/curl:1.123456"

Why couldn’t Kubernetes pull the image? There are three primary candidates besides network connectivity issues:

  • The image tag is incorrect
  • The image doesn’t exist
  • Kubernetes doesn’t have permissions to pull that image

If you don’t notice a typo in your image tag, then it’s time to test using your local machine. I usually start by running docker pull on my local development machine with the exact same image tag. In this case, I would run docker pull tutum/curl:1.123456.

If this succeeds, then it probably means that Kubernetes doesn’t have the correct permissions to pull that image.

Add the docker registry user/pwd to your cluster:

kubectl create secret docker-registry dockersecret --docker-server=https://index.docker.io/v1/ --docker-username=<username> --docker-password=<password> --docker-email=<email>

If the exact image tag fails, then I will test without an explicit image tag:

docker pull tutum/curl

This command will attempt to pull the latest tag. If this succeeds, then that means the originally specified tag doesn’t exist. Go to the Docker registry and check which tags are available for this image.

If docker pull tutum/curl (without an exact tag) fails, then we have a bigger problem - that image does not exist at all in our image registry.

6.6 - Container Image Not Updating

Updating images in your cluster during development

Introduction

A container image should use a fixed tag or the SHA of the image. It should not use the tags latest, head, canary, or other tags that are designed to be floating.

Problem

If you have encountered this issue, you have probably done something along the lines of:

  • Deploy anything using an image tag (e.g., cp-enablement/awesomeapp:1.0)
  • Fix a bug in awesomeapp
  • Build a new image and push it with the same tag (cp-enablement/awesomeapp:1.0)
  • Update the deployment
  • Realize that the bug is still present
  • Repeat steps 3-5 without any improvement

The problem relates to how Kubernetes decides whether to do a docker pull when starting a container. Since we tagged our image as :1.0, the default pull policy is IfNotPresent. The Kubelet already has a local copy of cp-enablement/awesomeapp:1.0, so it doesn’t attempt to do a docker pull. When the new Pods come up, they’re still using the old broken Docker image.

There are a couple of ways to resolve this, with the recommended one being to use unique tags.

Solution

In order to fix the problem, you can use the following bash script that runs anytime the deployment is updated to create a new tag and push it to the registry.

#!/usr/bin/env bash

# Set the docker image name and the corresponding repository
# Ensure that you change them in the deployment.yml as well.
# You must be logged in with docker login.
#
# CHANGE THIS TO YOUR Docker.io SETTINGS
#
PROJECT=awesomeapp
REPOSITORY=cp-enablement

# causes the shell to exit if any subcommand or pipeline returns a non-zero status.
#
set -e

# set debug mode
#
set -x

# build my nodeJS app
#
npm run build

# get the latest version ID from the Docker.io registry and increment them
#
VERSION=$(curl https://registry.hub.docker.com/v1/repositories/$REPOSITORY/$PROJECT/tags  | sed -e 's/[][]//g' -e 's/"//g' -e 's/ //g' | tr '}' '\n'  | awk -F: '{print $3}' | grep v| tail -n 1)
VERSION=${VERSION:1}
((VERSION++))
VERSION="v$VERSION"


# build the new docker image
#
echo '>>> Building new image'

echo '>>> Push new image'
docker push $REPOSITORY/$PROJECT:$VERSION

6.7 - Custom Seccomp Profile

Overview

Seccomp (secure computing mode) is a security facility in the Linux kernel for restricting the set of system calls applications can make.

Starting from Kubernetes v1.3.0, the Seccomp feature is in Alpha. To configure it on a Pod, the following annotations can be used:

  • seccomp.security.alpha.kubernetes.io/pod: <seccomp-profile> where <seccomp-profile> is the seccomp profile to apply to all containers in a Pod.
  • container.seccomp.security.alpha.kubernetes.io/<container-name>: <seccomp-profile> where <seccomp-profile> is the seccomp profile to apply to <container-name> in a Pod.

More details can be found in the PodSecurityPolicy documentation.

Installation of a Custom Profile

By default, kubelet loads custom Seccomp profiles from /var/lib/kubelet/seccomp/. There are two ways in which Seccomp profiles can be added to a Node:

  • to be baked in the machine image
  • to be added at runtime

This guide focuses on creating those profiles via a DaemonSet.

Create a file called seccomp-profile.yaml with the following content:

apiVersion: v1
kind: ConfigMap
metadata:
  name: seccomp-profile
  namespace: kube-system
data:
  my-profile.json: |
    {
      "defaultAction": "SCMP_ACT_ALLOW",
      "syscalls": [
        {
          "name": "chmod",
          "action": "SCMP_ACT_ERRNO"
        }
      ]
    }    

Apply the ConfigMap in your cluster:

$ kubectl apply -f seccomp-profile.yaml
configmap/seccomp-profile created

The next steps is to create the DaemonSet Seccomp installer. It’s going to copy the policy from above in /var/lib/kubelet/seccomp/my-profile.json.

Create a file called seccomp-installer.yaml with the following content:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: seccomp
  namespace: kube-system
  labels:
    security: seccomp
spec:
  selector:
    matchLabels:
      security: seccomp
  template:
    metadata:
      labels:
        security: seccomp
    spec:
      initContainers:
      - name: installer
        image: alpine:3.10.0
        command: ["/bin/sh", "-c", "cp -r -L /seccomp/*.json /host/seccomp/"]
        volumeMounts:
        - name: profiles
          mountPath: /seccomp
        - name: hostseccomp
          mountPath: /host/seccomp
          readOnly: false
      containers:
      - name: pause
        image: k8s.gcr.io/pause:3.1
      terminationGracePeriodSeconds: 5
      volumes:
      - name: hostseccomp
        hostPath:
          path: /var/lib/kubelet/seccomp
      - name: profiles
        configMap:
          name: seccomp-profile

Create the installer and wait until it’s ready on all Nodes:

$ kubectl apply -f seccomp-installer.yaml
daemonset.apps/seccomp-installer created

$ kubectl -n kube-system get pods -l security=seccomp
NAME                      READY   STATUS    RESTARTS   AGE
seccomp-installer-wjbxq   1/1     Running   0          21s

Create a Pod Using a Custom Seccomp Profile

Finally, we want to create a profile which uses our new Seccomp profile my-profile.json.

Create a file called my-seccomp-pod.yaml with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: seccomp-app
  namespace: default
  annotations:
    seccomp.security.alpha.kubernetes.io/pod: "localhost/my-profile.json"
    # you can specify seccomp profile per container. If you add another profile you can configure
    # it for a specific container - 'pause' in this case.
    # container.seccomp.security.alpha.kubernetes.io/pause: "localhost/some-other-profile.json"
spec:
  containers:
  - name: pause
    image: k8s.gcr.io/pause:3.1

Create the Pod and see that it’s running:

$ kubectl apply -f my-seccomp-pod.yaml
pod/seccomp-app created

$ kubectl get pod seccomp-app
NAME         READY   STATUS    RESTARTS   AGE
seccomp-app  1/1     Running   0          42s

Throubleshooting

If an invalid or a non-existing profile is used, then the Pod will be stuck in ContainerCreating phase:

broken-seccomp-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: broken-seccomp
  namespace: default
  annotations:
    seccomp.security.alpha.kubernetes.io/pod: "localhost/not-existing-profile.json"
spec:
  containers:
  - name: pause
    image: k8s.gcr.io/pause:3.1
$ kubectl apply -f broken-seccomp-pod.yaml
pod/broken-seccomp created

$ kubectl get pod broken-seccomp
NAME            READY   STATUS              RESTARTS   AGE
broken-seccomp  1/1     ContainerCreating   0          2m

$ kubectl describe pod broken-seccomp
Name:               broken-seccomp
Namespace:          default
....
Events:
  Type     Reason                  Age               From                     Message
  ----     ------                  ----              ----                     -------
  Normal   Scheduled               18s               default-scheduler        Successfully assigned kube-system/broken-seccomp to docker-desktop
  Warning  FailedCreatePodSandBox  4s (x2 over 18s)  kubelet, docker-desktop  Failed create pod sandbox: rpc error: code = Unknown desc = failed to make sandbox docker config for pod "broken-seccomp": failed to generate sandbox security options
for sandbox "broken-seccomp": failed to generate seccomp security options for container: cannot load seccomp profile "/var/lib/kubelet/seccomp/not-existing-profile.json": open /var/lib/kubelet/seccomp/not-existing-profile.json: no such file or directory

6.8 - Dockerfile Pitfalls

Common Dockerfile pitfalls

Using the latest Tag for an Image

Many Dockerfiles use the FROM package:latest pattern at the top of their Dockerfiles to pull the latest image from a Docker registry.

Bad Dockerfile

FROM alpine

While simple, using the latest tag for an image means that your build can suddenly break if that image gets updated. This can lead to problems where everything builds fine locally (because your local cache thinks it is the latest), while a build server may fail, because some pipelines make a clean pull on every build. Additionally, troubleshooting can prove to be difficult, since the maintainer of the Dockerfile didn’t actually make any changes.

Good Dockerfile

A digest takes the place of the tag when pulling an image. This will ensure that your Dockerfile remains immutable.

FROM alpine@sha256:7043076348bf5040220df6ad703798fd8593a0918d06d3ce30c6c93be117e430

Running apt/apk/yum update

Running apt-get install is one of those things virtually every Debian-based Dockerfile will have to do in order to satiate some external package requirements your code needs to run. However, using apt-get as an example, this comes with its own problems.

apt-get upgrade

This will update all your packages to their latests versions, which can be bad because it prevents your Dockerfile from creating consistent, immutable builds.

apt-get update (in a different line than the one running your apt-get install command)

Running apt-get update as a single line entry will get cached by the build and won’t actually run every time you need to run apt-get install. Instead, make sure you run apt-get update in the same line with all the packages to ensure that all are updated correctly.

Avoid Big Container Images

Building a small container image will reduce the time needed to start or restart pods. An image based on the popular Alpine Linux project is much smaller than most distribution based images (~5MB). For most popular languages and products, there is usually an official Alpine Linux image, e.g., golang, nodejs, and postgres.

$  docker images
REPOSITORY                                                      TAG                     IMAGE ID            CREATED             SIZE
postgres                                                        9.6.9-alpine            6583932564f8        13 days ago         39.26 MB
postgres                                                        9.6                     d92dad241eff        13 days ago         235.4 MB
postgres                                                        10.4-alpine             93797b0f31f4        13 days ago         39.56 MB

In addition, for compiled languages such as Go or C++ that do not require build time tooling during runtime, it is recommended to avoid build time tooling in the final images. With Docker’s support for multi-stages builds, this can be easily achieved with minimal effort. Such an example can be found at Multi-stage builds.

Google’s distroless image is also a good base image.

6.9 - Dynamic Volume Provisioning

Running a Postgres database on Kubernetes

Overview

The example shows how to run a Postgres database on Kubernetes and how to dynamically provision and mount the storage volumes needed by the database

Run Postgres Database

Define the following Kubernetes resources in a yaml file:

  • PersistentVolumeClaim (PVC)
  • Deployment

PersistentVolumeClaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgresdb-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 9Gi
  storageClassName: 'default'

This defines a PVC using the storage class default. Storage classes abstract from the underlying storage provider as well as other parameters, like disk-type (e.g., solid-state vs standard disks).

The default storage class has the annotation {“storageclass.kubernetes.io/is-default-class”:“true”}.


$ kubectl describe sc default
Name:            default
IsDefaultClass:  Yes
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1beta1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"},"labels":{"addonmanager.kubernetes.io/mode":"Exists"},"name":"default","namespace":""},"parameters":{"type":"gp2"},"provisioner":"kubernetes.io/aws-ebs"}
,storageclass.kubernetes.io/is-default-class=true
Provisioner:           kubernetes.io/aws-ebs
Parameters:            type=gp2
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

A Persistent Volume is automatically created when it is dynamically provisioned. In the following example, the PVC is defined as “postgresdb-pvc”, and a corresponding PV “pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb” is created and associated with the PVC automatically.

$ kubectl create -f .\postgres_deployment.yaml
persistentvolumeclaim "postgresdb-pvc" created

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                    STORAGECLASS   REASON    AGE
pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb   9Gi        RWO            Delete           Bound     default/postgresdb-pvc   default                  3s

$ kubectl get pvc
NAME             STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
postgresdb-pvc   Bound     pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb   9Gi        RWO            default        8s

Notice that the RECLAIM POLICY is Delete (default value), which is one of the two reclaim policies, the other one is Retain. (A third policy Recycle has been deprecated). In the case of Delete, the PV is deleted automatically when the PVC is removed, and the data on the PVC will also be lost.

On the other hand, a PV with Retain policy will not be deleted when the PVC is removed, and moved to Release status, so that data can be recovered by Administrators later.

You can use the kubectl patch command to change the reclaim policy as described in Change the Reclaim Policy of a PersistentVolume or use kubectl edit pv <pv-name> to edit it online as shown below:

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                    STORAGECLASS   REASON    AGE
pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb   9Gi        RWO            Delete           Bound     default/postgresdb-pvc   default                  44m

# change the reclaim policy from "Delete" to "Retain"
$ kubectl edit pv pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb
persistentvolume "pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb" edited

# check the reclaim policy afterwards
$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                    STORAGECLASS   REASON    AGE
pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb   9Gi        RWO            Retain           Bound     default/postgresdb-pvc   default                  45m

Deployment

Once a PVC is created, you can use it in your container via volumes.persistentVolumeClaim.claimName. In the below example, the PVC postgresdb-pvc is mounted as readable and writable, and in volumeMounts two paths in the container are mounted to subfolders in the volume.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  namespace: default
  labels:
    app: postgres
  annotations:
    deployment.kubernetes.io/revision: "1"
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      name: postgres
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: "cpettech.docker.repositories.sap.ondemand.com/jtrack_postgres:howto"
          env:
            - name: POSTGRES_USER
              value: postgres
            - name: POSTGRES_PASSWORD
              value: p5FVqfuJFrM42cVX9muQXxrC3r8S9yn0zqWnFR6xCoPqxqVQ
            - name: POSTGRES_INITDB_XLOGDIR
              value: "/var/log/postgresql/logs"
          ports:
            - containerPort: 5432
          volumeMounts:
            - mountPath: /var/lib/postgresql/data
              name: postgre-db
              subPath: data     # https://github.com/kubernetes/website/pull/2292.  Solve the issue of crashing initdb due to non-empty directory (i.e. lost+found)
            - mountPath: /var/log/postgresql/logs
              name: postgre-db
              subPath: logs
      volumes:
        - name: postgre-db
          persistentVolumeClaim:
            claimName: postgresdb-pvc
            readOnly: false
      imagePullSecrets:
      - name: cpettechregistry

To check the mount points in the container:

$ kubectl get po
NAME                        READY     STATUS    RESTARTS   AGE
postgres-7f485fd768-c5jf9   1/1       Running   0          32m

$ kubectl exec -it postgres-7f485fd768-c5jf9 bash

root@postgres-7f485fd768-c5jf9:/# ls /var/lib/postgresql/data/
base    pg_clog       pg_dynshmem  pg_ident.conf  pg_multixact  pg_replslot  pg_snapshots  pg_stat_tmp  pg_tblspc    PG_VERSION  postgresql.auto.conf  postmaster.opts
global  pg_commit_ts  pg_hba.conf  pg_logical     pg_notify     pg_serial    pg_stat       pg_subtrans  pg_twophase  pg_xlog     postgresql.conf       postmaster.pid

root@postgres-7f485fd768-c5jf9:/# ls /var/log/postgresql/logs/
000000010000000000000001  archive_status

Deleting a PersistentVolumeClaim

In case of a Delete policy, deleting a PVC will also delete its associated PV. If Retain is the reclaim policy, the PV will change status from Bound to Released when the PVC is deleted.

# Check pvc and pv before deletion
$ kubectl get pvc
NAME             STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
postgresdb-pvc   Bound     pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb   9Gi        RWO            default        50m

$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                    STORAGECLASS   REASON    AGE
pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb   9Gi        RWO            Retain           Bound     default/postgresdb-pvc   default                  50m

# delete pvc
$ kubectl delete pvc postgresdb-pvc
persistentvolumeclaim "postgresdb-pvc" deleted

# pv changed to status "Released"
$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                    STORAGECLASS   REASON    AGE
pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb   9Gi        RWO            Retain           Released   default/postgresdb-pvc   default                  51m

6.10 - Install Knative in Gardener Clusters

A walkthrough the steps for installing Knative in Gardener shoot clusters.

Overview

This guide walks you through the installation of the latest version of Knative using pre-built images on a Gardener created cluster environment. To set up your own Gardener, see the documentation or have a look at the landscape-setup-template project. To learn more about this open source project, read the blog on kubernetes.io.

Prerequisites

Knative requires a Kubernetes cluster v1.15 or newer.

Steps

Install and Configure kubectl

  1. If you already have kubectl CLI, run kubectl version --short to check the version. You need v1.10 or newer. If your kubectl is older, follow the next step to install a newer version.

  2. Install the kubectl CLI.

Access Gardener

  1. Create a project in the Gardener dashboard. This will essentially create a Kubernetes namespace with the name garden-<my-project>.

  2. Configure access to your Gardener project using a kubeconfig.

    If you are not the Gardener Administrator already, you can create a technical user in the Gardener dashboard. Go to the “Members” section and add a service account. You can then download the kubeconfig for your project. You can skip this step if you create your cluster using the user interface; it is only needed for programmatic access, make sure you set export KUBECONFIG=garden-my-project.yaml in your shell.

    Download kubeconfig for Gardener

Creating a Kubernetes Cluster

You can create your cluster using kubectl CLI by providing a cluster specification yaml file. You can find an example for GCP in the gardener/gardener repository. Make sure the namespace matches that of your project. Then just apply the prepared so-called “shoot” cluster CRD with kubectl:

kubectl apply --filename my-cluster.yaml

The easier alternative is to create the cluster following the cluster creation wizard in the Gardener dashboard: shoot creation

Configure kubectl for Your Cluster

You can now download the kubeconfig for your freshly created cluster in the Gardener dashboard or via the CLI as follows:

kubectl --namespace shoot--my-project--my-cluster get secret kubecfg --output jsonpath={.data.kubeconfig} | base64 --decode > my-cluster.yaml

This kubeconfig file has full administrators access to you cluster. For the rest of this guide, be sure you have export KUBECONFIG=my-cluster.yaml set.

Installing Istio

Knative depends on Istio. If your cloud platform offers a managed Istio installation, we recommend installing Istio that way, unless you need the ability to customize your installation.

Otherwise, see the Installing Istio for Knative guide to install Istio.

You must install Istio on your Kubernetes cluster before continuing with these instructions to install Knative.

Installing cluster-local-gateway for Serving Cluster-Internal Traffic

If you installed Istio, you can install a cluster-local-gateway within your Knative cluster so that you can serve cluster-internal traffic. If you want to configure your revisions to use routes that are visible only within your cluster, install and use the cluster-local-gateway.

Installing Knative

The following commands install all available Knative components as well as the standard set of observability plugins. Knative’s installation guide - Installing Knative.

  1. If you are upgrading from Knative 0.3.x: Update your domain and static IP address to be associated with the LoadBalancer istio-ingressgateway instead of knative-ingressgateway. Then run the following to clean up leftover resources:

    kubectl delete svc knative-ingressgateway -n istio-system
    kubectl delete deploy knative-ingressgateway -n istio-system
    

    If you have the Knative Eventing Sources component installed, you will also need to delete the following resource before upgrading:

    kubectl delete statefulset/controller-manager -n knative-sources
    

    While the deletion of this resource during the upgrade process will not prevent modifications to Eventing Source resources, those changes will not be completed until the upgrade process finishes.

  2. To install Knative, first install the CRDs by running the kubectl apply command once with the -l knative.dev/crd-install=true flag. This prevents race conditions during the install, which cause intermittent errors:

    kubectl apply --selector knative.dev/crd-install=true \
    --filename https://github.com/knative/serving/releases/download/v0.12.1/serving.yaml \
    --filename https://github.com/knative/eventing/releases/download/v0.12.1/eventing.yaml \
    --filename https://github.com/knative/serving/releases/download/v0.12.1/monitoring.yaml
    
  3. To complete the installation of Knative and its dependencies, run the kubectl apply command again, this time without the --selector flag:

    kubectl apply --filename https://github.com/knative/serving/releases/download/v0.12.1/serving.yaml \
    --filename https://github.com/knative/eventing/releases/download/v0.12.1/eventing.yaml \
    --filename https://github.com/knative/serving/releases/download/v0.12.1/monitoring.yaml
    
  4. Monitor the Knative components until all of the components show a STATUS of Running:

    kubectl get pods --namespace knative-serving
    kubectl get pods --namespace knative-eventing
    kubectl get pods --namespace knative-monitoring
    

Set Your Custom Domain

  1. Fetch the external IP or CNAME of the knative-ingressgateway:
kubectl --namespace istio-system get service knative-ingressgateway
NAME                     TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)                                      AGE
knative-ingressgateway   LoadBalancer   100.70.219.81   35.233.41.212   80:32380/TCP,443:32390/TCP,32400:32400/TCP   4d
  1. Create a wildcard DNS entry in your custom domain to point to the above IP or CNAME:
*.knative.<my domain> == A 35.233.41.212
# or CNAME if you are on AWS
*.knative.<my domain> == CNAME a317a278525d111e89f272a164fd35fb-1510370581.eu-central-1.elb.amazonaws.com
  1. Adapt your Knative config-domain (set your domain in the data field):
kubectl --namespace knative-serving get configmaps config-domain --output yaml
apiVersion: v1
data:
  knative.<my domain>: ""
kind: ConfigMap
  name: config-domain
  namespace: knative-serving

What’s Next

Now that your cluster has Knative installed, you can see what Knative has to offer.

Deploy your first app with the Getting Started with Knative App Deployment guide.

Get started with Knative Eventing by walking through one of the Eventing Samples.

Install Cert-Manager if you want to use the automatic TLS cert provisioning feature.

Cleaning Up

Use the Gardener dashboard to delete your cluster, or execute the following with kubectl pointing to your garden-my-project.yaml kubeconfig:

kubectl --kubeconfig garden-my-project.yaml --namespace garden--my-project annotate shoot my-cluster confirmation.gardener.cloud/deletion=true

kubectl --kubeconfig garden-my-project.yaml --namespace garden--my-project delete shoot my-cluster

6.11 - Integrity and Immutability

Ensure that you always get the right image

Introduction

When transferring data among networked systems, trust is a central concern. In particular, when communicating over an untrusted medium such as the internet, it is critical to ensure the integrity and immutability of all the data a system operates on. Especially if you use Docker Engine to push and pull images (data) to a public registry.

This immutability offers you a guarantee that any and all containers that you instantiate will be absolutely identical at inception. Surprise surprise, deterministic operations.

A Lesson in Deterministic Ops

Docker Tags are about as reliable and disposable as this guy down here.

docker-labels

Seems simple enough. You have probably already deployed hundreds of YAML’s or started endless counts of Docker containers.

docker run --name mynginx1 -P -d nginx:1.13.9

or

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rss-site
spec:
  replicas: 1
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: front-end
          image: nginx:1.13.9
          ports:
            - containerPort: 80

But Tags are mutable and humans are prone to error. Not a good combination. Here, we’ll dig into why the use of tags can be dangerous and how to deploy your containers across a pipeline and across environments with determinism in mind.

Let’s say that you want to ensure that whether it’s today or 5 years from now, that specific deployment uses the very same image that you have defined. Any updates or newer versions of an image should be executed as a new deployment. The solution: digest

A digest takes the place of the tag when pulling an image. For example, to pull the above image by digest, run the following command:

docker run --name mynginx1 -P -d nginx@sha256:4771d09578c7c6a65299e110b3ee1c0a2592f5ea2618d23e4ffe7a4cab1ce5de

You can now make sure that the same image is always loaded at every deployment. It doesn’t matter if the TAG of the image has been changed or not. This solves the problem of repeatability.

Content Trust

However, there’s an additionally hidden danger. It is possible for an attacker to replace a server image with another one infected with malware.

docker-content-trust

Docker Content trust gives you the ability to verify both the integrity and the publisher of all the data received from a registry over any channel.

Prior to version 1.8, Docker didn’t have a way to verify the authenticity of a server image. But in v1.8, a new feature called Docker Content Trust was introduced to automatically sign and verify the signature of a publisher.

So, as soon as a server image is downloaded, it is cross-checked with the signature of the publisher to see if someone tampered with it in any way. This solves the problem of trust.

In addition, you should scan all images for known vulnerabilities.

6.12 - Kubernetes Antipatterns

Common antipatterns for Kubernetes and Docker

antipattern

This HowTo covers common Kubernetes antipatterns that we have seen over the past months.

Running as Root User

Whenever possible, do not run containers as root user. One could be tempted to say that Kubernetes pods and nodes are well separated. Host and containers running on it share the same kernel. If a container is compromised, the root user in the container has full control over the underlying node.

Watch the very good presentation by Liz Rice at the KubeCon 2018

Use RUN groupadd -r anygroup && useradd -r -g anygroup myuser to create a group and add a user to it. Use the USER command to switch to this user. Note that you may also consider to provide an explicit UID/GID if required.

For example:

ARG GF_UID="500"
ARG GF_GID="500"

# add group & user
RUN groupadd -r -g $GF_GID appgroup && \
   useradd appuser -r -u $GF_UID -g appgroup

USER appuser

Store Data or Logs in Containers

Containers are ideal for stateless applications and should be transient. This means that no data or logs should be stored in the container, as they are lost when the container is closed. Use persistence volumes instead to persist data outside of containers. Using an ELK stack is another good option for storing and processing logs.

Using Pod IP Addresses

Each pod is assigned an IP address. It is necessary for pods to communicate with each other to build an application, e.g. an application must communicate with a database. Existing pods are terminated and new pods are constantly started. If you would rely on the IP address of a pod or container, you would need to update the application configuration constantly. This makes the application fragile.

Create services instead. They provide a logical name that can be assigned independently of the varying number and IP addresses of containers. Services are the basic concept for load balancing within Kubernetes.

More Than One Process in a Container

A docker file provides a CMD and ENTRYPOINT to start the image. CMD is often used around a script that makes a configuration and then starts the container. Do not try to start multiple processes with this script. It is important to consider the separation of concerns when creating docker images. Running multiple processes in a single pod makes managing your containers, collecting logs and updating each process more difficult.

You can split the image into multiple containers and manage them independently - even in one pod. Bear in mind that Kubernetes only monitors the process with PID=1. If more than one process is started within a container, then these no longer fall under the control of Kubernetes.

Creating Images in a Running Container

A new image can be created with the docker commit command. This is useful if changes have been made to the container and you want to persist them for later error analysis. However, images created like this are not reproducible and completely worthless for a CI/CD environment. Furthermore, another developer cannot recognize which components the image contains. Instead, always make changes to the docker file, close existing containers and start a new container with the updated image.

Saving Passwords in a docker Image 💀

Do not save passwords in a Docker file! They are in plain text and are checked into a repository. That makes them completely vulnerable even if you are using a private repository like the Artifactory.

Always use Secrets or ConfigMaps to provision passwords or inject them by mounting a persistent volume.

Using the ’latest’ Tag

Starting an image with tomcat is tempting. If no tags are specified, a container is started with the tomcat:latest image. This image may no longer be up to date and refer to an older version instead. Running a production application requires complete control of the environment with exact versions of the image.

Make sure you always use a tag or even better the sha256 hash of the image, e.g., tomcat@sha256:c34ce3c1fcc0c7431e1392cc3abd0dfe2192ffea1898d5250f199d3ac8d8720f.

Why Use the sha256 Hash?

Tags are not immutable and can be overwritten by a developer at any time. In this case you don’t have complete control over your image - which is bad.

Different Images per Environment

Don’t create different images for development, testing, staging and production environments. The image should be the source of truth and should only be created once and pushed to the repository. This image:tag should be used for different environments in the future.

Depend on Start Order of Pods

Applications often depend on containers being started in a certain order. For example, a database container must be up and running before an application can connect to it. The application should be resilient to such changes, as the db pod can be unreachable or restarted at any time. The application container should be able to handle such situations without terminating or crashing.

Additional Anti-Patterns and Patterns

In the community, vast experience has been collected to improve the stability and usability of Docker and Kubernetes.

Refer to Kubernetes Production Patterns for more information.

6.13 - Namespace Isolation

Deny all traffic from other namespaces

Overview

You can configure a NetworkPolicy to deny all the traffic from other namespaces while allowing all the traffic coming from the same namespace the pod was deployed into.

howto-namespaceisolation

There are many reasons why you may chose to employ Kubernetes network policies:

  • Isolate multi-tenant deployments
  • Regulatory compliance
  • Ensure containers assigned to different environments (e.g. dev/staging/prod) cannot interfere with each other

Kubernetes network policies are application centric compared to infrastructure/network centric standard firewalls. There are no explicit CIDRs or IP addresses used for matching source or destination IP’s. Network policies build up on labels and selectors which are key concepts of Kubernetes that are used to organize (for example, all DB tier pods of an app) and select subsets of objects.

Example

We create two nginx HTTP-Servers in two namespaces and block all traffic between the two namespaces. E.g. you are unable to get content from namespace1 if you are sitting in namespace2.

Setup the Namespaces

# create two namespaces for test purpose
kubectl create ns customer1
kubectl create ns customer2

# create a standard HTTP web server
kubectl run nginx --image=nginx --replicas=1 --port=80 -n=customer1
kubectl run nginx --image=nginx --replicas=1 --port=80 -n=customer2

# expose the port 80 for external access
kubectl expose deployment nginx --port=80 --type=NodePort -n=customer1
kubectl expose deployment nginx --port=80 --type=NodePort -n=customer2

Test Without NP

howto-namespaceisolation-without

Create a pod with curl preinstalled inside the namespace customer1:

# create a "bash" pod in one namespace
kubectl run -i --tty client --image=tutum/curl -n=customer1

Try to curl the exposed nginx server to get the default index.html page. Execute this in the bash prompt of the pod created above.

# get the index.html from the nginx of the namespace "customer1" => success
curl http://nginx.customer1
# get the index.html from the nginx of the namespace "customer2" => success
curl http://nginx.customer2

Both calls are done in a pod within the namespace customer1 and both nginx servers are always reachable, no matter in what namespace.


Test with NP

howto-namespaceisolation-with

Install the NetworkPolicy from your shell:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-from-other-namespaces
spec:
  podSelector:
    matchLabels:
  ingress:
  - from:
    - podSelector: {}
  • it applies the policy to ALL pods in the named namespace as the spec.podSelector.matchLabels is empty and therefore selects all pods.
  • it allows traffic from ALL pods in the named namespace, as spec.ingress.from.podSelector is empty and therefore selects all pods.
kubectl apply -f ./network-policy.yaml -n=customer1
kubectl apply -f ./network-policy.yaml -n=customer2

After this, curl http://nginx.customer2 shouldn’t work anymore if you are a service inside the namespace customer1 and vice versa

You can get more information on how to configure the NetworkPolicies at:

6.14 - Orchestration of Container Startup

How to orchestrate a startup sequence of multiple containers

Disclaimer

If an application depends on other services deployed separately, do not rely on a certain start sequence of containers. Instead, ensure that the application can cope with unavailability of the services it depends on.

Introduction

Kubernetes offers a feature called InitContainers to perform some tasks during a pod’s initialization. In this tutorial, we demonstrate how to use InitContainers in order to orchestrate a starting sequence of multiple containers. The tutorial uses the example app url-shortener, which consists of two components:

  • postgresql database
  • webapp which depends on the postgresql database and provides two endpoints: create a short url from a given location and redirect from a given short URL to the corresponding target location

This app represents the minimal example where an application relies on another service or database. In this example, if the application starts before the database is ready, the application will fail as shown below:

$ kubectl logs webapp-958cf5567-h247n
time="2018-06-12T11:02:42Z" level=info msg="Connecting to Postgres database using: host=`postgres:5432` dbname=`url_shortener_db` username=`user`\n"
time="2018-06-12T11:02:42Z" level=fatal msg="failed to start: failed to open connection to database: dial tcp: lookup postgres on 100.64.0.10:53: no such host\n"


$ kubectl get po -w
NAME                                READY     STATUS    RESTARTS   AGE
webapp-958cf5567-h247n   0/1       Pending   0         0s
webapp-958cf5567-h247n   0/1       Pending   0         0s
webapp-958cf5567-h247n   0/1       ContainerCreating   0         0s
webapp-958cf5567-h247n   0/1       ContainerCreating   0         1s
webapp-958cf5567-h247n   0/1       Error     0         2s
webapp-958cf5567-h247n   0/1       Error     1         3s
webapp-958cf5567-h247n   0/1       CrashLoopBackOff   1         4s
webapp-958cf5567-h247n   0/1       Error     2         18s
webapp-958cf5567-h247n   0/1       CrashLoopBackOff   2         29s
webapp-958cf5567-h247n   0/1       Error     3         43s
webapp-958cf5567-h247n   0/1       CrashLoopBackOff   3         56s

If the restartPolicy is set to Always (default) in the yaml file, the application will continue to restart the pod with an exponential back-off delay in case of failure.

Using InitContaniner

To avoid such a situation, InitContainers can be defined, which are executed prior to the application container. If one of the InitContainers fails, the application container won’t be triggered.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
spec:
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      initContainers:  # check if DB is ready, and only continue when true
      - name: check-db-ready
        image: postgres:9.6.5
        command: ['sh', '-c',  'until pg_isready -h postgres -p 5432;  do echo waiting for database; sleep 2; done;']
      containers:
      - image: xcoulon/go-url-shortener:0.1.0
        name: go-url-shortener
        env:
        - name: POSTGRES_HOST
          value: postgres
        - name: POSTGRES_PORT
          value: "5432"
        - name: POSTGRES_DATABASE
          value: url_shortener_db
        - name: POSTGRES_USER
          value: user
        - name: POSTGRES_PASSWORD
          value: mysecretpassword
        ports:
        - containerPort: 8080

In the above example, the InitContainers use the docker image postgres:9.6.5, which is different from the application container.

This also brings the advantage of not having to include unnecessary tools (e.g., pg_isready) in the application container.

With introduction of InitContainers, in case the database is not available yet, the pod startup will look like similarly to:

$ kubectl get po -w
NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment-5cc79d6bfd-t9n8h   1/1       Running   0          5d
privileged-pod                      1/1       Running   0          4d
webapp-fdcb49cbc-4gs4n   0/1       Pending   0         0s
webapp-fdcb49cbc-4gs4n   0/1       Pending   0         0s
webapp-fdcb49cbc-4gs4n   0/1       Init:0/1   0         0s
webapp-fdcb49cbc-4gs4n   0/1       Init:0/1   0         1s


$ kubectl  logs webapp-fdcb49cbc-4gs4n
Error from server (BadRequest): container "go-url-shortener" in pod "webapp-fdcb49cbc-4gs4n" is waiting to start: PodInitializing

6.15 - Out-Dated HTML and JS Files Delivered

Why is my application always outdated?

Problem

After updating your HTML and JavaScript sources in your web application, the Kubernetes cluster delivers outdated versions - why?

Overview

By default, Kubernetes service pods are not accessible from the external network, but only from other pods within the same Kubernetes cluster.

The Gardener cluster has a built-in configuration for HTTP load balancing called Ingress, defining rules for external connectivity to Kubernetes services. Users who want external access to their Kubernetes services create an ingress resource that defines rules, including the URI path, backing service name, and other information. The Ingress controller can then automatically program a frontend load balancer to enable Ingress configuration.

nginx

Example Ingress Configuration

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: vuejs-ingress
spec:
  rules:
  - host: test.ingress.<GARDENER-CLUSTER>.<GARDENER-PROJECT>.shoot.canary.k8s-hana.ondemand.com
    http:
      paths:
      - backend:
          serviceName: vuejs-svc
          servicePort: 8080

where:

  • <GARDENER-CLUSTER>: The cluster name in the Gardener
  • <GARDENER-PROJECT>: You project name in the Gardener

Diagnosing the Problem

The ingress controller we are using is NGINX. NGINX is a software load balancer, web server, and content cache built on top of open source NGINX.

NGINX caches the content as specified in the HTTP header. If the HTTP header is missing, it is assumed that the cache is forever and NGINX never updates the content in the stupidest case.

Solution

In general, you can avoid this pitfall with one of the solutions below:

  • Use a cache buster + HTTP-Cache-Control (prefered)
  • Use HTTP-Cache-Control with a lower retention period
  • Disable the caching in the ingress (just for dev purposes)

Learning how to set the HTTP header or setup a cache buster is left to you, as an exercise for your web framework (e.g., Express/NodeJS, SpringBoot, …)

Here is an example on how to disable the cache control for your ingress, done with an annotation in your ingress YAML (during development).

---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  annotations:
    ingress.kubernetes.io/cache-enable: "false"
  name: vuejs-ingress
spec:
  rules:
  - host: test.ingress.<GARDENER-CLUSTER>.<GARDENER-PROJECT>.shoot.canary.k8s-hana.ondemand.com
    http:
      paths:
      - backend:
          serviceName: vuejs-svc
          servicePort: 8080

6.16 - Remove Committed Secrets in Github 💀

Never ever commit a kubeconfig.yaml into github

Overview

If you commit sensitive data, such as a kubeconfig.yaml or SSH key into a Git repository, you can remove it from the history. To entirely remove unwanted files from a repository’s history you can use the git filter-branch command.

The git filter-branch command rewrites your repository’s history, which changes the SHAs for existing commits that you alter and any dependent commits. Changed commit SHAs may affect open pull requests in your repository. Merging or closing all open pull requests before removing files from your repository is recommended.

Purging a File from Your Repository’s History

To illustrate how git filter-branch works, we’ll show you how to remove your file with sensitive data from the history of your repository and add it to .gitignore to ensure that it is not accidentally re-committed.

1. Navigate into the repository’s working directory:

cd YOUR-REPOSITORY

2. Run the following command, replacing PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA with the path to the file you want to remove, not just its filename.

These arguments will:

  • Force Git to process, but not check out, the entire history of every branch and tag
  • Remove the specified file, as well as any empty commits generated as a result
  • Overwrite your existing tags
git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA' \
--prune-empty --tag-name-filter cat -- --all

3. Add your file with sensitive data to .gitignore to ensure that you don’t accidentally commit it again:

 echo "YOUR-FILE-WITH-SENSITIVE-DATA" >> .gitignore

Double-check that you’ve removed everything you wanted to from your repository’s history, and that all of your branches are checked out. Once you’re happy with the state of your repository, continue to the next step.

4. Force-push your local changes to overwrite your GitHub repository, as well as all the branches you’ve pushed up:

git push origin --force --all

4. In order to remove the sensitive file from your tagged releases, you’ll also need to force-push against your Git tags:

git push origin --force --tags

6.17 - Using Prometheus and Grafana to Monitor K8s

How to deploy and configure Prometheus and Grafana to collect and monitor kubelet container metrics

Disclaimer

This post is meant to give a basic end-to-end description for deploying and using Prometheus and Grafana. Both applications offer a wide range of flexibility, which needs to be considered in case you have specific requirements. Such advanced details are not in the scope of this topic.

Introduction

Prometheus is an open-source systems monitoring and alerting toolkit for recording numeric time series. It fits both machine-centric monitoring as well as monitoring of highly dynamic service-oriented architectures. In a world of microservices, its support for multi-dimensional data collection and querying is a particular strength.

Prometheus is the second hosted project to graduate within CNCF.

The following characteristics make Prometheus a good match for monitoring Kubernetes clusters:

  • Pull-based Monitoring Prometheus is a pull-based monitoring system, which means that the Prometheus server dynamically discovers and pulls metrics from your services running in Kubernetes.

  • Labels Prometheus and Kubernetes share the same label (key-value) concept that can be used to select objects in the system.
    Labels are used to identify time series and sets of label matchers can be used in the query language (PromQL) to select the time series to be aggregated.

  • Exporters
    There are many exporters available, which enable integration of databases or even other monitoring systems not already providing a way to export metrics to Prometheus. One prominent exporter is the so called node-exporter, which allows to monitor hardware and OS related metrics of Unix systems.

  • Powerful Query Language The Prometheus query language PromQL lets the user select and aggregate time series data in real time. Results can either be shown as a graph, viewed as tabular data in the Prometheus expression browser, or consumed by external systems via the HTTP API.

Find query examples on Prometheus Query Examples.

One very popular open-source visualization tool not only for Prometheus is Grafana. Grafana is a metric analytics and visualization suite. It is popular for visualizing time series data for infrastructure and application analytics but many use it in other domains including industrial sensors, home automation, weather, and process control. For more information, see the Grafana Documentation.

Grafana accesses data via Data Sources. The continuously growing list of supported backends includes Prometheus.

Dashboards are created by combining panels, e.g., Graph and Dashlist.

In this example, we describe an End-To-End scenario including the deployment of Prometheus and a basic monitoring configuration as the one provided for Kubernetes clusters created by Gardener.

If you miss elements on the Prometheus web page when accessing it via its service URL https://<your K8s FQN>/api/v1/namespaces/<your-prometheus-namespace>/services/prometheus-prometheus-server:80/proxy, this is probably caused by a Prometheus issue - #1583. To workaround this issue, set up a port forward kubectl port-forward -n <your-prometheus-namespace> <prometheus-pod> 9090:9090 on your client and access the Prometheus UI from there with your locally installed web browser. This issue is not relevant in case you use the service type LoadBalancer.

Preparation

The deployment of Prometheus and Grafana is based on Helm charts.
Make sure to implement the Helm settings before deploying the Helm charts.

The Kubernetes clusters provided by Gardener use role based access control (RBAC). To authorize the Prometheus node-exporter to access hardware and OS relevant metrics of your cluster’s worker nodes, specific artifacts need to be deployed.

Bind the Prometheus service account to the garden.sapcloud.io:monitoring:prometheus cluster role by running the command kubectl apply -f crbinding.yaml.

Content of crbinding.yaml

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: <your-prometheus-name>-server
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: garden.sapcloud.io:monitoring:prometheus
subjects:
- kind: ServiceAccount
  name: <your-prometheus-name>-server
  namespace: <your-prometheus-namespace>

Deployment of Prometheus and Grafana

Only minor changes are needed to deploy Prometheus and Grafana based on Helm charts.

Copy the following configuration into a file called values.yaml and deploy Prometheus: helm install <your-prometheus-name> --namespace <your-prometheus-namespace> stable/prometheus -f values.yaml

Typically, Prometheus and Grafana are deployed into the same namespace. There is no technical reason behind this, so feel free to choose different namespaces.

Content of values.yaml for Prometheus:

rbac:
  create: false # Already created in Preparation step
nodeExporter:
  enabled: false # The node-exporter is already deployed by default

server:
  global:
    scrape_interval: 30s
    scrape_timeout: 30s

serverFiles:
  prometheus.yml:
    rule_files:
      - /etc/config/rules
      - /etc/config/alerts      
    scrape_configs:
    - job_name: 'kube-kubelet'
      honor_labels: false
      scheme: https

      tls_config:
      # This is needed because the kubelets' certificates are not generated
      # for a specific pod IP
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - target_label: __metrics_path__
        replacement: /metrics
      - source_labels: [__meta_kubernetes_node_address_InternalIP]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

    - job_name: 'kube-kubelet-cadvisor'
      honor_labels: false
      scheme: https

      tls_config:
      # This is needed because the kubelets' certificates are not generated
      # for a specific pod IP
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - target_label: __metrics_path__
        replacement: /metrics/cadvisor
      - source_labels: [__meta_kubernetes_node_address_InternalIP]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

    # Example scrape config for probing services via the Blackbox Exporter.
    #
    # Relabelling allows to configure the actual service scrape endpoint using the following annotations:
    #
    # * `prometheus.io/probe`: Only probe services that have a value of `true`
    - job_name: 'kubernetes-services'
      metrics_path: /probe
      params:
        module: [http_2xx]
      kubernetes_sd_configs:
        - role: service
      relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
          action: keep
          regex: true
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: blackbox
        - source_labels: [__param_target]
          target_label: instance
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          target_label: kubernetes_name
    # Example scrape config for pods
    #
    # Relabelling allows to configure the actual service scrape endpoint using the following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: Scrape the pod on the indicated port instead of the default of `9102`.
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
          action: replace
          regex: (.+):(?:\d+);(\d+)
          replacement: ${1}:${2}
          target_label: __address__
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name
    # Scrape config for service endpoints.
    #
    # The relabeling allows the actual service scrape endpoint to be configured
    # via the following annotations:
    #
    # * `prometheus.io/scrape`: Only scrape services that have a value of `true`
    # * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
    # to set this to `https` & most likely set the `tls_config` of the scrape config.
    # * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
    # * `prometheus.io/port`: If the metrics are exposed on a different port to the
    # service then set this appropriately.
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
        - role: endpoints
      relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: (.+)(?::\d+);(\d+)
          replacement: $1:$2
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name # Add your additional configuration here...

Next, deploy Grafana. Since the deployment in this post is based on the Helm default values, the settings below are set explicitly in case the default changed.

Deploy Grafana via helm install grafana --namespace <your-prometheus-namespace> stable/grafana -f values.yaml. Here, the same namespace is chosen for Prometheus and for Grafana.

Content of values.yaml for Grafana:

server:
  ingress:
    enabled: false
  service:
    type: ClusterIP

Check the running state of the pods on the Kubernetes Dashboard or by running kubectl get pods -n <your-prometheus-namespace>. In case of errors, check the log files of the pod(s) in question.

The text output of Helm after the deployment of Prometheus and Grafana contains very useful information, e.g., the user and password of the Grafana Admin user. The credentials are stored as secrets in the namespace <your-prometheus-namespace> and could be decoded via kubectl get secret --namespace <my-grafana-namespace> grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo.

Basic Functional Tests

To access the web UI of both applications, use port forwarding of port 9090.

Setup port forwarding for port 9090:

kubectl port-forward -n <your-prometheus-namespace> <your-prometheus-server-pod> 9090:9090

Open http://localhost:9090 in your web browser. Select Graph from the top tab and enter the following expressing to show the overall CPU usage for a server (see Prometheus Query Examples):

100 * (1 - avg by(instance)(irate(node_cpu{mode='idle'}[5m])))

This should show some data in a graph.

To show the same data in Grafana setup port forwarding for port 3000 for the Grafana pod and open the Grafana Web UI by opening http://localhost:3000 in a browser. Enter the credentials of the admin user.

Next, you need to enter the server name of your Prometheus deployment. This name is shown directly after the installation via helm.

Run

helm status <your-prometheus-name>

to find this name. Below, this server name is referenced by <your-prometheus-server-name>.

First, you need to add your Prometheus server as data source:

  1. Navigate to Dashboards → Data Sources
  2. Choose Add data source
  3. Enter:
    Name: <your-prometheus-datasource-name>
    Type: Prometheus
    URL: http://<your-prometheus-server-name>
    Access: proxy
  4. Choose Save & Test

In case of failure, check the Prometheus URL in the Kubernetes Dashboard.

To add a Graph follow these steps:

  1. In the left corner, select Dashboards → New to create a new dashboard
  2. Select Graph to create a new graph
  3. Next, select the Panel Title → Edit
  4. Select your Prometheus Data Source in the drop down list
  5. Enter the expression 100 * (1 - avg by(instance)(irate(node_cpu{mode='idle'}[5m]))) in the entry field A
  6. Select the floppy disk symbol (Save) on top

Now you should have a very basic Prometheus and Grafana setup for your Kubernetes cluster.

As a next step you can implement monitoring for your applications by implementing the Prometheus client API.