This is the multi-page printable view of this section. Click here to print.
Guides
1 - Set Up Client Tools
1.1 - Fun with kubectl Aliases
Speed up Your Terminal Workflow
Use the Kubernetes command-line tool, kubectl
, to deploy and manage applications on Kubernetes. Using kubectl, you can inspect cluster resources, as well as create, delete, and update components.
You will probably run more than a hundred kubectl commands on some days and you should speed up your terminal workflow with with some shortcuts. Of course, there are good shortcuts and bad shortcuts (lazy coding, lack of security review, etc.), but let’s stick with the positives and talk about a good shortcut: bash aliases in your .profile
.
What are those mysterious .profile
and .bash_profile
files you’ve heard about?
Note
The contents of a .profile file are executed on every log-in of the owner of the fileWhat’s the .bash_profile
then? It’s exactly the same, but under a different name. The unix shell you are logging into, in this case OS X, looks for etc/profile
and loads it if it exists. Then it looks for ~/.bash_profile
, ~/.bash_login
and finally ~/.profile
, and loads the first one of these it finds.
Populating the .profile
File
Here is the fantastic time saver that needs to be in your shell profile:
# time save number one. shortcut for kubectl
#
alias k="kubectl"
# Start a shell in a pod AND kill them after leaving
#
alias ksh="kubectl run busybox -i --tty --image=busybox --restart=Never --rm -- sh"
# opens a bash
#
alias kbash="kubectl run busybox -i --tty --image=busybox --restart=Never --rm -- ash"
# activate/exports the kuberconfig.yaml in the current working directory
#
alias kexport="export KUBECONFIG=`pwd`/kubeconfig.yaml"
# usage: kurl http://your-svc.namespace.cluster.local
#
# we need for this our very own image...never trust an unknown image..
alias kurl="docker run --rm byrnedo/alpine-curl"
All the kubectl
tab completions still work fine with these aliases, so you’re not losing that speed.
Note
If the approach above does not work for you add the following lines in your ~/.bashrc instead:
# time save number one. shortcut for kubectl
#
alias k="kubectl"
# Enable kubectl completion
source <(k completion bash | sed s/kubectl/k/g)
1.2 - Kubeconfig Context as bash Prompt
Overview
Use the Kubernetes command-line tool, kubectl, to deploy and manage applications on Kubernetes. Using kubectl, you can inspect cluster resources, as well as create, delete, and update components.
By default, the kubectl configuration is located at ~/.kube/config
.
Let us suppose that you have two clusters, one for development work and one for scratch work.
How to handle this easily without copying the used configuration always to the right place?
Export the KUBECONFIG Environment Variable
bash$ export KUBECONFIG=<PATH-TO-M>-CONFIG>/kubeconfig-dev.yaml
How to determine which cluster is used by the kubectl command?
Determine Active Cluster
bash$ kubectl cluster-info
Kubernetes master is running at https://api.dev.garden.shoot.canary.k8s-hana.ondemand.com
KubeDNS is running at https://api.dev.garden.shoot.canary.k8s-hana.ondemand.com/api/v1/proxy/namespaces/kube-system/services/kube-dns
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
bash$
Display Cluster in the bash - Linux and Alike
I found this tip on Stackoverflow and find it worth to be added here.
Edit your ~/.bash_profile
and add the following code snippet to show the current K8s context in the shell’s prompt:
prompt_k8s(){
k8s_current_context=$(kubectl config current-context 2> /dev/null)
if [[ $? -eq 0 ]] ; then echo -e "(${k8s_current_context}) "; fi
}
PS1+='$(prompt_k8s)'
After this, your bash command prompt contains the active KUBECONFIG context and you always know which cluster is active - develop or production.
For example:
bash$ export KUBECONFIG=/Users/d023280/Documents/workspace/gardener-ui/kubeconfig_gardendev.yaml
bash (garden_dev)$
Note the (garden_dev) prefix in the bash command prompt.
This helps immensely to avoid thoughtless mistakes.
Display Cluster in the PowerShell - Windows
Display the current K8s cluster in the title of PowerShell window.
Create a profile file for your shell under %UserProfile%\Documents\WindowsPowerShell\Microsoft.PowerShell_profile.ps1
Copy following code to Microsoft.PowerShell_profile.ps1
function prompt_k8s {
$k8s_current_context = (kubectl config current-context) | Out-String
if($?) {
return $k8s_current_context
}else {
return "No K8S contenxt found"
}
}
$host.ui.rawui.WindowTitle = prompt_k8s
If you want to switch to different cluster, you can set KUBECONFIG
to new value, and re-run the file Microsoft.PowerShell_profile.ps1
1.3 - Organizing Access Using kubeconfig Files
Overview
The kubectl command-line tool uses kubeconfig
files to find the information it needs to choose a cluster and communicate with the API server of a cluster.
Problem
If you’ve become aware of a security breach that affects you, you may want to revoke or cycle credentials in case anything was leaked. However, this is not possible with the initial or master kubeconfig
from your cluster.
Pitfall
Never distribute the kubeconfig
, which you can download directly within the Gardener dashboard, for a productive cluster.
Create a Custom kubeconfig File for Each User
Create a separate kubeconfig
for each user. One of the big advantages of this approach is that you can revoke them and control the permissions better. A limitation to single namespaces is also possible here.
The script creates a new ServiceAccount
with read privileges in the whole cluster (Secrets are excluded).
To run the script, Deno, a secure TypeScript runtime, must be installed.
#!/usr/bin/env -S deno run --allow-run
/*
* This script create Kubernetes ServiceAccount and other required resource and print KUBECONFIG to console.
* Depending on your requirements you might want change clusterRoleBindingTemplate() function
*
* In order to execute this script it's required to install Deno.js https://deno.land/ (TypeScript & JavaScript runtime).
* It's single executable binary for the major OSs from the original author of the Node.js
* example: deno run --allow-run kubeconfig-for-custom-user.ts d00001
* example: deno run --allow-run kubeconfig-for-custom-user.ts d00001 --delete
*
* known issue: shebang does works under the Linux but not for Windows Linux Subsystem
*/
const KUBECTL = "/usr/local/bin/kubectl" //or
// const KUBECTL = "C:\\Program Files\\Docker\\Docker\\resources\\bin\\kubectl.exe"
const serviceAccName = Deno.args[0]
const deleteIt = Deno.args[1]
if (serviceAccName == undefined || serviceAccName == "--delete" ) {
console.log("please provide username as an argument, for example: deno run --allow-run kubeconfig-for-custom-user.ts USER_NAME [--delete]")
Deno.exit(1)
}
if (deleteIt == "--delete") {
exec([KUBECTL, "delete", "serviceaccount", serviceAccName])
exec([KUBECTL, "delete", "secret", `${serviceAccName}-secret`])
exec([KUBECTL, "delete", "clusterrolebinding", `view-${serviceAccName}-global`])
Deno.exit(0)
}
await exec([KUBECTL, "create", "serviceaccount", serviceAccName, "-o", "json"])
await exec([KUBECTL, "create", "-o", "json", "-f", "-"], secretYamlTemplate())
let secret = await exec([KUBECTL, "get", "secret", `${serviceAccName}-secret`, "-o", "json"])
let caCRT = secret.data["ca.crt"];
let userToken = atob(secret.data["token"]); //decode base64
let kubeConfig = await exec([KUBECTL, "config", "view", "--minify", "-o", "json"]);
let clusterApi = kubeConfig.clusters[0].cluster.server
let clusterName = kubeConfig.clusters[0].name
await exec([KUBECTL, "create", "-o", "json", "-f", "-"], clusterRoleBindingTemplate())
console.log(kubeConfigTemplate(caCRT, userToken, clusterApi, clusterName, serviceAccName + "-" + clusterName))
async function exec(args: string[], stdInput?: string): Promise<Object> {
console.log("# "+args.join(" "))
let opt: Deno.RunOptions = {
cmd: args,
stdout: "piped",
stderr: "piped",
stdin: "piped",
};
const p = Deno.run(opt);
if (stdInput != undefined) {
await p.stdin.write(new TextEncoder().encode(stdInput));
await p.stdin.close();
}
const status = await p.status()
const output = await p.output()
const stderrOutput = await p.stderrOutput()
if (status.code === 0) {
return JSON.parse(new TextDecoder().decode(output))
} else {
let error = new TextDecoder().decode(stderrOutput);
return ""
}
}
function clusterRoleBindingTemplate() {
return `
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: view-${serviceAccName}-global
subjects:
- kind: ServiceAccount
name: ${serviceAccName}
namespace: default
roleRef:
kind: ClusterRole
name: view
apiGroup: rbac.authorization.k8s.io
`
}
function secretYamlTemplate() {
return `
apiVersion: v1
kind: Secret
metadata:
name: ${serviceAccName}-secret
annotations:
kubernetes.io/service-account.name: ${serviceAccName}
type: kubernetes.io/service-account-token`
}
function kubeConfigTemplate(certificateAuthority: string, token: string, clusterApi: string, clusterName: string, username: string) {
return `
## KUBECONFIG generated on ${new Date()}
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: ${certificateAuthority}
server: ${clusterApi}
name: ${clusterName}
contexts:
- context:
cluster: ${clusterName}
user: ${username}
name: ${clusterName}
current-context: ${clusterName}
kind: Config
preferences: {}
users:
- name: ${username}
user:
token: ${token}
`
}
If edit or admin rights are to be assigned, the ClusterRoleBinding
must be adapted in the roleRef
section
with the roles listed below.
Furthermore, you can restrict this to a single namespace by not creating a ClusterRoleBinding
but only a RoleBinding
within the desired namespace.
Default ClusterRole | Default ClusterRoleBinding | Description |
---|---|---|
cluster-admin | system:masters group | Allows super-user access to perform any action on any resource. When used in a ClusterRoleBinding, it gives full control over every resource in the cluster and in all namespaces. When used in a RoleBinding, it gives full control over every resource in the rolebinding’s namespace, including the namespace itself. |
admin | None | Allows admin access, intended to be granted within a namespace using a RoleBinding. If used in a RoleBinding, allows read/write access to most resources in a namespace, including the ability to create roles and rolebindings within the namespace. It does not allow write access to resource quota or to the namespace itself. |
edit | None | Allows read/write access to most objects in a namespace. It does not allow viewing or modifying roles or rolebindings. |
view | None | Allows read-only access to see most objects in a namespace. It does not allow viewing roles or rolebindings. It does not allow viewing secrets, since those are escalating. |
2 - High Availability
2.1 - Best Practices
Implementing High Availability and Tolerating Zone Outages
Developing highly available workload that can tolerate a zone outage is no trivial task. You will find here various recommendations to get closer to that goal. While many recommendations are general enough, the examples are specific in how to achieve this in a Gardener-managed cluster and where/how to tweak the different control plane components. If you do not use Gardener, it may be still a worthwhile read.
First however, what is a zone outage? It sounds like a clear-cut “thing”, but it isn’t. There are many things that can go haywire. Here are some examples:
- Elevated cloud provider API error rates for individual or multiple services
- Network bandwidth reduced or latency increased, usually also effecting storage sub systems as they are network attached
- No networking at all, no DNS, machines shutting down or restarting, …
- Functional issues, of either the entire service (e.g. all block device operations) or only parts of it (e.g. LB listener registration)
- All services down, temporarily or permanently (the proverbial burning down data center 🔥)
This and everything in between make it hard to prepare for such events, but you can still do a lot. The most important recommendation is to not target specific issues exclusively - tomorrow another service will fail in an unanticipated way. Also, focus more on meaningful availability than on internal signals (useful, but not as relevant as the former). Always prefer automation over manual intervention (e.g. leader election is a pretty robust mechanism, auto-scaling may be required as well, etc.).
Also remember that HA is costly - you need to balance it against the cost of an outage as silly as this may sound, e.g. running all this excess capacity “just in case” vs. “going down” vs. a risk-based approach in between where you have means that will kick in, but they are not guaranteed to work (e.g. if the cloud provider is out of resource capacity). Maybe some of your components must run at the highest possible availability level, but others not - that’s a decision only you can make.
Control Plane
The Kubernetes cluster control plane is managed by Gardener (as pods in separate infrastructure clusters to which you have no direct access) and can be set up with no failure tolerance (control plane pods will be recreated best-effort when resources are available) or one of the failure tolerance types node
or zone
.
Strictly speaking, static workload does not depend on the (high) availability of the control plane, but static workload doesn’t rhyme with Cloud and Kubernetes and also means, that when you possibly need it the most, e.g. during a zone outage, critical self-healing or auto-scaling functionality won’t be available to you and your workload, if your control plane is down as well. That’s why, even though the resource consumption is significantly higher, we generally recommend to use the failure tolerance type zone
for the control planes of productive clusters, at least in all regions that have 3+ zones. Regions that have only 1 or 2 zones don’t support the failure tolerance type zone
and then your second best option is the failure tolerance type node
, which means a zone outage can still take down your control plane, but individual node outages won’t.
In the shoot
resource it’s merely only this what you need to add:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
controlPlane:
highAvailability:
failureTolerance:
type: zone # valid values are `node` and `zone` (only available if your control plane resides in a region with 3+ zones)
This setting will scale out all control plane components for a Gardener cluster as necessary, so that no single zone outage can take down the control plane for longer than just a few seconds for the fail-over to take place (e.g. lease expiration and new leader election or readiness probe failure and endpoint removal). Components run highly available in either active-active (servers) or active-passive (controllers) mode at all times, the persistence (ETCD), which is consensus-based, will tolerate the loss of one zone and still maintain quorum and therefore remain operational. These are all patterns that we will revisit down below also for your own workload.
Worker Pools
Now that you have configured your Kubernetes cluster control plane in HA, i.e. spread it across multiple zones, you need to do the same for your own workload, but in order to do so, you need to spread your nodes across multiple zones first.
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
provider:
workers:
- name: ...
minimum: 6
maximum: 60
zones:
- ...
Prefer regions with at least 2, better 3+ zones and list the zones in the zones
section for each of your worker pools. Whether you need 2 or 3 zones at a minimum depends on your fail-over concept:
- Consensus-based software components (like ETCD) depend on maintaining a quorum of
(n/2)+1
, so you need at least 3 zones to tolerate the outage of 1 zone. - Primary/Secondary-based software components need just 2 zones to tolerate the outage of 1 zone.
- Then there are software components that can scale out horizontally. They are probably fine with 2 zones, but you also need to think about the load-shift and that the remaining zone must then pick up the work of the unhealthy zone. With 2 zones, the remaining zone must cope with an increase of 100% load. With 3 zones, the remaining zones must only cope with an increase of 50% load (per zone).
In general, the question is also whether you have the fail-over capacity already up and running or not. If not, i.e. you depend on re-scheduling to a healthy zone or auto-scaling, be aware that during a zone outage, you will see a resource crunch in the healthy zones. If you have no automation, i.e. only human operators (a.k.a. “red button approach”), you probably will not get the machines you need and even with automation, it may be tricky. But holding the capacity available at all times is costly. In the end, that’s a decision only you can make. If you made that decision, please adapt the minimum
, maximum
, maxSurge
and maxUnavailable
settings for your worker pools accordingly (visit this section for more information).
Also, consider fall-back worker pools (with different/alternative machine types) and cluster autoscaler expanders using a priority-based strategy.
Gardener-managed clusters deploy the cluster autoscaler or CA for short and you can tweak the general CA knobs for Gardener-managed clusters like this:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
kubernetes:
clusterAutoscaler:
expander: "least-waste"
scanInterval: 10s
scaleDownDelayAfterAdd: 60m
scaleDownDelayAfterDelete: 0s
scaleDownDelayAfterFailure: 3m
scaleDownUnneededTime: 30m
scaleDownUtilizationThreshold: 0.5
If you want to be ready for a sudden spike or have some buffer in general, over-provision nodes by means of “placeholder” pods with low priority and appropriate resource requests. This way, they will demand nodes to be provisioned for them, but if any pod comes up with a regular/higher priority, the low priority pods will be evicted to make space for the more important ones. Strictly speaking, this is not related to HA, but it may be important to keep this in mind as you generally want critical components to be rescheduled as fast as possible and if there is no node available, it may take 3 minutes or longer to do so (depending on the cloud provider). Besides, not only zones can fail, but also individual nodes.
Replicas (Horizontal Scaling)
Now let’s talk about your workload. In most cases, this will mean to run multiple replicas. If you cannot do that (a.k.a. you have a singleton), that’s a bad situation to be in. Maybe you can run a spare (secondary) as backup? If you cannot, you depend on quick detection and rescheduling of your singleton (more on that below).
Obviously, things get messier with persistence. If you have persistence, you should ideally replicate your data, i.e. let your spare (secondary) “follow” your main (primary). If your software doesn’t support that, you have to deploy other means, e.g. volume snapshotting or side-backups (specific to the software you deploy; keep the backups regional, so that you can switch to another zone at all times). If you have to do those, your HA scenario becomes more a DR scenario and terms like RPO and RTO become relevant to you:
- Recovery Point Objective (RPO): Potential data loss, i.e. how much data will you lose at most (time between backups)
- Recovery Time Objective (RTO): Time until recovery, i.e. how long does it take you to be operational again (time to restore)
Also, keep in mind that your persistent volumes are usually zonal, i.e. once you have a volume in one zone, it’s bound to that zone and you cannot get up your pod in another zone w/o first recreating the volume yourself (Kubernetes won’t help you here directly).
Anyway, best avoid that, if you can (from technical and cost perspective). The best solution (and also the most costly one) is to run multiple replicas in multiple zones and keep your data replicated at all times, so that your RPO is always 0 (best). That’s what we do for Gardener-managed cluster HA control planes (ETCD) as any data loss may be disastrous and lead to orphaned resources (in addition, we deploy side cars that do side-backups for disaster recovery, with full and incremental snapshots with an RPO of 5m).
So, how to run with multiple replicas? That’s the easiest part in Kubernetes and the two most important resources, Deployments
and StatefulSet
, support that out of the box:
apiVersion: apps/v1
kind: Deployment | StatefulSet
spec:
replicas: ...
The problem comes with the number of replicas. It’s easy only if the number is static, e.g. 2 for active-active/passive or 3 for consensus-based software components, but what with software components that can scale out horizontally? Here you usually do not set the number of replicas statically, but make use of the horizontal pod autoscaler or HPA for short (built-in; part of the kube-controller-manager). There are also other options like the cluster proportional autoscaler, but while the former works based on metrics, the latter is more a guestimate approach that derives the number of replicas from the number of nodes/cores in a cluster. Sometimes useful, but often blind to the actual demand.
So, HPA it is then for most of the cases. However, what is the resource (e.g. CPU or memory) that drives the number of desired replicas? Again, this is up to you, but not always are CPU or memory the best choices. In some cases, custom metrics may be more appropriate, e.g. requests per second (it was also for us).
You will have to create specific HorizontalPodAutoscaler
resources for your scale target and can tweak the general HPA knobs for Gardener-managed clusters like this:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
kubernetes:
kubeControllerManager:
horizontalPodAutoscaler:
syncPeriod: 15s
tolerance: 0.1
downscaleStabilization: 5m0s
initialReadinessDelay: 30s
cpuInitializationPeriod: 5m0s
Resources (Vertical Scaling)
While it is important to set a sufficient number of replicas, it is also important to give the pods sufficient resources (CPU and memory). This is especially true when you think about HA. When a zone goes down, you might need to get up replacement pods, if you don’t have them running already to take over the load from the impacted zone. Likewise, e.g. with active-active software components, you can expect the remaining pods to receive more load. If you cannot scale them out horizontally to serve the load, you will probably need to scale them out (or rather up) vertically. This is done by the vertical pod autoscaler or VPA for short (not built-in; part of the kubernetes/autoscaler repository).
A few caveats though:
- You cannot use HPA and VPA on the same metrics as they would influence each other, which would lead to pod trashing (more replicas require fewer resources; fewer resources require more replicas)
- Scaling horizontally doesn’t cause downtimes (at least not when out-scaling and only one replica is affected when in-scaling), but scaling vertically does (if the pod runs OOM anyway, but also when new recommendations are applied, resource requests for existing pods may be changed, which causes the pods to be rescheduled). Although the discussion is going on for a very long time now, that is still not supported in-place yet (see KEP 1287, implementation in Kubernetes, implementation in VPA).
VPA is a useful tool and Gardener-managed clusters deploy a VPA by default for you (HPA is supported anyway as it’s built into the kube-controller-manager). You will have to create specific VerticalPodAutoscaler
resources for your scale target and can tweak the general VPA knobs for Gardener-managed clusters like this:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
kubernetes:
verticalPodAutoscaler:
enabled: true
evictAfterOOMThreshold: 10m0s
evictionRateBurst: 1
evictionRateLimit: -1
evictionTolerance: 0.5
recommendationMarginFraction: 0.15
updaterInterval: 1m0s
recommenderInterval: 1m0s
While horizontal pod autoscaling is relatively straight-forward, it takes a long time to master vertical pod autoscaling. We saw performance issues, hard-coded behavior (on OOM, memory is bumped by +20% and it may take a few iterations to reach a good level), unintended pod disruptions by applying new resource requests (after 12h all targeted pods will receive new requests even though individually they would be fine without, which also drives active-passive resource consumption up), difficulties to deal with spiky workload in general (due to the algorithmic approach it takes), recommended requests may exceed node capacity, limit scaling is proportional and therefore often questionable, and more. VPA is a double-edged sword: useful and necessary, but not easy to handle.
For the Gardener-managed components, we mostly removed limits. Why?
- CPU limits have almost always only downsides. They cause needless CPU throttling, which is not even easily visible. CPU requests turn into
cpu shares
, so if the node has capacity, the pod may consume the freely available CPU, but not if you have set limits, which curtail the pod by means ofcpu quota
. There are only certain scenarios in which they may make sense, e.g. if you set requests=limits and thereby define a pod withguaranteed
QoS, which influences yourcgroup
placement. However, that is difficult to do for the components you implement yourself and practically impossible for the components you just consume, because what’s the correct value for requests/limits and will it hold true also if the load increases and what happens if a zone goes down or with the next update/version of this component? If anything, CPU limits caused outages, not helped prevent them. - As for memory limits, they are slightly more useful, because CPU is compressible and memory is not, so if one pod runs berserk, it may take others down (with CPU,
cpu shares
make it as fair as possible), depending on which OOM killer strikes (a complicated topic by itself). You don’t want the operating system OOM killer to strike as the result is unpredictable. Better, it’s the cgroup OOM killer or even thekubelet
’s eviction, if the consumption is slow enough as it takes priorities into consideration even. If your component is critical and a singleton (e.g. node daemon set pods), you are better off also without memory limits, because letting the pod go OOM because of artificial/wrong memory limits can mean that the node becomes unusable. Hence, such components also better run only with no or a very high memory limit, so that you can catch the occasional memory leak (bug) eventually, but under normal operation, if you cannot decide about a true upper limit, rather not have limits and cause endless outages through them or when you need the pods the most (during a zone outage) where all your assumptions went out of the window.
The downside of having poor or no limits and poor and no requests is that nodes may “die” more often. Contrary to the expectation, even for managed services, the managed service is not responsible or cannot guarantee the health of a node under all circumstances, since the end user defines what is run on the nodes (shared responsibility). If the workload exhausts any resource, it will be the end of the node, e.g. by compressing the CPU too much (so that the kubelet
fails to do its work), exhausting the main memory too fast, disk space, file handles, or any other resource.
The kubelet
allows for explicit reservation of resources for operating system daemons (system-reserved
) and Kubernetes daemons (kube-reserved
) that are subtracted from the actual node resources and become the allocatable node resources for your workload/pods. All managed services configure these settings “by rule of thumb” (a balancing act), but cannot guarantee that the values won’t waste resources or always will be sufficient. You will have to fine-tune them eventually and adapt them to your needs. In addition, you can configure soft and hard eviction thresholds to give the kubelet
some headroom to evict “greedy” pods in a controlled way. These settings can be configured for Gardener-managed clusters like this:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
kubernetes:
kubelet:
kubeReserved: # explicit resource reservation for Kubernetes daemons
cpu: 100m
memory: 1Gi
ephemeralStorage: 1Gi
pid: 1000
evictionSoft: # soft, i.e. graceful eviction (used if the node is about to run out of resources, avoiding hard evictions)
memoryAvailable: 200Mi
imageFSAvailable: 10%
imageFSInodesFree: 10%
nodeFSAvailable: 10%
nodeFSInodesFree: 10%
evictionSoftGracePeriod: # caps pod's `terminationGracePeriodSeconds` value during soft evictions (specific grace periods)
memoryAvailable: 1m30s
imageFSAvailable: 1m30s
imageFSInodesFree: 1m30s
nodeFSAvailable: 1m30s
nodeFSInodesFree: 1m30s
evictionHard: # hard, i.e. immediate eviction (used if the node is out of resources, avoiding the OS generally run out of resources fail processes indiscriminately)
memoryAvailable: 100Mi
imageFSAvailable: 5%
imageFSInodesFree: 5%
nodeFSAvailable: 5%
nodeFSInodesFree: 5%
evictionMinimumReclaim: # additional resources to reclaim after hitting the hard eviction thresholds to not hit the same thresholds soon after again
memoryAvailable: 0Mi
imageFSAvailable: 0Mi
imageFSInodesFree: 0Mi
nodeFSAvailable: 0Mi
nodeFSInodesFree: 0Mi
evictionMaxPodGracePeriod: 90 # caps pod's `terminationGracePeriodSeconds` value during soft evictions (general grace periods)
evictionPressureTransitionPeriod: 5m0s # stabilization time window to avoid flapping of node eviction state
You can tweak these settings also individually per worker pool (spec.provider.workers.kubernetes.kubelet...
), which makes sense especially with different machine types (and also workload that you may want to schedule there).
Physical memory is not compressible, but you can overcome this issue to some degree (alpha since Kubernetes v1.22
in combination with the feature gate NodeSwap
on the kubelet
) with swap memory. You can read more in this introductory blog and the docs. If you chose to use it (still only alpha at the time of this writing) you may want to consider also the risks associated with swap memory:
- Reduced performance predictability
- Reduced performance up to page trashing
- Reduced security as secrets, normally held only in memory, could be swapped out to disk
That said, the various options mentioned above are only remotely related to HA and will not be further explored throughout this document, but just to remind you: if a zone goes down, load patterns will shift, existing pods will probably receive more load and will require more resources (especially because it is often practically impossible to set “proper” resource requests, which drive node allocation - limits are always ignored by the scheduler) or more pods will/must be placed on the existing and/or new nodes and then these settings, which are generally critical (especially if you switch on bin-packing for Gardener-managed clusters as a cost saving measure), will become even more critical during a zone outage.
Probes
Before we go down the rabbit hole even further and talk about how to spread your replicas, we need to talk about probes first, as they will become relevant later. Kubernetes supports three kinds of probes: startup, liveness, and readiness probes. If you are a visual thinker, also check out this slide deck by Tim Hockin (Kubernetes networking SIG chair).
Basically, the startupProbe
and the livenessProbe
help you restart the container, if it’s unhealthy for whatever reason, by letting the kubelet
that orchestrates your containers on a node know, that it’s unhealthy. The former is a special case of the latter and only applied at the startup of your container, if you need to handle the startup phase differently (e.g. with very slow starting containers) from the rest of the lifetime of the container.
Now, the readinessProbe
helps you manage the ready status of your container and thereby pod (any container that is not ready turns the pod not ready). This again has impact on endpoints and pod disruption budgets:
- If the pod is not ready, the endpoint will be removed and the pod will not receive traffic anymore
- If the pod is not ready, the pod counts into the pod disruption budget and if the budget is exceeded, no further voluntary pod disruptions will be permitted for the remaining ready pods (e.g. no eviction, no voluntary horizontal or vertical scaling, if the pod runs on a node that is about to be drained or in draining, draining will be paused until the max drain timeout passes)
As you can see, all of these probes are (also) related to HA (mostly the readinessProbe
, but depending on your workload, you can also leverage livenessProbe
and startupProbe
into your HA strategy). If Kubernetes doesn’t know about the individual status of your container/pod, it won’t do anything for you (right away). That said, later/indirectly something might/will happen via the node status that can also be ready or not ready, which influences the pods and load balancer listener registration (a not ready node will not receive cluster traffic anymore), but this process is worker pool global and reacts delayed and also doesn’t discriminate between the containers/pods on a node.
In addition, Kubernetes also offers pod readiness gates to amend your pod readiness with additional custom conditions (normally, only the sum of the container readiness matters, but pod readiness gates additionally count into the overall pod readiness). This may be useful if you want to block (by means of pod disruption budgets that we will talk about next) the roll-out of your workload/nodes in case some (possibly external) condition fails.
Pod Disruption Budgets
One of the most important resources that help you on your way to HA are pod disruption budgets or PDB for short. They tell Kubernetes how to deal with voluntary pod disruptions, e.g. during the deployment of your workload, when the nodes are rolled, or just in general when a pod shall be evicted/terminated. Basically, if the budget is reached, they block all voluntary pod disruptions (at least for a while until possibly other timeouts act or things happen that leave Kubernetes no choice anymore, e.g. the node is forcefully terminated). You should always define them for your workload.
Very important to note is that they are based on the readinessProbe
, i.e. even if all of your replicas are lively
, but not enough of them are ready
, this blocks voluntary pod disruptions, so they are very critical and useful. Here an example (you can specify either minAvailable
or maxUnavailable
in absolute numbers or as percentage):
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
maxUnavailable: 1
selector:
matchLabels:
...
And please do not specify a PDB of maxUnavailable
being 0 or similar. That’s pointless, even detrimental, as it blocks then even useful operations, forces always the hard timeouts that are less graceful and it doesn’t make sense in the context of HA. You cannot “force” HA by preventing voluntary pod disruptions, you must work with the pod disruptions in a resilient way. Besides, PDBs are really only about voluntary pod disruptions - something bad can happen to a node/pod at any time and PDBs won’t make this reality go away for you.
PDBs will not always work as expected and can also get in your way, e.g. if the PDB is violated or would be violated, it may possibly block whatever you are trying to do to salvage the situation, e.g. drain a node or deploy a patch version (if the PDB is or would be violated, not even unhealthy pods would be evicted as they could theoretically become healthy again, which Kubernetes doesn’t know). In order to overcome this issue, it is now possible (alpha since Kubernetes v1.26
in combination with the feature gate PDBUnhealthyPodEvictionPolicy
on the API server, beta and enabled by default since Kubernetes v1.27
) to configure the so-called unhealthy pod eviction policy. The default is still IfHealthyBudget
as a change in default would have changed the behavior (as described above), but you can now also set AlwaysAllow
at the PDB (spec.unhealthyPodEvictionPolicy
). For more information, please check out this discussion, the PR and this document and balance the pros and cons for yourself. In short, the new AlwaysAllow
option is probably the better choice in most of the cases while IfHealthyBudget
is useful only if you have frequent temporary transitions or for special cases where you have already implemented controllers that depend on the old behavior.
Pod Topology Spread Constraints
Pod topology spread constraints or PTSC for short (no official abbreviation exists, but we will use this in the following) are enormously helpful to distribute your replicas across multiple zones, nodes, or any other user-defined topology domain. They complement and improve on pod (anti-)affinities that still exist and can be used in combination.
PTSCs are an improvement, because they allow for maxSkew
and minDomains
. You can steer the “level of tolerated imbalance” with maxSkew
, e.g. you probably want that to be at least 1, so that you can perform a rolling update, but this all depends on your deployment (maxUnavailable
and maxSurge
), etc. Stateful sets are a bit different (maxUnavailable
) as they are bound to volumes and depend on them, so there usually cannot be 2 pods requiring the same volume. minDomains
is a hint to tell the scheduler how far to spread, e.g. if all nodes in one zone disappeared because of a zone outage, it may “appear” as if there are only 2 zones in a 3 zones cluster and the scheduling decisions may end up wrong, so a minDomains
of 3 will tell the scheduler to spread to 3 zones before adding another replica in one zone. Be careful with this setting as it also means, if one zone is down the “spread” is already at least 1, if pods run in the other zones. This is useful where you have exactly as many replicas as you have zones and you do not want any imbalance. Imbalance is critical as if you end up with one, nobody is going to do the (active) re-balancing for you (unless you deploy and configure additional non-standard components such as the descheduler). So, for instance, if you have something like a DBMS that you want to spread across 2 zones (active-passive) or 3 zones (consensus-based), you better specify minDomains
of 2 respectively 3 to force your replicas into at least that many zones before adding more replicas to another zone (if supported).
Anyway, PTSCs are critical to have, but not perfect, so we saw (unsurprisingly, because that’s how the scheduler works), that the scheduler may block the deployment of new pods because it takes the decision pod-by-pod (see for instance #109364).
Pod Affinities and Anti-Affinities
As said, you can combine PTSCs with pod affinities and/or anti-affinities. Especially inter-pod (anti-)affinities may be helpful to place pods apart, e.g. because they are fall-backs for each other or you do not want multiple potentially resource-hungry “best-effort” or “burstable” pods side-by-side (noisy neighbor problem), or together, e.g. because they form a unit and you want to reduce the failure domain, reduce the network latency, and reduce the costs.
Topology Aware Hints
While topology aware hints are not directly related to HA, they are very relevant in the HA context. Spreading your workload across multiple zones may increase network latency and cost significantly, if the traffic is not shaped. Topology aware hints (beta since Kubernetes v1.23
, replacing the now deprecated topology aware traffic routing with topology keys) help to route the traffic within the originating zone, if possible. Basically, they tell kube-proxy
how to setup your routing information, so that clients can talk to endpoints that are located within the same zone.
Be aware however, that there are some limitations. Those are called safeguards and if they strike, the hints are off and traffic is routed again randomly. Especially controversial is the balancing limitation as there is the assumption, that the load that hits an endpoint is determined by the allocatable CPUs in that topology zone, but that’s not always, if even often, the case (see for instance #113731 and #110714). So, this limitation hits far too often and your hints are off, but then again, it’s about network latency and cost optimization first, so it’s better than nothing.
Networking
We have talked about networking only to some small degree so far (readiness
probes, pod disruption budgets, topology aware hints). The most important component is probably your ingress load balancer - everything else is managed by Kubernetes. AWS, Azure, GCP, and also OpenStack offer multi-zonal load balancers, so make use of them. In Azure and GCP, LBs are regional whereas in AWS and OpenStack, they need to be bound to a zone, which the cloud-controller-manager does by observing the zone labels at the nodes (please note that this behavior is not always working as expected, see #570 where the AWS cloud-controller-manager is not readjusting to newly observed zones).
Please be reminded that even if you use a service mesh like Istio, the off-the-shelf installation/configuration usually never comes with productive settings (to simplify first-time installation and improve first-time user experience) and you will have to fine-tune your installation/configuration, much like the rest of your workload.
Relevant Cluster Settings
Following now a summary/list of the more relevant settings you may like to tune for Gardener-managed clusters:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
controlPlane:
highAvailability:
failureTolerance:
type: zone # valid values are `node` and `zone` (only available if your control plane resides in a region with 3+ zones)
kubernetes:
kubeAPIServer:
defaultNotReadyTolerationSeconds: 300
defaultUnreachableTolerationSeconds: 300
kubelet:
...
kubeScheduler:
featureGates:
MinDomainsInPodTopologySpread: true
kubeControllerManager:
nodeMonitorGracePeriod: 40s
horizontalPodAutoscaler:
syncPeriod: 15s
tolerance: 0.1
downscaleStabilization: 5m0s
initialReadinessDelay: 30s
cpuInitializationPeriod: 5m0s
verticalPodAutoscaler:
enabled: true
evictAfterOOMThreshold: 10m0s
evictionRateBurst: 1
evictionRateLimit: -1
evictionTolerance: 0.5
recommendationMarginFraction: 0.15
updaterInterval: 1m0s
recommenderInterval: 1m0s
clusterAutoscaler:
expander: "least-waste"
scanInterval: 10s
scaleDownDelayAfterAdd: 60m
scaleDownDelayAfterDelete: 0s
scaleDownDelayAfterFailure: 3m
scaleDownUnneededTime: 30m
scaleDownUtilizationThreshold: 0.5
provider:
workers:
- name: ...
minimum: 6
maximum: 60
maxSurge: 3
maxUnavailable: 0
zones:
- ... # list of zones you want your worker pool nodes to be spread across, see above
kubernetes:
kubelet:
... # similar to `kubelet` above (cluster-wide settings), but here per worker pool (pool-specific settings), see above
machineControllerManager: # optional, it allows to configure the machine-controller settings.
machineCreationTimeout: 20m
machineHealthTimeout: 10m
machineDrainTimeout: 60h
systemComponents:
coreDNS:
autoscaling:
mode: horizontal # valid values are `horizontal` (driven by CPU load) and `cluster-proportional` (driven by number of nodes/cores)
On spec.controlPlane.highAvailability.failureTolerance.type
If set, determines the degree of failure tolerance for your control plane. zone
is preferred, but only available if your control plane resides in a region with 3+ zones. See above and the docs.
On spec.kubernetes.kubeAPIServer.defaultUnreachableTolerationSeconds
and defaultNotReadyTolerationSeconds
This is a very interesting API server setting that lets Kubernetes decide how fast to evict pods from nodes whose status condition of type Ready
is either Unknown
(node status unknown, a.k.a unreachable) or False
(kubelet
not ready) (see node status conditions; please note that kubectl
shows both values as NotReady
which is a somewhat “simplified” visualization).
You can also override the cluster-wide API server settings individually per pod:
spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 0
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 0
This will evict pods on unreachable or not-ready nodes immediately, but be cautious: 0
is very aggressive and may lead to unnecessary disruptions. Again, you must decide for your own workload and balance out the pros and cons (e.g. long startup time).
Please note, these settings replace spec.kubernetes.kubeControllerManager.podEvictionTimeout
that was deprecated with Kubernetes v1.26
(and acted as an upper bound).
On spec.kubernetes.kubeScheduler.featureGates.MinDomainsInPodTopologySpread
Required to be enabled for minDomains
to work with PTSCs (beta since Kubernetes v1.25
, but off by default). See above and the docs. This tells the scheduler, how many topology domains to expect (=zones in the context of this document).
On spec.kubernetes.kubeControllerManager.nodeMonitorGracePeriod
This is another very interesting kube-controller-manager setting that can help you speed up or slow down how fast a node shall be considered Unknown
(node status unknown, a.k.a unreachable) when the kubelet
is not updating its status anymore (see node status conditions), which effects eviction (see spec.kubernetes.kubeAPIServer.defaultUnreachableTolerationSeconds
and defaultNotReadyTolerationSeconds
above). The shorter the time window, the faster Kubernetes will act, but the higher the chance of flapping behavior and pod trashing, so you may want to balance that out according to your needs, otherwise stick to the default which is a reasonable compromise.
On spec.kubernetes.kubeControllerManager.horizontalPodAutoscaler...
This configures horizontal pod autoscaling in Gardener-managed clusters. See above and the docs for the detailed fields.
On spec.kubernetes.verticalPodAutoscaler...
This configures vertical pod autoscaling in Gardener-managed clusters. See above and the docs for the detailed fields.
On spec.kubernetes.clusterAutoscaler...
This configures node auto-scaling in Gardener-managed clusters. See above and the docs for the detailed fields, especially about expanders, which may become life-saving in case of a zone outage when a resource crunch is setting in and everybody rushes to get machines in the healthy zones.
In case of a zone outage, it is critical to understand how the cluster autoscaler will put a worker pool in one zone into “back-off” and what the consequences for your workload will be. Unfortunately, the official cluster autoscaler documentation does not explain these details, but you can find hints in the source code:
If a node fails to come up, the node group (worker pool in that zone) will go into “back-off”, at first 5m, then exponentially longer until the maximum of 30m is reached. The “back-off” is reset after 3 hours. This in turn means, that nodes must be first considered Unknown
, which happens when spec.kubernetes.kubeControllerManager.nodeMonitorGracePeriod
lapses (e.g. at the beginning of a zone outage). Then they must either remain in this state until spec.provider.workers.machineControllerManager.machineHealthTimeout
lapses for them to be recreated, which will fail in the unhealthy zone, or spec.kubernetes.kubeAPIServer.defaultUnreachableTolerationSeconds
lapses for the pods to be evicted (usually faster than node replacements, depending on your configuration), which will trigger the cluster autoscaler to create more capacity, but very likely in the same zone as it tries to balance its node groups at first, which will fail in the unhealthy zone. It will be considered failed only when maxNodeProvisionTime
lapses (usually close to spec.provider.workers.machineControllerManager.machineCreationTimeout
) and only then put the node group into “back-off” and not retry for 5m (at first and then exponentially longer). Only then you can expect new node capacity to be brought up somewhere else.
During the time of ongoing node provisioning (before a node group goes into “back-off”), the cluster autoscaler may have “virtually scheduled” pending pods onto those new upcoming nodes and will not reevaluate these pods anymore unless the node provisioning fails (which will fail during a zone outage, but the cluster autoscaler cannot know that and will therefore reevaluate its decision only after it has given up on the new nodes).
It’s critical to keep that in mind and accommodate for it. If you have already capacity up and running, the reaction time is usually much faster with leases (whatever you set) or endpoints (spec.kubernetes.kubeControllerManager.nodeMonitorGracePeriod
), but if you depend on new/fresh capacity, the above should inform you how long you will have to wait for it and for how long pods might be pending (because capacity is generally missing and pending pods may have been “virtually scheduled” to new nodes that won’t come up until the node group goes eventually into “back-off” and nodes in the healthy zones come up).
On spec.provider.workers.minimum
, maximum
, maxSurge
, maxUnavailable
, zones
, and machineControllerManager
Each worker pool in Gardener may be configured differently. Among many other settings like machine type, root disk, Kubernetes version, kubelet
settings, and many more you can also specify the lower and upper bound for the number of machines (minimum
and maximum
), how many machines may be added additionally during a rolling update (maxSurge
) and how many machines may be in termination/recreation during a rolling update (maxUnavailable
), and of course across how many zones the nodes shall be spread (zones
).
Gardener divides minimum
, maximum
, maxSurge
, maxUnavailable
values by the number of zones specified for this worker pool. This fact must be considered when you plan the sizing of your worker pools.
Example:
provider:
workers:
- name: ...
minimum: 6
maximum: 60
maxSurge: 3
maxUnavailable: 0
zones: ["a", "b", "c"]
- The resulting
MachineDeployment
s per zone will getminimum: 2
,maximum: 20
,maxSurge: 1
,maxUnavailable: 0
. - If another zone is added all values will be divided by
4
, resulting in:- Less workers per zone.
- ⚠️ One
MachineDeployment
withmaxSurge: 0
, i.e. there will be a replacement of nodes without rolling updates.
Interesting is also the configuration for Gardener’s machine-controller-manager or MCM for short that provisions, monitors, terminates, replaces, or updates machines that back your nodes:
- The shorter
machineCreationTimeout
is, the faster MCM will retry to create a machine/node, if the process is stuck on cloud provider side. It is set to useful/practical timeouts for the different cloud providers and you probably don’t want to change those (in the context of HA at least). Please align with the cluster autoscaler’smaxNodeProvisionTime
. - The shorter
machineHealthTimeout
is, the faster MCM will replace machines/nodes in case the kubelet isn’t reporting back, which translates toUnknown
, or reports back withNotReady
, or the node-problem-detector that Gardener deploys for you reports a non-recoverable issue/condition (e.g. read-only file system). If it is too short however, you risk node and pod trashing, so be careful. - The shorter
machineDrainTimeout
is, the faster you can get rid of machines/nodes that MCM decided to remove, but this puts a cap on the grace periods and PDBs. They are respected up until the drain timeout lapses - then the machine/node will be forcefully terminated, whether or not the pods are still in termination or not even terminated because of PDBs. Those PDBs will then be violated, so be careful here as well. Please align with the cluster autoscaler’smaxGracefulTerminationSeconds
.
Especially the last two settings may help you recover faster from cloud provider issues.
On spec.systemComponents.coreDNS.autoscaling
DNS is critical, in general and also within a Kubernetes cluster. Gardener-managed clusters deploy CoreDNS, a graduated CNCF project. Gardener supports 2 auto-scaling modes for it, horizontal
(using HPA based on CPU) and cluster-proportional
(using cluster proportional autoscaler that scales the number of pods based on the number of nodes/cores, not to be confused with the cluster autoscaler that scales nodes based on their utilization). Check out the docs, especially the trade-offs why you would chose one over the other (cluster-proportional
gives you more configuration options, if CPU-based horizontal scaling is insufficient to your needs). Consider also Gardener’s feature node-local DNS to decouple you further from the DNS pods and stabilize DNS. Again, that’s not strictly related to HA, but may become important during a zone outage, when load patterns shift and pods start to initialize/resolve DNS records more frequently in bulk.
More Caveats
Unfortunately, there are a few more things of note when it comes to HA in a Kubernetes cluster that may be “surprising” and hard to mitigate:
- If the
kubelet
restarts, it will report all pods asNotReady
on startup until it reruns its probes (#100277), which leads to temporary endpoint and load balancer target removal (#102367). This topic is somewhat controversial. Gardener uses rolling updates and a jitter to spread necessarykubelet
restarts as good as possible. - If a
kube-proxy
pod on a node turnsNotReady
, all load balancer traffic to all pods (on this node) under services withexternalTrafficPolicy
local
will cease as the load balancer will then take this node out of serving. This topic is somewhat controversial as well. So, please remember thatexternalTrafficPolicy
local
not only has the disadvantage of imbalanced traffic spreading, but also a dependency to the kube-proxy pod that may and will be unavailable during updates. Gardener uses rolling updates to spread necessarykube-proxy
updates as good as possible.
These are just a few additional considerations. They may or may not affect you, but other intricacies may. It’s a reminder to be watchful as Kubernetes may have one or two relevant quirks that you need to consider (and will probably only find out over time and with extensive testing).
Meaningful Availability
Finally, let’s go back to where we started. We recommended to measure meaningful availability. For instance, in Gardener, we do not trust only internal signals, but track also whether Gardener or the control planes that it manages are externally available through the external DNS records and load balancers, SNI-routing Istio gateways, etc. (the same path all users must take). It’s a huge difference whether the API server’s internal readiness probe passes or the user can actually reach the API server and it does what it’s supposed to do. Most likely, you will be in a similar spot and can do the same.
What you do with these signals is another matter. Maybe there are some actionable metrics and you can trigger some active fail-over, maybe you can only use it to improve your HA setup altogether. In our case, we also use it to deploy mitigations, e.g. via our dependency-watchdog that watches, for instance, Gardener-managed API servers and shuts down components like the controller managers to avert cascading knock-off effects (e.g. melt-down if the kubelets
cannot reach the API server, but the controller managers can and start taking down nodes and pods).
Either way, understanding how users perceive your service is key to the improvement process as a whole. Even if you are not struck by a zone outage, the measures above and tracking the meaningful availability will help you improve your service.
Thank you for your interest.
2.2 - Chaos Engineering
Overview
Gardener provides chaostoolkit
modules to simulate compute and network outages for various cloud providers such as AWS, Azure, GCP, OpenStack/Converged Cloud, and VMware vSphere, as well as pod disruptions for any Kubernetes cluster.
The API, parameterization, and implementation is as homogeneous as possible across the different cloud providers, so that you have only minimal effort. As a Gardener user, you benefit from an additional garden
module that leverages the generic modules, but exposes their functionality in the most simple, homogeneous, and secure way (no need to specify cloud provider credentials, cluster credentials, or filters explicitly; retrieves credentials and stores them in memory only).
Installation
The name of the package is chaosgarden
and it was developed and tested with Python 3.9+. It’s being published to PyPI, so that you can comfortably install it via Python’s package installer pip (you may want to create a virtual environment before installing it):
pip install chaosgarden
ℹ️ If you want to use the VMware vSphere module, please note the remarks in requirements.txt
for vSphere
. Those are not contained in the published PyPI package.
The package can be used directly from Python scripts and supports this usage scenario with additional convenience that helps launch actions and probes in background (more on actions and probes later), so that you can compose also complex scenarios with ease.
If this technology is new to you, you will probably prefer the chaostoolkit
CLI in combination with experiment files, so we need to install the CLI next:
pip install chaostoolkit
Please verify that it was installed properly by running:
chaos --help
Usage
ℹ️ We assume you are using Gardener and run Gardener-managed shoot clusters. You can also use the generic cloud provider and Kubernetes chaosgarden
modules, but configuration and secrets will then differ. Please see the module docs for details.
A Simple Experiment
The most important command is the run
command, but before we can use it, we need to compile an experiment file first. Let’s start with a simple one, invoking only a read-only 📖 action from chaosgarden
that lists cloud provider machines and networks (depends on cloud provider) for the “first” zone of one of your shoot clusters.
Let’s assume, your project is called my-project
and your shoot is called my-shoot
, then we need to create the following experiment:
{
"title": "assess-filters-impact",
"description": "assess-filters-impact",
"method": [
{
"type": "action",
"name": "assess-filters-impact",
"provider": {
"type": "python",
"module": "chaosgarden.garden.actions",
"func": "assess_cloud_provider_filters_impact",
"arguments": {
"zone": 0
}
}
}
],
"configuration": {
"garden_project": "my-project",
"garden_shoot": "my-shoot"
}
}
We are not yet there and need one more thing to do before we can run it: We need to “target” the Gardener landscape resp. Gardener API server where you have created your shoot cluster (not to be confused with your shoot cluster API server). If you do not know what this is or how to download the Gardener API server kubeconfig
, please follow these instructions. You can either download your personal credentials or project credentials (see creation of a serviceaccount
) to interact with Gardener. For now (fastest and most convenient way, but generally not recommended), let’s use your personal credentials, but if you later plan to automate your experiments, please use proper project credentials (a serviceaccount
is not bound to your person, but to the project, and can be restricted using RBAC roles and role bindings, which is why we recommend this for production).
To download your personal credentials, open the Gardener Dashboard and click on your avatar in the upper right corner of the page. Click “My Account”, then look for the “Access” pane, then “Kubeconfig”, then press the “Download” button and save the kubeconfig
to disk. Run the following command next:
export KUBECONFIG=path/to/kubeconfig
We are now set and you can run your first experiment:
chaos run path/to/experiment
You should see output like this (depends on cloud provider):
[INFO] Validating the experiment's syntax
[INFO] Installing signal handlers to terminate all active background threads on involuntary signals (note that SIGKILL cannot be handled).
[INFO] Experiment looks valid
[INFO] Running experiment: assess-filters-impact
[INFO] Steady-state strategy: default
[INFO] Rollbacks strategy: default
[INFO] No steady state hypothesis defined. That's ok, just exploring.
[INFO] Playing your experiment's method now...
[INFO] Action: assess-filters-impact
[INFO] Validating client credentials and listing probably impacted instances and/or networks with the given arguments zone='world-1a' and filters={'instances': [{'Name': 'tag-key', 'Values': ['kubernetes.io/cluster/shoot--my-project--my-shoot']}], 'vpcs': [{'Name': 'tag-key', 'Values': ['kubernetes.io/cluster/shoot--my-project--my-shoot']}]}:
[INFO] 1 instance(s) would be impacted:
[INFO] - i-aabbccddeeff0000
[INFO] 1 VPC(s) would be impacted:
[INFO] - vpc-aabbccddeeff0000
[INFO] Let's rollback...
[INFO] No declared rollbacks, let's move on.
[INFO] Experiment ended with status: completed
🎉 Congratulations! You successfully ran your first chaosgarden
experiment.
A Destructive Experiment
Now let’s break 🪓 your cluster. Be advised that this experiment will be destructive in the sense that we will temporarily network-partition all nodes in one availability zone (machine termination or restart is available with chaosgarden
as well). That means, these nodes and their pods won’t be able to “talk” to other nodes, pods, and services. Also, the API server will become unreachable for them and the API server will report them as unreachable (confusingly shown as NotReady
when you run kubectl get nodes
and Unknown
in the status Ready
condition when you run kubectl get nodes --output yaml
).
Being unreachable will trigger service endpoint and load balancer de-registration (when the node’s grace period lapses) as well as eventually pod eviction and machine replacement (which will continue to fail under test). We won’t run the experiment long enough for all of these effects to materialize, but the longer you run it, the more will happen, up to temporarily giving up/going into “back-off” for the affected worker pool in that zone. You will also see that the Kubernetes cluster autoscaler will try to create a new machine almost immediately, if pods are pending for the affected zone (which will initially fail under test, but may succeed later, which again depends on the runtime of the experiment and whether or not the cluster autoscaler goes into “back-off” or not).
But for now, all of this doesn’t matter as we want to start “small”. You can later read up more on the various settings and effects in our best practices guide on high availability.
Please create a new experiment file, this time with this content:
{
"title": "run-network-failure-simulation",
"description": "run-network-failure-simulation",
"method": [
{
"type": "action",
"name": "run-network-failure-simulation",
"provider": {
"type": "python",
"module": "chaosgarden.garden.actions",
"func": "run_cloud_provider_network_failure_simulation",
"arguments": {
"mode": "total",
"zone": 0,
"duration": 60
}
}
}
],
"rollbacks": [
{
"type": "action",
"name": "rollback-network-failure-simulation",
"provider": {
"type": "python",
"module": "chaosgarden.garden.actions",
"func": "rollback_cloud_provider_network_failure_simulation",
"arguments": {
"mode": "total",
"zone": 0
}
}
}
],
"configuration": {
"garden_project": {
"type": "env",
"key": "GARDEN_PROJECT"
},
"garden_shoot": {
"type": "env",
"key": "GARDEN_SHOOT"
}
}
}
ℹ️ There is an even more destructive action that terminates or alternatively restarts machines in a given zone 🔥 (immediately or delayed with some randomness/chaos for maximum inconvenience for the nodes and pods). You can find links to all these examples at the end of this tutorial.
This experiment is very similar, but this time we will break 🪓 your cluster - for 60s
. If that’s too short to even see a node or pod transition from Ready
to NotReady
(actually Unknown
), then increase the duration
. Depending on the workload that your cluster runs, you may already see effects of the network partitioning, because it is effective immediately. It’s just that Kubernetes cannot know immediately and rather assumes that something is failing only after the node’s grace period lapses, but the actual workload is impacted immediately.
Most notably, this experiment also has a rollbacks
section, which is invoked even if you abort the experiment or it fails unexpectedly, but only if you run the CLI with the option --rollback-strategy always
which we will do soon. Any chaosgarden
action that can undo its activity, will do that implicitly when the duration
lapses, but it is a best practice to always configure a rollbacks
section in case something unexpected happens. Should you be in panic and just want to run the rollbacks
section, you can remove all other actions and the CLI will execute the rollbacks
section immediately.
One other thing is different in the second experiment as well. We now read the name of the project and the shoot from the environment, i.e. a configuration
section can automatically expand environment variables. Also useful to know (not shown here), chaostoolkit
supports variable substitution too, so that you have to define variables only once. Please note that you can also add a secrets
section that can also automatically expand environment variables. For instance, instead of targeting the Gardener API server via $KUBECONFIG
, which is supported by our chaosgarden
package natively, you can also explicitly refer to it in a secrets
section (for brevity reasons not shown here either).
Let’s now run your second experiment (please watch your nodes and pods in parallel, e.g. by running watch kubectl get nodes,pods --output wide
in another terminal):
export GARDEN_PROJECT=my-project
export GARDEN_SHOOT=my-shoot
chaos run --rollback-strategy always path/to/experiment
The output of the run
command will be similar to the one above, but longer. It will mention either machines or networks that were network-partitioned (depends on cloud provider), but should revert everything back to normal.
Normally, you would not only run actions in the method
section, but also probes as part of a steady state hypothesis. Such steady state hypothesis probes are run before and after the actions to validate that the “system” was in a healthy state before and gets back to a healthy state after the actions ran, hence show that the “system” is in a steady state when not under test. Eventually, you will write your own probes that don’t even have to be part of a steady state hypothesis. We at Gardener run multi-zone (multiple zones at once) and rolling-zone (strike each zone once) outages with continuous custom probes all within the method
section to validate our KPIs continuously under test (e.g. how long do the individual fail-overs take/how long is the actual outage). The most complex scenarios are even run via Python scripts as all actions and probes can also be invoked directly (which is what the CLI does).
High Availability
Developing highly available workload that can tolerate a zone outage is no trivial task. You can find more information on how to achieve this goal in our best practices guide on high availability.
Thank you for your interest in Gardener chaos engineering and making your workload more resilient.
Further Reading
Here some links for further reading:
- Examples: Experiments, Scripts
- Gardener Chaos Engineering: GitHub, PyPI, Module Docs for Gardener Users
- Chaos Toolkit Core: Home Page, Installation, Concepts, GitHub
2.3 - Control Plane
node
and zone
. Possible mitigations for zone or node outagesHighly Available Shoot Control Plane
Shoot resource offers a way to request for a highly available control plane.
Failure Tolerance Types
A highly available shoot control plane can be setup with either a failure tolerance of zone
or node
.
Node
Failure Tolerance
The failure tolerance of a node
will have the following characteristics:
- Control plane components will be spread across different nodes within a single availability zone. There will not be more than one replica per node for each control plane component which has more than one replica.
Worker pool
should have a minimum of 3 nodes.- A multi-node etcd (quorum size of 3) will be provisioned, offering zero-downtime capabilities with each member in a different node within a single availability zone.
Zone
Failure Tolerance
The failure tolerance of a zone
will have the following characteristics:
- Control plane components will be spread across different availability zones. There will be at least one replica per zone for each control plane component which has more than one replica.
- Gardener scheduler will automatically select a
seed
which has a minimum of 3 zones to host the shoot control plane. - A multi-node etcd (quorum size of 3) will be provisioned, offering zero-downtime capabilities with each member in a different zone.
Shoot Spec
To request for a highly available shoot control plane Gardener provides the following configuration in the shoot spec:
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
spec:
controlPlane:
highAvailability:
failureTolerance:
type: <node | zone>
Allowed Transitions
If you already have a shoot cluster with non-HA control plane, then the following upgrades are possible:
- Upgrade of non-HA shoot control plane to HA shoot control plane with
node
failure tolerance. - Upgrade of non-HA shoot control plane to HA shoot control plane with
zone
failure tolerance. However, it is essential that theseed
which is currently hosting the shoot control plane should bemulti-zonal
. If it is not, then the request to upgrade will be rejected.
Note: There will be a small downtime during the upgrade, especially for etcd, which will transition from a single node etcd cluster to a multi-node etcd cluster.
Disallowed Transitions
If you already have a shoot cluster with HA control plane, then the following transitions are not possible:
- Upgrade of HA shoot control plane from
node
failure tolerance tozone
failure tolerance is currently not supported, mainly because already existing volumes are bound to the zone they were created in originally. - Downgrade of HA shoot control plane with
zone
failure tolerance tonode
failure tolerance is currently not supported, mainly because of the same reason as above, that already existing volumes are bound to the respective zones they were created in originally. - Downgrade of HA shoot control plane with either
node
orzone
failure tolerance, to a non-HA shoot control plane is currently not supported, mainly because etcd-druid does not currently support scaling down of a multi-node etcd cluster to a single-node etcd cluster.
Zone Outage Situation
Implementing highly available software that can tolerate even a zone outage unscathed is no trivial task. You may find our HA Best Practices helpful to get closer to that goal. In this document, we collected many options and settings for you that also Gardener internally uses to provide a highly available service.
During a zone outage, you may be forced to change your cluster setup on short notice in order to compensate for failures and shortages resulting from the outage.
For instance, if the shoot cluster has worker nodes across three zones where one zone goes down, the computing power from these nodes is also gone during that time.
Changing the worker pool (shoot.spec.provider.workers[]
) and infrastructure (shoot.spec.provider.infrastructureConfig
) configuration can eliminate this disbalance, having enough machines in healthy availability zones that can cope with the requests of your applications.
Gardener relies on a sophisticated reconciliation flow with several dependencies for which various flow steps wait for the readiness of prior ones.
During a zone outage, this can block the entire flow, e.g., because all three etcd
replicas can never be ready when a zone is down, and required changes mentioned above can never be accomplished.
For this, a special one-off annotation shoot.gardener.cloud/skip-readiness
helps to skip any readiness checks in the flow.
The
shoot.gardener.cloud/skip-readiness
annotation serves as a last resort if reconciliation is stuck because of important changes during an AZ outage. Use it with caution, only in exceptional cases and after a case-by-case evaluation with your Gardener landscape administrator. If used together with other operations like Kubernetes version upgrades or credential rotation, the annotation may lead to a severe outage of your shoot control plane.
3 - Networking
3.1 - Enable IPv4/IPv6 (dual-stack) Ingress on AWS
Using IPv4/IPv6 (dual-stack) Ingress in an IPv4 single-stack cluster
Motivation
IPv6 adoption is continuously growing, already overtaking IPv4 in certain regions, e.g. India, or scenarios, e.g. mobile. Even though most IPv6 installations deploy means to reach IPv4, it might still be beneficial to expose services natively via IPv4 and IPv6 instead of just relying on IPv4.
Disadvantages of full IPv4/IPv6 (dual-stack) Deployments
Enabling full IPv4/IPv6 (dual-stack) support in a kubernetes cluster is a major endeavor. It requires a lot of changes and restarts of all pods so that all pods get addresses for both IP families. A side-effect of dual-stack networking is that failures may be hidden as network traffic may take the other protocol to reach the target. For this reason and also due to reduced operational complexity, service teams might lean towards staying in a single-stack environment as much as possible. Luckily, this is possible with Gardener and IPv4/IPv6 (dual-stack) ingress on AWS.
Simplifying IPv4/IPv6 (dual-stack) Ingress with Protocol Translation on AWS
Fortunately, the network load balancer on AWS supports automatic protocol translation, i.e. it can expose both IPv4 and IPv6 endpoints while communicating with just one protocol to the backends. Under the hood, automatic protocol translation takes place. Client IP address preservation can be achieved by using proxy protocol.
This approach enables users to expose IPv4 workload to IPv6-only clients without having to change the workload/service. Without requiring invasive changes, it allows a fairly simple first step into the IPv6 world for services just requiring ingress (incoming) communication.
Necessary Shoot Cluster Configuration Changes for IPv4/IPv6 (dual-stack) Ingress
To be able to utilize IPv4/IPv6 (dual-stack) Ingress in an IPv4 shoot cluster, the cluster needs to meet two preconditions:
dualStack.enabled
needs to be set totrue
to configure VPC/subnet for IPv6 and add a routing rule for IPv6. (This does not add IPv6 addresses to kubernetes nodes.)loadBalancerController.enabled
needs to be set totrue
as well to use the load balancer controller, which supports dual-stack ingress.
apiVersion: core.gardener.cloud/v1beta1
kind: Shoot
...
spec:
provider:
type: aws
infrastructureConfig:
apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
kind: InfrastructureConfig
dualStack:
enabled: true
controlPlaneConfig:
apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
kind: ControlPlaneConfig
loadBalancerController:
enabled: true
...
When infrastructureConfig.networks.vpc.id
is set to the ID of an existing VPC, please make sure that your VPC has an Amazon-provided IPv6 CIDR block added.
After adapting the shoot specification and reconciling the cluster, dual-stack load balancers can be created using kubernetes services objects.
Creating an IPv4/IPv6 (dual-stack) Ingress
With the preconditions set, creating an IPv4/IPv6 load balancer is as easy as annotating a service with the correct annotations:
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-ip-address-type: dualstack
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
service.beta.kubernetes.io/aws-load-balancer-type: external
name: ...
namespace: ...
spec:
...
type: LoadBalancer
In case the client IP address should be preserved, the following annotation can be used to enable proxy protocol. (The pod receiving the traffic needs to be configured for proxy protocol as well.)
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
Please note that changing an existing Service
to dual-stack may cause the creation of a new load balancer without
deletion of the old AWS load balancer resource. While this helps in a seamless migration by not cutting existing
connections it may lead to wasted/forgotten resources. Therefore, the (manual) cleanup needs to be taken into account
when migrating an existing Service
instance.
For more details see AWS Load Balancer Documentation - Network Load Balancer.
DNS Considerations to Prevent Downtime During a Dual-Stack Migration
In case the migration of an existing service is desired, please check if there are DNS entries directly linked to the corresponding load balancer. The migrated load balancer will have a new domain name immediately, which will not be ready in the beginning. Therefore, a direct migration of the domain name entries is not desired as it may cause a short downtime, i.e. domain name entries without backing IP addresses.
If there are DNS entries directly linked to the corresponding load balancer and they are managed by the
shoot-dns-service, you can identify this via
annotations with the prefix dns.gardener.cloud/
. Those annotations can be linked to a Service
, Ingress
or
Gateway
resources. Alternatively, they may also use DNSEntry
or DNSAnnotation
resources.
For a seamless migration without downtime use the following three step approach:
- Temporarily prevent direct DNS updates
- Migrate the load balancer and wait until it is operational
- Allow DNS updates again
To prevent direct updates of the DNS entries when the load balancer is migrated add the annotation
dns.gardener.cloud/ignore: 'true'
to all affected resources next to the other dns.gardener.cloud/...
annotations
before starting the migration. For example, in case of a Service
ensure that the service looks like the following:
kind: Service
metadata:
annotations:
dns.gardener.cloud/ignore: 'true'
dns.gardener.cloud/class: garden
dns.gardener.cloud/dnsnames: '...'
...
Next, migrate the load balancer to be dual-stack enabled by adding/changing the corresponding annotations.
You have multiple options how to check that the load balancer has been provisioned successfully. It might be useful
to peek into status.loadBalancer.ingress
of the corresponding Service
to identify the load balancer:
- Check in the AWS console for the corresponding load balancer provisioning state
- Perform domain name lookups with
nslookup
/dig
to check whether the name resolves to an IP address. - Call your workload via the new load balancer, e.g. using
curl --resolve <my-domain-name>:<port>:<IP-address> https://<my-domain-name>:<port>
, which allows you to call your service with the “correct” domain name without using actual name resolution. - Wait a fixed period of time as load balancer creation is usually finished within 15 minutes
Once the load balancer has been provisioned, you can remove the annotation dns.gardener.cloud/ignore: 'true'
again
from the affected resources. It may take some additional time until the domain name change finally propagates
(up to one hour).
3.2 - Manage Certificates with Gardener
Manage certificates with Gardener for public domain
Introduction
Dealing with applications on Kubernetes which offer a secure service endpoints (e.g. HTTPS) also require you to enable a secured communication via SSL/TLS. With the certificate extension enabled, Gardener can manage commonly trusted X.509 certificate for your application endpoint. From initially requesting certificate, it also handeles their renewal in time using the free Let’s Encrypt API.
There are two senarios with which you can use the certificate extension
- You want to use a certificate for a subdomain the shoot’s default DNS (see
.spec.dns.domain
of your shoot resource, e.g.short.ingress.shoot.project.default-domain.gardener.cloud
). If this is your case, please see Manage certificates with Gardener for default domain - You want to use a certificate for a custom domain. If this is your case, please keep reading this article.
Prerequisites
Before you start this guide there are a few requirements you need to fulfill:
- You have an existing shoot cluster
- Your custom domain is under a public top level domain (e.g.
.com
) - Your custom zone is resolvable with a public resolver via the internet (e.g.
8.8.8.8
) - You have a custom DNS provider configured and working (see “DNS Providers”)
As part of the Let’s Encrypt ACME challenge validation process, Gardener sets a DNS TXT entry and Let’s Encrypt checks if it can both resolve and authenticate it. Therefore, it’s important that your DNS-entries are publicly resolvable. You can check this by querying e.g. Googles public DNS server and if it returns an entry your DNS is publicly visible:
# returns the A record for cert-example.example.com using Googles DNS server (8.8.8.8)
dig cert-example.example.com @8.8.8.8 A
DNS provider
In order to issue certificates for a custom domain you need to specify a DNS provider which is permitted to create DNS records for subdomains of your requested domain in the certificate. For example, if you request a certificate for host.example.com
your DNS provider must be capable of managing subdomains of host.example.com
.
DNS providers are normally specified in the shoot manifest. To learn more on how to configure one, please see the DNS provider documentation.
Issue a certificate
Every X.509 certificate is represented by a Kubernetes custom resource certificate.cert.gardener.cloud
in your cluster. A Certificate
resource may be used to initiate a new certificate request as well as to manage its lifecycle. Gardener’s certificate service regularly checks the expiration timestamp of Certificates, triggers a renewal process if necessary and replaces the existing X.509 certificate with a new one.
Your application should be able to reload replaced certificates in a timely manner to avoid service disruptions.
Certificates can be requested via 3 resources type
- Ingress
- Service (type LoadBalancer)
- Gateways (both Istio gateways and from the Gateway API)
- Certificate (Gardener CRD)
If either of the first 2 are used, a corresponding Certificate
resource will be created automatically.
Using an Ingress Resource
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: amazing-ingress
annotations:
cert.gardener.cloud/purpose: managed
# Optional but recommended, this is going to create the DNS entry at the same time
dns.gardener.cloud/class: garden
dns.gardener.cloud/ttl: "600"
#cert.gardener.cloud/commonname: "*.example.com" # optional, if not specified the first name from spec.tls[].hosts is used as common name
#cert.gardener.cloud/dnsnames: "" # optional, if not specified the names from spec.tls[].hosts are used
#cert.gardener.cloud/follow-cname: "true" # optional, same as spec.followCNAME in certificates
#cert.gardener.cloud/secret-labels: "key1=value1,key2=value2" # optional labels for the certificate secret
#cert.gardener.cloud/issuer: custom-issuer # optional to specify custom issuer (use namespace/name for shoot issuers)
#cert.gardener.cloud/preferred-chain: "chain name" # optional to specify preferred-chain (value is the Subject Common Name of the root issuer)
#cert.gardener.cloud/private-key-algorithm: ECDSA # optional to specify algorithm for private key, allowed values are 'RSA' or 'ECDSA'
#cert.gardener.cloud/private-key-size: "384" # optional to specify size of private key, allowed values for RSA are "2048", "3072", "4096" and for ECDSA "256" and "384"
spec:
tls:
- hosts:
# Must not exceed 64 characters.
- amazing.example.com
# Certificate and private key reside in this secret.
secretName: tls-secret
rules:
- host: amazing.example.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: amazing-svc
port:
number: 8080
Replace the hosts
and rules[].host
value again with your own domain and adjust the remaining Ingress attributes in accordance with your deployment (e.g. the above is for an istio
Ingress controller and forwards traffic to a service1
on port 80).
Using a Service of type LoadBalancer
apiVersion: v1
kind: Service
metadata:
annotations:
cert.gardener.cloud/secretname: tls-secret
dns.gardener.cloud/dnsnames: example.example.com
dns.gardener.cloud/class: garden
# Optional
dns.gardener.cloud/ttl: "600"
cert.gardener.cloud/commonname: "*.example.example.com"
cert.gardener.cloud/dnsnames: ""
#cert.gardener.cloud/follow-cname: "true" # optional, same as spec.followCNAME in certificates
#cert.gardener.cloud/secret-labels: "key1=value1,key2=value2" # optional labels for the certificate secret
#cert.gardener.cloud/issuer: custom-issuer # optional to specify custom issuer (use namespace/name for shoot issuers)
#cert.gardener.cloud/preferred-chain: "chain name" # optional to specify preferred-chain (value is the Subject Common Name of the root issuer)
#cert.gardener.cloud/private-key-algorithm: ECDSA # optional to specify algorithm for private key, allowed values are 'RSA' or 'ECDSA'
#cert.gardener.cloud/private-key-size: "384" # optional to specify size of private key, allowed values for RSA are "2048", "3072", "4096" and for ECDSA "256" and "384"
name: test-service
namespace: default
spec:
ports:
- name: http
port: 80
protocol: TCP
targetPort: 8080
type: LoadBalancer
Using a Gateway resource
Please see Istio Gateways or Gateway API for details.
Using the custom Certificate resource
apiVersion: cert.gardener.cloud/v1alpha1
kind: Certificate
metadata:
name: cert-example
namespace: default
spec:
commonName: amazing.example.com
secretRef:
name: tls-secret
namespace: default
# Optionnal if using the default issuer
issuerRef:
name: garden
# If delegated domain for DNS01 challenge should be used. This has only an effect if a CNAME record is set for
# '_acme-challenge.amazing.example.com'.
# For example: If a CNAME record exists '_acme-challenge.amazing.example.com' => '_acme-challenge.writable.domain.com',
# the DNS challenge will be written to '_acme-challenge.writable.domain.com'.
#followCNAME: true
# optionally set labels for the secret
#secretLabels:
# key1: value1
# key2: value2
# Optionally specify the preferred certificate chain: if the CA offers multiple certificate chains, prefer the chain with an issuer matching this Subject Common Name. If no match, the default offered chain will be used.
#preferredChain: "ISRG Root X1"
# Optionally specify algorithm and key size for private key. Allowed algorithms: "RSA" (allowed sizes: 2048, 3072, 4096) and "ECDSA" (allowed sizes: 256, 384)
# If not specified, RSA with 2048 is used.
#privateKey:
# algorithm: ECDSA
# size: 384
Supported attributes
Here is a list of all supported annotations regarding the certificate extension:
Path | Annotation | Value | Required | Description |
---|---|---|---|---|
N/A | cert.gardener.cloud/purpose: | managed | Yes when using annotations | Flag for Gardener that this specific Ingress or Service requires a certificate |
spec.commonName | cert.gardener.cloud/commonname: | E.g. “*.demo.example.com” or “special.example.com” | Certificate and Ingress : No Service: Yes, if DNS names unset | Specifies for which domain the certificate request will be created. If not specified, the names from spec.tls[].hosts are used. This entry must comply with the 64 character limit. |
spec.dnsNames | cert.gardener.cloud/dnsnames: | E.g. “special.example.com” | Certificate and Ingress : No Service: Yes, if common name unset | Additional domains the certificate should be valid for (Subject Alternative Name). If not specified, the names from spec.tls[].hosts are used. Entries in this list can be longer than 64 characters. |
spec.secretRef.name | cert.gardener.cloud/secretname: | any-name | Yes for certificate and Service | Specifies the secret which contains the certificate/key pair. If the secret is not available yet, it’ll be created automatically as soon as the certificate has been issued. |
spec.issuerRef.name | cert.gardener.cloud/issuer: | E.g. gardener | No | Specifies the issuer you want to use. Only necessary if you request certificates for custom domains. |
N/A | cert.gardener.cloud/revoked: | true otherwise always false | No | Use only to revoke a certificate, see reference for more details |
spec.followCNAME | cert.gardener.cloud/follow-cname | E.g. true | No | Specifies that the usage of a delegated domain for DNS challenges is allowed. Details see Follow CNAME. |
spec.preferredChain | cert.gardener.cloud/preferred-chain | E.g. ISRG Root X1 | No | Specifies the Common Name of the issuer for selecting the certificate chain. Details see Preferred Chain. |
spec.secretLabels | cert.gardener.cloud/secret-labels | for annotation use e.g. key1=value1,key2=value2 | No | Specifies labels for the certificate secret. |
spec.privateKey.algorithm | cert.gardener.cloud/private-key-algorithm | RSA , ECDSA | No | Specifies algorithm for private key generation. The default value is depending on configuration of the extension (default of the default is RSA ). You may request a new certificate without privateKey settings to find out the concrete defaults in your Gardener. |
spec.privateKey.size | cert.gardener.cloud/private-key-size | "256" , "384" , "2048" , "3072" , "4096" | No | Specifies size for private key generation. Allowed values for RSA are 2048 , 3072 , and 4096 . For ECDSA allowed values are 256 and 384 . The default values are depending on the configuration of the extension (defaults of the default values are 3072 for RSA and 384 for ECDSA respectively). |
Request a wildcard certificate
In order to avoid the creation of multiples certificates for every single endpoints, you may want to create a wildcard certificate for your shoot’s default cluster.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: amazing-ingress
annotations:
cert.gardener.cloud/purpose: managed
cert.gardener.cloud/commonName: "*.example.com"
spec:
tls:
- hosts:
- amazing.example.com
secretName: tls-secret
rules:
- host: amazing.example.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: amazing-svc
port:
number: 8080
Please note that this can also be achived by directly adding an annotation to a Service type LoadBalancer. You could also create a Certificate object with a wildcard domain.
Using a custom Issuer
Most Gardener deployment with the certification extension enabled have a preconfigured garden
issuer. It is also usually configured to use Let’s Encrypt as the certificate provider.
If you need a custom issuer for a specific cluster, please see Using a custom Issuer
Quotas
For security reasons there may be a default quota on the certificate requests per day set globally in the controller registration of the shoot-cert-service.
The default quota only applies if there is no explicit quota defined for the issuer itself with the field
requestsPerDayQuota
, e.g.:
kind: Shoot
...
spec:
extensions:
- type: shoot-cert-service
providerConfig:
apiVersion: service.cert.extensions.gardener.cloud/v1alpha1
kind: CertConfig
issuers:
- email: your-email@example.com
name: custom-issuer # issuer name must be specified in every custom issuer request, must not be "garden"
server: 'https://acme-v02.api.letsencrypt.org/directory'
requestsPerDayQuota: 10
DNS Propagation
As stated before, cert-manager uses the ACME challenge protocol to authenticate that you are the DNS owner for the domain’s certificate you are requesting.
This works by creating a DNS TXT record in your DNS provider under _acme-challenge.example.example.com
containing a token to compare with. The TXT record is only applied during the domain validation.
Typically, the record is propagated within a few minutes. But if the record is not visible to the ACME server for any reasons, the certificate request is retried again after several minutes.
This means you may have to wait up to one hour after the propagation problem has been resolved before the certificate request is retried. Take a look in the events with kubectl describe ingress example
for troubleshooting.
Character Restrictions
Due to restriction of the common name to 64 characters, you may to leave the common name unset in such cases.
For example, the following request is invalid:
apiVersion: cert.gardener.cloud/v1alpha1
kind: Certificate
metadata:
name: cert-invalid
namespace: default
spec:
commonName: morethan64characters.ingress.shoot.project.default-domain.gardener.cloud
But it is valid to request a certificate for this domain if you have left the common name unset:
apiVersion: cert.gardener.cloud/v1alpha1
kind: Certificate
metadata:
name: cert-example
namespace: default
spec:
dnsNames:
- morethan64characters.ingress.shoot.project.default-domain.gardener.cloud
References
3.3 - Manage Certificates with Gardener for Default Domain
Manage certificates with Gardener for default domain
Introduction
Dealing with applications on Kubernetes which offer a secure service endpoints (e.g. HTTPS) also require you to enable a secured communication via SSL/TLS. With the certificate extension enabled, Gardener can manage commonly trusted X.509 certificate for your application endpoint. From initially requesting certificate, it also handeles their renewal in time using the free Let’s Encrypt API.
There are two senarios with which you can use the certificate extension
- You want to use a certificate for a subdomain the shoot’s default DNS (see
.spec.dns.domain
of your shoot resource, e.g.short.ingress.shoot.project.default-domain.gardener.cloud
). If this is your case, please keep reading this article. - You want to use a certificate for a custom domain. If this is your case, please see Manage certificates with Gardener for public domain
Prerequisites
Before you start this guide there are a few requirements you need to fulfill:
- You have an existing shoot cluster
Since you are using the default DNS name, all DNS configuration should already be done and ready.
Issue a certificate
Every X.509 certificate is represented by a Kubernetes custom resource certificate.cert.gardener.cloud
in your cluster. A Certificate
resource may be used to initiate a new certificate request as well as to manage its lifecycle. Gardener’s certificate service regularly checks the expiration timestamp of Certificates, triggers a renewal process if necessary and replaces the existing X.509 certificate with a new one.
Your application should be able to reload replaced certificates in a timely manner to avoid service disruptions.
Certificates can be requested via 3 resources type
- Ingress
- Service (type LoadBalancer)
- certificate (Gardener CRD)
If either of the first 2 are used, a corresponding Certificate
resource will automatically be created.
Using an ingress Resource
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: amazing-ingress
annotations:
cert.gardener.cloud/purpose: managed
#cert.gardener.cloud/issuer: custom-issuer # optional to specify custom issuer (use namespace/name for shoot issuers)
#cert.gardener.cloud/follow-cname: "true" # optional, same as spec.followCNAME in certificates
#cert.gardener.cloud/secret-labels: "key1=value1,key2=value2" # optional labels for the certificate secret
#cert.gardener.cloud/preferred-chain: "chain name" # optional to specify preferred-chain (value is the Subject Common Name of the root issuer)
#cert.gardener.cloud/private-key-algorithm: ECDSA # optional to specify algorithm for private key, allowed values are 'RSA' or 'ECDSA'
#cert.gardener.cloud/private-key-size: "384" # optional to specify size of private key, allowed values for RSA are "2048", "3072", "4096" and for ECDSA "256" and "384"spec:
tls:
- hosts:
# Must not exceed 64 characters.
- short.ingress.shoot.project.default-domain.gardener.cloud
# Certificate and private key reside in this secret.
secretName: tls-secret
rules:
- host: short.ingress.shoot.project.default-domain.gardener.cloud
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: amazing-svc
port:
number: 8080
Using a service type LoadBalancer
apiVersion: v1
kind: Service
metadata:
annotations:
cert.gardener.cloud/purpose: managed
# Certificate and private key reside in this secret.
cert.gardener.cloud/secretname: tls-secret
# You may add more domains separated by commas (e.g. "service.shoot.project.default-domain.gardener.cloud, amazing.shoot.project.default-domain.gardener.cloud")
dns.gardener.cloud/dnsnames: "service.shoot.project.default-domain.gardener.cloud"
dns.gardener.cloud/ttl: "600"
#cert.gardener.cloud/issuer: custom-issuer # optional to specify custom issuer (use namespace/name for shoot issuers)
#cert.gardener.cloud/follow-cname: "true" # optional, same as spec.followCNAME in certificates
#cert.gardener.cloud/secret-labels: "key1=value1,key2=value2" # optional labels for the certificate secret
#cert.gardener.cloud/preferred-chain: "chain name" # optional to specify preferred-chain (value is the Subject Common Name of the root issuer)
#cert.gardener.cloud/private-key-algorithm: ECDSA # optional to specify algorithm for private key, allowed values are 'RSA' or 'ECDSA'
#cert.gardener.cloud/private-key-size: "384" # optional to specify size of private key, allowed values for RSA are "2048", "3072", "4096" and for ECDSA "256" and "384" name: test-service
namespace: default
spec:
ports:
- name: http
port: 80
protocol: TCP
targetPort: 8080
type: LoadBalancer
Using the custom Certificate resource
apiVersion: cert.gardener.cloud/v1alpha1
kind: Certificate
metadata:
name: cert-example
namespace: default
spec:
commonName: short.ingress.shoot.project.default-domain.gardener.cloud
secretRef:
name: tls-secret
namespace: default
# Optionnal if using the default issuer
issuerRef:
name: garden
If you’re interested in the current progress of your request, you’re advised to consult the description, more specifically the status
attribute in case the issuance failed.
Request a wildcard certificate
In order to avoid the creation of multiples certificates for every single endpoints, you may want to create a wildcard certificate for your shoot’s default cluster.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: amazing-ingress
annotations:
cert.gardener.cloud/purpose: managed
cert.gardener.cloud/commonName: "*.ingress.shoot.project.default-domain.gardener.cloud"
spec:
tls:
- hosts:
- amazing.ingress.shoot.project.default-domain.gardener.cloud
secretName: tls-secret
rules:
- host: amazing.ingress.shoot.project.default-domain.gardener.cloud
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: amazing-svc
port:
number: 8080
Please note that this can also be achived by directly adding an annotation to a Service type LoadBalancer. You could also create a Certificate object with a wildcard domain.
More information
For more information and more examples about using the certificate extension, please see Manage certificates with Gardener for public domain
3.4 - Managing DNS with Gardener
Request DNS Names in Shoot Clusters
Introduction
Within a shoot cluster, it is possible to request DNS records via the following resource types:
It is necessary that the Gardener installation your shoot cluster runs in is equipped with a shoot-dns-service
extension. This extension uses the seed’s dns management infrastructure to maintain DNS names for shoot clusters. Please ask your Gardener operator if the extension is available in your environment.
Shoot Feature Gate
In some Gardener setups the shoot-dns-service
extension is not enabled globally and thus must be configured per shoot cluster. Please adapt the shoot specification by the configuration shown below to activate the extension individually.
kind: Shoot
...
spec:
extensions:
- type: shoot-dns-service
...
Before you start
You should :
- Have created a shoot cluster
- Have created and correctly configured a DNS Provider (Please consult this page for more information)
- Have a basic understanding of DNS (see link under References)
There are 2 types of DNS that you can use within Kubernetes :
- internal (usually managed by coreDNS)
- external (managed by a public DNS provider).
This page, and the extension, exclusively works for external DNS handling.
Gardener allows 2 way of managing your external DNS:
- Manually, which means you are in charge of creating / maintaining your Kubernetes related DNS entries
- Via the Gardener DNS extension
Gardener DNS extension
The managed external DNS records feature of the Gardener clusters makes all this easier. You do not need DNS service provider specific knowledge, and in fact you do not need to leave your cluster at all to achieve that. You simply annotate the Ingress / Service that needs its DNS records managed and it will be automatically created / managed by Gardener.
Managed external DNS records are supported with the following DNS provider types:
- aws-route53
- azure-dns
- azure-private-dns
- google-clouddns
- openstack-designate
- alicloud-dns
- cloudflare-dns
Request DNS records for Ingress resources
To request a DNS name for Ingress
, Service
or Gateway
(Istio or Gateway API) objects in the shoot cluster it must be annotated with the DNS class garden
and an annotation denoting the desired DNS names.
Example for an annotated Ingress resource:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: amazing-ingress
annotations:
# Let Gardener manage external DNS records for this Ingress.
dns.gardener.cloud/dnsnames: special.example.com # Use "*" to collects domains names from .spec.rules[].host
dns.gardener.cloud/ttl: "600"
dns.gardener.cloud/class: garden
# If you are delegating the certificate management to Gardener, uncomment the following line
#cert.gardener.cloud/purpose: managed
spec:
rules:
- host: special.example.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: amazing-svc
port:
number: 8080
# Uncomment the following part if you are delegating the certificate management to Gardener
#tls:
# - hosts:
# - special.example.com
# secretName: my-cert-secret-name
For an Ingress, the DNS names are already declared in the specification. Nevertheless the dnsnames annotation must be present. Here a subset of the DNS names of the ingress can be specified. If DNS names for all names are desired, the value all
can be used.
Keep in mind that ingress resources are ignored unless an ingress controller is set up. Gardener does not provide an ingress controller by default. For more details, see Ingress Controllers and Service in the Kubernetes documentation.
Request DNS records for service type LoadBalancer
Example for an annotated Service (it must have the type LoadBalancer
) resource:
apiVersion: v1
kind: Service
metadata:
name: amazing-svc
annotations:
# Let Gardener manage external DNS records for this Service.
dns.gardener.cloud/dnsnames: special.example.com
dns.gardener.cloud/ttl: "600"
dns.gardener.cloud/class: garden
spec:
selector:
app: amazing-app
ports:
- protocol: TCP
port: 80
targetPort: 8080
type: LoadBalancer
Request DNS records for Gateway resources
Please see Istio Gateways or Gateway API for details.
Creating a DNSEntry resource explicitly
It is also possible to create a DNS entry via the Kubernetes resource called DNSEntry
:
apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSEntry
metadata:
annotations:
# Let Gardener manage this DNS entry.
dns.gardener.cloud/class: garden
name: special-dnsentry
namespace: default
spec:
dnsName: special.example.com
ttl: 600
targets:
- 1.2.3.4
If one of the accepted DNS names is a direct subname of the shoot’s ingress domain, this is already handled by the standard wildcard entry for the ingress domain. Therefore this name should be excluded from the dnsnames list in the annotation. If only this DNS name is configured in the ingress, no explicit DNS entry is required, and the DNS annotations should be omitted at all.
You can check the status of the DNSEntry
with
$ kubectl get dnsentry
NAME DNS TYPE PROVIDER STATUS AGE
mydnsentry special.example.com aws-route53 default/aws Ready 24s
As soon as the status of the entry is Ready
, the provider has accepted the new DNS record. Depending on the provider and your DNS settings and cache, it may take up to 24 hours for the new entry to be propagated over all internet.
More examples can be found here
Request DNS records for Service/Ingress resources using a DNSAnnotation resource
In rare cases it may not be possible to add annotations to a Service
or Ingress
resource object.
E.g.: the helm chart used to deploy the resource may not be adaptable for some reasons or some automation is used, which always restores the original content of the resource object by dropping any additional annotations.
In these cases, it is recommended to use an additional DNSAnnotation
resource in order to have more flexibility that DNSentry resources
. The DNSAnnotation
resource makes the DNS shoot service behave as if annotations have been added to the referenced resource.
For the Ingress example shown above, you can create a DNSAnnotation
resource alternatively to provide the annotations.
apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSAnnotation
metadata:
annotations:
dns.gardener.cloud/class: garden
name: test-ingress-annotation
namespace: default
spec:
resourceRef:
kind: Ingress
apiVersion: networking.k8s.io/v1
name: test-ingress
namespace: default
annotations:
dns.gardener.cloud/dnsnames: '*'
dns.gardener.cloud/class: garden
Note that the DNSAnnotation resource itself needs the dns.gardener.cloud/class=garden
annotation. This also only works for annotations known to the DNS shoot service (see Accepted External DNS Records Annotations).
For more details, see also DNSAnnotation objects
Accepted External DNS Records Annotations
Here are all of the accepted annotation related to the DNS extension:
Annotation | Description |
---|---|
dns.gardener.cloud/dnsnames | Mandatory for service and ingress resources, accepts a comma-separated list of DNS names if multiple names are required. For ingress you can use the special value '*' . In this case, the DNS names are collected from .spec.rules[].host . |
dns.gardener.cloud/class | Mandatory, in the context of the shoot-dns-service it must always be set to garden . |
dns.gardener.cloud/ttl | Recommended, overrides the default Time-To-Live of the DNS record. |
dns.gardener.cloud/cname-lookup-interval | Only relevant if multiple domain name targets are specified. It specifies the lookup interval for CNAMEs to map them to IP addresses (in seconds) |
dns.gardener.cloud/realms | Internal, for restricting provider access for shoot DNS entries. Typcially not set by users of the shoot-dns-service. |
dns.gardener.cloud/ip-stack | Only relevant for provider type aws-route53 if target is an AWS load balancer domain name. Can be set for service, ingress and DNSEntry resources. It specify which DNS records with alias targets are created instead of the usual CNAME records. If the annotation is not set (or has the value ipv4 ), only an A record is created. With value dual-stack , both A and AAAA records are created. With value ipv6 only an AAAA record is created. |
service.beta.kubernetes.io/aws-load-balancer-ip-address-type=dualstack | For services, behaves similar to dns.gardener.cloud/ip-stack=dual-stack . |
loadbalancer.openstack.org/load-balancer-address | Internal, for services only: support for PROXY protocol on Openstack (which needs a hostname as ingress). Typcially not set by users of the shoot-dns-service. |
If one of the accepted DNS names is a direct subdomain of the shoot’s ingress domain, this is already handled by the standard wildcard entry for the ingress domain. Therefore, this name should be excluded from the dnsnames list in the annotation. If only this DNS name is configured in the ingress, no explicit DNS entry is required, and the DNS annotations should be omitted at all.
Troubleshooting
General DNS tools
To check the DNS resolution, use the nslookup
or dig
command.
$ nslookup special.your-domain.com
or with dig
$ dig +short special.example.com
Depending on your network settings, you may get a successful response faster using a public DNS server (e.g. 8.8.8.8, 8.8.4.4, or 1.1.1.1)
dig @8.8.8.8 +short special.example.com
DNS record events
The DNS controller publishes Kubernetes events for the resource which requested the DNS record (Ingress, Service, DNSEntry). These events reveal more information about the DNS requests being processed and are especially useful to check any kind of misconfiguration, e.g. requests for a domain you don’t own.
Events for a successfully created DNS record:
$ kubectl describe service my-service
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal dns-annotation 19s dns-controller-manager special.example.com: dns entry is pending
Normal dns-annotation 19s (x3 over 19s) dns-controller-manager special.example.com: dns entry pending: waiting for dns reconciliation
Normal dns-annotation 9s (x3 over 10s) dns-controller-manager special.example.com: dns entry active
Please note, events vanish after their retention period (usually 1h
).
DNSEntry status
DNSEntry
resources offer a .status
sub-resource which can be used to check the current state of the object.
Status of a erroneous DNSEntry
.
status:
message: No responsible provider found
observedGeneration: 3
provider: remote
state: Error
References
4 - Administer Client (Shoot) Clusters
4.1 - Scalability of Gardener Managed Kubernetes Clusters
Have you ever wondered how much more your Kubernetes cluster can scale before it breaks down?
Of course, the answer is heavily dependent on your workloads. But be assured, any cluster will break eventually. Therefore, the best mitigation is to plan for sharding early and run multiple clusters instead of trying to optimize everything hoping to survive with a single cluster. Still, it is helpful to know when the time has come to scale out. This document aims at giving you the basic knowledge to keep a Gardener-managed Kubernetes cluster up and running while it scales according to your needs.
Welcome to Planet Scale, Please Mind the Gap!
For a complex, distributed system like Kubernetes it is impossible to give absolute thresholds for its scalability. Instead, the limit of a cluster’s scalability is a combination of various, interconnected dimensions.
Let’s take a rather simple example of two dimensions - the number of Pods
per Node
and number of Nodes
in a cluster. According to the scalability thresholds documentation, Kubernetes can scale up to 5000 Nodes
and with default settings accommodate a maximum of 110 Pods
on a single Node
. Pushing only a single dimension towards its limit will likely harm the cluster. But if both are pushed simultaneously, any cluster will break way before reaching one dimension’s limit.
What sounds rather straightforward in theory can be a bit trickier in reality. While 110 Pods
is the default limit, we successfully pushed beyond that and in certain cases run up to 200 Pods
per Node
without breaking the cluster. This is possible in an environment where one knows and controls all workloads and cluster configurations. It still requires careful testing, though, and comes at the cost of limiting the scalability of other dimensions, like the number of Nodes
.
Of course, a Kubernetes cluster has a plethora of dimensions. Thus, when looking at a simple questions like “How many resources can I store in ETCD?”, the only meaningful answer must be: “it depends”
The following sections will help you to identify relevant dimensions and how they affect a Gardener-managed Kubernetes cluster’s scalability.
“Official” Kubernetes Thresholds and Scalability Considerations
To get started with the topic, please check the basic guidance provided by the Kubernetes community (specifically SIG Scalability):
Furthermore, the problem space has been discussed in a KubeCon talk, the slides for which can be found here. You should at least read the slides before continuing.
Essentially, it comes down to this:
If you promise to:
- correctly configure your cluster
- use extensibility features “reasonably”
- keep the load in the cluster within recommended limits
Then we promise that your cluster will function properly.
With that knowledge in mind, let’s look at Gardener and eventually pick up the question about the number of objects in ETCD raised above.
Gardener-Specific Considerations
The following considerations are based on experience with various large clusters that scaled in different dimensions. Just as explained above, pushing beyond even one of the limits is likely to cause issues at some point in time (but not guaranteed). Depending on the setup of your workloads however, it might work unexpectedly well. Nevertheless, we urge you take conscious decisions and rather think about sharding your workloads. Please keep in mind - your workload affects the overall stability and scalability of a cluster significantly.
ETCD
The following section is based on a setup where ETCD Pods
run on a dedicated Node
pool and each Node
has 8 vCPU and 32GB memory at least.
ETCD has a practical space limit of 8 GB. It caps the number of objects one can technically have in a Kubernetes cluster.
Of course, the number is heavily influenced by each object’s size, especially when considering that secrets and configmaps may store up to 1MB of data. Another dimension is a cluster’s churn rate. Since ETCD stores a history of the keyspace, a higher churn rate reduces the number of objects. Gardener runs compaction every 30min and defragmentation once per day during a cluster’s maintenance window to ensure proper ETCD operations. However, it is still possible to overload ETCD. If the space limit is reached, ETCD will only accept READ
or DELETE
requests and manual interaction by a Gardener operator is needed to disarm the alarm, once you got below the threshold.
To avoid such a situation, you can monitor the current ETCD usage via the “ETCD” dashboard of the monitoring stack. It gives you the current DB size, as well as historical data for the past 2 weeks. While there are improvements planned to trigger compaction and defragmentation based on DB size, an ETCD should not grow up to this threshold. A typical, healthy DB size is less than 3 GB.
Furthermore, the dashboard has a panel called “Memory”, which indicates the memory usage of the etcd pod(s). Using more than 16GB memory is a clear red flag, and you should reduce the load on ETCD.
Another dimension you should be aware of is the object count in ETCD. You can check it via the “API Server” dashboard, which features a “ETCD Object Counts By Resource” panel. The overall number of objects (excluding events
, as they are stored in a different etcd instance) should not exceed 100k for most use cases.
Kube API Server
The following section is based on a setup where kube-apiserver
run as Pods
and are scheduled to Nodes
with at least 8 vCPU and 32GB memory.
Gardener can scale the Deployment
of a kube-apiserver
horizontally and vertically. Horizontal scaling is limited to a certain number of replicas and should not concern a stakeholder much. However, the CPU / memory consumption of an individual kube-apiserver
pod poses a potential threat to the overall availability of your cluster. The vertical scaling of any kube-apiserver
is limited by the amount of resources available on a single Node
. Outgrowing the resources of a Node
will cause a downtime and render the cluster unavailable.
In general, continuous CPU usage of up to 3 cores and 16 GB memory per kube-apiserver
pod is considered to be safe. This gives some room to absorb spikes, for example when the caches are initialized. You can check the resource consumption by selecting kube-apiserver
Pods
in the “Kubernetes Pods
” dashboard. If these boundaries are exceeded constantly, you need to investigate and derive measures to lower the load.
Further information is also recorded and made available through the monitoring stack. The dashboard “API Server Request Duration and Response Size” provides insights into the request processing time of kube-apiserver
Pods
. Related information like request rates, dropped requests or termination codes (e.g., 429
for too many requests) can be obtained from the dashboards “API Server” and “Kubernetes API Server Details”. They provide a good indicator for how well the system is dealing with its current load.
Reducing the load on the API servers can become a challenge. To get started, you may try to:
- Use immutable secrets and configmaps where possible to save watches. This pays off, especially when you have a high number of
Nodes
or just lots of secrets in general. - Applications interacting with the K8s API: If you know an object by its name, use it. Using label selector queries is expensive, as the filtering happens only within the
kube-apiserver
and notetcd
, hence all resources must first pass completely frometcd
tokube-apiserver
. - Use (single object) caches within your controllers. Check the “Use cache for ShootStates in Gardenlet” issue for an example.
Nodes
When talking about the scalability of a Kubernetes cluster, Nodes
are probably mentioned in the first place… well, obviously not in this guide. While vanilla Kubernetes lists 5000 Nodes
as its upper limit, pushing that dimension is not feasible. Most clusters should run with fewer than 300 Nodes
. But of course, the actual limit depends on the workloads deployed and can be lower or higher. As you scale your cluster, be extra careful and closely monitor ETCD and kube-apiserver
.
The scalability of Nodes
is subject to a range of limiting factors. Some of them can only be defined upon cluster creation and remain immutable during a cluster lifetime. So let’s discuss the most important dimensions.
CIDR:
Upon cluster creation, you have to specify or use the default values for several network segments. There are dedicated CIDRs for services, Pods
, and Nodes
. Each defines a range of IP addresses available for the individual resource type. Obviously, the maximum of possible Nodes
is capped by the CIDR for Nodes
.
However, there is a second limiting factor, which is the pod CIDR combined with the nodeCIDRMaskSize
. This mask is used to divide the pod CIDR into smaller subnets, where each blocks gets assigned to a node. With a /16
pod network and a /24
nodeCIDRMaskSize, a cluster can scale up to 256 Nodes
. Please check Shoot Networking for details.
Even though a /24
nodeCIDRMaskSize translates to a theoretical 256 pod IP addresses per Node
, the maxPods
setting should be less than 1/2 of this value. This gives the system some breathing room for churn and minimizes the risk for strange effects like mis-routed packages caused by immediate re-use of IPs.
Cloud provider capacity:
Most of the time, Nodes
in Kubernetes translate to virtual machines on a hyperscaler. An attempt to add more Nodes
to a cluster might fail due to capacity issues resulting in an error message like this:
Cloud provider message - machine codes error: code = [Internal] message = [InsufficientInstanceCapacity: We currently do not have sufficient <instance type> capacity in the Availability Zone you requested. Our system will be working on provisioning additional capacity.
In heavily utilized regions, individual clusters are competing for scarce resources. So before choosing a region / zone, try to ensure that the hyperscaler supports your anticipated growth. This might be done through quota requests or by contacting the respective support teams.
To mitigate such a situation, you may configure a worker pool with a different Node
type and a corresponding priority expander as part of a shoot’s autoscaler section. Please consult the Autoscaler FAQ for more details.
Rolling of Node
pools:
The overall number of Nodes
is affecting the duration of a cluster’s maintenance. When upgrading a Node
pool to a new OS image or Kubernetes version, all machines will be drained and deleted, and replaced with new ones. The more Nodes
a cluster has, the longer this process will take, given that workloads are typically protected by PodDisruptionBudgets
. Check Shoot Updates and Upgrades for details. Be sure to take this into consideration when planning maintenance.
Root disk:
You should be aware that the Node
configuration impacts your workload’s performance too. Take the root disk of a Node
, for example. While most hyperscalers offer the usage of HDD and SSD disks, it is strongly recommended to use SSD volumes as root disks. When there are lots of Pods
on a Node
or workloads making extensive use of emptyDir
volumes, disk throttling becomes an issue. When a disk hits its IOPS limits, processes are stuck in IO-wait and slow down significantly. This can lead to a slow-down in the kubelet’s heartbeat mechanism and result in Nodes
being replaced automatically, as they appear to be unhealthy. To analyze such a situation, you might have to run tools like iostat
, sar
or top
directly on a Node
.
Switching to an I/O optimized instance type (if offered for your infrastructure) can help to resolve issue. Please keep in mind that disks used via PersistentVolumeClaims
have I/O limits as well. Sometimes these limits are related to the size and/or can be increased for individual disks.
Cloud Provider (Infrastructure) Limits
In addition to the already mentioned capacity restrictions, a cloud provider may impose other limitations to a Kubernetes cluster’s scalability. One category is the account quota defining the number of resources allowed globally or per region. Make sure to request appropriate values that suit your needs and contain a buffer, for example for having more Nodes
during a rolling update.
Another dimension is the network throughput per VM or network interface. While you may be able to choose a network-optimized Node
type for your workload to mitigate issues, you cannot influence the available bandwidth for control plane components. Therefore, please ensure that the traffic on the ETCD does not exceed 100MB/s. The ETCD dashboard provides data for monitoring this metric.
In some environments the upstream DNS might become an issue too and make your workloads subject to rate limiting. Given the heterogeneity of cloud providers incl. private data centers, it is not possible to give any thresholds. Still, the “CoreDNS” and “NodeLocalDNS” dashboards can help to derive a workload’s usage pattern. Check the DNS autoscaling and NodeLocalDNS documentations for available configuration options.
Webhooks
While webhooks provide powerful means to manage a cluster, they are equally powerful in breaking a cluster upon a malfunction or unavailability. Imagine using a policy enforcing system like Kyverno or Open Policy Agent Gatekeeper. As part of the stack, both will deploy webhooks which are invoked for almost everything that happens in a cluster. Now, if this webhook gets either overloaded or is simply not available, the cluster will stop functioning properly.
Hence, you have to ensure proper sizing, quick processing time, and availability of the webhook serving Pods
when deploying webhooks. Please consult Dynamic Admission Control (Availability and Timeouts sections) for details. You should also be aware of the time added to any request that has to go through a webhook, as the kube-apiserver
sends the request for mutation / validation to another pod and waits for the response. The more resources being subject to an external webhook, the more likely this will become a bottleneck when having a high churn rate on resources. Within the Gardener monitoring stack, you can check the extra time per webhook via the “API Server (Admission Details)” dashboard, which has a panel for “Duration per Webhook”.
In Gardener, any webhook timeout should be less than 15 seconds. Due to the separation of Kubernetes data-plane (shoot) and control-plane (seed) in Gardener, the extra hop from kube-apiserver
(control-plane) to webhook (data-plane) is more expensive. Please check Shoot Status for more details.
Custom Resource Definitions
Using Custom Resource Definitions (CRD) to extend a cluster’s API is a common Kubernetes pattern and so is writing an operator to act upon custom resources. Writing an efficient controller reduces the load on the kube-apiserver
and allows for better scaling. As a starting point, you might want to read Gardener’s Kubernetes Clients Guide.
Another problematic dimension is the usage of conversion webhooks when having resources stored in different versions. Not only do they add latency (see Webhooks) but can also block the kube-controllermanager’s garbage collection. If a conversion webhook is unavailable, the garbage collector fails to list all resources and does not perform any cleanup. In order to avoid such a situation, it is highly recommended to use conversion webhooks only when necessary and complete the migration to a new version as soon as possible.
Conclusion
As outlined by SIG Scalability, it is quite impossible to give limits or even recommendations fitting every individual use case. Instead, this guide outlines relevant dimensions and gives rather conservative recommendations based on usage patterns observed. By combining this information, it is possible to operate and scale a cluster in stable manner.
While going beyond is certainly possible for some dimensions, it significantly increases the risk of instability. Typically, limits on the control-plane are introduced by the availability of resources like CPU or memory on a single machine and can hardly be influenced by any user. Therefore, utilizing the existing resources efficiently is key. Other parameters are controlled by a user. In these cases, careful testing may reveal actual limits for a specific use case.
Please keep in mind that all aspects of a workload greatly influence the stability and scalability of a Kubernetes cluster.
4.2 - Authenticating with an Identity Provider
Prerequisites
Please read the following background material on Authenticating.
Overview
Kubernetes on its own doesn’t provide any user management. In other words, users aren’t managed through Kubernetes resources. Whenever you refer to a human user it’s sufficient to use a unique ID, for example, an email address. Nevertheless, Gardener project owners can use an identity provider to authenticate user access for shoot clusters in the following way:
- Configure an Identity Provider using OpenID Connect (OIDC).
- Configure a local kubectl oidc-login to enable
oidc-login
. - Configure the shoot cluster to share details of the OIDC-compliant identity provider with the Kubernetes API Server.
- Authorize an authenticated user using role-based access control (RBAC).
- Verify the result
Note
Gardener allows administrators to modify aspects of the control plane setup. It gives administrators full control of how the control plane is parameterized. While this offers much flexibility, administrators need to ensure that they don’t configure a control plane that goes beyond the service level agreements of the responsible operators team.Configure an Identity Provider
Create a tenant in an OIDC compatible Identity Provider. For simplicity, we use Auth0, which has a free plan.
In your tenant, create a client application to use authentication with
kubectl
:Provide a Name, choose Native as application type, and choose CREATE.
In the tab Settings, copy the following parameters to a local text file:
Domain
Corresponds to the issuer in OIDC. It must be an
https
-secured endpoint (Auth0 requires a trailing/
at the end). For more information, see Issuer Identifier.Client ID
Client Secret
Configure the client to have a callback url of
http://localhost:8000
. This callback connects to your localkubectl oidc-login
plugin:Save your changes.
Verify that
https://<Auth0 Domain>/.well-known/openid-configuration
is reachable.Choose Users & Roles > Users > CREATE USERS to create a user with a user and password:
Note
Users must have a verified email address.Configure a Local kubectl
oidc-login
Install the
kubectl
plugin oidc-login. We highly recommend the krew installation tool, which also makes other plugins easily available.kubectl krew install oidc-login
The response looks like this:
Updated the local copy of plugin index. Installing plugin: oidc-login CAVEATS: \ | You need to setup the OIDC provider, Kubernetes API server, role binding and kubeconfig. | See https://github.com/int128/kubelogin for more. / Installed plugin: oidc-login
Prepare a
kubeconfig
for later use:cp ~/.kube/config ~/.kube/config-oidc
Modify the configuration of
~/.kube/config-oidc
as follows:apiVersion: v1 kind: Config ... contexts: - context: cluster: shoot--project--mycluster user: my-oidc name: shoot--project--mycluster ... users: - name: my-oidc user: exec: apiVersion: client.authentication.k8s.io/v1beta1 command: kubectl args: - oidc-login - get-token - --oidc-issuer-url=https://<Issuer>/ - --oidc-client-id=<Client ID> - --oidc-client-secret=<Client Secret> - --oidc-extra-scope=email,offline_access,profile
To test our OIDC-based authentication, the context shoot--project--mycluster
of ~/.kube/config-oidc
is used in a later step. For now, continue to use the configuration ~/.kube/config
with administration rights for your cluster.
Configure the Shoot Cluster
Modify the shoot cluster YAML as follows, using the client ID and the domain (as issuer) from the settings of the client application you created in Auth0:
kind: Shoot
apiVersion: garden.sapcloud.io/v1beta1
metadata:
name: mycluster
namespace: garden-project
...
spec:
kubernetes:
kubeAPIServer:
oidcConfig:
clientID: <Client ID>
issuerURL: "https://<Issuer>/"
usernameClaim: email
This change of the Shoot
manifest triggers a reconciliation. Once the reconciliation is finished, your OIDC configuration is applied. It doesn’t invalidate other certificate-based authentication methods. Wait for Gardener to reconcile the change. It can take up to 5 minutes.
Authorize an Authenticated User
In Auth0, you created a user with a verified email address, test@test.com
in our example. For simplicity, we authorize a single user identified by this email address with the cluster role view
:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: viewer-test
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: view
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: test@test.com
As administrator, apply the cluster role binding in your shoot cluster.
Verify the Result
To step into the shoes of your user, use the prepared
kubeconfig
file~/.kube/config-oidc
, and switch to the context that usesoidc-login
:cd ~/.kube export KUBECONFIG=$(pwd)/config-oidc kubectl config use-context `shoot--project--mycluster`
kubectl
delegates the authentication to pluginoidc-login
the first time the user useskubectl
to contact the API server, for example:kubectl get all
The plugin opens a browser for an interactive authentication session with Auth0, and in parallel serves a local webserver for the configured callback.
Enter your login credentials.
You should get a successful response from the API server:
Opening in existing browser session. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kubernetes ClusterIP 100.64.0.1 <none> 443/TCP 86m
Note
After a successful login, kubectl
uses a token for authentication so that you don’t have to provide user and password for every new kubectl
command. How long the token is valid can be configured. If you want to log in again earlier, reset plugin oidc-login
:
- Delete directory
~/.kube/cache/oidc-login
. - Delete the browser cache.
To see if your user uses the cluster role
view
, do some checks withkubectl auth can-i
.The response for the following commands should be
no
:kubectl auth can-i create clusterrolebindings
kubectl auth can-i get secrets
kubectl auth can-i describe secrets
The response for the following commands should be
yes
:kubectl auth can-i list pods
kubectl auth can-i get pods
If the last step is successful, you’ve configured your cluster to authenticate against an identity provider using OIDC.
Related Links
4.3 - Backup and Restore of Kubernetes Objects
TL;DR
Note
Details of the description might change in the near future since Heptio was taken over by VMWare which might result in different GitHub repositories or other changes. Please don’t hesitate to inform us in case you encounter any issues.In general, Backup and Restore (BR) covers activities enabling an organization to bring a system back in a consistent state, e.g., after a disaster or to setup a new system. These activities vary in a very broad way depending on the applications and its persistency.
Kubernetes objects like Pods, Deployments, NetworkPolicies, etc. configure Kubernetes internal components and might as well include external components like load balancer and persistent volumes of the cloud provider. The BR of external components and their configurations might be difficult to handle in case manual configurations were needed to prepare these components.
To set the expectations right from the beginning, this tutorial covers the BR of Kubernetes deployments which might use persistent volumes. The BR of any manual configuration of external components, e.g., via the cloud providers console, is not covered here, as well as the BR of a whole Kubernetes system.
This tutorial puts the focus on the open source tool Velero (formerly Heptio Ark) and its functionality to explain the BR process.
Basically, Velero allows you to:
- backup and restore your Kubernetes cluster resources and persistent volumes (on-demand or scheduled)
- backup or restore all objects in your cluster, or filter resources by type, namespace, and/or label
- by default, all persistent volumes are backed up (configurable)
- replicate your production environment for development and testing environments
- define an expiration date per backup
- execute pre- and post-activities in a container of a pod when a backup is created (see Hooks)
- extend Velero by Plugins, e.g., for Object and Block store (see Plugins)
Velero consists of a server side component and a client tool. The server components consists of Custom Resource Definitions (CRD) and controllers to perform the activities. The client tool communicates with the K8s API server to, e.g., create objects like a Backup object.
The diagram below explains the backup process. When creating a backup, Velero client makes a call to the Kubernetes API server to create a Backup object (1). The BackupController notices the new Backup object, validates the object (2) and begins the backup process (3). Based on the filter settings provided by the Velero client it collects the resources in question (3). The BackupController creates a tar ball with the Kubernetes objects and stores it in the backup location, e.g., AWS S3 (4) as well as snapshots of persistent volumes (5).
The size of the backup tar ball corresponds to the number of objects in etcd. The gzipped archive contains the Json
representations of the objects.
Note
As of the writing of this tutorial, Velero or any other BR tool for Shoot clusters is not provided by Gardener.Getting Started
At first, clone the Velero GitHub repository and get the Velero client from the releases or build it from source via make all
in the main directory of the cloned GitHub repository.
To use an AWS S3 bucket as storage for the backup files and the persistent volumes, you need to:
- create a S3 bucket as the backup target
- create an AWS IAM user for Velero
- configure the Velero server
- create a secret for your AWS credentials
For details about this setup, check the Set Permissions for Velero documentation. Moreover, it is possible to use other supported storage providers.
Note
Per default, Velero is installed in the namespacevelero
. To change the namespace, check the documentation.Velero offers a wide range of filter possibilities for Kubernetes resources, e.g filter by namespaces, labels or resource types. The filter settings can be combined and used as include or exclude, which gives a great flexibility for selecting resources.
Note
Carefully set labels and/or use namespaces for your deployments to make the selection of the resources to be backed up easier. The best practice would be to check in advance which resources are selected with the defined filter.Exemplary Use Cases
Below are some use cases which could give you an idea on how to use Velero. You can also check Velero’s documentation for other introductory examples.
Helm Based Deployments
To be able to use Helm charts in your Kubernetes cluster, you need to install the Helm client helm
and the server component tiller
. Per default the server component is installed in the namespace kube-system
. Even if it is possible to select single deployments via the filter settings of Velero, you should consider to install tiller
in a separate namespace via helm init --tiller-namespace <your namespace>
. This approach applies as well for all Helm charts to be deployed - consider separate namespaces for your deployments as well by using the parameter --namespace
.
To backup a Helm based deployment, you need to backup both Tiller and the deployment. Only then the deployments could be managed via Helm. As mentioned above, the selection of resources would be easier in case they are separated in namespaces.
Separate Backup Locations
In case you run all your Kubernetes clusters on a single cloud provider, there is probably no need to store the backups in a bucket of a different cloud provider. However, if you run Kubernetes clusters on different cloud provider, you might consider to use a bucket on just one cloud provider as the target for the backups, e.g., to benefit from a lower price tag for the storage.
Per default, Velero assumes that both the persistent volumes and the backup location are on the same cloud provider. During the setup of Velero, a secret is created using the credentials for a cloud provider user who has access to both objects (see the policies, e.g., for the AWS configuration).
Now, since the backup location is different from the volume location, you need to follow these steps (described here for AWS):
configure as documented the volume storage location in
examples/aws/06-volumesnapshotlocation.yaml
and provide the user credentials. In this case, the S3 related settings like the policies can be omittedcreate the bucket for the backup in the cloud provider in question and a user with the appropriate credentials and store them in a separate file similar to
credentials-ark
create a secret which contains two credentials, one for the volumes and one for the backup target, e.g., by using the command
kubectl create secret generic cloud-credentials --namespace heptio-ark --from-file cloud=credentials-ark --from-file backup-target=backup-ark
configure in the deployment manifest
examples/aws/10-deployment.yaml
the entries involumeMounts
,env
andvolumes
accordingly, e.g., for a cluster running on AWS and the backup target bucket on GCP a configuration could look similar to:Note
Some links might get broken in the near future since Heptio was taken over by VMWare which might result in different GitHub repositories or other changes. Please don’t hesitate to inform us in case you encounter any issues.Example Velero deployment
# Copyright 2017 the Heptio Ark contributors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. --- apiVersion: apps/v1beta1 kind: Deployment metadata: namespace: velero name: velero spec: replicas: 1 template: metadata: labels: component: velero annotations: prometheus.io/scrape: "true" prometheus.io/port: "8085" prometheus.io/path: "/metrics" spec: restartPolicy: Always serviceAccountName: velero containers: - name: velero image: gcr.io/heptio-images/velero:latest command: - /velero args: - server volumeMounts: - name: cloud-credentials mountPath: /credentials - name: plugins mountPath: /plugins - name: scratch mountPath: /scratch env: - name: AWS_SHARED_CREDENTIALS_FILE value: /credentials/cloud - name: GOOGLE_APPLICATION_CREDENTIALS value: /credentials/backup-target - name: VELERO_SCRATCH_DIR value: /scratch volumes: - name: cloud-credentials secret: secretName: cloud-credentials - name: plugins emptyDir: {} - name: scratch emptyDir: {}
finally, configure the backup storage location in
examples/aws/05-backupstoragelocation.yaml
to use, in this case, a GCP bucket
Limitations
Below is a potentially incomplete list of limitations. You can also consult Velero’s documentation to get up to date information.
- Only full backups of selected resources are supported. Incremental backups are not (yet) supported. However, by using filters it is possible to restrict the backup to specific resources
- Inconsistencies might occur in case of changes during the creation of the backup
- Application specific actions are not considered by default. However, they might be handled by using Velero’s Hooks or Plugins
4.4 - Create / Delete a Shoot Cluster
Create a Shoot Cluster
As you have already prepared an example Shoot manifest in the steps described in the development documentation, please open another Terminal pane/window with the KUBECONFIG
environment variable pointing to the Garden development cluster and send the manifest to the Kubernetes API server:
kubectl apply -f your-shoot-aws.yaml
You should see that Gardener has immediately picked up your manifest and has started to deploy the Shoot cluster.
In order to investigate what is happening in the Seed cluster, please download its proper Kubeconfig yourself (see next paragraph). The namespace of the Shoot cluster in the Seed cluster will look like that: shoot-johndoe-johndoe-1
, whereas the first johndoe
is your namespace in the Garden cluster (also called “project”) and the johndoe-1
suffix is the actual name of the Shoot cluster.
To connect to the newly created Shoot cluster, you must download its Kubeconfig as well. Please connect to the proper Seed cluster, navigate to the Shoot namespace, and download the Kubeconfig from the kubecfg
secret in that namespace.
Delete a Shoot Cluster
In order to delete your cluster, you have to set an annotation confirming the deletion first, and trigger the deletion after that. You can use the prepared delete shoot
script which takes the Shoot name as first parameter. The namespace can be specified by the second parameter, but it is optional. If you don’t state it, it defaults to your namespace (the username you are logged in with to your machine).
./hack/usage/delete shoot johndoe-1 johndoe
(the hack
bash script can be found at GitHub)
Configure a Shoot Cluster Aalert Receiver
The receiver of the Shoot alerts can be configured from the .spec.monitoring.alerting.emailReceivers
section in the Shoot specification. The value of the field has to be a list of valid mail addresses.
The alerting for the Shoot clusters is handled by the Prometheus Alertmanager. The Alertmanager will be deployed next to the control plane when the Shoot
resource specifies .spec.monitoring.alerting.emailReceivers
and if a SMTP secret exists.
If the field gets removed then the Alertmanager will be also removed during the next reconcilation of the cluster. The opposite is also valid if the field is added to an existing cluster.
4.5 - Create a Shoot Cluster Into an Existing AWS VPC
Overview
Gardener can create a new VPC, or use an existing one for your shoot cluster. Depending on your needs, you may want to create shoot(s) into an already created VPC. The tutorial describes how to create a shoot cluster into an existing AWS VPC. The steps are identical for Alicloud, Azure, and GCP. Please note that the existing VPC must be in the same region like the shoot cluster that you want to deploy into the VPC.
TL;DR
If .spec.provider.infrastructureConfig.networks.vpc.cidr
is specified, Gardener will create a new VPC with the given CIDR block and respectively will delete it on shoot deletion.
If .spec.provider.infrastructureConfig.networks.vpc.id
is specified, Gardener will use the existing VPC and respectively won’t delete it on shoot deletion.
Note
It’s not recommended to create a shoot cluster into a VPC that is managed by Gardener (that is created for another shoot cluster). In this case the deletion of the initial shoot cluster will fail to delete the VPC because there will be resources attached to it.
Gardener won’t delete any manually created (unmanaged) resources in your cloud provider account.
1. Configure the AWS CLI
The aws configure
command is a convenient way to setup your AWS CLI. It will prompt you for your credentials and settings which will be used in the following AWS CLI invocations:
aws configure
AWS Access Key ID [None]: <ACCESS_KEY_ID>
AWS Secret Access Key [None]: <SECRET_ACCESS_KEY>
Default region name [None]: <DEFAULT_REGION>
Default output format [None]: <DEFAULT_OUTPUT_FORMAT>
2. Create a VPC
Create the VPC by running the following command:
aws ec2 create-vpc --cidr-block <cidr-block>
{
"Vpc": {
"VpcId": "vpc-ff7bbf86",
"InstanceTenancy": "default",
"Tags": [],
"CidrBlockAssociations": [
{
"AssociationId": "vpc-cidr-assoc-6e42b505",
"CidrBlock": "10.0.0.0/16",
"CidrBlockState": {
"State": "associated"
}
}
],
"Ipv6CidrBlockAssociationSet": [],
"State": "pending",
"DhcpOptionsId": "dopt-38f7a057",
"CidrBlock": "10.0.0.0/16",
"IsDefault": false
}
}
Gardener requires the VPC to have enabled DNS support, i.e the attributes enableDnsSupport
and enableDnsHostnames
must be set to true. enableDnsSupport
attribute is enabled by default, enableDnsHostnames
- not. Set the enableDnsHostnames
attribute to true:
aws ec2 modify-vpc-attribute --vpc-id vpc-ff7bbf86 --enable-dns-hostnames
3. Create an Internet Gateway
Gardener also requires that an internet gateway is attached to the VPC. You can create one by using:
aws ec2 create-internet-gateway
{
"InternetGateway": {
"Tags": [],
"InternetGatewayId": "igw-c0a643a9",
"Attachments": []
}
}
and attach it to the VPC using:
aws ec2 attach-internet-gateway --internet-gateway-id igw-c0a643a9 --vpc-id vpc-ff7bbf86
4. Create the Shoot
Prepare your shoot manifest (you could check the example manifests). Please make sure that you choose the region in which you had created the VPC earlier (step 2). Also, put your VPC ID in the .spec.provider.infrastructureConfig.networks.vpc.id
field:
spec:
region: <aws-region-of-vpc>
provider:
type: aws
infrastructureConfig:
apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
kind: InfrastructureConfig
networks:
vpc:
id: vpc-ff7bbf86
# ...
Apply your shoot manifest:
kubectl apply -f your-shoot-aws.yaml
Ensure that the shoot cluster is properly created:
kubectl get shoot $SHOOT_NAME -n $SHOOT_NAMESPACE
NAME CLOUDPROFILE VERSION SEED DOMAIN OPERATION PROGRESS APISERVER CONTROL NODES SYSTEM AGE
<SHOOT_NAME> aws 1.15.0 aws <SHOOT_DOMAIN> Succeeded 100 True True True True 20m
4.6 - Fix Problematic Conversion Webhooks
Reasoning
Custom Resource Definition (CRD) is what you use to define a Custom Resource
. This is a powerful way to extend Kubernetes capabilities beyond the default installation, adding any kind of API objects useful for your application.
The CustomResourceDefinition API provides a workflow for introducing and upgrading to new versions of a CustomResourceDefinition. In a scenario where a CRD adds support for a new version and switches its spec.versions.storage
field to it (i.e., from v1beta1
to v1)
, existing objects are not migrated in etcd. For more information, see Versions in CustomResourceDefinitions.
This creates a mismatch between the requested and stored version for all clients (kubectl, KCM, etc.). When the CRD also declares the usage of a conversion webhook, it gets called whenever a client requests information about a resource that still exists in the old version. If the CRD is created by the end-user, the webhook runs on the shoot side, whereas controllers / kapi-servers run separated, as part of the control-plane. For the webhook to be reachable, a working VPN connection seed -> shoot
is essential. In scenarios where the VPN connection is broken, the kube-controller-manager eventually stops its garbage collection, as that requires it to list v1.PartialObjectMetadata
for everything to build a dependency graph. Without the kube-controller-manager’s garbage collector, managed resources get stuck during update/rollout.
Breaking Situations
When a user upgrades to failureTolerance: node|zone
, that will cause the VPN deployments to be replaced by statefulsets. However, as the VPN connection is broken upon teardown of the deployment, garbage collection will fail, leading to a situation that is stuck until an operator manually tackles it.
Such a situation can be avoided if the end-user has correctly configured CRDs containing conversion webhooks.
Checking Problematic CRDs
In order to make sure there are no version problematic CRDs, please run the script below in your shoot. It will return the name of the CRDs in case they have one of the 2 problems:
- the returned version of the CR is different than what is maintained in the
status.storedVersions
field of the CRD. - the
status.storedVersions
field of the CRD has more than 1 version defined.
#!/bin/bash
set -e -o pipefail
echo "Checking all CRDs in the cluster..."
for p in $(kubectl get crd | awk 'NR>1' | awk '{print $1}'); do
strategy=$(kubectl get crd "$p" -o json | jq -r .spec.conversion.strategy)
if [ "$strategy" == "Webhook" ]; then
crd_name=$(kubectl get crd "$p" -o json | jq -r .metadata.name)
number_of_stored_versions=$(kubectl get crd "$crd_name" -o json | jq '.status.storedVersions | length')
if [[ "$number_of_stored_versions" == 1 ]]; then
returned_cr_version=$(kubectl get "$crd_name" -A -o json | jq -r '.items[] | .apiVersion' | sed 's:.*/::')
if [ -z "$returned_cr_version" ]; then
continue
else
variable=$(echo "$returned_cr_version" | xargs -n1 | sort -u | xargs)
present_version=$(kubectl get crd "$crd_name" -o json | jq -cr '.status.storedVersions |.[]')
if [[ $variable != "$present_version" ]]; then
echo "ERROR: Stored version differs from the version that CRs are being returned. $crd_name with conversion webhook needs to be fixed"
fi
fi
fi
if [[ "$number_of_stored_versions" -gt 1 ]]; then
returned_cr_version=$(kubectl get "$crd_name" -A -o json | jq -r '.items[] | .apiVersion' | sed 's:.*/::')
if [ -z "$returned_cr_version" ]; then
continue
else
echo "ERROR: Too many stored versions defined. $crd_name with conversion webhook needs to be fixed"
fi
fi
fi
done
echo "Problematic CRDs are reported above."
Resolve CRDs
Below we give the steps needed to be taken in order to fix the CRDs reported by the script above.
Inspect all your CRDs that have conversion webhooks in place. If you have more than 1 version defined in its spec.status.storedVersions
field, then initiate migration as described in Option 2 in the Upgrade existing objects to a new stored version guide.
For convenience, we have provided the necessary steps below.
Note
Please test the following steps on a non-productive landscape to make sure that the new CR version doesn’t break any of your existing workloads.Please check/set the old CR version to
storage:false
and set the new CR version tostorage:true
.For the sake of an example, let’s consider the two versions
v1beta1
(old) andv1
(new).Before:
spec: versions: - name: v1beta1 ...... storage: true - name: v1 ...... storage: false
After:
spec: versions: - name: v1beta1 ...... storage: false - name: v1 ...... storage: true
Convert
custom-resources
to the newest version.kubectl get <custom-resource-name> -A -ojson | k apply -f -
Patch the CRD to keep only the latest version under storedVersions.
kubectl patch customresourcedefinitions <crd-name> --subresource='status' --type='merge' -p '{"status":{"storedVersions":["your-latest-cr-version"]}}'
4.7 - GPU Enabled Cluster
Disclaimer
Be aware, that the following sections might be opinionated. Kubernetes, and the GPU support in particular, are rapidly evolving, which means that this guide is likely to be outdated sometime soon. For this reason, contributions are highly appreciated to update this guide.
Create a Cluster
First thing first, let’s create a Kubernetes (K8s) cluster with GPU accelerated nodes. In this example we will use an AWS p2.xlarge EC2 instance because it’s the cheapest available option at the moment. Use such cheap instances for learning to limit your resource costs. This costs around 1€/hour per GPU
Install NVidia Driver as Daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-driver-installer
namespace: kube-system
labels:
k8s-app: nvidia-driver-installer
spec:
selector:
matchLabels:
name: nvidia-driver-installer
k8s-app: nvidia-driver-installer
template:
metadata:
labels:
name: nvidia-driver-installer
k8s-app: nvidia-driver-installer
spec:
hostPID: true
initContainers:
- image: squat/modulus:4a1799e7aa0143bcbb70d354bab3e419b1f54972
name: modulus
args:
- compile
- nvidia
- "410.104"
securityContext:
privileged: true
env:
- name: MODULUS_CHROOT
value: "true"
- name: MODULUS_INSTALL
value: "true"
- name: MODULUS_INSTALL_DIR
value: /opt/drivers
- name: MODULUS_CACHE_DIR
value: /opt/modulus/cache
- name: MODULUS_LD_ROOT
value: /root
- name: IGNORE_MISSING_MODULE_SYMVERS
value: "1"
volumeMounts:
- name: etc-coreos
mountPath: /etc/coreos
readOnly: true
- name: usr-share-coreos
mountPath: /usr/share/coreos
readOnly: true
- name: ld-root
mountPath: /root
- name: module-cache
mountPath: /opt/modulus/cache
- name: module-install-dir-base
mountPath: /opt/drivers
- name: dev
mountPath: /dev
containers:
- image: "gcr.io/google-containers/pause:3.1"
name: pause
tolerations:
- key: "nvidia.com/gpu"
effect: "NoSchedule"
operator: "Exists"
volumes:
- name: etc-coreos
hostPath:
path: /etc/coreos
- name: usr-share-coreos
hostPath:
path: /usr/share/coreos
- name: ld-root
hostPath:
path: /
- name: module-cache
hostPath:
path: /opt/modulus/cache
- name: dev
hostPath:
path: /dev
- name: module-install-dir-base
hostPath:
path: /opt/drivers
Install Device Plugin
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-gpu-device-plugin
namespace: kube-system
labels:
k8s-app: nvidia-gpu-device-plugin
#addonmanager.kubernetes.io/mode: Reconcile
spec:
selector:
matchLabels:
k8s-app: nvidia-gpu-device-plugin
template:
metadata:
labels:
k8s-app: nvidia-gpu-device-plugin
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
priorityClassName: system-node-critical
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
containers:
- image: "k8s.gcr.io/nvidia-gpu-device-plugin@sha256:08509a36233c5096bb273a492251a9a5ca28558ab36d74007ca2a9d3f0b61e1d"
command: ["/usr/bin/nvidia-gpu-device-plugin", "-logtostderr", "-host-path=/opt/drivers/nvidia"]
name: nvidia-gpu-device-plugin
resources:
requests:
cpu: 50m
memory: 10Mi
limits:
cpu: 50m
memory: 10Mi
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /device-plugin
- name: dev
mountPath: /dev
updateStrategy:
type: RollingUpdate
Test
To run an example training on a GPU node, first start a base image with Tensorflow with GPU support & Keras:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deeplearning-workbench
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: deeplearning-workbench
template:
metadata:
labels:
app: deeplearning-workbench
spec:
containers:
- name: deeplearning-workbench
image: afritzler/deeplearning-workbench
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: "nvidia.com/gpu"
effect: "NoSchedule"
operator: "Exists"
Note
the tolerations
section above is not required if you deploy the ExtendedResourceToleration
admission controller to your cluster. You can do this in the kubernetes
section of your Gardener cluster shoot.yaml
as follows:
kubernetes:
kubeAPIServer:
admissionPlugins:
- name: ExtendedResourceToleration
Now exec into the container and start an example Keras training:
kubectl exec -it deeplearning-workbench-8676458f5d-p4d2v -- /bin/bash
cd /keras/example
python imdb_cnn.py
Related Links
- Andreas Fritzler from the Gardener Core team for the R&D, who has provided this setup.
- Build and install NVIDIA driver on CoreOS
4.8 - Shoot Cluster Maintenance
Overview
Day two operations for shoot clusters are related to:
- The Kubernetes version of the control plane and the worker nodes
- The operating system version of the worker nodes
Note
When referring to an update of the “operating system version” in this document, the update of the machine image of the shoot cluster’s worker nodes is meant. For example, Amazon Machine Images (AMI) for AWS.The following table summarizes what options Gardener offers to maintain these versions:
Auto-Update | Forceful Updates | Manual Updates | |
---|---|---|---|
Kubernetes version | Patches only | Patches and consecutive minor updates only | yes |
Operating system version | yes | yes | yes |
Allowed Target Versions in the CloudProfile
Administrators maintain the allowed target versions that you can update to in the CloudProfile
for each IaaS-Provider. Users with access to a Gardener project can check supported target versions with:
kubectl get cloudprofile [IAAS-SPECIFIC-PROFILE] -o yaml
Path | Description | More Information |
---|---|---|
spec.kubernetes.versions | The supported Kubernetes version major.minor.patch . | Patch releases |
spec.machineImages | The supported operating system versions for worker nodes |
Both the Kubernetes version and the operating system version follow semantic versioning that allows Gardener to handle updates automatically.
For more information, see Semantic Versioning.
Impact of Version Classifications on Updates
Gardener allows to classify versions in the CloudProfile
as preview
, supported
, deprecated
, or expired
. During maintenance operations, preview
versions are excluded from updates, because they’re often recently released versions that haven’t yet undergone thorough testing and may contain bugs or security issues.
For more information, see Version Classifications.
Let Gardener Manage Your Updates
The Maintenance Window
Gardener can manage updates for you automatically. It offers users to specify a maintenance window during which updates are scheduled:
- The time interval of the maintenance window can’t be less than 30 minutes or more than 6 hours.
- If there’s no maintenance window specified during the creation of a shoot cluster, Gardener chooses a maintenance window randomly to spread the load.
You can either specify the maintenance window in the shoot cluster specification (.spec.maintenance.timeWindow
) or the start time of the maintenance window using the Gardener dashboard (CLUSTERS > [YOUR-CLUSTER] > OVERVIEW > Lifecycle > Maintenance).
Auto-Update and Forceful Updates
To trigger updates during the maintenance window automatically, Gardener offers the following methods:
Auto-update:
Gardener starts an update during the next maintenance window whenever there’s a version available in theCloudProfile
that is higher than the one of your shoot cluster specification, and that isn’t classified aspreview
version. For Kubernetes versions, auto-update only updates to higher patch levels.You can either activate auto-update on the Gardener dashboard (CLUSTERS > [YOUR-CLUSTER] > OVERVIEW > Lifecycle > Maintenance) or in the shoot cluster specification:
.spec.maintenance.autoUpdate.kubernetesVersion: true
.spec.maintenance.autoUpdate.machineImageVersion: true
Forceful updates:
In the maintenance window, Gardener compares the current version given in the shoot cluster specification with the version list in theCloudProfile
. If the version has an expiration date and if the date is before the start of the maintenance window, Gardener starts an update to the highest version available in theCloudProfile
that isn’t classified aspreview
version. The highest version inCloudProfile
can’t have an expiration date. For Kubernetes versions, Gardener only updates to higher patch levels or consecutive minor versions.
If you don’t want to wait for the next maintenance window, you can annotate the shoot cluster specification with shoot.gardener.cloud/operation: maintain
. Gardener then checks immediately if there’s an auto-update or a forceful update needed.
Note
Forceful version updates are executed even if the auto-update for the Kubernetes version(or the auto-update for the machine image version) is deactivated (set tofalse
).With expiration dates, administrators can give shoot cluster owners more time for testing before the actual version update happens, which allows for smoother transitions to new versions.
Kubernetes Update Paths
The bigger the delta of the Kubernetes source version and the Kubernetes target version, the better it must be planned and executed by operators. Gardener only provides automatic support for updates that can be applied safely to the cluster workload:
Update Type | Example | Update Method |
---|---|---|
Patches | 1.10.12 to 1.10.13 | auto-update or Forceful update |
Update to consecutive minor version | 1.10.12 to 1.11.10 | Forceful update |
Other | 1.10.12 to 1.12.0 | Manual update |
Gardener doesn’t support automatic updates of nonconsecutive minor versions, because Kubernetes doesn’t guarantee updateability in this case. However, multiple minor version updates are possible if not only the minor source version is expired, but also the minor target version is expired. Gardener then updates the Kubernetes version first to the expired target version, and waits for the next maintenance window to update this version to the next minor target version.
Warning
The administrator who maintains theCloudProfile
has to ensure that the list of Kubernetes versions consists of consecutive minor versions, for example, from 1.10.x
to 1.11.y
. If the minor version increases in bigger steps, for example, from 1.10.x
to 1.12.y
, then the shoot cluster updates will fail during the maintenance window.Manual Updates
To update the Kubernetes version or the node operating system manually, change the .spec.kubernetes.version
field or the .spec.provider.workers.machine.image.version
field correspondingly.
Manual updates are required if you would like to do a minor update of the Kubernetes version. Gardener doesn’t do such updates automatically, as they can have breaking changes that could impact the cluster workload.
Manual updates are either executed immediately (default) or can be confined to the maintenance time window.
Choosing the latter option causes changes to the cluster (for example, node pool rolling-updates) and the subsequent reconciliation to only predictably happen during a defined time window (available since Gardener version 1.4).
For more information, see Confine Specification Changes/Update Roll Out.
Warning
Before applying such an update on minor or major releases, operators should check for all the breaking changes introduced in the target Kubernetes release changelog.Examples
In the examples for the CloudProfile
and the shoot cluster specification, only the fields relevant for the example are shown.
Auto-Update of Kubernetes Version
Let’s assume that the Kubernetes versions 1.10.5
and 1.11.0
were added in the following CloudProfile
:
spec:
kubernetes:
versions:
- version: 1.11.0
- version: 1.10.5
- version: 1.10.0
Before this change, the shoot cluster specification looked like this:
spec:
kubernetes:
version: 1.10.0
maintenance:
timeWindow:
begin: 220000+0000
end: 230000+0000
autoUpdate:
kubernetesVersion: true
As a consequence, the shoot cluster is updated to Kubernetes version 1.10.5
between 22:00-23:00 UTC. Your shoot cluster isn’t updated automatically to 1.11.0
, even though it’s the highest Kubernetes version in the CloudProfile
, because Gardener only does automatic updates of the Kubernetes patch level.
Forceful Update Due to Expired Kubernetes Version
Let’s assume the following CloudProfile
exists on the cluster:
spec:
kubernetes:
versions:
- version: 1.12.8
- version: 1.11.10
- version: 1.10.13
- version: 1.10.12
expirationDate: "2019-04-13T08:00:00Z"
Let’s assume the shoot cluster has the following specification:
spec:
kubernetes:
version: 1.10.12
maintenance:
timeWindow:
begin: 220000+0100
end: 230000+0100
autoUpdate:
kubernetesVersion: false
The shoot cluster specification refers to a Kubernetes version that has an expirationDate
. In the maintenance window on 2019-04-12
, the Kubernetes version stays the same as it’s still not expired. But in the maintenance window on 2019-04-14
, the Kubernetes version of the shoot cluster is updated to 1.10.13
(independently of the value of .spec.maintenance.autoUpdate.kubernetesVersion
).
Forceful Update to New Minor Kubernetes Version
Let’s assume the following CloudProfile
exists on the cluster:
spec:
kubernetes:
versions:
- version: 1.12.8
- version: 1.11.10
- version: 1.11.09
- version: 1.10.12
expirationDate: "2019-04-13T08:00:00Z"
Let’s assume the shoot cluster has the following specification:
spec:
kubernetes:
version: 1.10.12
maintenance:
timeWindow:
begin: 220000+0100
end: 230000+0100
autoUpdate:
kubernetesVersion: false
The shoot cluster specification refers a Kubernetes version that has an expirationDate
. In the maintenance window on 2019-04-14
, the Kubernetes version of the shoot cluster is updated to 1.11.10
, which is the highest patch version of minor target version 1.11
that follows the source version 1.10
.
Automatic Update from Expired Machine Image Version
Let’s assume the following CloudProfile
exists on the cluster:
spec:
machineImages:
- name: coreos
versions:
- version: 2191.5.0
- version: 2191.4.1
- version: 2135.6.0
expirationDate: "2019-04-13T08:00:00Z"
Let’s assume the shoot cluster has the following specification:
spec:
provider:
type: aws
workers:
- name: name
maximum: 1
minimum: 1
maxSurge: 1
maxUnavailable: 0
image:
name: coreos
version: 2135.6.0
type: m5.large
volume:
type: gp2
size: 20Gi
maintenance:
timeWindow:
begin: 220000+0100
end: 230000+0100
autoUpdate:
machineImageVersion: false
The shoot cluster specification refers a machine image version that has an expirationDate
. In the maintenance window on 2019-04-12
, the machine image version stays the same as it’s still not expired. But in the maintenance window on 2019-04-14
, the machine image version of the shoot cluster is updated to 2191.5.0
(independently of the value of .spec.maintenance.autoUpdate.machineImageVersion
) as version 2135.6.0
is expired.
4.9 - Tailscale
Access the Kubernetes apiserver from your tailnet
Overview
If you would like to strengthen the security of your Kubernetes cluster even further, this guide post explains how this can be achieved.
The most common way to secure a Kubernetes cluster which was created with Gardener is to apply the ACLs described in the Gardener ACL Extension repository or to use ExposureClass, which exposes the Kubernetes apiserver in a corporate network not exposed to the public internet.
However, those solutions are not without their drawbacks. Managing the ACL extension becomes fairly difficult with the growing number of participants, especially in a dynamic environment and work from home scenarios, and using ExposureClass requires you to first have a corporate network suitable for this purpose.
But there is a solution which bridges the gap between these two approaches by the use of a mesh VPN based on WireGuard
Tailscale
Tailscale is a mesh VPN network which uses Wireguard under the hood, but automates the key exchange procedure. Please consult the official tailscale documentation for a detailed explanation.
Target Architecture
Installation
In order to be able to access the Kubernetes apiserver only from a tailscale VPN, you need this steps:
- Create a tailscale account and ensure MagicDNS is enabled.
- Create an OAuth ClientID and Secret OAuth ClientID and Secret. Don’t forget to create the required tags.
- Install the tailscale operator tailscale operator.
If all went well after the operator installation, you should be able to see the tailscale operator by running tailscale status
:
# tailscale status
...
100.83.240.121 tailscale-operator tagged-devices linux -
...
Expose the Kubernetes apiserver
Now you are ready to expose the Kubernetes apiserver in the tailnet by annotating the service which was created by Gardener:
kubectl annotate -n default kubernetes tailscale.com/expose=true tailscale.com/hostname=kubernetes
It is required to kubernetes
as the hostname, because this is part of the certificate common name of the Kubernetes apiserver.
After annotating the service, it will be exposed in the tailnet and can be shown by running tailscale status
:
# tailscale status
...
100.83.240.121 tailscale-operator tagged-devices linux -
100.96.191.87 kubernetes tagged-devices linux idle, tx 19548 rx 71656
...
Modify the kubeconfig
In order to access the cluster via the VPN, you must modify the kubeconfig to point to the Kubernetes service exposed in the tailnet, by changing the server
entry to https://kubernetes
.
---
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: <base64 encoded secret>
server: https://kubernetes
name: my-cluster
...
Enable ACLs to Block All IPs
Now you are ready to use your cluster from every device which is part of your tailnet. Therefore you can now block all access to the Kubernetes apiserver with the ACL extension.
Caveats
Multiple Kubernetes Clusters
You can actually not join multiple Kubernetes Clusters to join your tailnet
because the kubernetes
service in every cluster would overlap.
Headscale
It is possible to host a tailscale coordination by your own if you do not want to rely on the service tailscale.com offers. The headscale project is a open source implementation of this.
This works for basic tailscale VPN setups, but not for the tailscale operator described here, because headscale
does not implement all required API endpoints for the tailscale operator.
The details can be found in this Github Issue.
5 - Monitor and Troubleshoot
5.1 - Analyzing Node Removal and Failures
Overview
Sometimes operators want to find out why a certain node got removed. This guide helps to identify possible causes. There are a few potential reasons why nodes can be removed:
- broken node: a node becomes unhealthy and machine-controller-manager terminates it in an attempt to replace the unhealthy node with a new one
- scale-down: cluster-autoscaler sees that a node is under-utilized and therefore scales down a worker pool
- node rolling: configuration changes to a worker pool (or cluster) require all nodes of one or all worker pools to be rolled and thus all nodes to be replaced. Some possible changes are:
- the K8s/OS version
- changing machine types
Helpful information can be obtained by using the logging stack. See Logging Stack for how to utilize the logging information in Gardener.
Find Out Whether the Node Was unhealthy
Check the Node Events
A good first indication on what happened to a node can be obtained from the node’s events. Events are scraped and ingested into the logging system, so they can be found in the explore tab of Grafana (make sure to select loki
as datasource) with a query like {job="event-logging"} | unpack | object="Node/<node-name>"
or find any event mentioning the node in question via a broader query like {job="event-logging"}|="<node-name>"
.
A potential result might reveal:
{"_entry":"Node ip-10-55-138-185.eu-central-1.compute.internal status is now: NodeNotReady","count":1,"firstTimestamp":"2023-04-05T12:02:08Z","lastTimestamp":"2023-04-05T12:02:08Z","namespace":"default","object":"Node/ip-10-55-138-185.eu-central-1.compute.internal","origin":"shoot","reason":"NodeNotReady","source":"node-controller","type":"Normal"}
Check machine-controller-manager Logs
If a node was getting unhealthy, the last conditions can be found in the logs of the machine-controller-manager
by using a query like {pod_name=~"machine-controller-manager.*"}|="<node-name>"
.
Caveat: every node
resource is backed by a corresponding machine
resource managed by machine-controller-manager. Usually two corresponding node
and machine
resources have the same name with the exception of AWS. Here you first need to find with the above query the corresponding machine
name, typically via a log like this
2023-04-05 12:02:08 {"log":"Conditions of Machine \"shoot--demo--cluster-pool-z1-6dffc-jh4z4\" with providerID \"aws:///eu-central-1/i-0a6ad1ca4c2e615dc\" and backing node \"ip-10-55-138-185.eu-central-1.compute.internal\" are changing","pid":"1","severity":"INFO","source":"machine_util.go:629"}
This reveals that node
ip-10-55-138-185.eu-central-1.compute.internal
is backed by machine
shoot--demo--cluster-pool-z1-6dffc-jh4z4
. On infrastructures other than AWS you can omit this step.
With the machine name at hand, now search for log entries with {pod_name=~"machine-controller-manager.*"}|="<machine-name>"
.
In case the node had failing conditions, you’d find logs like this:
2023-04-05 12:02:08 {"log":"Machine shoot--demo--cluster-pool-z1-6dffc-jh4z4 is unhealthy - changing MachineState to Unknown. Node conditions: [{Type:ClusterNetworkProblem Status:False LastHeartbeatTime:2023-04-05 11:58:39 +0000 UTC LastTransitionTime:2023-03-23 11:59:29 +0000 UTC Reason:NoNetworkProblems Message:no cluster network problems} ... {Type:Ready Status:Unknown LastHeartbeatTime:2023-04-05 11:55:27 +0000 UTC LastTransitionTime:2023-04-05 12:02:07 +0000 UTC Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.}]","pid":"1","severity":"WARN","source":"machine_util.go:637"}
In the example above, the reason for an unhealthy node was that kubelet
failed to renew its heartbeat. Typical reasons would be either a broken VM (that couldn’t execute kubelet
anymore) or a broken network. Note that some VM terminations performed by the infrastructure provider are actually expected (e.g., scheduled events on AWS).
In both cases, the infrastructure provider might be able to provide more information on particular VM or network failures.
Whatever the failure condition might have been, if a node gets unhealthy, it will be terminated by machine-controller-manager
after the machineHealthTimeout
has elapsed (this parameter can be configured in your shoot spec).
Check the Node Logs
For each node
the kernel and kubelet
logs, as well as a few others, are scraped and can be queried with this query {nodename="<node-name>"}
This might reveal OS specific issues or, in the absence of any logs (e.g., after the node went unhealthy), might indicate a network disruption or sudden VM termination. Note that some VM terminations performed by the infrastructure provider are actually expected (e.g., scheduled events on AWS).
Infrastructure providers might be able to provide more information on particular VM failures in such cases.
Check the Network Problem Detector Dashboard
If your Gardener installation utilizes gardener-extension-shoot-networking-problemdetector, you can check the dashboard named “Network Problem Detector” in Grafana for hints on network issues on the node of interest.
Scale-Down
In general, scale-downs are managed by the cluster-autoscaler, its logs can be found with the query {container_name="cluster-autoscaler"}
.
Attempts to remove a node can be found with the query {container_name="cluster-autoscaler"}|="Scale-down: removing empty node"
If a scale-down has caused disruptions in your workload, consider protecting your workload by adding PodDisruptionBudgets
(see the autoscaler FAQ for more options).
Node Rolling
Node rolling can be caused by, e.g.:
- change of the K8s minor version of the cluster or a worker pool
- change of the OS version of the cluster or a worker pool
- change of the disk size/type or machine size/type of a worker pool
- change of node labels
Changes like the above are done by altering the shoot specification and thus are recorded in the external auditlog system that is configured for the garden cluster.
5.2 - Get a Shell to a Gardener Shoot Worker Node
Overview
To troubleshoot certain problems in a Kubernetes cluster, operators need access to the host of the Kubernetes node. This can be required if a node misbehaves or fails to join the cluster in the first place.
With access to the host, it is for instance possible to check the kubelet
logs and interact with common tools such as systemctl
and journalctl
.
The first section of this guide explores options to get a shell to the node of a Gardener Kubernetes cluster. The options described in the second section do not rely on Kubernetes capabilities to get shell access to a node and thus can also be used if an instance failed to join the cluster.
This guide only covers how to get access to the host, but does not cover troubleshooting methods.
- Overview
- Get a Shell to an Operational Cluster Node
- SSH Access to a Node That Failed to Join the Cluster
- Cleanup
Get a Shell to an Operational Cluster Node
The following describes four different approaches to get a shell to an operational Shoot worker node. As a prerequisite to troubleshooting a Kubernetes node, the node must have joined the cluster successfully and be able to run a pod. All of the described approaches involve scheduling a pod with root permissions and mounting the root filesystem.
Gardener Dashboard
Prerequisite: the terminal feature is configured for the Gardener dashboard.
- Navigate to the cluster overview page and find the
Terminal
in theAccess
tile.
Select the target Cluster (Garden, Seed / Control Plane, Shoot cluster) depending on the requirements and access rights (only certain users have access to the Seed Control Plane).
- To open the terminal configuration, interact with the top right-hand corner of the screen.
- Set the Terminal Runtime to “Privileged”. Also, specify the target node from the drop-down menu.
Result
The Dashboard then schedules a pod and opens a shell session to the node.
To get access to the common binaries installed on the host, prefix the command with chroot /hostroot
. Note that the path depends on where the root path is mounted in the container. In the default image used by the Dashboard, it is under /hostroot
.
Gardener Ops Toolbelt
Prerequisite: kubectl
is available.
The Gardener ops-toolbelt can be used as a convenient way to deploy a root pod to a node. The pod uses an image that is bundled with a bunch of useful troubleshooting tools. This is also the same image that is used by default when using the Gardener Dashboard terminal feature as described in the previous section.
The easiest way to use the Gardener ops-toolbelt is to execute the ops-pod
script in the hacks
folder. To get root shell access to a node, execute the aforementioned script by supplying the target node name as an argument:
<path-to-ops-toolbelt-repo>/hacks/ops-pod <target-node>
Custom Root Pod
Alternatively, a pod can be assigned to a target node and a shell can be opened via standard Kubernetes means. To enable root access to the node, the pod specification requires proper securityContext
and volume
properties.
For instance, you can use the following pod manifest, after changing
apiVersion: v1
kind: Pod
metadata:
name: privileged-pod
namespace: default
spec:
nodeSelector:
kubernetes.io/hostname: <target-node-name>
containers:
- name: busybox
image: busybox
stdin: true
securityContext:
privileged: true
volumeMounts:
- name: host-root-volume
mountPath: /host
readOnly: true
volumes:
- name: host-root-volume
hostPath:
path: /
hostNetwork: true
hostPID: true
restartPolicy: Never
SSH Access to a Node That Failed to Join the Cluster
This section explores two options that can be used to get SSH access to a node that failed to join the cluster. As it is not possible to schedule a pod on the node, the Kubernetes-based methods explored so far cannot be used in this scenario.
Additionally, Gardener typically provisions worker instances in a private subnet of the VPC, hence - there is no public IP address that could be used for direct SSH access.
For this scenario, cloud providers typically have extensive documentation (e.g., AWS & GCP and in some cases tooling support). However, these approaches are mostly cloud provider specific, require interaction via their CLI and API or sometimes the installation of a cloud provider specific agent on the node.
Alternatively, gardenctl
can be used providing a cloud provider agnostic and out-of-the-box support to get ssh access to an instance in a private subnet. Currently gardenctl
supports AWS, GCP, Openstack, Azure and Alibaba Cloud.
Identifying the Problematic Instance
First, the problematic instance has to be identified. In Gardener, worker pools can be created in different cloud provider regions, zones, and accounts.
The instance would typically show up as successfully started / running in the cloud provider dashboard or API and it is not immediately obvious which one has a problem. Instead, we can use the Gardener API / CRDs to obtain the faulty instance identifier in a cloud-agnostic way.
Gardener uses the Machine Controller Manager to create the Shoot worker nodes. For each worker node, the Machine Controller Manager creates a Machine
CRD in the Shoot namespace in the respective Seed
cluster. Usually the problematic instance can be identified, as the respective Machine
CRD has status pending
.
The instance / node name can be obtained from the Machine
.status
field:
kubectl get machine <machine-name> -o json | jq -r .status.node
This is all the information needed to go ahead and use gardenctl ssh
to get a shell to the node. In addition, the used cloud provider, the specific identifier of the instance, and the instance region can be identified from the Machine
CRD.
Get the identifier of the instance via:
kubectl get machine <machine-name> -o json | jq -r .spec.providerID // e.g aws:///eu-north-1/i-069733c435bdb4640
The identifier shows that the instance belongs to the cloud provider aws
with the ec2 instance-id i-069733c435bdb4640
in region eu-north-1
.
To get more information about the instance, check out the MachineClass
(e.g., AWSMachineClass
) that is associated with each Machine
CRD in the Shoot
namespace of the Seed
cluster.
The AWSMachineClass
contains the machine image (ami), machine-type, iam information, network-interfaces, subnets, security groups and attached volumes.
Of course, the information can also be used to get the instance with the cloud provider CLI / API.
gardenctl ssh
Using the node name of the problematic instance, we can use the gardenctl ssh
command to get SSH access to the cloud provider instance via an automatically set up bastion host. gardenctl
takes care of spinning up the bastion
instance, setting up the SSH keys, ports and security groups and opens a root shell on the target instance. After the SSH session has ended, gardenctl
deletes the created cloud provider resources.
Use the following commands:
- First, target a Garden cluster containing all the Shoot definitions.
gardenctl target garden <target-garden>
- Target an available Shoot by name. This sets up the context, configures the
kubeconfig
file of the Shoot cluster and downloads the cloud provider credentials. Subsequent commands will execute in this context.
gardenctl target shoot <target-shoot>
- This uses the cloud provider credentials to spin up the bastion and to open a shell on the target instance.
gardenctl ssh <target-node>
SSH with a Manually Created Bastion on AWS
In case you are not using gardenctl
or want to control the bastion instance yourself, you can also manually set it up.
The steps described here are generally the same as those used by gardenctl
internally.
Despite some cloud provider specifics, they can be generalized to the following list:
- Open port 22 on the target instance.
- Create an instance / VM in a public subnet (the bastion instance needs to have a public IP address).
- Set-up security groups and roles, and open port 22 for the bastion instance.
The following diagram shows an overview of how the SSH access to the target instance works:
This guide demonstrates the setup of a bastion on AWS.
Prerequisites:
The
AWS CLI
is set up.Obtain target
instance-id
(see Identifying the Problematic Instance).Obtain the VPC ID the Shoot resources are created in. This can be found in the
Infrastructure
CRD in theShoot
namespace in theSeed
.Make sure that port 22 on the target instance is open (default for Gardener deployed instances).
- Extract security group via:
aws ec2 describe-instances --instance-ids <instance-id>
- Check for rule that allows inbound connections on port 22:
aws ec2 describe-security-groups --group-ids=<security-group-id>
- If not available, create the rule with the following comamnd:
aws ec2 authorize-security-group-ingress --group-id <security-group-id> --protocol tcp --port 22 --cidr 0.0.0.0/0
Create the Bastion Security Group
- The common name of the security group is
<shoot-name>-bsg
. Create the security group:
aws ec2 create-security-group --group-name <bastion-security-group-name> --description ssh-access --vpc-id <VPC-ID>
- Optionally, create identifying tags for the security group:
aws ec2 create-tags --resources <bastion-security-group-id> --tags Key=component,Value=<tag>
- Create a permission in the bastion security group that allows ssh access on port 22:
aws ec2 authorize-security-group-ingress --group-id <bastion-security-group-id> --protocol tcp --port 22 --cidr 0.0.0.0/0
- Create an IAM role for the bastion instance with the name
<shoot-name>-bastions
:
aws iam create-role --role-name <shoot-name>-bastions
The content should be:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeRegions"
],
"Resource": [
"*"
]
}
]
}
- Create the instance profile and name it
<shoot-name>-bastions
:
aws iam create-instance-profile --instance-profile-name <name>
- Add the created role to the instance profile:
aws iam add-role-to-instance-profile --instance-profile-name <instance-profile-name> --role-name <role-name>
Create the Bastion Instance
Next, in order to be able to ssh
into the bastion instance, the instance has to be set up with a user with a public ssh key.
Create a user gardener
that has the same Gardener-generated public ssh key as the target instance.
- First, we need to get the public part of the
Shoot
ssh-key. The ssh-key is stored in a secret in the the project namespace in the Garden cluster. The name is:<shoot-name>-ssh-publickey
. Get the key via:
kubectl get secret aws-gvisor.ssh-keypair -o json | jq -r .data.\"id_rsa.pub\"
- A script handed over as
user-data
to the bastionec2
instance, can be used to create thegardener
user and add the ssh-key. For your convenience, you can use the following script to generate theuser-data
.
#!/bin/bash -eu
saveUserDataFile () {
ssh_key=$1
cat > gardener-bastion-userdata.sh <<EOF
#!/bin/bash -eu
id gardener || useradd gardener -mU
mkdir -p /home/gardener/.ssh
echo "$ssh_key" > /home/gardener/.ssh/authorized_keys
chown gardener:gardener /home/gardener/.ssh/authorized_keys
echo "gardener ALL=(ALL) NOPASSWD:ALL" >/etc/sudoers.d/99-gardener-user
EOF
}
if [ -p /dev/stdin ]; then
read -r input
cat | saveUserDataFile "$input"
else
pbpaste | saveUserDataFile "$input"
fi
- Use the script by handing-over the public ssh-key of the
Shoot
cluster:
kubectl get secret aws-gvisor.ssh-keypair -o json | jq -r .data.\"id_rsa.pub\" | ./generate-userdata.sh
This generates a file called gardener-bastion-userdata.sh
in the same directory containing the user-data
.
- The following information is needed to create the bastion instance:
bastion-IAM-instance-profile-name
- Use the created instance profile with the name <shoot-name>-bastions
image-id
- It is possible to use the same image-id as the one used for the target instance (or any other image). Has cloud provider specific format (AWS: ami
).
ssh-public-key-name
- This is the ssh key pair already created in the Shoot's cloud provider account by Gardener during the `Infrastructure` CRD reconciliation.
- The name is usually: `<shoot-name>-ssh-publickey`
subnet-id
- Choose a subnet that is attached to an Internet Gateway
and NAT Gateway
(bastion instance must have a public IP).
- The Gardener created public subnet with the name <shoot-name>-public-utility-<xy>
can be used.
Please check the created subnets with the cloud provider.
bastion-security-group-id
- Use the id of the created bastion security group.
file-path-to-userdata
- Use the filepath to the user-data
file generated in the previous step.
bastion-instance-name
- Optionaly, you can tag the instance.
- Usually
<shoot-name>-bastions
- Create the bastion instance via:
ec2 run-instances --iam-instance-profile Name=<bastion-IAM-instance-profile-name> --image-id <image-id> --count 1 --instance-type t3.nano --key-name <ssh-public-key-name> --security-group-ids <bastion-security-group-id> --subnet-id <subnet-id> --associate-public-ip-address --user-data <file-path-to-userdata> --tag-specifications ResourceType=instance,Tags=[{Key=Name,Value=<bastion-instance-name>},{Key=component,Value=<mytag>}] ResourceType=volume,Tags=[{Key=component,Value=<mytag>}]"
Capture the instance-id
from the response and wait until the ec2
instance is running and has a public IP address.
Connecting to the Target Instance
- Save the private key of the ssh-key-pair in a temporary local file for later use:
umask 077
kubectl get secret <shoot-name>.ssh-keypair -o json | jq -r .data.\"id_rsa\" | base64 -d > id_rsa.key
- Use the private ssh key to ssh into the bastion instance:
ssh -i <path-to-private-key> gardener@<public-bastion-instance-ip>
- If that works, connect from your local terminal to the target instance via the bastion:
ssh -i <path-to-private-key> -o ProxyCommand="ssh -W %h:%p -i <private-key> -o IdentitiesOnly=yes -o StrictHostKeyChecking=no gardener@<public-ip-bastion>" gardener@<private-ip-target-instance> -o IdentitiesOnly=yes -o StrictHostKeyChecking=no
Cleanup
Do not forget to cleanup the created resources. Otherwise Gardener will eventually fail to delete the Shoot.
5.3 - How to Debug a Pod
Introduction
Kubernetes offers powerful options to get more details about startup or runtime failures of pods as e.g. described in Application Introspection and Debugging or Debug Pods and Replication Controllers.
In order to identify pods with potential issues, you could, e.g., run kubectl get pods --all-namespaces | grep -iv Running
to filter out the pods which are not in the state Running
. One of frequent error state is CrashLoopBackOff
, which tells that a pod crashes right after the start. Kubernetes then tries to restart the pod again, but often the pod startup fails again.
Here is a short list of possible reasons which might lead to a pod crash:
- Error during image pull caused by e.g. wrong/missing secrets or wrong/missing image
- The app runs in an error state caused e.g. by missing environmental variables (ConfigMaps) or secrets
- Liveness probe failed
- Too high resource consumption (memory and/or CPU) or too strict quota settings
- Persistent volumes can’t be created/mounted
- The container image is not updated
Basically, the commands kubectl logs ...
and kubectl describe ...
with different parameters are used to get more detailed information. By calling e.g. kubectl logs --help
you can get more detailed information about the command and its parameters.
In the next sections you’ll find some basic approaches to get some ideas what went wrong.
Remarks:
- Even if the pods seem to be running, as the status
Running
indicates, a high counter of theRestarts
shows potential problems - You can get a good overview of the troubleshooting process with the interactive tutorial Troubleshooting with Kubectl available which explains basic debugging activities
- The examples below are deployed into the namespace
default
. In case you want to change it, use the optional parameter--namespace <your-namespace>
to select the target namespace. The examples require a Kubernetes release ≥ 1.8.
Prerequisites
Your deployment was successful (no logical/syntactical errors in the manifest files), but the pod(s) aren’t running.
Error Caused by Wrong Image Name
Start by running kubectl describe pod <your-pod> <your-namespace>
to get detailed information about the pod startup.
In the Events
section, you should get an error message like Failed to pull image ...
and Reason: Failed
. The pod is in state ImagePullBackOff
.
The example below is based on a demo in the Kubernetes documentation. In all examples, the default
namespace is used.
First, perform a cleanup with:
kubectl delete pod termination-demo
Next, create a resource based on the yaml content below:
apiVersion: v1
kind: Pod
metadata:
name: termination-demo
spec:
containers:
- name: termination-demo-container
image: debiann
command: ["/bin/sh"]
args: ["-c", "sleep 10 && echo Sleep expired > /dev/termination-log"]
kubectl describe pod termination-demo
lists in the Event
section the content
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
2m 2m 1 default-scheduler Normal Scheduled Successfully assigned termination-demo to ip-10-250-17-112.eu-west-1.compute.internal
2m 2m 1 kubelet, ip-10-250-17-112.eu-west-1.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "default-token-sgccm"
2m 1m 4 kubelet, ip-10-250-17-112.eu-west-1.compute.internal spec.containers{termination-demo-container} Normal Pulling pulling image "debiann"
2m 1m 4 kubelet, ip-10-250-17-112.eu-west-1.compute.internal spec.containers{termination-demo-container} Warning Failed Failed to pull image "debiann": rpc error: code = Unknown desc = Error: image library/debiann:latest not found
2m 54s 10 kubelet, ip-10-250-17-112.eu-west-1.compute.internal Warning FailedSync Error syncing pod
2m 54s 6 kubelet, ip-10-250-17-112.eu-west-1.compute.internal spec.containers{termination-demo-container} Normal BackOff Back-off pulling image "debiann"
The error message with Reason: Failed
tells you that there is an error during pulling the image. A closer look at the image name indicates a misspelling.
The App Runs in an Error State Caused, e.g., by Missing Environmental Variables (ConfigMaps) or Secrets
This example illustrates the behavior in the case when the app expects environment variables but the corresponding Kubernetes artifacts are missing.
First, perform a cleanup with:
kubectl delete deployment termination-demo
kubectl delete configmaps app-env
Next, deploy the following manifest:
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: termination-demo
labels:
app: termination-demo
spec:
replicas: 1
selector:
matchLabels:
app: termination-demo
template:
metadata:
labels:
app: termination-demo
spec:
containers:
- name: termination-demo-container
image: debian
command: ["/bin/sh"]
args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]
Now, the command kubectl get pods
lists the pod termination-demo-xxx
in the state Error
or CrashLoopBackOff
. The command kubectl describe pod termination-demo-xxx
tells you that there is no error during startup but gives no clue about what caused the crash.
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
19m 19m 1 default-scheduler Normal Scheduled Successfully assigned termination-demo-5fb484867d-xz2x9 to ip-10-250-17-112.eu-west-1.compute.internal
19m 19m 1 kubelet, ip-10-250-17-112.eu-west-1.compute.internal Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "default-token-sgccm"
19m 19m 4 kubelet, ip-10-250-17-112.eu-west-1.compute.internal spec.containers{termination-demo-container} Normal Pulling pulling image "debian"
19m 19m 4 kubelet, ip-10-250-17-112.eu-west-1.compute.internal spec.containers{termination-demo-container} Normal Pulled Successfully pulled image "debian"
19m 19m 4 kubelet, ip-10-250-17-112.eu-west-1.compute.internal spec.containers{termination-demo-container} Normal Created Created container
19m 19m 4 kubelet, ip-10-250-17-112.eu-west-1.compute.internal spec.containers{termination-demo-container} Normal Started Started container
19m 14m 24 kubelet, ip-10-250-17-112.eu-west-1.compute.internal spec.containers{termination-demo-container} Warning BackOff Back-off restarting failed container
19m 4m 69 kubelet, ip-10-250-17-112.eu-west-1.compute.internal Warning FailedSync Error syncing pod
The command kubectl get logs termination-demo-xxx
gives access to the output, the application writes on stderr
and stdout
. In this case, you should get an output similar to:
/bin/sh: 1: cannot open : No such file
So you need to have a closer look at the application. In this case, the environmental variable MYFILE
is missing. To fix this
issue, you could e.g. add a ConfigMap to your deployment as is shown in the manifest listed below:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-env
data:
MYFILE: "/etc/profile"
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: termination-demo
labels:
app: termination-demo
spec:
replicas: 1
selector:
matchLabels:
app: termination-demo
template:
metadata:
labels:
app: termination-demo
spec:
containers:
- name: termination-demo-container
image: debian
command: ["/bin/sh"]
args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]
envFrom:
- configMapRef:
name: app-env
Note that once you fix the error and re-run the scenario, you might still see the pod in a CrashLoopBackOff
status.
It is because the container finishes the command sed ...
and runs to completion. In order to keep the container in a Running
status, a long running task is required, e.g.:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-env
data:
MYFILE: "/etc/profile"
SLEEP: "5"
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: termination-demo
labels:
app: termination-demo
spec:
replicas: 1
selector:
matchLabels:
app: termination-demo
template:
metadata:
labels:
app: termination-demo
spec:
containers:
- name: termination-demo-container
image: debian
command: ["/bin/sh"]
# args: ["-c", "sed \"s/foo/bar/\" < $MYFILE"]
args: ["-c", "while true; do sleep $SLEEP; echo sleeping; done;"]
envFrom:
- configMapRef:
name: app-env
Too High Resource Consumption (Memory and/or CPU) or Too Strict Quota Settings
You can optionally specify the amount of memory and/or CPU your container gets during runtime. In case these settings are missing, the default requests settings are taken: CPU: 0m (in Milli CPU) and RAM: 0Gi, which indicate no other limits other than the ones of the node(s) itself. For more details, e.g. about how to configure limits, see Configure Default Memory Requests and Limits for a Namespace.
In case your application needs more resources, Kubernetes distinguishes between requests
and limit
settings: requests
specify the guaranteed amount of resource, whereas limit
tells Kubernetes the maximum amount of resource the container might need. Mathematically, both settings could be described by the relation 0 <= requests <= limit
. For both settings you need to consider the total amount of resources your nodes provide. For a detailed description of the concept, see Resource Quality of Service in Kubernetes.
Use kubectl describe nodes
to get a first overview of the resource consumption in your cluster. Of special interest are the figures indicating the amount of CPU and Memory Requests at the bottom of the output.
The next example demonstrates what happens in case the CPU request is too high in order to be managed by your cluster.
First, perform a cleanup with:
kubectl delete deployment termination-demo
kubectl delete configmaps app-env
Next, adapt the cpu
below in the yaml below to be slightly higher than the remaining CPU resources in your cluster and deploy this manifest. In this example, 600m
(milli CPUs) are requested in a Kubernetes system with a single 2 core worker node which results in an error message.
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: termination-demo
labels:
app: termination-demo
spec:
replicas: 1
selector:
matchLabels:
app: termination-demo
template:
metadata:
labels:
app: termination-demo
spec:
containers:
- name: termination-demo-container
image: debian
command: ["/bin/sh"]
args: ["-c", "sleep 10 && echo Sleep expired > /dev/termination-log"]
resources:
requests:
cpu: "600m"
The command kubectl get pods
lists the pod termination-demo-xxx
in the state Pending
. More details on why this happens could be found by using the command kubectl describe pod termination-demo-xxx
:
$ kubectl describe po termination-demo-fdb7bb7d9-mzvfw
Name: termination-demo-fdb7bb7d9-mzvfw
Namespace: default
...
Containers:
termination-demo-container:
Image: debian
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
sleep 10 && echo Sleep expired > /dev/termination-log
Requests:
cpu: 6
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-t549m (ro)
Conditions:
Type Status
PodScheduled False
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 9s (x7 over 40s) default-scheduler 0/2 nodes are available: 2 Insufficient cpu.
You can find more details in:
Remarks:
- This example works similarly when specifying a too high request for memory
- In case you configured an autoscaler range when creating your Kubernetes cluster, another worker node will be spinned up automatically if you didn’t reach the maximum number of worker nodes
- In case your app is running out of memory (the memory settings are too small), you will typically find an
OOMKilled
(Out Of Memory) message in theEvents
section of thekubectl describe pod ...
output
The Container Image Is Not Updated
You applied a fix in your app, created a new container image and pushed it into your container repository. After redeploying your Kubernetes manifests, you expected to get the updated app, but the same bug is still in the new deployment present.
This behavior is related to how Kubernetes decides whether to pull a new docker image or to use the cached one.
In case you didn’t change the image tag, the default image policy IfNotPresent tells Kubernetes to use the cached image (see Images).
As a best practice, you should not use the tag latest
and change the image tag in case you changed anything in your image (see Configuration Best Practices).
For more information, see Container Image Not Updating.
Related Links
- Application Introspection and Debugging
- Debug Pods and Replication Controllers
- Logging Architecture
- Configure Default Memory Requests and Limits for a Namespace
- Managing Compute Resources for Containters
- Resource Quality of Service in Kubernetes
- Interactive Tutorial Troubleshooting with Kubectl
- Images
- Kubernetes Best Practices
5.4 - tail -f /var/log/my-application.log
Problem
One thing that always bothered me was that I couldn’t get logs of several pods at once with kubectl
. A simple tail -f <path-to-logfile>
isn’t possible at all. Certainly, you can use kubectl logs -f <pod-id>
, but it doesn’t help if you want to monitor more than one pod at a time.
This is something you really need a lot, at least if you run several instances of a pod behind a deployment
. This is even more so if you don’t have a Kibana or a similar setup.
Solution
Luckily, there are smart developers out there who always come up with solutions. The finding of the week is a small bash script that allows you to aggregate log files of several pods at the same time in a simple way. The script is called kubetail
and is available at GitHub.
6 - Applications
6.1 - Shoot Pod Autoscaling Best Practices
Introduction
There are two types of pod autoscaling in Kubernetes: Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA). HPA (implemented as part of the kube-controller-manager) scales the number of pod replicas, while VPA (implemented as independent community project) adjusts the CPU and memory requests for the pods. Both types of autoscaling aim to optimize resource usage/costs and maintain the performance and (high) availability of applications running on Kubernetes.
Horizontal Pod Autoscaling (HPA)
Horizontal Pod Autoscaling involves increasing or decreasing the number of pod replicas in a deployment, replica set, stateful set, or anything really with a scale subresource that manages pods. HPA adjusts the number of replicas based on specified metrics, such as CPU or memory average utilization (usage divided by requests; most common) or average value (usage; less common). When the demand on your application increases, HPA automatically scales out the number of pods to meet the demand. Conversely, when the demand decreases, it scales in the number of pods to reduce resource usage.
HPA targets (mostly stateless) applications where adding more instances of the application can linearly increase the ability to handle additional load. It is very useful for applications that experience variable traffic patterns, as it allows for real-time scaling without the need for manual intervention.
ℹ️ Note
HPA continuously monitors the metrics of the targeted pods and adjusts the number of replicas based on the observed metrics. It operates solely on the current metrics when it calculates the averages across all pods, meaning it reacts to the immediate resource usage without considering past trends or patterns. Also, all pods are treated equally based on the average metrics. This could potentially lead to situations where some pods are under high load while others are underutilized. Therefore, particular care must be applied to (fair) load-balancing (connection vs. request vs. actual resource load balancing are crucial).
A Few Words on the Cluster-Proportional (Horizontal) Autoscaler (CPA) and the Cluster-Proportional Vertical Autoscaler (CPVA)
Besides HPA and VPA, CPA and CPVA are further options for scaling horizontally or vertically (neither is deployed by Gardener and must be deployed by the user). Unlike HPA and VPA, CPA and CPVA do not monitor the actual pod metrics, but scale solely on the number of nodes or CPU cores in the cluster. While this approach may be helpful and sufficient in a few rare cases, it is often a risky and crude scaling scheme that we do not recommend. More often than not, cluster-proportional scaling results in either under- or over-reserving your resources.
Vertical Pod Autoscaling (VPA)
Vertical Pod Autoscaling, on the other hand, focuses on adjusting the CPU and memory resources allocated to the pods themselves. Instead of changing the number of replicas, VPA tweaks the resource requests (and limits, but only proportionally, if configured) for the pods in a deployment, replica set, stateful set, daemon set, or anything really with a scale subresource that manages pods. This means that each pod can be given more, or fewer resources as needed.
VPA is very useful for optimizing the resource requests of pods that have dynamic resource needs over time. It does so by mutating pod requests (unfortunately, not in-place). Therefore, in order to apply new recommendations, pods that are “out of bounds” (i.e. below a configured/computed lower or above a configured/computed upper recommendation percentile) will be evicted proactively, but also pods that are “within bounds” may be evicted after a grace period. The corresponding higher-level replication controller will then recreate a new pod that VPA will then mutate to set the currently recommended requests (and proportional limits, if configured).
ℹ️ Note
VPA continuously monitors all targeted pods and calculates recommendations based on their usage (one recommendation for the entire target). This calculation is influenced by configurable percentiles, with a greater emphasis on recent usage data and a gradual decrease (=decay) in the relevance of older data. However, this means, that VPA doesn’t take into account individual needs of single pods - eventually, all pods will receive the same recommendation, which may lead to considerable resource waste. Ideally, VPA would update pods in-place depending on their individual needs, but that’s (individual recommendations) not in its design, even if in-place updates get implemented, which may be years away for VPA based on current activity on the component.
Selecting the Appropriate Autoscaler
Before deciding on an autoscaling strategy, it’s important to understand the characteristics of your application:
- Interruptibility: Most importantly, if the clients of your workload are too sensitive to disruptions/cannot cope well with terminating pods, then maybe neither HPA nor VPA is an option (both, HPA and VPA cause pods and connections to be terminated, though VPA even more frequently). Clients must retry on disruptions, which is a reasonable ask in a highly dynamic (and self-healing) environment such as Kubernetes, but this is often not respected (or expected) by your clients (they may not know or care you run the workload in a Kubernetes cluster and have different expectations to the stability of the workload unless you communicated those through SLIs/SLOs/SLAs).
- Statelessness: Is your application stateless or stateful? Stateless applications are typically better candidates for HPA as they can be easily scaled out by adding more replicas without worrying about maintaining state.
- Traffic Patterns: Does your application experience variable traffic? If so, HPA can help manage these fluctuations by adjusting the number of replicas to handle the load.
- Resource Usage: Does your application’s resource usage change over time? VPA can adjust the CPU and memory reservations dynamically, which is beneficial for applications with non-uniform resource requirements.
- Scalability: Can your application handle increased load by scaling vertically (more resources per pod) or does it require horizontal scaling (more pod instances)?
HPA is the right choice if:
- Your application is stateless and can handle increased load by adding more instances.
- You experience short-term fluctuations in traffic that require quick scaling responses.
- You want to maintain a specific performance metric, such as requests per second per pod.
VPA is the right choice if:
- Your application’s resource requirements change over time, and you want to optimize resource usage without manual intervention.
- You want to avoid the complexity of managing resource requests for each pod, especially when they run code where it’s impossible for you to suggest static requests.
In essence:
- For applications that can handle increased load by simply adding more replicas, HPA should be used to handle short-term fluctuations in load by scaling the number of replicas.
- For applications that require more resources per pod to handle additional work, VPA should be used to adjust the resource allocation for longer-term trends in resource usage.
Consequently, if both cases apply (VPA often applies), HPA and VPA can also be combined. However, combining both, especially on the same metrics (CPU and memory), requires understanding and care to avoid conflicts and ensure that the autoscaling actions do not interfere with and rather complement each other. For more details, see Combining HPA and VPA.
Horizontal Pod Autoscaler (HPA)
HPA operates by monitoring resource metrics for all pods in a target. It computes the desired number of replicas from the current average metrics and the desired user-defined metrics as follows:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
HPA checks the metrics at regular intervals, which can be configured by the user. Several types of metrics are supported (classical resource metrics like CPU and memory, but also custom and external metrics like requests per second or queue length can be configured, if available). If a scaling event is necessary, HPA adjusts the replica count for the targeted resource.
Defining an HPA Resource
To configure HPA, you need to create an HPA resource in your cluster. This resource specifies the target to scale, the metrics to be used for scaling decisions, and the desired thresholds. Here’s an example of an HPA configuration:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: foo-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: foo-deployment
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: AverageValue
averageValue: 2
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 8G
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 1800
policies:
- type: Pods
value: 1
periodSeconds: 300
In this example, HPA is configured to scale foo-deployment
based on pod average CPU and memory usage. It will maintain an average CPU and memory usage (not utilization, which is usage divided by requests!) across all replicas of 2 CPUs and 8G or lower with as few replicas as possible. The number of replicas will be scaled between a minimum of 1 and a maximum of 10 based on this target.
Since a while, you can also configure the autoscaling based on the resource usage of individual containers, not only on the resource usage of the entire pod. All you need to do is to switch the type
from Resource
to ContainerResource
and specify the container name.
In the official documentation ([1] and [2]) you will find examples with average utilization (averageUtilization
), not average usage (averageValue
), but this is not particularly helpful, especially if you plan to combine HPA together with VPA on the same metrics (generally discouraged in the documentation). If you want to safely combine both on the same metrics, you should scale on average usage (averageValue
) as shown above. For more details, see Combining HPA and VPA.
Finally, the behavior section influences how fast you scale up and down. Most of the time (depends on your workload), you like to scale out faster than you scale in. In this example, the configuration will trigger a scale-out only after observing the need to scale out for 30s (stabilizationWindowSeconds
) and will then only scale out at most 100% (value
+ type
) of the current number of replicas every 60s (periodSeconds
). The configuration will trigger a scale-in only after observing the need to scale in for 1800s (stabilizationWindowSeconds
) and will then only scale in at most 1 pod (value
+ type
) every 300s (periodSeconds
). As you can see, scale-out happens quicker than scale-in in this example.
HPA (actually KCM) Options
HPA is a function of the kube-controller-manager (KCM).
You can read up the full KCM options online and set most of them conveniently in your Gardener shoot cluster spec:
downscaleStabilization
(default 5m): HPA will scale out whenever the formula (in accordance with the behavior section, if present in the HPA resource) yields a higher replica count, but it won’t scale in just as eagerly. This option lets you define a trailing time window that HPA must check and only if the recommended replica count is consistently lower throughout the entire time window, HPA will scale in (in accordance with the behavior section, if present in the HPA resource). If at any point in time in that trailing time window the recommended replica count isn’t lower, scale-in won’t happen. This setting is just a default, if nothing is defined in the behavior section of an HPA resource. The default for the upscale stabilization is 0s and it cannot be set via a KCM option (downscale stabilization was historically more important than upscale stabilization and when later the behavior sections were added to the HPA resources, upscale stabilization remained missing from the KCM options).tolerance
(default +/-10%): HPA will not scale out or in if the desired replica count is (mathematically as a float) near the actual replica count (see source code for details), which is a form of hysteresis to avoid replica flapping around a threshold.
There are a few more configurable options of lesser interest:
syncPeriod
(default 15s): How often HPA retrieves the pods and metrics respectively how often it recomputes and sets the desired replica count.cpuInitializationPeriod
(default 30s) andinitialReadinessDelay
(default 5m): Both settings only affect whether or not CPU metrics are considered for scaling decisions. They can be easily misinterpreted as the official docs are somewhat hard to read (see source code for details, which is more readable, if you ignore the comments). Normally, you have little reason to modify them, but here is what they do:cpuInitializationPeriod
: Defines a grace period after a pod starts during which HPA won’t consider CPU metrics of the pod for scaling if the pod is either not ready or it is ready, but a given CPU metric is older than the last state transition (to ready). This is to ignore CPU metrics that predate the current readiness while still in initialization to not make scaling decisions based on potentially misleading data. If the pod is ready and a CPU metric was collected after it became ready, it is considered also within this grace period.initialReadinessDelay
: Defines another grace period after a pod starts during which HPA won’t consider CPU metrics of the pod for scaling if the pod is not ready and it became not ready within this grace period (the docs/comments want to check whether the pod was ever ready, but the code only checks whether the pod condition last transition time to not ready happened within that grace period which it could have from being ready or simply unknown before). This is to ignore not (ever have been) ready pods while still in initialization to not make scaling decisions based on potentially misleading data. If the pod is ready, it is considered also within this grace period.
So, regardless of the values of these settings, if a pod is reporting ready and it has a CPU metric from the time after it became ready, that pod and its metric will be considered. This holds true even if the pod becomes ready very early into its initialization. These settings cannot be used to “black-out” pods for a certain duration before being considered for scaling decisions. Instead, if it is your goal to ignore a potentially resource-intensive initialization phase that could wrongly lead to further scale-out, you would need to configure your pods to not report as ready until that resource-intensive initialization phase is over.
Considerations When Using HPA
- Selection of metrics: Besides CPU and memory, HPA can also target custom or external metrics. Pick those (in addition or exclusively), if you guarantee certain SLOs in your SLAs.
- Targeting usage or utilization: HPA supports usage (absolute) and utilization (relative). Utilization is often preferred in simple examples, but usage is more precise and versatile.
- Compatibility with VPA: Care must be taken when using HPA in conjunction with VPA, as they can potentially interfere with each other’s scaling decisions.
Vertical Pod Autoscaler (VPA)
VPA operates by monitoring resource metrics for all pods in a target. It computes a resource requests recommendation from the historic and current resource metrics. VPA checks the metrics at regular intervals, which can be configured by the user. Only CPU and memory are supported. If VPA detects that a pod’s resource allocation is too high or too low, it may evict pods (if within the permitted disruption budget), which will trigger the creation of a new pod by the corresponding higher-level replication controller, which will then be mutated by VPA to match resource requests recommendation. This happens in three different components that work together:
- VPA Recommender: The Recommender observes the historic and current resource metrics of pods and generates recommendations based on this data.
- VPA Updater: The Updater component checks the recommendations from the Recommender and decides whether any pod’s resource requests need to be updated. If an update is needed, the Updater will evict the pod.
- VPA Admission Controller: When a pod is (re-)created, the Admission Controller modifies the pod’s resource requests based on the recommendations from the Recommender. This ensures that the pod starts with the optimal amount of resources.
Since VPA doesn’t support in-place updates, pods will be evicted. You will want to control voluntary evictions by means of Pod Disruption Budgets (PDBs). Please make yourself familiar with those and use them.
ℹ️ Note
PDBs will not always work as expected and can also get in your way, e.g. if the PDB is violated or would be violated, it may possibly block evictions that would actually help your workload, e.g. to get a pod out of an
OOMKilled
CrashLoopBackoff
(if the PDB is or would be violated, not even unhealthy pods would be evicted as they could theoretically become healthy again, which VPA doesn’t know). In order to overcome this issue, it is now possible (alpha since Kubernetesv1.26
in combination with the feature gatePDBUnhealthyPodEvictionPolicy
on the API server, beta and enabled by default since Kubernetesv1.27
) to configure the so-called unhealthy pod eviction policy. The default is stillIfHealthyBudget
as a change in default would have changed the behavior (as described above), but you can now also setAlwaysAllow
at the PDB (spec.unhealthyPodEvictionPolicy
). For more information, please check out this discussion, the PR and this document and balance the pros and cons for yourself. In short, the newAlwaysAllow
option is probably the better choice in most of the cases whileIfHealthyBudget
is useful only if you have frequent temporary transitions or for special cases where you have already implemented controllers that depend on the old behavior.
Defining a VPA Resource
To configure VPA, you need to create a VPA resource in your cluster. This resource specifies the target to scale, the metrics to be used for scaling decisions, and the policies for resource updates. Here’s an example of an VPA configuration:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: foo-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: foo-deployment
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: foo-container
controlledValues: RequestsOnly
minAllowed:
cpu: 50m
memory: 200M
maxAllowed:
cpu: 4
memory: 16G
In this example, VPA is configured to scale foo-deployment
requests (RequestsOnly
) from 50m cores (minAllowed
) up to 4 cores (maxAllowed
) and 200M memory (minAllowed
) up to 16G memory (maxAllowed
) automatically (updateMode
). VPA doesn’t support in-place updates, so in updateMode
Auto
it will evict pods under certain conditions and then mutate the requests (and possibly limits if you omit controlledValues
or set it to RequestsAndLimits
, which is the default) of upcoming new pods.
Multiple update modes exist. They influence eviction and mutation. The most important ones are:
Off
: In this mode, recommendations are computed, but never applied. This mode is useful, if you want to learn more about your workload or if you have a custom controller that depends on VPA’s recommendations but shall act instead of VPA.Initial
: In this mode, recommendations are computed and applied, but pods are never proactively evicted to enforce new recommendations over time. This mode is useful, if you want to control pod evictions yourself (similar to theStatefulSet
updateStrategy
OnDelete
) or your workload is sensitive to evictions, e.g. some brownfield singleton application or a daemon set pod that is critical for the node.Auto
(default): In this mode, recommendations are computed, applied, and pods are even proactively evicted to enforce new recommendations over time. This applies recommendations continuously without you having to worry too much.
As mentioned, controlledValues
influences whether only requests or requests and limits are scaled:
RequestsOnly
: Updates only requests and doesn’t change limits. Useful if you have defined absolute limits (unrelated to the requests).RequestsAndLimits
(default): Updates requests and proportionally scales limits along with the requests. Useful if you have defined relative limits (related to the requests). In this case, the gap between requests and limits should be either zero for QoSGuaranteed
or small for QoSBurstable
to avoid useless (way beyond the threshold of unhealthy behavior) or absurd (larger than node capacity) values.
VPA doesn’t offer many more settings that can be tuned per VPA resource than you see above (different than HPA’s behavior
section). However, there is one more that isn’t shown above, which allows to scale only up or only down (evictionRequirements[].changeRequirement
), in case you need that, e.g. to provide resources when needed, but avoid disruptions otherwise.
VPA Options
VPA is an independent community project that consists of a recommender (computing target recommendations and bounds), an updater (evicting pods that are out of recommendation bounds), and an admission controller (mutating webhook applying the target recommendation to newly created pods). As such, they have independent options.
VPA Recommender Options
You can read up the full VPA recommender options online and set some of them conveniently in your Gardener shoot cluster spec:
recommendationMarginFraction
(default 15%): Safety margin that will be added to the recommended requests.targetCPUPercentile
(default 90%): CPU usage percentile that will be targeted with the CPU recommendation (i.e. recommendation will “fit” e.g. 90% of the observed CPU usages). This setting is relevant for balancing your requests reservations vs. your costs. If you want to reduce costs, you can reduce this value (higher risk because of potential under-reservation, but lower costs), because CPU is compressible, but then VPA may lack the necessary signals for scale-up as throttling on an otherwise fully utilized node will go unnoticed by VPA. If you want to err on the safe side, you can increase this value, but you will then target more and more a worst case scenario, quickly (maybe even exponentially) increasing the costs.targetMemoryPercentile
(default 90%): Memory usage percentile that will be targeted with the memory recommendation (i.e. recommendation will “fit” e.g. 90% of the observed memory usages). This setting is relevant for balancing your requests reservations vs. your costs. If you want to reduce costs, you can reduce this value (higher risk because of potential under-reservation, but lower costs), because OOMs will trigger bump-ups, but those will disrupt the workload. If you want to err on the safe side, you can increase this value, but you will then target more and more a worst case scenario, quickly (maybe even exponentially) increasing the costs.
There are a few more configurable options of lesser interest:
recommenderInterval
(default 1m): How often VPA retrieves the pods and metrics respectively how often it recomputes the recommendations and bounds.
There are many more options that you can only configure if you deploy your own VPA and which we will not discuss here, but you can check them out here.
ℹ️ Note
Due to an implementation detail (smallest bucket size), VPA cannot create recommendations below 10m cores and 10M memory even if
minAllowed
is lower.
VPA Updater Options
You can read up the full VPA updater options online and set some of them conveniently in your Gardener shoot cluster spec:
evictAfterOOMThreshold
(default 10m): Pods where at least one container OOMs within this time period since its start will be actively evicted, which will implicitly apply the new target recommendation that will have been bumped up afterOOMKill
. Please note, the kubelet may evict pods even before an OOM, but only ifkube-reserved
is underrun, i.e. node-level resources are running low. In these cases, eviction will happen first by pod priority and second by how much the usage overruns the requests.evictionTolerance
(default 50%): Defines a threshold below which no further eligible pod will be evited anymore, i.e. limits how many eligible pods may be in eviction in parallel (but at least 1). The threshold is computed as follows:running - evicted > replicas - tolerance
. Example: 10 replicas, 9 running, 8 eligible for eviction, 20% tolerance with 10 replicas which amounts to 2 pods, and no pod evicted in this round yet, then9 - 0 > 10 - 2
is true and a pod would be evicted, but the next one would be in violation as9 - 1 = 10 - 2
and no further pod would be evicted anymore in this round.evictionRateBurst
(default 1): Defines how many eligible pods may be evicted in one go.evictionRateLimit
(default disabled): Defines how many eligible pods may be evicted per second (a value of 0 or -1 disables the rate limiting).
In general, avoid modifying these eviction settings unless you have good reasons and try to rely on Pod Disruption Budgets (PDBs) instead. However, PDBs are not available for daemon sets.
There are a few more configurable options of lesser interest:
updaterInterval
(default 1m): How often VPA evicts the pods.
There are many more options that you can only configure if you deploy your own VPA and which we will not discuss here, but you can check them out here.
Considerations When Using VPA
- Initial Resource Estimates: VPA requires historical resource usage data to base its recommendations on. Until they kick in, your initial resource requests apply and should be sensible.
- Pod Disruption: When VPA adjusts the resources for a pod, it may need to “recreate” the pod, which can cause temporary disruptions. This should be taken into account.
- Compatibility with HPA: Care must be taken when using VPA in conjunction with HPA, as they can potentially interfere with each other’s scaling decisions.
Combining HPA and VPA
HPA and VPA serve different purposes and operate on different axes of scaling. HPA increases or decreases the number of pod replicas based on metrics like CPU or memory usage, effectively scaling the application out or in. VPA, on the other hand, adjusts the CPU and memory reservations of individual pods, scaling the application up or down.
When used together, these autoscalers can provide both horizontal and vertical scaling. However, they can also conflict with each other if used on the same metrics (e.g. both on CPU or both on memory). In particular, if VPA adjusts the requests, the utilization, i.e. the ratio between usage and requests, will approach 100% (for various reasons not exactly right, but for this consideration, close enough), which may trigger HPA to scale out, if it’s configured to scale on utilization below 100% (often seen in simple examples), which will spread the load across more pods, which may trigger VPA again to adjust the requests to match the new pod usages.
This is a feedback loop and it stems from HPA’s method of calculating the desired number of replicas, which is:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
If desiredMetricValue
is utilization and VPA adjusts the requests, which changes the utilization, this may inadvertently trigger HPA and create said feedback loop. On the other hand, if desiredMetricValue
is usage and VPA adjusts the requests now, this will have no impact on HPA anymore (HPA will always influence VPA, but we can control whether VPA influences HPA).
Therefore, to safely combine HPA and VPA, consider the following strategies:
- Configure HPA and VPA on different metrics: One way to avoid conflicts is to use HPA and VPA based on different metrics. For instance, you could configure HPA to scale based on requests per seconds (or another representative custom/external metric) and VPA to adjust CPU and memory requests. This way, each autoscaler operates independently based on its specific metric(s).
- Configure HPA to scale on usage, not utilization, when used with VPA: Another way to avoid conflicts is to use HPA not on average utilization (
averageUtilization
), but instead on average usage (averageValue
) as replicas driver, which is an absolute metric (requests don’t affect usage). This way, you can combine both autoscalers even on the same metrics.
Pod Autoscaling and Cluster Autoscaler
Autoscaling within Kubernetes can be implemented at different levels: pod autoscaling (HPA and VPA) and cluster autoscaling (CA). While pod autoscaling adjusts the number of pod replicas or their resource reservations, cluster autoscaling focuses on the number of nodes in the cluster, so that your pods can be hosted. If your workload isn’t static and especially if you make use of pod autoscaling, it only works if you have sufficient node capacity available. The most effective way to do that, without running a worst-case number of nodes, is to configure burstable worker pools in your shoot spec, i.e. define a true minimum node count and a worst-case maximum node count and leave the node autoscaling to Gardener that internally uses the Cluster Autoscaler to provision and deprovision nodes as needed.
Cluster Autoscaler automatically adjusts the number of nodes by adding or removing nodes based on the demands of the workloads and the available resources. It interacts with the cloud provider’s APIs to provision or deprovision nodes as needed. Cluster Autoscaler monitors the utilization of nodes and the scheduling of pods. If it detects that pods cannot be scheduled due to a lack of resources, it will trigger the addition of new nodes to the cluster. Conversely, if nodes are underutilized for some time and their pods can be placed on other nodes, it will remove those nodes to reduce costs and improve resource efficiency.
Best Practices:
- Resource Buffering: Maintain a buffer of resources to accommodate temporary spikes in demand without waiting for node provisioning. This can be done by deploying pods with low priority that can be preempted when real workloads require resources. This helps in faster pod scheduling and avoids delays in scaling out or up.
- Pod Disruption Budgets (PDBs): Use PDBs to ensure that during scale-down events, the availability of applications is maintained as the Cluster Autoscaler will not voluntarily evict a pod if a PDB would be violated.
Interesting CA Options
CA can be configured in your Gardener shoot cluster spec globally and also in parts per worker pool:
- Can only be configured globally:
expander
(default least-waste): Defines the “expander” algorithm to use during scale-up, see FAQ.scaleDownDelayAfterAdd
(default 1h): Defines how long after scaling up a node, a node may be scaled down.scaleDownDelayAfterFailure
(default 3m): Defines how long after scaling down a node failed, scaling down will be resumed.scaleDownDelayAfterDelete
(default 0s): Defines how long after scaling down a node, another node may be scaled down.
- Can be configured globally and also overwritten individually per worker pool:
scaleDownUtilizationThreshold
(default 50%): Defines the threshold below which a node becomes eligible for scaling down.scaleDownUnneededTime
(default 30m): Defines the trailing time window the node must be consistently below a certain utilization threshold before it can finally be scaled down.
There are many more options that you can only configure if you deploy your own CA and which we will not discuss here, but you can check them out here.
Importance of Monitoring
Monitoring is a critical component of autoscaling for several reasons:
- Performance Insights: It provides insights into how well your autoscaling strategy is meeting the performance requirements of your applications.
- Resource Utilization: It helps you understand resource utilization patterns, enabling you to optimize resource allocation and reduce waste.
- Cost Management: It allows you to track the cost implications of scaling actions, helping you to maintain control over your cloud spending.
- Troubleshooting: It enables you to quickly identify and address issues with autoscaling, such as unexpected scaling behavior or resource bottlenecks.
To effectively monitor autoscaling, you should leverage the following tools and metrics:
- Kubernetes Metrics Server: Collects resource metrics from kubelets and provides them to HPA and VPA for autoscaling decisions (automatically provided by Gardener).
- Prometheus: An open-source monitoring system that can collect and store custom metrics, providing a rich dataset for autoscaling decisions.
- Grafana/Plutono: A visualization tool that integrates with Prometheus to create dashboards for monitoring autoscaling metrics and events.
- Cloud Provider Tools: Most cloud providers offer native monitoring solutions that can be used to track the performance and costs associated with autoscaling.
Key metrics to monitor include:
- CPU and Memory Utilization: Track the resource utilization of your pods and nodes to understand how they correlate with scaling events.
- Pod Count: Monitor the number of pod replicas over time to see how HPA is responding to changes in load.
- Scaling Events: Keep an eye on scaling events triggered by HPA and VPA to ensure they align with expected behavior.
- Application Performance Metrics: Track application-specific metrics such as response times, error rates, and throughput.
Based on the insights gained from monitoring, you may need to adjust your autoscaling configurations:
- Refine Thresholds: If you notice frequent scaling actions or periods of underutilization or overutilization, adjust the thresholds used by HPA and VPA to better match the workload patterns.
- Update Policies: Modify VPA update policies if you observe that the current settings are causing too much or too little pod disruption.
- Custom Metrics: If using custom metrics, ensure they accurately reflect the load on your application and adjust them if they do not.
- Scaling Limits: Review and adjust the minimum and maximum scaling limits to prevent over-scaling or under-scaling based on the capacity of your cluster and the criticality of your applications.
Quality of Service (QoS)
A few words on the quality of service for pods. Basically, there are 3 classes of QoS and they influence the eviction of pods when kube-reserved
is underrun, i.e. node-level resources are running low:
BestEffort
, i.e. pods where no container has CPU or memory requests or limits: Avoid them unless you have really good reasons. The kube-scheduler will place them just anywhere according to its policy, e.g.balanced
orbin-packing
, but whatever resources these pods consume, may bring other pods into trouble or even the kubelet and the container runtime itself, if it happens very suddenly.Burstable
, i.e. pods where at least one container has CPU or memory requests and at least one has no limits or limits that don’t match the requests: Prefer them unless you have really good reasons for the other QoS classes. Always specify proper requests or use VPA to recommend those. This helps the kube-scheduler to make the right scheduling decisions. Not having limits will additionally provide upward resource flexibility, if the node is not under pressure.Guaranteed
, i.e. pods where all containers have CPU and memory requests and equal limits: Avoid them unless you really know the limits or throttling/killing is intended. While “Guaranteed” sounds like something “positive” in the English language, this class comes with the downside, that pods will be actively CPU-throttled and will actively go OOM, even if the node is not under pressure and has excess capacity left. Worse, if containers in the pod are under VPA, their CPU requests/limits will often not be scaled up as CPU throttling will go unnoticed by VPA.
Summary
- As a rule of thumb, always set CPU and memory requests (or let VPA do that) and always avoid CPU and memory limits.
- CPU limits aren’t helpful on an under-utilized node (=may result in needless outages) and even suppress the signals for VPA to act. On a nearly or fully utilized node, CPU limits are practically irrelevant as only the requests matter, which are translated into CPU shares that provide a fair use of the CPU anyway (see CFS).
Therefore, if you do not know the healthy range, do not set CPU limits. If you as author of the source code know its healthy range, set them to the upper threshold of that healthy range (everything above, from your knowledge of that code, is definitely an unbound busy loop or similar, which is the main reason for CPU limits, besides batch jobs where throttling is acceptable or even desired). - Memory limits may be more useful, but suffer a similar, though not as negative downside. As with CPU limits, memory limits aren’t helpful on an under-utilized node (=may result in needless outages), but different than CPU limits, they result in an OOM, which triggers VPA to provide more memory suddenly (modifies the currently computed recommendations by a configurable factor, defaulting to +20%, see docs).
Therefore, if you do not know the healthy range, do not set memory limits. If you as author of the source code know its healthy range, set them to the upper threshold of that healthy range (everything above, from your knowledge of that code, is definitely an unbound memory leak or similar, which is the main reason for memory limits)
- CPU limits aren’t helpful on an under-utilized node (=may result in needless outages) and even suppress the signals for VPA to act. On a nearly or fully utilized node, CPU limits are practically irrelevant as only the requests matter, which are translated into CPU shares that provide a fair use of the CPU anyway (see CFS).
- Horizontal Pod Autoscaling (HPA): Use for pods that support horizontal scaling. Prefer scaling on usage, not utilization, as this is more predictable (not dependent on a second variable, namely the current requests) and conflict-free with vertical pod autoscaling (VPA).
- As a rule of thumb, set the initial replicas to the 5th percentile of the actually observed replica count in production. Since HPA reacts fast, this is not as critical, but may help reduce initial load on the control plane early after deployment. However, be cautious when you update the higher-level resource not to inadvertently reset the current HPA-controlled replica count (very easy to make mistake that can lead to catastrophic loss of pods). HPA modifies the replica count directly in the spec and you do not want to overwrite that. Even if it reacts fast, it is not instant (not via a mutating webhook as VPA operates) and the damage may already be done.
- As for minimum and maximum, let your high availability requirements determine the minimum and your theoretical maximum load determine the maximum, flanked with alerts to detect erroneous run-away out-scaling or the actual nearing of your practical maximum load, so that you can intervene.
- Vertical Pod Autoscaling (VPA): Use for containers that have a significant usage (e.g. any container above 50m CPU or 100M memory) and a significant usage spread over time (by more than 2x), i.e. ignore small (e.g. side-cars) or static (e.g. Java statically allocated heap) containers, but otherwise use it to provide the resources needed on the one hand and keep the costs in check on the other hand.
- As a rule of thumb, set the initial requests to the 5th percentile of the actually observed CPU resp. memory usage in production. Since VPA may need some time at first to respond and evict pods, this is especially critical early after deployment. The lower bound, below which pods will be immediately evicted, converges much faster than the upper bound, above which pods will be immediately evicted, but it isn’t instant, e.g. after 5 minutes the lower bound is just at 60% of the computed lower bound; after 12 hours the upper bound is still at 300% of the computed upper bound (see code). Unlike with HPA, you don’t need to be as cautious when updating the higher-level resource in the case of VPA. As long as VPA’s mutating webhook (VPA Admission Controller) is operational (which also the VPA Updater checks before evicting pods), it’s generally safe to update the higher-level resource. However, if it’s not up and running, any new pods that are spawned (e.g. as a consequence of a rolling update of the higher-level resource or for any other reason) will not be mutated. Instead, they will receive whatever requests are currently configured at the higher-level resource, which can lead to catastrophic resource under-reservation. Gardener deploys the VPA Admission Controller in HA - if unhealthy, it is reported under the
ControlPlaneHealthy
shoot status condition. - If you have defined absolute limits (unrelated to the requests), configure VPA to only scale the requests or else it will proportionally scale the limits as well, which can easily become useless (way beyond the threshold of unhealthy behavior) or absurd (larger than node capacity):
If you have defined relative limits (related to the requests), the default policy to scale the limits proportionally with the requests is fine, but the gap between requests and limits must be zero for QoSspec: resourcePolicy: containerPolicies: - controlledValues: RequestsOnly ...
Guaranteed
and should best be small for QoSBurstable
to avoid useless or absurd limits either, e.g. prefer limits being 5 to at most 20% larger than requests as opposed to being 100% larger or more. - As a rule of thumb, set
minAllowed
to the highest observed VPA recommendation (usually during the initialization phase or during any periodical activity) for an otherwise practically idle container, so that you avoid needless trashing (e.g. resource usage calms down over time and recommendations drop consecutively until eviction, which will then lead again to initialization or later periodical activity and higher recommendations and new evictions).
⚠️ You may want to provide higherminAllowed
values, if you observe that up-scaling takes too long for CPU or memory for a too large percentile of your workload. This will get you out of the danger zone of too few resources for too many pods at the expense of providing too many resources for a few pods. Memory may react faster than CPU, because CPU throttling is not visible and memory gets aided by OOM bump-up incidents, but still, if you observe that up-scaling takes too long, you may want to increaseminAllowed
accordingly. - As a rule of thumb, set
maxAllowed
to your theoretical maximum load, flanked with alerts to detect erroneous run-away usage or the actual nearing of your practical maximum load, so that you can intervene. However, VPA can easily recommend requests larger than what is allocatable on a node, so you must either ensure large enough nodes (Gardener can scale up from zero, in case you like to define a low-priority worker pool with more resources for very large pods) and/or cap VPA’s target recommendations usingmaxAllowed
at the node allocatable remainder (after daemon set pods) of the largest eligible machine type (may result in under-provisioning resources for a pod). Use your monitoring and check maximum pod usage to decide about the maximum machine type.
Recommendations in a Box
Container | When to use | Value |
---|---|---|
Requests | - Set them (recommended) unless: - Do not set requests for QoS BestEffort ; useful only if pod can be evicted as often as needed and pod can pick up where it left off without any penalty | Set requests to 95th percentile (w/o VPA) of the actually observed CPU resp. memory usage in production resp. 5th percentile (w/ VPA) (see below) |
Limits | - Avoid them (recommended) unless: - Set limits for QoS Guaranteed ; useful only if pod has strictly static resource requirements- Set CPU limits if you want to throttle CPU usage for containers that can be throttled w/o any other disadvantage than processing time (never do that when time-critical operations like leases are involved) - Set limits if you know the healthy range and want to shield against unbound busy loops, unbound memory leaks, or similar | If you really can (otherwise not), set limits to healthy theoretical max load |
Scaler | When to use | Initial | Minimum | Maximum |
---|---|---|---|---|
HPA | Use for pods that support horizontal scaling | Set initial replicas to 5th percentile of the actually observed replica count in production (prefer scaling on usage, not utilization) and make sure to never overwrite it later when controlled by HPA | Set minReplicas to 0 (requires feature gate and custom/external metrics), to 1 (regular HPA minimum), or whatever the high availability requirements of the workload demand | Set maxReplicas to healthy theoretical max load |
VPA | Use for containers that have a significant usage (>50m/100M) and a significant usage spread over time (>2x) | Set initial requests to 5th percentile of the actually observed CPU resp. memory usage in production | Set minAllowed to highest observed VPA recommendation (includes start-up phase) for an otherwise practically idle container (avoids pod trashing when pod gets evicted after idling) | Set maxAllowed to fresh node allocatable remainder after daemonset pods (avoids pending pods when requests exeed fresh node allocatable remainder) or, if you really can (otherwise not), to healthy theoretical max load (less disruptive than limits as no throttling or OOM happens on under-utilized nodes) |
CA | Use for dynamic workloads, definitely if you use HPA and/or VPA | N/A | Set minimum to 0 or number of nodes required right after cluster creation or wake-up | Set maximum to healthy theoretical max load |
ℹ️ Note
Theoretical max load may be very difficult to ascertain, especially with modern software that consists of building blocks you do not own or know in detail. If you have comprehensive monitoring in place, you may be tempted to pick the observed maximum and add a safety margin or even factor on top (2x, 4x, or any other number), but this is not to be confused with “theoretical max load” (solely depending on the code, not observations from the outside). At any point in time, your numbers may change, e.g. because you updated a software component or your usage increased. If you decide to use numbers that are set based only on observations, make sure to flank those numbers with monitoring alerts, so that you have sufficient time to investigate, revise, and readjust if necessary.
Conclusion
Pod autoscaling is a dynamic and complex aspect of Kubernetes, but it is also one of the most powerful tools at your disposal for maintaining efficient, reliable, and cost-effective applications. By carefully selecting the appropriate autoscaler, setting well-considered thresholds, and continuously monitoring and adjusting your strategies, you can ensure that your Kubernetes deployments are well-equipped to handle your resource demands while not over-paying for the provided resources at the same time.
As Kubernetes continues to evolve (e.g. in-place updates) and as new patterns and practices emerge, the approaches to autoscaling may also change. However, the principles discussed above will remain foundational to creating scalable and resilient Kubernetes workloads. Whether you’re a developer or operations engineer, a solid understanding of pod autoscaling will be instrumental in the successful deployment and management of containerized applications.
6.2 - Specifying a Disruption Budget for Kubernetes Controllers
Introduction of Disruptions
We need to understand that some kind of voluntary disruptions can happen to pods. For example, they can be caused by cluster administrators who want to perform automated cluster actions, like upgrading and autoscaling clusters. Typical application owner actions include:
- deleting the deployment or other controller that manages the pod
- updating a deployment’s pod template causing a restart
- directly deleting a pod (e.g., by accident)
Setup Pod Disruption Budgets
Kubernetes offers a feature called PodDisruptionBudget (PDB) for each application. A PDB limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions.
The most common use case is when you want to protect an application specified by one of the built-in Kubernetes controllers:
- Deployment
- ReplicationController
- ReplicaSet
- StatefulSet
A PodDisruptionBudget has three fields:
- A label selector
.spec.selector
to specify the set of pods to which it applies. .spec.minAvailable
which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage..spec.maxUnavailable
which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
Cluster Upgrade or Node Deletion Failed due to PDB Violation
Misconfiguration of the PDB could block the cluster upgrade or node deletion processes. There are two main cases that can cause a misconfiguration.
Case 1: The replica of Kubernetes controllers is 1
Only 1 replica is running: there is no
replicaCount
setup orreplicaCount
for the Kubernetes controllers is set to 1PDB configuration
spec: minAvailable: 1
To fix this PDB misconfiguration, you need to change the value of
replicaCount
for the Kubernetes controllers to a number greater than 1
Case 2: HPA configuration violates PDB
In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand. The HorizontalPodAutoscaler manages the replicas field of the Kubernetes controllers.
There is no
replicaCount
setup orreplicaCount
for the Kubernetes controllers is set to 1PDB configuration
spec: minAvailable: 1
HPA configuration
spec: minReplicas: 1
To fix this PDB misconfiguration, you need to change the value of HPA
minReplicas
to be greater than 1
Related Links
6.3 - Access a Port of a Pod Locally
Question
You have deployed an application with a web UI or an internal endpoint in your Kubernetes (K8s) cluster. How to access this endpoint without an external load balancer (e.g., Ingress)?
This tutorial presents two options:
- Using Kubernetes port forward
- Using Kubernetes apiserver proxy
Please note that the options described here are mostly for quick testing or troubleshooting your application. For enabling access to your application for productive environment, please refer to the official Kubernetes documentation.
Solution 1: Using Kubernetes Port Forward
You could use the port forwarding functionality of kubectl
to access the pods from your local host without involving a service.
To access any pod follow these steps:
- Run
kubectl get pods
- Note down the name of the pod in question as
<your-pod-name>
- Run
kubectl port-forward <your-pod-name> <local-port>:<your-app-port>
- Run a web browser or curl locally and enter the URL:
http(s)://localhost:<local-port>
In addition, kubectl port-forward
allows using a resource name, such as a deployment name or service name, to select a matching pod to port forward.
More details can be found in the Kubernetes documentation.
The main drawback of this approach is that the pod’s name changes as soon as it is restarted. Moreover, you need to have a web browser on your client and you need to make sure that the local port is not already used by an application running on your system. Finally, sometimes the port forwarding is canceled due to nonobvious reasons. This leads to a kind of shaky approach. A more stable possibility is based on accessing the app via the kube-proxy, which accesses the corresponding service.
Solution 2: Using the apiserver Proxy of Your Kubernetes Cluster
There are several different proxies in Kubernetes. In this tutorial we will be using apiserver proxy to enable the access to the services in your cluster without Ingress. Unlike the first solution, here a service is required.
Use the following format to compose a URL for accessing your service through an existing proxy on the Kubernetes cluster:
https://<your-cluster-master>/api/v1/namespace/<your-namespace>/services/<your-service>:<your-service-port>/proxy/<service-endpoint>
Example:
your-main-cluster | your-namespace | your-service | your-service-port | your-service-endpoint | url to access service |
---|---|---|---|---|---|
api.testclstr.cpet.k8s.sapcloud.io | default | nginx-svc | 80 | / | http://api.testclstr.cpet.k8s.sapcloud.io/api/v1/namespaces/default/services/nginx-svc:80/proxy/ |
api.testclstr.cpet.k8s.sapcloud.io | default | docker-nodejs-svc | 4500 | /cpu?baseNumber=4 | https://api.testclstr.cpet.k8s.sapcloud.io/api/v1/namespaces/default/services/docker-nodejs-svc:4500/proxy/cpu?baseNumber=4 |
For more details on the format, please refer to the official Kubernetes documentation.
Note
There are applications which do not support relative URLs yet, e.g. Prometheus (as of November, 2022). This typically leads to missing JavaScript objects, which could be investigated with your browser’s development tools. If such an issue occurs, please use theport-forward
approach described above.6.4 - Auditing Kubernetes for Secure Setup
Increasing the Security of All Gardener Stakeholders
In summer 2018, the Gardener project team asked Kinvolk to execute several penetration tests in its role as third-party contractor. The goal of this ongoing work was to increase the security of all Gardener stakeholders in the open source community. Following the Gardener architecture, the control plane of a Gardener managed shoot cluster resides in the corresponding seed cluster. This is a Control-Plane-as-a-Service with a network air gap.
Along the way we found various kinds of security issues, for example, due to misconfiguration or missing isolation, as well as two special problems with upstream Kubernetes and its Control-Plane-as-a-Service architecture.
Major Findings
From this experience, we’d like to share a few examples of security issues that could happen on a Kubernetes installation and how to fix them.
Alban Crequy (Kinvolk) and Dirk Marwinski (SAP SE) gave a presentation entitled Hardening Multi-Cloud Kubernetes Clusters as a Service at KubeCon 2018 in Shanghai presenting some of the findings.
Here is a summary of the findings:
Privilege escalation due to insecure configuration of the Kubernetes API server
- Root cause: Same certificate authority (CA) is used for both the API server and the proxy that allows accessing the API server.
- Risk: Users can get access to the API server.
- Recommendation: Always use different CAs.
Exploration of the control plane network with malicious HTTP-redirects
- Root cause: See detailed description below.
- Risk: Provoked error message contains full HTTP payload from anexisting endpoint which can be exploited. The contents of the payload depends on your setup, but can potentially be user data, configuration data, and credentials.
- Recommendation:
- Use the latest version of Gardener
- Ensure the seed cluster’s container network supports network policies. Clusters that have been created with Kubify are not protected as Flannel is used there which doesn’t support network policies.
- Recommendation:
Reading private AWS metadata via Grafana
- Root cause: It is possible to configuring a new custom data source in Grafana, we could send HTTP requests to target the control
- Risk: Users can get the “user-data” for the seed cluster from the metadata service and retrieve a kubeconfig for that Kubernetes cluster
- Recommendation: Lockdown Grafana features to only what’s necessary in this setup, block all unnecessary outgoing traffic, move Grafana to a different network, lockdown unauthenticated endpoints
Scenario 1: Privilege Escalation with Insecure API Server
In most configurations, different components connect directly to the Kubernetes API server, often using a kubeconfig
with a client
certificate. The API server is started with the flag:
/hyperkube apiserver --client-ca-file=/srv/kubernetes/ca/ca.crt ...
The API server will check whether the client certificate presented by kubectl, kubelet, scheduler or another component is really signed by the configured certificate authority for clients.
The API server can have many clients of various kinds
However, it is possible to configure the API server differently for use with an intermediate authenticating proxy. The proxy will authenticate the client with its own custom method and then issue HTTP requests to the API server with additional HTTP headers specifying the user name and group name. The API server should only accept HTTP requests with HTTP headers from a legitimate proxy. To allow the API server to check incoming requests, you need pass on a list of certificate authorities (CAs) to it. Requests coming from a proxy are only accepted if they use a client certificate that is signed by one of the CAs of that list.
--requestheader-client-ca-file=/srv/kubernetes/ca/ca-proxy.crt
--requestheader-username-headers=X-Remote-User
--requestheader-group-headers=X-Remote-Group
API server clients can reach the API server through an authenticating proxy
So far, so good. But what happens if the malicious user “Mallory” tries to connect directly to the API server and reuses the HTTP headers to pretend to be someone else?
What happens when a client bypasses the proxy, connecting directly to the API server?
With a correct configuration, Mallory’s kubeconfig will have a certificate signed by the API server certificate authority but not signed by the proxy certificate authority. So the API server will not accept the extra HTTP header “X-Remote-Group: system:masters”.
You only run into an issue when the same certificate authority is used for both the API server and the proxy. Then, any Kubernetes client certificate can be used to take the role of different user or group as the API server will accept the user header and group header.
The kubectl
tool does not normally add those HTTP headers but it’s pretty easy to generate the corresponding HTTP requests manually.
We worked on improving the Kubernetes documentation to make clearer that this configuration should be avoided.
Scenario 2: Exploration of the Control Plane Network with Malicious HTTP-Redirects
The API server is a central component of Kubernetes and many components initiate connections to it, including the kubelet running on worker nodes. Most of the requests from those clients will end up updating Kubernetes objects (pods, services, deployments, and so on) in the etcd database but the API server usually does not need to initiate TCP connections itself.
The API server is mostly a component that receives requests
However, there are exceptions. Some kubectl
commands will trigger the API server to open a new connection to the kubelet. kubectl exec
is one of those commands. In order to get the standard I/Os from the pod, the API server will start an HTTP connection to the kubelet on the worker node where the pod is running. Depending on the container runtime used, it can be done in different ways, but one way to do it is for the kubelet to reply with a HTTP-302 redirection to the Container Runtime Interface (CRI). Basically, the kubelet is telling the API server to get the streams from CRI itself directly instead of forwarding. The redirection from the kubelet will only change the port and path from the URL; the IP address will not be changed because the kubelet and the CRI component run on the same worker node.
But the API server also initiates some connections, for example, to worker nodes
It’s often quite easy for users of a Kubernetes cluster to get access to worker nodes and tamper with the kubelet. They could be given explicit SSH access or they could be given a kubeconfig with enough privileges to create privileged pods or even just pods with “host” volumes.
In contrast, users (even those with “system:masters” permissions or “root” rights) are often not given access to the control plane. On setups like, for example, GKE or Gardener, the control plane is running on separate nodes, with a different administrative access. It could be hosted on a different cloud provider account. So users are not free to explore the internal networking the control plane.
What would happen if a user was tampering with the kubelet to make it maliciously redirect kubectl exec
requests to a different random endpoint? Most likely the given endpoint would not speak to the streaming server protocol, so there would be an error. However, the full HTTP payload from the endpoint is included in the error message printed by kubectl exec.
The API server is tricked to connect to other components
The impact of this issue depends on the specific setup. But in many configurations, we could find a metadata service (such as the AWS metadata service) containing user data, configurations and credentials. The setup we explored had a different AWS account and a different EC2 instance profile for the worker nodes and the control plane. This issue allowed users to get access to the AWS metadata service in the context of the control plane, which they should not have access to.
We have reported this issue to the Kubernetes Security mailing list and the public pull request that addresses the issue has been merged PR#66516. It provides a way to enforce HTTP redirect validation (disabled by default).
But there are several other ways that users could trigger the API server to generate HTTP requests and get the reply payload back, so it is advised to isolate the API server and other components from the network as additional precautious measures. Depending on where the API server runs, it could be with Kubernetes Network Policies, EC2 Security Groups or just iptables directly. Following the defense in depth principle, it is a good idea to apply the API server HTTP redirect validation when it is available as well as firewall rules.
In Gardener, this has been fixed with Kubernetes network policies along with changes to ensure the API server does not need to contact the metadata service. You can see more details in the announcements on the Gardener mailing list. This is tracked in CVE-2018-2475.
To be protected from this issue, stakeholders should:
- Use the latest version of Gardener
- Ensure the seed cluster’s container network supports network policies. Clusters that have been created with Kubify are not protected as Flannel is used there which doesn’t support network policies.
Scenario 3: Reading Private AWS Metadata via Grafana
For our tests, we had access to a Kubernetes setup where users are not only given access to the API server in the control plane, but also to a Grafana instance that is used to gather data from their Kubernetes clusters via Prometheus. The control plane is managed and users don’t have access to the nodes that it runs. They can only access the API server and Grafana via a load balancer. The internal network of the control plane is therefore hidden to users.
Prometheus and Grafana can be used to monitor worker nodes
Unfortunately, that setup was not protecting the control plane network from nosy users. By configuring a new custom data source in Grafana, we could send HTTP requests to target the control plane network, for example the AWS metadata service. The reply payload is not displayed on the Grafana Web UI but it is possible to access it from the debugging console of the Chrome browser.
Credentials can be retrieved from the debugging console of Chrome
Adding a Grafana data source is a way to issue HTTP requests to arbitrary targets
In that installation, users could get the “user-data” for the seed cluster from the metadata service and retrieve a kubeconfig for that Kubernetes cluster.
There are many possible measures to avoid this situation: lockdown Grafana features to only what’s necessary in this setup, block all unnecessary outgoing traffic, move Grafana to a different network, or lockdown unauthenticated endpoints, among others.
Conclusion
The three scenarios above show pitfalls with a Kubernetes setup. A lot of them were specific to the Kubernetes installation: different cloud providers or different configurations will show different weaknesses. Users should no longer be given access to Grafana.
6.5 - Container Image Not Pulled
Problem
Two of the most common causes of this problems are specifying the wrong container image or trying to use private images without providing registry credentials.
Note
There is no observable difference in pod status between a missing image and incorrect registry permissions. In either case, Kubernetes will report anErrImagePull
status for the pods. For this reason, this article deals with both scenarios.Example
Let’s see an example. We’ll create a pod named fail, referencing a non-existent Docker image:
kubectl run -i --tty fail --image=tutum/curl:1.123456
The command doesn’t return and you can terminate the process with Ctrl+C
.
Error Analysis
We can then inspect our pods and see that we have one pod with a status of ErrImagePull or ImagePullBackOff.
$ (minikube) kubectl get pods
NAME READY STATUS RESTARTS AGE
client-5b65b6c866-cs4ch 1/1 Running 1 1m
fail-6667d7685d-7v6w8 0/1 ErrImagePull 0 <invalid>
vuejs-578574b75f-5x98z 1/1 Running 0 1d
$ (minikube)
For some additional information, we can describe
the failing pod.
kubectl describe pod fail-6667d7685d-7v6w8
As you can see in the events section, your image can’t be pulled:
Name: fail-6667d7685d-7v6w8
Namespace: default
Node: minikube/192.168.64.10
Start Time: Wed, 22 Nov 2017 10:01:59 +0100
Labels: pod-template-hash=2223832418
run=fail
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"fail-6667d7685d","uid":"cc4ccb3f-cf63-11e7-afca-4a7a1fa05b3f","a...
.
.
.
.
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 1m 1 default-scheduler Normal Scheduled Successfully assigned fail-6667d7685d-7v6w8 to minikube
1m 1m 1 kubelet, minikube Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "default-token-9fr6r"
1m 6s 4 kubelet, minikube spec.containers{fail} Normal Pulling pulling image "tutum/curl:1.123456"
1m 5s 4 kubelet, minikube spec.containers{fail} Warning Failed Failed to pull image "tutum/curl:1.123456": rpc error: code = Unknown desc = Error response from daemon: manifest for tutum/curl:1.123456 not found
1m <invalid> 10 kubelet, minikube Warning FailedSync Error syncing pod
1m <invalid> 6 kubelet, minikube spec.containers{fail} Normal BackOff Back-off pulling image "tutum/curl:1.123456"
Why couldn’t Kubernetes pull the image? There are three primary candidates besides network connectivity issues:
- The image tag is incorrect
- The image doesn’t exist
- Kubernetes doesn’t have permissions to pull that image
If you don’t notice a typo in your image tag, then it’s time to test using your local machine. I usually start by
running docker pull on my local development machine with the exact same image tag. In this case, I would
run docker pull tutum/curl:1.123456
.
If this succeeds, then it probably means that Kubernetes doesn’t have the correct permissions to pull that image.
Add the docker registry user/pwd to your cluster:
kubectl create secret docker-registry dockersecret --docker-server=https://index.docker.io/v1/ --docker-username=<username> --docker-password=<password> --docker-email=<email>
If the exact image tag fails, then I will test without an explicit image tag:
docker pull tutum/curl
This command will attempt to pull the latest tag. If this succeeds, then that means the originally specified tag doesn’t exist. Go to the Docker registry and check which tags are available for this image.
If docker pull tutum/curl
(without an exact tag) fails, then we have a bigger problem - that image does not exist at all in our image registry.
6.6 - Container Image Not Updating
Introduction
A container image should use a fixed tag or the SHA of the image. It should not use the tags latest, head, canary, or other tags that are designed to be floating.
Problem
If you have encountered this issue, you have probably done something along the lines of:
- Deploy anything using an image tag (e.g.,
cp-enablement/awesomeapp:1.0
) - Fix a bug in awesomeapp
- Build a new image and push it with the same tag (
cp-enablement/awesomeapp:1.0
) - Update the deployment
- Realize that the bug is still present
- Repeat steps 3-5 without any improvement
The problem relates to how Kubernetes decides whether to do a docker pull when starting a container.
Since we tagged our image as :1.0, the default pull policy is IfNotPresent. The Kubelet already has a local
copy of cp-enablement/awesomeapp:1.0
, so it doesn’t attempt to do a docker pull. When the new Pods come up,
they’re still using the old broken Docker image.
There are a couple of ways to resolve this, with the recommended one being to use unique tags.
Solution
In order to fix the problem, you can use the following bash script that runs anytime the deployment is updated to create a new tag and push it to the registry.
#!/usr/bin/env bash
# Set the docker image name and the corresponding repository
# Ensure that you change them in the deployment.yml as well.
# You must be logged in with docker login.
#
# CHANGE THIS TO YOUR Docker.io SETTINGS
#
PROJECT=awesomeapp
REPOSITORY=cp-enablement
# causes the shell to exit if any subcommand or pipeline returns a non-zero status.
#
set -e
# set debug mode
#
set -x
# build my nodeJS app
#
npm run build
# get the latest version ID from the Docker.io registry and increment them
#
VERSION=$(curl https://registry.hub.docker.com/v1/repositories/$REPOSITORY/$PROJECT/tags | sed -e 's/[][]//g' -e 's/"//g' -e 's/ //g' | tr '}' '\n' | awk -F: '{print $3}' | grep v| tail -n 1)
VERSION=${VERSION:1}
((VERSION++))
VERSION="v$VERSION"
# build the new docker image
#
echo '>>> Building new image'
echo '>>> Push new image'
docker push $REPOSITORY/$PROJECT:$VERSION
6.7 - Custom Seccomp Profile
Overview
Seccomp (secure computing mode) is a security facility in the Linux kernel for restricting the set of system calls applications can make.
Starting from Kubernetes v1.3.0, the Seccomp feature is in Alpha
. To configure it on a Pod
, the following annotations can be used:
seccomp.security.alpha.kubernetes.io/pod: <seccomp-profile>
where<seccomp-profile>
is the seccomp profile to apply to all containers in aPod
.container.seccomp.security.alpha.kubernetes.io/<container-name>: <seccomp-profile>
where<seccomp-profile>
is the seccomp profile to apply to<container-name>
in aPod
.
More details can be found in the PodSecurityPolicy
documentation.
Installation of a Custom Profile
By default, kubelet loads custom Seccomp profiles from /var/lib/kubelet/seccomp/
. There are two ways in which Seccomp profiles can be added to a Node
:
- to be baked in the machine image
- to be added at runtime
This guide focuses on creating those profiles via a DaemonSet
.
Create a file called seccomp-profile.yaml
with the following content:
apiVersion: v1
kind: ConfigMap
metadata:
name: seccomp-profile
namespace: kube-system
data:
my-profile.json: |
{
"defaultAction": "SCMP_ACT_ALLOW",
"syscalls": [
{
"name": "chmod",
"action": "SCMP_ACT_ERRNO"
}
]
}
Note
The policy above is a very simple one and not suitable for complex applications. The default docker profile can be used a reference. Feel free to modify it to your needs.Apply the ConfigMap
in your cluster:
$ kubectl apply -f seccomp-profile.yaml
configmap/seccomp-profile created
The next steps is to create the DaemonSet
Seccomp installer. It’s going to copy the policy from above in /var/lib/kubelet/seccomp/my-profile.json
.
Create a file called seccomp-installer.yaml
with the following content:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: seccomp
namespace: kube-system
labels:
security: seccomp
spec:
selector:
matchLabels:
security: seccomp
template:
metadata:
labels:
security: seccomp
spec:
initContainers:
- name: installer
image: alpine:3.10.0
command: ["/bin/sh", "-c", "cp -r -L /seccomp/*.json /host/seccomp/"]
volumeMounts:
- name: profiles
mountPath: /seccomp
- name: hostseccomp
mountPath: /host/seccomp
readOnly: false
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
terminationGracePeriodSeconds: 5
volumes:
- name: hostseccomp
hostPath:
path: /var/lib/kubelet/seccomp
- name: profiles
configMap:
name: seccomp-profile
Create the installer and wait until it’s ready on all Nodes
:
$ kubectl apply -f seccomp-installer.yaml
daemonset.apps/seccomp-installer created
$ kubectl -n kube-system get pods -l security=seccomp
NAME READY STATUS RESTARTS AGE
seccomp-installer-wjbxq 1/1 Running 0 21s
Create a Pod Using a Custom Seccomp Profile
Finally, we want to create a profile which uses our new Seccomp profile my-profile.json
.
Create a file called my-seccomp-pod.yaml
with the following content:
apiVersion: v1
kind: Pod
metadata:
name: seccomp-app
namespace: default
annotations:
seccomp.security.alpha.kubernetes.io/pod: "localhost/my-profile.json"
# you can specify seccomp profile per container. If you add another profile you can configure
# it for a specific container - 'pause' in this case.
# container.seccomp.security.alpha.kubernetes.io/pause: "localhost/some-other-profile.json"
spec:
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
Create the Pod
and see that it’s running:
$ kubectl apply -f my-seccomp-pod.yaml
pod/seccomp-app created
$ kubectl get pod seccomp-app
NAME READY STATUS RESTARTS AGE
seccomp-app 1/1 Running 0 42s
Throubleshooting
If an invalid or a non-existing profile is used, then the Pod
will be stuck in ContainerCreating
phase:
broken-seccomp-pod.yaml
:
apiVersion: v1
kind: Pod
metadata:
name: broken-seccomp
namespace: default
annotations:
seccomp.security.alpha.kubernetes.io/pod: "localhost/not-existing-profile.json"
spec:
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
$ kubectl apply -f broken-seccomp-pod.yaml
pod/broken-seccomp created
$ kubectl get pod broken-seccomp
NAME READY STATUS RESTARTS AGE
broken-seccomp 1/1 ContainerCreating 0 2m
$ kubectl describe pod broken-seccomp
Name: broken-seccomp
Namespace: default
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18s default-scheduler Successfully assigned kube-system/broken-seccomp to docker-desktop
Warning FailedCreatePodSandBox 4s (x2 over 18s) kubelet, docker-desktop Failed create pod sandbox: rpc error: code = Unknown desc = failed to make sandbox docker config for pod "broken-seccomp": failed to generate sandbox security options
for sandbox "broken-seccomp": failed to generate seccomp security options for container: cannot load seccomp profile "/var/lib/kubelet/seccomp/not-existing-profile.json": open /var/lib/kubelet/seccomp/not-existing-profile.json: no such file or directory
Related Links
6.8 - Dockerfile Pitfalls
Using the latest
Tag for an Image
Many Dockerfiles use the FROM package:latest
pattern at the top of their Dockerfiles to pull the latest image from a Docker registry.
Bad Dockerfile
FROM alpine
While simple, using the latest tag for an image means that your build can suddenly break if that image gets updated. This can lead to problems where everything builds fine locally (because your local cache thinks it is the latest), while a build server may fail, because some pipelines make a clean pull on every build. Additionally, troubleshooting can prove to be difficult, since the maintainer of the Dockerfile didn’t actually make any changes.
Good Dockerfile
A digest takes the place of the tag when pulling an image. This will ensure that your Dockerfile remains immutable.
FROM alpine@sha256:7043076348bf5040220df6ad703798fd8593a0918d06d3ce30c6c93be117e430
Running apt/apk/yum update
Running apt-get install
is one of those things virtually every Debian-based Dockerfile will have to do in order to satiate some external package requirements your code needs to run. However, using apt-get
as an example, this comes with its own problems.
apt-get upgrade
This will update all your packages to their latests versions, which can be bad because it prevents your Dockerfile from creating consistent, immutable builds.
apt-get update (in a different line than the one running your apt-get install command)
Running apt-get update
as a single line entry will get cached by the build and won’t actually run every time you need to run apt-get install
. Instead, make sure you run apt-get update
in the same line with all the packages to ensure that all are updated correctly.
Avoid Big Container Images
Building a small container image will reduce the time needed to start or restart pods. An image based on the popular Alpine Linux project is much smaller than most distribution based images (~5MB). For most popular languages and products, there is usually an official Alpine Linux image, e.g., golang, nodejs, and postgres.
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
postgres 9.6.9-alpine 6583932564f8 13 days ago 39.26 MB
postgres 9.6 d92dad241eff 13 days ago 235.4 MB
postgres 10.4-alpine 93797b0f31f4 13 days ago 39.56 MB
In addition, for compiled languages such as Go or C++ that do not require build time tooling during runtime, it is recommended to avoid build time tooling in the final images. With Docker’s support for multi-stages builds, this can be easily achieved with minimal effort. Such an example can be found at Multi-stage builds.
Google’s distroless image is also a good base image.
6.9 - Dynamic Volume Provisioning
Overview
The example shows how to run a Postgres database on Kubernetes and how to dynamically provision and mount the storage volumes needed by the database
Run Postgres Database
Define the following Kubernetes resources in a yaml file:
- PersistentVolumeClaim (PVC)
- Deployment
PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgresdb-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 9Gi
storageClassName: 'default'
This defines a PVC using the storage class default
. Storage classes abstract from the underlying storage provider as well as other parameters, like disk-type (e.g., solid-state vs standard disks).
The default storage class has the annotation {“storageclass.kubernetes.io/is-default-class”:“true”}.
$ kubectl describe sc default
Name: default
IsDefaultClass: Yes
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1beta1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"},"labels":{"addonmanager.kubernetes.io/mode":"Exists"},"name":"default","namespace":""},"parameters":{"type":"gp2"},"provisioner":"kubernetes.io/aws-ebs"}
,storageclass.kubernetes.io/is-default-class=true
Provisioner: kubernetes.io/aws-ebs
Parameters: type=gp2
AllowVolumeExpansion: <unset>
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: Immediate
Events: <none>
A Persistent Volume is automatically created when it is dynamically provisioned. In the following example, the PVC is defined as “postgresdb-pvc”, and a corresponding PV “pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb” is created and associated with the PVC automatically.
$ kubectl create -f .\postgres_deployment.yaml
persistentvolumeclaim "postgresdb-pvc" created
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb 9Gi RWO Delete Bound default/postgresdb-pvc default 3s
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
postgresdb-pvc Bound pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb 9Gi RWO default 8s
Notice that the RECLAIM POLICY is Delete (default value), which is one of the two reclaim policies, the other one is Retain. (A third policy Recycle has been deprecated). In the case of Delete, the PV is deleted automatically when the PVC is removed, and the data on the PVC will also be lost.
On the other hand, a PV with Retain policy will not be deleted when the PVC is removed, and moved to Release status, so that data can be recovered by Administrators later.
You can use the kubectl patch
command to change the reclaim policy as described in Change the Reclaim Policy of a PersistentVolume
or use kubectl edit pv <pv-name>
to edit it online as shown below:
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb 9Gi RWO Delete Bound default/postgresdb-pvc default 44m
# change the reclaim policy from "Delete" to "Retain"
$ kubectl edit pv pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb
persistentvolume "pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb" edited
# check the reclaim policy afterwards
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb 9Gi RWO Retain Bound default/postgresdb-pvc default 45m
Deployment
Once a PVC is created, you can use it in your container via volumes.persistentVolumeClaim.claimName
. In the below example, the PVC postgresdb-pvc is mounted as readable and writable, and in volumeMounts
two paths in the container are mounted to subfolders in the volume.
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
namespace: default
labels:
app: postgres
annotations:
deployment.kubernetes.io/revision: "1"
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: postgres
template:
metadata:
name: postgres
labels:
app: postgres
spec:
containers:
- name: postgres
image: "cpettech.docker.repositories.sap.ondemand.com/jtrack_postgres:howto"
env:
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
value: p5FVqfuJFrM42cVX9muQXxrC3r8S9yn0zqWnFR6xCoPqxqVQ
- name: POSTGRES_INITDB_XLOGDIR
value: "/var/log/postgresql/logs"
ports:
- containerPort: 5432
volumeMounts:
- mountPath: /var/lib/postgresql/data
name: postgre-db
subPath: data # https://github.com/kubernetes/website/pull/2292. Solve the issue of crashing initdb due to non-empty directory (i.e. lost+found)
- mountPath: /var/log/postgresql/logs
name: postgre-db
subPath: logs
volumes:
- name: postgre-db
persistentVolumeClaim:
claimName: postgresdb-pvc
readOnly: false
imagePullSecrets:
- name: cpettechregistry
To check the mount points in the container:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
postgres-7f485fd768-c5jf9 1/1 Running 0 32m
$ kubectl exec -it postgres-7f485fd768-c5jf9 bash
root@postgres-7f485fd768-c5jf9:/# ls /var/lib/postgresql/data/
base pg_clog pg_dynshmem pg_ident.conf pg_multixact pg_replslot pg_snapshots pg_stat_tmp pg_tblspc PG_VERSION postgresql.auto.conf postmaster.opts
global pg_commit_ts pg_hba.conf pg_logical pg_notify pg_serial pg_stat pg_subtrans pg_twophase pg_xlog postgresql.conf postmaster.pid
root@postgres-7f485fd768-c5jf9:/# ls /var/log/postgresql/logs/
000000010000000000000001 archive_status
Deleting a PersistentVolumeClaim
In case of a Delete policy, deleting a PVC will also delete its associated PV. If Retain is the reclaim policy, the PV will change status from Bound to Released when the PVC is deleted.
# Check pvc and pv before deletion
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
postgresdb-pvc Bound pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb 9Gi RWO default 50m
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb 9Gi RWO Retain Bound default/postgresdb-pvc default 50m
# delete pvc
$ kubectl delete pvc postgresdb-pvc
persistentvolumeclaim "postgresdb-pvc" deleted
# pv changed to status "Released"
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-06c81c30-72ea-11e8-ada2-aa3b2329c8bb 9Gi RWO Retain Released default/postgresdb-pvc default 51m
6.10 - Install Knative in Gardener Clusters
Overview
This guide walks you through the installation of the latest version of Knative using pre-built images on a Gardener created cluster environment. To set up your own Gardener, see the documentation or have a look at the landscape-setup-template project. To learn more about this open source project, read the blog on kubernetes.io.
Prerequisites
Knative requires a Kubernetes cluster v1.15 or newer.
Steps
Install and Configure kubectl
If you already have
kubectl
CLI, runkubectl version --short
to check the version. You need v1.10 or newer. If yourkubectl
is older, follow the next step to install a newer version.
Access Gardener
Create a project in the Gardener dashboard. This will essentially create a Kubernetes namespace with the name
garden-<my-project>
.Configure access to your Gardener project using a kubeconfig.
If you are not the Gardener Administrator already, you can create a technical user in the Gardener dashboard. Go to the “Members” section and add a service account. You can then download the kubeconfig for your project. You can skip this step if you create your cluster using the user interface; it is only needed for programmatic access, make sure you set
export KUBECONFIG=garden-my-project.yaml
in your shell.
Creating a Kubernetes Cluster
You can create your cluster using kubectl
CLI by providing a cluster specification yaml file. You can find an example for GCP in the gardener/gardener repository. Make sure the namespace matches that of your project. Then just apply the prepared so-called “shoot” cluster CRD with kubectl:
kubectl apply --filename my-cluster.yaml
The easier alternative is to create the cluster following the cluster creation wizard in the Gardener dashboard:
Configure kubectl for Your Cluster
You can now download the kubeconfig for your freshly created cluster in the Gardener dashboard or via the CLI as follows:
kubectl --namespace shoot--my-project--my-cluster get secret kubecfg --output jsonpath={.data.kubeconfig} | base64 --decode > my-cluster.yaml
This kubeconfig file has full administrators access to you cluster. For the rest of this guide, be sure you have export KUBECONFIG=my-cluster.yaml
set.
Installing Istio
Knative depends on Istio. If your cloud platform offers a managed Istio installation, we recommend installing Istio that way, unless you need the ability to customize your installation.
Otherwise, see the Installing Istio for Knative guide to install Istio.
You must install Istio on your Kubernetes cluster before continuing with these instructions to install Knative.
Installing cluster-local-gateway
for Serving Cluster-Internal Traffic
If you installed Istio, you can install a cluster-local-gateway
within your Knative cluster so that you can serve cluster-internal traffic. If you want to configure your revisions to use routes that are visible only within your cluster, install and use the cluster-local-gateway
.
Installing Knative
The following commands install all available Knative components as well as the standard set of observability plugins. Knative’s installation guide - Installing Knative.
If you are upgrading from Knative 0.3.x: Update your domain and static IP address to be associated with the LoadBalancer
istio-ingressgateway
instead ofknative-ingressgateway
. Then run the following to clean up leftover resources:kubectl delete svc knative-ingressgateway -n istio-system kubectl delete deploy knative-ingressgateway -n istio-system
If you have the Knative Eventing Sources component installed, you will also need to delete the following resource before upgrading:
kubectl delete statefulset/controller-manager -n knative-sources
While the deletion of this resource during the upgrade process will not prevent modifications to Eventing Source resources, those changes will not be completed until the upgrade process finishes.
To install Knative, first install the CRDs by running the
kubectl apply
command once with the-l knative.dev/crd-install=true
flag. This prevents race conditions during the install, which cause intermittent errors:kubectl apply --selector knative.dev/crd-install=true \ --filename https://github.com/knative/serving/releases/download/v0.12.1/serving.yaml \ --filename https://github.com/knative/eventing/releases/download/v0.12.1/eventing.yaml \ --filename https://github.com/knative/serving/releases/download/v0.12.1/monitoring.yaml
To complete the installation of Knative and its dependencies, run the
kubectl apply
command again, this time without the--selector
flag:kubectl apply --filename https://github.com/knative/serving/releases/download/v0.12.1/serving.yaml \ --filename https://github.com/knative/eventing/releases/download/v0.12.1/eventing.yaml \ --filename https://github.com/knative/serving/releases/download/v0.12.1/monitoring.yaml
Monitor the Knative components until all of the components show a
STATUS
ofRunning
:kubectl get pods --namespace knative-serving kubectl get pods --namespace knative-eventing kubectl get pods --namespace knative-monitoring
Set Your Custom Domain
- Fetch the external IP or CNAME of the knative-ingressgateway:
kubectl --namespace istio-system get service knative-ingressgateway
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
knative-ingressgateway LoadBalancer 100.70.219.81 35.233.41.212 80:32380/TCP,443:32390/TCP,32400:32400/TCP 4d
- Create a wildcard DNS entry in your custom domain to point to the above IP or CNAME:
*.knative.<my domain> == A 35.233.41.212
# or CNAME if you are on AWS
*.knative.<my domain> == CNAME a317a278525d111e89f272a164fd35fb-1510370581.eu-central-1.elb.amazonaws.com
- Adapt your Knative config-domain (set your domain in the data field):
kubectl --namespace knative-serving get configmaps config-domain --output yaml
apiVersion: v1
data:
knative.<my domain>: ""
kind: ConfigMap
name: config-domain
namespace: knative-serving
What’s Next
Now that your cluster has Knative installed, you can see what Knative has to offer.
Deploy your first app with the Getting Started with Knative App Deployment guide.
Get started with Knative Eventing by walking through one of the Eventing Samples.
Install Cert-Manager if you want to use the automatic TLS cert provisioning feature.
Cleaning Up
Use the Gardener dashboard to delete your cluster, or execute the following with kubectl pointing to your garden-my-project.yaml
kubeconfig:
kubectl --kubeconfig garden-my-project.yaml --namespace garden--my-project annotate shoot my-cluster confirmation.gardener.cloud/deletion=true
kubectl --kubeconfig garden-my-project.yaml --namespace garden--my-project delete shoot my-cluster
6.11 - Integrity and Immutability
Introduction
When transferring data among networked systems, trust is a central concern. In particular, when communicating over an untrusted medium such as the internet, it is critical to ensure the integrity and immutability of all the data a system operates on. Especially if you use Docker Engine to push and pull images (data) to a public registry.
This immutability offers you a guarantee that any and all containers that you instantiate will be absolutely identical at inception. Surprise surprise, deterministic operations.
A Lesson in Deterministic Ops
Docker Tags are about as reliable and disposable as this guy down here.
Seems simple enough. You have probably already deployed hundreds of YAML’s or started endless counts of Docker containers.
docker run --name mynginx1 -P -d nginx:1.13.9
or
apiVersion: apps/v1
kind: Deployment
metadata:
name: rss-site
spec:
replicas: 1
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: front-end
image: nginx:1.13.9
ports:
- containerPort: 80
But Tags are mutable and humans are prone to error. Not a good combination. Here, we’ll dig into why the use of tags can be dangerous and how to deploy your containers across a pipeline and across environments with determinism in mind.
Let’s say that you want to ensure that whether it’s today or 5 years from now, that specific deployment uses the very same image that you have defined. Any updates or newer versions of an image should be executed as a new deployment. The solution: digest
A digest takes the place of the tag when pulling an image. For example, to pull the above image by digest, run the following command:
docker run --name mynginx1 -P -d nginx@sha256:4771d09578c7c6a65299e110b3ee1c0a2592f5ea2618d23e4ffe7a4cab1ce5de
You can now make sure that the same image is always loaded at every deployment. It doesn’t matter if the TAG of the image has been changed or not. This solves the problem of repeatability.
Content Trust
However, there’s an additionally hidden danger. It is possible for an attacker to replace a server image with another one infected with malware.
Docker Content trust gives you the ability to verify both the integrity and the publisher of all the data received from a registry over any channel.
Prior to version 1.8, Docker didn’t have a way to verify the authenticity of a server image. But in v1.8, a new feature called Docker Content Trust was introduced to automatically sign and verify the signature of a publisher.
So, as soon as a server image is downloaded, it is cross-checked with the signature of the publisher to see if someone tampered with it in any way. This solves the problem of trust.
In addition, you should scan all images for known vulnerabilities.
6.12 - Kubernetes Antipatterns
This HowTo covers common Kubernetes antipatterns that we have seen over the past months.
Running as Root User
Whenever possible, do not run containers as root user. One could be tempted to say that Kubernetes pods and nodes are well separated. Host and containers running on it share the same kernel. If a container is compromised, the root user in the container has full control over the underlying node.
Watch the very good presentation by Liz Rice at the KubeCon 2018
Use RUN groupadd -r anygroup && useradd -r -g anygroup myuser
to create a group and add a user to it. Use the USER
command to switch to this user. Note that you may also consider to provide an explicit UID/GID if required.
For example:
ARG GF_UID="500"
ARG GF_GID="500"
# add group & user
RUN groupadd -r -g $GF_GID appgroup && \
useradd appuser -r -u $GF_UID -g appgroup
USER appuser
Store Data or Logs in Containers
Containers are ideal for stateless applications and should be transient. This means that no data or logs should be stored in the container, as they are lost when the container is closed. Use persistence volumes instead to persist data outside of containers. Using an ELK stack is another good option for storing and processing logs.
Using Pod IP Addresses
Each pod is assigned an IP address. It is necessary for pods to communicate with each other to build an application, e.g. an application must communicate with a database. Existing pods are terminated and new pods are constantly started. If you would rely on the IP address of a pod or container, you would need to update the application configuration constantly. This makes the application fragile.
Create services instead. They provide a logical name that can be assigned independently of the varying number and IP addresses of containers. Services are the basic concept for load balancing within Kubernetes.
More Than One Process in a Container
A docker file provides a CMD
and ENTRYPOINT
to start the image. CMD
is often used around a script that makes a configuration and then starts the container. Do not try to start multiple processes with this script. It is important to consider the separation of concerns when creating docker images. Running multiple processes in a single pod makes managing your containers, collecting logs and updating each process more difficult.
You can split the image into multiple containers and manage them independently - even in one pod. Bear in mind that Kubernetes only monitors the process with PID=1
. If more than one process is started within a container, then these no longer fall under the control of Kubernetes.
Creating Images in a Running Container
A new image can be created with the docker commit
command. This is useful if changes have been made to the container and you want to persist them for later error analysis. However, images created like this are not reproducible and completely worthless for a CI/CD environment. Furthermore, another developer cannot recognize which components the image contains. Instead, always make changes to the docker file, close existing containers and start a new container with the updated image.
Saving Passwords in a docker Image 💀
Do not save passwords in a Docker file! They are in plain text and are checked into a repository. That makes them completely vulnerable even if you are using a private repository like the Artifactory.
Always use Secrets or ConfigMaps to provision passwords or inject them by mounting a persistent volume.
Using the ’latest’ Tag
Starting an image with tomcat is tempting. If no tags are specified, a container is started with the tomcat:latest
image. This image may no longer be up to date and refer to an older version instead. Running a production application requires complete control of the environment with exact versions of the image.
Make sure you always use a tag or even better the sha256 hash of the image, e.g., tomcat@sha256:c34ce3c1fcc0c7431e1392cc3abd0dfe2192ffea1898d5250f199d3ac8d8720f
.
Why Use the sha256 Hash?
Tags are not immutable and can be overwritten by a developer at any time. In this case you don’t have complete control over your image - which is bad.
Different Images per Environment
Don’t create different images for development, testing, staging and production environments. The image should be the source of truth and should only be created once and pushed to the repository. This image:tag
should be used for different environments in the future.
Depend on Start Order of Pods
Applications often depend on containers being started in a certain order. For example, a database container must be up and running before an application can connect to it. The application should be resilient to such changes, as the db pod can be unreachable or restarted at any time. The application container should be able to handle such situations without terminating or crashing.
Additional Anti-Patterns and Patterns
In the community, vast experience has been collected to improve the stability and usability of Docker and Kubernetes.
Refer to Kubernetes Production Patterns for more information.
6.13 - Namespace Isolation
Overview
You can configure a NetworkPolicy to deny all the traffic from other namespaces while allowing all the traffic coming from the same namespace the pod was deployed into.
There are many reasons why you may chose to employ Kubernetes network policies:
- Isolate multi-tenant deployments
- Regulatory compliance
- Ensure containers assigned to different environments (e.g. dev/staging/prod) cannot interfere with each other
Kubernetes network policies are application centric compared to infrastructure/network centric standard firewalls. There are no explicit CIDRs or IP addresses used for matching source or destination IP’s. Network policies build up on labels and selectors which are key concepts of Kubernetes that are used to organize (for example, all DB tier pods of an app) and select subsets of objects.
Example
We create two nginx HTTP-Servers in two namespaces and block all traffic between the two namespaces. E.g. you are unable to get content from namespace1 if you are sitting in namespace2.
Setup the Namespaces
# create two namespaces for test purpose
kubectl create ns customer1
kubectl create ns customer2
# create a standard HTTP web server
kubectl run nginx --image=nginx --replicas=1 --port=80 -n=customer1
kubectl run nginx --image=nginx --replicas=1 --port=80 -n=customer2
# expose the port 80 for external access
kubectl expose deployment nginx --port=80 --type=NodePort -n=customer1
kubectl expose deployment nginx --port=80 --type=NodePort -n=customer2
Test Without NP
Create a pod with curl preinstalled inside the namespace customer1:
# create a "bash" pod in one namespace
kubectl run -i --tty client --image=tutum/curl -n=customer1
Try to curl the exposed nginx server to get the default index.html page. Execute this in the bash prompt of the pod created above.
# get the index.html from the nginx of the namespace "customer1" => success
curl http://nginx.customer1
# get the index.html from the nginx of the namespace "customer2" => success
curl http://nginx.customer2
Both calls are done in a pod within the namespace customer1 and both nginx servers are always reachable, no matter in what namespace.
Test with NP
Install the NetworkPolicy from your shell:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-from-other-namespaces
spec:
podSelector:
matchLabels:
ingress:
- from:
- podSelector: {}
- it applies the policy to ALL pods in the named namespace as the
spec.podSelector.matchLabels
is empty and therefore selects all pods. - it allows traffic from ALL pods in the named namespace, as
spec.ingress.from.podSelector
is empty and therefore selects all pods.
kubectl apply -f ./network-policy.yaml -n=customer1
kubectl apply -f ./network-policy.yaml -n=customer2
After this, curl http://nginx.customer2
shouldn’t work anymore if you are a service inside the namespace customer1 and
vice versa
Note
This policy, once applied, will also disable all external traffic to these pods. For example, you can create a service of typeLoadBalancer
in namespace customer1
that match the nginx pod. When you request the service by its <EXTERNAL_IP>:<PORT>
, then the network policy that will deny the ingress traffic from the service and the request will time out.Related Links
You can get more information on how to configure the NetworkPolicies at:
6.14 - Orchestration of Container Startup
Disclaimer
If an application depends on other services deployed separately, do not rely on a certain start sequence of containers. Instead, ensure that the application can cope with unavailability of the services it depends on.
Introduction
Kubernetes offers a feature called InitContainers to perform some tasks during a pod’s initialization.
In this tutorial, we demonstrate how to use InitContainers
in order to orchestrate a starting sequence of multiple containers. The tutorial uses the example app url-shortener, which consists of two components:
- postgresql database
- webapp which depends on the postgresql database and provides two endpoints: create a short url from a given location and redirect from a given short URL to the corresponding target location
This app represents the minimal example where an application relies on another service or database. In this example, if the application starts before the database is ready, the application will fail as shown below:
$ kubectl logs webapp-958cf5567-h247n
time="2018-06-12T11:02:42Z" level=info msg="Connecting to Postgres database using: host=`postgres:5432` dbname=`url_shortener_db` username=`user`\n"
time="2018-06-12T11:02:42Z" level=fatal msg="failed to start: failed to open connection to database: dial tcp: lookup postgres on 100.64.0.10:53: no such host\n"
$ kubectl get po -w
NAME READY STATUS RESTARTS AGE
webapp-958cf5567-h247n 0/1 Pending 0 0s
webapp-958cf5567-h247n 0/1 Pending 0 0s
webapp-958cf5567-h247n 0/1 ContainerCreating 0 0s
webapp-958cf5567-h247n 0/1 ContainerCreating 0 1s
webapp-958cf5567-h247n 0/1 Error 0 2s
webapp-958cf5567-h247n 0/1 Error 1 3s
webapp-958cf5567-h247n 0/1 CrashLoopBackOff 1 4s
webapp-958cf5567-h247n 0/1 Error 2 18s
webapp-958cf5567-h247n 0/1 CrashLoopBackOff 2 29s
webapp-958cf5567-h247n 0/1 Error 3 43s
webapp-958cf5567-h247n 0/1 CrashLoopBackOff 3 56s
If the restartPolicy
is set to Always
(default) in the yaml file, the application will continue to restart the pod with an exponential back-off delay in case of failure.
Using InitContaniner
To avoid such a situation, InitContainers
can be defined, which are executed prior to the application container. If one of the InitContainers
fails, the application container won’t be triggered.
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp
spec:
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
initContainers: # check if DB is ready, and only continue when true
- name: check-db-ready
image: postgres:9.6.5
command: ['sh', '-c', 'until pg_isready -h postgres -p 5432; do echo waiting for database; sleep 2; done;']
containers:
- image: xcoulon/go-url-shortener:0.1.0
name: go-url-shortener
env:
- name: POSTGRES_HOST
value: postgres
- name: POSTGRES_PORT
value: "5432"
- name: POSTGRES_DATABASE
value: url_shortener_db
- name: POSTGRES_USER
value: user
- name: POSTGRES_PASSWORD
value: mysecretpassword
ports:
- containerPort: 8080
In the above example, the InitContainers
use the docker image postgres:9.6.5
, which is different from the application container.
This also brings the advantage of not having to include unnecessary tools (e.g., pg_isready) in the application container.
With introduction of InitContainers
, in case the database is not available yet, the pod startup will look like similarly to:
$ kubectl get po -w
NAME READY STATUS RESTARTS AGE
nginx-deployment-5cc79d6bfd-t9n8h 1/1 Running 0 5d
privileged-pod 1/1 Running 0 4d
webapp-fdcb49cbc-4gs4n 0/1 Pending 0 0s
webapp-fdcb49cbc-4gs4n 0/1 Pending 0 0s
webapp-fdcb49cbc-4gs4n 0/1 Init:0/1 0 0s
webapp-fdcb49cbc-4gs4n 0/1 Init:0/1 0 1s
$ kubectl logs webapp-fdcb49cbc-4gs4n
Error from server (BadRequest): container "go-url-shortener" in pod "webapp-fdcb49cbc-4gs4n" is waiting to start: PodInitializing
6.15 - Out-Dated HTML and JS Files Delivered
Problem
After updating your HTML and JavaScript sources in your web application, the Kubernetes cluster delivers outdated versions - why?
Overview
By default, Kubernetes service pods are not accessible from the external network, but only from other pods within the same Kubernetes cluster.
The Gardener cluster has a built-in configuration for HTTP load balancing called Ingress, defining rules for external connectivity to Kubernetes services. Users who want external access to their Kubernetes services create an ingress resource that defines rules, including the URI path, backing service name, and other information. The Ingress controller can then automatically program a frontend load balancer to enable Ingress configuration.
Example Ingress Configuration
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: vuejs-ingress
spec:
rules:
- host: test.ingress.<GARDENER-CLUSTER>.<GARDENER-PROJECT>.shoot.canary.k8s-hana.ondemand.com
http:
paths:
- backend:
serviceName: vuejs-svc
servicePort: 8080
where:
- <GARDENER-CLUSTER>: The cluster name in the Gardener
- <GARDENER-PROJECT>: You project name in the Gardener
Diagnosing the Problem
The ingress controller we are using is NGINX. NGINX is a software load balancer, web server, and content cache built on top of open source NGINX.
NGINX caches the content as specified in the HTTP header. If the HTTP header is missing, it is assumed that the cache is forever and NGINX never updates the content in the stupidest case.
Solution
In general, you can avoid this pitfall with one of the solutions below:
- Use a cache buster + HTTP-Cache-Control (prefered)
- Use HTTP-Cache-Control with a lower retention period
- Disable the caching in the ingress (just for dev purposes)
Learning how to set the HTTP header or setup a cache buster is left to you, as an exercise for your web framework (e.g., Express/NodeJS, SpringBoot, …)
Here is an example on how to disable the cache control for your ingress, done with an annotation in your ingress YAML (during development).
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
annotations:
ingress.kubernetes.io/cache-enable: "false"
name: vuejs-ingress
spec:
rules:
- host: test.ingress.<GARDENER-CLUSTER>.<GARDENER-PROJECT>.shoot.canary.k8s-hana.ondemand.com
http:
paths:
- backend:
serviceName: vuejs-svc
servicePort: 8080
6.16 - Remove Committed Secrets in Github 💀
Overview
If you commit sensitive data, such as a kubeconfig.yaml
or SSH key
into a Git repository, you can remove it from
the history. To entirely remove unwanted files from a repository’s history you can use the git filter-branch
command.
The git filter-branch
command rewrites your repository’s history, which changes the SHAs for existing commits that you alter and any dependent commits. Changed commit SHAs may affect open pull requests in your repository. Merging or closing all open pull requests before removing files from your repository is recommended.
Warning
If someone has already checked out the repository, then of course they have the secret on their computer. So ALWAYS revoke the OAuthToken/Password or whatever it was immediately.Purging a File from Your Repository’s History
Warning
If you rungit filter-branch
after stashing changes, you won’t be able to retrieve your changes with other stash commands. Before running git filter-branch
, we recommend unstashing any changes you’ve made. To unstash the last set of changes you’ve stashed, run git stash show -p | git apply -R
. For more information, see Git Tools - Stashing and Cleaning.To illustrate how git filter-branch
works, we’ll show you how to remove your file with sensitive data from the history of your repository and add it to .gitignore to ensure that it is not accidentally re-committed.
1. Navigate into the repository’s working directory:
cd YOUR-REPOSITORY
2. Run the following command, replacing PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA
with the path to the file you want to remove, not just its filename.
These arguments will:
- Force Git to process, but not check out, the entire history of every branch and tag
- Remove the specified file, as well as any empty commits generated as a result
- Overwrite your existing tags
git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA' \
--prune-empty --tag-name-filter cat -- --all
3. Add your file with sensitive data to .gitignore
to ensure that you don’t accidentally commit it again:
echo "YOUR-FILE-WITH-SENSITIVE-DATA" >> .gitignore
Double-check that you’ve removed everything you wanted to from your repository’s history, and that all of your branches are checked out. Once you’re happy with the state of your repository, continue to the next step.
4. Force-push your local changes to overwrite your GitHub repository, as well as all the branches you’ve pushed up:
git push origin --force --all
4. In order to remove the sensitive file from your tagged releases, you’ll also need to force-push against your Git tags:
git push origin --force --tags
Warning
Tell your collaborators to rebase, not merge, any branches they created off of your old (tainted) repository history. One merge commit could reintroduce some or all of the tainted history that you just went to the trouble of purging.Related Links
6.17 - Using Prometheus and Grafana to Monitor K8s
Disclaimer
This post is meant to give a basic end-to-end description for deploying and using Prometheus and Grafana. Both applications offer a wide range of flexibility, which needs to be considered in case you have specific requirements. Such advanced details are not in the scope of this topic.
Introduction
Prometheus is an open-source systems monitoring and alerting toolkit for recording numeric time series. It fits both machine-centric monitoring as well as monitoring of highly dynamic service-oriented architectures. In a world of microservices, its support for multi-dimensional data collection and querying is a particular strength.
Prometheus is the second hosted project to graduate within CNCF.
The following characteristics make Prometheus a good match for monitoring Kubernetes clusters:
Pull-based Monitoring Prometheus is a pull-based monitoring system, which means that the Prometheus server dynamically discovers and pulls metrics from your services running in Kubernetes.
Labels Prometheus and Kubernetes share the same label (key-value) concept that can be used to select objects in the system.
Labels are used to identify time series and sets of label matchers can be used in the query language (PromQL) to select the time series to be aggregated.Exporters
There are many exporters available, which enable integration of databases or even other monitoring systems not already providing a way to export metrics to Prometheus. One prominent exporter is the so called node-exporter, which allows to monitor hardware and OS related metrics of Unix systems.Powerful Query Language The Prometheus query language PromQL lets the user select and aggregate time series data in real time. Results can either be shown as a graph, viewed as tabular data in the Prometheus expression browser, or consumed by external systems via the HTTP API.
Find query examples on Prometheus Query Examples.
One very popular open-source visualization tool not only for Prometheus is Grafana. Grafana is a metric analytics and visualization suite. It is popular for visualizing time series data for infrastructure and application analytics but many use it in other domains including industrial sensors, home automation, weather, and process control. For more information, see the Grafana Documentation.
Grafana accesses data via Data Sources. The continuously growing list of supported backends includes Prometheus.
Dashboards are created by combining panels, e.g., Graph and Dashlist.
In this example, we describe an End-To-End scenario including the deployment of Prometheus and a basic monitoring configuration as the one provided for Kubernetes clusters created by Gardener.
If you miss elements on the Prometheus web page when accessing it via its service URL https://<your K8s FQN>/api/v1/namespaces/<your-prometheus-namespace>/services/prometheus-prometheus-server:80/proxy
, this is probably caused by a Prometheus issue - #1583. To workaround this issue, set up a port forward kubectl port-forward -n <your-prometheus-namespace> <prometheus-pod> 9090:9090
on your client and access the Prometheus UI from there with your locally installed web browser. This issue is not relevant in case you use the service type LoadBalancer
.
Preparation
The deployment of Prometheus and Grafana is based on Helm charts.
Make sure to implement the Helm settings before deploying the Helm charts.
The Kubernetes clusters provided by Gardener use role based access control (RBAC). To authorize the Prometheus node-exporter to access hardware and OS relevant metrics of your cluster’s worker nodes, specific artifacts need to be deployed.
Bind the Prometheus service account to the garden.sapcloud.io:monitoring:prometheus
cluster role by running the command
kubectl apply -f crbinding.yaml
.
Content of crbinding.yaml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: <your-prometheus-name>-server
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: garden.sapcloud.io:monitoring:prometheus
subjects:
- kind: ServiceAccount
name: <your-prometheus-name>-server
namespace: <your-prometheus-namespace>
Deployment of Prometheus and Grafana
Only minor changes are needed to deploy Prometheus and Grafana based on Helm charts.
Copy the following configuration into a file called values.yaml
and deploy Prometheus:
helm install <your-prometheus-name> --namespace <your-prometheus-namespace> stable/prometheus -f values.yaml
Typically, Prometheus and Grafana are deployed into the same namespace. There is no technical reason behind this, so feel free to choose different namespaces.
Content of values.yaml
for Prometheus:
rbac:
create: false # Already created in Preparation step
nodeExporter:
enabled: false # The node-exporter is already deployed by default
server:
global:
scrape_interval: 30s
scrape_timeout: 30s
serverFiles:
prometheus.yml:
rule_files:
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: 'kube-kubelet'
honor_labels: false
scheme: https
tls_config:
# This is needed because the kubelets' certificates are not generated
# for a specific pod IP
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- target_label: __metrics_path__
replacement: /metrics
- source_labels: [__meta_kubernetes_node_address_InternalIP]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- job_name: 'kube-kubelet-cadvisor'
honor_labels: false
scheme: https
tls_config:
# This is needed because the kubelets' certificates are not generated
# for a specific pod IP
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- target_label: __metrics_path__
replacement: /metrics/cadvisor
- source_labels: [__meta_kubernetes_node_address_InternalIP]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Example scrape config for probing services via the Blackbox Exporter.
#
# Relabelling allows to configure the actual service scrape endpoint using the following annotations:
#
# * `prometheus.io/probe`: Only probe services that have a value of `true`
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
# Example scrape config for pods
#
# Relabelling allows to configure the actual service scrape endpoint using the following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the default of `9102`.
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (.+):(?:\d+);(\d+)
replacement: ${1}:${2}
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)(?::\d+);(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name # Add your additional configuration here...
Next, deploy Grafana. Since the deployment in this post is based on the Helm default values, the settings below are set explicitly in case the default changed.
Deploy Grafana via helm install grafana --namespace <your-prometheus-namespace> stable/grafana -f values.yaml
. Here, the same namespace is chosen for Prometheus and for Grafana.
Content of values.yaml
for Grafana:
server:
ingress:
enabled: false
service:
type: ClusterIP
Check the running state of the pods on the Kubernetes Dashboard or by running kubectl get pods -n <your-prometheus-namespace>
. In case of errors, check the log files of the pod(s) in question.
The text output of Helm after the deployment of Prometheus and Grafana contains very useful information, e.g., the user and password of the Grafana Admin user. The credentials are stored as secrets in the namespace <your-prometheus-namespace>
and could be decoded via kubectl get secret --namespace <my-grafana-namespace> grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
.
Basic Functional Tests
To access the web UI of both applications, use port forwarding of port 9090.
Setup port forwarding for port 9090:
kubectl port-forward -n <your-prometheus-namespace> <your-prometheus-server-pod> 9090:9090
Open http://localhost:9090
in your web browser. Select Graph from the top tab and enter the following expressing to show the overall CPU usage for a server (see Prometheus Query Examples):
100 * (1 - avg by(instance)(irate(node_cpu{mode='idle'}[5m])))
This should show some data in a graph.
To show the same data in Grafana setup port forwarding for port 3000 for the Grafana pod and open the Grafana Web UI by opening http://localhost:3000
in a browser. Enter the credentials of the admin user.
Next, you need to enter the server name of your Prometheus deployment. This name is shown directly after the installation via helm.
Run
helm status <your-prometheus-name>
to find this name. Below, this server name is referenced by <your-prometheus-server-name>
.
First, you need to add your Prometheus server as data source:
- Navigate to Dashboards → Data Sources
- Choose Add data source
- Enter:
Name:<your-prometheus-datasource-name>
Type: Prometheus
URL:http://<your-prometheus-server-name>
Access:proxy
- Choose Save & Test
In case of failure, check the Prometheus URL in the Kubernetes Dashboard.
To add a Graph follow these steps:
- In the left corner, select Dashboards → New to create a new dashboard
- Select Graph to create a new graph
- Next, select the Panel Title → Edit
- Select your Prometheus Data Source in the drop down list
- Enter the expression
100 * (1 - avg by(instance)(irate(node_cpu{mode='idle'}[5m])))
in the entry field A - Select the floppy disk symbol (Save) on top
Now you should have a very basic Prometheus and Grafana setup for your Kubernetes cluster.
As a next step you can implement monitoring for your applications by implementing the Prometheus client API.