4 minute read  

Trouble Shooting Guide

Are there really issue that cannot be fixed :O?

Well, of course not :P. With continuous development of Gardener, over the time its architecture and API might have to be changed to reduce complexity and support more features. In this process developers are bound to keep Gardener version backward compatible with last two releases. But maintaining backward compatibility is quite complex and effortful tasks. So, to save short term complex effort, its common practice in open source community to use work around or hacky solutions sometimes. This results in rare issues which are supposed to be resolved by human interaction across upgrades of Gardener version.

This guide records the issues that are quite possible across upgrade of Gardener version, root cause and the human action required for graceful resolution of issue. For troubleshooting guide of bugs which are not yet fixed, please refer the associated github issue.

Note To Maintainers: Please use only mention the resolution of issues which are by design. For bugs please report the temporary resolution on github issue create for the bug.

Etcd-Main pod fails to come up, since backup-restore sidecar is reporting RevisionConsistencyCheckErr

Issue

  • Etcd-main pod goes in CrashLoopBackoff.
  • Etcd-backup-restore sidecar reports validation error with RevisionConsistencyCheckErr.

Environment

  • Gardener version: 0.29.0+

Root Cause

  • From version 0.29.0, Gardener uses shared backup bucket for storing etcd backups, replacing old logic of having single bucket per shoot as per proposal.
  • Since there are very rare chances that etcd data directory will get corrupt, while doing this migration, to avoid etcd down time and implementation effort, we decided to switch directly from old bucket to new shared bucket without migrating old snapshot from old bucket to new bucket.
  • In this case just for safety side we added sanity check in etcd-backup-restore sidecar of etcd-main pod, which checks if etcd data revision is greater than the last snapshot revision from old bucket.
  • If above check fails mean there is surely some data corruption occurred with etcd, so etcd-backup-restore reports error and then etcd-main pod goes in CrashLoopBackoff creating etcd-main down alerts.

Action

  1. Disable the Gardener reconciliation for Shoot by annotating it with shoot.gardener.cloud/ignore=true
  2. Scale down the etcd-main statefulset in seed cluster.
  3. Find out the latest full snapshot and delta snapshot from old backup bucket. The old backup bucket name is same as the backupInfra resource associated with Shoot in Garden cluster.
  4. Move them manually to new backup bucket.
  5. Enable the Gardener reconciliation for shoot by removing annotation shoot.gardener.cloud/ignore=true.

After upgrading/restarting a local Gardener setup, the DNSEntries on the seeds show the error “… already busy for owner …”

Issue

  • custom resources DNSEntries on the seeds show the error “dns name “api.myshoot.mygarden.internal.dev.k8s.ondemand.com” already busy for owner “seed.gardener.cloud/a1234567-XXXX-XXXX-XXXX-025000000001/aws”
  • API server is not available via DNS name

Environment

  • Gardener version: 0.20.0+

Root Cause

DNS records created by Gardener’s dns-controller-manager are stored together with meta data, especially with an owner identifier. In this way the dns-controller-manager knows which records belong to it. It never changes records which are not owned by it. The owner identifier is unique for every seed and computed from the Gardener identity and the seed identity. The Gardener identity is the UUID of the garden namespace of the Gardener cluster. Especially if you have a local Gardener setup, there are situations where the Kubernetes cluster and therefore the garden namespace have to be recreated. For example, on updating docker-desktop all containers may have been deleted and are recreated.

Action

On each seed, you have to tell the dns-controller-manager, that it is also responsible for secondary owner identifiers. For this purpose create a custom resource DNSOwner and set the attribute ownerId to the old owner identifier shown in the error message of the DNS entries, e.g.

apiVersion: dns.gardener.cloud/v1alpha1
kind: DNSOwner
metadata:
  name: old-owner
  namespace: default
spec:
  ownerId: seed.gardener.cloud/a1234567-XXXX-XXXX-XXXX-025000000001/aws
  active: true

Currently the dns-controller-manager has to be restarted (i.e. delete its current pod) to make it known of the secondary owner identifier.