그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그
14 minute read
Shoot Control Plane Migration “Bad Case” Scenario
The migration flow described as part of GEP-7 can only be executed if both the Garden cluster and source seed cluster are healthy, and
gardenlet in the source seed cluster can connect to the Garden cluster. In this case,
gardenlet can directly scale down the shoot’s control plane in the source seed, after checking the
However, there might be situations in which
gardenlet in the source seed cluster can’t connect to the Garden cluster and determine that
spec.seedName has changed. Similarly, the connection to the seed
kube-apiserver could also be broken. This might be caused by issues with the seed cluster itself. In other situations, the migration flow steps in the source seed might have started but might not be able to finish successfully. In all such cases, it should still be possible to migrate a shoot’s control plane to a different seed, even though executing the migration flow steps in the source seed might not be possible. The potential “split brain” situation caused by having the shoot’s control plane components attempting to reconcile the shoot resources in two different seeds must still be avoided, by ensuring that the shoot’s control plane in the source seed is deactivated before it is activated in the destination seed.
The mechanisms and adaptations described below have been tested as part of a PoC prior to describing them here.
Owner Election / Copying Snapshots
To achieve the goals outlined above, an “owner election” (or rather, “ownership passing”) mechanism is introduced to ensure that the source and destination seeds are able to successfully negotiate a single “owner” during the migration. This mechanism is based on special owner DNS records that uniquely identify the seed that currently hosts the shoot’s control plane (“owns” the shoot).
For example, for a shoot named
i500152-gcp in project
dev that uses an internal domain suffix
internal.dev.k8s.ondemand.com and is scheduled on a seed with an identity
shoot--i500152--gcp2-0841c87f-8db9-4d04-a603-35570da6341f-sap-landscape-dev, the owner DNS record is a TXT record with a domain name
owner.i500152-gcp.dev.internal.dev.k8s.ondemand.com and a single value
shoot--i500152--gcp2-0841c87f-8db9-4d04-a603-35570da6341f-sap-landscape-dev. The owner DNS record is created and maintained by reconciling an
owner DNSRecord resource.
Unlike other extension resources, the
owner DNSRecord resource is not reconciled every time the shoot is reconciled, but only when the resource is created. Therefore, the owner DNS record value (the owner ID) is updated only when the shoot is migrated to a different seed. For more information, see Add handling of owner DNSRecord resources.
The owner DNS record domain name and owner ID are passed to components that need to perform ownership checks, such as the
backup-restore container of the
etcd-main StatefulSet, and all extension controllers. These components then check regularly whether the actual owner ID (the value of the record) matches the passed ID. If they don’t, the ownership check is considered failed, which causes the special behavior described below.
Note: A previous revision of this document proposed using “sync objects” written to and read from the backup container of the source seed as JSON files by the
etcd-backup-restore processes in both seeds. With the introduction of owner DNS records such sync objects are no longer needed.
For the destination seed to actually become the owner, it needs to acquire the shoot’s etcd data by copying the final full snapshot (and potentially also older snapshots) from the backup container of the source seed.
The mechanism to copy the snapshots and pass the ownership from the source to the destination seed consists of the following steps:
The reconciliation flow (“restore” phase) is triggered in the destination seed without first executing the migration flow in the source seed (or perhaps it was executed, but it failed, and its state is currently unknown).
ownerDNSRecord resource is created in the destination seed. As a result, the actual owner DNS record is updated with the destination seed ID. From this point, ownership checks by the
etcd-backup-restoreprocess and extension controller watchdogs in the source seed will fail, which will cause the special behavior described below.
An additional “source” backup entry referencing the source seed backup bucket is deployed to the Garden cluster and the destination seed and reconciled by the backup entry controller. As a result, a secret with the appropriate credentials for accessing the source seed backup container named
source-etcd-backupis created in the destination seed. The normal backup entry (referencing the destination seed backup container) is also deployed and reconciled, as usual, resulting in the usual
etcd-backupsecret being created.
A special “copy” version of the
etcd-mainEtcd resource is deployed to the destination seed. In its
backupsection, this resource contains a
sourceStorein addition to the usual
store, which contains the parameters needed to use the source seed backup container, such as its name and the secret created in the previous step.
spec: backup: ... store: container: 408740b8-6491-415e-98e6-76e92e5956ac secretRef: name: etcd-backup ... sourceStore: container: d1435fea-cd5e-4d5b-a198-81f4025454ff secretRef: name: source-etcd-backup ...
etcd-druidin the destination seed reconciles the above resource by deploying a
etcd-copyJob that contains a single
backup-restorecontainer. It executes the newly introduced
etcd-backup-restorethat copies the snapshots from the source to the destination backup container.
Before starting the copy itself, the
etcd-backup-restoreprocess in the destination seed checks if a final full snapshot (a full snapshot marked as
final=true) exists in the backup container. If such a snapshot is not found, it waits for it to appear in order to proceed. This waiting is up to a certain timeout that should be sufficient for a full snapshot to be taken; after this timeout has elapsed, it proceeds anyway, and the reconciliation flow continues from step 9. As described in Handling Inability to Access the Backup Container below, this is safe to do.
etcd-backup-restoreprocess in the source seed detects that the owner ID in the owner DNS record is different from the expected owner ID (because it was updated in step 2) and switches to a special “final snapshot” mode. In this mode the regular snapshotter is stopped, the readiness probe of the main
etcdcontainer starts returning 503, and one final full snapshot is taken. This snapshot is marked as
final=truein order to ensure that it’s only taken once, and in order to enable the
etcd-backup-restoreprocess in the destination seed to find it (see step 6).
Note: While testing our PoC, we noticed that simply making the readiness probe of the main
etcdcontainer fail doesn’t terminate the existing open connections from
etcd. For this to happen, either the
etcdprocess has to be restarted at least once. Therefore, when the snapshotter is stopped because an ownership change has been detected, the main
etcdprocess is killed (using
SIGTERMto allow graceful termination) to ensure that any open connections from
kube-apiserverare terminated. For this to work, the 2 containers must share the process namespace.
kube-apiserverprocess in the source seed is no longer able to connect to
etcd, all shoot control plane controllers (
machine-controller-manager, etc.) and extension controllers reconciling shoot resources in the source seed that require a connection to the shoot in order to work start failing. All remaining extension controllers are prevented from reconciling shoot resources via the watchdogs mechanism. At this point, the source seed has effectively lost its ownership of the shoot, and it is safe for the destination seed to assume the ownership.
etcd-backup-restoreprocess in the destination seed detects that a final full snapshot exists, it copies all snapshots (or a subset of all snapshots) from the source to the destination backup container. When this is done, the Job finishes successfully which signals to the reconciliation flow that the snapshots have been copied.
Note: To save time, only the final full snapshot taken in step 6, or a subset defined by some criteria, could be copied, instead of all snapshots.
The special “copy” version of the
etcd-mainEtcd resource is deleted from the source seed, and as a result the
etcd-copyJob is also deleted by
The additional “source” backup entry referencing the source seed backup container is deleted from the Garden cluster and the destination seed. As a result, its corresponding
source-etcd-backupsecret is also deleted from the destination seed.
From this point, the reconciliation flow proceeds as already described in GEP-7. This is safe, since the source seed cluster is no longer able to interfere with the shoot.
Handling Inability to Access the Backup Container
The mechanism described above assumes that the
etcd-backup-restore process in the source seed is able to access its backup container in order to take snapshots. If this is not the case, but an ownership change was detected, the
etcd-backup-restore process still sets the readiness probe status of the main
etcd container to 503, and kills the main
etcd process as described above to ensure that any open connections from
kube-apiserver are terminated. This effectively deactivates the source seed control plane to ensure that the ownership of the shoot can be passed to a different seed.
Because of this,
etcd-backup-restore process in the destination seed responsible for copying the snapshots can avoid waiting forever for a final full snapshot to appear. Instead, after a certain timeout has elapsed, it can proceed with the copying. In this situation, whatever latest snapshot is found in the source backup container will be restored in the destination seed. The shoot is still migrated to a healthy seed at the cost of losing the etcd data that accumulated between the point in time when the connection to the source backup container was lost, and the point in time when the source seed cluster was deactivated.
When the connection to the backup container is restored in the source seed, a final full snapshot will be eventually taken. Depending on the stage of the restoration flow in the destination seed, this snapshot may be copied to the destination seed and restored, or it may simply be ignored since the snapshots have already been copied.
Handling Inability to Resolve the Owner DNS Record
The situation when the owner DNS record cannot be resolved is treated similarly to a failed ownership check: the
etcd-backup-restore process sets the readiness probe status of the main
etcd container to 503, and kills the main
etcd process as described above to ensure that any open connections from
kube-apiserver are terminated, effectively deactivating the source seed control plane. The final full snapshot is not taken in this case to ensure that the control plane can be re-activated if needed.
When the owner DNS record can be resolved again, the following 2 situations are possible:
- If the source seed is still the owner of the shoot, the
etcd-backup-restoreprocess will set the readiness probe status of the main
etcdcontainer to 200, so
kube-apiserverwill be able to connect to
etcdand the source seed control plane will be activated again.
- If the source seed is no longer the owner of the shoot, the etcd readiness probe will continue to fail, and the source seed control plane will remain inactive. In addition, the final full snapshot will be taken at this time, for the same reason as described in Handling Inability to Access the Backup Container.
Note: We expect that actual DNS outages are extremely unlikely. A more likely reason for an inability to resolve a DNS record could be network issues with the underlying infrastructure. In such cases, the shoot would usually not be usable / reachable anyway, so deactivating its control plane would not cause a worse outage.
Migration Flow Adaptations
Certain changes to the migration flow are needed in order to ensure that it is compatible with the owner election mechanism described above. Instead of taking a full snapshot of the source seed etcd, the flow deletes the owner DNS record by deleting the
owner DNSRecord resource. This causes the ownership check by
etcd-backup-restore to fail, and the final full snapshot to be eventually taken, so the migration flow waits for a final full snapshot to appear as the last step before deleting the shoot namespace in the source seed. This ensures that the reconciliation flow described above will find a final full snapshot waiting to be copied at step 6.
Checking for the final full snapshot is performed by calling the already existing
snapshot/latest. This is possible, since the
backup-restore container is always running at this point.
After the final full snapshot has been taken, the readiness probe of the main
etcd container starts failing, which means that if the migration flow is retried due to an error it must skip the step that waits for
etcd-main to become ready. To determine if this is the case, a check whether the final full snapshot has been taken or not is performed by calling the same
etcd-backup-restore endpoint, e.g.
snapshot/latest. This is possible if the
etcd-main Etcd resource exists with non-zero replicas. Otherwise:
- If the resource doesn’t exist, it must have been already deleted, so the final full snapshot n must have been already taken.
- If it exists with zero replicas, the shoot must be hibernated, and the migration flow must have never been executed (since it scales up etcd as one of its first steps), so the final full snapshot must not have been taken yet.
Extension Controller Watchdogs
Some extension controllers will stop reconciling shoot resources after the connection to the shoot’s
kube-apiserver is lost. Others, most notably the infrastructure controller, will not be affected. Even though new shoot reconciliations won’t be performed by
gardenlet, such extension controllers might be stuck in a retry loop triggered by a previous reconciliation, which may cause them to reconcile their resources after
gardenlet has already stopped reconciling the shoot. In addition, a reconciliation started when the seed still owned the shoot might take some time and therefore might still be running after the ownership has changed. To ensure that the source seed is completely deactivated, an additional safety mechanism is needed.
This mechanism should handle the following interesting cases:
gardenletcannot connect to the Garden
kube-apiserver. In this case it cannot fetch shoots and therefore does not know if control plane migration has been triggered. Even though
gardenletwill not trigger new reconciliations, extension controllers could still attempt to reconcile their resources if they are stuck a retry loop from a previous reconciliation, and already running reconciliations will not be stopped.
gardenletcannot connect to the seed’s
kube-apiserver. In this case
gardenletknows if migration has been triggered, but it will not start shoot migration or reconciliation as it will first check the seed conditions and try to update the
Clusterresource, both of which will fail. Extension controllers could still be able to connect to the seed’s
kube-apiserver(if they are not running where
gardenletis running), and similarly to the previous case, they could still attempt to reconcile their resources.
- The seed components (
etcd-druid, extension controllers, etc) cannot connect to the seed’s
kube-apiserver. In this case extension controllers would not be able to reconcile their resources as they cannot fetch them from the seed’s
kube-apiserver. When the connection to the
kube-apiservercomes back, the controllers might be stuck in a retry loop from a previous reconciliation, or the resources could still be annotated with
gardener.cloud/operation=reconcile. This could lead to a race condition depending on who manages to
getthe resources first. If
gardenletmanages to update the resources before they are read by the extension controllers, they would be properly updated with
gardener.cloud/operation=migrate. Otherwise, they would be reconciled as usual.
Note: A previous revision of this document proposed using “cluster leases” as such an additional safety mechanism. With the introduction of owner DNS records cluster leases are no longer needed.
The safety mechanism is based on extension controller watchdogs. These are simply additional goroutines that are started when a reconciliation is started by an extension controller. These goroutines perform an ownership check on a regular basis using the owner DNS record, similar to the check performed by the
etcd-backup-restore process described above. If the check fails, the watchdog cancels the reconciliation context, which immediately aborts the reconciliation.
dns-external extension controller is the only extension controller that neither needs the shoot’s
kube-apiserver, nor uses the watchdog mechanism described here. Therefore, this controller will continue reconciling
DNSEntry resources even after the source seed has lost the ownership of the shoot. With the PoC, we manually delete the
DNSOwner resources from the source seed cluster to prevent this from happening. Eventually, the
dns-external controller should be adapted to use the owner DNS records to ensure that it disables itself after the seed has lost the ownership of the shoot. Changes in this direction have already been agreed and relevant PRs proposed.