Preservation of Machines

Preservation of Machines

Objective

Currently, the Machine Controller Manager(MCM) moves Machines with errors to the Unknown phase, and after the configured machineHealthTimeout, to the Failed phase. Failed machines are swiftly moved to the Terminating phase during which the node is drained and the Machine object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult.

Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler (CA) scaling down the node.

This document proposes enhancing MCM, such that:

VMs of machines are retained temporarily for analysis.
There is a configurable limit to the number of machines that can be preserved automatically on failure (auto-preservation).
There is a configurable limit to the duration for which machines are preserved.
Users can specify which healthy machines they would like to preserve in case of failure, or for diagnoses in current state (prevent scale down by CA).
Users can request MCM to release a preserved machine, even before the timeout expires, so that MCM can transition the machine to either Running or Terminating phase, as the case may be.

Proposal

In order to achieve the objectives mentioned, the following are proposed:

Enhance worker configuration in the ShootSpec, to specify the maximum number of failed machines that will be auto-preserved and the time duration for which machines will be preserved.

   workers:
   - name: example-worker 
     autoPreserveFailedMachineMax: 2
     machineControllerManager:
          machinePreserveTimeout: 72h

This configuration will be set per worker pool.
Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, autoPreserveFailedMachineMax will be distributed across N machine deployments.
autoPreserveFailedMachineMax must be chosen such that it can be appropriately distributed across the MachineDeployments.
Example: if autoPreserveFailedMachineMax is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.

MCM will be modified to include a new sub-phase Preserved to indicate that the machine has been preserved by MCM.
Allow user/operator to request for preservation of a specific machine/node with the use of annotations : node.machine.sapcloud.io/preserve=now and node.machine.sapcloud.io/preserve=when-failed.
When annotation node.machine.sapcloud.io/preserve=now is added to a Running machine, the following will take place:
- cluster-autoscaler.kubernetes.io/scale-down-disabled: "true" is added to the node to prevent CA from scaling it down.
- machine.CurrentStatus.PreserveExpiryTime is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
- The machine's phase is changed to Running:Preserved.
- After timeout, the node.machine.sapcloud.io/preserve=now and cluster-autoscaler.kubernetes.io/scale-down-disabled: "true" are deleted. The machine.CurrentStatus.PreserveExpiryTime is set to nil. The machine phase is changed to Running and the CA may delete the node.
  - If a machine in Running:Preserved fails, it is moved to Failed:Preserved.
When annotation node.machine.sapcloud.io/preserve=when-failed is added to a Running machine and the machine goes to Failed, the following will take place:
- Pods (other than DaemonSet pods) are drained.
- The machine phase is changed to Failed:Preserved.
- cluster-autoscaler.kubernetes.io/scale-down-disabled: "true" is added to the node to prevent CA from scaling it down.
- machine.CurrentStatus.PreserveExpiryTime is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
- After timeout, the annotations node.machine.sapcloud.io/preserve=when-failed and cluster-autoscaler.kubernetes.io/scale-down-disabled: "true" are deleted. machine.CurrentStatus.PreserveExpiryTime is set to nil. The phase is changed to Terminating.
When an un-annotated machine goes to Failed phase and autoPreserveFailedMachineMax is not breached:
- Pods (other than DaemonSet pods) are drained.
- The machine's phase is changed to Failed:Preserved.
- cluster-autoscaler.kubernetes.io/scale-down-disabled: "true" is added to the node to prevent CA from scaling it down.
- machine.CurrentStatus.PreserveExpiryTime is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
- After timeout, the annotation cluster-autoscaler.kubernetes.io/scale-down-disabled: "true" is deleted. machine.CurrentStatus.PreserveExpiryTime is set to nil. The phase is changed to Terminating.
- Number of machines in Failed:Preserved phase count towards enforcing autoPreserveFailedMachineMax. .
A user/operator can request MCM to stop preserving a machine/node in Running:Preserved or Failed:Preserved phase by deleting the annotation: node.machine.sapcloud.io/preserve.
- MCM will move a machine thus annotated either to Running phase or Terminating depending on the phase of the machine before it was preserved.
Machines of a MachineDeployment in Preserved sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment.
MCM will be modified to perform drain in Failed phase for preserved machines.

State Diagrams:

State Diagram for when a machine or its node is explicitly annotated for preservation:

State Diagram for when an un-annotated Running machine fails (Auto-preservation):

Use Cases:

Use Case 1: Preservation Request for Analysing Running Machine

Scenario: Workload on machine failing. Operator wishes to diagnose.

Steps:

Operator annotates node with node.machine.sapcloud.io/preserve=now.
MCM preserves the machine, and prevents CA from scaling it down.
Operator analyzes the VM.

Use Case 2: Proactive Preservation Request

Scenario: Operator suspects a machine might fail and wants to ensure preservation for analysis.

Steps:

Operator annotates node with node.machine.sapcloud.io/preserve=when-failed.
Machine fails later.
MCM preserves the machine.
Operator analyzes the VM.

Use Case 3: Auto-Preservation of Failed Machine aiding in Failure Analysis and Recovery

Scenario: Machine fails unexpectedly, no prior annotation.

Steps:

Machine transitions to Failed phase.
Machine is drained.
If autoPreserveFailedMachineMax is not breached, machine moved to Failed:Preserved phase by MCM.
After machinePreserveTimeout, machine is terminated by MCM.
If machine is brought back to Running phase before timeout, pods can be scheduled on it again.

Use Case 4: Early Release

Scenario: Operator has performed his analysis and no longer requires machine to be preserved.

Steps:

Machine is in Running:Preserved or Failed:Preserved phase.
Operator removes node.machine.sapcloud.io/preserve from node.
MCM transitions machine to Running or Terminating for Running:Preserved or Failed:Preserved respectively, even though machinePreserveTimeout has not expired.
If machine was auto-preserved, capacity becomes available for auto-preservation.

Points to Note

During rolling updates MCM will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase.
Hibernation policy will override machine preservation.
Consumers (with access to shoot cluster) can annotate Nodes they would like to preserve.
Operators (with access to control plane) can additionally annotate Machines that they would like to preserve. This feature can be used when a Machine does not have a backing Node and the operator wishes to preserve the backing VM.
If the backing Node object exists but does not have the preservation annotation, preservation annotations added on the Machine will be honoured.
However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value.
If autoPreserveFailedMachineMax is reduced in the Shoot Spec, older machines are moved to Terminating phase before newer ones.
In case of a scale down of an MCD's replica count, Preserved machines will be the last to be scaled down. Replica count will always be honoured.
On increase/decrease of machinePreserveTimeout, the new value will only apply to machines that go into Preserved phase after the change. Operators can always edit machine.CurrentStatus.PreserveExpiryTime to prolong the expiry time of existing Preserved machines.
Modify CA FAQ once feature is developed to use node.machine.sapcloud.io/preserve=now instead of the cluster-autoscaler.kubernetes.io/scale-down-disabled=true currently suggested. This would:
- harmonise machine flow
- shield from CA's internals
- make it generic and no longer CA specific
- allow a timeout to be specified

Resources

Provider Alibaba Cloud

Provider AWS

proposals

Provider Azure

Provider GCP

Provider OpenStack

Provider STACKIT

CoreOS/FlatCar OS

SUSE CHost OS

Ubuntu OS

Calico CNI

Cilium CNI

Kubernetes Auditing

Api Reference

Registry Cache

Registry cache

Registry mirror

Certificate Services

Tutorials

DNS Services

Tutorials

Workloadidentity

Lakom Service

Egress Filtering

Networking Problem Detector

OpenID Connect Services

Node Audit Logging

Concepts

Deployment

Concepts

Deployment

Getting started locally

Proposals

Documents

Proposals

To-Do

Preservation of Machines ​

Objective ​

Proposal ​

State Diagrams: ​

Use Cases: ​

Use Case 1: Preservation Request for Analysing Running Machine ​

Steps: ​

Use Case 2: Proactive Preservation Request ​

Steps: ​

Use Case 3: Auto-Preservation of Failed Machine aiding in Failure Analysis and Recovery ​

Steps: ​

Use Case 4: Early Release ​

Steps: ​

Points to Note ​

Preservation of Machines

Objective

Proposal

State Diagrams:

Use Cases:

Use Case 1: Preservation Request for Analysing Running Machine

Steps:

Use Case 2: Proactive Preservation Request

Steps:

Use Case 3: Auto-Preservation of Failed Machine aiding in Failure Analysis and Recovery

Steps:

Use Case 4: Early Release

Steps:

Points to Note