Preservation of Machines ​
Objective ​
Currently, the Machine Controller Manager(MCM) moves Machines with errors to the Unknown phase, and after the configured machineHealthTimeout, to the Failed phase. Failed machines are swiftly moved to the Terminating phase during which the node is drained and the Machine object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult.
Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler (CA) scaling down the node.
This document proposes enhancing MCM, such that:
- VMs of machines are retained temporarily for analysis.
- There is a configurable limit to the number of machines that can be preserved automatically on failure (auto-preservation).
- There is a configurable limit to the duration for which machines are preserved.
- Users can specify which healthy machines they would like to preserve in case of failure, or for diagnoses in current state (prevent scale down by CA).
- Users can request MCM to release a preserved machine, even before the timeout expires, so that MCM can transition the machine to either
RunningorTerminatingphase, as the case may be.
Related Issue: https://github.com/gardener/machine-controller-manager/issues/1008
Proposal ​
In order to achieve the objectives mentioned, the following are proposed:
- Enhance
workerconfiguration in theShootSpec, to specify the maximum number of failed machines that will be auto-preserved and the time duration for which machines will be preserved.
workers:
- name: example-worker
autoPreserveFailedMachineMax: 2
machineControllerManager:
machinePreserveTimeout: 72h- This configuration will be set per worker pool.
- Since gardener worker pool can correspond to
1..NMachineDeployments depending on number of zones,autoPreserveFailedMachineMaxwill be distributed across N machine deployments. autoPreserveFailedMachineMaxmust be chosen such that it can be appropriately distributed across the MachineDeployments.- Example: if
autoPreserveFailedMachineMaxis set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
- MCM will be modified to include a new sub-phase
Preservedto indicate that the machine has been preserved by MCM. - Allow user/operator to request for preservation of a specific machine/node with the use of annotations :
node.machine.sapcloud.io/preserve=nowandnode.machine.sapcloud.io/preserve=when-failed. - When annotation
node.machine.sapcloud.io/preserve=nowis added to aRunningmachine, the following will take place:cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"is added to the node to prevent CA from scaling it down.machine.CurrentStatus.PreserveExpiryTimeis updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.- The machine's phase is changed to
Running:Preserved. - After timeout, the
node.machine.sapcloud.io/preserve=nowandcluster-autoscaler.kubernetes.io/scale-down-disabled: "true"are deleted. Themachine.CurrentStatus.PreserveExpiryTimeis set tonil. The machine phase is changed toRunningand the CA may delete the node.- If a machine in
Running:Preservedfails, it is moved toFailed:Preserved.
- If a machine in
- When annotation
node.machine.sapcloud.io/preserve=when-failedis added to aRunningmachine and the machine goes toFailed, the following will take place:- Pods (other than DaemonSet pods) are drained.
- The machine phase is changed to
Failed:Preserved. cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"is added to the node to prevent CA from scaling it down.machine.CurrentStatus.PreserveExpiryTimeis updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.- After timeout, the annotations
node.machine.sapcloud.io/preserve=when-failedandcluster-autoscaler.kubernetes.io/scale-down-disabled: "true"are deleted.machine.CurrentStatus.PreserveExpiryTimeis set tonil. The phase is changed toTerminating.
- When an un-annotated machine goes to
Failedphase andautoPreserveFailedMachineMaxis not breached:- Pods (other than DaemonSet pods) are drained.
- The machine's phase is changed to
Failed:Preserved. cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"is added to the node to prevent CA from scaling it down.machine.CurrentStatus.PreserveExpiryTimeis updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.- After timeout, the annotation
cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"is deleted.machine.CurrentStatus.PreserveExpiryTimeis set tonil. The phase is changed toTerminating. - Number of machines in
Failed:Preservedphase count towards enforcingautoPreserveFailedMachineMax. .
- A user/operator can request MCM to stop preserving a machine/node in
Running:PreservedorFailed:Preservedphase by deleting the annotation:node.machine.sapcloud.io/preserve.- MCM will move a machine thus annotated either to
Runningphase orTerminatingdepending on the phase of the machine before it was preserved.
- MCM will move a machine thus annotated either to
- Machines of a MachineDeployment in
Preservedsub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. - MCM will be modified to perform drain in
Failedphase for preserved machines.
State Diagrams: ​
- State Diagram for when a machine or its node is explicitly annotated for preservation:
stateDiagram-v2
state "Running" as R
state "Running + Requested" as RR
state "Running:Preserved" as RP
state "Failed
(node drained)" as F
state "Failed:Preserved" as FP
state "Terminating" as T
[*]-->R
R --> RR: annotated with preserve=when-failed
RP-->R: on timeout or preserve=false
RR --> F: on failure
F --> FP
FP --> T: on timeout or preserve=false
FP --> R: if node Healthy before timeout
T --> [*]
R-->RP: annotated with preserve=now
RP-->F: if node/VM not healthy- State Diagram for when an un-annotated
Runningmachine fails (Auto-preservation):mermaidstateDiagram-v2 state "Running" as R state "Failed (node drained)" as F state "Failed:Preserved" as FP state "Terminating" as T [*] --> R R-->F: on failure F --> FP: if autoPreserveFailedMachineMax not breached F --> T: if autoPreserveFailedMachineMax breached FP --> T: on timeout or value=false FP --> R : if node Healthy before timeout T --> [*]
Use Cases: ​
Use Case 1: Preservation Request for Analysing Running Machine ​
Scenario: Workload on machine failing. Operator wishes to diagnose.
Steps: ​
- Operator annotates node with
node.machine.sapcloud.io/preserve=now. - MCM preserves the machine, and prevents CA from scaling it down.
- Operator analyzes the VM.
Use Case 2: Proactive Preservation Request ​
Scenario: Operator suspects a machine might fail and wants to ensure preservation for analysis.
Steps: ​
- Operator annotates node with
node.machine.sapcloud.io/preserve=when-failed. - Machine fails later.
- MCM preserves the machine.
- Operator analyzes the VM.
Use Case 3: Auto-Preservation of Failed Machine aiding in Failure Analysis and Recovery ​
Scenario: Machine fails unexpectedly, no prior annotation.
Steps: ​
- Machine transitions to
Failedphase. - Machine is drained.
- If
autoPreserveFailedMachineMaxis not breached, machine moved toFailed:Preservedphase by MCM. - After
machinePreserveTimeout, machine is terminated by MCM. - If machine is brought back to
Runningphase before timeout, pods can be scheduled on it again.
Use Case 4: Early Release ​
Scenario: Operator has performed his analysis and no longer requires machine to be preserved.
Steps: ​
- Machine is in
Running:PreservedorFailed:Preservedphase. - Operator removes
node.machine.sapcloud.io/preservefrom node. - MCM transitions machine to
RunningorTerminatingforRunning:PreservedorFailed:Preservedrespectively, even thoughmachinePreserveTimeouthas not expired. - If machine was auto-preserved, capacity becomes available for auto-preservation.
Points to Note ​
- During rolling updates MCM will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase.
- Hibernation policy will override machine preservation.
- Consumers (with access to shoot cluster) can annotate Nodes they would like to preserve.
- Operators (with access to control plane) can additionally annotate Machines that they would like to preserve. This feature can be used when a Machine does not have a backing Node and the operator wishes to preserve the backing VM.
- If the backing Node object exists but does not have the preservation annotation, preservation annotations added on the Machine will be honoured.
- However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value.
- If
autoPreserveFailedMachineMaxis reduced in the Shoot Spec, older machines are moved toTerminatingphase before newer ones. - In case of a scale down of an MCD's replica count,
Preservedmachines will be the last to be scaled down. Replica count will always be honoured. - On increase/decrease of
machinePreserveTimeout, the new value will only apply to machines that go intoPreservedphase after the change. Operators can always editmachine.CurrentStatus.PreserveExpiryTimeto prolong the expiry time of existingPreservedmachines. - Modify CA FAQ once feature is developed to use
node.machine.sapcloud.io/preserve=nowinstead of thecluster-autoscaler.kubernetes.io/scale-down-disabled=truecurrently suggested. This would:- harmonise machine flow
- shield from CA's internals
- make it generic and no longer CA specific
- allow a timeout to be specified