15 minute read
The answers in this FAQ apply to the newest (HEAD) version of Machine Controller Manager. If you’re using an older version of MCM please refer to corresponding version of this document. Few of the answers assume that the MCM being used is in conjuction with cluster-autoscaler:
maxEvictRetries
configuration work with drainTimeout
configuration?Machine Controller Manager aka MCM is a bunch of controllers used for the lifecycle management of the worker machines. It reconciles a set of CRDs such as Machine
, MachineSet
, MachineDeployment
which depicts the functionality of Pod
, Replicaset
, Deployment
of the core Kubernetes respectively. Read more about it at README.
A machine is deleted by MCM generally for 2 reasons-
MachineHealthTimeout
period. The default MachineHealthTimeout
is 10 minutes.DiskPressure
, KernelDeadlock
, FileSystem
, Readonly
is set to true
, or KubeletReady
is set to false
. However, this is something that is configurable using the following flag.MachineDeployment
resource.MachineDeployment
. Read more about cluster-autoscaler’s scale down behavior here.MCM mainly contains the following sub-controllers:
MachineDeployment Controller
: Responsible for reconciling the MachineDeployment
objects. It manages the lifecycle of the MachineSet
objects.MachineSet Controller
: Responsible for reconciling the MachineSet
objects. It manages the lifecycle of the Machine
objects.Machine Controller
: responsible for reconciling the Machine
objects. It manages the lifecycle of the actual VMs/machines created in cloud/on-prem. This controller has been moved out of tree. Please refer an AWS machine controller for more info - link.Safety Controller
contains following functions:
tag
of given cluster name and maps the VMs with the machine
objects using the ProviderID
field. VMs without any backing machine
objects are logged and deleted after confirmation.Safety Controller
freezes the MachineDeployment
and MachineSet
controller if the number of machine
objects goes beyond a certain threshold on top of Spec.Replicas
. It can be configured by the flag –safety-up or –safety-down and also machine-safety-overshooting-period.Safety Controller
freezes the functionality of the MCM if either of the target-apiserver
or the control-apiserver
is not reachable.Safety Controller
unfreezes the MCM automatically once situation is resolved to normal. A freeze
label is applied on MachineDeployment
/MachineSet
to enforce the freeze condition.MCM can be installed in a cluster with following steps:
machine-*
objects are stored. Target cluster is where all the node objects are registered.MCM allows configuring the rollout of the worker machines using maxSurge
and maxUnavailable
fields. These fields are applicable only during the rollout process and means nothing in general scale up/down scenarios.
The overall process is very similar to how the Deployment Controller
manages pods during RollingUpdate
.
maxSurge
refers to the number of additional machines that can be added on top of the Spec.Replicas
of MachineDeployment during rollout process.maxUnavailable
refers to the number of machines that can be deleted from Spec.Replicas
field of the MachineDeployment during rollout process.During scale down, triggered via MachineDeployment
/MachineSet
, MCM prefers to delete the machine/s
which have the least priority set.
Each machine
object has an annotation machinepriority.machine.sapcloud.io
set to 3
by default. Admin can reduce the priority of the given machines by changing the annotation value to 1
. The next scale down by MachineDeployment
shall delete the machines with the least priority first.
A machine can be force deleted by adding the label force-deletion: "True"
on the machine
object before executing the actual delete command. During force deletion, MCM skips the drain function and simply triggers the deletion of the machine. This label should be used with caution as it can violate the PDBs for pods running on the machine.
An ongoing rolling-update of the machine-deployment can be paused by using spec.paused
field. See the example below:
apiVersion: machine.sapcloud.io/v1alpha1
kind: MachineDeployment
metadata:
name: test-machine-deployment
spec:
paused: true
It can be unpaused again by removing the Paused
field from the machine-deployment.
If the user doesn’t have access to the machine objects (like in case of Gardener clusters) and they would like to replace a node immedietly then they can place the annotation node.machine.sapcloud.io/trigger-deletion-by-mcm: "true"
on their node. This will start the replacement of the machine with a new node.
On the other hand if the user deletes the node object immedietly then replacement will start only after MachineHealthTimeout
.
This annotation can also be used if the user wants to expedite the replacement of unhealthy nodes
NOTE
:
node.machine.sapcloud.io/trigger-deletion-by-mcm: "false"
annotation is NOT acted upon by MCM , neither does it mean that MCM will not replace this machine.desired replicas
specified for the machineDeployment/machineSet. Currently if the user doesn’t have access to machineDeployment/machineSet then they cannot remove a machine without replacement.MCM provides an in-built safety mechanism to garbage collect VMs which have no corresponding machine object. This is done to save costs and is one of the key features of MCM. However, sometimes users might like to add nodes directly to the cluster without the help of MCM and would prefer MCM to not garbage collect such VMs. To do so they should remove/not-use tags on their VMs containing the following strings:
kubernetes.io/cluster/
kubernetes.io/role/
kubernetes-io-cluster-
kubernetes-io-role-
Rolling update can be triggered for a machineDeployment by updating one of the following:
.spec.template.annotations
.spec.template.spec.class.name
Please refer the following document.
MCM allows configuring many knobs to fine-tune its behavior according to the user’s need. Please refer to the link to check the exact configuration options.
A machine’s lifecycle is governed by mainly following timeouts, which can be configured here.
MachineDrainTimeout
: Amount of time after which drain times out and the machine is force deleted. Default ~2 hours.MachineHealthTimeout
: Amount of time after which an unhealthy machine is declared Failed
and the machine is replaced by MachineSet
controller.MachineCreationTimeout
: Amount of time after which a machine creation is declared Failed
and the machine is replaced by the MachineSet
controller.NodeConditions
: List of node conditions which if set to true for MachineHealthTimeout
period, the machine is declared Failed
and replaced by MachineSet
controller.MaxEvictRetries
: An integer number depicting the number of times a failed eviction should be retried on a pod during drain process. A pod is deleted after max-retries
.MCM imports the functionality from the upstream Kubernetes-drain library. Although, few parts have been modified to make it work best in the context of MCM. Drain is executed before machine deletion for graceful migration of the applications.
Drain internally uses the EvictionAPI
to evict the pods and triggers the Deletion
of pods after MachineDrainTimeout
. Please note:
Drain function serially evicts the stateful-pods. It is observed that serial eviction of stateful pods yields better overall availability of pods as the underlying cloud in most cases detaches and reattaches disks serially anyways. It is implemented in the following manner:
.status.volumesAttached
to be removed by KCM. It does the same for all the stateful-pods.PvDetachTimeout
(default 2 minutes) for a given pod’s PVC to be removed, else moves forward.maxEvictRetries
configuration work with drainTimeout
configuration?It is recommended to only set MachineDrainTimeout
. It satisfies the related requirements. MaxEvictRetries
is auto-calculated based on MachineDrainTimeout
, if maxEvictRetries
is not provided. Following will be the overall behavior of both configurations together:
maxEvictRetries
isn’t set and only maxDrainTimeout
is set:maxEvictRetries
based on the drainTimeout
.drainTimeout
isn’t set and only maxEvictRetries
is set:drainTimeout
and user provided maxEvictRetries
for each pod is considered.maxEvictRetries
and drainTimoeut
are set:A phase of a machine
can be identified with Machine.Status.CurrentStatus.Phase
. Following are the possible phases of a machine
object:
Pending
: Machine creation call has succeeded. MCM is waiting for machine to join the cluster.
CrashLoopBackOff
: Machine creation call has failed. MCM will retry the operation after a minor delay.
Running
: Machine creation call has succeeded. Machine has joined the cluster successfully and corresponding node doesn’t have node.gardener.cloud/critical-components-not-ready
taint.
Unknown
: Machine health checks are failing, e.g., kubelet
has stopped posting the status.
Failed
: Machine health checks have failed for a prolonged time. Hence it is declared failed by Machine
controller in a rate limited fashion. Failed
machines get replaced immediately.
Terminating
: Machine is being terminated. Terminating state is set immediately when the deletion is triggered for the machine
object. It also includes time when it’s being drained.
NOTE
: No phase means the machine is being created on the cloud-provider.
Below is a simple phase transition diagram:
Health check performed on a machine are:
--node-conditions
for OOT MCM provider or can be specified per machine object.True
status of NodeReady
condition . This condition shows kubelet’s statusIf any of the above checks fails , the machine turns to Unknown
phase.
Currently MCM replaces only 1
Unknown
machine at a time per machinedeployment. This means until the particular Unknown
machine get terminated and its replacement joins, no other Unknown
machine would be removed.
The above is achieved by enabling Machine
controller to turn machine from Unknown
-> Failed
only if the above condition is met. MachineSet
controller on the other hand marks Failed
machine as Terminating
immediately.
One reason for this rate limited replacement was to ensure that in case of network failures , where node’s kubelet can’t reach out to kube-apiserver , all nodes are not removed together i.e. meltdown protection
.
In gardener context however, DWD is deployed to deal with this scenario, but to stay protected from corner cases, this mechanism has been introduced in MCM.
NOTE
: Rate limiting replacement is not yet configurable
Machinedeployment
controller executes the logic of scaling
BEFORE logic of rollout
. It identifies scaling
by comparing the deployment.kubernetes.io/desired-replicas
of each machineset under the machinedeployment with machinedeployment’s .spec.replicas
. If the difference is found for any machineSet, a scaling event is detected.
scale-out
-> ONLY New machineSet is scaled outscale-in
-> ALL machineSets(new or old) are scaled in , in proportion to their replica count , any leftover is adjusted in the largest machineSet.During update for scaling event, a machineSet is updated if any of the below is true for it:
.spec.Replicas
needs updatedeployment.kubernetes.io/desired-replicas
needs updateOnce scaling is achieved, rollout continues.
There could be many machines under a machinedeployment with different phases, creationTimestamp. When a scale down is triggered, MCM decides to remove the machine using the following logic:
machinepriority.machine.sapcloud.io
annotation is picked up.If a node is unhealthy for more than the machine-health-timeout
specified for the machine-controller
, the controller
health-check moves the machine phase to Failed
. By default, the machine-health-timeout
is 10` minutes.
Failed
machines have their deletion timestamp set and the machine then moves to the Terminating
phase. The node
drain process is initiated. The drain process is invoked either gracefully or forcefully.
The usual drain process is graceful. Pods are evicted from the node and the drain process waits until any existing
attached volumes are mounted on new node. However, if the node Ready
is False
or the ReadonlyFilesystem
is True
for greater than 5
minutes (non-configurable), then a forceful drain is initiated. In a forceful drain, pods are deleted
and VolumeAttachment
objects associated with the old node are also marked for deletion. This is followed by the deletion of the
cloud provider VM associated with the Machine
and then finally ending with the Node
object deletion.
During the deletion of the VM we only delete the local data disks and boot disks associated with the VM. The disks associated with persistent volumes are left un-touched as their attach/de-detach, mount/unmount processes are handled by k8s attach-detach controller in conjunction with the CSI driver.
In most cases, the Machine.Status.LastOperation
provides information around why a machine can’t be deleted.
Though following could be the reasons but not limited to:
maxUnavailable
set to 0, doesn’t allow the eviction of the pods. Hence, drain/eviction is retried till MachineDrainTimeout
. Default MachineDrainTimeout
could be as large as ~2hours. Hence, blocking the machine deletion.In most cases, the Machine.Status.LastOperation
provides information around why a machine can’t be created.
It could possibly be debugged with following steps:
kube-controller-manager
, cloud-controller-manager
are running.Machine.Spec.ProviderId
to query the machine in cloud.MachineDeployment
is pointing the correct MachineClass
, and MachineClass
is pointing to the correct Secret
. The secret object contains the actual cloud-config in base64
format which will be used to boot the machine.The following can be the reason:
Running
state until node-critical-components
are ready. Refer this for more details.Developer can locally setup the MCM using following guide
Developer must also enhance the unit tests related to the incoming changes.
Developer can run the unit test locally by executing:
make test-unit
Developer can locally run integration tests to ensure basic functionality of MCM is not altered.
Developer should add/update the API fields at both of the following places:
Once API changes are done, auto-generate the code using following command:
make generate
Please ignore the API-violation errors for now.
MCM uses gomod
for depedency management.
Developer should add/udpate depedency in the go.mod file. Please run following command to automatically tidy the dependencies.
make tidy
All of the knobs of MCM can be configured by the workers
section of the shoot resource.
MachineDeployment
per zone for each worker-pool under workers
section.workers.dataVolumes
allows to attach multiple disks to a machine during creation. Refer the link.workers.machineControllerManager
allows configuration of multiple knobs of the MachineDeployment
from the shoot resource.Shoot resource allows the worker-pool to spread across multiple zones using the field workers.zones
. Refer link.
Gardener creates one MachineDeployment
per zone. Each MachineDeployment
is initiated with the following replica:
MachineDeployment.Spec.Replicas = (Workers.Minimum)/(Number of availability zones)
Was this page helpful?