This is the multi-page printable view of this section. Click here to print.
Proposals
1 - Excess Reserve Capacity
Excess Reserve Capacity
Goal
Currently, autoscaler optimizes the number of machines for a given application-workload. Along with effective resource utilization, this feature brings concern where, many times, when new application instances are created - they don’t find space in existing cluster. This leads the cluster-autoscaler to create new machines via MachineDeployment, which can take from 3-4 minutes to ~10 minutes, for the machine to really come-up and join the cluster. In turn, application-instances have to wait till new machines join the cluster.
One of the promising solutions to this issue is Excess Reserve Capacity. Idea is to keep a certain number of machines or percent of resources[cpu/memory] always available, so that new workload, in general, can be scheduled immediately unless huge spike in the workload. Also, the user should be given enough flexibility to choose how many resources or how many machines should be kept alive and non-utilized as this affects the Cost directly.
Note
- We decided to go with Approach-4 which is based on low priority pods. Please find more details here: https://github.com/gardener/gardener/issues/254
- Approach-3 looks more promising in long term, we may decide to adopt that in future based on developments/contributions in autoscaler-community.
Possible Approaches
Following are the possible approaches, we could think of so far.
Approach 1: Enhance Machine-controller-manager to also entertain the excess machines
Machine-controller-manager currently takes care of the machines in the shoot cluster starting from creation-deletion-health check to efficient rolling-update of the machines. From the architecture point of view, MachineSet makes sure that X number of machines are always running and healthy. MachineDeployment controller smartly uses this facility to perform rolling-updates.
We can expand the scope of MachineDeployment controller to maintain excess number of machines by introducing new parallel independent controller named MachineTaint controller. This will result in MCM to include Machine, MachineSet, MachineDeployment, MachineSafety, MachineTaint controllers. MachineTaint controller does not need to introduce any new CRD - analogy fits where taint-controller also resides into kube-controller-manager.
Only Job of MachineTaint controller will be:
- List all the Machines under each MachineDeployment.
- Maintain taints of noSchedule and noExecute on
X
latest MachineObjects. - There should be an event-based informer mechanism where MachineTaintController gets to know about any Update/Delete/Create event of MachineObjects - in turn, maintains the noSchedule and noExecute taints on all the latest machines.
- Why latest machines?
- Whenever autoscaler decides to add new machines - essentially ScaleUp event - taints from the older machines are removed and newer machines get the taints. This way X number of Machines immediately becomes free for new pods to be scheduled.
- While ScaleDown event, autoscaler specifically mentions which machines should be deleted, and that should not bring any concerns. Though we will have to put proper label/annotation defined by autoscaler on taintedMachines, so that autoscaler does not consider the taintedMachines for deletion while scale-down.
* Annotation on tainted node:
"cluster-autoscaler.kubernetes.io/scale-down-disabled": "true"
Implementation Details:
- Expect new optional field ExcessReplicas in
MachineDeployment.Spec
. MachineDeployment controller now adds bothSpec.Replicas
andSpec.ExcessReplicas
[if provided], and considers that as a standard desiredReplicas. - Current working of MCM will not be affected if ExcessReplicas field is kept nil. - MachineController currently reads the NodeObject and sets the MachineConditions in MachineObject. Machine-controller will now also read the taints/labels from the MachineObject - and maintains it on the NodeObject.
- Expect new optional field ExcessReplicas in
We expect cluster-autoscaler to intelligently make use of the provided feature from MCM.
- CA gets the input of min:max:excess from Gardener. CA continues to set the
MachineDeployment.Spec.Replicas
as usual based on the application-workload. - In addition, CA also sets the
MachieDeployment.Spec.ExcessReplicas
. - Corner-case: * CA should decrement the excessReplicas field accordingly when desiredReplicas+excessReplicas on MachineDeployment goes beyond max.
- CA gets the input of min:max:excess from Gardener. CA continues to set the
Approach 2: Enhance Cluster-autoscaler by simulating fake pods in it
- There was already an attempt by community to support this feature.
- Refer for details to: https://github.com/kubernetes/autoscaler/pull/77/files
Approach 3: Enhance cluster-autoscaler to support pluggable scaling-events
- Forked version of cluster-autoscaler could be improved to plug-in the algorithm for excess-reserve capacity.
- Needs further discussion around upstream support.
- Create golang channel to separate the algorithms to trigger scaling (hard-coded in cluster-autoscaler, currently) from the algorithms about how to to achieve the scaling (already pluggable in cluster-autoscaler). This kind of separation can help us introduce/plug-in new algorithms (such as based node resource utilisation) without affecting existing code-base too much while almost completely re-using the code-base for the actual scaling.
- Also this approach is not specific to our fork of cluster-autoscaler. It can be made upstream eventually as well.
Approach 4: Make intelligent use of Low-priority pods
- Refer to: pod-priority-preemption
- TL; DR:
- High priority pods can preempt the low-priority pods which are already scheduled.
- Pre-create bunch[equivivalent of X shoot-control-planes] of low-priority pods with priority of zero, then start creating the workload pods with better priority which will reschedule the low-priority pods or otherwise keep them in pending state if the limit for max-machines has reached.
- This is still alpha feature.
2 - GRPC Based Implementation of Cloud Providers
GRPC based implementation of Cloud Providers - WIP
Goal:
Currently the Cloud Providers’ (CP) functionalities ( Create(), Delete(), List() ) are part of the Machine Controller Manager’s (MCM)repository. Because of this, adding support for new CPs into MCM requires merging code into MCM which may not be required for core functionalities of MCM itself. Also, for various reasons it may not be feasible for all CPs to merge their code with MCM which is an Open Source project.
Because of these reasons, it was decided that the CP’s code will be moved out in separate repositories so that they can be maintained separately by the respective teams. Idea is to make MCM act as a GRPC server, and CPs as GRPC clients. The CP can register themselves with the MCM using a GRPC service exposed by the MCM. Details of this approach is discussed below.
How it works:
MCM acts as GRPC server and listens on a pre-defined port 5000. It implements below GRPC services. Details of each of these services are mentioned in next section.
Register()
GetMachineClass()
GetSecret()
GRPC services exposed by MCM:
Register()
rpc Register(stream DriverSide) returns (stream MCMside) {}
The CP GRPC client calls this service to register itself with the MCM. The CP passes the kind
and the APIVersion
which it implements, and MCM maintains an internal map for all the registered clients. A GRPC stream is returned in response which is kept open througout the life of both the processes. MCM uses this stream to communicate with the client for machine operations: Create()
, Delete()
or List()
.
The CP client is responsible for reading the incoming messages continuously, and based on the operationType
parameter embedded in the message, it is supposed to take the required action. This part is already handled in the package grpc/infraclient
.
To add a new CP client, import the package, and implement the ExternalDriverProvider
interface:
type ExternalDriverProvider interface {
Create(machineclass *MachineClassMeta, credentials, machineID, machineName string) (string, string, error)
Delete(machineclass *MachineClassMeta, credentials, machineID string) error
List(machineclass *MachineClassMeta, credentials, machineID string) (map[string]string, error)
}
GetMachineClass()
rpc GetMachineClass(MachineClassMeta) returns (MachineClass) {}
As part of the message from MCM for various machine operations, the name of the machine class is sent instead of the full machine class spec. The CP client is expected to use this GRPC service to get the full spec of the machine class. This optionally enables the client to cache the machine class spec, and make the call only if the machine calass spec is not already cached.
GetSecret()
rpc GetSecret(SecretMeta) returns (Secret) {}
As part of the message from MCM for various machine operations, the Cloud Config (CC) and CP credentials are not sent. The CP client is expected to use this GRPC service to get the secret which has CC and CP’s credentials from MCM. This enables the client to cache the CC and credentials, and to make the call only if the data is not already cached.
How to add a new Cloud Provider’s support
Import the package grpc/infraclient
and grpc/infrapb
from MCM (currently in MCM’s “grpc-driver” branch)
- Implement the interface
ExternalDriverProvider
Create()
: Creates a new machineDelete()
: Deletes a machineList()
: Lists machines
- Use the interface
MachineClassDataProvider
GetMachineClass()
: Makes the call to MCM to get machine class specGetSecret()
: Makes the call to MCM to get secret containing Cloud Config and CP’s credentials
Example implementation:
Refer GRPC based implementation for AWS client: https://github.com/ggaurav10/aws-driver-grpc
3 - Hotupdate Instances
Hot-Update VirtualMachine tags without triggering a rolling-update
- Hot-Update VirtualMachine tags without triggering a rolling-update
Motivation
MCM Issue#750 There is a requirement to provide a way for consumers to add tags which can be hot-updated onto VMs. This requirement can be generalized to also offer a convenient way to specify tags which can be applied to VMs, NICs, Devices etc.
MCM Issue#635 which in turn points to MCM-Provider-AWS Issue#36 - The issue hints at other fields like enable/disable source/destination checks for NAT instances which needs to be hot-updated on network interfaces.
In GCP provider -
instance.ServiceAccounts
can be updated without the need to roll-over the instance. See
Boundary Condition
All tags that are added via means other than MachineClass.ProviderSpec should be preserved as-is. Only updates done to tags in MachineClass.ProviderSpec
should be applied to the infra resources (VM/NIC/Disk).
What is available today?
WorkerPool configuration inside shootYaml provides a way to set labels. As per the definition these labels will be applied on Node
resources. Currently these labels are also passed to the VMs as tags. There is no distinction made between Node
labels and VM
tags.
MachineClass
has a field which holds provider specific configuration and one such configuration is tags
. Gardener provider extensions updates the tags in MachineClass
.
- AWS provider extension directly passes the labels to the tag section of machineClass.
- Azure provider extension sanitizes the woker pool labels and adds them as tags in MachineClass.
- GCP provider extension sanitizes them, and then sets them as labels in the MachineClass. In GCP tags only have keys and are currently hard coded.
Let us look at an example of MachineClass.ProviderSpec
in AWS:
providerSpec:
ami: ami-02fe00c0afb75bbd3
tags:
#[section-1] pool lables added by gardener extension
#########################################################
kubernetes.io/arch: amd64
networking.gardener.cloud/node-local-dns-enabled: "true"
node.kubernetes.io/role: node
worker.garden.sapcloud.io/group: worker-ser234
worker.gardener.cloud/cri-name: containerd
worker.gardener.cloud/pool: worker-ser234
worker.gardener.cloud/system-components: "true"
#[section-2] Tags defined in the gardener-extension-provider-aws
###########################################################
kubernetes.io/cluster/cluster-full-name: "1"
kubernetes.io/role/node: "1"
#[section-3]
###########################################################
user-defined-key1: user-defined-val1
user-defined-key2: user-defined-val2
Refer src for tags defined in
section-1
. Refer src for tags defined insection-2
. Tags insection-3
are defined by the user.
Out of the above three tag categories, MCM depends section-2
tags (mandatory-tags
) for its orphan collection
and Driver’s DeleteMachine
and GetMachineStatus
to work.
ProviderSpec.Tags
are transported to the provider specific resources as follows:
Provider | Resources Tags are set on | Code Reference | Comment |
---|---|---|---|
AWS | Instance(VM), Volume, Network-Interface | aws-VM-Vol-NIC | No distinction is made between tags set on VM, NIC or Volume |
Azure | Instance(VM), Network-Interface | azure-VM-parameters & azureNIC-Parameters | |
GCP | Instance(VM), 1 tag: name (denoting the name of the worker) is added to Disk | gcp-VM & gcp-Disk | In GCP key-value pairs are called labels while network tags have only keys |
AliCloud | Instance(VM) | aliCloud-VM |
What are the problems with the current approach?
There are a few shortcomings in the way tags/labels are handled:
- Tags can only be set at the time a machine is created.
- There is no distinction made amongst tags/labels that are added to VM’s, disks or network interfaces. As stated above for AWS same set of tags are added to all. There is a limit defined on the number of tags/labels that can be associated to the devices (disks, VMs, NICs etc). Example: In AWS a max of 50 user created tags are allowed. Similar restrictions are applied on different resources across providers. Therefore adding all tags to all devices even if the subset of tags are not meant for that resource exhausts the total allowed tags/labels for that resource.
- The only placeholder in shoot yaml as mentioned above is meant to only hold labels that should be applied on primarily on the Node objects. So while you could use the node labels for extended resources, using it also for tags is not clean.
- There is no provision in the shoot YAML today to add tags only to a subset of resources.
MachineClass Update and its impact
When Worker.ProviderConfig is changed then a worker-hash is computed which includes the raw ProviderConfig
. This hash value is then used as a suffix when constructing the name for a MachineClass
. See aws-extension-provider as an example. A change in the name of the MachineClass
will then in-turn trigger a rolling update of machines. Since tags
are provider specific and therefore will be part of ProviderConfig
, any update to them will result in a rolling-update of machines.
Proposal
Shoot YAML changes
Provider specific configuration is set via providerConfig section for each worker pool.
Example worker provider config (current):
providerConfig:
apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
kind: WorkerConfig
volume:
iops: 10000
dataVolumes:
- name: kubelet-dir
snapshotID: snap-13234
iamInstanceProfile: # (specify either ARN or name)
name: my-profile
arn: my-instance-profile-arn
It is proposed that an additional field be added for tags
under providerConfig
. Proposed changed YAML:
providerConfig:
apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
kind: WorkerConfig
volume:
iops: 10000
dataVolumes:
- name: kubelet-dir
snapshotID: snap-13234
iamInstanceProfile: # (specify either ARN or name)
name: my-profile
arn: my-instance-profile-arn
tags:
vm:
key1: val1
key2: val2
..
# for GCP network tags are just keys (there is no value associated to them).
# What is shown below will work for AWS provider.
network:
key3: val3
key4: val4
Under tags
clear distinction is made between tags for VMs, Disks, network interface etc. Each provider has a different allowed-set of characters that it accepts as key names, has different limits on the tags that can be set on a resource (disk, NIC, VM etc.) and also has a different format (GCP network tags are only keys).
TODO:
Check if worker.labels are getting added as tags on infra resources. We should continue to support it and double check that these should only be added to VMs and not to other resources.
Should we support users adding VM tags as node labels?
Provider specific WorkerConfig API changes
Taking
AWS
provider extension as an example to show the changes.
WorkerConfig will now have the following changes:
- A new field for tags will be introduced.
- Additional metadata for struct fields will now be added via
struct tags
.
type WorkerConfig struct {
metav1.TypeMeta
Volume *Volume
// .. all fields are not mentioned here.
// Tags are a collection of tags to be set on provider resources (e.g. VMs, Disks, Network Interfaces etc.)
Tags *Tags `hotupdatable:true`
}
// Tags is a placeholder for all tags that can be set/updated on VMs, Disks and Network Interfaces.
type Tags struct {
// VM tags set on the VM instances.
VM map[string]string
// Network tags set on the network interfaces.
Network map[string]string
// Disk tags set on the volumes/disks.
Disk map[string]string
}
There is a need to distinguish fields within ProviderSpec
(which is then mapped to the above WorkerConfig
) which can be updated without the need to change the hash suffix for MachineClass
and thus trigger a rolling update on machines.
To achieve that we propose to use struct tag hotupdatable
whose value indicates if the field can be updated without the need to do a rolling update. To ensure backward compatibility, all fields which do not have this tag or have hotupdatable
set to false
will be considered as immutable and will require a rolling update to take affect.
Gardener provider extension changes
Taking AWS provider extension as an example. Following changes should be made to all gardener provider extensions
AWS Gardener Extension generates machine config using worker pool configuration. As part of that it also computes the workerPoolHash
which is then used to create the name of the MachineClass.
Currently WorkerPoolHash
function uses the entire providerConfig to compute the hash. Proposal is to do the following:
- Remove the code from function
WorkerPoolHash
. - Add another function to compute hash using all immutable fields in the provider config struct and then pass that to
worker.WorkerPoolHash
asadditionalData
.
The above will ensure that tags and any other field in WorkerConfig
which is marked with updatable:true
is not considered for hash computation and will therefore not contribute to changing the name of MachineClass
object thus preventing a rolling update.
WorkerConfig
and therefore the contained tags will be set as ProviderSpec in MachineClass
.
If only fields which have updatable:true
are changed then it should result in update/patch of MachineClass
and not creation.
Driver interface changes
Driver interface which is a facade to provider specific API implementations will have one additional method.
type Driver interface {
// .. existing methods are not mentioned here for brevity.
UpdateMachine(context.Context, *UpdateMachineRequest) error
}
// UpdateMachineRequest is the request to update machine tags.
type UpdateMachineRequest struct {
ProviderID string
LastAppliedProviderSpec raw.Extension
MachineClass *v1alpha1.MachineClass
Secret *corev1.Secret
}
If any
machine-controller-manager-provider-<providername>
has not implementedUpdateMachine
then updates of tags on Instances/NICs/Disks will not be done. An error message will be logged instead.
Machine Class reconciliation
Current MachineClass reconciliation does not reconcile MachineClass
resource updates but it only enqueues associated machines. The reason is that it is assumed that anything that is changed in a MachineClass will result in a creation of a new MachineClass with a different name. This will result in a rolling update of all machines using the MachineClass as a template.
However, it is possible that there is data that all machines in a MachineSet
share which do not require a rolling update (e.g. tags), therefore there is a need to reconcile the MachineClass as well.
Reconciliation Changes
In order to ensure that machines get updated eventually with changes to the hot-updatable
fields defined in the MachineClass.ProviderConfig
as raw.Extension
.
We should only fix MCM Issue#751 in the MachineClass reconciliation and let it enqueue the machines as it does today. We additionally propose the following two things:
Introduce a new annotation
last-applied-providerspec
on every machine resource. This will capture the last successfully appliedMachineClass.ProviderSpec
on this instance.Enhance the machine reconciliation to include code to hot-update machine.
In machine-reconciliation there are currently two flows triggerDeletionFlow
and triggerCreationFlow
. When a machine gets enqueued due to changes in MachineClass then in this method following changes needs to be introduced:
Check if the machine has last-applied-providerspec
annotation.
Case 1.1
If the annotation is not present then there can be just 2 possibilities:
It is a fresh/new machine and no backing resources (VM/NIC/Disk) exist yet. The current flow checks if the providerID is empty and
Status.CurrenStatus.Phase
is empty then it enters into thetriggerCreationFlow
.It is an existing machine which does not yet have this annotation. In this case call
Driver.UpdateMachine
. If the driver returns no error then addlast-applied-providerspec
annotation with the value ofMachineClass.ProviderSpec
to this machine.
Case 1.2
If the annotation is present then compare the last applied provider-spec with the current provider-spec. If there are changes (check their hash values) then call Driver.UpdateMachine
. If the driver returns no error then add last-applied-providerspec
annotation with the value of MachineClass.ProviderSpec
to this machine.
NOTE: It is assumed that if there are changes to the fields which are not marked as
hotupdatable
then it will result in the change of name for MachineClass resulting in a rolling update of machines. If the name has not changed + machine is enqueued + there is a change in machine-class then it will be change to a hotupdatable fields in the spec.
Trigger update flow can be done after reconcileMachineHealth
and syncMachineNodeTemplates
in machine-reconciliation.
There are 2 edge cases that needs attention and special handling:
Premise: It is identified that there is an update done to one or more hotupdatable fields in the MachineClass.ProviderSpec.
Edge-Case-1
In the machine reconciliation, an update-machine-flow is triggered which in-turn calls Driver.UpdateMachine
. Consider the case where the hot update needs to be done to all VM, NIC and Disk resources. The driver returns an error which indicates a partial-failure
. As we have mentioned above only when Driver.UpdateMachine
returns no error will last-applied-providerspec
be updated. In case of partial failure the annotation will not be updated. This event will be re-queued for a re-attempt. However consider a case where before the item is re-queued, another update is done to MachineClass reverting back the changes to the original spec.
At T1 | At T2 (T2 > T1) | At T3 (T3> T2) |
---|---|---|
last-applied-providerspec=S1 MachineClass.ProviderSpec = S1 | last-applied-providerspec=S1 MachineClass.ProviderSpec = S2 Another update to MachineClass.ProviderConfig = S3 is enqueue (S3 == S1) | last-applied-providerspec=S1 Driver.UpdateMachine for S1-S2 update - returns partial failure Machine-Key is requeued |
At T4 (T4> T3) when a machine is reconciled then it checks that last-applied-providerspec
is S1 and current MachineClass.ProviderSpec = S3 and since S3 is same as S1, no update is done. At T2 Driver.UpdateMachine was called to update the machine with S2
but it partially failed. So now you will have resources which are partially updated with S2 and no further updates will be attempted.
Edge-Case-2
The above situation can also happen when Driver.UpdateMachine
is in the process of updating resources. It has hot-updated lets say 1 resource. But now MCM crashes. By the time it comes up another update to MachineClass.ProviderSpec is done essentially reverting back the previous change (same case as above). In this case reconciliation loop never got a chance to get any response from the driver.
To handle the above edge cases there are 2 options:
Option #1
Introduce a new annotation inflight-providerspec-hash
. The value of this annotation will be the hash value of the MachineClass.ProviderSpec
that is in the process of getting applied on this machine. The machine will be updated with this annotation just before calling Driver.UpdateMachine
(in the trigger-update-machine-flow). If the driver returns no error then (in a single update):
last-applied-providerspec
will be updatedinflight-providerspec-hash
annotation will be removed.
Option #2 - Preferred
Leverage Machine.Status.LastOperation
with Type
set to MachineOperationUpdate
and State
set to MachineStateProcessing
This status will be updated just before calling Driver.UpdateMachine
.
Semantically LastOperation
captures the details of the operation post-operation and not pre-operation. So this solution would be a divergence from the norm.
4 - Initialize Machine
Post-Create Initialization of Machine Instance
Background
Today the driver.Driver facade represents the boundary between the the machine-controller
and its various provider specific implementations.
We have abstract operations for creation/deletion and listing of machines (actually compute instances) but we do not correctly handle post-creation initialization logic. Nor do we provide an abstract operation to represent the hot update of an instance after creation.
We have found this to be necessary for several use cases. Today in the MCM AWS Provider, we already misuse driver.GetMachineStatus
which is supposed to be a read-only operation obtaining the status of an instance.
Each AWS EC2 instance performs source/destination checks by default. For EC2 NAT instances these should be disabled. This is done by issuing a ModifyInstanceAttribute request with the
SourceDestCheck
set tofalse
. The MCM AWS Provider, decodes the AWSProviderSpec, readsproviderSpec.SrcAndDstChecksEnabled
and correspondingly issues the call to modify the already launched instance. However, this should be done as an action after creating the instance and should not be part of the VM status retrieval.Similarly, there is a pending PR to add the
Ipv6AddessCount
andIpv6PrefixCount
to enable the assignment of an ipv6 address and an ipv6 prefix to instances. This requires constructing and issuing an AssignIpv6Addresses request after the EC2 instance is available.We have other uses-cases such as MCM Issue#750 where there is a requirement to provide a way for consumers to add tags which can be hot-updated onto instances. This requirement can be generalized to also offer a convenient way to specify tags which can be applied to VMs, NICs, Devices etc.
We have a need for “machine-instance-not-ready” taint as described in MCM#740 which should only get removed once the post creation updates are finished.
Objectives
We will split the fulfilment of this overall need into 2 stages of implementation.
Stage-A: Support post-VM creation initialization logic of the instance suing a proposed
Driver.InitializeMachine
by permitting provider implementors to add initialization logic after VM creation, return with special new error codecodes.Initialization
for initialization errors and correspondingly support a new machine operation stageInstanceInitialization
which will be updated in the machineLastOperation
. The triggerCreationFlow - a reconciliation sub-flow of the MCM responsible for orchestrating instance creation and updating machine status will be changed to support this behaviour.Stage-B: Introduction of
Driver.UpdateMachine
and enhancing the MCM, MCM providers and gardener extension providers to support hot update of instances throughDriver.UpdateMachine
. The MCM triggerUpdationFlow - a reconciliation sub-flow of the MCM which is supposed to be responsible for orchestrating instance update - but currently not used, will be updated to invoke the providerDriver.UpdateMachine
on hot-updates to to theMachine
object
Stage-A Proposal
Current MCM triggerCreationFlow
Today, reconcileClusterMachine which is the main routine for the Machine
object reconciliation invokes triggerCreationFlow at the end when the machine.Spec.ProviderID
is empty or if the machine.Status.CurrentStatus.Phase
is empty or in CrashLoopBackOff
%%{ init: { 'themeVariables': { 'fontSize': '12px'} } }%% flowchart LR other["..."] -->chk{"machine ProviderID empty OR Phase empty or CrashLoopBackOff ? "}--yes-->triggerCreationFlow chk--noo-->LongRetry["return machineutils.LongRetry"]
Today, the triggerCreationFlow
is illustrated below with some minor details omitted/compressed for brevity
NOTES
- The
lastop
below is an abbreviation formachine.Status.LastOperation
. This, along with the machine phase is generally updated on theMachine
object just before returning from the method. - regarding
phase=CrashLoopBackOff|Failed
. the machine phase may either beCrashLoopBackOff
or move toFailed
if the difference between current time and themachine.CreationTimestamp
has exceeded the configuredMachineCreationTimeout
.
%%{ init: { 'themeVariables': { 'fontSize': '12px'} } }%% flowchart TD end1(("end")) begin((" ")) medretry["return MediumRetry, err"] shortretry["return ShortRetry, err"] medretry-->end1 shortretry-->end1 begin-->AddBootstrapTokenToUserData -->gms["statusResp,statusErr=driver.GetMachineStatus(...)"] -->chkstatuserr{"Check statusErr"} chkstatuserr--notFound-->chknodelbl{"Chk Node Label"} chkstatuserr--else-->createFailed["lastop.Type=Create,lastop.state=Failed,phase=CrashLoopBackOff|Failed"]-->medretry chkstatuserr--nil-->initnodename["nodeName = statusResp.NodeName"]-->setnodename chknodelbl--notset-->createmachine["createResp, createErr=driver.CreateMachine(...)"]-->chkCreateErr{"Check createErr"} chkCreateErr--notnil-->createFailed chkCreateErr--nil-->getnodename["nodeName = createResp.NodeName"] -->chkstalenode{"nodeName != machine.Name\n//chk stale node"} chkstalenode--false-->setnodename["if unset machine.Labels['node']= nodeName"] -->machinepending["if empty/crashloopbackoff lastop.type=Create,lastop.State=Processing,phase=Pending"] -->shortretry chkstalenode--true-->delmachine["driver.DeleteMachine(...)"] -->permafail["lastop.type=Create,lastop.state=Failed,Phase=Failed"] -->shortretry subgraph noteA [" "] permafail -.- note1(["VM was referring to stale node obj"]) end style noteA opacity:0 subgraph noteB [" "] setnodename-.- note2(["Proposal: Introduce Driver.InitializeMachine after this"]) end
Enhancement of MCM triggerCreationFlow
Relevant Observations on Current Flow
- Observe that we always perform a call to
Driver.GetMachineStatus
and only then conditionally perform a call toDriver.CreateMachine
if there was was no machine found. - Observe that after the call to a successful
Driver.CreateMachine
, the machine phase is set toPending
, theLastOperation.Type
is currently set toCreate
and theLastOperation.State
set toProcessing
before returning with aShortRetry
. TheLastOperation.Description
is (unfortunately) set to the fixed message:Creating machine on cloud provider
. - Observe that after an erroneous call to
Driver.CreateMachine
, the machine phase is set toCrashLoopBackOff
orFailed
(in case of creation timeout).
The following changes are proposed with a view towards minimal impact on current code and no introduction of a new Machine Phase.
MCM Changes
- We propose introducing a new machine operation
Driver.InitializeMachine
with the following signaturetype Driver interface { // .. existing methods are omitted for brevity. // InitializeMachine call is responsible for post-create initialization of the provider instance. InitializeMachine(context.Context, *InitializeMachineRequest) error } // InitializeMachineRequest is the initialization request for machine instance initialization type InitializeMachineRequest struct { // Machine object whose VM instance should be initialized Machine *v1alpha1.Machine // MachineClass backing the machine object MachineClass *v1alpha1.MachineClass // Secret backing the machineClass object Secret *corev1.Secret }
- We propose introducing a new MC error code
codes.Initialization
indicating that the VM Instance was created but there was an error in initialization after VM creation. The implementor ofDriver.InitializeMachine
can return this error code, indicating thatInitializeMachine
needs to be called again. The Machine Controller will change the phase toCrashLoopBackOff
as usual when encountering acodes.Initialization
error. - We will introduce a new machine operation stage
InstanceInitialization
. In case of ancodes.Initialization
error- the
machine.Status.LastOperation.Description
will be set toInstanceInitialization
, machine.Status.LastOperation.ErrorCode
will be set tocodes.Initialization
- the
LastOperation.Type
will be set toCreate
- the
LastOperation.State
set toFailed
before returning with aShortRetry
- the
- The semantics of
Driver.GetMachineStatus
will be changed. If the instance associated with machine exists, but the instance was not initialized as expected, the provider implementations ofGetMachineStatus
should return an error:status.Error(codes.Initialization)
. - If
Driver.GetMachineStatus
returned an error encapsulatingcodes.Initialization
thenDriver.InitializeMachine
will be invoked again in thetriggerCreationFlow
. - As according to the usual logic, the main machine controller reconciliation loop will now re-invoke the
triggerCreationFlow
again if the machine phase isCrashLoopBackOff
.
Illustration
AWS Provider Changes
Driver.InitializeMachine
The implementation for the AWS Provider will look something like:
- After the VM instance is available, check
providerSpec.SrcAndDstChecksEnabled
, constructModifyInstanceAttributeInput
and callModifyInstanceAttribute
. In case of an error returncodes.Initialization
instead of the currentcodes.Internal
- Check
providerSpec.NetworkInterfaces
and ifIpv6PrefixCount
is notnil
, then constructAssignIpv6AddressesInput
and callAssignIpv6Addresses
. In case of an error returncodes.Initialization
. Don’t use the genericcodes.Internal
The existing Ipv6 PR will need modifications.
Driver.GetMachineStatus
- If
providerSpec.SrcAndDstChecksEnabled
isfalse
, checkec2.Instance.SourceDestCheck
. If it does not match then returnstatus.Error(codes.Initialization)
- Check
providerSpec.NetworkInterfaces
and ifIpv6PrefixCount
is notnil
, checkec2.Instance.NetworkInterfaces
and check ifInstanceNetworkInterface.Ipv6Addresses
has a non-nil slice. If this is not the case then returnstatus.Error(codes.Initialization)
Instance Not Ready Taint
- Due to the fact that creation flow for machines will now be enhanced to correctly support post-creation startup logic, we should not scheduled workload until this startup logic is complete. Even without this feature we have a need for such a taint as described in MCM#740
- We propose a new taint
node.machine.sapcloud.io/instance-not-ready
which will be added as a node startup taint in gardener core KubeletConfiguration.RegisterWithTaints - The will will then removed by MCM in health check reconciliation, once the machine becomes fully ready. (when moving to
Running
phase) - We will add this taint as part of
--ignore-taint
in CA - We will introduce a disclaimer / prerequisite in the MCM FAQ, to add this taint as part of kubelet config under
--register-with-taints
, otherwise workload could get scheduled , before machine beomesRunning
Stage-B Proposal
Enhancement of Driver Interface for Hot Updation
Kindly refer to the Hot-Update Instances design which provides elaborate detail.