This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.


1 - Excess Reserve Capacity

Excess Reserve Capacity


Currently, autoscaler optimizes the number of machines for a given application-workload. Along with effective resource utilization, this feature brings concern where, many times, when new application instances are created - they don’t find space in existing cluster. This leads the cluster-autoscaler to create new machines via MachineDeployment, which can take from 3-4 minutes to ~10 minutes, for the machine to really come-up and join the cluster. In turn, application-instances have to wait till new machines join the cluster.

One of the promising solutions to this issue is Excess Reserve Capacity. Idea is to keep a certain number of machines or percent of resources[cpu/memory] always available, so that new workload, in general, can be scheduled immediately unless huge spike in the workload. Also, the user should be given enough flexibility to choose how many resources or how many machines should be kept alive and non-utilized as this affects the Cost directly.


  • We decided to go with Approach-4 which is based on low priority pods. Please find more details here:
  • Approach-3 looks more promising in long term, we may decide to adopt that in future based on developments/contributions in autoscaler-community.

Possible Approaches

Following are the possible approaches, we could think of so far.

Approach 1: Enhance Machine-controller-manager to also entertain the excess machines

  • Machine-controller-manager currently takes care of the machines in the shoot cluster starting from creation-deletion-health check to efficient rolling-update of the machines. From the architecture point of view, MachineSet makes sure that X number of machines are always running and healthy. MachineDeployment controller smartly uses this facility to perform rolling-updates.

  • We can expand the scope of MachineDeployment controller to maintain excess number of machines by introducing new parallel independent controller named MachineTaint controller. This will result in MCM to include Machine, MachineSet, MachineDeployment, MachineSafety, MachineTaint controllers. MachineTaint controller does not need to introduce any new CRD - analogy fits where taint-controller also resides into kube-controller-manager.

  • Only Job of MachineTaint controller will be:

    • List all the Machines under each MachineDeployment.
    • Maintain taints of noSchedule and noExecute on X latest MachineObjects.
    • There should be an event-based informer mechanism where MachineTaintController gets to know about any Update/Delete/Create event of MachineObjects - in turn, maintains the noSchedule and noExecute taints on all the latest machines. - Why latest machines? - Whenever autoscaler decides to add new machines - essentially ScaleUp event - taints from the older machines are removed and newer machines get the taints. This way X number of Machines immediately becomes free for new pods to be scheduled. - While ScaleDown event, autoscaler specifically mentions which machines should be deleted, and that should not bring any concerns. Though we will have to put proper label/annotation defined by autoscaler on taintedMachines, so that autoscaler does not consider the taintedMachines for deletion while scale-down. * Annotation on tainted node: "": "true"
  • Implementation Details:

    • Expect new optional field ExcessReplicas in MachineDeployment.Spec. MachineDeployment controller now adds both Spec.Replicas and Spec.ExcessReplicas[if provided], and considers that as a standard desiredReplicas. - Current working of MCM will not be affected if ExcessReplicas field is kept nil.
    • MachineController currently reads the NodeObject and sets the MachineConditions in MachineObject. Machine-controller will now also read the taints/labels from the MachineObject - and maintains it on the NodeObject.
  • We expect cluster-autoscaler to intelligently make use of the provided feature from MCM.

    • CA gets the input of min:max:excess from Gardener. CA continues to set the MachineDeployment.Spec.Replicas as usual based on the application-workload.
    • In addition, CA also sets the MachieDeployment.Spec.ExcessReplicas .
    • Corner-case: * CA should decrement the excessReplicas field accordingly when desiredReplicas+excessReplicas on MachineDeployment goes beyond max.

Approach 2: Enhance Cluster-autoscaler by simulating fake pods in it

Approach 3: Enhance cluster-autoscaler to support pluggable scaling-events

  • Forked version of cluster-autoscaler could be improved to plug-in the algorithm for excess-reserve capacity.
  • Needs further discussion around upstream support.
  • Create golang channel to separate the algorithms to trigger scaling (hard-coded in cluster-autoscaler, currently) from the algorithms about how to to achieve the scaling (already pluggable in cluster-autoscaler). This kind of separation can help us introduce/plug-in new algorithms (such as based node resource utilisation) without affecting existing code-base too much while almost completely re-using the code-base for the actual scaling.
  • Also this approach is not specific to our fork of cluster-autoscaler. It can be made upstream eventually as well.

Approach 4: Make intelligent use of Low-priority pods

  • Refer to: pod-priority-preemption
  • TL; DR:
    • High priority pods can preempt the low-priority pods which are already scheduled.
    • Pre-create bunch[equivivalent of X shoot-control-planes] of low-priority pods with priority of zero, then start creating the workload pods with better priority which will reschedule the low-priority pods or otherwise keep them in pending state if the limit for max-machines has reached.
    • This is still alpha feature.

2 - GRPC Based Implementation of Cloud Providers

GRPC based implementation of Cloud Providers - WIP


Currently the Cloud Providers’ (CP) functionalities ( Create(), Delete(), List() ) are part of the Machine Controller Manager’s (MCM)repository. Because of this, adding support for new CPs into MCM requires merging code into MCM which may not be required for core functionalities of MCM itself. Also, for various reasons it may not be feasible for all CPs to merge their code with MCM which is an Open Source project.

Because of these reasons, it was decided that the CP’s code will be moved out in separate repositories so that they can be maintained separately by the respective teams. Idea is to make MCM act as a GRPC server, and CPs as GRPC clients. The CP can register themselves with the MCM using a GRPC service exposed by the MCM. Details of this approach is discussed below.

How it works:

MCM acts as GRPC server and listens on a pre-defined port 5000. It implements below GRPC services. Details of each of these services are mentioned in next section.

  • Register()
  • GetMachineClass()
  • GetSecret()

GRPC services exposed by MCM:


rpc Register(stream DriverSide) returns (stream MCMside) {}

The CP GRPC client calls this service to register itself with the MCM. The CP passes the kind and the APIVersion which it implements, and MCM maintains an internal map for all the registered clients. A GRPC stream is returned in response which is kept open througout the life of both the processes. MCM uses this stream to communicate with the client for machine operations: Create(), Delete() or List(). The CP client is responsible for reading the incoming messages continuously, and based on the operationType parameter embedded in the message, it is supposed to take the required action. This part is already handled in the package grpc/infraclient. To add a new CP client, import the package, and implement the ExternalDriverProvider interface:

type ExternalDriverProvider interface {
	Create(machineclass *MachineClassMeta, credentials, machineID, machineName string) (string, string, error)
	Delete(machineclass *MachineClassMeta, credentials, machineID string) error
	List(machineclass *MachineClassMeta, credentials, machineID string) (map[string]string, error)


rpc GetMachineClass(MachineClassMeta) returns (MachineClass) {}

As part of the message from MCM for various machine operations, the name of the machine class is sent instead of the full machine class spec. The CP client is expected to use this GRPC service to get the full spec of the machine class. This optionally enables the client to cache the machine class spec, and make the call only if the machine calass spec is not already cached.


rpc GetSecret(SecretMeta) returns (Secret) {}

As part of the message from MCM for various machine operations, the Cloud Config (CC) and CP credentials are not sent. The CP client is expected to use this GRPC service to get the secret which has CC and CP’s credentials from MCM. This enables the client to cache the CC and credentials, and to make the call only if the data is not already cached.

How to add a new Cloud Provider’s support

Import the package grpc/infraclient and grpc/infrapb from MCM (currently in MCM’s “grpc-driver” branch)

  • Implement the interface ExternalDriverProvider
    • Create(): Creates a new machine
    • Delete(): Deletes a machine
    • List(): Lists machines
  • Use the interface MachineClassDataProvider
    • GetMachineClass(): Makes the call to MCM to get machine class spec
    • GetSecret(): Makes the call to MCM to get secret containing Cloud Config and CP’s credentials

Example implementation:

Refer GRPC based implementation for AWS client: