Enhanced Node Management: Introducing In-Place Updates in Gardener
5 minute read
Gardener is committed to providing efficient and flexible Kubernetes cluster management. Traditionally, updates to worker pool configurations, such as machine image or Kubernetes minor version changes, trigger a rolling update. This process involves replacing existing nodes with new ones, which is a robust approach for many scenarios. However, for environments with physical or bare-metal nodes, or stateful workloads sensitive to node replacement, or if the virtual machine type is scarce, this can introduce challenges like extended update times and potential disruptions.
To address these needs, Gardener now introduces In-Place Node Updates. This new capability allows certain updates to be applied directly to existing worker nodes without requiring their replacement, significantly reducing disruption and speeding up update processes for compatible changes.
New Update Strategies for Worker Pools
Gardener now supports three distinct update strategies for your worker pools, configurable via the updateStrategy
field in the Shoot
specification’s worker pool definition:
AutoRollingUpdate
: This is the classic and default strategy. When updates occur, nodes are cordoned, drained, terminated, and replaced with new nodes incorporating the changes.AutoInPlaceUpdate
: With this strategy, compatible updates are applied directly to the existing nodes. The MachineControllerManager (MCM) automatically selects nodes, cordons and drains them, and then signals the Gardener Node Agent (GNA) to perform the update. Once GNA confirms success, MCM uncordons the node.ManualInPlaceUpdate
: This strategy also applies updates directly to existing nodes but gives operators fine-grained control. After an update is specified, MCM marks all nodes in the pool as candidates. Operators must then manually label individual nodes to select them for the in-place update process, which then proceeds similarly to theAutoInPlaceUpdate
strategy.
The AutoInPlaceUpdate
and ManualInPlaceUpdate
strategies are available when the InPlaceNodeUpdates
feature gate is enabled in the gardener-apiserver
.
What Can Be Updated In-Place?
In-place updates are designed to handle a variety of common operational tasks more efficiently:
- Machine Image Updates: Newer versions of a machine image can be rolled out by executing an update command directly on the node, provided the image and cloud profile are configured to support this.
- Kubernetes Minor Version Updates: Updates to the Kubernetes minor version of worker nodes can be applied in-place.
- Kubelet Configuration Changes: Modifications to the Kubelet configuration can be applied directly.
- Credentials Rotation: Critical for security, rotation of Certificate Authorities (CAs) and ServiceAccount signing keys can now be performed on existing nodes without replacement.
However, some changes still necessitate a rolling update (node replacement):
- Changing the machine image name (e.g., switching from Ubuntu to Garden Linux).
- Modifying the machine type.
- Altering volume types or sizes.
- Changing the Container Runtime Interface (CRI) name (e.g., from Docker to containerd).
- Enabling or disabling node-local DNS.
Key API and Component Adaptations
Several Gardener components and APIs have been enhanced to support in-place updates:
- CloudProfile: The
CloudProfile
API now allows specifyinginPlaceUpdates
configuration withinmachineImage.versions
. This includes a booleansupported
field to indicate if a version supports in-place updates and an optionalminVersionForUpdate
string to define the minimum OS version from which an in-place update to the current version is permissible. - Shoot Specification: As mentioned, the
spec.provider.workers[].updateStrategy
field allows selection of the desired update strategy. Additionally,spec.provider.workers[].machineControllerManagerSettings
now includesmachineInPlaceUpdateTimeout
anddisableHealthTimeout
(which defaults totrue
for in-place strategies to prevent premature machine deletion during lengthy updates). ForManualInPlaceUpdate
,maxSurge
defaults to0
andmaxUnavailable
to1
. - OperatingSystemConfig (OSC): The OSC resource, managed by OS extensions, now includes
status.inPlaceUpdates.osUpdate
where extensions can specify thecommand
andargs
for the Gardener Node Agent to execute for machine image (Operating System) updates. Thespec.inPlaceUpdates
field in the OSC will carry information like the target Operating System version, Kubelet version, and credential rotation status to the node. - Gardener Node Agent (GNA): GNA is responsible for executing the in-place updates on the node. It watches for a specific node condition (
InPlaceUpdate
with reasonReadyForUpdate
) set by MCM, performs the OS update, Kubelet updates, or credentials rotation, restarts necessary pods (like DaemonSets), and then labels the node with the update outcome. - MachineControllerManager (MCM): MCM orchestrates the in-place update process. For in-place strategies, while new machine classes and machine sets are created to reflect the desired state, the actual machine objects are not deleted and recreated. Instead, their ownership is transferred to the new machine set. MCM handles cordoning, draining, and setting node conditions to coordinate with GNA.
- Shoot Status & Constraints: To provide visibility, the
status.inPlaceUpdates.pendingWorkerUpdates
field in theShoot
now lists worker pools pendingautoInPlaceUpdate
ormanualInPlaceUpdate
. A newShootManualInPlaceWorkersUpdated
constraint is added if any manual in-place updates are pending, ensuring users are aware. - Worker Status: The
Worker
extension resource now includesstatus.inPlaceUpdates.workerPoolToHashMap
to track the configuration hash of worker pools that have undergone in-place updates. This helps Gardener determine if a pool is up-to-date. - Forcing Updates: If an in-place update is stuck, the
gardener.cloud/operation=force-in-place-update
annotation can be added to the Shoot to allow subsequent changes or retries.
Benefits of In-Place Updates
- Reduced Disruption: Minimizes workload interruptions by avoiding full node replacements for compatible updates.
- Faster Updates: Applying changes directly can be quicker than provisioning new nodes, especially for OS patches or configuration changes.
- Bare-Metal Efficiency: Particularly beneficial for bare-metal environments where node provisioning is more time-consuming and complex.
- Stateful Workload Friendly: Lessens the impact on stateful applications that might be sensitive to node churn.
In-place node updates represent a significant step forward in Gardener’s operational flexibility, offering a more nuanced and efficient approach to managing node lifecycles, especially in demanding or specialized environments.
Dive Deeper
To explore the technical details and contributions that made this feature possible, refer to the following resources:
- Parent Issue for “[GEP-31] Support for In-Place Node Updates”: Issue #10219
- GEP-31: In-Place Node Updates of Shoot Clusters: GEP-31: In-Place Node Updates of Shoot Clusters
- Developer Talk Recording (starting at 39m37s): Youtube