Skip to content

Hack The Garden 11/2025 Wrap Up ​

Group picture

🐢 Use Self-Hosted Shoot Cluster For Single-Node E2E Tests ​

Problem Statement ​

Our goal is to increase confidence and robustly test self-hosted shoot clusters by integrating them into our existing end-to-end (E2E) testing framework. We plan to run these E2E tests within a single-node, self-hosted Shoot cluster. We will leverage the ability to create such clusters using gardenadm.

Motivation & Benefits ​

Running E2E tests in a self-hosted shoot cluster allows us to "eat our own dog food". This provides a real-world test for the self-hosted shoot functionality. This will ultimately increase confidence in the stability and reliability of self-hosted shoot clusters.

Achievements ​

We made significant progress across several core areas to enable this setup:

  • Setup/bootstrap: Successfully ran gardenadm init in a Docker container. Tracked as gind (Gardener in Docker, inspired by KinD).
  • etcd management: Made druid-managed etcd optional. This allows the system to continue with a bootstrap etcd if no backup is configured in the Shoot manifest.
  • Networking/DNS:
    • Addressed NetworkPolicies for CoreDNS.
    • Fixed registry hostnames by always running registries as containers via docker compose and exposing them on dedicated hostnames.
    • Cleaned up the provider-local Service controller in the traditional kind-based setup.
    • Fixed DNS records in kind CoreDNS Corefile to use hard-coded IPs.
  • Controller conflict management: Ensured that multiple controllers (e.g., gardener-resource-manager, etcd-druid, vpa) do not conflict.
  • Host network: Ensured gardener-resource-manager and extensions remain in the host network.

Next Steps ​

The following tasks are planned to complete the goal and address ongoing considerations:

  • Extension management:
    • We need to complete the logic for extension management to prevent extensions from being deployed twice if the self-hosted shoot is also a Seed.
    • This involves the gardenlet labeling the Seed object if it runs in a self-hosted Shoot.
    • The gardener-controller-manager (GCM) must then adapt its ControllerInstallation creation logic accordingly.
  • E2E integration: We need to adapt existing E2E tests to create the self-hosted cluster and run tests within it.
  • Tooling: We will implement gind (Gardener/gardenadm in Docker) as a kind alternative using gardenadm.
  • Controller isolation (future): We plan to move gardener-resource-manager and etcd-druid of the self-hosted Shoot to the garden namespace to avoid potential conflicts/interferences when the self-hosted Shoot is also the garden runtime or a Seed cluster.

Code & Pull Requests ​

πŸ«† Enrich Shoot Logs with Istio Access Logs ​

Problem Statement ​

The Istio ingress gateway generates valuable access logs that show all requests passing through it, especially in conjunction with Layer 7 (L7) load balancing. Currently, these logs are only accessible to Seed operators, limiting visibility for Shoot cluster users. We need to move these logs to the corresponding shoot log stream to enhance cluster debugging and operational visibility for Shoot owners, particularly in cases where access control is restricted (e.g., by the ACL extension).

Motivation & Benefits ​

  • Improved debugging and operational visibility for Shoot cluster owners by giving them direct access to ingress traffic logs.
  • Enhanced compliance and security by making critical access information available within the shoot's boundary, which is necessary when strict access controls are in place.
  • This project serves as a pilot for enriching Shoot logs with logs from other relevant Seed components in the future.

Achievements ​

  • We successfully implemented the required changes to enrich Shoot logs with Istio access logs.
  • This functionality was integrated into the main Gardener repository following an initial attempt in the logging extension repository.

Next Steps ​

  • We should consider extending this approach to include access logs or other relevant logs from other Seed components that impact Shoot cluster operations.
  • The feature needs validation across different Shoot configurations to ensure stability and reliable delivery of the access logs to the Shoot log stream.

Code & Pull Requests ​

πŸͺ£ Allow Relocating Backup Buckets ​

Problem Statement ​

Currently, there is no straightforward way to relocate ETCD backups for shoots, such as changing the backup provider or region. This can only be achieved using a Shoot migration to another Seed that has a different backup provider configured. Such a migration can be difficult because it requires downtime for Shoot owners, as all Shoots managed by a Seed need to be migrated. For operator setups, it is not possible to change the backup location either because the respective fields in the Garden resource are also immutable.

Motivation & Benefits ​

We want to be able to relocate backups more easily and without any downtime for the Shoot owners. During our migration to gardener-operator, we created the Garden cluster in a completely new cloud provider project, but existing Seeds still point to the old project's backup bucket. We would like to migrate these buckets to the new project so we can clean up the old one.

Achievements ​

We created a small prototype to allow mutating the fields for relocating the backup. The proposed changes allow mutating the fields such that a redeployment of the ETCD StatefulSet gets triggered during the next Shoot reconciliation. This StatefulSet includes the new backup configuration. As soon as the new ETCD pod starts, it continues to run with the data found on the PVC and then directly creates a full snapshot in the new backup bucket after startup. This behavior needs no modification in the backup-restore sidecar. The prototype implementation also allows changing the bucket name used by the Seeds, which is currently strictly derived from the Seed UID.

Next Steps ​

We need to finalize the implementation and open a pull request. It might be worth opening another issue to discuss modifying the backup-restore sidecar to allow it to "claim" the bucket it acts on.

Code & Pull Requests ​

πŸͺž Pull gardener-node-agent From Registry Mirror ​

Problem Statement ​

It is currently not possible to pull the gardener-node-agent through a registry mirror configured by the OperatingSystemConfig (OSC) extension. This is because the gardener-node-agent is the component responsible for configuring containerd with the registry mirrors, creating a chicken-and-egg problem during the initial Node provisioning.

Motivation & Benefits ​

We want to enable the gardener-node-agent to be pulled directly from a configured registry mirror, which simplifies the initial Node setup and ensures that all image pulls respect the configured mirroring policies from the very start. We previously used a workaround where the registry mirror was added as systemd files into the userdata via a webhook, but integrating this logic directly into the registry-cache extension is cleaner and more robust.

Achievements ​

  • Introduced an option in the extension to specify a Mirror as provisionRelevant. Marking a mirror as provisionRelevant causes the mirror configuration to be included as files in the provision OperatingSystemConfig. This ensures the mirror configuration is available before the gardener-node-agent is running, allowing the agent to be pulled via the mirror.
  • After the gardener-node-agent is running, it will take over and maintain the containerd config, updating it if needed.
  • Added the possibility to set a CA bundle for a Mirror.

Next Steps ​

The pull request needs to be finalized, documentation updated, and then merged and released.

Code & Pull Requests ​

πŸ—½ Evaluate Talos As Node Operating System ​

Problem Statement ​

We evaluated the technical feasibility of integrating Talos OS, a modern, minimal, and immutable operating system designed specifically for Kubernetes, as a worker Node OS in Gardener Shoot clusters. This required moving away from the traditional systemd-based management model.

Motivation & Benefits ​

  • Security: Talos offers enhanced security due to its immutable filesystem, minimal package set, and zero traditional access surfaces (no SSH or Shell).
  • Management: The OS is fully declarative and configuration is entirely API-based (via GRPC), which aligns with Kubernetes principles.
  • Architecture: It provides a lightweight solution dedicated solely to running Kubernetes, potentially leading to a smaller attack surface and improved stability.

Achievements ​

  • We successfully demonstrated that Talos OS is technically feasible as a Gardener worker Node.
  • Machine bootstrapping: Automated the generation of the initial Talos configuration and successfully deployed Talos Nodes as Pods in a local provider PoC, pushing the configuration while the Node was in "insecure" mode.
  • Cluster joining: Established trust by configuring the kubelet with the cluster CA, successfully using bootstrap tokens, and identifying compatible kubelet images.
  • CNI & VPN: Successfully configured the cluster by disabling the Talos-native CNI, enabling the kubelet server for VPN/tunneling, and resolving an SNI issue by disabling KubePrism and ensuring proper DNS resolution.
  • Control plane trust: Successfully deployed the trustd daemon in the control plane and configured Istio routing to allow Talos Nodes to reach it securely.

Next Steps ​

The Talos OS integration will not be followed up to production at this time, but the architectural findings point to the necessary evolution for future support of declarative operating systems:

  • OperatingSystemConfig & gardener-node-agent evolution: Refactor the existing OperatingSystemConfig and Gardener Node Agent (GNA) to abstract away the current dependency on systemd, requiring a new agent that leverages Talos's declarative API.
  • Extension compatibility: Extensions that inject systemd units via OSC mutations must be rewritten to use containerized sidecars or DaemonSets.
  • Service Hardening: Define a standard strategy for securely exposing the Talos API daemon (apid) to operators for debugging and create a fully managed deployment for the trustd component within the control plane.

Code & Pull Requests ​

  • No specific public pull requests are available, as this was a feasibility evaluation track.

πŸ“¦ Gardener API Types As Standalone Go Module ​

Problem Statement ​

The main Gardener repository currently contains all API types within its primary Go module. This forces external projects that only need to import the API types (e.g., extensions, consumers) to download and depend on the entire set of dependencies for the main Gardener component, leading to unnecessary overhead and potential dependency conflicts.

Motivation & Benefits ​

Introducing a dedicated Go module for pkg/apis (the API types) ensures a minimal set of dependencies for consumers who only need the API definitions. This significantly reduces the size and complexity of the dependency graph for API-only imports.

Achievements ​

We opened a pull request to implement the changes. πŸ₯³

Next Steps ​

The following tasks are necessary to fully implement the standalone API module:

  • Introduce a dedicated Go module for pkg/apis in gardener/gardener.
  • Strictly limit the dependencies of this API module (e.g., only to k8s.io/{api,apimachinery,utils}).
  • Enforce dependency restrictions using a .import-restrictions file.
  • Ensure the API module is released together with the main module, using the proper Go submodule tag (similar to gardener/cc-utils#1382).
  • Use the Go workspaces feature in gardener/gardener to conveniently develop both the main and the API module together.

Code & Pull Requests ​

πŸ“ˆ Gardener Scale-Out Tests ​

Problem Statement ​

We don't have a good estimate of how many Seeds and Shoots a Gardener environment can support. We are not aware of any scalability limitations that we might face in the future. There is no way to prevent regressions in Gardener's scalability.

Motivation & Benefits ​

We wanted to implement "hollow" gardenlets, similar to Kubemark's hollow Nodes, and run many of them to generate load on the Garden cluster. This approach provides a solid foundation for running automated performance/scalability tests.

Achievements ​

We transferred the concept of hollow Nodes to Seeds, meaning that we have a gardenlet that registers itself with the Garden and reports itself as ready. Shoots scheduled to these Seeds are also reported healthy, but no real control planes will be spawned. To simulate a more realistic scenario, we analyzed the queries-per-second (QPS) that the gardenlets perform against the Gardener API server (GAPI). We registered a runnable in the hollow gardenlet that simulates these requests, based on the number of Shoots on the Seeds. In the local operator setup on an M4 Mac with 48 GB of memory, we were able to schedule ~200 hollow gardenlets in the runtime cluster before we ran out of resources. We figured out that scheduling too many Shoots and Seeds in a small amount of time leads to problems, specifically lease requests timing out. The settings of the Garden were just taken as-is, so tuning for this scenario would probably be possible. We did a longer running test, aiming for 50 Shoots per minute and 1 Seed every 3 minutes, to circumvent the initial problem. The outcome was 800 Seeds and 21,600 Shoots over 10 hours, but around 4 am, the test ran into a meltdown. We observed unequal request balancing on the Kubernetes API servers (KAPIs), as no Layer 7 (L7) load balancing was active.

Next Steps ​

We can try to run the tests in a more controlled environment. Although we will not continue with this after the hackathon, this artifact can serve as a starting point for the next round.

Code & Pull Requests ​

concept diagram for the gardener scale-out tests using hollow clusters and gardenlets to simulate load

β€οΈβ€πŸ©Ή force-restore Operation Annotation For Shoots ​

Problem Statement ​

The existing mechanism for disaster recovery of a shoot cluster using available backups needs improvement. We aimed to introduce a clear way for operators to force a restore operation using existing backups, which is particularly critical in disaster scenarios.

Motivation & Benefits ​

  • Facilitates and simplifies the recovery process for Shoot clusters from a disaster event.
  • Provides an annotation for operators to initiate a restore using the last known good state from the available backups.
  • This feature is general enough to be adopted by provider extensions like OpenStack to manage specific resource recovery flows.

Achievements ​

  • Identified the necessary changes in the gardenlet's operational logic, specifically around backup handling and migration checks.
  • The implementation requires updates to how the gardenlet determines if a copy of backups is required. It needs to handle cases where the source BackupEntry is not found, preventing errors during recovery attempts.

Next Steps ​

  • The identified gardenlet changes must be fully tested and implemented to enable the core force-restore functionality.
  • An issue in gardener-extension-provider-openstack that hinders control plane migration needs to be fixed.

Code & Pull Requests ​

πŸ—ƒοΈ Go Build Cache In Prow ​

Shafeeque and Tobias explored the build and test caching for Gardener Prow jobs. During this process, we discovered some interesting aspects of how Go caching works, its usage, and how the new Go feature GOCACHEPROG operates.

We pursued three goals:

  1. Speed up time-to-feedback on pull requests.
  2. Reduce load on the Prow build clusters.
  3. Do this securely.

For security, it was critical that presubmit (pull request) jobs could read from the cache but never write to it. This prevents untrusted PRs and potentially broken jobs from polluting the cache.

We first reviewed how other projects handle caching:

  • Istio: Uses a hostPath volume to reuse cache on the same node. This helps somewhat, but suffers from numerous cache misses as jobs are assigned to different nodes. With our autoscaling worker groups, node churn would make misses even more common, so this wouldn’t be effective for us.
  • Kubermatic machine-controller: Uses scripts to fetch a cache archive for the PR's ancestor commit from blob storage before the build, and upload an updated cache on main after the build. This mirrors systems like GitHub Actions’ actions/cache and works with plain file systems, so build tooling doesn’t need remote cache awareness.

We considered using a ReadWriteMany persistent volume for caching to achieve a similar setup to Istio, but across multiple nodes. On GCP, the practical option is NFS Filestore, which is quite expensive. Additionally, to make presubmits read-only, we’d have to copy the cache to a local directory before buildsβ€”an expensive step for large caches that erodes the performance benefits.

Go recently introduced the GOCACHEPROG feature, which allows you to plug in a custom program to read from and write to the build and test cache. We're using it with the open-source saracen/gobuildcache to store cache entries in Google Cloud Storage.

  • This gives us a "remote cache helper" without needing to download/upload an entire cache directory before/after each build.
  • Because Go’s cache is highly granular (per compiled/tested unit), we only fetch what's needed rather than relying on commit ancestry heuristics.
  • Google Cloud Storage doesn't charge network costs within the same region and only minimal costs for storage and operations.
  • Trade-off: many small requests to the remote backend are slower than local filesystem access. However, with our high hit rate, the overall impact is very positive.

We use separate GCP principals, federated from the Prow shoot cluster via Workload Identity Federation, with read-only permissions for presubmit jobs and read-write permissions for postsubmit and periodic jobs.

With this setup, we have already significantly accelerated builds. However, some parts, like go test and linting, didn’t speed up as much. We found that go test only caches results when a limited set of flags is used; in Prow, we rely on ginkgo.junit-report=junit.xml to produce JUnit reports for pull requests, and this flag disables test result caching. In local experiments, removing this flag enabled caching and dramatically reduced unit test time (for example, from approximately 40 minutes to under 5 minutes).

Overall, we’re happy with the first improvements. Examining Gardener’s jobs, we did not significantly reduce end-to-end feedback time for presubmits; however, the actual CPU time spent on builds decreased by more than 90%. The impact is particularly clear in the kind-e2e tests: they used to take roughly 1 hour 30 minutes and consume about 65 minutes of CPU time across 12 cores in parallel at the beginning; now they consume under 3 minutes of CPU time across fewer than 2 cores. This substantially reduces the load on the build cluster.

kind e2e tests without cache ​

terminaloutput
real        85m20.914s
user        64m51.059s
sys          6m42.249s

Prow CPU metrics without cache showing 12 CPUs being used during build

kind e2e tests with cache ​

terminaloutput
real        82m53.159s
user         2m33.118s
sys          1m0.021s

Prow CPU metrics with cache showing 2 CPUs being used during build

πŸ› οΈ MCM: Update Machines Updates During In-Place Updates ​

Problem Statement ​

The existing in-place node update mechanism in the gardener/machine-controller-manager (MCM) does not support updating underlying infrastructure resources, such as the OS image. To update these resources, a full node recreation is required, which we want to avoid.

Motivation & Benefits ​

We want to enable infrastructure-level updates without requiring a full node recreation, improving efficiency and reducing disruption. The key benefit of bare metal is achieving a clean OS state by memory booting the machine via the network and updating the image by rebooting the server.

Achievements ​

  • Extended the MCM provider driver interface by introducing a new UpdateMachine method for in-place updates.
  • The signature of the new method is UpdateMachine(context.Context, *UpdateMachineRequest) (*UpdateMachineResponse, error).
  • Completed the MCM update machine implementation on a feature branch.
  • Successfully tested the concept on a local provider environment.
  • Released forked container images to test the scenario in a real-world bare metal environment.
  • Implemented a "hacky" fix to address the challenge of getting the new MachineImage within the UpdateMachine call, as only the old image was initially available.
  • Mitigated an issue where the Garden Linux OS extension used an on-disk-only gardenlinux-update command by providing a fix to a forked extension.

Next Steps ​

We need to revisit the idea of using in-place updates for memory-booted servers and re-evaluate the rolling update approach instead. The current hacky changes to MCM are not worth contributing upstream, as they might break the core in-place update contract; therefore, we will not pursue them further.

Code & Pull Requests ​

πŸ”” gardenadm/Flow Package: Handle SIGINFO (^T) ​

Problem Statement ​

The existing command-line tools that utilize the Gardener flow package for complex, multistep operations currently lack a standardized way to provide clear, real-time feedback to the user about which step is currently executing. This makes the tools appear static or slow, hindering developer productivity.

Motivation & Benefits ​

  • Improves the usability and developer experience of Gardener's command-line tools by providing visual confirmation of progress.
  • Increases developer productivity by making it easier to monitor long-running processes, quickly see which step failed, and estimate completion time.
  • Provides a standardized interface (CommandLineProgressReporter) that can be easily integrated into any tool using the flow package, ensuring consistency across the ecosystem.

Achievements ​

  • We successfully implemented the CommandLineProgressReporter utility.
  • This new reporter integrates with the flow package to print the currently executing step to standard output.
  • The feature was integrated into the main Gardener repository, making it available for use in all Gardener command-line tools.

Next Steps ​

  • Integrate the CommandLineProgressReporter into all relevant Gardener command-line tools that use the flow package to ensure a consistent user experience.

Code & Pull Requests ​

βš–οΈοΈ Load Balancer Controller For provider-local ​

Problem Statement ​

Our goal was to build a functional, real load balancer controller for the provider-local extension, moving beyond simple Service controller implementations. This allows us to properly emulate external load-balancing behavior in local development and testing environments, such as kind.

Motivation & Benefits ​

Implementing a dedicated load balancer controller is essential for accurately testing features that rely on external IP address allocation and port exposure. This will be crucial for making ManagedSeed tests succeed, as they require a reliable load balancer implementation in the test Shoot clusters.

Achievements ​

Significant progress has been made on implementation, as captured in the work-in-progress branch. The demo steps illustrate the core functionality achieved:

  • IP addresses can now be allocated from a specified range (172.18.255.64/26) to Services of type LoadBalancer.
  • External access can be simulated by manually adding the allocated IP addresses to the host's loopback interface (lo0) and then accessing the exposed Service using those IPs.
  • The controller can utilize Docker networking features (--enable-lb-port-mapping) to map the load balancer port to the backing container.
  • This allows the containerized workload (like nginx) to be reached externally via the allocated IP.
  • We demonstrated the successful creation and access of multiple Services of type LoadBalancer (nginx, nginx2, etc.), each receiving a unique external IP from the defined pool.
  • A user can create a Deployment, expose it as a load balancer Service, and then curl the assigned external IP.

Next Steps ​

The following tasks are planned to complete the load balancer controller:

  • The controller needs to correctly handle IPv6.
  • We must implement the LoadBalancer controller for Shoots to ensure ManagedSeed tests succeed.
  • Deployment of this functionality is required in the local setup.
  • The provider-local Service controller must be dropped, as its functionality will be superseded by the new, dedicated load balancer controller.

Code & Pull Requests ​

πŸ”Œ Evaluation Of NFT Mode In kube-proxy ​

Problem Statement ​

The existing kube-proxy modes, ipvs and iptables, have been the standard, but the Kubernetes community has stabilized the nftables mode in version 1.31. We needed to assess this new mode for potential adoption within Gardener.

Motivation & Benefits ​

nftables is recognized as the modern, more efficient, and flexible successor to the legacy iptables framework. Adopting this new proxy mode ensures we keep up with Kubernetes upstream developments and can leverage the performance improvements and simpler rule management that nftables offers.

Achievements ​

We successfully implemented initial support for the nftables proxy mode within Gardener. Cluster operators can now provision and test Shoot clusters using the new nftables mode for kube-proxy.

Next Steps ​

The next logical step is to perform comprehensive testing of the new mode under various loads and configurations. We will evaluate the possibility of setting nftables as the default kube-proxy mode for new Shoot clusters when they use a sufficiently recent Kubernetes version.

Code & Pull Requests ​

πŸŒ‰ Replace Ingress NGINX controller With Gateway API ​

Problem Statement ​

Following the deprecation announcement for Ingress NGINX, we need to evaluate and implement the Gateway API as a replacement for ingress traffic management in the Garden runtime/Seed clusters. This involves migrating from existing Istio-native resources (Gateway, VirtualServices, DestinationRules) to Gateway API resources, such as HTTPRoute, and overcoming initial integration hurdles with Gardener's Istio configuration.

Motivation & Benefits ​

  • The primary motivation is to replace the deprecated Ingress NGINX controller with a modern, standard API.
  • Adopting the Gateway API drives its maturity and aligns us with the future direction of Kubernetes Ingress.
  • This evaluation will determine how many Istio-native resources can be replaced, simplifying the configuration and management of ingress traffic over time.

Achievements ​

  • Successfully identified and fixed an issue preventing the Gateway API from working with Gardener's default Istio configuration due to restrictive exportTo settings in the mesh config.
    • We discovered that implicitly allowing the internally created VirtualService by setting defaultVirtualServiceExportTo: '.' is required for the Istio ingress gateway to export services.
  • Created a functional branch showcasing Gateway API usage for Plutono.
  • HTTP basic authentication was implemented using an EnvoyFilter and an external authorization server, with a special label introduced to reference basic auth secrets.
  • Extended the gardener-resource-manager network policy controller to generate network policies based on HTTPRoute resources, similar to how it handled Ingress resources.
  • Managed to serve Gateway API resources and Istio resources under the same port.
  • Translated the basic redirect logic (used for preventing access to admin routes) by using a redirect to "/" as a workaround for the current Gateway API limitation on direct responses.

Next Steps ​

  • Fully implement HTTP basic authentication, potentially replacing the custom EnvoyFilter with Istio-native external authorization or adopting the experimental HTTPExternalAuthFilter when Istio supports it.
  • Complete the translation of all remaining NGINX annotations and an existing Ingress resource to equivalent Gateway API resources.
  • The migration of DestinationRules to Gateway API's new traffic policy resources (like XBackendTrafficPolicy) needs to be finalized.
  • Implement traffic encryption for the external authorization server.
  • The deployment and configuration for the dashboard and Prometheus services (including Alertmanager) need to be migrated to the Gateway API.
  • The changes must be integrated into the operator reconciliation flow and the Shoot reconciliation flow.

Code & Pull Requests ​

🐱 Add Support For Calico Whisker ​

Problem Statement ​

We needed to integrate Calico Whisker into Gardener-managed clusters to provide traffic monitoring and tracing capabilities, similar to Cilium's Hubble. Whisker is typically deployed via the tigera/operator, which complicates its deployment in our existing Calico extension setup.

Motivation & Benefits ​

Adding Calico Whisker provides valuable observability tools for cluster network traffic, which is a feature highly requested by users who prefer Calico as their Container Network Interface (CNI). This new capability brings Calico clusters closer to feature parity with the network tracing offered by Cilium/Hubble.

Achievements ​

  • Developed a working prototype by reusing code from the tigera-operator to manage the deployment of the required components, which are Calico Whisker and Calico Goldmane.
  • The prototype successfully manages the mTLS communication requirement between calico-node and calico-typha by utilizing the Gardener security-manager for certificate provisioning.
  • Identified and implemented the necessary network policies to allow communication between the new components, such as the extension to Shoot API servers and calico-node to goldmane.
  • The code for the working prototype is available in a feature branch for further development.

Next Steps ​

  • The next primary step is the productization of the changes into the main Calico networking extension, including integrating certificate management and component deployment logic.
  • We need to resolve the issue with the untagged tigera-operator API module dependency to ensure stable integration.
  • The approach used here, which involves reusing tigera-operator code, can potentially be applied to simplify the future addition of support for the Calico API server.

Code & Pull Requests ​

🏷️ Respect Terminating Nodes In Load-Balancing ​

Problem Statement ​

When a machine is deleted, the load balancer may not be immediately aware that the corresponding node is no longer available. This lack of synchronization can lead to unsuccessful new connections because the load balancer may still forward traffic to the terminating Node. The existing Node conditions or "drain" signals set by the Machine Controller Manager (MCM) do not reliably trigger a reconciliation in the cloud-provider package or cause the kube-proxy health check to fail.

Motivation & Benefits ​

We want to improve load-balancing behavior during machine terminations to prevent connection failures. The goal is to provide a reliable signal to both the cloud load balancer (via cloud-provider) and the in-cluster service routing (via kube-proxy) that a Node is being terminated and should be removed from the rotation immediately.

Achievements ​

  • Identified that the ToBeDeletedByClusterAutoscaler taint is recognized and handled by both the Kubernetes cloud-provider and kube-proxy components for draining traffic.
  • Implemented changes in MCM to apply the ToBeDeletedByClusterAutoscaler taint to a node when its corresponding Machine is marked for deletion.
  • Applying this taint triggers the necessary load balancer reconcile and service health check updates, ensuring traffic is stopped before the Node is fully gone.

Next Steps ​

The pull request needs to be merged and released.

Code & Pull Requests ​

🧰 Use Go Tools & Drop VGOPATH ​

Problem Statement ​

The main Gardener repository relied on the outdated practice of using the VGOPATH environment variable within its build and development environment. This approach is generally obsolete in modern Go development, as it utilizes the Go modules system for dependency management and tooling.

Motivation & Benefits ​

By migrating to standard Go tooling and removing VGOPATH, we achieve several benefits:

  • Modernization: We align our development and build process with current Go best practices (Go modules).
  • Simplification: We removed the need for custom logic and non-standard environment variables (VGOPATH) in our tooling scripts, simplifying the development experience for contributors.
  • Dependency clarity: We improved the clarity and reliability of our dependency resolution.

Achievements ​

This effort was split into two key pull requests that together eliminate VGOPATH and standardize the use of Go tools:

  • Remove VGOPATH:
    • This PR introduced the initial change to remove all occurrences of VGOPATH from the gardener/gardener repository.
    • It refactored scripts and configurations that previously relied on this variable to use standard, module-aware Go commands.
  • Use Go tools:
    • This PR focused on simplifying our dependency setup. We updated our tool dependency management by pinning all necessary tools in the go.mod file using the standard tools directory pattern.
    • This means we no longer need complex, custom logic to ensure the correct versions of various Go tools (like linters, code generators, etc.) are available during development.
    • Tools are now sourced directly through Go modules, simplifying the bootstrapping process for a new developer.

Next Steps ​

The primary goal of removing VGOPATH and leveraging standard Go tools is now complete with the merging of these two PRs. The changes are integrated into the repository, ensuring a modernized and simplified development workflow for all contributors.

Code & Pull Requests ​

πŸ”€ Pod Overlay To Native Routing Without Downtime ​

Problem Statement ​

The current method for switching a Shoot cluster's networking mode between pod overlay networking and native routing requires a full reconfiguration and rollout, which results in network downtime. Many production clusters cannot tolerate this interruption, creating a need for a seamless, zero-downtime transition.

Motivation & Benefits ​

Enabling seamless migration allows cluster operators to safely evolve networking configurations in critical production environments without service interruption. This improves the operational flexibility and resilience of Gardener-managed clusters. The desired state is that old nodes continue with the old mode while new Nodes adopt the new mode, maintaining full cross-communication between all Nodes during the transition.

Achievements ​

  • Performed initial testing of the migration flow (bidirectional: overlay to native and native to overlay) in a small two-Node cluster using both Calico and Cilium CNIs on AWS.
  • Calico successfully maintained a long-running connection (netcat) and continuous traffic during the switch in both directions.
  • Cilium successfully transitioned from overlay to native routing without resetting the connection, though the reverse transition (native to overlay) consistently resulted in a connection reset.

Next Steps ​

  • This topic requires further investigation and will be addressed by Sebastian during ordinary operations as a direct follow-up to the related pull request (gardener/gardener#13332).
  • The Cilium connection reset issue during the native-to-overlay switch needs to be investigated and resolved to ensure robust zero-downtime migration for all supported CNIs.

Code & Pull Requests ​

πŸšͺ [GEP-28] Expose API server Of Self-Hosted Shoots ​

Problem Statement ​

The API server of a self-hosted Shoot cluster with managed infrastructure needs a mechanism to be reliably exposed for external access. Currently, there is no standardized way for Gardener components to trigger cloud-specific load-balancing resources for this specific scenario.

Motivation & Benefits ​

We want to enable external access to the API server of self-hosted Shoots by reusing existing cloud provider mechanisms, such as the cloud-controller-manager, creating a load balancer Service for default/kubernetes. The ability to expose the API server this way allows the DNSRecord to be updated with the LoadBalancer's external IP, providing robust, highly available external access.

Achievements ​

  • Introduced a new resource, SelfHostedShootExposure, in the extensions.gardener.cloud/v1alpha1 API to manage the exposure resources. This resource abstracts the specific cloud exposure implementation (e.g., load balancer or kube-vip) into an extension model.
  • The Shoot API was extended to allow configuration of control plane exposure via .spec.provider.workers[].controlPlane.exposure.
  • Defined the responsibilities: gardenadm or gardenlet will create the SelfHostedShootExposure object, and a corresponding extension controller will reconcile the necessary cloud resources.
  • An Actuator interface was defined for the extension controllers with Reconcile and Delete methods to manage the exposure resources and report the ingress status ([]corev1.LoadBalancerIngress).
  • The gardenlet will also run a new controller to watch control plane Nodes and automatically update the SelfHostedShootExposure resource's .spec.endpoints[] with the latest Node addresses.

Next Steps ​

Finish writing the Gardener Enhancement Proposal (GEP) and defend it in front of the Technical Steering Committee (TSC) for approval.

Code & Pull Requests ​

πŸ€– Tool-Enabled Agent For Shoots ​

We developed agents to support operators with end-user questions about Gardener or a specific Shoot cluster. They use LLMs to plan actions and use powerful tools. The primary tools include knowledge-base search, ticket-database search, access to Garden/Seed/Shoot clusters via bash/kubectl, and the ability to extract metrics, logs, and more. This allows the agent to drill deep and surface insights that would otherwise require an operator a significant amount of time. Of course, LLMs won't always draw correct conclusions, but even then, the agent gives the operator a head start from which they can continue the investigation.

Conclusion: The tooling for the platform has been successfully implemented and is currently in use. However, it is not yet publicly available due to the need to scrub internal data from its knowledge sources.


ApeiroRA

EU and German government funding logos

Funded by the European Union – NextGenerationEU.

The views and opinions expressed are solely those of the author(s) and do not necessarily reflect the views of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.