Hack The Garden November 2025: Self-Hosted Shoots, Build Caches, and Networking Modernization
From November 24–28, 2025, the Gardener community gathered at Schlosshof in Schelklingen for another week of focused collaboration, organized by x-cellent. The full per-topic write-up is available on the community page, and the review meeting recording covers the highlights. Continue reading to find out more about the larger storylines that emerged!

🐶 Self-Hosted Shoots and "Eating Our Own Dog Food"
A major track of the hackathon advanced GEP-28 self-hosted shoot clusters, with the goal of running Gardener's own end-to-end (E2E) tests inside a single-node, self-hosted Shoot provisioned via gardenadm.
The team successfully ran gardenadm init inside a Docker container, an approach tracked as gind (Gardener in Docker, inspired by KinD). etcd-druid-managed etcd was made optional so the system can continue with a bootstrap etcd when no backup is configured in the Shoot manifest. Several networking and DNS issues were addressed along the way: NetworkPolicies for CoreDNS, registry hostnames now consistently exposed via docker compose, a cleaned-up provider-local Service controller in the traditional kind-based setup, and hard-coded IPs in the kind CoreDNS Corefile. Additional work ensured that multiple controllers — gardener-resource-manager, etcd-druid, vpa — do not conflict with each other, and that gardener-resource-manager and extensions remain in the host network.
Code lives across several PRs and branches, including a new --use-bootstrap-etcd flag for gardenadm init, registries-as-containers via docker compose, the provider-local cleanup, and the gind branch.
In parallel, GEP-28's API server exposure design progressed. A new SelfHostedShootExposure resource was introduced in the extensions.gardener.cloud/v1alpha1 API to abstract cloud-specific exposure (load balancers, kube-vip, …) into an extension model, with .spec.provider.workers[].controlPlane.exposure configurable on the Shoot. An Actuator interface with Reconcile and Delete methods was defined for extension controllers, and gardenlet will run a controller that watches control plane Nodes and updates .spec.endpoints[] with the latest Node addresses. Tracking issue #2906 collects the work; the next step is finalizing the GEP and presenting it to the Technical Steering Committee.
🛜 Networking Modernization: Gateway API, nftables, and Calico Whisker
Following the Ingress NGINX retirement announcement, the team began evaluating the Gateway API as a replacement for ingress in the Garden runtime/Seed clusters. A first hurdle — Gardener's restrictive Istio defaultVirtualServiceExportTo mesh config — was identified and fixed by setting it to '.', which is required for the Istio ingress gateway to export services. A functional branch demonstrates Gateway API usage for Plutono, with HTTP basic authentication implemented via an EnvoyFilter and an external authorization server, the gardener-resource-manager network policy controller extended to handle HTTPRoute resources, and Gateway API and Istio resources served on the same port. Open follow-ups include native Istio external authorization, completing the translation of NGINX annotations, migrating DestinationRules to Gateway API traffic policy resources (like XBackendTrafficPolicy), and integrating the changes into both the operator and Shoot reconciliation flows. Code lives in a WIP branch.
Initial support for nftables mode in kube-proxy was implemented in gardener/gardener#13558, allowing operators to provision and test Shoot clusters with the modern, more efficient successor to legacy iptables (stable since Kubernetes 1.31). Comprehensive testing under various loads is up next, with the long-term goal of making nftables the default for sufficiently recent Kubernetes versions.
The Calico story also moved forward. A working prototype integrated Calico Whisker for traffic monitoring and tracing, bringing Calico clusters closer to feature parity with Cilium's Hubble. The prototype reuses code from the tigera-operator to manage Calico Whisker and Calico Goldmane, handles the mTLS requirement between calico-node and calico-typha via the Gardener secrets manager, and includes the necessary network policies between extension, Shoot API servers, calico-node, and goldmane. Productization into the main Calico networking extension is the next step (WIP branch).
A separate effort enabled pulling gardener-node-agent through a registry mirror, breaking the chicken-and-egg problem during initial Node provisioning. A new provisionRelevant flag on a Mirror causes its configuration to be written into the provision OperatingSystemConfig, ensuring the mirror is available before gardener-node-agent starts. After bootstrap, the agent takes over and maintains the containerd config as before; CA bundles can also be configured per Mirror (PR #495).
🗃️ A Go Build Cache for Prow
The team tackled build and test caching for Gardener's Prow jobs with three goals: speed up time-to-feedback on PRs, reduce load on build clusters, and do so securely — presubmits must be able to read from the cache but never write to it, to keep untrusted PRs and broken jobs from polluting it.
After reviewing approaches from Istio (hostPath per node, lots of misses with autoscaling) and Kubermatic (cache archive download/upload around builds), the team settled on Go's new GOCACHEPROG interface together with the open-source saracen/gobuildcache, backed by Google Cloud Storage. This avoids whole-cache up- and downloads, leverages Go's per-unit cache granularity, and benefits from GCS's free intra-region network costs. Read-only and read-write GCP principals are federated from the Prow shoot cluster via Workload Identity Federation.
The impact on the kind-based E2E job is striking: end-to-end wall-clock time stayed roughly the same (~85 minutes), but CPU time dropped from ~65 minutes across 12 cores to under 3 minutes across fewer than 2 cores — more than a 90% reduction in build-cluster CPU consumption.
real 85m20.914s # without cache
user 64m51.059s
sys 6m42.249sreal 82m53.159s # with cache
user 2m33.118s
sys 1m0.021sSome workloads benefit less. go test only caches results when a limited flag set is used, and the ginkgo.junit-report=junit.xml flag we rely on for JUnit reports disables test result caching; in local experiments, removing it reduced unit-test time from ~40 minutes to under 5 minutes — a strong incentive to revisit our reporting setup.
📦 Gardener API Types as a Standalone Go Module
Today, importing Gardener API types from gardener/gardener drags in the full dependency graph of the main module — a long-standing pain point for extensions and other API-only consumers. The team opened PR #13536 to extract pkg/apis into a dedicated Go module with a strictly minimal dependency set (k8s.io/{api,apimachinery,utils}), enforced via .import-restrictions. The plan is to release the API module alongside the main module using Go submodule tags (similar to gardener/cc-utils#1382) and use Go workspaces in gardener/gardener for convenient parallel development. This addresses long-standing issue #2871.
In the same modernization spirit, two PRs together eliminated VGOPATH from the main repository: #13545 moved tool dependencies to the standard Go modules tools directory pattern, and #13556 removed VGOPATH and hack/vgopath-setup.sh, replacing them with module-aware Go commands. The result: a simpler bootstrap for new contributors and one less piece of non-standard tooling to explain.
💾 Backups, Restore, and Disaster Recovery
Two efforts targeted operational resilience around etcd backups.
A small prototype enables relocating ETCD backups, today only achievable via a disruptive Seed migration with downtime for all owners. The proposed change makes the relevant fields mutable so a StatefulSet redeployment is triggered during the next Shoot reconciliation; the new etcd pod continues with PVC data and immediately writes a full snapshot to the new bucket, requiring no changes to the backup-restore sidecar. The prototype also relaxes the requirement that the bucket name be derived from the Seed UID. Tracked in issue #13579 with a prototype branch.
In parallel, work began on a force-restore operation annotation for Shoots to give operators a clear way to force a restore from existing backups during disaster scenarios. Required gardenlet changes around backup-copy logic and missing source BackupEntrys were identified (issue #12952), alongside a related blocker in gardener-extension-provider-openstack#1217.
Enriching Shoot Logs with Istio Access Logs
The Istio ingress gateway emits valuable access logs — particularly with L7 load balancing — but today they are only available to Seed operators. For Shoot owners, especially those running with restricted access (e.g. via the ACL extension), this limits debugging visibility.
The team implemented enrichment of the Shoot log stream with Istio access logs, integrated directly into the main Gardener repository after an initial attempt in the logging extension. The same approach can be extended to other Seed components that impact Shoot operations.
📈 Scale-Out Tests with Hollow Gardenlets
Inspired by Kubemark's hollow Nodes, the team prototyped hollow gardenlets that register themselves with the Garden, report ready, and mark their scheduled Shoots as healthy without ever spawning real control planes. A simulated request load against the Gardener API server, parameterized by the number of Shoots per Seed, models a realistic scenario.
On a local operator setup running on an M4 Mac with 48 GB of RAM, ~200 hollow gardenlets could be scheduled before resource exhaustion. A 10-hour run aiming for 50 Shoots/minute and 1 Seed every 3 minutes reached 800 Seeds and 21,600 Shoots before melting down around 4 a.m., with unequal request balancing on the kube-apiservers (no L7 load balancing yet) as a notable observation. While not productized yet, the WIP implementation is a useful starting point for future scalability work.

🗽 Talos as a Node OS: A Feasibility Study
The team evaluated Talos OS — minimal, immutable, API-driven — as a worker Node OS for Gardener Shoots. A local-provider PoC successfully bootstrapped Talos Nodes as Pods, joined them to the cluster using bootstrap tokens and a CA-configured kubelet, disabled the Talos-native CNI in favor of Gardener's, resolved an SNI issue by disabling KubePrism, and deployed trustd in the control plane with Istio routing.
The conclusion: technically feasible, but not being pursued to production at this time. Productization would require evolving OperatingSystemConfig and gardener-node-agent to abstract away systemd (Talos has no shell, no SSH, no systemd), rewriting extensions that inject systemd units to use containerized sidecars or DaemonSets, and standardizing how the Talos API daemon (apid) is exposed to operators for debugging.
🤖 A Tool-Enabled Agent for Shoots
Finally, the team developed agents that support operators answering end-user questions about Gardener and specific Shoot clusters. The agents combine LLMs for planning with concrete tools: knowledge-base search, ticket-database search, Garden/Seed/Shoot access via bash/kubectl, and metrics and log extraction. This lets the agent drill into details that would otherwise consume significant operator time.
The platform tooling is implemented and already in use internally; public availability is gated on scrubbing internal data from the knowledge sources.
⚖️ Smaller Wins
Several smaller, focused workstreams improved day-to-day operations:
- Real load balancer controller for
provider-local: an IP-allocation prototype assigns IPs from172.18.255.64/26toServices of typeLoadBalancerand uses Docker port mapping to expose containerized workloads externally — a foundation for makingManagedSeedtests succeed (branch). - Respect terminating
Nodes in load balancing: applying theToBeDeletedByClusterAutoscalertaint when aMachineis marked for deletion now reliably triggers bothcloud-providerreconciliation andkube-proxyhealth-check updates, preventing connections to terminatingNodes (PR #1054). SIGINFO(^T) handling ingardenadm/flow: a newCommandLineProgressReporterintegrates with theflowpackage to print the currently executing step, giving developers visual confirmation of progress in long-running multi-step operations (PR #13565).- MCM in-place updates for underlying machines: a feature-branch prototype added an
UpdateMachinemethod to the MCM provider driver interface, enabling infrastructure-level updates (such as OS images on memory-booted bare metal) without full node recreation. The team concluded that the current changes are too invasive for upstream contribution and will revisit the rolling-update approach instead.
🌷 Closing Thoughts
This event made strong progress on multi-hackathon storylines — self-hosted shoots, networking modernization, scalability — while delivering immediately useful operational wins like the Prow Go build cache and Istio access log enrichment. Several topics are now ready for cleanup, GEPs, and PR submission.
The next hackathon is already on the horizon. If you want to join, drop by the Gardener Slack (#hack-the-garden). See you there! ✌️
