GEP-0039: Live Control Plane Migration (Live CPM)
- 📌 GEP Tracking Issue: https://github.com/gardener/enhancements/issues/39
- 📖 GEP Link: https://github.com/gardener/enhancements/tree/main/geps/0039-live-control-plane-migration
- ✍🏻 Author(s): @acumino (Sonu Kumar Singh), @ary1992 (Ashish Ranjan Yadav), @seshachalam-yv (Seshachalam), @shafeeqes (Shafeeque E S)
- 🗓️ Presentation: 2026-02-09, 13:00 - 14:00 CET
- 🎥 Recording: https://youtu.be/DdU8SNNf23o
- 👨⚖️ Decisions:
- No major technical decisions were finalized in this session.
- Agreement to:
- Revamp and update the proposal document, addressing the open questions above.
- Clearly document assumptions, risks, and guarantees.
- Hold a follow-up Technical Steering session in a few weeks to re-evaluate the updated proposal.
- Key Discussion Points & Open Questions
- Failure Handling & Recovery
- What additional failure modes exist beyond those documented (e.g., ETCD scale-up failures)?
- Can we always fall back to normal CPM, or are there cases requiring manual intervention?
- Assumption: all failures except those explicitly documented should be retryable / recoverable.
- Gardenlet restart behavior:
- Current proposal resumes from the failed step.
- This differs from usual reconciliation semantics (restart from beginning).
- Question: could this lead to irrecoverable or inconsistent states?
- ETCD-related Topics
- 6-member ETCD risk:
- 3 members per seed implies permanent quorum loss if seed-to-seed connectivity is lost.
- ETCD APIs:
- Separate APIs exist for member name prefix vs. externally managed members due to uniqueness constraints.
- Question: can these be harmonized, or would that increase complexity?
- ETCD member removal:
- GEP-28 (SHSC) requires this as well.
- Existing plan:
etcd-druidremoves members via HTTP calls to the backup-restore sidecar. - Question: can Live CPM reuse this approach instead of introducing
EtcdOpsTask?
- ETCD exposure:
- Proposal doc is outdated and will be updated to reflect Istio-based exposure.
- Open question: how are
DNSRecords constructed in this model?
- 6-member ETCD risk:
- Networking & Connectivity
- Is seed-to-seed connectivity guaranteed at all times?
- VPN setup:
- Why is an additional VPN seed server configuration needed on the destination seed?
- Can we deploy directly in the “target” configuration from the start?
- Scheduling & Latency Constraints
- Scheduler
ConfigMapdistances are weights, not necessarily latency in ms. - How is the “distant region” prevention for LCPM enforced?
- The proposal should explain how the 180 ms latency threshold was derived.
- Scheduler
- Control Plane Components & Coordination
- Lease management:
- Can controllers simply be recreated in the destination seed instead of running in both seeds?
- Would this simplify the implementation / should we change the proposal?
- Gardenlet coordination:
- Current design uses back-and-forth updates via
.status.liveMigration. - Question: would conditions be a clearer and more robust coordination mechanism?
- Current design uses back-and-forth updates via
- Gardenlet versions:
- How is it enforced that gardenlets in both seeds run the same version?
- What happens if a gardenlet upgrade occurs while a migration is already in progress?
- Lease management:
- Autoscaling & Resource Management
- VPA recommendations:
- Not yet considered.
- Open question: do
VPACheckpoints need to be transferred as part of migration?
- VPA recommendations:
- Failure Handling & Recovery