그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그
7 minute read
Topology-Aware Traffic Routing
The enablement of highly available shoot control-planes requires multi-zone seed clusters. A garden runtime cluster can also be a multi-zone cluster. The topology-aware routing is introduced to reduce costs and to improve network performance by avoiding the cross availability zone traffic, if possible. The cross availability zone traffic is charged by the cloud providers and it comes with higher latency compared to the traffic within the same zone. The topology-aware routing feature enables topology-aware routing for
Services deployed in a seed or garden runtime cluster. For the clients consuming these topology-aware services,
kube-proxy favors the endpoints which are located in the same zone where the traffic originated from. In this way, the cross availability zone traffic is avoided.
How it works
The topology-aware routing feature relies on the Kubernetes feature
EndpointSlice Hints Mutating Webhook
The component that is responsible for providing hints in the EndpointSlices resources is the kube-controller-manager, in particular this is the EndpointSlice controller. However, there are several drawbacks with the TopologyAwareHints feature that don’t allow us to use it in its native way:
The algorithm in the EndpointSlice controller is based on a CPU-balance heuristic. From the TopologyAwareHints documentation:
The controller allocates a proportional amount of endpoints to each zone. This proportion is based on the allocatable CPU cores for nodes running in that zone. For example, if one zone had 2 CPU cores and another zone only had 1 CPU core, the controller would allocate twice as many endpoints to the zone with 2 CPU cores.
In case it is not possible to achieve a balanced distribution of the endpoints, as a safeguard mechanism the controller removes hints from the EndpointSlice resource. In our setup, the clients and the servers are well-known and usually the traffic a component receives does not depend on the zone’s allocatable CPU. Many components deployed by Gardener are scaled automatically by VPA. In case of an overload of a replica, the VPA should provide and apply enhanced CPU and memory resources. Additionally, Gardener uses the cluster-autoscaler to upscale/downscale Nodes dynamically. Hence, it is not possible to ensure a balanced allocatable CPU across the zones.
The TopologyAwareHints feature does not work at low-endpoint counts. It falls apart for a Service with less than 10 Endpoints.
Hints provided by the EndpointSlice controller are not deterministic. With cluster-autoscaler running and load increasing, hints can be removed in the next moment. There is no option to enforce the zone-level topology.
For more details, see the following issue kubernetes/kubernetes#113731.
To circumvent these issues with the EndpointSlice controller, a mutating webhook in the gardener-resource-manager assigns hints to EndpointSlice resources. For each endpoint in the EndpointSlice, it sets the endpoint’s hints to the endpoint’s zone. The webhook overwrites the hints provided by the EndpointSlice controller in kube-controller-manager. For more details, see the webhook’s documentation.
By default, with kube-proxy running in
iptables mode, traffic is distributed randomly across all endpoints, regardless of where it originates from. In a cluster with 3 zones, traffic is more likely to go to another zone than to stay in the current zone.
With the topology-aware routing feature, kube-proxy filters the endpoints it routes to based on the hints in the EndpointSlice resource. In most of the cases, kube-proxy will prefer the endpoint(s) in the same zone. For more details, see the Kubernetes documentation.
How to make a Service topology-aware?
To make a Service topology-aware, the following annotation and label have to be added to the Service:
apiVersion: v1 kind: Service metadata: annotations: service.kubernetes.io/topology-aware-hints: "auto" labels: endpoint-slice-hints.resources.gardener.cloud/consider: "true"
Note: In Kubernetes 1.27 the
service.kubernetes.io/topology-aware-hints=autoannotation is deprecated in favor of the newly introduced
service.kubernetes.io/topology-mode=auto. When the runtime cluster’s K8s version is >= 1.27, use the
service.kubernetes.io/topology-mode=autoannotation. For more details, see the corresponding upstream PR.
service.kubernetes.io/topology-aware-hints=auto annotation is needed for kube-proxy. One of the prerequisites on kube-proxy side for using topology-aware routing is the corresponding Service to be annotated with the
service.kubernetes.io/topology-aware-hints=auto. For more details, see the following kube-proxy function.
endpoint-slice-hints.resources.gardener.cloud/consider=true label is needed for gardener-resource-manager to prevent the EndpointSlice hints mutating webhook from selecting all EndpointSlice resources but only the ones that are labeled with the consider label.
The Gardener extensions can use this approach to make a Service they deploy topology-aware.
Prerequisites for making a Service topology-aware:
- The Pods backing the Service should be spread on most of the available zones. This constraint should be ensured with appropriate scheduling constraints (topology spread constraints, (anti-)affinity). Enabling the feature for a Service with a single backing Pod or Pods all located in the same zone does not lead to a benefit.
- The component should be scaled up by
VerticalPodAutoscaler. In case of an overload (a large portion of the of the traffic is originating from a given zone), the
VerticalPodAutoscalershould provide better resource recommendations for the overloaded backing Pods.
- Consider the
Note: The topology-aware routing feature is considered as alpha feature. Use it only for evaluation purposes.
Topology-aware Services in the Seed cluster
etcd-main-client and etcd-events-client
etcd-events-client Services are topology-aware. They are consumed by the kube-apiserver.
kube-apiserver Service is topology-aware. It is consumed by the controllers running in the Shoot control plane.
istio-ingressgatewaycomponent routes traffic in topology-aware manner - if possible, it routes traffic to the target
kube-apiserverPods in the same zone. If there is no healthy
kube-apiserverPod available in the same zone, the traffic is routed to any of the healthy Pods in the other zones. This behaviour is unconditionally enabled.
gardener-resource-manager Service that is part of the Shoot control plane is topology-aware. The resource-manager serves webhooks and the Service is consumed by the kube-apiserver for the webhook communication.
vpa-webhook Service that is part of the Shoot control plane is topology-aware. It is consumed by the kube-apiserver for the webhook communication.
Topology-aware Services in the garden runtime cluster
virtual-garden-etcd-main-client and virtual-garden-etcd-events-client
virtual-garden-etcd-events-client Services are topology-aware.
virtual-garden-etcd-main-client is consumed by
virtual-garden-etcd-events-client is consumed by
virtual-garden-kube-apiserver Service is topology-aware. It is consumed by
gardener-admission-controller, extension admission components,
gardener-dashboard and other components.
Note: Unlike the other Services, the
virtual-garden-kube-apiserverService is of type LoadBalancer. In-cluster components consuming the
virtual-garden-kube-apiserverService by its Service name will have benefit from the topology-aware routing. However, the TopologyAwareHints feature cannot help with external traffic routed to load balancer’s address - such traffic won’t be routed in a topology-aware manner and will be routed according to the cloud-provider specific implementation.
gardener-apiserver Service is topology-aware. It is consumed by
virtual-garden-kube-apiserver. The aggregation layer in
virtual-garden-kube-apiserver proxies requests sent for the Gardener API types to the
gardener-admission-controller Service is topology-aware. It is consumed by
gardener-apiserver for the webhook communication.
How to enable the topology-aware routing for a Seed cluster?
For a Seed cluster the topology-aware routing functionality can be enabled in the Seed specification:
apiVersion: core.gardener.cloud/v1beta1 kind: Seed # ... spec: settings: topologyAwareRouting: enabled: true
The topology-aware routing setting can be only enabled for a Seed cluster with more than one zone.
gardenlet enables topology-aware Services only for Shoot control planes with failure tolerance type
.spec.controlPlane.highAvailability.failureTolerance.type=zone). Control plane Pods of non-HA Shoots and HA Shoots with failure tolerance type
node are pinned to single zone. For more details, see High Availability Of Deployed Components.
⚠️ For K8s < 1.24 Seed clusters, the topology-aware routing setting requires the Kubernetes
TopologyAwareHints feature gate to be enabled for kube-apiserver, kube-controller-manager and kube-proxy. This is required because the
TopologyAwareHints feature gate is disabled by default in K8s < 1.24. When
TopologyAwareHints is disabled, the kube-apiserver does not allow anything to be persisted in the
.endpoints.hints field in the EndpointSlice resource. Also, the kube-controller-manager removes the hints, hence kube-proxy is not using topology-aware routing.
How to enable the topology-aware routing for a garden runtime cluster?
For a garden runtime cluster the topology-aware routing functionality can be enabled in the Garden resource specification:
apiVersion: operator.gardener.cloud/v1alpha1 kind: Garden # ... spec: runtimeCluster: settings: topologyAwareRouting: enabled: true
The topology-aware routing setting can be only enabled for a garden runtime cluster with more than one zone.
⚠️ For K8s < 1.24 garden runtime clusters, the topology-aware routing setting requires the Kubernetes
TopologyAwareHints feature gate to be enabled for kube-apiserver, kube-controller-manager and kube-proxy. For more details, see the above section.