그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그
7 minute read
Topology-Aware Traffic Routing
Motivation
The enablement of highly available shoot control-planes requires multi-zone seed clusters. A garden runtime cluster can also be a multi-zone cluster. The topology-aware routing is introduced to reduce costs and to improve network performance by avoiding the cross availability zone traffic, if possible. The cross availability zone traffic is charged by the cloud providers and it comes with higher latency compared to the traffic within the same zone. The topology-aware routing feature enables topology-aware routing for Service
s deployed in a seed or garden runtime cluster. For the clients consuming these topology-aware services, kube-proxy
favors the endpoints which are located in the same zone where the traffic originated from. In this way, the cross availability zone traffic is avoided.
How it works
The topology-aware routing feature relies on the Kubernetes feature TopologyAwareHints
.
EndpointSlice Hints Mutating Webhook
The component that is responsible for providing hints in the EndpointSlices resources is the kube-controller-manager, in particular this is the EndpointSlice controller. However, there are several drawbacks with the TopologyAwareHints feature that don’t allow us to use it in its native way:
The algorithm in the EndpointSlice controller is based on a CPU-balance heuristic. From the TopologyAwareHints documentation:
The controller allocates a proportional amount of endpoints to each zone. This proportion is based on the allocatable CPU cores for nodes running in that zone. For example, if one zone had 2 CPU cores and another zone only had 1 CPU core, the controller would allocate twice as many endpoints to the zone with 2 CPU cores.
In case it is not possible to achieve a balanced distribution of the endpoints, as a safeguard mechanism the controller removes hints from the EndpointSlice resource. In our setup, the clients and the servers are well-known and usually the traffic a component receives does not depend on the zone’s allocatable CPU. Many components deployed by Gardener are scaled automatically by VPA. In case of an overload of a replica, the VPA should provide and apply enhanced CPU and memory resources. Additionally, Gardener uses the cluster-autoscaler to upscale/downscale Nodes dynamically. Hence, it is not possible to ensure a balanced allocatable CPU across the zones.
The TopologyAwareHints feature does not work at low-endpoint counts. It falls apart for a Service with less than 10 Endpoints.
Hints provided by the EndpointSlice controller are not deterministic. With cluster-autoscaler running and load increasing, hints can be removed in the next moment. There is no option to enforce the zone-level topology.
For more details, see the following issue kubernetes/kubernetes#113731.
To circumvent these issues with the EndpointSlice controller, a mutating webhook in the gardener-resource-manager assigns hints to EndpointSlice resources. For each endpoint in the EndpointSlice, it sets the endpoint’s hints to the endpoint’s zone. The webhook overwrites the hints provided by the EndpointSlice controller in kube-controller-manager. For more details, see the webhook’s documentation.
kube-proxy
By default, with kube-proxy running in iptables
mode, traffic is distributed randomly across all endpoints, regardless of where it originates from. In a cluster with 3 zones, traffic is more likely to go to another zone than to stay in the current zone.
With the topology-aware routing feature, kube-proxy filters the endpoints it routes to based on the hints in the EndpointSlice resource. In most of the cases, kube-proxy will prefer the endpoint(s) in the same zone. For more details, see the Kubernetes documentation.
How to make a Service topology-aware?
To make a Service topology-aware, the following annotation and label have to be added to the Service:
apiVersion: v1
kind: Service
metadata:
annotations:
service.kubernetes.io/topology-aware-hints: "auto"
labels:
endpoint-slice-hints.resources.gardener.cloud/consider: "true"
Note: In Kubernetes 1.27 the
service.kubernetes.io/topology-aware-hints=auto
annotation is deprecated in favor of the newly introducedservice.kubernetes.io/topology-mode=auto
. When the runtime cluster’s K8s version is >= 1.27, use theservice.kubernetes.io/topology-mode=auto
annotation. For more details, see the corresponding upstream PR.
The service.kubernetes.io/topology-aware-hints=auto
annotation is needed for kube-proxy. One of the prerequisites on kube-proxy side for using topology-aware routing is the corresponding Service to be annotated with the service.kubernetes.io/topology-aware-hints=auto
. For more details, see the following kube-proxy function.
The endpoint-slice-hints.resources.gardener.cloud/consider=true
label is needed for gardener-resource-manager to prevent the EndpointSlice hints mutating webhook from selecting all EndpointSlice resources but only the ones that are labeled with the consider label.
The Gardener extensions can use this approach to make a Service they deploy topology-aware.
Prerequisites for making a Service topology-aware:
- The Pods backing the Service should be spread on most of the available zones. This constraint should be ensured with appropriate scheduling constraints (topology spread constraints, (anti-)affinity). Enabling the feature for a Service with a single backing Pod or Pods all located in the same zone does not lead to a benefit.
- The component should be scaled up by
VerticalPodAutoscaler
. In case of an overload (a large portion of the of the traffic is originating from a given zone), theVerticalPodAutoscaler
should provide better resource recommendations for the overloaded backing Pods. - Consider the
TopologyAwareHints
constraints.
Note: The topology-aware routing feature is considered as alpha feature. Use it only for evaluation purposes.
Topology-aware Services in the Seed cluster
etcd-main-client and etcd-events-client
The etcd-main-client
and etcd-events-client
Services are topology-aware. They are consumed by the kube-apiserver.
kube-apiserver
The kube-apiserver
Service is topology-aware. It is consumed by the controllers running in the Shoot control plane.
Note: The
istio-ingressgateway
component routes traffic in topology-aware manner - if possible, it routes traffic to the targetkube-apiserver
Pods in the same zone. If there is no healthykube-apiserver
Pod available in the same zone, the traffic is routed to any of the healthy Pods in the other zones. This behaviour is unconditionally enabled.
gardener-resource-manager
The gardener-resource-manager
Service that is part of the Shoot control plane is topology-aware. The resource-manager serves webhooks and the Service is consumed by the kube-apiserver for the webhook communication.
vpa-webhook
The vpa-webhook
Service that is part of the Shoot control plane is topology-aware. It is consumed by the kube-apiserver for the webhook communication.
Topology-aware Services in the garden runtime cluster
virtual-garden-etcd-main-client and virtual-garden-etcd-events-client
The virtual-garden-etcd-main-client
and virtual-garden-etcd-events-client
Services are topology-aware. virtual-garden-etcd-main-client
is consumed by virtual-garden-kube-apiserver
and gardener-apiserver
, virtual-garden-etcd-events-client
is consumed by virtual-garden-kube-apiserver
.
virtual-garden-kube-apiserver
The virtual-garden-kube-apiserver
Service is topology-aware. It is consumed by virtual-garden-kube-controller-manager
, gardener-controller-manager
, gardener-scheduler
, gardener-admission-controller
, extension admission components, gardener-dashboard
and other components.
Note: Unlike the other Services, the
virtual-garden-kube-apiserver
Service is of type LoadBalancer. In-cluster components consuming thevirtual-garden-kube-apiserver
Service by its Service name will have benefit from the topology-aware routing. However, the TopologyAwareHints feature cannot help with external traffic routed to load balancer’s address - such traffic won’t be routed in a topology-aware manner and will be routed according to the cloud-provider specific implementation.
gardener-apiserver
The gardener-apiserver
Service is topology-aware. It is consumed by virtual-garden-kube-apiserver
. The aggregation layer in virtual-garden-kube-apiserver
proxies requests sent for the Gardener API types to the gardener-apiserver
.
gardener-admission-controller
The gardener-admission-controller
Service is topology-aware. It is consumed by virtual-garden-kube-apiserver
and gardener-apiserver
for the webhook communication.
How to enable the topology-aware routing for a Seed cluster?
For a Seed cluster the topology-aware routing functionality can be enabled in the Seed specification:
apiVersion: core.gardener.cloud/v1beta1
kind: Seed
# ...
spec:
settings:
topologyAwareRouting:
enabled: true
The topology-aware routing setting can be only enabled for a Seed cluster with more than one zone.
gardenlet enables topology-aware Services only for Shoot control planes with failure tolerance type zone
(.spec.controlPlane.highAvailability.failureTolerance.type=zone
). Control plane Pods of non-HA Shoots and HA Shoots with failure tolerance type node
are pinned to single zone. For more details, see High Availability Of Deployed Components.
How to enable the topology-aware routing for a garden runtime cluster?
For a garden runtime cluster the topology-aware routing functionality can be enabled in the Garden resource specification:
apiVersion: operator.gardener.cloud/v1alpha1
kind: Garden
# ...
spec:
runtimeCluster:
settings:
topologyAwareRouting:
enabled: true
The topology-aware routing setting can be only enabled for a garden runtime cluster with more than one zone.