Gardener Enhances Observability with OpenTelemetry Integration for Logging

Gardener is advancing its observability capabilities by integrating OpenTelemetry, starting with log collection and processing. This strategic move, outlined in GEP-34: OpenTelemetry Operator And Collectors, lays the groundwork for a more standardized, flexible, and powerful observability framework in line with Gardener’s Observability 2.0 vision.

The Drive Towards Standardization

Gardener’s previous observability stack, though effective, utilized vendor-specific formats and protocols. This presented challenges in extending components and integrating with diverse external systems. The adoption of OpenTelemetry addresses these limitations by aligning Gardener with open standards, enhancing interoperability, and paving the way for future innovations like unified visualization, comprehensive tracing support and even LLM integrations via MCP (Model Context Propagation) enabled services.

Core Components: Operator and Collectors

The initial phase of this integration introduces two key OpenTelemetry components into Gardener-managed clusters:

  • OpenTelemetry Operator: Deployed on seed clusters (specifically in the garden namespace using ManagedResources), the OpenTelemetry Operator for Kubernetes will manage the lifecycle of OpenTelemetry Collector instances across shoot control planes. Its deployment follows a similar pattern to the existing Prometheus and Fluent Bit operators and occurs during the Seed reconciliation flow.
  • OpenTelemetry Collectors: A dedicated OpenTelemetry Collector instance will be provisioned for each shoot control plane namespace (e.g., shoot--project--name). These collectors, managed as Deployments by the OpenTelemetry Operator via an OpenTelemetryCollector Custom Resource created during Shoot reconciliation, are responsible for receiving, processing, and exporting observability data, with an initial focus on logs.

Key Changes and Benefits for Logging

  • Standardized Log Transport: Logs from various sources will now be channeled through the OpenTelemetry Collector.
    • Shoot Node Log Collection: The existing valitail systemd service on shoot nodes is being replaced by an OpenTelemetry Collector. This new collector will gather systemd logs (e.g., from kernel, kubelet.service, containerd.service) with parity to valitail’s previous functionality and forward them to the OpenTelemetry Collector instance residing in the shoot control plane.
    • Fluent Bit Integration: Existing Fluent Bit instances, which act as log shippers on seed clusters, will be configured to forward logs to the OpenTelemetry Collector’s receivers. This ensures continued compatibility with the Vali-based setup previously established by GEP-19.
  • Backend Agility: While initially the OpenTelemetry Collector will be configured to use its Loki exporter to send logs to the existing Vali backend, this architecture introduces significant flexibility. It allows Gardener to switch to any OpenTelemetry-compatible backend in the future, with plans to eventually migrate to Victoria-Logs.
  • Phased Rollout: The transition to OpenTelemetry is designed as a phased approach. Existing observability tools like Vali, Fluent Bit, and Prometheus will be gradually integrated and some backends such as Vali will be replaced.
  • Foundation for Future Observability: Although this GEP primarily targets logging, it critically establishes the foundation for incorporating other observability signals, such as metrics and traces, into the OpenTelemetry framework. Future enhancements may include:
    • Utilizing the OpenTelemetry Collector on shoot nodes to also scrape and process metrics.
    • Replacing the current event logger component with the OpenTelemetry Collector’s k8s-event receiver within the shoot’s OpenTelemetry Collector instance.

Explore Further

This integration marks a significant step in Gardener’s observability journey, promising a more robust and adaptable system.