PingCAP’s Experience in Implementing their Managed TiDB Service with Gardener

Gardener is showing successful collaboration with its growing community of contributors and adopters. With this come some success stories, including PingCAP using Gardener to implement its managed service.

About PingCAP and its TiDB Cloud

PingCAP started in 2015, when three seasoned infrastructure engineers working at leading Internet companies got sick and tired of the way databases were managed, scaled and maintained. Seeing no good solution on the market, they decided to build their own - the open-source way. With the help of a first-class team and hundreds of contributors from around the globe, PingCAP is building a distributed NewSQL, hybrid transactional and analytical processing (HTAP) database.

Its flagship project, TiDB, is a cloud-native distributed SQL database with MySQL compatibility, and one of the most popular open-source database projects - with 23.5K+ stars and 400+ contributors. Its sister project TiKV is a Cloud Native Interactive Landscape project.

PingCAP envisioned their managed TiDB service, known as TiDB Cloud, to be multi-tenant, secure, cost-efficient, and to be compatible with different cloud providers. As a result, the company turned to Gardener to build their managed TiDB cloud service offering.

TiDB Cloud Beta Preview

Limitations with other public managed Kubernetes services

Previously, PingCAP encountered issues while using other public managed K8s cluster services, to develop the first version of its TiDB Cloud. Their worst pain point was that they felt helpless when encountering certain malfunctions. PingCAP wasn’t able to do much to resolve these issues, except waiting for the providers’ help. More specifically, they experienced problems due to cloud-provider specific Kubernetes system upgrades, delays in the support response (which could be avoided in exchange of a costly support fee), and no control over when things got fixed.

There was also a lot of cloud-specific integration work needed to follow a multi-cloud strategy, which proved to be expensive both to produce and maintain. With one of these managed K8s services, you would have to integrate the instance API, as opposed to a solution like Gardener, which provides a unified API for all clouds. Such a unified API eliminates the need to worry about cloud specific-integration work altogether.

Why PingCAP chose Gardener to build TiDB Cloud

“Gardener has similar concepts to Kubernetes. Each Kubernetes cluster is just like a Kubernetes pod, so the similar concepts apply, and the controller pattern makes Gardener easy to manage. It was also easy to extend, as the team was already very familiar with Kubernetes, so it wasn’t hard for us to extend Gardener. We also saw that Gardener has a very active community, which is always a plus!”

- Aylei Wu, (Cloud Engineer) at PingCAP

At first glance, PingCAP had initial reservations about using Gardener - mainly due to its adoption level (still at the beginning) and an apparent complexity of use. However, these were soon eliminated as they learned more about the solution. As Aylei Wu mentioned during the last Gardener community meeting, “a good product speaks for itself”, and once the company got familiar with Gardener, they quickly noticed that the concepts were very similar to Kubernetes, which they were already familiar with.

They recognized that Gardener would be their best option, as it is highly extensible and provides a unified abstraction API layer. In essence, the machines can be managed via a machine controller manager for different cloud providers - without having to worry about the individual cloud APIs.

They agreed that Gardener’s solution, although complex, was definitely worth it. Even though it is a relatively new solution, meaning they didn’t have access to other user testimonials, they decided to go with the service since it checked all the boxes (and as SAP was running it productively with a huge fleet). PingCAP also came to the conclusion that building a managed Kubernetes service themselves would not be easy. Even if they were to build a managed K8s service, they would have to heavily invest in development and would still end up with an even more complex platform than Gardener’s. For all these reasons combined, PingCAP decided to go with Gardener to build its TiDB Cloud.

Here are certain features of Gardener that PingCAP found appealing:

  • Cloud agnostic: Gardener’s abstractions for cloud-specific integrations dramatically reduce the investment in supporting more than one cloud infrastructure. Once the integration with Amazon Web Services was done, moving on to Google Cloud Platform proved to be relatively easy. (At the moment, TiDB Cloud has subscription plans available for both GCP and AWS, and they are planning to support Alibaba Cloud in the future.)
  • Familiar concepts: Gardener is K8s native; its concepts are easily related to core Kubernetes concepts. As such, it was easy to onboard for a K8s experienced team like PingCAP’s SRE team.
  • Easy to manage and extend: Gardener’s API and extensibility are easy to implement, which has a positive impact on the implementation, maintenance costs and time-to-market.
  • Active community: Prompt and quality responses on Slack from the Gardener team tremendously helped to quickly onboard and produce an efficient solution.

How PingCAP built TiDB Cloud with Gardener

On a technical level, PingCAP’s set-up overview includes the following:

  • A Base Cluster globally, which is the top-level control plane of TiDB Cloud
  • A Seed Cluster per cloud provider per region, which makes up the fundamental data plane of TiDB Cloud
  • A Shoot Cluster is dynamically provisioned per tenant per cloud provider per region when requested
  • A tenant may create one or more TiDB clusters in a Shoot Cluster

As a real world example, PingCAP sets up the Base Cluster and Seed Clusters in advance. When a tenant creates its first TiDB cluster under the us-west-2 region of AWS, a Shoot Cluster will be dynamically provisioned in this region, and will host all the TiDB clusters of this tenant under us-west-2. Nevertheless, if another tenant requests a TiDB cluster in the same region, a new Shoot Cluster will be provisioned. Since different Shoot Clusters are located in different VPCs and can even be hosted under different AWS accounts, TiDB Cloud is able to achieve hard isolation between tenants and meet the critical security requirements for our customers.

To automate these processes, PingCAP creates a service in the Base Cluster, known as the TiDB Cloud “Central” service. The Central is responsible for managing shoots and the TiDB clusters in the Shoot Clusters. As shown in the following diagram, user operations go to the Central, being authenticated, authorized, validated, stored and then applied asynchronously in a controller manner. The Central will talk to the Gardener API Server to create and scale Shoot clusters. The Central will also access the Shoot API Service to deploy and reconcile components in the Shoot cluster, including control components (TiDB Operator, API Proxy, Usage Reporter for billing, etc.) and the TiDB clusters.

TiDB Cloud on Gardener Architecture Overview

What’s next for PingCAP and Gardener

With the initial success of using the project to build TiDB Cloud, PingCAP is now working heavily on the stability and day-to-day operations of TiDB Cloud on Gardener. This includes writing Infrastructure-as-Code scripts/controllers with it to achieve GitOps, building tools to help diagnose problems across regions and clusters, as well as running chaos tests to identify and eliminate potential risks. After benefiting greatly from the community, PingCAP will continue to contribute back to Gardener.

In the future, PingCAP also plans to support more cloud providers like AliCloud and Azure. Moreover, PingCAP may explore the opportunity of running TiDB Cloud in on-premise data centers with the constantly expanding support this project provides. Engineers at PingCAP enjoy the ease of learning from Gardener’s kubernetes-like concepts and being able to apply them everywhere. Gone are the days of heavy integrations with different clouds and worrying about vendor stability. With this project, PingCAP now sees broader opportunities to land TiDB Cloud on various infrastructures to meet the needs of their global user group.

Stay tuned, more blog posts to come on how Gardener is collaborating with its contributors and adopters to bring fully-managed clusters at scale everywhere! If you want to join in on the fun, connect with our community.


New Website, Same Green Flower

The Gardener project website just received a serious facelift. Here are some of the highlights:

  • A completely new landing page, emphasizing both on Gardener’s value proposition and the open community behind it.
  • The Community page was reconstructed for quick access to the various community channels and will soon merge the Adopters page. It will provide a better insight into success stories from the communty.
  • A completely new News section and content type available at /documentation/news. Use metadata such as publishdate and archivedate to schedule for news publish and archive automatically, regardless of when you contributed them. You can now track what’s happening from the landing page or in the dedicated News section on the website and share.
  • Improved blogs layout. One-click sharing options are available starting with simple URL copy link and twitter button and others will closely follow up. While we are at it, give it a try. Spread the word.

Website builds also got to a new level with:

  • Containerization. The whole build environment is containerized now, eliminating differences between local and CI/CD setup and reducing content developers focus only to the /documentation repository. Running a local server for live preview of changes as you make them when developing content for the website, is now as easy as runing make serve in your local /documentation clone.
  • Numerous improvements to the buld scripts. More configuration options, authenticated requests, fault tollerance and performance.
  • Good news for Windows WSL users who will now nejoy a significantly support. See the updated README for details on that.
  • A number of improvements in layouts styles, site assets and hugo site-building techniques.

But hey, THAT’S NOT ALL!

Stay tuned for more improvements around the corner. The biggest ones are aligning the documentation with the new theme and restructuring it along, more emphasis on community success stories all around, more sharing options and more than a handful of shortcodes for content development and … let’s cut the spoilers here.

I hope you will like it. Let us know what you think about it. Feel free to leave comments and discuss on Twitter and Slack, or in case of issues - on GitHub.

Go ahead and help us spread the word:


Cluster Overprovisioning

This tutorial describes how to overprovisioning of cluster nodes for scaling and failover. This is desired when you have work load that need to scale up quickly without waiting for the new cluster nodes to be created and join the cluster.

A similar problem occurs when crashing a node from the Hyperscaler. This must be replaced by Kubernetes as fast as possible. The solution can be overprovisioning of nodes some more on Cluster Overprovisioning.


Feature Flags in Kubernetes Applications

Feature flags are used to change the behavior of a program at runtime without forcing a restart.

Although they are essential in a native cloud environment, they cannot be implemented without significant effort on some platforms. Kubernetes has made this trivial. Here we will implement them through labels and annotations, but you can also implement them by connecting directly to the Kubernetes API Server.

Possible Use Cases

  • turn on/off a specific instance
  • turn on/off profiling of a specific instance
  • change the logging level, to capture detailed logs during a specific event
  • change caching strategy at runtime
  • change timeouts in production
  • toggle on/off some special verification some more on Feature Flags for App.


Manually adding a node to an existing cluster

Gardener has an excellent ability to automatically scale machines for the cluster. From the point of view of scalability, there is no need for manual intervention.

This tutorial is useful for those end-users who need specifically configured nodes, which are not yet supported by Gardener. For example: an end-user who wants some workload that requires runnc instead of runc as container runtime. some more on Adding Nodes to a Cluster.


Organizing Access Using kubeconfig Files

The kubectl command-line tool uses kubeconfig files to find the information it needs to choose a cluster and communicate with the API server of a cluster.

What happens if your kubeconfig file of your production cluster is leaked or published by accident?

Since there is no possibility to rotate or revoke the initial kubeconfig, there is only one way to protect your infrastructure or application if it is has leaked - delete the cluster.

..learn more on Work with kubeconfig files.


Cookies are dangerous...

…they mess up the figure.

For a team event during the Christmas season we decided to completely reinterpret the topic cookies… since the vegetables have gone on a well-deserved vacation. :-)

Get recipe on Gardener Cookies.


Auditing Kubernetes for Secure Setup

In summer 2018, the Gardener project team asked Kinvolk to execute several penetration tests in its role as third-party contractor. The goal of this ongoing work is to increase the security of all Gardener stakeholders in the open source community. Following the Gardener architecture, the control plane of a Gardener managed shoot cluster resides in the corresponding seed cluster. This is a Control-Plane-as-a-Service with a network air gap.

Along the way we found various kinds of security issues, for example, due to misconfiguration or missing isolation, as well as two special problems with upstream Kubernetes and its Control-Plane-as-a-Service architecture. some more on Auditing Kubernetes for Secure Setup.


Big things come in small packages

Microservices tend to use smaller runtimes but you can use what you have today - and this can be a problem in kubernetes.

Switching your architecture from a monolith to microservices has many advantages, both in the way you write software and the way it is used throughout its lifecycle. In this post, my attempt is to cover one problem which does not get as much attention and discussion - size of the technology stack.

General purpose technology stack

There is a tendency to be more generalized in development and to apply this pattern to all services. One feels that a homogeneous image of the technology stack is good if it is the same for all services.

One forgets, however, that a large percentage of the integrated infrastructure is not used by all services in the same way, and is therefore only a burden. Thus, resources are wasted and the entire application becomes expensive in operation and scales very badly.

Light technology stack

Due to the lightweight nature of your service, you can run more containers on a physical server and virtual machines. The result is higher resource utilization.

Additionally, microservices are developed and deployed as containers independently of each another. This means that a development team can develop, optimize and deploy a microservice without impacting other subsystems.


Anti Patterns

Running as root user

Whenever possible, do not run containers as root users. One could be tempted to say that Kubernetes Pods and Node are well separated. The host and the container share the same kernel. If the container is compromised, a root user can damage the underlying node. Use RUN groupadd -r anygroup && useradd -r -g anygroup myuser to create a group and a user in it. Use the USER command to switch to this user.

Storing data or logs in containers

Containers are ideal for stateless applications and should be transient. This means that no data or logs should be stored in the container, as they are lost when the container is closed. If absolutely necessary, you can use persistence volumes instead to persist them outside the containers. However, an ELK stack is preferred for storing and processing log files. some more on Common Kubernetes Antipattern.


Frontend HTTPS

For encrypted communication between the client to the load balancer, you need to specify a TLS private key and certificate to be used by the ingress controller.

Create a secret in the namespace of the ingress containing the TLS private key and certificate. Then configure the secret name in the TLS configuration section of the ingress specification. on HTTPS - Self Signed Certificates how to configure it.


Kubernetes is available in Docker for Mac 17.12 CE

Kubernetes is only available in Docker for Mac 17.12 CE and higher on the Edge channel. Kubernetes support is not included in Docker for Mac Stable releases. To find out more about Stable and Edge channels and how to switch between them, see general configuration.
Docker for Mac 17.12 CE (and higher) Edge includes a standalone Kubernetes server that runs on Mac, so that you can test deploying your Docker workloads on Kubernetes.

The Kubernetes client command, kubectl, is included and configured to connect to the local Kubernetes server. If you have kubectl already installed and pointing to some other environment, such as minikube or a GKE cluster, be sure to change context so that kubectl is pointing to docker-for-desktop:

…see more on

I recommend to setup your shell to see which KUBECONFIG is active.


Namespace Isolation

…or DENY all traffic from other namespaces

You can configure a NetworkPolicy to deny all traffic from other namespaces while allowing all traffic coming from the same namespace the pod is deployed to. There are many reasons why you may chose to configure Kubernetes network policies:

  • Isolate multi-tenant deployments
  • Regulatory compliance
  • Ensure containers assigned to different environments (e.g. dev/staging/prod) cannot interfere with each another on Namespace Isolation how to configure it.


Namespace Scope

Should I use:

  • ❌ one namespace per user/developer?
  • ❌ one namespace per team?
  • ❌ one per service type?
  • ❌ one namespace per application type?
  • 😄 one namespace per running instance of your application?

Apply the Principle of Least Privilege

All user accounts should run at all times as few privileges as possible, and also launch applications with as few privileges as possible. If you share a cluster for different user separated by a namespace, all user has access to all namespaces and services per default. It can happen that a user accidentally uses and destroys the namespace of a productive application or the namespace of another developer.

Keep in mind: By default namespaces don’t provide:

  • Network isolation
  • Access Control
  • Audit Logging on user level


ReadWriteMany - Dynamically Provisioned Persistent Volumes Using Amazon EFS

The efs-provisioner allows you to mount EFS storage as PersistentVolumes in kubernetes. It consists of a container that has access to an AWS EFS resource. The container reads a configmap containing the EFS filesystem ID, the AWS region and the name identifying the efs-provisioner. This name will be used later when you create a storage class.


  1. When you have application running on multiple nodes which require shared access to a file system
  2. When you have an application that requires multiple virtual machines to access the same file system at the same time, AWS EFS is a tool that you can use.
  3. EFS supports encryption.
  4. EFS is SSD based storage and its storage capacity and pricing will scale in or out as needed, so there is no need for the system administrator to do additional operations. It can grow to a petabyte scale.
  5. EFS now supports NFSv4 lock upgrading and downgrading, so yes, you can use sqlite with EFS… even if it was possible before.
  6. Easy to setup

Why Not EFS

  1. Sometimes when you think about using a service like EFS, you may also think about vendor lock-in and its negative sides
  2. Making an EFS backup may decrease your production FS performance; the throughput used by backup counts towards your total file system throughput.
  3. EFS is expensive compared to EBS (roughly twice the price of EBS storage)
  4. EFS is not the magical solution for all your distributed FS problems, it can be slow in many cases. Test, benchmark and measure to ensure your if EFS is a good solution for your use case.
  5. EFS distributed architecture results in a latency overhead for each file read/write operation.
  6. If you have the possibility to use a CDN, don’t use EFS, use it for the files which can’t be stored in a CDN.
  7. Don’t use EFS as a caching system, sometimes you could be doing this unintentionally.
  8. Last but not least, even if EFS is a fully managed NFS, you will face performance problems in many cases, resolving them takes time and needs effort. some more on ReadWriteMany.


Shared storage with S3 backend

The storage is definitely the most complex and important part of an application setup, once this part is completed, one of the most problematic parts could be solved.

Mounting a S3 bucket into a pod using FUSE allows to access data stored in S3 via the filesystem. The mount is a pointer to an S3 location, so the data is never synced locally. Once mounted, any pod can read or even write from that directory without the need for explicit keys.

However, it can be used to import and parse large amounts of data into a database. on Shared S3 Storage how to configure it.


Watching logs of several pods

One thing that always bothered me was that I couldn’t get logs of several pods at once with kubectl. A simple tail -f <path-to-logfile> isn’t possible. Certainly you can use kubectl logs -f <pod-id>, but it doesn’t help if you want to monitor more than one pod at a time.

This is something you really need a lot, at least if you run several instances of a pod behind a deployment and you don’t have setup a log viewer service like Kibana.

kubetail comes to the rescue, it is a small bash script that allows you to aggregate log files of several pods at the same time in a simple way. The script is called kubetail and is available at GitHub.

Hibernate a Cluster to save money

You want to experiment with Kubernetes or have set up a customer scenario, but you don’t want to run the cluster 24 / 7 for reasons of cost?

The Gardener gives you the possibility to scale your cluster down to zero nodes. some more on Hibernate a Cluster.