This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

etcd-druid

A druid for etcd management in Gardener

etcd-druid is an etcd operator which makes it easy to configure, provision, reconcile, monitor and delete etcd clusters. It enables management of etcd clusters through declarative Kubernetes API model.

In every etcd cluster managed by etcd-druid, each etcd member is a two container Pod which consists of:

etcd-wrapper which manages the lifecycle (validation & initialization) of an etcd.
etcd-backup-restore sidecar which currently provides the following capabilities (the list is not comprehensive):
- etcd DB validation.
- Scheduled etcd DB defragmentation.
- Backup - etcd DB snapshots are taken regularly and backed in an object store if one is configured.
- Restoration - In case of a DB corruption for a single-member cluster it helps in restoring from latest set of snapshots (full & delta).
- Member control operations.

etcd-druid additionally provides the following capabilities:

Facilitates declarative scale-out of etcd clusters.
Provides protection against accidental deletion/mutation of resources provisioned as part of an etcd cluster.
Offers an asynchronous and threshold based capability to process backed up snapshots to:
- Potentially minimize the recovery time by leveraging restoration from backups followed by etcd’s compaction and defragmentation.
- Indirectly assert integrity of the backed up snaphots.
Allows seamless copy of backups between any two object store buckets.

Start using or developing `etcd-druid` locally

If you are looking to try out druid then you can use a Kind cluster based setup.

https://github.com/user-attachments/assets/cfe0d891-f709-4d7f-b975-4300c6de67e4

For detailed documentation, see our docs.

Contributions

If you wish to contribute then please see our contributor guidelines.

Feedback and Support

We always look forward to active community engagement. Please report bugs or suggestions on how we can enhance etcd-druid on GitHub Issues.

License

Release under Apache-2.0 license.

1 - 01 Multi Node Etcd Clusters

DEP-01: Multi-node etcd cluster instances via etcd-druid

This document proposes an approach (along with some alternatives) to support provisioning and management of multi-node etcd cluster instances via etcd-druid and etcd-backup-restore.

Goal

Enhance etcd-druid and etcd-backup-restore to support provisioning and management of multi-node etcd cluster instances within a single Kubernetes cluster.
The etcd CRD interface should be simple to use. It should preferably work with just setting the spec.replicas field to the desired value and should not require any more configuration in the CRD than currently required for the single-node etcd instances. The spec.replicas field is part of the scale sub-resource implementation in Etcd CRD.
The single-node and multi-node scenarios must be automatically identified and managed by etcd-druid and etcd-backup-restore.
The etcd clusters (single-node or multi-node) managed by etcd-druid and etcd-backup-restore must automatically recover from failures (even quorum loss) and disaster (e.g. etcd member persistence/data loss) as much as possible.
It must be possible to dynamically scale an etcd cluster horizontally (even between single-node and multi-node scenarios) by simply scaling the Etcd scale sub-resource.
It must be possible to (optionally) schedule the individual members of an etcd clusters on different nodes or even infrastructure availability zones (within the hosting Kubernetes cluster).

Though this proposal tries to cover most aspects related to single-node and multi-node etcd clusters, there are some more points that are not goals for this document but are still in the scope of either etcd-druid/etcd-backup-restore and/or gardener. In such cases, a high-level description of how they can be addressed in the future are mentioned at the end of the document.

Background and Motivation

Single-node etcd cluster

At present, etcd-druid supports only single-node etcd cluster instances. The advantages of this approach are given below.

The problem domain is smaller. There are no leader election and quorum related issues to be handled. It is simpler to setup and manage a single-node etcd cluster.
Single-node etcd clusters instances have less request latency than multi-node etcd clusters because there is no requirement to replicate the changes to the other members before committing the changes.
etcd-druid provisions etcd cluster instances as pods (actually as statefulsets) in a Kubernetes cluster and Kubernetes is quick (<20s) to restart container/pods if they go down.
Also, etcd-druid is currently only used by gardener to provision etcd clusters to act as back-ends for Kubernetes control-planes and Kubernetes control-plane components (kube-apiserver, kubelet, kube-controller-manager, kube-scheduler etc.) can tolerate etcd going down and recover when it comes back up.
Single-node etcd clusters incur less cost (CPU, memory and storage)
It is easy to cut-off client requests if backups fail by using readinessProbe on the etcd-backup-restore healthz endpoint to minimize the gap between the latest revision and the backup revision.

The disadvantages of using single-node etcd clusters are given below.

The database verification step by etcd-backup-restore can introduce additional delays whenever etcd container/pod restarts (in total ~20-25s). This can be much longer if a database restoration is required. Especially, if there are incremental snapshots that need to be replayed (this can be mitigated by compacting the incremental snapshots in the background).
Kubernetes control-plane components can go into CrashloopBackoff if etcd is down for some time. This is mitigated by the dependency-watchdog. But Kubernetes control-plane components require a lot of resources and create a lot of load on the etcd cluster and the apiserver when they come out of CrashloopBackoff. Especially, in medium or large sized clusters (> 20 nodes).
Maintenance operations such as updates to etcd (and updates to etcd-druid of etcd-backup-restore), rolling updates to the nodes of the underlying Kubernetes cluster and vertical scaling of etcd pods are disruptive because they cause etcd pods to be restarted. The vertical scaling of etcd pods is somewhat mitigated during scale down by doing it only during the target clusters’ maintenance window. But scale up is still disruptive.
We currently use some form of elastic storage (via persistentvolumeclaims) for storing which have some upper-bounds on the I/O latency and throughput. This can be potentially be a problem for large clusters (> 220 nodes). Also, some cloud providers (e.g. Azure) take a long time to attach/detach volumes to and from machines which increases the down time to the Kubernetes components that depend on etcd. It is difficult to use ephemeral/local storage (to achieve better latency/throughput as well as to circumvent volume attachment/detachment) for single-node etcd cluster instances.

Multi-node etcd-cluster

The advantages of introducing support for multi-node etcd clusters via etcd-druid are below.

Multi-node etcd cluster is highly-available. It can tolerate disruption to individual etcd pods as long as the quorum is not lost (i.e. more than half the etcd member pods are healthy and ready).
Maintenance operations such as updates to etcd (and updates to etcd-druid of etcd-backup-restore), rolling updates to the nodes of the underlying Kubernetes cluster and vertical scaling of etcd pods can be done non-disruptively by respecting poddisruptionbudgets for the various multi-node etcd cluster instances hosted on that cluster.
Kubernetes control-plane components do not see any etcd cluster downtime unless quorum is lost (which is expected to be lot less frequent than current frequency of etcd container/pod restarts).
We can consider using ephemeral/local storage for multi-node etcd cluster instances because individual member restarts can afford to take time to restore from backup before (re)joining the etcd cluster because the remaining members serve the requests in the meantime.
High-availability across availability zones is also possible by specifying (anti)affinity for the etcd pods (possibly via kupid).

Some disadvantages of using multi-node etcd clusters due to which it might still be desirable, in some cases, to continue to use single-node etcd cluster instances in the gardener context are given below.

Multi-node etcd cluster instances are more complex to manage. The problem domain is larger including the following.
- Leader election
- Quorum loss
- Managing rolling changes
- Backups to be taken from only the leading member.
- More complex to cut-off client requests if backups fail to minimize the gap between the latest revision and the backup revision is under control.
Multi-node etcd cluster instances incur more cost (CPU, memory and storage).

Dynamic multi-node etcd cluster

Though it is not part of this proposal, it is conceivable to convert a single-node etcd cluster into a multi-node etcd cluster temporarily to perform some disruptive operation (etcd, etcd-backup-restore or etcd-druid updates, etcd cluster vertical scaling and perhaps even node rollout) and convert it back to a single-node etcd cluster once the disruptive operation has been completed. This will necessarily still involve a down-time because scaling from a single-node etcd cluster to a three-node etcd cluster will involve etcd pod restarts, it is still probable that it can be managed with a shorter down time than we see at present for single-node etcd clusters (on the other hand, converting a three-node etcd cluster to five node etcd cluster can be non-disruptive).

This is definitely not to argue in favour of such a dynamic approach in all cases (eventually, if/when dynamic multi-node etcd clusters are supported). On the contrary, it makes sense to make use of static (fixed in size) multi-node etcd clusters for production scenarios because of the high-availability.

Prior Art

ETCD Operator from CoreOS

etcd operator
Project status: archived
This project is no longer actively developed or maintained. The project exists here for historical reference. If you are interested in the future of the project and taking over stewardship, please contact etcd-dev@googlegroups.com.

etcdadm from kubernetes-sigs

etcdadm is a command-line tool for operating an etcd cluster. It makes it easy to create a new cluster, add a member to, or remove a member from an existing cluster. Its user experience is inspired by kubeadm.

It is a tool more tailored for manual command-line based management of etcd clusters with no API’s. It also makes no assumptions about the underlying platform on which the etcd clusters are provisioned and hence, doesn’t leverage any capabilities of Kubernetes.

Etcd Cluster Operator from Improbable-Engineering

Etcd Cluster Operator
Etcd Cluster Operator is an Operator for automating the creation and management of etcd inside of Kubernetes. It provides a custom resource definition (CRD) based API to define etcd clusters with Kubernetes resources, and enable management with native Kubernetes tooling._

Out of all the alternatives listed here, this one seems to be the only possible viable alternative. Parts of its design/implementations are similar to some of the approaches mentioned in this proposal. However, we still don’t propose to use it as -

The project is still in early phase and is not mature enough to be consumed as is in productive scenarios of ours.
The resotration part is completely different which makes it difficult to adopt as-is and requries lot of re-work with the current restoration semantics with etcd-backup-restore making the usage counter-productive.

General Approach to ETCD Cluster Management

Bootstrapping

There are three ways to bootstrap an etcd cluster which are static, etcd discovery and DNS discovery. Out of these, the static way is the simplest (and probably faster to bootstrap the cluster) and has the least external dependencies. Hence, it is preferred in this proposal. But it requires that the initial (during bootstrapping) etcd cluster size (number of members) is already known before bootstrapping and that all of the members are already addressable (DNS,IP,TLS etc.). Such information needs to be passed to the individual members during startup using the following static configuration.

ETCD_INITIAL_CLUSTER
- The list of peer URLs including all the members. This must be the same as the advertised peer URLs configuration. This can also be passed as initial-cluster flag to etcd.
ETCD_INITIAL_CLUSTER_STATE
- This should be set to new while bootstrapping an etcd cluster.
ETCD_INITIAL_CLUSTER_TOKEN
- This is a token to distinguish the etcd cluster from any other etcd cluster in the same network.

Assumptions

ETCD_INITIAL_CLUSTER can use DNS instead of IP addresses. We need to verify this by deleting a pod (as against scaling down the statefulset) to ensure that the pod IP changes and see if the recreated pod (by the statefulset controller) re-joins the cluster automatically.
DNS for the individual members is known or computable. This is true in the case of etcd-druid setting up an etcd cluster using a single statefulset. But it may not necessarily be true in other cases (multiple statefulset per etcd cluster or deployments instead of statefulsets or in the case of etcd cluster with members distributed across more than one Kubernetes cluster.

Adding a new member to an etcd cluster

A new member can be added to an existing etcd cluster instance using the following steps.

If the latest backup snapshot exists, restore the member’s etcd data to the latest backup snapshot. This can reduce the load on the leader to bring the new member up to date when it joins the cluster.
1. If the latest backup snapshot doesn’t exist or if the latest backup snapshot is not accessible (please see backup failure) and if the cluster itself is quorate, then the new member can be started with an empty data. But this will will be suboptimal because the new member will fetch all the data from the leading member to get up-to-date.
The cluster is informed that a new member is being added using the MemberAdd API including information like the member name and its advertised peer URLs.
The new etcd member is then started with ETCD_INITIAL_CLUSTER_STATE=existing apart from other required configuration.

This proposal recommends this approach.

Note

If there are incremental snapshots (taken by etcd-backup-restore), they cannot be applied because that requires the member to be started in isolation without joining the cluster which is not possible. This is acceptable if the amount of incremental snapshots are managed to be relatively small. This adds one more reason to increase the priority of the issue of incremental snapshot compaction.
There is a time window, between the MemberAdd call and the new member joining the cluster and getting up to date, where the cluster is vulnerable to leader elections which could be disruptive.

Alternative

With v3.4, the new raft learner approach can be used to mitigate some of the possible disruptions mentioned above. Then the steps will be as follows.

If the latest backup snapshot exists, restore the member’s etcd data to the latest backup snapshot. This can reduce the load on the leader to bring the new member up to date when it joins the cluster.
The cluster is informed that a new member is being added using the MemberAddAsLearner API including information like the member name and its advertised peer URLs.
The new etcd member is then started with ETCD_INITIAL_CLUSTER_STATE=existing apart from other required configuration.
Once the new member (learner) is up to date, it can be promoted to a full voting member by using the MemberPromote API

This approach is new and involves more steps and is not recommended in this proposal. It can be considered in future enhancements.

Managing Failures

A multi-node etcd cluster may face failures of diffent kinds during its life-cycle. The actions that need to be taken to manage these failures depend on the failure mode.

Removing an existing member from an etcd cluster

If a member of an etcd cluster becomes unhealthy, it must be explicitly removed from the etcd cluster, as soon as possible. This can be done by using the MemberRemove API. This ensures that only healthy members participate as voting members.

A member of an etcd cluster may be removed not just for managing failures but also for other reasons such as -

The etcd cluster is being scaled down. I.e. the cluster size is being reduced
An existing member is being replaced by a new one for some reason (e.g. upgrades)

If the majority of the members of the etcd cluster are healthy and the member that is unhealthy/being removed happens to be the leader at that moment then the etcd cluster will automatically elect a new leader. But if only a minority of etcd clusters are healthy after removing the member then the the cluster will no longer be quorate and will stop accepting write requests. Such an etcd cluster needs to be recovered via some kind of disaster-recovery.

Restarting an existing member of an etcd cluster

If the existing member of an etcd cluster restarts and retains an uncorrupted data directory after the restart, then it can simply re-join the cluster as an existing member without any API calls or configuration changes. This is because the relevant metadata (including member ID and cluster ID) are maintained in the write ahead logs. However, if it doesn’t retain an uncorrupted data directory after the restart, then it must first be removed and added as a new member.

Recovering an etcd cluster from failure of majority of members

If a majority of members of an etcd cluster fail but if they retain their uncorrupted data directory then they can be simply restarted and they will re-form the existing etcd cluster when they come up. However, if they do not retain their uncorrupted data directory, then the etcd cluster must be recovered from latest snapshot in the backup. This is very similar to bootstrapping with the additional initial step of restoring the latest snapshot in each of the members. However, the same limitation about incremental snapshots, as in the case of adding a new member, applies here. But unlike in the case of adding a new member, not applying incremental snapshots is not acceptable in the case of etcd cluster recovery. Hence, if incremental snapshots are required to be applied, the etcd cluster must be recovered in the following steps.

Restore a new single-member cluster using the latest snapshot.
Apply incremental snapshots on the single-member cluster.
Take a full snapshot which can now be used while adding the remaining members.
Add new members using the latest snapshot created in the step above.

Kubernetes Context

Users will provision an etcd cluster in a Kubernetes cluster by creating an etcd CRD resource instance.
A multi-node etcd cluster is indicated if the spec.replicas field is set to any value greater than 1. The etcd-druid will add validation to ensure that the spec.replicas value is an odd number according to the requirements of etcd.
The etcd-druid controller will provision a statefulset with the etcd main container and the etcd-backup-restore sidecar container. It will pass on the spec.replicas field from the etcd resource to the statefulset. It will also supply the right pre-computed configuration to both the containers.
The statefulset controller will create the pods based on the pod template in the statefulset spec and these individual pods will be the members that form the etcd cluster.

Component diagram

This approach makes it possible to satisfy the assumption that the DNS for the individual members of the etcd cluster must be known/computable. This can be achieved by using a headless service (along with the statefulset) for each etcd cluster instance. Then we can address individual pods/etcd members via the predictable DNS name of <statefulset_name>-{0|1|2|3|…|n}.<headless_service_name> from within the Kubernetes namespace (or from outside the Kubernetes namespace by appending .<namespace>.svc.<cluster_domain> suffix). The etcd-druid controller can compute the above configurations automatically based on the spec.replicas in the etcd resource.

This proposal recommends this approach.

Alternative

One statefulset is used for each member (instead of one statefulset for all members). While this approach gives a flexibility to have different pod specifications for the individual members, it makes managing the individual members (e.g. rolling updates) more complicated. Hence, this approach is not recommended.

ETCD Configuration

As mentioned in the general approach section, there are differences in the configuration that needs to be passed to individual members of an etcd cluster in different scenarios such as bootstrapping, adding a new member, removing a member, restarting an existing member etc. Managing such differences in configuration for individual pods of a statefulset is tricky in the recommended approach of using a single statefulset to manage all the member pods of an etcd cluster. This is because statefulset uses the same pod template for all its pods.

The recommendation is for etcd-druid to provision the base configuration template in a ConfigMap which is passed to all the pods via the pod template in the StatefulSet. The initialization flow of etcd-backup-restore (which is invoked every time the etcd container is (re)started) is then enhanced to generate the customized etcd configuration for the corresponding member pod (in a shared volume between etcd and the backup-restore containers) based on the supplied template configuration. This will require that etcd-backup-restore will have to have a mechanism to detect which scenario listed above applies during any given member container/pod restart.

Alternative

As mentioned above, one statefulset is used for each member of the etcd cluster. Then different configuration (generated directly by etcd-druid) can be passed in the pod templates of the different statefulsets. Though this approach is advantageous in the context of managing the different configuration, it is not recommended in this proposal because it makes the rest of the management (e.g. rolling updates) more complicated.

Data Persistence

The type of persistence used to store etcd data (including the member ID and cluster ID) has an impact on the steps that are needed to be taken when the member pods or containers (minority of them or majority) need to be recovered.

Persistent

Like the single-node case, persistentvolumes can be used to persist ETCD data for all the member pods. The individual member pods then get their own persistentvolumes. The advantage is that individual members retain their member ID across pod restarts and even pod deletion/recreation across Kubernetes nodes. This means that member pods that crash (or are unhealthy) can be restarted automatically (by configuring livenessProbe) and they will re-join the etcd cluster using their existing member ID without any need for explicit etcd cluster management).

The disadvantages of this approach are as follows.

The number of persistentvolumes increases linearly with the cluster size which is a cost-related concern.
Network-mounted persistentvolumes might eventually become a performance bottleneck under heavy load for a latency-sensitive component like ETCD.
Volume attach/detach issues when associated with etcd cluster instances cause downtimes to the target shoot clusters that are backed by those etcd cluster instances.

Ephemeral

The ephemeral volumes use-case is considered as an optimization and may be planned as a follow-up action.

Disk

Ephemeral persistence can be achieved in Kubernetes by using either emptyDir volumes or local persistentvolumes to persist ETCD data. The advantages of this approach are as follows.

Potentially faster disk I/O.
The number of persistent volumes does not increase linearly with the cluster size (at least not technically).
Issues related volume attachment/detachment can be avoided.

The main disadvantage of using ephemeral persistence is that the individual members may retain their identity and data across container restarts but not across pod deletion/recreation across Kubernetes nodes. If the data is lost then on restart of the member pod, the older member (represented by the container) has to be removed and a new member has to be added.

Using emptyDir ephemeral persistence has the disadvantage that the volume doesn’t have its own identity. So, if the member pod is recreated but scheduled on the same node as before then it will not retain the identity as the persistence is lost. But it has the advantage that scheduling of pods is unencumbered especially during pod recreation as they are free to be scheduled anywhere.

Using local persistentvolumes has the advantage that the volume has its own indentity and hence, a recreated member pod will retain its identity if scheduled on the same node. But it has the disadvantage of tying down the member pod to a node which is a problem if the node becomes unhealthy requiring etcd druid to take additional actions (such as deleting the local persistent volume).

Based on these constraints, if ephemeral persistence is opted for, it is recommended to use emptyDir ephemeral persistence.

In-memory

In-memory ephemeral persistence can be achieved in Kubernetes by using emptyDir with medium: Memory. In this case, a tmpfs (RAM-backed file-system) volume will be used. In addition to the advantages of ephemeral persistence, this approach can achieve the fastest possible disk I/O. Similarly, in addition to the disadvantages of ephemeral persistence, in-memory persistence has the following additional disadvantages.

More memory required for the individual member pods.
Individual members may not at all retain their data and identity across container restarts let alone across pod restarts/deletion/recreation across Kubernetes nodes. I.e. every time an etcd container restarts, the old member (represented by the container) will have to be removed and a new member has to be added.

How to detect if valid metadata exists in an etcd member

Since the likelyhood of a member not having valid metadata in the WAL files is much more likely in the ephemeral persistence scenario, one option is to pass the information that ephemeral persistence is being used to the etcd-backup-restore sidecar (say, via command-line flags or environment variables).

But in principle, it might be better to determine this from the WAL files directly so that the possibility of corrupted WAL files also gets handled correctly. To do this, the wal package has some functions that might be useful.

Recommendation

It might be possible that using the wal package for verifying if valid metadata exists might be performance intensive. So, the performance impact needs to be measured. If the performance impact is acceptable (both in terms of resource usage and time), it is recommended to use this way to verify if the member contains valid metadata. Otherwise, alternatives such as a simple check that WAL folder exists coupled with the static information about use of persistent or ephemeral storage might be considered.

How to detect if valid data exists in an etcd member

The initialization sequence in etcd-backup-restore already includes database verification. This would suffice to determine if the member has valid data.

Recommendation

Though ephemeral persistence has performance and logistics advantages, it is recommended to start with persistent data for the member pods. In addition to the reasons and concerns listed above, there is also the additional concern that in case of backup failure, the risk of additional data loss is a bit higher if ephemeral persistence is used (simultaneous quoram loss is sufficient) when compared to persistent storage (simultaenous quorum loss with majority persistence loss is needed). The risk might still be acceptable but the idea is to gain experience about how frequently member containers/pods get restarted/recreated, how frequently leader election happens among members of an etcd cluster and how frequently etcd clusters lose quorum. Based on this experience, we can move towards using ephemeral (perhaps even in-memory) persistence for the member pods.

Separating peer and client traffic

The current single-node ETCD cluster implementation in etcd-druid and etcd-backup-restore uses a single service object to act as the entry point for the client traffic. There is no separation or distinction between the client and peer traffic because there is not much benefit to be had by making that distinction.

In the multi-node ETCD cluster scenario, it makes sense to distinguish between and separate the peer and client traffic. This can be done by using two services.

peer
- To be used for peer communication. This could be a headless service.
client
- To be used for client communication. This could be a normal ClusterIP service like it is in the single-node case.

The main advantage of this approach is that it makes it possible (if needed) to allow only peer to peer communication while blocking client communication. Such a thing might be required during some phases of some maintenance tasks (manual or automated).

Cutting off client requests

At present, in the single-node ETCD instances, etcd-druid configures the readinessProbe of the etcd main container to probe the healthz endpoint of the etcd-backup-restore sidecar which considers the status of the latest backup upload in addition to the regular checks about etcd and the side car being up and healthy. This has the effect of setting the etcd main container (and hence the etcd pod) as not ready if the latest backup upload failed. This results in the endpoints controller removing the pod IP address from the endpoints list for the service which eventually cuts off ingress traffic coming into the etcd pod via the etcd client service. The rationale for this is to fail early when the backup upload fails rather than continuing to serve requests while the gap between the last backup and the current data increases which might lead to unacceptably large amount of data loss if disaster strikes.

This approach will not work in the multi-node scenario because we need the individual member pods to be able to talk to each other to maintain the cluster quorum when backup upload fails but need to cut off only client ingress traffic.

It is recommended to separate the backup health condition tracking taking appropriate remedial actions. With that, the backup health condition tracking is now separated to the BackupReady condition in the Etcd resource status and the cutting off of client traffic (which could now be done for more reasons than failed backups) can be achieved in a different way described below.

Manipulating Client Service podSelector

The client traffic can be cut off by updating (manually or automatically by some component) the podSelector of the client service to add an additional label (say, unhealthy or disabled) such that the podSelector no longer matches the member pods created by the statefulset. This will result in the client ingress traffic being cut off. The peer service is left unmodified so that peer communication is always possible.

Health Check

The etcd main container and the etcd-backup-restore sidecar containers will be configured with livenessProbe and readinessProbe which will indicate the health of the containers and effectively the corresponding ETCD cluster member pod.

Backup Failure

As described above using readinessProbe failures based on latest backup failure is not viable in the multi-node ETCD scenario.

Though cutting off traffic by manipulating client service podSelector is workable, it may not be desirable.

It is recommended that on backup failure, the leading etcd-backup-restore sidecar (the one that is responsible for taking backups at that point in time, as explained in the backup section below, updates the BackupReady condition in the Etcd status and raises a high priority alert to the landscape operators but does not cut off the client traffic.

The reasoning behind this decision to not cut off the client traffic on backup failure is to allow the Kubernetes cluster’s control plane (which relies on the ETCD cluster) to keep functioning as long as possible and to avoid bringing down the control-plane due to a missed backup.

The risk of this approach is that with a cascaded sequence of failures (on top of the backup failure), there is a chance of more data loss than the frequency of backup would otherwise indicate.

To be precise, the risk of such an additional data loss manifests only when backup failure as well as a special case of quorum loss (majority of the members are not ready) happen in such a way that the ETCD cluster needs to be re-bootstrapped from the backup. As described here, re-bootstrapping the ETCD cluster requires restoration from the latest backup only when a majority of members no longer have uncorrupted data persistence.

If persistent storage is used, this will happen only when backup failure as well as a majority of the disks/volumes backing the ETCD cluster members fail simultaneously. This would indeed be rare and might be an acceptable risk.

If ephemeral storage is used (especially, in-memory), the data loss will happen if a majority of the ETCD cluster members become NotReady (requiring a pod restart) at the same time as the backup failure. This may not be as rare as majority members’ disk/volume failure. The risk can be somewhat mitigated at least for planned maintenance operations by postponing potentially disruptive maintenance operations when BackupReady condition is false (vertical scaling, rolling updates, evictions due to node roll-outs).

But in practice (when ephemeral storage is used), the current proposal suggests restoring from the latest full backup even when a minority of ETCD members (even a single pod) restart both to speed up the process of the new member catching up to the latest revision but also to avoid load on the leading member which needs to supply the data to bring the new member up-to-date. But as described here, in case of a minority member failure while using ephemeral storage, it is possible to restart the new member with empty data and let it fetch all the data from the leading member (only if backup is not accessible). Though this is suboptimal, it is workable given the constraints and conditions. With this, the risk of additional data loss in the case of ephemeral storage is only if backup failure as well as quorum loss happens. While this is still less rare than the risk of additional data loss in case of persistent storage, the risk might be tolerable. Provided the risk of quorum loss is not too high. This needs to be monitored/evaluated before opting for ephemeral storage.

Given these constraints, it is better to dynamically avoid/postpone some potentially disruptive operations when BackupReady condition is false. This has the effect of allowing n/2 members to be evicted when the backups are healthy and completely disabling evictions when backups are not healthy.

Skip/postpone potentially disruptive maintenance operations (listed below) when the BackupReady condition is false.
Vertical scaling.
Rolling updates, Basically, any updates to the StatefulSet spec which includes vertical scaling.
Dynamically toggle the minAvailable field of the PodDisruptionBudget between n/2 + 1 and n (where n is the ETCD desired cluster size) whenever the BackupReady condition toggles between true and false.

This will mean that etcd-backup-restore becomes Kubernetes-aware. But there might be reasons for making etcd-backup-restore Kubernetes-aware anyway (e.g. to update the etcd resource status with latest full snapshot details). This enhancement should keep etcd-backup-restore backward compatible. I.e. it should be possible to use etcd-backup-restore Kubernetes-unaware as before this proposal. This is possible either by auto-detecting the existence of kubeconfig or by an explicit command-line flag (such as --enable-client-service-updates which can be defaulted to false for backward compatibility).

Alternative

The alternative is for etcd-druid to implement the above functionality.

But etcd-druid is centrally deployed in the host Kubernetes cluster and cannot scale well horizontally. So, it can potentially be a bottleneck if it is involved in regular health check mechanism for all the etcd clusters it manages. Also, the recommended approach above is more robust because it can work even if etcd-druid is down when the backup upload of a particular etcd cluster fails.

Status

It is desirable (for the etcd-druid and landscape administrators/operators) to maintain/expose status of the etcd cluster instances in the status sub-resource of the Etcd CRD. The proposed structure for maintaining the status is as shown in the example below.

apiVersion: druid.gardener.cloud/v1alpha1
kind: Etcd
metadata:
  name: etcd-main
spec:
  replicas: 3
  ...
...
status:
  ...
  conditions:
  - type: Ready                 # Condition type for the readiness of the ETCD cluster
    status: "True"              # Indicates of the ETCD Cluster is ready or not
    lastHeartbeatTime:          "2020-11-10T12:48:01Z"
    lastTransitionTime:         "2020-11-10T12:48:01Z"
    reason: Quorate             # Quorate|QuorumLost
  - type: AllMembersReady       # Condition type for the readiness of all the member of the ETCD cluster
    status: "True"              # Indicates if all the members of the ETCD Cluster are ready
    lastHeartbeatTime:          "2020-11-10T12:48:01Z"
    lastTransitionTime:         "2020-11-10T12:48:01Z"
    reason: AllMembersReady     # AllMembersReady|NotAllMembersReady
  - type: BackupReady           # Condition type for the readiness of the backup of the ETCD cluster
    status: "True"              # Indicates if the backup of the ETCD cluster is ready
    lastHeartbeatTime:          "2020-11-10T12:48:01Z"
    lastTransitionTime:         "2020-11-10T12:48:01Z"
    reason: FullBackupSucceeded # FullBackupSucceeded|IncrementalBackupSucceeded|FullBackupFailed|IncrementalBackupFailed
  ...
  clusterSize: 3
  ...
  replicas: 3
  ...
  members:
  - name: etcd-main-0          # member pod name
    id: 272e204152             # member Id
    role: Leader               # Member|Leader
    status: Ready              # Ready|NotReady|Unknown
    lastTransitionTime:        "2020-11-10T12:48:01Z"
    reason: LeaseSucceeded     # LeaseSucceeded|LeaseExpired|UnknownGracePeriodExceeded|PodNotRead
  - name: etcd-main-1          # member pod name
    id: 272e204153             # member Id
    role: Member               # Member|Leader
    status: Ready              # Ready|NotReady|Unknown
    lastTransitionTime:        "2020-11-10T12:48:01Z"
    reason: LeaseSucceeded     # LeaseSucceeded|LeaseExpired|UnknownGracePeriodExceeded|PodNotRead

This proposal recommends that etcd-druid (preferrably, the custodian controller in etcd-druid) maintains most of the information in the status of the Etcd resources described above.

One exception to this is the BackupReady condition which is recommended to be maintained by the leading etcd-backup-restore sidecar container. This will mean that etcd-backup-restore becomes Kubernetes-aware. But there are other reasons for making etcd-backup-restore Kubernetes-aware anyway (e.g. to maintain health conditions). This enhancement should keep etcd-backup-restore backward compatible. But it should be possible to use etcd-backup-restore Kubernetes-unaware as before this proposal. This is possible either by auto-detecting the existence of kubeconfig or by an explicit command-line flag (such as --enable-etcd-status-updates which can be defaulted to false for backward compatibility).

Members

The members section of the status is intended to be maintained by etcd-druid (preferraby, the custodian controller of etcd-druid) based on the leases of the individual members.

Note

An earlier design in this proposal was for the individual etcd-backup-restore sidecars to update the corresponding status.members entries themselves. But this was redesigned to use member leases to avoid conflicts rising from frequent updates and the limitations in the support for Server-Side Apply in some versions of Kubernetes.

The spec.holderIdentity field in the leases is used to communicate the ETCD member id and role between the etcd-backup-restore sidecars and etcd-druid.

Member name as the key

In an ETCD cluster, the member id is the unique identifier for a member. However, this proposal recommends using a single StatefulSet whose pods form the members of the ETCD cluster and Pods of a StatefulSet have uniquely indexed names as well as uniquely addressible DNS.

This proposal recommends that the name of the member (which is the same as the name of the member Pod) be used as the unique key to identify a member in the members array. This can minimise the need to cleanup superfluous entries in the members array after the member pods are gone to some extent because the replacement pods for any member will share the same name and will overwrite the entry with a possibly new member id.

There is still the possibility of not only superfluous entries in the members array but also superfluous members in the ETCD cluster for which there is no corresponding pod in the StatefulSet anymore.

For example, if an ETCD cluster is scaled up from 3 to 5 and the new members were failing constantly due to insufficient resources and then if the ETCD client is scaled back down to 3 and failing member pods may not have the chance to clean up their member entries (from the members array as well as from the ETCD cluster) leading to superfluous members in the cluster that may have adverse effect on quorum of the cluster.

Hence, the superfluous entries in both members array as well as the ETCD cluster need to be cleaned up as appropriate.

Member Leases

One Kubernetes lease object per desired ETCD member is maintained by etcd-druid (preferrably, the custodian controller in etcd-druid). The lease objects will be created in the same namespace as their owning Etcd object and will have the same name as the member to which they correspond (which, in turn would be the same as the pod name in which the member ETCD process runs).

The lease objects are created and deleted only by etcd-druid but are continually renewed within the leaseDurationSeconds by the individual etcd-backup-restore sidecars (corresponding to their members) if the the corresponding ETCD member is ready and is part of the ETCD cluster.

This will mean that etcd-backup-restore becomes Kubernetes-aware. But there are other reasons for making etcd-backup-restore Kubernetes-aware anyway (e.g. to maintain health conditions). This enhancement should keep etcd-backup-restore backward compatible. But it should be possible to use etcd-backup-restore Kubernetes-unaware as before this proposal. This is possible either by auto-detecting the existence of kubeconfig or by an explicit command-line flag (such as --enable-etcd-lease-renewal which can be defaulted to false for backward compatibility).

A member entry in the Etcd resource status would be marked as Ready (with reason: LeaseSucceeded) if the corresponding pod is ready and the corresponding lease has not yet expired. The member entry would be marked as NotReady if the corresponding pod is not ready (with reason PodNotReady) or as Unknown if the corresponding lease has expired (with reason: LeaseExpired).

While renewing the lease, the etcd-backup-restore sidecars also maintain the ETCD member id and their role (Leader or Member) separated by : in the spec.holderIdentity field of the corresponding lease object since this information is only available to the ETCD member processes and the etcd-backup-restore sidecars (e.g. 272e204152:Leader or 272e204153:Member). When the lease objects are created by etcd-druid, the spec.holderIdentity field would be empty.

The value in spec.holderIdentity in the leases is parsed and copied onto the id and role fields of the corresponding status.members by etcd-druid.

Conditions

The conditions section in the status describe the overall condition of the ETCD cluster. The condition type Ready indicates if the ETCD cluster as a whole is ready to serve requests (i.e. the cluster is quorate) even though some minority of the members are not ready. The condition type AllMembersReady indicates of all the members of the ETCD cluster are ready. The distinction between these conditions could be significant for both external consumers of the status as well as etcd-druid itself. Some maintenance operations might be safe to do (e.g. rolling updates) only when all members of the cluster are ready. The condition type BackupReady indicates of the most recent backup upload (full or incremental) succeeded. This information also might be significant because some maintenance operations might be safe to do (e.g. anything that involves re-bootstrapping the ETCD cluster) only when backup is ready.

The Ready and AllMembersReady conditions can be maintained by etcd-druid based on the status in the members section. The BackupReady condition will be maintained by the leading etcd-backup-restore sidecar that is in charge of taking backups.

More condition types could be introduced in the future if specific purposes arise.

ClusterSize

The clusterSize field contains the current size of the ETCD cluster. It will be actively kept up-to-date by etcd-druid in all scenarios.

Before bootstrapping the ETCD cluster (during cluster creation or later bootstrapping because of quorum failure), etcd-druid will clear the status.members array and set status.clusterSize to be equal to spec.replicas.
While the ETCD cluster is quorate, etcd-druid will actively set status.clusterSize to be equal to length of the status.members whenever the length of the array changes (say, due to scaling of the ETCD cluster).

Given that clusterSize reliably represents the size of the ETCD cluster, it can be used to calculate the Ready condition.

Alternative

The alternative is for etcd-druid to maintain the status in the Etcd status sub-resource. But etcd-druid is centrally deployed in the host Kubernetes cluster and cannot scale well horizontally. So, it can potentially be a bottleneck if it is involved in regular health check mechanism for all the etcd clusters it manages. Also, the recommended approach above is more robust because it can work even if etcd-druid is down when the backup upload of a particular etcd cluster fails.

Decision table for etcd-druid based on the status

The following decision table describes the various criteria etcd-druid takes into consideration to determine the different etcd cluster management scenarios and the corresponding reconciliation actions it must take. The general principle is to detect the scenario and take the minimum action to move the cluster along the path to good health. The path from any one scenario to a state of good health will typically involve going through multiple reconciliation actions which probably take the cluster through many other cluster management scenarios. Especially, it is proposed that individual members auto-heal where possible, even in the case of the failure of a majority of members of the etcd cluster and that etcd-druid takes action only if the auto-healing doesn’t happen for a configured period of time.

1. Pink of health

Observed state

Cluster Size
- Desired: n
- Current: n
StatefulSet replicas
- Desired: n
- Ready: n
Etcd status
- members
  - Total: n
  - Ready: n
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: 0
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: 0
  - Members with expired lease: 0
- conditions:
  - Ready: true
  - AllMembersReady: true
  - BackupReady: true

Recommended Action

Nothing to do

2. Member status is out of sync with their leases

Observed state

Cluster Size
- Desired: n
- Current: n
StatefulSet replicas
- Desired: n
- Ready: n
Etcd status
- members
  - Total: n
  - Ready: r
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: 0
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: 0
  - Members with expired lease: l
- conditions:
  - Ready: true
  - AllMembersReady: true
  - BackupReady: true

Recommended Action

Mark the l members corresponding to the expired leases as Unknown with reason LeaseExpired and with id populated from spec.holderIdentity of the lease if they are not already updated so.

Mark the n - l members corresponding to the active leases as Ready with reason LeaseSucceeded and with id populated from spec.holderIdentity of the lease if they are not already updated so.

Please refer here for more details.

3. All members are `Ready` but `AllMembersReady` condition is stale

Observed state

Cluster Size
- Desired: N/A
- Current: N/A
StatefulSet replicas
- Desired: n
- Ready: N/A
Etcd status
- members
  - Total: n
  - Ready: n
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: 0
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: 0
  - Members with expired lease: 0
- conditions:
  - Ready: N/A
  - AllMembersReady: false
  - BackupReady: N/A

Recommended Action

Mark the status condition type AllMembersReady to true.

4. Not all members are `Ready` but `AllMembersReady` condition is stale

Observed state

Cluster Size
- Desired: N/A
- Current: N/A
StatefulSet replicas
- Desired: n
- Ready: N/A
Etcd status
- members
  - Total: N/A
  - Ready: r where 0 <= r < n
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: nr where 0 < nr < n
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: u where 0 < u < n
  - Members with expired lease: h where 0 < h < n
- conditions:
  - Ready: N/A
  - AllMembersReady: true
  - BackupReady: N/A
where (nr + u + h) > 0 or r < n

Recommended Action

Mark the status condition type AllMembersReady to false.

5. Majority members are `Ready` but `Ready` condition is stale

Observed state

Cluster Size
- Desired: N/A
- Current: N/A
StatefulSet replicas
- Desired: n
- Ready: N/A
Etcd status
- members
  - Total: n
  - Ready: r where r > n/2
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: nr where 0 < nr < n/2
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: u where 0 < u < n/2
  - Members with expired lease: N/A
- conditions:
  - Ready: false
  - AllMembersReady: N/A
  - BackupReady: N/A
where 0 < (nr + u + h) < n/2

Recommended Action

Mark the status condition type Ready to true.

6. Majority members are `NotReady` but `Ready` condition is stale

Observed state

Cluster Size
- Desired: N/A
- Current: N/A
StatefulSet replicas
- Desired: n
- Ready: N/A
Etcd status
- members
  - Total: n
  - Ready: r where 0 < r < n
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: nr where 0 < nr < n
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: u where 0 < u < n
  - Members with expired lease: N/A
- conditions:
  - Ready: true
  - AllMembersReady: N/A
  - BackupReady: N/A
where (nr + u + h) > n/2 or r < n/2

Recommended Action

Mark the status condition type Ready to false.

7. Some members have been in `Unknown` status for a while

Observed state

Cluster Size
- Desired: N/A
- Current: n
StatefulSet replicas
- Desired: N/A
- Ready: N/A
Etcd status
- members
  - Total: N/A
  - Ready: N/A
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: N/A
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: u where u <= n
  - Members with expired lease: N/A
- conditions:
  - Ready: N/A
  - AllMembersReady: N/A
  - BackupReady: N/A

Recommended Action

Mark the u members as NotReady in Etcd status with reason: UnknownGracePeriodExceeded.

8. Some member pods are not `Ready` but have not had the chance to update their status

Observed state

Cluster Size
- Desired: N/A
- Current: n
StatefulSet replicas
- Desired: n
- Ready: s where s < n
Etcd status
- members
  - Total: N/A
  - Ready: N/A
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: N/A
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: N/A
  - Members with expired lease: N/A
- conditions:
  - Ready: N/A
  - AllMembersReady: N/A
  - BackupReady: N/A

Recommended Action

Mark the n - s members (corresponding to the pods that are not Ready) as NotReady in Etcd status with reason: PodNotReady

9. Quorate cluster with a minority of members `NotReady`

Observed state

Cluster Size
- Desired: N/A
- Current: n
StatefulSet replicas
- Desired: N/A
- Ready: N/A
Etcd status
- members
  - Total: n
  - Ready: n - f
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: f where f < n/2
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: 0
  - Members with expired lease: N/A
- conditions:
  - Ready: true
  - AllMembersReady: false
  - BackupReady: true

Recommended Action

Delete the f NotReady member pods to force restart of the pods if they do not automatically restart via failed livenessProbe. The expectation is that they will either re-join the cluster as an existing member or remove themselves and join as new members on restart of the container or pod and renew their leases.

10. Quorum lost with a majority of members `NotReady`

Observed state

Cluster Size
- Desired: N/A
- Current: n
StatefulSet replicas
- Desired: N/A
- Ready: N/A
Etcd status
- members
  - Total: n
  - Ready: n - f
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: f where f >= n/2
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: N/A
  - Members with expired lease: N/A
- conditions:
  - Ready: false
  - AllMembersReady: false
  - BackupReady: true

Recommended Action

Scale down the StatefulSet to replicas: 0. Ensure that all member pods are deleted. Ensure that all the members are removed from Etcd status. Delete and recreate all the member leases. Recover the cluster from loss of quorum as discussed here.

11. Scale up of a healthy cluster

Observed state

Cluster Size
- Desired: d
- Current: n where d > n
StatefulSet replicas
- Desired: N/A
- Ready: n
Etcd status
- members
  - Total: n
  - Ready: n
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: 0
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: 0
  - Members with expired lease: 0
- conditions:
  - Ready: true
  - AllMembersReady: true
  - BackupReady: true

Recommended Action

Add d - n new members by scaling the StatefulSet to replicas: d. The rest of the StatefulSet spec need not be updated until the next cluster bootstrapping (alternatively, the rest of the StatefulSet spec can be updated pro-actively once the new members join the cluster. This will trigger a rolling update).

Also, create the additional member leases for the d - n new members.

12. Scale down of a healthy cluster

Observed state

Cluster Size
- Desired: d
- Current: n where d < n
StatefulSet replicas
- Desired: n
- Ready: n
Etcd status
- members
  - Total: n
  - Ready: n
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: 0
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: 0
  - Members with expired lease: 0
- conditions:
  - Ready: true
  - AllMembersReady: true
  - BackupReady: true

Recommended Action

Remove d - n existing members (numbered d, d + 1 … n) by scaling the StatefulSet to replicas: d. The StatefulSet spec need not be updated until the next cluster bootstrapping (alternatively, the StatefulSet spec can be updated pro-actively once the superfluous members exit the cluster. This will trigger a rolling update).

Also, delete the member leases for the d - n members being removed.

The superfluous entries in the members array will be cleaned up as explained here. The superfluous members in the ETCD cluster will be cleaned up by the leading etcd-backup-restore sidecar.

13. Superfluous member entries in `Etcd` status

Observed state

Cluster Size
- Desired: N/A
- Current: n
StatefulSet replicas
- Desired: n
- Ready: n
Etcd status
- members
  - Total: m where m > n
  - Ready: N/A
  - Members NotReady for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod: N/A
  - Members with readiness status Unknown long enough to be considered NotReady, i.e. lastTransitionTime > unknownGracePeriod: N/A
  - Members with expired lease: N/A
- conditions:
  - Ready: N/A
  - AllMembersReady: N/A
  - BackupReady: N/A

Recommended Action

Remove the superfluous m - n member entries from Etcd status (numbered n, n+1 … m). Remove the superfluous m - n member leases if they exist. The superfluous members in the ETCD cluster will be cleaned up by the leading etcd-backup-restore sidecar.

Decision table for etcd-backup-restore during initialization

As discussed above, the initialization sequence of etcd-backup-restore in a member pod needs to generate suitable etcd configuration for its etcd container. It also might have to handle the etcd database verification and restoration functionality differently in different scenarios.

The initialization sequence itself is proposed to be as follows. It is an enhancement of the existing initialization sequence. etcd member initialization sequence

The details of the decisions to be taken during the initialization are given below.

1. First member during bootstrap of a fresh etcd cluster

Observed state

Cluster Size: n
Etcd status members:
- Total: 0
- Ready: 0
- Status contains own member: false
Data persistence
- WAL directory has cluster/ member metadata: false
- Data directory is valid and up-to-date: false
Backup
- Backup exists: false
- Backup has incremental snapshots: false

Recommended Action

Generate etcd configuration with n initial cluster peer URLs and initial cluster state new and return success.

2. Addition of a new following member during bootstrap of a fresh etcd cluster

Observed state

Cluster Size: n
Etcd status members:
- Total: m where 0 < m < n
- Ready: m
- Status contains own member: false
Data persistence
- WAL directory has cluster/ member metadata: false
- Data directory is valid and up-to-date: false
Backup
- Backup exists: false
- Backup has incremental snapshots: false

Recommended Action

Generate etcd configuration with n initial cluster peer URLs and initial cluster state new and return success.

3. Restart of an existing member of a quorate cluster with valid metadata and data

Observed state

Cluster Size: n
Etcd status members:
- Total: m where m > n/2
- Ready: r where r > n/2
- Status contains own member: true
Data persistence
- WAL directory has cluster/ member metadata: true
- Data directory is valid and up-to-date: true
Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A

Recommended Action

Re-use previously generated etcd configuration and return success.

4. Restart of an existing member of a quorate cluster with valid metadata but without valid data

Observed state

Cluster Size: n
Etcd status members:
- Total: m where m > n/2
- Ready: r where r > n/2
- Status contains own member: true
Data persistence
- WAL directory has cluster/ member metadata: true
- Data directory is valid and up-to-date: false
Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A

Recommended Action

Remove self as a member (old member ID) from the etcd cluster as well as Etcd status. Add self as a new member of the etcd cluster as well as in the Etcd status. If backups do not exist, create an empty data and WAL directory. If backups exist, restore only the latest full snapshot (please see here for the reason for not restoring incremental snapshots). Generate etcd configuration with n initial cluster peer URLs and initial cluster state existing and return success.

5. Restart of an existing member of a quorate cluster without valid metadata

Observed state

Cluster Size: n
Etcd status members:
- Total: m where m > n/2
- Ready: r where r > n/2
- Status contains own member: true
Data persistence
- WAL directory has cluster/ member metadata: false
- Data directory is valid and up-to-date: N/A
Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A

Recommended Action

6. Restart of an existing member of a non-quorate cluster with valid metadata and data

Observed state

Cluster Size: n
Etcd status members:
- Total: m where m < n/2
- Ready: r where r < n/2
- Status contains own member: true
Data persistence
- WAL directory has cluster/ member metadata: true
- Data directory is valid and up-to-date: true
Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A

Recommended Action

Re-use previously generated etcd configuration and return success.

7. Restart of the first member of a non-quorate cluster without valid data

Observed state

Cluster Size: n
Etcd status members:
- Total: 0
- Ready: 0
- Status contains own member: false
Data persistence
- WAL directory has cluster/ member metadata: N/A
- Data directory is valid and up-to-date: false
Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A

Recommended Action

If backups do not exist, create an empty data and WAL directory. If backups exist, restore the latest full snapshot. Start a single-node embedded etcd with initial cluster peer URLs containing only own peer URL and initial cluster state new. If incremental snapshots exist, apply them serially (honouring source transactions). Take and upload a full snapshot after incremental snapshots are applied successfully (please see here for more reasons why). Generate etcd configuration with n initial cluster peer URLs and initial cluster state new and return success.

8. Restart of a following member of a non-quorate cluster without valid data

Observed state

Cluster Size: n
Etcd status members:
- Total: m where 1 < m < n
- Ready: r where 1 < r < n
- Status contains own member: false
Data persistence
- WAL directory has cluster/ member metadata: N/A
- Data directory is valid and up-to-date: false
Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A

Recommended Action

If backups do not exist, create an empty data and WAL directory. If backups exist, restore only the latest full snapshot (please see here for the reason for not restoring incremental snapshots). Generate etcd configuration with n initial cluster peer URLs and initial cluster state existing and return success.

Backup

Only one of the etcd-backup-restore sidecars among the members are required to take the backup for a given ETCD cluster. This can be called a backup leader. There are two possibilities to ensure this.

Leading ETCD main container’s sidecar is the backup leader

The backup-restore sidecar could poll the etcd cluster and/or its own etcd main container to see if it is the leading member in the etcd cluster. This information can be used by the backup-restore sidecars to decide that sidecar of the leading etcd main container is the backup leader (i.e. responsible to for taking/uploading backups regularly).

The advantages of this approach are as follows.

The approach is operationally and conceptually simple. The leading etcd container and backup-restore sidecar are always located in the same pod.
Network traffic between the backup container and the etcd cluster will always be local.

The disadvantage is that this approach may not age well in the future if we think about moving the backup-restore container as a separate pod rather than a sidecar container.

Independent leader election between backup-restore sidecars

We could use the etcd lease mechanism to perform leader election among the backup-restore sidecars. For example, using something like go.etcd.io/etcd/clientv3/concurrency.

The advantage and disadvantages are pretty much the opposite of the approach above. The advantage being that this approach may age well in the future if we think about moving the backup-restore container as a separate pod rather than a sidecar container.

The disadvantages are as follows.

The approach is operationally and conceptually a bit complex. The leading etcd container and backup-restore sidecar might potentially belong to different pods.
Network traffic between the backup container and the etcd cluster might potentially be across nodes.

History Compaction

This proposal recommends to configure automatic history compaction on the individual members.

Defragmentation

Defragmentation is already triggered periodically by etcd-backup-restore. This proposal recommends to enhance this functionality to be performed only by the leading backup-restore container. The defragmentation must be performed only when etcd cluster is in full health and must be done in a rolling manner for each members to avoid disruption. The leading member should be defragmented last after all the rest of the members have been defragmented to minimise potential leadership changes caused by defragmentation. If the etcd cluster is unhealthy when it is time to trigger scheduled defragmentation, the defragmentation must be postponed until the cluster becomes healthy. This check must be done before triggering defragmentation for each member.

Work-flows in etcd-backup-restore

There are different work-flows in etcd-backup-restore. Some existing flows like initialization, scheduled backups and defragmentation have been enhanced or modified. Some new work-flows like status updates have been introduced. Some of these work-flows are sensitive to which etcd-backup-restore container is leading and some are not.

The life-cycle of these work-flows is shown below. etcd-backup-restore work-flows life-cycle

Work-flows independent of leader election in all members

Serve the HTTP API that all members are expected to support currently but some HTTP API call which are used to take out-of-sync delta or full snapshot should delegate the incoming HTTP requests to the leading-sidecar and one of the possible approach to achieve this is via an HTTP reverse proxy.
Check the health of the respective etcd member and renew the corresponding member lease.

Work-flows only on the leading member

Take backups (full and incremental) at configured regular intervals
Defragment all the members sequentially at configured regular intervals
Cleanup superflous members from the ETCD cluster for which there is no corresponding pod (the ordinal in the pod name is greater than the cluster size) at regular intervals (or whenever the Etcd resource status changes by watching it)
- The cleanup of superfluous entries in status.members array is already covered here

High Availability

Considering that high-availability is the primary reason for using a multi-node etcd cluster, it makes sense to distribute the individual member pods of the etcd cluster across different physical nodes. If the underlying Kubernetes cluster has nodes from multiple availability zones, it makes sense to also distribute the member pods across nodes from different availability zones.

One possibility to do this is via SelectorSpreadPriority of kube-scheduler but this is only best-effort and may not always be enforced strictly.

It is better to use pod anti-affinity to enforce such distribution of member pods.

Zonal Cluster - Single Availability Zone

A zonal cluster is configured to consist of nodes belonging to only a single availability zone in a region of the cloud provider. In such a case, we can at best distribute the member pods of a multi-node etcd cluster instance only across different nodes in the configured availability zone.

This can be done by specifying pod anti-affinity in the specification of the member pods using kubernetes.io/hostname as the topology key.

apiVersion: apps/v1
kind: StatefulSet
...
spec:
  ...
  template:
    ...
    spec:
      ...
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector: {} # podSelector that matches the member pods of the given etcd cluster instance
            topologyKey: "kubernetes.io/hostname"
      ...
    ...
  ...

The recommendation is to keep etcd-druid agnostic of such topics related scheduling and cluster-topology and to use kupid to orthogonally inject the desired pod anti-affinity.

Alternative

Another option is to build the functionality into etcd-druid to include the required pod anti-affinity when it provisions the StatefulSet that manages the member pods. While this has the advantage of avoiding a dependency on an external component like kupid, the disadvantage is that we might need to address development or testing use-cases where it might be desirable to avoid distributing member pods and schedule them on as less number of nodes as possible. Also, as mentioned below, kupid can be used to distribute member pods of an etcd cluster instance across nodes in a single availability zone as well as across nodes in multiple availability zones with very minor variation. This keeps the solution uniform regardless of the topology of the underlying Kubernetes cluster.

Regional Cluster - Multiple Availability Zones

A regional cluster is configured to consist of nodes belonging to multiple availability zones (typically, three) in a region of the cloud provider. In such a case, we can distribute the member pods of a multi-node etcd cluster instance across nodes belonging to different availability zones.

This can be done by specifying pod anti-affinity in the specification of the member pods using topology.kubernetes.io/zone as the topology key. In Kubernetes clusters using Kubernetes release older than 1.17, the older (and now deprecated) failure-domain.beta.kubernetes.io/zone might have to be used as the topology key.

apiVersion: apps/v1
kind: StatefulSet
...
spec:
  ...
  template:
    ...
    spec:
      ...
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector: {} # podSelector that matches the member pods of the given etcd cluster instance
            topologyKey: "topology.kubernetes.io/zone
      ...
    ...
  ...

The recommendation is to keep etcd-druid agnostic of such topics related scheduling and cluster-topology and to use kupid to orthogonally inject the desired pod anti-affinity.

Alternative

PodDisruptionBudget

This proposal recommends that etcd-druid should deploy PodDisruptionBudget (minAvailable set to floor(<cluster size>/2) + 1) for multi-node etcd clusters (if AllMembersReady condition is true) to ensure that any planned disruptive operation can try and honour the disruption budget to ensure high availability of the etcd cluster while making potentially disrupting maintenance operations.

Also, it is recommended to toggle the minAvailable field between floor(<cluster size>/2) and <number of members with status Ready true> whenever the AllMembersReady condition toggles between true and false. This is to disable eviction of any member pods when not all members are Ready.

In case of a conflict, the recommendation is to use the highest of the applicable values for minAvailable.

Rolling updates to etcd members

Any changes to the Etcd resource spec that might result in a change to StatefulSet spec or otherwise result in a rolling update of member pods should be applied/propagated by etcd-druid only when the etcd cluster is fully healthy to reduce the risk of quorum loss during the updates. This would include vertical autoscaling changes (via, HVPA). If the cluster status unhealthy (i.e. if either AllMembersReady or BackupReady conditions are false), etcd-druid must restore it to full health before proceeding with such operations that lead to rolling updates. This can be further optimized in the future to handle the cases where rolling updates can still be performed on an etcd cluster that is not fully healthy.

Follow Up

Ephemeral Volumes

See section Ephemeral Volumes.

Shoot Control-Plane Migration

This proposal adds support for multi-node etcd clusters but it should not have significant impact on shoot control-plane migration any more than what already present in the single-node etcd cluster scenario. But to be sure, this needs to be discussed further.

Performance impact of multi-node etcd clusters

Multi-node etcd clusters incur a cost on write performance as compared to single-node etcd clusters. This performance impact needs to be measured and documented. Here, we should compare different persistence option for the multi-nodeetcd clusters so that we have all the information necessary to take the decision balancing the high-availability, performance and costs.

Metrics, Dashboards and Alerts

There are already metrics exported by etcd and etcd-backup-restore which are visualized in monitoring dashboards and also used in triggering alerts. These might have hidden assumptions about single-node etcd clusters. These might need to be enhanced and potentially new metrics, dashboards and alerts configured to cover the multi-node etcd cluster scenario.

Especially, a high priority alert must be raised if BackupReady condition becomes false.

Costs

Multi-node etcd clusters will clearly involve higher cost (when compared with single-node etcd clusters) just going by the CPU and memory usage for the additional members. Also, the different options for persistence for etcd data for the members will have different cost implications. Such cost impact needs to be assessed and documented to help navigate the trade offs between high availability, performance and costs.

Future Work

Gardener Ring

Gardener Ring, requires provisioning and management of an etcd cluster with the members distributed across more than one Kubernetes cluster. This cannot be achieved by etcd-druid alone which has only the view of a single Kubernetes cluster. An additional component that has the view of all the Kubernetes clusters involved in setting up the gardener ring will be required to achieve this. However, etcd-druid can be used by such a higher-level component/controller (for example, by supplying the initial cluster configuration) such that individual etcd-druid instances in the individual Kubernetes clusters can manage the corresponding etcd cluster members.

Autonomous Shoot Clusters

Autonomous Shoot Clusters also will require a highly availble etcd cluster to back its control-plane and the multi-node support proposed here can be leveraged in that context. However, the current proposal will not meet all the needs of a autonomous shoot cluster. Some additional components will be required that have the overall view of the autonomous shoot cluster and they can use etcd-druid to manage the multi-node etcd cluster. But this scenario may be different from that of Gardener Ring in that the individual etcd members of the cluster may not be hosted on different Kubernetes clusters.

Optimization of recovery from non-quorate cluster with some member containing valid data

It might be possible to optimize the actions during the recovery of a non-quorate cluster where some of the members contain valid data and some other don’t. The optimization involves verifying the data of the valid members to determine the data of which member is the most recent (even considering the latest backup) so that the full snapshot can be taken from it before recovering the etcd cluster. Such an optimization can be attempted in the future.

Optimization of rolling updates to unhealthy etcd clusters

As mentioned above, optimizations to proceed with rolling updates to unhealthy etcd clusters (without first restoring the cluster to full health) can be pursued in future work.

2 - 02 Snapshot Compaction

DEP-02: Snapshot Compaction for Etcd

Current Problem

To ensure recoverability of Etcd, backups of the database are taken at regular interval. Backups are of two types: Full Snapshots and Incremental Snapshots.

Full Snapshots

Full snapshot is a snapshot of the complete database at given point in time.The size of the database keeps changing with time and typically the size is relatively large (measured in 100s of megabytes or even in gigabytes. For this reason, full snapshots are taken after some large intervals.

Incremental Snapshots

Incremental Snapshots are collection of events on Etcd database, obtained through running WATCH API Call on Etcd. After some short intervals, all the events that are accumulated through WATCH API Call are saved in a file and named as Incremental Snapshots at relatively short time intervals.

Recovery from the Snapshots

Recovery from Full Snapshots

As the full snapshots are snapshots of the complete database, the whole database can be recovered from a full snapshot in one go. Etcd provides API Call to restore the database from a full snapshot file.

Recovery from Incremental Snapshots

Delta snapshots are collection of retrospective Etcd events. So, to restore from Incremental snapshot file, the events from the file are needed to be applied sequentially on Etcd database through Etcd Put/Delete API calls. As it is heavily dependent on Etcd calls sequentially, restoring from Incremental Snapshot files can take long if there are numerous commands captured in Incremental Snapshot files.

Delta snapshots are applied on top of running Etcd database. So, if there is inconsistency between the state of database at the point of applying and the state of the database when the delta snapshot commands were captured, restoration will fail.

Currently, in Gardener setup, Etcd is restored from the last full snapshot and then the delta snapshots, which were captured after the last full snapshot.

The main problem with this is that the complete restoration time can be unacceptably large if the rate of change coming into the etcd database is quite high because there are large number of events in the delta snapshots to be applied sequentially. A secondary problem is that, though auto-compaction is enabled for etcd, it is not quick enough to compact all the changes from the incremental snapshots being re-applied during the relatively short period of time of restoration (as compared to the actual period of time when the incremental snapshots were accumulated). This may lead to the etcd pod (the backup-restore sidecar container, to be precise) to run out of memory and/or storage space even if it is sufficient for normal operations.

Solution

Compaction command

To help with the problem mentioned earlier, our proposal is to introduce compact subcommand with etcdbrctl. On execution of compact command, A separate embedded Etcd process will be started where the Etcd data will be restored from the snapstore (exactly as in the restoration scenario today). Then the new Etcd database will be compacted and defragmented using Etcd API calls. The compaction will strip off the Etcd database of old revisions as per the Etcd auto-compaction configuration. The defragmentation will free up the unused fragment memory space released after compaction. Then a full snapshot of the compacted database will be saved in snapstore which then can be used as the base snapshot during any subsequent restoration (or backup compaction).

How the solution works

The newly introduced compact command does not disturb the running Etcd while compacting the backup snapshots. The command is designed to run potentially separately (from the main Etcd process/container/pod). Etcd Druid can be configured to run the newly introduced compact command as a separate job (scheduled periodically) based on total number of Etcd events accumulated after the most recent full snapshot.

Etcd-druid flags:

Etcd-druid introduces the following flags to configure the compaction job:

--enable-backup-compaction (default false): Set this flag to true to enable the automatic compaction of etcd backups when the threshold value denoted by CLI flag --etcd-events-threshold is exceeded.
--compaction-workers (default 3): Number of worker threads of the CompactionJob controller. The controller creates a backup compaction job if a certain etcd event threshold is reached. If compaction is enabled, the value for this flag must be greater than zero.
--etcd-events-threshold (default 1000000): Total number of etcd events that can be allowed before a backup compaction job is triggered.
--active-deadline-duration (default 3h): Duration after which a running backup compaction job will be terminated.
--metrics-scrape-wait-duration (default 0s): Duration to wait for after compaction job is completed, to allow Prometheus metrics to be scraped.

Points to take care while saving the compacted snapshot:

As compacted snapshot and the existing periodic full snapshots are taken by different processes running in different pods but accessing same store to save the snapshots, some problems may arise:

When uploading the compacted snapshot to the snapstore, there is the problem of how does the restorer know when to start using the newly compacted snapshot. This communication needs to be atomic.
With a regular schedule for compaction that happens potentially separately from the main etcd pod, is there a need for regular scheduled full snapshots anymore?
We are planning to introduce new directory structure, under v2 prefix, for saving the snapshots (compacted and full), as mentioned in details below. But for backward compatibility, we also need to consider the older directory, which is currently under v1 prefix, during accessing snapshots.

How to swap full snapshot with compacted snapshot atomically

Currently, full snapshots and the subsequent delta snapshots are grouped under same prefix path in the snapstore. When a full snapshot is created, it is placed under a prefix/directory with the name comprising of timestamp. Then subsequent delta snapshots are also pushed into the same directory. Thus each prefix/directory contains a single full snapshot and the subsequent delta snapshots. So far, it is the job of ETCDBR to start main Etcd process and snapshotter process which takes full snapshot and delta snapshot periodically. But as per our proposal, compaction will be running as parallel process to main Etcd process and snapshotter process. So we can’t reliably co-ordinate between the processes to achieve switching to the compacted snapshot as the base snapshot atomically.

Current Directory Structure

- Backup-192345
    - Full-Snapshot-0-1-192345
    - Incremental-Snapshot-1-100-192355
    - Incremental-Snapshot-100-200-192365
    - Incremental-Snapshot-200-300-192375
- Backup-192789
    - Full-Snapshot-0-300-192789
    - Incremental-Snapshot-300-400-192799
    - Incremental-Snapshot-400-500-192809
    - Incremental-Snapshot-500-600-192819

To solve the problem, proposal is:

ETCDBR will take the first full snapshot after it starts main Etcd Process and snapshotter process. After taking the first full snapshot, snapshotter will continue taking full snapshots. On the other hand, ETCDBR compactor command will be run as periodic job in a separate pod and use the existing full or compacted snapshots to produce further compacted snapshots. Full snapshots and compacted snapshots will be named after same fashion. So, there is no need of any mechanism to choose which snapshots(among full and compacted snapshot) to consider as base snapshots.
Flatten the directory structure of backup folder. Save all the full snapshots, delta snapshots and compacted snapshots under same directory/prefix. Restorer will restore from full/compacted snapshots and delta snapshots sorted based on the revision numbers in name (or timestamp if the revision numbers are equal).

Proposed Directory Structure

Backup :
    - Full-Snapshot-0-1-192355 (Taken by snapshotter)
    - Incremental-Snapshot-revision-1-100-192365
    - Incremental-Snapshot-revision-100-200-192375
    - Full-Snapshot-revision-0-200-192379 (Taken by snapshotter)
    - Incremental-Snapshot-revision-200-300-192385
    - Full-Snapshot-revision-0-300-192386 (Taken by compaction job)
    - Incremental-Snapshot-revision-300-400-192396
    - Incremental-Snapshot-revision-400-500-192406
    - Incremental-Snapshot-revision-500-600-192416
    - Full-Snapshot-revision-0-600-192419 (Taken by snapshotter)
    - Full-Snapshot-revision-0-600-192420 (Taken by compaction job)

What happens to the delta snapshots that were compacted?

The proposed compaction sub-command in etcdbrctl (and hence, the CronJob provisioned by etcd-druid that will schedule it at a regular interval) would only upload the compacted full snapshot. It will not delete the snapshots (delta or full snapshots) that were compacted. These snapshots which were superseded by a freshly uploaded compacted snapshot would follow the same life-cycle as other older snapshots. I.e. they will be garbage collected according to the configured backup snapshot retention policy. For example, if an exponential retention policy is configured and if compaction is done every 30m then there might be at most 48 additional (compacted) full snapshots (24h * 2) in the backup for the latest day. As time rolls forward to the next day, these additional compacted snapshots (along with the delta snapshots that were compacted into them) will get garbage collected retaining only one full snapshot for the day before according to the retention policy.

Future work

In the future, we have plan to stop the snapshotter just after taking the first full snapshot. Then, the compaction job will be solely responsible for taking subsequent full snapshots. The directory structure would be looking like following:

Backup :
    - Full-Snapshot-0-1-192355 (Taken by snapshotter)
    - Incremental-Snapshot-revision-1-100-192365
    - Incremental-Snapshot-revision-100-200-192375
    - Incremental-Snapshot-revision-200-300-192385
    - Full-Snapshot-revision-0-300-192386 (Taken by compaction job)
    - Incremental-Snapshot-revision-300-400-192396
    - Incremental-Snapshot-revision-400-500-192406
    - Incremental-Snapshot-revision-500-600-192416
    - Full-Snapshot-revision-0-600-192420 (Taken by compaction job)

Backward Compatibility

Restoration : The changes to handle the newly proposed backup directory structure must be backward compatible with older structures at least for restoration because we need have to restore from backups in the older structure. This includes the support for restoring from a backup without a metadata file if that is used in the actual implementation.
Backup : For new snapshots (even on a backup containing the older structure), the new structure may be used. The new structure must be setup automatically including creating the base full snapshot.
Garbage collection : The existing functionality of garbage collection of snapshots (full and incremental) according to the backup retention policy must be compatible with both old and new backup folder structure. I.e. the snapshots in the older backup structure must be retained in their own structure and the snapshots in the proposed backup structure should be retained in the proposed structure. Once all the snapshots in the older backup structure go out of the retention policy and are garbage collected, we can think of removing the support for older backup folder structure.

Note: Compactor will run parallel to current snapshotter process and work only if there is any full snapshot already present in the store. By current design, a full snapshot will be taken if there is already no full snapshot or the existing full snapshot is older than 24 hours. It is not limitation but a design choice. As per proposed design, the backup storage will contain both periodic full snapshots as well as periodic compacted snapshot. Restorer will pickup the base snapshot whichever is latest one.

3 - 03 Scaling Up An Etcd Cluster

DEP-03: Scaling-up a single-node to multi-node etcd cluster deployed by etcd-druid

To mark a cluster for scale-up from single node to multi-node etcd, just patch the etcd custom resource’s .spec.replicas from 1 to 3 (for example).

Challenges for scale-up

Etcd cluster with single replica don’t have any peers, so no peer communication is required hence peer URL may or may not be TLS enabled. However, while scaling up from single node etcd to multi-node etcd, there will be a requirement to have peer communication between members of the etcd cluster. Peer communication is required for various reasons, for instance for members to sync up cluster state, data, and to perform leader election or any cluster wide operation like removal or addition of a member etc. Hence in a multi-node etcd cluster we need to have TLS enable peer URL for peer communication.
Providing the correct configuration to start new etcd members as it is different from boostrapping a cluster since these new etcd members will join an existing cluster.

Approach

We first went through the etcd doc of update-advertise-peer-urls to find out information regarding peer URL updation. Interestingly, etcd doc has mentioned the following:

To update the advertise peer URLs of a member, first update it explicitly via member command and then restart the member.

But we can’t assume peer URL is not TLS enabled for single-node cluster as it depends on end-user. A user may or may not enable the TLS for peer URL for a single node etcd cluster. So, How do we detect whether peer URL was enabled or not when cluster is marked for scale-up?

Detecting if peerURL TLS is enabled or not

For this we use an annotation in member lease object member.etcd.gardener.cloud/tls-enabled set by backup-restore sidecar of etcd. As etcd configuration is provided by backup-restore, so it can find out whether TLS is enabled or not and accordingly set this annotation member.etcd.gardener.cloud/tls-enabled to either true or false in member lease object. And with the help of this annotation and config-map values etcd-druid is able to detect whether there is a change in a peer URL or not.

Etcd-Druid helps in scaling up etcd cluster

Now, it is detected whether peer URL was TLS enabled or not for single node etcd cluster. Etcd-druid can now use this information to take action:

If peer URL was already TLS enabled then no action is required from etcd-druid side. Etcd-druid can proceed with scaling up the cluster.
If peer URL was not TLS enabled then etcd-druid has to intervene and make sure peer URL should be TLS enabled first for the single node before marking the cluster for scale-up.

Action taken by etcd-druid to enable the peerURL TLS

Etcd-druid will update the {etcd.Name}-config config-map with new config like initial-cluster,initial-advertise-peer-urls etc. Backup-restore will detect this change and update the member lease annotation to member.etcd.gardener.cloud/tls-enabled: "true".
In case the peer URL TLS has been changed to enabled: Etcd-druid will add tasks to the deployment flow:
- Check if peer TLS has been enabled for existing StatefulSet pods, by checking the member leases for the annotation member.etcd.gardener.cloud/tls-enabled.
- If peer TLS enablement is pending for any of the members, then check and patch the StatefulSet with the peer TLS volume mounts, if not already patched. This will cause a rolling update of the existing StatefulSet pods, which allows etcd-backup-restore to update the member peer URL in the etcd cluster.
- Requeue this reconciliation flow until peer TLS has been enabled for all the existing etcd members.

After PeerURL is TLS enabled

After peer URL TLS enablement for single node etcd cluster, now etcd-druid adds a scale-up annotation: gardener.cloud/scaled-to-multi-node to the etcd statefulset and etcd-druid will patch the statefulsets .spec.replicas to 3(for example). The statefulset controller will then bring up new pods(etcd with backup-restore as a sidecar). Now etcd’s sidecar i.e backup-restore will check whether this member is already a part of a cluster or not and incase it is unable to check (may be due to some network issues) then backup-restore checks presence of this annotation: gardener.cloud/scaled-to-multi-node in etcd statefulset to detect scale-up. If it finds out it is the scale-up case then backup-restore adds new etcd member as a learner first and then starts the etcd learner by providing the correct configuration. Once learner gets in sync with the etcd cluster leader, it will get promoted to a voting member.

Providing the correct etcd config

As backup-restore detects that it’s a scale-up scenario, backup-restore sets initial-cluster-state to existing as this member will join an existing cluster and it calculates the rest of the config from the updated config-map provided by etcd-druid.

Sequence diagram

Future improvements:

The need of restarting etcd pods twice will change in the future. please refer: https://github.com/gardener/etcd-backup-restore/issues/538

4 - Add New Etcd Cluster Component

Add A New Etcd Cluster Component

etcd-druid defines an Operator which is responsible for creation, deletion and update of a resource that is created for an Etcd cluster. If you want to introduce a new resource for an Etcd cluster then you must do the following:

Add a dedicated package for the resource under component.
Implement Operator interface.
Define a new Kind for this resource in the operator Registry.
Every resource a.k.a Component needs to have the following set of default labels:
- app.kubernetes.io/name - value of this label is the name of this component. Helper functions are defined here to create the name of each component using the parent Etcd resource. Please define a new helper function to generate the name of your resource using the parent Etcd resource.
- app.kubernetes.io/component - value of this label is the type of the component. All component type label values are defined here where you can add an entry for your component.
- In addition to the above component specific labels, each resource/component should have default labels defined on the Etcd resource. You can use GetDefaultLabels function.
These labels are also part of recommended labels by kubernetes. NOTE: Constants for the label keys are already defined here.
Ensure that there is no wait introduced in any Operator method implementation in your component. In case there are multiple steps to be executed in a sequence then re-queue the event with a special error code in case there is an error or if the pre-conditions check to execute the next step are not yet satisfied.
All errors should be wrapped with a custom DruidError.

5 - API Reference

Packages:

druid.gardener.cloud/v1alpha1

druid.gardener.cloud/v1alpha1

Package v1alpha1 is the v1alpha1 version of the etcd-druid API.

Resource Types:

BackupSpec

(Appears on: EtcdSpec)

BackupSpec defines parameters associated with the full and delta snapshots of etcd.

Field	Description
`port` int32	(Optional) Port define the port on which etcd-backup-restore server will be exposed.
`tls` TLSConfig	(Optional)
`image` string	(Optional) Image defines the etcd container image and tag
`store` StoreSpec	(Optional) Store defines the specification of object store provider for storing backups.
`resources` Kubernetes core/v1.ResourceRequirements	(Optional) Resources defines compute Resources required by backup-restore container. More info: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
`compactionResources` Kubernetes core/v1.ResourceRequirements	(Optional) CompactionResources defines compute Resources required by compaction job. More info: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
`fullSnapshotSchedule` string	(Optional) FullSnapshotSchedule defines the cron standard schedule for full snapshots.
`garbageCollectionPolicy` GarbageCollectionPolicy	(Optional) GarbageCollectionPolicy defines the policy for garbage collecting old backups
`garbageCollectionPeriod` Kubernetes meta/v1.Duration	(Optional) GarbageCollectionPeriod defines the period for garbage collecting old backups
`deltaSnapshotPeriod` Kubernetes meta/v1.Duration	(Optional) DeltaSnapshotPeriod defines the period after which delta snapshots will be taken
`deltaSnapshotMemoryLimit` k8s.io/apimachinery/pkg/api/resource.Quantity	(Optional) DeltaSnapshotMemoryLimit defines the memory limit after which delta snapshots will be taken
`compression` CompressionSpec	(Optional) SnapshotCompression defines the specification for compression of Snapshots.
`enableProfiling` bool	(Optional) EnableProfiling defines if profiling should be enabled for the etcd-backup-restore-sidecar
`etcdSnapshotTimeout` Kubernetes meta/v1.Duration	(Optional) EtcdSnapshotTimeout defines the timeout duration for etcd FullSnapshot operation
`leaderElection` LeaderElectionSpec	(Optional) LeaderElection defines parameters related to the LeaderElection configuration.

ClientService

(Appears on: EtcdConfig)

ClientService defines the parameters of the client service that a user can specify

Field	Description
`annotations` map[string]string	(Optional) Annotations specify the annotations that should be added to the client service
`labels` map[string]string	(Optional) Labels specify the labels that should be added to the client service

CompactionMode (`string` alias)

(Appears on: SharedConfig)

CompactionMode defines the auto-compaction-mode: ‘periodic’ or ‘revision’. ‘periodic’ for duration based retention and ‘revision’ for revision number based retention.

CompressionPolicy (`string` alias)

(Appears on: CompressionSpec)

CompressionPolicy defines the type of policy for compression of snapshots.

CompressionSpec

(Appears on: BackupSpec)

CompressionSpec defines parameters related to compression of Snapshots(full as well as delta).

Field	Description
`enabled` bool	(Optional)
`policy` CompressionPolicy	(Optional)

Condition

(Appears on: EtcdCopyBackupsTaskStatus, EtcdStatus)

Condition holds the information about the state of a resource.

Field	Description
`type` ConditionType	Type of the Etcd condition.
`status` ConditionStatus	Status of the condition, one of True, False, Unknown.
`lastTransitionTime` Kubernetes meta/v1.Time	Last time the condition transitioned from one status to another.
`lastUpdateTime` Kubernetes meta/v1.Time	Last time the condition was updated.
`reason` string	The reason for the condition’s last transition.
`message` string	A human-readable message indicating details about the transition.

ConditionStatus (`string` alias)

(Appears on: Condition)

ConditionStatus is the status of a condition.

ConditionType (`string` alias)

(Appears on: Condition)

ConditionType is the type of condition.

CrossVersionObjectReference

(Appears on: EtcdStatus)

CrossVersionObjectReference contains enough information to let you identify the referred resource.

Field	Description
`kind` string	Kind of the referent
`name` string	Name of the referent
`apiVersion` string	(Optional) API version of the referent

Etcd

Etcd is the Schema for the etcds API

Field Description

metadata
Kubernetes meta/v1.ObjectMeta Refer to the Kubernetes API documentation for the fields of the metadata field.

spec
EtcdSpec

`selector` Kubernetes meta/v1.LabelSelector	selector is a label query over pods that should match the replica count. It must match the pod template’s labels. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
`labels` map[string]string
`annotations` map[string]string	(Optional)
`etcd` EtcdConfig
`backup` BackupSpec
`sharedConfig` SharedConfig	(Optional)
`schedulingConstraints` SchedulingConstraints	(Optional)
`replicas` int32
`priorityClassName` string	(Optional) PriorityClassName is the name of a priority class that shall be used for the etcd pods.
`storageClass` string	(Optional) StorageClass defines the name of the StorageClass required by the claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#class-1
`storageCapacity` k8s.io/apimachinery/pkg/api/resource.Quantity	(Optional) StorageCapacity defines the size of persistent volume.
`volumeClaimTemplate` string	(Optional) VolumeClaimTemplate defines the volume claim template to be created

status
EtcdStatus

EtcdConfig

(Appears on: EtcdSpec)

EtcdConfig defines parameters associated etcd deployed

Field	Description
`quota` k8s.io/apimachinery/pkg/api/resource.Quantity	(Optional) Quota defines the etcd DB quota.
`defragmentationSchedule` string	(Optional) DefragmentationSchedule defines the cron standard schedule for defragmentation of etcd.
`serverPort` int32	(Optional)
`clientPort` int32	(Optional)
`image` string	(Optional) Image defines the etcd container image and tag
`authSecretRef` Kubernetes core/v1.SecretReference	(Optional)
`metrics` MetricsLevel	(Optional) Metrics defines the level of detail for exported metrics of etcd, specify ‘extensive’ to include histogram metrics.
`resources` Kubernetes core/v1.ResourceRequirements	(Optional) Resources defines the compute Resources required by etcd container. More info: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
`clientUrlTls` TLSConfig	(Optional) ClientUrlTLS contains the ca, server TLS and client TLS secrets for client communication to ETCD cluster
`peerUrlTls` TLSConfig	(Optional) PeerUrlTLS contains the ca and server TLS secrets for peer communication within ETCD cluster Currently, PeerUrlTLS does not require client TLS secrets for gardener implementation of ETCD cluster.
`etcdDefragTimeout` Kubernetes meta/v1.Duration	(Optional) EtcdDefragTimeout defines the timeout duration for etcd defrag call
`heartbeatDuration` Kubernetes meta/v1.Duration	(Optional) HeartbeatDuration defines the duration for members to send heartbeats. The default value is 10s.
`clientService` ClientService	(Optional) ClientService defines the parameters of the client service that a user can specify

EtcdCopyBackupsTask

EtcdCopyBackupsTask is a task for copying etcd backups from a source to a target store.

Field Description

metadata
Kubernetes meta/v1.ObjectMeta Refer to the Kubernetes API documentation for the fields of the metadata field.

spec
EtcdCopyBackupsTaskSpec

`sourceStore` StoreSpec	SourceStore defines the specification of the source object store provider for storing backups.
`targetStore` StoreSpec	TargetStore defines the specification of the target object store provider for storing backups.
`maxBackupAge` uint32	(Optional) MaxBackupAge is the maximum age in days that a backup must have in order to be copied. By default all backups will be copied.
`maxBackups` uint32	(Optional) MaxBackups is the maximum number of backups that will be copied starting with the most recent ones.
`waitForFinalSnapshot` WaitForFinalSnapshotSpec	(Optional) WaitForFinalSnapshot defines the parameters for waiting for a final full snapshot before copying backups.

status
EtcdCopyBackupsTaskStatus

EtcdCopyBackupsTaskSpec

(Appears on: EtcdCopyBackupsTask)

EtcdCopyBackupsTaskSpec defines the parameters for the copy backups task.

Field	Description
`sourceStore` StoreSpec	SourceStore defines the specification of the source object store provider for storing backups.
`targetStore` StoreSpec	TargetStore defines the specification of the target object store provider for storing backups.
`maxBackupAge` uint32	(Optional) MaxBackupAge is the maximum age in days that a backup must have in order to be copied. By default all backups will be copied.
`maxBackups` uint32	(Optional) MaxBackups is the maximum number of backups that will be copied starting with the most recent ones.
`waitForFinalSnapshot` WaitForFinalSnapshotSpec	(Optional) WaitForFinalSnapshot defines the parameters for waiting for a final full snapshot before copying backups.

EtcdCopyBackupsTaskStatus

(Appears on: EtcdCopyBackupsTask)

EtcdCopyBackupsTaskStatus defines the observed state of the copy backups task.

Field	Description
`conditions` []Condition	(Optional) Conditions represents the latest available observations of an object’s current state.
`observedGeneration` int64	(Optional) ObservedGeneration is the most recent generation observed for this resource.
`lastError` string	(Optional) LastError represents the last occurred error.

EtcdMemberConditionStatus (`string` alias)

(Appears on: EtcdMemberStatus)

EtcdMemberConditionStatus is the status of an etcd cluster member.

EtcdMemberStatus

(Appears on: EtcdStatus)

EtcdMemberStatus holds information about a etcd cluster membership.

Field	Description
`name` string	Name is the name of the etcd member. It is the name of the backing `Pod`.
`id` string	(Optional) ID is the ID of the etcd member.
`role` EtcdRole	(Optional) Role is the role in the etcd cluster, either `Leader` or `Member`.
`status` EtcdMemberConditionStatus	Status of the condition, one of True, False, Unknown.
`reason` string	The reason for the condition’s last transition.
`lastTransitionTime` Kubernetes meta/v1.Time	LastTransitionTime is the last time the condition’s status changed.

EtcdRole (`string` alias)

(Appears on: EtcdMemberStatus)

EtcdRole is the role of an etcd cluster member.

EtcdSpec

(Appears on: Etcd)

EtcdSpec defines the desired state of Etcd

Field	Description
`selector` Kubernetes meta/v1.LabelSelector	selector is a label query over pods that should match the replica count. It must match the pod template’s labels. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors
`labels` map[string]string
`annotations` map[string]string	(Optional)
`etcd` EtcdConfig
`backup` BackupSpec
`sharedConfig` SharedConfig	(Optional)
`schedulingConstraints` SchedulingConstraints	(Optional)
`replicas` int32
`priorityClassName` string	(Optional) PriorityClassName is the name of a priority class that shall be used for the etcd pods.
`storageClass` string	(Optional) StorageClass defines the name of the StorageClass required by the claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#class-1
`storageCapacity` k8s.io/apimachinery/pkg/api/resource.Quantity	(Optional) StorageCapacity defines the size of persistent volume.
`volumeClaimTemplate` string	(Optional) VolumeClaimTemplate defines the volume claim template to be created

EtcdStatus

(Appears on: Etcd)

EtcdStatus defines the observed state of Etcd.

Field	Description
`observedGeneration` int64	(Optional) ObservedGeneration is the most recent generation observed for this resource.
`etcd` CrossVersionObjectReference	(Optional)
`conditions` []Condition	(Optional) Conditions represents the latest available observations of an etcd’s current state.
`serviceName` string	(Optional) ServiceName is the name of the etcd service.
`lastError` string	(Optional) LastError represents the last occurred error.
`clusterSize` int32	(Optional) Cluster size is the size of the etcd cluster.
`currentReplicas` int32	(Optional) CurrentReplicas is the current replica count for the etcd cluster.
`replicas` int32	(Optional) Replicas is the replica count of the etcd resource.
`readyReplicas` int32	(Optional) ReadyReplicas is the count of replicas being ready in the etcd cluster.
`ready` bool	(Optional) Ready is `true` if all etcd replicas are ready.
`updatedReplicas` int32	(Optional) UpdatedReplicas is the count of updated replicas in the etcd cluster.
`labelSelector` Kubernetes meta/v1.LabelSelector	(Optional) LabelSelector is a label query over pods that should match the replica count. It must match the pod template’s labels.
`members` []EtcdMemberStatus	(Optional) Members represents the members of the etcd cluster
`peerUrlTLSEnabled` bool	(Optional) PeerUrlTLSEnabled captures the state of peer url TLS being enabled for the etcd member(s)

GarbageCollectionPolicy (`string` alias)

(Appears on: BackupSpec)

GarbageCollectionPolicy defines the type of policy for snapshot garbage collection.

LeaderElectionSpec

(Appears on: BackupSpec)

LeaderElectionSpec defines parameters related to the LeaderElection configuration.

Field	Description
`reelectionPeriod` Kubernetes meta/v1.Duration	(Optional) ReelectionPeriod defines the Period after which leadership status of corresponding etcd is checked.
`etcdConnectionTimeout` Kubernetes meta/v1.Duration	(Optional) EtcdConnectionTimeout defines the timeout duration for etcd client connection during leader election.

MetricsLevel (`string` alias)

(Appears on: EtcdConfig)

MetricsLevel defines the level ‘basic’ or ‘extensive’.

SchedulingConstraints

(Appears on: EtcdSpec)

SchedulingConstraints defines the different scheduling constraints that must be applied to the pod spec in the etcd statefulset. Currently supported constraints are Affinity and TopologySpreadConstraints.

Field	Description
`affinity` Kubernetes core/v1.Affinity	(Optional) Affinity defines the various affinity and anti-affinity rules for a pod that are honoured by the kube-scheduler.
`topologySpreadConstraints` []Kubernetes core/v1.TopologySpreadConstraint	(Optional) TopologySpreadConstraints describes how a group of pods ought to spread across topology domains, that are honoured by the kube-scheduler.

SecretReference

(Appears on: TLSConfig)

SecretReference defines a reference to a secret.

Field	Description
`SecretReference` Kubernetes core/v1.SecretReference	(Members of `SecretReference` are embedded into this type.)
`dataKey` string	(Optional) DataKey is the name of the key in the data map containing the credentials.

SharedConfig

(Appears on: EtcdSpec)

SharedConfig defines parameters shared and used by Etcd as well as backup-restore sidecar.

Field	Description
`autoCompactionMode` CompactionMode	(Optional) AutoCompactionMode defines the auto-compaction-mode:‘periodic’ mode or ‘revision’ mode for etcd and embedded-Etcd of backup-restore sidecar.
`autoCompactionRetention` string	(Optional) AutoCompactionRetention defines the auto-compaction-retention length for etcd as well as for embedded-Etcd of backup-restore sidecar.

StorageProvider (`string` alias)

(Appears on: StoreSpec)

StorageProvider defines the type of object store provider for storing backups.

StoreSpec

(Appears on: BackupSpec, EtcdCopyBackupsTaskSpec)

StoreSpec defines parameters related to ObjectStore persisting backups

Field	Description
`container` string	(Optional) Container is the name of the container the backup is stored at.
`prefix` string	Prefix is the prefix used for the store.
`provider` StorageProvider	(Optional) Provider is the name of the backup provider.
`secretRef` Kubernetes core/v1.SecretReference	(Optional) SecretRef is the reference to the secret which used to connect to the backup store.

TLSConfig

(Appears on: BackupSpec, EtcdConfig)

TLSConfig hold the TLS configuration details.

Field	Description
`tlsCASecretRef` SecretReference
`serverTLSSecretRef` Kubernetes core/v1.SecretReference
`clientTLSSecretRef` Kubernetes core/v1.SecretReference	(Optional)

WaitForFinalSnapshotSpec

(Appears on: EtcdCopyBackupsTaskSpec)

WaitForFinalSnapshotSpec defines the parameters for waiting for a final full snapshot before copying backups.

Field	Description
`enabled` bool	Enabled specifies whether to wait for a final full snapshot before copying backups.
`timeout` Kubernetes meta/v1.Duration	(Optional) Timeout is the timeout for waiting for a final full snapshot. When this timeout expires, the copying of backups will be performed anyway. No timeout or 0 means wait forever.

Generated with gen-crd-api-reference-docs

6 - Changing Api

Change the API

This guide provides detailed information on what needs to be done when the API needs to be changed.

etcd-druid API follows the same API conventions and guidelines which Kubernetes defines and adopts. The Kubernetes API Conventions as well as Changing the API topics already provide a good overview and general explanation of the basic concepts behind it. We adhere to the principles laid down by Kubernetes.

Etcd Druid API

The etcd-druid API is defined here.

!!! info The current version of the API is v1alpha1. We are currently working on migration to v1beta1 API.

Changing the API

If there is a need to make changes to the API, then one should do the following:

If new fields are added then ensure that these are added as optional fields. They should have the +optional comment and an omitempty JSON tag should be added against the field.
Ensure that all new fields or changing the existing fields are well documented with doc-strings.
Care should be taken that incompatible API changes should not be made in the same version of the API. If there is a real necessity to introduce a backward incompatible change then a newer version of the API should be created and an API conversion webhook should be put in place to support more than one version of the API.
After the changes to the API are finalized, run make generate to ensure that the changes are also reflected in the CRD.
If necessary, implement or adapt the validation for the API.
If necessary, adapt the examples YAML manifests.
When opening a pull-request, always add a release note informing the end-users of the changes that are coming in.

Removing a Field

If field(s) needs to be removed permanently from the API, then one should ensure the following:

Field should not be directly removed, instead first a deprecation notice should be put which should follow a well-defined deprecation period. Ensure that the release note in the pull-request is properly categorized so that this is easily visible to the end-users and clearly mentiones which field(s) have been deprecated. Clearly suggest a way in which clients need to adapt.
To allow sufficient time to the end-users to adapt to the API changes, deprecated field(s) should only be removed once the deprecation period is over. It is generally recommended that this be done in 2 stages:
- First stage: Remove the code that refers to the deprecated fields. This ensures that the code no longer has dependency on the deprecated field(s).
- Second Stage: Remove the field from the API.

7 - Configure Etcd Druid

etcd-druid CLI Flags

etcd-druid process can be started with the following command line flags.

Command line flags

!!! note “Deprecation Notice”

All existing command line flags have been marked as deprecated. To configure etcd-druid the recommended way is to define a single --config CLI flag which points to a configuration YAML file containing etcd-druid OperatorConfiguration. For backward compatibility both new --config CLI flag and now deprecated existing CLI flags are supported.

Operator Configuration

The recommended way to configure etcd-druid is via OperatorConfiguration.

Flag	Description	Default
config	Path to the mounted operator configuration.	NA

You can see the default values for OperatorConfiguration at api/config/v1alpha1/defaults.go.

!!! note

It is required to define the --config CLI flag and point to the configuration YAML file path.

If you are deploying etcd-druid in a kubernetes cluster, then it is assumed that you have done the following:

Created a ConfigMap which contains the serialized operator configuration.
Mount the ConfigMap to etcd-druid Deployment.

Leader election (Deprecated)

If you wish to setup etcd-druid in high-availability mode then leader election needs to be enabled to ensure that at a time only one replica services the incoming events and does the reconciliation.

Flag	Description	Default
enable-leader-election	Leader election provides the capability to select one replica as a leader where active reconciliation will happen. The other replicas will keep waiting for leadership change and not do active reconciliations.	false
leader-election-id	Name of the k8s lease object that leader election will use for holding the leader lock. By default etcd-druid will use lease resource lock for leader election which is also a natural usecase for leases and is also recommended by k8s.	“druid-leader-election”
leader-election-resource-lock	Deprecated: This flag will be removed in later version of druid. By default `lease.coordination.k8s.io` resources will be used for leader election resource locking for the controller manager.	“leases”

Metrics (Deprecated)

etcd-druid exposes a /metrics endpoint which can be scrapped by tools like Prometheus. If the default metrics endpoint configuration is not suitable then consumers can change it via the following options.

Flag	Description	Default
metrics-bind-address	The IP address that the metrics endpoint binds to	""
metrics-port	The port used for the metrics endpoint	8080
metrics-addr	Duration to wait for after compaction job is completed, to allow Prometheus metrics to be scraped. Deprecated: Please use `--metrics-bind-address` and `--metrics-port` instead	“:8080”

Metrics bind-address is computed by joining the host and port. By default its value is computed as :8080.

!!! tip Ensure that the metrics-port is also reflected in the etcd-druid deployment specification.

Webhook Server (Deprecated)

etcd-druid provides the following CLI flags to configure webhook server. These CLI flags are used to construct a new webhook.Server by configuring Options.

Flag	Description	Default
webhook-server-bind-address	It is the address that the webhook server will listen on.	""
webhook-server-port	Port is the port number that the webhook server will serve.	9443
webhook-server-tls-server-cert-dir	The path to a directory containing the server’s TLS certificate and key (the files must be named tls.crt and tls.key respectively).	/etc/webhook-server-tls

Etcd-Components Webhook (Deprecated)

etcd-druid provisions and manages several Kubernetes resources which we call Etcdcluster components. To ensure that there is no accidental changes done to these managed resources, a webhook is put in place to check manual changes done to any managed etcd-cluster Kubernetes resource. It rejects most of these changes except a few. The details on how to enable the etcd-components webhook, which resources are protected and in which scenarios is the change allowed is documented here.

Following CLI flags are provided to configure the etcd-components webhook:

Flag	Description	Default
enable-etcd-components-webhook	Enable EtcdComponents Webhook to prevent unintended changes to resources managed by etcd-druid.	false
reconciler-service-account	The fully qualified name of the service account used by etcd-druid for reconciling etcd resources. If unspecified, the default service account mounted for etcd-druid will be used	etcd-druid-service-account
etcd-components-exempt-service-accounts	In case there is a need to allow changes to `Etcd` resources from external controllers like `vertical-pod-autoscaler` then one must list the `ServiceAaccount` that is used by each such controller.	""

Reconcilers (Deprecated)

Following set of flags configures the reconcilers running within etcd-druid. To know more about different reconcilers read this document.

Etcd Reconciler (Deprecated)

Flag	Description	Default
etcd-workers	Number of workers spawned for concurrent reconciles of `Etcd` resources.	3
enable-etcd-spec-auto-reconcile	If true then automatically reconciles Etcd Spec. If false, waits for explicit annotation `gardener.cloud/operation: reconcile` to be placed on the Etcd resource to trigger reconcile.	false
disable-etcd-serviceaccount-automount	For each `Etcd` cluster a `ServiceAccount` is created which is used by the `StatefulSet` pods and tied to `Role` via `RoleBinding`. If `false` then pods running as this `ServiceAccount` will have the API token automatically mounted. You can consider disabling it if you wish to use Projected Volumes allowing one to set an `expirationSeconds` on the mounted token for better security. To use projected volumes ensure that you have set relevant kube-apiserver flags. Note: With Kubernetes version >=1.24 projected service account token is the default. This means that we no longer need this flag. Issue #872 has been raised to address this.	false
etcd-status-sync-period	`Etcd.Status` is periodically updated. This interval defines the status sync frequency.	15s
etcd-member-notready-threshold	Threshold after which an etcd member is considered not ready if the status was unknown before. This is currently used to update EtcdMemberConditionStatus.	5m
etcd-member-unknown-threshold	Threshold after which an etcd member is considered unknown. This is currently used to update EtcdMemberConditionStatus.	1m

Compaction Reconciler (Deprecated)

Flag	Description	Default
enable-backup-compaction	Enable automatic compaction of etcd backups	false
compaction-workers	Number of workers that can be spawned for concurrent reconciles for compaction Jobs. The controller creates a backup compaction job if a certain etcd event threshold is reached. If compaction is enabled, the value for this flag must be greater than zero.	3
etcd-events-threshold	Defines the threshold in terms of total number of etcd events before a backup compaction job is triggered.	1000000
active-deadline-duration	Duration after which a running backup compaction job will be terminated.	3h
metrics-scrape-wait-duration	Duration to wait for after compaction job is completed, to allow Prometheus metrics to be scraped.	0s

Etcd Copy-Backup Task & Secret Reconcilers (Deprecated)

Flag	Description	Default
etcd-copy-backups-task-workers	Number of workers spawned for concurrent reconciles for `EtcdCopyBackupTask` resources.	3
secret-workers	Number of workers spawned for concurrent reconciles for secrets.	10

Miscellaneous (Deprecated)

Flag	Description	Default
feature-gates	A set of key=value pairs that describe feature gates for alpha/experimental features. Please check feature-gates for more information.	""
disable-lease-cache	Disable cache for lease.coordination.k8s.io resources.	false

8 - Contribution

Contributors Guide

etcd-druid is an actively maintained project which has organically evolved to be a mature and stable etcd operator. We welcome active participation from the community and to this end this guide serves as a good starting point.

Code of Conduct

All maintainers and contributors must abide by Contributor Covenant. Real progress can only happen in a collaborative environment which fosters mutual respect, openeness and disruptive innovation.

Developer Certificate of Origin

Due to legal reasons, contributors will be asked to accept a Developer Certificate of Origin (DCO) before they submit the first pull request to the IronCore project, this happens in an automated fashion during the submission process. We use the standard DCO text of the Linux Foundation.

License

Your contributions to etcd-druid must be licensed properly:

Code contributions must be licensed under the Apache 2.0 License.
Documentation contributions must be licensed under the Creative Commons Attribution 4.0 International License.

Contributing

etcd-druid use Github to manage reviews of pull requests.

If you are looking to make your first contribution, follow Steps to Contribute.
If you have a trivial fix or improvement, go ahead and create an issue first followed by a pull request.
If you plan to do something more involved, first discuss your ideas by creating an issue. This will avoid unnecessary work and surely give you and us a good deal of inspiration.

Steps to Contribute

If you wish to contribute and have not done that in the past, then first try and filter the list of issues with label exp/beginner. Once you find the issue that interests you, add a comment stating that you would like to work on it. This is to prevent duplicated efforts from contributors on the same issue.
If you have questions about one of the issues please comment on them and one of the maintainers will clarify it.

We kindly ask you to follow the Pull Request Checklist to ensure reviews can happen accordingly.

Issues and Planning

We use GitHub issues to track bugs and enhancement requests. Please provide as much context as possible when you open an issue. The information you provide must be comprehensive enough to understand, reproduce the behavior and find related reports of that issue for the assignee. Therefore, contributors may use but aren’t restricted to the issue template provided by the etcd-druid maintainers.

9 - Controllers

Controllers

etcd-druid is an operator to manage etcd clusters, and follows the Operator pattern for Kubernetes. It makes use of the Kubebuilder framework which makes it quite easy to define Custom Resources (CRs) such as Etcds and EtcdCopyBackupTasks through Custom Resource Definitions (CRDs), and define controllers for these CRDs. etcd-druid uses Kubebuilder to define the Etcd CR and its corresponding controllers.

All controllers that are a part of etcd-druid reside in package internal/controller, as sub-packages.

Etcd-druid currently consists of the following controllers, each having its own responsibility:

etcd : responsible for the reconciliation of the Etcd CR spec, which allows users to run etcd clusters within the specified Kubernetes cluster, and also responsible for periodically updating the Etcd CR status with the up-to-date state of the managed etcd cluster.
compaction : responsible for snapshot compaction.
etcdcopybackupstask : responsible for the reconciliation of the EtcdCopyBackupsTask CR, which helps perform the job of copying snapshot backups from one object store to another.
secret : responsible in making sure Secrets being referenced by Etcd resources are not deleted while in use.

Package Structure

The typical package structure for the controllers that are part of etcd-druid is shown with the compaction controller:

internal/controller/compaction
├── config.go
├── reconciler.go
└── register.go

config.go: contains all the logic for the configuration of the controller, including feature gate activations, CLI flag parsing and validations.
register.go: contains the logic for registering the controller with the etcd-druid controller manager.
reconciler.go: contains the controller reconciliation logic.

Each controller package also contains auxiliary files which are relevant to that specific controller.

Controller Manager

A manager is first created for all controllers that are a part of etcd-druid. The controller manager is responsible for all the controllers that are associated with CRDs. Once the manager is Start()ed, all the controllers that are registered with it are started.

Each controller is built using a controller builder, configured with details such as the type of object being reconciled, owned objects whose owner object is reconciled, event filters (predicates), etc. Predicates are filters which allow controllers to filter which type of events the controller should respond to and which ones to ignore.

The logic relevant to the controller manager like the creation of the controller manager and registering each of the controllers with the manager, is contained in internal/manager/manager.go.

Etcd Controller

The etcd controller is responsible for the reconciliation of the Etcd resource spec and status. It handles the provisioning and management of the etcd cluster. Different components that are required for the functioning of the cluster like Leases, ConfigMaps, and the Statefulset for the etcd cluster are all deployed and managed by the etcd controller.

Additionally, etcd controller also periodically updates the Etcd resource status with the latest available information from the etcd cluster, as well as results and errors from the recent-most reconciliation of the Etcd resource spec.

The etcd controller is essential to the functioning of the etcd cluster and etcd-druid, thus the minimum number of worker threads is 1 (default being 3), controlled by the CLI flag --etcd-workers.

`Etcd` Spec Reconciliation

While building the controller, an event filter is set such that the behavior of the controller, specifically for Etcd update operations, depends on the gardener.cloud/operation: reconcile annotation. This is controlled by the --enable-etcd-spec-auto-reconcile CLI flag, which, if set to false, tells the controller to perform reconciliation only when this annotation is present. If the flag is set to true, the controller will reconcile the etcd cluster anytime the Etcd spec, and thus generation, changes, and the next queued event for it is triggered.

!!! note Creation and deletion of Etcd resources are not affected by the above flag or annotation.

The reason this filter is present is that any disruption in the Etcd resource due to reconciliation (due to changes in the Etcd spec, for example) while workloads are being run would cause unwanted downtimes to the etcd cluster. Hence, any user who wishes to avoid such disruptions, can choose to set the --enable-etcd-spec-auto-reconcile CLI flag to false. An example of this is Gardener’s gardenlet, which reconciles the Etcd resource only during a shoot cluster’s maintenance window.

The controller adds a finalizer to the Etcd resource in order to ensure that it does not get deleted until all dependent resources managed by etcd-druid, aka managed components, are properly cleaned up. Only the etcd controller can delete a resource once it adds finalizers to it. This ensures that the proper deletion flow steps are followed while deleting the resource. During deletion flow, managed components are deleted in parallel.

`Etcd` Status Updates

The Etcd resource status is updated periodically by etcd controller, the interval for which is determined by the CLI flag --etcd-status-sync-period.

Status fields of the Etcd resource such as LastOperation, LastErrors and ObservedGeneration, are updated to reflect the result of the recent reconciliation of the Etcd resource spec.

LastOperation holds information about the last operation performed on the etcd cluster, indicated by fields Type, State, Description and LastUpdateTime. Additionally, a field RunID indicates the unique ID assigned to the specific reconciliation run, to allow for better debugging of issues.
LastErrors is a slice of errors encountered by the last reconciliation run. Each error consists of fields Code to indicate the custom etcd-druid error code for the error, a human-readable Description, and the ObservedAt time when the error was seen.
ObservedGeneration indicates the latest generation of the Etcd resource that etcd-druid has “observed” and consequently reconciled. It helps identify whether a change in the Etcd resource spec was acted upon by druid or not.

Status fields of the Etcd resource which correspond to the StatefulSet like CurrentReplicas, ReadyReplicas and Replicas are updated to reflect those of the StatefulSet by the controller.

Status fields related to the etcd cluster itself, such as Members, PeerUrlTLSEnabled and Ready are updated as follows:

Cluster Membership: The controller updates the information about etcd cluster membership like Role, Status, Reason, LastTransitionTime and identifying information like the Name and ID. For the Status field, the member is checked for the Ready condition, where the member can be in Ready, NotReady and Unknown statuses.

Etcd resource conditions are indicated by status field Conditions. The condition checks that are currently performed are:

AllMembersReady: indicates readiness of all members of the etcd cluster.
Ready: indicates overall readiness of the etcd cluster in serving traffic.
BackupReady: indicates health of the etcd backups, i.e., whether etcd backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
DataVolumesReady: indicates health of the persistent volumes containing the etcd data.

Compaction Controller

The compaction controller deploys the snapshot compaction job whenever required. To understand the rationale behind this controller, please read snapshot-compaction.md. The controller watches the number of events accumulated as part of delta snapshots in the etcd cluster’s backups, and triggers a snapshot compaction when the number of delta events crosses the set threshold, which is configurable through the --etcd-events-threshold CLI flag (1M events by default).

The controller watches for changes in snapshot Leases associated with Etcd resources. It checks the full and delta snapshot Leases and calculates the difference in events between the latest delta snapshot and the previous full snapshot, and initiates the compaction job if the event threshold is crossed.

The number of worker threads for the compaction controller needs to be greater than or equal to 0 (default 3), controlled by the CLI flag --compaction-workers. This is unlike other controllers which need at least one worker thread for the proper functioning of etcd-druid as snapshot compaction is not a core functionality for the etcd clusters to be deployed. The compaction controller should be explicitly enabled by the user, through the --enable-backup-compaction CLI flag.

EtcdCopyBackupsTask Controller

The etcdcopybackupstask controller is responsible for deploying the etcdbrctl copy command as a job. This controller reacts to create/update events arising from EtcdCopyBackupsTask resources, and deploys the EtcdCopyBackupsTask job with source and target backup storage providers as arguments, which are derived from source and target bucket secrets referenced by the EtcdCopyBackupsTask resource.

The number of worker threads for the etcdcopybackupstask controller needs to be greater than or equal to 0 (default being 3), controlled by the CLI flag --etcd-copy-backups-task-workers. This is unlike other controllers who need at least one worker thread for the proper functioning of etcd-druid as EtcdCopyBackupsTask is not a core functionality for the etcd clusters to be deployed.

Secret Controller

The secret controller’s primary responsibility is to add a finalizer on Secrets referenced by the Etcd resource. The secret controller is registered for Secrets, and the controller keeps a watch on the Etcd CR. This finalizer is added to ensure that Secrets which are referenced by the Etcd CR aren’t deleted while still being used by the Etcd resource.

Events arising from the Etcd resource are mapped to a list of Secrets such as backup and TLS secrets that are referenced by the Etcd resource, and are enqueued into the request queue, which the reconciler then acts on.

The number of worker threads for the secret controller must be at least 1 (default being 10) for this core controller, controlled by the CLI flag --secret-workers, since the referenced TLS and infrastructure access secrets are essential to the proper functioning of the etcd cluster.

10 - DEP Title

DEP-NN: Your short, descriptive title

Summary

Motivation

Goals

Non-Goals

Proposal

Alternatives

11 - Dependency Management

Dependency Management

We use Go Modules for dependency management. In order to add a new package dependency to the project, you can perform go get <package@version> or edit the go.mod file and append the package along with the version you want to use.

Organize Dependencies

Unfortunately go does not differentiate between dev and test dependencies. It becomes cleaner to organize dev and test dependencies in their respective require clause which gives a clear view on existing set of dependencies. The goal is to keep the dependencies to a minimum and only add a dependency when absolutely required.

Updating Dependencies

The Makefile contains a rule called tidy which performs go mod tidy which ensures that the go.mod file matches the source code in the module. It adds any missing module requirements necessary to build the current module’s packages and dependencies, and it removes requirements on modules that don’t provide any relevant packages. It also adds any missing entries to go.sum and removes unnecessary entries.

make tidy

!!! warning Make sure that you test the code after you have updated the dependencies!

12 - Etcd Cluster Components

Etcd Cluster Components

For every Etcd cluster that is provisioned by etcd-druid it deploys a set of resources. Following sections provides information and code reference to each such resource.

StatefulSet

StatefulSet is the primary kubernetes resource that gets provisioned for an etcd cluster.

Replicas for the StatefulSet are derived from Etcd.Spec.Replicas in the custom resource.
Each pod comprises two containers:
- etcd-wrapper : This is the main container which runs an etcd process.
- etcd-backup-restore : This is a side-container which does the following:
  - Orchestrates the initialization of etcd. This includes validation of any existing etcd data directory, restoration in case of corrupt etcd data directory files for a single-member etcd cluster.
  - Periodically renews member lease.
  - Optionally takes schedule and threshold based delta and full snapshots and pushes them to a configured object store.
  - Orchestrates scheduled etcd-db defragmentation.
  NOTE: This is not a complete list of functionalities offered out of etcd-backup-restore.

Code reference: StatefulSet-Component

For detailed information on each container you can visit etcd-wrapper and etcd-backup-restore repositories.

ConfigMap

Every etcd member requires configuration with which it must be started. etcd-druid creates a ConfigMap which gets mounted onto the etcd-backup-restore container. etcd-backup-restore container will modify the etcd configuration and serve it to the etcd-wrapper container upon request.

Code reference: ConfigMap-Component

PodDisruptionBudget

An etcd cluster requires quorum for all write operations. Clients can additionally configure quorum based reads as well to ensure linearizable reads (kube-apiserver’s etcd client is configured for linearizable reads and writes). In a cluster of size 3, only 1 member failure is tolerated. Failure tolerance for an etcd cluster with replicas n is computed as (n-1)/2.

To ensure that etcd pods are not evicted more than its failure tolerance, etcd-druid creates a PodDisruptionBudget.

!!! note For a single node etcd cluster a PodDisruptionBudget will be created, however pdb.spec.minavailable is set to 0 effectively disabling it.

Code reference: PodDisruptionBudget-Component

ServiceAccount

etcd-backup-restore container running as a side-car in every etcd-member, requires permissions to access resources like Lease, StatefulSet etc. A dedicated ServiceAccount is created per Etcd cluster for this purpose.

Code reference: ServiceAccount-Component

Role & RoleBinding

etcd-backup-restore container running as a side-car in every etcd-member, requires permissions to access resources like Lease, StatefulSet etc. A dedicated Role and RoleBinding is created and linked to the ServiceAccount created per Etcd cluster.

Code reference: Role-Component & RoleBinding-Component

Client & Peer Service

To enable clients to connect to an etcd cluster a ClusterIP Client Service is created. To enable etcd members to talk to each other(for discovery, leader-election, raft consensus etc.) etcd-druid also creates a Headless Service.

Code reference: Client-Service-Component & Peer-Service-Component

Member Lease

Every member in an Etcd cluster has a dedicated Lease that gets created which signifies that the member is alive. It is the responsibility of the etcd-backup-store side-car container to periodically renew the lease.

!!! note Today the lease object is also used to indicate the member-ID and the role of the member in an etcd cluster. Possible roles are Leader, Member(which denotes that this is a member but not a leader). This will change in the future with EtcdMember resource.

Code reference: Member-Lease-Component

Delta & Full Snapshot Leases

One of the responsibilities of etcd-backup-restore container is to take periodic or threshold based snapshots (delta and full) of the etcd DB. Today etcd-backup-restore communicates the end-revision of the latest full/delta snapshots to etcd-druid operator via leases.

etcd-druid creates two Lease resources one for delta and another for full snapshot. This information is used by the operator to trigger snapshot-compaction jobs. Snapshot leases are also used to derive the health of backups which gets updated in the Status subresource of every Etcd resource.

In future these leases will be replaced by EtcdMember resource.

Code reference: Snapshot-Lease-Component

13 - Etcd Cluster Resource Protection

Etcd Cluster Resource Protection

etcd-druid provisions and manages kubernetes resources (a.k.a components) for each Etcd cluster. To ensure that each component’s specification is in line with the configured attributes defined in Etcd custom resource and to protect unintended changes done to any of these managed components a Validating Webhook is employed.

Etcd Components Webhook is the validating webhook which prevents unintended UPDATE and DELETE operations on all managed resources. Following sections describe what is prohibited and in which specific conditions the changes are permitted.

Configure Etcd Components Webhook

Prerequisite to enable the validation webhook is to configure the Webhook Server. Additionally you need to enable the Etcd Components validating webhook and optionally configure other options. You can look at all the options here.

What is allowed?

Modifications to managed resources under the following circumstances will be allowed:

Create and Connect operations are allowed and no validation is done.
Changes to a kubernetes resource (e.g. StatefulSet, ConfigMap etc) not managed by etcd-druid are allowed.
Changes to a resource whose Group-Kind is amongst the resources managed by etcd-druid but does not have a parent Etcd resource are allowed.
It is possible that an operator wishes to explicitly disable etcd-component protection. This can be done by setting druid.gardener.cloud/disable-etcd-component-protection annotation on an Etcd resource. If this annotation is present then changes to managed components will be allowed.
If Etcd resource has a deletion timestamp set indicating that it is marked for deletion and is awaiting etcd-druid to delete all managed resources then deletion requests for all managed resources for this etcd cluster will be allowed if:
- The deletion request has come from a ServiceAccount associated to etcd-druid. If not explicitly specified via --reconciler-service-account then a default-reconciler-service-account will be assumed.
- The deletion request has come from a ServiceAccount configured via --etcd-components-webhook-exempt-service-accounts.
Lease objects are periodically updated by each etcd member pod. A single ServiceAccount is created for all members. Update operation on Lease objects from this ServiceAccount is allowed.
If an active reconciliation is in-progress then only allow operations that are initiated by etcd-druid.
If no active reconciliation is currently in-progress, then allow updates to managed resource from ServiceAccounts configured via --etcd-components-webhook-exempt-service-accounts.

14 - Etcd Druid Api

API Reference

Packages

config.druid.gardener.cloud/v1alpha1

ClientConnectionConfiguration

ClientConnectionConfiguration defines the configuration for constructing a client.Client to connect to k8s kube-apiserver.

Appears in:

OperatorConfiguration

Field	Description	Default	Validation
`qps` float	QPS controls the number of queries per second allowed for a connection. Setting this to a negative value will disable client-side rate limiting.
`burst` integer	Burst allows extra queries to accumulate when a client is exceeding its rate.
`contentType` string	ContentType is the content type used when sending data to the server from this client.
`acceptContentTypes` string	AcceptContentTypes defines the Accept header sent by clients when connecting to the server, overriding the default value of ‘application/json’. This field will control all connections to the server used by a particular client.

CompactionControllerConfiguration

CompactionControllerConfiguration defines the configuration for the compaction controller.

Appears in:

ControllerConfiguration

Field	Description	Default	Validation
`enabled` boolean	Enabled specifies whether backup compaction should be enabled.
`concurrentSyncs` integer	ConcurrentSyncs is the max number of concurrent workers that can be run, each worker servicing a reconcile request.
`eventsThreshold` integer	EventsThreshold denotes total number of etcd events to be reached upon which a backup compaction job is triggered.
`activeDeadlineDuration` Duration	ActiveDeadlineDuration is the duration after which a running compaction job will be killed.
`metricsScrapeWaitDuration` Duration	MetricsScrapeWaitDuration is the duration to wait for after compaction job is completed, to allow Prometheus metrics to be scraped

ControllerConfiguration

ControllerConfiguration defines the configuration for the controllers.

Appears in:

OperatorConfiguration

Field	Description	Default	Validation
`disableLeaseCache` boolean	DisableLeaseCache disables the cache for lease.coordination.k8s.io resources. Deprecated: This field will be eventually removed. It is recommended to not use this. It has only been introduced to allow for backward compatibility with the old CLI flags.
`etcd` EtcdControllerConfiguration	Etcd is the configuration for the Etcd controller.
`compaction` CompactionControllerConfiguration	Compaction is the configuration for the compaction controller.
`etcdCopyBackupsTask` EtcdCopyBackupsTaskControllerConfiguration	EtcdCopyBackupsTask is the configuration for the EtcdCopyBackupsTask controller.
`secret` SecretControllerConfiguration	Secret is the configuration for the Secret controller.

EtcdComponentProtectionWebhookConfiguration

EtcdComponentProtectionWebhookConfiguration defines the configuration for EtcdComponentProtection webhook. NOTE: At least one of ReconcilerServiceAccountFQDN or ServiceAccountInfo must be set. It is recommended to switch to ServiceAccountInfo.

Appears in:

WebhookConfiguration

Field	Description	Default	Validation
`enabled` boolean	Enabled indicates whether the EtcdComponentProtection webhook is enabled.
`reconcilerServiceAccountFQDN` string	ReconcilerServiceAccountFQDN is the FQDN of the reconciler service account used by the etcd-druid operator. Deprecated: Please use ServiceAccountInfo instead and ensure that both Name and Namespace are set via projected volumes and downward API in the etcd-druid deployment spec.
`serviceAccountInfo` ServiceAccountInfo	ServiceAccountInfo contains paths to gather etcd-druid service account information.
`exemptServiceAccounts` string array	ExemptServiceAccounts is a list of service accounts that are exempt from Etcd Components Webhook checks.

EtcdControllerConfiguration

EtcdControllerConfiguration defines the configuration for the Etcd controller.

Appears in:

ControllerConfiguration

Field	Description	Default	Validation
`concurrentSyncs` integer	ConcurrentSyncs is the max number of concurrent workers that can be run, each worker servicing a reconcile request.
`enableEtcdSpecAutoReconcile` boolean	EnableEtcdSpecAutoReconcile controls how the Etcd Spec is reconciled. If set to true, then any change in Etcd spec will automatically trigger a reconciliation of the Etcd resource. If set to false, then an operator needs to explicitly set gardener.cloud/operation=reconcile annotation on the Etcd resource to trigger reconciliation of the Etcd spec. NOTE: Decision to enable it should be carefully taken as spec updates could potentially result in rolling update of the StatefulSet which will cause a minor downtime for a single node etcd cluster and can potentially cause a downtime for a multi-node etcd cluster.
`disableEtcdServiceAccountAutomount` boolean	DisableEtcdServiceAccountAutomount controls the auto-mounting of service account token for etcd StatefulSets.
`etcdStatusSyncPeriod` Duration	EtcdStatusSyncPeriod is the duration after which an event will be re-queued ensuring etcd status synchronization.
`etcdMember` EtcdMemberConfiguration	EtcdMember holds configuration related to etcd members.

EtcdCopyBackupsTaskControllerConfiguration

EtcdCopyBackupsTaskControllerConfiguration defines the configuration for the EtcdCopyBackupsTask controller.

Appears in:

ControllerConfiguration

Field	Description	Default	Validation
`enabled` boolean	Enabled specifies whether EtcdCopyBackupsTaskController should be enabled.
`concurrentSyncs` integer	ConcurrentSyncs is the max number of concurrent workers that can be run, each worker servicing a reconcile request.

EtcdMemberConfiguration

EtcdMemberConfiguration holds configuration related to etcd members.

Appears in:

EtcdControllerConfiguration

Field	Description	Default	Validation
`notReadyThreshold` Duration	NotReadyThreshold is the duration after which an etcd member’s state is considered `NotReady`.
`unknownThreshold` Duration	UnknownThreshold is the duration after which an etcd member’s state is considered `Unknown`.

LeaderElectionConfiguration

LeaderElectionConfiguration defines the configuration for the leader election. It should be enabled when you deploy etcd-druid in HA mode. For single replica etcd-druid deployments it will not really serve any purpose.

Appears in:

OperatorConfiguration

Field	Description	Default	Validation
`enabled` boolean	Enabled specifies whether leader election is enabled. Set this to true when running replicated instances of the operator for high availability.
`leaseDuration` Duration	LeaseDuration is the duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of the occupied but un-renewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled.
`renewDeadline` Duration	RenewDeadline is the interval between attempts by the acting leader to renew its leadership before it stops leading. This must be less than or equal to the lease duration. This is only applicable if leader election is enabled.
`retryPeriod` Duration	RetryPeriod is the duration leader elector clients should wait between attempting acquisition and renewal of leadership. This is only applicable if leader election is enabled.
`resourceLock` string	ResourceLock determines which resource lock to use for leader election. This is only applicable if leader election is enabled.
`resourceName` string	ResourceName determines the name of the resource that leader election will use for holding the leader lock. This is only applicable if leader election is enabled.

LogConfiguration

LogConfiguration contains the configuration for logging.

Appears in:

OperatorConfiguration

Field	Description	Default	Validation
`logLevel` LogLevel	LogLevel is the level/severity for the logs. Must be one of [info,debug,error].
`logFormat` LogFormat	LogFormat is the output format for the logs. Must be one of [text,json].

LogFormat

Underlying type: string

LogFormat is the format of the log.

Appears in:

LogConfiguration

Field	Description
`json`	LogFormatJSON is the JSON log format.
`text`	LogFormatText is the text log format.

LogLevel

Underlying type: string

LogLevel represents the level for logging.

Appears in:

LogConfiguration

Field	Description
`debug`	LogLevelDebug is the debug log level, i.e. the most verbose.
`info`	LogLevelInfo is the default log level.
`error`	LogLevelError is a log level where only errors are logged.

SecretControllerConfiguration

SecretControllerConfiguration defines the configuration for the Secret controller.

Appears in:

ControllerConfiguration

Field	Description	Default	Validation
`concurrentSyncs` integer	ConcurrentSyncs is the max number of concurrent workers that can be run, each worker servicing a reconcile request.

Server

Server contains information for HTTP(S) server configuration.

Appears in:

Field	Description	Default	Validation
`bindAddress` string	BindAddress is the IP address on which to listen for the specified port.
`port` integer	Port is the port on which to serve unsecured, unauthenticated access.

ServerConfiguration

ServerConfiguration contains the server configurations.

Appears in:

OperatorConfiguration

Field	Description	Default	Validation
`webhooks` TLSServer	Webhooks is the configuration for the TLS webhook server.
`metrics` Server	Metrics is the configuration for serving the metrics endpoint.

ServiceAccountInfo

ServiceAccountInfo contains paths to gather etcd-druid service account information. Usually downward API and projected volumes are used in the deployment specification of etcd-druid to provide this information as mounted volume files.

Appears in:

EtcdComponentProtectionWebhookConfiguration

Field	Description	Default	Validation
`name` string	Name is the name of the service account associated with etcd-druid deployment.
`namespace` string	Namespace is the namespace in which the service account has been deployed. Usually this information is usually available at /var/run/secrets/kubernetes.io/serviceaccount/namespace. However, if automountServiceAccountToken is set to false then this file will not be available.

TLSServer

TLSServer is the configuration for a TLS enabled server.

Appears in:

ServerConfiguration

Field	Description	Default	Validation
`bindAddress` string	BindAddress is the IP address on which to listen for the specified port.
`port` integer	Port is the port on which to serve unsecured, unauthenticated access.
`serverCertDir` string	ServerCertDir is the path to a directory containing the server’s TLS certificate and key (the files must be named tls.crt and tls.key respectively).

WebhookConfiguration

WebhookConfiguration defines the configuration for admission webhooks.

Appears in:

OperatorConfiguration

Field	Description	Default	Validation
`etcdComponentProtection` EtcdComponentProtectionWebhookConfiguration	EtcdComponentProtection is the configuration for EtcdComponentProtection webhook.

druid.gardener.cloud/v1alpha1

Package v1alpha1 contains API Schema definitions for the druid v1alpha1 API group

Resource Types

BackupSpec

BackupSpec defines parameters associated with the full and delta snapshots of etcd.

Appears in:

EtcdSpec

Field	Description	Validation
`port` integer	Port define the port on which etcd-backup-restore server will be exposed.
`tls` TLSConfig
`image` string	Image defines the etcd container image and tag
`store` StoreSpec	Store defines the specification of object store provider for storing backups.
`resources` ResourceRequirements	Resources defines compute Resources required by backup-restore container. More info: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
`compactionResources` ResourceRequirements	CompactionResources defines compute Resources required by compaction job. More info: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
`fullSnapshotSchedule` string	FullSnapshotSchedule defines the cron standard schedule for full snapshots.	Pattern: ^(\\|[1-5]?[0-9]\|[1-5]?[0-9]-[1-5]?[0-9]\|(?:[1-9]\|[1-4][0-9]\|5[0-9])\/(?:[1-9]\|[1-4][0-9]\|5[0-9]\|60)\|\\/(?:[1-9]\|[1-4][0-9]\|5[0-9]\|60))\s+(\\|[0-9]\|1[0-9]\|2[0-3]\|[0-9]-(?:[0-9]\|1[0-9]\|2[0-3])\|1[0-9]-(?:1[0-9]\|2[0-3])\|2[0-3]-2[0-3]\|(?:[1-9]\|1[0-9]\|2[0-3])\/(?:[1-9]\|1[0-9]\|2[0-4])\|\\/(?:[1-9]\|1[0-9]\|2[0-4]))\s+(\\|[1-9]\|[12][0-9]\|3[01]\|[1-9]-(?:[1-9]\|[12][0-9]\|3[01])\|[12][0-9]-(?:[12][0-9]\|3[01])\|3[01]-3[01]\|(?:[1-9]\|[12][0-9]\|30)\/(?:[1-9]\|[12][0-9]\|3[01])\|\\/(?:[1-9]\|[12][0-9]\|3[01]))\s+(\\|[1-9]\|1[0-2]\|[1-9]-(?:[1-9]\|1[0-2])\|1[0-2]-1[0-2]\|(?:[1-9]\|1[0-2])\/(?:[1-9]\|1[0-2])\|\\/(?:[1-9]\|1[0-2]))\s+(\\|[1-7]\|[1-6]-[1-7]\|[1-6]\/[1-7]\|\\/[1-7])$
`garbageCollectionPolicy` GarbageCollectionPolicy	GarbageCollectionPolicy defines the policy for garbage collecting old backups	Enum: [Exponential LimitBased]
`maxBackupsLimitBasedGC` integer	MaxBackupsLimitBasedGC defines the maximum number of Full snapshots to retain in Limit Based GarbageCollectionPolicy All full snapshots beyond this limit will be garbage collected.
`garbageCollectionPeriod` Duration	GarbageCollectionPeriod defines the period for garbage collecting old backups	Pattern: `^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$` Type: string
`deltaSnapshotPeriod` Duration	DeltaSnapshotPeriod defines the period after which delta snapshots will be taken	Pattern: `^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$` Type: string
`deltaSnapshotMemoryLimit` Quantity	DeltaSnapshotMemoryLimit defines the memory limit after which delta snapshots will be taken
`deltaSnapshotRetentionPeriod` Duration	DeltaSnapshotRetentionPeriod defines the duration for which delta snapshots will be retained, excluding the latest snapshot set. The value should be a string formatted as a duration (e.g., ‘1s’, ‘2m’, ‘3h’, ‘4d’)	Pattern: `^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$` Type: string
`compression` CompressionSpec	SnapshotCompression defines the specification for compression of Snapshots.
`enableProfiling` boolean	EnableProfiling defines if profiling should be enabled for the etcd-backup-restore-sidecar
`etcdSnapshotTimeout` Duration	EtcdSnapshotTimeout defines the timeout duration for etcd FullSnapshot operation	Pattern: `^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$` Type: string
`leaderElection` LeaderElectionSpec	LeaderElection defines parameters related to the LeaderElection configuration.

ClientService

ClientService defines the parameters of the client service that a user can specify

Appears in:

EtcdConfig

Field	Description	Validation
`annotations` object (keys:string, values:string)	Annotations specify the annotations that should be added to the client service
`labels` object (keys:string, values:string)	Labels specify the labels that should be added to the client service
`trafficDistribution` string	TrafficDistribution defines the traffic distribution preference that should be added to the client service. More info: https://kubernetes.io/docs/reference/networking/virtual-ips/#traffic-distribution	Enum: [PreferClose]

CompactionMode

Underlying type: string

CompactionMode defines the auto-compaction-mode: ‘periodic’ or ‘revision’. ‘periodic’ for duration based retention and ‘revision’ for revision number based retention.

Validation:

Enum: [periodic revision]

Appears in:

SharedConfig

Field	Description
`periodic`	Periodic is a constant to set auto-compaction-mode ‘periodic’ for duration based retention.
`revision`	Revision is a constant to set auto-compaction-mode ‘revision’ for revision number based retention.

CompressionPolicy

Underlying type: string

CompressionPolicy defines the type of policy for compression of snapshots.

Validation:

Enum: [gzip lzw zlib]

Appears in:

CompressionSpec

Field	Description
`gzip`	GzipCompression is constant for gzip compression policy.
`lzw`	LzwCompression is constant for lzw compression policy.
`zlib`	ZlibCompression is constant for zlib compression policy.

CompressionSpec

CompressionSpec defines parameters related to compression of Snapshots(full as well as delta).

Appears in:

BackupSpec

Field	Description	Default	Validation
`enabled` boolean
`policy` CompressionPolicy			Enum: [gzip lzw zlib]

Condition

Condition holds the information about the state of a resource.

Appears in:

Field	Description	Default	Validation
`type` ConditionType	Type of the Etcd condition.
`status` ConditionStatus	Status of the condition, one of True, False, Unknown.
`lastTransitionTime` Time	Last time the condition transitioned from one status to another.
`lastUpdateTime` Time	Last time the condition was updated.
`reason` string	The reason for the condition’s last transition.
`message` string	A human-readable message indicating details about the transition.

ConditionStatus

Underlying type: string

ConditionStatus is the status of a condition.

Appears in:

Condition

Field	Description
`True`	ConditionTrue means a resource is in the condition.
`False`	ConditionFalse means a resource is not in the condition.
`Unknown`	ConditionUnknown means Gardener can’t decide if a resource is in the condition or not.
`Progressing`	ConditionProgressing means the condition was seen true, failed but stayed within a predefined failure threshold. In the future, we could add other intermediate conditions, e.g. ConditionDegraded. Deprecated: Will be removed in the future since druid conditions will be replaced by metav1.Condition which has only three status options: True, False, Unknown.
`ConditionCheckError`	ConditionCheckError is a constant for a reason in condition. Deprecated: Will be removed in the future since druid conditions will be replaced by metav1.Condition which has only three status options: True, False, Unknown.

ConditionType

Underlying type: string

ConditionType is the type of condition.

Appears in:

Condition

Field	Description
`Ready`	ConditionTypeReady is a constant for a condition type indicating that the etcd cluster is ready.
`AllMembersReady`	ConditionTypeAllMembersReady is a constant for a condition type indicating that all members of the etcd cluster are ready.
`AllMembersUpdated`	ConditionTypeAllMembersUpdated is a constant for a condition type indicating that all members of the etcd cluster have been updated with the desired spec changes.
`BackupReady`	ConditionTypeBackupReady is a constant for a condition type indicating that the etcd backup is ready.
`DataVolumesReady`	ConditionTypeDataVolumesReady is a constant for a condition type indicating that the etcd data volumes are ready.
`Succeeded`	EtcdCopyBackupsTaskSucceeded is a condition type indicating that a EtcdCopyBackupsTask has succeeded.
`Failed`	EtcdCopyBackupsTaskFailed is a condition type indicating that a EtcdCopyBackupsTask has failed.

CrossVersionObjectReference

CrossVersionObjectReference contains enough information to let you identify the referred resource.

Appears in:

EtcdStatus

Field	Description	Default	Validation
`kind` string	Kind of the referent
`name` string	Name of the referent
`apiVersion` string	API version of the referent

ErrorCode

Underlying type: string

ErrorCode is a string alias representing an error code that identifies an error.

Appears in:

LastError

Etcd

Etcd is the Schema for the etcds API

Field	Description	Default	Validation
`apiVersion` string	`druid.gardener.cloud/v1alpha1`
`kind` string	`Etcd`
`metadata` ObjectMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`spec` EtcdSpec
`status` EtcdStatus

EtcdConfig

EtcdConfig defines the configuration for the etcd cluster to be deployed.

Appears in:

EtcdSpec

Field	Description	Validation
`quota` Quantity	Quota defines the etcd DB quota.
`snapshotCount` integer	SnapshotCount defines the number of applied Raft entries to hold in-memory before compaction. More info: https://etcd.io/docs/v3.4/op-guide/maintenance/#raft-log-retention
`defragmentationSchedule` string	DefragmentationSchedule defines the cron standard schedule for defragmentation of etcd.	Pattern: ^(\\|[1-5]?[0-9]\|[1-5]?[0-9]-[1-5]?[0-9]\|(?:[1-9]\|[1-4][0-9]\|5[0-9])\/(?:[1-9]\|[1-4][0-9]\|5[0-9]\|60)\|\\/(?:[1-9]\|[1-4][0-9]\|5[0-9]\|60))\s+(\\|[0-9]\|1[0-9]\|2[0-3]\|[0-9]-(?:[0-9]\|1[0-9]\|2[0-3])\|1[0-9]-(?:1[0-9]\|2[0-3])\|2[0-3]-2[0-3]\|(?:[1-9]\|1[0-9]\|2[0-3])\/(?:[1-9]\|1[0-9]\|2[0-4])\|\\/(?:[1-9]\|1[0-9]\|2[0-4]))\s+(\\|[1-9]\|[12][0-9]\|3[01]\|[1-9]-(?:[1-9]\|[12][0-9]\|3[01])\|[12][0-9]-(?:[12][0-9]\|3[01])\|3[01]-3[01]\|(?:[1-9]\|[12][0-9]\|30)\/(?:[1-9]\|[12][0-9]\|3[01])\|\\/(?:[1-9]\|[12][0-9]\|3[01]))\s+(\\|[1-9]\|1[0-2]\|[1-9]-(?:[1-9]\|1[0-2])\|1[0-2]-1[0-2]\|(?:[1-9]\|1[0-2])\/(?:[1-9]\|1[0-2])\|\\/(?:[1-9]\|1[0-2]))\s+(\\|[1-7]\|[1-6]-[1-7]\|[1-6]\/[1-7]\|\\/[1-7])$
`serverPort` integer
`clientPort` integer
`wrapperPort` integer
`image` string	Image defines the etcd container image and tag
`authSecretRef` SecretReference
`metrics` MetricsLevel	Metrics defines the level of detail for exported metrics of etcd, specify ’extensive’ to include histogram metrics.	Enum: [basic extensive]
`resources` ResourceRequirements	Resources defines the compute Resources required by etcd container. More info: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
`clientUrlTls` TLSConfig	ClientUrlTLS contains the ca, server TLS and client TLS secrets for client communication to ETCD cluster
`peerUrlTls` TLSConfig	PeerUrlTLS contains the ca and server TLS secrets for peer communication within ETCD cluster Currently, PeerUrlTLS does not require client TLS secrets for gardener implementation of ETCD cluster.
`etcdDefragTimeout` Duration	EtcdDefragTimeout defines the timeout duration for etcd defrag call	Pattern: `^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$` Type: string
`heartbeatDuration` Duration	HeartbeatDuration defines the duration for members to send heartbeats. The default value is 10s.	Pattern: `^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?$` Type: string
`clientService` ClientService	ClientService defines the parameters of the client service that a user can specify

EtcdCopyBackupsTask

EtcdCopyBackupsTask is a task for copying etcd backups from a source to a target store.

Field	Description	Default	Validation
`apiVersion` string	`druid.gardener.cloud/v1alpha1`
`kind` string	`EtcdCopyBackupsTask`
`metadata` ObjectMeta	Refer to Kubernetes API documentation for fields of `metadata`.
`spec` EtcdCopyBackupsTaskSpec
`status` EtcdCopyBackupsTaskStatus

EtcdCopyBackupsTaskSpec

EtcdCopyBackupsTaskSpec defines the parameters for the copy backups task.

Appears in:

EtcdCopyBackupsTask

Field	Description	Default	Validation
`podLabels` object (keys:string, values:string)	PodLabels is a set of labels that will be added to pod(s) created by the copy backups task.
`sourceStore` StoreSpec	SourceStore defines the specification of the source object store provider for storing backups.
`targetStore` StoreSpec	TargetStore defines the specification of the target object store provider for storing backups.
`maxBackupAge` integer	MaxBackupAge is the maximum age in days that a backup must have in order to be copied. By default, all backups will be copied.
`maxBackups` integer	MaxBackups is the maximum number of backups that will be copied starting with the most recent ones.
`waitForFinalSnapshot` WaitForFinalSnapshotSpec	WaitForFinalSnapshot defines the parameters for waiting for a final full snapshot before copying backups.

EtcdCopyBackupsTaskStatus

EtcdCopyBackupsTaskStatus defines the observed state of the copy backups task.

Appears in:

EtcdCopyBackupsTask

Field	Description	Default	Validation
`conditions` Condition array	Conditions represents the latest available observations of an object’s current state.
`observedGeneration` integer	ObservedGeneration is the most recent generation observed for this resource.
`lastError` string	LastError represents the last occurred error.

EtcdMemberConditionStatus

Underlying type: string

EtcdMemberConditionStatus is the status of an etcd cluster member.

Appears in:

EtcdMemberStatus

Field	Description
`Ready`	EtcdMemberStatusReady indicates that the etcd member is ready.
`NotReady`	EtcdMemberStatusNotReady indicates that the etcd member is not ready.
`Unknown`	EtcdMemberStatusUnknown indicates that the status of the etcd member is unknown.

EtcdMemberStatus

EtcdMemberStatus holds information about etcd cluster membership.

Appears in:

EtcdStatus

Field	Description	Default	Validation
`name` string	Name is the name of the etcd member. It is the name of the backing `Pod`.
`id` string	ID is the ID of the etcd member.
`role` EtcdRole	Role is the role in the etcd cluster, either `Leader` or `Member`.
`status` EtcdMemberConditionStatus	Status of the condition, one of True, False, Unknown.
`reason` string	The reason for the condition’s last transition.
`lastTransitionTime` Time	LastTransitionTime is the last time the condition’s status changed.

EtcdRole

Underlying type: string

EtcdRole is the role of an etcd cluster member.

Appears in:

EtcdMemberStatus

Field	Description
`Leader`	EtcdRoleLeader describes the etcd role `Leader`.
`Member`	EtcdRoleMember describes the etcd role `Member`.

EtcdSpec

EtcdSpec defines the desired state of Etcd

Appears in:

Etcd

Field	Description	Default	Validation
`selector` LabelSelector	selector is a label query over pods that should match the replica count. It must match the pod template’s labels. More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors Deprecated: this field will be removed in the future.
`labels` object (keys:string, values:string)
`annotations` object (keys:string, values:string)
`etcd` EtcdConfig
`backup` BackupSpec
`sharedConfig` SharedConfig
`schedulingConstraints` SchedulingConstraints
`replicas` integer
`priorityClassName` string	PriorityClassName is the name of a priority class that shall be used for the etcd pods.
`storageClass` string	StorageClass defines the name of the StorageClass required by the claim. More info: https://kubernetes.io/docs/concepts/storage/persistent-volumes#class-1
`storageCapacity` Quantity	StorageCapacity defines the size of persistent volume.
`volumeClaimTemplate` string	VolumeClaimTemplate defines the volume claim template to be created
`runAsRoot` boolean	RunAsRoot defines whether the securityContext of the pod specification should indicate that the containers shall run as root. By default, they run as non-root with user ’nobody’.

EtcdStatus

EtcdStatus defines the observed state of Etcd.

Appears in:

Etcd

Field	Description	Default	Validation
`observedGeneration` integer	ObservedGeneration is the most recent generation observed for this resource.
`etcd` CrossVersionObjectReference
`conditions` Condition array	Conditions represents the latest available observations of an etcd’s current state.
`lastErrors` LastError array	LastErrors captures errors that occurred during the last operation.
`lastOperation` LastOperation	LastOperation indicates the last operation performed on this resource.
`currentReplicas` integer	CurrentReplicas is the current replica count for the etcd cluster.
`replicas` integer	Replicas is the replica count of the etcd cluster.
`readyReplicas` integer	ReadyReplicas is the count of replicas being ready in the etcd cluster.
`ready` boolean	Ready is `true` if all etcd replicas are ready.
`labelSelector` LabelSelector	LabelSelector is a label query over pods that should match the replica count. It must match the pod template’s labels. Deprecated: this field will be removed in the future.
`members` EtcdMemberStatus array	Members represents the members of the etcd cluster
`peerUrlTLSEnabled` boolean	PeerUrlTLSEnabled captures the state of peer url TLS being enabled for the etcd member(s)
`selector` string	Selector is a label query over pods that should match the replica count. It must match the pod template’s labels.

GarbageCollectionPolicy

Underlying type: string

GarbageCollectionPolicy defines the type of policy for snapshot garbage collection.

Validation:

Enum: [Exponential LimitBased]

Appears in:

BackupSpec

LastError

LastError stores details of the most recent error encountered for a resource.

Appears in:

EtcdStatus

Field	Description	Default	Validation
`code` ErrorCode	Code is an error code that uniquely identifies an error.
`description` string	Description is a human-readable message indicating details of the error.
`observedAt` Time	ObservedAt is the time the error was observed.

LastOperation

LastOperation holds the information on the last operation done on the Etcd resource.

Appears in:

EtcdStatus

Field	Description	Default	Validation
`type` LastOperationType	Type is the type of last operation.
`state` LastOperationState	State is the state of the last operation.
`description` string	Description describes the last operation.
`runID` string	RunID correlates an operation with a reconciliation run. Every time an Etcd resource is reconciled (barring status reconciliation which is periodic), a unique ID is generated which can be used to correlate all actions done as part of a single reconcile run. Capturing this as part of LastOperation aids in establishing this correlation. This further helps in also easily filtering reconcile logs as all structured logs in a reconciliation run should have the `runID` referenced.
`lastUpdateTime` Time	LastUpdateTime is the time at which the operation was last updated.

LastOperationState

Underlying type: string

LastOperationState is a string alias representing the state of the last operation.

Appears in:

LastOperation

Field	Description
`Processing`	LastOperationStateProcessing indicates that an operation is in progress.
`Succeeded`	LastOperationStateSucceeded indicates that an operation has completed successfully.
`Error`	LastOperationStateError indicates that an operation is completed with errors and will be retried.
`Requeue`	LastOperationStateRequeue indicates that an operation is not completed and either due to an error or unfulfilled conditions will be retried.

LastOperationType

Underlying type: string

LastOperationType is a string alias representing type of the last operation.

Appears in:

LastOperation

Field	Description
`Create`	LastOperationTypeCreate indicates that the last operation was a creation of a new Etcd resource.
`Reconcile`	LastOperationTypeReconcile indicates that the last operation was a reconciliation of the spec of an Etcd resource.
`Delete`	LastOperationTypeDelete indicates that the last operation was a deletion of an existing Etcd resource.

LeaderElectionSpec

LeaderElectionSpec defines parameters related to the LeaderElection configuration.

Appears in:

BackupSpec

Field	Description	Default	Validation
`reelectionPeriod` Duration	ReelectionPeriod defines the Period after which leadership status of corresponding etcd is checked.		Pattern: `^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$` Type: string
`etcdConnectionTimeout` Duration	EtcdConnectionTimeout defines the timeout duration for etcd client connection during leader election.		Pattern: `^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$` Type: string

MetricsLevel

Underlying type: string

MetricsLevel defines the level ‘basic’ or ’extensive’.

Validation:

Enum: [basic extensive]

Appears in:

EtcdConfig

Field	Description
`basic`	Basic is a constant for metrics level basic.
`extensive`	Extensive is a constant for metrics level extensive.

SchedulingConstraints

Appears in:

EtcdSpec

Field	Description	Default	Validation
`affinity` Affinity	Affinity defines the various affinity and anti-affinity rules for a pod that are honoured by the kube-scheduler.
`topologySpreadConstraints` TopologySpreadConstraint array	TopologySpreadConstraints describes how a group of pods ought to spread across topology domains, that are honoured by the kube-scheduler.

SecretReference

SecretReference defines a reference to a secret.

Appears in:

TLSConfig

Field	Description	Default	Validation
`dataKey` string	DataKey is the name of the key in the data map containing the credentials.

SharedConfig

SharedConfig defines parameters shared and used by Etcd as well as backup-restore sidecar.

Appears in:

EtcdSpec

Field	Description	Default	Validation
`autoCompactionMode` CompactionMode	AutoCompactionMode defines the auto-compaction-mode:‘periodic’ mode or ‘revision’ mode for etcd and embedded-etcd of backup-restore sidecar.		Enum: [periodic revision]
`autoCompactionRetention` string	AutoCompactionRetention defines the auto-compaction-retention length for etcd as well as for embedded-etcd of backup-restore sidecar.

StorageProvider

Underlying type: string

StorageProvider defines the type of object store provider for storing backups.

Appears in:

StoreSpec

StoreSpec

StoreSpec defines parameters related to ObjectStore persisting backups

Appears in:

Field	Description	Default	Validation
`container` string	Container is the name of the container the backup is stored at.
`prefix` string	Prefix is the prefix used for the store.
`provider` StorageProvider	Provider is the name of the backup provider.
`secretRef` SecretReference	SecretRef is the reference to the secret which used to connect to the backup store.

TLSConfig

TLSConfig hold the TLS configuration details.

Appears in:

Field	Description	Default	Validation
`tlsCASecretRef` SecretReference
`serverTLSSecretRef` SecretReference
`clientTLSSecretRef` SecretReference

WaitForFinalSnapshotSpec

WaitForFinalSnapshotSpec defines the parameters for waiting for a final full snapshot before copying backups.

Appears in:

EtcdCopyBackupsTaskSpec

Field	Description	Default	Validation
`enabled` boolean	Enabled specifies whether to wait for a final full snapshot before copying backups.
`timeout` Duration	Timeout is the timeout for waiting for a final full snapshot. When this timeout expires, the copying of backups will be performed anyway. No timeout or 0 means wait forever.

15 - etcd Network Latency

Network Latency analysis: `sn-etcd-sz` vs `mn-etcd-sz` vs `mn-etcd-mz`

This page captures the etcd cluster latency analysis for below scenarios using the benchmark tool (build from etcd benchmark tool).

sn-etcd-sz -> single-node etcd single zone (Only single replica of etcd will be running)

mn-etcd-sz -> multi-node etcd single zone (Multiple replicas of etcd pods will be running across nodes in a single zone)

mn-etcd-mz -> multi-node etcd multi zone (Multiple replicas of etcd pods will be running across nodes in multiple zones)

PUT Analysis

Summary

sn-etcd-sz latency is ~20% less than mn-etcd-sz when benchmark tool with single client.
mn-etcd-sz latency is less than mn-etcd-mz but the difference is ~+/-5%.
Compared to mn-etcd-sz, sn-etcd-sz latency is higher and gradually grows with more clients and larger value size.
Compared to mn-etcd-mz, mn-etcd-sz latency is higher and gradually grows with more clients and larger value size.
Compared to follower, leader latency is less, when benchmark tool with single client for all cases.
Compared to follower, leader latency is high, when benchmark tool with multiple clients for all cases.

Sample commands:

# write to leader
benchmark put --target-leader --conns=1 --clients=1 --precise \
    --sequential-keys --key-starts 0 --val-size=256 --total=10000 \
    --endpoints=$ETCD_HOST 


# write to follower
benchmark put  --conns=1 --clients=1 --precise \
    --sequential-keys --key-starts 0 --val-size=256 --total=10000 \
    --endpoints=$ETCD_FOLLOWER_HOST

Latency analysis during PUT requests to etcd

In this case benchmark tool tries to put key with random 256 bytes value.

Benchmark tool loads key/value to leader with single client .

sn-etcd-sz latency (~0.815ms) is ~50% lesser than mn-etcd-sz (~1.74ms ).
- mn-etcd-sz latency (~1.74ms ) is slightly lesser than mn-etcd-mz (~1.8ms) but the difference is negligible (within same ms).

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
10000	256	1	1	leader	1220.0520	0.815ms	eu-west-1c	etcd-main-0	sn-etcd-sz
10000	256	1	1	leader	586.545	1.74ms	eu-west-1a	etcd-main-1	mn-etcd-sz
10000	256	1	1	leader	554.0155654442634	1.8ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool loads key/value to follower with single client.

mn-etcd-sz latency(~2.2ms) is 20% to 30% lesser than mn-etcd-mz(~2.7ms).
Compare to follower, leader has lower latency.

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
10000	256	1	1	follower-1	445.743	2.23ms	eu-west-1a	etcd-main-0	mn-etcd-sz
10000	256	1	1	follower-1	378.9366747610789	2.63ms	eu-west-1c	etcd-main-0	mn-etcd-mz

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
10000	256	1	1	follower-2	457.967	2.17ms	eu-west-1a	etcd-main-2	mn-etcd-sz
10000	256	1	1	follower-2	345.6586129825796	2.89ms	eu-west-1b	etcd-main-2	mn-etcd-mz

Benchmark tool loads key/value to leader with multiple clients.

sn-etcd-sz latency(~78.3ms) is ~10% greater than mn-etcd-sz(~71.81ms).
mn-etcd-sz latency(~71.81ms) is less than mn-etcd-mz(~72.5ms) but the difference is negligible.

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
100000	256	100	1000	leader	12638.905	78.32ms	eu-west-1c	etcd-main-0	sn-etcd-sz
100000	256	100	1000	leader	13789.248	71.81ms	eu-west-1a	etcd-main-1	mn-etcd-sz
100000	256	100	1000	leader	13728.446436395223	72.5ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool loads key/value to follower with multiple clients.

mn-etcd-sz latency(~69.8ms) is ~5% greater than mn-etcd-mz(~72.6ms).
Compare to leader, follower has lower latency.

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
100000	256	100	1000	follower-1	14271.983	69.80ms	eu-west-1a	etcd-main-0	mn-etcd-sz
100000	256	100	1000	follower-1	13695.98	72.62ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
100000	256	100	1000	follower-2	14325.436	69.47ms	eu-west-1a	etcd-main-2	mn-etcd-sz
100000	256	100	1000	follower-2	15750.409490407475	63.3ms	eu-west-1b	etcd-main-2	mn-etcd-mz

In this case benchmark tool tries to put key with random 1 MB value.

Benchmark tool loads key/value to leader with single client.

sn-etcd-sz latency(~16.35ms) is ~20% lesser than mn-etcd-sz(~20.64ms).
mn-etcd-sz latency(~20.64ms) is less than mn-etcd-mz(~21.08ms) but the difference is negligible..

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	1	1	leader	61.117	16.35ms	eu-west-1c	etcd-main-0	sn-etcd-sz
1000	1000000	1	1	leader	48.416	20.64ms	eu-west-1a	etcd-main-1	mn-etcd-sz
1000	1000000	1	1	leader	45.7517341664802	21.08ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool loads key/value withto follower single client.

mn-etcd-sz latency(~23.10ms) is ~10% greater than mn-etcd-mz(~21.8ms).
Compare to follower, leader has lower latency.

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	1	1	follower-1	43.261	23.10ms	eu-west-1a	etcd-main-0	mn-etcd-sz
1000	1000000	1	1	follower-1	45.7517341664802	21.8ms	eu-west-1c	etcd-main-0	mn-etcd-mz
1000	1000000	1	1	follower-1	45.33	22.05ms	eu-west-1c	etcd-main-0	mn-etcd-mz

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	1	1	follower-2	40.0518	24.95ms	eu-west-1a	etcd-main-2	mn-etcd-sz
1000	1000000	1	1	follower-2	43.28573155709838	23.09ms	eu-west-1b	etcd-main-2	mn-etcd-mz
1000	1000000	1	1	follower-2	45.92	21.76ms	eu-west-1a	etcd-main-1	mn-etcd-mz
1000	1000000	1	1	follower-2	35.5705	28.1ms	eu-west-1b	etcd-main-2	mn-etcd-mz

Benchmark tool loads key/value to leader with multiple clients.

sn-etcd-sz latency(~6.0375secs) is ~30% greater than mn-etcd-sz``~4.000secs).
mn-etcd-sz latency(~4.000secs) is less than mn-etcd-mz(~ 4.09secs) but the difference is negligible.

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	100	300	leader	55.373	6.0375secs	eu-west-1c	etcd-main-0	sn-etcd-sz
1000	1000000	100	300	leader	67.319	4.000secs	eu-west-1a	etcd-main-1	mn-etcd-sz
1000	1000000	100	300	leader	65.91914167957594	4.09secs	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool loads key/value to follower with multiple clients.

mn-etcd-sz latency(~4.04secs) is ~5% greater than mn-etcd-mz(~ 3.90secs).
Compare to leader, follower has lower latency.

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	100	300	follower-1	66.528	4.0417secs	eu-west-1a	etcd-main-0	mn-etcd-sz
1000	1000000	100	300	follower-1	70.6493461856332	3.90secs	eu-west-1c	etcd-main-0	mn-etcd-mz
1000	1000000	100	300	follower-1	71.95	3.84secs	eu-west-1c	etcd-main-0	mn-etcd-mz

Number of keys	Value size	Number of connections	Number of clients	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	100	300	follower-2	66.447	4.0164secs	eu-west-1a	etcd-main-2	mn-etcd-sz
1000	1000000	100	300	follower-2	67.53038086369484	3.87secs	eu-west-1b	etcd-main-2	mn-etcd-mz
1000	1000000	100	300	follower-2	68.46	3.92secs	eu-west-1a	etcd-main-1	mn-etcd-mz

Range Analysis

Sample commands are:

# Single connection read request with sequential keys
benchmark range 0 --target-leader --conns=1 --clients=1 --precise \
    --sequential-keys --key-starts 0  --total=10000 \
    --consistency=l \
    --endpoints=$ETCD_HOST 
# --consistency=s [Serializable]
benchmark range 0 --target-leader --conns=1 --clients=1 --precise \
    --sequential-keys --key-starts 0  --total=10000 \
    --consistency=s \
    --endpoints=$ETCD_HOST 
# Each read request with range query matches key 0 9999 and repeats for total number of requests.  
benchmark range 0 9999 --target-leader --conns=1 --clients=1 --precise \
    --total=10 \
    --consistency=s \
    --endpoints=https://etcd-main-client:2379
# Read requests with multiple connections
benchmark range 0 --target-leader --conns=100 --clients=1000 --precise \
    --sequential-keys --key-starts 0  --total=100000 \
    --consistency=l \
    --endpoints=$ETCD_HOST 
benchmark range 0 --target-leader --conns=100 --clients=1000 --precise \
    --sequential-keys --key-starts 0  --total=100000 \
    --consistency=s \
    --endpoints=$ETCD_HOST

Latency analysis during Range requests to etcd

In this case benchmark tool tries to get specific key with random 256 bytes value.

Benchmark tool range requests to leader with single client.

sn-etcd-sz latency(~1.24ms) is ~40% greater than mn-etcd-sz(~0.67ms).
mn-etcd-sz latency(~0.67ms) is ~20% lesser than mn-etcd-mz(~0.85ms).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
10000	256	1	1	true	l	leader	800.272	1.24ms	eu-west-1c	etcd-main-0	sn-etcd-sz
10000	256	1	1	true	l	leader	1173.9081	0.67ms	eu-west-1a	etcd-main-1	mn-etcd-sz
10000	256	1	1	true	l	leader	999.3020189178693	0.85ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Compare to consistency Linearizable, Serializable is ~40% less for all cases

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
10000	256	1	1	true	s	leader	1411.229	0.70ms	eu-west-1c	etcd-main-0	sn-etcd-sz
10000	256	1	1	true	s	leader	2033.131	0.35ms	eu-west-1a	etcd-main-1	mn-etcd-sz
10000	256	1	1	true	s	leader	2100.2426362012025	0.47ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool range requests to follower with single client .

mn-etcd-sz latency(~1.3ms) is ~20% lesser than mn-etcd-mz(~1.6ms).
Compare to follower, leader read request latency is ~50% less for both mn-etcd-sz, mn-etcd-mz

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
10000	256	1	1	true	l	follower-1	765.325	1.3ms	eu-west-1a	etcd-main-0	mn-etcd-sz
10000	256	1	1	true	l	follower-1	596.1	1.6ms	eu-west-1c	etcd-main-0	mn-etcd-mz

Compare to consistency Linearizable, Serializable is ~50% less for all cases

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
10000	256	1	1	true	s	follower-1	1823.631	0.54ms	eu-west-1a	etcd-main-0	mn-etcd-sz
10000	256	1	1	true	s	follower-1	1442.6	0.69ms	eu-west-1c	etcd-main-0	mn-etcd-mz
10000	256	1	1	true	s	follower-1	1416.39	0.70ms	eu-west-1c	etcd-main-0	mn-etcd-mz
10000	256	1	1	true	s	follower-1	2077.449	0.47ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool range requests to leader with multiple client.

sn-etcd-sz latency(~84.66ms) is ~20% greater than mn-etcd-sz(~73.95ms).
mn-etcd-sz latency(~73.95ms) is more or less equal to mn-etcd-mz(~ 73.8ms).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
100000	256	100	1000	true	l	leader	11775.721	84.66ms	eu-west-1c	etcd-main-0	sn-etcd-sz
100000	256	100	1000	true	l	leader	13446.9598	73.95ms	eu-west-1a	etcd-main-1	mn-etcd-sz
100000	256	100	1000	true	l	leader	13527.19810605353	73.8ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Compare to consistency Linearizable, Serializable is ~20% lesser for all cases
sn-etcd-sz latency(~69.37ms) is more or less equal to mn-etcd-sz(~69.89ms).
mn-etcd-sz latency(~69.89ms) is slightly higher than mn-etcd-mz(~67.63ms).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
100000	256	100	1000	true	s	leader	14334.9027	69.37ms	eu-west-1c	etcd-main-0	sn-etcd-sz
100000	256	100	1000	true	s	leader	14270.008	69.89ms	eu-west-1a	etcd-main-1	mn-etcd-sz
100000	256	100	1000	true	s	leader	14715.287354023869	67.63ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool range requests to follower with multiple client.

mn-etcd-sz latency(~60.69ms) is ~20% lesser than mn-etcd-mz(~70.76ms).
Compare to leader, follower has lower read request latency.

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
100000	256	100	1000	true	l	follower-1	11586.032	60.69ms	eu-west-1a	etcd-main-0	mn-etcd-sz
100000	256	100	1000	true	l	follower-1	14050.5	70.76ms	eu-west-1c	etcd-main-0	mn-etcd-mz

mn-etcd-sz latency(~86.09ms) is ~20 higher than mn-etcd-mz(~64.6ms).
- Compare to mn-etcd-sz consistency Linearizable, Serializable is ~20% higher.*
Compare to mn-etcd-mz consistency Linearizable, Serializable is ~slightly less.

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
100000	256	100	1000	true	s	follower-1	11582.438	86.09ms	eu-west-1a	etcd-main-0	mn-etcd-sz
100000	256	100	1000	true	s	follower-1	15422.2	64.6ms	eu-west-1c	etcd-main-0	mn-etcd-mz

Benchmark tool range requests to leader all keys.

sn-etcd-sz latency(~678.77ms) is ~5% slightly lesser than mn-etcd-sz(~697.29ms).
mn-etcd-sz latency(~697.29ms) is less than mn-etcd-mz(~701ms) but the difference is negligible.

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
20	256	2	5	false	l	leader	6.8875	678.77ms	eu-west-1c	etcd-main-0	sn-etcd-sz
20	256	2	5	false	l	leader	6.720	697.29ms	eu-west-1a	etcd-main-1	mn-etcd-sz
20	256	2	5	false	l	leader	6.7	701ms	eu-west-1a	etcd-main-1	mn-etcd-mz

- Compare to consistency Linearizable, Serializable is ~5% slightly higher for all cases
sn-etcd-sz latency(~687.36ms) is less than mn-etcd-sz(~692.68ms) but the difference is negligible.
mn-etcd-sz latency(~692.68ms) is ~5% slightly lesser than mn-etcd-mz(~735.7ms).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
20	256	2	5	false	s	leader	6.76	687.36ms	eu-west-1c	etcd-main-0	sn-etcd-sz
20	256	2	5	false	s	leader	6.635	692.68ms	eu-west-1a	etcd-main-1	mn-etcd-sz
20	256	2	5	false	s	leader	6.3	735.7ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool range requests to follower all keys

mn-etcd-sz(~737.68ms) latency is ~5% slightly higher than mn-etcd-mz(~713.7ms).
Compare to leader consistency Linearizableread request, follower is ~5% slightly higher.

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
20	256	2	5	false	l	follower-1	6.163	737.68ms	eu-west-1a	etcd-main-0	mn-etcd-sz
20	256	2	5	false	l	follower-1	6.52	713.7ms	eu-west-1c	etcd-main-0	mn-etcd-mz

mn-etcd-sz latency(~757.73ms) is ~10% higher than mn-etcd-mz(~690.4ms).
Compare to follower consistency Linearizableread request, follower consistency Serializable is ~3% slightly higher for mn-etcd-sz.
Compare to follower consistency Linearizableread request, follower consistency Serializable is ~5% less for mn-etcd-mz.
*Compare to leader consistency Serializableread request, follower consistency Serializable is ~5% less for mn-etcd-mz. *

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
20	256	2	5	false	s	follower-1	6.0295	757.73ms	eu-west-1a	etcd-main-0	mn-etcd-sz
20	256	2	5	false	s	follower-1	6.87	690.4ms	eu-west-1c	etcd-main-0	mn-etcd-mz

In this case benchmark tool tries to get specific key with random `1MB` value.

Benchmark tool range requests to leader with single client.

sn-etcd-sz latency(~5.96ms) is ~5% lesser than mn-etcd-sz(~6.28ms).
mn-etcd-sz latency(~6.28ms) is ~10% higher than mn-etcd-mz(~5.3ms).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	1	1	true	l	leader	167.381	5.96ms	eu-west-1c	etcd-main-0	sn-etcd-sz
1000	1000000	1	1	true	l	leader	158.822	6.28ms	eu-west-1a	etcd-main-1	mn-etcd-sz
1000	1000000	1	1	true	l	leader	187.94	5.3ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Compare to consistency Linearizable, Serializable is ~15% less for sn-etcd-sz, mn-etcd-sz, mn-etcd-mz

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	1	1	true	s	leader	184.95	5.398ms	eu-west-1c	etcd-main-0	sn-etcd-sz
1000	1000000	1	1	true	s	leader	176.901	5.64ms	eu-west-1a	etcd-main-1	mn-etcd-sz
1000	1000000	1	1	true	s	leader	209.99	4.7ms	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool range requests to follower with single client.

mn-etcd-sz latency(~6.66ms) is ~10% higher than mn-etcd-mz(~6.16ms).
Compare to leader, follower read request latency is ~10% high for mn-etcd-sz
Compare to leader, follower read request latency is ~20% high for mn-etcd-mz

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	1	1	true	l	follower-1	150.680	6.66ms	eu-west-1a	etcd-main-0	mn-etcd-sz
1000	1000000	1	1	true	l	follower-1	162.072	6.16ms	eu-west-1c	etcd-main-0	mn-etcd-mz

Compare to consistency Linearizable, Serializable is ~15% less for mn-etcd-sz(~5.84ms), mn-etcd-mz(~5.01ms).
Compare to leader, follower read request latency is ~5% slightly high for mn-etcd-sz, mn-etcd-mz

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	1	1	true	s	follower-1	170.918	5.84ms	eu-west-1a	etcd-main-0	mn-etcd-sz
1000	1000000	1	1	true	s	follower-1	199.01	5.01ms	eu-west-1c	etcd-main-0	mn-etcd-mz

Benchmark tool range requests to leader with multiple clients.

sn-etcd-sz latency(~1.593secs) is ~20% lesser than mn-etcd-sz(~1.974secs).
mn-etcd-sz latency(~1.974secs) is ~5% greater than mn-etcd-mz(~1.81secs).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	100	500	true	l	leader	252.149	1.593secs	eu-west-1c	etcd-main-0	sn-etcd-sz
1000	1000000	100	500	true	l	leader	205.589	1.974secs	eu-west-1a	etcd-main-1	mn-etcd-sz
1000	1000000	100	500	true	l	leader	230.42	1.81secs	eu-west-1a	etcd-main-1	mn-etcd-mz

Compare to consistency Linearizable, Serializable is more or less same for sn-etcd-sz(~1.57961secs), mn-etcd-mz(~1.8secs) not a big difference
Compare to consistency Linearizable, Serializable is ~10% high for mn-etcd-sz(~ 2.277secs).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	100	500	true	s	leader	252.406	1.57961secs	eu-west-1c	etcd-main-0	sn-etcd-sz
1000	1000000	100	500	true	s	leader	181.905	2.277secs	eu-west-1a	etcd-main-1	mn-etcd-sz
1000	1000000	100	500	true	s	leader	227.64	1.8secs	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool range requests to follower with multiple client.

mn-etcd-sz latency is ~20% less than mn-etcd-mz.
Compare to leader consistency Linearizable, follower read request latency is ~15 less for mn-etcd-sz(~1.694secs).
Compare to leader consistency Linearizable, follower read request latency is ~10% higher for mn-etcd-sz(~1.977secs).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	100	500	true	l	follower-1	248.489	1.694secs	eu-west-1a	etcd-main-0	mn-etcd-sz
1000	1000000	100	500	true	l	follower-1	210.22	1.977secs	eu-west-1c	etcd-main-0	mn-etcd-mz

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	100	500	true	l	follower-2	205.765	1.967secs	eu-west-1a	etcd-main-2	mn-etcd-sz
1000	1000000	100	500	true	l	follower-2	195.2	2.159secs	eu-west-1b	etcd-main-2	mn-etcd-mz

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	100	500	true	s	follower-1	231.458	1.7413secs	eu-west-1a	etcd-main-0	mn-etcd-sz
1000	1000000	100	500	true	s	follower-1	214.80	1.907secs	eu-west-1c	etcd-main-0	mn-etcd-mz

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
1000	1000000	100	500	true	s	follower-2	183.320	2.2810secs	eu-west-1a	etcd-main-2	mn-etcd-sz
1000	1000000	100	500	true	s	follower-2	195.40	2.164secs	eu-west-1b	etcd-main-2	mn-etcd-mz

Benchmark tool range requests to leader all keys.

sn-etcd-sz latency(~8.993secs) is ~3% slightly lower than mn-etcd-sz(~9.236secs).
mn-etcd-sz latency(~9.236secs) is ~2% slightly lower than mn-etcd-mz(~9.100secs).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
20	1000000	2	5	false	l	leader	0.5139	8.993secs	eu-west-1c	etcd-main-0	sn-etcd-sz
20	1000000	2	5	false	l	leader	0.506	9.236secs	eu-west-1a	etcd-main-1	mn-etcd-sz
20	1000000	2	5	false	l	leader	0.508	9.100secs	eu-west-1a	etcd-main-1	mn-etcd-mz

Compare to consistency Linearizableread request, follower for sn-etcd-sz(~9.secs) is a slight difference 10ms.
Compare to consistency Linearizableread request, follower for mn-etcd-sz(~9.113secs) is ~1% less, not a big difference.
Compare to consistency Linearizableread request, follower for mn-etcd-mz(~8.799secs) is ~3% less, not a big difference.
sn-etcd-sz latency(~9.secs) is ~1% slightly less than mn-etcd-sz(~9.113secs).

mn-etcd-sz latency(~9.113secs) is ~3% slightly higher than mn-etcd-mz(~8.799secs).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
20	1000000	2	5	false	s	leader	0.51125	9.0003secs	eu-west-1c	etcd-main-0	sn-etcd-sz
20	1000000	2	5	false	s	leader	0.4993	9.113secs	eu-west-1a	etcd-main-1	mn-etcd-sz
20	1000000	2	5	false	s	leader	0.522	8.799secs	eu-west-1a	etcd-main-1	mn-etcd-mz

Benchmark tool range requests to follower all keys

mn-etcd-sz latency(~9.065secs) is ~1% slightly higher than mn-etcd-mz(~9.007secs).
Compare to leader consistency Linearizableread request, follower is ~1% slightly higher for both cases mn-etcd-sz, mn-etcd-mz .

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
20	1000000	2	5	false	l	follower-1	0.512	9.065secs	eu-west-1a	etcd-main-0	mn-etcd-sz
20	1000000	2	5	false	l	follower-1	0.533	9.007secs	eu-west-1c	etcd-main-0	mn-etcd-mz

Compare to consistency Linearizableread request, follower for mn-etcd-sz(~9.553secs) is ~5% high.
Compare to consistency Linearizableread request, follower for mn-etcd-mz(~7.7433secs) is ~15% less.
mn-etcd-sz(~9.553secs) latency is ~20% higher than mn-etcd-mz(~7.7433secs).

Number of requests	Value size	Number of connections	Number of clients	sequential-keys	Consistency	Target etcd server	Average write QPS	Average latency per request	zone	server name	Test name
20	1000000	2	5	false	s	follower-1	0.4743	9.553secs	eu-west-1a	etcd-main-0	mn-etcd-sz
20	1000000	2	5	false	s	follower-1	0.5500	7.7433secs	eu-west-1c	etcd-main-0	mn-etcd-mz

NOTE: This Network latency analysis is inspired by etcd performance.

16 - EtcdMember Custom Resource

DEP-04: EtcdMember Custom Resource

Summary

Today, etcd-druid mainly acts as an etcd cluster provisioner, and seldom takes remediatory actions if the etcd cluster goes into an undesired state that needs to be resolved by a human operator. In other words, etcd-druid cannot perform day-2 operations on etcd clusters in its current form, and hence cannot carry out its full set of responsibilities as a true “operator” of etcd clusters. For etcd-druid to be fully capable of its responsibilities, it must know the latest state of the etcd clusters and their individual members at all times.

This proposal aims to bridge that gap by introducing EtcdMember custom resource allowing individual etcd cluster members to publish information/state (previously unknown to etcd-druid). This provides etcd-druid a handle to potentially take cluster-scoped remediatory actions.

Terminology

druid: etcd-druid - an operator for etcd clusters.
etcd-member: A single etcd pod in an etcd cluster that is realised as a StatefulSet.
backup-sidecar: It is the etcd-backup-restore sidecar container in each etcd-member pod.
NOTE: Term sidecar can now be confused with the latest definition in KEP-73. etcd-backup-restore container is currently not set as an init-container as proposed in the KEP but as a regular container in a multi-container [Pod](Pods | Kubernetes).
leading-backup-sidecar: A backup-sidecar that is associated to an etcd leader.
restoration: It refers to an individual etcd-member restoring etcd data from an existing backup (comprising of full and delta snapshots). The authors have deliberately chosen to distinguish between restoration and learning. Learning refers to a process where a learner “learns” from an etcd-cluster leader.

Motivation

Sharing state of an individual etcd-member with druid is essential for diagnostics, monitoring, cluster-wide-operations and potential remediation. At present, only a subset of etcd-member state is shared with druid using leases. It was always meant as a stopgap arrangement as mentioned in the corresponding issue and is not the best use of leases.

There is a need to have a clear distinction between an etcd-member state and etcd cluster state since most of an etcd cluster state is often derived by looking at individual etcd-member states. In addition, actors which update each of these states should be clearly identified so as to prevent multiple actors updating a single resource holding the state of either an etcd cluster or an etcd-member. As a consequence, etcd-members should not directly update the Etcd resource status and would therefore need a new custom resource allowing each member to publish detailed information about its latest state.

Goals

Introduce EtcdMember custom resource via which each etcd-member can publish information about its state. This enables druid to deterministically orchestrate out-of-turn operations like compaction, defragmentation, volume management etc.
Define and capture states, sub-states and deterministic transitions amongst states of an etcd-member.
Today leases are misused to share member-specific information with druid. Their usage to share member state [leader, follower, learner], member-id, snapshot revisions etc should be removed.

Non-Goals

Auto-recovery from quorum loss or cluster-split due to network partitioning.
Auto-recovery of an etcd-member due to volume mismatch.
Relooking at segregating responsiblities between etcd and backup-sidecar containers.

Proposal

This proposal introduces a new custom resource EtcdMember, and in the following sections describes different sets of information that should be captured as part of the new resource.

Etcd Member Metadata

Every etcd-member has a unique memberID and it is part of an etcd cluster which has a unique clusterID. In a well-formed etcd cluster every member must have the same clusterID. Publishing this information to druid helps in identifying issues when one or more etcd-members form their own individual clusters, thus resulting in multiple clusters where only one was expected. Issues Issue#419, Canary#4027, Canary#3973 are some such occurrences.

Today, this information is published by using a member lease. Both these fields are populated in the leases’ Spec.HolderIdentity by the backup-sidecar container.

The authors propose to publish member metadata information in EtcdMember resource.

id: <etcd-member id>
clusterID: <etcd cluster id>

NOTE: Druid would not do any auto-recovery when it finds out that there are more than one clusters being formed. Instead this information today will be used for diagnostic and alerting.

Etcd Member State Transitions

Each etcd-member goes through different States during its lifetime. State is a derived high-level summary of where an etcd-member is in its lifecycle. A SubState gives additional information about the state. This proposal extends the concept of states with the notion of a SubState, since State indicates a top-level state of an EtcdMember resource, which can have one or more SubStates.

While State is sufficient for many human operators, the notion of a SubState provides operators with an insight about the discrete stage of an etcd-member in its lifecycle. For example, consider a top-level State: Starting, which indicates that an etcd-member is starting. Starting is meant to be a transient state for an etcd-member. If an etcd-member remains in this State longer than expected, then an operator would require additional insight, which the authors propose to provide via SubState (in this case, the possible SubStates could be PendingLearner and Learner, which are detailed in the following sections).

At present, these states are not captured and only the final state is known - i.e the etcd-member either fails to come up (all re-attempts to bring up the pod via the StatefulSet controller has exhausted) or it comes up. Getting an insight into all its state transitions would help in diagnostics.

The status of an etcd-member at any given point in time can be best categorized as a combination of a top-level State and a SubState. The authors propose to introduce the following states and sub-states:

States and Sub-States

NOTE: Abbreviations have been used wherever possible, only to represent sub-states. These representations are chosen only for brevity and will have proper longer names.

States	Sub-States	Description
New	-	Every newly created etcd-member will start in this state and is termed as the initial state or the start state.
Initializing	DBV-S (DBValidationSanity)	This state denotes that backup-restore container in etcd-member pod has started initialization. Sub-State `DBV-S` which is an abbreviation for `DBValidationSanity` denotes that currently sanity etcd DB validation is in progress.
Initializing	DBV-F (DBValidationFull)	This state denotes that backup-restore container in etcd-member pod has started initialization. Sub-State `DBV-F` which is an abbreviation for `DBValidationFull` denotes that currently full etcd DB validation is in progress.
Initializing	R (Restoration)	This state denotes that backup-restore container in etcd-member pod has started initialization. Sub-State `R` which is an abbreviation for `Restoration` denotes that DB validation failed and now backup-restore has commenced restoration of etcd DB from the backup (comprising of full snapshot and delta-snapshots). An etcd-member will transition to this sub-state only when it is part of a single-node etcd-cluster.
Starting (SI)	PL (PendingLearner)	An etcd-member can transition from `Initializing` state to `PendingLearner` state. In this state backup-restore container will optionally delete any existing etcd data directory and then attempts to add its peer etcd-member process as a learner. Since there can be only one learner at a time in an etcd cluster, an etcd-member could be in this state for some time till its request to get added as a learner is accepted.
Starting (SI)	Learner	When backup-restore is successfully able to add its peer etcd-member process as a `Learner`. In this state the etcd-member process will start its DB sync from an etcd leader.
Started (Sd)	Follower	A follower is a voting raft member. A `Learner` etcd-member will get promoted to a `Follower` once its DB is in sync with the leader. It could also become a follower if during a re-election it loses leadership and transitions from being a `Leader` to `Follower`.
Started (Sd)	Leader	A leader is an etcd-member which will handle all client write requests and linearizable read requests. A member could transition to being a `Leader` from an existing `Follower` role due to winning a leader election or for a single node etcd cluster it directly transitions from `Initializing` state to `Leader` state as there is no other member.

In the following sub-sections, the state transitions are categorized into several flows making it easier to grasp the different transitions.

Top Level State Transitions

Following DFA represents top level state transitions (without any representation of sub-states). As described in the table above there are 4 top level states:

New- this is a start state for all newly created etcd-members
Initializing - In this state backup-restore will perform pre-requisite actions before it triggers the start of an etcd process. DB validation and optionally restoration is done in this state. Possible sub-states are: DBValidationSanity, DBValidationFull and Restoration
Starting - Once the optional initialization is done backup-restore will trigger the start of an etcd process. It can either directly go to Learner sub-state or wait for getting added as a learner and therefore be in PendingLearner sub-state.
Started - In this state the etcd-member is a full voting member. It can either be in Leader or Follower sub-states.

Starting an Etcd-Member in a Single-Node Etcd Cluster

Following DFA represents the states, sub-states and transitions of a single etcd-member for a cluster that is bootstrapped from cluster size of 0 -> 1.

Addition of a New Etcd-Member in a Multi-Node Etcd Cluster

Following DFA represents the states, sub-states and transitions of an etcd cluster which starts with having a single member (Leader) and then one or more new members are added which represents a scale-up of an etcd cluster from 1 -> n, where n is odd.

Restart of a Voting Etcd-Member in a Multi-Node Etcd Cluster

Following DFA represents the states, sub-states and transitions when a voting etcd-member in a multi-node etcd cluster restarts.

NOTE: If the DB validation fails then data directory of the etcd-member is removed and etcd-member is removed from cluster membership, thus transitioning it to New state. The state transitions from New state are depicted by this section.

Deterministic Etcd Member Creation/Restart During Scale-Up

Bootstrap information:

When an etcd-member starts, then it needs to find out:

If it should join an existing cluster or start a new cluster.
If it should add itself as a Learner or directly start as a voting member.

Issue with the current approach:

At present, this is facilitated by three things:

During scale-up, druid adds an annotation gardener.cloud/scaled-to-multi-node to the StatefulSet. Each etcd-members looks up this annotation.
backup-sidecar attempts to fetch etcd cluster member-list and checks if this etcd-member is already part of the cluster.
Size of the cluster by checking initial-cluster in the etcd config.

Druid adds an annotation gardener.cloud/scaled-to-multi-node on the StatefulSet which is then shared by all etcd-members irrespective of the starting state of an etcd-member (as Learner or Voting-Member). This especially creates an issue for the current leader (often pod with index 0) during the scale-up of an etcd cluster as described in this issue.

It has been agreed that the current solution to this issue is a quick and dirty fix and needs to be revisited to be uniformly applied to all etcd-members. The authors propose to provide a more deterministic approach to scale-up using the EtcdMember resource.

New approach

Instead of adding an annotation gardener.cloud/scaled-to-multi-node on the StatefulSet, a new annotation druid.gardener.cloud/create-as-learner should be added by druid on an EtcdMember resource. This annotation will only be added to newly created members during scale-up.

Each etcd-member should look at the following to deterministically compute the bootstrap information specified above:

druid.gardener.cloud/create-as-learner annotation on its respective EtcdMember resource. This new annotation will be honored in the following cases:
- When an etcd-member is created for the very first time.
- An etcd-member is restarted while it is in Starting state (PendingLearner and Learner sub-states).
Etcd-cluster member list. to check if it is already part of the cluster.
Existing etcd data directory and its validity.

NOTE: When the etcd-member gets promoted to a voting-member, then it should remove the annotation on its respective EtcdMember resource.

TLS Enablement for Peer Communication

Etcd-members in a cluster use peer URL(s) to communicate amongst each other. If the advertised peer URL(s) for an etcd-member are updated then etcd mandates a restart of the etcd-member.

Druid only supports toggling the transport level security for the advertised peer URL(s). To indicate that the etcd process within the etcd-member has the updated advertised peer URL(s), an annotation member.etcd.gardener.cloud/tls-enabled is added by backup-sidecar container to the member lease object.

During the reconciliation run for an Etcd resource in druid, if reconciler detects a change in advertised peer URL(s) TLS configuration then it will watch for the above mentioned annotation on the member lease. If the annotation has a value of false then it will trigger a restart of the etcd-member pod.

The authors propose to publish member metadata information in EtcdMember resource and not misuse member leases.

peerTLSEnabled: <bool>

Monitoring Backup Health

Backup-sidecar takes delta and full snapshot both periodically and threshold based. These backed-up snapshots are essential for restoration operations for bootstrapping an etcd cluster from 0 -> 1 replicas. It is essential that leading-backup-sidecar container which is responsible for taking delta/full snapshots and uploading these snapshots to the configured backup store, publishes this information for druid to consume.

At present, information about backed-up snapshot (only latest-revision-number) is published by leading-backup-sidecar container by updating Spec.HolderIdentity of the delta-snapshot and full-snapshot leases.

Druid maintains conditions in the Etcd resource status, which include but are not limited to maintaining information on whether backups being taken for an etcd cluster are healthy (up-to-date) or stale (outdated in context to a configured schedule). Druid computes these conditions using information from full/delta snapshot leases.

In order to provide a holistic view of the health of backups to human operators, druid requires additional information about the snapshots that are being backed-up. The authors propose to not misuse leases and instead publish the following snapshot information as part EtcdMember custom resource:

snapshots:
  lastFull:
    timestamp: <time of full snapshot>
    name: <name of the file that is uploaded>
    size: <size of the un-compressed snapshot file uploaded>
    startRevision: <start revision of etcd db captured in the snapshot>
    endRevision: <end revision of etcd db captured in the snapshot>
  lastDelta:
    timestamp: <time of delta snapshot>
    name: <name of the file that is uploaded>
    size: <size of the un-compressed snapshot file uploaded>
    startRevision: <start revision of etcd db captured in the snapshot>
    endRevision: <end revision of etcd db captured in the snapshot>

While this information will primarily help druid compute accurate conditions regarding backup health from snapshot information and publish this to human operators, it could be further utilised by human operators to take remediatory actions (e.g. manually triggering a full or delta snapshot or further restarting the leader if the issue is still not resolved) if backup is unhealthy.

Enhanced Snapshot Compaction

Druid can be configured to perform regular snapshot compactions for etcd clusters, to reduce the total number of delta snapshots to be restored if and when a DB restoration for an etcd cluster is required. Druid triggers a snapshot compaction job when the accumulated etcd events in the latest set of delta snapshots (taken after the last full snapshot) crosses a specified threshold.

As described in Issue#591 scheduling compaction only based on number of accumulated etcd events is not sufficient to ensure a successful compaction. This is specifically targeted for kubernetes clusters where each etcd event is larger in size owing to large spec or status fields or respective resources.

Druid will now need information regarding snapshot sizes, and more importantly the total size of accumulated delta snapshots since the last full snapshot.

The authors propose to enhance the proposed snapshots field described in Use Case #3 with the following additional field:

snapshots:
  accumulatedDeltaSize: <total size of delta snapshots since last full snapshot>

Druid can then use this information in addition to the existing revision information to decide to trigger an early snapshot compaction job. This effectively allows druid to be proactive in performing regular compactions for etcds receiving large events, reducing the probability of a failed snapshot compaction or restoration.

Enhanced Defragmentation

Reader is recommended to read Etcd Compaction & Defragmentation in order to understand the following terminology:

dbSize - total storage space used by the etcd database

dbSizeInUse - logical storage space used by the etcd database, not accounting for free pages in the DB due to etcd history compaction

The leading-backup-sidecar performs periodic defragmentations of the DBs of all the etcd-members in the cluster, controlled via a defragmentation cron schedule provided to each backup-sidecar. Defragmentation is a costly maintenance operation and causes a brief downtime to the etcd-member being defragmented, due to which the leading-backup-sidecar defragments each etcd-member sequentially. This ensures that only one etcd-member would be unavailable at any given time, thus avoiding an accidental quorum loss in the etcd cluster.

The authors propose to move the responsibility of orchestrating these individual defragmentations to druid due to the following reasons:

Since each backup-sidecar only has knowledge of the health of its own etcd, it can only determine whether its own etcd can be defragmented or not, based on etcd-member health. Trying to defragment a different healthy etcd-member while another etcd-member is unhealthy would lead to a transient quorum loss.
Each backup-sidecar is only a sidecar to its own etcd-member, and by good design principles, it must not be performing any cluster-wide maintenance operations, and this responsibility should remain with the etcd cluster operator.

Additionally, defragmentation of an etcd DB becomes inevitable if the DB size exceeds the specified DB space quota, since the etcd DB then becomes read-only, ie no write operations on the etcd would be possible unless the etcd DB is defragmented and storage space is freed up. In order to automate this, druid will now need information about the etcd DB size from each member, specifically the leading etcd-member, so that a cluster-wide defragmentation can be triggered if the DB size reaches a certain threshold, as already described by this issue.

The authors propose to enhance each etcd-member to regularly publish information about the dbSize and dbSizeInUse so that druid may trigger defragmentation for the etcd cluster.

dbSize: <db-size> # e.g 6Gi
dbSizeInUse: <db-size-in-use> # e.g 3.5Gi

Difference between dbSize and dbSizeInUse gives a clear indication of how much storage space would be freed up if a defragmentation is performed. If the difference is not significant (based on a configurable threshold provided to druid), then no defragmentation should be performed. This will ensure that druid does not perform frequent defragmentations that do not yield much benefit. Effectively it is to maximise the benefit of defragmentation since this operations involves transient downtime for each etcd-member.

Monitoring Defragmentations

As discussed in the previous section, every etcd-member is defragmented periodically, and can also be defragmented based on the DB size reaching a certain threshold. It is beneficial for druid to have knowledge of this data from each etcd-member for the following reasons:

[Diagnostics] It is expected that backup-sidecar will push releveant metrics and configure alerts on these metrics.
[Operational] Derive status of defragmentation at etcd cluster level. In case of partial failures for a subset of etcd-members druid can potentially re-trigger defragmentation only for those etcd-members.

The authors propose to capture this information as part of lastDefragmentation section in the EtcdMember resource.

lastDefragmentation:
  startTime: <start time of defragmentation>
  endTime: <end time of defragmentation>
  status: <Succeeded | Failed>
  message: <success or failure message>
  initialDBSize: <size of etcd DB prior to defragmentation>
  finalDBSize: <size of etcd DB post defragmentation>

NOTE: Defragmentation is a cluster-wide operation, and insights derived from aggregating defragmentation data from individual etcd-members would be captured in the Etcd resource status

Monitoring Restorations

Each etcd-member may perform restoration of data multiple times throughout its lifecycle, possibly owing to data corruptions. It would be useful to capture this information as part of an EtcdMember resource, for the following use cases:

[Diagnostics] It is expected that backup-sidecar will push a metric indicating failure to restore.
[Operational] Restoration from backup-bucket only happens for a single node etcd cluster. If restoration is failing then druid cannot take any remediatory actions since there is no etcd quorum.

The authors propose to capture this information under lastRestoration section in the EtcdMember resource.

lastRestoration:
  status: <Failed | Success | In-Progress>
  reason: <reason-code for status>
  message: <human readable message for status>
  startTime: <start time of restoration>
  endTime: <end time of restoration>

Authors have considered the following cases to better understand how errors during restoration will be handled:

Case #1 - Failure to connect to Provider Object Store

At present full and delta snapshots are downloaded during restoration. If there is a failure then initialization status transitions to Failed followed by New which forces etcd-wrapper to trigger the initialization again. This in a way forces a retry and currently there is no limit on the number of attempts.

Authors propose to improve the retry logic but keep the overall behavior of not forcing a container restart the same.

Case #2 - Read-Only Mounted volume

If a mounted volume which is used to create the etcd data directory turns read-only then authors propose to capture this state via EtcdMember.

Authors propose that druid should initiate recovery by deleting the PVC for this etcd-member and letting StatefulSet controller re-create the Pod and the PVC. Removing PVC and deleting the pod is considered safe because:

Data directory is present and is the DB is corrupt resulting in an un-usasble etcd.
Data directory is not present but any attempt to create a directory structure fails due to read-only FS.

In both these cases there is no side-effect of deleting the PVC and the Pod.

Case #3 - Revision mismatch

There is currently an issue in backup-sidecar which results in a revision mismatch in the snapshots (full/delta) taken by leading the backup-sidecar container. This results in a restoration failure. One occurance of such issue has been captured in Issue#583. This occurence points to a bug which should be fixed however there is a rare possibility that these snapshots (full/delta) get corrupted. In this rare situation, backup-sidecar should only raise an alert.

Authors propose that druid should not take any remediatory actions as this involves:

Inspecting snapshots
If the full snapshot is corrupt then a decision needs to be taken to recover from the last full snapshot as the base snapshot. This can result in data loss and therefore needs manual intervention.
If a delta snapshot is corrupt, then recovery can be done till the corrupt revision in the delta snapshot. Since this will also result in a loss of data therefore this decision needs to be take by an operator.

Monitoring Volume Mismatches

Each etcd-member checks for possible etcd data volume mismatches, based on which it decides whether to start the etcd process or not, but this information is not captured anywhere today. It would be beneficial to capture this information as part of the EtcdMember resource so that a human operator may check this and manually fix the underlying problem with the wrong volume being attached or mounted to an etcd-member pod.

The authors propose to capture this information under volumeMismatches section in the EtcdMember resource.

volumeMismatches:
- identifiedAt: <time at which wrong volume mount was identified>
  fixedAt: <time at which correct volume was mounted>
  volumeID: <volume ID of wrong volume that got mounted>
  numRestarts: <num of etcd-member restarts that were attempted>

Each entry under volumeMismatches will be for a unique volumeID. If there is a pod restart and it results in yet another unexpected volumeID (different from the already captured volumeIDs) then a new entry will get created. numRestarts denotes the number of restarts seen by the etcd-member for a specific volumeID.

Based on information from the volumeMismatches section, druid may choose to perform rudimentary remediatory actions as simple as restarting the member pod to force a possible rescheduling of the pod to a different node which could potentially force the correct volume to be mounted to the member.

Custom Resource API

Spec vs Status

Information that is captured in the etcd-member custom resource could be represented either as EtcdMember.Status or EtcdMemberState.Spec.

Gardener has a similar need to capture a shoot state and they have taken the decision to represent it via ShootState resource where the state or status of a shoot is captured as part of the Spec field in the ShootState custom resource.

The authors wish to instead align themselves with the K8S API conventions and choose to use EtcdMember custom resource and capture the status of each member in Status field of this resource. This has the following advantages:

Spec represents a desired state of a resource and what is intended to be captured is the As-Is state of a resource which Status is meant to capture. Therefore, semantically using Status is the correct choice.
Not mis-using Spec now to represent As-Is state provides us with a choice to extend the custom resource with any future need for a Spec a.k.a desired state.

Representing State Transitions

The authors propose to use a custom representation for states, sub-states and transitions.

Consider the following representation:

transitions:
- state: <name of the state that the etcd-member has transitioned to>
  subState: <name of the sub-state if any>
  reason: <reason code for the transition>
  transitionTime: <time of transition to this state>
  message: <detailed message if any>

As an example, consider the following transitions which represent addition of an etcd-member during scale-up of an etcd cluster, followed by a restart of the etcd-member which detects a corrupt DB:

status:
  transitions:
  - state: New
    subState: New
    reason: ClusterScaledUp
    transitionTime: "2023-07-17T05:00:00Z"
    message: "New member added due to etcd cluster scale-up"
  - state: Starting
    subState: PendingLearner
    reason: WaitingToJoinAsLearner
    transitionTime: "2023-07-17T05:00:30Z"
    message: "Waiting to join the cluster as a learner"
  - state: Starting
    subState: Learner
    reason: JoinedAsLearner
    transitionTime: "2023-07-17T05:01:20Z"
    message: "Joined the cluster as a learner"
  - state: Started
    subState: Follower
    reason: PromotedAsVotingMember
    transitionTime: "2023-07-17T05:02:00Z"
    message: "Now in sync with leader, promoted as voting member"
  - state: Initializing
    subState: DBValidationFull
    reason: DetectedPreviousUncleanExit
    transitionTime: "2023-07-17T08:00:00Z"
    message: "Detected previous unclean exit, requires full DB validation"
  - state: New
    subState: New
    reason: DBCorruptionDetected
    transitionTime: "2023-07-17T08:01:30Z"
    message: "Detected DB corruption during initialization, removing member from cluster"
  - state: Starting
    subState: PendingLearner
    reason: WaitingToJoinAsLearner
    transitionTime: "2023-07-17T08:02:10Z"
    message: "Waiting to join the cluster as a learner"
  - state: Starting
    subState: Learner
    reason: JoinedAsLearner
    transitionTime: "2023-07-17T08:02:20Z"
    message: "Joined the cluster as a learner"
  - state: Started
    subState: Follower
    reason: PromotedAsVotingMember
    transitionTime: "2023-07-17T08:04:00Z"
    message: "Now in sync with leader, promoted as voting member"

Reason Codes

The authors propose the following list of possible reason codes for transitions. This list is not exhaustive, and can be further enhanced to capture any new transitions in the future.

Reason	Transition From State (SubState)	Transition To State (SubState)
`ClusterScaledUp` \| `NewSingleNodeClusterCreated`	nil	New
`DetectedPreviousCleanExit`	New \| Started (Leader) \| Started (Follower)	Initializing (DBValidationSanity)
`DetectedPreviousUncleanExit`	New \| Started (Leader) \| Started (Follower)	Initializing (DBValidationFull)
`DBValidationFailed`	Initializing (DBValidationSanity) \| Initializing (DBValidationFull)	Initializing (Restoration) \| New
`DBValidationSucceeded`	Initializing (DBValidationSanity) \| Initializing (DBValidationFull)	Started (Leader) \| Started (Follower)
`Initializing (Restoration)Succeeded`	Initializing (Restoration)	Started (Leader)
`WaitingToJoinAsLearner`	New	Starting (PendingLearner)
`JoinedAsLearner`	Starting (PendingLearner)	Starting (Learner)
`PromotedAsVotingMember`	Starting (Learner)	Started (Follower)
`GainedClusterLeadership`	Started (Follower)	Started (Leader)
`LostClusterLeadership`	Started (Leader)	Started (Follower)

API

EtcdMember

The authors propose to add the EtcdMember custom resource API to etcd-druid APIs and initially introduce it with v1alpha1 version.

apiVersion: druid.gardener.cloud/v1alpha1
kind: EtcdMember
metadata:
  labels:
    gardener.cloud/owned-by: <name of parent Etcd resource>
  name: <name of the etcd-member>
  namespace: <namespace | will be the same as that of parent Etcd resource>
  ownerReferences:
  - apiVersion: druid.gardener.cloud/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Etcd
    name: <name of the parent Etcd resource>
    uid: <UID of the parent Etcd resource> 
status:
  id: <etcd-member id>
  clusterID: <etcd cluster id>
  peerTLSEnabled: <bool>
  dbSize: <db-size>
  dbSizeInUse: <db-size-in-use>
  snapshots:
    lastFull:
      timestamp: <time of full snapshot>
      name: <name of the file that is uploaded>
      size: <size of the un-compressed snapshot file uploaded>
      startRevision: <start revision of etcd db captured in the snapshot>
      endRevision: <end revision of etcd db captured in the snapshot>
    lastDelta:
      timestamp: <time of delta snapshot>
      name: <name of the file that is uploaded>
      size: <size of the un-compressed snapshot file uploaded>
      startRevision: <start revision of etcd db captured in the snapshot>
      endRevision: <end revision of etcd db captured in the snapshot>
    accumulatedDeltaSize: <total size of delta snapshots since last full snapshot>
  lastRestoration:
    type: <FromSnapshot | FromLeader>
    status: <Failed | Success | In-Progress>
    startTime: <start time of restoration>
    endTime: <end time of restoration>
  lastDefragmentation:
    startTime: <start time of defragmentation>
    endTime: <end time of defragmentation>
    reason: 
    message:
    initialDBSize: <size of etcd DB prior to defragmentation>
    finalDBSize: <size of etcd DB post defragmentation>
  volumeMismatches:
  - identifiedAt: <time at which wrong volume mount was identified>
    fixedAt: <time at which correct volume was mounted>
    volumeID: <volume ID of wrong volume that got mounted>
    numRestarts: <num of pod restarts that were attempted>
  transitions:
  - state: <name of the state that the etcd-member has transitioned to>
    subState: <name of the sub-state if any>
    reason: <reason code for the transition>
    transitionTime: <time of transition to this state>
    message: <detailed message if any>

Etcd

Authors propose the following changes to the Etcd API:

In the Etcd.Status resource API, member status is computed and stored. This field will be marked as deprecated and in a later version of druid it will be removed. In its place, the authors propose to introduce the following:

type EtcdStatus struct {
  // MemberRefs contains references to all existing EtcdMember resources
  MemberRefs []CrossVersionObjectReference
}

In Etcd.Status resource API, PeerUrlTLSEnabled reflects the status of enabling TLS for peer communication across all etcd-members. Currentlty this field is not been used anywhere. In this proposal, the authors have also proposed that each EtcdMember resource should capture the status of TLS enablement of peer URL. The authors propose to relook at the need to have this field under EtcdStatus.

Lifecycle of an EtcdMember

Creation

Druid creates an EtcdMember resource for every replica in etcd.Spec.Replicas during reconciliation of an etcd resource. For a fresh etcd cluster this is done prior to creation of the StatefulSet resource and for an existing cluster which has now been scaled-up, it is done prior to updating the StatefulSet resource.

Updation

All fields in EtcdMember.Status are only updated by the corresponding etcd-member. Druid only consumes the information published via EtcdMember resources.

Deletion

Druid is responsible for deletion of all existing EtcdMember resources for an etcd cluster. There are three scenarios where an EtcdMember resource will be deleted:

Deletion of etcd resource.
Scale down of an etcd cluster to 0 replicas due to hibernation of the k8s control plane.
Transient scale down of an etcd cluster to 0 replicas to recover from a quorum loss.

Authors found no reason to retain EtcdMember resources when the etcd cluster is scale down to 0 replicas since the information contained in each EtcdMember resource would no longer represent the current state of each member and would thus be stale. Any controller in druid which acts upon the EtcdMember.Status could potentially take incorrect actions.

Reconciliation

Authors propose to introduce a new controller (let’s call it etcd-member-controller) which watches for changes to the EtcdMember resource(s). If a reconciliation of an Etcd resource is required as a result of change in EtcdMember status then this controller should enqueue an event and force a reconciliation via existing etcd-controller, thus preserving the single-actor-principal constraint which ensures deterministic changes to etcd cluster resources.

NOTE: Further decisions w.r.t responsibility segregation will be taken during implementation and will not be documented in this proposal.

Stale EtcdMember Status Handling

It is possible that an etcd-member is unable to update its respective EtcdMember resource. Following can be some of the implications which should be kept in mind while reconciling EtcdMember resource in druid:

Druid sees stale state transitions (this assumes that the backup-sidecar attempts to update the state/sub-state in etcdMember.status.transitions with best attempt). There is currently no implication other than an operator seeing a stale state.
dbSize and dbSizeInUse could not be updated. A consequence could be that druid continues to see high value for dbSize - dbSizeInUse for a extended amount of time. Druid should ensure that it does not trigger repeated defragmentations.
If VolumeMismatches is stale, then druid should no longer attempt to recover by repeatedly restarting the pod.
Failed restoration was recorded last and further updates to this array failed. Druid should not repeatedly take full-snapshots.
If snapshots.accumulatedDeltaSize could not be updated, then druid should not schedule repeated compaction Jobs.

Reference

17 - Feature Gates in Etcd-Druid

Feature Gates in Etcd-Druid

This page contains an overview of the various feature gates an administrator can specify on etcd-druid.

Overview

Feature gates are a set of key=value pairs that describe etcd-druid features. You can turn these features on or off by passing them to the --feature-gates CLI flag in the etcd-druid command.

The following tables are a summary of the feature gates that you can set on etcd-druid.

The “Since” column contains the etcd-druid release when a feature is introduced or its release stage is changed.
The “Until” column, if not empty, contains the last etcd-druid release in which you can still use a feature gate.
If a feature is in the Alpha or Beta state, you can find the feature listed in the Alpha/Beta feature gate table.
If a feature is stable you can find all stages for that feature listed in the Graduated/Deprecated feature gate table.
The Graduated/Deprecated feature gate table also lists deprecated and withdrawn features.

Feature Gates for Alpha or Beta Features

Feature	Default	Stage	Since	Until

Feature Gates for Graduated or Deprecated Features

Feature	Default	Stage	Since	Until
`UseEtcdWrapper`	`false`	`Alpha`	`0.19`	`0.21`
`UseEtcdWrapper`	`true`	`Beta`	`0.22`	`0.24`
`UseEtcdWrapper`	`true`	`GA`	`0.25`	`0.27`
`UseEtcdWrapper`		`Removed`	`0.28`

Using a Feature

A feature can be in Alpha, Beta or GA stage.

Alpha feature

Disabled by default.
Might be buggy. Enabling the feature may expose bugs.
Support for feature may be dropped at any time without notice.
The API may change in incompatible ways in a later software release without notice.
Recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.

Beta feature

Enabled by default.
The feature is well tested. Enabling the feature is considered safe.
Support for the overall feature will not be dropped, though details may change.
The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release. When this happens, we will provide instructions for migrating to the next version. This may require deleting, editing, and re-creating API objects. The editing process may require some thought. This may require downtime for applications that rely on the feature.
Recommended for only non-critical uses because of potential for incompatible changes in subsequent releases.

Please do try Beta features and give feedback on them! After they exit beta, it may not be practical for us to make more changes.

General Availability (GA) feature

This is also referred to as a stable feature which should have the following characteristics:

The feature is always enabled; you cannot disable it.
The corresponding feature gate is no longer needed.
Stable versions of features will appear in released software for many subsequent versions.

List of Feature Gates

Feature	Description
`UseEtcdWrapper`	Enables the use of etcd-wrapper image and a compatible version of etcd-backup-restore, along with component-specific configuration changes necessary for the usage of the etcd-wrapper image.

18 - Getting Started Locally

Setup Etcd-Druid Locally

This document will guide you on how to setup etcd-druid on your local machine and how to provision and manage Etcd cluster(s).

00-Prerequisites

Before we can setup etcd-druid and use it to provision Etcd clusters, we need to prepare the development environment. Follow the Prepare Dev Environment Guide for detailed instructions.

01-Setting up KIND cluster

etcd-druid uses kind as it’s local Kubernetes engine. The local setup is configured for kind due to its convenience only. Any other Kubernetes setup would also work.

make kind-up

This command sets up a new Kind cluster and stores the kubeconfig at ./hack/kind/kubeconfig. Additionally, this command also deploys a local container registry as a docker container. This ensures faster image push/pull times. The local registry can be accessed as localhost:5001 for pushing and pulling images.

To target this newly created cluster, set the KUBECONFIG environment variable to the kubeconfig file.

export KUBECONFIG=$PWD/hack/kind/kubeconfig

Note: If you wish to configure kind cluster differently then you can directly invoke the script and check its help to know about all configuration options.

./hack/kind-up.sh -h
  usage: kind-up.sh [Options]
  Options:
    --cluster-name  <cluster-name>   Name of the kind cluster to create. Default value is 'etcd-druid-e2e'
    --skip-registry                  Skip creating a local docker registry. Default value is false.
    --feature-gates <feature-gates>  Comma separated list of feature gates to enable on the cluster.

02-Setting up etcd-druid

Configuring etcd-druid

Prior to deploying etcd-druid, it can be configured via CLI-args and environment variables.

To configure CLI args you can modify charts/druid/values.yaml. For example, if you wish to auto-reconcile any change done to Etcd CR, then you should set enableEtcdSpecAutoReconcile to true. By default this will be switched off.
DRUID_E2E_TEST=true: sets specific configuration for etcd-druid for optimal e2e test runs, like a lower sync period for the etcd controller.

Deploying etcd-druid

Any variant of make deploy-* command uses helm and skaffold to build and deploy etcd-druid to the target Kubernetes cluster. In addition to deploying etcd-druid it will also install the Etcd CRD and EtcdCopyBackupTask CRD.

Regular mode

make deploy

The above command will use skaffold to build and deploy etcd-druid to the k8s kind cluster pointed to by KUBECONFIG environment variable.

Dev mode

make deploy-dev

This is similar to make deploy but additionally starts a skaffold dev loop. After the initial deployment, skaffold starts watching source files. Once it has detected changes, you can press any key to update the etcd-druid deployment.

Debug mode

make deploy-debug

This is similar to make deploy-dev but additionally configures containers in pods for debugging as required for each container’s runtime technology. The associated debugging ports are exposed and labelled so that they can be port-forwarded to the local machine. Skaffold disables automatic image rebuilding and syncing when using the debug mode as compared to dev mode.

Go debugging uses Delve. Please see the skaffold debugging documentation how to setup your IDE accordingly.

!!! note Resuming or stopping only a single goroutine (Go Issue 25578, 31132) is currently not supported, so the action will cause all the goroutines to get activated or paused.

This means that when a goroutine is paused on a breakpoint, then all the other goroutines are also paused. This should be kept in mind when using skaffold debug.

03-Configure Backup [Optional]

Deploying a Local Backup Store Emulator

!!! info This section is Optional and is only meant to describe steps to deploy a local object store which can be used for testing and development. If you either do not wish to enable backups or you wish to use remote (infra-provider-specific) object store then this section can be skipped.

An Etcd cluster provisioned via etcd-druid provides a capability to take regular delta and full snapshots which are stored in an object store. You can enable this functionality by ensuring that you fill in spec.backup.store section of the Etcd CR.

Backup Store Variant	Setup Guide
Azure Object Storage Emulator	Manage Azurite (Steps 00-03)
S3 Object Store Emulator	Manage LocalStack (Steps 00-03)
GCS Object Store Emulator	Manage GCS Emulator (Steps 00-03)

Setting up Cloud Provider Object Store Secret

!!! info This section is Optional. If you have disabled backup functionality or if you are using local storage or one of the supported object store emulators then you can skip this section.

A Kubernetes Secret needs to be created for cloud provider Object Store access. You can refer to the Secret YAML templates here. Replace the dummy values with the actual configuration and ensure that you have added the metadata.name and metadata.namespace to the secret.

!!! tip * Secret should be deployed in the same namespace as the Etcd resource. * All the values in the data field of the secret YAML should in base64 encoded format.

To apply the secret run:

kubectl apply -f <path/to/secret>

04-Preparing Etcd CR

Choose an appropriate variant of Etcd CR from examples directory.

If you wish to enable functionality to backup delta & full snapshots then uncomment spec.backup.store section.

# Configuration for storage provider
store:
  secretRef:
    name: etcd-backup-secret-name
  container: object-storage-container-name
  provider: aws # options: aws,azure,gcp,openstack,alicloud,dell,openshift,local
  prefix: etcd-test

Brief explanation of the keys:

secretRef.name is the name of the secret that was applied as mentioned above.
store.container is the object storage bucket name.
store.provider is the bucket provider. Pick from the options mentioned in comment.
store.prefix is the folder name that you want to use for your snapshots inside the bucket.

!!! tip For developer convenience we have provided object store emulator specific etcd CR variants which can be used as if as well.

05-Applying Etcd CR

Create the Etcd CR (Custom Resource) by applying the Etcd yaml to the cluster

kubectl apply -f <path-to-etcd-cr-yaml>

06-Verify the Etcd Cluster

To obtain information on the etcd cluster you can invoke the following command:

kubectl get etcd -o=wide

We adhere to a naming convention for all resources that are provisioned for an Etcd cluster. Refer to etcd-cluster-components document to get details of all resources that are provisioned.

Verify Etcd Pods’ Functionality

etcd-wrapper uses a distroless image, which lacks a shell. To interact with etcd, use an Ephemeral container as a debug container. Refer to this documentation for building and using an ephemeral container which gets attached to the etcd-wrapper pod.

# Put a key-value pair into the etcd 
etcdctl put <key1> <value1>
# Retrieve all key-value pairs from the etcd db
etcdctl get --prefix ""

For a multi-node etcd cluster, insert the key-value pair using the etcd container of one etcd member and retrieve it from the etcd container of another member to verify consensus among the multiple etcd members.

07-Updating Etcd CR

Etcd CR can be updated with new changes. To ensure that etcd-druid reconciles the changes you can refer to options that etcd-druid provides here.

08-Cleaning up the setup

If you wish to only delete the Etcd cluster then you can use the following command:

kubectl delete etcd <etcd-name>

This will add the deletionTimestamp to the Etcd resource. At the time the creation of the Etcd cluster, etcd-druid will add a finalizer to ensure that it cleans up all Etcd cluster resources before the CR is removed.

  finalizers:
  - druid.gardener.cloud/etcd-druid

etcd-druid will automatically pick up the deletion event and attempt clean up Etcd cluster resources. It will only remove the finalizer once all resources have been cleaned up.

If you only wish to remove etcd-druid but retain the kind cluster then you can use the following make target:

make undeploy

If you wish to delete the kind cluster then you can use the following make target:

make kind-down

This cleans up the entire setup as the kind cluster gets deleted.

19 - Getting Started Locally

Developing etcd-druid locally

You can setup etcd-druid locally by following detailed instructions in this document.

For best development experience you should use make deploy-dev - this helps during development where you wish to make changes to the code base and with a key-press allow automatic re-deployment of the application to the target Kubernetes cluster.
In case you wish to start a debugging session then use make deploy-debug - this will additionally disable leader election and prevent leases to expire and process to stop.

!!! info We leverage skaffold debug and skaffold dev features.

20 - Immutable etcd Cluster Backups

DEP-06: Immutable etcd Cluster Backups

Summary

This proposal introduces immutable backups for etcd clusters managed by etcd-druid. By leveraging cloud provider immutability features, backups taken by etcd-backup-restore can neither be modified nor deleted once created for a configurable retention duration (immutability period). This approach strengthens the reliability and fault tolerance of the etcd restoration process.

Terminology

etcd-druid: An etcd operator that configures, provisions, reconciles, and monitors etcd clusters.
etcd-backup-restore: A sidecar container that manages backups and restores of etcd cluster state. For more information, see the etcd-backup-restore documentation.
WORM (Write Once, Read Many): A storage model in which data, once written, cannot be modified or deleted until certain conditions are met.
Immutability: The property of an object that prevents it from being modified or deleted after creation.
Immutability Period: The duration for which data must remain immutable before it can be modified or deleted.
Bucket-Level Immutability: A policy that applies a uniform immutability period to all objects within a bucket.
Object-Level Immutability: A policy that allows setting immutability periods individually for objects within a bucket, offering more granular control.
Garbage Collection: The process of deleting old snapshot data that is no longer needed, in order to free up storage space. For more information, see the garbage collection documentation.
Hibernation: A state in which an etcd cluster is scaled down to zero replicas, effectively pausing its operations. This is typically done to save costs when the cluster is not needed for an extended period. During hibernation, the cluster’s data remains intact, and it can be resumed to its previous state when required.

Motivation

etcd-druid provisions etcd clusters and manages their lifecycle. For every etcd cluster, consumers can enable periodic backups of the cluster state by configuring the spec.backup section in an Etcd custom resource. Periodic backups are taken via the etcd-backup-restore sidecar container that runs in each etcd member pod.

Periodic backups of an etcd cluster state ensure the ability to recover from a data loss or a quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing WORM protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure.

Goals

Protect backup data against modifications and deletions post-creation through immutability policies offered by storage providers.

Non-Goals

Implementing a mechanism to signal hibernation intent for handling snapshot immutability for hibernated etcd clusters, such as adding functionality via etcd.spec or annotations on the Etcd CR, to indicate when an etcd cluster should enter or exit hibernation, as discussed in gardener/etcd-druid#922.
Supporting immutable backups on storage providers that do not offer immutability features (e.g., OpenStack Swift).

Proposal

This proposal aims to improve backup storage integrity and security by using immutability features available on major cloud providers.

Supported Cloud Providers

Google Cloud Storage (GCS): Bucket Lock
Amazon S3 (S3): Object Lock
Azure Blob Storage (ABS): Immutable Blob Storage

Note
Currently, Openstack object storage (swift) doesn’t support immutability for objects: https://blueprints.launchpad.net/swift/+spec/immutability-middleware.

Types of Immutability

Object-Level Immutability: Allows setting immutability periods independently for each object within a bucket.
Bucket-Level Immutability: Applies a uniform immutability policy to all objects in a bucket.

Comparison of Bucket-Level and Object-Level Immutability

Feature	GCS	S3	ABS
Can bucket-level immutability period be increased?	Yes	Yes*	Yes (only 5 times)
Can bucket-level immutability period be decreased?	No	Yes*	No
Is bucket-level immutability a prerequisite for object-level immutability?	No	Yes	Yes (for existing buckets)
Can object-level immutability period be increased?	Yes	Yes	Yes
Can object-level immutability period be decreased?	No	No	No
Support for enabling object-level immutability in existing buckets	No	Yes	Yes
Support for enabling object-level immutability in new buckets	Yes	Yes	Yes
Support for enabling bucket-level immutability in existing buckets	Yes	Yes	Yes
Support for enabling bucket-level immutability in new buckets	Yes	Yes	Yes
Precedence between bucket-level and object-level immutability periods	Max(bucket, object)	Object-level	Max(bucket, object)

Note
In AWS S3, it is possible to increase and decrease the bucket-level immutability period; however, this action can be blocked by configuring specific bucket policy settings.
For GCS, object-level immutability is not yet supported for existing buckets; see this issue.

Recommended Approach

At the time of writing this proposal, these are the current limitations seen across providers:

S3 and ABS: typically require bucket-level immutability as a prerequisite for object-level immutability.
GCS does not currently support object-level immutability in existing buckets.
ABS requires a migration process to enable version-level immutability on existing containers.

Consequently, the authors recommend bucket-level immutability. This approach simplifies configuration and ensures a uniform immutability policy for all backups in a bucket across all support providers.

Configuring Immutable Backups

Creating and configuring immutable buckets on providers is not handled by etcd-druid and must be done by the consumers. For a large-scale consumer like Gardener, provider extensions are leveraged to automate both the creation and configuration of buckets. For more details, see BackupBucket and refer to this issue.

Prerequisites

Configure or Update the Immutable Bucket
- Use your cloud provider’s CLI, SDK, or console to create (or update) a bucket/container with a WORM (write-once-read-many) immutability policy.
- Refer to the Configure Bucket-Level Immutability for step-by-step instructions on configuring or updating the immutable bucket across different cloud providers.
Provide Valid Credentials in a Kubernetes Secret

The store section of the Etcd CR must reference a Secret containing valid credentials.

apiVersion: druid.gardener.cloud/v1alpha1
kind: Etcd
metadata:
 name: example-etcd
spec:
 backup:
  store:
    prefix: etcd-backups
    container: my-immutable-backups  # Bucket name
    provider: aws                    # Supported: aws, gcp, azure
    secretRef:
     name: etcd-backup-credentials  # Reference to the Secret
    immutability:
     retentionType: bucket          # Enables bucket-level immutability

Confirm that this secret has the proper permissions to upload and retrieve snapshots from the immutable bucket.
See the Getting Started guide for an example.

Note
The etcd-druid does not handle the rotation of cloud provider credentials. Credential rotation must be managed by the operator.

By following these steps, you will have set up an immutable bucket for storing etcd backups, along with the necessary references in your Etcd specification and Kubernetes secret.

Handling of Hibernated Clusters

When an etcd cluster remains hibernated beyond the bucket’s immutability period, backups might become mutable again, depending on the cloud provider (see Comparison of Storage Provider Properties). This could compromise the intended guarantees of immutability, exposing backups to accidental or malicious alterations.

As mentioned in gardener/etcd-druid#922, a clear hibernation signal is needed. Since etcd-druid does not currently support hibernation natively and addressing that is out of scope for this proposal, we focus solely on maintaining immutability.

Proposal

To mitigate the risk of backups becoming mutable during extended hibernation under bucket-level immutability, the authors propose the following approach:

Prerequisite: Cut-off Traffic and Take a Final Full Snapshot Before Hibernation
- Before scaling the etcd cluster down to zero replicas, the etcd controller removes etcd’s client ports (2379/2380) from the etcd Service to block application traffic.
- The etcd controller then triggers an on-demand full snapshot. This ensures that the latest state of the etcd cluster is captured and securely stored before hibernation begins.
Periodically Re-Upload the Snapshot
- Re-uploading the latest full snapshot resets its immutability period in the bucket, ensuring backups remain protected during hibernation.
- By default, the re-upload schedule follows etcd.spec.backup.fullSnapshotSchedule. Currently, this interval cannot be customized exclusively for re-uploads; future enhancements may introduce a dedicated configuration parameter.
- A new operator task type, ExtendFullSnapshotImmutabilityTask, periodically calls a new CLI command, extend-snapshot-immutability, to re-upload the snapshot and extend its immutability.
- ExtendFullSnapshotImmutabilityTask also manages garbage collection, ensuring that only the latest immutable snapshots are retained while deleting older snapshots created by the task itself.

By capturing a final full snapshot before hibernation, periodically re-uploading it to preserve immutability, and removing stale backups, etcd backups remain safeguarded against accidental or malicious alterations until the cluster is resumed.

Important
Limitation: There is a potential edge case where a snapshot might become corrupted before hibernation or during the re-upload process by ExtendFullSnapshotImmutabilityTask. If this happens, the process could repeatedly re-upload the same corrupted snapshot, failing to ensure a reliable backup.

An alternative solution could be to trigger the snapshot compaction which runs in separate pod and takes a fresh snapshot of the compacted etcd data and re-uploads it to the object store.

While this approach ensures that only valid snapshots are re-uploaded, it is resource-intensive, requiring an operational etcd instance even in hibernation. Given the high cost in terms of compute and memory, the authors recommend the snapshot re-upload approach as a more practical solution.

Etcd CR API Changes

A new field is introduced in Etcd.spec.backup.store to indicate the immutability strategy:

// StoreSpec defines parameters for storing etcd backups.
type StoreSpec struct {
    // ...
    // Immutability configuration for the backup store.
    Immutability *ImmutabilitySpec `json:"immutability,omitempty"`
}

// ImmutabilitySpec defines immutability settings.
type ImmutabilitySpec struct {
    // RetentionType indicates the type of immutability approach. For example, "bucket".
    RetentionType string `json:"retentionType,omitempty"`
}

If immutability is not specified, etcd-druid will assume that the bucket is mutable, and no immutability-related logic applies. We have defined a new type to allow us future enhancements to the immutability specification.

`etcd-backup-restore` Enhancements

The authors propose adding new sub-command to the etcd-backup-restore CLI (etcdbrctl) to maintain immutability during hibernation and to clean up snapshots created by ExtendFullSnapshotImmutabilityTask:

extend-snapshot-immutability
- Downloads the latest full snapshot from the object store.
- Renames the snapshot (for instance, updates its Unix timestamp) to avoid overwriting an existing immutable snapshot.
- Uploads the renamed snapshot back to object storage, thereby restarting its immutability timer.
- Introduces the --gc-from-timestamp=<timestamp> parameter, where <timestamp> is the creation timestamp of the task. This ensures that only snapshots created by the task are subject to garbage collection.

As an alternative to the download/upload approach, the authors document the possibility of using provider APIs to perform a server-side object copy. This method could significantly reduce network costs and latency by directly copying the snapshot within the cloud provider’s infrastructure. While this option is not implemented in the current version, feasibility of server-side copy can be explored in etcd-backup-restore during implementation.

etcd Controller Enhancements

When a hibernation flow is initiated (by external tooling or higher-level operators), the etcd controller can:

Remove etcd’s client ports (2379/2380) from the etcd Service to block application traffic.
Trigger an on-demand full snapshot via an EtcdOperatorTask.
Scale down the StatefulSet replicas to zero, provided the previous snapshot step is successful.
Create the ExtendFullSnapshotImmutabilityTask if etcd.spec.backup.store.immutability.retentionType is "bucket" and based on etcd.spec.backup.fullSnapshotSchedule.

Operator Task Enhancements

The ExtendFullSnapshotImmutabilityTask will create a cron job that:

Runs etcdbrctl extend-snapshot-immutability --gc-from-timestamp=<creation timestamp of task> to preserve the immutability period of the most recent snapshot. This command re-uploads the latest snapshot, effectively resetting its immutability period. Additionally, it removes any snapshots that have become mutable after the creation timestamp of the task.

By periodically re-uploading (extending) the latest snapshot during hibernation, the authors ensure that the immutability period is extended, and the backups remain protected throughout the hibernation period.

Lifecycle of `ExtendFullSnapshotImmutabilityTask`

The ExtendFullSnapshotImmutabilityTask is active during hibernation and is automatically managed by the etcd-controller. Its lifecycle is tied to the cluster’s hibernation state:

Task Creation
- When the etcd cluster enters hibernation (e.g., scaling down to zero replicas), the etcd controller:
  - Triggers a final full snapshot.
  - Creates the ExtendFullSnapshotImmutabilityTask to run extend-snapshot-immutability --gc-from-timestamp=<creation timestamp of this task>
Task Deletion
- When the cluster resumes from hibernation (scales up to non-zero replicas), the controller:
  - Deletes the ExtendFullSnapshotImmutabilityTask to stop extending snapshots.
  - Resumes the normal backup schedule defined in spec.backup.fullSnapshotSchedule.
- When the immutability configuration is removed from the etcd.spec.backup.store, the controller:
  - Deletes the ExtendFullSnapshotImmutabilityTask to stop extending snapshots.

Example Task Config

type ExtendFullSnapshotImmutabilityTaskConfig struct {
  // Schedule defines a cron schedule (e.g., "0 0 * * *").
  Schedule *string `json:"schedule,omitempty"`
}

Sample YAML:

spec:
  config:
    schedule: "0 0 * * *"

Disabling Immutability

If you remove the immutability configuration from etcd.spec.backup.store, hibernation-based immutability support no longer applies. However, once the bucket itself is locked at the provider level, it cannot be reverted to a mutable state. Any objects uploaded by etcd-backup-restore are still subject to the existing WORM policy.

If you genuinely require a mutable backup again, the recommended approach is:

Use a new bucket. In your Etcd custom resource, reference a different bucket that does not have immutability enabled.
Use EtcdCopyBackupsTask. If you want to start the cluster with a new bucket but retain old data, use the EtcdCopyBackupsTask to copy existing backups from the old immutable bucket to the new mutable bucket.
Reconcile the Etcd CR. After pointing etcd.spec.backup.store to the new bucket, etcd-druid will start storing backups there.

Note: Existing snapshots in the old immutable bucket remain locked according to the configured immutability period.

Compatibility

These changes are compatible with existing etcd clusters and current backup processes.

Backward Compatibility:
- Clusters without immutable buckets continue to function without any changes.
Forward Compatibility:
- Clusters can opt in to use immutable backups by configuring the bucket accordingly (as described in Configuring Immutable Backups) and setting etcd.spec.backup.store.immutability.retentionType == "bucket".
- The enhanced hibernation logic in the etcd controller is additive, meaning it does not interfere with existing workflows.

Impact for Operators

In scenarios where you want to exclude certain snapshots from an etcd restore, you previously could simply delete them from object storage. However, when bucket-level immutability is enabled, deleting existing immutable snapshots is no longer possible. To address this need, most cloud providers allow adding custom annotations or tags to objects—even immutable ones—so they can be logically excluded without physically removing them.

etcd-backup-restore supports ignoring snapshots based on annotations or tags, rather than deleting them. Operators can add the following key-value pair to any snapshot object to exclude it from future restores:

Key: x-etcd-snapshot-exclude
Value: true

Because these tags or annotations do not modify the underlying snapshot data, they are permissible even for immutable objects. Once these annotations are in place, etcd-backup-restore will detect them and skip the tagged snapshots during restoration, thus preventing unwanted snapshots from being used. For more details, see the Ignoring Snapshots during Restoration.

Note
At the time of writing this proposal, this feature is not supported for AWS S3 buckets.

References

21 - Local e2e Tests

e2e Test Suite

Developers can run extended e2e tests, in addition to unit tests, for Etcd-Druid in or from their local environments. This is recommended to verify the desired behavior of several features and to avoid regressions in future releases.

The very same tests typically run as part of the component’s release job as well as on demand, e.g., when triggered by Etcd-Druid maintainers for open pull requests.

Testing Etcd-Druid automatically involves a certain test coverage for gardener/etcd-backup-restore which is deployed as a side-car to the actual etcd container.

Prerequisites

The e2e test lifecycle is managed with the help of skaffold. Every involved step like setup, deploy, undeploy or cleanup is executed against a Kubernetes cluster which makes it a mandatory prerequisite at the same time. Only skaffold itself with involved docker, helm and kubectl executions as well as the e2e-tests are executed locally. Required binaries are automatically downloaded if you use the corresponding make target, as described in this document.

It’s expected that especially the deploy step is run against a Kubernetes cluster which doesn’t contain an Druid deployment or any left-overs like druid.gardener.cloud CRDs. The deploy step will likely fail in such scenarios.

Tip: Create a fresh KinD cluster or a similar one with a small footprint before executing the tests.

Providers

The following providers are supported for e2e tests:

AWS
Azure
GCP
Local

Valid credentials need to be provided when tests are executed with mentioned cloud providers.

Flow

An e2e test execution involves the following steps:

Step	Description
`setup`	Create a storage bucket which is used for etcd backups (only with cloud providers).
`deploy`	Build Docker image, upload it to registry (if remote cluster - see Docker build), deploy Helm chart (`charts/druid`) to Kubernetes cluster.
`test`	Execute e2e tests as defined in `test/e2e`.
`undeploy`	Remove the deployed artifacts from Kubernetes cluster.
`cleanup`	Delete storage bucket and Druid deployment from test cluster.

Make target

Executing e2e-tests is as easy as executing the following command with defined Env-Vars as desribed in the following section and as needed for your test scenario.

make test-e2e

Common Env Variables

The following environment variables influence how the flow described above is executed:

PROVIDERS: Providers used for testing (all, aws, azure, gcp, local). Multiple entries must be comma separated.
Note: Some tests will use very first entry from env PROVIDERS for e2e testing (ex: multi-node tests). So for multi-node tests to use specific provider, specify that provider as first entry in env PROVIDERS.
KUBECONFIG: Kubeconfig pointing to cluster where Etcd-Druid will be deployed (preferably KinD).
TEST_ID: Some ID which is used to create assets for and during testing.
STEPS: Steps executed by make target (setup, deploy, test, undeploy, cleanup - default: all steps).

AWS Env Variables

AWS_ACCESS_KEY_ID: Key ID of the user.
AWS_SECRET_ACCESS_KEY: Access key of the user.
AWS_REGION: Region in which the test bucket is created.

Example:

make \
  AWS_ACCESS_KEY_ID="abc" \
  AWS_SECRET_ACCESS_KEY="xyz" \
  AWS_REGION="eu-central-1" \
  KUBECONFIG="$HOME/.kube/config" \
  PROVIDERS="aws" \
  TEST_ID="some-test-id" \
  STEPS="setup,deploy,test,undeploy,cleanup" \
test-e2e

Azure Env Variables

STORAGE_ACCOUNT: Storage account used for managing the storage container.
STORAGE_KEY: Key of storage account.

Example:

make \
  STORAGE_ACCOUNT="abc" \
  STORAGE_KEY="eHl6Cg==" \
  KUBECONFIG="$HOME/.kube/config" \
  PROVIDERS="azure" \
  TEST_ID="some-test-id" \
  STEPS="setup,deploy,test,undeploy,cleanup" \
test-e2e

GCP Env Variables

GCP_SERVICEACCOUNT_JSON_PATH: Path to the service account json file used for this test.
GCP_PROJECT_ID: ID of the GCP project.

Example:

make \
  GCP_SERVICEACCOUNT_JSON_PATH="/var/lib/secrets/serviceaccount.json" \
  GCP_PROJECT_ID="xyz-project" \
  KUBECONFIG="$HOME/.kube/config" \
  PROVIDERS="gcp" \
  TEST_ID="some-test-id" \
  STEPS="setup,deploy,test,undeploy,cleanup" \
test-e2e

Local Env Variables

No special environment variables are required for running e2e tests with Local provider.

Example:

make \
  KUBECONFIG="$HOME/.kube/config" \
  PROVIDERS="local" \
  TEST_ID="some-test-id" \
  STEPS="setup,deploy,test,undeploy,cleanup" \
test-e2e

e2e test with local storage emulators [AWS, GCP, AZURE]

The above-mentioned e2e tests need storage from real cloud providers to be setup. But there are tools such as localstack, fake-gcs-server and azurite that enable developers to run e2e tests with emulators for AWS S3, Google GCS and Azure ABS respectively. We can provision a KIND cluster for running the e2e tests against. Thus, by using cloud storage emulators alongside the KIND cluster, we eliminate the need for any actual cloud provider infrastructure to be set up for running e2e tests.

How are the KIND cluster and Emulators set up

KIND or Kubernetes-In-Docker is a kubernetes cluster that is set up inside a docker container. This cluster is with limited capability as it does not have much compute power. But this cluster can easily be setup inside a container and can be tear down easily just by removing a container. That’s why KIND cluster is very easy to use for e2e tests. Makefile command helps to spin up a KIND cluster and use the cluster to run e2e tests.

Localstack setup

There is a docker image for localstack. The image is deployed as pod inside the KIND cluster through hack/e2e-test/infrastructure/localstack/localstack.yaml. Makefile takes care of deploying the yaml file in a KIND cluster.

The developer needs to run make ci-e2e-kind command. This command in turn runs hack/ci-e2e-kind.sh which spin up the KIND cluster and deploy localstack in it and then run the e2e tests using localstack as mock AWS storage provider. e2e tests are actually run on host machine but deploy the druid controller inside KIND cluster. Druid controller spawns multinode etcd clusters inside KIND cluster. e2e tests verify whether the druid controller performs its jobs correctly or not. Mock localstack storage is cleaned up after every e2e tests. That’s why the e2e tests need to access the localstack pod running inside KIND cluster. The network traffic between host machine and localstack pod is resolved via mapping localstack pod port to host port while setting up the KIND cluster via hack/e2e-test/infrastructure/kind/cluster.yaml

How to execute e2e tests with localstack and KIND cluster

Run the following make command to spin up a KinD cluster, deploy localstack and run the e2e tests with provider aws:

make ci-e2e-kind

Fake-GCS-Server setup

Fake-gcs-server is run inside a pod using this docker image in a KIND cluster.

The user needs to run make ci-e2e-kind-gcs to start the e2e tests for druid with GCS emulator as the object storage for etcd backups. The above command internally runs the script hack/ci-e2e-kind-gcs.sh which initializes the setup with required steps before going on to create a KIND cluster and deploy fakegcs in it and use that emulator to run e2e tests.

The fake-gcs-server running inside the pod serves HTTP requests at port-8000 and HTTPS requests at port-4443. As the e2e tests runs on the host machine while the emulator runs on KIND, both ports i.e 8000 & 4443 needs to be port-forwarded from the host machine to fake-gcs service running inside the KIND cluster. The port forwardings is defined in the hack/kind-up.sh file which is used to setup the KIND cluster.

How to execute e2e tests with fake-gcs-server and KIND cluster

Run the following make command to spin up a KinD cluster, deploy fakegcs and run the e2e tests with provider gcp:

make ci-e2e-kind-gcs

Azurite setup

Azurite is run inside a pod using this docker image (tag:latest) in a KIND cluster.

The user needs to run make ci-e2e-kind-azure to start the e2e tests for druid with Azurite as the object storage for etcd backups. The above command internally runs the script hack/ci-e2e-kind-azure.sh which initializes the setup with required steps before going on to create a KIND cluster and deploy Azurite in it and use that emulator to run e2e tests.

The azurite running inside the pod serves HTTP requests at port 10000. As the e2e tests runs on the host machine while the emulator runs on KIND cluster, the port 10000 needs to be port-forwarded from the host machine to azurite service running inside the KIND cluster. The port forwardings is defined in the hack/kind-up.sh file which is used to setup the KIND cluster.

How to execute e2e tests with Azurite and KIND cluster

Run the following make command to spin up a KinD cluster, deploy Azurite and run the e2e tests with provider azure:

make ci-e2e-kind-azure

22 - Manage Azurite Emulator

Manage Azure Blob Storage Emulator

This document is a step-by-step guide on how to configure, deploy and cleanup Azurite, the Azure Blob Storage emulator, within a kind cluster. This setup is ideal for local development and testing.

00-Prerequisites

Ensure that you have setup the development environment as per the documentation.

Note: It is assumed that you have already created kind cluster and the KUBECONFIG is pointing to this Kubernetes cluster.

Installing Azure CLI

To interact with Azurite you must also install the Azure CLI (version >=2.55.0) On macOS run:

brew install azure-cli

For other OS, please check the Azure CLI installation documentation.

01-Deploy Azurite

make deploy-azurite

The above make target will deploy Azure emulator in the target Kubernetes cluster.

02-Setup ABS Container

We will be using the azure-cli to create an ABS container. Export the connection string to enable azure-cli to connect to Azurite emulator.

export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;"

To create an Azure Blob Storage Container in Azurite, run the following command:

az storage container create -n <container-name>

03-Configure Secret

Connection details for an Azure Object Store Container are put into a Kubernetes Secret. Apply the Kubernetes Secret manifest through:

kubectl apply -f examples/objstore-emulator/etcd-secret-azurite.yaml

Note: The secret created should be referred to in the Etcd CR in spec.backup.store.secretRef.

04-Cleanup

To clean the setup, unset the environment variable set in step-03 above and delete the Azurite deployment:

unset AZURE_STORAGE_CONNECTION_STRING
kubectl delete -f ./hack/e2e-test/infrastructure/azurite/azurite.yaml

23 - Manage Gcs Emulator

Manage GCS Emulator

This document is a step-by-step guide on how to configure, deploy and cleanup GCS Emulator, within a kind cluster. GCS Emulator emulates Google Cloud Storage locally, which allows the Etcd cluster to interact with GCS. This setup is ideal for local development and testing.

00-Prerequisites

Ensure that you have setup the development environment as per the documentation.

Note: It is assumed that you have already created kind cluster and the KUBECONFIG is pointing to this Kubernetes cluster.

Installing gsutil

To interact with GCS Emulator you must also install the gsutil utility. Follow the instructions here to install gsutil.

01-Deploy FakeGCS

Deploy FakeGCS onto the Kubernetes cluster using the command below:

make deploy-fakegcs

02-Setup GCS Bucket

To create a GCS bucket for Etcd backup purposes, execute the following command:

gsutil -o "Credentials:gs_json_host=127.0.0.1" -o "Credentials:gs_json_port=4443" -o "Boto:https_validate_certificates=False" mb "gs://etcd-bucket"

03-Configure Secret

Connection details for a GCS Object Store are put into a Kubernetes Secret. Apply the Kubernetes Secret manifest through:

kubectl apply -f examples/objstore-emulator/etcd-secret-fakegcs.yaml

Note: The secret created should be referred to in the Etcd CR in spec.backup.store.secretRef.

04-Cleanup

To clean the setup, delete the FakeGCS deployment:

kubectl delete -f ./hack/e2e-test/infrastructure/fake-gcs-server/fake-gcs-server.yaml

24 - Manage S3 Emulator

Manage S3 Emulator

This document is a step-by-step guide on how to configure, deploy and cleanup LocalStack, within a kind cluster. LocalStack emulates AWS services locally, which allows the Etcd cluster to interact with AWS S3. This setup is ideal for local development and testing.

00-Prerequisites

Ensure that you have setup the development environment as per the documentation.

Note: It is assumed that you have already created kind cluster and the KUBECONFIG is pointing to this Kubernetes cluster.

Installing AWS CLI

To interact with LocalStack you must also install the AWS CLI (version >=1.29.0 or version >=2.13.0) On macOS run:

brew install awscli

For other OS, please check the AWS CLI installation documentation.

01-Deploy LocalStack

make deploy-localstack

The above make target will deploy LocalStack in the target Kubernetes cluster.

02-Setup S3 Bucket

Configure AWS CLI to interact with LocalStack by setting the necessary environment variables. This configuration redirects S3 commands to the LocalStack endpoint and provides the required credentials for authentication.

export AWS_ENDPOINT_URL_S3="http://localhost:4566"
export AWS_ACCESS_KEY_ID=ACCESSKEYAWSUSER
export AWS_SECRET_ACCESS_KEY=sEcreTKey
export AWS_DEFAULT_REGION=us-east-2

Create a S3 bucket using the following command:

aws s3api create-bucket --bucket <bucket-name> --region <region> --create-bucket-configuration LocationConstraint=<region> --acl private

To verify if the bucket has been created, you can use the following command:

aws s3api head-bucket --bucket <bucket-name>

03-Configure Secret

Connection details for an Azure S3 Object Store are put into a Kubernetes Secret. Apply the Kubernetes Secret manifest through:

kubectl apply -f examples/objstore-emulator/etcd-secret-localstack.yaml

Note: The secret created should be referred to in the Etcd CR in spec.backup.store.secretRef.

04-Cleanup

To clean the setup,, unset the environment variable set in step-03 above and delete the LocalStack deployment:

unset AWS_ENDPOINT_URL_S3 AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_DEFAULT_REGION
kubectl delete -f ./hack/e2e-test/infrastructure/localstack/localstack.yaml

25 - Managing Etcd Clusters

Managing ETCD Clusters

Create an Etcd Cluster

Creating an Etcd cluster can be done either by explicitly creating a manifest file or it can also be done programmatically. You can refer to and/or modify any sample Etcd manifest to create an etcd cluster. In order to programmatically create an Etcd cluster you can refer to the Golang API to create an Etcd custom resource and using a k8s client you can apply an instance of a Etcd custom resource targetting any namespace in a k8s cluster.

Prior to v0.23.0 version of etcd-druid, after creating an Etcd custom resource, you will have to annotate the resource with gardener.cloud/operation=reconcile in order to trigger a reconciliation for the newly created Etcd resource. Post v0.23.0 version of etcd-druid, there is no longer any need to explicitly trigger reconciliations for creating new Etcd clusters.

Track etcd cluster creation

In order to track the progress of creation of etcd cluster resources you can do the following:

status.lastOperation can be monitored to check the status of reconciliation.
Additional printer columns have been defined for Etcd custom resource. You can execute the following command to know if an Etcd cluster is ready/quorate.

kubectl get etcd <etcd-name> -n <namespace> -owide
  # you will see additional columns which will indicate the state of an etcd cluster
  NAME        READY   QUORATE   ALL MEMBERS READY   BACKUP READY   AGE    CLUSTER SIZE   CURRENT REPLICAS   READY REPLICAS
  etcd-main   true    True      True                True           235d   3              3                  3

You can additional monitor all etcd cluster resources that are created for every etcd cluster.
For etcd-druid version <v0.23.0 use the following command:

kubectl get all,cm,role,rolebinding,lease,sa -n <namespace> --selector=instance=<etcd-name>

For etcd-druid version >=v0.23.0 use the following command:

kubectl get all,cm,role,rolebinding,lease,sa -n <namespace> --selector=app.kubernetes.io/managed-by=etcd-druid,app.kubernetes.io/part-of=<etcd-name>

Update & Reconcile an Etcd Cluster

Edit the Etcd custom resource

To update an etcd cluster, you should usually only be updating the Etcd custom resource representing the etcd cluster. You can make changes to the existing Etcd resource by invoking the following command:

kubectl edit etcd <etcd-name> -n <namespace>

This will open up the linked editor where you can make the edits.

Scale the Etcd cluster horizontally

To scale an etcd cluster horizontally, you can update the spec.replicas field in the Etcd custom resource. For example, to scale an etcd cluster to 5 replicas, you can run:

kubectl patch etcd <etcd-name> -n <namespace> --type merge -p '{"spec":{"replicas":5}}'

Alternatively, you can use the /scale subresource to scale an etcd cluster horizontally. For example, to scale an etcd cluster to 5 replicas, you can run:

kubectl scale etcd <etcd-name> -n <namespace> --replicas=5

!!! note While an Etcd cluster can be scaled out, it cannot be scaled in, i.e., the replicas cannot be decreased to a non-zero value. An Etcd cluster can still be scaled to 0 replicas, indicating that the cluster is to be “hibernated”. This is beneficial for use-cases where an etcd cluster is not needed for a certain period of time, and the user does not want to pay for the compute resources. A hibernated etcd cluster can be resumed later by scaling it back to a non-zero value. Please note that the data volumes backing an Etcd cluster will be retained during hibernation, and will still be charged for.

Scale the Etcd cluster vertically

To scale an Etcd cluster vertically, you can update the spec.etcd.resources and spec.backup.resources fields in the Etcd custom resource. For example, to scale an Etcd cluster’s etcd container to 2 CPU and 4Gi memory, you can run:

kubectl patch etcd <etcd-name> -n <namespace> --type merge -p '{"spec":{"etcd":{"resources":{"requests":{"cpu":"2","memory":"4Gi"}}}}}'

Additionally, you can use an external pod autoscaler such as vertical pod autoscaler to scale an Etcd cluster vertically. This is made possible by the /scale subresource on the Etcd custom resource.

!!! note Scaling an Etcd cluster vertically can potentially result in downtime, based on its failure tolerance. This is due to the fact that an etcd cluster is actuated by pods, and the pods need to be restarted in order to apply the new resource values. A multi-member etcd cluster can tolerate a failure of one member, but if the number of replicas is set to 1, then the etcd cluster will not be able to tolerate any failures. Therefore, it is recommended to scale a single-member etcd cluster vertically only when it is deemed necessary, and to monitor the cluster closely during the scaling process. A multi-member Etcd cluster can be scaled vertically, one member at a time. A pod disruption budget for this can ensure that the cluster can tolerate a certain number of failures. This pod disruption budget is automatically deployed and managed by etcd-druid for multi-member Etcd clusters, and it is recommended to not modify this PDB manually.

!!! note While the replicas and resources in an Etcd resource spec can be modified, please ensure to read the next section to understand when and how these changes are reconciled by etcd-druid.

Reconcile

There are two ways to control reconciliation of any changes done to Etcd custom resources.

Auto reconciliation

If etcd-druid has been deployed with auto-reconciliation then any change done to an Etcd resource will be automatically reconciled. You use --enable-etcd-spec-auto-reconcile CLI flag to enable auto-reconciliation of the etcd spec.

For a complete list of CLI args you can see this document.

Explicit reconciliation

If --enable-etcd-spec-auto-reconcile is set to false or not set at all, then any change to an Etcd resource will not be automatically reconciled. To trigger a reconcile you must set the following annotation on the Etcd resource:

kubectl annotate etcd <etcd-name> gardener.cloud/operation=reconcile -n <namespace>

This option is sometimes recommeded as you would like avoid auto-reconciliation of accidental changes to Etcd resources outside the maintenance time window, thus preventing a potential transient quorum loss due to misconfiguration, attach-detach issues of persistent volumes etc.

Overwrite Container OCI Images

To find out image versions of etcd-backup-restore and etcd-wrapper used by a specific version of etcd-druid one way is look for the image versions in images.yaml. There are times that you might wish to override these images that come bundled with etcd-druid. There are two ways in which you can do that:

Option #1 We leverage Overwrite ImageVector facility provided by gardener. This capability can be used without bringing in gardener as well. To illustrate this in context of etcd-druid you will create a ConfigMap with the following content:

apiVersion: v1
kind: ConfigMap
metadata:
  name: etcd-druid-images-overwrite
  namespace: <etcd-druid-namespace>
data:
  images_overwrite.yaml: |
    images:
    - name: etcd-backup-restore
      sourceRepository: github.com/gardener/etcd-backup-restore
      repository: <your-own-custom-etcd-backup-restore-repo-url>
      tag: "v<custom-tag>"
    - name: etcd-wrapper
      sourceRepository: github.com/gardener/etcd-wrapper
      repository: <your-own-custom-etcd-wrapper-repo-url>
      tag: "v<custom-tag>"
    - name: alpine
      repository: <your-own-custom-alpine-repo-url>
      tag: "v<custom-tag>"

You can use images.yaml as a reference to create the overwrite images YAML ConfigMap.

Edit the etcd-druid Deployment with:

Mount the ConfigMap
Set IMAGEVECTOR_OVERWRITE environment variable whose value must be the path you choose to mount the ConfigMap.

To illustrate the changes you can see the following etcd-druid Deployment YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: etcd-druid
  namespace: <etcd-druid-namespace>
spec:
  template:
    spec:
      containers:
      - name: etcd-druid
        env:
        - name: IMAGEVECTOR_OVERWRITE
          value: /imagevector-overwrite/images_overwrite.yaml
        volumeMounts:
        - name: etcd-druid-images-overwrite
          mountPath: /imagevector-overwrite
      volumes:
      - name: etcd-druid-images-overwrite
        configMap:
          name: etcd-druid-images-overwrite

!!! info Image overwrites specified in the mounted ConfigMap will be respected by successive reconciliations for this Etcd custom resource.

Option #2

We provide a generic way to suspend etcd cluster reconciliation via etcd-druid, allowing a human operator to take control. This option should be excercised only in case of troubleshooting or quick fixes which are not possible to do via the reconciliation loop in etcd-druid. However one of the use cases to use this option is to perhaps update the container image to apply a hot patch and speed up recovery of an etcd cluster.

Manually modify individual etcd cluster resources

etcd cluster resources are managed by etcd-druid and since v0.23.0 version of etcd-druid any changes to these managed resources are protected via a validating webhook. You can find more information about this webhook here. To be able to manually modify etcd cluster managed resources two things needs to be done:

Annotate the target Etcd resource suspending any reconciliation by etcd-druid. You can do this by invoking the following command:

   kubectl annotate etcd <etcd-name> -n <namespace> druid.gardener.cloud/suspend-etcd-spec-reconcile=

Add another annotation to the target Etcd resource disabling managed resource protection via the webhook. You can do this by invoking the following command:

   kubectl annotate etcd <etcd-name> -n <namespace> druid.gardener.cloud/disable-etcd-component-protection=

Now you are free to make changes to any managed etcd cluster resource.

!!! note As long as the above two annotations are there, no reconciliation will be done for this etcd cluster by etcd-druid. Therefore it is essential that you remove this annotations eventually.

26 - Metrics

Monitoring

etcd-druid uses Prometheus for metrics reporting. The metrics can be used for real-time monitoring and debugging of compaction jobs.

The simplest way to see the available metrics is to cURL the metrics endpoint /metrics. The format is described here.

Follow the Prometheus getting started doc to spin up a Prometheus server to collect etcd metrics.

The naming of metrics follows the suggested Prometheus best practices. All compaction related metrics are put under namespace etcddruid and the respective subsystems.

Snapshot Compaction

These metrics provide information about the compaction jobs that run after some interval in shoot control planes. Studying the metrics, we can deduce how many compaction job ran successfully, how many failed, how many delta events compacted etc.

Name	Description	Type
etcddruid_compaction_jobs_total	Total number of compaction jobs initiated by compaction controller.	Counter
etcddruid_compaction_jobs_current	Number of currently running compaction job.	Gauge
etcddruid_compaction_job_duration_seconds	Total time taken in seconds to finish a running compaction job.	Histogram
etcddruid_compaction_num_delta_events	Total number of etcd events to be compacted by a compaction job.	Gauge

There are two labels for etcddruid_compaction_jobs_total metrics. The label succeeded shows how many of the compaction jobs are succeeded and label failed shows how many of compaction jobs are failed.

There are two labels for etcddruid_compaction_job_duration_seconds metrics. The label succeeded shows how much time taken by a successful job to complete and label failed shows how much time taken by a failed compaction job.

etcddruid_compaction_jobs_current metric comes with label etcd_namespace that indicates the namespace of the Etcd running in the control plane of a shoot cluster..

Etcd

These metrics are exposed by the etcd process that runs in each etcd pod.

The following list metrics is applicable to clustering of a multi-node etcd cluster. The full list of metrics exposed by etcd is available here.

No.	Metrics Name	Description	Comments
1	etcd_disk_wal_fsync_duration_seconds	latency distributions of fsync called by WAL.	High disk operation latencies indicate disk issues.
2	etcd_disk_backend_commit_duration_seconds	latency distributions of commit called by backend.	High disk operation latencies indicate disk issues.
3	etcd_server_has_leader	whether or not a leader exists. 1: leader exists, 0: leader not exists.	To capture quorum loss or to check the availability of etcd cluster.
4	etcd_server_is_leader	whether or not this member is a leader. 1 if it is, 0 otherwise.
5	etcd_server_leader_changes_seen_total	number of leader changes seen.	Helpful in fine tuning the zonal cluster like etcd-heartbeat time etc, it can also indicates the etcd load and network issues.
6	etcd_server_is_learner	whether or not this member is a learner. 1 if it is, 0 otherwise.
7	etcd_server_learner_promote_successes	total number of successful learner promotions while this member is leader.	Might be helpful in checking the success of API calls called by backup-restore.
8	etcd_network_client_grpc_received_bytes_total	total number of bytes received from grpc clients.	Client Traffic In.
9	etcd_network_client_grpc_sent_bytes_total	total number of bytes sent to grpc clients.	Client Traffic Out.
10	etcd_network_peer_sent_bytes_total	total number of bytes sent to peers.	Useful for network usage.
11	etcd_network_peer_received_bytes_total	total number of bytes received from peers.	Useful for network usage.
12	etcd_network_active_peers	current number of active peer connections.	Might be useful in detecting issues like network partition.
13	etcd_server_proposals_committed_total	total number of consensus proposals committed.	A consistently large lag between a single member and its leader indicates that member is slow or unhealthy.
14	etcd_server_proposals_pending	current number of pending proposals to commit.	Pending proposals suggests there is a high client load or the member cannot commit proposals.
15	etcd_server_proposals_failed_total	total number of failed proposals seen.	Might indicates downtime caused by a loss of quorum.
16	etcd_server_proposals_applied_total	total number of consensus proposals applied.	Difference between etcd_server_proposals_committed_total and etcd_server_proposals_applied_total should usually be small.
17	etcd_mvcc_db_total_size_in_bytes	total size of the underlying database physically allocated in bytes.
18	etcd_server_heartbeat_send_failures_total	total number of leader heartbeat send failures.	Might be helpful in fine-tuning the cluster or detecting slow disk or any network issues.
19	etcd_network_peer_round_trip_time_seconds	round-trip-time histogram between peers.	Might be helpful in fine-tuning network usage specially for zonal etcd cluster.
20	etcd_server_slow_apply_total	total number of slow apply requests.	Might indicate overloaded from slow disk.
21	etcd_server_slow_read_indexes_total	total number of pending read indexes not in sync with leader’s or timed out read index requests.

The full list of metrics is available here.

Etcd-Backup-Restore

These metrics are exposed by the etcd-backup-restore container in each etcd pod.

The following list metrics is applicable to clustering of a multi-node etcd cluster. The full list of metrics exposed by etcd-backup-restore is available here.

No.	Metrics Name	Description
1.	etcdbr_cluster_size	to capture the scale-up/scale-down scenarios.
2.	etcdbr_is_learner	whether or not this member is a learner. 1 if it is, 0 otherwise.
3.	etcdbr_is_learner_count_total	total number times member added as the learner.
4.	etcdbr_restoration_duration_seconds	total latency distribution required to restore the etcd member.
5.	etcdbr_add_learner_duration_seconds	total latency distribution of adding the etcd member as a learner to the cluster.
6.	etcdbr_member_remove_duration_seconds	total latency distribution removing the etcd member from the cluster.
7.	etcdbr_member_promote_duration_seconds	total latency distribution of promoting the learner to the voting member.
8.	etcdbr_defragmentation_duration_seconds	total latency distribution of defragmentation of each etcd cluster member.

Prometheus supplied metrics

The Prometheus client library provides a number of metrics under the go and process namespaces.

27 - operator out-of-band tasks

DEP-05: Operator Out-of-band Tasks

Summary

This DEP proposes an enhancement to etcd-druid’s capabilities to handle out-of-band tasks, which are presently performed manually or invoked programmatically via suboptimal APIs. The document proposes the establishment of a unified interface by defining a well-structured API to harmonize the initiation of any out-of-band task, monitor its status, and simplify the process of adding new tasks and managing their lifecycles.

Terminology

etcd-druid: etcd-druid is an operator to manage the etcd clusters.
backup-sidecar: It is the etcd-backup-restore sidecar container running in each etcd-member pod of etcd cluster.
leading-backup-sidecar: A backup-sidecar that is associated to an etcd leader of an etcd cluster.
out-of-band task: Any on-demand tasks/operations that can be executed on an etcd cluster without modifying the Etcd custom resource spec (desired state).

Motivation

Today, etcd-druid mainly acts as an etcd cluster provisioner (creation, maintenance and deletion). In future, capabilities of etcd-druid will be enhanced via etcd-member proposal by providing it access to much more detailed information about each etcd cluster member. While we enhance the reconciliation and monitoring capabilities of etcd-druid, it still lacks the ability to allow users to invoke out-of-band tasks on an existing etcd cluster.

There are new learnings while operating etcd clusters at scale. It has been observed that we regularly need capabilities to trigger out-of-band tasks which are outside of the purview of a regular etcd reconciliation run. Many of these tasks are multi-step processes, and performing them manually is error-prone, even if an operator follows a well-written step-by-step guide. Thus, there is a need to automate these tasks. Some examples of an on-demand/out-of-band tasks:

Recover from a permanent quorum loss of etcd cluster.
Trigger an on-demand full/delta snapshot.
Trigger an on-demand snapshot compaction.
Trigger an on-demand maintenance of etcd cluster.
Copy the backups from one object store to another object store.

Goals

Establish a unified interface for operator tasks by defining a single dedicated custom resource for out-of-band tasks.
Define a contract (in terms of prerequisites) which needs to be adhered to by any task implementation.
Facilitate the easy addition of new out-of-band task(s) through this custom resource.
Provide CLI capabilities to operators, making it easy to invoke supported out-of-band tasks.

Non-Goals

In the current scope, capability to abort/suspend an out-of-band task is not going to be provided. This could be considered as an enhancement based on pull.
Ordering (by establishing dependency) of out-of-band tasks submitted for the same etcd cluster has not been considered in the first increment. In a future version based on how operator tasks are used, we will enhance this proposal and the implementation.

Proposal

Authors propose creation of a new single dedicated custom resource to represent an out-of-band task. Etcd-druid will be enhanced to process the task requests and update its status which can then be tracked/observed.

Custom Resource Golang API

EtcdOperatorTask is the new custom resource that will be introduced. This API will be in v1alpha1 version and will be subject to change. We will be respecting Kubernetes Deprecation Policy.

// EtcdOperatorTask represents an out-of-band operator task resource.
type EtcdOperatorTask struct {
  metav1.TypeMeta
  metav1.ObjectMeta

  // Spec is the specification of the EtcdOperatorTask resource.
  Spec EtcdOperatorTaskSpec `json:"spec"`
  // Status is most recently observed status of the EtcdOperatorTask resource.
  Status EtcdOperatorTaskStatus `json:"status,omitempty"`
}

Spec

The authors propose that the following fields should be specified in the spec (desired state) of the EtcdOperatorTask custom resource.

To capture the type of out-of-band operator task to be performed, .spec.type field should be defined. It can have values from all supported out-of-band tasks eg. “OnDemandSnaphotTask”, “QuorumLossRecoveryTask” etc.
To capture the configuration specific to each task, a .spec.config field should be defined of type string as each task can have different input configuration.

// EtcdOperatorTaskSpec is the spec for a EtcdOperatorTask resource.
type EtcdOperatorTaskSpec struct {
  
  // Type specifies the type of out-of-band operator task to be performed. 
  Type string `json:"type"`

  // Config is a task specific configuration.
  Config string `json:"config,omitempty"`

  // TTLSecondsAfterFinished is the time-to-live to garbage collect the 
  // related resource(s) of task once it has been completed.
  // +optional
  TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`

  // OwnerEtcdReference refers to the name and namespace of the corresponding 
  // Etcd owner for which the task has been invoked.
  OwnerEtcdRefrence types.NamespacedName `json:"ownerEtcdRefrence"`
}

Status

The authors propose the following fields for the Status (current state) of the EtcdOperatorTask custom resource to monitor the progress of the task.

// EtcdOperatorTaskStatus is the status for a EtcdOperatorTask resource.
type EtcdOperatorTaskStatus struct {
  // ObservedGeneration is the most recent generation observed for the resource.
  ObservedGeneration *int64 `json:"observedGeneration,omitempty"`
  // State is the last known state of the task.
  State TaskState `json:"state"`
  // Time at which the task has moved from "pending" state to any other state.
  InitiatedAt metav1.Time `json:"initiatedAt"`
  // LastError represents the errors when processing the task.
  // +optional
  LastErrors []LastError `json:"lastErrors,omitempty"`
  // Captures the last operation status if task involves many stages.
  // +optional
  LastOperation *LastOperation `json:"lastOperation,omitempty"`
}

type LastOperation struct {
  // Name of the LastOperation.
  Name opsName `json:"name"`
  // Status of the last operation, one of pending, progress, completed, failed.
  State OperationState `json:"state"`
  // LastTransitionTime is the time at which the operation state last transitioned from one state to another.
  LastTransitionTime metav1.Time `json:"lastTransitionTime"`
  // A human readable message indicating details about the last operation.
  Reason string `json:"reason"`
}

// LastError stores details of the most recent error encountered for the task.
type LastError struct {
  // Code is an error code that uniquely identifies an error.
  Code ErrorCode `json:"code"`
  // Description is a human-readable message indicating details of the error.
  Description string `json:"description"`
  // ObservedAt is the time at which the error was observed.
  ObservedAt metav1.Time `json:"observedAt"`
}

// TaskState represents the state of the task.
type TaskState string

const (
  TaskStateFailed TaskState = "Failed"
  TaskStatePending TaskState = "Pending"
  TaskStateRejected TaskState = "Rejected"
  TaskStateSucceeded TaskState = "Succeeded"
  TaskStateInProgress TaskState = "InProgress"
)

// OperationState represents the state of last operation.
type OperationState string

const (
  OperationStateFailed OperationState = "Failed"
  OperationStatePending OperationState = "Pending"
  OperationStateCompleted OperationState = "Completed"
  OperationStateInProgress OperationState = "InProgress"
)

Custom Resource YAML API

apiVersion: druid.gardener.cloud/v1alpha1
kind: EtcdOperatorTask
metadata:
    name: <name of operator task resource>
    namespace: <cluster namespace>
    generation: <specific generation of the desired state>
spec:
    type: <type/category of supported out-of-band task>
    ttlSecondsAfterFinished: <time-to-live to garbage collect the custom resource after it has been completed>
    config: <task specific configuration>
    ownerEtcdRefrence: <refer to corresponding etcd owner name and namespace for which task has been invoked>
status:
    observedGeneration: <specific observedGeneration of the resource>
    state: <last known current state of the out-of-band task>
    initiatedAt: <time at which task move to any other state from "pending" state>
    lastErrors:
    - code: <error-code>
      description: <description of the error>
      observedAt: <time the error was observed>
    lastOperation:
      name: <operation-name>
      state: <task state as seen at the completion of last operation>
      lastTransitionTime: <time of transition to this state>
      reason: <reason/message if any>

Lifecycle

Creation

Task(s) can be created by creating an instance of the EtcdOperatorTask custom resource specific to a task.

Note: In future, either a kubectl extension plugin or a druidctl tool will be introduced. Dedicated sub-commands will be created for each out-of-band task. This will drastically increase the usability for an operator for performing such tasks, as the CLI extension will automatically create relevant instance(s) of EtcdOperatorTask with the provided configuration.

Execution

Authors propose to introduce a new controller which watches for EtcdOperatorTask custom resource.
Each out-of-band task may have some task specific configuration defined in .spec.config.
The controller needs to parse this task specific config, which comes as a string, according to the schema defined for each task.
For every out-of-band task, a set of pre-conditions can be defined. These pre-conditions are evaluated against the current state of the target etcd cluster. Based on the evaluation result (boolean), the task is permitted or denied execution.
If multiple tasks are invoked simultaneously or in pending state, then they will be executed in a First-In-First-Out (FIFO) manner.

Note: Dependent ordering among tasks will be addressed later which will enable concurrent execution of tasks when possible.

Deletion

Upon completion of the task, irrespective of its final state, Etcd-druid will ensure the garbage collection of the task custom resource and any other Kubernetes resources created to execute the task. This will be done according to the .spec.ttlSecondsAfterFinished if defined in the spec, or a default expiry time will be assumed.

Use Cases

Recovery from permanent quorum loss

Recovery from permanent quorum loss involves two phases - identification and recovery - both of which are done manually today. This proposal intends to automate the latter. Recovery today is a multi-step process and needs to be performed carefully by a human operator. Automating these steps would be prudent, to make it quicker and error-free. The identification of the permanent quorum loss would remain a manual process, requiring a human operator to investigate and confirm that there is indeed a permanent quorum loss with no possibility of auto-healing.

Task Config

We do not need any config for this task. When creating an instance of EtcdOperatorTask for this scenario, .spec.config will be set to nil (unset).

Pre-Conditions

There should be a quorum loss in a multi-member etcd cluster. For a single-member etcd cluster, invoking this task is unnecessary as the restoration of the single member is automatically handled by the backup-restore process.
There should not already be a permanent-quorum-loss-recovery-task running for the same etcd cluster.

Trigger on-demand snapshot compaction

Etcd-druid provides a configurable etcd-events-threshold flag. When this threshold is breached, then a snapshot compaction is triggered for the etcd cluster. However, there are scenarios where an ad-hoc snapshot compaction may be required.

Possible Scenarios

If an operator anticipates a scenario of permanent quorum loss, they can trigger an on-demand snapshot compaction to create a compacted full-snapshot. This can potentially reduce the recovery time from a permanent quorum loss.
As an additional benefit, a human operator can leverage the current implementation of snapshot compaction, which internally triggers restoration. Hence, by initiating an on-demand snapshot compaction task, the operator can verify the integrity of etcd cluster backups, particularly in cases of potential backup corruption or re-encryption. The success or failure of this snapshot compaction can offer valuable insights into these scenarios.

Task Config

We do not need any config for this task. When creating an instance of EtcdOperatorTask for this scenario, .spec.config will be set to nil (unset).

Pre-Conditions

There should not be a on-demand snapshot compaction task already running for the same etcd cluster.

Note: on-demand snapshot compaction runs as a separate job in a separate pod, which interacts with the backup bucket and not the etcd cluster itself, hence it doesn’t depend on the health of etcd cluster members.

Trigger on-demand full/delta snapshot

Etcd custom resource provides an ability to set FullSnapshotSchedule which currently defaults to run once in 24 hrs. DeltaSnapshotPeriod is also made configurable which defines the duration after which a delta snapshot will be taken. If a human operator does not wish to wait for the scheduled full/delta snapshot, they can trigger an on-demand (out-of-schedule) full/delta snapshot on the etcd cluster, which will be taken by the leading-backup-restore.

Possible Scenarios

An on-demand full snapshot can be triggered if scheduled snapshot fails due to any reason.
Gardener Shoot Hibernation: Every etcd cluster incurs an inherent cost of preserving the volumes even when a gardener shoot control plane is scaled down, i.e the shoot is in a hibernated state. However, it is possible to save on hyperscaler costs by invoking this task to take a full snapshot before scaling down the etcd cluster, and deleting the etcd data volumes afterwards.
Gardener Control Plane Migration: In gardener, a cluster control plane can be moved from one seed cluster to another. This process currently requires the etcd data to be replicated on the target cluster, so a full snapshot of the etcd cluster in the source seed before the migration would allow for faster restoration of the etcd cluster in the target seed.

Task Config

// SnapshotType can be full or delta snapshot.
type SnapshotType string

const (
  SnapshotTypeFull SnapshotType = "full"
  SnapshotTypeDelta SnapshotType = "delta"
)

type OnDemandSnapshotTaskConfig struct {
  // Type of on-demand snapshot.
  Type SnapshotType `json:"type"`
}

spec:
  config: |
    type: <type of on-demand snapshot>

Pre-Conditions

Etcd cluster should have a quorum.
There should not already be a on-demand snapshot task running with the same SnapshotType for the same etcd cluster.

Trigger on-demand maintenance of etcd cluster

Operator can trigger on-demand maintenance of etcd cluster which includes operations like etcd compaction, etcd defragmentation etc.

Possible Scenarios

If an etcd cluster is heavily loaded, which is causing performance degradation of an etcd cluster, and the operator does not want to wait for the scheduled maintenance window then an on-demand maintenance task can be triggered which will invoke etcd-compaction, etcd-defragmentation etc. on the target etcd cluster. This will make the etcd cluster lean and clean, thus improving cluster performance.

Task Config

type OnDemandMaintenanceTaskConfig struct {
  // MaintenanceType defines the maintenance operations need to be performed on etcd cluster.
  MaintenanceType maintenanceOps `json:"maintenanceType`
}

type maintenanceOps struct {
  // EtcdCompaction if set to true will trigger an etcd compaction on the target etcd.
  // +optional
  EtcdCompaction bool `json:"etcdCompaction,omitempty"`
  // EtcdDefragmentation if set to true will trigger a etcd defragmentation on the target etcd.
  // +optional
  EtcdDefragmentation bool `json:"etcdDefragmentation,omitempty"`
}

spec:
  config: |
    maintenanceType:
      etcdCompaction: <true/false>
      etcdDefragmentation: <true/false>

Pre-Conditions

Etcd cluster should have a quorum.
There should not already be a duplicate task running with same maintenanceType.

Copy Backups Task

Copy the backups(full and delta snapshots) of etcd cluster from one object store(source) to another object store(target).

Possible Scenarios

In Gardener, the Control Plane Migration process utilizes the copy-backups task. This task is responsible for copying backups from one object store to another, typically located in different regions.

Task Config

// EtcdCopyBackupsTaskConfig defines the parameters for the copy backups task.
type EtcdCopyBackupsTaskConfig struct {
  // SourceStore defines the specification of the source object store provider.
  SourceStore StoreSpec `json:"sourceStore"`

  // TargetStore defines the specification of the target object store provider for storing backups.
  TargetStore StoreSpec `json:"targetStore"`

  // MaxBackupAge is the maximum age in days that a backup must have in order to be copied.
  // By default all backups will be copied.
  // +optional
  MaxBackupAge *uint32 `json:"maxBackupAge,omitempty"`

  // MaxBackups is the maximum number of backups that will be copied starting with the most recent ones.
  // +optional
  MaxBackups *uint32 `json:"maxBackups,omitempty"`
}

spec:
  config: |
    sourceStore: <source object store specification>
    targetStore: <target object store specification>
    maxBackupAge: <maximum age in days that a backup must have in order to be copied>
    maxBackups: <maximum no. of backups that will be copied>

Note: For detailed object store specification please refer here

Pre-Conditions

There should not already be a copy-backups task running.

Note: copy-backups-task runs as a separate job, and it operates only on the backup bucket, hence it doesn’t depend on health of etcd cluster members.

Note: copy-backups-task has already been implemented and it’s currently being used in Control Plane Migration but copy-backups-task will be harmonized with EtcdOperatorTask custom resource.

Metrics

Authors proposed to introduce the following metrics:

etcddruid_operator_task_duration_seconds : Histogram which captures the runtime for each etcd operator task. Labels:
- Key: type, Value: all supported tasks
- Key: state, Value: One-Of {failed, succeeded, rejected}
- Key: etcd, Value: name of the target etcd resource
- Key: etcd_namespace, Value: namespace of the target etcd resource
etcddruid_operator_tasks_total: Counter which counts the number of etcd operator tasks. Labels:
- Key: type, Value: all supported tasks
- Key: state, Value: One-Of {failed, succeeded, rejected}
- Key: etcd, Value: name of the target etcd resource
- Key: etcd_namespace, Value: namespace of the target etcd resource

28 - Prepare Dev Environment

Prepare Dev Environment

This guide will provide with detailed instructions on installing all dependencies and tools that are required to start developing and testing etcd-druid.

[macOS only] Installing Homebrew

Hombrew is a popular package manager for macOS. You can install it by executing the following command in a terminal:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Installing Go

On macOS run:

brew install go

Alternatively you can also follow the Go installation documentation.

Installing Git

We use git as VCS which you need to install. On macOS run:

brew install git

For other OS, please check the Git installation documentation.

Installing Docker

You need to have docker installed and running. This will allow starting a kind cluster or a minikube cluster for locally deploying etcd-druid.

On macOS run:

brew install docker

Alternatively you can also follow the Docker installation documentation.

Installing Kubectl

To interact with the local Kubernetes cluster you will need kubectl. On macOS run:

brew install kubernetes-cli

For other OS, please check the Kubectl installation documentation.

Other tools that might come in handy

To operate etcd-druid you do not need these tools but they usually come in handy when working with YAML/JSON files. On macOS run:

# jq (https://jqlang.github.io/jq/) is a lightweight and flexible command-line JSON processor
brew install jq
# yq (https://mikefarah.gitbook.io/yq) is a lightweight and portable command-line YAML processor.
brew install yq

Get the sources

Clone the repository from Github into your $GOPATH.

mkdir -p $(go env GOPATH)/src/github.com/gardener
cd $(go env GOPATH)src/github.com/gardener
git clone https://github.com/gardener/etcd-druid.git
# alternatively you can also use `git clone git@github.com:gardener/etcd-druid.git`

29 - Prepare Helm Charts

Prepare etcd-druid Helm charts

etcd-druid operator can be deployed via helm charts. The charts can be found here. All Makefile deploy* targets employ skaffold which internally uses the same helm charts to deploy all resources to setup etcd-druid. In the following sections you will learn on the prerequisites, generated/copied resources and kubernetes resources that are deployed via helm charts to setup etcd-druid.

Prerequisite

Installing Helm

If you wish to directly use helm charts then please ensure that helm is already installed.

On macOS, you can install via brew:

brew install helm

For all other OS please check Helm installation instructions.

Installing OpenSSL

OpenSSL is used to generate PKI resources that are used to configure TLS connectivity with the webhook(s). On macOS, you can install via brew:

brew install openssl

For all other OS please check OpenSSL download instructions.

NOTE: On linux, the library is available via native package managers like apt, yum etc. On Windows, you can get the installer here.

Generated/Copied resources

To leverage etcd-druid helm charts you need to ensure that the charts contains the required CRD yaml files and PKI resources.

CRDs

Heml-3 provides special status to CRDs. CRD YAML files should be placed in crds/ directory inside of a chart. Helm will attempt to load all the files in this directory. We now generate the CRDs and keep these at etcd-druid/api/core/v1alpha1/crds which serves as a single source of truth for all custom resource specifications under etcd-druid operator. These CRDs needs to be copied to etcd-druid/charts/crds.

PKI resources

Webhooks communicate over TLS with the kube-apiserver. It is therefore essential to generate PKI resources (CA certificate, Server certificate and Server key) to be used to configure Webhook configuration and mount it to the etcd-druid operator Deployment.

Kubernetes Resources

etcd-druid helm charts creates the following kubernetes resources:

Resource	Description
ApiVersion: apps/v1 Kind: Deployment	This is the etcd-druid Deployment which runs etcd-druid operator. All reconcilers run as part of this operator.
ApiVersion: rbac.authorization.k8s.io/v1 Kind: ClusterRole	etcd-druid manages `Etcd` resources deployed across namespaces. A cluster role provides required roles to etcd-druid operator for all the resources that are created for an etcd cluster.
ApiVersion: v1 Kind: ServiceAccount	It defines a system user with which etcd-druid operator will function. The service account name will be configured in the Deployment at `spec.template.spec.serviceAccountName`
ApiVersion:rbac.authorization.k8s.io/v1 Kind: ClusterRoleBinding	Binds the cluster roles to the `ServiceAccount` thus associating all cluster roles to the system user with which etcd-druid operator will be run.
ApiVersion: v1 Kind: Service	ClusterIP service which will provide a logical endpoint to reach any etcd-druid pods.
ApiVersion: v1 Kind: Secret	A secret containing the webhook server certificate and key will be created and mounted onto the Deployment.
ApiVersion: admissionregistration.k8s.io/v1 Kind: ValidatingWebhookConfiguration	It is the validation webhook configuration. Currently there is only one webhook `etcdcomponents` . For more details see here.

Chart Values

A values.yaml is defined which contains the default set of values for all configurable properties. You can change the values as per your needs. A few properties of note:

image

Points to the image URL that will be configured in etcd-druid Deployment. If you are building the image on your own and pushing it to the repository of your choice then ensure that you change the value accordingly.

webhookPKI

This YAML map contains paths to required PKI artifacts. If you are generating these on your own then ensure that you provide correct paths to these resources.

etcdComponentProtection

By default, this webhook is enabled. This is a good default for production environments. However, while you are actively developing then you can choose to disable this webhook.

If you have switched to using OperatorConfiguration then you must set enabledOperatorConfig to true (for backward compatibility reasons it is defaulted to false) and then control the enablement of EtcdComponentProctection webhook via operatorConfig.webhooks.etcdComponentProtection.enabled property.

If you have not yet switched to using OperatorConfiguration then you can control the enablement of this webhook via webhooks.etcdComponentProtection.enabled property.

Makefile target

A convenience Makefile target make prepare-helm-charts is provided which leverages OpenSSL to generate the required PKI artifacts. If you wish to deploy etcd-druid in a specific namespace then prior to running this Makefile target you can run:

export NAMESPACE=<namespace>

!!! info Specifying a namespace other than default will result in additional SAN being added in the webhook server certificate.

By default, the certificates generated have an expiry of 12h. If you wish to have a different expiry then prior to running this Makefile target you can run:

export CERT_EXPIRY=<duration-of-your-choice>
# example: export CERT_EXPIRY=6h

!!! note

If you are using make deploy* targets directly which leverages skaffold then Makefile target prepare-helm-charts will be invoked automatically.

30 - Production Setup Recommendations

Setting up etcd-druid in Production

You can get familiar with etcd-druid and all the resources that it creates by setting up etcd-druid locally by following the detailed guide. This document lists down recommendations for a productive setup of etcd-druid.

Helm Charts

You can use helm charts at this location to deploy druid. Values for charts are present here and can be configured as per your requirement. Following charts are present:

deployment.yaml - defines a kubernetes Deployment for etcd-druid. To configure the CLI flags for druid you can refer to this document which explains these flags in detail.
serviceaccount.yaml - defines a kubernetes ServiceAccount which will serve as a technical user to which role/clusterroles can be bound.
clusterrole.yaml - etcd-druid can manage multiple etcd clusters. In a hosted control plane setup (e.g. Gardener), one would typically create separate namespace per control-plane. This would require a ClusterRole to be defined which gives etcd-druid permissions to operate across namespaces. Packing control-planes via namespaces provides you better resource utilisation while providing you isolation from the data-plane (where the actual workload is scheduled).
rolebinding.yaml - binds the ClusterRole defined in druid-clusterrole.yaml to the ServiceAccount defined in service-account.yaml.
service.yaml - defines a Cluster IP Service allowing other control-plane components to communicate to http endpoints exposed out of etcd-druid (e.g. enables prometheus to scrap metrics, validating webhook to be invoked upon change to Etcd CR etc.)
secret-ca-crt.yaml - Contains the base64 encoded CA certificate used for the etcd-druid webhook server.
secret-server-tls-crt.yaml - Contains the base64 encoded server certificate used for the etcd-druid webhook server.
validating-webhook-config.yaml - Configuration for all webhooks that etcd-druid registers to the webhook server. At the time of writing this document EtcdComponents webhook gets registered.

Etcd cluster size

Recommendation from upstream etcd is to always have an odd number of members in an Etcd cluster.

Mounted Volume

All Etcd cluster member Pods provisioned by etcd-druid mount a Persistent Volume. A mounted persistent storage helps in faster recovery in case of single-member transient failures. etcd is I/O intensive and its performance is heavily dependent on the Storage Class. It is therefore recommended that high performance SSD drives be used.

At the time of writing this document etcd-druid provisions the following volume types:

Cloud Provider	Type	Size
AWS	GP3	25Gi
Azure	Premium SSD	33Gi
GCP	Performance (SSD) Persistent Disks (pd-ssd)	25Gi

Also refer: Etcd Disk recommendation.
Additionally, each cloud provider offers redundancy for managed disks. You should choose redundancy as per your availability requirement.

Backup & Restore

A permanent quorum loss or data-volume corruption is a reality in production clusters and one must ensure that data loss is minimized. Etcd clusters provisioned via etcd-druid offer two levels of data-protection

Via etcd-backup-restore all clusters started via etcd-druid get the capability to regularly take delta & full snapshots. These snapshots are stored in an object store. Additionally, a snapshot-compaction job is run to compact and defragment the latest snapshot, thereby reducing the time it takes to restore a cluster in case of a permanent quorum loss. You can read the detailed guide on how to restore from permanent quorum loss.

It is therefore recommended that you configure an Object store in the cloud/infra provider of your choice, enabled backup & restore functionality by filling in store configuration of an Etcd custom CR.

Ransomware protection

Ransomware is a form of malware designed to encrypt files on a device, rendering any files and the systems that rely on them unusable. All cloud providers (aws, gcp, azure) provide a feature of immutability that can be set at the bucket/object level which provides WORM access to objects as long as the bucket/lock retention duration.

All delta & full snapshots that are periodically taken by etcd-backup-restore are stored in Object store provided by a cloud provider. It is recommended that these backups be protected from ransomware protection by turning locking at the bucket/object level.

Security

Use Distroless Container Images

It is generally recommended to use a minimal base image which additionally reduces the attack surface. Google’s Distroless is one way to reduce the attack surface and also minimize the size of the base image. It provides the following benefits:

Reduces the attack surface
Minimizes vulnerabilities
No shell
Reduced size - only includes what is necessary

For every Etcd cluster provisioned by etcd-druid, distroless images are used as base images.

Enable TLS for Peer and Client communication

Generally you should enable TLS for peer and client communication for an Etcd cluster. To enable TLS CA certificate, server and client certificates needs to be generated. You can refer to the list of TLS artifacts that are generated for an Etcd cluster provisioned by etcd-druid here.

Enable TLS for Druid Webhooks

If you choose to enable webhooks in etcd-druid then it is necessary to create a separate CA and server certificate to be used by the webhooks.

Rotate TLS artifacts

It is generally recommended to rotate all TLS certificates to reduce the chances of it getting leaked or have expired. Kubernetes does not support revocation of certificates (see issue#18982). One possible way to revoke certificates is to also revoke the entire chain including CA certificates.

Scaling etcd pods

etcd clusters cannot be scaled-out horizontly to meet the increased traffic/storage demand for the following reasons:

There is a soft limit of 8GB and a hard limit of 10GB for the etcd DB beyond which perfomance and stability of etcd is not guaranteed.
All members of etcd maintain the entire replica of the entire DB, thus scaling-out will not really help if the storage demand grows.
Increasing the number of cluster members beyond 5 also increases the cost of consensus amongst now a larger quorum, increases load on the single leader as it needs to also participate in bringing up etcd learner.

Therefore the following is recommended:

To meet the increased demand, configure a VPA. You have to be careful on selection of containerPolicies, targetRef.
To meet the increased demand in storage etcd-druid already configures each etcd member to auto-compact and it also configures periodic defragmentation of the etcd DB. The only case this will not help is when you only have unique writes all the time.

!!! note Care should be taken with usage of VPA. While it helps to vertically scale up etcd-member pods, it also can cause transient quorum loss. This is a direct consequence of the design of VPA - where recommendation is done by Recommender component, Updater evicts the pods that do not have the resources recommended by the Recommender and Admission Controller which updates the resources on the Pods. All these three components act asynchronously and can fail independently, so while VPA respects PDB’s it can easily enter into a state where updater evicts a pod while respecting PDB but the admission controller fails to apply the recommendation. The pod comes with a default resources which still differ from the recommended values, thus causing a repeat eviction. There are other race conditions that can also occur and one needs to be careful of using VPA for quorum based workloads.

High Availability

To ensure that an Etcd cluster is highly available, following is recommended:

Ensure that the `Etcd` cluster members are spread

Etcd cluster members should always be spread across nodes. This provides you failure tolerance at the node level. For failure tolerance of a zone, it is recommended that you spread the Etcd cluster members across zones. We recommend that you use a combination of TopologySpreadConstraints and Pod Anti-Affinity. To set the scheduling constraints you can either specify these constraints using SchedulingConstraints in the Etcd custom resource or use a MutatingWebhook to dynamically inject these into pods.

An example of scheduling constraints for a multi-node cluster with zone failure tolerance will be:

  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app.kubernetes.io/component: etcd-statefulset
        app.kubernetes.io/managed-by: etcd-druid
        app.kubernetes.io/name: etcd-main
        app.kubernetes.io/part-of: etcd-main
    maxSkew: 1
    minDomains: 3
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
  - labelSelector:
      matchLabels:
        app.kubernetes.io/component: etcd-statefulset
        app.kubernetes.io/managed-by: etcd-druid
        app.kubernetes.io/name: etcd-main
        app.kubernetes.io/part-of: etcd-main
    maxSkew: 1
    minDomains: 3
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

For a 3 member etcd-cluster, the above TopologySpreadConstraints will ensure that the members will be spread across zones (assuming there are 3 zones -> minDomains=3) and no two members will be on the same node.

Optimize Network Cost

In most cloud providers there is no network cost (ingress/egress) for any traffic that is confined within a single zone. For Zonal failure tolerance, it will become imperative to spread the Etcd cluster across zones within a region. Knowing that an Etcd cluster members are quite chatty (leader election, consensus building for writes and linearizable reads etc.), this can add to the network cost.

One could evaluate using TopologyAwareRouting which reduces cross-zonal traffic thus saving costs and latencies.

!!! tip You can read about how it is done in Gardener here.

Metrics & Alerts

Monitoring etcd metrics is essential for fine tuning Etcd clusters. etcd already exports a lot of metrics. You can see the complete list of metrics that are exposed out of an Etcd cluster provisioned by etcd-druid here. It is also recommended that you configure an alert for etcd space quota alarms.

Hibernation

If you have a concept of hibernating kubernetes clusters, then following should be kept in mind:

Before you bring down the Etcd cluster, leverage the capability to take a full snapshot which captures the state of the etcd DB and stores it in the configured Object store. This ensures that when the cluster is woken up from hibernation it can restore from the last state with no data loss.
To save costs you should consider deleting the PersistentVolumeClaims associated to the StatefulSet pods. However, it must be ensured that you take a full snapshot as highlighted in the previous point.
When the cluster is woken up from hibernation then you should do the following (assuming prior to hibernation the cluster had a size of 3 members):
- Start the Etcd cluster with 1 replica. Let it restore from the last full snapshot.
- Once the cluster reports that it is ready, only then increase the replicas to its original value (e.g. 3). The other two members will start up each as learners and post learning they will join as voting members (Followers).

Reference

A nicely written blog post on High Availability and Zone Outage Toleration has a lot of recommendations that one can borrow from.

31 - Raising A Pr

Raising a Pull Request

We welcome active contributions from the community. This document details out the things-to-be-done in order for us to consider a PR for review. Contributors should follow the guidelines mentioned in this document to minimize the time it takes to get the PR reviewed.

00-Prerequisites

In order to make code contributions you must setup your development environment. Follow the Prepare Dev Environment Guide for detailed instructions.

01-Raise an Issue

For every pull-request, it is mandatory to raise an Issue which should describe the problem in detail. We have created a few categories, each having its own dedicated template.

03-Prepare Code Changes

It is not recommended to create a branch on the main repository for raising pull-requests. Instead you must fork the etcd-druid repository and create a branch in the fork. You can follow the detailed instructions on how to fork a repository and set it up for contributions.
Ensure that you follow the coding guidelines while introducing new code.
If you are making changes to the API then please read Changing-API documentation.
If you are introducing new go mod dependencies then please read Dependency Management documentation.
If you are introducing a new Etcd cluster component then please read Add new Cluster Component documentation.
For guidance on testing, follow the detailed instructions here.
Before you submit your PR, please ensure that the following is done:
- Run make check which will do the following:
  - Runs make format - this target will ensure a common formatting of the code and ordering of imports across all source files.
  - Runs make manifests - this target will re-generate manifests if there are any changes in the API.
  - Only when the above targets have run without errorrs, then make check will be run linters against the code. The rules for the linter are configured here.
- Ensure that all the tests pass by running the following make targets:
  - make test-unit - this target will run all unit tests.
  - make test-integration - this target will run all integration tests (controller level tests) using envtest framework.
  - make ci-e2e-kind or any of its variants - these targets will run etcd-druid e2e tests.
  !!! warning Please ensure that after introduction of new code the code coverage does not reduce. An increase in code coverage is always welcome.
If you add new features, make sure that you create relevant documentation under /docs.

04-Raise a pull request

Create Work In Progress [WIP] pull requests only if you need a clarification or an explicit review before you can continue your work item.
Ensure that you have rebased your fork’s development branch with upstream main/master branch.
Squash all commits into a minimal number of commits.
Fill in the PR template with appropriate details and provide the link to the Issue for which a PR has been raised.
If your patch is not getting reviewed, or you need a specific person to review it, you can @-reply a reviewer asking for a review in the pull request or a comment.

05-Post review

If a reviewer requires you to change your commit(s), please test the changes again.
Amend the affected commit(s) and force push onto your branch.
Set respective comments in your GitHub review as resolved.
Create a general PR comment to notify the reviewers that your amendments are ready for another round of review.

06-Merging a pull request

Merge can only be done if the PR has approvals from atleast 2 reviewers.
Add an appropriate release note detailing what is introduced as part of this PR.
Before merging the PR, ensure that you squash and then merge.

32 - Recovering Etcd Clusters

Recovery from Quorum Loss

In an Etcd cluster, quorum is a majority of nodes/members that must agree on updates to a cluster state before the cluster can authorise the DB modification. For a cluster with n members, quorum is (n/2)+1. An Etcd cluster is said to have lost quorum when majority of nodes (greater than or equal to (n/2)+1) are unhealthy or down and as a consequence cannot participate in consensus building.

For a multi-node Etcd cluster quorum loss can either be Transient or Permanent.

Transient quorum loss

If quorum is lost through transient network failures (e.g. n/w partitions) or there is a spike in resource usage which results in OOM, etcd automatically and safely resumes (once the network recovers or the resource consumption has come down) and restores quorum. In other cases like transient power loss, etcd persists the Raft log to disk and replays the log to the point of failure and resumes cluster operation.

Permanent quorum loss

In case the quorum is lost due to hardware failures or disk corruption etc, automatic recovery is no longer possible and it is categorized as a permanent quorum loss.

Note: If one has capability to detect Failed nodes and replace them, then eventually new nodes can be launched and etcd cluster can recover automatically. But sometimes this is just not possible.

Recovery

At present, recovery from a permanent quorum loss is achieved by manually executing the steps listed in this section.

Note: In the near future etcd-druid will offer capability to automate the recovery from a permanent quorum loss via Out-Of-Band Operator Tasks. An operator only needs to ascertain that there is a permanent quorum loss and the etcd-cluster is beyond auto-recovery. Once that is established then an operator can invoke a task whose status an operator can check.

!!! warning Please note that manually restoring etcd can result in data loss. This guide is the last resort to bring an Etcd cluster up and running again.

00-Identify the etcd cluster

It is possible to shard the etcd cluster based on resource types using –etcd-servers-overrides CLI flag of kube-apiserver. Any sharding results in more than one etcd-cluster.

!!! info In gardener, each shoot control plane has two etcd clusters, etcd-events which only stores events and etcd-main - stores everything else except events.

Identify the etcd-cluster which has a permanent quorum loss. Most of the resources of an etcd-cluster can be identified by its name. The resources of interest to recover from permanent quorum loss are: Etcd CR, StatefulSet, ConfigMap and PVC.

To identify the ConfigMap resource use the following command:

 kubectl get sts <sts-name> -o jsonpath='{.spec.template.spec.volumes[?(@.name=="etcd-config-file")].configMap.name}'

01-Prepare Etcd Resource to allow manual updates

To ensure that only one actor (in this case an operator) makes changes to the Etcd resource and also to the Etcd cluster resources, following must be done:

Add the annotation to the Etcd resource:

kubectl annotate etcd <etcd-name> -n <namespace> druid.gardener.cloud/suspend-etcd-spec-reconcile=

The above annotation will prevent any reconciliation by etcd-druid for this Etcd cluster.

Add another annotation to the Etcd resource:

kubectl annotate etcd <etcd-name> -n <namespace> druid.gardener.cloud/disable-etcd-component-protection=

The above annotation will allow manual edits to Etcd cluster resources that are managed by etcd-druid.

02-Scale-down Etcd StatefulSet resource to 0

kubectl scale sts <sts-name> --replicas=0 -n <namespace>

03-Delete all PVCs for the Etcd cluster

kubectl delete pvc -l instance=<sts-name> -n <namespace>

04-Delete All Member Leases

For a n member Etcd cluster there should be n member Lease objects. The lease names should start with the Etcd name.

Example leases for a 3 node Etcd cluster:

 NAME          HOLDER                  AGE
 <etcd-name>-0 4c37667312a3912b:Member 1m
 <etcd-name>-1 75a9b74cfd3077cc:Member 1m
 <etcd-name>-2 c62ee6af755e890d:Leader 1m

Delete all the member leases.

kubectl delete lease <space separated lease names>
# Alternatively you can use label selector. From v0.23.0 onwards leases will have common set of labels
kubectl delete lease -l app.kubernetes.io.component=etcd-member-lease, app.kubernetes.io/part-of=<etcd-name> -n <namespace>

05-Modify ConfigMap

Prerequisite to scale up etcd-cluster from 0->1 is to change the fields initial-cluster, initial-advertise-peer-urls, and advertise-client-urls in the ConfigMap.

Assuming that prior to scale-down to 0, there were 3 members:

The initial-cluster field would look like the following (assuming that the name of the etcd resource is etcd-main):

# Initial cluster
initial-cluster: etcd-main-0=https://etcd-main-0.etcd-main-peer.default.svc:2380,etcd-main-1=https://etcd-main-1.etcd-main-peer.default.svc:2380,etcd-main-2=https://etcd-main-2.etcd-main-peer.default.svc:2380

Change the initial-cluster field to have only one member (in this case etcd-main-0). After the change it should look like:

# Initial cluster
initial-cluster: etcd-main-0=https://etcd-main-0.etcd-main-peer.default.svc:2380

The initial-advertise-peer-urls field would look like the following:

# Initial advertise peer urls
initial-advertise-peer-urls:
  etcd-main-0:
  - http://etcd-main-0.etcd-main-peer.default.svc:2380
  etcd-main-1:
  - http://etcd-main-1.etcd-main-peer.default.svc:2380
  etcd-main-2:
  - http://etcd-main-2.etcd-main-peer.default.svc:2380

Change the initial-advertise-peer-urls field to have only one member (in this case etcd-main-0). After the change it should look like:

# Initial advertise peer urls
initial-advertise-peer-urls:
  etcd-main-0:
  - http://etcd-main-0.etcd-main-peer.default.svc:2380

The advertise-client-urls field would look like the following:

advertise-client-urls:
  etcd-main-0:
  - http://etcd-main-0.etcd-main-peer.default.svc:2379
  etcd-main-1:
  - http://etcd-main-1.etcd-main-peer.default.svc:2379
  etcd-main-2:
  - http://etcd-main-2.etcd-main-peer.default.svc:2379

Change the advertise-client-urls field to have only one member (in this case etcd-main-0). After the change it should look like:

advertise-client-urls:
  etcd-main-0:
  - http://etcd-main-0.etcd-main-peer.default.svc:2379

06-Scale up Etcd cluster to size 1

kubectl scale sts <sts-name> -n <namespace> --replicas=1

07-Wait for Single-Member etcd cluster to be completely ready

To check if the single-member etcd cluster is ready check the status of the pod.

kubectl get pods <etcd-name-0> -n <namespace>
NAME            READY   STATUS    RESTARTS   AGE
<etcd-name>-0   2/2     Running   0          1m

If both containers report readiness (as seen above), then the etcd-cluster is considered ready.

08-Enable Etcd reconciliation and resource protection

All manual changes are now done. We must now re-enable etcd-cluster resource protection and also enable reconciliation by etcd-druid by doing the following:

kubectl annotate etcd <etcd-name> -n <namespace> druid.gardener.cloud/suspend-etcd-spec-reconcile-
kubectl annotate etcd <etcd-name> -n <namespace> druid.gardener.cloud/disable-etcd-component-protection-

09-Scale-up Etcd Cluster to 3 and trigger reconcile

Scale etcd-cluster to its original size (we assumed 3 below).

kubectl scale sts <sts-name> -n namespace --replicas=3

If etcd-druid has been set up with --enable-etcd-spec-auto-reconcile switched-off then to ensure reconciliation one must annotate Etcd resource with the following command:

# Annotate etcd CR to reconcile
kubectl annotate etcd <etcd-name> -n <namespace> gardener.cloud/operation="reconcile"

10-Verify Etcd cluster health

Check if all the member pods have both of their containers in Running state.

kubectl get pods -n <namespace> -l app.kubernetes.io/part-of=<etcd-name>
NAME            READY   STATUS    RESTARTS   AGE
<etcd-name>-0   2/2     Running   0          5m
<etcd-name>-1   2/2     Running   0          1m
<etcd-name>-2   2/2     Running   0          1m

Additionally, check if the Etcd CR is ready:

kubectl get etcd <etcd-name> -n <namespace>
NAME          READY   AGE
<etcd-name>   true    13d

Check member leases, whose holderIdentity should reflect the member role. Check if all members are voting members (their role should either be Member or Leader). Monitor the leases for some time and check if the leases are getting updated. You can monitor the AGE field.

NAME          HOLDER                  AGE
<etcd-name>-0 4c37667312a3912b:Member 1m
<etcd-name>-1 75a9b74cfd3077cc:Member 1m
<etcd-name>-2 c62ee6af755e890d:Leader 1m

33 - Securing Etcd Clusters

Securing etcd cluster

This document will describe all the TLS artifacts that are typically generated for setting up etcd-druid and etcd clusters in Gardener clusters. You can take inspiration from this and decide which communication lines are essential to be TLS enabled.

Communication lines

In order to undertand all the TLS artifacts that are required to setup etcd-druid and one or more etcd-clusters, one must have a clear view of all the communication channels that needs to be protected via TLS. In the diagram below all communication lines in a typical 3-node etcd cluster along with kube-apiserver and etcd-druid is illustrated.

!!! info For Gardener setup all the communication lines are TLS enabled.

TLS artifacts

An etcd cluster setup by etcd-druid leverages the following TLS artifacts:

Certificate Authority used to sign server and client certificate key-pair for etcd-backup-restore specified via etcd.spec.backup.tls.tlsCASecretRef.
Server certificate key-pair specified via etcd.spec.backup.tls.serverTLSSecretRef used by etcd-backup-restore HTTPS server.
Client certificate key-pair specified via etcd.spec.backup.tls.clientTLSSecretRef used by etcd-wrapper to securely communicate to the etcd-backup-restore HTTPS server.
Certificate Authority used to sign server and client certificate key-pair for etcd and etcd-wrapper specified via etcd.spec.etcd.clientUrlTls.tlsCASecretRef for etcd client communication.
Server certificate key-pair specified via etcd.spec.etcd.clientUrlTls.serverTLSSecretRef used by etcd and etcd-wrapper HTTPS servers.
Client certificate key-pair specified via etcd.spec.etcd.clientUrlTls.clientTLSSecretRef used by:
- etcd-wrapper and etcd-backup-restore to securely communicate to the etcd HTTPS server.
- etcd-backup-restore to securely communicate to the etcd-wrapper HTTPS server.
Certificate Authority used to sign server certificate key-pair for etcd peer communication specified via etcd.spec.etcd.peerUrlTls.tlsCASecretRef.
Server certificate key-pair specified via etcd.spec.etcd.peerUrlTls.serverTLSSecretRef used for etcd peer communication.

!!! note TLS artifacts should be created prior to creating Etcd clusters. etcd-druid currently does not provide a convenience way to generate these TLS artifacts. etcd recommends to use cfssl to generate certificates. However you can use any other tool as well. We do provide a convenience script for local development here which can be used to generate TLS artifacts. Currently this script is part of etcd-wrapper github repository but we will harmonize these scripts to be used across all github projects under the etcd-druid ecosystem.

34 - Testing

Testing Strategy and Developer Guideline

Intent of this document is to introduce you (the developer) to the following:

Libraries that are used to write tests.
Best practices to write tests that are correct, stable, fast and maintainable.
How to run tests.

The guidelines are not meant to be absolute rules. Always apply common sense and adapt the guideline if it doesn’t make much sense for some cases. If in doubt, don’t hesitate to ask questions during a PR review (as an author, but also as a reviewer). Add new learnings as soon as we make them!

For any new contributions tests are a strict requirement. Boy Scouts Rule is followed: If you touch a code for which either no tests exist or coverage is insufficient then it is expected that you will add relevant tests.

Common guidelines for writing tests

We use the Testing package provided by the standard library in golang for writing all our tests. Refer to its official documentation to learn how to write tests using Testing package. You can also refer to this example.
We use gomega as our matcher or assertion library. Refer to Gomega’s official documentation for details regarding its installation and application in tests.
For naming the individual test/helper functions, ensure that the name describes what the function tests/helps-with. Naming is important for code readability even when writing tests - example-testcase-naming.
Introduce helper functions for assertions to make test more readable where applicable - example-assertion-function.
Introduce custom matchers to make tests more readable where applicable - example-custom-matcher.
Do not use time.Sleep and friends as it renders the tests flaky.
If a function returns a specific error then ensure that the test correctly asserts the expected error instead of just asserting that an error occurred. To help make this assertion consider using DruidError where possible. example-test-utility & usage.
Creating sample data for tests can be a high effort. Consider writing test utilities to generate sample data instead. example-test-object-builder.
If tests require any arbitrary sample data then ensure that you create a testdata directory within the package and keep the sample data as files in it. From https://pkg.go.dev/cmd/go/internal/test
The go tool will ignore a directory named “testdata”, making it available to hold ancillary data needed by the tests.
Avoid defining shared variable/state across tests. This can lead to race conditions causing non-deterministic state. Additionally it limits the capability to run tests concurrently via t.Parallel().
Do not assume or try and establish an order amongst different tests. This leads to brittle tests as the codebase evolves.
If you need to have logs produced by test runs (especially helpful in failing tests), then consider using t.Log or t.Logf.

Unit Tests

If you need a kubernetes client.Client, prefer using fake client instead of mocking the client. You can inject errors when building the client which enables you test error handling code paths.
- Mocks decrease maintainability because they expect the tested component to follow a certain way to reach the desired goal (e.g., call specific functions with particular arguments).
All unit tests should be run quickly. Do not use envtest and do not set up a Kind cluster in unit tests.
If you have common setup for variations of a function, consider using table-driven tests. See this as an example.
An individual test should only test one and only one thing. Do not try and test multiple variants in a single test. Either use table-driven tests or write individual tests for each variation.
If a function/component has multiple steps, its probably better to split/refactor it into multiple functions/components that can be unit tested individually.
If there are a lot of edge cases, extract dedicated functions that cover them and use unit tests to test them.

Running Unit Tests

!!! info For unit tests we are currently transitioning away from ginkgo to using golang native tests. The make test-unit target runs both ginkgo and golang native tests. Once the transition is complete this target will be simplified.

Run all unit tests

make test-unit

Run unit tests of specific packages:

# if you have not already installed gotestfmt tool then install it once.
# make test-unit target automatically installs this in ./hack/tools/bin. You can alternatively point the GOBIN to this directory and then directly invoke test-go.sh
> go install github.com/gotesttools/gotestfmt/v2/cmd/gotestfmt@v2.5.0
> ./hack/test-go.sh <package-1> <package-2>

De-flaking Unit Tests

If tests have sporadic failures, then trying running ./hack/stress-test.sh which internally uses stress tool.

# install the stress tool
go install golang.org/x/tools/cmd/stress@latest
# invoke the helper script to execute the stress test
./hack/stress-test.sh test-package=<test-package> test-func=<test-function> tool-params="<tool-params>"

An example invocation:

./hack/stress-test.sh test-package=./internal/utils test-func=TestRunConcurrentlyWithAllSuccessfulTasks tool-params="-p 10"
5s: 877 runs so far, 0 failures
10s: 1906 runs so far, 0 failures
15s: 2885 runs so far, 0 failures
...

stress tool will output a path to a file containing the full failure message when a test run fails.

Integration Tests (envtests)

Integration tests in etcd-druid use envtest. It sets up a minimal temporary control plane (etcd + kube-apiserver) and runs the test against it. Test suites (group of tests) start their individual envtest environment before running the tests for the respective controller/webhook. Before exiting, the temporary test environment is shutdown.

!!! info For integration-tests we are currently transitioning away from ginkgo to using golang native tests. All ginkgo integration tests can be found here and golang native integration tests can be found here.

Integration tests in etcd-druid only targets a single controller. It is therefore advised that code (other than common utility functions should not be shared between any two controllers).
If you are sharing a common envtest environment across tests then it is recommended that an individual test is run in a dedicated namespace.
Since envtest is used to setup a minimum environment where no controller (e.g. KCM, Scheduler) other than etcd and kube-apiserver runs, status updates to resources controller/reconciled by not-deployed-controllers will not happen. Tests should refrain from asserting changes to status. In case status needs to be set as part of a test setup then it must be done explicitly.
If you have common setup and teardown, then consider using TestMain -example.
If you have to wait for resources to be provisioned or reach a specific state, then it is recommended that you create smaller assertion functions and use Gomega’s AsyncAssertion functions - example.
- Beware of the default Eventually / Consistently timeouts / poll intervals: docs.
- Don’t forget to call {Eventually,Consistently}.Should(), otherwise the assertions always silently succeeds without errors: onsi/gomega#561

Running Integration Tests

make test-integration

Debugging Integration Tests

There are two ways in which you can debug Integration Tests:

Using IDE

All commonly used IDE’s provide in-built or easy integration with delve debugger. For debugging integration tests the only additional requirement is to set KUBEBUILDER_ASSETS environment variable. You can get the value of this environment variable by executing the following command:

# ENVTEST_K8S_VERSION is the k8s version that you wish to use for testing.
setup-envtest --os $(go env GOOS) --arch $(go env GOARCH) use $ENVTEST_K8S_VERSION -p path

!!! tip All integration tests usually have a timeout. If you wish to debug a failing integration-test then increase the timeouts.

Use standalone envtest

We also provide a capability to setup a stand-alone envtest and leverage the cluster to run individual integration-test. This allows you more control over when this k8s control plane is destroyed and allows you to inspect the resources at the end of the integration-test run using kubectl.

While you can use an existing cluster (e.g., kind), some test suites expect that no controllers and no nodes are running in the test environment (as it is the case in envtest test environments). Hence, using a full-blown cluster with controllers and nodes might sometimes be impractical, as you would need to stop cluster components for the tests to work.

To setup a standalone envtest and run an integration test against it, do the following:

# In a terminal session use the following make target to setup a standalone envtest
make start-envtest
# As part of output path to kubeconfig will be also be printed on the console.

# In another terminal session setup resource(s) watch:
kubectl get po -A -w # alternatively you can also use `watch -d <command>` utility.

# In another terminal session:
export KUBECONFIG=<envtest-kubeconfig-path>
export USE_EXISTING_K8S_CLUSTER=true

# run the test
go test -run="<regex-for-test>" <package>
# example: go test -run="^TestEtcdDeletion/test deletion of all*" ./test/it/controller/etcd

Once you are done the testing you can press Ctrl+C in the terminal session where you started envtest. This will shutdown the kubernetes control plane.

End-To-End (e2e) Tests

End-To-End tests are run using Kind cluster and Skaffold. These tests provide a high level of confidence that the code runs as expected by users when deployed to production.

Purpose of running these tests is to be able to catch bugs which result from interaction amongst different components within etcd-druid.
In CI pipelines e2e tests are run with S3 compatible LocalStack (in cases where backup functionality has been enabled for an etcd cluster).
In future we will only be using a file-system based local provider to reduce the run times for the e2e tests when run in a CI pipeline.
e2e tests can be triggered either with other cloud provider object-store emulators or they can also be run against actual/remove cloud provider object-store services.
In contrast to integration tests, in e2e tests, it might make sense to specify higher timeouts for Gomega’s AsyncAssertion calls.

Running e2e tests locally

Detailed instructions on how to run e2e tests can be found here.

35 - Updating Documentation

Updating Documentation

All documentation for etcd-druid resides in docs directory. If you wish to update the existing documentation or create new documentation files then read on.

Prerequisite: Setup Mkdocs locally

Material for Mkdocs is used to generate GitHub Pages from all the Markdown files present under the docs directory. To locally validate that the documentation renders correctly, it is recommended that you perform the following setup.

Install python3 if not already installed.
Setup a virtual environment via

python -m venv venv

Activate the virtual environment

source venv/bin/activate

In the virtual environment install the packages.

(venv) > pip install mkdocs-material
(venv) > pip install pymdown-extensions
(venv) > pip install mkdocs-glightbox
(venv) > pip install mkdocs-pymdownx-material-extras

!!! note Complete list of packages installed should be in sync with Github Actions Configuration.

Serve the documentation

(venv) > mkdocs serve

You can now view the rendered documentation at localhost:8000. Any changes that you make to the docs will get hot-reloaded and you can immediately view the changes.

Updating Documentation

All documentation should be in markdown only. Ensure that you take care of the following:

The index.md is the home page for the documentation rendered as Github Pages. Please do not remove this file.
If you are using a new feature (that is not already used) by Mkdocs then ensure that it is properly configured in mkdocs.yml. Additionally, if new plugins or Markdown extensions are used, make sure that you update the Github Actions Configuration accordingly.
If new files are being added and you wish to show these files in Github Pages then ensure that you have added them under appropriate sections in the navigation section of mkdocs.yml.
If you are linking to any file outside the docs directory then relative links will not work on Github Pages. Please get the https link to the file or section of the file that you wish to link.

Raise a Pull Request

Once you have made the documentation changes then follow the guide on how to raise a PR.

!!! info Once the documentation update PR has been merged, you will be able to see the updated documentation here.

36 - Using Druid Client

Using Druid Golang Client

Etcd Druid provides a generated typed Golang client which can be used to invoke CRUD operations on Etcd and EtcdCopyBackupsTask custom resources.

A simple example is provided to demonstrate how an Etcd resource can be created using client package.

To run the example ensure that you have the following setup:

Follow the Getting Started Guide to set up a KIND cluster and deploy etcd-druid operator.
Run the example:
```
go run examples/client/main.go
```

You should see the following output:

INFO Successfully created Etcd cluster Etcd=default/etcd-test

You can further list the resources that are created for an Etcd cluster. See this document for information on all resources created for an Etcd cluster.

37 - Validating Etcd Clusters

CRD Validations for etcd

The validations for the fields within the etcd resource are done via kubebuilder markers for CRD validation.
The validations for clusters with kubernetes versions >= 1.29 are written using a combination of CEL expressions via the x-validation tag which provides a straightforward syntax to write validation rules for the fields, and pattern matching with the use of the validation tag.
The validations for clusters with kubernetes versions < 1.29 will not contain validations via CEL expressions since this is GA for kubernetes version 1.29 or higher.
Upon any changes to the validation rules to the etcd resource, the yaml files for the same can be generated by running the make generate command.

Validation rules:

Type Validation rules:

The validations for fields of types Duration(metav1.Duration) and cron expressions are done via regex matching. These use the validation:Pattern marker.(The checking for the Quantity(resource.Quantity) fields are done by default, hence, no explicit validation is needed for the fields of this type):

Duration fields: '^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$')
Cron expression: ^(\*|[1-5]?[0-9]|[1-5]?[0-9]-[1-5]?[0-9]|(?:[1-9]|[1-4][0-9]|5[0-9])\/(?:[1-9]|[1-4][0-9]|5[0-9]|60)|\*\/(?:[1-9]|[1-4][0-9]|5[0-9]|60))\s+(\*|[0-9]|1[0-9]|2[0-3]|[0-9]-(?:[0-9]|1[0-9]|2[0-3])|1[0-9]-(?:1[0-9]|2[0-3])|2[0-3]-2[0-3]|(?:[1-9]|1[0-9]|2[0-3])\/(?:[1-9]|1[0-9]|2[0-4])|\*\/(?:[1-9]|1[0-9]|2[0-4]))\s+(\*|[1-9]|[12][0-9]|3[01]|[1-9]-(?:[1-9]|[12][0-9]|3[01])|[12][0-9]-(?:[12][0-9]|3[01])|3[01]-3[01]|(?:[1-9]|[12][0-9]|30)\/(?:[1-9]|[12][0-9]|3[01])|\*\/(?:[1-9]|[12][0-9]|3[01]))\s+(\*|[1-9]|1[0-2]|[1-9]-(?:[1-9]|1[0-2])|1[0-2]-1[0-2]|(?:[1-9]|1[0-2])\/(?:[1-9]|1[0-2])|\*\/(?:[1-9]|1[0-2]))\s+(\*|[1-7]|[1-6]-[1-7]|[1-6]\/[1-7]|\*\/[1-7])$
- NOTE: The provided regex does not account for special strings such as @yearly or @monthly. Additionally, it fails to invalidate cases involving the step operator (x/y) and the range operator (x-y), where the cron expression is considered valid even if x > y. Please ensure these values are validated before passing the expression.

Update validations

These validations are triggered when an update operation is done on the etcd resource.

Immutable fields: The fields etcd.spec.StorageClass , etcd.spec.StorageCapacity and etcd.spec.VolumeClaimTemplate are immutable. The immutability is enforced by the CEL expression : self == oldSelf.
The value set for the field etcd.spec.replicas can either be decreased to 0 or increased. This is enforced by the CEL expression: self==0 ? true : self < oldSelf ? false : true

Field validations

The fields which expect only a particular set of values are checked by using the kubebuilder marker: +kubebuilder:validation:Enum=<value1>;<value2>
- The etcd.spec.etcd.metrics can only be set as either basic or extensive.
- The etcd.spec.backup.garbageCollectionPolicy can only be set as Exponential or LimitBased
- The etcd.spec.backup.compression.policy can only be set as either gzip or lzw or zlib.
- The etcd.spec.sharedConfig.autoCompactionMode can only be set as either periodic or revision.

The value of etcd.spec.backup.garbageCollectionPeriod must be greater than etcd.spec.backup.deltaSnapshotPeriod. This is enforced by the CEL expression !(has(self.deltaSnapshotPeriod) && has(self.garbageCollectionPeriod)) || duration(self.deltaSnapshotPeriod).getSeconds() < duration(self.garbageCollectionPeriod).getSeconds(). The first part of the expression ensures that both the fields are present and then compares the values of the garbageCollectionPeriod and deltaSnapshotPeriod fields, if not, skips the check.
The value of etcd.spec.StorageCapacity must be more than 3 times that of the etcd.spec.etcd.quota if backups are enabled. If not, the value must be greater than that of the etcd.spec.etcd.quota field. This is enforced by using the CEL expression: has(self.storageCapacity) && has(self.etcd.quota) ? (has(self.backup.store) ? quantity(self.storageCapacity).compareTo(quantity(self.etcd.quota).add(quantity(self.etcd.quota)).add(quantity(self.etcd.quota))) > 0 : quantity(self.storageCapacity).compareTo(quantity(self.etcd.quota)) > 0 ): true The check for whether backups are enabled or not is done by checking if the field etcd.spec.backup.store exists.

38 - Version Compatibility Matrix

Version Compatibility

Kubernetes

We strongly recommend using etcd-druid with the supported kubernetes versions, published in this document. The following is a list of kubernetes versions supported by the respective etcd-druid versions.

etcd-druid version	Kubernetes version
>=v0.20	>=v1.21
>=v0.14 && <0.20	All versions supported
<v0.14	< v1.25

etcd-backup-restore & etcd-wrapper

etcd-druid version	etcd-backup-restore version	etcd-wrapper version
>=v0.23.1	>=v0.30.2	>=v0.2.0

etcd-druid

Start using or developing etcd-druid locally

Contributions

Feedback and Support

License

1 - 01 Multi Node Etcd Clusters

DEP-01: Multi-node etcd cluster instances via etcd-druid

Goal

Background and Motivation

Single-node etcd cluster

Multi-node etcd-cluster

Dynamic multi-node etcd cluster

Prior Art

ETCD Operator from CoreOS

etcdadm from kubernetes-sigs

Etcd Cluster Operator from Improbable-Engineering

General Approach to ETCD Cluster Management

Bootstrapping

Assumptions

Adding a new member to an etcd cluster

Note

Alternative

Managing Failures

Removing an existing member from an etcd cluster

Restarting an existing member of an etcd cluster

Recovering an etcd cluster from failure of majority of members

Kubernetes Context

Alternative

ETCD Configuration

Alternative

Data Persistence

Persistent

Ephemeral

Disk

In-memory

How to detect if valid metadata exists in an etcd member

Recommendation

How to detect if valid data exists in an etcd member

Recommendation

Separating peer and client traffic

Cutting off client requests

Manipulating Client Service podSelector

Health Check

Backup Failure

Alternative

Status

Members

Note

Member name as the key

Member Leases

Conditions

ClusterSize

Alternative

Decision table for etcd-druid based on the status

1. Pink of health

Observed state

Recommended Action

2. Member status is out of sync with their leases

Observed state

Recommended Action

3. All members are Ready but AllMembersReady condition is stale

Observed state

Recommended Action

4. Not all members are Ready but AllMembersReady condition is stale

Observed state

Recommended Action

5. Majority members are Ready but Ready condition is stale

Observed state

Recommended Action

6. Majority members are NotReady but Ready condition is stale

Observed state

Recommended Action

7. Some members have been in Unknown status for a while

Observed state

Recommended Action

8. Some member pods are not Ready but have not had the chance to update their status

Observed state

Recommended Action

9. Quorate cluster with a minority of members NotReady

Observed state

Start using or developing `etcd-druid` locally

3. All members are `Ready` but `AllMembersReady` condition is stale

4. Not all members are `Ready` but `AllMembersReady` condition is stale

5. Majority members are `Ready` but `Ready` condition is stale

6. Majority members are `NotReady` but `Ready` condition is stale

7. Some members have been in `Unknown` status for a while

8. Some member pods are not `Ready` but have not had the chance to update their status

9. Quorate cluster with a minority of members `NotReady`

10. Quorum lost with a majority of members `NotReady`

13. Superfluous member entries in `Etcd` status