Etcd Druid
A druid for etcd management in Gardener
etcd-druid
etcd-druid
is an etcd operator which makes it easy to configure, provision, reconcile and monitor etcd clusters. It enables management of an etcd cluster through declarative Kubernetes API model.
In every etcd cluster managed by etcd-druid
, each etcd member is a two container Pod
which consists of:
- etcd-wrapper which manages the lifecycle (validation & initialization) of an etcd.
- etcd-backup-restore sidecar which currently provides the following capabilities (the list is not comprehensive):
- etcd DB validation.
- Scheduled etcd DB defragmentation.
- Backup - etcd DB snapshots are taken regularly and backed in an object store if one is configured.
- Restoration - In case of a DB corruption for a single-member cluster it helps in restoring from latest set of snapshots (full & delta).
- Member control operations.
etcd-druid
additional provides the following capabilities:
Facilitates declarative scale-out of etcd clusters.
Provides protection against accidental deletion/mutation of resources provisioned as part of an etcd cluster.
Offers an asynchronous and threshold based capability to process backed up snapshots to:
Allows seamless copy of backups between any two object store buckets.
Start using or developing etcd-druid
locally
If you are looking to try out druid then you can use a Kind cluster based setup.
https://github.com/user-attachments/assets/cfe0d891-f709-4d7f-b975-4300c6de67e4
For detailed documentation, see our /docs
folder. Please find the index here.
Contributions
If you wish to contribute then please see our guidelines.
Feedback and Support
We always look forward to active community engagement. Please report bugs or suggestions on how we can enhance etcd-druid
on GitHub Issues.
License
Release under Apache-2.0 license.
1 - API Reference
Packages:
druid.gardener.cloud/v1alpha1
Package v1alpha1 is the v1alpha1 version of the etcd-druid API.
Resource Types:
BackupSpec
(Appears on:
EtcdSpec)
BackupSpec defines parameters associated with the full and delta snapshots of etcd.
ClientService
(Appears on:
EtcdConfig)
ClientService defines the parameters of the client service that a user can specify
Field | Description |
---|
annotations map[string]string | (Optional) Annotations specify the annotations that should be added to the client service |
labels map[string]string | (Optional) Labels specify the labels that should be added to the client service |
CompactionMode
(string
alias)
(Appears on:
SharedConfig)
CompactionMode defines the auto-compaction-mode: ‘periodic’ or ‘revision’.
‘periodic’ for duration based retention and ‘revision’ for revision number based retention.
CompressionPolicy
(string
alias)
(Appears on:
CompressionSpec)
CompressionPolicy defines the type of policy for compression of snapshots.
CompressionSpec
(Appears on:
BackupSpec)
CompressionSpec defines parameters related to compression of Snapshots(full as well as delta).
Condition
(Appears on:
EtcdCopyBackupsTaskStatus,
EtcdStatus)
Condition holds the information about the state of a resource.
Field | Description |
---|
type ConditionType | Type of the Etcd condition. |
status ConditionStatus | Status of the condition, one of True, False, Unknown. |
lastTransitionTime Kubernetes meta/v1.Time | Last time the condition transitioned from one status to another. |
lastUpdateTime Kubernetes meta/v1.Time | Last time the condition was updated. |
reason string | The reason for the condition’s last transition. |
message string | A human-readable message indicating details about the transition. |
ConditionStatus
(string
alias)
(Appears on:
Condition)
ConditionStatus is the status of a condition.
ConditionType
(string
alias)
(Appears on:
Condition)
ConditionType is the type of condition.
CrossVersionObjectReference
(Appears on:
EtcdStatus)
CrossVersionObjectReference contains enough information to let you identify the referred resource.
Field | Description |
---|
kind string | Kind of the referent |
name string | Name of the referent |
apiVersion string | (Optional) API version of the referent |
Etcd
Etcd is the Schema for the etcds API
EtcdConfig
(Appears on:
EtcdSpec)
EtcdConfig defines parameters associated etcd deployed
Field | Description |
---|
quota k8s.io/apimachinery/pkg/api/resource.Quantity | (Optional) Quota defines the etcd DB quota. |
defragmentationSchedule string | (Optional) DefragmentationSchedule defines the cron standard schedule for defragmentation of etcd. |
serverPort int32 | (Optional) |
clientPort int32 | (Optional) |
image string | (Optional) Image defines the etcd container image and tag |
authSecretRef Kubernetes core/v1.SecretReference | (Optional) |
metrics MetricsLevel | (Optional) Metrics defines the level of detail for exported metrics of etcd, specify ‘extensive’ to include histogram metrics. |
resources Kubernetes core/v1.ResourceRequirements | (Optional) Resources defines the compute Resources required by etcd container.
More info: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/ |
clientUrlTls TLSConfig | (Optional) ClientUrlTLS contains the ca, server TLS and client TLS secrets for client communication to ETCD cluster |
peerUrlTls TLSConfig | (Optional) PeerUrlTLS contains the ca and server TLS secrets for peer communication within ETCD cluster
Currently, PeerUrlTLS does not require client TLS secrets for gardener implementation of ETCD cluster. |
etcdDefragTimeout Kubernetes meta/v1.Duration | (Optional) EtcdDefragTimeout defines the timeout duration for etcd defrag call |
heartbeatDuration Kubernetes meta/v1.Duration | (Optional) HeartbeatDuration defines the duration for members to send heartbeats. The default value is 10s. |
clientService ClientService | (Optional) ClientService defines the parameters of the client service that a user can specify |
EtcdCopyBackupsTask
EtcdCopyBackupsTask is a task for copying etcd backups from a source to a target store.
Field | Description |
---|
metadata Kubernetes meta/v1.ObjectMeta | Refer to the Kubernetes API documentation for the fields of the
metadata field. |
spec EtcdCopyBackupsTaskSpec |
sourceStore StoreSpec | SourceStore defines the specification of the source object store provider for storing backups. | targetStore StoreSpec | TargetStore defines the specification of the target object store provider for storing backups. | maxBackupAge uint32 | (Optional) MaxBackupAge is the maximum age in days that a backup must have in order to be copied.
By default all backups will be copied. | maxBackups uint32 | (Optional) MaxBackups is the maximum number of backups that will be copied starting with the most recent ones. | waitForFinalSnapshot WaitForFinalSnapshotSpec | (Optional) WaitForFinalSnapshot defines the parameters for waiting for a final full snapshot before copying backups. |
|
status EtcdCopyBackupsTaskStatus | |
EtcdCopyBackupsTaskSpec
(Appears on:
EtcdCopyBackupsTask)
EtcdCopyBackupsTaskSpec defines the parameters for the copy backups task.
Field | Description |
---|
sourceStore StoreSpec | SourceStore defines the specification of the source object store provider for storing backups. |
targetStore StoreSpec | TargetStore defines the specification of the target object store provider for storing backups. |
maxBackupAge uint32 | (Optional) MaxBackupAge is the maximum age in days that a backup must have in order to be copied.
By default all backups will be copied. |
maxBackups uint32 | (Optional) MaxBackups is the maximum number of backups that will be copied starting with the most recent ones. |
waitForFinalSnapshot WaitForFinalSnapshotSpec | (Optional) WaitForFinalSnapshot defines the parameters for waiting for a final full snapshot before copying backups. |
EtcdCopyBackupsTaskStatus
(Appears on:
EtcdCopyBackupsTask)
EtcdCopyBackupsTaskStatus defines the observed state of the copy backups task.
Field | Description |
---|
conditions []Condition | (Optional) Conditions represents the latest available observations of an object’s current state. |
observedGeneration int64 | (Optional) ObservedGeneration is the most recent generation observed for this resource. |
lastError string | (Optional) LastError represents the last occurred error. |
EtcdMemberConditionStatus
(string
alias)
(Appears on:
EtcdMemberStatus)
EtcdMemberConditionStatus is the status of an etcd cluster member.
EtcdMemberStatus
(Appears on:
EtcdStatus)
EtcdMemberStatus holds information about a etcd cluster membership.
Field | Description |
---|
name string | Name is the name of the etcd member. It is the name of the backing Pod . |
id string | (Optional) ID is the ID of the etcd member. |
role EtcdRole | (Optional) Role is the role in the etcd cluster, either Leader or Member . |
status EtcdMemberConditionStatus | Status of the condition, one of True, False, Unknown. |
reason string | The reason for the condition’s last transition. |
lastTransitionTime Kubernetes meta/v1.Time | LastTransitionTime is the last time the condition’s status changed. |
EtcdRole
(string
alias)
(Appears on:
EtcdMemberStatus)
EtcdRole is the role of an etcd cluster member.
EtcdSpec
(Appears on:
Etcd)
EtcdSpec defines the desired state of Etcd
EtcdStatus
(Appears on:
Etcd)
EtcdStatus defines the observed state of Etcd.
Field | Description |
---|
observedGeneration int64 | (Optional) ObservedGeneration is the most recent generation observed for this resource. |
etcd CrossVersionObjectReference | (Optional) |
conditions []Condition | (Optional) Conditions represents the latest available observations of an etcd’s current state. |
serviceName string | (Optional) ServiceName is the name of the etcd service. |
lastError string | (Optional) LastError represents the last occurred error. |
clusterSize int32 | (Optional) Cluster size is the size of the etcd cluster. |
currentReplicas int32 | (Optional) CurrentReplicas is the current replica count for the etcd cluster. |
replicas int32 | (Optional) Replicas is the replica count of the etcd resource. |
readyReplicas int32 | (Optional) ReadyReplicas is the count of replicas being ready in the etcd cluster. |
ready bool | (Optional) Ready is true if all etcd replicas are ready. |
updatedReplicas int32 | (Optional) UpdatedReplicas is the count of updated replicas in the etcd cluster. |
labelSelector Kubernetes meta/v1.LabelSelector | (Optional) LabelSelector is a label query over pods that should match the replica count.
It must match the pod template’s labels. |
members []EtcdMemberStatus | (Optional) Members represents the members of the etcd cluster |
peerUrlTLSEnabled bool | (Optional) PeerUrlTLSEnabled captures the state of peer url TLS being enabled for the etcd member(s) |
GarbageCollectionPolicy
(string
alias)
(Appears on:
BackupSpec)
GarbageCollectionPolicy defines the type of policy for snapshot garbage collection.
LeaderElectionSpec
(Appears on:
BackupSpec)
LeaderElectionSpec defines parameters related to the LeaderElection configuration.
Field | Description |
---|
reelectionPeriod Kubernetes meta/v1.Duration | (Optional) ReelectionPeriod defines the Period after which leadership status of corresponding etcd is checked. |
etcdConnectionTimeout Kubernetes meta/v1.Duration | (Optional) EtcdConnectionTimeout defines the timeout duration for etcd client connection during leader election. |
MetricsLevel
(string
alias)
(Appears on:
EtcdConfig)
MetricsLevel defines the level ‘basic’ or ‘extensive’.
SchedulingConstraints
(Appears on:
EtcdSpec)
SchedulingConstraints defines the different scheduling constraints that must be applied to the
pod spec in the etcd statefulset.
Currently supported constraints are Affinity and TopologySpreadConstraints.
Field | Description |
---|
affinity Kubernetes core/v1.Affinity | (Optional) Affinity defines the various affinity and anti-affinity rules for a pod
that are honoured by the kube-scheduler. |
topologySpreadConstraints []Kubernetes core/v1.TopologySpreadConstraint | (Optional) TopologySpreadConstraints describes how a group of pods ought to spread across topology domains,
that are honoured by the kube-scheduler. |
SecretReference
(Appears on:
TLSConfig)
SecretReference defines a reference to a secret.
Field | Description |
---|
SecretReference Kubernetes core/v1.SecretReference | (Members of SecretReference are embedded into this type.) |
dataKey string | (Optional) DataKey is the name of the key in the data map containing the credentials. |
SharedConfig
(Appears on:
EtcdSpec)
SharedConfig defines parameters shared and used by Etcd as well as backup-restore sidecar.
Field | Description |
---|
autoCompactionMode CompactionMode | (Optional) AutoCompactionMode defines the auto-compaction-mode:‘periodic’ mode or ‘revision’ mode for etcd and embedded-Etcd of backup-restore sidecar. |
autoCompactionRetention string | (Optional) AutoCompactionRetention defines the auto-compaction-retention length for etcd as well as for embedded-Etcd of backup-restore sidecar. |
StorageProvider
(string
alias)
(Appears on:
StoreSpec)
StorageProvider defines the type of object store provider for storing backups.
StoreSpec
(Appears on:
BackupSpec,
EtcdCopyBackupsTaskSpec)
StoreSpec defines parameters related to ObjectStore persisting backups
Field | Description |
---|
container string | (Optional) Container is the name of the container the backup is stored at. |
prefix string | Prefix is the prefix used for the store. |
provider StorageProvider | (Optional) Provider is the name of the backup provider. |
secretRef Kubernetes core/v1.SecretReference | (Optional) SecretRef is the reference to the secret which used to connect to the backup store. |
TLSConfig
(Appears on:
BackupSpec,
EtcdConfig)
TLSConfig hold the TLS configuration details.
WaitForFinalSnapshotSpec
(Appears on:
EtcdCopyBackupsTaskSpec)
WaitForFinalSnapshotSpec defines the parameters for waiting for a final full snapshot before copying backups.
Field | Description |
---|
enabled bool | Enabled specifies whether to wait for a final full snapshot before copying backups. |
timeout Kubernetes meta/v1.Duration | (Optional) Timeout is the timeout for waiting for a final full snapshot. When this timeout expires, the copying of backups
will be performed anyway. No timeout or 0 means wait forever. |
Generated with gen-crd-api-reference-docs
2 - 01 Multi Node Etcd Clusters
Multi-node etcd cluster instances via etcd-druid
This document proposes an approach (along with some alternatives) to support provisioning and management of multi-node etcd cluster instances via etcd-druid and etcd-backup-restore.
Content
Goal
- Enhance etcd-druid and etcd-backup-restore to support provisioning and management of multi-node etcd cluster instances within a single Kubernetes cluster.
- The etcd CRD interface should be simple to use. It should preferably work with just setting the
spec.replicas
field to the desired value and should not require any more configuration in the CRD than currently required for the single-node etcd instances. The spec.replicas
field is part of the scale
sub-resource implementation in Etcd
CRD. - The single-node and multi-node scenarios must be automatically identified and managed by
etcd-druid
and etcd-backup-restore
. - The etcd clusters (single-node or multi-node) managed by
etcd-druid
and etcd-backup-restore
must automatically recover from failures (even quorum loss) and disaster (e.g. etcd member persistence/data loss) as much as possible. - It must be possible to dynamically scale an etcd cluster horizontally (even between single-node and multi-node scenarios) by simply scaling the
Etcd
scale sub-resource. - It must be possible to (optionally) schedule the individual members of an etcd clusters on different nodes or even infrastructure availability zones (within the hosting Kubernetes cluster).
Though this proposal tries to cover most aspects related to single-node and multi-node etcd clusters, there are some more points that are not goals for this document but are still in the scope of either etcd-druid/etcd-backup-restore and/or gardener.
In such cases, a high-level description of how they can be addressed in the future are mentioned at the end of the document.
Background and Motivation
Single-node etcd cluster
At present, etcd-druid
supports only single-node etcd cluster instances.
The advantages of this approach are given below.
- The problem domain is smaller.
There are no leader election and quorum related issues to be handled.
It is simpler to setup and manage a single-node etcd cluster.
- Single-node etcd clusters instances have less request latency than multi-node etcd clusters because there is no requirement to replicate the changes to the other members before committing the changes.
etcd-druid
provisions etcd cluster instances as pods (actually as statefulsets
) in a Kubernetes cluster and Kubernetes is quick (<20s
) to restart container/pods if they go down.- Also,
etcd-druid
is currently only used by gardener to provision etcd clusters to act as back-ends for Kubernetes control-planes and Kubernetes control-plane components (kube-apiserver
, kubelet
, kube-controller-manager
, kube-scheduler
etc.) can tolerate etcd going down and recover when it comes back up. - Single-node etcd clusters incur less cost (CPU, memory and storage)
- It is easy to cut-off client requests if backups fail by using
readinessProbe
on the etcd-backup-restore
healthz endpoint to minimize the gap between the latest revision and the backup revision.
The disadvantages of using single-node etcd clusters are given below.
- The database verification step by
etcd-backup-restore
can introduce additional delays whenever etcd container/pod restarts (in total ~20-25s
).
This can be much longer if a database restoration is required.
Especially, if there are incremental snapshots that need to be replayed (this can be mitigated by compacting the incremental snapshots in the background). - Kubernetes control-plane components can go into
CrashloopBackoff
if etcd is down for some time. This is mitigated by the dependency-watchdog.
But Kubernetes control-plane components require a lot of resources and create a lot of load on the etcd cluster and the apiserver when they come out of CrashloopBackoff
.
Especially, in medium or large sized clusters (> 20
nodes). - Maintenance operations such as updates to etcd (and updates to
etcd-druid
of etcd-backup-restore
), rolling updates to the nodes of the underlying Kubernetes cluster and vertical scaling of etcd pods are disruptive because they cause etcd pods to be restarted.
The vertical scaling of etcd pods is somewhat mitigated during scale down by doing it only during the target clusters’ maintenance window.
But scale up is still disruptive. - We currently use some form of elastic storage (via
persistentvolumeclaims
) for storing which have some upper-bounds on the I/O latency and throughput. This can be potentially be a problem for large clusters (> 220
nodes).
Also, some cloud providers (e.g. Azure) take a long time to attach/detach volumes to and from machines which increases the down time to the Kubernetes components that depend on etcd.
It is difficult to use ephemeral/local storage (to achieve better latency/throughput as well as to circumvent volume attachment/detachment) for single-node etcd cluster instances.
Multi-node etcd-cluster
The advantages of introducing support for multi-node etcd clusters via etcd-druid
are below.
- Multi-node etcd cluster is highly-available. It can tolerate disruption to individual etcd pods as long as the quorum is not lost (i.e. more than half the etcd member pods are healthy and ready).
- Maintenance operations such as updates to etcd (and updates to
etcd-druid
of etcd-backup-restore
), rolling updates to the nodes of the underlying Kubernetes cluster and vertical scaling of etcd pods can be done non-disruptively by respecting poddisruptionbudgets
for the various multi-node etcd cluster instances hosted on that cluster. - Kubernetes control-plane components do not see any etcd cluster downtime unless quorum is lost (which is expected to be lot less frequent than current frequency of etcd container/pod restarts).
- We can consider using ephemeral/local storage for multi-node etcd cluster instances because individual member restarts can afford to take time to restore from backup before (re)joining the etcd cluster because the remaining members serve the requests in the meantime.
- High-availability across availability zones is also possible by specifying (anti)affinity for the etcd pods (possibly via
kupid
).
Some disadvantages of using multi-node etcd clusters due to which it might still be desirable, in some cases, to continue to use single-node etcd cluster instances in the gardener context are given below.
- Multi-node etcd cluster instances are more complex to manage.
The problem domain is larger including the following.
- Leader election
- Quorum loss
- Managing rolling changes
- Backups to be taken from only the leading member.
- More complex to cut-off client requests if backups fail to minimize the gap between the latest revision and the backup revision is under control.
- Multi-node etcd cluster instances incur more cost (CPU, memory and storage).
Dynamic multi-node etcd cluster
Though it is not part of this proposal, it is conceivable to convert a single-node etcd cluster into a multi-node etcd cluster temporarily to perform some disruptive operation (etcd, etcd-backup-restore
or etcd-druid
updates, etcd cluster vertical scaling and perhaps even node rollout) and convert it back to a single-node etcd cluster once the disruptive operation has been completed. This will necessarily still involve a down-time because scaling from a single-node etcd cluster to a three-node etcd cluster will involve etcd pod restarts, it is still probable that it can be managed with a shorter down time than we see at present for single-node etcd clusters (on the other hand, converting a three-node etcd cluster to five node etcd cluster can be non-disruptive).
This is definitely not to argue in favour of such a dynamic approach in all cases (eventually, if/when dynamic multi-node etcd clusters are supported). On the contrary, it makes sense to make use of static (fixed in size) multi-node etcd clusters for production scenarios because of the high-availability.
Prior Art
ETCD Operator from CoreOS
etcd operator
Project status: archived
This project is no longer actively developed or maintained. The project exists here for historical reference. If you are interested in the future of the project and taking over stewardship, please contact etcd-dev@googlegroups.com.
etcdadm from kubernetes-sigs
etcdadm is a command-line tool for operating an etcd cluster. It makes it easy to create a new cluster, add a member to, or remove a member from an existing cluster. Its user experience is inspired by kubeadm.
It is a tool more tailored for manual command-line based management of etcd clusters with no API’s.
It also makes no assumptions about the underlying platform on which the etcd clusters are provisioned and hence, doesn’t leverage any capabilities of Kubernetes.
Etcd Cluster Operator from Improbable-Engineering
Etcd Cluster Operator
Etcd Cluster Operator is an Operator for automating the creation and management of etcd inside of Kubernetes. It provides a custom resource definition (CRD) based API to define etcd clusters with Kubernetes resources, and enable management with native Kubernetes tooling._
Out of all the alternatives listed here, this one seems to be the only possible viable alternative.
Parts of its design/implementations are similar to some of the approaches mentioned in this proposal. However, we still don’t propose to use it as -
- The project is still in early phase and is not mature enough to be consumed as is in productive scenarios of ours.
- The resotration part is completely different which makes it difficult to adopt as-is and requries lot of re-work with the current restoration semantics with etcd-backup-restore making the usage counter-productive.
General Approach to ETCD Cluster Management
Bootstrapping
There are three ways to bootstrap an etcd cluster which are static, etcd discovery and DNS discovery.
Out of these, the static way is the simplest (and probably faster to bootstrap the cluster) and has the least external dependencies.
Hence, it is preferred in this proposal.
But it requires that the initial (during bootstrapping) etcd cluster size (number of members) is already known before bootstrapping and that all of the members are already addressable (DNS,IP,TLS etc.).
Such information needs to be passed to the individual members during startup using the following static configuration.
- ETCD_INITIAL_CLUSTER
- The list of peer URLs including all the members. This must be the same as the advertised peer URLs configuration. This can also be passed as
initial-cluster
flag to etcd.
- ETCD_INITIAL_CLUSTER_STATE
- This should be set to
new
while bootstrapping an etcd cluster.
- ETCD_INITIAL_CLUSTER_TOKEN
- This is a token to distinguish the etcd cluster from any other etcd cluster in the same network.
Assumptions
- ETCD_INITIAL_CLUSTER can use DNS instead of IP addresses. We need to verify this by deleting a pod (as against scaling down the statefulset) to ensure that the pod IP changes and see if the recreated pod (by the statefulset controller) re-joins the cluster automatically.
- DNS for the individual members is known or computable. This is true in the case of etcd-druid setting up an etcd cluster using a single statefulset. But it may not necessarily be true in other cases (multiple statefulset per etcd cluster or deployments instead of statefulsets or in the case of etcd cluster with members distributed across more than one Kubernetes cluster.
Adding a new member to an etcd cluster
A new member can be added to an existing etcd cluster instance using the following steps.
- If the latest backup snapshot exists, restore the member’s etcd data to the latest backup snapshot. This can reduce the load on the leader to bring the new member up to date when it joins the cluster.
- If the latest backup snapshot doesn’t exist or if the latest backup snapshot is not accessible (please see backup failure) and if the cluster itself is quorate, then the new member can be started with an empty data. But this will will be suboptimal because the new member will fetch all the data from the leading member to get up-to-date.
- The cluster is informed that a new member is being added using the
MemberAdd
API including information like the member name and its advertised peer URLs. - The new etcd member is then started with
ETCD_INITIAL_CLUSTER_STATE=existing
apart from other required configuration.
This proposal recommends this approach.
Note
- If there are incremental snapshots (taken by
etcd-backup-restore
), they cannot be applied because that requires the member to be started in isolation without joining the cluster which is not possible.
This is acceptable if the amount of incremental snapshots are managed to be relatively small.
This adds one more reason to increase the priority of the issue of incremental snapshot compaction. - There is a time window, between the
MemberAdd
call and the new member joining the cluster and getting up to date, where the cluster is vulnerable to leader elections which could be disruptive.
Alternative
With v3.4
, the new raft learner approach can be used to mitigate some of the possible disruptions mentioned above.
Then the steps will be as follows.
- If the latest backup snapshot exists, restore the member’s etcd data to the latest backup snapshot. This can reduce the load on the leader to bring the new member up to date when it joins the cluster.
- The cluster is informed that a new member is being added using the
MemberAddAsLearner
API including information like the member name and its advertised peer URLs. - The new etcd member is then started with
ETCD_INITIAL_CLUSTER_STATE=existing
apart from other required configuration. - Once the new member (learner) is up to date, it can be promoted to a full voting member by using the
MemberPromote
API
This approach is new and involves more steps and is not recommended in this proposal.
It can be considered in future enhancements.
Managing Failures
A multi-node etcd cluster may face failures of diffent kinds during its life-cycle.
The actions that need to be taken to manage these failures depend on the failure mode.
Removing an existing member from an etcd cluster
If a member of an etcd cluster becomes unhealthy, it must be explicitly removed from the etcd cluster, as soon as possible.
This can be done by using the MemberRemove
API.
This ensures that only healthy members participate as voting members.
A member of an etcd cluster may be removed not just for managing failures but also for other reasons such as -
- The etcd cluster is being scaled down. I.e. the cluster size is being reduced
- An existing member is being replaced by a new one for some reason (e.g. upgrades)
If the majority of the members of the etcd cluster are healthy and the member that is unhealthy/being removed happens to be the leader at that moment then the etcd cluster will automatically elect a new leader.
But if only a minority of etcd clusters are healthy after removing the member then the the cluster will no longer be quorate and will stop accepting write requests.
Such an etcd cluster needs to be recovered via some kind of disaster-recovery.
Restarting an existing member of an etcd cluster
If the existing member of an etcd cluster restarts and retains an uncorrupted data directory after the restart, then it can simply re-join the cluster as an existing member without any API calls or configuration changes.
This is because the relevant metadata (including member ID and cluster ID) are maintained in the write ahead logs.
However, if it doesn’t retain an uncorrupted data directory after the restart, then it must first be removed and added as a new member.
Recovering an etcd cluster from failure of majority of members
If a majority of members of an etcd cluster fail but if they retain their uncorrupted data directory then they can be simply restarted and they will re-form the existing etcd cluster when they come up.
However, if they do not retain their uncorrupted data directory, then the etcd cluster must be recovered from latest snapshot in the backup.
This is very similar to bootstrapping with the additional initial step of restoring the latest snapshot in each of the members.
However, the same limitation about incremental snapshots, as in the case of adding a new member, applies here.
But unlike in the case of adding a new member, not applying incremental snapshots is not acceptable in the case of etcd cluster recovery.
Hence, if incremental snapshots are required to be applied, the etcd cluster must be recovered in the following steps.
- Restore a new single-member cluster using the latest snapshot.
- Apply incremental snapshots on the single-member cluster.
- Take a full snapshot which can now be used while adding the remaining members.
- Add new members using the latest snapshot created in the step above.
Kubernetes Context
- Users will provision an etcd cluster in a Kubernetes cluster by creating an etcd CRD resource instance.
- A multi-node etcd cluster is indicated if the
spec.replicas
field is set to any value greater than 1. The etcd-druid will add validation to ensure that the spec.replicas
value is an odd number according to the requirements of etcd. - The etcd-druid controller will provision a statefulset with the etcd main container and the etcd-backup-restore sidecar container. It will pass on the
spec.replicas
field from the etcd resource to the statefulset. It will also supply the right pre-computed configuration to both the containers. - The statefulset controller will create the pods based on the pod template in the statefulset spec and these individual pods will be the members that form the etcd cluster.
This approach makes it possible to satisfy the assumption that the DNS for the individual members of the etcd cluster must be known/computable.
This can be achieved by using a headless
service (along with the statefulset) for each etcd cluster instance.
Then we can address individual pods/etcd members via the predictable DNS name of <statefulset_name>-{0|1|2|3|…|n}.<headless_service_name>
from within the Kubernetes namespace (or from outside the Kubernetes namespace by appending .<namespace>.svc.<cluster_domain> suffix)
.
The etcd-druid controller can compute the above configurations automatically based on the spec.replicas
in the etcd resource.
This proposal recommends this approach.
Alternative
One statefulset is used for each member (instead of one statefulset for all members).
While this approach gives a flexibility to have different pod specifications for the individual members, it makes managing the individual members (e.g. rolling updates) more complicated.
Hence, this approach is not recommended.
ETCD Configuration
As mentioned in the general approach section, there are differences in the configuration that needs to be passed to individual members of an etcd cluster in different scenarios such as bootstrapping, adding a new member, removing a member, restarting an existing member etc.
Managing such differences in configuration for individual pods of a statefulset is tricky in the recommended approach of using a single statefulset to manage all the member pods of an etcd cluster.
This is because statefulset uses the same pod template for all its pods.
The recommendation is for etcd-druid
to provision the base configuration template in a ConfigMap
which is passed to all the pods via the pod template in the StatefulSet
.
The initialization
flow of etcd-backup-restore
(which is invoked every time the etcd container is (re)started) is then enhanced to generate the customized etcd configuration for the corresponding member pod (in a shared volume between etcd and the backup-restore containers) based on the supplied template configuration.
This will require that etcd-backup-restore
will have to have a mechanism to detect which scenario listed above applies during any given member container/pod restart.
Alternative
As mentioned above, one statefulset is used for each member of the etcd cluster.
Then different configuration (generated directly by etcd-druid
) can be passed in the pod templates of the different statefulsets.
Though this approach is advantageous in the context of managing the different configuration, it is not recommended in this proposal because it makes the rest of the management (e.g. rolling updates) more complicated.
Data Persistence
The type of persistence used to store etcd data (including the member ID and cluster ID) has an impact on the steps that are needed to be taken when the member pods or containers (minority of them or majority) need to be recovered.
Persistent
Like the single-node case, persistentvolumes
can be used to persist ETCD data for all the member pods. The individual member pods then get their own persistentvolumes
.
The advantage is that individual members retain their member ID across pod restarts and even pod deletion/recreation across Kubernetes nodes.
This means that member pods that crash (or are unhealthy) can be restarted automatically (by configuring livenessProbe
) and they will re-join the etcd cluster using their existing member ID without any need for explicit etcd cluster management).
The disadvantages of this approach are as follows.
- The number of persistentvolumes increases linearly with the cluster size which is a cost-related concern.
- Network-mounted persistentvolumes might eventually become a performance bottleneck under heavy load for a latency-sensitive component like ETCD.
- Volume attach/detach issues when associated with etcd cluster instances cause downtimes to the target shoot clusters that are backed by those etcd cluster instances.
Ephemeral
The ephemeral volumes use-case is considered as an optimization and may be planned as a follow-up action.
Disk
Ephemeral persistence can be achieved in Kubernetes by using either emptyDir
volumes or local
persistentvolumes to persist ETCD data.
The advantages of this approach are as follows.
- Potentially faster disk I/O.
- The number of persistent volumes does not increase linearly with the cluster size (at least not technically).
- Issues related volume attachment/detachment can be avoided.
The main disadvantage of using ephemeral persistence is that the individual members may retain their identity and data across container restarts but not across pod deletion/recreation across Kubernetes nodes. If the data is lost then on restart of the member pod, the older member (represented by the container) has to be removed and a new member has to be added.
Using emptyDir
ephemeral persistence has the disadvantage that the volume doesn’t have its own identity.
So, if the member pod is recreated but scheduled on the same node as before then it will not retain the identity as the persistence is lost.
But it has the advantage that scheduling of pods is unencumbered especially during pod recreation as they are free to be scheduled anywhere.
Using local
persistentvolumes has the advantage that the volume has its own indentity and hence, a recreated member pod will retain its identity if scheduled on the same node.
But it has the disadvantage of tying down the member pod to a node which is a problem if the node becomes unhealthy requiring etcd druid to take additional actions (such as deleting the local persistent volume).
Based on these constraints, if ephemeral persistence is opted for, it is recommended to use emptyDir
ephemeral persistence.
In-memory
In-memory ephemeral persistence can be achieved in Kubernetes by using emptyDir
with medium: Memory
.
In this case, a tmpfs
(RAM-backed file-system) volume will be used.
In addition to the advantages of ephemeral persistence, this approach can achieve the fastest possible disk I/O.
Similarly, in addition to the disadvantages of ephemeral persistence, in-memory persistence has the following additional disadvantages.
Since the likelyhood of a member not having valid metadata in the WAL files is much more likely in the ephemeral persistence scenario, one option is to pass the information that ephemeral persistence is being used to the etcd-backup-restore
sidecar (say, via command-line flags or environment variables).
But in principle, it might be better to determine this from the WAL files directly so that the possibility of corrupted WAL files also gets handled correctly.
To do this, the wal package has some functions that might be useful.
Recommendation
It might be possible that using the wal package for verifying if valid metadata exists might be performance intensive.
So, the performance impact needs to be measured.
If the performance impact is acceptable (both in terms of resource usage and time), it is recommended to use this way to verify if the member contains valid metadata.
Otherwise, alternatives such as a simple check that WAL folder exists coupled with the static information about use of persistent or ephemeral storage might be considered.
How to detect if valid data exists in an etcd member
The initialization sequence in etcd-backup-restore
already includes database verification.
This would suffice to determine if the member has valid data.
Recommendation
Though ephemeral persistence has performance and logistics advantages,
it is recommended to start with persistent data for the member pods.
In addition to the reasons and concerns listed above, there is also the additional concern that in case of backup failure, the risk of additional data loss is a bit higher if ephemeral persistence is used (simultaneous quoram loss is sufficient) when compared to persistent storage (simultaenous quorum loss with majority persistence loss is needed).
The risk might still be acceptable but the idea is to gain experience about how frequently member containers/pods get restarted/recreated, how frequently leader election happens among members of an etcd cluster and how frequently etcd clusters lose quorum.
Based on this experience, we can move towards using ephemeral (perhaps even in-memory) persistence for the member pods.
Separating peer and client traffic
The current single-node ETCD cluster implementation in etcd-druid
and etcd-backup-restore
uses a single service
object to act as the entry point for the client traffic.
There is no separation or distinction between the client and peer traffic because there is not much benefit to be had by making that distinction.
In the multi-node ETCD cluster scenario, it makes sense to distinguish between and separate the peer and client traffic.
This can be done by using two services
.
- peer
- To be used for peer communication. This could be a
headless
service.
- client
- To be used for client communication. This could be a normal
ClusterIP
service like it is in the single-node case.
The main advantage of this approach is that it makes it possible (if needed) to allow only peer to peer communication while blocking client communication. Such a thing might be required during some phases of some maintenance tasks (manual or automated).
Cutting off client requests
At present, in the single-node ETCD instances, etcd-druid configures the readinessProbe of the etcd main container to probe the healthz endpoint of the etcd-backup-restore sidecar which considers the status of the latest backup upload in addition to the regular checks about etcd and the side car being up and healthy. This has the effect of setting the etcd main container (and hence the etcd pod) as not ready if the latest backup upload failed. This results in the endpoints controller removing the pod IP address from the endpoints list for the service which eventually cuts off ingress traffic coming into the etcd pod via the etcd client service. The rationale for this is to fail early when the backup upload fails rather than continuing to serve requests while the gap between the last backup and the current data increases which might lead to unacceptably large amount of data loss if disaster strikes.
This approach will not work in the multi-node scenario because we need the individual member pods to be able to talk to each other to maintain the cluster quorum when backup upload fails but need to cut off only client ingress traffic.
It is recommended to separate the backup health condition tracking taking appropriate remedial actions.
With that, the backup health condition tracking is now separated to the BackupReady
condition in the Etcd
resource status
and the cutting off of client traffic (which could now be done for more reasons than failed backups) can be achieved in a different way described below.
Manipulating Client Service podSelector
The client traffic can be cut off by updating (manually or automatically by some component) the podSelector
of the client service to add an additional label (say, unhealthy or disabled) such that the podSelector
no longer matches the member pods created by the statefulset.
This will result in the client ingress traffic being cut off.
The peer service is left unmodified so that peer communication is always possible.
Health Check
The etcd main container and the etcd-backup-restore sidecar containers will be configured with livenessProbe and readinessProbe which will indicate the health of the containers and effectively the corresponding ETCD cluster member pod.
Backup Failure
As described above using readinessProbe
failures based on latest backup failure is not viable in the multi-node ETCD scenario.
Though cutting off traffic by manipulating client service
podSelector
is workable, it may not be desirable.
It is recommended that on backup failure, the leading etcd-backup-restore
sidecar (the one that is responsible for taking backups at that point in time, as explained in the backup section below, updates the BackupReady
condition in the Etcd
status and raises a high priority alert to the landscape operators but does not cut off the client traffic.
The reasoning behind this decision to not cut off the client traffic on backup failure is to allow the Kubernetes cluster’s control plane (which relies on the ETCD cluster) to keep functioning as long as possible and to avoid bringing down the control-plane due to a missed backup.
The risk of this approach is that with a cascaded sequence of failures (on top of the backup failure), there is a chance of more data loss than the frequency of backup would otherwise indicate.
To be precise, the risk of such an additional data loss manifests only when backup failure as well as a special case of quorum loss (majority of the members are not ready) happen in such a way that the ETCD cluster needs to be re-bootstrapped from the backup.
As described here, re-bootstrapping the ETCD cluster requires restoration from the latest backup only when a majority of members no longer have uncorrupted data persistence.
If persistent storage is used, this will happen only when backup failure as well as a majority of the disks/volumes backing the ETCD cluster members fail simultaneously.
This would indeed be rare and might be an acceptable risk.
If ephemeral storage is used (especially, in-memory), the data loss will happen if a majority of the ETCD cluster members become NotReady
(requiring a pod restart) at the same time as the backup failure.
This may not be as rare as majority members’ disk/volume failure.
The risk can be somewhat mitigated at least for planned maintenance operations by postponing potentially disruptive maintenance operations when BackupReady
condition is false
(vertical scaling, rolling updates, evictions due to node roll-outs).
But in practice (when ephemeral storage is used), the current proposal suggests restoring from the latest full backup even when a minority of ETCD members (even a single pod) restart both to speed up the process of the new member catching up to the latest revision but also to avoid load on the leading member which needs to supply the data to bring the new member up-to-date.
But as described here, in case of a minority member failure while using ephemeral storage, it is possible to restart the new member with empty data and let it fetch all the data from the leading member (only if backup is not accessible).
Though this is suboptimal, it is workable given the constraints and conditions.
With this, the risk of additional data loss in the case of ephemeral storage is only if backup failure as well as quorum loss happens.
While this is still less rare than the risk of additional data loss in case of persistent storage, the risk might be tolerable. Provided the risk of quorum loss is not too high. This needs to be monitored/evaluated before opting for ephemeral storage.
Given these constraints, it is better to dynamically avoid/postpone some potentially disruptive operations when BackupReady
condition is false
.
This has the effect of allowing n/2
members to be evicted when the backups are healthy and completely disabling evictions when backups are not healthy.
- Skip/postpone potentially disruptive maintenance operations (listed below) when the
BackupReady
condition is false
. - Vertical scaling.
- Rolling updates, Basically, any updates to the
StatefulSet
spec which includes vertical scaling. - Dynamically toggle the
minAvailable
field of the PodDisruptionBudget
between n/2 + 1
and n
(where n
is the ETCD desired cluster size) whenever the BackupReady
condition toggles between true
and false
.
This will mean that etcd-backup-restore
becomes Kubernetes-aware. But there might be reasons for making etcd-backup-restore
Kubernetes-aware anyway (e.g. to update the etcd
resource status with latest full snapshot details).
This enhancement should keep etcd-backup-restore
backward compatible.
I.e. it should be possible to use etcd-backup-restore
Kubernetes-unaware as before this proposal.
This is possible either by auto-detecting the existence of kubeconfig or by an explicit command-line flag (such as --enable-client-service-updates
which can be defaulted to false
for backward compatibility).
Alternative
The alternative is for etcd-druid
to implement the above functionality.
But etcd-druid
is centrally deployed in the host Kubernetes cluster and cannot scale well horizontally.
So, it can potentially be a bottleneck if it is involved in regular health check mechanism for all the etcd clusters it manages.
Also, the recommended approach above is more robust because it can work even if etcd-druid
is down when the backup upload of a particular etcd cluster fails.
Status
It is desirable (for the etcd-druid
and landscape administrators/operators) to maintain/expose status of the etcd cluster instances in the status
sub-resource of the Etcd
CRD.
The proposed structure for maintaining the status is as shown in the example below.
apiVersion: druid.gardener.cloud/v1alpha1
kind: Etcd
metadata:
name: etcd-main
spec:
replicas: 3
...
...
status:
...
conditions:
- type: Ready # Condition type for the readiness of the ETCD cluster
status: "True" # Indicates of the ETCD Cluster is ready or not
lastHeartbeatTime: "2020-11-10T12:48:01Z"
lastTransitionTime: "2020-11-10T12:48:01Z"
reason: Quorate # Quorate|QuorumLost
- type: AllMembersReady # Condition type for the readiness of all the member of the ETCD cluster
status: "True" # Indicates if all the members of the ETCD Cluster are ready
lastHeartbeatTime: "2020-11-10T12:48:01Z"
lastTransitionTime: "2020-11-10T12:48:01Z"
reason: AllMembersReady # AllMembersReady|NotAllMembersReady
- type: BackupReady # Condition type for the readiness of the backup of the ETCD cluster
status: "True" # Indicates if the backup of the ETCD cluster is ready
lastHeartbeatTime: "2020-11-10T12:48:01Z"
lastTransitionTime: "2020-11-10T12:48:01Z"
reason: FullBackupSucceeded # FullBackupSucceeded|IncrementalBackupSucceeded|FullBackupFailed|IncrementalBackupFailed
...
clusterSize: 3
...
replicas: 3
...
members:
- name: etcd-main-0 # member pod name
id: 272e204152 # member Id
role: Leader # Member|Leader
status: Ready # Ready|NotReady|Unknown
lastTransitionTime: "2020-11-10T12:48:01Z"
reason: LeaseSucceeded # LeaseSucceeded|LeaseExpired|UnknownGracePeriodExceeded|PodNotRead
- name: etcd-main-1 # member pod name
id: 272e204153 # member Id
role: Member # Member|Leader
status: Ready # Ready|NotReady|Unknown
lastTransitionTime: "2020-11-10T12:48:01Z"
reason: LeaseSucceeded # LeaseSucceeded|LeaseExpired|UnknownGracePeriodExceeded|PodNotRead
This proposal recommends that etcd-druid
(preferrably, the custodian
controller in etcd-druid
) maintains most of the information in the status
of the Etcd
resources described above.
One exception to this is the BackupReady
condition which is recommended to be maintained by the leading etcd-backup-restore
sidecar container.
This will mean that etcd-backup-restore
becomes Kubernetes-aware. But there are other reasons for making etcd-backup-restore
Kubernetes-aware anyway (e.g. to maintain health conditions).
This enhancement should keep etcd-backup-restore
backward compatible.
But it should be possible to use etcd-backup-restore
Kubernetes-unaware as before this proposal. This is possible either by auto-detecting the existence of kubeconfig or by an explicit command-line flag (such as --enable-etcd-status-updates
which can be defaulted to false
for backward compatibility).
Members
The members
section of the status is intended to be maintained by etcd-druid
(preferraby, the custodian
controller of etcd-druid
) based on the leases
of the individual members.
Note
An earlier design in this proposal was for the individual etcd-backup-restore
sidecars to update the corresponding status.members
entries themselves. But this was redesigned to use member leases
to avoid conflicts rising from frequent updates and the limitations in the support for Server-Side Apply in some versions of Kubernetes.
The spec.holderIdentity
field in the leases
is used to communicate the ETCD member id
and role
between the etcd-backup-restore
sidecars and etcd-druid
.
Member name as the key
In an ETCD cluster, the member id
is the unique identifier for a member.
However, this proposal recommends using a single StatefulSet
whose pods form the members of the ETCD cluster and Pods
of a StatefulSet
have uniquely indexed names as well as uniquely addressible DNS.
This proposal recommends that the name
of the member (which is the same as the name of the member Pod
) be used as the unique key to identify a member in the members
array.
This can minimise the need to cleanup superfluous entries in the members
array after the member pods are gone to some extent because the replacement pods for any member will share the same name
and will overwrite the entry with a possibly new member id
.
There is still the possibility of not only superfluous entries in the members
array but also superfluous members
in the ETCD cluster for which there is no corresponding pod in the StatefulSet
anymore.
For example, if an ETCD cluster is scaled up from 3
to 5
and the new members were failing constantly due to insufficient resources and then if the ETCD client is scaled back down to 3
and failing member pods may not have the chance to clean up their member
entries (from the members
array as well as from the ETCD cluster) leading to superfluous members in the cluster that may have adverse effect on quorum of the cluster.
Hence, the superfluous entries in both members
array as well as the ETCD cluster need to be cleaned up as appropriate.
Member Leases
One Kubernetes lease
object per desired ETCD member is maintained by etcd-druid
(preferrably, the custodian
controller in etcd-druid
).
The lease
objects will be created in the same namespace
as their owning Etcd
object and will have the same name
as the member to which they correspond (which, in turn would be the same as the pod
name in which the member ETCD process runs).
The lease
objects are created and deleted only by etcd-druid
but are continually renewed within the leaseDurationSeconds
by the individual etcd-backup-restore
sidecars (corresponding to their members) if the the corresponding ETCD member is ready and is part of the ETCD cluster.
This will mean that etcd-backup-restore
becomes Kubernetes-aware. But there are other reasons for making etcd-backup-restore
Kubernetes-aware anyway (e.g. to maintain health conditions).
This enhancement should keep etcd-backup-restore
backward compatible.
But it should be possible to use etcd-backup-restore
Kubernetes-unaware as before this proposal. This is possible either by auto-detecting the existence of kubeconfig or by an explicit command-line flag (such as --enable-etcd-lease-renewal
which can be defaulted to false
for backward compatibility).
A member
entry in the Etcd
resource status
would be marked as Ready
(with reason: LeaseSucceeded
) if the corresponding pod
is ready and the corresponding lease
has not yet expired.
The member
entry would be marked as NotReady
if the corresponding pod
is not ready (with reason PodNotReady
) or as Unknown
if the corresponding lease
has expired (with reason: LeaseExpired
).
While renewing the lease, the etcd-backup-restore
sidecars also maintain the ETCD member id
and their role
(Leader
or Member
) separated by :
in the spec.holderIdentity
field of the corresponding lease
object since this information is only available to the ETCD
member processes and the etcd-backup-restore
sidecars (e.g. 272e204152:Leader
or 272e204153:Member
).
When the lease
objects are created by etcd-druid
, the spec.holderIdentity
field would be empty.
The value in spec.holderIdentity
in the leases
is parsed and copied onto the id
and role
fields of the corresponding status.members
by etcd-druid
.
Conditions
The conditions
section in the status describe the overall condition of the ETCD cluster.
The condition type Ready
indicates if the ETCD cluster as a whole is ready to serve requests (i.e. the cluster is quorate) even though some minority of the members are not ready.
The condition type AllMembersReady
indicates of all the members of the ETCD cluster are ready.
The distinction between these conditions could be significant for both external consumers of the status as well as etcd-druid
itself.
Some maintenance operations might be safe to do (e.g. rolling updates) only when all members of the cluster are ready.
The condition type BackupReady
indicates of the most recent backup upload (full or incremental) succeeded.
This information also might be significant because some maintenance operations might be safe to do (e.g. anything that involves re-bootstrapping the ETCD cluster) only when backup is ready.
The Ready
and AllMembersReady
conditions can be maintained by etcd-druid
based on the status in the members
section.
The BackupReady
condition will be maintained by the leading etcd-backup-restore
sidecar that is in charge of taking backups.
More condition types could be introduced in the future if specific purposes arise.
ClusterSize
The clusterSize
field contains the current size of the ETCD cluster. It will be actively kept up-to-date by etcd-druid
in all scenarios.
- Before bootstrapping the ETCD cluster (during cluster creation or later bootstrapping because of quorum failure),
etcd-druid
will clear the status.members
array and set status.clusterSize
to be equal to spec.replicas
. - While the ETCD cluster is quorate,
etcd-druid
will actively set status.clusterSize
to be equal to length of the status.members
whenever the length of the array changes (say, due to scaling of the ETCD cluster).
Given that clusterSize
reliably represents the size of the ETCD cluster, it can be used to calculate the Ready
condition.
Alternative
The alternative is for etcd-druid
to maintain the status in the Etcd
status sub-resource.
But etcd-druid
is centrally deployed in the host Kubernetes cluster and cannot scale well horizontally.
So, it can potentially be a bottleneck if it is involved in regular health check mechanism for all the etcd clusters it manages.
Also, the recommended approach above is more robust because it can work even if etcd-druid
is down when the backup upload of a particular etcd cluster fails.
Decision table for etcd-druid based on the status
The following decision table describes the various criteria etcd-druid
takes into consideration to determine the different etcd cluster management scenarios and the corresponding reconciliation actions it must take.
The general principle is to detect the scenario and take the minimum action to move the cluster along the path to good health.
The path from any one scenario to a state of good health will typically involve going through multiple reconciliation actions which probably take the cluster through many other cluster management scenarios.
Especially, it is proposed that individual members auto-heal where possible, even in the case of the failure of a majority of members of the etcd cluster and that etcd-druid
takes action only if the auto-healing doesn’t happen for a configured period of time.
1. Pink of health
Observed state
- Cluster Size
StatefulSet
replicasEtcd
status- members
- Total:
n
- Ready:
n
- Members
NotReady
for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod
: 0
- Members with readiness status
Unknown
long enough to be considered NotReady
, i.e. lastTransitionTime > unknownGracePeriod
: 0
- Members with expired
lease
: 0
- conditions:
- Ready:
true
- AllMembersReady:
true
- BackupReady:
true
Recommended Action
Nothing to do
2. Member status is out of sync with their leases
Observed state
- Cluster Size
StatefulSet
replicasEtcd
status- members
- Total:
n
- Ready:
r
- Members
NotReady
for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod
: 0
- Members with readiness status
Unknown
long enough to be considered NotReady
, i.e. lastTransitionTime > unknownGracePeriod
: 0
- Members with expired
lease
: l
- conditions:
- Ready:
true
- AllMembersReady:
true
- BackupReady:
true
Recommended Action
Mark the l
members corresponding to the expired leases
as Unknown
with reason LeaseExpired
and with id
populated from spec.holderIdentity
of the lease
if they are not already updated so.
Mark the n - l
members corresponding to the active leases
as Ready
with reason LeaseSucceeded
and with id
populated from spec.holderIdentity
of the lease
if they are not already updated so.
Please refer here for more details.
3. All members are Ready
but AllMembersReady
condition is stale
Observed state
- Cluster Size
StatefulSet
replicasEtcd
status- members
- Total:
n
- Ready:
n
- Members
NotReady
for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod
: 0
- Members with readiness status
Unknown
long enough to be considered NotReady
, i.e. lastTransitionTime > unknownGracePeriod
: 0
- Members with expired
lease
: 0
- conditions:
- Ready: N/A
- AllMembersReady: false
- BackupReady: N/A
Recommended Action
Mark the status condition type AllMembersReady
to true
.
4. Not all members are Ready
but AllMembersReady
condition is stale
Observed state
Recommended Action
Mark the status condition type AllMembersReady
to false
.
5. Majority members are Ready
but Ready
condition is stale
Observed state
Recommended Action
Mark the status condition type Ready
to true
.
6. Majority members are NotReady
but Ready
condition is stale
Observed state
Recommended Action
Mark the status condition type Ready
to false
.
7. Some members have been in Unknown
status for a while
Observed state
- Cluster Size
StatefulSet
replicasEtcd
status- members
- Total: N/A
- Ready: N/A
- Members
NotReady
for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod
: N/A - Members with readiness status
Unknown
long enough to be considered NotReady
, i.e. lastTransitionTime > unknownGracePeriod
: u
where u <= n
- Members with expired
lease
: N/A
- conditions:
- Ready: N/A
- AllMembersReady: N/A
- BackupReady: N/A
Recommended Action
Mark the u
members as NotReady
in Etcd
status with reason: UnknownGracePeriodExceeded
.
8. Some member pods are not Ready
but have not had the chance to update their status
Observed state
- Cluster Size
StatefulSet
replicas- Desired:
n
- Ready:
s
where s < n
Etcd
status- members
- Total: N/A
- Ready: N/A
- Members
NotReady
for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod
: N/A - Members with readiness status
Unknown
long enough to be considered NotReady
, i.e. lastTransitionTime > unknownGracePeriod
: N/A - Members with expired
lease
: N/A
- conditions:
- Ready: N/A
- AllMembersReady: N/A
- BackupReady: N/A
Recommended Action
Mark the n - s
members (corresponding to the pods that are not Ready
) as NotReady
in Etcd
status with reason: PodNotReady
9. Quorate cluster with a minority of members NotReady
Observed state
- Cluster Size
StatefulSet
replicasEtcd
status- members
- Total:
n
- Ready:
n - f
- Members
NotReady
for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod
: f
where f < n/2
- Members with readiness status
Unknown
long enough to be considered NotReady
, i.e. lastTransitionTime > unknownGracePeriod
: 0
- Members with expired
lease
: N/A
- conditions:
- Ready: true
- AllMembersReady: false
- BackupReady: true
Recommended Action
Delete the f
NotReady
member pods to force restart of the pods if they do not automatically restart via failed livenessProbe
. The expectation is that they will either re-join the cluster as an existing member or remove themselves and join as new members on restart of the container or pod and renew their leases
.
10. Quorum lost with a majority of members NotReady
Observed state
- Cluster Size
StatefulSet
replicasEtcd
status- members
- Total:
n
- Ready:
n - f
- Members
NotReady
for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod
: f
where f >= n/2
- Members with readiness status
Unknown
long enough to be considered NotReady
, i.e. lastTransitionTime > unknownGracePeriod
: N/A - Members with expired
lease
: N/A
- conditions:
- Ready: false
- AllMembersReady: false
- BackupReady: true
Recommended Action
Scale down the StatefulSet
to replicas: 0
. Ensure that all member pods are deleted. Ensure that all the members are removed from Etcd
status. Delete and recreate all the member leases
. Recover the cluster from loss of quorum as discussed here.
11. Scale up of a healthy cluster
Observed state
- Cluster Size
- Desired:
d
- Current:
n
where d > n
StatefulSet
replicasEtcd
status- members
- Total:
n
- Ready:
n
- Members
NotReady
for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod
: 0 - Members with readiness status
Unknown
long enough to be considered NotReady
, i.e. lastTransitionTime > unknownGracePeriod
: 0 - Members with expired
lease
: 0
- conditions:
- Ready: true
- AllMembersReady: true
- BackupReady: true
Recommended Action
Add d - n
new members by scaling the StatefulSet
to replicas: d
. The rest of the StatefulSet
spec need not be updated until the next cluster bootstrapping (alternatively, the rest of the StatefulSet
spec can be updated pro-actively once the new members join the cluster. This will trigger a rolling update).
Also, create the additional member leases
for the d - n
new members.
12. Scale down of a healthy cluster
Observed state
- Cluster Size
- Desired:
d
- Current:
n
where d < n
StatefulSet
replicasEtcd
status- members
- Total:
n
- Ready:
n
- Members
NotReady
for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod
: 0 - Members with readiness status
Unknown
long enough to be considered NotReady
, i.e. lastTransitionTime > unknownGracePeriod
: 0 - Members with expired
lease
: 0
- conditions:
- Ready: true
- AllMembersReady: true
- BackupReady: true
Recommended Action
Remove d - n
existing members (numbered d
, d + 1
… n
) by scaling the StatefulSet
to replicas: d
. The StatefulSet
spec need not be updated until the next cluster bootstrapping (alternatively, the StatefulSet
spec can be updated pro-actively once the superfluous members exit the cluster. This will trigger a rolling update).
Also, delete the member leases
for the d - n
members being removed.
The superfluous entries in the members
array will be cleaned up as explained here.
The superfluous members in the ETCD cluster will be cleaned up by the leading etcd-backup-restore
sidecar.
13. Superfluous member entries in Etcd
status
Observed state
- Cluster Size
StatefulSet
replicasEtcd
status- members
- Total:
m
where m > n
- Ready: N/A
- Members
NotReady
for long enough to be evicted, i.e. lastTransitionTime > notReadyGracePeriod
: N/A - Members with readiness status
Unknown
long enough to be considered NotReady
, i.e. lastTransitionTime > unknownGracePeriod
: N/A - Members with expired
lease
: N/A
- conditions:
- Ready: N/A
- AllMembersReady: N/A
- BackupReady: N/A
Recommended Action
Remove the superfluous m - n
member entries from Etcd
status (numbered n
, n+1
… m
).
Remove the superfluous m - n
member leases
if they exist.
The superfluous members in the ETCD cluster will be cleaned up by the leading etcd-backup-restore
sidecar.
Decision table for etcd-backup-restore during initialization
As discussed above, the initialization sequence of etcd-backup-restore
in a member pod needs to generate suitable etcd configuration for its etcd container.
It also might have to handle the etcd database verification and restoration functionality differently in different scenarios.
The initialization sequence itself is proposed to be as follows.
It is an enhancement of the existing initialization sequence.
The details of the decisions to be taken during the initialization are given below.
1. First member during bootstrap of a fresh etcd cluster
Observed state
- Cluster Size:
n
Etcd
status members:- Total:
0
- Ready:
0
- Status contains own member:
false
- Data persistence
- WAL directory has cluster/ member metadata:
false
- Data directory is valid and up-to-date:
false
- Backup
- Backup exists:
false
- Backup has incremental snapshots:
false
Recommended Action
Generate etcd configuration with n
initial cluster peer URLs and initial cluster state new and return success.
2. Addition of a new following member during bootstrap of a fresh etcd cluster
Observed state
- Cluster Size:
n
Etcd
status members:- Total:
m
where 0 < m < n
- Ready:
m
- Status contains own member:
false
- Data persistence
- WAL directory has cluster/ member metadata:
false
- Data directory is valid and up-to-date:
false
- Backup
- Backup exists:
false
- Backup has incremental snapshots:
false
Recommended Action
Generate etcd configuration with n
initial cluster peer URLs and initial cluster state new and return success.
Observed state
- Cluster Size:
n
Etcd
status members:- Total:
m
where m > n/2
- Ready:
r
where r > n/2
- Status contains own member:
true
- Data persistence
- WAL directory has cluster/ member metadata:
true
- Data directory is valid and up-to-date:
true
- Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A
Recommended Action
Re-use previously generated etcd configuration and return success.
Observed state
- Cluster Size:
n
Etcd
status members:- Total:
m
where m > n/2
- Ready:
r
where r > n/2
- Status contains own member:
true
- Data persistence
- WAL directory has cluster/ member metadata:
true
- Data directory is valid and up-to-date:
false
- Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A
Recommended Action
Remove self as a member (old member ID) from the etcd cluster as well as Etcd
status. Add self as a new member of the etcd cluster as well as in the Etcd
status. If backups do not exist, create an empty data and WAL directory. If backups exist, restore only the latest full snapshot (please see here for the reason for not restoring incremental snapshots). Generate etcd configuration with n
initial cluster peer URLs and initial cluster state existing
and return success.
Observed state
- Cluster Size:
n
Etcd
status members:- Total:
m
where m > n/2
- Ready:
r
where r > n/2
- Status contains own member:
true
- Data persistence
- WAL directory has cluster/ member metadata:
false
- Data directory is valid and up-to-date: N/A
- Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A
Recommended Action
Remove self as a member (old member ID) from the etcd cluster as well as Etcd
status. Add self as a new member of the etcd cluster as well as in the Etcd
status. If backups do not exist, create an empty data and WAL directory. If backups exist, restore only the latest full snapshot (please see here for the reason for not restoring incremental snapshots). Generate etcd configuration with n
initial cluster peer URLs and initial cluster state existing
and return success.
Observed state
- Cluster Size:
n
Etcd
status members:- Total:
m
where m < n/2
- Ready:
r
where r < n/2
- Status contains own member:
true
- Data persistence
- WAL directory has cluster/ member metadata:
true
- Data directory is valid and up-to-date:
true
- Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A
Recommended Action
Re-use previously generated etcd configuration and return success.
7. Restart of the first member of a non-quorate cluster without valid data
Observed state
- Cluster Size:
n
Etcd
status members:- Total:
0
- Ready:
0
- Status contains own member:
false
- Data persistence
- WAL directory has cluster/ member metadata: N/A
- Data directory is valid and up-to-date:
false
- Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A
Recommended Action
If backups do not exist, create an empty data and WAL directory. If backups exist, restore the latest full snapshot. Start a single-node embedded etcd with initial cluster peer URLs containing only own peer URL and initial cluster state new
. If incremental snapshots exist, apply them serially (honouring source transactions). Take and upload a full snapshot after incremental snapshots are applied successfully (please see here for more reasons why). Generate etcd configuration with n
initial cluster peer URLs and initial cluster state new
and return success.
8. Restart of a following member of a non-quorate cluster without valid data
Observed state
- Cluster Size:
n
Etcd
status members:- Total:
m
where 1 < m < n
- Ready:
r
where 1 < r < n
- Status contains own member:
false
- Data persistence
- WAL directory has cluster/ member metadata: N/A
- Data directory is valid and up-to-date:
false
- Backup
- Backup exists: N/A
- Backup has incremental snapshots: N/A
Recommended Action
If backups do not exist, create an empty data and WAL directory. If backups exist, restore only the latest full snapshot (please see here for the reason for not restoring incremental snapshots). Generate etcd configuration with n
initial cluster peer URLs and initial cluster state existing
and return success.
Backup
Only one of the etcd-backup-restore sidecars among the members are required to take the backup for a given ETCD cluster. This can be called a backup leader
. There are two possibilities to ensure this.
Leading ETCD main container’s sidecar is the backup leader
The backup-restore sidecar could poll the etcd cluster and/or its own etcd main container to see if it is the leading member in the etcd cluster.
This information can be used by the backup-restore sidecars to decide that sidecar of the leading etcd main container is the backup leader (i.e. responsible to for taking/uploading backups regularly).
The advantages of this approach are as follows.
- The approach is operationally and conceptually simple. The leading etcd container and backup-restore sidecar are always located in the same pod.
- Network traffic between the backup container and the etcd cluster will always be local.
The disadvantage is that this approach may not age well in the future if we think about moving the backup-restore container as a separate pod rather than a sidecar container.
Independent leader election between backup-restore sidecars
We could use the etcd lease
mechanism to perform leader election among the backup-restore sidecars. For example, using something like go.etcd.io/etcd/clientv3/concurrency
.
The advantage and disadvantages are pretty much the opposite of the approach above.
The advantage being that this approach may age well in the future if we think about moving the backup-restore container as a separate pod rather than a sidecar container.
The disadvantages are as follows.
- The approach is operationally and conceptually a bit complex. The leading etcd container and backup-restore sidecar might potentially belong to different pods.
- Network traffic between the backup container and the etcd cluster might potentially be across nodes.
History Compaction
This proposal recommends to configure automatic history compaction on the individual members.
Defragmentation
Defragmentation is already triggered periodically by etcd-backup-restore
.
This proposal recommends to enhance this functionality to be performed only by the leading backup-restore container.
The defragmentation must be performed only when etcd cluster is in full health and must be done in a rolling manner for each members to avoid disruption.
The leading member should be defragmented last after all the rest of the members have been defragmented to minimise potential leadership changes caused by defragmentation.
If the etcd cluster is unhealthy when it is time to trigger scheduled defragmentation, the defragmentation must be postponed until the cluster becomes healthy. This check must be done before triggering defragmentation for each member.
Work-flows in etcd-backup-restore
There are different work-flows in etcd-backup-restore.
Some existing flows like initialization, scheduled backups and defragmentation have been enhanced or modified.
Some new work-flows like status updates have been introduced.
Some of these work-flows are sensitive to which etcd-backup-restore
container is leading and some are not.
The life-cycle of these work-flows is shown below.
Work-flows independent of leader election in all members
- Serve the HTTP API that all members are expected to support currently but some HTTP API call which are used to take out-of-sync delta or full snapshot should delegate the incoming HTTP requests to the
leading-sidecar
and one of the possible approach to achieve this is via an HTTP reverse proxy. - Check the health of the respective etcd member and renew the corresponding member
lease
.
Work-flows only on the leading member
- Take backups (full and incremental) at configured regular intervals
- Defragment all the members sequentially at configured regular intervals
- Cleanup superflous members from the ETCD cluster for which there is no corresponding pod (the ordinal in the pod name is greater than the cluster size) at regular intervals (or whenever the
Etcd
resource status changes by watching it)
High Availability
Considering that high-availability is the primary reason for using a multi-node etcd cluster, it makes sense to distribute the individual member pods of the etcd cluster across different physical nodes.
If the underlying Kubernetes cluster has nodes from multiple availability zones, it makes sense to also distribute the member pods across nodes from different availability zones.
One possibility to do this is via SelectorSpreadPriority
of kube-scheduler
but this is only best-effort and may not always be enforced strictly.
It is better to use pod anti-affinity to enforce such distribution of member pods.
Zonal Cluster - Single Availability Zone
A zonal cluster is configured to consist of nodes belonging to only a single availability zone in a region of the cloud provider.
In such a case, we can at best distribute the member pods of a multi-node etcd cluster instance only across different nodes in the configured availability zone.
This can be done by specifying pod anti-affinity in the specification of the member pods using kubernetes.io/hostname
as the topology key.
apiVersion: apps/v1
kind: StatefulSet
...
spec:
...
template:
...
spec:
...
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector: {} # podSelector that matches the member pods of the given etcd cluster instance
topologyKey: "kubernetes.io/hostname"
...
...
...
The recommendation is to keep etcd-druid
agnostic of such topics related scheduling and cluster-topology and to use kupid to orthogonally inject the desired pod anti-affinity.
Alternative
Another option is to build the functionality into etcd-druid
to include the required pod anti-affinity when it provisions the StatefulSet
that manages the member pods.
While this has the advantage of avoiding a dependency on an external component like kupid, the disadvantage is that we might need to address development or testing use-cases where it might be desirable to avoid distributing member pods and schedule them on as less number of nodes as possible.
Also, as mentioned below, kupid can be used to distribute member pods of an etcd cluster instance across nodes in a single availability zone as well as across nodes in multiple availability zones with very minor variation.
This keeps the solution uniform regardless of the topology of the underlying Kubernetes cluster.
Regional Cluster - Multiple Availability Zones
A regional cluster is configured to consist of nodes belonging to multiple availability zones (typically, three) in a region of the cloud provider.
In such a case, we can distribute the member pods of a multi-node etcd cluster instance across nodes belonging to different availability zones.
This can be done by specifying pod anti-affinity in the specification of the member pods using topology.kubernetes.io/zone
as the topology key.
In Kubernetes clusters using Kubernetes release older than 1.17
, the older (and now deprecated) failure-domain.beta.kubernetes.io/zone
might have to be used as the topology key.
apiVersion: apps/v1
kind: StatefulSet
...
spec:
...
template:
...
spec:
...
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector: {} # podSelector that matches the member pods of the given etcd cluster instance
topologyKey: "topology.kubernetes.io/zone
...
...
...
The recommendation is to keep etcd-druid
agnostic of such topics related scheduling and cluster-topology and to use kupid to orthogonally inject the desired pod anti-affinity.
Alternative
Another option is to build the functionality into etcd-druid
to include the required pod anti-affinity when it provisions the StatefulSet
that manages the member pods.
While this has the advantage of avoiding a dependency on an external component like kupid, the disadvantage is that such built-in support necessarily limits what kind of topologies of the underlying cluster will be supported.
Hence, it is better to keep etcd-druid
altogether agnostic of issues related to scheduling and cluster-topology.
PodDisruptionBudget
This proposal recommends that etcd-druid
should deploy PodDisruptionBudget
(minAvailable
set to floor(<cluster size>/2) + 1
) for multi-node etcd clusters (if AllMembersReady
condition is true
) to ensure that any planned disruptive operation can try and honour the disruption budget to ensure high availability of the etcd cluster while making potentially disrupting maintenance operations.
Also, it is recommended to toggle the minAvailable
field between floor(<cluster size>/2)
and <number of members with status Ready true>
whenever the AllMembersReady
condition toggles between true
and false
.
This is to disable eviction of any member pods when not all members are Ready
.
In case of a conflict, the recommendation is to use the highest of the applicable values for minAvailable
.
Rolling updates to etcd members
Any changes to the Etcd
resource spec that might result in a change to StatefulSet
spec or otherwise result in a rolling update of member pods should be applied/propagated by etcd-druid
only when the etcd cluster is fully healthy to reduce the risk of quorum loss during the updates.
This would include vertical autoscaling changes (via, HVPA).
If the cluster status unhealthy (i.e. if either AllMembersReady
or BackupReady
conditions are false
), etcd-druid
must restore it to full health before proceeding with such operations that lead to rolling updates.
This can be further optimized in the future to handle the cases where rolling updates can still be performed on an etcd cluster that is not fully healthy.
Follow Up
Ephemeral Volumes
See section Ephemeral Volumes.
Shoot Control-Plane Migration
This proposal adds support for multi-node etcd clusters but it should not have significant impact on shoot control-plane migration any more than what already present in the single-node etcd cluster scenario.
But to be sure, this needs to be discussed further.
Multi-node etcd clusters incur a cost on write performance as compared to single-node etcd clusters.
This performance impact needs to be measured and documented.
Here, we should compare different persistence option for the multi-nodeetcd clusters so that we have all the information necessary to take the decision balancing the high-availability, performance and costs.
Metrics, Dashboards and Alerts
There are already metrics exported by etcd and etcd-backup-restore
which are visualized in monitoring dashboards and also used in triggering alerts.
These might have hidden assumptions about single-node etcd clusters.
These might need to be enhanced and potentially new metrics, dashboards and alerts configured to cover the multi-node etcd cluster scenario.
Especially, a high priority alert must be raised if BackupReady
condition becomes false
.
Costs
Multi-node etcd clusters will clearly involve higher cost (when compared with single-node etcd clusters) just going by the CPU and memory usage for the additional members.
Also, the different options for persistence for etcd data for the members will have different cost implications.
Such cost impact needs to be assessed and documented to help navigate the trade offs between high availability, performance and costs.
Future Work
Gardener Ring
Gardener Ring, requires provisioning and management of an etcd cluster with the members distributed across more than one Kubernetes cluster.
This cannot be achieved by etcd-druid alone which has only the view of a single Kubernetes cluster.
An additional component that has the view of all the Kubernetes clusters involved in setting up the gardener ring will be required to achieve this.
However, etcd-druid can be used by such a higher-level component/controller (for example, by supplying the initial cluster configuration) such that individual etcd-druid instances in the individual Kubernetes clusters can manage the corresponding etcd cluster members.
Autonomous Shoot Clusters
Autonomous Shoot Clusters also will require a highly availble etcd cluster to back its control-plane and the multi-node support proposed here can be leveraged in that context.
However, the current proposal will not meet all the needs of a autonomous shoot cluster.
Some additional components will be required that have the overall view of the autonomous shoot cluster and they can use etcd-druid to manage the multi-node etcd cluster. But this scenario may be different from that of Gardener Ring in that the individual etcd members of the cluster may not be hosted on different Kubernetes clusters.
Optimization of recovery from non-quorate cluster with some member containing valid data
It might be possible to optimize the actions during the recovery of a non-quorate cluster where some of the members contain valid data and some other don’t.
The optimization involves verifying the data of the valid members to determine the data of which member is the most recent (even considering the latest backup) so that the full snapshot can be taken from it before recovering the etcd cluster.
Such an optimization can be attempted in the future.
Optimization of rolling updates to unhealthy etcd clusters
As mentioned above, optimizations to proceed with rolling updates to unhealthy etcd clusters (without first restoring the cluster to full health) can be pursued in future work.
3 - 02 Snapshot Compaction
Snapshot Compaction for Etcd
Current Problem
To ensure recoverability of Etcd, backups of the database are taken at regular interval.
Backups are of two types: Full Snapshots and Incremental Snapshots.
Full Snapshots
Full snapshot is a snapshot of the complete database at given point in time.The size of the database keeps changing with time and typically the size is relatively large (measured in 100s of megabytes or even in gigabytes. For this reason, full snapshots are taken after some large intervals.
Incremental Snapshots
Incremental Snapshots are collection of events on Etcd database, obtained through running WATCH API Call on Etcd. After some short intervals, all the events that are accumulated through WATCH API Call are saved in a file and named as Incremental Snapshots at relatively short time intervals.
Recovery from the Snapshots
Recovery from Full Snapshots
As the full snapshots are snapshots of the complete database, the whole database can be recovered from a full snapshot in one go. Etcd provides API Call to restore the database from a full snapshot file.
Recovery from Incremental Snapshots
Delta snapshots are collection of retrospective Etcd events. So, to restore from Incremental snapshot file, the events from the file are needed to be applied sequentially on Etcd database through Etcd Put/Delete API calls. As it is heavily dependent on Etcd calls sequentially, restoring from Incremental Snapshot files can take long if there are numerous commands captured in Incremental Snapshot files.
Delta snapshots are applied on top of running Etcd database. So, if there is inconsistency between the state of database at the point of applying and the state of the database when the delta snapshot commands were captured, restoration will fail.
Currently, in Gardener setup, Etcd is restored from the last full snapshot and then the delta snapshots, which were captured after the last full snapshot.
The main problem with this is that the complete restoration time can be unacceptably large if the rate of change coming into the etcd database is quite high because there are large number of events in the delta snapshots to be applied sequentially.
A secondary problem is that, though auto-compaction is enabled for etcd, it is not quick enough to compact all the changes from the incremental snapshots being re-applied during the relatively short period of time of restoration (as compared to the actual period of time when the incremental snapshots were accumulated). This may lead to the etcd pod (the backup-restore sidecar container, to be precise) to run out of memory and/or storage space even if it is sufficient for normal operations.
Solution
Compaction command
To help with the problem mentioned earlier, our proposal is to introduce compact
subcommand with etcdbrctl
. On execution of compact
command, A separate embedded Etcd process will be started where the Etcd data will be restored from the snapstore (exactly as in the restoration scenario today). Then the new Etcd database will be compacted and defragmented using Etcd API calls. The compaction will strip off the Etcd database of old revisions as per the Etcd auto-compaction configuration. The defragmentation will free up the unused fragment memory space released after compaction. Then a full snapshot of the compacted database will be saved in snapstore which then can be used as the base snapshot during any subsequent restoration (or backup compaction).
How the solution works
The newly introduced compact command does not disturb the running Etcd while compacting the backup snapshots. The command is designed to run potentially separately (from the main Etcd process/container/pod). Etcd Druid can be configured to run the newly introduced compact command as a separate job (scheduled periodically) based on total number of Etcd events accumulated after the most recent full snapshot.
Etcd-druid flags:
Etcd-druid introduces the following flags to configure the compaction job:
--enable-backup-compaction
(default false
): Set this flag to true
to enable the automatic compaction of etcd backups when the threshold value denoted by CLI flag --etcd-events-threshold
is exceeded.--compaction-workers
(default 3
): Number of worker threads of the CompactionJob controller. The controller creates a backup compaction job if a certain etcd event threshold is reached. If compaction is enabled, the value for this flag must be greater than zero.--etcd-events-threshold
(default 1000000
): Total number of etcd events that can be allowed before a backup compaction job is triggered.--active-deadline-duration
(default 3h
): Duration after which a running backup compaction job will be terminated.--metrics-scrape-wait-duration
(default 0s
): Duration to wait for after compaction job is completed, to allow Prometheus metrics to be scraped.
Points to take care while saving the compacted snapshot:
As compacted snapshot and the existing periodic full snapshots are taken by different processes running in different pods but accessing same store to save the snapshots, some problems may arise:
- When uploading the compacted snapshot to the snapstore, there is the problem of how does the restorer know when to start using the newly compacted snapshot. This communication needs to be atomic.
- With a regular schedule for compaction that happens potentially separately from the main etcd pod, is there a need for regular scheduled full snapshots anymore?
- We are planning to introduce new directory structure, under v2 prefix, for saving the snapshots (compacted and full), as mentioned in details below. But for backward compatibility, we also need to consider the older directory, which is currently under v1 prefix, during accessing snapshots.
How to swap full snapshot with compacted snapshot atomically
Currently, full snapshots and the subsequent delta snapshots are grouped under same prefix path in the snapstore. When a full snapshot is created, it is placed under a prefix/directory with the name comprising of timestamp. Then subsequent delta snapshots are also pushed into the same directory. Thus each prefix/directory contains a single full snapshot and the subsequent delta snapshots. So far, it is the job of ETCDBR to start main Etcd process and snapshotter process which takes full snapshot and delta snapshot periodically. But as per our proposal, compaction will be running as parallel process to main Etcd process and snapshotter process. So we can’t reliably co-ordinate between the processes to achieve switching to the compacted snapshot as the base snapshot atomically.
Current Directory Structure
- Backup-192345
- Full-Snapshot-0-1-192345
- Incremental-Snapshot-1-100-192355
- Incremental-Snapshot-100-200-192365
- Incremental-Snapshot-200-300-192375
- Backup-192789
- Full-Snapshot-0-300-192789
- Incremental-Snapshot-300-400-192799
- Incremental-Snapshot-400-500-192809
- Incremental-Snapshot-500-600-192819
To solve the problem, proposal is:
- ETCDBR will take the first full snapshot after it starts main Etcd Process and snapshotter process. After taking the first full snapshot, snapshotter will continue taking full snapshots. On the other hand, ETCDBR compactor command will be run as periodic job in a separate pod and use the existing full or compacted snapshots to produce further compacted snapshots. Full snapshots and compacted snapshots will be named after same fashion. So, there is no need of any mechanism to choose which snapshots(among full and compacted snapshot) to consider as base snapshots.
- Flatten the directory structure of backup folder. Save all the full snapshots, delta snapshots and compacted snapshots under same directory/prefix. Restorer will restore from full/compacted snapshots and delta snapshots sorted based on the revision numbers in name (or timestamp if the revision numbers are equal).
Proposed Directory Structure
Backup :
- Full-Snapshot-0-1-192355 (Taken by snapshotter)
- Incremental-Snapshot-revision-1-100-192365
- Incremental-Snapshot-revision-100-200-192375
- Full-Snapshot-revision-0-200-192379 (Taken by snapshotter)
- Incremental-Snapshot-revision-200-300-192385
- Full-Snapshot-revision-0-300-192386 (Taken by compaction job)
- Incremental-Snapshot-revision-300-400-192396
- Incremental-Snapshot-revision-400-500-192406
- Incremental-Snapshot-revision-500-600-192416
- Full-Snapshot-revision-0-600-192419 (Taken by snapshotter)
- Full-Snapshot-revision-0-600-192420 (Taken by compaction job)
What happens to the delta snapshots that were compacted?
The proposed compaction
sub-command in etcdbrctl
(and hence, the CronJob
provisioned by etcd-druid
that will schedule it at a regular interval) would only upload the compacted full snapshot.
It will not delete the snapshots (delta or full snapshots) that were compacted.
These snapshots which were superseded by a freshly uploaded compacted snapshot would follow the same life-cycle as other older snapshots.
I.e. they will be garbage collected according to the configured backup snapshot retention policy.
For example, if an exponential
retention policy is configured and if compaction is done every 30m
then there might be at most 48
additional (compacted) full snapshots (24h * 2
) in the backup for the latest day. As time rolls forward to the next day, these additional compacted snapshots (along with the delta snapshots that were compacted into them) will get garbage collected retaining only one full snapshot for the day before according to the retention policy.
Future work
In the future, we have plan to stop the snapshotter just after taking the first full snapshot. Then, the compaction job will be solely responsible for taking subsequent full snapshots. The directory structure would be looking like following:
Backup :
- Full-Snapshot-0-1-192355 (Taken by snapshotter)
- Incremental-Snapshot-revision-1-100-192365
- Incremental-Snapshot-revision-100-200-192375
- Incremental-Snapshot-revision-200-300-192385
- Full-Snapshot-revision-0-300-192386 (Taken by compaction job)
- Incremental-Snapshot-revision-300-400-192396
- Incremental-Snapshot-revision-400-500-192406
- Incremental-Snapshot-revision-500-600-192416
- Full-Snapshot-revision-0-600-192420 (Taken by compaction job)
Backward Compatibility
- Restoration : The changes to handle the newly proposed backup directory structure must be backward compatible with older structures at least for restoration because we need have to restore from backups in the older structure. This includes the support for restoring from a backup without a metadata file if that is used in the actual implementation.
- Backup : For new snapshots (even on a backup containing the older structure), the new structure may be used. The new structure must be setup automatically including creating the base full snapshot.
- Garbage collection : The existing functionality of garbage collection of snapshots (full and incremental) according to the backup retention policy must be compatible with both old and new backup folder structure. I.e. the snapshots in the older backup structure must be retained in their own structure and the snapshots in the proposed backup structure should be retained in the proposed structure. Once all the snapshots in the older backup structure go out of the retention policy and are garbage collected, we can think of removing the support for older backup folder structure.
Note: Compactor will run parallel to current snapshotter process and work only if there is any full snapshot already present in the store. By current design, a full snapshot will be taken if there is already no full snapshot or the existing full snapshot is older than 24 hours. It is not limitation but a design choice. As per proposed design, the backup storage will contain both periodic full snapshots as well as periodic compacted snapshot. Restorer will pickup the base snapshot whichever is latest one.
4 - 03 Scaling Up An Etcd Cluster
Scaling-up a single-node to multi-node etcd cluster deployed by etcd-druid
To mark a cluster for scale-up from single node to multi-node etcd, just patch the etcd custom resource’s .spec.replicas
from 1
to 3
(for example).
Challenges for scale-up
- Etcd cluster with single replica don’t have any peers, so no peer communication is required hence peer URL may or may not be TLS enabled. However, while scaling up from single node etcd to multi-node etcd, there will be a requirement to have peer communication between members of the etcd cluster. Peer communication is required for various reasons, for instance for members to sync up cluster state, data, and to perform leader election or any cluster wide operation like removal or addition of a member etc. Hence in a multi-node etcd cluster we need to have TLS enable peer URL for peer communication.
- Providing the correct configuration to start new etcd members as it is different from boostrapping a cluster since these new etcd members will join an existing cluster.
Approach
We first went through the etcd doc of update-advertise-peer-urls to find out information regarding peer URL updation. Interestingly, etcd doc has mentioned the following:
To update the advertise peer URLs of a member, first update it explicitly via member command and then restart the member.
But we can’t assume peer URL is not TLS enabled for single-node cluster as it depends on end-user. A user may or may not enable the TLS for peer URL for a single node etcd cluster. So, How do we detect whether peer URL was enabled or not when cluster is marked for scale-up?
Detecting if peerURL TLS is enabled or not
For this we use an annotation in member lease object member.etcd.gardener.cloud/tls-enabled
set by backup-restore sidecar of etcd. As etcd configuration is provided by backup-restore, so it can find out whether TLS is enabled or not and accordingly set this annotation member.etcd.gardener.cloud/tls-enabled
to either true
or false
in member lease object.
And with the help of this annotation and config-map values etcd-druid is able to detect whether there is a change in a peer URL or not.
Etcd-Druid helps in scaling up etcd cluster
Now, it is detected whether peer URL was TLS enabled or not for single node etcd cluster. Etcd-druid can now use this information to take action:
- If peer URL was already TLS enabled then no action is required from etcd-druid side. Etcd-druid can proceed with scaling up the cluster.
- If peer URL was not TLS enabled then etcd-druid has to intervene and make sure peer URL should be TLS enabled first for the single node before marking the cluster for scale-up.
Action taken by etcd-druid to enable the peerURL TLS
- Etcd-druid will update the
etcd-bootstrap
config-map with new config like initial-cluster,initial-advertise-peer-urls etc. Backup-restore will detect this change and update the member lease annotation to member.etcd.gardener.cloud/tls-enabled: "true"
. - In case the peer URL TLS has been changed to
enabled
: Etcd-druid will add tasks to the deployment flow:- Check if peer TLS has been enabled for existing StatefulSet pods, by checking the member leases for the annotation
member.etcd.gardener.cloud/tls-enabled
. - If peer TLS enablement is pending for any of the members, then check and patch the StatefulSet with the peer TLS volume mounts, if not already patched. This will cause a rolling update of the existing StatefulSet pods, which allows etcd-backup-restore to update the member peer URL in the etcd cluster.
- Requeue this reconciliation flow until peer TLS has been enabled for all the existing etcd members.
After PeerURL is TLS enabled
After peer URL TLS enablement for single node etcd cluster, now etcd-druid adds a scale-up annotation: gardener.cloud/scaled-to-multi-node
to the etcd statefulset and etcd-druid will patch the statefulsets .spec.replicas
to 3
(for example). The statefulset controller will then bring up new pods(etcd with backup-restore as a sidecar). Now etcd’s sidecar i.e backup-restore will check whether this member is already a part of a cluster or not and incase it is unable to check (may be due to some network issues) then backup-restore checks presence of this annotation: gardener.cloud/scaled-to-multi-node
in etcd statefulset to detect scale-up. If it finds out it is the scale-up case then backup-restore adds new etcd member as a learner first and then starts the etcd learner by providing the correct configuration. Once learner gets in sync with the etcd cluster leader, it will get promoted to a voting member.
Providing the correct etcd config
As backup-restore detects that it’s a scale-up scenario, backup-restore sets initial-cluster-state
to existing
as this member will join an existing cluster and it calculates the rest of the config from the updated config-map provided by etcd-druid.
Future improvements:
The need of restarting etcd pods twice will change in the future. please refer: https://github.com/gardener/etcd-backup-restore/issues/538
5 - Cli Flags
CLI Flags
Etcd-druid exposes the following CLI flags that allow for configuring its behavior.
CLI FLag | Component | Description | Default |
---|
feature-gates | etcd-druid | A set of key=value pairs that describe feature gates for alpha/experimental features. Please check feature-gates for more information. | "" |
metrics-bind-address | controller-manager | The IP address that the metrics endpoint binds to. | "" |
metrics-port | controller-manager | The port used for the metrics endpoint. | 8080 |
metrics-addr | controller-manager | The fully qualified address:port that the metrics endpoint binds to. Deprecated: this field will be eventually removed. Please use --metrics-bind-address and –metrics-port instead. | ":8080" |
webhook-server-bind-address | controller-manager | The IP address on which to listen for the HTTPS webhook server. | "" |
webhook-server-port | controller-manager | The port on which to listen for the HTTPS webhook server. | 9443 |
webhook-server-tls-server-cert-dir | controller-manager | The path to a directory containing the server’s TLS certificate and key (the files must be named tls.crt and tls.key respectively). | "/etc/webhook-server-tls" |
enable-leader-election | controller-manager | Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager. | false |
leader-election-id | controller-manager | Name of the resource that leader election will use for holding the leader lock. | "druid-leader-election" |
leader-election-resource-lock | controller-manager | Specifies which resource type to use for leader election. Supported options are ’endpoints’, ‘configmaps’, ’leases’, ’endpointsleases’ and ‘configmapsleases’. Deprecated. Will be removed in the future in favour of using only leases as the leader election resource lock for the controller manager. | "leases" |
disable-lease-cache | controller-manager | Disable cache for lease.coordination.k8s.io resources. | false |
etcd-workers | etcd-controller | Number of workers spawned for concurrent reconciles of etcd spec and status changes. If not specified then default of 3 is assumed. | 3 |
ignore-operation-annotation | etcd-controller | Specifies whether to ignore or honour the annotation gardener.cloud/operation: reconcile on resources to be reconciled. Deprecated: please use --enable-etcd-spec-auto-reconcile instead. | false |
enable-etcd-spec-auto-reconcile | etcd-controller | If true then automatically reconciles Etcd Spec. If false, waits for explicit annotation gardener.cloud/operation: reconcile to be placed on the Etcd resource to trigger reconcile. | false |
disable-etcd-serviceaccount-automount | etcd-controller | If true then .automountServiceAccountToken will be set to false for the ServiceAccount created for etcd StatefulSets. | false |
etcd-status-sync-period | etcd-controller | Period after which an etcd status sync will be attempted. | 15s |
etcd-member-notready-threshold | etcd-controller | Threshold after which an etcd member is considered not ready if the status was unknown before. | 5m |
etcd-member-unknown-threshold | etcd-controller | Threshold after which an etcd member is considered unknown. | 1m |
enable-backup-compaction | compaction-controller | Enable automatic compaction of etcd backups. | false |
compaction-workers | compaction-controller | Number of worker threads of the CompactionJob controller. The controller creates a backup compaction job if a certain etcd event threshold is reached. If compaction is enabled, the value for this flag must be greater than zero. | 3 |
etcd-events-threshold | compaction-controller | Total number of etcd events that can be allowed before a backup compaction job is triggered. | 1000000 |
active-deadline-duration | compaction-controller | Duration after which a running backup compaction job will be terminated. | 3h |
metrics-scrape-wait-duration | compaction-controller | Duration to wait for after compaction job is completed, to allow Prometheus metrics to be scraped. | 0s |
etcd-copy-backups-task-workers | etcdcopybackupstask-controller | Number of worker threads for the etcdcopybackupstask controller. | 3 |
secret-workers | secret-controller | Number of worker threads for the secrets controller. | 10 |
enable-etcd-components-webhook | etcdcomponents-webhook | Enable EtcdComponents Webhook to prevent unintended changes to resources managed by etcd-druid. | false |
reconciler-service-account | etcdcomponents-webhook | The fully qualified name of the service account used by etcd-druid for reconciling etcd resources. If unspecified, the default service account mounted for etcd-druid will be used. | <etcd-druid-service-account> |
etcd-components-exempt-service-accounts | etcdcomponents-webhook | The comma-separated list of fully qualified names of service accounts that are exempt from EtcdComponents Webhook checks. | "" |
6 - Controllers
Controllers
etcd-druid is an operator to manage etcd clusters, and follows the Operator
pattern for Kubernetes.
It makes use of the Kubebuilder framework which makes it quite easy to define Custom Resources (CRs) such as Etcd
s and EtcdCopyBackupTask
s through Custom Resource Definitions (CRDs), and define controllers for these CRDs.
etcd-druid uses Kubebuilder to define the Etcd
CR and its corresponding controllers.
All controllers that are a part of etcd-druid reside in package internal/controller
, as sub-packages.
Etcd-druid currently consists of the following controllers, each having its own responsibility:
- etcd : responsible for the reconciliation of the
Etcd
CR spec, which allows users to run etcd clusters within the specified Kubernetes cluster, and also responsible for periodically updating the Etcd
CR status with the up-to-date state of the managed etcd cluster. - compaction : responsible for snapshot compaction.
- etcdcopybackupstask : responsible for the reconciliation of the
EtcdCopyBackupsTask
CR, which helps perform the job of copying snapshot backups from one object store to another. - secret : responsible in making sure
Secret
s being referenced by Etcd
resources are not deleted while in use.
Package Structure
The typical package structure for the controllers that are part of etcd-druid is shown with the compaction controller:
internal/controller/compaction
├── config.go
├── reconciler.go
└── register.go
config.go
: contains all the logic for the configuration of the controller, including feature gate activations, CLI flag parsing and validations.register.go
: contains the logic for registering the controller with the etcd-druid controller manager.reconciler.go
: contains the controller reconciliation logic.
Each controller package also contains auxiliary files which are relevant to that specific controller.
Controller Manager
A manager is first created for all controllers that are a part of etcd-druid.
The controller manager is responsible for all the controllers that are associated with CRDs.
Once the manager is Start()
ed, all the controllers that are registered with it are started.
Each controller is built using a controller builder, configured with details such as the type of object being reconciled, owned objects whose owner object is reconciled, event filters (predicates), etc. Predicates
are filters which allow controllers to filter which type of events the controller should respond to and which ones to ignore.
The logic relevant to the controller manager like the creation of the controller manager and registering each of the controllers with the manager, is contained in internal/manager/manager.go
.
Etcd Controller
The etcd controller is responsible for the reconciliation of the Etcd
resource spec and status. It handles the provisioning and management of the etcd cluster. Different components that are required for the functioning of the cluster like Leases
, ConfigMap
s, and the Statefulset
for the etcd cluster are all deployed and managed by the etcd controller.
Additionally, etcd controller also periodically updates the Etcd
resource status with the latest available information from the etcd cluster, as well as results and errors from the recent-most reconciliation of the Etcd
resource spec.
The etcd controller is essential to the functioning of the etcd cluster and etcd-druid, thus the minimum number of worker threads is 1 (default being 3), controlled by the CLI flag --etcd-workers
.
Etcd
Spec Reconciliation
While building the controller, an event filter is set such that the behavior of the controller, specifically for Etcd
update operations, depends on the gardener.cloud/operation: reconcile
annotation. This is controlled by the --enable-etcd-spec-auto-reconcile
CLI flag, which, if set to false
, tells the controller to perform reconciliation only when this annotation is present. If the flag is set to true
, the controller will reconcile the etcd cluster anytime the Etcd
spec, and thus generation
, changes, and the next queued event for it is triggered.
Note: Creation and deletion of Etcd
resources are not affected by the above flag or annotation.
The reason this filter is present is that any disruption in the Etcd
resource due to reconciliation (due to changes in the Etcd
spec, for example) while workloads are being run would cause unwanted downtimes to the etcd cluster. Hence, any user who wishes to avoid such disruptions, can choose to set the --enable-etcd-spec-auto-reconcile
CLI flag to false
. An example of this is Gardener’s gardenlet, which reconciles the Etcd
resource only during a shoot cluster’s maintenance window.
The controller adds a finalizer to the Etcd
resource in order to ensure that it does not get deleted until all dependent resources managed by etcd-druid, aka managed components, are properly cleaned up. Only the etcd controller can delete a resource once it adds finalizers to it. This ensures that the proper deletion flow steps are followed while deleting the resource. During deletion flow, managed components are deleted in parallel.
Etcd
Status Updates
The Etcd
resource status is updated periodically by etcd controller
, the interval for which is determined by the CLI flag --etcd-status-sync-period
.
Status fields of the Etcd
resource such as LastOperation
, LastErrors
and ObservedGeneration
, are updated to reflect the result of the recent reconciliation of the Etcd
resource spec.
LastOperation
holds information about the last operation performed on the etcd cluster, indicated by fields Type
, State
, Description
and LastUpdateTime
. Additionally, a field RunID
indicates the unique ID assigned to the specific reconciliation run, to allow for better debugging of issues.LastErrors
is a slice of errors encountered by the last reconciliation run. Each error consists of fields Code
to indicate the custom etcd-druid error code for the error, a human-readable Description
, and the ObservedAt
time when the error was seen.ObservedGeneration
indicates the latest generation
of the Etcd
resource that etcd-druid has “observed” and consequently reconciled. It helps identify whether a change in the Etcd
resource spec was acted upon by druid or not.
Status fields of the Etcd
resource which correspond to the StatefulSet
like CurrentReplicas
, ReadyReplicas
and Replicas
are updated to reflect those of the StatefulSet
by the controller.
Status fields related to the etcd cluster itself, such as Members
, PeerUrlTLSEnabled
and Ready
are updated as follows:
- Cluster Membership: The controller updates the information about etcd cluster membership like
Role
, Status
, Reason
, LastTransitionTime
and identifying information like the Name
and ID
. For the Status
field, the member is checked for the Ready condition, where the member can be in Ready
, NotReady
and Unknown
statuses.
Etcd
resource conditions are indicated by status field Conditions
. The condition checks that are currently performed are:
AllMembersReady
: indicates readiness of all members of the etcd cluster.Ready
: indicates overall readiness of the etcd cluster in serving traffic.BackupReady
: indicates health of the etcd backups, i.e., whether etcd backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.DataVolumesReady
: indicates health of the persistent volumes containing the etcd data.
Compaction Controller
The compaction controller deploys the snapshot compaction job whenever required. To understand the rationale behind this controller, please read snapshot-compaction.md.
The controller watches the number of events accumulated as part of delta snapshots in the etcd cluster’s backups, and triggers a snapshot compaction when the number of delta events crosses the set threshold, which is configurable through the --etcd-events-threshold
CLI flag (1M events by default).
The controller watches for changes in snapshot Leases
associated with Etcd
resources.
It checks the full and delta snapshot Leases
and calculates the difference in events between the latest delta snapshot and the previous full snapshot, and initiates the compaction job if the event threshold is crossed.
The number of worker threads for the compaction controller needs to be greater than or equal to 0 (default 3), controlled by the CLI flag --compaction-workers
.
This is unlike other controllers which need at least one worker thread for the proper functioning of etcd-druid as snapshot compaction is not a core functionality for the etcd clusters to be deployed.
The compaction controller should be explicitly enabled by the user, through the --enable-backup-compaction
CLI flag.
EtcdCopyBackupsTask Controller
The etcdcopybackupstask controller is responsible for deploying the etcdbrctl copy
command as a job.
This controller reacts to create/update events arising from EtcdCopyBackupsTask resources, and deploys the EtcdCopyBackupsTask
job with source and target backup storage providers as arguments, which are derived from source and target bucket secrets referenced by the EtcdCopyBackupsTask
resource.
The number of worker threads for the etcdcopybackupstask controller needs to be greater than or equal to 0 (default being 3), controlled by the CLI flag --etcd-copy-backups-task-workers
.
This is unlike other controllers who need at least one worker thread for the proper functioning of etcd-druid as EtcdCopyBackupsTask
is not a core functionality for the etcd clusters to be deployed.
Secret Controller
The secret controller’s primary responsibility is to add a finalizer on Secret
s referenced by the Etcd
resource.
The secret controller is registered for Secret
s, and the controller keeps a watch on the Etcd
CR.
This finalizer is added to ensure that Secret
s which are referenced by the Etcd
CR aren’t deleted while still being used by the Etcd
resource.
Events arising from the Etcd
resource are mapped to a list of Secret
s such as backup and TLS secrets that are referenced by the Etcd
resource, and are enqueued into the request queue, which the reconciler then acts on.
The number of worker threads for the secret controller must be at least 1 (default being 10) for this core controller, controlled by the CLI flag --secret-workers
, since the referenced TLS and infrastructure access secrets are essential to the proper functioning of the etcd cluster.
7 - DEP Title
DEP-NN: Your short, descriptive title
Table of Contents
Summary
Motivation
Goals
Non-Goals
Proposal
Alternatives
8 - Etcd Druid
Documentation Index
Concepts
Development
Deployment
Operations
Proposals
Usage
9 - etcd Network Latency
Network Latency analysis: sn-etcd-sz
vs mn-etcd-sz
vs mn-etcd-mz
This page captures the etcd cluster latency analysis for below scenarios using the benchmark tool (build from etcd benchmark tool).
sn-etcd-sz
-> single-node etcd single zone (Only single replica of etcd will be running)
mn-etcd-sz
-> multi-node etcd single zone (Multiple replicas of etcd pods will be running across nodes in a single zone)
mn-etcd-mz
-> multi-node etcd multi zone (Multiple replicas of etcd pods will be running across nodes in multiple zones)
PUT Analysis
Summary
sn-etcd-sz
latency is ~20% less than mn-etcd-sz
when benchmark tool with single client.mn-etcd-sz
latency is less than mn-etcd-mz
but the difference is ~+/-5%
.- Compared to
mn-etcd-sz
, sn-etcd-sz
latency is higher and gradually grows with more clients and larger value size. - Compared to
mn-etcd-mz
, mn-etcd-sz
latency is higher and gradually grows with more clients and larger value size. - Compared to
follower
, leader
latency is less, when benchmark tool with single client for all cases. - Compared to
follower
, leader
latency is high, when benchmark tool with multiple clients for all cases.
Sample commands:
# write to leader
benchmark put --target-leader --conns=1 --clients=1 --precise \
--sequential-keys --key-starts 0 --val-size=256 --total=10000 \
--endpoints=$ETCD_HOST
# write to follower
benchmark put --conns=1 --clients=1 --precise \
--sequential-keys --key-starts 0 --val-size=256 --total=10000 \
--endpoints=$ETCD_FOLLOWER_HOST
Latency analysis during PUT requests to etcd
In this case benchmark tool tries to put key with random 256 bytes value.
Benchmark tool loads key/value to leader
with single client .
sn-etcd-sz
latency (~0.815ms) is ~50% lesser than mn-etcd-sz
(~1.74ms ).mn-etcd-sz
latency (~1.74ms ) is slightly lesser than mn-etcd-mz
(~1.8ms) but the difference is negligible (within same ms).
Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
10000 | 256 | 1 | 1 | leader | 1220.0520 | 0.815ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
10000 | 256 | 1 | 1 | leader | 586.545 | 1.74ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
10000 | 256 | 1 | 1 | leader | 554.0155654442634 | 1.8ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool loads key/value to follower
with single client.
mn-etcd-sz
latency(~2.2ms
) is 20% to 30% lesser than mn-etcd-mz
(~2.7ms
).- Compare to
follower
, leader
has lower latency. Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
10000 | 256 | 1 | 1 | follower-1 | 445.743 | 2.23ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
10000 | 256 | 1 | 1 | follower-1 | 378.9366747610789 | 2.63ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
10000 | 256 | 1 | 1 | follower-2 | 457.967 | 2.17ms | eu-west-1a | etcd-main-2 | mn-etcd-sz |
10000 | 256 | 1 | 1 | follower-2 | 345.6586129825796 | 2.89ms | eu-west-1b | etcd-main-2 | mn-etcd-mz |
Benchmark tool loads key/value to leader
with multiple clients.
sn-etcd-sz
latency(~78.3ms
) is ~10% greater than mn-etcd-sz
(~71.81ms
).mn-etcd-sz
latency(~71.81ms
) is less than mn-etcd-mz
(~72.5ms
) but the difference is negligible.Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
100000 | 256 | 100 | 1000 | leader | 12638.905 | 78.32ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
100000 | 256 | 100 | 1000 | leader | 13789.248 | 71.81ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
100000 | 256 | 100 | 1000 | leader | 13728.446436395223 | 72.5ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool loads key/value to follower
with multiple clients.
mn-etcd-sz
latency(~69.8ms
) is ~5% greater than mn-etcd-mz
(~72.6ms
).- Compare to
leader
, follower
has lower latency. Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
100000 | 256 | 100 | 1000 | follower-1 | 14271.983 | 69.80ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
100000 | 256 | 100 | 1000 | follower-1 | 13695.98 | 72.62ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
100000 | 256 | 100 | 1000 | follower-2 | 14325.436 | 69.47ms | eu-west-1a | etcd-main-2 | mn-etcd-sz |
100000 | 256 | 100 | 1000 | follower-2 | 15750.409490407475 | 63.3ms | eu-west-1b | etcd-main-2 | mn-etcd-mz |
In this case benchmark tool tries to put key with random 1 MB value.
Benchmark tool loads key/value to leader
with single client.
sn-etcd-sz
latency(~16.35ms
) is ~20% lesser than mn-etcd-sz
(~20.64ms
).mn-etcd-sz
latency(~20.64ms
) is less than mn-etcd-mz
(~21.08ms
) but the difference is negligible..Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 1 | 1 | leader | 61.117 | 16.35ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
1000 | 1000000 | 1 | 1 | leader | 48.416 | 20.64ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
1000 | 1000000 | 1 | 1 | leader | 45.7517341664802 | 21.08ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool loads key/value withto follower
single client.
mn-etcd-sz
latency(~23.10ms
) is ~10% greater than mn-etcd-mz
(~21.8ms
).- Compare to
follower
, leader
has lower latency. Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 1 | 1 | follower-1 | 43.261 | 23.10ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
1000 | 1000000 | 1 | 1 | follower-1 | 45.7517341664802 | 21.8ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
1000 | 1000000 | 1 | 1 | follower-1 | 45.33 | 22.05ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 1 | 1 | follower-2 | 40.0518 | 24.95ms | eu-west-1a | etcd-main-2 | mn-etcd-sz |
1000 | 1000000 | 1 | 1 | follower-2 | 43.28573155709838 | 23.09ms | eu-west-1b | etcd-main-2 | mn-etcd-mz |
1000 | 1000000 | 1 | 1 | follower-2 | 45.92 | 21.76ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
1000 | 1000000 | 1 | 1 | follower-2 | 35.5705 | 28.1ms | eu-west-1b | etcd-main-2 | mn-etcd-mz |
Benchmark tool loads key/value to leader
with multiple clients.
sn-etcd-sz
latency(~6.0375secs
) is ~30% greater than mn-etcd-sz``~4.000secs
).mn-etcd-sz
latency(~4.000secs
) is less than mn-etcd-mz
(~ 4.09secs
) but the difference is negligible.Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 100 | 300 | leader | 55.373 | 6.0375secs | eu-west-1c | etcd-main-0 | sn-etcd-sz |
1000 | 1000000 | 100 | 300 | leader | 67.319 | 4.000secs | eu-west-1a | etcd-main-1 | mn-etcd-sz |
1000 | 1000000 | 100 | 300 | leader | 65.91914167957594 | 4.09secs | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool loads key/value to follower
with multiple clients.
mn-etcd-sz
latency(~4.04secs
) is ~5% greater than mn-etcd-mz
(~ 3.90secs
).- Compare to
leader
, follower
has lower latency. Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 100 | 300 | follower-1 | 66.528 | 4.0417secs | eu-west-1a | etcd-main-0 | mn-etcd-sz |
1000 | 1000000 | 100 | 300 | follower-1 | 70.6493461856332 | 3.90secs | eu-west-1c | etcd-main-0 | mn-etcd-mz |
1000 | 1000000 | 100 | 300 | follower-1 | 71.95 | 3.84secs | eu-west-1c | etcd-main-0 | mn-etcd-mz |
Number of keys | Value size | Number of connections | Number of clients | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 100 | 300 | follower-2 | 66.447 | 4.0164secs | eu-west-1a | etcd-main-2 | mn-etcd-sz |
1000 | 1000000 | 100 | 300 | follower-2 | 67.53038086369484 | 3.87secs | eu-west-1b | etcd-main-2 | mn-etcd-mz |
1000 | 1000000 | 100 | 300 | follower-2 | 68.46 | 3.92secs | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Range Analysis
Sample commands are:
# Single connection read request with sequential keys
benchmark range 0 --target-leader --conns=1 --clients=1 --precise \
--sequential-keys --key-starts 0 --total=10000 \
--consistency=l \
--endpoints=$ETCD_HOST
# --consistency=s [Serializable]
benchmark range 0 --target-leader --conns=1 --clients=1 --precise \
--sequential-keys --key-starts 0 --total=10000 \
--consistency=s \
--endpoints=$ETCD_HOST
# Each read request with range query matches key 0 9999 and repeats for total number of requests.
benchmark range 0 9999 --target-leader --conns=1 --clients=1 --precise \
--total=10 \
--consistency=s \
--endpoints=https://etcd-main-client:2379
# Read requests with multiple connections
benchmark range 0 --target-leader --conns=100 --clients=1000 --precise \
--sequential-keys --key-starts 0 --total=100000 \
--consistency=l \
--endpoints=$ETCD_HOST
benchmark range 0 --target-leader --conns=100 --clients=1000 --precise \
--sequential-keys --key-starts 0 --total=100000 \
--consistency=s \
--endpoints=$ETCD_HOST
Latency analysis during Range requests to etcd
In this case benchmark tool tries to get specific key with random 256 bytes value.
Benchmark tool range requests to leader
with single client.
sn-etcd-sz
latency(~1.24ms
) is ~40% greater than mn-etcd-sz
(~0.67ms
).
mn-etcd-sz
latency(~0.67ms
) is ~20% lesser than mn-etcd-mz
(~0.85ms
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
10000 | 256 | 1 | 1 | true | l | leader | 800.272 | 1.24ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
10000 | 256 | 1 | 1 | true | l | leader | 1173.9081 | 0.67ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
10000 | 256 | 1 | 1 | true | l | leader | 999.3020189178693 | 0.85ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Compare to consistency Linearizable
, Serializable
is ~40% less for all cases
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
10000 | 256 | 1 | 1 | true | s | leader | 1411.229 | 0.70ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
10000 | 256 | 1 | 1 | true | s | leader | 2033.131 | 0.35ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
10000 | 256 | 1 | 1 | true | s | leader | 2100.2426362012025 | 0.47ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool range requests to follower
with single client .
mn-etcd-sz
latency(~1.3ms
) is ~20% lesser than mn-etcd-mz
(~1.6ms
).- Compare to
follower
, leader
read request latency is ~50% less for both mn-etcd-sz
, mn-etcd-mz
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
10000 | 256 | 1 | 1 | true | l | follower-1 | 765.325 | 1.3ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
10000 | 256 | 1 | 1 | true | l | follower-1 | 596.1 | 1.6ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
- Compare to consistency
Linearizable
, Serializable
is ~50% less for all cases Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
10000 | 256 | 1 | 1 | true | s | follower-1 | 1823.631 | 0.54ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
10000 | 256 | 1 | 1 | true | s | follower-1 | 1442.6 | 0.69ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
10000 | 256 | 1 | 1 | true | s | follower-1 | 1416.39 | 0.70ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
10000 | 256 | 1 | 1 | true | s | follower-1 | 2077.449 | 0.47ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool range requests to leader
with multiple client.
sn-etcd-sz
latency(~84.66ms
) is ~20% greater than mn-etcd-sz
(~73.95ms
).
mn-etcd-sz
latency(~73.95ms
) is more or less equal to mn-etcd-mz
(~ 73.8ms
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
100000 | 256 | 100 | 1000 | true | l | leader | 11775.721 | 84.66ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
100000 | 256 | 100 | 1000 | true | l | leader | 13446.9598 | 73.95ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
100000 | 256 | 100 | 1000 | true | l | leader | 13527.19810605353 | 73.8ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Compare to consistency Linearizable
, Serializable
is ~20% lesser for all cases
sn-etcd-sz
latency(~69.37ms
) is more or less equal to mn-etcd-sz
(~69.89ms
).
mn-etcd-sz
latency(~69.89ms
) is slightly higher than mn-etcd-mz
(~67.63ms
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
100000 | 256 | 100 | 1000 | true | s | leader | 14334.9027 | 69.37ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
100000 | 256 | 100 | 1000 | true | s | leader | 14270.008 | 69.89ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
100000 | 256 | 100 | 1000 | true | s | leader | 14715.287354023869 | 67.63ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool range requests to follower
with multiple client.
mn-etcd-sz
latency(~60.69ms
) is ~20% lesser than mn-etcd-mz
(~70.76ms
).
Compare to leader
, follower
has lower read request latency.
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
100000 | 256 | 100 | 1000 | true | l | follower-1 | 11586.032 | 60.69ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
100000 | 256 | 100 | 1000 | true | l | follower-1 | 14050.5 | 70.76ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
mn-etcd-sz
latency(~86.09ms
) is ~20 higher than mn-etcd-mz
(~64.6ms
).
- Compare to
mn-etcd-sz
consistency Linearizable
, Serializable
is ~20% higher.*
Compare to mn-etcd-mz
consistency Linearizable
, Serializable
is ~slightly less.
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
100000 | 256 | 100 | 1000 | true | s | follower-1 | 11582.438 | 86.09ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
100000 | 256 | 100 | 1000 | true | s | follower-1 | 15422.2 | 64.6ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
Benchmark tool range requests to leader
all keys.
sn-etcd-sz
latency(~678.77ms
) is ~5% slightly lesser than mn-etcd-sz
(~697.29ms
).
mn-etcd-sz
latency(~697.29ms
) is less than mn-etcd-mz
(~701ms
) but the difference is negligible.
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
20 | 256 | 2 | 5 | false | l | leader | 6.8875 | 678.77ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
20 | 256 | 2 | 5 | false | l | leader | 6.720 | 697.29ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
20 | 256 | 2 | 5 | false | l | leader | 6.7 | 701ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
- Compare to consistency
Linearizable
, Serializable
is ~5% slightly higher for all cases
sn-etcd-sz
latency(~687.36ms
) is less than mn-etcd-sz
(~692.68ms
) but the difference is negligible.
mn-etcd-sz
latency(~692.68ms
) is ~5% slightly lesser than mn-etcd-mz
(~735.7ms
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
20 | 256 | 2 | 5 | false | s | leader | 6.76 | 687.36ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
20 | 256 | 2 | 5 | false | s | leader | 6.635 | 692.68ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
20 | 256 | 2 | 5 | false | s | leader | 6.3 | 735.7ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool range requests to follower
all keys
mn-etcd-sz
(~737.68ms
) latency is ~5% slightly higher than mn-etcd-mz
(~713.7ms
).
Compare to leader
consistency Linearizable
read request, follower
is ~5% slightly higher.
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
20 | 256 | 2 | 5 | false | l | follower-1 | 6.163 | 737.68ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
20 | 256 | 2 | 5 | false | l | follower-1 | 6.52 | 713.7ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
mn-etcd-sz
latency(~757.73ms
) is ~10% higher than mn-etcd-mz
(~690.4ms
).
Compare to follower
consistency Linearizable
read request, follower
consistency Serializable
is ~3% slightly higher for mn-etcd-sz
.
Compare to follower
consistency Linearizable
read request, follower
consistency Serializable
is ~5% less for mn-etcd-mz
.
*Compare to leader
consistency Serializable
read request, follower
consistency Serializable
is ~5% less for mn-etcd-mz
. *
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
20 | 256 | 2 | 5 | false | s | follower-1 | 6.0295 | 757.73ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
20 | 256 | 2 | 5 | false | s | follower-1 | 6.87 | 690.4ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
In this case benchmark tool tries to get specific key with random `1MB` value.
Benchmark tool range requests to leader
with single client.
sn-etcd-sz
latency(~5.96ms
) is ~5% lesser than mn-etcd-sz
(~6.28ms
).
mn-etcd-sz
latency(~6.28ms
) is ~10% higher than mn-etcd-mz
(~5.3ms
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 1 | 1 | true | l | leader | 167.381 | 5.96ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
1000 | 1000000 | 1 | 1 | true | l | leader | 158.822 | 6.28ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
1000 | 1000000 | 1 | 1 | true | l | leader | 187.94 | 5.3ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Compare to consistency Linearizable
, Serializable
is ~15% less for sn-etcd-sz
, mn-etcd-sz
, mn-etcd-mz
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 1 | 1 | true | s | leader | 184.95 | 5.398ms | eu-west-1c | etcd-main-0 | sn-etcd-sz |
1000 | 1000000 | 1 | 1 | true | s | leader | 176.901 | 5.64ms | eu-west-1a | etcd-main-1 | mn-etcd-sz |
1000 | 1000000 | 1 | 1 | true | s | leader | 209.99 | 4.7ms | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool range requests to follower
with single client.
mn-etcd-sz
latency(~6.66ms
) is ~10% higher than mn-etcd-mz
(~6.16ms
).
Compare to leader
, follower
read request latency is ~10% high for mn-etcd-sz
Compare to leader
, follower
read request latency is ~20% high for mn-etcd-mz
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 1 | 1 | true | l | follower-1 | 150.680 | 6.66ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
1000 | 1000000 | 1 | 1 | true | l | follower-1 | 162.072 | 6.16ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
Compare to consistency Linearizable
, Serializable
is ~15% less for mn-etcd-sz
(~5.84ms
), mn-etcd-mz
(~5.01ms
).
Compare to leader
, follower
read request latency is ~5% slightly high for mn-etcd-sz
, mn-etcd-mz
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 1 | 1 | true | s | follower-1 | 170.918 | 5.84ms | eu-west-1a | etcd-main-0 | mn-etcd-sz |
1000 | 1000000 | 1 | 1 | true | s | follower-1 | 199.01 | 5.01ms | eu-west-1c | etcd-main-0 | mn-etcd-mz |
Benchmark tool range requests to leader
with multiple clients.
sn-etcd-sz
latency(~1.593secs
) is ~20% lesser than mn-etcd-sz
(~1.974secs
).
mn-etcd-sz
latency(~1.974secs
) is ~5% greater than mn-etcd-mz
(~1.81secs
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 100 | 500 | true | l | leader | 252.149 | 1.593secs | eu-west-1c | etcd-main-0 | sn-etcd-sz |
1000 | 1000000 | 100 | 500 | true | l | leader | 205.589 | 1.974secs | eu-west-1a | etcd-main-1 | mn-etcd-sz |
1000 | 1000000 | 100 | 500 | true | l | leader | 230.42 | 1.81secs | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Compare to consistency Linearizable
, Serializable
is more or less same for sn-etcd-sz
(~1.57961secs
), mn-etcd-mz
(~1.8secs
) not a big difference
Compare to consistency Linearizable
, Serializable
is ~10% high for mn-etcd-sz
(~ 2.277secs
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 100 | 500 | true | s | leader | 252.406 | 1.57961secs | eu-west-1c | etcd-main-0 | sn-etcd-sz |
1000 | 1000000 | 100 | 500 | true | s | leader | 181.905 | 2.277secs | eu-west-1a | etcd-main-1 | mn-etcd-sz |
1000 | 1000000 | 100 | 500 | true | s | leader | 227.64 | 1.8secs | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool range requests to follower
with multiple client.
mn-etcd-sz
latency is ~20% less than mn-etcd-mz
.
Compare to leader
consistency Linearizable
, follower
read request latency is ~15 less for mn-etcd-sz
(~1.694secs
).
Compare to leader
consistency Linearizable
, follower
read request latency is ~10% higher for mn-etcd-sz
(~1.977secs
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 100 | 500 | true | l | follower-1 | 248.489 | 1.694secs | eu-west-1a | etcd-main-0 | mn-etcd-sz |
1000 | 1000000 | 100 | 500 | true | l | follower-1 | 210.22 | 1.977secs | eu-west-1c | etcd-main-0 | mn-etcd-mz |
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 100 | 500 | true | l | follower-2 | 205.765 | 1.967secs | eu-west-1a | etcd-main-2 | mn-etcd-sz |
1000 | 1000000 | 100 | 500 | true | l | follower-2 | 195.2 | 2.159secs | eu-west-1b | etcd-main-2 | mn-etcd-mz |
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 100 | 500 | true | s | follower-1 | 231.458 | 1.7413secs | eu-west-1a | etcd-main-0 | mn-etcd-sz |
1000 | 1000000 | 100 | 500 | true | s | follower-1 | 214.80 | 1.907secs | eu-west-1c | etcd-main-0 | mn-etcd-mz |
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
1000 | 1000000 | 100 | 500 | true | s | follower-2 | 183.320 | 2.2810secs | eu-west-1a | etcd-main-2 | mn-etcd-sz |
1000 | 1000000 | 100 | 500 | true | s | follower-2 | 195.40 | 2.164secs | eu-west-1b | etcd-main-2 | mn-etcd-mz |
Benchmark tool range requests to leader
all keys.
sn-etcd-sz
latency(~8.993secs
) is ~3% slightly lower than mn-etcd-sz
(~9.236secs
).
mn-etcd-sz
latency(~9.236secs
) is ~2% slightly lower than mn-etcd-mz
(~9.100secs
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
20 | 1000000 | 2 | 5 | false | l | leader | 0.5139 | 8.993secs | eu-west-1c | etcd-main-0 | sn-etcd-sz |
20 | 1000000 | 2 | 5 | false | l | leader | 0.506 | 9.236secs | eu-west-1a | etcd-main-1 | mn-etcd-sz |
20 | 1000000 | 2 | 5 | false | l | leader | 0.508 | 9.100secs | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Compare to consistency Linearizable
read request, follower
for sn-etcd-sz
(~9.secs
) is a slight difference 10ms
.
Compare to consistency Linearizable
read request, follower
for mn-etcd-sz
(~9.113secs
) is ~1% less, not a big difference.
Compare to consistency Linearizable
read request, follower
for mn-etcd-mz
(~8.799secs
) is ~3% less, not a big difference.
sn-etcd-sz
latency(~9.secs
) is ~1% slightly less than mn-etcd-sz
(~9.113secs
).
mn-etcd-sz
latency(~9.113secs
) is ~3% slightly higher than mn-etcd-mz
(~8.799secs
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
20 | 1000000 | 2 | 5 | false | s | leader | 0.51125 | 9.0003secs | eu-west-1c | etcd-main-0 | sn-etcd-sz |
20 | 1000000 | 2 | 5 | false | s | leader | 0.4993 | 9.113secs | eu-west-1a | etcd-main-1 | mn-etcd-sz |
20 | 1000000 | 2 | 5 | false | s | leader | 0.522 | 8.799secs | eu-west-1a | etcd-main-1 | mn-etcd-mz |
Benchmark tool range requests to follower
all keys
mn-etcd-sz
latency(~9.065secs
) is ~1% slightly higher than mn-etcd-mz
(~9.007secs
).
Compare to leader
consistency Linearizable
read request, follower
is ~1% slightly higher for both cases mn-etcd-sz
, mn-etcd-mz
.
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
20 | 1000000 | 2 | 5 | false | l | follower-1 | 0.512 | 9.065secs | eu-west-1a | etcd-main-0 | mn-etcd-sz |
20 | 1000000 | 2 | 5 | false | l | follower-1 | 0.533 | 9.007secs | eu-west-1c | etcd-main-0 | mn-etcd-mz |
Compare to consistency Linearizable
read request, follower
for mn-etcd-sz
(~9.553secs
) is ~5% high.
Compare to consistency Linearizable
read request, follower
for mn-etcd-mz
(~7.7433secs
) is ~15% less.
mn-etcd-sz
(~9.553secs
) latency is ~20% higher than mn-etcd-mz
(~7.7433secs
).
Number of requests | Value size | Number of connections | Number of clients | sequential-keys | Consistency | Target etcd server | Average write QPS | Average latency per request | zone | server name | Test name |
---|
20 | 1000000 | 2 | 5 | false | s | follower-1 | 0.4743 | 9.553secs | eu-west-1a | etcd-main-0 | mn-etcd-sz |
20 | 1000000 | 2 | 5 | false | s | follower-1 | 0.5500 | 7.7433secs | eu-west-1c | etcd-main-0 | mn-etcd-mz |
NOTE: This Network latency analysis is inspired by etcd performance.
10 - EtcdMember Custom Resource
DEP-04: EtcdMember Custom Resource
Table of Contents
Summary
Today, etcd-druid mainly acts as an etcd cluster provisioner, and seldom takes remediatory actions if the etcd cluster goes into an undesired state that needs to be resolved by a human operator. In other words, etcd-druid cannot perform day-2 operations on etcd clusters in its current form, and hence cannot carry out its full set of responsibilities as a true “operator” of etcd clusters. For etcd-druid to be fully capable of its responsibilities, it must know the latest state of the etcd clusters and their individual members at all times.
This proposal aims to bridge that gap by introducing EtcdMember
custom resource allowing individual etcd cluster members to publish information/state (previously unknown to etcd-druid). This provides etcd-druid a handle to potentially take cluster-scoped remediatory actions.
Terminology
druid: etcd-druid - an operator for etcd clusters.
etcd-member: A single etcd pod in an etcd cluster that is realised as a StatefulSet.
backup-sidecar: It is the etcd-backup-restore sidecar container in each etcd-member pod.
NOTE: Term sidecar can now be confused with the latest definition in KEP-73. etcd-backup-restore container is currently not set as an init-container
as proposed in the KEP but as a regular container in a multi-container [Pod](Pods | Kubernetes).
leading-backup-sidecar: A backup-sidecar that is associated to an etcd leader.
restoration: It refers to an individual etcd-member restoring etcd data from an existing backup (comprising of full and delta snapshots). The authors have deliberately chosen to distinguish between restoration and learning. Learning refers to a process where a learner “learns” from an etcd-cluster leader.
Motivation
Sharing state of an individual etcd-member with druid is essential for diagnostics, monitoring, cluster-wide-operations and potential remediation. At present, only a subset of etcd-member state is shared with druid using leases. It was always meant as a stopgap arrangement as mentioned in the corresponding issue and is not the best use of leases.
There is a need to have a clear distinction between an etcd-member state and etcd cluster state since most of an etcd cluster state is often derived by looking at individual etcd-member states. In addition, actors which update each of these states should be clearly identified so as to prevent multiple actors updating a single resource holding the state of either an etcd cluster or an etcd-member. As a consequence, etcd-members should not directly update the Etcd
resource status and would therefore need a new custom resource allowing each member to publish detailed information about its latest state.
Goals
- Introduce
EtcdMember
custom resource via which each etcd-member can publish information about its state. This enables druid to deterministically orchestrate out-of-turn operations like compaction, defragmentation, volume management etc. - Define and capture states, sub-states and deterministic transitions amongst states of an etcd-member.
- Today leases are misused to share member-specific information with druid. Their usage to share member state [leader, follower, learner], member-id, snapshot revisions etc should be removed.
Non-Goals
- Auto-recovery from quorum loss or cluster-split due to network partitioning.
- Auto-recovery of an etcd-member due to volume mismatch.
- Relooking at segregating responsiblities between
etcd
and backup-sidecar
containers.
Proposal
This proposal introduces a new custom resource EtcdMember
, and in the following sections describes different sets of information that should be captured as part of the new resource.
Every etcd-member has a unique memberID
and it is part of an etcd cluster which has a unique clusterID
. In a well-formed etcd cluster every member must have the same clusterID
. Publishing this information to druid helps in identifying issues when one or more etcd-members form their own individual clusters, thus resulting in multiple clusters where only one was expected. Issues Issue#419, Canary#4027, Canary#3973 are some such occurrences.
Today, this information is published by using a member lease. Both these fields are populated in the leases’ Spec.HolderIdentity
by the backup-sidecar container.
The authors propose to publish member metadata information in EtcdMember
resource.
id: <etcd-member id>
clusterID: <etcd cluster id>
NOTE: Druid would not do any auto-recovery when it finds out that there are more than one clusters being formed. Instead this information today will be used for diagnostic and alerting.
Etcd Member State Transitions
Each etcd-member goes through different States
during its lifetime. State
is a derived high-level summary of where an etcd-member is in its lifecycle. A SubState
gives additional information about the state. This proposal extends the concept of states with the notion of a SubState
, since State
indicates a top-level state of an EtcdMember
resource, which can have one or more SubState
s.
While State
is sufficient for many human operators, the notion of a SubState
provides operators with an insight about the discrete stage of an etcd-member in its lifecycle. For example, consider a top-level State: Starting
, which indicates that an etcd-member is starting. Starting
is meant to be a transient state for an etcd-member. If an etcd-member remains in this State
longer than expected, then an operator would require additional insight, which the authors propose to provide via SubState
(in this case, the possible SubStates
could be PendingLearner
and Learner
, which are detailed in the following sections).
At present, these states are not captured and only the final state is known - i.e the etcd-member either fails to come up (all re-attempts to bring up the pod via the StatefulSet controller has exhausted) or it comes up. Getting an insight into all its state transitions would help in diagnostics.
The status of an etcd-member at any given point in time can be best categorized as a combination of a top-level State
and a SubState
. The authors propose to introduce the following states and sub-states:
States and Sub-States
NOTE: Abbreviations have been used wherever possible, only to represent sub-states. These representations are chosen only for brevity and will have proper longer names.
States | Sub-States | Description |
---|
New | - | Every newly created etcd-member will start in this state and is termed as the initial state or the start state. |
Initializing | DBV-S (DBValidationSanity) | This state denotes that backup-restore container in etcd-member pod has started initialization. Sub-State DBV-S which is an abbreviation for DBValidationSanity denotes that currently sanity etcd DB validation is in progress. |
Initializing | DBV-F (DBValidationFull) | This state denotes that backup-restore container in etcd-member pod has started initialization. Sub-State DBV-F which is an abbreviation for DBValidationFull denotes that currently full etcd DB validation is in progress. |
Initializing | R (Restoration) | This state denotes that backup-restore container in etcd-member pod has started initialization. Sub-State R which is an abbreviation for Restoration denotes that DB validation failed and now backup-restore has commenced restoration of etcd DB from the backup (comprising of full snapshot and delta-snapshots). An etcd-member will transition to this sub-state only when it is part of a single-node etcd-cluster. |
Starting (SI) | PL (PendingLearner) | An etcd-member can transition from Initializing state to PendingLearner state. In this state backup-restore container will optionally delete any existing etcd data directory and then attempts to add its peer etcd-member process as a learner. Since there can be only one learner at a time in an etcd cluster, an etcd-member could be in this state for some time till its request to get added as a learner is accepted. |
Starting (SI) | Learner | When backup-restore is successfully able to add its peer etcd-member process as a Learner . In this state the etcd-member process will start its DB sync from an etcd leader. |
Started (Sd) | Follower | A follower is a voting raft member. A Learner etcd-member will get promoted to a Follower once its DB is in sync with the leader. It could also become a follower if during a re-election it loses leadership and transitions from being a Leader to Follower . |
Started (Sd) | Leader | A leader is an etcd-member which will handle all client write requests and linearizable read requests. A member could transition to being a Leader from an existing Follower role due to winning a leader election or for a single node etcd cluster it directly transitions from Initializing state to Leader state as there is no other member. |
In the following sub-sections, the state transitions are categorized into several flows making it easier to grasp the different transitions.
Top Level State Transitions
Following DFA represents top level state transitions (without any representation of sub-states). As described in the table above there are 4 top level states:
New
- this is a start state for all newly created etcd-members
Initializing
- In this state backup-restore will perform pre-requisite actions before it triggers the start of an etcd process. DB validation and optionally restoration is done in this state. Possible sub-states are: DBValidationSanity
, DBValidationFull
and Restoration
Starting
- Once the optional initialization is done backup-restore will trigger the start of an etcd process. It can either directly go to Learner
sub-state or wait for getting added as a learner and therefore be in PendingLearner
sub-state.
Started
- In this state the etcd-member is a full voting member. It can either be in Leader
or Follower
sub-states.
Starting an Etcd-Member in a Single-Node Etcd Cluster
Following DFA represents the states, sub-states and transitions of a single etcd-member for a cluster that is bootstrapped from cluster size of 0 -> 1.
Addition of a New Etcd-Member in a Multi-Node Etcd Cluster
Following DFA represents the states, sub-states and transitions of an etcd cluster which starts with having a single member (Leader) and then one or more new members are added which represents a scale-up of an etcd cluster from 1 -> n, where n is odd.
Restart of a Voting Etcd-Member in a Multi-Node Etcd Cluster
Following DFA represents the states, sub-states and transitions when a voting etcd-member in a multi-node etcd cluster restarts.
NOTE: If the DB validation fails then data directory of the etcd-member is removed and etcd-member is removed from cluster membership, thus transitioning it to New
state. The state transitions from New
state are depicted by this section.
Deterministic Etcd Member Creation/Restart During Scale-Up
Bootstrap information:
When an etcd-member starts, then it needs to find out:
Issue with the current approach:
At present, this is facilitated by three things:
During scale-up, druid adds an annotation gardener.cloud/scaled-to-multi-node
to the StatefulSet
. Each etcd-members looks up this annotation.
backup-sidecar attempts to fetch etcd cluster member-list and checks if this etcd-member is already part of the cluster.
Size of the cluster by checking initial-cluster
in the etcd config.
Druid adds an annotation gardener.cloud/scaled-to-multi-node
on the StatefulSet
which is then shared by all etcd-members irrespective of the starting state of an etcd-member (as Learner
or Voting-Member
). This especially creates an issue for the current leader (often pod with index 0) during the scale-up of an etcd cluster as described in this issue.
It has been agreed that the current solution to this issue is a quick and dirty fix and needs to be revisited to be uniformly applied to all etcd-members. The authors propose to provide a more deterministic approach to scale-up using the EtcdMember
resource.
New approach
Instead of adding an annotation gardener.cloud/scaled-to-multi-node
on the StatefulSet
, a new annotation druid.gardener.cloud/create-as-learner
should be added by druid on an EtcdMember
resource. This annotation will only be added to newly created members during scale-up.
Each etcd-member should look at the following to deterministically compute the bootstrap information
specified above:
druid.gardener.cloud/create-as-learner
annotation on its respective EtcdMember
resource. This new annotation will be honored in the following cases:
Etcd-cluster member list. to check if it is already part of the cluster.
Existing etcd data directory and its validity.
NOTE: When the etcd-member gets promoted to a voting-member, then it should remove the annotation on its respective EtcdMember
resource.
TLS Enablement for Peer Communication
Etcd-members in a cluster use peer URL(s) to communicate amongst each other. If the advertised peer URL(s) for an etcd-member are updated then etcd mandates a restart of the etcd-member.
Druid only supports toggling the transport level security for the advertised peer URL(s). To indicate that the etcd process within the etcd-member has the updated advertised peer URL(s), an annotation member.etcd.gardener.cloud/tls-enabled
is added by backup-sidecar container to the member lease object.
During the reconciliation run for an Etcd
resource in druid, if reconciler detects a change in advertised peer URL(s) TLS configuration then it will watch for the above mentioned annotation on the member lease. If the annotation has a value of false
then it will trigger a restart of the etcd-member pod.
The authors propose to publish member metadata information in EtcdMember
resource and not misuse member leases.
Monitoring Backup Health
Backup-sidecar takes delta and full snapshot both periodically and threshold based. These backed-up snapshots are essential for restoration operations for bootstrapping an etcd cluster from 0 -> 1 replicas. It is essential that leading-backup-sidecar container which is responsible for taking delta/full snapshots and uploading these snapshots to the configured backup store, publishes this information for druid to consume.
At present, information about backed-up snapshot (only latest-revision-number
) is published by leading-backup-sidecar container by updating Spec.HolderIdentity
of the delta-snapshot and full-snapshot leases.
Druid maintains conditions
in the Etcd
resource status, which include but are not limited to maintaining information on whether backups being taken for an etcd cluster are healthy (up-to-date) or stale (outdated in context to a configured schedule). Druid computes these conditions using information from full/delta snapshot leases.
In order to provide a holistic view of the health of backups to human operators, druid requires additional information about the snapshots that are being backed-up. The authors propose to not misuse leases and instead publish the following snapshot information as part EtcdMember
custom resource:
snapshots:
lastFull:
timestamp: <time of full snapshot>
name: <name of the file that is uploaded>
size: <size of the un-compressed snapshot file uploaded>
startRevision: <start revision of etcd db captured in the snapshot>
endRevision: <end revision of etcd db captured in the snapshot>
lastDelta:
timestamp: <time of delta snapshot>
name: <name of the file that is uploaded>
size: <size of the un-compressed snapshot file uploaded>
startRevision: <start revision of etcd db captured in the snapshot>
endRevision: <end revision of etcd db captured in the snapshot>
While this information will primarily help druid compute accurate conditions regarding backup health from snapshot information and publish this to human operators, it could be further utilised by human operators to take remediatory actions (e.g. manually triggering a full or delta snapshot or further restarting the leader if the issue is still not resolved) if backup is unhealthy.
Enhanced Snapshot Compaction
Druid can be configured to perform regular snapshot compactions for etcd clusters, to reduce the total number of delta snapshots to be restored if and when a DB restoration for an etcd cluster is required. Druid triggers a snapshot compaction job when the accumulated etcd events in the latest set of delta snapshots (taken after the last full snapshot) crosses a specified threshold.
As described in Issue#591 scheduling compaction only based on number of accumulated etcd events is not sufficient to ensure a successful compaction. This is specifically targeted for kubernetes clusters where each etcd event is larger in size owing to large spec or status fields or respective resources.
Druid will now need information regarding snapshot sizes, and more importantly the total size of accumulated delta snapshots since the last full snapshot.
The authors propose to enhance the proposed snapshots
field described in Use Case #3 with the following additional field:
snapshots:
accumulatedDeltaSize: <total size of delta snapshots since last full snapshot>
Druid can then use this information in addition to the existing revision information to decide to trigger an early snapshot compaction job. This effectively allows druid to be proactive in performing regular compactions for etcds receiving large events, reducing the probability of a failed snapshot compaction or restoration.
Enhanced Defragmentation
Reader is recommended to read Etcd Compaction & Defragmentation in order to understand the following terminology:
dbSize
- total storage space used by the etcd database
dbSizeInUse
- logical storage space used by the etcd database, not accounting for free pages in the DB due to etcd history compaction
The leading-backup-sidecar performs periodic defragmentations of the DBs of all the etcd-members in the cluster, controlled via a defragmentation cron schedule provided to each backup-sidecar. Defragmentation is a costly maintenance operation and causes a brief downtime to the etcd-member being defragmented, due to which the leading-backup-sidecar defragments each etcd-member sequentially. This ensures that only one etcd-member would be unavailable at any given time, thus avoiding an accidental quorum loss in the etcd cluster.
The authors propose to move the responsibility of orchestrating these individual defragmentations to druid due to the following reasons:
- Since each backup-sidecar only has knowledge of the health of its own etcd, it can only determine whether its own etcd can be defragmented or not, based on etcd-member health. Trying to defragment a different healthy etcd-member while another etcd-member is unhealthy would lead to a transient quorum loss.
- Each backup-sidecar is only a
sidecar
to its own etcd-member, and by good design principles, it must not be performing any cluster-wide maintenance operations, and this responsibility should remain with the etcd cluster operator.
Additionally, defragmentation of an etcd DB becomes inevitable if the DB size exceeds the specified DB space quota, since the etcd DB then becomes read-only, ie no write operations on the etcd would be possible unless the etcd DB is defragmented and storage space is freed up. In order to automate this, druid will now need information about the etcd DB size from each member, specifically the leading etcd-member, so that a cluster-wide defragmentation can be triggered if the DB size reaches a certain threshold, as already described by this issue.
The authors propose to enhance each etcd-member to regularly publish information about the dbSize
and dbSizeInUse
so that druid may trigger defragmentation for the etcd cluster.
dbSize: <db-size> # e.g 6Gi
dbSizeInUse: <db-size-in-use> # e.g 3.5Gi
Difference between dbSize
and dbSizeInUse
gives a clear indication of how much storage space would be freed up if a defragmentation is performed. If the difference is not significant (based on a configurable threshold provided to druid), then no defragmentation should be performed. This will ensure that druid does not perform frequent defragmentations that do not yield much benefit. Effectively it is to maximise the benefit of defragmentation since this operations involves transient downtime for each etcd-member.
Monitoring Defragmentations
As discussed in the previous section, every etcd-member is defragmented periodically, and can also be defragmented based on the DB size reaching a certain threshold. It is beneficial for druid to have knowledge of this data from each etcd-member for the following reasons:
[Diagnostics] It is expected that backup-sidecar
will push releveant metrics and configure alerts on these metrics.
[Operational] Derive status of defragmentation at etcd cluster level. In case of partial failures for a subset of etcd-members druid can potentially re-trigger defragmentation only for those etcd-members.
The authors propose to capture this information as part of lastDefragmentation
section in the EtcdMember
resource.
lastDefragmentation:
startTime: <start time of defragmentation>
endTime: <end time of defragmentation>
status: <Succeeded | Failed>
message: <success or failure message>
initialDBSize: <size of etcd DB prior to defragmentation>
finalDBSize: <size of etcd DB post defragmentation>
NOTE: Defragmentation is a cluster-wide operation, and insights derived from aggregating defragmentation data from individual etcd-members would be captured in the Etcd
resource status
Monitoring Restorations
Each etcd-member may perform restoration of data multiple times throughout its lifecycle, possibly owing to data corruptions. It would be useful to capture this information as part of an EtcdMember
resource, for the following use cases:
[Diagnostics] It is expected that backup-sidecar
will push a metric indicating failure to restore.
[Operational] Restoration from backup-bucket only happens for a single node etcd cluster. If restoration is failing then druid cannot take any remediatory actions since there is no etcd quorum.
The authors propose to capture this information under lastRestoration
section in the EtcdMember
resource.
lastRestoration:
status: <Failed | Success | In-Progress>
reason: <reason-code for status>
message: <human readable message for status>
startTime: <start time of restoration>
endTime: <end time of restoration>
Authors have considered the following cases to better understand how errors during restoration will be handled:
Case #1 - Failure to connect to Provider Object Store
At present full and delta snapshots are downloaded during restoration. If there is a failure then initialization status transitions to Failed
followed by New
which forces etcd-wrapper
to trigger the initialization again. This in a way forces a retry and currently there is no limit on the number of attempts.
Authors propose to improve the retry logic but keep the overall behavior of not forcing a container restart the same.
Case #2 - Read-Only Mounted volume
If a mounted volume which is used to create the etcd data directory turns read-only
then authors propose to capture this state via EtcdMember
.
Authors propose that druid
should initiate recovery by deleting the PVC for this etcd-member and letting StatefulSet
controller re-create the Pod and the PVC. Removing PVC and deleting the pod is considered safe because:
- Data directory is present and is the DB is corrupt resulting in an un-usasble etcd.
- Data directory is not present but any attempt to create a directory structure fails due to
read-only
FS.
In both these cases there is no side-effect of deleting the PVC and the Pod.
Case #3 - Revision mismatch
There is currently an issue in backup-sidecar
which results in a revision mismatch in the snapshots (full/delta) taken by leading the backup-sidecar
container. This results in a restoration failure. One occurance of such issue has been captured in Issue#583. This occurence points to a bug which should be fixed however there is a rare possibility that these snapshots (full/delta) get corrupted. In this rare situation, backup-sidecar
should only raise an alert.
Authors propose that druid
should not take any remediatory actions as this involves:
- Inspecting snapshots
- If the full snapshot is corrupt then a decision needs to be taken to recover from the last full snapshot as the base snapshot. This can result in data loss and therefore needs manual intervention.
- If a delta snapshot is corrupt, then recovery can be done till the corrupt revision in the delta snapshot. Since this will also result in a loss of data therefore this decision needs to be take by an operator.
Monitoring Volume Mismatches
Each etcd-member checks for possible etcd data volume mismatches, based on which it decides whether to start the etcd process or not, but this information is not captured anywhere today. It would be beneficial to capture this information as part of the EtcdMember
resource so that a human operator may check this and manually fix the underlying problem with the wrong volume being attached or mounted to an etcd-member pod.
The authors propose to capture this information under volumeMismatches
section in the EtcdMember
resource.
volumeMismatches:
- identifiedAt: <time at which wrong volume mount was identified>
fixedAt: <time at which correct volume was mounted>
volumeID: <volume ID of wrong volume that got mounted>
numRestarts: <num of etcd-member restarts that were attempted>
Each entry under volumeMismatches
will be for a unique volumeID
. If there is a pod restart and it results in yet another unexpected volumeID
(different from the already captured volumeIDs) then a new entry will get created. numRestarts
denotes the number of restarts seen by the etcd-member for a specific volumeID
.
Based on information from the volumeMismatches
section, druid may choose to perform rudimentary remediatory actions as simple as restarting the member pod to force a possible rescheduling of the pod to a different node which could potentially force the correct volume to be mounted to the member.
Custom Resource API
Spec vs Status
Information that is captured in the etcd-member custom resource could be represented either as EtcdMember.Status
or EtcdMemberState.Spec
.
Gardener has a similar need to capture a shoot state and they have taken the decision to represent it via ShootState resource where the state or status of a shoot is captured as part of the Spec
field in the ShootState
custom resource.
The authors wish to instead align themselves with the K8S API conventions and choose to use EtcdMember
custom resource and capture the status of each member in Status
field of this resource. This has the following advantages:
Spec
represents a desired state of a resource and what is intended to be captured is the As-Is
state of a resource which Status
is meant to capture. Therefore, semantically using Status
is the correct choice.
Not mis-using Spec
now to represent As-Is
state provides us with a choice to extend the custom resource with any future need for a Spec
a.k.a desired state.
Representing State Transitions
The authors propose to use a custom representation for states, sub-states and transitions.
Consider the following representation:
transitions:
- state: <name of the state that the etcd-member has transitioned to>
subState: <name of the sub-state if any>
reason: <reason code for the transition>
transitionTime: <time of transition to this state>
message: <detailed message if any>
As an example, consider the following transitions which represent addition of an etcd-member during scale-up of an etcd cluster, followed by a restart of the etcd-member which detects a corrupt DB:
status:
transitions:
- state: New
subState: New
reason: ClusterScaledUp
transitionTime: "2023-07-17T05:00:00Z"
message: "New member added due to etcd cluster scale-up"
- state: Starting
subState: PendingLearner
reason: WaitingToJoinAsLearner
transitionTime: "2023-07-17T05:00:30Z"
message: "Waiting to join the cluster as a learner"
- state: Starting
subState: Learner
reason: JoinedAsLearner
transitionTime: "2023-07-17T05:01:20Z"
message: "Joined the cluster as a learner"
- state: Started
subState: Follower
reason: PromotedAsVotingMember
transitionTime: "2023-07-17T05:02:00Z"
message: "Now in sync with leader, promoted as voting member"
- state: Initializing
subState: DBValidationFull
reason: DetectedPreviousUncleanExit
transitionTime: "2023-07-17T08:00:00Z"
message: "Detected previous unclean exit, requires full DB validation"
- state: New
subState: New
reason: DBCorruptionDetected
transitionTime: "2023-07-17T08:01:30Z"
message: "Detected DB corruption during initialization, removing member from cluster"
- state: Starting
subState: PendingLearner
reason: WaitingToJoinAsLearner
transitionTime: "2023-07-17T08:02:10Z"
message: "Waiting to join the cluster as a learner"
- state: Starting
subState: Learner
reason: JoinedAsLearner
transitionTime: "2023-07-17T08:02:20Z"
message: "Joined the cluster as a learner"
- state: Started
subState: Follower
reason: PromotedAsVotingMember
transitionTime: "2023-07-17T08:04:00Z"
message: "Now in sync with leader, promoted as voting member"
Reason Codes
The authors propose the following list of possible reason codes for transitions. This list is not exhaustive, and can be further enhanced to capture any new transitions in the future.
Reason | Transition From State (SubState) | Transition To State (SubState) |
---|
ClusterScaledUp | NewSingleNodeClusterCreated | nil | New |
DetectedPreviousCleanExit | New | Started (Leader) | Started (Follower) | Initializing (DBValidationSanity) |
DetectedPreviousUncleanExit | New | Started (Leader) | Started (Follower) | Initializing (DBValidationFull) |
DBValidationFailed | Initializing (DBValidationSanity) | Initializing (DBValidationFull) | Initializing (Restoration) | New |
DBValidationSucceeded | Initializing (DBValidationSanity) | Initializing (DBValidationFull) | Started (Leader) | Started (Follower) |
Initializing (Restoration)Succeeded | Initializing (Restoration) | Started (Leader) |
WaitingToJoinAsLearner | New | Starting (PendingLearner) |
JoinedAsLearner | Starting (PendingLearner) | Starting (Learner) |
PromotedAsVotingMember | Starting (Learner) | Started (Follower) |
GainedClusterLeadership | Started (Follower) | Started (Leader) |
LostClusterLeadership | Started (Leader) | Started (Follower) |
API
EtcdMember
The authors propose to add the EtcdMember
custom resource API to etcd-druid APIs and initially introduce it with v1alpha1
version.
apiVersion: druid.gardener.cloud/v1alpha1
kind: EtcdMember
metadata:
labels:
gardener.cloud/owned-by: <name of parent Etcd resource>
name: <name of the etcd-member>
namespace: <namespace | will be the same as that of parent Etcd resource>
ownerReferences:
- apiVersion: druid.gardener.cloud/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Etcd
name: <name of the parent Etcd resource>
uid: <UID of the parent Etcd resource>
status:
id: <etcd-member id>
clusterID: <etcd cluster id>
peerTLSEnabled: <bool>
dbSize: <db-size>
dbSizeInUse: <db-size-in-use>
snapshots:
lastFull:
timestamp: <time of full snapshot>
name: <name of the file that is uploaded>
size: <size of the un-compressed snapshot file uploaded>
startRevision: <start revision of etcd db captured in the snapshot>
endRevision: <end revision of etcd db captured in the snapshot>
lastDelta:
timestamp: <time of delta snapshot>
name: <name of the file that is uploaded>
size: <size of the un-compressed snapshot file uploaded>
startRevision: <start revision of etcd db captured in the snapshot>
endRevision: <end revision of etcd db captured in the snapshot>
accumulatedDeltaSize: <total size of delta snapshots since last full snapshot>
lastRestoration:
type: <FromSnapshot | FromLeader>
status: <Failed | Success | In-Progress>
startTime: <start time of restoration>
endTime: <end time of restoration>
lastDefragmentation:
startTime: <start time of defragmentation>
endTime: <end time of defragmentation>
reason:
message:
initialDBSize: <size of etcd DB prior to defragmentation>
finalDBSize: <size of etcd DB post defragmentation>
volumeMismatches:
- identifiedAt: <time at which wrong volume mount was identified>
fixedAt: <time at which correct volume was mounted>
volumeID: <volume ID of wrong volume that got mounted>
numRestarts: <num of pod restarts that were attempted>
transitions:
- state: <name of the state that the etcd-member has transitioned to>
subState: <name of the sub-state if any>
reason: <reason code for the transition>
transitionTime: <time of transition to this state>
message: <detailed message if any>
Etcd
Authors propose the following changes to the Etcd
API:
- In the
Etcd.Status
resource API, member status is computed and stored. This field will be marked as deprecated and in a later version of druid it will be removed. In its place, the authors propose to introduce the following:
type EtcdStatus struct {
// MemberRefs contains references to all existing EtcdMember resources
MemberRefs []CrossVersionObjectReference
}
- In
Etcd.Status
resource API, PeerUrlTLSEnabled reflects the status of enabling TLS for peer communication across all etcd-members. Currentlty this field is not been used anywhere. In this proposal, the authors have also proposed that each EtcdMember
resource should capture the status of TLS enablement of peer URL. The authors propose to relook at the need to have this field under EtcdStatus
.
Lifecycle of an EtcdMember
Creation
Druid creates an EtcdMember
resource for every replica in etcd.Spec.Replicas
during reconciliation of an etcd resource. For a fresh etcd cluster this is done prior to creation of the StatefulSet resource and for an existing cluster which has now been scaled-up, it is done prior to updating the StatefulSet resource.
Updation
All fields in EtcdMember.Status
are only updated by the corresponding etcd-member. Druid only consumes the information published via EtcdMember
resources.
Deletion
Druid is responsible for deletion of all existing EtcdMember
resources for an etcd cluster. There are three scenarios where an EtcdMember
resource will be deleted:
Deletion of etcd resource.
Scale down of an etcd cluster to 0 replicas due to hibernation of the k8s control plane.
Transient scale down of an etcd cluster to 0 replicas to recover from a quorum loss.
Authors found no reason to retain EtcdMember resources when the etcd cluster is scale down to 0 replicas since the information contained in each EtcdMember resource would no longer represent the current state of each member and would thus be stale. Any controller in druid which acts upon the EtcdMember.Status
could potentially take incorrect actions.
Reconciliation
Authors propose to introduce a new controller (let’s call it etcd-member-controller
) which watches for changes to the EtcdMember
resource(s). If a reconciliation of an Etcd
resource is required as a result of change in EtcdMember
status then this controller should enqueue an event and force a reconciliation via existing etcd-controller
, thus preserving the single-actor-principal constraint which ensures deterministic changes to etcd cluster resources.
NOTE: Further decisions w.r.t responsibility segregation will be taken during implementation and will not be documented in this proposal.
Stale EtcdMember Status Handling
It is possible that an etcd-member is unable to update its respective EtcdMember
resource. Following can be some of the implications which should be kept in mind while reconciling EtcdMember
resource in druid:
- Druid sees stale state transitions (this assumes that the backup-sidecar attempts to update the state/sub-state in
etcdMember.status.transitions
with best attempt). There is currently no implication other than an operator seeing a stale state. dbSize
and dbSizeInUse
could not be updated. A consequence could be that druid continues to see high value for dbSize - dbSizeInUse
for a extended amount of time. Druid should ensure that it does not trigger repeated defragmentations.- If
VolumeMismatches
is stale, then druid should no longer attempt to recover by repeatedly restarting the pod. - Failed
restoration
was recorded last and further updates to this array failed. Druid should not repeatedly take full-snapshots. - If
snapshots.accumulatedDeltaSize
could not be updated, then druid should not schedule repeated compaction Jobs.
Reference
11 - Feature Gates in Etcd-Druid
Feature Gates in Etcd-Druid
This page contains an overview of the various feature gates an administrator can specify on etcd-druid.
Overview
Feature gates are a set of key=value pairs that describe etcd-druid features. You can turn these features on or off by passing them to the --feature-gates
CLI flag in the etcd-druid command.
The following tables are a summary of the feature gates that you can set on etcd-druid.
- The “Since” column contains the etcd-druid release when a feature is introduced or its release stage is changed.
- The “Until” column, if not empty, contains the last etcd-druid release in which you can still use a feature gate.
- If a feature is in the Alpha or Beta state, you can find the feature listed in the Alpha/Beta feature gate table.
- If a feature is stable you can find all stages for that feature listed in the Graduated/Deprecated feature gate table.
- The Graduated/Deprecated feature gate table also lists deprecated and withdrawn features.
Feature Gates for Alpha or Beta Features
Feature | Default | Stage | Since | Until |
---|
UseEtcdWrapper | false | Alpha | 0.19 | 0.21 |
UseEtcdWrapper | true | Beta | 0.22 | |
Feature Gates for Graduated or Deprecated Features
Feature | Default | Stage | Since | Until |
---|
Using a Feature
A feature can be in Alpha, Beta or GA stage.
An Alpha feature means:
- Disabled by default.
- Might be buggy. Enabling the feature may expose bugs.
- Support for feature may be dropped at any time without notice.
- The API may change in incompatible ways in a later software release without notice.
- Recommended for use only in short-lived testing clusters, due to increased
risk of bugs and lack of long-term support.
A Beta feature means:
- Enabled by default.
- The feature is well tested. Enabling the feature is considered safe.
- Support for the overall feature will not be dropped, though details may change.
- The schema and/or semantics of objects may change in incompatible ways in a
subsequent beta or stable release. When this happens, we will provide instructions
for migrating to the next version. This may require deleting, editing, and
re-creating API objects. The editing process may require some thought.
This may require downtime for applications that rely on the feature.
- Recommended for only non-critical uses because of potential for
incompatible changes in subsequent releases.
Please do try Beta features and give feedback on them!
After they exit beta, it may not be practical for us to make more changes.
A General Availability (GA) feature is also referred to as a stable feature. It means:
- The feature is always enabled; you cannot disable it.
- The corresponding feature gate is no longer needed.
- Stable versions of features will appear in released software for many subsequent versions.
List of Feature Gates
Feature | Description |
---|
UseEtcdWrapper | Enables the use of etcd-wrapper image and a compatible version of etcd-backup-restore, along with component-specific configuration changes necessary for the usage of the etcd-wrapper image. |
12 - Getting Started Locally
Etcd-Druid Local Setup
This page aims to provide steps on how to setup Etcd-Druid locally with and without storage providers.
Clone the etcd-druid github repo
# clone the repo
git clone https://github.com/gardener/etcd-druid.git
# cd into etcd-druid folder
cd etcd-druid
Note:
- Etcd-druid uses kind as it’s local Kubernetes engine. The local setup is configured for kind due to its convenience but any other kubernetes setup would also work.
- To set up etcd-druid with backups enabled on a LocalStack provider, refer this document
- In the section Annotate Etcd CR with the reconcile annotation, the flag
--enable-etcd-spec-auto-reconcile
is set to false
, which means a special annotation is required on the Etcd CR, for etcd-druid to reconcile it. To disable this behavior and allow auto-reconciliation of the Etcd CR for any change in the Etcd spec, set the controllers.etcd.enableEtcdSpecAutoReconcile
value to true
in the values.yaml
located at charts/druid/values.yaml
. Or if etcd-druid is being run as a process, then while starting the process, set the CLI flag --enable-etcd-spec-auto-reconcile=true
for it.
Setting up the kind cluster
# Create a kind cluster
make kind-up
This creates a new kind cluster and stores the kubeconfig in the ./hack/e2e-test/infrastructure/kind/kubeconfig
file.
To target this newly created cluster, set the KUBECONFIG
environment variable to the kubeconfig file located at ./hack/e2e-test/infrastructure/kind/kubeconfig
by using the following
export KUBECONFIG=$PWD/hack/e2e-test/infrastructure/kind/kubeconfig
Setting up etcd-druid
Either one of these commands may be used to deploy etcd-druid to the configured k8s cluster.
The following command deploys etcd-druid to the configured k8s cluster:
The following command deploys etcd-druid to the configured k8s cluster using Skaffold dev
mode, such that changes in the etcd-druid code are automatically picked up and applied to the deployment. This helps with local development and quick iterative changes:
The following command deploys etcd-druid to the configured k8s cluster using Skaffold debug
mode, so that a debugger can be attached to the running etcd-druid deployment. Please refer to this guide for more information on Skaffold-based debugging:
This generates the Etcd
and EtcdCopyBackupsTask
CRDs and deploys an etcd-druid pod into the cluster.
Note: Before calling any of the make deploy*
commands, certain environment variables may be set in order to enable/disable certain functionalities of etcd-druid. These are:
DRUID_ENABLE_ETCD_COMPONENTS_WEBHOOK=true
: enables the etcdcomponents webhookDRUID_E2E_TEST=true
: sets specific configuration for etcd-druid for optimal e2e test runs, like a lower sync period for the etcd controller.USE_ETCD_DRUID_FEATURE_GATES=false
: enables etcd-druid feature gates.
Prepare the Etcd CR
Etcd CR can be configured in 2 ways. Either to take backups to the store or disable them. Follow the appropriate section below based on the requirement.
The Etcd CR can be found at this location $PWD/config/samples/druid_v1alpha1_etcd.yaml
Without Backups enabled
To set up etcd-druid without backups enabled, make sure the spec.backup.store
section of the Etcd CR is commented out.
With Backups enabled (On Cloud Provider Object Stores)
Prepare the secret
Create a secret for cloud provider access. Find the secret yaml templates for different cloud providers here.
Replace the dummy values with the actual configurations and make sure to add a name and a namespace to the secret as intended.
Note 1: The secret should be applied in the same namespace as druid.
Note 2: All the values in the data field of secret yaml should be in base64 encoded format.
Apply the secret
kubectl apply -f path/to/secret
Adapt Etcd
resource
Uncomment the spec.backup.store
section of the druid yaml and set the keys to allow backuprestore to take backups by connecting to an object store.
# Configuration for storage provider
store:
secretRef:
name: etcd-backup-secret-name
container: object-storage-container-name
provider: aws # options: aws,azure,gcp,openstack,alicloud,dell,openshift,local
prefix: etcd-test
Brief explanation of keys:
secretRef.name
is the name of the secret that was applied as mentioned abovestore.container
is the object storage bucket namestore.provider
is the bucket provider. Pick from the options mentioned in commentstore.prefix
is the folder name that you want to use for your snapshots inside the bucket.
Applying the Etcd CR
Note: With backups enabled, make sure the bucket is created in corresponding cloud provider before applying the Etcd yaml
Create the Etcd CR (Custom Resource) by applying the Etcd yaml to the cluster
# Apply the prepared etcd CR yaml
kubectl apply -f config/samples/druid_v1alpha1_etcd.yaml
Verify the Etcd cluster
To obtain information regarding the newly instantiated etcd cluster, perform the following step, which gives details such as the cluster size, readiness status of its members, and various other attributes.
Verify Etcd Member Pods
To check the etcd member pods, do the following and look out for pods starting with the name etcd-
Verify Etcd Pods’ Functionality
Verify the working conditions of the etcd pods by putting data through a etcd container and access the db from same/another container depending on single/multi node etcd cluster.
Ideally, you can exec into the etcd container using kubectl exec -it <etcd_pod> -c etcd -- bash
if it utilizes a base image containing a shell. However, note that the etcd-wrapper
Docker image employs a distroless image, which lacks a shell. To interact with etcd, use an Ephemeral container as a debug container. Refer to this documentation for building and using an ephemeral container by attaching it to the etcd container.
# Put a key-value pair into the etcd
etcdctl put key1 value1
# Retrieve all key-value pairs from the etcd db
etcdctl get --prefix ""
For a multi-node etcd cluster, insert the key-value pair from the etcd
container of one etcd member and retrieve it from the etcd
container of another member to verify consensus among the multiple etcd members.
View Etcd Database File
The Etcd database file is located at var/etcd/data/new.etcd/snap/db
inside the backup-restore
container. In versions with an alpine
base image, you can exec directly into the container. However, in recent versions where the backup-restore
docker image started using a distroless image, a debug container is required to communicate with it, as mentioned in the previous section.
Updating the Etcd CR
The Etcd
spec can be updated with new changes, such as etcd cluster configuration or backup-restore configuration, and etcd-druid will reconcile these changes as expected, under certain conditions:
- If the
--enable-etcd-spec-auto-reconcile
flag is set to true
, the spec change is automatically picked up and reconciled by etcd-druid. - If the
--enable-etcd-spec-auto-reconcile
flag is unset, or set to false
, then etcd-druid will expect an additional annotation gardener.cloud/operation: reconcile
on the Etcd
resource in order to pick it up for reconciliation. Upon successful reconciliation, this annotation is removed by etcd-druid. The annotation can be added as follows:# Annotate etcd-test CR to reconcile
kubectl annotate etcd etcd-test gardener.cloud/operation="reconcile"
Cleaning the setup
# Delete the cluster
make kind-down
This cleans up the entire setup as the kind cluster gets deleted. It deletes the created Etcd, all pods that got created along the way and also other resources such as statefulsets, services, PV’s, PVC’s, etc.
13 - Getting Started Locally Azurite
Getting started with etcd-druid
using Azurite
, and kind
This document is a step-by-step guide to run etcd-druid
with Azurite
, the Azure Blob Storage
emulator, within a kind
cluster. This setup is ideal for local development and testing.
Prerequisites
Docker
with the daemon running, or Docker Desktop running.Azure CLI
(>=2.55.0
)
Environment setup
Step 1: Provisioning the kind
cluster
Execute the command below to provision a kind
cluster. This command also forwards port 10000
from the kind
cluster to your local machine, enabling Azurite
access:
Export the KUBECONFIG
file after running the above command.
Step 2: Deploy Azurite
To start up the Azurite
emulator in a pod in the kind
cluster, run:
Step 3: Set up a ABS Container
- To use the
Azure CLI
with the Azurite
emulator running as a pod in the kind
cluster, export the connection string for the Azure CLI
.
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;"
- Create a
Azure Blob Storage Container
in Azurite
az storage container create -n etcd-bucket
Step 4: Deploy etcd-druid
- Apply the Kubernetes
Secret
manifest through:
kubectl apply -f config/samples/etcd-secret-azurite.yaml
- Apply the
Etcd
manifest through:
kubectl apply -f config/samples/druid_v1alpha1_etcd_azurite.yaml
Step 6 : Make use of the Azurite emulator however you wish
etcd-backup-restore
will now use Azurite
running in kind
as the remote store to store snapshots if all the previous steps were followed correctly.
Cleanup
make kind-down
unset AZURE_STORAGE_CONNECTION_STRING KUBECONFIG
14 - Getting Started Locally Localstack
Getting Started with etcd-druid, LocalStack, and Kind
This guide provides step-by-step instructions on how to set up etcd-druid with LocalStack and Kind on your local machine. LocalStack emulates AWS services locally, which allows the etcd cluster to interact with AWS S3 without the need for an actual AWS connection. This setup is ideal for local development and testing.
Prerequisites
- Docker (installed and running)
- AWS CLI (version
>=1.29.0
or >=2.13.0
)
Environment Setup
Step 1: Provision the Kind Cluster
Execute the command below to provision a kind
cluster. This command also forwards port 4566
from the kind cluster to your local machine, enabling LocalStack access:
Step 2: Deploy LocalStack
Deploy LocalStack onto the Kubernetes cluster using the command below:
Step 3: Set up an S3 Bucket
- Set up the AWS CLI to interact with LocalStack by setting the necessary environment variables. This configuration redirects S3 commands to the LocalStack endpoint and provides the required credentials for authentication:
export AWS_ENDPOINT_URL_S3="http://localhost:4566"
export AWS_ACCESS_KEY_ID=ACCESSKEYAWSUSER
export AWS_SECRET_ACCESS_KEY=sEcreTKey
export AWS_DEFAULT_REGION=us-east-2
- Create an S3 bucket for etcd-druid backup purposes:
aws s3api create-bucket --bucket etcd-bucket --region us-east-2 --create-bucket-configuration LocationConstraint=us-east-2 --acl private
Step 4: Deploy etcd-druid
Deploy etcd-druid onto the Kind cluster using the command below:
Apply the required Kubernetes manifests to create an etcd custom resource (CR) and a secret for AWS credentials, facilitating LocalStack access:
export KUBECONFIG=hack/e2e-test/infrastructure/kind/kubeconfig
kubectl apply -f config/samples/druid_v1alpha1_etcd_localstack.yaml -f config/samples/etcd-secret-localstack.yaml
Step 6: Reconcile the etcd
Initiate etcd reconciliation by annotating the etcd resource with the gardener.cloud/operation=reconcile
annotation:
kubectl annotate etcd etcd-test gardener.cloud/operation=reconcile
Congratulations! You have successfully configured etcd-druid
, LocalStack
, and kind
on your local machine. Inspect the etcd-druid logs and LocalStack to ensure the setup operates as anticipated.
To validate the buckets, execute the following command:
aws s3 ls etcd-bucket/etcd-test/v2/
Cleanup
To dismantle the setup, execute the following command:
make kind-down
unset AWS_ENDPOINT_URL_S3 AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_DEFAULT_REGION KUBECONFIG
15 - Local e2e Tests
e2e Test Suite
Developers can run extended e2e tests, in addition to unit tests, for Etcd-Druid in or from
their local environments. This is recommended to verify the desired behavior of several features
and to avoid regressions in future releases.
The very same tests typically run as part of the component’s release job as well as on demand, e.g.,
when triggered by Etcd-Druid maintainers for open pull requests.
Testing Etcd-Druid automatically involves a certain test coverage for gardener/etcd-backup-restore
which is deployed as a side-car to the actual etcd
container.
Prerequisites
The e2e test lifecycle is managed with the help of skaffold. Every involved step like setup
,
deploy
, undeploy
or cleanup
is executed against a Kubernetes cluster which makes it a mandatory prerequisite at the same time.
Only skaffold itself with involved docker
, helm
and kubectl
executions as well as
the e2e-tests are executed locally. Required binaries are automatically downloaded if you use the corresponding make
target,
as described in this document.
It’s expected that especially the deploy
step is run against a Kubernetes cluster which doesn’t contain an Druid deployment or any left-overs like druid.gardener.cloud
CRDs.
The deploy
step will likely fail in such scenarios.
Tip: Create a fresh KinD cluster or a similar one with a small footprint before executing the tests.
Providers
The following providers are supported for e2e tests:
Valid credentials need to be provided when tests are executed with mentioned cloud providers.
Flow
An e2e test execution involves the following steps:
Step | Description |
---|
setup | Create a storage bucket which is used for etcd backups (only with cloud providers). |
deploy | Build Docker image, upload it to registry (if remote cluster - see Docker build), deploy Helm chart (charts/druid ) to Kubernetes cluster. |
test | Execute e2e tests as defined in test/e2e . |
undeploy | Remove the deployed artifacts from Kubernetes cluster. |
cleanup | Delete storage bucket and Druid deployment from test cluster. |
Make target
Executing e2e-tests is as easy as executing the following command with defined Env-Vars as desribed in the following
section and as needed for your test scenario.
Common Env Variables
The following environment variables influence how the flow described above is executed:
PROVIDERS
: Providers used for testing (all
, aws
, azure
, gcp
, local
). Multiple entries must be comma separated.Note: Some tests will use very first entry from env PROVIDERS
for e2e testing (ex: multi-node tests). So for multi-node tests to use specific provider, specify that provider as first entry in env PROVIDERS
.
KUBECONFIG
: Kubeconfig pointing to cluster where Etcd-Druid will be deployed (preferably KinD).TEST_ID
: Some ID which is used to create assets for and during testing.STEPS
: Steps executed by make
target (setup
, deploy
, test
, undeploy
, cleanup
- default: all steps).
AWS Env Variables
AWS_ACCESS_KEY_ID
: Key ID of the user.AWS_SECRET_ACCESS_KEY
: Access key of the user.AWS_REGION
: Region in which the test bucket is created.
Example:
make \
AWS_ACCESS_KEY_ID="abc" \
AWS_SECRET_ACCESS_KEY="xyz" \
AWS_REGION="eu-central-1" \
KUBECONFIG="$HOME/.kube/config" \
PROVIDERS="aws" \
TEST_ID="some-test-id" \
STEPS="setup,deploy,test,undeploy,cleanup" \
test-e2e
Azure Env Variables
STORAGE_ACCOUNT
: Storage account used for managing the storage container.STORAGE_KEY
: Key of storage account.
Example:
make \
STORAGE_ACCOUNT="abc" \
STORAGE_KEY="eHl6Cg==" \
KUBECONFIG="$HOME/.kube/config" \
PROVIDERS="azure" \
TEST_ID="some-test-id" \
STEPS="setup,deploy,test,undeploy,cleanup" \
test-e2e
GCP Env Variables
GCP_SERVICEACCOUNT_JSON_PATH
: Path to the service account json file used for this test.GCP_PROJECT_ID
: ID of the GCP project.
Example:
make \
GCP_SERVICEACCOUNT_JSON_PATH="/var/lib/secrets/serviceaccount.json" \
GCP_PROJECT_ID="xyz-project" \
KUBECONFIG="$HOME/.kube/config" \
PROVIDERS="gcp" \
TEST_ID="some-test-id" \
STEPS="setup,deploy,test,undeploy,cleanup" \
test-e2e
Local Env Variables
No special environment variables are required for running e2e tests with Local
provider.
Example:
make \
KUBECONFIG="$HOME/.kube/config" \
PROVIDERS="local" \
TEST_ID="some-test-id" \
STEPS="setup,deploy,test,undeploy,cleanup" \
test-e2e
e2e test with localstack
The above-mentioned e2e tests need storage from real cloud providers to be setup. But there is a tool named localstack that enables to run e2e test with mock AWS storage. We can also provision KIND cluster for e2e tests. So, together with localstack and KIND cluster, we don’t need to depend on any actual cloud provider infrastructure to be setup to run e2e tests.
How are the KIND cluster and localstack set up
KIND or Kubernetes-In-Docker is a kubernetes cluster that is set up inside a docker container. This cluster is with limited capability as it does not have much compute power. But this cluster can easily be setup inside a container and can be tear down easily just by removing a container. That’s why KIND cluster is very easy to use for e2e tests. Makefile
command helps to spin up a KIND cluster and use the cluster to run e2e tests.
There is a docker image for localstack. The image is deployed as pod inside the KIND cluster through hack/e2e-test/infrastructure/localstack/localstack.yaml
. Makefile
takes care of deploying the yaml file in a KIND cluster.
The developer needs to run make ci-e2e-kind
command. This command in turn runs hack/ci-e2e-kind.sh
which spin up the KIND cluster and deploy localstack in it and then run the e2e tests using localstack as mock AWS storage provider. e2e tests are actually run on host machine but deploy the druid controller inside KIND cluster. Druid controller spawns multinode etcd clusters inside KIND cluster. e2e tests verify whether the druid controller performs its jobs correctly or not. Mock localstack storage is cleaned up after every e2e tests. That’s why the e2e tests need to access the localstack pod running inside KIND cluster. The network traffic between host machine and localstack pod is resolved via mapping localstack pod port to host port while setting up the KIND cluster via hack/e2e-test/infrastructure/kind/cluster.yaml
How to execute e2e tests with localstack and KIND cluster
Run the following make
command to spin up a KinD cluster, deploy localstack and run the e2e tests with provider aws
:
make ci-e2e-kind
16 - Metrics
Monitoring
etcd-druid uses Prometheus for metrics reporting. The metrics can be used for real-time monitoring and debugging of compaction jobs.
The simplest way to see the available metrics is to cURL the metrics endpoint /metrics
. The format is described here.
Follow the Prometheus getting started doc to spin up a Prometheus server to collect etcd metrics.
The naming of metrics follows the suggested Prometheus best practices. All compaction related metrics are put under namespace etcddruid
and the respective subsystems.
Snapshot Compaction
These metrics provide information about the compaction jobs that run after some interval in shoot control planes. Studying the metrics, we can deduce how many compaction job ran successfully, how many failed, how many delta events compacted etc.
Name | Description | Type |
---|
etcddruid_compaction_jobs_total | Total number of compaction jobs initiated by compaction controller. | Counter |
etcddruid_compaction_jobs_current | Number of currently running compaction job. | Gauge |
etcddruid_compaction_job_duration_seconds | Total time taken in seconds to finish a running compaction job. | Histogram |
etcddruid_compaction_num_delta_events | Total number of etcd events to be compacted by a compaction job. | Gauge |
There are two labels for etcddruid_compaction_jobs_total
metrics. The label succeeded
shows how many of the compaction jobs are succeeded and label failed
shows how many of compaction jobs are failed.
There are two labels for etcddruid_compaction_job_duration_seconds
metrics. The label succeeded
shows how much time taken by a successful job to complete and label failed
shows how much time taken by a failed compaction job.
etcddruid_compaction_jobs_current
metric comes with label etcd_namespace
that indicates the namespace of the Etcd running in the control plane of a shoot cluster..
Etcd
These metrics are exposed by the etcd process that runs in each etcd pod.
The following list metrics is applicable to clustering of a multi-node etcd cluster. The full list of metrics exposed by etcd
is available here.
No. | Metrics Name | Description | Comments |
---|
1 | etcd_disk_wal_fsync_duration_seconds | latency distributions of fsync called by WAL. | High disk operation latencies indicate disk issues. |
2 | etcd_disk_backend_commit_duration_seconds | latency distributions of commit called by backend. | High disk operation latencies indicate disk issues. |
3 | etcd_server_has_leader | whether or not a leader exists. 1: leader exists, 0: leader not exists. | To capture quorum loss or to check the availability of etcd cluster. |
4 | etcd_server_is_leader | whether or not this member is a leader. 1 if it is, 0 otherwise. | |
5 | etcd_server_leader_changes_seen_total | number of leader changes seen. | Helpful in fine tuning the zonal cluster like etcd-heartbeat time etc, it can also indicates the etcd load and network issues. |
6 | etcd_server_is_learner | whether or not this member is a learner. 1 if it is, 0 otherwise. | |
7 | etcd_server_learner_promote_successes | total number of successful learner promotions while this member is leader. | Might be helpful in checking the success of API calls called by backup-restore. |
8 | etcd_network_client_grpc_received_bytes_total | total number of bytes received from grpc clients. | Client Traffic In. |
9 | etcd_network_client_grpc_sent_bytes_total | total number of bytes sent to grpc clients. | Client Traffic Out. |
10 | etcd_network_peer_sent_bytes_total | total number of bytes sent to peers. | Useful for network usage. |
11 | etcd_network_peer_received_bytes_total | total number of bytes received from peers. | Useful for network usage. |
12 | etcd_network_active_peers | current number of active peer connections. | Might be useful in detecting issues like network partition. |
13 | etcd_server_proposals_committed_total | total number of consensus proposals committed. | A consistently large lag between a single member and its leader indicates that member is slow or unhealthy. |
14 | etcd_server_proposals_pending | current number of pending proposals to commit. | Pending proposals suggests there is a high client load or the member cannot commit proposals. |
15 | etcd_server_proposals_failed_total | total number of failed proposals seen. | Might indicates downtime caused by a loss of quorum. |
16 | etcd_server_proposals_applied_total | total number of consensus proposals applied. | Difference between etcd_server_proposals_committed_total and etcd_server_proposals_applied_total should usually be small. |
17 | etcd_mvcc_db_total_size_in_bytes | total size of the underlying database physically allocated in bytes. | |
18 | etcd_server_heartbeat_send_failures_total | total number of leader heartbeat send failures. | Might be helpful in fine-tuning the cluster or detecting slow disk or any network issues. |
19 | etcd_network_peer_round_trip_time_seconds | round-trip-time histogram between peers. | Might be helpful in fine-tuning network usage specially for zonal etcd cluster. |
20 | etcd_server_slow_apply_total | total number of slow apply requests. | Might indicate overloaded from slow disk. |
21 | etcd_server_slow_read_indexes_total | total number of pending read indexes not in sync with leader’s or timed out read index requests. | |
The full list of metrics is available here.
Etcd-Backup-Restore
These metrics are exposed by the etcd-backup-restore container in each etcd pod.
The following list metrics is applicable to clustering of a multi-node etcd cluster. The full list of metrics exposed by etcd-backup-restore
is available here.
No. | Metrics Name | Description |
---|
1. | etcdbr_cluster_size | to capture the scale-up/scale-down scenarios. |
2. | etcdbr_is_learner | whether or not this member is a learner. 1 if it is, 0 otherwise. |
3. | etcdbr_is_learner_count_total | total number times member added as the learner. |
4. | etcdbr_restoration_duration_seconds | total latency distribution required to restore the etcd member. |
5. | etcdbr_add_learner_duration_seconds | total latency distribution of adding the etcd member as a learner to the cluster. |
6. | etcdbr_member_remove_duration_seconds | total latency distribution removing the etcd member from the cluster. |
7. | etcdbr_member_promote_duration_seconds | total latency distribution of promoting the learner to the voting member. |
8. | etcdbr_defragmentation_duration_seconds | total latency distribution of defragmentation of each etcd cluster member. |
Prometheus supplied metrics
The Prometheus client library provides a number of metrics under the go
and process
namespaces.
17 - operator out-of-band tasks
DEP-05: Operator Out-of-band Tasks
Table of Contents
Summary
This DEP proposes an enhancement to etcd-druid
’s capabilities to handle out-of-band tasks, which are presently performed manually or invoked programmatically via suboptimal APIs. The document proposes the establishment of a unified interface by defining a well-structured API to harmonize the initiation of any out-of-band
task, monitor its status, and simplify the process of adding new tasks and managing their lifecycles.
Terminology
etcd-druid: etcd-druid is an operator to manage the etcd clusters.
backup-sidecar: It is the etcd-backup-restore sidecar container running in each etcd-member pod of etcd cluster.
leading-backup-sidecar: A backup-sidecar that is associated to an etcd leader of an etcd cluster.
out-of-band task: Any on-demand tasks/operations that can be executed on an etcd cluster without modifying the Etcd custom resource spec (desired state).
Motivation
Today, etcd-druid mainly acts as an etcd cluster provisioner (creation, maintenance and deletion). In future, capabilities of etcd-druid will be enhanced via etcd-member proposal by providing it access to much more detailed information about each etcd cluster member. While we enhance the reconciliation and monitoring capabilities of etcd-druid, it still lacks the ability to allow users to invoke out-of-band
tasks on an existing etcd cluster.
There are new learnings while operating etcd clusters at scale. It has been observed that we regularly need capabilities to trigger out-of-band
tasks which are outside of the purview of a regular etcd reconciliation run. Many of these tasks are multi-step processes, and performing them manually is error-prone, even if an operator follows a well-written step-by-step guide. Thus, there is a need to automate these tasks.
Some examples of an on-demand/out-of-band
tasks:
- Recover from a permanent quorum loss of etcd cluster.
- Trigger an on-demand full/delta snapshot.
- Trigger an on-demand snapshot compaction.
- Trigger an on-demand maintenance of etcd cluster.
- Copy the backups from one object store to another object store.
Goals
- Establish a unified interface for operator tasks by defining a single dedicated custom resource for
out-of-band
tasks. - Define a contract (in terms of prerequisites) which needs to be adhered to by any task implementation.
- Facilitate the easy addition of new
out-of-band
task(s) through this custom resource. - Provide CLI capabilities to operators, making it easy to invoke supported
out-of-band
tasks.
Non-Goals
- In the current scope, capability to abort/suspend an
out-of-band
task is not going to be provided. This could be considered as an enhancement based on pull. - Ordering (by establishing dependency) of
out-of-band
tasks submitted for the same etcd cluster has not been considered in the first increment. In a future version based on how operator tasks are used, we will enhance this proposal and the implementation.
Proposal
Authors propose creation of a new single dedicated custom resource to represent an out-of-band
task. Etcd-druid will be enhanced to process the task requests and update its status which can then be tracked/observed.
Custom Resource Golang API
EtcdOperatorTask
is the new custom resource that will be introduced. This API will be in v1alpha1
version and will be subject to change. We will be respecting Kubernetes Deprecation Policy.
// EtcdOperatorTask represents an out-of-band operator task resource.
type EtcdOperatorTask struct {
metav1.TypeMeta
metav1.ObjectMeta
// Spec is the specification of the EtcdOperatorTask resource.
Spec EtcdOperatorTaskSpec `json:"spec"`
// Status is most recently observed status of the EtcdOperatorTask resource.
Status EtcdOperatorTaskStatus `json:"status,omitempty"`
}
Spec
The authors propose that the following fields should be specified in the spec (desired state) of the EtcdOperatorTask
custom resource.
- To capture the type of
out-of-band
operator task to be performed, .spec.type
field should be defined. It can have values from all supported out-of-band
tasks eg. “OnDemandSnaphotTask”, “QuorumLossRecoveryTask” etc. - To capture the configuration specific to each task, a
.spec.config
field should be defined of type string
as each task can have different input configuration.
// EtcdOperatorTaskSpec is the spec for a EtcdOperatorTask resource.
type EtcdOperatorTaskSpec struct {
// Type specifies the type of out-of-band operator task to be performed.
Type string `json:"type"`
// Config is a task specific configuration.
Config string `json:"config,omitempty"`
// TTLSecondsAfterFinished is the time-to-live to garbage collect the
// related resource(s) of task once it has been completed.
// +optional
TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`
// OwnerEtcdReference refers to the name and namespace of the corresponding
// Etcd owner for which the task has been invoked.
OwnerEtcdRefrence types.NamespacedName `json:"ownerEtcdRefrence"`
}
Status
The authors propose the following fields for the Status (current state) of the EtcdOperatorTask
custom resource to monitor the progress of the task.
// EtcdOperatorTaskStatus is the status for a EtcdOperatorTask resource.
type EtcdOperatorTaskStatus struct {
// ObservedGeneration is the most recent generation observed for the resource.
ObservedGeneration *int64 `json:"observedGeneration,omitempty"`
// State is the last known state of the task.
State TaskState `json:"state"`
// Time at which the task has moved from "pending" state to any other state.
InitiatedAt metav1.Time `json:"initiatedAt"`
// LastError represents the errors when processing the task.
// +optional
LastErrors []LastError `json:"lastErrors,omitempty"`
// Captures the last operation status if task involves many stages.
// +optional
LastOperation *LastOperation `json:"lastOperation,omitempty"`
}
type LastOperation struct {
// Name of the LastOperation.
Name opsName `json:"name"`
// Status of the last operation, one of pending, progress, completed, failed.
State OperationState `json:"state"`
// LastTransitionTime is the time at which the operation state last transitioned from one state to another.
LastTransitionTime metav1.Time `json:"lastTransitionTime"`
// A human readable message indicating details about the last operation.
Reason string `json:"reason"`
}
// LastError stores details of the most recent error encountered for the task.
type LastError struct {
// Code is an error code that uniquely identifies an error.
Code ErrorCode `json:"code"`
// Description is a human-readable message indicating details of the error.
Description string `json:"description"`
// ObservedAt is the time at which the error was observed.
ObservedAt metav1.Time `json:"observedAt"`
}
// TaskState represents the state of the task.
type TaskState string
const (
TaskStateFailed TaskState = "Failed"
TaskStatePending TaskState = "Pending"
TaskStateRejected TaskState = "Rejected"
TaskStateSucceeded TaskState = "Succeeded"
TaskStateInProgress TaskState = "InProgress"
)
// OperationState represents the state of last operation.
type OperationState string
const (
OperationStateFailed OperationState = "Failed"
OperationStatePending OperationState = "Pending"
OperationStateCompleted OperationState = "Completed"
OperationStateInProgress OperationState = "InProgress"
)
Custom Resource YAML API
apiVersion: druid.gardener.cloud/v1alpha1
kind: EtcdOperatorTask
metadata:
name: <name of operator task resource>
namespace: <cluster namespace>
generation: <specific generation of the desired state>
spec:
type: <type/category of supported out-of-band task>
ttlSecondsAfterFinished: <time-to-live to garbage collect the custom resource after it has been completed>
config: <task specific configuration>
ownerEtcdRefrence: <refer to corresponding etcd owner name and namespace for which task has been invoked>
status:
observedGeneration: <specific observedGeneration of the resource>
state: <last known current state of the out-of-band task>
initiatedAt: <time at which task move to any other state from "pending" state>
lastErrors:
- code: <error-code>
description: <description of the error>
observedAt: <time the error was observed>
lastOperation:
name: <operation-name>
state: <task state as seen at the completion of last operation>
lastTransitionTime: <time of transition to this state>
reason: <reason/message if any>
Lifecycle
Creation
Task(s) can be created by creating an instance of the EtcdOperatorTask
custom resource specific to a task.
Note: In future, either a kubectl
extension plugin or a druidctl
tool will be introduced. Dedicated sub-commands will be created for each out-of-band
task. This will drastically increase the usability for an operator for performing such tasks, as the CLI extension will automatically create relevant instance(s) of EtcdOperatorTask
with the provided configuration.
Execution
- Authors propose to introduce a new controller which watches for
EtcdOperatorTask
custom resource. - Each
out-of-band
task may have some task specific configuration defined in .spec.config. - The controller needs to parse this task specific config, which comes as a string, according to the schema defined for each task.
- For every
out-of-band
task, a set of pre-conditions
can be defined. These pre-conditions are evaluated against the current state of the target etcd cluster. Based on the evaluation result (boolean), the task is permitted or denied execution. - If multiple tasks are invoked simultaneously or in
pending
state, then they will be executed in a First-In-First-Out (FIFO) manner.
Note: Dependent ordering among tasks will be addressed later which will enable concurrent execution of tasks when possible.
Deletion
Upon completion of the task, irrespective of its final state, Etcd-druid
will ensure the garbage collection of the task custom resource and any other Kubernetes resources created to execute the task. This will be done according to the .spec.ttlSecondsAfterFinished
if defined in the spec, or a default expiry time will be assumed.
Use Cases
Recovery from permanent quorum loss
Recovery from permanent quorum loss involves two phases - identification and recovery - both of which are done manually today. This proposal intends to automate the latter. Recovery today is a multi-step process and needs to be performed carefully by a human operator. Automating these steps would be prudent, to make it quicker and error-free. The identification of the permanent quorum loss would remain a manual process, requiring a human operator to investigate and confirm that there is indeed a permanent quorum loss with no possibility of auto-healing.
Task Config
We do not need any config for this task. When creating an instance of EtcdOperatorTask
for this scenario, .spec.config
will be set to nil (unset).
Pre-Conditions
- There should be a quorum loss in a multi-member etcd cluster. For a single-member etcd cluster, invoking this task is unnecessary as the restoration of the single member is automatically handled by the backup-restore process.
- There should not already be a permanent-quorum-loss-recovery-task running for the same etcd cluster.
Trigger on-demand snapshot compaction
Etcd-druid
provides a configurable etcd-events-threshold flag. When this threshold is breached, then a snapshot compaction is triggered for the etcd cluster. However, there are scenarios where an ad-hoc snapshot compaction may be required.
Possible scenarios
- If an operator anticipates a scenario of permanent quorum loss, they can trigger an
on-demand snapshot compaction
to create a compacted full-snapshot. This can potentially reduce the recovery time from a permanent quorum loss. - As an additional benefit, a human operator can leverage the current implementation of snapshot compaction, which internally triggers
restoration
. Hence, by initiating an on-demand snapshot compaction
task, the operator can verify the integrity of etcd cluster backups, particularly in cases of potential backup corruption or re-encryption. The success or failure of this snapshot compaction can offer valuable insights into these scenarios.
Task Config
We do not need any config for this task. When creating an instance of EtcdOperatorTask
for this scenario, .spec.config
will be set to nil (unset).
Pre-Conditions
- There should not be a
on-demand snapshot compaction
task already running for the same etcd cluster.
Note: on-demand snapshot compaction
runs as a separate job in a separate pod, which interacts with the backup bucket and not the etcd cluster itself, hence it doesn’t depend on the health of etcd cluster members.
Trigger on-demand full/delta snapshot
Etcd
custom resource provides an ability to set FullSnapshotSchedule which currently defaults to run once in 24 hrs. DeltaSnapshotPeriod is also made configurable which defines the duration after which a delta snapshot will be taken.
If a human operator does not wish to wait for the scheduled full/delta snapshot, they can trigger an on-demand (out-of-schedule) full/delta snapshot on the etcd cluster, which will be taken by the leading-backup-restore
.
Possible scenarios
- An on-demand full snapshot can be triggered if scheduled snapshot fails due to any reason.
- Gardener Shoot Hibernation: Every etcd cluster incurs an inherent cost of preserving the volumes even when a gardener shoot control plane is scaled down, i.e the shoot is in a hibernated state. However, it is possible to save on hyperscaler costs by invoking this task to take a full snapshot before scaling down the etcd cluster, and deleting the etcd data volumes afterwards.
- Gardener Control Plane Migration: In gardener, a cluster control plane can be moved from one seed cluster to another. This process currently requires the etcd data to be replicated on the target cluster, so a full snapshot of the etcd cluster in the source seed before the migration would allow for faster restoration of the etcd cluster in the target seed.
Task Config
// SnapshotType can be full or delta snapshot.
type SnapshotType string
const (
SnapshotTypeFull SnapshotType = "full"
SnapshotTypeDelta SnapshotType = "delta"
)
type OnDemandSnapshotTaskConfig struct {
// Type of on-demand snapshot.
Type SnapshotType `json:"type"`
}
spec:
config: |
type: <type of on-demand snapshot>
Pre-Conditions
- Etcd cluster should have a quorum.
- There should not already be a
on-demand snapshot
task running with the same SnapshotType
for the same etcd cluster.
Trigger on-demand maintenance of etcd cluster
Operator can trigger on-demand maintenance of etcd cluster which includes operations like etcd compaction, etcd defragmentation etc.
Possible Scenarios
- If an etcd cluster is heavily loaded, which is causing performance degradation of an etcd cluster, and the operator does not want to wait for the scheduled maintenance window then an
on-demand maintenance
task can be triggered which will invoke etcd-compaction, etcd-defragmentation etc. on the target etcd cluster. This will make the etcd cluster lean and clean, thus improving cluster performance.
Task Config
type OnDemandMaintenanceTaskConfig struct {
// MaintenanceType defines the maintenance operations need to be performed on etcd cluster.
MaintenanceType maintenanceOps `json:"maintenanceType`
}
type maintenanceOps struct {
// EtcdCompaction if set to true will trigger an etcd compaction on the target etcd.
// +optional
EtcdCompaction bool `json:"etcdCompaction,omitempty"`
// EtcdDefragmentation if set to true will trigger a etcd defragmentation on the target etcd.
// +optional
EtcdDefragmentation bool `json:"etcdDefragmentation,omitempty"`
}
spec:
config: |
maintenanceType:
etcdCompaction: <true/false>
etcdDefragmentation: <true/false>
Pre-Conditions
- Etcd cluster should have a quorum.
- There should not already be a duplicate task running with same
maintenanceType
.
Copy Backups Task
Copy the backups(full and delta snapshots) of etcd cluster from one object store(source) to another object store(target).
Possible Scenarios
- In Gardener, the Control Plane Migration process utilizes the copy-backups task. This task is responsible for copying backups from one object store to another, typically located in different regions.
Task Config
// EtcdCopyBackupsTaskConfig defines the parameters for the copy backups task.
type EtcdCopyBackupsTaskConfig struct {
// SourceStore defines the specification of the source object store provider.
SourceStore StoreSpec `json:"sourceStore"`
// TargetStore defines the specification of the target object store provider for storing backups.
TargetStore StoreSpec `json:"targetStore"`
// MaxBackupAge is the maximum age in days that a backup must have in order to be copied.
// By default all backups will be copied.
// +optional
MaxBackupAge *uint32 `json:"maxBackupAge,omitempty"`
// MaxBackups is the maximum number of backups that will be copied starting with the most recent ones.
// +optional
MaxBackups *uint32 `json:"maxBackups,omitempty"`
}
spec:
config: |
sourceStore: <source object store specification>
targetStore: <target object store specification>
maxBackupAge: <maximum age in days that a backup must have in order to be copied>
maxBackups: <maximum no. of backups that will be copied>
Note: For detailed object store specification please refer here
Pre-Conditions
- There should not already be a
copy-backups
task running.
Note: copy-backups-task
runs as a separate job, and it operates only on the backup bucket, hence it doesn’t depend on health of etcd cluster members.
Note: copy-backups-task
has already been implemented and it’s currently being used in Control Plane Migration but copy-backups-task
will be harmonized with EtcdOperatorTask
custom resource.
Metrics
Authors proposed to introduce the following metrics:
etcddruid_operator_task_duration_seconds
: Histogram which captures the runtime for each etcd operator task.
Labels:
- Key:
type
, Value: all supported tasks - Key:
state
, Value: One-Of {failed, succeeded, rejected} - Key:
etcd
, Value: name of the target etcd resource - Key:
etcd_namespace
, Value: namespace of the target etcd resource
etcddruid_operator_tasks_total
: Counter which counts the number of etcd operator tasks.
Labels:
- Key:
type
, Value: all supported tasks - Key:
state
, Value: One-Of {failed, succeeded, rejected} - Key:
etcd
, Value: name of the target etcd resource - Key:
etcd_namespace
, Value: namespace of the target etcd resource
18 - Recovery From Permanent Quorum Loss In Etcd Cluster
Recovery from Permanent Quorum Loss in an Etcd Cluster
Quorum loss in Etcd Cluster
Quorum loss means when the majority of Etcd pods (greater than or equal to n/2 + 1) are down simultaneously for some reason.
There are two types of quorum loss that can happen to an Etcd multinode cluster:
Transient quorum loss - A quorum loss is called transient when the majority of Etcd pods are down simultaneously for some time. The pods may be down due to network unavailability, high resource usages, etc. When the pods come back after some time, they can re-join the cluster and quorum is recovered automatically without any manual intervention. There should not be a permanent failure for the majority of etcd pods due to hardware failure or disk corruption.
Permanent quorum loss - A quorum loss is called permanent when the majority of Etcd cluster members experience permanent failure, whether due to hardware failure or disk corruption, etc. In that case, the etcd cluster is not going to recover automatically from the quorum loss. A human operator will now need to intervene and execute the following steps to recover the multi-node Etcd cluster.
If permanent quorum loss occurs to a multinode Etcd cluster, the operator needs to note down the PVCs, configmaps, statefulsets, CRs, etc. related to that Etcd cluster and work on those resources only. The following steps guide a human operator to recover from permanent quorum loss of an etcd cluster. We assume the name of the Etcd CR for the Etcd cluster is etcd-main
.
Etcd cluster in shoot control plane of gardener deployment:
There are two Etcd clusters running in the shoot control plane. One is named etcd-events
and another is named etcd-main
. The operator needs to take care of permanent quorum loss to a specific cluster. If permanent quorum loss occurs to etcd-events
cluster, the operator needs to note down the PVCs, configmaps, statefulsets, CRs, etc. related to the etcd-events
cluster and work on those resources only.
⚠️ Note: Please note that manually restoring etcd can result in data loss. This guide is the last resort to bring an Etcd cluster up and running again.
If etcd-druid and etcd-backup-restore is being used with gardener, then:
Target the control plane of affected shoot cluster via kubectl
. Alternatively, you can use gardenctl to target the control plane of the affected shoot cluster. You can get the details to target the control plane from the Access tile in the shoot cluster details page on the Gardener dashboard. Ensure that you are targeting the correct namespace.
Add the following annotations to the Etcd
resource etcd-main
:
kubectl annotate etcd etcd-main druid.gardener.cloud/suspend-etcd-spec-reconcile=
kubectl annotate etcd etcd-main druid.gardener.cloud/disable-resource-protection=
Note down the configmap name that is attached to the etcd-main
statefulset. If you describe the statefulset with kubectl describe sts etcd-main
, look for the lines similar to following lines to identify attached configmap name. It will be needed at later stages:
Volumes:
etcd-config-file:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: etcd-bootstrap-4785b0
Optional: false
Alternatively, the related configmap name can be obtained by executing following command as well:
kubectl get sts etcd-main -o jsonpath='{.spec.template.spec.volumes[?(@.name=="etcd-config-file")].configMap.name}'
Scale down the etcd-main
statefulset replicas to 0
:
kubectl scale sts etcd-main --replicas=0
The PVCs will look like the following on listing them with the command kubectl get pvc
:
main-etcd-etcd-main-0 Bound pv-shoot--garden--aws-ha-dcb51848-49fa-4501-b2f2-f8d8f1fad111 80Gi RWO gardener.cloud-fast 13d
main-etcd-etcd-main-1 Bound pv-shoot--garden--aws-ha-b4751b28-c06e-41b7-b08c-6486e03090dd 80Gi RWO gardener.cloud-fast 13d
main-etcd-etcd-main-2 Bound pv-shoot--garden--aws-ha-ff17323b-d62e-4d5e-a742-9de823621490 80Gi RWO gardener.cloud-fast 13d
Delete all PVCs that are attached to etcd-main
cluster.
kubectl delete pvc -l instance=etcd-main
Check the etcd’s member leases. There should be leases starting with etcd-main
as many as etcd-main
replicas.
One of those leases will have holder identity as <etcd-member-id>:Leader
and rest of etcd member leases have holder identities as <etcd-member-id>:Member
.
Please ignore the snapshot leases, i.e., those leases which have the suffix snap
.
etcd-main member leases:
NAME HOLDER AGE
etcd-main-0 4c37667312a3912b:Member 1m
etcd-main-1 75a9b74cfd3077cc:Member 1m
etcd-main-2 c62ee6af755e890d:Leader 1m
Delete all etcd-main
member leases.
Edit the etcd-main
cluster’s configmap (ex: etcd-bootstrap-4785b0
) as follows:
Find the initial-cluster
field in the configmap. It should look similar to the following:
# Initial cluster
initial-cluster: etcd-main-0=https://etcd-main-0.etcd-main-peer.default.svc:2380,etcd-main-1=https://etcd-main-1.etcd-main-peer.default.svc:2380,etcd-main-2=https://etcd-main-2.etcd-main-peer.default.svc:2380
Change the initial-cluster
field to have only one member (etcd-main-0
) in the string. It should now look like this:
# Initial cluster
initial-cluster: etcd-main-0=https://etcd-main-0.etcd-main-peer.default.svc:2380
Scale up the etcd-main
statefulset replicas to 1
:
kubectl scale sts etcd-main --replicas=1
Wait for the single-member etcd cluster to be completely ready.
kubectl get pods etcd-main-0
will give the following output when ready:
NAME READY STATUS RESTARTS AGE
etcd-main-0 2/2 Running 0 1m
Remove the following annotations from the Etcd
resource etcd-main
:
kubectl annotate etcd etcd-main druid.gardener.cloud/suspend-etcd-spec-reconcile-
kubectl annotate etcd etcd-main druid.gardener.cloud/disable-resource-protection-
Finally, add the following annotation to the Etcd
resource etcd-main
:
kubectl annotate etcd etcd-main gardener.cloud/operation='reconcile'
Verify that the etcd cluster is formed correctly.
All the etcd-main
pods will have outputs similar to following:
NAME READY STATUS RESTARTS AGE
etcd-main-0 2/2 Running 0 5m
etcd-main-1 2/2 Running 0 1m
etcd-main-2 2/2 Running 0 1m
Additionally, check if the Etcd CR is ready with kubectl get etcd etcd-main
:
NAME READY AGE
etcd-main true 13d
Additionally, check the leases for 30 seconds at least. There should be leases starting with etcd-main
as many as etcd-main
replicas. One of those leases will have holder identity as <etcd-member-id>:Leader
and rest of those leases have holder identities as <etcd-member-id>:Member
. The AGE
of those leases can also be inspected to identify if those leases were updated in conjunction with the restart of the Etcd cluster: Example:
NAME HOLDER AGE
etcd-main-0 4c37667312a3912b:Member 1m
etcd-main-1 75a9b74cfd3077cc:Member 1m
etcd-main-2 c62ee6af755e890d:Leader 1m
19 - Restoring Single Member In Multi Node Etcd Cluster
Restoration of a single member in multi-node etcd deployed by etcd-druid
Note:
- For a cluster with n members, we are proposing the solution to only single member restoration within a etcd cluster not the quorum loss scenario (when majority of members within a cluster fail).
- In this proposal we are not targeting the recovery of single member which got separated from cluster due to network partition.
Motivation
If a single etcd member within a multi-node etcd cluster goes down due to DB corruption/PVC corruption/Invalid data-dir then it needs to be brought back. Unlike in the single-node case, a minority member of a multi-node cluster can’t be restored from the snapshots present in storage container as you can’t restore from the old snapshots as it contains the metadata information of cluster which leads to memberID mismatch that prevents the new member from coming up as new member is getting its metadata information from db which got restore from old snapshots.
Solution
- If a corresponding backup-restore sidecar detects that its corresponding etcd is down due to data-dir corruption or Invalid data-dir
- Then backup-restore will first remove the failing etcd member from the cluster using the MemberRemove API call and clean the data-dir of failed etcd member.
- It won’t affect the etcd cluster as quorum is still maintained.
- After successfully removing failed etcd member from the cluster, backup-restore sidecar will try to add a new etcd member to a cluster to get the same cluster size as before.
- Backup-restore firstly adds new member as a Learner using the MemberAddAsLearner API call, once learner is added to the cluster and it’s get in sync with leader and becomes up-to-date then promote the learner(non-voting member) to a voting member using MemberPromote API call.
- So, the failed member first needs to be removed from the cluster and then added as a new member.
Example
- If a
3
member etcd cluster has 1 downed member(due to invalid data-dir), the cluster can still make forward progress because the quorum is 2
. - Etcd downed member get restarted and it’s corresponding backup-restore sidecar receives an initialization request.
- Then, backup-restore sidecar checks for data corruption/invalid data-dir.
- Backup-restore sidecar detects that data-dir is invalid and its a multi-node etcd cluster.
- Then, backup-restore sidecar removed the downed etcd member from cluster.
- The number of members in a cluster becomes
2
and the quorum remains at 2
, so it won’t affect the etcd cluster. - Clean the data-dir and add a member as a learner(non-voting member).
- As soon as learner gets in sync with leader, promote the learner to a voting member, hence increasing number of members in a cluster back to
3
.
20 - Supported K8s Versions
Supported Kubernetes Versions
We strongly recommend using etcd-druid
with the supported kubernetes versions, published in this document.
The following is a list of kubernetes versions supported by the respective etcd-druid
versions.
Etcd-druid version | Kubernetes version |
---|
>=0.20 | >=1.21 |
>=0.14 && <0.20 | All versions supported |
<0.14 | < 1.25 |
21 - Testing
Testing Strategy and Developer Guideline
Intent of this document is to introduce you (the developer) to the following:
- Libraries that are used to write tests.
- Best practices to write tests that are correct, stable, fast and maintainable.
- How to run tests.
The guidelines are not meant to be absolute rules. Always apply common sense and adapt the guideline if it doesn’t make much sense for some cases. If in doubt, don’t hesitate to ask questions during a PR review (as an author, but also as a reviewer). Add new learnings as soon as we make them!
For any new contributions tests are a strict requirement. Boy Scouts Rule
is followed: If you touch a code for which either no tests exist or coverage is insufficient then it is expected that you will add relevant tests.
Common guidelines for writing tests
We use the Testing
package provided by the standard library in golang for writing all our tests. Refer to its official documentation to learn how to write tests using Testing
package. You can also refer to this example.
We use gomega as our matcher or assertion library. Refer to Gomega’s official documentation for details regarding its installation and application in tests.
For naming the individual test/helper functions, ensure that the name describes what the function tests/helps-with. Naming is important for code readability even when writing tests - example-testcase-naming.
Introduce helper functions for assertions to make test more readable where applicable - example-assertion-function.
Introduce custom matchers to make tests more readable where applicable - example-custom-matcher.
Do not use time.Sleep
and friends as it renders the tests flaky.
If a function returns a specific error then ensure that the test correctly asserts the expected error instead of just asserting that an error occurred. To help make this assertion consider using DruidError where possible. example-test-utility & usage.
Creating sample data for tests can be a high effort. Consider writing test utilities to generate sample data instead. example-test-object-builder.
If tests require any arbitrary sample data then ensure that you create a testdata
directory within the package and keep the sample data as files in it. From https://pkg.go.dev/cmd/go/internal/test
The go tool will ignore a directory named “testdata”, making it available to hold ancillary data needed by the tests.
Avoid defining shared variable/state across tests. This can lead to race conditions causing non-deterministic state. Additionally it limits the capability to run tests concurrently via t.Parallel()
.
Do not assume or try and establish an order amongst different tests. This leads to brittle tests as the codebase evolves.
If you need to have logs produced by test runs (especially helpful in failing tests), then consider using t.Log or t.Logf.
Unit Tests
- If you need a kubernetes
client.Client
, prefer using fake client instead of mocking the client. You can inject errors when building the client which enables you test error handling code paths.- Mocks decrease maintainability because they expect the tested component to follow a certain way to reach the desired goal (e.g., call specific functions with particular arguments).
- All unit tests should be run quickly. Do not use envtest and do not set up a Kind cluster in unit tests.
- If you have common setup for variations of a function, consider using table-driven tests. See this as an example.
- An individual test should only test one and only one thing. Do not try and test multiple variants in a single test. Either use table-driven tests or write individual tests for each variation.
- If a function/component has multiple steps, its probably better to split/refactor it into multiple functions/components that can be unit tested individually.
- If there are a lot of edge cases, extract dedicated functions that cover them and use unit tests to test them.
Running Unit Tests
NOTE: For unit tests we are currently transitioning away from ginkgo to using golang native tests. The make test-unit
target runs both ginkgo and golang native tests. Once the transition is complete this target will be simplified.
Run all unit tests
Run unit tests of specific packages:
# if you have not already installed gotestfmt tool then install it once.
# make test-unit target automatically installs this in ./hack/tools/bin. You can alternatively point the GOBIN to this directory and then directly invoke test-go.sh
> go install github.com/gotesttools/gotestfmt/v2/cmd/gotestfmt@v2.5.0
> ./hack/test-go.sh <package-1> <package-2>
De-flaking Unit Tests
If tests have sporadic failures, then trying running ./hack/stress-test.sh
which internally uses stress tool.
# install the stress tool
> go install golang.org/x/tools/cmd/stress@latest
# invoke the helper script to execute the stress test
> ./hack/stress-test.sh test-package=<test-package> test-func=<test-function> tool-params="<tool-params>"
An example invocation:
> ./hack/stress-test.sh test-package=./internal/utils test-func=TestRunConcurrentlyWithAllSuccessfulTasks tool-params="-p 10"
5s: 877 runs so far, 0 failures
10s: 1906 runs so far, 0 failures
15s: 2885 runs so far, 0 failures
...
stress
tool will output a path to a file containing the full failure message when a test run fails.
Integration Tests (envtests)
Integration tests in etcd-druid use envtest. It sets up a minimal temporary control plane (etcd + kube-apiserver) and runs the test against it. Test suites (group of tests) start their individual envtest
environment before running the tests for the respective controller/webhook. Before exiting, the temporary test environment is shutdown.
NOTE: For integration-tests we are currently transitioning away from ginkgo to using golang native tests. All ginkgo integration tests can be found here and golang native integration tests can be found here.
- Integration tests in etcd-druid only targets a single controller. It is therefore advised that code (other than common utility functions should not be shared between any two controllers).
- If you are sharing a common
envtest
environment across tests then it is recommended that an individual test is run in a dedicated namespace
. - Since
envtest
is used to setup a minimum environment where no controller (e.g. KCM, Scheduler) other than etcd
and kube-apiserver
runs, status updates to resources controller/reconciled by not-deployed-controllers will not happen. Tests should refrain from asserting changes to status. In case status needs to be set as part of a test setup then it must be done explicitly. - If you have common setup and teardown, then consider using TestMain -example.
- If you have to wait for resources to be provisioned or reach a specific state, then it is recommended that you create smaller assertion functions and use Gomega’s AsyncAssertion functions - example.
- Beware of the default
Eventually
/ Consistently
timeouts / poll intervals: docs. - Don’t forget to call
{Eventually,Consistently}.Should()
, otherwise the assertions always silently succeeds without errors: onsi/gomega#561
Running Integration Tests
Debugging Integration Tests
There are two ways in which you can debug Integration Tests:
Using IDE
All commonly used IDE’s provide in-built or easy integration with delve debugger. For debugging integration tests the only additional requirement is to set KUBEBUILDER_ASSETS
environment variable. You can get the value of this environment variable by executing the following command:
# ENVTEST_K8S_VERSION is the k8s version that you wish to use for testing.
> setup-envtest --os $(go env GOOS) --arch $(go env GOARCH) use $ENVTEST_K8S_VERSION -p path
NOTE: All integration tests usually have a timeout. If you wish to debug a failing integration-test then increase the timeouts.
Use standalone envtest
We also provide a capability to setup a stand-alone envtest
and leverage the cluster to run individual integration-test. This allows you more control over when this k8s control plane is destroyed and allows you to inspect the resources at the end of the integration-test run using kubectl
.
NOTE: While you can use an existing cluster (e.g., kind
), some test suites expect that no controllers and no nodes are running in the test environment (as it is the case in envtest
test environments). Hence, using a full-blown cluster with controllers and nodes might sometimes be impractical, as you would need to stop cluster components for the tests to work.
To setup a standalone envtest
and run an integration test against it, do the following:
# In a terminal session use the following make target to setup a standalone envtest
> make start-envtest
# As part of output path to kubeconfig will be also be printed on the console.
# In another terminal session setup resource(s) watch:
> kubectl get po -A -w # alternatively you can also use `watch -d <command>` utility.
# In another terminal session:
> export KUBECONFIG=<envtest-kubeconfig-path>
> export USE_EXISTING_K8S_CLUSTER=true
# run the test
> go test -run="<regex-for-test>" <package>
# example: go test -run="^TestEtcdDeletion/test deletion of all*" ./test/it/controller/etcd
Once you are done the testing you can press Ctrl+C
in the terminal session where you started envtest
. This will shutdown the kubernetes control plane.
End-To-End (e2e) Tests
End-To-End tests are run using Kind cluster and Skaffold. These tests provide a high level of confidence that the code runs as expected by users when deployed to production.
Purpose of running these tests is to be able to catch bugs which result from interaction amongst different components within etcd-druid.
In CI pipelines e2e tests are run with S3 compatible LocalStack (in cases where backup functionality has been enabled for an etcd cluster).
In future we will only be using a file-system based local provider to reduce the run times for the e2e tests when run in a CI pipeline.
e2e tests can be triggered either with other cloud provider object-store emulators or they can also be run against actual/remove cloud provider object-store services.
In contrast to integration tests, in e2e tests, it might make sense to specify higher timeouts for Gomega’s AsyncAssertion calls.
Running e2e tests locally
Detailed instructions on how to run e2e tests can be found here.
22 - Webhooks
Webhooks
The etcd-druid controller-manager registers certain admission webhooks that allow for validation or mutation of requests on resources in the cluster, in order to prevent misconfiguration and restrict access to the etcd cluster resources.
All webhooks that are a part of etcd-druid reside in package internal/webhook
, as sub-packages.
Package Structure
The typical package structure for the webhooks that are part of etcd-druid is shown with the EtcdComponents Webhook:
internal/webhook/etcdcomponents
├── config.go
├── handler.go
└── register.go
config.go
: contains all the logic for the configuration of the webhook, including feature gate activations, CLI flag parsing and validations.register.go
: contains the logic for registering the webhook with the etcd-druid controller manager.handler.go
: contains the webhook admission handler logic.
Each webhook package may also contain auxiliary files which are relevant to that specific webhook.
Etcd Components Webhook
Druid controller-manager registers and runs the etcd controller, which creates and manages various components/resources such as Leases
, ConfigMap
s, and the Statefulset
for the etcd cluster. It is essential for all these resources to contain correct configuration for the proper functioning of the etcd cluster.
Unintended changes to any of these managed resources can lead to misconfiguration of the etcd cluster, leading to unwanted downtime for etcd traffic. To prevent such unintended changes, a validating webhook called EtcdComponents Webhook guards these managed resources, ensuring that only authorized entities can perform operations on these managed resources.
EtcdComponents webhook prevents UPDATE and DELETE operations on all resources managed by etcd controller, unless such an operation is performed by druid itself, and during reconciliation of the Etcd
resource. Operations are also allowed if performed by one of the authorized entities specified by CLI flag --etcd-components-webhook-exempt-service-accounts
, but only if the Etcd
resource is not being reconciled by etcd-druid at that time.
There may be specific cases where a human operator may need to make changes to the managed resources, possibly to test or fix an etcd cluster. An example of this is recovery from permanent quorum loss, where a human operator will need to suspend reconciliation of the Etcd
resource, make changes to the underlying managed resources such as StatefulSet
and ConfigMap
, and then resume reconciliation for the Etcd
resource. Such manual interventions will require out-of-band changes to the managed resources. Protection of managed resources for such Etcd
resources can be turned off by adding an annotation druid.gardener.cloud/disable-etcd-component-protection
on the Etcd
resource. This will effectively disable EtcdComponents Webhook protection for all managed resources for the specific Etcd
.
Note: UPDATE operations for Lease
s by etcd members are always allowed, since these are regularly updated by the etcd-backup-restore sidecar.
The Etcd Components Webhook is disabled by default, and can be enabled via the CLI flag `–enable-etcd-components-webhook.