DEP-05: Operator Out-of-band Tasks
Summary
This DEP proposes an enhancement to etcd-druid's capabilities to handle out-of-band tasks, which are presently performed manually or invoked programmatically via suboptimal APIs. The document proposes the establishment of a unified interface by defining a well-structured API to harmonize the initiation of any out-of-band task, monitor its status, and simplify the process of adding new tasks and managing their lifecycles.
Terminology
etcd-druid: etcd-druid is an operator to manage the etcd clusters.
backup-sidecar: It is the etcd-backup-restore sidecar container running in each etcd-member pod of etcd cluster.
leading-backup-sidecar: A backup-sidecar that is associated to an etcd leader of an etcd cluster.
out-of-band task: Any on-demand tasks/operations that can be executed on an etcd cluster without modifying the Etcd custom resource spec (desired state).
Motivation
Today, etcd-druid mainly acts as an etcd cluster provisioner (creation, maintenance and deletion). In future, capabilities of etcd-druid will be enhanced via etcd-member proposal by providing it access to much more detailed information about each etcd cluster member. While we enhance the reconciliation and monitoring capabilities of etcd-druid, it still lacks the ability to allow users to invoke out-of-band tasks on an existing etcd cluster.
There are new learnings while operating etcd clusters at scale. It has been observed that we regularly need capabilities to trigger out-of-band tasks which are outside of the purview of a regular etcd reconciliation run. Many of these tasks are multi-step processes, and performing them manually is error-prone, even if an operator follows a well-written step-by-step guide. Thus, there is a need to automate these tasks. Some examples of an on-demand/out-of-band tasks:
- Recover from a permanent quorum loss of etcd cluster.
- Trigger an on-demand full/delta snapshot.
- Trigger an on-demand snapshot compaction.
- Trigger an on-demand maintenance of etcd cluster.
- Copy the backups from one object store to another object store.
Goals
- Establish a unified interface for operator tasks by defining a single dedicated custom resource for
out-of-bandtasks. - Define a contract (in terms of prerequisites) which needs to be adhered to by any task implementation.
- Facilitate the easy addition of new
out-of-bandtask(s) through this custom resource. - Provide CLI capabilities to operators, making it easy to invoke supported
out-of-bandtasks.
Non-Goals
- In the current scope, capability to abort/suspend an
out-of-bandtask is not going to be provided. This could be considered as an enhancement based on pull. - Ordering (by establishing dependency) of
out-of-bandtasks submitted for the same etcd cluster has not been considered in the first increment. In a future version based on how operator tasks are used, we will enhance this proposal and the implementation.
Proposal
Authors propose creation of a new single dedicated custom resource to represent an out-of-band task. Etcd-druid will be enhanced to process the task requests and update its status which can then be tracked/observed.
Custom Resource Golang API
EtcdOpsTask is the new custom resource that will be introduced. This API will be in v1alpha1 version and will be subject to change. We will be respecting Kubernetes Deprecation Policy.
// EtcdOpsTask represents an out-of-band operator task resource.
type EtcdOpsTask struct {
metav1.TypeMeta
metav1.ObjectMeta
// Spec is the specification of the EtcdOpsTask resource.
Spec EtcdOpsTaskSpec `json:"spec"`
// Status is most recently observed status of the EtcdOpsTask resource.
Status EtcdOpsTaskStatus `json:"status,omitempty"`
}Spec
The authors propose that the following fields should be specified in the spec (desired state) of the EtcdOpsTask custom resource.
- The Name of the Etcd Resource will be taken as an optional field because of the below assumptions:
- The task might be independent of the etcd resource (Eg: EtcdCopyBackups)
- The etcdopstask will be created in the same namespace as the etcd it is intended to perform its operation upon.
- To capture the configuration specific to each task, a
.spec.configfield should be defined of typeEtcdOpsTaskConfigas a structured object where each task can have different input configuration. Exactly one of its members must be set according to the operation to be performed, which implicitly determines the type ofout-of-bandoperator task to be executed (e.g., settingonDemandSnapshotindicates an "OnDemandSnapshotTask").
// EtcdOpsTaskSpec is the spec for a EtcdOpsTask resource.
type EtcdOpsTaskSpec struct {
// Config specifies the configuration for the operation to be performed.
// Exactly one of the members of EtcdOpsTaskConfig must be set.
// +kubebuilder:validation:Required
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="config is immutable"
Config EtcdOpsTaskConfig `json:"config"`
// TTLSecondsAfterFinished is the duration in seconds after which a finished task (status.state == Succeeded|Failed|Rejected) will be garbage-collected.
// +optional
TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`
// EtcdName refers to the name of the Etcd resource that this task will operate on.
// +optional
EtcdName *string `json:"etcdName"`
}
// EtcdOpsTaskConfig holds the configuration for the specific operation.
// Exactly one of its members must be set according to the operation to be performed.
type EtcdOpsTaskConfig struct {
// OnDemandSnapshot defines the configuration for an on-demand snapshot task.
// +optional
OnDemandSnapshot *OnDemandSnapshotConfig `json:"onDemandSnapshot,omitempty"`
// OndemandMaintenance defines the configuration for an on-demand etcd maintenance task such as etcd-compaction or etcd-defragmentation.
OndemandMaintenance *OndemandMaintenanceConfig
}
// Similarly, the configs for the other tasks are to be added.Status
The authors propose the following fields for the Status (current state) of the EtcdOpsTask custom resource to monitor the progress of the task.
// EtcdOpsTaskStatus is the status for a EtcdOpsTask resource.
type EtcdOpsTaskStatus struct {
// State is the last known state of the task.
State TaskState `json:"state"`
// LastTransitionTime is the last time the state transitioned from one value to another.
LastTransitionTime *metav1.Time `json:"lastTransitionTime,omitempty"`
// Time at which the task has moved from "pending" state to any other state.
StartedAt *metav1.Time `json:"startedAt"`
// LastError represents the errors when processing the task.
// +optional
LastErrors []LastError `json:"lastErrors,omitempty"`
// Captures the last operation status if task involves many stages.
// +optional
LastOperation *LastOperation `json:"lastOperation,omitempty"`
}
type LastOperation struct {
// State is the state of the last operation.
State LastOperationState `json:"state"`
// Type is the type of the last operation
Type LastOperationType `json:"type"`
// LastUpdateTime is the time at which the operation state last transitioned from one state to another.
LastUpdateTime *metav1.Time `json:"lastTransitionTime"`
// RunID correlates an operation with a reconciliation run.
RunID string `json:"runID"`
// Description describes the last operation.
Description string `json:"description"`
}
// LastError stores details of the most recent error encountered for the task.
type LastError struct {
// Code is an error code that uniquely identifies an error.
Code ErrorCode `json:"code"`
// Description is a human-readable message indicating details of the error.
Description string `json:"description"`
// ObservedAt is the time at which the error was observed.
ObservedAt metav1.Time `json:"observedAt"`
}
// TaskState represents the state of the task.
type TaskState string
const (
TaskStateFailed TaskState = "Failed"
TaskStatePending TaskState = "Pending"
TaskStateRejected TaskState = "Rejected"
TaskStateSucceeded TaskState = "Succeeded"
TaskStateInProgress TaskState = "InProgress"
)
// LastOperationState represents the state of the last operation.
const (
// LastOperationStateInProgress indicates that the operation is currently in progress.
LastOperationStateInProgress druidapicommon.LastOperationState = "InProgress"
// LastOperationStateCompleted indicates that the operation has completed.
LastOperationStateCompleted druidapicommon.LastOperationState = "Completed"
// LastOperationStateFailed indicates that the operation has failed.
LastOperationStateFailed druidapicommon.LastOperationState = "Failed"
)
// LastOperationType represents the type of the last operation.
const (
// LastOperationTypeAdmit indicates that the task is in the admission phase.
LastOperationTypeAdmit druidapicommon.LastOperationType = "Admit"
// LastOperationTypeExecution indicates that the task has successfully passed admission criteria and is currently being executed.
LastOperationTypeExecution druidapicommon.LastOperationType = "Execution"
// LastOperationTypeCleanup indicates that the cleanup of the task resources has begun.
// This will happen when the task has completed and its TTL has expired.
LastOperationTypeCleanup druidapicommon.LastOperationType = "Cleanup"
)Custom Resource YAML API
apiVersion: druid.gardener.cloud/v1alpha1
kind: EtcdOpsTask
metadata:
name: <name of ops task resource>
namespace: <cluster namespace>
generation: <specific generation of the desired state>
spec:
type: <type/category of supported out-of-band task>
ttlSecondsAfterFinished: <time-to-live to garbage collect the custom resource after it has been completed>
config: <task specific configuration>
EtcdName: <refer to corresponding etcd name for which task has been invoked>
status:
state: <last known current state of the out-of-band task>
lastTransitionTime: <Time at which the task status changed from one state to another>
startedAt: <time at which task move to any other state from "pending" state>
lastErrors:
- code: <error-code>
description: <description of the error>
observedAt: <time the error was observed>
lastOperation:
type: <The current type of the task i.e whether it is Admit or Execution or Cleanup>
state: <task state as seen at the completion of last operation ie either InProgress or Completed or Failed>
lastUpdateTime: <time of transition to this state>
description: <reason/message if any>Lifecycle
Creation
Task(s) can be created by creating an instance of the EtcdOpsTask custom resource specific to a task.
Note: In future, either a
kubectlextension plugin or adruidctltool will be introduced. Dedicated sub-commands will be created for eachout-of-bandtask. This will drastically increase the usability for an operator for performing such tasks, as the CLI extension will automatically create relevant instance(s) ofEtcdOpsTaskwith the provided configuration.
Execution
- Authors propose to introduce a new controller which watches for
EtcdOpsTaskcustom resource. - Each
out-of-bandtask may have some task specific configuration defined in .spec.config.<taskConfig>. - The controller needs to process this task specific config, which comes as a structured object of type
EtcdOpsTaskConfig, according to the schema defined for each task. - For every
out-of-bandtask, a set ofpre-conditionscan be defined. These pre-conditions are evaluated against the current state of the target etcd cluster. Based on the evaluation result (boolean), the task is permitted or denied execution. - If multiple tasks are invoked simultaneously or in
pendingstate, then they will be executed in a First-In-First-Out (FIFO) manner.
Note: Dependent ordering among tasks will be addressed later which will enable concurrent execution of tasks when possible.
Deletion
Upon completion of the task, irrespective of its final state, Etcd-druid will ensure the garbage collection of the task custom resource and any other Kubernetes resources created to execute the task. This will be done according to the .spec.ttlSecondsAfterFinished if defined in the spec, or a default expiry time will be assumed.
Use Cases
Recovery from permanent quorum loss
Recovery from permanent quorum loss involves two phases - identification and recovery - both of which are done manually today. This proposal intends to automate the latter. Recovery today is a multi-step process and needs to be performed carefully by a human operator. Automating these steps would be prudent, to make it quicker and error-free. The identification of the permanent quorum loss would remain a manual process, requiring a human operator to investigate and confirm that there is indeed a permanent quorum loss with no possibility of auto-healing.
Task Config
We do not need any config for this task. When creating an instance of EtcdOpsTask for this scenario, .spec.config will be set to nil (unset).
Pre-Conditions
- There should be a quorum loss in a multi-member etcd cluster. For a single-member etcd cluster, invoking this task is unnecessary as the restoration of the single member is automatically handled by the backup-restore process.
- There should not already be a permanent-quorum-loss-recovery-task running for the same etcd cluster.
Trigger on-demand snapshot compaction
Etcd-druid provides a configurable etcd-events-threshold flag. When this threshold is breached, then a snapshot compaction is triggered for the etcd cluster. However, there are scenarios where an ad-hoc snapshot compaction may be required.
Possible Scenarios
- If an operator anticipates a scenario of permanent quorum loss, they can trigger an
on-demand snapshot compactionto create a compacted full-snapshot. This can potentially reduce the recovery time from a permanent quorum loss. - As an additional benefit, a human operator can leverage the current implementation of snapshot compaction, which internally triggers
restoration. Hence, by initiating anon-demand snapshot compactiontask, the operator can verify the integrity of etcd cluster backups, particularly in cases of potential backup corruption or re-encryption. The success or failure of this snapshot compaction can offer valuable insights into these scenarios.
Task Config
We do not need any config for this task. When creating an instance of EtcdOpsTask for this scenario, .spec.config will be set to nil (unset).
Pre-Conditions
- There should not be a
on-demand snapshot compactiontask already running for the same etcd cluster.
Note:
on-demand snapshot compactionruns as a separate job in a separate pod, which interacts with the backup bucket and not the etcd cluster itself, hence it doesn't depend on the health of etcd cluster members.
Trigger on-demand full/delta snapshot
Etcd custom resource provides an ability to set FullSnapshotSchedule which currently defaults to run once in 24 hrs. DeltaSnapshotPeriod is also made configurable which defines the duration after which a delta snapshot will be taken. If a human operator does not wish to wait for the scheduled full/delta snapshot, they can trigger an on-demand (out-of-schedule) full/delta snapshot on the etcd cluster, which will be taken by the leading-backup-restore.
Possible Scenarios
- An on-demand full snapshot can be triggered if scheduled snapshot fails due to any reason.
- Gardener Shoot Hibernation: Every etcd cluster incurs an inherent cost of preserving the volumes even when a gardener shoot control plane is scaled down, i.e the shoot is in a hibernated state. However, it is possible to save on hyperscaler costs by invoking this task to take a full snapshot before scaling down the etcd cluster, and deleting the etcd data volumes afterwards.
- Gardener Control Plane Migration: In gardener, a cluster control plane can be moved from one seed cluster to another. This process currently requires the etcd data to be replicated on the target cluster, so a full snapshot of the etcd cluster in the source seed before the migration would allow for faster restoration of the etcd cluster in the target seed.
Task Config
// SnapshotType can be full or delta snapshot.
type OndemandSnapshotType string
const (
SnapshotTypeFull OndemandSnapshotType = "full"
SnapshotTypeDelta OndemandSnapshotType = "delta"
)
// Example showing sample implementation of ondemandSnapshot config
type OnDemandSnapshotConfig struct {
// Type specifies whether the snapshot is a 'full' or 'delta' snapshot.
Type OnDemandSnapshotType `json:"type"`
// IsFinal indicates whether this is the final snapshot for the etcd cluster.
// +optional
IsFinal *bool `json:"isFinal,omitempty"`
// TimeoutSeconds is the timeout for the delta snapshot operation.
// Defaults to 60 seconds.
// +optional
// +kubebuilder:default=60
TimeoutSecondsDelta *int32 `json:"timeoutSecondsDelta,omitempty"`
// TimeoutSecondsFull is the timeout for full snapshot operations.
// Defaults to 480 seconds (8 minutes).
// +optional
// +kubebuilder:default=480
TimeoutSecondsFull *int32 `json:"timeoutSecondsFull,omitempty"`
}spec:
config: |
type: <type of on-demand snapshot>
isFinal: <to indicate if the snapshot to be taken is to be marked as final>
timeoutSecondsDelta: <the timeout interval for delta snapshots>
timeoutSecondsFull: <the timeout interval for full snapshots>Pre-Conditions
- Etcd cluster should have a quorum.
- Etcd cluster should be in ready state.
- Etcd should have backups enabled.
- There should not already be a
on-demand snapshottask running with the sameSnapshotTypefor the same etcd cluster.
Trigger on-demand maintenance of etcd cluster
Operator can trigger on-demand maintenance of etcd cluster which includes operations like etcd compaction, etcd defragmentation etc.
Possible Scenarios
- If an etcd cluster is heavily loaded, which is causing performance degradation of an etcd cluster, and the operator does not want to wait for the scheduled maintenance window then an
on-demand maintenancetask can be triggered which will invoke etcd-compaction, etcd-defragmentation etc. on the target etcd cluster. This will make the etcd cluster lean and clean, thus improving cluster performance.
Task Config
type OnDemandMaintenanceTaskConfig struct {
// MaintenanceType defines the maintenance operations need to be performed on etcd cluster.
MaintenanceType maintenanceOps `json:"maintenanceType`
}
type maintenanceOps struct {
// EtcdCompaction if set to true will trigger an etcd compaction on the target etcd.
// +optional
EtcdCompaction bool `json:"etcdCompaction,omitempty"`
// EtcdDefragmentation if set to true will trigger a etcd defragmentation on the target etcd.
// +optional
EtcdDefragmentation bool `json:"etcdDefragmentation,omitempty"`
}spec:
config: |
maintenanceType:
etcdCompaction: <true/false>
etcdDefragmentation: <true/false>Pre-Conditions
- Etcd cluster should have a quorum.
- There should not already be a duplicate task running with same
maintenanceType.
Copy Backups Task
Copy the backups(full and delta snapshots) of etcd cluster from one object store(source) to another object store(target).
Possible Scenarios
- In Gardener, the Control Plane Migration process utilizes the copy-backups task. This task is responsible for copying backups from one object store to another, typically located in different regions.
Task Config
// EtcdCopyBackupsTaskConfig defines the parameters for the copy backups task.
type EtcdCopyBackupsTaskConfig struct {
// SourceStore defines the specification of the source object store provider.
SourceStore StoreSpec `json:"sourceStore"`
// TargetStore defines the specification of the target object store provider for storing backups.
TargetStore StoreSpec `json:"targetStore"`
// MaxBackupAge is the maximum age in days that a backup must have in order to be copied.
// By default all backups will be copied.
// +optional
MaxBackupAge *uint32 `json:"maxBackupAge,omitempty"`
// MaxBackups is the maximum number of backups that will be copied starting with the most recent ones.
// +optional
MaxBackups *uint32 `json:"maxBackups,omitempty"`
}spec:
config: |
sourceStore: <source object store specification>
targetStore: <target object store specification>
maxBackupAge: <maximum age in days that a backup must have in order to be copied>
maxBackups: <maximum no. of backups that will be copied>Note: For detailed object store specification please refer here
Pre-Conditions
- There should not already be a
copy-backupstask running.
Note:
copy-backups-taskruns as a separate job, and it operates only on the backup bucket, hence it doesn't depend on health of etcd cluster members.
Note:
copy-backups-taskhas already been implemented and it's currently being used in Control Plane Migration butcopy-backups-taskwill be harmonized withEtcdOpsTaskcustom resource.
Metrics
Authors proposed to introduce the following metrics:
etcddruid_operator_task_duration_seconds: Histogram which captures the runtime for each etcd operator task. Labels:- Key:
type, Value: all supported tasks - Key:
state, Value: One-Of - Key:
etcd, Value: name of the target etcd resource - Key:
etcd_namespace, Value: namespace of the target etcd resource
- Key:
etcddruid_operator_tasks_total: Counter which counts the number of etcd operator tasks. Labels:- Key:
type, Value: all supported tasks - Key:
state, Value: One-Of - Key:
etcd, Value: name of the target etcd resource - Key:
etcd_namespace, Value: namespace of the target etcd resource
- Key: