그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그
12 minute read
DEP-06: Immutable etcd Cluster Backups
Summary
This proposal introduces immutable backups for etcd clusters managed by etcd-druid
. By leveraging cloud provider immutability features, backups taken by etcd-backup-restore
can neither be modified nor deleted once created for a configurable retention duration (immutability period). This approach strengthens the reliability and fault tolerance of the etcd restoration process.
Terminology
- etcd-druid: An etcd operator that configures, provisions, reconciles, and monitors etcd clusters.
- etcd-backup-restore: A sidecar container that manages backups and restores of etcd cluster state. For more information, see the etcd-backup-restore documentation.
- WORM (Write Once, Read Many): A storage model in which data, once written, cannot be modified or deleted until certain conditions are met.
- Immutability: The property of an object that prevents it from being modified or deleted after creation.
- Immutability Period: The duration for which data must remain immutable before it can be modified or deleted.
- Bucket-Level Immutability: A policy that applies a uniform immutability period to all objects within a bucket.
- Object-Level Immutability: A policy that allows setting immutability periods individually for objects within a bucket, offering more granular control.
- Garbage Collection: The process of deleting old snapshot data that is no longer needed, in order to free up storage space. For more information, see the garbage collection documentation.
- Hibernation: A state in which an etcd cluster is scaled down to zero replicas, effectively pausing its operations. This is typically done to save costs when the cluster is not needed for an extended period. During hibernation, the cluster’s data remains intact, and it can be resumed to its previous state when required.
Motivation
etcd-druid
provisions etcd clusters and manages their lifecycle. For every etcd cluster, consumers can enable periodic backups of the cluster state by configuring the spec.backup
section in an Etcd custom resource. Periodic backups are taken via the etcd-backup-restore
sidecar container that runs in each etcd member pod.
Periodic backups of an etcd cluster state ensure the ability to recover from a data loss or a quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing WORM
protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure.
Goals
- Protect backup data against modifications and deletions post-creation through immutability policies offered by storage providers.
Non-Goals
- Implementing a mechanism to signal hibernation intent for handling snapshot immutability for hibernated etcd clusters, such as adding functionality via
etcd.spec
or annotations on theEtcd
CR, to indicate when an etcd cluster should enter or exit hibernation, as discussed in gardener/etcd-druid#922. - Supporting immutable backups on storage providers that do not offer immutability features (e.g., OpenStack Swift).
Proposal
This proposal aims to improve backup storage integrity and security by using immutability features available on major cloud providers.
Supported Cloud Providers
- Google Cloud Storage (GCS): Bucket Lock
- Amazon S3 (S3): Object Lock
- Azure Blob Storage (ABS): Immutable Blob Storage
Note
Currently, Openstack object storage (swift) doesn’t support immutability for objects: https://blueprints.launchpad.net/swift/+spec/immutability-middleware.
Types of Immutability
- Object-Level Immutability: Allows setting immutability periods independently for each object within a bucket.
- Bucket-Level Immutability: Applies a uniform immutability policy to all objects in a bucket.
Comparison of Bucket-Level and Object-Level Immutability
Feature | GCS | S3 | ABS |
---|---|---|---|
Can bucket-level immutability period be increased? | Yes | Yes* | Yes (only 5 times) |
Can bucket-level immutability period be decreased? | No | Yes* | No |
Is bucket-level immutability a prerequisite for object-level immutability? | No | Yes | Yes (for existing buckets) |
Can object-level immutability period be increased? | Yes | Yes | Yes |
Can object-level immutability period be decreased? | No | No | No |
Support for enabling object-level immutability in existing buckets | No | Yes | Yes |
Support for enabling object-level immutability in new buckets | Yes | Yes | Yes |
Support for enabling bucket-level immutability in existing buckets | Yes | Yes | Yes |
Support for enabling bucket-level immutability in new buckets | Yes | Yes | Yes |
Precedence between bucket-level and object-level immutability periods | Max(bucket, object) | Object-level | Max(bucket, object) |
Note
In AWS S3, it is possible to increase and decrease the bucket-level immutability period; however, this action can be blocked by configuring specific bucket policy settings.
For GCS, object-level immutability is not yet supported for existing buckets; see this issue.
Recommended Approach
At the time of writing this proposal, these are the current limitations seen across providers:
- S3 and ABS: typically require bucket-level immutability as a prerequisite for object-level immutability.
- GCS does not currently support object-level immutability in existing buckets.
- ABS requires a migration process to enable version-level immutability on existing containers.
Consequently, the authors recommend bucket-level immutability. This approach simplifies configuration and ensures a uniform immutability policy for all backups in a bucket across all support providers.
Configuring Immutable Backups
Creating and configuring immutable buckets on providers is not handled by etcd-druid
and must be done by the consumers. For a large-scale consumer like Gardener, provider extensions are leveraged to automate both the creation and configuration of buckets. For more details, see BackupBucket and refer to this issue.
Prerequisites
Configure or Update the Immutable Bucket
- Use your cloud provider’s CLI, SDK, or console to create (or update) a bucket/container with a WORM (write-once-read-many) immutability policy.
- Refer to the Configure Bucket-Level Immutability for step-by-step instructions on configuring or updating the immutable bucket across different cloud providers.
Provide Valid Credentials in a Kubernetes Secret
- The
store
section of theEtcd
CR must reference aSecret
containing valid credentials.apiVersion: druid.gardener.cloud/v1alpha1 kind: Etcd metadata: name: example-etcd spec: backup: store: prefix: etcd-backups container: my-immutable-backups # Bucket name provider: aws # Supported: aws, gcp, azure secretRef: name: etcd-backup-credentials # Reference to the Secret immutability: retentionType: bucket # Enables bucket-level immutability
- Confirm that this secret has the proper permissions to upload and retrieve snapshots from the immutable bucket.
- See the Getting Started guide for an example.
Note
The
etcd-druid
does not handle the rotation of cloud provider credentials. Credential rotation must be managed by the operator.
By following these steps, you will have set up an immutable bucket for storing etcd backups, along with the necessary references in your Etcd
specification and Kubernetes secret.
Handling of Hibernated Clusters
When an etcd cluster remains hibernated beyond the bucket’s immutability period, backups might become mutable again, depending on the cloud provider (see Comparison of Storage Provider Properties). This could compromise the intended guarantees of immutability, exposing backups to accidental or malicious alterations.
As mentioned in gardener/etcd-druid#922, a clear hibernation signal is needed. Since etcd-druid
does not currently support hibernation natively and addressing that is out of scope for this proposal, we focus solely on maintaining immutability.
Proposal
To mitigate the risk of backups becoming mutable during extended hibernation under bucket-level immutability, the authors propose the following approach:
Prerequisite: Cut-off Traffic and Take a Final Full Snapshot Before Hibernation
- Before scaling the etcd cluster down to zero replicas, the etcd controller removes etcd’s client ports (2379/2380) from the etcd Service to block application traffic.
- The etcd controller then triggers an on-demand full snapshot. This ensures that the latest state of the etcd cluster is captured and securely stored before hibernation begins.
Periodically Re-Upload the Snapshot
- Re-uploading the latest full snapshot resets its immutability period in the bucket, ensuring backups remain protected during hibernation.
- By default, the re-upload schedule follows
etcd.spec.backup.fullSnapshotSchedule
. Currently, this interval cannot be customized exclusively for re-uploads; future enhancements may introduce a dedicated configuration parameter. - A new operator task type,
ExtendFullSnapshotImmutabilityTask
, periodically calls a new CLI command,extend-snapshot-immutability
, to re-upload the snapshot and extend its immutability. ExtendFullSnapshotImmutabilityTask
also manages garbage collection, ensuring that only the latest immutable snapshots are retained while deleting older snapshots created by the task itself.
By capturing a final full snapshot before hibernation, periodically re-uploading it to preserve immutability, and removing stale backups, etcd backups remain safeguarded against accidental or malicious alterations until the cluster is resumed.
Important
Limitation: There is a potential edge case where a snapshot might become corrupted before hibernation or during the re-upload process by
ExtendFullSnapshotImmutabilityTask
. If this happens, the process could repeatedly re-upload the same corrupted snapshot, failing to ensure a reliable backup.
An alternative solution could be to trigger the snapshot compaction which runs in separate pod and takes a fresh snapshot of the compacted etcd data and re-uploads it to the object store.
While this approach ensures that only valid snapshots are re-uploaded, it is resource-intensive, requiring an operational etcd instance even in hibernation. Given the high cost in terms of compute and memory, the authors recommend the snapshot re-upload approach as a more practical solution.
Etcd CR API Changes
A new field is introduced in Etcd.spec.backup.store
to indicate the immutability strategy:
// StoreSpec defines parameters for storing etcd backups.
type StoreSpec struct {
// ...
// Immutability configuration for the backup store.
Immutability *ImmutabilitySpec `json:"immutability,omitempty"`
}
// ImmutabilitySpec defines immutability settings.
type ImmutabilitySpec struct {
// RetentionType indicates the type of immutability approach. For example, "bucket".
RetentionType string `json:"retentionType,omitempty"`
}
If immutability
is not specified, etcd-druid
will assume that the bucket is mutable, and no immutability-related logic applies. We have defined a new type to allow us future enhancements to the immutability specification.
etcd-backup-restore
Enhancements
The authors propose adding new sub-command to the etcd-backup-restore
CLI (etcdbrctl
) to maintain immutability during hibernation and to clean up snapshots created by ExtendFullSnapshotImmutabilityTask
:
extend-snapshot-immutability
- Downloads the latest full snapshot from the object store.
- Renames the snapshot (for instance, updates its Unix timestamp) to avoid overwriting an existing immutable snapshot.
- Uploads the renamed snapshot back to object storage, thereby restarting its immutability timer.
- Introduces the
--gc-from-timestamp=<timestamp>
parameter, where<timestamp>
is the creation timestamp of the task. This ensures that only snapshots created by the task are subject to garbage collection.
As an alternative to the download/upload approach, the authors document the possibility of using provider APIs to perform a server-side object copy. This method could significantly reduce network costs and latency by directly copying the snapshot within the cloud provider’s infrastructure. While this option is not implemented in the current version, feasibility of server-side copy can be explored in etcd-backup-restore during implementation.
etcd Controller Enhancements
When a hibernation flow is initiated (by external tooling or higher-level operators), the etcd controller can:
- Remove etcd’s client ports (2379/2380) from the etcd Service to block application traffic.
- Trigger an on-demand full snapshot via an
EtcdOperatorTask
. - Scale down the
StatefulSet
replicas to zero, provided the previous snapshot step is successful. - Create the
ExtendFullSnapshotImmutabilityTask
ifetcd.spec.backup.store.immutability.retentionType
is"bucket"
and based onetcd.spec.backup.fullSnapshotSchedule
.
Operator Task Enhancements
The ExtendFullSnapshotImmutabilityTask
will create a cron job that:
- Runs
etcdbrctl extend-snapshot-immutability --gc-from-timestamp=<creation timestamp of task>
to preserve the immutability period of the most recent snapshot. This command re-uploads the latest snapshot, effectively resetting its immutability period. Additionally, it removes any snapshots that have become mutable after the creation timestamp of the task.
By periodically re-uploading (extending) the latest snapshot during hibernation, the authors ensure that the immutability period is extended, and the backups remain protected throughout the hibernation period.
Lifecycle of ExtendFullSnapshotImmutabilityTask
The ExtendFullSnapshotImmutabilityTask
is active during hibernation and is automatically managed by the etcd-controller
. Its lifecycle is tied to the cluster’s hibernation state:
Task Creation
- When the etcd cluster enters hibernation (e.g., scaling down to zero replicas), the etcd controller:
- Triggers a final full snapshot.
- Creates the
ExtendFullSnapshotImmutabilityTask
to runextend-snapshot-immutability --gc-from-timestamp=<creation timestamp of this task>
- When the etcd cluster enters hibernation (e.g., scaling down to zero replicas), the etcd controller:
Task Deletion
- When the cluster resumes from hibernation (scales up to non-zero replicas), the controller:
- Deletes the
ExtendFullSnapshotImmutabilityTask
to stop extending snapshots. - Resumes the normal backup schedule defined in
spec.backup.fullSnapshotSchedule
.
- Deletes the
- When the immutability configuration is removed from the
etcd.spec.backup.store
, the controller:- Deletes the
ExtendFullSnapshotImmutabilityTask
to stop extending snapshots.
- Deletes the
- When the cluster resumes from hibernation (scales up to non-zero replicas), the controller:
Example Task Config
type ExtendFullSnapshotImmutabilityTaskConfig struct {
// Schedule defines a cron schedule (e.g., "0 0 * * *").
Schedule *string `json:"schedule,omitempty"`
}
Sample YAML:
spec:
config:
schedule: "0 0 * * *"
Disabling Immutability
If you remove the immutability configuration from etcd.spec.backup.store
, hibernation-based immutability support no longer applies. However, once the bucket itself is locked at the provider level, it cannot be reverted to a mutable state. Any objects uploaded by etcd-backup-restore
are still subject to the existing WORM policy.
If you genuinely require a mutable backup again, the recommended approach is:
- Use a new bucket. In your
Etcd
custom resource, reference a different bucket that does not have immutability enabled. - Use
EtcdCopyBackupTask
. If you want to start the cluster with a new bucket but retain old data, use theEtcdCopyBackupTask
to copy existing backups from the old immutable bucket to the new mutable bucket. - Reconcile the
Etcd
CR. After pointingetcd.spec.backup.store
to the new bucket,etcd-druid
will start storing backups there.
Note: Existing snapshots in the old immutable bucket remain locked according to the configured immutability period.
Compatibility
These changes are compatible with existing etcd clusters and current backup processes.
- Backward Compatibility:
- Clusters without immutable buckets continue to function without any changes.
- Forward Compatibility:
- Clusters can opt in to use immutable backups by configuring the bucket accordingly (as described in Configuring Immutable Backups) and setting
etcd.spec.backup.store.immutability.retentionType == "bucket"
. - The enhanced hibernation logic in the etcd controller is additive, meaning it does not interfere with existing workflows.
- Clusters can opt in to use immutable backups by configuring the bucket accordingly (as described in Configuring Immutable Backups) and setting
Impact for Operators
In scenarios where you want to exclude certain snapshots from an etcd restore, you previously could simply delete them from object storage. However, when bucket-level immutability is enabled, deleting existing immutable snapshots is no longer possible. To address this need, most cloud providers allow adding custom annotations or tags to objects—even immutable ones—so they can be logically excluded without physically removing them.
etcd-backup-restore
supports ignoring snapshots based on annotations or tags, rather than deleting them. Operators can add the following key-value pair to any snapshot object to exclude it from future restores:
- Key:
x-etcd-snapshot-exclude
- Value:
true
Because these tags or annotations do not modify the underlying snapshot data, they are permissible even for immutable objects. Once these annotations are in place, etcd-backup-restore
will detect them and skip the tagged snapshots during restoration, thus preventing unwanted snapshots from being used. For more details, see the Ignoring Snapshots during Restoration.
Note
At the time of writing this proposal, this feature is not supported for AWS S3 buckets.
References
- GCS Bucket Lock
- AWS S3 Object Lock
- Azure Immutable Blob Storage
- etcd-backup-restore Documentation
- Gardener Issue: 10866