그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그

  4 minute read  

Monitoring

etcd-druid uses [Prometheus][prometheus] for metrics reporting. The metrics can be used for real-time monitoring and debugging of compaction jobs.

The simplest way to see the available metrics is to cURL the metrics endpoint /metrics. The format is described here.

Follow the [Prometheus getting started doc][prometheus-getting-started] to spin up a Prometheus server to collect etcd metrics.

The naming of metrics follows the suggested [Prometheus best practices][prometheus-naming]. All compaction related metrics are put under namespace etcddruid and the respective subsystems.

Snapshot Compaction

These metrics provide information about the compaction jobs that run after some interval in shoot control planes. Studying the metrics, we can deduce how many compaction job ran successfully, how many failed, how many delta events compacted etc.

NameDescriptionType
etcddruid_compaction_jobs_totalTotal number of compaction jobs initiated by compaction controller.Counter
etcddruid_compaction_jobs_currentNumber of currently running compaction job.Gauge
etcddruid_compaction_job_duration_secondsTotal time taken in seconds to finish a running compaction job.Histogram
etcddruid_compaction_num_delta_eventsTotal number of etcd events to be compacted by a compaction job.Gauge

There are two labels for etcddruid_compaction_jobs_total metrics. The label succeeded shows how many of the compaction jobs are succeeded and label failed shows how many of compaction jobs are failed.

There are two labels for etcddruid_compaction_job_duration_seconds metrics. The label succeeded shows how much time taken by a successful job to complete and label failed shows how much time taken by a failed compaction job.

etcddruid_compaction_jobs_current metric comes with label etcd_namespace that indicates the namespace of the Etcd running in the control plane of a shoot cluster..

Etcd

These metrics are exposed by the etcd process that runs in each etcd pod.

The following list metrics is applicable to clustering of a multi-node etcd cluster. The full list of metrics exposed by etcd is available here.

No.Metrics NameDescriptionComments
1etcd_disk_wal_fsync_duration_secondslatency distributions of fsync called by WAL.High disk operation latencies indicate disk issues.
2etcd_disk_backend_commit_duration_secondslatency distributions of commit called by backend.High disk operation latencies indicate disk issues.
3etcd_server_has_leaderwhether or not a leader exists. 1: leader exists, 0: leader not exists.To capture quorum loss or to check the availability of etcd cluster.
4etcd_server_is_leaderwhether or not this member is a leader. 1 if it is, 0 otherwise.
5etcd_server_leader_changes_seen_totalnumber of leader changes seen.Helpful in fine tuning the zonal cluster like etcd-heartbeat time etc, it can also indicates the etcd load and network issues.
6etcd_server_is_learnerwhether or not this member is a learner. 1 if it is, 0 otherwise.
7etcd_server_learner_promote_successestotal number of successful learner promotions while this member is leader.Might be helpful in checking the success of API calls called by backup-restore.
8etcd_network_client_grpc_received_bytes_totaltotal number of bytes received from grpc clients.Client Traffic In.
9etcd_network_client_grpc_sent_bytes_totaltotal number of bytes sent to grpc clients.Client Traffic Out.
10etcd_network_peer_sent_bytes_totaltotal number of bytes sent to peers.Useful for network usage.
11etcd_network_peer_received_bytes_totaltotal number of bytes received from peers.Useful for network usage.
12etcd_network_active_peerscurrent number of active peer connections.Might be useful in detecting issues like network partition.
13etcd_server_proposals_committed_totaltotal number of consensus proposals committed.A consistently large lag between a single member and its leader indicates that member is slow or unhealthy.
14etcd_server_proposals_pendingcurrent number of pending proposals to commit.Pending proposals suggests there is a high client load or the member cannot commit proposals.
15etcd_server_proposals_failed_totaltotal number of failed proposals seen.Might indicates downtime caused by a loss of quorum.
16etcd_server_proposals_applied_totaltotal number of consensus proposals applied.Difference between etcd_server_proposals_committed_total and etcd_server_proposals_applied_total should usually be small.
17etcd_mvcc_db_total_size_in_bytestotal size of the underlying database physically allocated in bytes.
18etcd_server_heartbeat_send_failures_totaltotal number of leader heartbeat send failures.Might be helpful in fine-tuning the cluster or detecting slow disk or any network issues.
19etcd_network_peer_round_trip_time_secondsround-trip-time histogram between peers.Might be helpful in fine-tuning network usage specially for zonal etcd cluster.
20etcd_server_slow_apply_totaltotal number of slow apply requests.Might indicate overloaded from slow disk.
21etcd_server_slow_read_indexes_totaltotal number of pending read indexes not in sync with leader’s or timed out read index requests.

The full list of metrics is available here.

Etcd-Backup-Restore

These metrics are exposed by the etcd-backup-restore container in each etcd pod.

The following list metrics is applicable to clustering of a multi-node etcd cluster. The full list of metrics exposed by etcd-backup-restore is available here.

No.Metrics NameDescription
1.etcdbr_cluster_sizeto capture the scale-up/scale-down scenarios.
2.etcdbr_is_learnerwhether or not this member is a learner. 1 if it is, 0 otherwise.
3.etcdbr_is_learner_count_totaltotal number times member added as the learner.
4.etcdbr_restoration_duration_secondstotal latency distribution required to restore the etcd member.
5.etcdbr_add_learner_duration_secondstotal latency distribution of adding the etcd member as a learner to the cluster.
6.etcdbr_member_remove_duration_secondstotal latency distribution removing the etcd member from the cluster.
7.etcdbr_member_promote_duration_secondstotal latency distribution of promoting the learner to the voting member.
8.etcdbr_defragmentation_duration_secondstotal latency distribution of defragmentation of each etcd cluster member.

Prometheus supplied metrics

The Prometheus client library provides a number of metrics under the go and process namespaces.