그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그 그

  5 minute read  

Monitoring

The shoot-rsyslog-relp extension exposes metrics for the rsyslog service running on a Shoot’s nodes so that they can be easily viewed by cluster owners and operators in the Shoot’s Prometheus and Plutono instances. The exposed monitoring data offers valuable insights into the operation of the rsyslog service and can be used to detect and debug ongoing issues. This guide describes the various metrics, alerts and logs available to cluster owners and operators.

Metrics

Metrics for the rsyslog service originate from its impstats module. These include the number of messages in the various queues, the number of ingested messages, the number of processed messages by configured actions, system resources used by the rsyslog service, and others. More information about them can be found in the impstats documentation and the statistics counter documentation. They are exposed via the node-exporter running on each Shoot node and are scraped by the Shoot’s Prometheus instance.

These metrics can also be viewed in a dedicated dashboard named Rsyslog Stats in the Shoot’s Plutono instance. You can select the node for which you wish the metrics to be displayed from the Node dropdown menu (by default metrics are summed over all nodes).

Following is a list of all exposed rsyslog metrics. The name and origin labels can be used to determine wether the metric is for: a queue, an action, plugins or system stats; the node label can be used to determine the node the metric originates from:

rsyslog_pstat_submitted

Number of messages that were submitted to the rsyslog service from its input. Currently rsyslog uses the /run/systemd/journal/syslog socket as input.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_processed

Number of messages that are successfully processed by an action and sent to the target server.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_failed

Number of messages that could not be processed by an action nor sent to the target server.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_suspended

Total number of times an action suspended itself. Note that this counts the number of times the action transitioned from active to suspended state. The counter is no indication of how long the action was suspended or how often it was retried.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_suspended_duration

The total number of seconds this action was disabled.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_resumed

The total number of times this action resumed itself. A resumption occurs after the action has detected that a failure condition does no longer exist.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_utime

User time used in microseconds.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_stime

System time used in microsends.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_maxrss

Maximum resident set size

  • Type: Gauge
  • Labels: name node origin

rsyslog_pstat_minflt

Total number of minor faults the task has made per second, those which have not required loading a memory page from disk.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_majflt

Total number of major faults the task has made per second, those which have required loading a memory page from disk.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_inblock

Filesystem input operations.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_oublock

Filesystem output operations.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_nvcsw

Voluntary context switches.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_nivcsw

Involuntary context switches.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_openfiles

Number of open files.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_size

Messages currently in queue.

  • Type: Gauge
  • Labels: name node origin

rsyslog_pstat_enqueued

Total messages enqueued.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_full

Times queue was full.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_discarded_full

Messages discarded due to queue being full.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_discarded_nf

Messages discarded when queue not full.

  • Type: Counter
  • Labels: name node origin

rsyslog_pstat_maxqsize

Maximum size queue has reached.

  • Type: Gauge
  • Labels: name node origin

rsyslog_augenrules_load_success

Shows whether the augenrules --load command was executed successfully or not on the node.

  • Type: Gauge
  • Labels: node

Alerts

There are three alerts defined for the rsyslog service in the Shoot’s Prometheus instance:

RsyslogTooManyRelpActionFailures

This indicates that the cumulative failure rate in processing relp action messages is greater than 2%. In other words, it compares the rate of processed relp action messages to the rate of failed relp action messages and fires an alert when the following expression evaluates to true:

sum(rate(rsyslog_pstat_failed{origin="core.action",name="rsyslg-relp"}[5m])) / sum(rate(rsyslog_pstat_processed{origin="core.action",name="rsyslog-relp"}[5m])) > bool 0.02`

RsyslogRelpActionProcessingRateIsZero

This indicates that no messages are being sent to the upstream rsyslog target by the relp action. An alert is fired when the following expression evaluates to true:

rate(rsyslog_pstat_processed{origin="core.action",name="rsyslog-relp"}[5m]) == 0

RsyslogRelpAuditRulesNotLoadedSuccessfully

This indicates that augenrules --load was not executed successfully when called to load the configured audit rules. You should check if the auditd configuration you provided is valid. An alert is fired when the following expression evaluates to true:

absent(rsyslog_augenrules_load_success == 1)

Users can subscribe to these alerts by following the Gardener alerting guide.

Logging

There are two ways to view the logs of the rsyslog service running on the Shoot’s nodes - either using the Explore tab of the Shoot’s Plutono instance, or ssh-ing directly to a node.

To view logs in Plutono, navigate to the Explore tab and select vali from the Explore dropdown menu. Afterwards enter the following vali query:

{nodename="<name-of-node>"} |~ "\"unit\":\"rsyslog.service\""

Notice that you cannot use the unit label to filter for the rsyslog.service unit logs. Instead, you have to grep for the service as displayed in the example above.

To view logs when directly ssh-ing to a node in the Shoot cluster, use either of the following commands on the node:

systemctl status rsyslog

journalctl -u rsyslog