Metrics and Instrumentation

StackStorm services and code base contain instrumentation with metrics in various critical places. This provides better operational visibility and allows operators to detect various infrastructure or deployment related issues (e.g. long average duration for a particular action could indicate an issue with that action or similar).

Configuring and Enabling Metrics Collection

Note

This feature was added and is available in StackStorm v2.9.0 and above.

By default metrics collection is disabled. To enable it, you need to configure metrics.driver and depending on the driver, also metrics.host and metrics.port option in /etc/st2/st2.conf.

Right now, the only supported driver is statsd. To configure it, add the following entries to st2.conf:

[metrics]
driver = statsd
# Optional prefix which is prepended to each metric key. E.g. if prefix is
# "production" and key is "action.executions" actual key would be
# "st2.production.action.executions". This comes handy when you want to
# utilize the same backend instance for multiple environments or similar.

# statsd collection and aggregation server address
host = 127.0.0.1
# statsd collection and aggregation server port
port = 8125

After you have configured it, you need to restart all the services using st2ctl restart.

In case your statsd daemon is running on a remote sever and you have a firewall configured, you also need to make sure that all the servers where StackStorm components are running are allowed outgoing access to the configured host and port.

For debugging and troubleshooting purposes, you can also set driver to echo. This will cause StackStorm to log under DEBUG log level any metrics operation which would have otherwise be performed (increasing a counter, timing an operation, etc.) without actually performing it.

For a full list of config options, see the [metrics] section in the StackStorm sample config here: https://github.com/StackStorm/st2/blob/master/conf/st2.conf.sample

Configuring StatsD

StackStorm statsd metrics driver is compatible with any service which exposes statsd compatible interface for receiving metrics via UDP.

This includes original statsd service written in Node.js, but also compatible projects such as Telegraf and others.

This provides for a lot of flexibility and allows statsd service to submit those metrics to self hosted or managed graphite instance or to other compatible projects and services such as InfluxDB and hostedgraphite.

Configuring those services is out of scope of this documentation, because it’s very environment specific (aggregation resolution, retention period, etc.), but some sample configs which can help you get started with statsd and self hosted graphite and carbon cache instance can be found at https://github.com/StackStorm/st2/tree/master/conf/metrics.

Exposed Metrics

Note

Various metrics documented in this section are only available in StackStorm v2.9.0 and above.

This section describes which metrics are currently exposed by various StackStorm services.

Name

Type

Service

Description

st2.action.executions

counter

st2actionrunner

Number of action executions processed by st2actionrunner service.

st2.action.executions

timer

st2actionrunner

How long it took to process (run) a particular action execution inside st2actionrunner service.

st2.action.executions.calculate_result_size

timer

st2actionrunner

How long it took to update result size in the database.

st2.action.executions.process.<status>

counter

st2actionrunner

Number of action executions processed by st2actionrunner service for a particular status.

st2.action.executions.process.<status>

timer

st2actionrunner

How long it took to fully process a request inside st2actionrunner for a particular status.

st2.action.executions.update_status

timer

st2actionrunner

How long it took to update execution and live action status and result in the database.

st2.action.executions.update_liveaction_db

timer

st2actionrunner

How long it took to update / store LiveActionDB model in the database.

st2.action.executions.update_execution_db

timer

st2actionrunner

How long it took to update / store ActionExecutionD Bmodel in the database.

st2.action.<action ref>.executions

counter

st2actionrunner

Number of action execution for a particular action processed by st2actionrunner.

st2.action.<action ref>.executions

timer

st2actionrunner

How long it took to process (run) action execution for a particular action inside st2actionrunner.

st2.action.executions.<execution status>

counter

st2actionrunner

Number of executions in a particular state (succeeded, failed, timeout, delayed, etc).

st2.rule.processed

counter

st2rulesengine

Number of rules (trigger instances) processed by st2rulesengine service.

st2.rule.processed

timer

st2rulesengine

How long it took to process a particular rule (trigger instance) inside st2rulesengine.

st2.rule.<rule ref>.processed

counter

st2rulesengine

Number of particular rules processed by st2rulesengine.

st2.rule.matched

counter

st2rulesengine

Number of trigger instances which matched a rule (criteria).

st2.rule.<rule ref>.matched

counter

st2rulesengine

Numbers of trigger instances which matched a particular rule (criteria).

st2.scheduler.handle_execution

counter

st2scheduler

Number of executions handled by st2scheduler.

st2.scheduler.handle_execution

timer

st2scheduler

How long it took to handle a particular execution by st2scheduler.

st2.trigger.<trigger ref>.processed

counter

st2rulesengine

Number of particular triggers processed by st2rulesengine.

st2.trigger.<trigger ref>.processed

timer

st2rulesengine

How long it took to process a particular trigger inside st2rulesengine.

st2.orquesta.workflow.executions

counter

st2workflowengine

Number of workflow executions processed by st2workflowengine.

st2.orquesta.workflow.executions

timer

st2workflowengine

How long it took to process a particular workflow execution inside st2workflowengine.

st2.orquesta.action.executions

counter

st2workflowengine

Number of executions processed for workflow task executions by st2workflowengine.

st2.orquesta.action.executions

timer

st2workflowengine

How long it took to process a particular workflow task execution inside st2workflowengine.

st2.{auth,api,stream}.request.total

counter

st2auth, st2api, st2stream

Number of requests processed by st2auth / st2api / st2stream.

st2.{auth,api,stream}.request

counter

st2auth, st2api, st2stream

Number of requests processed by st2auth / st2api / st2stream.

st2.{auth,api,stream}.request

timer

st2auth, st2api, st2stream

How long it took to process a particular HTTP request.

st2.{auth,api,stream}.request.method.<method>

counter

st2auth, st2api, st2stream

Number of requests with particular HTTP method processed by st2auth / st2api / st2stream.

st2.{auth,api,stream}.request.path.<path>

counter

st2auth, st2api, st2stream

Number of requests to a particular HTTP path (controller endpoint) processed by st2auth / st2api / st2stream.

st2.{auth,api,stream}.response.status.<status code>

counter

st2auth, st2api, st2stream

Number of requests which resulted in a response with a particular HTTP status code.

st2.stream.connections

gauge

st2stream

Number of open connections to the stream service.

st2.notifier.action.executions

counter

st2notifier

Number of action executions processed by st2notifier.

st2.notifier.action.executions

timer

st2notifier

How long it took to process a particular action execution by st2notifier.

st2.notifier.apply_post_run_policies

counter

st2notifier

Number of post run policies applied by st2notifier.

st2.notifier.apply_post_run_policies

timer

st2notifier

How long it took to apply post run policies processed by st2notifier.

st2.notifier.notify_trigger.dispatch

counter

st2notifier

Number of notify triggers dispatched by st2notifier.

st2.notifier.notify_trigger.dispatch

timer

st2notifier

How long it took to dispatch notify trigger for an execution.

st2.notifier.notify_trigger.post

counter

st2notifier

Number of notify triggers processed by st2notifier.

st2.notifier.notify_trigger.post

timer

st2notifier

How long it took to process / post notify trigger for an execution.

st2.notifier.generic_trigger.dispatch

counter

st2notifier

Number of generic notify triggers dispatched by st2notifier.

st2.notifier.generic_trigger.dispatch

timer

st2notifier

How long it took to dispatch generic notify trigger for an execution.

st2.notifier.generic_trigger.post

counter

st2notifier

Number of generic notify triggers processed by st2notifier.

st2.notifier.generic_trigger.post

timer

st2notifier

How long it took to process generic notify trigger for an execution.

st2.notifier.transform_message

timer

st2notifier

How long a “transform_message” function call took for a particular notify trigger.

st2.notifier.transform_data

timer

st2notifier

How long a “transform_data” function call took for a particular notify trigger.

Depending on the metric backend and metric type, some of those metrics will also be sampled, averaged, aggregated and converted into a rate (operations / seconds for counter metrics), etc.

Keep in mind that for the counter metrics, statsd automatically calculates rates. If you are interested in more than a rate (events per second), you will need to derive those metrics from the raw “count” metric.

For example, if you are interested in a total number of executions scheduled or a total number of API requests in a particular time frame, you would use integral() graphite function (e.g. integral(stats.counters.st2.action.executions.scheduled.count) and integral(stats.counters.st2.api.requests.count)).

Example Graphite Dashboard

Below you can find code for an example Graphite dashboard which contains most of the common graphs you need to have a good operational visibility into StackStorm deployment.

../_images/graphite_dashboard.png
[
  {
    "target": [
      "integral(stats.counters.st2.<prefix>.action.executions.count)"
    ],
    "title": "Total Number of Action Executions",
    "height": "308",
    "width": "586"
  },
  {
    "target": [
      "stats.counters.st2.<prefix>.action.executions.rate"
    ],
    "title": "Action Executions per Second",
    "height": "308",
    "width": "586"
  },
  {
    "target": [
      "integral(stats.counters.st2.<prefix>.action.executions.running.count)",
      "sumSeries(stats.counters.st2.<prefix>.action.executions.requested.count)",
      "sumSeries(stats.counters.st2.<prefix>.action.executions.pending.count)",
      "sumSeries(stats.counters.st2.<prefix>.action.executions.delayed.count)",
      "sumSeries(stats.counters.st2.<prefix>.action.executions.paused.count)"
    ],
    "title": "Current Number of Action Execution in Particular State",
    "height": "495",
    "width": "798"
  },
  {
    "logBase": "",
    "target": [
      "stats.timers.st2.<prefix>.action.executions.median"
    ],
    "title": "Median Action Execution Duration (ms)",
    "areaMode": "stacked",
    "minorY": "",
    "height": "469",
    "width": "754"
  },
  {
    "logBase": "",
    "target": [
      "stats.counters.st2.<prefix>.api.request.rate"
    ],
    "title": "API Requests Per Second",
    "areaMode": "stacked",
    "minorY": "",
    "height": "308",
    "width": "586"
  },
  {
    "target": [
      "stats.counters.st2.<prefix>.rule.processed.rate",
      "stats.counters.st2.<prefix>.rule.matched.rate"
    ],
    "title": "Processed trigger instances and matched rules per second",
    "height": "308",
    "width": "586"
  },
  {
    "target": [
      "stats.counters.st2.<prefix>.api.response.status.200.rate",
      "stats.counters.st2.<prefix>.api.response.status.404.rate",
      "stats.counters.st2.<prefix>.api.response.status.201.rate"
    ],
    "title": "API responses per status code per second",
    "height": "308",
    "width": "586"
  },
  {
    "target": [
      "stats.counters.st2.<prefix>.orquesta.action.executions.rate"
    ],
    "title": "Orquesta Workflow and Action Executions per Second",
    "height": "331",
    "width": "697"
  }
]

Keep in mind that some of the graphs such as “current number of executions in a particular state during a particular point in time” and “total counts for a particular execution state” are derived from the raw metric values.

Pushing metrics to InfluxDB

It is possible to gather the StatsD data with Telegraf to push them to InfluxDB. The StatsD data are formatted in a different way than InfluxDB usually, so we can use the template feature that is availabie in the Telegraf StatsD importer to reformat them to something more convenients (with flags, etc..)

Configure your InfluxDB and Telegraf InfluxDB output as usual, then on the StatsD input in Telegraf, you can specify the following configuration

 # Statsd UDP/TCP Server
 [[inputs.statsd]]
   protocol = "udp"
   max_tcp_connections = 250
   tcp_keep_alive = false
   service_address = ":8125"
   delete_gauges = true
   delete_counters = true
   delete_sets = true
   delete_timings = true
   percentiles = []
   metric_separator = "_"
   parse_data_dog_tags = false
   datadog_extensions = false

   templates = [
        "st2.action.executions.* measurement.measurement.measurement.type",
        "st2.action.*.*.executions measurement.measurement.pack.action.field",
        "st2.amqp.pool_publisher.publish_with_retries.* measurement.measurement.measurement.measurement.field",
        "st2.amqp.publish.* measurement.measurement.measurement.field",
        "st2.*.request.method.* measurement.measurement.measurement..method",
        "st2.*.request.path.* measurement.measurement.measurement..path",
        "st2.*.response.status.* measurement.measurement.measurement.status",
        "st2.rule.*.*.* measurement.measurement.pack.rule.field",
        "st2.rule.* measurement.measurement.field",
        "st2.trigger.*.*.processed measurement.measurement.pack.name.flag",
        "st2.trigger.*.*.*.*.processed measurement.measurement.pack.name.name.name.flag",
        "st2.notifier.* measurement.measurement.notifier",
        "st2.orquesta.*.*, measurement.measurement.field.measurement",

   ]

   allowed_pending_messages = 10000
   percentile_limit = 1000

Pushing metrics to Prometheus via the statsd_exporter

Prometheus provides a service called Statsd Exporter which receives data in the StatsD format and acts as a scrape target for Prometheus.

This exporter enables to make the st2 metrics available to Prometheus and other monitoring solutions that are able to scrape Prometheus targets (i.e. Zabbix).

While the README at https://github.com/prometheus/statsd_exporter provides the latest overview, metric mapping and service configuration https://github.com/prometheus/statsd_exporter#using-docker shows an example how to deploy the service using the prometheus/statds_exporter docker image.

The configuration examples below rely on the statsd_exporter default port configuration:

Port

Purpose

8125/udp

receive metrics in the statsd format

9102/tcp

expose the web interface and generated Prometheus metrics

Note that Docker must expose the ports on a (public) interface if StackStorm is not running in the same containerized environment.

st2 configuration:

[metrics]
driver = statsd
# Optional prefix which is prepended to each metric key. E.g. if prefix is
# "production" and key is "action.executions" actual key would be
# "st2.production.action.executions". This comes handy when you want to
# utilize the same backend instance for multiple environments or similar.

# statsd collection and aggregation server address
host = <statsd_exporter_address>
# statsd collection and aggregation server port
port = 8125

Prometheus configuration

Prometheus needs to know the new scrape target - the statsd exporter.

Example scrape config:

scrape_configs:
  - job_name: 'st2-statsd-metrics'
    static_configs:
      - targets:
        - "statsd-exporter:9102"

Replace statsd-exporter by the host name or IP address of the host / container running the statsd exporter service.