Metrics and Instrumentation
StackStorm services and code base contain instrumentation with metrics in various critical places. This provides better operational visibility and allows operators to detect various infrastructure or deployment related issues (e.g. long average duration for a particular action could indicate an issue with that action or similar).
Configuring and Enabling Metrics Collection
Note
This feature was added and is available in StackStorm v2.9.0 and above.
By default metrics collection is disabled. To enable it, you need to configure metrics.driver
and depending on the driver, also metrics.host
and metrics.port
option in
/etc/st2/st2.conf
.
Right now, the only supported driver is statsd
. To configure it, add the following entries to
st2.conf
:
[metrics]
driver = statsd
# Optional prefix which is prepended to each metric key. E.g. if prefix is
# "production" and key is "action.executions" actual key would be
# "st2.production.action.executions". This comes handy when you want to
# utilize the same backend instance for multiple environments or similar.
# statsd collection and aggregation server address
host = 127.0.0.1
# statsd collection and aggregation server port
port = 8125
After you have configured it, you need to restart all the services using st2ctl restart
.
In case your statsd daemon is running on a remote sever and you have a firewall configured, you also need to make sure that all the servers where StackStorm components are running are allowed outgoing access to the configured host and port.
For debugging and troubleshooting purposes, you can also set driver to echo
. This will cause
StackStorm to log under DEBUG
log level any metrics operation which would have otherwise be performed
(increasing a counter, timing an operation, etc.) without actually performing it.
For a full list of config options, see the [metrics]
section in the StackStorm sample
config here: https://github.com/StackStorm/st2/blob/master/conf/st2.conf.sample
Configuring StatsD
StackStorm statsd
metrics driver is compatible with any service which exposes statsd compatible
interface for receiving metrics via UDP.
This includes original statsd service written in Node.js, but also compatible projects such as Telegraf and others.
This provides for a lot of flexibility and allows statsd service to submit those metrics to self hosted or managed graphite instance or to other compatible projects and services such as InfluxDB and hostedgraphite.
Configuring those services is out of scope of this documentation, because it’s very environment specific (aggregation resolution, retention period, etc.), but some sample configs which can help you get started with statsd and self hosted graphite and carbon cache instance can be found at https://github.com/StackStorm/st2/tree/master/conf/metrics.
Exposed Metrics
Note
Various metrics documented in this section are only available in StackStorm v2.9.0 and above.
This section describes which metrics are currently exposed by various StackStorm services.
Name |
Type |
Service |
Description |
---|---|---|---|
st2.action.executions |
counter |
st2actionrunner |
Number of action executions processed by st2actionrunner service. |
st2.action.executions |
timer |
st2actionrunner |
How long it took to process (run) a particular action execution inside st2actionrunner service. |
st2.action.executions.calculate_result_size |
timer |
st2actionrunner |
How long it took to update result size in the database. |
st2.action.executions.process.<status> |
counter |
st2actionrunner |
Number of action executions processed by st2actionrunner service for a particular status. |
st2.action.executions.process.<status> |
timer |
st2actionrunner |
How long it took to fully process a request inside st2actionrunner for a particular status. |
st2.action.executions.update_status |
timer |
st2actionrunner |
How long it took to update execution and live action status and result in the database. |
st2.action.executions.update_liveaction_db |
timer |
st2actionrunner |
How long it took to update / store LiveActionDB model in the database. |
st2.action.executions.update_execution_db |
timer |
st2actionrunner |
How long it took to update / store ActionExecutionD Bmodel in the database. |
st2.action.<action ref>.executions |
counter |
st2actionrunner |
Number of action execution for a particular action processed by st2actionrunner. |
st2.action.<action ref>.executions |
timer |
st2actionrunner |
How long it took to process (run) action execution for a particular action inside st2actionrunner. |
st2.action.executions.<execution status> |
counter |
st2actionrunner |
Number of executions in a particular state (succeeded, failed, timeout, delayed, etc). |
st2.rule.processed |
counter |
st2rulesengine |
Number of rules (trigger instances) processed by st2rulesengine service. |
st2.rule.processed |
timer |
st2rulesengine |
How long it took to process a particular rule (trigger instance) inside st2rulesengine. |
st2.rule.<rule ref>.processed |
counter |
st2rulesengine |
Number of particular rules processed by st2rulesengine. |
st2.rule.matched |
counter |
st2rulesengine |
Number of trigger instances which matched a rule (criteria). |
st2.rule.<rule ref>.matched |
counter |
st2rulesengine |
Numbers of trigger instances which matched a particular rule (criteria). |
st2.scheduler.handle_execution |
counter |
st2scheduler |
Number of executions handled by st2scheduler. |
st2.scheduler.handle_execution |
timer |
st2scheduler |
How long it took to handle a particular execution by st2scheduler. |
st2.trigger.<trigger ref>.processed |
counter |
st2rulesengine |
Number of particular triggers processed by st2rulesengine. |
st2.trigger.<trigger ref>.processed |
timer |
st2rulesengine |
How long it took to process a particular trigger inside st2rulesengine. |
st2.orquesta.workflow.executions |
counter |
st2workflowengine |
Number of workflow executions processed by st2workflowengine. |
st2.orquesta.workflow.executions |
timer |
st2workflowengine |
How long it took to process a particular workflow execution inside st2workflowengine. |
st2.orquesta.action.executions |
counter |
st2workflowengine |
Number of executions processed for workflow task executions by st2workflowengine. |
st2.orquesta.action.executions |
timer |
st2workflowengine |
How long it took to process a particular workflow task execution inside st2workflowengine. |
st2.{auth,api,stream}.request.total |
counter |
st2auth, st2api, st2stream |
Number of requests processed by st2auth / st2api / st2stream. |
st2.{auth,api,stream}.request |
counter |
st2auth, st2api, st2stream |
Number of requests processed by st2auth / st2api / st2stream. |
st2.{auth,api,stream}.request |
timer |
st2auth, st2api, st2stream |
How long it took to process a particular HTTP request. |
st2.{auth,api,stream}.request.method.<method> |
counter |
st2auth, st2api, st2stream |
Number of requests with particular HTTP method processed by st2auth / st2api / st2stream. |
st2.{auth,api,stream}.request.path.<path> |
counter |
st2auth, st2api, st2stream |
Number of requests to a particular HTTP path (controller endpoint) processed by st2auth / st2api / st2stream. |
st2.{auth,api,stream}.response.status.<status code> |
counter |
st2auth, st2api, st2stream |
Number of requests which resulted in a response with a particular HTTP status code. |
st2.stream.connections |
gauge |
st2stream |
Number of open connections to the stream service. |
st2.notifier.action.executions |
counter |
st2notifier |
Number of action executions processed by st2notifier. |
st2.notifier.action.executions |
timer |
st2notifier |
How long it took to process a particular action execution by st2notifier. |
st2.notifier.apply_post_run_policies |
counter |
st2notifier |
Number of post run policies applied by st2notifier. |
st2.notifier.apply_post_run_policies |
timer |
st2notifier |
How long it took to apply post run policies processed by st2notifier. |
st2.notifier.notify_trigger.dispatch |
counter |
st2notifier |
Number of notify triggers dispatched by st2notifier. |
st2.notifier.notify_trigger.dispatch |
timer |
st2notifier |
How long it took to dispatch notify trigger for an execution. |
st2.notifier.notify_trigger.post |
counter |
st2notifier |
Number of notify triggers processed by st2notifier. |
st2.notifier.notify_trigger.post |
timer |
st2notifier |
How long it took to process / post notify trigger for an execution. |
st2.notifier.generic_trigger.dispatch |
counter |
st2notifier |
Number of generic notify triggers dispatched by st2notifier. |
st2.notifier.generic_trigger.dispatch |
timer |
st2notifier |
How long it took to dispatch generic notify trigger for an execution. |
st2.notifier.generic_trigger.post |
counter |
st2notifier |
Number of generic notify triggers processed by st2notifier. |
st2.notifier.generic_trigger.post |
timer |
st2notifier |
How long it took to process generic notify trigger for an execution. |
st2.notifier.transform_message |
timer |
st2notifier |
How long a “transform_message” function call took for a particular notify trigger. |
st2.notifier.transform_data |
timer |
st2notifier |
How long a “transform_data” function call took for a particular notify trigger. |
Depending on the metric backend and metric type, some of those metrics will also be sampled,
averaged, aggregated and converted into a rate (operations / seconds for counter
metrics), etc.
Keep in mind that for the counter metrics, statsd automatically calculates rates. If you are interested in more than a rate (events per second), you will need to derive those metrics from the raw “count” metric.
For example, if you are interested in a total number of executions scheduled or a total number of
API requests in a particular time frame, you would use integral()
graphite function (e.g.
integral(stats.counters.st2.action.executions.scheduled.count)
and
integral(stats.counters.st2.api.requests.count)
).
Example Graphite Dashboard
Below you can find code for an example Graphite dashboard which contains most of the common graphs you need to have a good operational visibility into StackStorm deployment.
[
{
"target": [
"integral(stats.counters.st2.<prefix>.action.executions.count)"
],
"title": "Total Number of Action Executions",
"height": "308",
"width": "586"
},
{
"target": [
"stats.counters.st2.<prefix>.action.executions.rate"
],
"title": "Action Executions per Second",
"height": "308",
"width": "586"
},
{
"target": [
"integral(stats.counters.st2.<prefix>.action.executions.running.count)",
"sumSeries(stats.counters.st2.<prefix>.action.executions.requested.count)",
"sumSeries(stats.counters.st2.<prefix>.action.executions.pending.count)",
"sumSeries(stats.counters.st2.<prefix>.action.executions.delayed.count)",
"sumSeries(stats.counters.st2.<prefix>.action.executions.paused.count)"
],
"title": "Current Number of Action Execution in Particular State",
"height": "495",
"width": "798"
},
{
"logBase": "",
"target": [
"stats.timers.st2.<prefix>.action.executions.median"
],
"title": "Median Action Execution Duration (ms)",
"areaMode": "stacked",
"minorY": "",
"height": "469",
"width": "754"
},
{
"logBase": "",
"target": [
"stats.counters.st2.<prefix>.api.request.rate"
],
"title": "API Requests Per Second",
"areaMode": "stacked",
"minorY": "",
"height": "308",
"width": "586"
},
{
"target": [
"stats.counters.st2.<prefix>.rule.processed.rate",
"stats.counters.st2.<prefix>.rule.matched.rate"
],
"title": "Processed trigger instances and matched rules per second",
"height": "308",
"width": "586"
},
{
"target": [
"stats.counters.st2.<prefix>.api.response.status.200.rate",
"stats.counters.st2.<prefix>.api.response.status.404.rate",
"stats.counters.st2.<prefix>.api.response.status.201.rate"
],
"title": "API responses per status code per second",
"height": "308",
"width": "586"
},
{
"target": [
"stats.counters.st2.<prefix>.orquesta.action.executions.rate"
],
"title": "Orquesta Workflow and Action Executions per Second",
"height": "331",
"width": "697"
}
]
Keep in mind that some of the graphs such as “current number of executions in a particular state during a particular point in time” and “total counts for a particular execution state” are derived from the raw metric values.
Pushing metrics to InfluxDB
It is possible to gather the StatsD data with Telegraf to push them to InfluxDB. The StatsD data are formatted in a different way than InfluxDB usually, so we can use the template feature that is availabie in the Telegraf StatsD importer to reformat them to something more convenients (with flags, etc..)
Configure your InfluxDB and Telegraf InfluxDB output as usual, then on the StatsD input in Telegraf, you can specify the following configuration
# Statsd UDP/TCP Server
[[inputs.statsd]]
protocol = "udp"
max_tcp_connections = 250
tcp_keep_alive = false
service_address = ":8125"
delete_gauges = true
delete_counters = true
delete_sets = true
delete_timings = true
percentiles = []
metric_separator = "_"
parse_data_dog_tags = false
datadog_extensions = false
templates = [
"st2.action.executions.* measurement.measurement.measurement.type",
"st2.action.*.*.executions measurement.measurement.pack.action.field",
"st2.amqp.pool_publisher.publish_with_retries.* measurement.measurement.measurement.measurement.field",
"st2.amqp.publish.* measurement.measurement.measurement.field",
"st2.*.request.method.* measurement.measurement.measurement..method",
"st2.*.request.path.* measurement.measurement.measurement..path",
"st2.*.response.status.* measurement.measurement.measurement.status",
"st2.rule.*.*.* measurement.measurement.pack.rule.field",
"st2.rule.* measurement.measurement.field",
"st2.trigger.*.*.processed measurement.measurement.pack.name.flag",
"st2.trigger.*.*.*.*.processed measurement.measurement.pack.name.name.name.flag",
"st2.notifier.* measurement.measurement.notifier",
"st2.orquesta.*.*, measurement.measurement.field.measurement",
]
allowed_pending_messages = 10000
percentile_limit = 1000
Pushing metrics to Prometheus via the statsd_exporter
Prometheus provides a service called Statsd Exporter which receives data in the StatsD format and acts as a scrape target for Prometheus.
This exporter enables to make the st2 metrics available to Prometheus and other monitoring solutions that are able to scrape Prometheus targets (i.e. Zabbix).
While the README at https://github.com/prometheus/statsd_exporter provides the latest overview, metric mapping and service configuration https://github.com/prometheus/statsd_exporter#using-docker shows an example how to deploy the service using the prometheus/statds_exporter docker image.
The configuration examples below rely on the statsd_exporter default port configuration:
Port |
Purpose |
---|---|
8125/udp |
receive metrics in the statsd format |
9102/tcp |
expose the web interface and generated Prometheus metrics |
Note that Docker must expose the ports on a (public) interface if StackStorm is not running in the same containerized environment.
st2 configuration:
[metrics]
driver = statsd
# Optional prefix which is prepended to each metric key. E.g. if prefix is
# "production" and key is "action.executions" actual key would be
# "st2.production.action.executions". This comes handy when you want to
# utilize the same backend instance for multiple environments or similar.
# statsd collection and aggregation server address
host = <statsd_exporter_address>
# statsd collection and aggregation server port
port = 8125
Prometheus configuration
Prometheus needs to know the new scrape target - the statsd exporter
.
Example scrape config:
scrape_configs:
- job_name: 'st2-statsd-metrics'
static_configs:
- targets:
- "statsd-exporter:9102"
Replace statsd-exporter
by the host name or IP address of the host / container running the statsd exporter service.