Shinken internals monitoring

Introduction

Shinken is able to expose many internal metrics to a statsd server, allowing to monitor its performances and its operation. It may also be useful to troubleshoot issues.

The metrics export to statsd may be controlled through parameters explained in advanced configuration.

The various metrics available in statsd are described in the sections below (the metric names do not mention the configurable prefix nor hostname).

Each metric is specified - Its name - Its metric type - A description telling which information it represents - The type it is linked to, which may be used to filter the metrics to send to statsd through the statsd_types global attribute.

Initial connection timings

Some services establish connections to continuously exchange data. The metrics below measure the time spent to establish them.

con-init.poller timer Connection time from broker to poller perf
con-init.reactionner timer Connection time from broker to reactionner perf
con-init.receiver timer Connection time from broker to receiver perf
con-init.scheduler timer Connection time from broker/poller/reactcionner to schedule perf
con-init.poller
Time spent to establish initial session from the broker to the poller services.
con-init.reactionner
Time spent to establish initial session from the broker to the reactionner services.
con-init.receiver
Time spent to establish initial session from the broker to the receiver services.
con-init.scheduler
Time spent to establish initial session from the broker (to get objects state), the poller and the reactionner (to get actions to execute) to the scheduler services.

Hook timings

The shinken services executes code in hooks on different conditions (on a particular point in the workflow, on a timer, on a particular event, ...). The hook events are forwarded to modules for them to execute actions on them. The metrics below expose the time spent to execute those hooks.

hook.early_configuration timer Time spent in the early_configuration hook perf
hook.get_new_actions timer Time spent in the get_new_actions hook perf
hook.late_configuration timer Time spent in the late_configuration hook perf
hook.load_retention timer Time spent in the load_retention hook perf
hook.read_configuration timer Time spent in the read_configuration hook perf
hook.save_retention timer Time spent in the save_retention hook perf
hook.tick timer Time spent in the tick hook perf
hook.early_configuration
The early_configuration hook is executed in the Arbiter daemon after having read the raw configuration, and before starting the deeper parsing operation. This metric exposes the time spent by Arbiter modules to react to this hook.
hook.get_new_actions
The get_new_actions hook is executed in the Scheduler daemon to get actions to execute from modules (actions may be checks, notifications or event handlers). This metric exposes the time spent by Scheduler modules to react to this hook.
hook.late_configuration
The late_configuration hook is executed in the Arbiter daemon after the deeper configuration parsing, and before validating it’s correct. This metric exposes the time spent by Arbiter modules to react to this hook.
hook.load_retention
The load_retention hook is executed in the Scheduler and Broker daemons to load retention data using the registered retention module. This metric exposes the time spent by the Schedulers and Brokers to load their retention data.
hook.save_retention
The save_retention hook is executed in the Scheduler and Broker daemons to save retention data using the registered retention module. This metric exposes the time spent by the Schedulers and Brokers to save their retention data.
hook.tick
All daemons send tick event each time they finish to execute a cycle in their main loop. This metric exposes the time spent by modules to react to the tick event.

Http communication timings

Daemons exchange data using remote execution calls based on HTTP/REST APIs. Each of the communications between the daemons are measured (server side). Depending on the request type, various metrics are calculated.

http.*.aqulock timer Time spent waiting for a lock (if the operation requires it) perf
http.*.args timer Time spent parsing the request and its parameters perf
http.*.calling timer Time spent executing the required procedure perf
http.*.global timer Total time spent to execute the remote procedure perf
http.*.json timer Time spent encoding the result perf

Scheduling metrics

The scheduler operations are measured carefully. Each operation registered in the recurrent_works dictionary in the Scheduler daemon is measured its execution time, and the memory increase it generated. Those values are exposed using the metrics below.

loop.* timer Time spent in the operation perf
loop.*.mem counter Memory usage evolution involved by the operation perf
core.scheduler.actions.queue gauge The actions queue size in the Scheduler queue
core.scheduler.checks.havetoresolvedep gauge The check queue size in state havetoresolvedep queue
core.scheduler.checks.inpoller gauge The check queue size in state inpoller queue
core.scheduler.checks.queue gauge The total check queue size in the scheduler queue
core.scheduler.checks.scheduled gauge The check queue size in state scheduled queue
core.scheduler.checks.timeout gauge The check queue size in state timeout queue
core.scheduler.checks.waitconsume gauge The check queue size in state waitconsume queue
core.scheduler.checks.waitdep gauge The check queue size in state waitdep queue
core.scheduler.checks.zombie gauge The check queue size in state zombie queue
loop.*
Time spent in a particular step in the scheduler workflow.
loop.*.mem
The memory variation involved in a particular step in the scheduler workflow.
core.scheduler.actions.queue
The notifications and eventhandlers queue to be consumed by the reactionners
core.scheduler.checks.havetoresolvedep
The checks count having havetoresolvedep state in the scheduler. Those checks have dependent checks that have to be checked before taking any decision.
core.scheduler.checks.inpoller
The checks count having inpoller state in the scheduler. Those checks have been got from by a poller, and the scheduler is waiting for its result.
core.scheduler.checks.queue
The total queue size on the Scheduler (all states).
core.scheduler.checks.scheduled
The checks count having scheduled state in the scheduler. Those checks have to be taken by a poller.
core.scheduler.checks.timeout
The checks count having inpoller state in the scheduler. Those checks have been got from by a poller, and the result did not came in time.
core.scheduler.checks.waitconsume
The checks count having waitconsume state in the scheduler. Those checks have been got from by a poller, the result came in time and has to be processed by the Scheduler.
core.scheduler.checks.waitconsume
The checks count having waitdep state in the scheduler. Those checks have dependent checks which result is required.
core.scheduler.checks.zombie
The checks count having zombie state in the scheduler. Those checks have been totally processed and may be deleted.

Broker specific metrics

The broker receives broks emitted by the other services to manage its internal representation of the infrastructure, and forwards broks to its modules for them to do the same. The time to manage its state. Those various operation are measured and exposed through the metrics below.

core.broker.manage-brok timer Time to manage a single brok perf
core.broker.put-to-external-queue timer Time to forward broks to modules perf
core.broker.get-new-broks timer Time to forward broks to modules perf
core.broker.manage-brok
When broks are received, they have to be decoded and integrated in the broker configuration to update its representation of the infrastructure. This metric measures the time spent to handle a single brok.
core.broker.put-to-external-queue
External broker modules do not benefit from broker internal state representation, and have to decode broks to do the work on their own. This metric measures the time spent to forward all the received broks to all the external modules.
core.broker.get-new-broks
Time to get new broks from other services.

Poller/Reactionner specific metrics

core.*.manage-returns timer Time spent by a satellite to send results to scheduler perf
core.*.wait-ratio gauge To be documented perf
core.*.timeout gauge To be documented perf
core.*.worker-fork.queue-size gauge The checks/notifications/eventhandlers execution queue size queue
core.*.actions.in counter The number of new actions got from scheduler queue
core.*.actions.queue gauge The number actions currently queued queue
core.*.results.out counter The number of results returned to scheduler queue
core.*.results.queue gauge The number results currently queued queue
core.*.manage-returns
Time spent by the poller or reactionners to return the execution results to the scheduler.
core.*.wait-ratio
To be documented
core.*.timeout
To be documented
core.*.worker-fork.queue-size
The execution queue in the poller/reactionner.
core.*.worker-fork.queue-size
The execution queue in the poller/reactionner.
core.*.actions.in
The number of new actions got from scheduler.
core.*.actions.queue
The number actions currently queued
core.*.results.out
The number of results returned to scheduler.
core.*.results.queued
The number results currently queued

Managed objects

Service that hold configuration objects are monitored the objects they manage through the metrics below. Note the the Arbiter holds the whole confugiration, but the schedulers may havo only a portion of it if multible active schedulers are used.

core.arbiter.commands
The number of Command objects managed by the Arbiter
core.arbiter.contactgroups
The number of Contactgroup objects managed by the Arbiter
core.arbiter.contacts
The number of Contact objects managed by the Arbiter
core.arbiter.hostgroups
The number of Hostgroup objects managed by the Arbiter
core.arbiter.hosts
The number of Host objects managed by the Arbiter
core.arbiter.servicegroups
The number of Servicegroup objects managed by the Arbiter
core.arbiter.services
The number of Service objects managed by the Arbiter
core.scheduler.commands
The number of Command objects managed by the Scheduler
core.scheduler.contactgroups
The number of Contactgroup objects managed by the Scheduler
core.scheduler.contacts
The number of Contact objects managed by the Scheduler
core.scheduler.hostgroups
The number of Hostgroup objects managed by the Scheduler
core.scheduler.hosts
The number of Host objects managed by the Scheduler
core.scheduler.servicegroups
The number of Servicegroup objects managed by the Scheduler
core.scheduler.services
The number of Service objects managed by the Scheduler