Shinken internals monitoring
Introduction
Shinken is able to expose many internal metrics to a statsd server, allowing to monitor its performances and its operation. It may also be useful to troubleshoot issues.
The metrics export to statsd may be controlled through parameters explained in advanced configuration.
The various metrics available in statsd are described in the sections below (the metric names do not mention the configurable prefix nor hostname).
Each metric is specified
- Its name
- Its metric type
- A description telling which information it represents
- The type it is linked to, which may be used to filter the metrics to send to statsd through the statsd_types global attribute.
Initial connection timings
Some services establish connections to continuously exchange data. The metrics below measure the time spent to establish them.
con-init.poller |
timer |
Connection time from broker to poller |
perf |
con-init.reactionner |
timer |
Connection time from broker to reactionner |
perf |
con-init.receiver |
timer |
Connection time from broker to receiver |
perf |
con-init.scheduler |
timer |
Connection time from broker/poller/reactcionner to schedule |
perf |
- con-init.poller
- Time spent to establish initial session from the broker to the poller services.
- con-init.reactionner
- Time spent to establish initial session from the broker to the reactionner services.
- con-init.receiver
- Time spent to establish initial session from the broker to the receiver services.
- con-init.scheduler
- Time spent to establish initial session from the broker (to get objects state), the poller and the reactionner (to get actions to execute) to the scheduler services.
Hook timings
The shinken services executes code in hooks on different conditions (on a particular point in the workflow, on a timer, on a particular event, ...). The hook events are forwarded to modules for them to execute actions on them. The metrics below expose the time spent to execute those hooks.
hook.early_configuration |
timer |
Time spent in the early_configuration hook |
perf |
hook.get_new_actions |
timer |
Time spent in the get_new_actions hook |
perf |
hook.late_configuration |
timer |
Time spent in the late_configuration hook |
perf |
hook.load_retention |
timer |
Time spent in the load_retention hook |
perf |
hook.read_configuration |
timer |
Time spent in the read_configuration hook |
perf |
hook.save_retention |
timer |
Time spent in the save_retention hook |
perf |
hook.tick |
timer |
Time spent in the tick hook |
perf |
- hook.early_configuration
- The early_configuration hook is executed in the Arbiter daemon after having read the raw configuration, and before starting the deeper parsing operation. This metric exposes the time spent by Arbiter modules to react to this hook.
- hook.get_new_actions
- The get_new_actions hook is executed in the Scheduler daemon to get actions to execute from modules (actions may be checks, notifications or event handlers). This metric exposes the time spent by Scheduler modules to react to this hook.
- hook.late_configuration
- The late_configuration hook is executed in the Arbiter daemon after the deeper configuration parsing, and before validating it’s correct. This metric exposes the time spent by Arbiter modules to react to this hook.
- hook.load_retention
- The load_retention hook is executed in the Scheduler and Broker daemons to load retention data using the registered retention module. This metric exposes the time spent by the Schedulers and Brokers to load their retention data.
- hook.save_retention
- The save_retention hook is executed in the Scheduler and Broker daemons to save retention data using the registered retention module. This metric exposes the time spent by the Schedulers and Brokers to save their retention data.
- hook.tick
- All daemons send tick event each time they finish to execute a cycle in their main loop. This metric exposes the time spent by modules to react to the tick event.
Http communication timings
Daemons exchange data using remote execution calls based on HTTP/REST APIs. Each of the communications between the daemons are measured (server side). Depending on the request type, various metrics are calculated.
http.*.aqulock |
timer |
Time spent waiting for a lock (if the operation requires it) |
perf |
http.*.args |
timer |
Time spent parsing the request and its parameters |
perf |
http.*.calling |
timer |
Time spent executing the required procedure |
perf |
http.*.global |
timer |
Total time spent to execute the remote procedure |
perf |
http.*.json |
timer |
Time spent encoding the result |
perf |
Scheduling metrics
The scheduler operations are measured carefully. Each operation registered in the recurrent_works dictionary in the Scheduler daemon is measured its execution time, and the memory increase it generated. Those values are exposed using the metrics below.
loop.* |
timer |
Time spent in the operation |
perf |
loop.*.mem |
counter |
Memory usage evolution involved by the operation |
perf |
core.scheduler.actions.queue |
gauge |
The actions queue size in the Scheduler |
queue |
core.scheduler.checks.havetoresolvedep |
gauge |
The check queue size in state havetoresolvedep |
queue |
core.scheduler.checks.inpoller |
gauge |
The check queue size in state inpoller |
queue |
core.scheduler.checks.queue |
gauge |
The total check queue size in the scheduler |
queue |
core.scheduler.checks.scheduled |
gauge |
The check queue size in state scheduled |
queue |
core.scheduler.checks.timeout |
gauge |
The check queue size in state timeout |
queue |
core.scheduler.checks.waitconsume |
gauge |
The check queue size in state waitconsume |
queue |
core.scheduler.checks.waitdep |
gauge |
The check queue size in state waitdep |
queue |
core.scheduler.checks.zombie |
gauge |
The check queue size in state zombie |
queue |
- loop.*
- Time spent in a particular step in the scheduler workflow.
- loop.*.mem
- The memory variation involved in a particular step in the scheduler workflow.
- core.scheduler.actions.queue
- The notifications and eventhandlers queue to be consumed by the reactionners
- core.scheduler.checks.havetoresolvedep
- The checks count having havetoresolvedep state in the scheduler. Those checks have dependent checks that have to be checked before taking any decision.
- core.scheduler.checks.inpoller
- The checks count having inpoller state in the scheduler. Those checks have been got from by a poller, and the scheduler is waiting for its result.
- core.scheduler.checks.queue
- The total queue size on the Scheduler (all states).
- core.scheduler.checks.scheduled
- The checks count having scheduled state in the scheduler. Those checks have to be taken by a poller.
- core.scheduler.checks.timeout
- The checks count having inpoller state in the scheduler. Those checks have been got from by a poller, and the result did not came in time.
- core.scheduler.checks.waitconsume
- The checks count having waitconsume state in the scheduler. Those checks have been got from by a poller, the result came in time and has to be processed by the Scheduler.
- core.scheduler.checks.waitconsume
- The checks count having waitdep state in the scheduler. Those checks have dependent checks which result is required.
- core.scheduler.checks.zombie
- The checks count having zombie state in the scheduler. Those checks have been totally processed and may be deleted.
Broker specific metrics
The broker receives broks emitted by the other services to manage its internal representation of the infrastructure, and forwards broks to its modules for them to do the same. The time to manage its state. Those various operation are measured and exposed through the metrics below.
core.broker.manage-brok |
timer |
Time to manage a single brok |
perf |
core.broker.put-to-external-queue |
timer |
Time to forward broks to modules |
perf |
core.broker.get-new-broks |
timer |
Time to forward broks to modules |
perf |
- core.broker.manage-brok
- When broks are received, they have to be decoded and integrated in the broker configuration to update its representation of the infrastructure. This metric measures the time spent to handle a single brok.
- core.broker.put-to-external-queue
- External broker modules do not benefit from broker internal state representation, and have to decode broks to do the work on their own. This metric measures the time spent to forward all the received broks to all the external modules.
- core.broker.get-new-broks
- Time to get new broks from other services.
Poller/Reactionner specific metrics
core.*.manage-returns |
timer |
Time spent by a satellite to send results to scheduler |
perf |
core.*.wait-ratio |
gauge |
To be documented |
perf |
core.*.timeout |
gauge |
To be documented |
perf |
core.*.worker-fork.queue-size |
gauge |
The checks/notifications/eventhandlers execution queue size |
queue |
core.*.actions.in |
counter |
The number of new actions got from scheduler |
queue |
core.*.actions.queue |
gauge |
The number actions currently queued |
queue |
core.*.results.out |
counter |
The number of results returned to scheduler |
queue |
core.*.results.queue |
gauge |
The number results currently queued |
queue |
- core.*.manage-returns
- Time spent by the poller or reactionners to return the execution results to the scheduler.
- core.*.wait-ratio
- To be documented
- core.*.timeout
- To be documented
- core.*.worker-fork.queue-size
- The execution queue in the poller/reactionner.
- core.*.worker-fork.queue-size
- The execution queue in the poller/reactionner.
- core.*.actions.in
- The number of new actions got from scheduler.
- core.*.actions.queue
- The number actions currently queued
- core.*.results.out
- The number of results returned to scheduler.
- core.*.results.queued
- The number results currently queued
Managed objects
Service that hold configuration objects are monitored the objects they manage through the metrics below. Note the the Arbiter holds the whole confugiration, but the schedulers may havo only a portion of it if multible active schedulers are used.
- core.arbiter.commands
- The number of Command objects managed by the Arbiter
- core.arbiter.contactgroups
- The number of Contactgroup objects managed by the Arbiter
- core.arbiter.contacts
- The number of Contact objects managed by the Arbiter
- core.arbiter.hostgroups
- The number of Hostgroup objects managed by the Arbiter
- core.arbiter.hosts
- The number of Host objects managed by the Arbiter
- core.arbiter.servicegroups
- The number of Servicegroup objects managed by the Arbiter
- core.arbiter.services
- The number of Service objects managed by the Arbiter
- core.scheduler.commands
- The number of Command objects managed by the Scheduler
- core.scheduler.contactgroups
- The number of Contactgroup objects managed by the Scheduler
- core.scheduler.contacts
- The number of Contact objects managed by the Scheduler
- core.scheduler.hostgroups
- The number of Hostgroup objects managed by the Scheduler
- core.scheduler.hosts
- The number of Host objects managed by the Scheduler
- core.scheduler.servicegroups
- The number of Servicegroup objects managed by the Scheduler
- core.scheduler.services
- The number of Service objects managed by the Scheduler