I while back I blogged about how to approach error logging in production multi-node environments so that you stay sane. Later on, I looked at a way to deal with application level metrics using statsd.
One last piece of the puzzle is monitoring of the health of the nodes themselves. This typically includes CPU and Memory usage, disk space monitoring and application health monitoring. Diamond collectors are great exactly for that.
Monitoring your server health
I’ve seen instances where a server has gone down purely because it ran out of disk-space. It’s also not uncommon to have a rogue process that suddenly takes up all the CPU in spikes that are easy to miss if you don’t have the right visibility. It’s also a good idea to measure the available memory on your servers - in case you hit a slow memory leak that manifests itself over-time.
All these metrics should be part of your business metrics that you verify when doing releases. A common approach is to have a canary node(s) that you deploy the latest version. Once deployed, you can verify these metrics compared to the rest of the cluster before deploying on all nodes. In an ideal world, these checks should be automated and be part of promoting your builds.
There’s a variety of ways to get these metrics from unix boxes to something like graphite. My choice has always been Diamond - mainly for it’s simplicity to set up.
Diamond is a python daemon that collects system metrics and publishes them to Graphite or other supported handlers. It is capable of collecting cpu, memory, network, i/o, load and disk metrics. Additionally, it features an API for implementing custom collectors for gathering metrics from almost any source.
It was originally made by BrightCove, but now it’s maintained mainly by devs from python-diamond.
Diamond contains a ton of different collectors. While you have a ton to choose from, you are unlikely to need them all. The active collectors are defined via config. Here are some that I would like to highlight
Setting up diamond
There are multiple ways to get diamond set-up. The repo illustrates few manual steps to get it running on a box. However, I prefer to have them running as part of my ansible scripts so that it can be repeated and automated effortlessly.
Install Diamond via ansible
Depening on how your code is structured, I tend to add a diamond role that takes care of installing all the dependencies, adding the configuration and any custom collectors I may have.
Your dir structure of the role should also look something along these lines
- tasks - main.yml - files - collectors - SomeCustomCollector1.py - SomeCustomCollector2.py - templates - diamond.conf
The role relies on a diamond configuration file. This effectively specifies what handler we’re going to use (in this instance graphite), what collectors are enabled and their settings and the intervals in which they are collected. The entire file can be found here, but I can pick out few key properties to watch out for.
# Specify that metrics are sent via graphite handler handlers = diamond.handler.graphite.GraphiteHandler .... # Interval when the metrics are collected and sent off collectors_reload_interval = 3600 .... # Graphite server host host = # Port to send metrics to port = 2003 ... # Specify path prefix or suffix path_prefix = your-metrics path_suffix = suff
Building custom collectors
While Diamond comes with a long list of in-built collectors, it also allows you to build your custom collectors. We’ve done this recently as we needed to monitor an endpoint on localhost and push that as a metric intro graphite. Rather than building a cron job, custom collector was perfect for that use-case. I conviniently lifted this from the original repo.
Your collector needs to inherit from the diamond.collector.Collector and implement the collect method. You can check out the base class here for more info on what’s exposed to you while writing the collector.
There is also an example collector on the github page of the project.