Programming, Technology

Ceph Monitoring with Telegraf, InfluxDB and Grafana

Ceph Input Plug-in

An improved Ceph input plug-in for Telegraf is the core of how Data News Blog collect metrics to be graphed and analysed. You can follow the progress here as the code makes its way into the main release.  Eventually you too can enjoy it as much as we do.

Our transition to InfluxDB as our time-series database of choice motivated this work. Some previous posts go some way to showing why we love InfluxDB. The ability to tag measurements with context specific data is the big win. It helps us to create simplified dashboards which a less clutter which dynamically adapt.

The existing metrics were collected with the Ceph collector for Collectd and stored in Graphite.  Like for like functionality was not available for Telegraf so we decided to contribute code that met our needs.  Setting up the Ceph input plug-in for Telegraf is intended to be simple.  For those familiar with Ceph all you need to do is make a configuration available which can find the cluster.  You will also need a key which provides access to the cluster.


The following shows a typical set up.

  interval = '1m'
  ceph_user = "client.admin"
  ceph_config = "/etc/ceph/ceph.conf"
  gather_cluster_stats = true

The interval setting is set to be fairly relaxed. When the system is under heavy load e.g. during recovery operations, measurement collection can take some time.  Instead of have the collection time-out we make sure that there is enough time for it to complete.  After all the reason we want to see the measurements is to see what happens when these heavy operations happen.  It is no good if we have no data.  We chose to do this work also as the collectd plug-in fell in to this trap.

The ceph_user specifies a specific user to attach to the cluster with.  It allows the collector to find the access key and also can optionally pick up additional settings from the configuration file.  The default of client.admin can be automatically found by the plug-in by the ceph command when run.  Key location can be also be set in the configuration file for the user if necessary.

The ceph_config setting tells the plug-in where to find the settings for your ceph cluster.  Normally this will tell us where we can make contact with it and also how to authorise the user.  Finally the gather_cluster_stats option turns the collection of measurements on.


So what does the plug-in measure?  It all comes down to running the ceph command.  People who have used this before should have an idea about what it can do.  For now the plug-in collects the cluster summary, pool use and pool statistics.

The cluster summary (ceph status) measures things like how many disks you have, if they are in the cluster and if they are running.  It also gives a summary of the amount of space used and available, how much data is being read and written and the number of operations being performed.  The final things measured are the states of placement groups so you can see how many objects are in a good state, and how many need to be fixed to bring the cluster back into a healthy state.

Pool use (ceph df) show you the amount of memory available and used per pool.  It also shows you the number of objects stored in each pool.  These measurements are tagged with the pool name.  This is useful because pools may be located on specific groups of disks, for example hard drives or flash drives.  You can then monitor and manage these as logically separate entities.

Pool statistics (ceph pool stats) much like global statistics show on a per pool level the number of reads, writes and operations each pool is handling.  Again these are tagged with the pool name and can be used to managed hard drive and solid state drives independently even though they are part of the same cluster.

Show Me The Money

A brief look at what can be collected is all well and good however a real life demonstration speaks a thousand words.

Here is a live demonstration of the plug-in running during an operation performed recently.  This was an operation that moved objects between servers so that we are now able to handle an entire rack failing.  This protects us against a switch failure and allows us to power off a rack to reorganise it.

Global Cluster Statistics

The top pane shows the overall cluster state.  The first graph on the left shows the state of all placement groups.  When the operation begins groups that were clean become misplaced and must be moved to new locations.  From this we can make predictions into how long the maintenance will take and provide feed back to our customers.  You can also see a distinct change in the angle of the graph as the SSD storage completes.  Substantially quicker I think you’ll agree!

To the right we can see the number of groups which are degraded e.g only have two object copies not the full three, and the number of misplaced objects.  The former is interesting in that it show how many objects are at risk from a component failure which would reduce the number of copies down to one.

Per-Pool Statistics

The lower pane is constructed from the pool name.  It is selected at the top of the page.  Here we are displaying (left to right, top to bottom) the number of client operations per second, the storage used and available, the amount of data read and written, and finally the number of objects recovering per second.

Here we can see that although the peak number of client operations are reduced they hardly go below the minimum seen before the operation stated.  This is good news because it means we can handle the customer workload and recover without too much disruption.  Importantly we are able to quantify the impact a similar operation is likely to have in the future.

Some other interesting uses would be to watch for operations, reads or writes ‘clipping’ which would mean you have reached the available limits of the devices and need to add more.  If the pool is less concerned for performance and more with the amount of data, such as a cold storage pool, then the utilisation graph can be used to plan for the future and predict when you will need to expand.

Summing Up

We have demonstrated the upcoming improvements to the Ceph input plug-in for Telegraf, shown what can be collected with it and how this can improve your level of service by gleaning insight into the impact of maintenance on performance, and predicting future outcomes.

As always if you like it, please try it out, share your experiences and help us to improve the experience of running a Ceph cluster for the world as a whole.  The InfluxData community is very friendly in my experience so if you want to make improvements to this or other input plug-ins give it a go!

Update 31 August 2016

As of today the patch has hit the master branch so feel free to check out and build the latest Telegraf. Alternatively it will be released in the official 1.1 version.