Programming, Technology

Integrating Icinga2 with InfluxDB and Grafana

Typically when you are monitoring a platform for performance metrics you will inevitably probably end up considering things like Collectd or Diamond for collection of metrics and Graphite for receipt, storage and visualisation of metrics.  That was state of the art 3 years ago and times change rapidly in computing.  I’d like to take you on a journey of how we developed our current monitoring, alerting and visualisation platform.

The Problem With Ceph…

Ceph is the new starlet on the block for scale-out, fault-tolerant storage. We’ve been operating a petabyte scale cluster in production for well over 2 years now, and one of the things you will soon learn is that when a journal drive fails it’s a fairly big deal. All drives reliant on that journal disk are fairly quickly removed from the cluster which results in objects replicating to replace the lost redundancy, and redistributing objects across the cluster to cater for the altered topology. This process, depending on how much data is in the cluster can take days to complete, and unfortunately have a significant impact on client performance. Luckily we as an operator handle these situations for you in a way that minimises impact, typically as a result of being woken up at 3am!

The dream however is to predict that an SSD journal drive is going to fail and pro-actively replace it during core working hours, transparent to the client. Initially with one vendor’s devices, we noted that I/O wait times increased quite dramatically before the device failed completely, giving plenty of notice (in the order of days) that the device should be replaced. Obviously this has a knock-on effect on storage performance as writes to the cluster are synchronous to ensure redundancy, so not the best situation.

Eventually we changed to use of devices from another manufacturer, which last longer and offer better performance. The downside is they no longer exhibit slow I/O. They just go pop, cue mad scramble to stop the cluster rebalancing and replace the failed journal hastily.

Can we do anything to predict failure with these devices? The answer is possibly. SMART monitoring of ATA devices allows the system to interrogate the device and pull off a number of metrics and performance counters that may lead to clues as to impending failure. Existing monitoring plug-ins that are available with our operating system were only capable of working with directly attached devices, so monitoring ATA devices behind a SAS expander device was impossible, and also only alert when the SMART firmware predicts a failure, which I have never seen in the field! This led to my authoring of the check-scsi-smart plug-in which allows the vast majority of devices in our platform to be monitored, every available counter to be individually monitored via performance data, and individually raise alerts based on user provided warning and critical thresholds.

Data Collection

A while ago I made the bold (most will say sensible) statement that Nagios/Icinga was no longer fit for purpose and needed replacing with a modern monitoring platform. My biggest gripes were the reliance on things like NRPE and NSCA. The former has quite frankly, broken and insecure transport. The latter has none, so when it comes to throwing possibly sensitive monitoring metrics across the public internet in plain text these solutions were pretty much untenable.

Luckily the good folks at Icinga had been slavishly working away at a ground up replacement of the old Nagios based code. Icinga2 is a breath of fresh air. All communications are secured via X.509 public key cryptography, they are initiated by either end point so work behind a NAT boundary. Hosts can monitor themselves, thus distributing the load across the platform, they can also raise notifications about themselves, so you are no longer reliant on a central monitoring server, however check results are propagated towards the root of the tree. Configuration is generated on a top-level master node and propagated to satellite zones and end hosts. The system is flexible so it need not work in this way, but I’ve arrived at this architecture as a best practice.

For me the real genius is how service checks are applied to hosts. Consider the following host definition:

object Host "ceph-osd-0.example.com" {
  import "satellite-host"

  address = "10.10.112.156"
  display_name = "ceph-osd-0.example.com"
  zone = "icinga2.example.com"

  vars.kernel = "Linux"
  vars.role = "ceph_osd"
  vars.architecture = "amd64"
  vars.productname = "X8DTT-H"
  vars.operatingsystem = "Ubuntu"
  vars.lsbdistcodename = "trusty"
  vars.enable_pagerduty = true
  vars.is_virtual = false

  vars.blockdevices["sda"] = {
     path = "/dev/sda"
  }
  vars.blockdevices["sdb"] = {
     path = "/dev/sdb"
  }
  vars.blockdevices["sdc"] = {
     path = "/dev/sdc"
  }
  vars.blockdevices["sdd"] = {
     path = "/dev/sdd"
  }
  vars.blockdevices["sde"] = {
     path = "/dev/sde"
  }
  vars.blockdevices["sdf"] = {
     path = "/dev/sdf"
  }
  vars.blockdevices["sdg"] = {
     path = "/dev/sdg"
  }

  vars.interfaces["eth0"] = {
     address = "10.10.112.156"
     cidr = "10.10.112.0/24"
     mac = "00:30:48:f6:de:fe"
  }
  vars.foreman_interfaces["p1p1"] = {
     address = "10.10.104.107"
     mac = "00:1b:21:76:86:d8"
     netmask = "255.255.255.0"
  }
  vars.interfaces["p1p2"] = {
     address = "10.10.96.129"
     cidr = "10.10.96.0/24"
     mac = "00:1b:21:76:86:d9"
  }

}

 

Importing satellite-host basically inherits a number of parameters from a template that describe how to check that the host is alive, and how often. The zone parameter describes where this check will be performed from e.g. the north bound icinga2 satellite. The vars data structure is a dictionary of various key value pairs and can be utterly arbitrary. In this example we define everything about the operating system, the architecture and machine type, whether or not the machine is virtual. Because this is generated by Puppet orchestration software, we can inspect even more parts of the system e.g. block devices, network interfaces. The possibilities are endless.

object CheckCommand "smart" {
  import "plugin-check-command"
  command = [ "sudo", PluginDir + "/check_scsi_smart" ]
  arguments = {
     "-d" = "$smart_device$"
  }
}

 

The CheckCommand object defines a executable to perform a service check. Here we define the check as having to run with elevated privileges, and its absolute path. You can also specify potential arguments, in this case if the macro smart_device is able to be expanded (it will look in host or service variables for a match) then the option will be generated on the command line with the parameter. There is also provision to set the option only without a parameter if needs be.

apply Service "smart" for (blockdevice => attributes in host.vars.blockdevices) {
  import "generic-service"
  check_command = "smart"
  display_name = "smart " + blockdevice
  vars.smart_device = attributes.path
  zone = host.name
  assign where match("sd*", blockdevice)
}

 

The last piece of the jigsaw is the Service object. Here we are saying that for each blockdevice/attributes pair on each host, if the block device name begins with sd then apply the service check to it. This way you write the service check once, and it will be applied correctly to every SCSI disk on every host, no host specific hacks involved ever. Much like the host definition generic-service is a template that defines how often a check should be performed, the zone which performs the check is the host itself. The check_command defines which check to perform, as defined above, and we set vars.smart_device to the device path of the block device which will be picked up by the macro expansion in the check command as discussed earlier.

Time Series Data Collection

With that all in place we now have a single pane of glass view onto all current states of all SCSI devices on all hosts. However what we really need is to gather all of these snapshots into a database which allows us to plot the counters over time, derive trends that indicate potential disk failure and then set alerting thresholds accordingly.

Anecdotally we previously had Graphite Carbon aggregating statistics we gathered via Collectd. However with several hundred servers sending many tens of metrics a second, it wasn’t up to the task. Even with local SSD backed storage the I/O queues were constantly full to capacity. We needed a better solution, and one which looked promising was InfluxDB. Although a fledgling product still in flux, it is built to perform many operations in memory, support clustering for horizontal scaling and be schema-less. To illustrate take a look at the following example from my test environment.

load,domain=angel.net,fqdn=ns.angel.net,hostname=ns,service=load,metric=load15,type=value value=0.05 1460907584

 

The measurement load is in essence a big bucket that all metrics to do with load fit into. Arbitrary pieces of meta data can be associated with a data point, here we attach the domain, fqdn and hostname which are useful for organising data based on physical location. The metric correlates with a performance data metric returned by a monitoring plug-in, the type references the field in the performance data, in this case referring to the actual value, but it may represent alerting thresholds or physical limits. The value records the actual data value and the final parameter is the time stamp, in this case to a second precision, but defaults to nanoseconds.

By arranging data like this you can ask questions such as, give me all metrics of type value, from the last hour for hosts in a specific domain, grouping the data by host name. I for one find this a lot more intuitive than the existing methodologies bound up in Graphite. You can also query the meta data asking questions like, for the load metric, give me all possible values of hostname, which makes automatically generating fields a dream.

The missing part in this puzzle is getting performance data from Icinga2 into InfluxDB, along with all the tags which makes InfluxDB so powerful. Luckily I was able to spend a few days making this a reality, although at the time of press still in review, it looks set to be a great addition to the ecosystem.

library "perfdata"

object InfluxdbWriter "influxdb" {
  host = "influxdb.angel.net"
  port = 8086
  database = "icinga2"
  ssl_enable = false
  ssl_ca_cert = "/var/lib/puppet/ssl/certs/ca.pem"
  ssl_cert = "/var/lib/puppet/ssl/certs/icinga.angel.net.pem"
  ssl_key = "/var/lib/puppet/ssl/private_keys/icinga.angel.net.pem"
  host_template = {
    measurement = "$host.check_command$"
    tags = {
      fqdn = "$host.name$"
      domain = "$host.vars.domain$"
      hostname = "$host.vars.hostname$"
    }
  }
  service_template = {
    measurement = "$service.check_command$"
    tags = {
      fqdn = "$host.name$"
      domain = "$host.vars.domain$"
      hostname = "$host.vars.hostname$"
      service = "$service.name$"
      fake = "$host.vars.nonexistant$"
    }
  }
}

 

Here’s the current state of play; it allows a connection to any port on any host, specification of the database to write to, with optional full SSL support. The powerful piece is in the host and service templates which allow the measurement to be set, typically the check_command e.g. ssh, smart, and any tags that can be derived from the host or service objects, if the value doesn’t exist, the tag is not generated for the data point. Remember how we can associate all manner of meta data with a host, well all of that rich data is available here to be used as tags.

Presentation Layer

Putting it all together we need to visualise this data, and choose Grafana. Below is a demonstration of where we are today.

The dashboard is templated on the domain, which is extracted from InfluxDB meta data. We can then ask for all hosts within that domain, and finally all mount points on that host in that domain. This makes organising data simple, flexible and extremely powerful. Going back to my example on Ceph journals, I can now select the domain a faulty machine resides in, select the host, the disk that has failed, and then look at individual performance metrics over time to identify predictive failure indicators which can then be fed back into the monitoring platform as alert thresholds. Luckily I have as of yet been unable to test this theory as nothing has gone pop yet.

There you have it, from problem to modern and powerful solution. I hope this inspires you to have a play with these emerging technologies and come up with innovative ways to monitor and analyse your estates, and predict failures or plan of capacity trends.

Update

Quite soon after this functionality was introduced we experienced an OSD journal failure. Now to put the theory to the test…

 

As the graphic depicts for the failing drive certain counters will start to increase from zero before the drive is about to fail. Importantly these will gradually increase over time for a period of several weeks before the drive fatally fails. Crucially we now have visibility of potential failures and can replace them in time periods which will be less likely to cause customer impact, and can be handled at more healthy time of the day. Failures can also be correlated with logical block addresses written, which now enables us to predict operating expenditure over the lifetime of the cluster.

Updated blog post 23 August 2016

Icinga 2.5 is now in the wild! See my updated blog post on integrating your own monitoring platform with InfluxDB and Grafana.