CI Whac-A-Mole – Keeping All Your Slaves In Line

We have lots of Hudson slaves in our build farm, and we like to keep a close eye on how fast the farm as a whole is running. But how about individual slaves? Is each one pulling its weight? Are there some that aren’t even running? Because both Hudson grids and Selenium grids seamlessly handle slaves joining and leaving the cluster, it can actually be hard to tell who’s up and who’s down – and sometimes it feels like you’re playing a carnival game, bringing up one server just to notice a short time later that another has gone down.

These days we have really good uptime from all the servers in our farm. We didn’t replace any of the machines (many of which are outgrown or abandoned user desktops, so not the most upstanding citizens) – but we do monitor all of them using Nagios and Cacti. We get a Nagios email quickly when a machine goes down or just stops participating in the cluster, and it’s pretty easy to bounce it and get it working again straightaway. And we can track historical performance using Cacti – this sometimes points up servers whose performance is deteriorating, or confirms a guess that a particular machine isn’t doing well.

Specifically, we get an alert if any slave:

  • isn’t running (i.e. can’t be seen on the network);
  • is low on disk space;
  • has high CPU load; or
  • has an unusual number of processes or users.

Further, we make sure our Selenium slaves are listening on the right ports, and that Hudson slaves are running the right process (using a Hudson plugin we wrote).

What else do you monitor? What might we be missing from the list above that would keep our slaves hard at work?