We have a farm of 15-20 continuous-integration servers (depending on how you count) whose job it is to tear apart every checkin and tell us all the bugs in it before customers see them. Hudson does a great job managing the builds and Selenium Grid lets us add and remove browser-test machines. Our biggest problem is getting fast enough responses – it’s not unusual to wait 3-4 hours or even longer before you get a full set of results from your checkin, by which time lots of others have also checked in and you’ve forgotten what you were working on.
We’ve identified three related metrics that we would like to tune:
- Wait time is like latency in a network – it’s the time your build spends waiting before it gets to run on a Hudson slave. More precisely, it’s
startBuildTime - firstCheckinTime– note this includes time for predecessor builds to run, if you are using a build pipeline (as we are).
- Build time is just that:
finishBuildTime - startBuildTime. In other words, once your build got assigned a Hudson slave, how long did it take to finish its run and give a result?
- Response time is the total time between your checkin and the email from Hudson giving you the result:
finishBuildTime - firstCheckinTime, or equivalently
waitTime + buildTime.
We have started gathering these three values daily for every job we run in Hudson. As we are concentrating on stability at the moment, the results are pretty flat, but we are just starting to add more Hudson slaves (which should reduce wait time) and Selenium Grid slaves (which should reduce build time for browser tests at least). We’re also using Nagios and Cacti to keep track of utilisation on all our servers. Stay tuned for the results!
Unfortunately measuring these things is quite a pain. I’ve had to write a gnarly Python script that parses Hudson logs and even uses file modification times to figure out the three key times
finishBuildTime. If any readers know how to do this better (fancy Hudson plugin maybe?) I’d love some help in the comments.