High Availability Scheduling with Open Source and Glue

We’re interested in community feedback on how to implement scheduling. TIM Group has a rather large number of periodically recurring tasks (report generation, statistics calculation, close price processing, and so on). Historically, we have used a home grown cyclic processing framework to schedule these tasks to run at the appropriate times. However, this framework is showing its age, and its deep integration with our TIM Ideas and TIM Funds application servers doesn’t work well with our strategy of many new and small applications. Looking at the design of a new application with a large number of scheduled tasks, I couldn’t see it working well within the old infrastructure. We need a new solution.

To provide a better experience than our existing solution, we have a number of requirements:

  • Reliability, resiliency, and high availability: we can’t afford to have jobs missed, we don’t want a single point of failure, and we want to be able to restart a scheduling server without affecting running jobs.
  • Throttling and batching: there are a number of scheduled tasks that perform a large amount of calculations. We don’t want all of these executing at once, as it can overwhelm our calculation infrastructure.
  • Alerting and monitoring: we need to have a historical log of tasks that have been previously processed, and we need to be alerted when there are problems (jobs not completing successfully, jobs taking too long, and so on). We have an existing infrastructure for monitoring and alerting based on Logstash, Graphite, Elastic Search, Kibana and Nagios, so it would be nice if the chosen solution can easily integrate with those tools.
  • Price: we’re happy to pay for commercial software, but obviously don’t want to pay over-the-top for enterprise features we don’t need.
  • Simple and maintainable: our operations staff need to be able to understand the tool to support and maintain it, and our users must be able to configure the schedule. If it can fit in with our existing operating systems and deployment tools (MCollective and Puppet running on Linux), all the better.

There is a large amount of job scheduling software out there. What to choose? It is disappointing to see that most of the commercial scheduling software hidden behind brochure-ware websites and “call for a quote” links. This makes it very difficult to evaluate in the context of our requirements. There was lots of interesting prospects in the open-source arena, though it is difficult finding something that would easily cover all our requirements:

  • Cron is a favourite for a lot of us because of its simplicity and ubiquity, but it doesn’t have high availability out of the box.
  • Jenkins is also considered for use as a scheduler, as we have a large build infrastructure based on Jenkins, but once again there is no simple way to make it highly available.
  • Chronos looked very interesting, but we’re worried by its apparent complexity – it would be difficult for our operations staff to maintain, and it would introduce a large number of new technologies we’d have to learn.
  • Resque looked like very close match to our requirements, but there is a small concern over it’s HA implementation (in our investigation, it took 3 minutes for fail-over to occur, during which some jobs might be missed)
  • There were many others we looked at, but were rejected because they appeared overly complex, unstable or unmaintained.

In the end, Tom Anderson recognised a simpler alternative: use cron for scheduling, with rcron and keepalived for high availability and fail-over, along with MCollective, RabbitMQ and curl for job distribution. Everything is connected with some very simple glue adaptors written in Bash and Ruby script (which also hook in with our monitoring and alerting).

Schmeduler Architecture

This architecture§ fulfils pretty much most of our requirements:

  • Reliability, resiliency, and high availability: Cron server does no work locally, and so is effectively a very small server instance that offloads all of its work ensure that it is much more stable. rcron and keepalived provide high availability.
  • Throttling and batching: available through the use of queues and MCollective
  • Alerting and monitoring: available through the adaptor scripts
  • Price: software is free and open source, though there is a operations and development maintenance cost to the adaptor code.
  • Simple and maintainable: except for rcron, all these tools are already in use within TIM Group and well understood, so there is a very low learning curve.

We’re still investigating the viability of this architecture. Has anyone else had success with this type of scheme? What are the gotchas?

§Yes, the actor in that diagram is actually Cueball from xkcd.

The Summit is Just a Halfway Point

(Title is originally a quote from Ed Viesturs.)

This past week, TIM Group held its Global Summit, where we had nearly all of our Technology, Product, and Sales folks under the same roof in London. For those who aren’t aware, we are quite globally spread. We have technologists in both London and Boston offices (as well as remote from elsewhere in the UK and France), product managers in our NYC/London/Boston offices, and sales/client services folks across those offices as well as Hong Kong and Toronto. This summit brings together teams and departments that haven’t been face-to-face in many months — some teams haven’t been in the same place since the previous summit.

On the Monday, we got together for an unconference (in the Open Spaces format.) This was a great way to kick off the week. We were able to quickly organize ourselves, setup an itinerary for the day, and get to it. We discussed what people truly thought were the most important topics. We had sessions on everything from “Where does our (product) learning go?” (about cataloging the knowledge we gain about our new products), “Deployment next” (about where we take our current infrastructure next), and many other topics across development, product management, DevOps, QA, and beyond.

For the more technically-focused of the group, the rest of the week was filled with self-organized projects. Some of these were devised in the weeks leading to the Summit — some were proposed on the day by attendees. There were so many awesome projects to be worked on last week. We had ones that ranged from investigating our company’s current technical problems (like HA task scheduling), creating tools to solve some of our simple problems (like an online Niko-niko tracker), or solve some bigger-than-just-us problems (like Intellij remote collaboration.) You should expect to see us share more about these projects on this blog. Stay tuned!

To be clear, it wasn’t all fun. We also had a football game, beer tasting, and Brick Lane dinners. Altogether, this was a great opportunity to re-establish the bonds between our teams. While there are many benefits to having distributed teams as we do, there are many challenges as well. We do many things to work past these challenges, and our Global Summit is one of the shining examples. Getting everyone face-to-face builds trust and shared experiences which helps fuels the collaboration when everyone returns to their home locations.

Drawing back to the title, I am optimistic that our teams will build on top of the code, experiences, and collaboration of the summit. We will move forward with a clear view from the Summit, and be able to move into 2014 with more great things in the form of new products, services, or even just great blog posts.

Bring the outside in: gov.uk

Recently, Jake Benilov from the Government Digital Service (GDS) team visited the TIM Group office in London. His presentation on “7 types of feedback that helped make GOV.UK awesome” was a very interesting tour of the key success factors in building the UK central government’s publishing portal.

The talk filled the whole 90 minutes time slot – partly because a lot of ground was covered, but also because we asked many questions in between. One aspect that sparked discussions even in the following days was the definition of “DevOps” in general and the question whether developers should have full access to production systems in particular. The latter seems to work well for the GDS team but is not current practice at TIM Group.

If you want to know more, please head over to Jake’s blog entry explaining the seven types of feedback and/or view the slides of the talk.