High Availability Scheduling with Open Source and Glue

We’re interested in community feedback on how to implement scheduling. TIM Group has a rather large number of periodically recurring tasks (report generation, statistics calculation, close price processing, and so on). Historically, we have used a home grown cyclic processing framework to schedule these tasks to run at the appropriate times. However, this framework is showing its age, and its deep integration with our TIM Ideas and TIM Funds application servers doesn’t work well with our strategy of many new and small applications. Looking at the design of a new application with a large number of scheduled tasks, I couldn’t see it working well within the old infrastructure. We need a new solution.

To provide a better experience than our existing solution, we have a number of requirements:

  • Reliability, resiliency, and high availability: we can’t afford to have jobs missed, we don’t want a single point of failure, and we want to be able to restart a scheduling server without affecting running jobs.
  • Throttling and batching: there are a number of scheduled tasks that perform a large amount of calculations. We don’t want all of these executing at once, as it can overwhelm our calculation infrastructure.
  • Alerting and monitoring: we need to have a historical log of tasks that have been previously processed, and we need to be alerted when there are problems (jobs not completing successfully, jobs taking too long, and so on). We have an existing infrastructure for monitoring and alerting based on Logstash, Graphite, Elastic Search, Kibana and Nagios, so it would be nice if the chosen solution can easily integrate with those tools.
  • Price: we’re happy to pay for commercial software, but obviously don’t want to pay over-the-top for enterprise features we don’t need.
  • Simple and maintainable: our operations staff need to be able to understand the tool to support and maintain it, and our users must be able to configure the schedule. If it can fit in with our existing operating systems and deployment tools (MCollective and Puppet running on Linux), all the better.

There is a large amount of job scheduling software out there. What to choose? It is disappointing to see that most of the commercial scheduling software hidden behind brochure-ware websites and “call for a quote” links. This makes it very difficult to evaluate in the context of our requirements. There was lots of interesting prospects in the open-source arena, though it is difficult finding something that would easily cover all our requirements:

  • Cron is a favourite for a lot of us because of its simplicity and ubiquity, but it doesn’t have high availability out of the box.
  • Jenkins is also considered for use as a scheduler, as we have a large build infrastructure based on Jenkins, but once again there is no simple way to make it highly available.
  • Chronos looked very interesting, but we’re worried by its apparent complexity – it would be difficult for our operations staff to maintain, and it would introduce a large number of new technologies we’d have to learn.
  • Resque looked like very close match to our requirements, but there is a small concern over it’s HA implementation (in our investigation, it took 3 minutes for fail-over to occur, during which some jobs might be missed)
  • There were many others we looked at, but were rejected because they appeared overly complex, unstable or unmaintained.

In the end, Tom Anderson recognised a simpler alternative: use cron for scheduling, with rcron and keepalived for high availability and fail-over, along with MCollective, RabbitMQ and curl for job distribution. Everything is connected with some very simple glue adaptors written in Bash and Ruby script (which also hook in with our monitoring and alerting).

Schmeduler Architecture

This architecture§ fulfils pretty much most of our requirements:

  • Reliability, resiliency, and high availability: Cron server does no work locally, and so is effectively a very small server instance that offloads all of its work ensure that it is much more stable. rcron and keepalived provide high availability.
  • Throttling and batching: available through the use of queues and MCollective
  • Alerting and monitoring: available through the adaptor scripts
  • Price: software is free and open source, though there is a operations and development maintenance cost to the adaptor code.
  • Simple and maintainable: except for rcron, all these tools are already in use within TIM Group and well understood, so there is a very low learning curve.

We’re still investigating the viability of this architecture. Has anyone else had success with this type of scheme? What are the gotchas?

§Yes, the actor in that diagram is actually Cueball from xkcd.

Leave a Reply

Your email address will not be published. Required fields are marked *