Workflow for deploying potentially unstable Tucker components

When adding a Tucker component to an application, it’s important to get the workflow right so as to avoid contributing noise to our already noisy alerting. Now support for that workflow is built into Tucker, it’s easier to get right than ever before.

Problem

Occasionally, I’ve added a Tucker component to monitor the status of an application, but when I deployed it to a realistic environment, I discovered an assumption I made about the system didn’t hold. The component raised a warning unexpectedly, and contributed to the spurious noise across our alerting systems. I wanted to find a way to develop new monitoring components that could be introduced without causing this noise.

Examples of the kind of monitoring where this could be an issue include:

  • checking a business rule holds by querying the database
  • monitoring the availability of a service required by the application
  • ensuring background tasks are running and completing successfully

In each case it’s possible that you assumed some behaviour while developing and testing your new component locally. Perhaps data in the database already violates the business rule unexpectedly, or the service is less reliable than you thought, or background tasks don’t complete in the time you expect them to. The component may or may not be alerting correctly, but until you can achieve a clean slate, it will be causing alerting errors in Nagios. Leaving components in this state for any length of time is undesirable as it leads to:

  • other alerts from the same application being hidden[0]
  • alert fatigue (aka the “boy who cried wolf” effect)

Solution

A better workflow is to deploy the Tucker component such that it will make the same checks as usual, and display the same result, but never change to a state that causes an alert. This way you can observe how it behaves in a realistic environment, without it causing any alerts, until you decide it is “stable”.

This has been a reasonably easy workflow to follow. It’s not too much effort to write and test a component that behaves as usual, but when configuring in the application startup, configure in such a way that the status will always be INFO or OK. Once you’re happy the component is stable, a minor code change “switches it on” so that it can begin alerting as you intended it to. This workflow has now been incorporated into the Tucker library, so it’s even easier.

How to use

When adding your component to the Tucker configuration, decorate with a PendingComponent, like so:

import com.timgroup.tucker.info.component.pending.PendingComponent;

tucker.addComponent(new PendingComponent(new PotentiallyUnstableComponent()));

The PendingComponent will always return Status.INFO, regardless of the actual status, preventing premature warnings. However, it will display the value and the real status from the underlying component, as shown here:

pending-component

Since in the case of the less-reliable-than-expected-service, we would not spend all day refreshing our browser on our application’s status page just to see changes in the Tucker component, it’s helpful to log any state changes that would have occurred at any time. That would allow you to look back after the component has spent some time “soaking”, to see how it would have alerted. PendingComponent will log a structured event, viewable in Kibana, exemplified by the following:

{
   "eventType":"ComponentStateChange",
   "event":{
     "id":"potentially-unstable-component",
     "label":"Potentially Unstable Component",
     "previousStatus":"OK",
     "previousValue":"something which warrants a ok status",
     "currentStatus":"CRITICAL",
     "currentValue":"something which warrants a critical status"
   }
 }

Once your component is stable, and you’re confident that a WARNING or CRITICAL status will be real, you can easily “undecorate” it by removing the PendingComponent wrapper. Assuming you have tested your component directly, it should be the only code change you need to make at this point.


PendingComponent is available from Tucker since version 1.0.326.


[0] a known weakness of the Nagios/Tucker monitoring system. Nagios is limited to only considering one status from an entire Tucker status page. This means if your component is in a WARNING state, it will only alert once, even if other components change to WARNING.

Distributed Pair Programming @ TIM Group with Saros

There’s a particular technology used within TIM Group that has been too useful for too long to continue to go uncelebrated. The tool is called “Saros“, it is an Eclipse plugin for distributed pair programming that is used extensively at TIM Group.

Why We Need A Distributed Pairing Solution

We have developers spread across London and Boston offices, and a few “Remoties” working from home in various far flung lands. As one of said Remoties, I believe I would not have the opportunity to continue to work at TIM Group and live where I want to live (hint: not London) if it not for Saros.

Having been entirely colocated just a few years ago, our choice to use pair programming by default didn’t really have technical limitations. With an extra seat, keyboard and mouse at every developer’s workstation, the barrier to pairing was low. Just pull up a chair and pair. When we first started having distributed developers, we didn’t want to give up on pairing for technical reasons, so the search began for how to adapt pair programming to a distributed environment.

Why Other Solutions Don’t Match Up

An oft-cited solution for this problem is to use a screen sharing technology and remote access technology. There’s plenty to choose from: Windows’ Remote Desktop; TeamViewer; LogMeIn; GoTo MyPC; VNC; NX; etc. However, we found them all far too limited for pairing. Just like colocated pairing, there’s only one cursor, so only one person can use the keyboard and mouse at any time. Unlike colocated pairing, there’s a large bias towards the person hosting the session, as they get much faster feedback from typing, since they don’t have to suffer the latency. When it’s (figuratively) painful to type, it’s much easier to shy away from being the “driver”, which hinders the collaboration. We never found a solution that reduced that latency to an acceptable level.

Another cited solution is to use a terminal editor like tmux, that does not have the overhead of screen sharing. The feedback when typing is much better, however, there’s one major drawback: being limited to terminal editors only. I’ve seen people whose terminal environments are well suited for developing in a language like Ruby or JavaScript. For those of us coding in Java and Scala, we didn’t want to give up the powerful features we appreciate in our IDE, so tmux is not suitable.

Saros To The Rescue

Thankfully, we discovered Saros, a research project from the University of Berlin. The most relatable way I’ve found to describe it is as:

Google Docs within the Eclipse IDE

It works by connecting two or more developers through Eclipse, so that when Alice enters some text, Bob sees it appear in his editor. The experience for both users is as if they were editing files locally[0]. Rather than sharing the image of a screen, edit commands are serialised and sent over the wire, changing the other participant’s local copy. This comes with several other benefits over remote access technologies:

  • the feedback when typing is instant, for both parties
  • the latency for seeing your partner’s keystrokes is much lower than when transmitting an image of the screen[1]

There are even benefits over colocated pairing:

  • neither the host nor guest has to leave the comfort of their familiar work environment to pair; you can set up fonts and shortcuts however you like
  • since you are both editing as though it was local, each participant has their own cursor, and can be editing different files, allowing ad-hoc splitting of tasks[1], which can be very handy
  • you can have more people involved (up to 5) which we’ve used for code review and introducing a project to a group of people, sparing the discomfort of hunching around a single desk

There are other distributed pairing solutions, such as Cloud9 IDE, and Kobra.io, but none (that we’ve found) that let you stick with your own IDE setup, retaining parity with how you would develop locally.

There are of course drawbacks to distributed pairing, regardless of technology, which I’m not going to cover here; they’re generally not problems Saros, or any technology, will be able to solve.

IntelliJ Support Is On The Roadmap

After a long search, we have not found anything that really compares with Saros. For me, it really is a killer feature of Eclipse. So much so that I’ve basically kept IntelliJ in a dusty old drawer, even when it would be better for certain tasks (like Scala development). Fortunately, it looks like that’s about to change. Work has just begun to port the Saros platform to IntelliJ, which is really exciting. Whenever I’ve talked to people about Saros, alternative IDE support inevitably arises. If the core Saros technology was agnostic of the IDE, it could be a huge leap forward for collaborative development in general.

At TIM Group we were so keen on the idea that a handful of us spent a “hack week” throwing together the first steps towards an IntelliJ plugin for Saros. We were able to demonstrate a proof-of-concept, but didn’t get anything truly viable. Having brought this effort to the attention of the Saros team, I hope that in some small way it inspired them to start work on it, but I doubt that’s something we can take credit for. Hopefully, during the development of the IntelliJ plugin there will be something that we can contribute, and give something back for our many hours of happy usage.

If you’re looking for an answer to the problem of distributed pairing, I heartily endorse Saros!


[0] for more details of the theory behind this aspect of collaborative editors, see http://en.wikipedia.org/wiki/Operational_transformation
[1] we have acquired a habit of being vocal about whether we are in Driver/Navigator mode, and if your pair is following you, since you can’t assume they are

TopicalJS: A rose amongst the thorns?

We have a codebase in semi-retirement to which we occasionally have to make limited changes. With active codebases, you are continuously evaluating the state of the art and potentially upgrading/switching third-party libraries as better solutions become available. With inactive codebases, you live with decisions that were made years ago and which are not cost-effective to improve. However, not all old code is bad code and there are some really excellent patterns and home-grown tools within this codebase that have never seen light outside of TIM Group.

It was a great pleasure, therefore, to be able to extract one such gem and make it open-source during the last round of maintenance programming on this codebase. We’ve called it TopicalJS, mostly because this was one of the few words left in the English language not already used for a JavaScript library. Topical is concerned with the management of events within a client-side environment (or indeed server-side if you run server-side JavaScript).

This old codebase uses Prototype and YUI on the front-end and a custom event-passing system internally (which is not very inspiringly) called the “MessageBus”. Our newer codebases use Underscore, jQuery, and Backbone. Backbone comes with its own event system which is built into every view, model, and collection. You can raise an event against any of these types or you can just use a raw Backbone Event instance and use it to pass around events.

Without Backbone, and in fact before Backbone existed, we invented our own system for exchanging events. Unlike Backbone all it does is exchange events. So, it could even be used to complement Backbone’s event system. It’s best feature is that it encourages you to create a single JavaScript file containing all of the events that can be fired, who consumes them, and how they get mapped onto actions and other events. This is effectively declarative event programming for JavaScript, which I think might be unique.

You use it by creating a bus and then adding modules to that bus. These modules declare what events they publish and what events they are interested in receiving. When a module publishes an event it is sent to every module, including the one that published it. Then, if that module subscribes to that event type, its subscription function will be called along with any data associated with the event. Events can be republished under different aliases, multiple events can be awaited before an aggregated event is fired, allowing easy coordination of multiple different events.

As an example, this is what bus configuration code might look like.

topical.MessageBus(
  topical.Coordinate({ 
    expecting: ["leftTextEntered", "rightTextEntered"], 
    publishing: "bothTextsEntered" }),

  topical.Republish({ 
    subscribeTo: "init", 
    republishAs: [ "hello", "clear"] }),
  topical.Republish({ 
    subscribeTo: "reset", 
    republishAs: "clear" }),

  topical.MessageBusModule({
    name: "Hello",
    subscribe: {
      hello: function() {
        alert("This alert is shown when the hello event is fired");
      }
    }
  }))

The Coordinate module waits for two different text boxes to be filled in before firing and event saying that they’re both present. The Republish modules raise the hello and clear event, ultimately causing an annoying alert to be shown, as well as providing another module the ability to react to the clear event and flush out any old data.

Full documentation and a worked example is available here: https://github.com/youdevise/topicaljs.

Feedback or contributions are most welcome.

High Availability Scheduling with Open Source and Glue

We’re interested in community feedback on how to implement scheduling. TIM Group has a rather large number of periodically recurring tasks (report generation, statistics calculation, close price processing, and so on). Historically, we have used a home grown cyclic processing framework to schedule these tasks to run at the appropriate times. However, this framework is showing its age, and its deep integration with our TIM Ideas and TIM Funds application servers doesn’t work well with our strategy of many new and small applications. Looking at the design of a new application with a large number of scheduled tasks, I couldn’t see it working well within the old infrastructure. We need a new solution.

To provide a better experience than our existing solution, we have a number of requirements:

  • Reliability, resiliency, and high availability: we can’t afford to have jobs missed, we don’t want a single point of failure, and we want to be able to restart a scheduling server without affecting running jobs.
  • Throttling and batching: there are a number of scheduled tasks that perform a large amount of calculations. We don’t want all of these executing at once, as it can overwhelm our calculation infrastructure.
  • Alerting and monitoring: we need to have a historical log of tasks that have been previously processed, and we need to be alerted when there are problems (jobs not completing successfully, jobs taking too long, and so on). We have an existing infrastructure for monitoring and alerting based on Logstash, Graphite, Elastic Search, Kibana and Nagios, so it would be nice if the chosen solution can easily integrate with those tools.
  • Price: we’re happy to pay for commercial software, but obviously don’t want to pay over-the-top for enterprise features we don’t need.
  • Simple and maintainable: our operations staff need to be able to understand the tool to support and maintain it, and our users must be able to configure the schedule. If it can fit in with our existing operating systems and deployment tools (MCollective and Puppet running on Linux), all the better.

There is a large amount of job scheduling software out there. What to choose? It is disappointing to see that most of the commercial scheduling software hidden behind brochure-ware websites and “call for a quote” links. This makes it very difficult to evaluate in the context of our requirements. There was lots of interesting prospects in the open-source arena, though it is difficult finding something that would easily cover all our requirements:

  • Cron is a favourite for a lot of us because of its simplicity and ubiquity, but it doesn’t have high availability out of the box.
  • Jenkins is also considered for use as a scheduler, as we have a large build infrastructure based on Jenkins, but once again there is no simple way to make it highly available.
  • Chronos looked very interesting, but we’re worried by its apparent complexity – it would be difficult for our operations staff to maintain, and it would introduce a large number of new technologies we’d have to learn.
  • Resque looked like very close match to our requirements, but there is a small concern over it’s HA implementation (in our investigation, it took 3 minutes for fail-over to occur, during which some jobs might be missed)
  • There were many others we looked at, but were rejected because they appeared overly complex, unstable or unmaintained.

In the end, Tom Anderson recognised a simpler alternative: use cron for scheduling, with rcron and keepalived for high availability and fail-over, along with MCollective, RabbitMQ and curl for job distribution. Everything is connected with some very simple glue adaptors written in Bash and Ruby script (which also hook in with our monitoring and alerting).

Schmeduler Architecture

This architecture§ fulfils pretty much most of our requirements:

  • Reliability, resiliency, and high availability: Cron server does no work locally, and so is effectively a very small server instance that offloads all of its work ensure that it is much more stable. rcron and keepalived provide high availability.
  • Throttling and batching: available through the use of queues and MCollective
  • Alerting and monitoring: available through the adaptor scripts
  • Price: software is free and open source, though there is a operations and development maintenance cost to the adaptor code.
  • Simple and maintainable: except for rcron, all these tools are already in use within TIM Group and well understood, so there is a very low learning curve.

We’re still investigating the viability of this architecture. Has anyone else had success with this type of scheme? What are the gotchas?

§Yes, the actor in that diagram is actually Cueball from xkcd.

Dip Your Toe In Open (Source) Waters

One of the qualities that TIM Group look for when filling vacancies is an interest in contributing to open source projects. We think that when a candidate gets involved in open source, it indicates a passion for software development. If you’re like me, at some point, you have wanted to join a project. Perhaps you wanted to improve your skills, try a different technology, or brighten up your CV. But alas, you didn’t know the best way to get started.

I am trying to provide that exact opportunity, in conjunction with RecWorks’ “Meet-A-Project” scheme. I have an open source project, called Mutability Detector, with issues and features waiting to be completed, specifically earmarked for newcomers to the project. I promise a helpful, friendly environment, in which to dip your toe in open (source) waters.

If you want to know more details, head over to the project blog for a description on the how and why of getting involved.