MDI: Monitoring Driven Infrastructure?

Adam and I attended the London Infracoders Meetup last night which featured a demo of serverspec and Beaker. When I asked Adam what he thought he wasn’t impressed*. “I don’t see the point of this if you’re using Puppet unless you’re worried about typos or you don’t control your production monitoring. It duplicates what you already have with Puppet’s DSL, which is why we stopped using rspec-puppet in the first place.”

I realized that Adam was correct, that the sort of automated tests I’m used to as a developer are functionally equivalent of the monitoring checks system administrators are already heavily using. (An easy leap for me to make since Michael Bolton already convinced me that automated tests are checks, not testing.) Over time I’ve seen the migrations of testing from unit tests on the developer desktop to more tests further and further down the deployment pipeline. This led me to wonder how far monitoring could make the same march but in reverse.

We already use our production monitoring in our test environments, but we don’t tend to use them in our local Puppet development environments. But why? Couldn’t we use our monitoring in TDD fashion? Write a monitoring check, see it fail, then write the Puppet code to make it pass? (This is the same motivation as using your tests as monitoring such as with Cucumber-Nagios but working in the other direction.)

We haven’t tried this experiment yet but I’m curious if this is a pattern others have attempted, and if so, how did it work for you?

* Adam did think serverspec might be a plausible replacement for our existing NRPE checks, a way to uniformly structure our monitoring and to allow easier independent development of the checks

High Availability Scheduling with Open Source and Glue

We’re interested in community feedback on how to implement scheduling. TIM Group has a rather large number of periodically recurring tasks (report generation, statistics calculation, close price processing, and so on). Historically, we have used a home grown cyclic processing framework to schedule these tasks to run at the appropriate times. However, this framework is showing its age, and its deep integration with our TIM Ideas and TIM Funds application servers doesn’t work well with our strategy of many new and small applications. Looking at the design of a new application with a large number of scheduled tasks, I couldn’t see it working well within the old infrastructure. We need a new solution.

To provide a better experience than our existing solution, we have a number of requirements:

  • Reliability, resiliency, and high availability: we can’t afford to have jobs missed, we don’t want a single point of failure, and we want to be able to restart a scheduling server without affecting running jobs.
  • Throttling and batching: there are a number of scheduled tasks that perform a large amount of calculations. We don’t want all of these executing at once, as it can overwhelm our calculation infrastructure.
  • Alerting and monitoring: we need to have a historical log of tasks that have been previously processed, and we need to be alerted when there are problems (jobs not completing successfully, jobs taking too long, and so on). We have an existing infrastructure for monitoring and alerting based on Logstash, Graphite, Elastic Search, Kibana and Nagios, so it would be nice if the chosen solution can easily integrate with those tools.
  • Price: we’re happy to pay for commercial software, but obviously don’t want to pay over-the-top for enterprise features we don’t need.
  • Simple and maintainable: our operations staff need to be able to understand the tool to support and maintain it, and our users must be able to configure the schedule. If it can fit in with our existing operating systems and deployment tools (MCollective and Puppet running on Linux), all the better.

There is a large amount of job scheduling software out there. What to choose? It is disappointing to see that most of the commercial scheduling software hidden behind brochure-ware websites and “call for a quote” links. This makes it very difficult to evaluate in the context of our requirements. There was lots of interesting prospects in the open-source arena, though it is difficult finding something that would easily cover all our requirements:

  • Cron is a favourite for a lot of us because of its simplicity and ubiquity, but it doesn’t have high availability out of the box.
  • Jenkins is also considered for use as a scheduler, as we have a large build infrastructure based on Jenkins, but once again there is no simple way to make it highly available.
  • Chronos looked very interesting, but we’re worried by its apparent complexity – it would be difficult for our operations staff to maintain, and it would introduce a large number of new technologies we’d have to learn.
  • Resque looked like very close match to our requirements, but there is a small concern over it’s HA implementation (in our investigation, it took 3 minutes for fail-over to occur, during which some jobs might be missed)
  • There were many others we looked at, but were rejected because they appeared overly complex, unstable or unmaintained.

In the end, Tom Anderson recognised a simpler alternative: use cron for scheduling, with rcron and keepalived for high availability and fail-over, along with MCollective, RabbitMQ and curl for job distribution. Everything is connected with some very simple glue adaptors written in Bash and Ruby script (which also hook in with our monitoring and alerting).

Schmeduler Architecture

This architecture§ fulfils pretty much most of our requirements:

  • Reliability, resiliency, and high availability: Cron server does no work locally, and so is effectively a very small server instance that offloads all of its work ensure that it is much more stable. rcron and keepalived provide high availability.
  • Throttling and batching: available through the use of queues and MCollective
  • Alerting and monitoring: available through the adaptor scripts
  • Price: software is free and open source, though there is a operations and development maintenance cost to the adaptor code.
  • Simple and maintainable: except for rcron, all these tools are already in use within TIM Group and well understood, so there is a very low learning curve.

We’re still investigating the viability of this architecture. Has anyone else had success with this type of scheme? What are the gotchas?

§Yes, the actor in that diagram is actually Cueball from xkcd.

The Summit is Just a Halfway Point

(Title is originally a quote from Ed Viesturs.)

This past week, TIM Group held its Global Summit, where we had nearly all of our Technology, Product, and Sales folks under the same roof in London. For those who aren’t aware, we are quite globally spread. We have technologists in both London and Boston offices (as well as remote from elsewhere in the UK and France), product managers in our NYC/London/Boston offices, and sales/client services folks across those offices as well as Hong Kong and Toronto. This summit brings together teams and departments that haven’t been face-to-face in many months — some teams haven’t been in the same place since the previous summit.

On the Monday, we got together for an unconference (in the Open Spaces format.) This was a great way to kick off the week. We were able to quickly organize ourselves, setup an itinerary for the day, and get to it. We discussed what people truly thought were the most important topics. We had sessions on everything from “Where does our (product) learning go?” (about cataloging the knowledge we gain about our new products), “Deployment next” (about where we take our current infrastructure next), and many other topics across development, product management, DevOps, QA, and beyond.

For the more technically-focused of the group, the rest of the week was filled with self-organized projects. Some of these were devised in the weeks leading to the Summit — some were proposed on the day by attendees. There were so many awesome projects to be worked on last week. We had ones that ranged from investigating our company’s current technical problems (like HA task scheduling), creating tools to solve some of our simple problems (like an online Niko-niko tracker), or solve some bigger-than-just-us problems (like Intellij remote collaboration.) You should expect to see us share more about these projects on this blog. Stay tuned!

To be clear, it wasn’t all fun. We also had a football game, beer tasting, and Brick Lane dinners. Altogether, this was a great opportunity to re-establish the bonds between our teams. While there are many benefits to having distributed teams as we do, there are many challenges as well. We do many things to work past these challenges, and our Global Summit is one of the shining examples. Getting everyone face-to-face builds trust and shared experiences which helps fuels the collaboration when everyone returns to their home locations.

Drawing back to the title, I am optimistic that our teams will build on top of the code, experiences, and collaboration of the summit. We will move forward with a clear view from the Summit, and be able to move into 2014 with more great things in the form of new products, services, or even just great blog posts.

Scenario testing for infrastructures

Recent advancements have allowed us to provision an entire environment with a single command.

The next major challenge facing us is how to perform updates to environments, this gives rise to an additional set of challenges (and constraints) for our automated provisioning system.

We’re working towards a provisioning system that is able to upgrade the systems serving an application by rebuilding a completely fresh environment from the ground up, then seamlessly redirecting the live traffic into the new environment. We plan to eventually do this in a fully automated fashion during business hours – continually as we make changes and improvements to the infrastructure.

One approach to being able to perform updates on the running system, whilst maintaining the “single build” provisioning model presented in the last post would be to implement Blue-Green deployments for our applications at the environment level.

Thinking about things in this mindset gives us a set of problems to solve:

  • How can we have complete confidence that our newly provisioned environment is fully operational and that no regressions have been introduced?
  • How do we know that we can exactly duplicate an environment build process in a way that we have complete confidence in? (E.g. is the QA environment really like production)
  • Do we know the health monitoring of components and the service is functional? (Can we trust the system to monitor the right things as servers are rebuilt dynamically).
  • Are load balancing and redundancy / high availability requirements met under the conditions of unexpected hardware failures? (Not only do the application environments have to be highly available, the provisioning system itself has to cope with failure gracefully)


Let’s start by talking about confidence, and how our MVP for automated confidence is insufficient. Our initial thought process went something like this:

  • We have monitoring for every machine comprising the environment – right?
  • Our monitoring should be able to tell us whether the machines we just provisioned are working (i.e. give us confidence in the state of the environment) – awesome.
  • If we automatically verify that our per host monitoring is “all green”?
  • We’re confident everything is working, and we can put the environment into production – right?
  • So we need automated API to access environment specific monitoring pieces, so that we can assert that all the checks for this specific application is correctly provisioned and healthy.

It was a relatively simple job to extend the mcollective nrpe agent so that from our provisioning model we have a rake task which remotely execute our NRPE checks on the hosts in an environment.

However, whilst a great first step, this is obviously not sufficient because:

  • It does not cover the end-to-end case (we cut-out nagios)
  • It is difficult to model test targets of services rather than individual machines (e.g. tests for a virtual IP on a load balancer)
  • How do we know that our monitoring continues to work through subsequent iterations of model or puppet code?
  • How do we know that our monitoring is not just saying ‘OK’ despite the state of the things it’s trying to monitor?
  • How do we know that we have not caused regressions in our HA setup, which causes it to not-function in a crisis?

… (etc)

Just like in application development, we can’t just hope that provisioning an entire infrastructure still works! We need automated tests that will give us the confidence we need to bring newly provisioned environments online.

Scenario testing

Our current thinking is that we need to be able to reason about the behaviour of the entire system, both under normal and failure conditions – from a high level.

This feels a good fit to structuring the tests in the classic Given, When, Then BDD format.

To this end, we wanted to give some examples of test scenarios that we would be interested in writing to actually have confidence, and show the support we will need to realise them.

Here is a simple example of a scenario we might want to test:

Given – the Service has finished being provisioned
And – all monitoring for the related the service to be passing
When – when we destroy a single member of the service
Then – we expect all monitoring at the service level to be passing
And – we expect all monitoring at the single machine level to be

Even with this simple example, we can drive our thinking into the
upstream APIs and services we’ll need to achieve these goals at each
of the steps.

>> Given – the Service has finished being

We can do this! We can launch a bunch of virtual machines to fulfil a service.

>> And – all monitoring for the related the service
to be passing.

We can’t do this. We can check that nrpe commands on machines that are part of a service are working (and we do), we can also execute some one off sanity checks from the provisioning machines. But what we really want to do is to ask our monitoring system (nagios in our case).

Now “the service” actually consists of a number of machines. Each machine has checks, and the service itself has checks. The service will have some internal checking like are all the instances healthy in the load balancer or not; it will also have some external checking checked from somewhere on the Internet (e.g. pingdom), which will cover off most connectivity issues.

So how can we confidently answer this question? I believe we need to be able to query the monitoring system like this:

    'The Service' :transitive => true
  ).should be_all_green

In this case I want to know about every participating machine and all checks relating to the service. This is the classic “is everything ok for the thing we just provisioned” question.

>> When – when we destroy a single member of the service

This is easy We just issue a destroy command for a VM. For other scenarios we might want more specific ways for disabling particular functionality of a machine rather than complete destruction.

>> Then – we expect all monitoring at the service level to be passing.

We need to able to do something like this:

    'The Service', :transitive => false
  ).should be_all_green

Note that the change of transitivity is the important clue here!

Sometimes I want to ask the question: “Are all participants in this service ok?” But sometimes I just want to know if the service is still functioning.

>> And – we expect all monitoring at the single machine level to be failing

  'Service App Machine 0', :transitive => false
).should be_all_red


The thinking we’re doing about extending the modeling of our infrastructure from provisioning to testing is also applicable to running the systems in production. The scenario testing described above is predicated on the ability to sample an environment and build a model of the infrastructure state.

Whilst we’re still a long way from solving all of the problems, the model used for solving the testing problems outlined above can be used to reason about the production system! There are wins to be had every step along the path – in the short term, our model can power much better service/environment (rather than machine) level monitoring.

In the longer term, we’ll have better testing in place to give us confidence about newly provisioned environments. This will allow us to have a trustworthy blue/green infrastructure deployment and update system, as it can be built as a simple state machine where transitions resolve differences between a ‘desired state’ and the ‘current state’. This is exactly the same concept as behind Orc, our continuous deployment tool – wherein a model and state machine driven approach allows us to safely upgrade production applications, even in the face of machine failure.

We hope this post has made you think about the problems in the same way we have. Even if you don’t agree with the strategy we’re pursuing, you’ll hopefully agree that being able to reason about infrastructures at a higher conceptual level than individual machines is an important and powerful concept.

Exported Resources Considered Harmful

Our infrastructure automation is driven by Puppet, so this post is mainly going to talk about Puppet – however the key problem we have (and issues we’re solving) is equally relevant for most other current configuration management tools (such as Chef). One of the key challenges for configuration management systems is determinism – i.e. being able to rebuild the same system in the same way.

In a ‘traditional’ world view, the ‘system’ means an individual machine – however, in the real world, there are very few cases where a new production system can be brought on-line with only one component. For resiliency (if not scalability) purposes you probably want to have more than one machine able to fulfil a particular role so that a single hardware failure won’t cause a system wide outage.

Therefore your system consists of more than one machine – whilst the current crop of configuration management tools can be deterministic for a single machine, they’re much less deterministic when you have inter-machine dependencies.

Beyond individual per-application clusters of servers, you want the monitoring of your entire infrastructure to be coupled with the systems in place inside that infrastructure. I.e. you shouldn’t have to replicate any effort; when you add a new web server to the pool serving an application, you expect the monitoring of that server to adjust so that the new service becomes monitored automatically.

In the puppet ecosystem, the traditional solution to this is exported resources. In this model, each host running puppet ‘exports’ a set of data about the system under management (for instance nagios checks), and then other hosts can ‘collect’ these resources when they run puppet.

Traditionally this was not very scalable, although this has largely been addressed with the introduction of PuppetDB. It was also difficult to arrange things such that you could get the set of resources you wanted onto the host you wanted – with the newer versions of PuppetDB this issue is ameliorated with the introduction of a more flexible query interface.

All of these advancements have been great progress, and kudos for puppetlabs to doing much needed work in this area. However, pulling back from the actual problems, myself (and my team) have come to consider exported resources as the wrong solution for the problems it’s commonly used to solve.

Exported resources introduce puppet run-order dependencies, i.e. in order to reach the correct state, puppet must run on some machines before it runs on others. The implication is that this “management method” is a Convergent[1] system as the system could end up in its final state by more than one route. Any system which relies on convergence is complicated, as it’s very hard to know if you’ve converged to the end state (or if you will ever converge to the end state).

The key issue is, of course, determinism: If host A exports resources to host B – then the order in which you build host A and host B matter, making them co-dependent and non-deterministic. If you’re rolling out an entirely new environment then this likely means that you have to run puppet again and again across the machines until things appear to stop changing – and this is just _apparent_ convergence, rather than proven convergence.

We can go some way to solve this issue, by forcing the order that machines are provisioned (or that puppet is run on those machines). We wrote puppet roll which executes puppet on hosts in order according to a dependency graph. But this is the wrong problem to be solving. Eliminate provisioning order dependencies and we eliminate a difficult and brittle problem.

In recent work, we have rejected the traditional “exported resources anti-pattern” and instead have created a model of ‘our system’ entirely outside puppet. This means that we can build a model of the entire system, which contains no mutable state. We wire this model up to puppet to generate ENC (external node classifier) data for each machine. All data needed for this machine’s state is supplied by the ENC, meaning that machines can be built in any order, and all in exactly one pass.

This entirely removes all the key problems with determinism, convergence, multiple puppet runs etc. In our experience, it also in many cases radically simplifies things. Whereas previously we would have bent things to fit into the model offered by exported resources, we can now instead write our business specific logic in our own model layer – meaning that we can represent things as they should naturally be modelled.


The thing we like most about puppet for individual systems and services is its declarative, model driven nature – thus we’ve tried to replicate something with a similar ‘feel’ at the whole system level.

Given this (somewhat simplified) description of a service:

stack 'example' do
  virtual_appserver 'exampleapp', :instances => 2
  loadbalancer, :instances => 2

env 'ci', :location => 'dc1' do
  instantiate_stack 'example'

env 'production', :location => 'dc2' do
  instantiate_stack 'example'

The (again somewhat simplified) ENC generated for the 2 application servers, looks like this:

role: http_app_server
  environment: ci
  application: 'exampleapp'

The ENC for the 2 load balancers look like this:

      type: http

This sort of configuration eliminates the need for a defined puppet run-order (as each host has all the details it needs to configure itself from the model – without any data being needed from other hosts), and goes a long way towards achieving the goal of complete determinism.

The example shows a traditional load balancers to web servers dependency, however the same technique can be (and is) applied in our code wherever we have clusters of servers of the same type that need to inter-communicate. E.g. RabbitMQ, Ehcache and Elasticsearch clusters.

If you haven’t guessed yet, as well as being theoretically correct, this approach is vastly powerful:

  • We’re able to smoke test our puppet code in a real integration environment.
  • We can provision entire clusters of servers for QA or development purposes with 1 line of code and a < 10 minute wait.
  • We’ve used this system to build our newest production applications.
  • We can rebuild an entire app environment during scheduled maintenance.
  • We can add servers or resources to an application cluster with a 1 line code change and 1 command.

We’ve got a lot of work still to do on this system (and on our internal applications and puppet code before it’ll all fit into this system), however it’s already quite obviously a completely different (and superior) model to traditional convergence for how to think of (and use) our configuration management tools across many servers.


  1. Why Order Matters: Turing Equivalence
    in Automated Systems Administration (USENIX 2002)

Puppet Camp Barcelona

I recently had the pleasure of being asked to speak at Puppet Camp Barcelona. I’d submitted a talk a few months ago about some of the problems my team was having with our uses of puppet, and how we’re adapting to change how we use puppet.

I was extremely pleased to be asked to present, and also extremely pleased that TIM Group was willing to fund my flights and give me the time to attend the conference.

I was pleased by how the presentation went, and I gained a whole bunch of ideas we hadn’t thought of from chatting to people afterwards. From the official writeup, at I think the talk was generally well received.

I’m looking forward to finding the time to write up further details of how we use puppet at TIM Group, and what problems this solves for us.


This is a blog post that was written in 2013, but somehow was forgotten about. So here is a bit of history!

— Andrew Parker

Last month I got the chance to attend the Monitorama conference.

This was out and out the best conference I’ve visited so far this year for learning. The conference was organised as a day of lectures by notable people in the field, followed by a day of workshops (practical ‘follow the talk on your laptop’ style sessions) on various monitoring and visualisation tools in parallel with a day of ‘Hackathon’ (working on projects).

The talks I attended covered some topics I’m already very familiar with (E.g. logstash), and some topics which I’m much less familiar with (E.g. the R workshop). The level of technical detail was generally great – not overfacing if you were a beginner, but also with something to learn if you’re a seasoned user.

In the second day, in the morning, I teamed up with Jason from TIM Group, and we were a little selfish – working on scratching our own itch. In the afternoon we broke off and attended some of the workshops, which were interesting. I was extremely surprised (and pleased!) that our project won the 3rd prize for a group project at the conference – that was unexpected given that we’d gone off and solved our own reporting issue, rather than tackling a more generic monitoring problem which would have helped a larger subset of people! We also managed to build something functional for our needs, and get it deployed into production within the 5 hours we had to work on it.

I hope that readers will forgive me if I spend a few paragraphs telling you about what we built (and why!):

We used to deploy Foreman, which we used to view / search through puppet run reports. Unfortunately, with our recent upgrade to Puppet 3.0, our Foreman installation needed upgrading too.

The version of Foreman we were running was ancient, and it’s since gained a massive number of features – however the only feature we were using from it was the report browser. This was causing us to have mysql on our monitoring machines, just to support this application, and re-packaging the latest version to our internal standards proved to be a non-trivial exercise.

We’d basically just disabled it to go ahead with the puppet 3.0 upgrade, with a plan to experiment with (try writing a proof of concept) using our logstash/Elasticsearch solution for the data transport and storage. I was able to very quickly hack up a reporting plugin for puppet, based off of some earlier work I’d found on github, and I’d been playing with the angularjs framework on the plane on the way over.

So, after about 5 hours hacking, we had bolted together Norman (excuse the bad pun).

This is, of course, still a simple and barely functional prototype; however it’s useable enough that after a couple more hours work we have unit tests (and green builds in Travis) and so that it was possible to deploy as an Elasticsearch plugin. There is still missing functionality from what we replaced in Foreman, however none of it is truely essential and we should be able to add that gradually as we have time.

Devopsdays London

This is a blog post that was written in 2013, but somehow was forgotten about. So here is a bit of history!

— Andrew Parker

Most of our Infrastructure team and a couple of developers we had seconded to the team all attended the Devopsdays London conference a couple of weeks ago.

There are a load of reviews/notes about the conference online already, however we also made a set.

I think everyone attending found the conference valuable, although for varying reasons (depending upon which sessions they had attended). Personally I found that the 2nd day was more valuable, with better talks and more interesting openspace sessions (that I attended). As I had expected (from my previous attendance at Devopsdays New York), I found the most value in networking and comparing the state of the art with what others are doing in automation / monitoring / etc.

I was very pleased that TIM Group is actually among the leading companies to have implemented devops practices. I’m well aware that what we’re doing is a long way away from perfect (as I deal with it 5 days a week), however it’s refreshing to find out that our practices are among the leaders, and that the issues we’re currently struggling with are relevant to many other people and teams.

I particularly enjoyed the discussion in the openspaces part of the conference about estimating and planning Infrastructure and Operations projects – at the time we were at the end of a large project, in which we’d tried a new planning process for the first time (and we had a number of reservations). The thoughts and ideas from the group helped us to shape our thinking about the problems we were trying to solve (both within the team, and by broadcasting progress information to the wider company).

Afterwards (in the last week) we have taken the time to step back and re-engineer our planning and estimation process. We’ve subsequently set off work on another couple of projects, with the modified planning and estimation process, and the initial feeling from the team is much more positive. Once we’ve completed the current projects and we’ve had a retrospective (and made more changes) I’ll be writing up the challenges that we’ve faced in estimating and how we’ve overcome them – as being able to deliver accurate and consistent estimates in the face of un-planned work (e.g. outages, hardware failures etc) is even more challenging than for operations projects than in an agile development organisation.

Introducing Orc and its agents


This post is the third part of series documented here (part 1) and here (part 2).
Orc at a high level

In the previous post we discussed the Application Infrastructure Contracts. These contracts mean that new applications can be deployed to production with minimal effort because the infrastructure tools can make assumptions about their behaviour. In this post we will discuss the tool-set that leverages this.

Orc is split into three main components (or types of): The central orchestration tool, its agents and its model (the cmdb). The diagram above shows this clearly.

Each box must install an agent that is capable of “auditing” itself. This means that it should be able to report if an application is running/not-running, what version it is on, whether it is in or out of the load balancer pool, whether it is stoppable etc. For different types of components this audit information would be different, ie for a database component there would be no concept of stoppable.

Orc is a model driven tool. It continually audits the environment by sending messages to all nodes, each node will respond with its current state. Orc then compares the information retrieved from the audit with its model (the cmdb) and for each node will decide on the action to take if any. Orc will review each action for conformance with its policies and remove any illegal actions (such as removing all instances from the load balancer). It then continues by sending messages to the agents to perform the intended actions. Currently we have one simple policy: “Never leave zero instances in the load balancer”.

The final component is the model of the desired state of the world (the cmdb). It contains which applications should be on which versions and whether they should be participating in the load balanced service or not. This is currently a simple yaml file in a git repository. Having it in git gives us a desirable side-effect in that we can audit changes to the cmdb.

So now we have a way to audit the actual state of the world (via our agents) and we know what the world is supposed to look like (our cmdb) so all that is left is to execute the upgrade steps in the correct order without breaking our policies (well the one policy at the moment).

This diagram explains the transitions that Orc makes given any particular state of any one instance:

So if the world started out looking like this:

Then HostA and HostB would instructed to upgrade (install) to version 6 as they are on the wrong version and are not participating. Actions for HostC and HostD would be to disable participation, but this is currently blocked as it violates the policy (must be at least one item in the load balancer).

Then participation is enabled for HostA and HostB, again HostC and HostD are currently blocked awaiting more instances in the load balancer.

Finally, HostA and HostB have both become inline with the CMDB so no further action has been taken, this leaves the final action of disabling participation on HostC and HostD.

This concludes our introduction. Future posts will look at future work with Orc particularly: respecting component dependencies and performing database upgrades.

Standardized application infrastructure contracts – part 2 – towards continuous deployment

This post is the second part of a series, which starts here:

To deploy a new version of an application we need to be able to do the following in a standard way:

  • download a new version of an artifact
  • install and start a version of an app (deploy)
  • take the running app in and out of the load balancer
  • stop a running instance

When we deploy an application we need to ensure that we can download the artifact from a known location so that the download can be defined programmatically. We chose to use our legacy “product store”, which is accessible via ssh to copy artifacts to and from. We agreed on a naming convention for applications (much like maven). We chose this solution simply because it was already setup in the way we wanted it to be (most applications were already being deployed from there). Going forward we will probably change this to a more industry standard repository management tool (perhaps use apt and debian packages or maybe even the nexus repository manager), but for now it serves our purpose.

Once the artifact could be retrieved from the standardized location we needed the applications to be launched in a standardized way. We asked the development teams to standardize on the parameters we pass the application and implemented unix service scripts that rely on this standard. This allowed the deployment scripts to be able to communicate with the instance being shutdown or launched.

With a process now associated with the launched application instance, the deployment tool must verify a few things before it can declare the application running: firstly the application must report its version so that if we ask for version 1 we then verify that the running application reports that it is running on version 1. One important point is that we are not assuming the application can start serving requests at this point (discussed below) simply that it is started and is reporting the correct version from the version page.

In order to support zero-downtime deployments we needed to be able to add and remove instances from the load balanced pool. We had applications that already do this by embedding a very simple API inside the application. The deployment tools would ask the instance to shutdown, at which point a status page would return a 503, which meant “I don’t want to be in the pool”.

After much debate we settled on defining two responsibilities (with regard to load balancing): is the application *willing* to be in the load balancer? we called this *participation* i.e. if instance 1 of app X is supposed to be in the active pool then participation should be “enabled” if it is not supposed to be in the active pool i.e. during the deployment cycle it should report participation as “disabled”. The second concern was that of application health, so an application may be asked to be in the load balancer but it may not be *able* to serve requests correctly, we all agreed that this was very much the applications concern, it could be that the application is not able to perform requests yet because it hasn’t initialized its internal state; or perhaps during operation an unexpected failure occurs the application would report its health as “ill”.

The group was divided around how we go about implementing the participation part of the contract. Should we embed the API in the application or should it be part of the deployment infrastructure accessed by the load balancers and deployment tools? We settled on moving it out of the application into a standalone service which recorded whether an instance should be in or out of the load balancer (the willing part).

Finally, if an instance is already running we need to be able to stop it running. We wanted the applications to stop quickly so they can be replaced, however some of our legacy applications held state within the application that would drain after new requests were stopped being directed at it. With this in mind we required that every application implement a resource called “stoppable”, if it returns “safe” the deployment tooling is allowed to send a kill signal to the application, this completes the deployment cycle.

We have described our standardized approach to how we go about retrieving new versions of application; starting new instances; adding them to the load balancers, pulling them out again; and finally how we stop our applications to complete the deployment cycle.

In future posts I will present a high-level architecture for the moving parts we have discussed here; the orchestration engine whose job it is to orchestrate the deployment of our applications over multiple machines and finally some of the other challenges we face and have faced on our journey towards continuous deployment.