Standardized application infrastructure contracts – part 2 – towards continuous deployment

This post is the second part of a series, which starts here:

To deploy a new version of an application we need to be able to do the following in a standard way:

  • download a new version of an artifact
  • install and start a version of an app (deploy)
  • take the running app in and out of the load balancer
  • stop a running instance

When we deploy an application we need to ensure that we can download the artifact from a known location so that the download can be defined programmatically. We chose to use our legacy “product store”, which is accessible via ssh to copy artifacts to and from. We agreed on a naming convention for applications (much like maven). We chose this solution simply because it was already setup in the way we wanted it to be (most applications were already being deployed from there). Going forward we will probably change this to a more industry standard repository management tool (perhaps use apt and debian packages or maybe even the nexus repository manager), but for now it serves our purpose.

Once the artifact could be retrieved from the standardized location we needed the applications to be launched in a standardized way. We asked the development teams to standardize on the parameters we pass the application and implemented unix service scripts that rely on this standard. This allowed the deployment scripts to be able to communicate with the instance being shutdown or launched.

With a process now associated with the launched application instance, the deployment tool must verify a few things before it can declare the application running: firstly the application must report its version so that if we ask for version 1 we then verify that the running application reports that it is running on version 1. One important point is that we are not assuming the application can start serving requests at this point (discussed below) simply that it is started and is reporting the correct version from the version page.

In order to support zero-downtime deployments we needed to be able to add and remove instances from the load balanced pool. We had applications that already do this by embedding a very simple API inside the application. The deployment tools would ask the instance to shutdown, at which point a status page would return a 503, which meant “I don’t want to be in the pool”.

After much debate we settled on defining two responsibilities (with regard to load balancing): is the application *willing* to be in the load balancer? we called this *participation* i.e. if instance 1 of app X is supposed to be in the active pool then participation should be “enabled” if it is not supposed to be in the active pool i.e. during the deployment cycle it should report participation as “disabled”. The second concern was that of application health, so an application may be asked to be in the load balancer but it may not be *able* to serve requests correctly, we all agreed that this was very much the applications concern, it could be that the application is not able to perform requests yet because it hasn’t initialized its internal state; or perhaps during operation an unexpected failure occurs the application would report its health as “ill”.

The group was divided around how we go about implementing the participation part of the contract. Should we embed the API in the application or should it be part of the deployment infrastructure accessed by the load balancers and deployment tools? We settled on moving it out of the application into a standalone service which recorded whether an instance should be in or out of the load balancer (the willing part).

Finally, if an instance is already running we need to be able to stop it running. We wanted the applications to stop quickly so they can be replaced, however some of our legacy applications held state within the application that would drain after new requests were stopped being directed at it. With this in mind we required that every application implement a resource called “stoppable”, if it returns “safe” the deployment tooling is allowed to send a kill signal to the application, this completes the deployment cycle.

We have described our standardized approach to how we go about retrieving new versions of application; starting new instances; adding them to the load balancers, pulling them out again; and finally how we stop our applications to complete the deployment cycle.

In future posts I will present a high-level architecture for the moving parts we have discussed here; the orchestration engine whose job it is to orchestrate the deployment of our applications over multiple machines and finally some of the other challenges we face and have faced on our journey towards continuous deployment.