Scenario testing for infrastructures

Recent advancements have allowed us to provision an entire environment with a single command.

The next major challenge facing us is how to perform updates to environments, this gives rise to an additional set of challenges (and constraints) for our automated provisioning system.

We’re working towards a provisioning system that is able to upgrade the systems serving an application by rebuilding a completely fresh environment from the ground up, then seamlessly redirecting the live traffic into the new environment. We plan to eventually do this in a fully automated fashion during business hours – continually as we make changes and improvements to the infrastructure.

One approach to being able to perform updates on the running system, whilst maintaining the “single build” provisioning model presented in the last post would be to implement Blue-Green deployments for our applications at the environment level.

Thinking about things in this mindset gives us a set of problems to solve:

  • How can we have complete confidence that our newly provisioned environment is fully operational and that no regressions have been introduced?
  • How do we know that we can exactly duplicate an environment build process in a way that we have complete confidence in? (E.g. is the QA environment really like production)
  • Do we know the health monitoring of components and the service is functional? (Can we trust the system to monitor the right things as servers are rebuilt dynamically).
  • Are load balancing and redundancy / high availability requirements met under the conditions of unexpected hardware failures? (Not only do the application environments have to be highly available, the provisioning system itself has to cope with failure gracefully)

Confidence

Let’s start by talking about confidence, and how our MVP for automated confidence is insufficient. Our initial thought process went something like this:

  • We have monitoring for every machine comprising the environment – right?
  • Our monitoring should be able to tell us whether the machines we just provisioned are working (i.e. give us confidence in the state of the environment) – awesome.
  • If we automatically verify that our per host monitoring is “all green”?
  • We’re confident everything is working, and we can put the environment into production – right?
  • So we need automated API to access environment specific monitoring pieces, so that we can assert that all the checks for this specific application is correctly provisioned and healthy.

It was a relatively simple job to extend the mcollective nrpe agent so that from our provisioning model we have a rake task which remotely execute our NRPE checks on the hosts in an environment.

However, whilst a great first step, this is obviously not sufficient because:

  • It does not cover the end-to-end case (we cut-out nagios)
  • It is difficult to model test targets of services rather than individual machines (e.g. tests for a virtual IP on a load balancer)
  • How do we know that our monitoring continues to work through subsequent iterations of model or puppet code?
  • How do we know that our monitoring is not just saying ‘OK’ despite the state of the things it’s trying to monitor?
  • How do we know that we have not caused regressions in our HA setup, which causes it to not-function in a crisis?

… (etc)

Just like in application development, we can’t just hope that provisioning an entire infrastructure still works! We need automated tests that will give us the confidence we need to bring newly provisioned environments online.

Scenario testing

Our current thinking is that we need to be able to reason about the behaviour of the entire system, both under normal and failure conditions – from a high level.

This feels a good fit to structuring the tests in the classic Given, When, Then BDD format.

To this end, we wanted to give some examples of test scenarios that we would be interested in writing to actually have confidence, and show the support we will need to realise them.

Here is a simple example of a scenario we might want to test:

Given – the Service has finished being provisioned
And – all monitoring for the related the service to be passing
When – when we destroy a single member of the service
Then – we expect all monitoring at the service level to be passing
And – we expect all monitoring at the single machine level to be
failing

Even with this simple example, we can drive our thinking into the
upstream APIs and services we’ll need to achieve these goals at each
of the steps.

>> Given – the Service has finished being
provisioned

We can do this! We can launch a bunch of virtual machines to fulfil a service.

>> And – all monitoring for the related the service
to be passing.

We can’t do this. We can check that nrpe commands on machines that are part of a service are working (and we do), we can also execute some one off sanity checks from the provisioning machines. But what we really want to do is to ask our monitoring system (nagios in our case).

Now “the service” actually consists of a number of machines. Each machine has checks, and the service itself has checks. The service will have some internal checking like are all the instances healthy in the load balancer or not; it will also have some external checking checked from somewhere on the Internet (e.g. pingdom), which will cover off most connectivity issues.

So how can we confidently answer this question? I believe we need to be able to query the monitoring system like this:

monitoring_system.find_status_of_alerts(
    'The Service' :transitive => true
  ).should be_all_green

In this case I want to know about every participating machine and all checks relating to the service. This is the classic “is everything ok for the thing we just provisioned” question.

>> When – when we destroy a single member of the service

This is easy We just issue a destroy command for a VM. For other scenarios we might want more specific ways for disabling particular functionality of a machine rather than complete destruction.

>> Then – we expect all monitoring at the service level to be passing.

We need to able to do something like this:

monitoring_system.find_status_of_alerts(
    'The Service', :transitive => false
  ).should be_all_green

Note that the change of transitivity is the important clue here!

Sometimes I want to ask the question: “Are all participants in this service ok?” But sometimes I just want to know if the service is still functioning.

>> And – we expect all monitoring at the single machine level to be failing

monitoring_system.find_status_of_alerts(
  'Service App Machine 0', :transitive => false
).should be_all_red

Modeling

The thinking we’re doing about extending the modeling of our infrastructure from provisioning to testing is also applicable to running the systems in production. The scenario testing described above is predicated on the ability to sample an environment and build a model of the infrastructure state.

Whilst we’re still a long way from solving all of the problems, the model used for solving the testing problems outlined above can be used to reason about the production system! There are wins to be had every step along the path – in the short term, our model can power much better service/environment (rather than machine) level monitoring.

In the longer term, we’ll have better testing in place to give us confidence about newly provisioned environments. This will allow us to have a trustworthy blue/green infrastructure deployment and update system, as it can be built as a simple state machine where transitions resolve differences between a ‘desired state’ and the ‘current state’. This is exactly the same concept as behind Orc, our continuous deployment tool – wherein a model and state machine driven approach allows us to safely upgrade production applications, even in the face of machine failure.

We hope this post has made you think about the problems in the same way we have. Even if you don’t agree with the strategy we’re pursuing, you’ll hopefully agree that being able to reason about infrastructures at a higher conceptual level than individual machines is an important and powerful concept.