Scenario testing for infrastructures

Recent advancements have allowed us to provision an entire environment with a single command.

The next major challenge facing us is how to perform updates to environments, this gives rise to an additional set of challenges (and constraints) for our automated provisioning system.

We’re working towards a provisioning system that is able to upgrade the systems serving an application by rebuilding a completely fresh environment from the ground up, then seamlessly redirecting the live traffic into the new environment. We plan to eventually do this in a fully automated fashion during business hours – continually as we make changes and improvements to the infrastructure.

One approach to being able to perform updates on the running system, whilst maintaining the “single build” provisioning model presented in the last post would be to implement Blue-Green deployments for our applications at the environment level.

Thinking about things in this mindset gives us a set of problems to solve:

  • How can we have complete confidence that our newly provisioned environment is fully operational and that no regressions have been introduced?
  • How do we know that we can exactly duplicate an environment build process in a way that we have complete confidence in? (E.g. is the QA environment really like production)
  • Do we know the health monitoring of components and the service is functional? (Can we trust the system to monitor the right things as servers are rebuilt dynamically).
  • Are load balancing and redundancy / high availability requirements met under the conditions of unexpected hardware failures? (Not only do the application environments have to be highly available, the provisioning system itself has to cope with failure gracefully)

Confidence

Let’s start by talking about confidence, and how our MVP for automated confidence is insufficient. Our initial thought process went something like this:

  • We have monitoring for every machine comprising the environment – right?
  • Our monitoring should be able to tell us whether the machines we just provisioned are working (i.e. give us confidence in the state of the environment) – awesome.
  • If we automatically verify that our per host monitoring is “all green”?
  • We’re confident everything is working, and we can put the environment into production – right?
  • So we need automated API to access environment specific monitoring pieces, so that we can assert that all the checks for this specific application is correctly provisioned and healthy.

It was a relatively simple job to extend the mcollective nrpe agent so that from our provisioning model we have a rake task which remotely execute our NRPE checks on the hosts in an environment.

However, whilst a great first step, this is obviously not sufficient because:

  • It does not cover the end-to-end case (we cut-out nagios)
  • It is difficult to model test targets of services rather than individual machines (e.g. tests for a virtual IP on a load balancer)
  • How do we know that our monitoring continues to work through subsequent iterations of model or puppet code?
  • How do we know that our monitoring is not just saying ‘OK’ despite the state of the things it’s trying to monitor?
  • How do we know that we have not caused regressions in our HA setup, which causes it to not-function in a crisis?

… (etc)

Just like in application development, we can’t just hope that provisioning an entire infrastructure still works! We need automated tests that will give us the confidence we need to bring newly provisioned environments online.

Scenario testing

Our current thinking is that we need to be able to reason about the behaviour of the entire system, both under normal and failure conditions – from a high level.

This feels a good fit to structuring the tests in the classic Given, When, Then BDD format.

To this end, we wanted to give some examples of test scenarios that we would be interested in writing to actually have confidence, and show the support we will need to realise them.

Here is a simple example of a scenario we might want to test:

Given – the Service has finished being provisioned
And – all monitoring for the related the service to be passing
When – when we destroy a single member of the service
Then – we expect all monitoring at the service level to be passing
And – we expect all monitoring at the single machine level to be
failing

Even with this simple example, we can drive our thinking into the
upstream APIs and services we’ll need to achieve these goals at each
of the steps.

>> Given – the Service has finished being
provisioned

We can do this! We can launch a bunch of virtual machines to fulfil a service.

>> And – all monitoring for the related the service
to be passing.

We can’t do this. We can check that nrpe commands on machines that are part of a service are working (and we do), we can also execute some one off sanity checks from the provisioning machines. But what we really want to do is to ask our monitoring system (nagios in our case).

Now “the service” actually consists of a number of machines. Each machine has checks, and the service itself has checks. The service will have some internal checking like are all the instances healthy in the load balancer or not; it will also have some external checking checked from somewhere on the Internet (e.g. pingdom), which will cover off most connectivity issues.

So how can we confidently answer this question? I believe we need to be able to query the monitoring system like this:

monitoring_system.find_status_of_alerts(
    'The Service' :transitive => true
  ).should be_all_green

In this case I want to know about every participating machine and all checks relating to the service. This is the classic “is everything ok for the thing we just provisioned” question.

>> When – when we destroy a single member of the service

This is easy We just issue a destroy command for a VM. For other scenarios we might want more specific ways for disabling particular functionality of a machine rather than complete destruction.

>> Then – we expect all monitoring at the service level to be passing.

We need to able to do something like this:

monitoring_system.find_status_of_alerts(
    'The Service', :transitive => false
  ).should be_all_green

Note that the change of transitivity is the important clue here!

Sometimes I want to ask the question: “Are all participants in this service ok?” But sometimes I just want to know if the service is still functioning.

>> And – we expect all monitoring at the single machine level to be failing

monitoring_system.find_status_of_alerts(
  'Service App Machine 0', :transitive => false
).should be_all_red

Modeling

The thinking we’re doing about extending the modeling of our infrastructure from provisioning to testing is also applicable to running the systems in production. The scenario testing described above is predicated on the ability to sample an environment and build a model of the infrastructure state.

Whilst we’re still a long way from solving all of the problems, the model used for solving the testing problems outlined above can be used to reason about the production system! There are wins to be had every step along the path – in the short term, our model can power much better service/environment (rather than machine) level monitoring.

In the longer term, we’ll have better testing in place to give us confidence about newly provisioned environments. This will allow us to have a trustworthy blue/green infrastructure deployment and update system, as it can be built as a simple state machine where transitions resolve differences between a ‘desired state’ and the ‘current state’. This is exactly the same concept as behind Orc, our continuous deployment tool – wherein a model and state machine driven approach allows us to safely upgrade production applications, even in the face of machine failure.

We hope this post has made you think about the problems in the same way we have. Even if you don’t agree with the strategy we’re pursuing, you’ll hopefully agree that being able to reason about infrastructures at a higher conceptual level than individual machines is an important and powerful concept.

Exported Resources Considered Harmful

Our infrastructure automation is driven by Puppet, so this post is mainly going to talk about Puppet – however the key problem we have (and issues we’re solving) is equally relevant for most other current configuration management tools (such as Chef). One of the key challenges for configuration management systems is determinism – i.e. being able to rebuild the same system in the same way.

In a ‘traditional’ world view, the ‘system’ means an individual machine – however, in the real world, there are very few cases where a new production system can be brought on-line with only one component. For resiliency (if not scalability) purposes you probably want to have more than one machine able to fulfil a particular role so that a single hardware failure won’t cause a system wide outage.

Therefore your system consists of more than one machine – whilst the current crop of configuration management tools can be deterministic for a single machine, they’re much less deterministic when you have inter-machine dependencies.

Beyond individual per-application clusters of servers, you want the monitoring of your entire infrastructure to be coupled with the systems in place inside that infrastructure. I.e. you shouldn’t have to replicate any effort; when you add a new web server to the pool serving an application, you expect the monitoring of that server to adjust so that the new service becomes monitored automatically.

In the puppet ecosystem, the traditional solution to this is exported resources. In this model, each host running puppet ‘exports’ a set of data about the system under management (for instance nagios checks), and then other hosts can ‘collect’ these resources when they run puppet.

Traditionally this was not very scalable, although this has largely been addressed with the introduction of PuppetDB. It was also difficult to arrange things such that you could get the set of resources you wanted onto the host you wanted – with the newer versions of PuppetDB this issue is ameliorated with the introduction of a more flexible query interface.

All of these advancements have been great progress, and kudos for puppetlabs to doing much needed work in this area. However, pulling back from the actual problems, myself (and my team) have come to consider exported resources as the wrong solution for the problems it’s commonly used to solve.

Exported resources introduce puppet run-order dependencies, i.e. in order to reach the correct state, puppet must run on some machines before it runs on others. The implication is that this “management method” is a Convergent[1] system as the system could end up in its final state by more than one route. Any system which relies on convergence is complicated, as it’s very hard to know if you’ve converged to the end state (or if you will ever converge to the end state).

The key issue is, of course, determinism: If host A exports resources to host B – then the order in which you build host A and host B matter, making them co-dependent and non-deterministic. If you’re rolling out an entirely new environment then this likely means that you have to run puppet again and again across the machines until things appear to stop changing – and this is just _apparent_ convergence, rather than proven convergence.

We can go some way to solve this issue, by forcing the order that machines are provisioned (or that puppet is run on those machines). We wrote puppet roll which executes puppet on hosts in order according to a dependency graph. But this is the wrong problem to be solving. Eliminate provisioning order dependencies and we eliminate a difficult and brittle problem.

In recent work, we have rejected the traditional “exported resources anti-pattern” and instead have created a model of ‘our system’ entirely outside puppet. This means that we can build a model of the entire system, which contains no mutable state. We wire this model up to puppet to generate ENC (external node classifier) data for each machine. All data needed for this machine’s state is supplied by the ENC, meaning that machines can be built in any order, and all in exactly one pass.

This entirely removes all the key problems with determinism, convergence, multiple puppet runs etc. In our experience, it also in many cases radically simplifies things. Whereas previously we would have bent things to fit into the model offered by exported resources, we can now instead write our business specific logic in our own model layer – meaning that we can represent things as they should naturally be modelled.

Demonstration:

The thing we like most about puppet for individual systems and services is its declarative, model driven nature – thus we’ve tried to replicate something with a similar ‘feel’ at the whole system level.

Given this (somewhat simplified) description of a service:

stack 'example' do
  virtual_appserver 'exampleapp', :instances => 2
  loadbalancer, :instances => 2
end

env 'ci', :location => 'dc1' do
  instantiate_stack 'example'
end

env 'production', :location => 'dc2' do
  instantiate_stack 'example'
end

The (again somewhat simplified) ENC generated for the 2 application servers, looks like this:

---
role: http_app_server
  environment: ci
  application: 'exampleapp'
  vip_fqdn: ci-exampleapp-vip.dc1.net.local

The ENC for the 2 load balancers look like this:

---
role::loadbalancer
  virtual_servers:
    ci-exampleapp-vip.dc1.net.local:
      type: http
      realservers:
        - ci-exampleapp-001.dc1.net.local
        - ci-exampleapp-002.dc1.net.local

This sort of configuration eliminates the need for a defined puppet run-order (as each host has all the details it needs to configure itself from the model – without any data being needed from other hosts), and goes a long way towards achieving the goal of complete determinism.

The example shows a traditional load balancers to web servers dependency, however the same technique can be (and is) applied in our code wherever we have clusters of servers of the same type that need to inter-communicate. E.g. RabbitMQ, Ehcache and Elasticsearch clusters.

If you haven’t guessed yet, as well as being theoretically correct, this approach is vastly powerful:

  • We’re able to smoke test our puppet code in a real integration environment.
  • We can provision entire clusters of servers for QA or development purposes with 1 line of code and a < 10 minute wait.
  • We’ve used this system to build our newest production applications.
  • We can rebuild an entire app environment during scheduled maintenance.
  • We can add servers or resources to an application cluster with a 1 line code change and 1 command.

We’ve got a lot of work still to do on this system (and on our internal applications and puppet code before it’ll all fit into this system), however it’s already quite obviously a completely different (and superior) model to traditional convergence for how to think of (and use) our configuration management tools across many servers.

References:

  1. Why Order Matters: Turing Equivalence
    in Automated Systems Administration (USENIX 2002)

Facilitating Agility with Transparency

Part of the agile coaching work I do at TIM Group involves running a large number of Retrospectives and the (hopefully only) occasional Root Cause Analysis. Both of these generate actions designed to improve (and/or shore up) our processes so that we are constantly improving. These actions are supposed to be discrete, and done within a week of their assignment.

Over the last year or so, TIM Group been moving to a more ‘start-up’ style organizational model. Previous to this, we had a stabilized two week release cycle, and our development teams were quite static. This has changed now, and while some of the teams here still run retros on a two-week cycle, others are on a one-week cycle, and still others on a more ad-hoc basis. More importantly, the teams are a lot more fluid, with developers not just moving from one development team to another but also to our infrastructure team and back.

In a perfect world, this would not be a problem because actions are all done within a week.

Well, despite the ‘within a week’ expectation, actions had been piling up. Retro after retro would pass, and the ‘actions’ column would get clogged with outstanding items. In addition to this, the RCA actions were also not getting done. While it wouldn’t be fair to say this was a new problem, the new organization was aggravating the existing problem.

I was into the habit of reminding each team about their retro first thing in the morning of their meeting. This helped to get more discussion topics brought up before the start of the meeting. This helped the meeting go faster and more smoothly, but it wasn’t proving to be enough to actually get the actions done.

So I started sending out more and more specific reminders, looking at each board and naming the individuals who had outstanding actions.

As this activity took more and more of my time each retro day, I decided to build myself some help. Luckily, I had previously done some work with the APIs of our on-line Kanban tool. It was fairly simple to make a new version of the code that instead of working with our taskboards, worked with the boards we used for our RCAs and retrospectives.

My initial idea was to simply find a way to generate (or at least partially generate) some of those reminders I was sending out to the teams. But once I had gotten the team notifications done, a pattern emerged — many people had actions across multiple teams. This was when it struck me. I was facilitating the teams, but the *individuals* were the ones who needed to get their actions done. I needed to make their lives easier if I wanted them to get the actions done.

The clear next step was to give each person their own ‘actions report’. Now, at the start of each week, instead of having to look in a bunch of different locations and trying to check if there were things they needed to do, each person who has any outstanding actions gets an e-mail. It clearly states which actions need to be done, including the action title and description, with a URL linking back to the exact card in question on the taskboard. *This* was getting somewhere. I got a lot of positive feedback from people. In fact, I got a number of people asking to put their own smaller-task or special project taskboards on the system so that they could get even more of their actions in one place.

That was a big indicator that I’d done something right, people asking for more!

Of course, once I had a person-by-person action tally, it was a doddle to implement a simple gamification, a leader-board, posted weekly, listing everyone who has yet to complete their actions with the ‘top’ person having the most outstanding actions. A top position which has been, incidentally, occupied since inception by our very own CTO.

Next up? Implementing markdown in the action reports, to increase readability. Team status pages, to show what cards we have as ‘monitor’ cards, so we know what current issues we are monitoring.