Towards A Rich Domain Model for Infrastructure

We suffer from a number of issues around how we provision environments that host our software. One of the most annoying of these is that we need to make sure puppet runs complete in the correct order because different role types will “export” resources to PuppetDB, which are then consumed by other role types. For example a puppet run on an application server might produce information that is later consumed by the load balancer. When this pattern is used in many places, a complex dependency graph is built up that dictates the order that puppet must run to converge on the correct configuration. We developed a tool called PuppetRoll that understands these dependencies and forces puppet runs in the correct order and enforces that puppet completes without error before allowing dependent role types to also run puppet. This is ok, but it is leaves the concept of an applications Stack as a disjointed collection of low level concepts.

In order to create a more scalable and more deterministic infrastructure we think creating a central description of the environment we want to represent will help us achieve this goal. We want to create a model of our environments and then derive the specific data that will populate specific puppet resources in order to configure specific machines. With this model, building new environments will not be as constrained by a puppet run order and will lead to less duplication of data.

I will describe an example of this approach with a simple ruby DSL that exposes data as puppet functions to populate networking resources:

This piece of configuration:

environment "dev" do
  virtual_service "myservice", :vip=>"10.1.1.3.1", :port=>80 do
    realserver, :ipaddress=>"10.1.1.3.2"
    realserver, :ipaddress=>"10.1.1.3.3"
  end

  virtual_service "myservice2", :vip=>"10.1.1.4.1", :port=>80 do
    realserver, :ipaddress=>"10.1.1.4.2"
    realserver, :ipaddress=>"10.1.1.4.3"
  end
end

Describes an environment that contains two VirtualService‘s. Each VirtualService contains a virtual ip address (a VIP) and port to expose the service on. It also contains a list of RealServer‘s that will back this virtual service.

We should be able to generate the networking configuration for the load balancers like this:

networking:
  eth1:
    ipaddress: '10.1.1.3.1'
    ipaddress: '10.1.1.4.1'

And would generate networking configuration for the real servers in “myservice” that look like this:

for real server 1:

networking:
  lo:0:
    ipaddress: '10.1.1.3.1'
  eth1:
    ipaddress: '10.1.1.3.2'

and for real server 2:

networking:
  lo:0:
    ipaddress: '10.1.1.3.1'
  eth1:
    ipaddress: '10.1.1.3.3'

The puppet code could look like this:

   create_resources("networking::interface", networking(me))

We could of could use an ENC (External Node Classifier) to push the correct configuration onto the node. And of course the use of static ip addresses seems far than ideal!

So, this very simple example got us thinking about how we could extend this concept to provide descriptions of entire complex environments; each node would pull in the configuration it needed at build time. Further to this, the provisioning of the environment could also be driven from this model.. ie: well, I need 2 app servers for “myservice”, after auditing I discover we only have 1 so I provision a new one (similar to Orc).

Taking this to the logical conclusion.. we would end up with a Rich Domain model backed by a normalized datastore that would model our environment and posses the capability to manipulate that data into hashes that could be used to populate puppet resources. A rough draft of this looks like this:

In this model a Server could be one of LoadBalancer or RealServer but the correct networking configuration can be generated for both types of server. A Server lives in an Environment and an environment has many VirtualService‘s. Each VirtualService actually depends on a number of other VirtualService‘s, this allows us to generate the correct set of firewalling rules and even app server configuration files that allow us to “wire” our environment together with minimal effort!

Introducing Orc and its agents

Image

This post is the third part of series documented here (part 1) and here (part 2).
Orc at a high level

In the previous post we discussed the Application Infrastructure Contracts. These contracts mean that new applications can be deployed to production with minimal effort because the infrastructure tools can make assumptions about their behaviour. In this post we will discuss the tool-set that leverages this.

Orc is split into three main components (or types of): The central orchestration tool, its agents and its model (the cmdb). The diagram above shows this clearly.

Each box must install an agent that is capable of “auditing” itself. This means that it should be able to report if an application is running/not-running, what version it is on, whether it is in or out of the load balancer pool, whether it is stoppable etc. For different types of components this audit information would be different, ie for a database component there would be no concept of stoppable.

Orc is a model driven tool. It continually audits the environment by sending messages to all nodes, each node will respond with its current state. Orc then compares the information retrieved from the audit with its model (the cmdb) and for each node will decide on the action to take if any. Orc will review each action for conformance with its policies and remove any illegal actions (such as removing all instances from the load balancer). It then continues by sending messages to the agents to perform the intended actions. Currently we have one simple policy: “Never leave zero instances in the load balancer”.

The final component is the model of the desired state of the world (the cmdb). It contains which applications should be on which versions and whether they should be participating in the load balanced service or not. This is currently a simple yaml file in a git repository. Having it in git gives us a desirable side-effect in that we can audit changes to the cmdb.

So now we have a way to audit the actual state of the world (via our agents) and we know what the world is supposed to look like (our cmdb) so all that is left is to execute the upgrade steps in the correct order without breaking our policies (well the one policy at the moment).

This diagram explains the transitions that Orc makes given any particular state of any one instance:

So if the world started out looking like this:

Then HostA and HostB would instructed to upgrade (install) to version 6 as they are on the wrong version and are not participating. Actions for HostC and HostD would be to disable participation, but this is currently blocked as it violates the policy (must be at least one item in the load balancer).

Then participation is enabled for HostA and HostB, again HostC and HostD are currently blocked awaiting more instances in the load balancer.

Finally, HostA and HostB have both become inline with the CMDB so no further action has been taken, this leaves the final action of disabling participation on HostC and HostD.

This concludes our introduction. Future posts will look at future work with Orc particularly: respecting component dependencies and performing database upgrades.