Human Error and Just Culture

Sidney Dekker’s Just Culture made me thankful I don’t work in an occupation with a high risk of impacting public safety (those described in the book include aviation, health-care, and policing). In our society we believe that practitioners should be accountable for their actions, that without legal consequences after a tragedy there would be no justice. The dilemma is that tragic outcomes are more likely to be the result of systemic issues rather than bad actors, and the legal system is fundamentally unsuitable for dealing with issues of systematic safety. Worse, the risk of legal consequences stifles learning, and so our search for justice makes tragic outcomes more likely, rather than less.

Reading Just Culture after Charles Perrow’s Normal Accidents was a serendipitous pairing. Normal Accidents illustrates very convincingly that safety is an issue that largely transcends our traditional idea of human error. It makes the case that some accidents are normal and expected because of the properties of the system, and that the easy finger pointing at the practitioners misses the real story. As we should already know from Deming and manufacturing, quality is a property of the system, not the people in the system.

Picking up from there, Just Culture shows how the concept of accident doesn’t exist in law. There is always someone who was negligent, either willfully or not, and that someone shall be held responsible. The law isn’t interested in the learning of the system. It isn’t really interested in the truth as most of us would understand it. It is really about blame and about punishment.

How does your organization respond to a system outage? Are blame and finger-pointing the order of the day? We may not be subject to the criminalization of error described in Just Culture, but the organizational reflex can all too easily be to blame the developers, the testers, the system administrators, or others, when the focus should be on organizational learning, on fixing the system.

The idea of Blameless PostMortems is not new to TIM Group. We’ve done our best to use our RCAs as a tool for improving the system for several years now. Just Culture served as a reminder that we are fighting a cultural bias, and we need vigilance to avoid outdated ideas of human error creeping back into our organization. The pressure to do so is both pervasive and subtle. It would be easy to detect and fight if it were a case of managers asking “who screwed up?” It is harder when it seems like a virtue, when it is an engineer who is quick to assume responsibility for a mistake. It is a valuable trait when each individual is willing to be self-critical. The challenge is being able to look beyond the individual to the contribution of the larger system.

This is the balance we are trying to strike, between individuals who feel enough safety that they are willing to acknowledge their own contribution to the problem, and a system that doesn’t accept “human error” as a reason to avoid learning. We believe this is the path to a high-performing, and just, culture.

Report from DevOpsDays London 2013 Fall

This Monday and Tuesday a few of us went to DevOpsDays London 2013 Fall.

We asked for highlights from every attendant and this is what they had to say about the conference:

Francesco Gigli: Security, DevOps & OWASP

There was an interesting talk about security and DevOps and a follow up during one of the open sessions.
We discussed capturing security related work in user stories, or rather “Evil User Stories” and the use of anti-personas as a way to keep malicious users in mind.
OWASP, which I did not know before DevOpsDays, was also nominated: it is an organization “focused on improving the security of software”. One of the resources that they make available is the OWASP Top 10 of the most critical web application security flaws. Very good for awareness.

Tom Denley: Failure Friday

I was fascinated to hear about “Failure Fridays” from Doug Barth at PagerDuty.  They take an hour out each week to deliberately failover components that they believe to be resilient.  The aim is not to take down production, but to expose unexpected failure modes in a system that is designed to be highly available, and to verify the operation of the monitoring/alerting tools.  If production does go down, better that it happens during office hours, when staff are available to make fixes, and in the knowledge of exactly what event triggered the downtime.

Jeffrey Fredrick: Failure Friday

I am very interested in the Failure Fridays. We already do a Failure Analysis for our application where we identify what we believe would happen with different components failing. My plan is that we will use one of these sessions to record our expectations and then try manually failing those components in production to see if our expectations are correct!

Mehul Shah: Failure Fridays & The Network – The Next Frontier for Devops

I very much enjoyed the DevOpsDays. Apart from the fact that I won a HP Slate 7 in the HP free raffle, I drew comfort from the fact that ‘everyone’ is experiencing the same/similar problems to us and it was good to talk and share that stuff. It felt good to understand that we are not far from what most people are doing – emphasizing on strong DevOps communication and collaboration. I really enjoyed most of the morning talks in particular the Failure Fridays and the The Network – The Next Frontier for Devops – which was all about creating a logically centralized program to control the behaviour of an entire network. This will make networks easier to configure, manage and debug. We are doing some cool stuff here at TIM Group (at least from my stand point), but I am keen to see if we can toward this as a goal.

Waseem Taj: Alerting & What science tells us about information infrastructure

At the open space session on alerting, there was a good discussion on adding context to the alert. One of the attendee mentioned that each of the alert they get has a link to a page that describes the likely business impact of the alert (why we think it is worth getting someone out of the bed at 3am), a run book with typical steps to take and the escalation path. We have already started on the path of documenting how to respond to nagios alerts, I believe expanding it to include the perceived ‘business impact of the alert’ and integration with nagios will be most helpful in moments of crisis in the middle of night when the brain just does not want to cooperate.

The talk by Mark Burgress on ‘What science tells us about information infrastructure’ indeed had the intended impact on me, i.e. I will certainly be reading his new book on the subject.

You can find all the Talks and Ignites video on the DevOpsDays site.

Puppet Camp Barcelona

I recently had the pleasure of being asked to speak at Puppet Camp Barcelona. I’d submitted a talk a few months ago about some of the problems my team was having with our uses of puppet, and how we’re adapting to change how we use puppet.

I was extremely pleased to be asked to present, and also extremely pleased that TIM Group was willing to fund my flights and give me the time to attend the conference.

I was pleased by how the presentation went, and I gained a whole bunch of ideas we hadn’t thought of from chatting to people afterwards. From the official writeup, at I think the talk was generally well received.

I’m looking forward to finding the time to write up further details of how we use puppet at TIM Group, and what problems this solves for us.

Devopsdays London

This is a blog post that was written in 2013, but somehow was forgotten about. So here is a bit of history!

— Andrew Parker

Most of our Infrastructure team and a couple of developers we had seconded to the team all attended the Devopsdays London conference a couple of weeks ago.

There are a load of reviews/notes about the conference online already, however we also made a set.

I think everyone attending found the conference valuable, although for varying reasons (depending upon which sessions they had attended). Personally I found that the 2nd day was more valuable, with better talks and more interesting openspace sessions (that I attended). As I had expected (from my previous attendance at Devopsdays New York), I found the most value in networking and comparing the state of the art with what others are doing in automation / monitoring / etc.

I was very pleased that TIM Group is actually among the leading companies to have implemented devops practices. I’m well aware that what we’re doing is a long way away from perfect (as I deal with it 5 days a week), however it’s refreshing to find out that our practices are among the leaders, and that the issues we’re currently struggling with are relevant to many other people and teams.

I particularly enjoyed the discussion in the openspaces part of the conference about estimating and planning Infrastructure and Operations projects – at the time we were at the end of a large project, in which we’d tried a new planning process for the first time (and we had a number of reservations). The thoughts and ideas from the group helped us to shape our thinking about the problems we were trying to solve (both within the team, and by broadcasting progress information to the wider company).

Afterwards (in the last week) we have taken the time to step back and re-engineer our planning and estimation process. We’ve subsequently set off work on another couple of projects, with the modified planning and estimation process, and the initial feeling from the team is much more positive. Once we’ve completed the current projects and we’ve had a retrospective (and made more changes) I’ll be writing up the challenges that we’ve faced in estimating and how we’ve overcome them – as being able to deliver accurate and consistent estimates in the face of un-planned work (e.g. outages, hardware failures etc) is even more challenging than for operations projects than in an agile development organisation.