Choosing what work to do at TIM Group

TL;DR: Working at TIM Group means having the responsibility to decide what work to do. The most obvious criteria is business value but I don’t think that is enough.

At TIM Group we have been experimenting with self-organisation for a while. It’s been a gradual process that started with the adoption of the Eight Behaviours for Smarter Teams and whose last steps were the removal of the line management structure in the technology department and a reiterated message from the CEO on employees insisting on being responsible.

My personal experience has been of changing team and/or project every six months or so which I find refreshing. Most of the time my moves were motivated by changing conditions and suggested by people higher in the hierarchy. A few times, very notably the last change of both team and project at the same time was decided and executed without indication from the management. I saw this happen multiple times since, to multiple people across multiple departments.

My colleagues and I have the responsibility of deciding what work to do. The executive team still has the authority to give orders and to fire people but that does not happen often. For all intents and purposes proposing projects, staffing those projects and delivering are now shared responsibilities.

Continue reading

Kik, left-pad… Should I stop using npm?

TL;DR: No, unless you make npm packages. If you do publish npm packages think about how the disputes are resolved and decide if you are OK with it.

I started using npm a few years ago in our build system. My CTO and his deputy today told me that means it is in production. I thought it was not, but on second thought I think they are right since it decides what code is in the production artefacts.

A few days ago, while I was on holiday, two of my colleagues investigated a problem with npm regarding some dependencies that may contain malicious code. I’m the one who used it first. I felt responsible.

I read about the issue on GitHub and I was more worried: npm allowing reuse of names meant that it was unsafe to use it. Due to my decision to adopt npm more and more projects in the company depend on that, was I responsible for us disseminating malware?

How did you feel at that time? Had you done your due diligence on npm? I felt I had not.

Or did I? After all, what is the difference between npm, Maven and GitHub? Don’t they all allow you to download some code identified by a name and a version number?

Continue reading

Using your values to choose

At the London Action Science Meetup in January we discussed the article Emotional Agility (HBR, Nov 2013), and in particular how to apply the four steps described in the article:

  1. recognize your patterns;
  2. label your thoughts and emotions;
  3. accept them; and
  4. act on your values

The importance of using your values they provide a stable point of reference, not subject to day-to-day fluctuations: “The mind’s thought stream flows endlessly, and emotions change like the weather, but values can be called on at any time, in any situation.”

For me personally, one consequence is choosing between alternatives for positive reasons, not negative ones. This was useful for a recent decision at TIM Group and I used the occasion of our weekly Lightning Talks to share the experience more widely.

Rollbackability in upgrading a MySQL cluster

Here I give a lightning talk on how we mitigated the risk of upgrading the version of MySQL used in one of TIM Group’s most important databases.

Our approach to risk mitigation was Rollbackability. Rather than spend all our time ensuring that nothing would go wrong, we decided to plan the upgrade such that if anything did go wrong, we could rollback safely. One complicated replication topology,  double the use of computing resource and a sprinkling of unsupported MySQL behaviour later, we safely upgraded versions with minimal impact to the business, knowing that we had “Undo” at every step of the way.

 

The diagrams used in the talk, showing each step in the process, are available here.

Human Error and Just Culture

Sidney Dekker’s Just Culture made me thankful I don’t work in an occupation with a high risk of impacting public safety (those described in the book include aviation, health-care, and policing). In our society we believe that practitioners should be accountable for their actions, that without legal consequences after a tragedy there would be no justice. The dilemma is that tragic outcomes are more likely to be the result of systemic issues rather than bad actors, and the legal system is fundamentally unsuitable for dealing with issues of systematic safety. Worse, the risk of legal consequences stifles learning, and so our search for justice makes tragic outcomes more likely, rather than less.

Reading Just Culture after Charles Perrow’s Normal Accidents was a serendipitous pairing. Normal Accidents illustrates very convincingly that safety is an issue that largely transcends our traditional idea of human error. It makes the case that some accidents are normal and expected because of the properties of the system, and that the easy finger pointing at the practitioners misses the real story. As we should already know from Deming and manufacturing, quality is a property of the system, not the people in the system.

Picking up from there, Just Culture shows how the concept of accident doesn’t exist in law. There is always someone who was negligent, either willfully or not, and that someone shall be held responsible. The law isn’t interested in the learning of the system. It isn’t really interested in the truth as most of us would understand it. It is really about blame and about punishment.

How does your organization respond to a system outage? Are blame and finger-pointing the order of the day? We may not be subject to the criminalization of error described in Just Culture, but the organizational reflex can all too easily be to blame the developers, the testers, the system administrators, or others, when the focus should be on organizational learning, on fixing the system.

The idea of Blameless PostMortems is not new to TIM Group. We’ve done our best to use our RCAs as a tool for improving the system for several years now. Just Culture served as a reminder that we are fighting a cultural bias, and we need vigilance to avoid outdated ideas of human error creeping back into our organization. The pressure to do so is both pervasive and subtle. It would be easy to detect and fight if it were a case of managers asking “who screwed up?” It is harder when it seems like a virtue, when it is an engineer who is quick to assume responsibility for a mistake. It is a valuable trait when each individual is willing to be self-critical. The challenge is being able to look beyond the individual to the contribution of the larger system.

This is the balance we are trying to strike, between individuals who feel enough safety that they are willing to acknowledge their own contribution to the problem, and a system that doesn’t accept “human error” as a reason to avoid learning. We believe this is the path to a high-performing, and just, culture.

SoCraTes Germany

SoCraTes Germany 2015 has again been a conference full of superlatives — or should I say “unconference”, given that it consists mainly of a two-day Open Space? Whatever you call it, we had more participants (over 180), more proposed sessions (90 – 100 sessions per day), and more sponsors than ever.

Speaking of sponsors: Of course TIM Group was among them, and all contributions were spent on reducing hotel bills for participants. Not that it was an expensive event anyway: the participants cover their hotel room, breakfast and dinner, and that’s it. But let me start at the beginning…

Six or maybe even seven years ago, my boyfriend Andreas Leidig decided to create a conference for developers — after all, developers deliver the software. If you remove all managers from a software project, chances are you might get something usable out of it, but if you remove all developers, well, I’d say your chances are zero. There were lots of agile events for managers, coaches and other “tree-huggers”, but no general-purpose developer conferences along the lines of software craftsmanship and improvement.

So Andreas created a developer conference, and named it SoCraTes: Software Craftsmanship and Testing Conference. It was designed as a two-day event falling on a Friday and Saturday so that both the developer and the developer’s employer would contribute time to the event. Arrival on the previous evening was mandatory in order to start the first day early and in a relaxed fashion. Another important factor was that it was held off-site — everybody stayed in the conference hotel, which we had to ourselves, allowing us to be together from the first yoga and jogging workout before breakfast to the last beer, song, or boardgame long after midnight.

We started our first SoCraTes in 2011, with the help of a small handful of friends. The first year we had one day of talks and one day of Open Space, because we were afraid companies would not let their employees attend a conference without a programme. The feedback of the 50+ participants reassured us we didn’t need the programmed talks, and from the second year we ran SoCraTes as a two-day Open Space. We sold out in the second year with about 75 participants, confirming our decision. On Saturday evening of the second year, a group of participants decided that they had not yet had enough, so they extended their stay and spontaneously ran a Coderetreat on Sunday. This Coderetreat is now an integral (though optional) part of SoCraTes, accompanied by workshops and other half- or full-day sessions.

In the third year, we sold out so quickly (literally within a minute) that we felt the need to do something. Of course it was nice for everybody who could register, but thinking of those who could not get a spot diminished the joy quite a bit. Also, the SoCraTes “brand” had expanded to the UK, where a sibling event was being run by the London Software Craftsmanship Community. Meeting friends from abroad was an important part of attending SoCraTes; how could this reliably happen with such high demand and such a small number of participant slots?

After some discussion among the organizers, we decided to experiment with increasing the size of the event. We were curious — and also a bit tense — because we wondered whether it would still be the same event, with the same familiar feel we had come to love so much.

The fourth SoCraTes took place last year with twice as many participants, approximately 150. And it was amazing! Of course the event felt different. I was not able to speak to everybody any more. Everything took longer, from the signup to the marketplace to walking around the new hotel, which was of course a bit bigger than the previous one. But at the same time, the SoCraTes spirit was there just like the years before, as the seasoned participants carried it over while the newbies blended in naturally. We heard many positive voices and read enthusiastic blog posts. Our experiment had been a success, and so we did not think twice about continuing in this direction. In 2015, we increased the number of participants to 180, and next year we will try to cross the 200 mark. Although I will slowly pass the management of SoCraTes on to the next “generation” of organizers, I’m looking forward to seeing SoCraTes thrive together with its sibling conferences that sprout all over Europe.

Again, my thanks go to TIM Group and all our other sponsors for supporting SoCraTes and encouraging their employees to participate!

Oh, by the way, are you curious now? Please feel free to have a look at http://www.socrates-conference.de.

Joda Time? You might be overusing DateTime

We generally prefer to use Joda-Time for our date and time representation. Its immutable objects fit our house style, and the plusHours, plusDays etc methods usually produce much more readable and maintainable code than the giant bag of static methods based on Date and Calendar that we had before. Throw in easy construction from components, thread-safe parsers and formatters, a richer type catalogue for representing date-only, time-only, timezone-less values and integrated conversion to/from ISO-style strings, and working with date and time values becomes more comfortable and less fragile.

This snippet is fairly typical:

DateTime now = new DateTime(2015, 4, 13, 9, 15 /*default timezone*/);
DateTime laterToday = now.withTime(17, 30, 0, 0);
DateTime tomorrow = now.plusDays(1);
DateTime muchLater = now.plusMonths(6);

In principle, a DateTime contains a date, a time and a timezone. These three correspond to an instant- a number of milliseconds since the epoch, 1st Jan 1970 00:00 UTC.

Actually, that’s not quite correct- there are some date/time/timezone combinations that don’t correspond to an instant, and some that ambiguously reference multiple instants. This occurs when the timezone includes rules for daylight savings time, producing gaps where the clocks go forward and overlaps when the clocks go back. (Consider how you would interpret “2015-03-29T01:30:00” as London time, or “2015-10-25T01:30:00”).

Here’s a nasty thing that can occur when you use DateTime, though. If you wrote this:

assertThat(new DateTime(2008, 6, 1, 14, 0, 0) /* Europe/London */,
    equalTo(DateTime.parse("2008-06-01T14:00:00+01:00")));

Then (if your system timezone is “Europe/London”, or you explicitly passed it in as the parameter) you would get this highly unhelpful failure:

Expected: <2008-06-01T14:00:00.000+01:00>
     but: was <2008-06-01T14:00:00.000+01:00>

Continue reading

TRUNCATE making replicated MySQL table inconsistent

Here at TIM Group we make use of MySQL’s statement-based replication. This means that some functions, like UUID and LOAD_FILE, cannot be used when we write code or do manual maintenance because they break the consistency of the slaves. We now know we have to add the TRUNCATE statement to the list of things not to use. This is how it went…

One of the applications we have is used for background jobs: you send it an HTTP request to have a job started and then you look at its database for the results.

We needed the result of running one of these jobs for some analysis. The instructions looked like:

1) TRUNCATE table job_results
2) send request to /jobs/new
3) extract content of job_results table

We started following the instructions but after a few minutes of staring at the terminal waiting for the request to complete we thought that the request had a problem. We hit Ctrl-C, killing the request, ran TRUNCATE job_results again and restarted the request.

Our second attempt was faster to respond: after a few minutes it returned an error.

And after a few more minutes the two slave databases started reporting an error with statement-based replication. They could not insert a record in the table job_results because the primary key was duplicate.

Continue reading

From structure to behaviour – migrating to an Event Sourced system

For a few years now we have identified Event Sourcing as being a good fit for the sort of applications we build at TIM Group. This is becoming increasingly important as the business has been transitioning from an alpha capture and distribution tool to a platform that provides analytics as well.

IdeasFX, one of our newer initiatives, was exhibiting various issues even in its infancy. Although we had some stability and scalability issues, the largest negative impact on the team was being cited as the very confused “domain model”. The impact came in the form of slower development cycles caused by a large amount of accidental complexity and low team morale for the same reason. Although the product was not making much money and almost certainly could have continued as it was, we needed a product to demonstrate the value of a domain centric Event Sourced approach so we could later apply it to more of our core products.

So the team went to work. We started by researching CQRS/Event Sourcing topics: we bought Greg Young’s and some of Udi Dahan’s online courses, we read books on Domain Driven Design and Event Sourcing. We were lucky enough to have Greg in for a couple of days during our technology summit to help us clarify our thoughts.

Finally, we started a series of modeling sessions to try and formulate the concepts we were working with. These sessions turned out to be invaluable, we were lucky enough to have product people with us who truly understood the domain and were able to help the team reach a shared-understanding of the logical system and ubiquitous language.

Shortly after these initial sessions, we begun work to improve the system!

We intend to publish a series of posts explaining techniques we are using and our reasoning for taking the approaches we did, starting with the “Event Store Adapter”.

Workflow for deploying potentially unstable Tucker components

When adding a Tucker component to an application, it’s important to get the workflow right so as to avoid contributing noise to our already noisy alerting. Now support for that workflow is built into Tucker, it’s easier to get right than ever before.

Problem

Occasionally, I’ve added a Tucker component to monitor the status of an application, but when I deployed it to a realistic environment, I discovered an assumption I made about the system didn’t hold. The component raised a warning unexpectedly, and contributed to the spurious noise across our alerting systems. I wanted to find a way to develop new monitoring components that could be introduced without causing this noise.

Examples of the kind of monitoring where this could be an issue include:

  • checking a business rule holds by querying the database
  • monitoring the availability of a service required by the application
  • ensuring background tasks are running and completing successfully

In each case it’s possible that you assumed some behaviour while developing and testing your new component locally. Perhaps data in the database already violates the business rule unexpectedly, or the service is less reliable than you thought, or background tasks don’t complete in the time you expect them to. The component may or may not be alerting correctly, but until you can achieve a clean slate, it will be causing alerting errors in Nagios. Leaving components in this state for any length of time is undesirable as it leads to:

  • other alerts from the same application being hidden[0]
  • alert fatigue (aka the “boy who cried wolf” effect)

Solution

A better workflow is to deploy the Tucker component such that it will make the same checks as usual, and display the same result, but never change to a state that causes an alert. This way you can observe how it behaves in a realistic environment, without it causing any alerts, until you decide it is “stable”.

This has been a reasonably easy workflow to follow. It’s not too much effort to write and test a component that behaves as usual, but when configuring in the application startup, configure in such a way that the status will always be INFO or OK. Once you’re happy the component is stable, a minor code change “switches it on” so that it can begin alerting as you intended it to. This workflow has now been incorporated into the Tucker library, so it’s even easier.

How to use

When adding your component to the Tucker configuration, decorate with a PendingComponent, like so:

import com.timgroup.tucker.info.component.pending.PendingComponent;

tucker.addComponent(new PendingComponent(new PotentiallyUnstableComponent()));

The PendingComponent will always return Status.INFO, regardless of the actual status, preventing premature warnings. However, it will display the value and the real status from the underlying component, as shown here:

pending-component

Since in the case of the less-reliable-than-expected-service, we would not spend all day refreshing our browser on our application’s status page just to see changes in the Tucker component, it’s helpful to log any state changes that would have occurred at any time. That would allow you to look back after the component has spent some time “soaking”, to see how it would have alerted. PendingComponent will log a structured event, viewable in Kibana, exemplified by the following:

{
   "eventType":"ComponentStateChange",
   "event":{
     "id":"potentially-unstable-component",
     "label":"Potentially Unstable Component",
     "previousStatus":"OK",
     "previousValue":"something which warrants a ok status",
     "currentStatus":"CRITICAL",
     "currentValue":"something which warrants a critical status"
   }
 }

Once your component is stable, and you’re confident that a WARNING or CRITICAL status will be real, you can easily “undecorate” it by removing the PendingComponent wrapper. Assuming you have tested your component directly, it should be the only code change you need to make at this point.


PendingComponent is available from Tucker since version 1.0.326.


[0] a known weakness of the Nagios/Tucker monitoring system. Nagios is limited to only considering one status from an entire Tucker status page. This means if your component is in a WARNING state, it will only alert once, even if other components change to WARNING.