Choosing what work to do at TIM Group

TL;DR: Working at TIM Group means having the responsibility to decide what work to do. The most obvious criteria is business value but I don’t think that is enough.

At TIM Group we have been experimenting with self-organisation for a while. It’s been a gradual process that started with the adoption of the Eight Behaviours for Smarter Teams and whose last steps were the removal of the line management structure in the technology department and a reiterated message from the CEO on employees insisting on being responsible.

My personal experience has been of changing team and/or project every six months or so which I find refreshing. Most of the time my moves were motivated by changing conditions and suggested by people higher in the hierarchy. A few times, very notably the last change of both team and project at the same time was decided and executed without indication from the management. I saw this happen multiple times since, to multiple people across multiple departments.

My colleagues and I have the responsibility of deciding what work to do. The executive team still has the authority to give orders and to fire people but that does not happen often. For all intents and purposes proposing projects, staffing those projects and delivering are now shared responsibilities.

Continue reading

Kik, left-pad… Should I stop using npm?

TL;DR: No, unless you make npm packages. If you do publish npm packages think about how the disputes are resolved and decide if you are OK with it.

I started using npm a few years ago in our build system. My CTO and his deputy today told me that means it is in production. I thought it was not, but on second thought I think they are right since it decides what code is in the production artefacts.

A few days ago, while I was on holiday, two of my colleagues investigated a problem with npm regarding some dependencies that may contain malicious code. I’m the one who used it first. I felt responsible.

I read about the issue on GitHub and I was more worried: npm allowing reuse of names meant that it was unsafe to use it. Due to my decision to adopt npm more and more projects in the company depend on that, was I responsible for us disseminating malware?

How did you feel at that time? Had you done your due diligence on npm? I felt I had not.

Or did I? After all, what is the difference between npm, Maven and GitHub? Don’t they all allow you to download some code identified by a name and a version number?

Continue reading

TRUNCATE making replicated MySQL table inconsistent

Here at TIM Group we make use of MySQL’s statement-based replication. This means that some functions, like UUID and LOAD_FILE, cannot be used when we write code or do manual maintenance because they break the consistency of the slaves. We now know we have to add the TRUNCATE statement to the list of things not to use. This is how it went…

One of the applications we have is used for background jobs: you send it an HTTP request to have a job started and then you look at its database for the results.

We needed the result of running one of these jobs for some analysis. The instructions looked like:

1) TRUNCATE table job_results
2) send request to /jobs/new
3) extract content of job_results table

We started following the instructions but after a few minutes of staring at the terminal waiting for the request to complete we thought that the request had a problem. We hit Ctrl-C, killing the request, ran TRUNCATE job_results again and restarted the request.

Our second attempt was faster to respond: after a few minutes it returned an error.

And after a few more minutes the two slave databases started reporting an error with statement-based replication. They could not insert a record in the table job_results because the primary key was duplicate.

Continue reading

Report from DevOpsDays London 2013 Fall

This Monday and Tuesday a few of us went to DevOpsDays London 2013 Fall.

We asked for highlights from every attendant and this is what they had to say about the conference:

Francesco Gigli: Security, DevOps & OWASP

There was an interesting talk about security and DevOps and a follow up during one of the open sessions.
We discussed capturing security related work in user stories, or rather “Evil User Stories” and the use of anti-personas as a way to keep malicious users in mind.
OWASP, which I did not know before DevOpsDays, was also nominated: it is an organization “focused on improving the security of software”. One of the resources that they make available is the OWASP Top 10 of the most critical web application security flaws. Very good for awareness.

Tom Denley: Failure Friday

I was fascinated to hear about “Failure Fridays” from Doug Barth at PagerDuty.  They take an hour out each week to deliberately failover components that they believe to be resilient.  The aim is not to take down production, but to expose unexpected failure modes in a system that is designed to be highly available, and to verify the operation of the monitoring/alerting tools.  If production does go down, better that it happens during office hours, when staff are available to make fixes, and in the knowledge of exactly what event triggered the downtime.

Jeffrey Fredrick: Failure Friday

I am very interested in the Failure Fridays. We already do a Failure Analysis for our application where we identify what we believe would happen with different components failing. My plan is that we will use one of these sessions to record our expectations and then try manually failing those components in production to see if our expectations are correct!

Mehul Shah: Failure Fridays & The Network – The Next Frontier for Devops

I very much enjoyed the DevOpsDays. Apart from the fact that I won a HP Slate 7 in the HP free raffle, I drew comfort from the fact that ‘everyone’ is experiencing the same/similar problems to us and it was good to talk and share that stuff. It felt good to understand that we are not far from what most people are doing – emphasizing on strong DevOps communication and collaboration. I really enjoyed most of the morning talks in particular the Failure Fridays and the The Network – The Next Frontier for Devops – which was all about creating a logically centralized program to control the behaviour of an entire network. This will make networks easier to configure, manage and debug. We are doing some cool stuff here at TIM Group (at least from my stand point), but I am keen to see if we can toward this as a goal.

Waseem Taj: Alerting & What science tells us about information infrastructure

At the open space session on alerting, there was a good discussion on adding context to the alert. One of the attendee mentioned that each of the alert they get has a link to a page that describes the likely business impact of the alert (why we think it is worth getting someone out of the bed at 3am), a run book with typical steps to take and the escalation path. We have already started on the path of documenting how to respond to nagios alerts, I believe expanding it to include the perceived ‘business impact of the alert’ and integration with nagios will be most helpful in moments of crisis in the middle of night when the brain just does not want to cooperate.

The talk by Mark Burgress on ‘What science tells us about information infrastructure’ indeed had the intended impact on me, i.e. I will certainly be reading his new book on the subject.

You can find all the Talks and Ignites video on the DevOpsDays site.