Grabbing the Estimation Bull by the horns (and holding on for dear life)

You’re going to talk about estimation, you must be crazy?

Before you start, just let me tell that I’ve either read it or heard it, ALL of it. Putting that all to one side it is my assertion that even if you work at Apple or Google or wherever, if you want to make money from selling software, then sometimes you need to tell the people who are interested in buying your software when you’ll be ready to take their money.

If you’re still with me, here’s how our team does this today (I’ll ignore the fun we had to get to this point, sometimes you really don’t want to go there.)

Okay, you might have a point. So where do you start??

It always starts with an idea, a “wouldn’t it be great/cool if…” We kick this idea around the product/client/dev team until we think we’re ready to come up with a “Level-Zero” estimate.

What’s a Level Zero estimate?

It’s a big number that we arrive at after an hour of at least two developers and a product manager talking about the idea and how we might implement it. We express the Level Zero as the maximum number of iterations (fortnightly releases) that a pair of developers are confident it would take them to deliver the feature as we understand it. We document this break down and any assumptions each might contain, as well as, estimate the number of cycles per major section of the feature.

What happens next?

If the idea makes it to the top of the backlog and we decide to work on it, we then go to the next stage – a “Level One” estimate.

I get the idea now, how does the Level One work?

At this point, we break the feature down into development cards. We aim to make all of these small enough that no individual card represents more than two days of work for a pair of developers. In an ideal breakdown, all of our Level One cards are smaller than this (zero-point cards to those of you more familiar with our numbering).

Is it time for a cross-check?

That’s right. Next we compare the total of the Level One cards with the Level Zero estimate. If it is not comfortably lower, we republish a new total.

Are we nearly there yet?

Almost. We use our Level One estimate and the number of pairs of developers that we plan to assign to the work to come up with a delivery date. We’ve learned that parallelisation is complicated. So, we try to identify areas where we are confident that we can parallelise and areas where we know we can’t, then judge the date accordingly.

Are you ever going to build anything?

I’m right with you. From this point, we want to start development as soon as possible to minimise any context loss or code base shifts that could derail us before we begin. Ideally, we work on the first card right after the Level One breakdown session.

How do you track progress?

Once we start, we use burn down or burn up charts (http://en.wikipedia.org/wiki/Burn_down_chart)  to track the progress of each feature through the stand ups.

But nothing stays the same, how do you deal with that?

If you’re talking about scope creep. We manage this and any bugs / design changes / gold-plating by assessing each change against the committed delivery date and deciding yes or no to adding it to the feature for the same delivery date

Sounds like you’ve got it covered?

We’ve had decent predictably of delivery with this over the last year. It might not stand up to a client pushing back hard on estimates or to major team or infrastructure disruptions but we’ll cross those bridges if we get to them.

What’s the next challenge?

What we want now are ways to keep this process in place while incorporating techical debt repayment into the way we tackle individual features estimates and delivery.

Good luck with that…

With thanks to Tim Harford for the conversation style (http://timharford.com/)

 

Graphite Y-Axis Rendering Bug

Our infrastructure and development teams love metrics and open source software. In 2011, we started using Graphite more heavily for graphing and metrics collection because it allows anyone at TIM Group to easily add new metrics.

We use Coda Hale’s metrics library to send our application metrics to Graphite, and the Write Graphite plugin for collectd for system-level metrics like CPU, memory, and disk utilization.

Recently, we encountered a bug in Graphite’s graph rendering code that seemed like an interesting example of the types of problems we work on in TIM Group’s infrastructure team. Basically, very small differences in data point values in a graph would cause an infinite loop where memory usage would grow to 100% and Apache would need to be restarted.

For example, data points for one of our metrics, PS Mark Sweep JVM garbage collection, was consistently ~0.170000, which was small enough to cause the issue.

The code in question was from Graphite’s graph rendering code in webapp/graphite/render/glyph.py. We worked with one of the Graphite developers, Michael Leinartas, to find the cause of the issue and patch Graphite: