Joda Time? You might be overusing DateTime

We generally prefer to use Joda-Time for our date and time representation. Its immutable objects fit our house style, and the plusHours, plusDays etc methods usually produce much more readable and maintainable code than the giant bag of static methods based on Date and Calendar that we had before. Throw in easy construction from components, thread-safe parsers and formatters, a richer type catalogue for representing date-only, time-only, timezone-less values and integrated conversion to/from ISO-style strings, and working with date and time values becomes more comfortable and less fragile.

This snippet is fairly typical:

DateTime now = new DateTime(2015, 4, 13, 9, 15 /*default timezone*/);
DateTime laterToday = now.withTime(17, 30, 0, 0);
DateTime tomorrow = now.plusDays(1);
DateTime muchLater = now.plusMonths(6);

In principle, a DateTime contains a date, a time and a timezone. These three correspond to an instant- a number of milliseconds since the epoch, 1st Jan 1970 00:00 UTC.

Actually, that’s not quite correct- there are some date/time/timezone combinations that don’t correspond to an instant, and some that ambiguously reference multiple instants. This occurs when the timezone includes rules for daylight savings time, producing gaps where the clocks go forward and overlaps when the clocks go back. (Consider how you would interpret “2015-03-29T01:30:00″ as London time, or “2015-10-25T01:30:00″).

Here’s a nasty thing that can occur when you use DateTime, though. If you wrote this:

assertThat(new DateTime(2008, 6, 1, 14, 0, 0) /* Europe/London */,
    equalTo(DateTime.parse("2008-06-01T14:00:00+01:00")));

Then (if your system timezone is “Europe/London”, or you explicitly passed it in as the parameter) you would get this highly unhelpful failure:

Expected: <2008-06-01T14:00:00.000+01:00>
     but: was <2008-06-01T14:00:00.000+01:00>

Continue reading

TRUNCATE making replicated MySQL table inconsistent

Here at TIM Group we make use of MySQL’s statement-based replication. This means that some functions, like UUID and LOAD_FILE, cannot be used when we write code or do manual maintenance because they break the consistency of the slaves. We now know we have to add the TRUNCATE statement to the list of things not to use. This is how it went…

One of the applications we have is used for background jobs: you send it an HTTP request to have a job started and then you look at its database for the results.

We needed the result of running one of these jobs for some analysis. The instructions looked like:

1) TRUNCATE table job_results
2) send request to /jobs/new
3) extract content of job_results table

We started following the instructions but after a few minutes of staring at the terminal waiting for the request to complete we thought that the request had a problem. We hit Ctrl-C, killing the request, ran TRUNCATE job_results again and restarted the request.

Our second attempt was faster to respond: after a few minutes it returned an error.

And after a few more minutes the two slave databases started reporting an error with statement-based replication. They could not insert a record in the table job_results because the primary key was duplicate.

Continue reading

From structure to behaviour – migrating to an Event Sourced system

For a few years now we have identified Event Sourcing as being a good fit for the sort of applications we build at TIM Group. This is becoming increasingly important as the business has been transitioning from an alpha capture and distribution tool to a platform that provides analytics as well.

IdeasFX, one of our newer initiatives, was exhibiting various issues even in its infancy. Although we had some stability and scalability issues, the largest negative impact on the team was being cited as the very confused “domain model”. The impact came in the form of slower development cycles caused by a large amount of accidental complexity and low team morale for the same reason. Although the product was not making much money and almost certainly could have continued as it was, we needed a product to demonstrate the value of a domain centric Event Sourced approach so we could later apply it to more of our core products.

So the team went to work. We started by researching CQRS/Event Sourcing topics: we bought Greg Young’s and some of Udi Dahan’s online courses, we read books on Domain Driven Design and Event Sourcing. We were lucky enough to have Greg in for a couple of days during our technology summit to help us clarify our thoughts.

Finally, we started a series of modeling sessions to try and formulate the concepts we were working with. These sessions turned out to be invaluable, we were lucky enough to have product people with us who truly understood the domain and were able to help the team reach a shared-understanding of the logical system and ubiquitous language.

Shortly after these initial sessions, we begun work to improve the system!

We intend to publish a series of posts explaining techniques we are using and our reasoning for taking the approaches we did, starting with the “Event Store Adapter”.

Workflow for deploying potentially unstable Tucker components

When adding a Tucker component to an application, it’s important to get the workflow right so as to avoid contributing noise to our already noisy alerting. Now support for that workflow is built into Tucker, it’s easier to get right than ever before.

Problem

Occasionally, I’ve added a Tucker component to monitor the status of an application, but when I deployed it to a realistic environment, I discovered an assumption I made about the system didn’t hold. The component raised a warning unexpectedly, and contributed to the spurious noise across our alerting systems. I wanted to find a way to develop new monitoring components that could be introduced without causing this noise.

Examples of the kind of monitoring where this could be an issue include:

  • checking a business rule holds by querying the database
  • monitoring the availability of a service required by the application
  • ensuring background tasks are running and completing successfully

In each case it’s possible that you assumed some behaviour while developing and testing your new component locally. Perhaps data in the database already violates the business rule unexpectedly, or the service is less reliable than you thought, or background tasks don’t complete in the time you expect them to. The component may or may not be alerting correctly, but until you can achieve a clean slate, it will be causing alerting errors in Nagios. Leaving components in this state for any length of time is undesirable as it leads to:

  • other alerts from the same application being hidden[0]
  • alert fatigue (aka the “boy who cried wolf” effect)

Solution

A better workflow is to deploy the Tucker component such that it will make the same checks as usual, and display the same result, but never change to a state that causes an alert. This way you can observe how it behaves in a realistic environment, without it causing any alerts, until you decide it is “stable”.

This has been a reasonably easy workflow to follow. It’s not too much effort to write and test a component that behaves as usual, but when configuring in the application startup, configure in such a way that the status will always be INFO or OK. Once you’re happy the component is stable, a minor code change “switches it on” so that it can begin alerting as you intended it to. This workflow has now been incorporated into the Tucker library, so it’s even easier.

How to use

When adding your component to the Tucker configuration, decorate with a PendingComponent, like so:

import com.timgroup.tucker.info.component.pending.PendingComponent;

tucker.addComponent(new PendingComponent(new PotentiallyUnstableComponent()));

The PendingComponent will always return Status.INFO, regardless of the actual status, preventing premature warnings. However, it will display the value and the real status from the underlying component, as shown here:

pending-component

Since in the case of the less-reliable-than-expected-service, we would not spend all day refreshing our browser on our application’s status page just to see changes in the Tucker component, it’s helpful to log any state changes that would have occurred at any time. That would allow you to look back after the component has spent some time “soaking”, to see how it would have alerted. PendingComponent will log a structured event, viewable in Kibana, exemplified by the following:

{
   "eventType":"ComponentStateChange",
   "event":{
     "id":"potentially-unstable-component",
     "label":"Potentially Unstable Component",
     "previousStatus":"OK",
     "previousValue":"something which warrants a ok status",
     "currentStatus":"CRITICAL",
     "currentValue":"something which warrants a critical status"
   }
 }

Once your component is stable, and you’re confident that a WARNING or CRITICAL status will be real, you can easily “undecorate” it by removing the PendingComponent wrapper. Assuming you have tested your component directly, it should be the only code change you need to make at this point.


PendingComponent is available from Tucker since version 1.0.326.


[0] a known weakness of the Nagios/Tucker monitoring system. Nagios is limited to only considering one status from an entire Tucker status page. This means if your component is in a WARNING state, it will only alert once, even if other components change to WARNING.

MDI: Monitoring Driven Infrastructure?

Adam and I attended the London Infracoders Meetup last night which featured a demo of serverspec and Beaker. When I asked Adam what he thought he wasn’t impressed*. “I don’t see the point of this if you’re using Puppet unless you’re worried about typos or you don’t control your production monitoring. It duplicates what you already have with Puppet’s DSL, which is why we stopped using rspec-puppet in the first place.”

I realized that Adam was correct, that the sort of automated tests I’m used to as a developer are functionally equivalent of the monitoring checks system administrators are already heavily using. (An easy leap for me to make since Michael Bolton already convinced me that automated tests are checks, not testing.) Over time I’ve seen the migrations of testing from unit tests on the developer desktop to more tests further and further down the deployment pipeline. This led me to wonder how far monitoring could make the same march but in reverse.

We already use our production monitoring in our test environments, but we don’t tend to use them in our local Puppet development environments. But why? Couldn’t we use our monitoring in TDD fashion? Write a monitoring check, see it fail, then write the Puppet code to make it pass? (This is the same motivation as using your tests as monitoring such as with Cucumber-Nagios but working in the other direction.)

We haven’t tried this experiment yet but I’m curious if this is a pattern others have attempted, and if so, how did it work for you?

* Adam did think serverspec might be a plausible replacement for our existing NRPE checks, a way to uniformly structure our monitoring and to allow easier independent development of the checks

Switching To Scala talk @ Devoxx 2014

Last year I was lucky enough to be able to attend (and participate a little bit) at Devoxx UK. I found it to be a great conference, with a friendly vibe and interesting talks. This year I’m going back for seconds, as a speaker, talking about some of my experiences switching to Scala at TIM Group.

I’d highly recommend this conference to people interested in Java, JVM languages, and other areas of software development. Seems like there’s still tickets left, and if with the promo code SPK10A9 you will get 10% off the listed price. If you decide to attend, do come and say hello.

Distributed Pair Programming @ TIM Group with Saros

There’s a particular technology used within TIM Group that has been too useful for too long to continue to go uncelebrated. The tool is called “Saros“, it is an Eclipse plugin for distributed pair programming that is used extensively at TIM Group.

Why We Need A Distributed Pairing Solution

We have developers spread across London and Boston offices, and a few “Remoties” working from home in various far flung lands. As one of said Remoties, I believe I would not have the opportunity to continue to work at TIM Group and live where I want to live (hint: not London) if it not for Saros.

Having been entirely colocated just a few years ago, our choice to use pair programming by default didn’t really have technical limitations. With an extra seat, keyboard and mouse at every developer’s workstation, the barrier to pairing was low. Just pull up a chair and pair. When we first started having distributed developers, we didn’t want to give up on pairing for technical reasons, so the search began for how to adapt pair programming to a distributed environment.

Why Other Solutions Don’t Match Up

An oft-cited solution for this problem is to use a screen sharing technology and remote access technology. There’s plenty to choose from: Windows’ Remote Desktop; TeamViewer; LogMeIn; GoTo MyPC; VNC; NX; etc. However, we found them all far too limited for pairing. Just like colocated pairing, there’s only one cursor, so only one person can use the keyboard and mouse at any time. Unlike colocated pairing, there’s a large bias towards the person hosting the session, as they get much faster feedback from typing, since they don’t have to suffer the latency. When it’s (figuratively) painful to type, it’s much easier to shy away from being the “driver”, which hinders the collaboration. We never found a solution that reduced that latency to an acceptable level.

Another cited solution is to use a terminal editor like tmux, that does not have the overhead of screen sharing. The feedback when typing is much better, however, there’s one major drawback: being limited to terminal editors only. I’ve seen people whose terminal environments are well suited for developing in a language like Ruby or JavaScript. For those of us coding in Java and Scala, we didn’t want to give up the powerful features we appreciate in our IDE, so tmux is not suitable.

Saros To The Rescue

Thankfully, we discovered Saros, a research project from the University of Berlin. The most relatable way I’ve found to describe it is as:

Google Docs within the Eclipse IDE

It works by connecting two or more developers through Eclipse, so that when Alice enters some text, Bob sees it appear in his editor. The experience for both users is as if they were editing files locally[0]. Rather than sharing the image of a screen, edit commands are serialised and sent over the wire, changing the other participant’s local copy. This comes with several other benefits over remote access technologies:

  • the feedback when typing is instant, for both parties
  • the latency for seeing your partner’s keystrokes is much lower than when transmitting an image of the screen[1]

There are even benefits over colocated pairing:

  • neither the host nor guest has to leave the comfort of their familiar work environment to pair; you can set up fonts and shortcuts however you like
  • since you are both editing as though it was local, each participant has their own cursor, and can be editing different files, allowing ad-hoc splitting of tasks[1], which can be very handy
  • you can have more people involved (up to 5) which we’ve used for code review and introducing a project to a group of people, sparing the discomfort of hunching around a single desk

There are other distributed pairing solutions, such as Cloud9 IDE, and Kobra.io, but none (that we’ve found) that let you stick with your own IDE setup, retaining parity with how you would develop locally.

There are of course drawbacks to distributed pairing, regardless of technology, which I’m not going to cover here; they’re generally not problems Saros, or any technology, will be able to solve.

IntelliJ Support Is On The Roadmap

After a long search, we have not found anything that really compares with Saros. For me, it really is a killer feature of Eclipse. So much so that I’ve basically kept IntelliJ in a dusty old drawer, even when it would be better for certain tasks (like Scala development). Fortunately, it looks like that’s about to change. Work has just begun to port the Saros platform to IntelliJ, which is really exciting. Whenever I’ve talked to people about Saros, alternative IDE support inevitably arises. If the core Saros technology was agnostic of the IDE, it could be a huge leap forward for collaborative development in general.

At TIM Group we were so keen on the idea that a handful of us spent a “hack week” throwing together the first steps towards an IntelliJ plugin for Saros. We were able to demonstrate a proof-of-concept, but didn’t get anything truly viable. Having brought this effort to the attention of the Saros team, I hope that in some small way it inspired them to start work on it, but I doubt that’s something we can take credit for. Hopefully, during the development of the IntelliJ plugin there will be something that we can contribute, and give something back for our many hours of happy usage.

If you’re looking for an answer to the problem of distributed pairing, I heartily endorse Saros!


[0] for more details of the theory behind this aspect of collaborative editors, see http://en.wikipedia.org/wiki/Operational_transformation
[1] we have acquired a habit of being vocal about whether we are in Driver/Navigator mode, and if your pair is following you, since you can’t assume they are

Released scalaquery-play-iteratees 1.1.0

Announcing the release of scalaquery-play-iteratees version 1.1.0.

This version is for use with Slick 2.0.0+ and Play Framework 2.2.0+ on Scala 2.10.x.

This library provides the “glue” between Play’s concept of Iteratee and Slick’s Query, making it super-easy to process chunks of database query results in a reactive fashion.

Please leave us a comment if you find this library useful — we are in the process of porting it to be part of the play-slick project, and welcome any and all input on syntax, documentation, usage, etc!

TopicalJS: A rose amongst the thorns?

We have a codebase in semi-retirement to which we occasionally have to make limited changes. With active codebases, you are continuously evaluating the state of the art and potentially upgrading/switching third-party libraries as better solutions become available. With inactive codebases, you live with decisions that were made years ago and which are not cost-effective to improve. However, not all old code is bad code and there are some really excellent patterns and home-grown tools within this codebase that have never seen light outside of TIM Group.

It was a great pleasure, therefore, to be able to extract one such gem and make it open-source during the last round of maintenance programming on this codebase. We’ve called it TopicalJS, mostly because this was one of the few words left in the English language not already used for a JavaScript library. Topical is concerned with the management of events within a client-side environment (or indeed server-side if you run server-side JavaScript).

This old codebase uses Prototype and YUI on the front-end and a custom event-passing system internally (which is not very inspiringly) called the “MessageBus”. Our newer codebases use Underscore, jQuery, and Backbone. Backbone comes with its own event system which is built into every view, model, and collection. You can raise an event against any of these types or you can just use a raw Backbone Event instance and use it to pass around events.

Without Backbone, and in fact before Backbone existed, we invented our own system for exchanging events. Unlike Backbone all it does is exchange events. So, it could even be used to complement Backbone’s event system. It’s best feature is that it encourages you to create a single JavaScript file containing all of the events that can be fired, who consumes them, and how they get mapped onto actions and other events. This is effectively declarative event programming for JavaScript, which I think might be unique.

You use it by creating a bus and then adding modules to that bus. These modules declare what events they publish and what events they are interested in receiving. When a module publishes an event it is sent to every module, including the one that published it. Then, if that module subscribes to that event type, its subscription function will be called along with any data associated with the event. Events can be republished under different aliases, multiple events can be awaited before an aggregated event is fired, allowing easy coordination of multiple different events.

As an example, this is what bus configuration code might look like.

topical.MessageBus(
  topical.Coordinate({ 
    expecting: ["leftTextEntered", "rightTextEntered"], 
    publishing: "bothTextsEntered" }),

  topical.Republish({ 
    subscribeTo: "init", 
    republishAs: [ "hello", "clear"] }),
  topical.Republish({ 
    subscribeTo: "reset", 
    republishAs: "clear" }),

  topical.MessageBusModule({
    name: "Hello",
    subscribe: {
      hello: function() {
        alert("This alert is shown when the hello event is fired");
      }
    }
  }))

The Coordinate module waits for two different text boxes to be filled in before firing and event saying that they’re both present. The Republish modules raise the hello and clear event, ultimately causing an annoying alert to be shown, as well as providing another module the ability to react to the clear event and flush out any old data.

Full documentation and a worked example is available here: https://github.com/youdevise/topicaljs.

Feedback or contributions are most welcome.

Report from DevOpsDays London 2013 Fall

This Monday and Tuesday a few of us went to DevOpsDays London 2013 Fall.

We asked for highlights from every attendant and this is what they had to say about the conference:

Francesco Gigli: Security, DevOps & OWASP

There was an interesting talk about security and DevOps and a follow up during one of the open sessions.
We discussed capturing security related work in user stories, or rather “Evil User Stories” and the use of anti-personas as a way to keep malicious users in mind.
OWASP, which I did not know before DevOpsDays, was also nominated: it is an organization “focused on improving the security of software”. One of the resources that they make available is the OWASP Top 10 of the most critical web application security flaws. Very good for awareness.

Tom Denley: Failure Friday

I was fascinated to hear about “Failure Fridays” from Doug Barth at PagerDuty.  They take an hour out each week to deliberately failover components that they believe to be resilient.  The aim is not to take down production, but to expose unexpected failure modes in a system that is designed to be highly available, and to verify the operation of the monitoring/alerting tools.  If production does go down, better that it happens during office hours, when staff are available to make fixes, and in the knowledge of exactly what event triggered the downtime.

Jeffrey Fredrick: Failure Friday

I am very interested in the Failure Fridays. We already do a Failure Analysis for our application where we identify what we believe would happen with different components failing. My plan is that we will use one of these sessions to record our expectations and then try manually failing those components in production to see if our expectations are correct!

Mehul Shah: Failure Fridays & The Network – The Next Frontier for Devops

I very much enjoyed the DevOpsDays. Apart from the fact that I won a HP Slate 7 in the HP free raffle, I drew comfort from the fact that ‘everyone’ is experiencing the same/similar problems to us and it was good to talk and share that stuff. It felt good to understand that we are not far from what most people are doing – emphasizing on strong DevOps communication and collaboration. I really enjoyed most of the morning talks in particular the Failure Fridays and the The Network – The Next Frontier for Devops – which was all about creating a logically centralized program to control the behaviour of an entire network. This will make networks easier to configure, manage and debug. We are doing some cool stuff here at TIM Group (at least from my stand point), but I am keen to see if we can toward this as a goal.

Waseem Taj: Alerting & What science tells us about information infrastructure

At the open space session on alerting, there was a good discussion on adding context to the alert. One of the attendee mentioned that each of the alert they get has a link to a page that describes the likely business impact of the alert (why we think it is worth getting someone out of the bed at 3am), a run book with typical steps to take and the escalation path. We have already started on the path of documenting how to respond to nagios alerts, I believe expanding it to include the perceived ‘business impact of the alert’ and integration with nagios will be most helpful in moments of crisis in the middle of night when the brain just does not want to cooperate.

The talk by Mark Burgress on ‘What science tells us about information infrastructure’ indeed had the intended impact on me, i.e. I will certainly be reading his new book on the subject.

You can find all the Talks and Ignites video on the DevOpsDays site.