Workflow for deploying potentially unstable Tucker components

When adding a Tucker component to an application, it’s important to get the workflow right so as to avoid contributing noise to our already noisy alerting. Now support for that workflow is built into Tucker, it’s easier to get right than ever before.

Problem

Occasionally, I’ve added a Tucker component to monitor the status of an application, but when I deployed it to a realistic environment, I discovered an assumption I made about the system didn’t hold. The component raised a warning unexpectedly, and contributed to the spurious noise across our alerting systems. I wanted to find a way to develop new monitoring components that could be introduced without causing this noise.

Examples of the kind of monitoring where this could be an issue include:

  • checking a business rule holds by querying the database
  • monitoring the availability of a service required by the application
  • ensuring background tasks are running and completing successfully

In each case it’s possible that you assumed some behaviour while developing and testing your new component locally. Perhaps data in the database already violates the business rule unexpectedly, or the service is less reliable than you thought, or background tasks don’t complete in the time you expect them to. The component may or may not be alerting correctly, but until you can achieve a clean slate, it will be causing alerting errors in Nagios. Leaving components in this state for any length of time is undesirable as it leads to:

  • other alerts from the same application being hidden[0]
  • alert fatigue (aka the “boy who cried wolf” effect)

Solution

A better workflow is to deploy the Tucker component such that it will make the same checks as usual, and display the same result, but never change to a state that causes an alert. This way you can observe how it behaves in a realistic environment, without it causing any alerts, until you decide it is “stable”.

This has been a reasonably easy workflow to follow. It’s not too much effort to write and test a component that behaves as usual, but when configuring in the application startup, configure in such a way that the status will always be INFO or OK. Once you’re happy the component is stable, a minor code change “switches it on” so that it can begin alerting as you intended it to. This workflow has now been incorporated into the Tucker library, so it’s even easier.

How to use

When adding your component to the Tucker configuration, decorate with a PendingComponent, like so:

import com.timgroup.tucker.info.component.pending.PendingComponent;

tucker.addComponent(new PendingComponent(new PotentiallyUnstableComponent()));

The PendingComponent will always return Status.INFO, regardless of the actual status, preventing premature warnings. However, it will display the value and the real status from the underlying component, as shown here:

pending-component

Since in the case of the less-reliable-than-expected-service, we would not spend all day refreshing our browser on our application’s status page just to see changes in the Tucker component, it’s helpful to log any state changes that would have occurred at any time. That would allow you to look back after the component has spent some time “soaking”, to see how it would have alerted. PendingComponent will log a structured event, viewable in Kibana, exemplified by the following:

{
   "eventType":"ComponentStateChange",
   "event":{
     "id":"potentially-unstable-component",
     "label":"Potentially Unstable Component",
     "previousStatus":"OK",
     "previousValue":"something which warrants a ok status",
     "currentStatus":"CRITICAL",
     "currentValue":"something which warrants a critical status"
   }
 }

Once your component is stable, and you’re confident that a WARNING or CRITICAL status will be real, you can easily “undecorate” it by removing the PendingComponent wrapper. Assuming you have tested your component directly, it should be the only code change you need to make at this point.


PendingComponent is available from Tucker since version 1.0.326.


[0] a known weakness of the Nagios/Tucker monitoring system. Nagios is limited to only considering one status from an entire Tucker status page. This means if your component is in a WARNING state, it will only alert once, even if other components change to WARNING.

Leave a Reply

Your email address will not be published. Required fields are marked *