Cron instigated contention

Compute resources are 21st-century utility, much like water, electricity and gas there is a limited supply. When you run out there are implications, even when the period was brief. This can cause to unexpected failure scenarios in systems thought to be resilient to failure.

tldr:

If you have cron jobs that run on Virtual Machines and KVM hosts ensure that they do not all run at the exact same time. This can cause temporary exhaustion of compute resources on all your KVM hosts. An excellent countermeasure is to simply splay the cron times. If you use puppet the fqdn_rand is a really helpful function to do this.

Background:

TIMGroup has around 100 physical servers spread over 3 sites with around 560 virtual machines. There an internal caching DNS service at the heart of each site. The service consists of 3 name servers (bind) sitting behind a pair of load balancers (lvs/keepalived). The DNS servers themselves are virtual machines that will run on different KVM hosts.

The service in 1 particular site had been running for 859 days without interruption until the 24th Feb.

24th February DNS Outage:

A development colleague made the Infra team aware of some exceptions they had observed from their production application:

Caused by: java.net.UnknownHostException: files.timgroup.com

An initial investigation revealed that there was a full DNS outage in this site lasting around 5 seconds. It appeared that 1 of the health checks on the loadbalancer failed to all 3 DNS servers.

Each DNS server has 4 health checks: 1 for each zone (3 zones) and 1 for participation. Participation is our mechanism to determine if the service should ‘participate’ in the load-balanced pool of servers. All 3 of the zone checks were passing it was just the participation check that was failing with a timeout.

Feb 24 16:45:08 oymigration-lb-002 Keepalived_healthcheckers[5424]:
Misc check to [10.30.0.145] for [check_participation.rb --participation-url=http://10.30.0.145:5643/participation] timed out
Feb 24 16:45:13 oymigration-lb-002 Keepalived_healthcheckers[5424]:
Misc check to [10.30.0.145] for [check_participation.rb --participation-url=http://10.30.0.145:5643/participation] success.

Looking back in syslog, this had happened before:

  • 24th Feb: 16:45:08 to 16:45:13
  • 24th Feb: 00:15:12 to 00:15:16
  • 22nd Feb 19:15:07 to 19:15:12

Whilst investigating we had another incident on: 26th Feb: 23:45:15 to 23:45:51

The most interesting thing about the outages were their regularity, they were also extremely short lived. It was so fast that there would have been no way our monitoring (Nagios) would have noticed. The timing was highly suspicious occurring at 15mins or 45mins past the hour. Looking at metrics there was clearly something going on here:

This pattern was present on all 3 DNS servers.
Looking in syslog revealed some interesting events:

Feb 26 23:45:01 oymigration-ns-003 CRON[1153]: (root) CMD (/usr/bin/apt-get -qy clean)
Feb 26 23:45:01 oymigration-ns-003 CRON[1156]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Feb 26 23:45:01 oymigration-ns-003 CRON[1158]: (root) CMD (/bin/sh -c '/usr/local/sbin/refresh-mcollective-metadata >/dev/null 2>&1')

What are these tasks?

apt-get -q clean – This is a maintenance task to remove any left-over files after using apt to install or upgrade packages. Puppet runs and automated upgrades of packages could cause these files to accumulate. The load is caused by this cron task would be un-deterministic (due to the number of files to clean) but this is generally a very low-risk task that is safe to run at any time.

command -v debian-sa1 > /dev/null && debian-sa1 1 1 – I had to look this one up, it came in from the sysstat package. This generates an activity report and the load caused by this is very small another low-risk task.

/usr/local/sbin/refresh-mcollective-metadata – This is custom ruby wrapper around Facter. The wrapper runs Facter.to_hash and writes to a YAML file, the file is then and atomic moved to /etc/mcollective/facts.yaml. This is a mechanism to cache facter facts that are accessed by mcollective. This was introduced as Facter 2.x has traditionally been too slow to execute directly. This task does cause some load, it takes around 5~10 seconds to complete, before writing to the disk and causing a small amount of IO.

Below is the crontab output (correlating times with the short outage are bolded):

  • 45 23 * * * /usr/bin/apt-get -qy clean
  • 0,15,30,45 * * * * /bin/sh -c ‘/usr/local/sbin/refresh-mcollective-metadata
  • 45 * * * * root command -v debian-sa1 > /dev/null && debian-sa1 1 1

The timings are interesting as there is clearly an overlap. It seemed reasonable that the next step would be to run the cron job and observe the behaviour

Attempt #1 – Run the cron tasks

I wanted to get some fast feedback to ascertain if this was the cause. My expectation for this was to observe a similar pattern of CPU usage in the metrics and observe an short outage for a single DNS server

Running the cron jobs did cause the same spike in CPU but there was no outage. To make sure I ran the cron jobs in parrallel and multiple times, all to no avail.

Clearly there was something else contributing to this problem that I wasn’t yet aware of. I decided to counter my original experiment and thought about eliminating the cron jobs

Attempt #2 – Shift the cron times

On server 001 and 002 I decided to modify all 3 cron tasks by an offset of +3 minutes. This would cause refresh-mcollective-metadata to run hourly at 3,18,33,48 mins past, apt-get clean daily at 23:48 and debian-sa1 hourly at 48 mins past. I decided to leave 003 to run at the existing time. To ‘hedge my bets’ I also setup a task to continually run iostat on 001, this would give me more information during the next outage.

Excitingly there was a short outage overnight at: Feb 27 23:45:09 to 23:45:21

The metrics revealed something very interesting, the original dip at 23:45 had moved to 23:48 along with the IO but something was still happening to the CPU at 23:45. Moving the cron jobs did not prevent the outage. It looked like the cron jobs were off the hook.

Looking at the iostat information revealed that ‘steal’ rose sharply before falling back down.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 87.88    0.00     3.03           0.00        9.09      0.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

I hadn’t dealt with steal before so looked it up:

“Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.” (source)

The surface area for this problem just got a WHOLE lot larger.
I started by looking at the metrics for the KVM hosts where the DNS servers are hosted.

Short-term load on all 3 KVM servers was spiking at the same time, the next logical step was to look at all KVM hosts.Wow! That is a shockingly large spike, showing that every single KVM host in this site was busy at precisely the same time. This was also not specific to this site, a look at another site confirmed it was also occurring there at the same time. 

It was mind-blowing that this was going on and we were completely un-aware. This was all happening far too fast for our monitoring to detect and we have no anomaly detection on our metrics. The only symptom of this problem was the DNS servers participation check. 

Zooming out on the metrics makes it very clear that this is occurring every 15mins just to a lesser extent. Taking another look at the cronjob was back in the frame:

  • 0,15,30,45 * * * * /bin/sh -c ‘/usr/local/sbin/refresh-mcollective-metadata

It was time to up the ante on this investigation. Taking another look at the cron jobs I confirmed that they run at the same time on EVERY virtual machine and KVM host at exactly the same time.

Attempt #3 – Shift the cron times on a single KVM host that hosts a DNS server. And shift the cron times for all the guest VM’s

I decided to use my earlier strategy of moving the cron times to prove/disprove the cron jobs involvement. I modified the cron times on every VM on KVM host to be the same. However, this did not cause an outage there was clear still something missing.

Attempt #4 – Shift the cron times on TWO KVM hosts, 1 that hosts the DNS server and another that hosts the primary Load balancer. And shift the cron times for all the guest VM’s

    • oy-kvm-006
      • oymigration-ns-002
      • oymigration-lb-002
      • All other children on this KVM host
    • oy-kvm-008
      • oymigration-ns-001
      • All other children on this KVM host 

For the above hosts, the cron times were modified to:

  • refresh-mcollective-metadata cron’s to 8,23,38,53 mins past
  • apt-get clean cron 23:53

What I expected to happen:

  • The outage moves to 23:53
  • oymigration-ns-001|2 should have the same pattern but occurring at 23:53
  • oymigration-ns-003 should remain at 23:45
  • oy-kvm-006 and oy-kvm-008 should have the same pattern occurring at 23:38 and 23:53, all other KVM hosts should be at 23:45

What actually happened:

  • No outage occurred  
  • The oymigration-ns-001|2 servers pattern shifted as expected.
  • Oymigration-ns-002 had an anomaly at 23:53:44 caused by a noop puppet run. 
  • oy-migration-003 remained at the same time
  • oy-kvm-006|8 servers pattern shifted as expected, but we did not see a significant rise in load at 23:53
    • This is especially interesting as the apt-get clean was actually moved to 23:38. It appears that the apt-get clean and refresh-mcollective-metadata together cause an much higher increase in load on the KVM host than just refresh-mcollective-metadata.

It was becoming a little frustrating that I was unable to generate an outage, doing so would prove beyond doubt that the cron jobs running at the same time everywhere were the cause.

While all this was going on there were reports that one of our jump (SSH) servers in the other site was experiencing a similar problem to DNS with failed health checks every 15 minutes aligned to 0,15,30 and 45 mins past the hour.

Mar  1 09:30:12 production-lb-002 Keepalived_healthcheckers[5905]: Misc check to [10.40.1.84] for [/opt/youdevise/keepalived/healthchecks/bin/check_ssh.rb 10.40.1.84] timed out
Mar  1 09:45:12 production-lb-002 Keepalived_healthcheckers[5905]: Misc check to [10.40.1.84] for [/opt/youdevise/keepalived/healthchecks/bin/check_ssh.rb 10.40.1.84] timed out
Mar  1 10:00:12 production-lb-002 Keepalived_healthcheckers[5905]: Misc check to [10.40.1.84] for [/opt/youdevise/keepalived/healthchecks/bin/check_ssh.rb 10.40.1.84] timed out
Mar  1 10:15:12 production-lb-002 Keepalived_healthcheckers[5905]: Misc check to [10.40.1.84] for [/opt/youdevise/keepalived/healthchecks/bin/check_ssh.rb 10.40.1.84] timed out

Given that the outages were occurring frequently and reliably I thought this would be a better candidate to gain some fast feedback.

Attempt #5 – Shift the cron time for refresh-mcollective-metadata on the jump box.

This was a validation of my earlier experiment (Attempt #2 shift the cron times) but for the SSH service. I modified the refresh-mcollective-metadata cron on pg-jump-001 to 8,23,38,53. The outage remained at 10:45, 11:00, 11:12, 11:15, 11:30, 11:45. But there was an un-expected extra outage at 11:12. This result was in line with the outcome of the ‘Attempt #2 shift the cron times‘ experiment. 

Attempt #6 – Shift the cron time for refresh-mcollective-metadata on the jump box, its KVM host and all the other guest virtual machines.

This prevented an outage occurring for the SSH service.

This proved that for an outage to occur the pg-jump servers & its KVM host needed to run the cron at the same time as the KVM host with load balancer and the other guest virtual machines on that KVM host.

Interestingly there was an outage at 13:00 so there are circumstances where an outage can still occur.

Given the strength of evidence against refresh-mcollective-metadata it was time to look at countermeasures.

Prevention:

  1. Nice the cron task
  2. Splay refresh-mcollective-metadata
  3. Splay refresh-mcollective-metadata and apt-get -q clean
  4. Reduce the number of times these cron jobs run

It was decided to go for option 3 & 4.

  • Option 3 – This was deemed the best solution as it was the best solution to prevent contention occurring at the exact same moment in time.
  • Option 4 – The requirements for these cron jobs were discussed, based on this we revised the timings to be:
    • apt-get -q clean – Could be run at any time once per day, instead of 23:45
    • refresh-mcollective-metadata – Could be run hourly, instead of every 15mins

Implementation

In puppet there is a really useful function called fqdn_rand

“Generates a random whole number greater than or equal to 0 and less than MAX, combining the $fqdnfact and the value of SEED for repeatable randomness. (That is, each node will get a different random number from this function, but a given node’s result will be the same every time unless its hostname changes.)”

The solution was to use this function when the cron is defined:

cron { 'refresh-mcollective-metadata': 
  command => "facter -p --yaml >${yaml_fact_path_real} 2>&1", 
  minute  => fqdn_rand(59,'refresh-mcollective-metadata-minute') 
}

To prevent this problem occurring in future the plan is to ensure that all cron tasks go through a wrapper that will automatically splay the hours and minutes when required.

Post countermeasure analysis/retrospective

With the splay rolled out we can see a much-improved pattern but this is not the end of the story. This investigation has provided a lot of learning and highlighted some  interesting technical issues that need to be addressed:

  • I want to know when services jump out of the load balancer unexpectedly for short periods of time.
  • I want to know when anomalies occur in the metrics so we can investigate and do something about it.
  • I want to have a mechanism to protect and restricted shared resources used by Virtual Machines.

There are technical solutions available to solve these problems, it is just a case of prioritising the work.

Leave a Reply

Your email address will not be published. Required fields are marked *