Uptime related server crashes

This is a guest post by Iliya Polihronov.  Iliya is the newest member of the global infrastructure, systems, and security team at Automattic and the first ever guest blogger here on barry.wordpress.com.

Hey, my name is Iliya and as a Systems Wrangler at Automattic, I am one of the people handling the server-side issues across the 2000 servers running WordPress.com and other Automattic services.

Last week, within two hours of each other, two of our MogileFS storage servers locked up with the following trace:

The next day, a few more servers crashed with similar traces.

We started searching for a common pattern. All hosts were running Debian kernels ranging from 2.6.32-21 to 2.6.32-24, some of them were in different data centers and had different purposes in our network.

One thing we noticed was that all of the servers crashed after having an uptime of a little more than 200 days. After some research and investigation, we found that the culprit appears to be a quite interesting kernel bug.

As part of the scheduler load balancing algorithm, the kernel searches for the busiest group within a given scheduling domain. In order to do that it has to take into account the average load for all groups. It is calculated in the function find_busiest_group() with:

sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;

sds.total_load is the sum of the load on all CPUs in the scheduling domain, based on the run queue tasks and their priority.

SCHED_LOAD_SCALE is a constant used to increase resolution.

sds.total_pwr is the sum of the power of all CPUs in the scheduling domain. This sum ends up being zero and that’s what causing the crash – division by zero.

The “CPU power” is used to take into account how much calculating capabilities a CPU has compared to the other CPUs and the main factors for calculating it are:

1. Whether the CPU is shared, for example by using multithreading.
2. How many real-time tasks the CPU is processing.
3. In newer kernels,  how much time the CPU had spent processing IRQs.

The current suggested fix for this bug is relying on the theory that while taking into account the real-time tasks (#2 above), scale_rt_power() could return negative value, and thus the sum of all CPU powers may end up being zero.

This was merged into the 2.6.32.29 vanilla kernel, together with the IRQ accounting into the cpu_power (#3  above). It is also merged into the Debian 2.6.32-31 kernel.

Alternatively, the scheduling load balancing can be turned off, which would effectively skip the related code. This can be done using control groups, however it should be used with caution as it may cause performance issues:

mount -t cgroup -o cpuset cpuset /cgroups
echo 0 > /cgroups/cpuset.sched_load_balance

As it is yet not absolutely clear if the suggested fix really fixes the problem, we will try to post updates on any new developments as we observe them.

9 responses to “Uptime related server crashes”

  1. Any update on those crashes?

  2. hi, happy to finally find something about this damn bug 🙂
    today, my filer crashed (running debian squeeze with kernel 2.6.32-5-amd64)
    it has been running for ~212 days
    i had the on screen debugging with the same message “find_busiest_group”…..
    seems it was the same context as yours…
    after hard rebooting the server,it’s ok
    i used aptitude to get the latest official kernel for debian, which still is 2.6.32-5
    so i dont know if this bug can happen again or not ?

  3. How sure are you that those servers did not hang them selfs because they ran out of memory, because all the CPUs where completely used 100%? I ran into the same issue with 12 or more Ruby 1.9 processes: http://img.ly/8Xbp and my Kernel hung with this message: https://plus.google.com/105218873624625602932/posts/5EGnZbBBMPk – looks similar to yours. I am running 2.6.33 and that is the Kernel version all the Financial institutions run around the world. I compile and install all my kernel manually after downloading them via git from the Linus tree.

  4. This should be the link to the Kernel Bug-Tracker but the it is down at the moment due to the Hacking of Kernel.org (I guess): https://bugzilla.kernel.org/show_bug.cgi?id=16991

  5. i had the same screen: http://pic.twitter.com/sAih9DlN
    but my filer is not heavily loaded, about 2 GB ram on 4, and 1/3 CPU 😉

    @zdavatz: the bugzilla website is out ? i cant reach it.

  6. Did anyone ever reach any conclusions on this?

    I’ve just (over a period of a few months) had 5 machines crash in suspiciously similar circumstances as they approximately reach the >200 days uptime threshold. They were running various stock Debian squeeze kernels up to 2.6.32-39.

    Unfortunately the kernel messages were not capturable due to the somewhat inferior remote console setup on the machines in question.

    1. This problem is fixed in 2.6.32-45 and higher.

      1. Hi Barry,

        The complete changelog entry for 2.6.32-45 is just “Avoid ABI change on some archs due to a new #include in the fix for CVE-2012-2123.” – are you sure that’s the fix version you meant to say?

  7. Hello All,

    On some Customer production systems with 2.6.32-220.el6.x86_64 (RHEL 6.2), machines like to panic due to this bug, after exactly 212 or 214 days of uptime!

    This was multi-node Veritas Cluster, and 2 or 3 nodes crashed during 24 hours, with uptime around 212 or 214 days.

    But Redhat never mentioned about any connectivity between this divide by zero error and system uptime.

    Br,
    Tomasz

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: