Predicting server hardware failure with mcelog

Have you ever wanted to predict that a piece of hardware in your server was failing before it actually caused the server to crash?

Sure! We all do.

Over the past few months, I have been tracking the correlation between errors logged to the Machine Check Event Log (MCElog) and the hard crash of a server or application running on that server (mostly MySQL). So far, the correlation is about 90%. That is to say, about 9 times out of 10, there will be an error logged to the MCElog before the server actually crashes. It may take days or even weeks between the time of the logged error and the crash, but it will happen. We are now actively monitoring this log and replacing hardware (RAM and CPUs) which show errors before they actually fail which I thought was pretty cool, so I thought I would share how we are doing it.

On Debian, there is a package for the mcelog utility which will allow you to decode and display the kernel messages logged to /dev/mcelog Part of this package is a cron job which outputs the decoded contents of /dev/mcelog to /var/log/mcelog every 5 minutes:

*/5 * * * * root test -x /usr/sbin/mcelog -a ! -e /etc/mcelog-disabled && /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

We modify this a little bit and add another cron job which rotates this log file on reboot:

@reboot root test -f /var/log/mcelog && mv /var/log/mcelog /var/log/mcelog.0

The reason we do this is because after a reboot, which is most likely a result of the hardware repair, we want to clear the active logfile (monitored by the nagios plugin below), so the alert will clear.  In case, however, the reboot was not part of the hardware maintenance, we still want to have a record of the hardware errors so we move the log file to mcelog.0.

We then have a simple nagios plugin which monitors /var/log/mcelog for errors:

#!/bin/bash

LOGFILE=/var/log/mcelog

if [ ! -f "$LOGFILE" ]
then
	echo "No logfile exists"
	exit 3
else
	ERRORS=$( grep -c "HARDWARE ERROR" /var/log/mcelog )
	if [ $ERRORS -eq 0 ]
	then
		echo "OK: $ERRORS hardware errors found"
		exit 0
	elif [ $ERRORS -gt 0 ]
	then
		echo "WARNING: $ERRORS hardware errors found"
		exit 1
	fi
fi

And thats pretty much it.  In just a few weeks we have caught about a dozen hardware faults before they led to server crashes.

Disclaimer: This only works when running a X86_64 kernel and YMMV.

WordPress Code Repository

WordPress Code

We have decided to consolidate all of the small projects we have released into a single subversion repository. Previously these were spread across multiple domains and not very well publicized. We have setup a Trac instance as well to facilitate bug reports. There are 5 projects currently in the repository all of which we have used or are currently using at Automattic. Some of the projects, like Servermattic, are also being used elsewhere. All of these projects are obviously open source and are released under the GPL. Patches and feedback are welcome! We hope to release more of these soon. Thanks to Nikolay and Demitrious who have both contributed to the projects in the repository.

New Datacenter for WordPress.com

Towards the end of 2008, we brought online a new datacenter to serve the over 5.5 million blogs now hosted on the WordPress.com platform.  Adding the data center in Chicago, IL gives us a total of 3 data centers across the US which serve live content at any given time.  We have decommissioned one of our facilities in the Dallas, TX area.  Our friends at Layered Technologies were kind enough to shoot this footage for us (think The Blair Witch Project) and the always awesome Michael Pick took care of the editing.  Here’s a peak at what a typical WordPress data center installation looks like…

For those interested in technical details here is a hardware overview of the installation:

150 HP DL165s dual quad-core AMD 2354 processors 2GB-4GB RAM
50 HP DL365s dual dual-core AMD 2218 processors 4GB-16GB RAM
5 HP DL185s dual quad-core AMD 2354 processors 4GB RAM

And here is a graph of what the current CPU usage looks like across about 700 CPU cores.  As you can see there is plenty of idle CPU for those big spikes or in case one of the other 2 data centers fail and we have to route more traffic to this one.

cpuusage-chicago

%d bloggers like this: