Predicting server hardware failure with mcelog

Have you ever wanted to predict that a piece of hardware in your server was failing before it actually caused the server to crash?

Sure! We all do.

Over the past few months, I have been tracking the correlation between errors logged to the Machine Check Event Log (MCElog) and the hard crash of a server or application running on that server (mostly MySQL). So far, the correlation is about 90%. That is to say, about 9 times out of 10, there will be an error logged to the MCElog before the server actually crashes. It may take days or even weeks between the time of the logged error and the crash, but it will happen. We are now actively monitoring this log and replacing hardware (RAM and CPUs) which show errors before they actually fail which I thought was pretty cool, so I thought I would share how we are doing it.

On Debian, there is a package for the mcelog utility which will allow you to decode and display the kernel messages logged to /dev/mcelog Part of this package is a cron job which outputs the decoded contents of /dev/mcelog to /var/log/mcelog every 5 minutes:

*/5 * * * * root test -x /usr/sbin/mcelog -a ! -e /etc/mcelog-disabled && /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

We modify this a little bit and add another cron job which rotates this log file on reboot:

@reboot root test -f /var/log/mcelog && mv /var/log/mcelog /var/log/mcelog.0

The reason we do this is because after a reboot, which is most likely a result of the hardware repair, we want to clear the active logfile (monitored by the nagios plugin below), so the alert will clear.  In case, however, the reboot was not part of the hardware maintenance, we still want to have a record of the hardware errors so we move the log file to mcelog.0.

We then have a simple nagios plugin which monitors /var/log/mcelog for errors:

#!/bin/bash

LOGFILE=/var/log/mcelog

if [ ! -f "$LOGFILE" ]
then
	echo "No logfile exists"
	exit 3
else
	ERRORS=$( grep -c "HARDWARE ERROR" /var/log/mcelog )
	if [ $ERRORS -eq 0 ]
	then
		echo "OK: $ERRORS hardware errors found"
		exit 0
	elif [ $ERRORS -gt 0 ]
	then
		echo "WARNING: $ERRORS hardware errors found"
		exit 1
	fi
fi

And thats pretty much it.  In just a few weeks we have caught about a dozen hardware faults before they led to server crashes.

Disclaimer: This only works when running a X86_64 kernel and YMMV.

6 responses to “Predicting server hardware failure with mcelog”

  1. Thanks a lot.
    You might consider writing more frequently on this blog about your “automattic discoveries” 🙂

  2. It’d be great if you followed up with some statistics about frequency of server failure before and after once you’ve been doing this for a while. If the difference is significant we’ll know that this is really something hot and important.

    1. Hey Randall. I will gather up some stats and try to do a follow up post at some point. From my initial impressions – we are seeing less unexpected complete HW failures now that we are proactively replacing hardware when errors appear in the MCElog. At the same time, it results in a lot of extra work because I suspect only 20-30% of the servers that undergo proactive maintenance would fail later. As a result, we are touching a lot more boxes than we normally would. Also, it’s not catching everything and we still see the occasional complete failure as the result of bad RAM, CPU, etc.

  3. How do you interpret mcelogs? Do you get definite indicator that there is hardware error? How many times does this happen that the vendor diagnosed the Server and no hardware component got flagged for failure. Are you concerned with one mcelog or what is the threshold for concern? how would you interpret following mcelog?
    Feb 10 12:15:02 t01041 mcelog: STATUS cc0001000001009f MCGSTATUS 0
    Feb 10 12:15:02 t01041 mcelog: Resolving address 54e1a0e40 using SMBIOS
    Feb 10 12:15:02 t01041 mcelog: No DIMMs found in SMBIOS tables
    Feb 10 12:15:02 t01041 mcelog: HARDWARE ERROR. This is *NOT* a software problem!
    Feb 10 12:15:02 t01041 mcelog: Please contact your hardware vendor
    Feb 10 12:15:02 t01041 mcelog: CPU 7 BANK 8
    Feb 10 12:15:02 t01041 mcelog: TSC 4739e8af70daaa
    Feb 10 12:15:02 t01041 mcelog: MISC 9a43050400045840
    Feb 10 12:15:02 t01041 mcelog: ADDR 54e1a0e40

  4. Those mcelogs are rather cryptic. Does anybody know how to find exact memory module causing the error? For example, on my server
    with 2-x quad-core Intel Xeon CPU X5550 I have the following:
    MCE 31
    HARDWARE ERROR. This is *NOT* a software problem!
    Please contact your hardware vendor
    CPU 3 BANK 8 TSC b70375c0a7ba4e [at 2660 Mhz 224 days 3:18:21 uptime (unreliable)]
    MISC 741d401000081784 ADDR 34521ed40
    MCG status:
    MCi status:
    Error overflow
    MCi_MISC register valid
    MCi_ADDR register valid
    MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
    Transaction: Memory read error
    Memory read ECC error
    Memory corrected error count (CORE_ERR_CNT): 44
    Memory transaction Tracker ID (RTId): 4
    Memory DIMM ID of error: 0
    Memory channel ID of error: 0
    Memory ECC syndrome: 741d4010
    STATUS cc000b000001009f MCGSTATUS 0

    Is this info “Memory DIMM ID of error: 0” really means that problems exists with the 1-st memory module installed?

    1. Hi,

      I think you should try upgrading to the latest version of mcelog, as older versions can’t read all the info from the Nehalem Xeon CPUs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: