If you are attending WordCamp NYC this year and have ever upgraded WordPress, please consider volunteering for an hour at the Genius Bar.
Yesterday we had a session about scaling, servers, and WordPress at the 1st WordPress Developer Day in San Francisco. We used a P2 blog on WordPress.com which allowed people to ask questions and then Demitrious, Chris, and I answered them. I went back and posted some follow up answers which means that the site will hopefully serve as a reference to others looking for answers to these questions.
We are looking at switching some of our servers from AMD Opteron Barcelona quad-core processors to the new Intel 5520 Nehalem processors. These are both 4 core CPUs, but the Intels utilize hyper-threading, so the OS sees 8 cores per CPU. It wasn’t that long ago that the first thing you did with a hyper-threading-enabled CPU was switch it off in the BIOS, but I have heard good things about Intel’s reincarnation of hyper-threading, so I decided to give it a shot.
I ran some real-world stress tests against these servers, adding them into the WordPress.com web pool and seeing how many requests per second they could serve before becoming 100% CPU bound effectively falling over. The types of requests served are varied; a lot are rendering web pages, but there are also quite a few image resizing operations thrown in here as well, as we spread this image work evenly over the 2500 cores in our web tier. Everything is php executed via fastcgi. I was a bit skeptical that there would be much of a difference between the two processors, but the numbers proved me wrong — the Nehalem’s are impressive.
2 x AMD Opteron 2356 Barcelona Quad-core 2.3Ghz
40 requests/second at 87.5% CPU utilization
2 x Intel 5520 Nehalem Quad-core 2.26Ghz
78 requests/second at 94% CPU utilization
Few things that I thought were interesting:
- On a per request basis, there isn’t much of a difference between the two. They both generate a given page in roughly the same amount of time.
- As CPU utilization approaches 100%, The Intel’s scale rather linearly, while the AMDs seem to struggle over the 85% range.
- The load averages were pretty high during these tests (35+ on the Intel box), but request times didn’t seem to suffer.
Has anyone else seen the same sort of results or maybe something to the contrary? These 2 configurations are roughly the same price, making it seem like a no-brainer to choose the Intels for web applications.
Have you ever wanted to predict that a piece of hardware in your server was failing before it actually caused the server to crash?
Sure! We all do.
Over the past few months, I have been tracking the correlation between errors logged to the Machine Check Event Log (MCElog) and the hard crash of a server or application running on that server (mostly MySQL). So far, the correlation is about 90%. That is to say, about 9 times out of 10, there will be an error logged to the MCElog before the server actually crashes. It may take days or even weeks between the time of the logged error and the crash, but it will happen. We are now actively monitoring this log and replacing hardware (RAM and CPUs) which show errors before they actually fail which I thought was pretty cool, so I thought I would share how we are doing it.
On Debian, there is a package for the mcelog utility which will allow you to decode and display the kernel messages logged to /dev/mcelog Part of this package is a cron job which outputs the decoded contents of /dev/mcelog to /var/log/mcelog every 5 minutes:
*/5 * * * * root test -x /usr/sbin/mcelog -a ! -e /etc/mcelog-disabled && /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
We modify this a little bit and add another cron job which rotates this log file on reboot:
@reboot root test -f /var/log/mcelog && mv /var/log/mcelog /var/log/mcelog.0
The reason we do this is because after a reboot, which is most likely a result of the hardware repair, we want to clear the active logfile (monitored by the nagios plugin below), so the alert will clear. In case, however, the reboot was not part of the hardware maintenance, we still want to have a record of the hardware errors so we move the log file to mcelog.0.
We then have a simple nagios plugin which monitors /var/log/mcelog for errors:
#!/bin/bash LOGFILE=/var/log/mcelog if [ ! -f "$LOGFILE" ] then echo "No logfile exists" exit 3 else ERRORS=$( grep -c "HARDWARE ERROR" /var/log/mcelog ) if [ $ERRORS -eq 0 ] then echo "OK: $ERRORS hardware errors found" exit 0 elif [ $ERRORS -gt 0 ] then echo "WARNING: $ERRORS hardware errors found" exit 1 fi fi
And thats pretty much it. In just a few weeks we have caught about a dozen hardware faults before they led to server crashes.
Disclaimer: This only works when running a X86_64 kernel and YMMV.
We have decided to consolidate all of the small projects we have released into a single subversion repository. Previously these were spread across multiple domains and not very well publicized. We have setup a Trac instance as well to facilitate bug reports. There are 5 projects currently in the repository all of which we have used or are currently using at Automattic. Some of the projects, like Servermattic, are also being used elsewhere. All of these projects are obviously open source and are released under the GPL. Patches and feedback are welcome! We hope to release more of these soon. Thanks to Nikolay and Demitrious who have both contributed to the projects in the repository.