Author: Barry

  • Keeping track of 300 servers

    Since WordPress.com broke 10 million pageviews today, I thought it would be a good time to talk a little bit about keeping track of all the servers that run WordPress.com, Akismet, WordPress.org, Ping-o-matic, etc. Currently we have over 300 servers online in 5 different data centers across the country. Some of these are collocated, and others are with dedicated hosting providers, but the bottom line is that we need to keep track of them all as if they were our children! Currently we use Nagios for server health monitoring, Munin for graphing various server metrics, and a wiki to keep track of all the server hardware specs, IPs, vendor IDs, etc. All of these tools have suited us well up until now, but there have been some scaling issues.

    • MediaWiki — Like Wikipedia, we have a MediaWiki page with a table that contains all of our server information, from hardware configuration to physical location, price, and IP information. Unfortunately, MediaWiki tables don’t seem to be very flexible and you cannot perform row or column-based operations. This makes simple things such as counting how many servers we have become somewhat time consuming. Also, when you get to 300 rows, editing the table becomes a very tedious task. It is very easy to make a mistake throwing the entire table out of whack. Even dividing the data into a few tables doesn’t make it much easier. In addition, there is no concept of unique records (nor do I really think there should be) so it is very easy to end up with 2 servers that have the same IP listed or the same hostname.
    • Munin — Munin has become an invaluable tool for us when troubleshooting issues and planning future server expansion. Unfortunately, scaling munin hasn’t been the best experience. At about 100 hosts, we started running into disk IO problems caused by the various data collection, graphing and HTML output jobs munin runs. It seemed the solution was to switch to the JIT graphing model which only drew the graphs when you viewed them. Unfortunately, this only seemed to make the interface excruciatingly slow and didn’t help the IO problems we were having. At about 150 hosts we moved munin to a dedicated server with 15k RPM SCSI drives in a RAID 0 array in an attempt to give it some more breathing room. That worked for a while, but we then started running into problems where the process of polling all the hosts actually took longer than the monitoring interval. The result was that we were missing some data. Since then, we have resorted to removing some of the things we graph on each server in order to lighten the load. Every once in a while, we still run into problems where a server is a little slow to respond and it causes the polling to take longer than 5 minutes. Obviously, better hardware and reducing graphed items isn’t a scalable solution so something is going to have to change. We could put a munin monitoring server in each datacenter, but we currently sum and stack graphs across datacenters. I am not sure if/how that works when the data is on different servers. The other big problem I see with munin is that if one host’s graphs stop updating and that host was part of a totals graph, the totals graph will just stop working. This happened today — very frustrating.
    • Nagios — I feel this has scaled the best of the 3. We have this running on a relatively light server and have no load or scheduling issues. I think it is time, however, to look at moving to Nagios’ distributed monitoring model. The main reason for this is that since we have multiple datacenters, each of which have their own private network, it is important for us to monitor each of these networks independently in addition to the public internet connectivity to each datacenter. The simplest way to do this is to put a nagios monitoring node in each data center which can then monitor all the servers in that facility and report the results back to the central monitoring server. Splitting up the workload should also allow us to scale to thousands of hosts without any problems.

    Anyone have recommendations on how to better deal with these basic server monitoring needs? I have looked at Zabbix, Cacti, Ganglia, and some others in the past, but have never been super-impressed. Barring any major revelations in the next couple weeks, I think we are going to continue to scale out Nagios and Munin and replace the wiki page with a simple PHP/MySQL application that is flexible enough to integrate into our configuration management and deploy tools.

  • California Road Trip

    Granted, it’s not a 7000 mile trek on a 1985 Honda motorcycle, but 1081 miles in a semi-comfortable car is long enough for me. Heading out tomorrow, be back in about a week. Hope to see a few friends along the way.

    Day 1
    Distance Traveled – 270 miles
    Driving Time – 6.5 hours
    Location – Pismo Beach, CA
    Photos

    Day 2
    Distance Traveled – 200 miles
    Driving Time – 3 hours
    Location – Pasadena, CA

    Day 3
    Distance Traveled – not very far
    Driving Time – not much
    Location – Manhattan Beach, CA
    Photos

    Day 4
    Distance Traveled – 125 miles
    Driving Time – 4 hours (LOTS of traffic)
    Location – San Diego, CA

    Days 5-6
    Location – San Diego, CA

    Day 7
    Distance Traveled – 500 miles
    Driving Time – 9 hours (includes lunch and 2 hours of traffic in LA)
    Location – San Francisco, CA

  • Calling all F1 WordPressers

    Any WordPress users at the US Formula One Grand Prix this weekend? If so, I thought it would be neat to meet up and watch either the morning practice session or qualifying tomorrow (Saturday). For the morning practice at 10AM I was thinking about finding a seat in the grandstand at turn 10. For qualifying at 1PM, I think across from the Ferrari pits in the upper section is the place to be. Last year there was some exciting action at the beginning of qualifying. Here is a link to the track map. If you are at the race and are interested in meeting up, please leave a comment and we can rendezvous just before the sessions start. And yes, the picture below is me typing this post at the circuit… I wouldn’t have it any other way.

    blogging.jpg

  • PHP 4 vs PHP 5 Basic Benchmark

    I’ve been doing some preliminary research and testing to see if upgrading to php 5 is something we want to do on WordPress.com and if so, how soon. Here are the results of a simple apache bench test of a phpinfo page. The test environment was as follows:

    Hardware:

    • Dual AMD Opteron 246
    • 2GB RAM
    • 2 x 160GB SATA drives in a RAID 1 array

    Software:

    • Debian Sarge AMD64
    • Litespeed 3.0.3

    The tests were run from the same machine running the web server so network latency is not a factor. The test parameters were 5000 total requests with a concurrency of 100.

    PHP 4.4.6 with APC 3.0.14
    Concurrency Level: 100
    Time taken for tests: 5.581265 seconds
    Complete requests: 5000
    Failed requests: 0
    Write errors: 0
    Total transferred: 134854586 bytes
    HTML transferred: 134070000 bytes
    Requests per second: 895.85 [#/sec] (mean)
    Time per request: 111.625 [ms] (mean)
    Time per request: 1.116 [ms] (mean, across all concurrent requests)
    Transfer rate: 23595.55 [Kbytes/sec] received

    PHP 5.2.2 with APC 3.0.14
    Concurrency Level: 100
    Time taken for tests: 8.388090 seconds
    Complete requests: 5000
    Failed requests: 0
    Write errors: 0
    Total transferred: 183254839 bytes
    HTML transferred: 182470000 bytes
    Requests per second: 596.08 [#/sec] (mean)
    Time per request: 167.762 [ms] (mean)
    Time per request: 1.678 [ms] (mean, across all concurrent requests)
    Transfer rate: 21334.89 [Kbytes/sec] received

    From these preliminary tests, php 5.2.2 seems about 33% slower than php 4.4.6. Surprising…

    NOTE: One thing that may contribute to the apparent slowness is that the phpinfo page grew from 26814 bytes to 36494 bytes in the upgrade process.

    Has anyone else run similar tests? Are the results the same?

  • One milllllliion blogs on WordPress.com

    At 10:56:22PM PDT on 5/23/2007, the 1 millionth active blog was registered on WordPress.com. And the winner is…..

    claudiacanals.wordpress.com

    Not much there right now, but hopefully there will be soon. Maybe head over and leave a comment on their about page to let them know!

    Predictions on how long it will take to get to 2 million?