Keeping track of 300 servers

Written by

Since WordPress.com broke 10 million pageviews today, I thought it would be a good time to talk a little bit about keeping track of all the servers that run WordPress.com, Akismet, WordPress.org, Ping-o-matic, etc. Currently we have over 300 servers online in 5 different data centers across the country. Some of these are collocated, and others are with dedicated hosting providers, but the bottom line is that we need to keep track of them all as if they were our children! Currently we use Nagios for server health monitoring, Munin for graphing various server metrics, and a wiki to keep track of all the server hardware specs, IPs, vendor IDs, etc. All of these tools have suited us well up until now, but there have been some scaling issues.

MediaWiki — Like Wikipedia, we have a MediaWiki page with a table that contains all of our server information, from hardware configuration to physical location, price, and IP information. Unfortunately, MediaWiki tables don’t seem to be very flexible and you cannot perform row or column-based operations. This makes simple things such as counting how many servers we have become somewhat time consuming. Also, when you get to 300 rows, editing the table becomes a very tedious task. It is very easy to make a mistake throwing the entire table out of whack. Even dividing the data into a few tables doesn’t make it much easier. In addition, there is no concept of unique records (nor do I really think there should be) so it is very easy to end up with 2 servers that have the same IP listed or the same hostname.

Munin — Munin has become an invaluable tool for us when troubleshooting issues and planning future server expansion. Unfortunately, scaling munin hasn’t been the best experience. At about 100 hosts, we started running into disk IO problems caused by the various data collection, graphing and HTML output jobs munin runs. It seemed the solution was to switch to the JIT graphing model which only drew the graphs when you viewed them. Unfortunately, this only seemed to make the interface excruciatingly slow and didn’t help the IO problems we were having. At about 150 hosts we moved munin to a dedicated server with 15k RPM SCSI drives in a RAID 0 array in an attempt to give it some more breathing room. That worked for a while, but we then started running into problems where the process of polling all the hosts actually took longer than the monitoring interval. The result was that we were missing some data. Since then, we have resorted to removing some of the things we graph on each server in order to lighten the load. Every once in a while, we still run into problems where a server is a little slow to respond and it causes the polling to take longer than 5 minutes. Obviously, better hardware and reducing graphed items isn’t a scalable solution so something is going to have to change. We could put a munin monitoring server in each datacenter, but we currently sum and stack graphs across datacenters. I am not sure if/how that works when the data is on different servers. The other big problem I see with munin is that if one host’s graphs stop updating and that host was part of a totals graph, the totals graph will just stop working. This happened today — very frustrating.

Nagios — I feel this has scaled the best of the 3. We have this running on a relatively light server and have no load or scheduling issues. I think it is time, however, to look at moving to Nagios’ distributed monitoring model. The main reason for this is that since we have multiple datacenters, each of which have their own private network, it is important for us to monitor each of these networks independently in addition to the public internet connectivity to each datacenter. The simplest way to do this is to put a nagios monitoring node in each data center which can then monitor all the servers in that facility and report the results back to the central monitoring server. Splitting up the workload should also allow us to scale to thousands of hosts without any problems.

Anyone have recommendations on how to better deal with these basic server monitoring needs? I have looked at Zabbix, Cacti, Ganglia, and some others in the past, but have never been super-impressed. Barring any major revelations in the next couple weeks, I think we are going to continue to scale out Nagios and Munin and replace the wiki page with a simple PHP/MySQL application that is flexible enough to integrate into our configuration management and deploy tools.

Comments

59 responses to “Keeping track of 300 servers”

July 18, 2007

Graham Freeman

I’ve been told by a fellow co-op member that Zenoss is worth looking at as an alternative to Nagios. Apparently it scales better. Haven’t tried it myself, but it might be worth considering.

Reply
July 18, 2007

Barry

Thanks for the suggestion. I will definitely check it out!

Reply
1. April 8, 2014
  
  Danny Kurniawan
  
  second Zenoss
  
  Reply
July 18, 2007

tim

Use DabbleDB instead of MediaWiki?

Reply
July 18, 2007

Benjamin Schrauwen

How are these 300 servers approximately distributed over the different services (WordPress.com, Akismet, WordPress.org, Ping-o-matic)? I’d expect 95% of these would be for WordPress.com. Or does Akismet ‘consume’ a lot of servers (given the huge number of spam messages it has to process)?

Reply
July 18, 2007

Barry

Yes, the majority of the servers power WordPress.com. Although Akismet deals with millions of spam messages per day, the infrastructure required to run the service is much less than WordPress.com

Reply
July 18, 2007

Barry and 300 Servers | Joseph Scott’s Blog

[…] talks about keeping up with 300 servers that are used run WordPress, Akismet, Ping-o-Matic and friends. Specifically the use and issues […]

Reply
July 18, 2007

davidu

Barry,

For the Munin monitoring problem, I assume Munin uses RRD, so the Disk IO issue is a common problem. One way we’ve gotten around it (we use collectd rather than Munin but it doesn’t matter)… anyways, one way to get around it is to collect the stats at each datacenter and rsync the RRDs to a central “graphing server.” This has a number of other benefits as well, but helps scale you out while still letting you pull data from multiple RRDs at the same time to do complex graphs.

Reply
July 18, 2007

Tammer Saleh

I used to work with a team managing the same sized network, and we used a combination of nagios and ganglia for monitoring.

Ganglia uses RRD for data storage. Our solution to the performance issue was to store the rrd files on a ramdisk. You could have a cron job copy the files to disk every few minutes to minimize data loss on a power failure or reboot.

I later used the cacti package to manage about 40 hosts/routers, and I found that it was very easy to use and setup, but it didn’t scale well at all (interface wise).

One question I have is what kind of naming conventions you guys use for your servers.

Finally, tim’s suggestion of using DabbleDB is a great one. It’s a phenomenal tool.

Reply
July 18, 2007

Carson

You may want to give OpenNMS (http://www.opennms.org/) a try. I’ve seen it monitoring a few hundred servers and it should roll everything you are doing into one place.

Reply
July 18, 2007

Barry

When I was at Rackspace we used OpenNMS to power our monitoring services and I wasn’t a big fan. That was a while ago and I am sure it has gotten better, so I will definitely take a look.

Reply
July 18, 2007

Around Ireland in 80 links on July 18th 2007 at Holy Shmoly!

[…] news, but Joseph has blogged about the WordPress.com and Facebook integration! Barry blogged about keeping track of the 300 servers Automattic use, and I have to add that if you use any of the premium products on […]

Reply
July 18, 2007

karyrogers

“The other big problem I see with munin is that if one host’s graphs stop updating and that host was part of a totals graph, the totals graph will just stop working.”

This is because the value for that data point is UN (unknown), so anything plus UN is UN and your graph will have gaps.

Not sure how to go about fixing this within the Munin framework as I’m not familiar with it, but if you can get at the RRD definition for the graph you can fix it.

If the data point is A then do this RPN to test for UN:

CDEF:B=A,UN,0,A,IF

Now B is either the real data or zero instead of unknown.

HTH.

Reply
July 18, 2007

Top Posts « WordPress.com

[…] Keeping track of 300 servers Since WordPress.com broke 10 million pageviews today, I thought it would be a good time to talk a little bit about […] […]

Reply
July 19, 2007

Daily Links for 07/18/07

[…] Keeping track of 300 servers « Barry on WordPress Since WordPress.com broke 10 million pageviews today, I thought it would be a good time to talk a little bit about keeping track of all the servers that run WordPress.com… (tags: wordpress scaling performance monitoring servers) […]

Reply
July 19, 2007

Nat

I’ve always been a big fan of nagios and munin. I run a small set of servers and they do the job. Plus ntop is always fun but not really what you’re talking about :p

I too tried cacti, and it seems like it would be great, but after spending a day getting it to work, I just gave up.

You might try just writing your own rdtool programs. You may be able to optimize the statistics you want easier that way.

Reply
July 19, 2007

Ian

We use the product Orion by Solar Winds (www.solarwinds.net) for all our monitoring. It’s not free, but all my coworkers and I feel it’s def. worth the price. Thus far, no scaling issues what so ever. (Thus far I believe we’re monitoring around the same amount of servers).

We also have a similar Wiki setup, except we use DokuWiki. Can’t sign enough praises about DW… and we’ve done some pretty cool stuff integrating it with HP Insight Manager (most of our servers are HP) and what not.

Reply
July 19, 2007

jenny

There is a company called HoundDog Technology which oes a server monitoring tool that seems to do what you need, but its only sold through resellers. There might be a list of who could provide it at their website, its maybe worth a look.

Reply
July 20, 2007

totti

good

Thank you

http://www.moon25.com

Reply
July 20, 2007

Lior Gradstein

The problem with all these products is that they rely on external run programs, which continually fork, eating all the CPU/memory. You should definitely look at Hyperic (http://www.hyperic.com/). It’s in Java and has an opensource version.

Reply
July 21, 2007

Dave Schmid

Even though I have access to just about every single enterprise level software for monitoring (Patrol, Precise, Mercury, etc) for our some 2,000 servers in our corporate data centers – I’ve found that the simple Cacti implementation I use actually is what shows up to meetings when we have problems – for quick reference – not deep dive issues.

How did I do it? I set up the poller on each server as part of the build – mapped an NFS/CIFS mount (UNIX/Windows mixed environment) and wrote out a simple name:pair value for basic metrics. Since all of the servers run it every 5 minutes and are connected to the same filer – all I do is have Cacti poll a log reader script. It scales pretty well since all you’re doing is reading in values from a text file – not actually waiting for systems to return data.

If a system doesn’t drop their data in time – we just ignore the value (so you don’t get the same reading from the file each time) after 5 minutes.

Granted – the totaling issue you have with RRD is a pain. I haven’t found a way around that either since you can’t make those types of graphs dynamic in nature.

Reply
July 21, 2007

Pulp Free: Status » Blog Archive » Word Press Server Stats

[…] that I just came across this note: Since WordPress.com broke 10 million pageviews today, I thought it would be a good time to talk a […]

Reply
July 21, 2007

Zaiaku

Damn that’s alot of servers. I came by the blog some time ago and it seemed like it was hacced by something. Glad to see everything is fixed, or atleast appears to be.

Reply
July 22, 2007

The WordCamp Report » HyperDB and High Performance WordPress PLUS

[…] What profiling tool they used… Barry wrote a recent post -> https://barry.wordpress.com/2007/07/18/keeping-track-of-300-servers/ […]

Reply
July 24, 2007

TechMount » Archive » Daily Friction #255

[…] Keeping track of 300 servers – Currently we have over 300 servers online in 5 different data centers across the country. Some of these are collocated, and others are with dedicated hosting providers, but the bottom line is that we need to keep track of them all. […]

Reply
August 1, 2007

People Over Process » links for 2007-08-01

[…] Keeping track of 300 servers Good profile of a SaaS data center and tools used to admin. (tags: via:photomatt itmanagement networkingmanagement wordpress sysmgmt saas) […]

Reply
August 9, 2007

Wordpress.com – Keeping Track of 300 Servers « FR Test Blog

[…] – Keeping Track of 300 Servers Keeping track of 300 servers « Barry on WordPress … I thought it would be a good time to talk a little bit about keeping track of all the […]

Reply
August 11, 2007

Internet Alchemy » links for 2007-08-11

[…] Keeping track of 300 servers « Barry on WordPress Good to see someone sharing real world experience of common monitoring tools (tags: performance scaling monitoring tools sysadmin) […]

Reply
September 15, 2007

Mitesh Rami

Thanks for this great info.

i feel one of the ways is to increase number of datacenters. that will definately help in tracking as load will be shared

Reply
September 18, 2007

FreeDomainHost

Wow cool site! to be honest, I am surprise of the power of WordPress!

keep it up and hope you post more ! So that we benefit from it!

Reply
September 24, 2007

darkfader

Nagios-wise a few hints:

– if possible, go to a fully redundant & distributed setup so you don’t have to think about it later on
– if you run into performance issues (and you will), get rid of host checks, they’re evil (and blocking)
– apply the status.dat patch from nagiosexchange which turns status.dat into a shared memory segment.

other than that i don’t think editing around in either mediawiki or a php setup is more than a massive waste of time, as you saw it’s errorprone too.

i don’t have an easy solution either, for our small site i am right now trying to push the configuration data into the documentation wiki, if that doesn’t work i’ll mess with the backend database instead, but from experience:
if my documentation is done manually, my documentation isn’t worth anything.

Reply
October 12, 2007

Anime Girl

Have you considered trying Zeus Server instead of apache? eBay uses it. And it does everything you need I think and takes less hardware to do it.

http://www.zeus.com/

Reply
November 2, 2007

john allspaw

The ramdisk and periodic rsync trick is also how we handle rrd disk IO (with ganglia) at Flickr, for roughly 1000 nodes, split across about 6 aggregation nodes.

Reply
November 3, 2007

Julian Hein

As far as I know, Munin creates graphs. Maybe you can get rid of it and just use Nagios to collect all the data and then use a tool like PNP or NagiosGrapher to create graphs from performance date on the Nagios server. This worked well for us with more than 300 servers. NagiosGrapher can also run on a dedicated box, if the system needs more power.

Reply
November 4, 2007

Atif Ghaffar

Barry,

I am building a small opensource application to manage the IT assets (machines, ip address, vendor, etc).

This database can be used to do asset accounting or hook-ups to some monitoring system (nagios, jffnms, firewall, etc)

Can we have a small look at what you have in the wiki (after garbling the data) to have an idea on how you see this data.

This will help me to build a similar view.

thanks.

Reply
November 4, 2007

Atif Ghaffar

IRM.

I think i will give it a miss to write a custom system.
I have found IRM http://irm.stackworks.net/ which fits quiet nicely to my needs (different locations, different infrastructues, detailed informations, customizeable)

Give it a try.

Reply
November 4, 2007

Chris Goffinet

I *highly* recommend using Hyperic HQ. http://www.hyperic.com. Over at Yahoo we were asked to use Nagios and we had for sometime until someone showed me Hyperic and I pissed myself at how easy to setup and how much better it was than Nagios. They have an open source version, and we are using it on multiple datacenters, ties into Postgres, Oracle or MySQL for backend store, and we use it to monitor over 100 servers in multiple data centers for the property I do work for. I also got a quote from them about a company who uses them to monitor over 1000+ servers collecting 44,000 metrics a minute to a single server! Trust me after using this, I will never use Nagios.

Reply
November 5, 2007

Son Nguyen

We use rrdtools with our own custom interface. Data storage is RRD on memory disk and each datacenter has a local collector and each sends data back to the reporter periodically (3-tier design)

Reply
November 6, 2007

Herbert

300 hosts to serve only 10 million pageviews? This is a little bloated, isn’t?

I know a case that only 2 hosts serves more than 350 millions of pageviews!!

About MediaWiki, I’m sure there are OSS options to document hardware, and possibly something automated to use a database.

Reply
November 9, 2007

QuickLinks vom 4. November bis zum 8. November — instant-thinking.de

[…] Keeping track of 300 servers – How wordpress.com scales its server management […]

Reply
January 13, 2008

Vincent from the World of Office Tips & Tricks

What a great article, this is some very interesting reading material.

Cheers

Vincent

The World of Office Tips and Tricks.

Reply
January 18, 2008

Adam

Great idea on the wiki. Maybe we should try something like that with our own data management.

Reply
January 23, 2008

Scott Burkett

I recommend the stuff that the guys at Ipswitch.com put out …

Reply
January 26, 2008

Warwick Poole

Hey Barry, we had the exact same challenge with surprisingly similar parameters: ~300 servers, multiple datacenters, Nagios and a wiki. In the end we opted to continue to use Nagios for monitoring and then we wrote our own inventory system. The goal of this system was to be both inventory AND configuration management, ie: be flexible enough that it could allow us to define the inventory of our systems and software and then this tool would autogenerate configuration files for Nagios/cfengine/pound/keepalived as well as generate XML files which could be used in software builds which needed to know infrastructure details (like server IPs, database access details, URLs etc)

We open sourced it (MIT) and released the first version almost a year ago: http://nventory.sf.net

It doesnt do it all yet, but its a pretty handy and flexible inventory manager now, and the base is there for adding config management ontop of it.

Reply
February 28, 2008

cloverfield575

That must be hard to do all that!

Reply
April 3, 2008

CS

Hello Barry,

We use Nagios for 250 hosts and 1000 services on one server and graph everything with PNP (and the npcd daemon), no load problem. We also use mediawiki for the documentation, with a link from nagios (notes_url) to mediawiki : each project has a troubleshooting page on the wiki, and a the notes_url points to the troubleshooting page of the project. For the inventory, we use ocs inventory, it works fine and it is dynamic : it’s far better than a static list on a wiki.

With a good configuration (I mean templates), Nagios 3 is really simple to use (v3 add cool features to make the configuration really easier) . For example, adding a host with a “linux-template” profile automatically create the default service checks for linux (load,swap,disk), and same thing for other equipments. Templating is the key…

Nagios rocks !

For Hyperic and this kind of stuff, I don’t want to pay the “enterprise” version to get what I need. (Fully) Free software is the best choice, and that’s where you’ll find a community (help, advice, contributions and so on…). These dual-licenses projets suck

Reply
May 10, 2008

Links for 2007-11-02 – EDV | Ende der Vernunft

[…] Keeping track of 300 servers – WordPress.com l�uft auf 300 Servern und sie malen mit Munin schicke Grafiken �ber Betriebszust�nde. Das skaliert nicht wirklich, wobei Munin an sich nicht das Problem ist. Das erzeugen der RRD-Container braucht immens viel IO und das alle paar Minuten. Eine echte L�sung kenne ich auch nicht. […]

Reply
August 6, 2008

Tim

I’ve seen Zenoss and Hyperic — but for what you guys do, have you looked into real user monitoring that gives you the end to end as well as server localization?

It’s more of a top-down than “needle in the haystack” although needle-searching is valuable and more manageable with smaller haystacks.

Reply
August 21, 2008

r1JOB Forum

I guess not only keeping track, but also debugging, maintenance, backups, etc. I do not envy you 🙂

Reply
September 20, 2008

hoh

i love to know what goes on running something like WP
it sounds a tad anxious making
i’m facinated by the stats for our blogs
any one who will divulge how they work
for instance how long do i need to hover my curser over a post for it to register as a view?

is it possibel to go to a blog and the views go up on the posts looked at but there is no trace of what the viewers url is
ie invisible hitters
hits is such an unfortunate term!
all the best

Reply
September 21, 2008

Vandebilt.com » Monitoring applications

[…] Keeping track of 300 servers […]

Reply
January 10, 2009

dizin

in turkey the biggest dc closed.

Reply
May 14, 2009

Brian

For the I/O Issue, We have created a tmpfs drive in ram to store the RRD files and the images/files. We then rsync that data every 20 minutes to a local drive for storage/backup. We are producing about 8219 items being graphed on our network. The only tradeoff is to make sure you keep it backed up, and stop munin cron and sync to disk before any reboot of the server.

Reply
1. May 21, 2009
  
  Barry
  
  Yep, we did this too, but we quickly became CPU bound. We are currently processing about 300 servers with about 20 data points per machine per munin server.
  
  Reply
August 15, 2009

lorinescu

server inventory and very often The Truth – http://www.racktables.org , clean, simple, customizable… It holds without any problems at least 700 pieces of equipment.

nagios/munin and other tools that can be configured the unix way (text files) can be easily sharded into multiple instances f- or example odd machines on nagios1 , even on nagios2 and so on. For sharding formula inspiration you can check mysql’s multiple master to master replication examples.

Alternately you can distribute your trend monitoring instead of scaling in the center – run munin on each server and let it monitor itself, install a lightweight web server for presenting the graphs. It would pose a problem with aggregated data points tough.

We are currently using munin and it’s getting really slow and because the graphs presentation is pretty lame we will probably move to collectd and buid our own rrdgraph thing.

Reply
March 5, 2010

stefan parvu

Munin uses RRDTool and might make sense to ask the project if they have integrated rrdcached or you can yourself use that one: http://oss.oetiker.ch/rrdtool/doc/rrdcached.en.html

rrdcached is specific to RRD 1.4.x series.

“The RRD Caching Daemon can dramatically improve the ‘update’ performance of your system. Due to file handling overheads, the time it takes todo one update is virtually the same as to doing two updates in a row.

The Cache Daemon intercepts rrdtool update calls, assembling multiple updates before writing them to the actual rrd file. When calling rrdtool graph in such a setup, the command will tell the daemon to flush out all
pending updates for the rrd files, required to draw the graph. “

Reply
October 23, 2010

Karyl Repoff

Long time viewer / 1st time poster. Really enjoying reading the blog, keep up the good work. Will definitely start posting more oftenin the future.

Reply
May 8, 2012

links for 2007-08-11 « Internet Alchemy

[…] Keeping track of 300 servers Â« Barry on WordPress Good to see someone sharing real world experience of common monitoring tools (tags: performance scaling monitoring tools sysadmin) […]

Reply
December 16, 2014

leanermorgan

This is really a an informative post. Thanks for sharing your experience. I loved to read your blog and really enjoyed it.

Reply