Keeping track of 300 servers

Since WordPress.com broke 10 million pageviews today, I thought it would be a good time to talk a little bit about keeping track of all the servers that run WordPress.com, Akismet, WordPress.org, Ping-o-matic, etc. Currently we have over 300 servers online in 5 different data centers across the country. Some of these are collocated, and others are with dedicated hosting providers, but the bottom line is that we need to keep track of them all as if they were our children! Currently we use Nagios for server health monitoring, Munin for graphing various server metrics, and a wiki to keep track of all the server hardware specs, IPs, vendor IDs, etc. All of these tools have suited us well up until now, but there have been some scaling issues.

  • MediaWiki — Like Wikipedia, we have a MediaWiki page with a table that contains all of our server information, from hardware configuration to physical location, price, and IP information. Unfortunately, MediaWiki tables don’t seem to be very flexible and you cannot perform row or column-based operations. This makes simple things such as counting how many servers we have become somewhat time consuming. Also, when you get to 300 rows, editing the table becomes a very tedious task. It is very easy to make a mistake throwing the entire table out of whack. Even dividing the data into a few tables doesn’t make it much easier. In addition, there is no concept of unique records (nor do I really think there should be) so it is very easy to end up with 2 servers that have the same IP listed or the same hostname.
  • Munin — Munin has become an invaluable tool for us when troubleshooting issues and planning future server expansion. Unfortunately, scaling munin hasn’t been the best experience. At about 100 hosts, we started running into disk IO problems caused by the various data collection, graphing and HTML output jobs munin runs. It seemed the solution was to switch to the JIT graphing model which only drew the graphs when you viewed them. Unfortunately, this only seemed to make the interface excruciatingly slow and didn’t help the IO problems we were having. At about 150 hosts we moved munin to a dedicated server with 15k RPM SCSI drives in a RAID 0 array in an attempt to give it some more breathing room. That worked for a while, but we then started running into problems where the process of polling all the hosts actually took longer than the monitoring interval. The result was that we were missing some data. Since then, we have resorted to removing some of the things we graph on each server in order to lighten the load. Every once in a while, we still run into problems where a server is a little slow to respond and it causes the polling to take longer than 5 minutes. Obviously, better hardware and reducing graphed items isn’t a scalable solution so something is going to have to change. We could put a munin monitoring server in each datacenter, but we currently sum and stack graphs across datacenters. I am not sure if/how that works when the data is on different servers. The other big problem I see with munin is that if one host’s graphs stop updating and that host was part of a totals graph, the totals graph will just stop working. This happened today — very frustrating.
  • Nagios — I feel this has scaled the best of the 3. We have this running on a relatively light server and have no load or scheduling issues. I think it is time, however, to look at moving to Nagios’ distributed monitoring model. The main reason for this is that since we have multiple datacenters, each of which have their own private network, it is important for us to monitor each of these networks independently in addition to the public internet connectivity to each datacenter. The simplest way to do this is to put a nagios monitoring node in each data center which can then monitor all the servers in that facility and report the results back to the central monitoring server. Splitting up the workload should also allow us to scale to thousands of hosts without any problems.

Anyone have recommendations on how to better deal with these basic server monitoring needs? I have looked at Zabbix, Cacti, Ganglia, and some others in the past, but have never been super-impressed. Barring any major revelations in the next couple weeks, I think we are going to continue to scale out Nagios and Munin and replace the wiki page with a simple PHP/MySQL application that is flexible enough to integrate into our configuration management and deploy tools.

Author: Barry

To be written by someone famous...

59 thoughts on “Keeping track of 300 servers”

  1. I’ve been told by a fellow co-op member that Zenoss is worth looking at as an alternative to Nagios. Apparently it scales better. Haven’t tried it myself, but it might be worth considering.

  2. How are these 300 servers approximately distributed over the different services (WordPress.com, Akismet, WordPress.org, Ping-o-matic)? I’d expect 95% of these would be for WordPress.com. Or does Akismet ‘consume’ a lot of servers (given the huge number of spam messages it has to process)?

  3. Yes, the majority of the servers power WordPress.com. Although Akismet deals with millions of spam messages per day, the infrastructure required to run the service is much less than WordPress.com

  4. Barry,

    For the Munin monitoring problem, I assume Munin uses RRD, so the Disk IO issue is a common problem. One way we’ve gotten around it (we use collectd rather than Munin but it doesn’t matter)… anyways, one way to get around it is to collect the stats at each datacenter and rsync the RRDs to a central “graphing server.” This has a number of other benefits as well, but helps scale you out while still letting you pull data from multiple RRDs at the same time to do complex graphs.

  5. I used to work with a team managing the same sized network, and we used a combination of nagios and ganglia for monitoring.

    Ganglia uses RRD for data storage. Our solution to the performance issue was to store the rrd files on a ramdisk. You could have a cron job copy the files to disk every few minutes to minimize data loss on a power failure or reboot.

    I later used the cacti package to manage about 40 hosts/routers, and I found that it was very easy to use and setup, but it didn’t scale well at all (interface wise).

    One question I have is what kind of naming conventions you guys use for your servers.

    Finally, tim’s suggestion of using DabbleDB is a great one. It’s a phenomenal tool.

  6. When I was at Rackspace we used OpenNMS to power our monitoring services and I wasn’t a big fan. That was a while ago and I am sure it has gotten better, so I will definitely take a look.

  7. “The other big problem I see with munin is that if one host’s graphs stop updating and that host was part of a totals graph, the totals graph will just stop working.”

    This is because the value for that data point is UN (unknown), so anything plus UN is UN and your graph will have gaps.

    Not sure how to go about fixing this within the Munin framework as I’m not familiar with it, but if you can get at the RRD definition for the graph you can fix it.

    If the data point is A then do this RPN to test for UN:

    CDEF:B=A,UN,0,A,IF

    Now B is either the real data or zero instead of unknown.

    HTH.

  8. I’ve always been a big fan of nagios and munin. I run a small set of servers and they do the job. Plus ntop is always fun but not really what you’re talking about :p

    I too tried cacti, and it seems like it would be great, but after spending a day getting it to work, I just gave up.

    You might try just writing your own rdtool programs. You may be able to optimize the statistics you want easier that way.

  9. We use the product Orion by Solar Winds (www.solarwinds.net) for all our monitoring. It’s not free, but all my coworkers and I feel it’s def. worth the price. Thus far, no scaling issues what so ever. (Thus far I believe we’re monitoring around the same amount of servers).

    We also have a similar Wiki setup, except we use DokuWiki. Can’t sign enough praises about DW… and we’ve done some pretty cool stuff integrating it with HP Insight Manager (most of our servers are HP) and what not.

  10. There is a company called HoundDog Technology which oes a server monitoring tool that seems to do what you need, but its only sold through resellers. There might be a list of who could provide it at their website, its maybe worth a look.

  11. Even though I have access to just about every single enterprise level software for monitoring (Patrol, Precise, Mercury, etc) for our some 2,000 servers in our corporate data centers – I’ve found that the simple Cacti implementation I use actually is what shows up to meetings when we have problems – for quick reference – not deep dive issues.

    How did I do it? I set up the poller on each server as part of the build – mapped an NFS/CIFS mount (UNIX/Windows mixed environment) and wrote out a simple name:pair value for basic metrics. Since all of the servers run it every 5 minutes and are connected to the same filer – all I do is have Cacti poll a log reader script. It scales pretty well since all you’re doing is reading in values from a text file – not actually waiting for systems to return data.

    If a system doesn’t drop their data in time – we just ignore the value (so you don’t get the same reading from the file each time) after 5 minutes.

    Granted – the totaling issue you have with RRD is a pain. I haven’t found a way around that either since you can’t make those types of graphs dynamic in nature.

  12. Damn that’s alot of servers. I came by the blog some time ago and it seemed like it was hacced by something. Glad to see everything is fixed, or atleast appears to be.

  13. Nagios-wise a few hints:

    – if possible, go to a fully redundant & distributed setup so you don’t have to think about it later on
    – if you run into performance issues (and you will), get rid of host checks, they’re evil (and blocking)
    – apply the status.dat patch from nagiosexchange which turns status.dat into a shared memory segment.

    other than that i don’t think editing around in either mediawiki or a php setup is more than a massive waste of time, as you saw it’s errorprone too.

    i don’t have an easy solution either, for our small site i am right now trying to push the configuration data into the documentation wiki, if that doesn’t work i’ll mess with the backend database instead, but from experience:
    if my documentation is done manually, my documentation isn’t worth anything.

  14. As far as I know, Munin creates graphs. Maybe you can get rid of it and just use Nagios to collect all the data and then use a tool like PNP or NagiosGrapher to create graphs from performance date on the Nagios server. This worked well for us with more than 300 servers. NagiosGrapher can also run on a dedicated box, if the system needs more power.

  15. Barry,

    I am building a small opensource application to manage the IT assets (machines, ip address, vendor, etc).

    This database can be used to do asset accounting or hook-ups to some monitoring system (nagios, jffnms, firewall, etc)

    Can we have a small look at what you have in the wiki (after garbling the data) to have an idea on how you see this data.

    This will help me to build a similar view.

    thanks.

  16. I *highly* recommend using Hyperic HQ. http://www.hyperic.com. Over at Yahoo we were asked to use Nagios and we had for sometime until someone showed me Hyperic and I pissed myself at how easy to setup and how much better it was than Nagios. They have an open source version, and we are using it on multiple datacenters, ties into Postgres, Oracle or MySQL for backend store, and we use it to monitor over 100 servers in multiple data centers for the property I do work for. I also got a quote from them about a company who uses them to monitor over 1000+ servers collecting 44,000 metrics a minute to a single server! Trust me after using this, I will never use Nagios.

  17. We use rrdtools with our own custom interface. Data storage is RRD on memory disk and each datacenter has a local collector and each sends data back to the reporter periodically (3-tier design)

  18. 300 hosts to serve only 10 million pageviews? This is a little bloated, isn’t?

    I know a case that only 2 hosts serves more than 350 millions of pageviews!!

    About MediaWiki, I’m sure there are OSS options to document hardware, and possibly something automated to use a database.

  19. Hey Barry, we had the exact same challenge with surprisingly similar parameters: ~300 servers, multiple datacenters, Nagios and a wiki. In the end we opted to continue to use Nagios for monitoring and then we wrote our own inventory system. The goal of this system was to be both inventory AND configuration management, ie: be flexible enough that it could allow us to define the inventory of our systems and software and then this tool would autogenerate configuration files for Nagios/cfengine/pound/keepalived as well as generate XML files which could be used in software builds which needed to know infrastructure details (like server IPs, database access details, URLs etc)

    We open sourced it (MIT) and released the first version almost a year ago: http://nventory.sf.net

    It doesnt do it all yet, but its a pretty handy and flexible inventory manager now, and the base is there for adding config management ontop of it.

  20. Hello Barry,

    We use Nagios for 250 hosts and 1000 services on one server and graph everything with PNP (and the npcd daemon), no load problem. We also use mediawiki for the documentation, with a link from nagios (notes_url) to mediawiki : each project has a troubleshooting page on the wiki, and a the notes_url points to the troubleshooting page of the project. For the inventory, we use ocs inventory, it works fine and it is dynamic : it’s far better than a static list on a wiki.

    With a good configuration (I mean templates), Nagios 3 is really simple to use (v3 add cool features to make the configuration really easier) . For example, adding a host with a “linux-template” profile automatically create the default service checks for linux (load,swap,disk), and same thing for other equipments. Templating is the key…

    Nagios rocks !

    For Hyperic and this kind of stuff, I don’t want to pay the “enterprise” version to get what I need. (Fully) Free software is the best choice, and that’s where you’ll find a community (help, advice, contributions and so on…). These dual-licenses projets suck

  21. I’ve seen Zenoss and Hyperic — but for what you guys do, have you looked into real user monitoring that gives you the end to end as well as server localization?

    It’s more of a top-down than “needle in the haystack” although needle-searching is valuable and more manageable with smaller haystacks.

  22. i love to know what goes on running something like WP
    it sounds a tad anxious making
    i’m facinated by the stats for our blogs
    any one who will divulge how they work
    for instance how long do i need to hover my curser over a post for it to register as a view?

    is it possibel to go to a blog and the views go up on the posts looked at but there is no trace of what the viewers url is
    ie invisible hitters
    hits is such an unfortunate term!
    all the best

  23. For the I/O Issue, We have created a tmpfs drive in ram to store the RRD files and the images/files. We then rsync that data every 20 minutes to a local drive for storage/backup. We are producing about 8219 items being graphed on our network. The only tradeoff is to make sure you keep it backed up, and stop munin cron and sync to disk before any reboot of the server.

    1. Yep, we did this too, but we quickly became CPU bound. We are currently processing about 300 servers with about 20 data points per machine per munin server.

  24. server inventory and very often The Truth – http://www.racktables.org , clean, simple, customizable… It holds without any problems at least 700 pieces of equipment.

    nagios/munin and other tools that can be configured the unix way (text files) can be easily sharded into multiple instances f- or example odd machines on nagios1 , even on nagios2 and so on. For sharding formula inspiration you can check mysql’s multiple master to master replication examples.

    Alternately you can distribute your trend monitoring instead of scaling in the center – run munin on each server and let it monitor itself, install a lightweight web server for presenting the graphs. It would pose a problem with aggregated data points tough.

    We are currently using munin and it’s getting really slow and because the graphs presentation is pretty lame we will probably move to collectd and buid our own rrdgraph thing.

  25. Munin uses RRDTool and might make sense to ask the project if they have integrated rrdcached or you can yourself use that one: http://oss.oetiker.ch/rrdtool/doc/rrdcached.en.html

    rrdcached is specific to RRD 1.4.x series.

    “The RRD Caching Daemon can dramatically improve the ‘update’ performance of your system. Due to file handling overheads, the time it takes todo one update is virtually the same as to doing two updates in a row.

    The Cache Daemon intercepts rrdtool update calls, assembling multiple updates before writing them to the actual rrd file. When calling rrdtool graph in such a setup, the command will tell the daemon to flush out all
    pending updates for the rrd files, required to draw the graph. “

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s