Making Gravatar fast again

As Matt blogged, Automattic recently purchased Gravatar. The first thing we did was move the service onto the infrastructure. Since the application is very different from what this really means is using what we have learned from scaling to increase both speed and reliability of the service, as well as leveraging our existing hardware and network infrastructure to stabilize the service. The current infrastructure is laid out as follows:

  • 2 application servers (in 2 different data centers for redundancy). One of these servers primarily handles the main Gravatar website which is Ruby on Rails while the other serves the images themselves. If either of these servers or data centers were to fail, we could easily switch things around to work around the outage.
  • 2 cache servers (1 in each datacenter). These servers are running Varnish. They cache requested images for a period of 10 minutes, so frequently requested images are not repeatedly requested from the application servers. We are seeing about a 65% cache hit rate and about 1000 requests/second at peak times, although as adoption of the service increases, we expect this number to go up significantly. A single server running Varnish can serve many thousands of requests/sec. The amount of data we are caching is small enough to fit in RAM, so disk I/O is not currently an issue.

On the hardware side, for those of you who are curious, we are using HP DL365s for the application servers, and HP DL145s for the caching servers. 4GB of RAM and 2 x AMD Opteron 2218s all around. The application servers have 4 x 73GB 15k SAS drives in a RAID 5, while the caching servers are just single 80GB SATA drives. We use the same hardware configurations extensively for and they work well.

Previously, the service was using Apache2 + Mongrel to serve the main site and lighttpd + mod_magnet to serve the images. We decided to simplify this and we are currently using lighttpd to serve everything and it is working well for the most part. We seem to have a memory usage issue with lighttpd, which may be related to this long-standing bug.  For now, we are just monitoring memory usage of the application with monit, and restarting the service before memory usage gets too high.

12 responses to “Making Gravatar fast again”

  1. Great stuff, Barry. I’m surprised, though, that the working set of images can fit into RAM on your varnish boxes. I would have guess it would be bigger.

    Since varnish now has at least an LRU eviction policy, what would stop you from lifting the 10 minute expiry and just caching everything ‘forever’ ?

  2. I assume you’re are getting the stats from ‘varnishstat’. Are you using home made scripts to save that information and graph it?

  3. […] moved Gravatar to their infrastructure which has gone well, the blog High Scalability pointed out Making Gravatar Fast Again. Cool stuff, and will help them avoid “crashing hard” moments. The Gravatar article […]

  4. […] Making Gravatar fast again | Barry on WordPress – Details from Barry on what Automattic is using to run Gravatar. Tags: gravatar […]

  5. John,

    The working set is only about 1GB (lucky us!)

    Having an infinite expiry would require cache invalidation which is not something that we are currently doing on Gravatar.

  6. James,

    We are using Munin to graph the data. There are Varnish graphing plugins available.

  7. […] reveals some of the details behind what powers Gravatar […]

  8. Hi, interesting post.

    So, you have a cache & application server in each of your datacenter.

    Do you manually fallback to the other one if one datacenter is in trouble by changing the DNS entry?

    Or you have something more automatic?

  9. Bruno,

    Currently it is manual because the automatic failover portion is not complete yet, but it should be finished this week. The basic idea is that each datacenter will also have a server that serves DNS requests. Each datacenter’s DNS server will only return its own IP when queried. There are some additional monitoring scripts that check the application and cache server and make sure all is functioning normally, otherwise stop the DNS service. In the case of a server or datacenter outage, the IP of the failed node will not be returned via DNS so traffic will automatically failover to the other location. There are DNS TTLs to deal with, but they can be set very low so impact is minimal.

    I will probably write a separate post on this setup with more details once its complete.

  10. The IT world is pretty lucky that you put up the architecture and strategies that you use Barry on all these different setups! Sometimes though I think you might need a glossary of terms for us people who don’t speak wrangler, but great stuff, none the less!

  11. Now we just need the rest of the world to adopt gravatar as a standard and we’re happy 😉

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: