HyperDB Replication Lag Detection

Howdy – Iliya here again. Seems like I am taking over Barry’s blog. Hopefully this will motivate him to blog more.

On WordPress.com we have over 218 million tables and perform tens of thousands queries per second. To scale all of this, we shard our 24 million blogs across more than 550 MySQL servers. This allows us to cope with load bursts and to handle database servers failures.

For those who are unfamiliar, MySQL data replication is asynchronous and works as follows:

  1. [Master] Receives a query that modifies database structure or content (INSERT, UPDATE, ALTER etc.)
  2. [Master] The query is written to a log file (aka the binlog).
  3. [Master] The query is executed on the master.
  4. [Slaves] Create a “Slave I/O” thread that connects to the [Master] and requests all new queries from the mater’s binlog.
  5. [Master] Creates a “Binlog dump” thread for each connected slave, that reads the requested events from the binlog and sends them to the slave.
  6. [Slaves] Start a “Slave SQL” thread which reads queries from the log file written by the “Slave I/O” thread and executes them

There are a number of things to be considered in this scenario, which can lead to a condition known as replication lag where the slaves have older data then the master:

  • Since only one thread on the slave executes write queries, and there are many execution threads on the master, there is no guarantee that the slave will be able to execute queries with the same speed as the master.
  • Long running SELECTs or explicit locks on the slave, will cause the “Slave SQL” thread to wait, thus slowing it down.
  • Long running queries on the master would take at least the same amount of time to run on the slave, causing it to fall behind the master
  • I/O (disk or network) issues can prevent or slow down the slave from reading and replaying the binlog events

In order to deal with this, we needed a way to avoid connections to lagged slaves as long as there are slaves that are current. This would allow for the lagged ones to recover faster and avoid returning old data to our users. It also had to be something flexible enough, so we could have different settings for acceptable replication lag per dataset or stop tracking it altogether. Since we use the advanced database class, HyperDB, for all our database connections, it was the obvious place to integrate this.

We implemented it  in the following steps:

  • If a connection modifies data in a given table, then all subsequent SELECTs on the same connection for that table are sent to the master. Chances are replication won’t be fast enough to propagate the changes to the slaves on the same page load.  This logic has existed in HyperDB for a while.
  • Before we make a connection to a slave, we use a callback, to check if we have information for this slave’s lag in the cache and we skip it based on that, unless all slaves in the dataset are considered lagged.  In case replication breaks on all slaves, we would rather return old data then overload the master with read queries and cause an outage.
  • After a successful connection to a slave, if there was nothing in the cache regarding its lag status and not all slaves are considered lagged, we execute a second callback that checks whether this slave is lagged and updates the cache.

A slave is considered lagged when it has a “lag threshold” defined in it’s dataset configuration and the current lag is more than this threshold.

We considered the following options for checking if a slave is lagged.  No MySQL patches are required for any of them:

  • Checking the value of Seconds_Behind_Master from the SHOW SLAVE STATUS statement executed on the slave. It shows the difference between the timestamp of the currently executed query and the latest query we have received from the master. Although it is easy to implement and has low overhead, the main problem with using this option is that it is not completely reliable, as it can be tricked by IO latency and/or master connection problems.
  • Tracking the “File” and “Position” on SHOW MASTER STATUS executed on the master and comparing it to Relay_Master_Log_File and Exec_Master_Log_Pos of SHOW SLAVE STATUS on the slave. This way we can wait until the slave executes the queries from binlog “file” and position “position” before send certain queries to that slave and thus effectively we wait for the data to be replicated to the point where we need it. While very reliable, this option is more complex, has lots of overhead and doesn’t give us clock time value which we can track and set between servers.
  • Tracking the difference between the current time on the slave and the replication of a timestamp update from the master, which runs every second. This is basically what mk-heartbeat does. It requires proper time sync between the master and the slave servers but is otherwise very reliable.

The third option fit our needs best, however the code is flexible enough to easily support any of these. For caching, we decided to go with memcached, since it works well in our distributed, multi-server, multi-datacenter environment, but other methods (APC cache, shared memory, custom daemon etc.) would work just fine.

HyperDB is free, open-source and easy to integrate in your WordPress installation. You can download it here.  We hope you enjoy this new functionality and please let us know if you have any questions in the comments.

Load Balancer Update

A while back, I posted about some testing we were doing of various software load balancers for WordPress.com.  We chose to use Pound and have been using it past 2-ish years.  We started to run into some issues, however, so we starting looking elsewhere.  Some of these problems were:

  • Lack of true configuration reload support made managing our 20+ load balancers cumbersome.  We had a solution (hack) in place, but it was getting to be a pain.
  • When something would break on the backend and cause 20-50k connections to pile up, the thread creation would cause huge load spikes and sometimes render the servers useless.
  • As we started to push 700-1000 requests per second per load balancer, it seemed things started to slow down.  Hard to get quantitative data on this because page load times are dependent on so many things.

So…  A couple weeks ago we finished converting all our load balancers to Nginx.  We have been using Nginx for Gravatar for a few months and have been impressed by its performance, so moving WordPress.com over was the obvious next step.  Here is a graph that shows CPU usage before and after the switch.  Pretty impressive!

Before choosing nginx, we looked at HAProxy, Perlbal, and LVS. Here are some of the reasons we chose Nginx:

  • Easy and flexible configuration (true config “reload” support has made my life easier)
  • Can also be used as a web server, which allows us to simplify our software stack (we are not using nginx as a web server currently, but may switch at some point).
  • Only software we tested which could handle 8000 (live traffic, not benchmark) requests/second on a single server
We are currently using Nginx 0.6.29 with the upstream hash module which gives us the static hashing we need to proxy to varnish.  We are regularly serving about 8-9k requests/second  and about 1.2Gbit/sec through a few Nginx instances and have plenty of room to grow!

Static hostname hashing in Pound

WordPress.com just surpassed her 300th server today. How do we distribute requests to all those servers? We use Pound of course. For those of you not familiar with Pound, it is an open source software load balancer that is easy to setup and maintain, flexible, and fast!

In general, we do not stick individual sessions to particular backend servers because WordPress uses HTTP cookies to keep track of users and is therefore not dependent on server sessions. Any web server can process any request in any given point of time and the correct data will be returned. This is important since serve traffic in real time across three data centers.

There is one exception to this rule, however, and it has to do with the way we serve images. As Demitrious explained in his detailed post, when a request for an image is made, pound sends the request to a cache server running Varnish. How does it decide which server to send the request to? Well, it looks at the hostname of the request, hashes it, and then assigns that to a particular cache server. By default Pound supports sessions based on any HTTP header, so we could easily use the hostname as the determining factor, but the mapping is not static. In other words, when we restart pound, all the hostname assignments would be reset and we would effectively invalidate a large portion of our cache.

To circumvent this problem, please see the following patch. What the patch does is statically hash hostnames so a given hostname is sent to the same server all the time, even across restarts. If the backend server happens to go down, the requests will be sent to another server in the pool until the server is back up, at which point the requests will be sent to the original server. This allows us to restart pound without invalidating our image cache. We have been using this in production for a couple months now and everything is working great. The patch is written against Pound 2.3.2 and to use the static mapping you would add the following to the end of the Service directive in your Pound configuration file:

Type hostname

One thing to keep in mind is that if you add or remove servers from the Service definition, you will change the mapping, so I would recommend adding a few more backend directives than you need right away to allow for future growth without complete cache invalidation. For example, we currently have 4 caching servers, but 16 BackEnds listed (4 instances of each server). This will allow us to add more cache servers and only invalidate a small portion of the cache each time.

Of course this works for us because each blog has a unique hostname from which images are served (mine is barry.files.wordpress.com). If all of your traffic is served from a single domain name, this strategy won’t do you much good.

Making Gravatar fast again

As Matt blogged, Automattic recently purchased Gravatar. The first thing we did was move the service onto the WordPress.com infrastructure. Since the application is very different from WordPress.com what this really means is using what we have learned from scaling WordPress.com to increase both speed and reliability of the service, as well as leveraging our existing hardware and network infrastructure to stabilize the service. The current infrastructure is laid out as follows:

  • 2 application servers (in 2 different data centers for redundancy). One of these servers primarily handles the main Gravatar website which is Ruby on Rails while the other serves the images themselves. If either of these servers or data centers were to fail, we could easily switch things around to work around the outage.
  • 2 cache servers (1 in each datacenter). These servers are running Varnish. They cache requested images for a period of 10 minutes, so frequently requested images are not repeatedly requested from the application servers. We are seeing about a 65% cache hit rate and about 1000 requests/second at peak times, although as adoption of the service increases, we expect this number to go up significantly. A single server running Varnish can serve many thousands of requests/sec. The amount of data we are caching is small enough to fit in RAM, so disk I/O is not currently an issue.

On the hardware side, for those of you who are curious, we are using HP DL365s for the application servers, and HP DL145s for the caching servers. 4GB of RAM and 2 x AMD Opteron 2218s all around. The application servers have 4 x 73GB 15k SAS drives in a RAID 5, while the caching servers are just single 80GB SATA drives. We use the same hardware configurations extensively for WordPress.com and they work well.

Previously, the service was using Apache2 + Mongrel to serve the main site and lighttpd + mod_magnet to serve the images. We decided to simplify this and we are currently using lighttpd to serve everything and it is working well for the most part. We seem to have a memory usage issue with lighttpd, which may be related to this long-standing bug.  For now, we are just monitoring memory usage of the application with monit, and restarting the service before memory usage gets too high.

Redundancy and power outages

Scott Beale reports that many Web 2.0 websites were affected by today’s power outage at 365 Main in San Francisco. While unfortunate, as a systems guy I have to assume things like this are going to happen. They shouldn’t happen, but they can and they will. At the data center level, there should be multiple levels of redundancy that minimize the probability of a power outage. Things such as multiple power circuits, redundant UPSes, and generators are standard. For a complete power outage to occur there should have to be multiple simultaneous system failures. I looked for a statement from 365 Main as to what the problem was, but couldn’t find one.

The system architecture behind WordPress.com and Akismet is designed to take entire data center failures into account. For WordPress.com, we serve live content in real-time from 3 data centers (33% from each data center) and in the event of a data center failure, traffic is automatically re-routed to the 2 remaining data centers. Syncing content in real-time between multiple data centers has not been easy, but at times like this I am sure that we made the right decision.

%d bloggers like this: