Yesterday, Valentin Bartenev, a developer at Nginx, Inc., announced SPDY support for the Nginx web server. SPDY is a next-generation networking protocol developed by Google and focused on making the web faster. More information on SPDY can be found on Wikipedia.
At Automattic, we have used Nginx since 2008. Since then, it has made its way into almost every piece of our web infrastructure. We use it for load balancing, image serving (via MogileFS), serving static and dynamic web content, and caching. In fact, we have almost 1000 servers running Nginx today, serving over 100,000 requests per second.
I met Andrew and Igor at WordCamp San Fransicso in 2011. For the next six months, we discussed the best way for Automattic and Nginx, Inc. to work together. In December 2011, we agreed that Automattic would sponsor the development and integration of SPDY into Nginx. The only real requirement from our end was that the resulting code be released under an open source license so that others could benefit from all the hard work.
For the past 6 months, Valentin and others have been implementing SPDY support in Nginx, and for the past month or so, we have been continually testing SPDY, fixing bugs, and improving stability. Things are almost ready for production and we hope to enable SPDY for all of WordPress.com in the next few weeks. Today, this site is SPDY-enabled if you are using a recent version of Chrome or Firefox and accessing this site over SSL. You can download the Chrome extension here and the one for FireFox here.
Thanks to the Nginx team for all their hard work implementing SPDY, and thanks to all of my Automattic co-workers who helped us test SPDY. I hope to post some real-world performance numbers in the next few weeks as we complete our SPDY deployment and gather more data. We are also looking forward to SPDY support being part of the official Nginx source in the near future.
“We’d like to say big thanks to the team at Automattic and especially to Pyry Hakulinen who has been great in helping us test and debug this first public version of SPDY module for nginx. Automattic is a great partner, and we will continue to work with Barry and his team on improvements to nginx and to nginx/SPDY in particular.”
Andrew Alexeev – Nginx, Inc.
Evan has a cool post showing some of our internal heat map stats and some interesting points on data visualization.
After a 3 year speaking hiatus from WordCamp SF, I am excited about speaking again this year. The most interesting part of my talks is usually the Q&A at the end, so this time we decided to get rid of the talk and go straight to the Q&A. It will focus on running large WordPress installations, but I’m sure there will be time to discuss other WordPress-related things. Bring your questions and make them difficult! If you have a question but won’t be able to attend, please ask in the comments and I will try to answer it during the session (which I think will be recorded).
Howdy – Iliya here again. Seems like I am taking over Barry’s blog. Hopefully this will motivate him to blog more.
On WordPress.com we have over 218 million tables and perform tens of thousands queries per second. To scale all of this, we shard our 24 million blogs across more than 550 MySQL servers. This allows us to cope with load bursts and to handle database servers failures.
For those who are unfamiliar, MySQL data replication is asynchronous and works as follows:
- [Master] Receives a query that modifies database structure or content (INSERT, UPDATE, ALTER etc.)
- [Master] The query is written to a log file (aka the binlog).
- [Master] The query is executed on the master.
- [Slaves] Create a “Slave I/O” thread that connects to the [Master] and requests all new queries from the mater’s binlog.
- [Master] Creates a “Binlog dump” thread for each connected slave, that reads the requested events from the binlog and sends them to the slave.
- [Slaves] Start a “Slave SQL” thread which reads queries from the log file written by the “Slave I/O” thread and executes them
There are a number of things to be considered in this scenario, which can lead to a condition known as replication lag where the slaves have older data then the master:
- Since only one thread on the slave executes write queries, and there are many execution threads on the master, there is no guarantee that the slave will be able to execute queries with the same speed as the master.
- Long running SELECTs or explicit locks on the slave, will cause the “Slave SQL” thread to wait, thus slowing it down.
- Long running queries on the master would take at least the same amount of time to run on the slave, causing it to fall behind the master
- I/O (disk or network) issues can prevent or slow down the slave from reading and replaying the binlog events
In order to deal with this, we needed a way to avoid connections to lagged slaves as long as there are slaves that are current. This would allow for the lagged ones to recover faster and avoid returning old data to our users. It also had to be something flexible enough, so we could have different settings for acceptable replication lag per dataset or stop tracking it altogether. Since we use the advanced database class, HyperDB, for all our database connections, it was the obvious place to integrate this.
We implemented it in the following steps:
- If a connection modifies data in a given table, then all subsequent SELECTs on the same connection for that table are sent to the master. Chances are replication won’t be fast enough to propagate the changes to the slaves on the same page load. This logic has existed in HyperDB for a while.
- Before we make a connection to a slave, we use a callback, to check if we have information for this slave’s lag in the cache and we skip it based on that, unless all slaves in the dataset are considered lagged. In case replication breaks on all slaves, we would rather return old data then overload the master with read queries and cause an outage.
- After a successful connection to a slave, if there was nothing in the cache regarding its lag status and not all slaves are considered lagged, we execute a second callback that checks whether this slave is lagged and updates the cache.
A slave is considered lagged when it has a “lag threshold” defined in it’s dataset configuration and the current lag is more than this threshold.
We considered the following options for checking if a slave is lagged. No MySQL patches are required for any of them:
- Checking the value of Seconds_Behind_Master from the
SHOW SLAVE STATUSstatement executed on the slave. It shows the difference between the timestamp of the currently executed query and the latest query we have received from the master. Although it is easy to implement and has low overhead, the main problem with using this option is that it is not completely reliable, as it can be tricked by IO latency and/or master connection problems.
- Tracking the “File” and “Position” on
SHOW MASTER STATUSexecuted on the master and comparing it to Relay_Master_Log_File and Exec_Master_Log_Pos of
SHOW SLAVE STATUSon the slave. This way we can wait until the slave executes the queries from binlog “file” and position “position” before send certain queries to that slave and thus effectively we wait for the data to be replicated to the point where we need it. While very reliable, this option is more complex, has lots of overhead and doesn’t give us clock time value which we can track and set between servers.
- Tracking the difference between the current time on the slave and the replication of a timestamp update from the master, which runs every second. This is basically what mk-heartbeat does. It requires proper time sync between the master and the slave servers but is otherwise very reliable.
The third option fit our needs best, however the code is flexible enough to easily support any of these. For caching, we decided to go with memcached, since it works well in our distributed, multi-server, multi-datacenter environment, but other methods (APC cache, shared memory, custom daemon etc.) would work just fine.
HyperDB is free, open-source and easy to integrate in your WordPress installation. You can download it here. We hope you enjoy this new functionality and please let us know if you have any questions in the comments.
This is a guest post by Iliya Polihronov. Iliya is the newest member of the global infrastructure, systems, and security team at Automattic and the first ever guest blogger here on barry.wordpress.com.
Hey, my name is Iliya and as a Systems Wrangler at Automattic, I am one of the people handling the server-side issues across the 2000 servers running WordPress.com and other Automattic services.
Last week, within two hours of each other, two of our MogileFS storage servers locked up with the following trace:
The next day, a few more servers crashed with similar traces.
We started searching for a common pattern. All hosts were running Debian kernels ranging from 2.6.32-21 to 2.6.32-24, some of them were in different data centers and had different purposes in our network.
One thing we noticed was that all of the servers crashed after having an uptime of a little more than 200 days. After some research and investigation, we found that the culprit appears to be a quite interesting kernel bug.
As part of the scheduler load balancing algorithm, the kernel searches for the busiest group within a given scheduling domain. In order to do that it has to take into account the average load for all groups. It is calculated in the function find_busiest_group() with:
sds.avg_load = (SCHED_LOAD_SCALE * sds.total_load) / sds.total_pwr;
sds.total_load is the sum of the load on all CPUs in the scheduling domain, based on the run queue tasks and their priority.
SCHED_LOAD_SCALE is a constant used to increase resolution.
sds.total_pwr is the sum of the power of all CPUs in the scheduling domain. This sum ends up being zero and that’s what causing the crash – division by zero.
The “CPU power” is used to take into account how much calculating capabilities a CPU has compared to the other CPUs and the main factors for calculating it are:
1. Whether the CPU is shared, for example by using multithreading.
2. How many real-time tasks the CPU is processing.
3. In newer kernels, how much time the CPU had spent processing IRQs.
The current suggested fix for this bug is relying on the theory that while taking into account the real-time tasks (#2 above), scale_rt_power() could return negative value, and thus the sum of all CPU powers may end up being zero.
This was merged into the 126.96.36.199 vanilla kernel, together with the IRQ accounting into the cpu_power (#3 above). It is also merged into the Debian 2.6.32-31 kernel.
Alternatively, the scheduling load balancing can be turned off, which would effectively skip the related code. This can be done using control groups, however it should be used with caution as it may cause performance issues:
mount -t cgroup -o cpuset cpuset /cgroups
echo 0 > /cgroups/cpuset.sched_load_balance
As it is yet not absolutely clear if the suggested fix really fixes the problem, we will try to post updates on any new developments as we observe them.