Barry on WordPress

Category: technical

dotScale 2013 MySQL Talk

In June, I gave a talk at the dotScale conference in Paris about WordPress.com’s MySQL database architecture and infrastructure. The video is now online:

September 2, 2013
Anycast

This week we started testing our global anycast network. We have a real-time map which shows which people are served by each location. Today we have three locations online, we hope to have 10-12 by the end of the year.

The Internet is definitely not a big truck…

March 16, 2013
Nginx Case Study

A case study I worked on with Andrew Alexeev at Nginx was republished on High Scalability. The Hacker News thread has some good conversation as well.

October 2, 2012
WordPress.com DDoS Details

As you may have heard, on March 3rd and into the 4th, 2011, WordPress.com was targeted by a rather large Distributed Denial of Service Attack. I am part of the systems and infrastructure team at Automattic and it is our team’s responsibility to a) mitigate the attack, b) communicate status updates and details of the attack, and c) figure out how to better protect ourselves in the future. We are still working on the third part, but I wanted to share some details here.

One of our hosting partners, Peer1, provided us these InMon graphs to help illustrate the timeline. What we saw was not one single attack, but 6 separate attacks beginning at 2:10AM PST on March 3rd. All of these attacks were directed at a single site hosted on WordPress.com’s servers. The first graph shows the size of the attack in bits per second (bandwidth), and the second graph shows packets per second. The different colors represent source IP ranges.

The first 5 attacks caused minimal disruption to our infrastructure because they were smaller in size and shorter in duration. The largest attack began at 9:20AM PST and was mostly blocked by 10:20AM PST. The attacks were TCP floods directed at port 80 of our load balancers. These types of attacks try to fill the network links and overwhelm network routers, switches, and servers with “junk” packets which prevents legitimate requests from getting through.

The last TCP flood (the largest one on the graph) saturated the links of some of our providers and overwhelmed the core network routers in one of our data centers. In order to block the attack effectively, we had to work directly with our hosting partners and their Tier 1 bandwidth providers to filter the attacks upstream. This process took an hour or two.

Once the last attack was mitigated at around 10:20AM PST, we saw a lull in activity. On March 4th around 3AM PST, the attackers switched tactics. Rather than a TCP flood, they switched to a HTTP resource consumption attack. Enlisting a bot-net consisting of thousands of compromised PCs, they made many thousands of simultaneous HTTP requests in an attempt to overwhelm our servers. The source IPs were completely different than the previous attacks, but mostly still from China. Fortunately for us, the WordPress.com grid harnesses over 3,600 CPU cores in our web tier alone, so we were able to quickly mitigate this attack and identify the target.

We see denial of service attacks every day on WordPress.com and 99.9% of them have no user impact. This type of attack made it difficult to initially determine the target since the incoming DDoS traffic did not have any identifying information contained in the packets. WordPress.com hosts over 18 million sites, so finding the needle in the haystack is a challenge. This attack was large, in the 4-6Gbit range, but not the largest we have seen. For example, in 2008, we experienced a DDoS in the 8Gbit/sec range.

While it is true that some attacks are politically motivated, contrary to our initial suspicions, we have no reason to believe this one was. We are big proponents of free speech and aim to provide a platform that supports that freedom. We even have dedicated infrastructure for sites under active attack. Some of these attacks last for months, but this allows us to keep these sites online and not put our other users at risk.

We also don’t put all of our eggs in one basket. WordPress.com alone has 24 load balancers in 3 different data centers that serve production traffic. These load balancers are deployed across different network segments and different IP ranges. As a result, some sites were only affected for a couple minutes (when our provider’s core network infrastructure failed) throughout the duration of these attacks. We are working on ways to improve this segmentation even more.

If you have any questions, feel free to leave them in the comments and I will try to answer them.

March 7, 2011
AMD Barcelona vs. Intel Nehalem
We are looking at switching some of our servers from AMD Opteron Barcelona quad-core processors to the new Intel 5520 Nehalem processors. These are both 4 core CPUs, but the Intels utilize hyper-threading, so the OS sees 8 cores per CPU. It wasn’t that long ago that the first thing you did with a hyper-threading-enabled CPU was switch it off in the BIOS, but I have heard good things about Intel’s reincarnation of hyper-threading, so I decided to give it a shot.

I ran some real-world stress tests against these servers, adding them into the WordPress.com web pool and seeing how many requests per second they could serve before becoming 100% CPU bound effectively falling over. The types of requests served are varied; a lot are rendering web pages, but there are also quite a few image resizing operations thrown in here as well, as we spread this image work evenly over the 2500 cores in our web tier. Everything is php executed via fastcgi. I was a bit skeptical that there would be much of a difference between the two processors, but the numbers proved me wrong — the Nehalem’s are impressive.

2 x AMD Opteron 2356 Barcelona Quad-core 2.3Ghz
40 requests/second at 87.5% CPU utilization

2 x Intel 5520 Nehalem Quad-core 2.26Ghz
78 requests/second at 94% CPU utilization

Few things that I thought were interesting:
- On a per request basis, there isn’t much of a difference between the two. They both generate a given page in roughly the same amount of time.
- As CPU utilization approaches 100%, The Intel’s scale rather linearly, while the AMDs seem to struggle over the 85% range.
- The load averages were pretty high during these tests (35+ on the Intel box), but request times didn’t seem to suffer.
Has anyone else seen the same sort of results or maybe something to the contrary? These 2 configurations are roughly the same price, making it seem like a no-brainer to choose the Intels for web applications.
May 22, 2009