Data Center Heatmap

Written by

At Automattic, our systems team manages over 10,000 physical servers located across 30 data centers on 6 continents. As our compute density has increased from 24 CPU threads/RU in 2013 to 128 CPU threads/RU in 2022 the maximum thermal thresholds have decreased. Older, less powerful servers could operate with inlet air temperatures up to 42C (107.6F) while newer servers trigger CPU throttling at much lower temperatures of 35C-37C (95F – 98.6F). Normal data center operating temperatures tend to be between 20F-25C, but cooling failures are somewhat common (they even affect Google), so we have to monitor temperatures carefully.

We are big fans of Prometheus and Grafana and for a few years we have had temperature graphs that look like this.

This graph shows the temperatures of some servers located in our data center in Johannesburg, South Africa over one week. The colored lines represent individual servers and the bold, red line is the average temperature in the rack.

We get this data from our servers inlet temperature sensor using ipmitool. I thought it would be interesting to see this data visualized a bit differently, and Grafana has a Heatmap graph type that made it pretty easy.

First, we simply want to graph the temperature by location for a given datacenter. In PromQL this looks like

avg by (location) (ipmi_inlet_temp{dc="$DC"})

location includes the rack identifier and the location in the rack. For example a location of 101-10 would mean Rack 101, RU 10. We store this information is our data center asset management system (which is a colon separated file) and it gets added as labels to all Prometheus metrics. By choosing the Heatmap (New) graph type and configuring some basic graph options, Grafana allows us to create a graph which shows the same data as our original graph, but in a different, and more useful way. We can easily see that the top of the rack is warmer than the bottom which is to be expected since the cold air in this facility comes from the floor. We can also see that temperatures have increased slightly over the past week, which is not ideal, but they are not at dangerous levels.

We can contrast this with a rack in Milan, Italy where there was a cooling outage which caused the servers operate beyond their intended temperature threshold for a period of time:

Using the same data and graph options, we are also able to easily create heat maps of entire rows of racks to visualize airflow management and identify areas for potential improvement. Here is a row of racks in a data center in Los Angeles with poor airflow management. We can see the racks at the end of the row suffer from increased temperatures due to air leakage from the hot aisle to the cold aisle.

When can contrast this with data from a set of racks in Amsterdam which have much better airflow management

I hope this post shows how easy it is to create cool(!!) and useful heat maps using Grafana, Prometheus, and a little bit of time. If this sort of stuff interests you, Automattic is hiring!

Comments

5 responses to “Data Center Heatmap”

July 30, 2022

Cc

Nice

Reply
August 1, 2022

Rohak Jain

Great work!

Reply
August 1, 2022

Valuable News – 2022/08/01 | 𝚟𝚎𝚛𝚖𝚊𝚍𝚎𝚗

[…] Data Center Heatmap. https://barry.blog/2022/07/28/data-center-heatmap/ […]

Reply
June 13, 2023

Techniques for Energy Efficiency Optimization – GetUpdated

[…] Automattic leverages Prometheus and Grafana to understand the temperature of the servers in their 30 data centers spread over six continents. They were able to create heat maps of entire rows of racks to visualize airflow management and identify areas for potential improvement. Using the same data and graph options, Automattic was able to contrast the temperatures between their data centers and aim for optimal temperatures. […]

Reply
June 13, 2023

Techniques for Energy Efficiency Optimization – Techno Blender

[…] Automattic leverages Prometheus and Grafana to understand the temperature of the servers in their 30 data centers spread over six continents. They were able to create heat maps of entire rows of racks to visualize airflow management and identify areas for potential improvement. Using the same data and graph options, Automattic was able to contrast the temperatures between their data centers and aim for optimal temperatures. […]

Reply

Data Center Heatmap

Share this:

Comments

5 responses to “Data Center Heatmap”

Leave a comment Cancel reply

More posts

Javascript Performance Woes

Pingo – Monitoring the Internet One ECMP Link at a Time

Data Center Heatmap

TLSv1.3