Monitoring physical server environments
Last Modified: 15th Feb 2017
Despite the popularity of the cloud, a number of business still have servers located on premises. Depending on the value the company places on those servers the requirements, facilities and even their location can vary widely. Anecdotally, I’ve seen server “rooms” vary in size from custom air conditioned rooms complete with racking to machines hidden under desks.
Servers produce a lot of heat, figures vary but a rough figure for a large server is around 2500 BTU. This heat has to go somewhere, normally the available ventilation is enough to maintain a safe and reliable environment, but if that fails then overheating can be a problem. Spotting this as an issue when it occurs can be even more troublesome especially without automated monitoring – typically servers are setup and left alone.
What Should You Monitor?
The environment inside your servers is the most obvious (and important) one to monitor, and that’s why we spend so much time developing our popular server monitoring tool. What’s often neglected though, is the environment outside of those servers.
Server Density’s core service is monitoring the internals, but we also offer lots of plugins and a brilliant monitoring API, that means you can monitor a whole host of things. That’s why in this tutorial we’ll be working through how to monitor server environments and why.
Inside the servers
Depending on the make and cost of your server, the chances are the vendor has some specific hardware monitoring software. Whilst it may not be perfect it’s usually better to run the vendor supplied monitoring so that any issues can be reported back to the vendor in a way they both accept and understand. Major vendors (IBM, Dell, HP etc.) usually have prepackaged solutions for your OS. Otherwise for any server running Linux, http://www.lm-sensors.org/ will provide all the information you require – assuming all the hardware is supported.
WARNING: Setting up lm-sensors requires probing of the onboard sensors and module loading. THIS MAY CAUSE YOUR SERVER TO LOCK UP AND STOP RESPONDING! I have yet to have it happen to me but I would strongly recommend you prepare for physically power cycling the server just in case. If you are running Ubuntu have a read of this blog post to get a better idea of what setup might be required.
Mainboard Based Sensors
Temperature sensors tend to be located in a few different positions inside servers. Motherboard and CPU are usually available and some servers will have chassis/case sensors as well.
$ sensors fam15h_power-pci-00c4 Adapter: PCI adapter power1: 78.39 W (crit = 94.92 W) fam15h_power-pci-00cc Adapter: PCI adapter power1: 83.20 W (crit = 94.92 W) coretemp-isa-0000 Adapter: ISA adapter Core 0: +45.0°C (high = +80.0°C, crit = +100.0°C) Core 1: +42.0°C (high = +80.0°C, crit = +100.0°C) Core 2: +45.0°C (high = +80.0°C, crit = +100.0°C) Core 3: +42.0°C (high = +80.0°C, crit = +100.0°C) f71882fg-isa-0a00 Adapter: ISA adapter +3.3V: +3.36 V in1: +0.92 V (max = +2.04 V) in2: +0.01 V in3: +0.01 V in4: +0.95 V in5: +1.11 V in6: +0.03 V 3VSB: +3.33 V Vbat: +3.28 V fan1: 2846 RPM fan2: 0 RPM ALARM fan3: 1754 RPM fan4: 0 RPM ALARM temp1: +24.0°C (high = +255.0°C, hyst = +251.0°C) (crit = +255.0°C, hyst = +251.0°C) sensor = thermistor temp2: +63.0°C (high = +255.0°C, hyst = +251.0°C) (crit = +255.0°C, hyst = +251.0°C) sensor = transistor temp3: +45.0°C (high = +255.0°C, hyst = +253.0°C) (crit = +255.0°C, hyst = +253.0°C) sensor = transistor
Above is a typical output, this one is from my remote server. For me the the key parts are “Core” temperatures (CPU core, it’s a four core machine), “temp” temperatures and “fan” speeds. There is also information regarding voltage levels which is a bit out of scope for this article. Though it is worth making a note of these values and checking for changes as part of standard maintenance.
Fan speed is crucial, if this is zero your fan isn’t running. Depending on the server this could be disastrous, the motherboard used in this example has four fan outputs but only two are in use. Fan1 is on the CPU, if the RPM (revolutions per minute) drops, it could mean the fan is getting clogged with dirt and dust. If it drops to 0, then the CPU will cook itself pretty quickly.
There are three temperature readings. These are located on the motherboard in various locations. The above readings are with the the expected range. Spikes in temperates are not normally an issue, though a sustained increase indicates either the server is dealing with more work or there could be a blockage in the exhaust or intake for the air flow.
One area I overlooked until recently was hard drive temperature. I have a small server in a cupboard and decided to close the door. When checking it later, one disk was showing I/O errors and was unavailable. The lack of air flow had caused the disk to get very hot; so hot in fact that it was no longer operational. The other sensors did pick a raise in temperature but not enough to cause concern.
smartctl is a great tool for querying SMART enabled hard drives to find out a lot of information about the current health and status of a drive. The following will give the current temperature of the device specified, in this case sda.
(ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE) sudo smartctl --all /dev/sda -s on | grep Temperature_Celsius
| grep Temperature_Celsius will give the full output, it’s worth reading through this as there is a lot potentially useful information about the hard drive. This article also has more information about using it – the Ubuntu package is available as smartmontools.
Outside the servers
The environment directly surrounding servers is usually ignored unless in a specific server hosting environment. It’s not uncommon to find servers lodged in the only space that was available at the time: Kitchen cupboards, book cases and people’s desks.
I have been working on environment monitoring for my greenhouse and thought the kit I used could be simplified for server monitoring. The temperature outside the server needs monitoring as a basic requirement. This can also be extended to include humidity, of which a sudden spike could indicate water leaking in where it shouldn’t.
How to monitor server environments
For the purposes of this tutorial let me take you through the following when it comes to monitoring server environments.
- Fan Speed
- Core(s) temperature
- Mainboard temperatures
- Hard drive temperatures
Monitoring inside the server
For this I’ve written a pretty nifty plugin that works with Server Density, it’s available here. This works for lm-sensors and Raspberry Pis.
To get your hands on my temperature plugin, and graph what’s going on with your hardware then you’ll need to setup your Server Density account, sign up for a 15 day free trial and get up and running in a few minutes. Alternatively there’s a few great plugins written for monitoring your hardware with our Open Source friend Nagios.
Monitoring outside the server
This is where the post gets a little more fun. We need to find the right hardware, buy the right hardware and then plug it all in to a monitoring tool. When I started out looking after servers, iButtons were the standard solution. Simple one wire devices that used to plug in the serial port. Unfortunately, I had all sorts of problems getting them working for the first time and they’re quite expensive for what they are. Now a simple Arduino can replace these and offer more data for a lot less money. For the budget conscious you can get a cheap Arduino clone from eBay for less $5.
To monitor the humidity I suggest using one of these DHT22’s ($3 from eBay) and a 4.7k resistor. The DHT22 does monitor temperature, but only to 1 decimal place. If you want more precision and graphs with more of curvy line, then you may want to get a DS18b20 temperature sensor as well. Otherwise as a cheap 1 sensor solution the DHT22 is perfect.
If you’d like a better (than 1 dec place) idea of temperature, then we suggest using either a DS18b20 as a small transistor ($1 or so on eBay) or the sensor probe version the sensor probe version ($14 from eBay). You will again need a 4.7k resistor.
To read data from these sensors you can either attach them to a Raspberry Pi directly into the GPIO; or use an Arduino. See this for setting up the DHT22. The Arduino can be connected over a USB serial cable for the current values to be extracted and reported back.
Now you’ll want to post the information back to the Server Density API, so you monitor your server health and get alerts if things start to go wrong. I’ve been using something similar to this for my greenhouse to monitor the temperature. I ran the Server Density agent on a laptop and communicated with the Arduino over 433mhz to get the current values and graph them. When the temperature hit 50C I would get a SMS from Server Density, then nip out and unzip the door before everything wilted.
If you want to monitor everything, the last thing to measure is light. Most data centres will have the lighting turned off and only switched on when someone enters. A simple LDR (Light Dependent Resistor) can detect changes in light level. Here’s a simple set up with an Arduino. LDRs are very cheap on ebay costing a less than $1.
Make your data actionable
Simply monitoring this data isn’t enough. We need to make sure we do something if our environmental monitoring spots problems – that’s the whole point of monitoring. In the case of my greenhouse once the temperature exceeded 50c I opened the door to allow some fresh air in.
If I turn off humidity in Server Density, you can pretty clearly see when I got the 50c alert and when I let the fresh air in:
Set up monitoring alerts
Lets look at some of the alerts we should set up to make sure we avoid any expensive replacements:
Server temperature alert
A sudden raise in temperature is usually bad news. My first instinct would be either fan has failed or is impaired somehow. Unless action is taken swiftly this can mean failure for some potentially expensive parts on the machine. This situation ties in with fan speed changes.
Server fan speed alert
Fan speed suddenly dropping to zero is even worse. Zero means the fan has stopped and in the case of CPUs this could mean almost instant failure for the CPU as it cooks itself without cooling.
Server room humidity alert
Humidity is a funny one and will be very specific to your server location. Moisture is going to impact servers badly and is usually a strong indicator that something is wrong. This could be leaving a window open and / or a leaking air conditioner unit.
A few good benchmarks to follow would be:
- More than a 20% increase to any temperature reading
- Fan speed equals zero
- > 75% humidity
- Prolonged change in lighting level (also should the light be on that time?)
Set up dashboards and graphs for insight
It’s also a good idea to add this stuff to a monitoring dashboard to see trends over time. If the internal temperature of a server starts to rise, it might be the start of the problem that you should fix. If the temperature across all the machines goes up maybe it’s time to upgrade the air conditioner.
Servers for the most part are pretty hard to kill, even Google experimented with leaving servers in the parking complex covered in a tarp. What you should be concerned with is deviation from what is “normal”. A steady rise in temperature could be an indication the server is now having more work to do and may need upgrading. This depending on you setup might be an issue and may mean an upgrade to the infrastructure of your server setup.
It’s worth keeping in mind a server is the sum of it’s parts. Recently my server had a hard disk disappear from OS. I was no longer able to mount it. On investigation it was still plugged in but because of its location and lack of air flow it was was too hot to hold. Even though all the other sensors were reporting temperatures and fan speeds that looked normal, this hard drive was so hot I had to leave the machine off for 10 minutes before I could hold it. Sometimes it’s worth just going round and doing a simple hand temperature test, there is operating hot and there is trouble hot. To detect this I use smartctl, now I will get a warning before I slowly cook a 2TB drive. Below is the current temperatures of all the attached drives using the Temperatures plugin to Server Density. Not very exciting but combined with an alert it should stop me ruining another drive.
Monitoring the environment is about picking up anomalies. Every server is going to be different from available air flow to amount of work it has to do. You need to measure first and decide what is “normal” for your setup and then track any changes.
This is a lot of great information on the internet about environment design and monitoring. From hot and cold aisles to using sea water to cool servers. If you need to convince your boss it’s worth some time, perhaps get them to consider the potential for money saving by being able to raise the thermostat a few degrees (knowing you can detect any issues before they result in hardware failure).