When unusual activity occurs in a cluster of servers, the first job is to eliminate any variables so you can rule out small configuration differences. Often this will be small software version differences but with the use of modern config management tools like Puppet or Chef, this is becoming less and less likely.
So when that doesn’t reveal anything unusual, the next step is to look at the hardware. Here lies a classic example of taking care of spotting the details, and understanding what is out of the ordinary when it comes to your infrastructure.
We have weekly reviews of several performance indicators across our infrastructure. This doesn’t replace automated monitoring and alerting on those indicators, however it allows us to spot small performance decreases over time so that we can investigate issues within the infrastructure, schedule in time for performance improvements to our codebase and plan for upgrades.
This is the last 3 months of data on one of those dashboards:
The odd performance values
Some time ago, soon after an upgrade on one of our clusters, it started to show this load profile:
This is a 4 server queue processing cluster run on Softlayer with dedicated hardware (SuperMicro, Xeon 1270 Quadcores, 8GB RAM). All the software stack is built from the same source using Puppet and our deploy process ensures all of the cluster nodes run exactly the same versions.
Why was one of the servers showing lower load for the exact same work? We couldn’t justify any difference so we went and asked Softlayer support:
There are no discernible differences between the servers
was the first answer we got.
The plot thickens
Not being happy with having servers that should behave the same and not doing so, we looked further into the matter and found yet another, this time more worrying, issue – packet loss on the 3 servers that showed the higher load:
So we went back to Softlayer support. They were quite diligent and “looked at the switch/s for these servers, network speed, connections Established & Waiting, apache/python/tornado process etc…” but in the end came back empty except for a subtle difference on the cluster hardware: “all of the processors are Xeon 1270 Quadcores, -web4 is running V3 and is the newest; -web2 and -web3 is running V2; -web1 is running V1“.
When we order new servers, we pick the CPU type but it doesn’t offer the granularity of the CPU versions. The data center team deliver what they have ready.
After some research, we discovered that there were some potentially interesting differences between the CPU versions and so we decided to eliminate the hardware difference and see what would happen.
Softlayer usually accommodates special requests and we had no difficulty in getting this through.
The next graph show the replacement of -web1 and then -web2 and -web3. Can you see when it was done?
Then a similar plot for the cluster packet loss:
Switching all the servers to a consistent CPU and CPU version solved the problem. The packet loss disappeared and the performance equalised. This is a great example of a very subtle difference having some measurable impact on the operation of the server. Using config management allowed us to quickly eliminate a software cause, at least one that we could control. It’s possible that the CPU version had some issue with the hardware drivers, but it illustrates how consistency within a cluster is important.