The health of your infrastructure is not just about hardware, software, automations and uptime—it also includes the health and wellbeing of your team. Sysadmins are not super humans. They are susceptible to stress and fatigue just like everybody else.
Now here is the thing.
A superhero culture exists that places unreasonable expectations on Ops teams. While understandable, this level of expectation is neither helpful nor sustainable.
In our efforts to highlight the effects of this culture on sysadmins and their productivity, earlier this year we introduced HumanOps, a collection of principles that advance our focus away from systems, and towards humans. What’s more we got everyone together in HumanOps events around the world (watch the talks below).
On May 19th, we launched Sparklines for iOS. A great way of translating data and information into something a human can assimilate quickly. With system trends at their fingertips, sysadmins can now quickly decide whether to go home, or whether they can finish dinner before reaching for their laptop.
Then we shifted our gaze at interruptions. Context switching does not come free for humans, especially for tasks that require focus and problem solving, which is precisely what sysadmins do. When an engineer (and anyone for that matter) is working on a complex task the worst thing you can do is expose them to random alerts. It takes an average of 23 minutes to regain intense focus after being interrupted.
To alleviate that, we introduced Alert Costs, an easy way to measure the impact of incidents and alerts in human hours. This new quantifiable information allows engineering teams to measure and, most importantly, communicate the actual toll of systems on their human operators. And then do something about it.
But we couldn’t stop there. What about sleep? It’s no surprise that quality of sleep correlates with health, productivity and overall wellbeing; while sleep deprivation is associated with stress, irritability, and cognitive impairment. We cannot mitigate the human cost of on-call work without some sort of objective and relevant sleep metric we can measure. And that’s exactly what we did.
We launched Opzzz.sh; a tool that correlates sleep efficiency with incidents, and then visualises the human cost of on-call work.
Let’s talk HumanOps
HumanOps drives a lot of what we do here at Server Density. But HumanOps is not about one feature or even one company. There are several teams out there that are doing amazing work in this space. To bring all those efforts to the forefront, to celebrate, and to push the HumanOps agenda forward, we are organising HumanOps events around the world.
Over the last few months, for example, we hosted HumanOps meetups in San Francisco and London. In this series of articles, we will go through some of the highlights from the HumanOps efforts of companies like Spotify and Barclays, and organisations like GOV.UK. To give you a taste of what’s to come, here is the introductory video of our recent HumanOps event in London.
What follows is the key take-aways from Barclay’s HumanOps talk.
In her HumanOps talk she highlighted the importance of play at all stages of human development, in and out of the office. She pointed to research that demonstrates how play makes animals smarter and more adaptable. How it helps them sustain social relationships, supercharge their creativity and innovation.
Portia Tung touched on things like the recommended daily amount of play (5 to 10 minutes minimum) and things like play deficiency and its effects on employees (hint: not good). Play is a natural and essential human need throughout our life, i.e. not just when we’re young. A productive, collaborative, and happy workplace is a playful workplace.
Check out the Portia’s talk here:
And if you want the full transcript, then use the download link at the bottom of the article.
Stay tuned for more
HumanOps is a collection of principles that advance our focus away from systems, and towards humans. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.
Did you enjoy the talk? Make sure you download and share the beautifully designed transcripts and stay tuned as next week we will be sharing some interesting work Spotify is doing in the HumanOps space.
As a core constituent of the classic LAMP stack and a critical component of any web architecture, it is a good idea to monitor Apache thoroughly.
Keep reading to find out how we monitor Apache here at Server Density.
Enabling Apache monitoring with mod_status
Most of the tools for monitoring Apache require the use of the mod_status module. This is included by default but it needs to be enabled. You will also need to specify an endpoint in your Apache config:
Deny from all
Allow from 127.0.0.1
This will make the status page available at http://localhost/server-status on your server (check out our guide). Be sure to enable the ExtendedStatus directive to get full access to all the stats.
Monitoring Apache from the command line
Once you have enabled the status page and verified it works, you can use the command line tools to monitor the traffic on your server in real time. This is useful for debugging issues and examining traffic as it happens.
The apache-top tool is a popular method of achieving this. It is often available as a system package e.g. apt-get install apachetop but can also be downloaded from the source, as it is just a simple Python script.
Apache monitoring and alerting – Apache stats
apache-top is particularly good at i) real time debugging and ii) determining what’s happening on your server right now. When it comes to collecting statistics, however, apache-top will probably leave you wanting.
This is where monitoring products such as Server Density come in handy. Our monitoring agent supports parsing the Apache server status output and can give you statistics on requests per second and idle/busy workers.
Apache has several process models. The most common one is worker processes running idle waiting for service requests. As more requests come in, more workers are launched to handle them—up to a pre-configured limit. Once past that limit all requests are queued and visitors experience service delays. So it’s important to monitor not only raw requests per second but idle workers too.
A good way to configure Apache alerts is by first determining what the baseline traffic of your application is and then setting alerts around it. For example, you can generate an alert if the stats are significantly higher (indicating a sudden traffic spike) or if the values drop significantly (indicating an issue that blocks traffic somewhere).
You could also benchmark your server to figure out at what traffic level things start to slow down. This can then act as the upper limit for triggering alerts.
Apache monitoring and alerting – server stats
Monitoring Apache stats like requests per second and worker status is useful in keeping an eye on Apache performance, and indicates how overloaded your web server is. Ideally you will be running Apache on a dedicated instance so you don’t need to worry about contention with other apps.
Web servers are CPU hungry. As traffic grows Apache workers take up more CPU time and are distributed across the available CPUs and cores.
CPU % usage is not necessarily a useful metric to alert on because the values tend to be on a per CPU or per core basis whereas you probably have multiple instances of each. It’s more useful to monitor the average CPU utilisation across all CPUs or cores.
On Linux the CPU average discussed above is abstracted out to another system metric called load average. This is a decimal number rather than a percentage and allows you to view load from the perspective of the operating system i.e. how long processes have to wait for access to the CPU. The recommended threshold for load average therefore depends on how many CPUs and cores you have – our guide to load average will help you understand this further.
Monitoring the remote status of Apache
All those metrics monitor the internal status of Apache and the servers it runs on but it is also important to monitor the end user experience too.
You can achieve that by using external status and response time tools. You need to know how well your Apache instance serves traffic from different locations around the world (wherever your customers are). Based on that, you can then determine at what stage you should add more hardware capacity.
This is very easy to achieve with services like Server Density because of our in-built website monitoring. You can check the status of your public URLs and other endpoints from custom locations and get alerts when performance drops or when there is an outage.
Sleep is not the black hole of productivity. It’s not this pesky hurdle we have to mitigate, minimise, or hack. It’s not a sign of weakness, laziness, or stupor. Skimping on sleep will not make us any more successful or rich.
Money Never Sleeps
Quite the opposite: Sleep is the pinnacle of productivity.
During those eight to ten hours our brain continues to spin along, sifting through tasks, reordering thoughts, opening pathways and forging new connections.
It’s no surprise that quality of sleep correlates with health, productivity and overall wellbeing; while sleep deprivation is associated with stress, irritability, and cognitive impairment.
As such, sleep is a personal as much as a business affair. A well rested human responds better to personal and business challenges alike.
The sheer impact sleep has on the quality of our work, team morale and decision making, should give us pause. We should be asking ourselves: How do we minimise stress and fatigue? We should be asking ourselves: How do we safeguard downtime and renewal? We should, but we don’t. We don’t because we have no data, no ammunition to prove what each of us intuitively knows.
We cannot mitigate the human cost of on-call work without some sort of objective and relevant sleep metric we can measure.
So that’s what we set out to do.
Finding an objective sleep metric was not hard. There are plenty of decent sleep trackers out there. But we also wanted them to be relevant. In particular, we wanted to quantify the impact on-call work has on our sleep. In other words, we wanted to marry two disparate worlds: the personal insights of sleep quality and the business insights of alerts and incidents.
Fitbit collects sleep information, while PagerDuty and Server Density store information about our incidents. What Opzzz does is connect the dots between sleep efficiency and incidents. By correlating sleep data with on call incidents, we can then illustrate the human cost of on-call work.
HumanOps is a collection of questions, principles and ideas aimed at improving the life of sysadmins. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.
At Server Density we’ve observed a strong correlation between human and system metrics. Reduced stress leads to fewer errors and escalations. Reduction in incidents and alerts leads to better sleep and reduced stress. Better sleep leads to better time-to-resolution metrics.
Unfortunately, the effects of on-call work on sleep quality are often ignored. They’re ignored because sleep happens out-of-hours and away from the office. But, most crucially, they are ignored because they’re not measured.
That’s why we built Opzzz.
Opzzz correlates sleep efficiency with incidents in a direct and measurable way. As a SaaS company with a global infrastructure, on call is a core constituent of what we do. So we appreciate and feel its effects on sleep quality, on wellbeing and productivity.
Opzzz is the clearest expression of our vision for HumanOps. And we’re only getting started. So, go ahead, create a free Opzzz account account, start graphing incidents with sleep data, and let us know what you think.
When an engineer (and anyone for that matter) is working on a complex task the worst thing you can do is expose them to random alerts. It takes an average of 23 minutes and 15 seconds to regain intense focus after being interrupted.
The more noise your system generates the more inefficient and expensive it is. Problem is, given how busy most sysadmins are with various projects and daily issues, we rarely get the opportunity to pause and analyse what eats away at our time and attention the most.
Technology and infrastructure doesn’t feel fatigue, but its human operators do. Human fatigue is something qualitative, cumulative, and often imperceptible to others. When it comes to communicating about it, we’re held back by our inability to quantify it. And here lies the problem. If something cannot be measured, it’s harder to focus and improve upon. So we wanted to come up with an indicator of human fatigue.
Our ultimate goal is to raise awareness of its cost; of that human toll that we’d otherwise forget.
Introducing Alert Costs
Alert costs measures the impact of alerts (and incidents) in human hours. Armed with this knowledge, a sysadmin can look for ways to mitigate those types of alerts that create the most noise. It could be an alert is triggering more frequently than it should. Or it could be that when it does trigger, it takes a significant amount of time to resolve. This added clarity will allow sysadmins to reduce interruptions, mitigate alert fatigue, and improve everyone’s on-call shift.
How did we estimate the cost of an alert:
1.Amount of time an alert is open: This is the simple bit. If the alert triggered once and took 30 minutes to resolve, then the cost here will be 30 minutes. If it triggered 10 times, staying open for 30 minutes each time, then the overall cost will 30×10=300 mins.
But there is much more to it. Humans are not machines. You can’t redirect their attention between tasks like a mechanical switch. Context switching does not come free for humans, especially for tasks that require focus and problem solving, which is precisely what sysadmins do. It takes a significant amount of time for humans to regain intense focus on their task once they’ve been interrupted. This is not exact science and figures differ from person to person and time of day. Context Switch Penalty: We add 23 minutes for every time the alert triggered.
2.If an alert triggered 25 times and stayed open for 60 minutes, the overall cost for that particular alert would therefore amount to 635 minutes.
If you expand on the alert you can see a list of events, one for each time the alert triggered. This allows you to dive in the detail and examine the circumstances of each particular event. Based on this information you can then fine-tune the thresholds for this particular alert so you can reduce the amount of alerts we get by increasing the wait time, for example.
What is that 20% of alert types that trigger 80% of the time? How long do those alerts last and what does this mean for our productivity? You can then have interesting conversations with your team.
Alert Costs can reveal patterns that you may otherwise miss. For example, if a specific group of alerts triggers on a specific time each week, it could be due to a specific script. Extracting such insights is now much easier.
The alert costs table is entirely sortable. You can look and check the configuration to determine what fired most recently. Alternatively you can focus on duration, to see which alert was open for the longest time.
We wanted to make it immediately obvious what’s important. Our focus was on clarity and being concise. Which alerts are over the threshold for the longest time? Which alert manifests into events most often? Can we adjust its configuration and reduce the probability of waking our engineers up?
In the future, we want to do things like offer different (higher) cost values for night alerts versus day alerts. Which part of our tech stack is taking the greatest toll on our engineers? How do we improve on-call quality?
We will add sparklines and offer even more ways to slice in order to deduce (and reduce) the impact on your team.
HumanOps is a collection of principles that advance our focus away from systems, and towards humans. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.
Alert Costs is one such feature. The aim of alert costs is to find a balance on how we assess the performance of our systems and improve the wellbeing of our people.
If you haven’t done so already, create an account with Server Density using the form below, take alert costs for a spin and stay tuned for even more HumanOps features in the not too distant future.
Let’s get the obvious out of the way. The tech industry has a serious and chronic diversity problem. The very industry that’s supposed to spearhead new ideas, innovation and progress, is woefully behind the times where it matters most. The heterogeneity of its people.
Tech workers are predominantly male and white, while non-white workers earn significantly less than their white counterparts. To make matters worse, an overwhelming majority of tech firms do not have gender diverse senior management at the helm. And while there has been some welcome transparency in the last few years (annual diversity reports and so on) it was not followed by any meaningful change in momentum. Minorities continue to be underrepresented and women continue to leave the tech industry in greater rates than their male peers.
What this indicates is that we cannot deal with diversity in the same way we tackle most problems in tech. In other words . . .
This is not a metrics problem
We can’t approach diversity as a hiring quota challenge, hard as that challenge may be. The diversity issue goes deeper than that. It’s a culture problem that starts from schooling and education before it expresses itself everywhere else, including boardrooms, office corridors and water cooler corners.
Within companies, diversity starts at the top.
Leadership is where culture is born and shaped. As a corollary, any investments in hiring can easily go to waste if the company is not driven by culturally diverse values. What good is hiring more people if the workplace cannot integrate and retain their talents?
And while we’re at it: what’s so good about diversity? Why do we want it? Is it because of an upcoming equal opportunity report? Are we paying lip service to diversity because that’s what everyone else is doing?
Behind most of those questions lies an inherent aversion to diversity. As if tech companies have to mitigate diversity, tacitly dismissing it as another cost of doing business. This is not only short-sighted (diversity takes time and effort) but it is also counterproductive since diversity is associated with creativity, innovation, and real economic benefits.
Diversity is Good Business
Ideas generated by people from different backgrounds are informed by different experiences, worldviews, and values. It’s great when ideas get the chance to cross-pollinate like this. As James Altucher says, you combine two ideas to come up with a better idea. A more diverse workplace is therefore a more fertile place for ideas.
Idea evolution works much faster than human evolution.
Now, here is the thing: ideas in diverse environments do not come easy. Why? Because diverse ideas tend to be different. Different (opposing) ideas have to be debated. They have to be weighed, discussed and decided upon. This lack of initial consensus, this creative friction does not come free. The rigour and discipline involved in negotiating and distilling insights and action plans from a broad and varied pool of ideas comes with an upfront cost. But it bears fruits down the line. The result of this requisite complexity translates in a more thought-out and “creatively hardened” product that has more chances of surviving against other ideas in the marketplace.
In short, if you want to create new and better products—products that appeal to a broader audience—you should focus on creating a diverse company culture, starting from the top.
Our diversity journey
We live in an increasingly pluralistic society. The majority of our customers are outside the UK; they come from many different backgrounds. By having a more diverse team, we have a better chance of building something that appeals to our diverse customers.
Server Density launched in 2009, and for much of our first few years it was just a few of us building stuff. Diversity did not become a priority until our team was several engineers strong. Most of them work remotely from various parts of Europe and the UK. Having multilingual folks from different geographies and cultures working in the same team is an incredible creative catalyst for everyone. Our product couldn’t be what it is today if we didn’t have all those different perspectives.
In line with the overall industry, however, the percentage of female engineers in our team is lower than we would like. We took some time to study this challenge and observe what other companies have done. We wanted to address this now, while our company and culture were in their formative years, realising that any change would be exponentially harder to make a few years down the line.
So here is what we did.
Avoid gender-coded job ads
It turns out that power words (driven, logic, outspoken) are more masculine and attract male candidates, while warmer ones (together, interpersonal, yield) encourage more women to apply. We now use online analysis tools to scan all our job ads and suggest changes before we publish them.
Another problem, as illustrated by a Harvard Business Review article, is that women tend to avoid applying for roles they are not 100% qualified for, contrary to men who go ahead and apply anyway. To cater for that behaviour we try and remove as much self selection criteria as possible. We want to be the ones deciding if the candidate is qualified enough, not them. Even if it means more work and delays in filling up open positions.
Avoid unconscious bias
As part of the hiring process, we ask all our candidates to take a writing test followed by a coding exercise. When we review those, the name of the candidate is now hidden, in order to avoid unconscious bias in assessing those tests.
Encourage a diverse culture
The next, and harder, step involves fostering a culture that encourages diverse ideas. We thought long and hard about this. How do you make sure everyone gets a chance to steer the direction of our company and have a voice when it comes to what features we invest in?
While we are still navigating those questions, we’ve already started making targeted adjustments in how we collaborate. We started running planning games, for example. Planning games is a regular forum where we plan our engineering efforts. Everyone has an equal voice in this meeting and we review and vote all ideas based on merit. We stand up and defend whatever it is we think. We support and encourage folks to participate.
We also reviewed our employee handbook including all company policies. We made significant changes to ensure they are as inclusive as they can be. Many of our policies (equal opportunities, hiring/selection, complaints procedure, code of conduct and maternity/paternity leave) used to be informal. What we found was that by just having them written down and being able to point to them during our recruitment efforts has a tangible impact. It shows you’ve at least thought about it.
So we codified our policies in a systematic manner, using pull requests so the proposed format could be discussed by everyone. As an example, if someone feels unable to escalate an issue to their manager, we now have alternative routes in place, including members of the board if needed, and in full confidence.
As with most worthwhile things, the hardest step is the first one. Going from zero to one employees in underrepresented demographics is invariably undermined by the assumption that if you don’t have diversity it’s like this for a reason.
In response to that, we rely on referrals quite heavily as a way to proactively reach out for competent candidates in diverse backgrounds. Obviously that is a short-term measure, and ideally we should gain traction with all demographics sooner rather than later. Having a diverse culture allows you to tap into a broader talent pool, internally and externally. As CTO of Busuu, Rob Elkin, put it, “We just want to make sure that the process for showing that someone should be part of the team is as open and fair as possible.”
We have also started to sponsor and participate in various industry events that encourage diversity (e.g. COED Code). On top of that, we are looking to broaden where we place our engineering job ads. So far we’ve been publishing them on stackoverflow but we want to reach further and wider.
The Canadian cabinet consists of 30 ethnically and religiously diverse ministers, evenly split between women and men who are mostly aged under 50. While we don’t plan to relocate to Canada just yet, it certainly serves as a great example of leadership that is inclusive and representative of as many people as possible.
At Server Density we don’t tackle diversity with a single-minded metrics driven approach. This is not a numbers problem as much as it is a culture problem. It’s not so much about putting a tick in a box as it is about i) understanding the challenge ii) internalising the benefits of diversity and iii) making strategic and nuanced changes in the way we lead our people.
A truly diverse culture is not a compromise. It couldn’t be. It’s a long-term investment into the fundamentals of our team and our future prospects as a company.
How do you measure and track the human cost of out-of-hours incidents? How do you keep your systems running around the clock without affecting the health of the teams behind those systems?
On May 19th, and in our quest to address those questions, we sponsored the very first HumanOps Meetup here in London.
It was a great debate.
Francesc Zacarias, SRE engineer at Spotify, and Bob Walker, Head of Web Operations at GDS GOV.UK, spoke about their on call approach and how it evolved over time.
Here is a brief summary:
Spotify On Call
According to Francesc, Spotify Engineering is a cross-functional organisation. What this means is that each engineering team includes members from disparate functions. What this also means is that each team is able to fully own the service they run in its entirety.
Spotify is growing fast. From 100 services running on 1,300 servers in 2011, they now have 1400 services on 10,000 servers.
In the past, the Spotify Ops team was responsible for hundreds of services. Given how small their team was (a handful of engineers) and how quickly new services were appearing, their Ops team was turning into a bottleneck for the entire organisation.
While every member of the Ops team was an expert in their own specific area, there was no sharing between Ops engineers, or across the rest of the engineering organisation.
You were paged on a service you didn’t know existed because someone deployed and forgot to tell you.
Francesc Zacarias, Spotify Engineering
With only a handful of people on call for the entire company, the Ops team were getting close to burnout. So Spotify decided to adopt a different strategy.
Redistribution of Ownership
Under the new Spotify structure, developers now own their services. In true devops fashion, building something is no longer separate from running it. Developers control the entire lifecycle, including operational tasks like backup, monitoring and, of course, on call rotation.
This change required a significant cultural shift. Several folks were sceptical about this change, while others braced themselves for unmitigated disaster.
Plenty of times I reviewed changes that if we hadn’t stopped, would have caused major outages.
Francesc Zacarias, Spotify Engineering
In most instances however it was a case of “trust but verify.” Everyone had to trust their colleagues, otherwise the new structure wouldn’t take off.
Now both teams move faster.
Developers are not blocked by operations because they handle all incidents pertaining to their own services. They are more aware of the pitfalls of running code in production because they are the ones handling production incidents (waking up to alerts, et cetera).
They are also incentivised to put sufficient measures in place. Things like monitoring (metrics and alerts), logging, maintenance (updating and repairing) and scalability are now key considerations behind every line of code they write.
In the event of an incident that touches multiple teams, the issue is manually escalated to the Incident Manager On Call aka IMOC (in other companies this is called “Incident Commander”). The IMOC engineer is then responsible for: i) key decisions, ii) communication between teams, and iii) authoring status updates.
IMOC remains in the loop until the incident is resolved and a blameless post mortem is authored and published.
By they way, Spotify has adopted what they refer to as “follow the sun” on-call rotation. At the end of a 12 hour shift, the Stockholm team hands over their call duties to their New York colleagues.
GOV.UK is the UK government’s digital portal. Bob Walker, Head of Web Operations, spoke about their recent efforts to reduce the amount of incidents that lead to alerts.
After extensive rationalisation, they’ve now reached a stage where only 6 types of incidents can alert (wake them up) out of hours. The rest can wait until next morning.
Their on-call strategy is split in 2 lines.
Primary support is on call during work hours. Two members of their staff deal with alerts, incidents and any urgent requests. The rotation comprises 28 full time employees. Most of them start at primary support until they upskill enough to graduate to 2nd line support. 2nd line support is 9 engineer strong and they are on call during out of hours.
GOV.UK mirrors their website across disparate geographical locations and operates a managed CDN at the front. As a result, even if parts of their infrastructure fail, most of their website should remain available.
Once issues are resolved GOV.UK carries out incident reviews (their own flavour of post mortems). In reiterating the importance of blameless post mortems, bob said: “you can blame procedures and code, but not humans.”
By they way, every Wednesday at 11:00 they test their paging system. The purpose of this exercise is to not only to test their monitoring system but also to ensure people have configured their phones to receive alerts!
One of the highlights of the recent HumanOps event was “on call work” and how different companies approach it from their own unique perspective.
There seem to be two overarching factors guiding on call strategy: i) the nature of the service offered and ii) organizational culture.
Do you run on microservices and have different teams owning different services? You could consider the Spotify approach. On the other hand, if you can simplify your service and convert most assets into static content on a CDN, then the GOV.UK strategy might make more sense.
While no size fits all, successful ops teams seem to have the following things in common:
They empower their people and foster a culture of trust
They tear down silos and cross pollinate knowledge across different teams.
Their culture shapes their tools. Not the other way around.
They increase on call coverage and reduce on call assigned time
Redis is a key-value database, and one of the most popular NoSQL databases out there. Redis (REmote DIctionary Server) works in a similar fashion to memcached, albeit with a non-volatile dataset.
The dataset is stored entirely in memory (one of the reasons Redis is so fast) and it is periodically flushed to disk so it remains persistent.
Redis also provides native support for manipulating and querying data structures such as lists, sets and hashes.
There are several well-known companies using Redis, including Twitter, GitHub, and Snapchat. While Redis is open source, there is good commercial support for it and some companies offer it as a fully managed service.
Typical use cases include:
Leaderboards/Counting: Redis is effective at incrementing scores or presenting the hall of fame in games. Here at Server Density we use it to set security limits on our APIs endpoints.
Queues: Redis is often used to build message/job queues either with the native RPOPLPUSH command or with a language specific library like RestMQ, PythonRQ, and RedisMQ.
Session cache: Redis has a LRU (Least Recently Used) key eviction policy.
Full page cache: PHP platforms such as Magento often use Redis in addition to an OpCode cache such as Zend OpCache.
We will now take a look at the most important metrics and alerts for monitoring Redis, using free tools, or Server Density.
Monitor Redis: metrics and alerts
Even in simple services like Redis server, there is no shortage of possible metrics you can monitor. The key to successful monitoring is to select those very few ones we care about; and care enough to let them pester us with alerts and notifications.
Our rule of thumb here at Server Density is, “collect all metrics that help with troubleshooting, alert only on those that require an action.”
Same as with any other database, you need to monitor some broad conditions:
Required processes are running as expected
System resources usage is within limits
Queries are executed successfully
Service is performing properly
Typical failure points
Let’s take a look at each category and flesh them out with some specifics.
1. Redis process running
These alerts will let us know if something basic is not in place, like a daemon not running or respawning all the time.
Right binary daemon process running.
When process /usr/sbin/redis count != 1.
We want to make sure the service is not respawning all the time.
When uptime < 300s.
2. System Metrics
The metrics listed below are the “usual suspects” behind most issues and bottlenecks. They also correspond to the top system resources you should monitor on pretty much any in-memory DB server.
An all-in-one performance metric. A high load will lead to performance degradation.
When load is > factor x (number of cores). Our suggested factor is 4.
High CPU usage is not a bad thing as long as you don’t reach the limit.
RAM usage depends on how many keys and values we keep in memory. Redis should fit in memory with plenty of room to spare for the OS.
Swap is for emergencies only. Don’t swap. A bit of RAM is always in use, but if that grows, it’s an indicator for performance degradation.
When used swap is > 128MB.
Traffic is related to the number of connections and the size of those requests. Used for for troubleshooting but not for alerting.
Make sure you always have free space for new data, logs, temporary files, snapshot or backups.
When disk is > 85% usage.
Hard disk I/O is the most common bottleneck in database servers. Thankfully, that is not the case for Redis, since all operations are performed in memory and only occasionally written asynchronously to permanent storage.
3. Monitoring Redis availability and queries
These metrics will inform you if Redis is working as expected.
Number of clients connected to Redis. Typically your application nodes rather than final users.
When connected_clients < minimum number of application/consumers on your stack.
Total number of keys in your database. Useful when compared to hit_rate in order to help troubleshoot any misses.
Number of commands processed per second.
hit rate (calculated)
keyspace_hits / (keyspace_hits + keyspace_misses)
Unix timestamp for last save to disk, when using persistence.
When rdb_last_save_time is > 3600 seconds (or your acceptable timeframe)
Number of changes to the database since last dump. Data that you would lose upon restart.
Number of slaves connected to this master instance
When connected_slaves != from the number of slaves in your cluster.
Seconds since last interaction between slave and master
When master_last_io_seconds_ago is > 30 seconds (or your acceptable timeframe)
4. Monitoring Redis performance
Average time it takes Redis to respond to a query.
When latency is > 200ms (or your max acceptable).
Memory used by the Redis server. If it exceeds physical memory, system will start swapping causing severe performance degradation. You can configure a limit with Redis maxmemory configuration setting for cache scenarios (you don’t want to evict keys on database or queues scenarios!)
Compares Redis memory usage to Linux virtual memory pages (mapped to physical memory chunks). A high ratio will lead to swapping and performance degradation.
When mem_fragmentation_ratio is > 1.5
Number of keys removed (evicted) due to reaching maxmemory. Too many evicted keys means that new requests need to wait for an empty space before being stored in memory. When that happens, latency will increase.
None, but when using TTL for expiring keys and you don’t expect evictions you could configure when evicted_keys is > 0.
Number of clients waiting on a blocking call (BLPOP, BRPOP, BRPOPLPUSH).
5. Monitoring Redis errors
Number of connections rejected due to hitting maxclient limit (remember to control max/used OS file descriptors)
When rejected_connections > XX, depending on the number of clients you might have.
Number of failed lookups of keys
Only when not using blocking calls (BRPOPLPUSH, BRPOP and BLPOP), when keyspace_misses > 0.
Time (in seconds) that the link between master and slave was down. When a new reconnect happens, the slave will send SYNC commands which will impact master performance.
When master_link_down_since_seconds is > 60 seconds.
Redis Monitoring Tools
There are quite a few options out there. These are the ones we know of. Please chime in if we’ve missed something obvious here:
redis-cli info command
redis-cli comes with the INFO command, providing the most important information and statistics about the Redis server.
As the output is pretty long, it has been divided in several sections:
server: General information about the Redis server
clients: Client connections information
memory: Memory usage information
persistence: Persistence (RDB and AOF) related information
stats: General statistics
replication: Master/slave replication information
cpu: CPU usage statistics
commandstats: Redis commands statistics
cluster: Redis Cluster information (if enabled)
keyspace: Database (key expiration) related statistics
An optional parameter can be used to select a specific section of information:
As you can see, every line will either contain a section name (starting with a # character) or a property. The meaning of each property field is described in the Redis documentation in detail. In this article we will look at the most important ones to monitor.
redis-cli monitor command
This one displays every command processed by the Redis server. It is used either for spotting bugs in an application, or for generic troubleshooting. It helps us understand what is happening to the database. It seriously affects performance, however, so it’s not something you want to run all the time.
While redis-cli info is a great interactive / realtime tool, you still need to set up some alerts so that you get notified when things go wrong. You may also want to record various metrics over time in order to identify trends. What follows is a list of available options for doing that.
redis-stat is a simple Redis monitoring tool written in Ruby. It tracks Redis performance in a terminal output in a vmstat like format or as a web based dashboard. It is based on the INFO command, which means it shouldn’t impact the performance of the Redis server. redis-stat shows CPU and memory usage, commands, cache hits and misses, expires and evictions amongst other metrics.
Redmon is a simple Ruby dashboard based on the Sinatra framework. While its monitoring function is not as complete as redis-stats, Redmon comes with invaluable management features like server configuration and access to the CLI interface through the web UI.
Redis Live is a monitoring dashboard written in Python and Tornado. It comes with a number of useful widgets, memory and various commands. It also shows the top used Redis commands and keys. Redis Live uses a database backend to store metrics over time, sqlite being the default choice.
redis-faina is a query analyzer created by the folks at Instagram. It parses the MONITOR command for counter/timing stats, and provides aggregate stats on the most commonly-hit keys, the queries that took up the most amount of time, and the most common key prefixes as well. You can read more about redis-faina here.
The popular agent for metrics collection, collectd also has a couple of plugins for Redis: the upstream Redis Plugin, in C, and redis-collectd-plugin in Python. Both support multiple Redis instances/servers, but the Python version supports a few more metrics, especially replication lag per slave. collectd is just the agent part that can be connected to different monitoring systems.
We mentioned the Percona Monitoring templates before, when we talked about MySQL monitoring. These templates add default graphing and alerting configuration to existing on-premise monitoring solutions like Nagios, Cacti or Zabbix.
Nagios, Icinga and their likes also support Redis monitoring. The Nagios community share their creations on Monitoring Plugins, previously Nagios Exchange. You will find multiple plugins there, as different people write their own in Python, Ruby or Node.
If all that sounds too onerous and if you have other, more pressing, priorities then maybe you should leave server monitoring to the experts and carry on with your business.
This is where we shamelessly toot our own horn.
Server Density offers a user interface (we like to think it’s very intuitive) that supports tagging, elastic graphs and advanced infrastructure workflows. It plays well with your automation tools and offers mobile apps too.
So if you don’t have the time to setup and maintain your own on-premise monitoring and you are looking for a hosted and robust monitoring that covers Redis (and the rest of your infrastructure), you should sign up for a 2-week trial of Server Density.
Who monitors Redis with Server Density
One of our customers, Tooplay uses Redis for their ad serving and analytics measurements. They told us they almost never have to worry about response times because “Redis is blazing fast.”
They don’t use Redis persistence and their smallest Redis instance holds 10GB of data on memory. Old data is evicted upon expiration after 1 day (via TTL settings). Their Redis cluster is monitored with Server Density.
What about you? Do you have a checklist of best practices for monitoring Redis? What memory databases do you have in your stack and how do you monitor them? Any books you can suggest?
The lowly incident status update happens to be one of the most essential pieces of communication a company gets to write.
When users navigate to a status page, they’re driven by a heightened sense of urgency (compared to, say, a website, a blog, or a newsletter). Not many words get as dissected, discussed and forwarded as the ones we place on our status page.
Now let’s state the obvious. Customers couldn’t care less about a string of words posted on a status update. What they care about is, “am I in good hands?” Every time we publish (or fail to publish) a service status update we are ultimately answering that question.
So how do you go about writing status updates that send the right message to customers?
1. Write frequent status updates
First and foremost, good status updates are frequent.
Some companies send them as often as every 20 minutes. Whatever frequency you decide upon, make sure you set accurate expectations. If you intend to spread your status updates over longer intervals, let users know in advance. Never leave them hanging, wondering, not knowing what’s going on. That is not a great customer journey.
2. Well written status updates
You don’t need to have “a voice” in order to write great status updates. You just need to be authoritative. What does authoritative mean?
To start with, authoritative means honest. An honest service update is, by definition, fearless. Nothing betrays fear like ten dollar weasel phrases (“we apologise that our provider . . .”) or passive wording that shirks responsibility (“it was decided that . . .”). Authoritative writing does the opposite. It embraces responsibility, and opens up to all the learnings it bestows.
“Express that opinion clearly, gracefully and empathetically.”
Well written status updates are brief and deceptively simple. Deceptively because it’s not that easy. To make your status updates simple for your users, you need to break complex concepts down to just a few key words. Before you start editing your words, you need to edit your thinking. Pick the essential from the inessential. You can’t write a good service status update until you’re clear about what you know and what you don’t.
What we learned early on was that regular and well-written status updates reduce the amount of incoming support requests. Investing the time to get incident updates right was paying productivity dividends for the rest of the team.
Eventually we transitioned to a dedicated status page, hosted by a 3rd party, separate from our systems. As the team grew, the responsibility for status updates now sits with the engineer on call, and with our support folks too. The one thing that hasn’t changed though, is how we write those status updates.
We only state the facts. We avoid flimsy assumptions (“we think”), tacky remorse (“we apologize”), useless adverbs (“currently”) and generic drivel (“in the process of”). If we have no clue what’s going on, we don’t pretend otherwise. If there is something we do know—what services are operational, for example—we make sure we mention them.
The primary goal of our status updates is to be there for our customers (#1: frequent) and also indicate that they’re in good hands (#2: well written).
The vast majority of our users are technically savvy. They want to have as much detail about the outage as possible, so they can make their own assessments. By including specific and relevant facts in the status update, we satisfy that need and reduce incoming service requests too.
Authoring those regular, one-sentence-long text bites, is a great way to keep customers and team members in the loop.
By the way, if we cannot summarise everything in a single sentence, chances are we don’t know what we’re doing, and probably have no plan of action. The rigour involved in describing the problem in a few short words helps us inch closer to resolution.
When faced with service interruptions, we drop everything in our hands and perform operational backflips 24×7 until the service is restored for all customers.
During this time, over-communication is a good thing. As is transparency, i.e. acknowledging problems and throwing the public light of accountability on all remaining issues until they’re resolved.
While the crisis is unfolding we publish short status updates at regular intervals. We stick to the facts, including scope of impact and possible workarounds. We update the status page even if it’s just to say “we’re still looking into it.”
Once service is resolved, it’s time to turn our focus on the less urgent, but equally important piece of writing: the postmortem. Communicating detailed postmortems helps restore our credibility with our users. It demonstrates that someone is investing time on their product. That they care enough to sit down and think things through. Most crucially, it also creates the space for our team to learn and grow as a company.
What about you? How do you handle service status updates, where do you host them, and who is accountable for it?
Stories are narratives. They take data (events) and interpret them by placing them into context (timeline).
Unlike computers, most humans aren’t great at assimilating raw data. In order for us to understand (remember and emote to) data, we need context. A single data point, accurate as it may be, doesn’t communicate nearly as much as its relationship to other data points does. June revenue numbers mean nothing unless we stack them next to revenue numbers for April and May. We now have a progression, a storyline. It starts to make sense.
That’s where data visualisation comes in. Data visualisation is the art and craft of presenting data into context. It’s about creating meaning. And that’s great because meaning is what makes humans tick.
It’s all in the detail
Some charts are brimming with detail. To appreciate those graphs, you’d need to invest the time to notice, observe, and enjoy the detail.
Other charts have zero detail (no axes, labels, or legends). The value of such graphs is in how quickly you can extract information from them. See the sparkline in the watchface below?
That’s what Sparklines (or sparkcharts) do. They condense charts into smaller expressions. It’s no coincidence that one of the first commercial applications for sparklines was in stock trading. Sparklines provide immediate trendinginformation, and that’s precisely what high-powered stock brokers wolf down 24×7.
Now, you don’t need to be a hero of a Bret Easton Ellis novel to appreciate sparklines. Sparklines are everywhere these days. From weather trends, production rates, website traffic, and Yankees’ current season results, everything can be presented with sparklines.
Smartphones are not lean-forward devices. You wouldn’t analyse data, setup new alerts, or troubleshoot incidents by thumping around on a 4-inch display. What “sysadmins on the move” need is an answer to a simple question.
What is happening?
Sparklines are a perfect match for the iPhone because they offer instant visual cues. With system trends at their fingertips, sysadmins can quickly decide whether to go home, or whether they can finish dinner before reaching for their laptop.
With sparklines you can tell at a glance whether the alert is the result of a slow climb up, or a sudden spike. In human terms, sparklines are the EKG for your server’s health. You can instantly tell what the status of your open (and recently closed) alerts is.
To add context, sparklines include what precedes and what follows an alert trigger. For open alerts, we display the same amount of data points before and after the alert trigger (with a minimum of 30 minutes). For closed alerts, the alert duration is half of the depicted interval, and we distribute the rest of the time before and after the alert.
The number one engineering priority for sparklines was performance. Sparklines should work fast, without affecting the overall glanceability of the app.
We did not want to penalise performance in any way. In fact, some of our earlier iterations were binned because they did not meet our responsiveness criteria. We tested sparklines on everything from an iPhone 6s to a lowly iPod Touch. We didn’t release the feature until it rendered well on every device.
So here is the challenge.
How do you render several (we tested with hundreds of) concurrent sparklines, each comprising multiple data points, and refresh them every 60 seconds, without missing a beat?
Enter Operation Queues
Operations and Operation Queues allow for a network of tasks, divided into smaller interconnected (chained) tasks, executed in a predefined order.
We first fetch the list of triggered alerts from our API. Once we confirm they’re of the right type (a device alert with a numeric metric) we initiate a “chain” of Operations. These operations happen on the fly. As the user scrolls down the list of alerts, each corresponding operation jumps on their Operation Queue.
We then define dependencies between Operations. For example, one Operation fetches the last metric value for a given alert, while another renders those values. The latter cannot work without data provided by the former, so we create a dependency to accommodate that.
Operations execute on the background and as system resources allow. In fact, the nice part about Operation Queues is that they run atop Grand Central Dispatch. GCD figures out how many concurrent tasks to run taking into consideration current system load and available hardware resources.
Running off the main thread
It’s common knowledge that we are not supposed to manipulate the UI unless we’re running on the main thread. A less known fact is that you don’t need to be on the main thread to manipulate graphics. We render our charts off the main thread, and only jump back to the main thread with a finished chart image ready to show up in the list of alerts.
We follow this process with every data refresh, and what we end up with is an up-to-the-minute comprehensive, information-rich view of the system health.
Sparklines is a HumanOps feature
We recently introduced HumanOps, a set of principles aimed at improving the life of sysadmins. HumanOps features create bridges between systems and humans.
Sparklines is one such feature. It condenses significant amounts of information in a small chart that sysadmins can digest at a glance. It saves them time, energy and, most crucially, it preserves their attention.
We also want to hear from you. How do you employ sparklines in your own dashboards, reports or presentations? How do you add context to your data, and how do your requirements change as you switch from desktop to mobile?
The ability to problem solve, right? We’re not talking about sudoku and crosswords here. Errors and delays can cost millions. With scale comes complexity, and an exponential increase in things that could go south. In production. At four in the morning.
And here lies the challenge. Sysadmins are not superhumans. They are susceptible to stress and fatigue just like everybody else.
We know that prolonged stress is detrimental to health. We also know that fatigue impairs our ability for basic problem solving. A diminished problem solving capacity may not pose a problem in jobs dictated by the traditional metrics of productivity, i.e. output per hour. But for those jobs where ideas and innovative solutions are required, productivity is a rather poor measure of success.
It’s hard to shoehorn some of the most important things we do in life into the category of “being productive.”
How do we minimise interruptions? How do we safeguard downtime and renewal? How do we minimise stress and fatigue? How do we build software that is more inline with how the human brain works?
HumanOps is a collection of principles that address those questions. It advances our focus away from systems, and towards humans. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.
At Server Density we’ve observed a strong correlation between human and system metrics. Reduced stress leads to fewer errors and escalations. Reduction in incidents and alerts leads to better sleep and reduced stress. Better sleep leads to better time-to-resolution metrics.
What’s the average number of interruptions and wake-ups our engineers experience per month? How many late shifts and weekend calls do they get?
As software makers, we have significant opportunity and responsibility here. How do you spot issues before they cause downtime? How do you reduce incidents and mitigate stress? How do you present this data in a more intuitive way?
Here is a wireframe for an upcoming Server Density feature called alert history. Notice the Cost column? It measures the cost of incidents in actual human hours.
Below is a preview of an upcoming feature for iOS, called sparklines. Sparklines condense full blown charts into smaller inline expressions that illustrate trends. Sparklines are a perfect match for the iPhone because they offer visual cues about “what’s happening?” allowing sysadmins to quickly decide whether to go home, or whether they can finish dinner before reaching for their laptop.
We will expand on this, and many more, HumanOps features in the near future. The important thing to remember is that HumanOps features create bridges between systems and humans. And present information in a way that is easy for humans to pickup at a glance.
The anxiety associated with being available out of hours stems from the lack of control. It doesn’t matter if the phone rings or not. Being on-call and not being called is, in fact, more stressful than a “busy” shift.” It is this non-stop vigilance, having to keep checking for possible “threats” that is unhealthy.
How do you restore the feeling of control? How do you measure and track the human cost of out-of-hours incidents and escalations? All those considerations fall squarely under the HumanOps agenda.
We want to hear from you
HumanOps is a collection of questions, principles and ideas aimed at improving the life of sysadmins.
A challenge like this could never be tackled by one engineer, team, or company on their own. So we couldn’t be more excited about having Spotify, Barclays, Yelp, M&S, and Gov.uk join HumanOps. And even more teams are contributing their insights here.
If you happen to be in London on May 19th we’d love to see you at our very first HumanOps meetup, with more worldwide events coming soon.