Server monitoring that doesn't suck.

See for yourself

Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. 10 Ways to Secure Your Webapp

    4 Comments

    While there is no such thing as 100% secure, you can take specific measures to mitigate against a wide range of attacks and secure your webapp as much as possible.

    In this post we discuss some of the steps we’ve taken as part of our efforts to secure our server monitoring tool.

    1. Cover the Basics

    Before considering any of the suggestions listed here, make sure you’ve covered the basics. Those include industry best practices like protecting against SQL injection, filtering, session handling, and XSRF attacks.

    Also check out the OWASP cheat sheets and top 10 lists to ensure you’re covered.

    2. Use SSL only

    When we launched Server Density in 2009, we offered HTTPS for monitoring agent postbacks but didn’t go as far as to block standard HTTP altogether.

    Later on, when we made the switch to HTTPS-only, the change was nowhere near as onerous as we thought it would be.

    SSL is often viewed as a performance bottleneck but that isn’t really true. In most situations, we see no reason not to force SSL for all connections right from the start.

    Server Density v2 uses a new URL. As part of this, we can force SSL for new agent deployments and access to the web UI alike. We still support the old domain endpoint under non-SSL but will eventually be retiring it.

    To get an excellent report on how good your implementation is, run your URL against the Qualys SSL server test. Here is ours:

    SSL scan for our webapp

    3. Support SSL with Perfect Forward Secrecy

    Every connection to an SSL URL is encrypted using a single private key. If someone obtains that key they can decrypt and access the traffic of that URL.

    Perfect forward secrecy addresses this risk by negotiating a new key with every session. A compromise of one key would therefore only affect the data in that one session.

    To do this, you need to allow certain cipher suites in your web server configuration.

    ECDHE-RSA-AES128-SHA:AES128-SHA:RC4-SHA is compatible with most browsers (for more background and implementation details check out this post).

    We terminate SSL at our nginx load balancers and implement SSL using these settings:

    
    ssl_protocols SSLv3 TLSv1 TLSv1.1 TLSv1.2;
    ssl_prefer_server_ciphers   on;
    # prefer RC4-SHA to avoid BEAST
    ssl_ciphers ECDHE-RSA-AES128-SHA256:AES128-GCM-SHA256:RC4:HIGH:!MD5:!aNULL:!EDH;
    

    You can easily tell if you're connected using perfect forward secrecy. In Chrome, just click on the lock icon preceding the URL and look for ECDHE_RSA under the Connection tab:

    TLS security

    4. Use Strict Transport Security

    Forcing SSL should be combined with HTTP Strict Transport Security. Otherwise you run a risk of users entering your domain without specifying a protocol.

    For example, typing example.com rather than https://example.com and then being redirected to HTTPS. This redirect opens a security hole because there’s a short time when communication is still over HTTP.

    You can address this by sending an STS header with your response. This forces the browser to do the HTTP to HTTPS conversion without issuing a request at all. Instead, it sends the header together with a time setting that the browser stores, before checking again:

    
    strict-transport-security:max-age=315360000; includeSubdomains
    

    Our header is set for 10 years and includes all subdomains because each account gets their own URL, for example: foo.serverdensity.io.

    5. Submit STS Settings to Browser Vendors

    Even with STS headers in place there’s still a potential hole, because those headers are only sent after the first request.

    One way to address this is by submitting your URL to browser vendors so they can force the browser to only ever access your URL over SSL.

    You can read more about how this works and submit your URL for inclusion in Chrome. Firefox seeds from the Chrome list.

    6. Enforce a Content Security Policy

    Of the top 10 most common security vulnerabilities, cross site scripting (XSS) is number 3. This is where remote code is injected and executed on your site, usually through incorrect (or non-existing) filtering.

    A good way to combat this is to whitelist the specific remote resources you want to allow. If a script URL is not matched by this list then browsers will block it.

    It’s much easier to implement this on a new product because you can start out by blocking everything. You then open specific URLs as and when you add functionality.

    Using browser developer tools you can easily see which remote hosts are being called. The CSP we use is:

    
    content-security-policy:script-src 'self' 'unsafe-eval' https://maps.google.com https://*.gstatic.com https://*.googleapis.com https://*.mixpanel.com https://*.mxpnl.com; connect-src 'self' https://maps.google.com https://*.gstatic.com https://*.googleapis.com https://*.mixpanel.com https://*.mxpnl.com; frame-src 'self' https://maps.google.com https://*.gstatic.com https://*.googleapis.com https://*.mixpanel.com https://*.mxpnl.com; object-src 'none'
    

    We have to specifically allow unsafe-eval here, as a number of third party libraries require this. You might not use any third party libraries—or the libraries you do use may not require unsafe eval—in which case you should not allow unsafe-eval.

    script-src is a directive that controls a set of script-related privileges for a specific page. For more information on connect-src, script-src and frame-src this is a good introduction on CSP.

    Be careful with wildcarding on domains which can have any content hosted on them. For example wildcarding *.cloudfront.net would allow anyone to host any script. This is Amazon’s CDN which everyone can upload files to!

    Also note that Content-Security-Policy is the standard header but Firefox and IE only support X-Content-Security-Policy. See the OWASP documentation for more information about the header names and directives.

    7. Enable HTTP security headers

    You can enable some additional security features in certain browsers by setting the appropriate response headers. While not widely supported, they are still worth considering:

    8. Setup passwords, “remember me” and login resets properly

    This is the main gateway to your webapp, so make sure you implement all stages of logging-in properly. It only takes a short amount of time to research and design a secure process:

    • Registration and login should use salting and cryptographic functions (such as bcrypt) to store passwords, not plain text or MD5 hashing.
    • Password reset should use an out-of-band method to trigger resets, for example: requiring a username then emailing a one-time, expiring link to the on-record email address where the user can then choose a new password. Here is more guidance and a checklist.
    • "Remember me" functionality should use secure tokens to recognise the user, and not storing their credentials in cookies.

    You can review your authentication process against this OWASP cheat sheet.

    9. Offer Multi Factor Authentication

    If your webapp is anything more than a trivial consumer product, you should implement—and encourage your users to use—multi factor authentication.

    This requires them to authenticate using something they carry with them (token), before they can log in. An attacker would therefore need both this token (phone, RSA SecurID etc) and user credentials before they obtain access.

    We use the Google Authenticator standard because it has authentication apps available for all platforms, and has libraries for pretty much every platform.

    It is quite onerous to install a custom, proprietary MFA app so we don’t recommend you implement your own system.

    Be sure to re-authenticate for things like adding/removing MFA tokens. We require re-authentication for all user profile changes.

    We do however have a timeout in place during which users won’t have to re-authenticate. This timeout applies for simple actions like changing passwords (adding or removing tokens requires authentication even during the timeout).

    To sum up, MFA is crucial for any serious application as it’s the only way to protect against account hijacking.

    10. Schedule Security Audits

    We inspect security as part of our code review and deployment process (many eyes on the code). We also have regular reviews from external security consultants.

    We recommend having one firm do an audit, implement their fixes, and then have another firm audit those changes.

    Summary

    Security is all about identifying and mitigating possible risks of attack. The operative word here is mitigation, since new threats are always emerging.

    This is an ongoing exercise. Be sure to conduct regular reviews of all existing measures, check for new defence mechanisms and keep abreast of security announcements.

  2. Is Security a Growth Catalyst for DevOps?

    Leave a Comment

    Security comes from the Latin route sēcūrus. It means free from care. Some adjectives associated with this word are untroubled, fearless, and composed.

    Security provides a safe space for humans to stretch their imagination and be as creative as they can. It allows for growth.

    It also allows for focus. For small companies like ours, security unfetters our potential to improve our product and serve our customers.

    Good security is not an add-on, a feature or a separate effort. It is an essential building block of our work. And that should be reflected in everything we do, including our people, our infrastructure, our technologies and our product.

    Let’s start with people.

    The Role of People

    If you think technology can solve your security problems, then you don’t understand the problems and you don’t understand the technology.”

    ~ Bruce Schneier.

    All fourteen collisions with Google’s self driving cars were caused by human error, according to Google. The drivers involved in those accidents were all distracted. It turns out that humans are the weakest link when it comes to safe systems.

    There are a number of ways we approach (and mitigate) this risk. To begin with, we try and have as many “eyes on the code” as possible.

    As part of our code review and deployment process we test each other’s code and try to break it. We are a small and tightly knit team, which is great. But we don’t know it all.

    To reduce the risk of blind spots and confirmation bias (we are only human!), we work with independent security consultants who inspect our product (and code) on a regular basis.

    Another resource we are looking into (but haven’t leveraged yet) is the specialised skillsets of the crowd. There are some compelling platforms for bug bounty and bug reporting out there. Large companies, like Google and Tesla, and smaller ones, like LastPass and Drupal, have used this for awhile.

    Now let’s turn our attention to technology, and how we can secure it.

    Multi Factor Authentication

    Multi Factor Authentication (MFA) requires the user to authenticate using something they physically have with them before they can log in. It’s the only way to protect against account hijacking.

    We use MFA internally as much as we can. For example, we enforce Google authenticator for Gmail, Google Drive and all our Google Apps.

    We also encourage all our customers to activate MFA for their Server Density account:

    Screen Shot 2015-08-16 at 6.36.50 pm

    Encryption

    Our computers are full-disk encrypted (we use Filevault, PGP Full Disk Encryption or Espionage, depending on the OS). We also encrypt some of our email communications with GnuPG, one of the tools that Edward Snowden used to protect his communications about the NSA.

    Up to Date Software

    We make sure we are always running the latest bug fixed versions of all installed software we use. This includes web browsers, messaging clients, OS components and the OS itself.

    Web Browser

    We like Google Chrome for its tight integration with Google Apps but also for its auto-update feature which keeps the browser secure.

    We are not big on browser add-ons. Click-to-play is an exception as it helps us prevent browser plugin vulnerabilities (Flash and Java in particular). We also use this Chrome extension to protect against phishing on our Google accounts.

    We also recommend Fluffify, our very own Chrome extension. It won’t make you any more secure, but it will keep you sane.

    Passwords

    The second law of thermodynamics states that entropy always increases with time. When it comes to guessing passwords however, time always increases with entropy.

    Password entropy is a measurement of how unpredictable a password is.

    Our passwords are at least 20 characters long. They comprise a mix of upper and lower case characters including numbers, letters and symbols. They are also unique for each system, which means if one system is compromised, others will not follow suit.

    We keep offsite and easily accessible backups of all our passwords (using tools like 1Password) to allow for easy reset of all account passwords in the event of a breach.

    We never share passwords. Each of us has our very own set of credentials. This helps us deal with red-flag scenarios. Like revoking employee privileges when they leave. Or auditing who accessed a particular server or database.

    Least Privilege

    According to the principle of least privilege, every process or user should only be able to access the resources they need. User administration is a key component of our product:

    users sd

    Secure Data Flows

    For Server Density to work we ask our customers to install a lightweight agent on their server. All this does is collect various system metrics and constantly report back.

    A deliberate restriction is that data only can only travel one way: from the client server to ours. That rules out any possibility for remote execution.

    From that point everything is encrypted. In fact, encrypted post backs are the only option.

    We use ports that are usually already open (HTTPS port 443) which means there is no need to configure anything new. No root access required either. And because our agent is open source, our customers have full visibility of what is running at all times.

    Summary

    Amateurs hack systems, professionals hack people.

    ~ Bruce Schneier

    We don’t think security is a mere feature, and it shouldn’t be treated as such. At its best, security is an essential building block of the product, the team, and everything a company does.

    From sending data, provisioning access to their systems and storing internal passwords, DevOps teams should take all reasonable precautions to keep confidential data safe and available.

    Having secure systems affords companies the stability and peace of mind they need to be creative, grow, and serve their customers.

    What about you? What industry best practices do you follow?

  3. How and why we use DevOps checklists

    4 Comments

    In his book The Checklist Manifesto, Atul Gawande tells the story of the first pre-flight checklist, created by Boeing following the fatal crash of a B-17 in 1935.

    According to the investigation, the pilots forgot to disengage a critical wing adjustment mechanism before take off. Evidence that even veteran pilots could miss key steps or do things in the wrong order. With hundreds of lives at stake it was necessary to design around this constraint.

    The checklist does exactly that. It compensates for the “limits of human memory and attention.”

    Indeed, Gawande — a doctor himself — writes how key steps in medical procedures were routinely missed, resulting in infections and preventable fatalities. The adoption of checklists reduced those occurrences, and they are now used in a wide range of healthcare settings.

    Checklists for DevOps

    Not unlike healthcare and aviation, sysadmins are often tasked with systems that touch many lives. Here at Server Density we appreciate the complexities of the systems we run. We also recognise the limits of the people who run them — us. That’s why we use checklists for much of what we do.

    checklist-tattoo

    Only so much that human memory can remember. Source: http://bit.ly/1Wc7m0p

    Checklists are particularly effective in situations where there is:

    Complexity

    There is only so much that human memory can remember, reproduce and execute upon, in a reliable manner.

    Stress and Fatigue

    Incidents may happen at awkward times, like early in the morning when mistakes are more likely. Sysadmins are vulnerable to stress and fatigue like everyone else.

    Ego

    You’d expect a seasoned engineer to intuitively know how to deal with a wide range of contingencies. That is a good thing. Experience and tenure, however, could also encourage people to rely on “gut instincts”, to “wing it” and “shoot from the hip.” In complex situations those attitudes could prove hazardous.

    A checklist is a good way to mitigate those problems because it helps us define our response in advance and make it available to everyone. We therefore ensure that every member of our team is taking the right steps in the right order, each and every time.

    Checklists at Server Density

    Here is one of our own checklists. It defines what our on-call first responders do when a critical incident occurs (we also wrote a guide on how we handle incidents outages and downtime):

    Incident Response Checklist

    As “common sense” and obvious as the steps may be, they carry great importance for the health of our infrastructure. So we spell them out.

    As you would expect, we take our uptime metrics seriously. We’ve got some pretty capable folks taking care of our servers. And we use checklists. Very prescriptive ones like the one below. This one details the steps our on-call people follow when faced with a server failover:

    Load Balancer Checklist

    We sed/awk/grep all day long. Our checklists assume we do it for the very first time. At 2:00 in the morning we might have trouble finding the lights, let alone the Puppet master for our configuration.

    Here is another example. We use this checklist when a server we monitor stops sending data:

    No Data Checklist

    DevOps checklists are as unique as the teams that use them. Each team has their own recipe for doing things and as technology stacks evolve, so do the checklists required to run them.

    We aim to have a checklist for every scenario. From restoring a backup to production, deploying fixes, switching primary data centres, and database consistency checks, to responding to traffic spikes, security breaches, critical alerts, and a long list of other contingencies.

    Google Docs is a key part of our on-call playbook, so that’s where we store our checklists for everyone to access and update as needed.

    Checklists are not static

    Relying on checklists does not mean we are intractable about how we do things. For us, creating checklists is an excellent opportunity to take a step back and review the entirety of our stack.

    DevOps checklists work best when we schedule time to update and improve on them.

    Summary

    When it comes to server monitoring, we believe checklists are an important step towards reliable systems. They help our team respond to issues in a consistent and timely manner. This translates to increased uptime and a better capacity to serve our customers.

    What about you. Are you using checklists?

  4. What’s new in Server Density – July 2015

    Leave a Comment

    This is our regular post to keep you up to date with the latest releases to our server monitoring product, Server Density.

    Tags as a recipient

    We launched tags several months ago to allow you to set permissions for different users, but they are the foundation for many more features we’ll be releasing over the coming months. The first of these is tags as a recipient. This allows you to have alerts delivered to all members of a tag, rather than having to set up each of your users on every alert configuration individually.

    For example, if you have an “on call” team, you can add all the users to that tag, then set the tag as the recipient for an alert. Each user can have different notification options and any changes you make will apply to all alerts the tag is a recipient for. This is particularly useful if your team changes e.g. new members or staff leaving – you only have to make the change once on the tag and it’ll apply to all alerts.

    Tags as a recipient

    This is available now on device and service level alerts. It’s not available for group level alerts because our next release will be replacing those with alerts on a tag (so servers and services can have multiple tags, with inheritance across multiple tags).

    Learn how to set these up in our support guide.

    Service monitoring error details

    We often get reports that our availability monitoring is reporting “false positives” when compared to competing products which actually turn out to be real errors that we’ve detected where others have not! To back up our claims, we have now exposed full details of any errors we see along with their history for all of your service checks.

    You can browse recent errors, search and filter by location and see errors as they are detected. This will help debug any problems we detect with our availability monitoring.

    Service monitoring errors

    Ongoing fixes

    We’re always working on improvements and fixes and often deploy code 5-10 times a day! So if you find any problems or have ideas for improvements, please get in touch so we can continue to improve.

  5. What’s new in Server Density – May 2015

    Leave a Comment

    This is our regular post to keep you up to date with the latest releases to our server monitoring product, Server Density.

    Process statistics

    The main release for this month has been our in-depth process level statistics. Server Density has had the ability to alert on process existence and resource usage for some time but now that data is visualised for each server.

    Each server overview has a top processes widget that gives you a breakdown of the most intensive processes and how many running instances there are:

    Top processes

    This is also extended to the snapshot view which you can reach by clicking on a data point on any graph or from the Snapshot tab when viewing a particular server.

    Processes snapshot

    Processes snapshot

    sd-agent 1.14.0

    A new version of the monitoring agent for Linux, FreeBSD and Mac has been released with a range of bug fixes. This is intended to be the final release of the v1 agent. We’ll soon be releasing sd-agent v2 which will include features such as SNMP, statsd and second by second monitoring.

    Ongoing fixes

    We’re always working on improvements and fixes and often deploy code 5-10 times a day! So if you find any problems or have ideas for improvements, please get in touch so we can continue to improve.

  6. What’s new in Server Density – Apr 2015

    Leave a Comment

    This is our regular post to keep you up to date with the latest releases to our server monitoring product, Server Density.

    New alert config UI

    We released a new configuration interface for managing alerts, the result of several months of work involving design and usability tests. Try it out on your account now and read about the work behind the scenes.

    The new UI

    New support site

    Our support website has been redesigned, all the articles have been updated and you can now log in to submit/view old tickets. We provide live chat, email and phone support to all customers Monday to Friday, 10am to 6pm UK time.

    Set tags from the installer script

    Our quick agent installer shell script now allows you to specify tags when deploying the agent.

    Updated iPhone app

    A new release of the iPhone app for alerting improves various elements of the interface and fully supports fluid layouts, including iPad, iPhone 6 and iPhone 6 Plus.

    Default alerts

    All new devices and services added via the web UI will now get default alerts configured for “no data received” (devices), service is down and HTTP status code is not 200 (services).

    Ongoing fixes

    We’re always working on improvements and fixes and often deploy code 5-10 times a day! So if you find any problems or have ideas for improvements, please get in touch so we can continue to improve.

  7. What’s new in Server Density – Jan 2015

    Leave a Comment

    This is our regular monthly post to keep you up to date with the latest releases to our server monitoring product, Server Density.

    Latest value widget

    A new widget is available on the dashboard which will show you the latest, current value for any metric. It will also display the average value over the time period the dashboard is configured for e.g. the 24 hour average or 1 hour average, with a sparkline graph in the background.

    Latest value widget

    New official plugins for entropy, inodes, ProFTP, Zombies and Zookeeper

    We’re in the process of retiring our old plugin directory and rewriting many old community plugins into officially supported and updated versions. These are available on Github and we’re accepting pull requests for improvements and changes, as well as brand new plugins.

    The goal is to make it easier to install by just dropping the file into your agent plugin directory and ensure these plugins are kept up to date and fully supported by us.

    New API documentation + Dashboard API

    We’ve updated and expanded our API documentation with a new template and example calls for Python, Ruby and Curl.

    In addition, there’s now an API endpoint for managing dashboards and the widgets on them. This allows you to programmatically create new dashboards and new widgets, perhaps as part of your provisioning process for new environments.

    Server Density v1 shutdown in March

    Last month we announced the shutdown date of 24th March for Server Density v1. Users still on v1 are being sent regular reminders to migrate, which only takes a few minutes and does not cost anything extra.

    What’s coming next?

    Over the last few months we have been working on moving our alerts processing backend from Celery + MongoDB to Storm + Kafka, which sets the foundations for a range of new alerting functionality we’ll be releasing from March. Tagging is a key part of this functionality, which was released in December. Before then, we’ll be releasing more plugins and full process lists within the UI.

  8. What’s new in Server Density – Nov 2014

    Leave a Comment

    The last few months of development at Server Density have been focused on a large number of small improvements, particularly targeted at fixing known issues with the dashboard and ensuring we tackle lots of minor bugs and complaints. This means you’ll find the dashboard is more solid, performance across the whole app has improved and we have resolved bugs with our cross browser support on Firefox, Chrome and Safari.

    We also added some new functionality which sets the foundations ready for some major releases in 2015:

    Permissions & tagging

    You may want to allow different teams or customers access to your account but restrict them to only be able to view and manage specific servers or web checks. Using tags and permissions you can now do this. Tag a device or web check and then tag an associated user, and that user will then only be able to access those specific devices or web checks. There’s a guide here.

    Tagging is currently only used for permissions but is the foundation for more tag based functionality such as alerting on tags which will be released at the start of 2015.

    tags

    New official plugins for MongoDB, Docker, Nginx, Nagios and Temperature

    We’re in the process of retiring our old plugin directory and rewriting many old community plugins into officially supported and updated versions. These are available on Github and we’re accepting pull requests for improvements and changes, as well as brand new plugins.

    The goal is to make it easier to install by just dropping the file into your agent plugin directory and ensure these plugins are kept up to date and fully supported by us.

    Bandwidth aggregation calculations

    We’ve created a new tool – sd-bw – which uses our API to aggregate bandwidth statistics for your servers (either individually or for every server in a group) to give you a “total amount transferred” or “total bandwidth used” figure for a specific timerange.

    Vertically resizable dashboard graphs

    You can now resize dashboard graph widgets both horizontally and vertically. Just hover over the bottom or right side of a widget, then click and drag to resize.

    What’s coming next?

    The beginning of 2015 will have a series of feature releases including enhancements to alerting, process list monitoring and better integration into cloud provider APIs e.g. CloudWatch metrics. If you have any ideas for improvements, let us know!

  9. Nagios Alternative – Cost Calculator.

    7 Comments

    A common misconception in the industry is the notion that open source monitoring software is free. This is true if you’re looking at licensing alone, but there’s infinitely more factors to take into account than that. Being a great Nagios alternative, we decided to work out exactly how expensive Nagios is in comparison to our own server monitoring.

    Server Density’s job as a competitor is to highlight some of the problems and difficulties of using Nagios, without damning the open source community and misleading anyone. Anticipating some critique of our calculations we’ve decided to write this article on our ‘workings’. It also gives you the ability to engage with us about the calculations – please comment below if you think we’ve gone wrong. If you can convince us, we’ll happily amend our math. For now though, here’s how we’ve worked it out.

    Before you start

    You’ll notice a common theme across many of these headings being the time Nagios takes to setup and use. In a world where time and money are completely unrelated, this is how the relationship between Server Density and Nagios looks:

    • Nagios saves you money.
    • Server Density saves you time.

    Or

    • Nagios costs you time.
    • Server Density costs you money.

    But of course this isn’t true, the age old idiom “time is money” couldn’t be more applicable to the world of fast moving tech startups, so:

    • Nagios costs you money.
    • Server Density costs you money.

    Those principles form the basis of our Nagios cost calculator, to which we’ve created a monetary value for Nagios based on the time you’d expect to take setting up and maintaining the open source tool. You can evaluate the cost of a basic monitoring setup if you’d like, but if you need to replicate our monitoring infrastructure it’s best to keep all of the options ticked.

    Nagios Cost Calculator

    Nagios Hardware Requirements

    A Nagios server isn’t cheap to run, they require a large amount of processing power, especially if you have a lot of servers:

    Under 50 servers

    To monitor anything under 50 servers we suggest something similar to the Amazon m3.medium instance type. At the time of writing (Nov, 2014) that’ll set you back $0.070 an hour, which totals to a yearly cost of $613.

    Over 50 servers

    Monitoring more than 50 servers will demand more from your Nagios server, so you’ll need to upgrade. For this we’d suggest an m3.xlarge instance. At the time of writing (Nov, 2014) that’ll set you back $0.280 an hour, which totals to a yearly cost of $2452.

    We’ve used AWS as the cost benchmark as they’re constantly pushing costs down and are the most popular provider. We didn’t consider reserved instances because they add some complexity to calculating the cost due to the pre-purchase fees, which add to the overall setup cost of Nagios you don’t get with Server Density.

    Need redundancy?

    If you take monitoring seriously you’ll want to keep redundancy checked. To replicate how Server Density is deployed with full redundancy within our data centers combined with geographic redundancy of deploying into multiple facilities (your monitoring needs to be more reliable than what you’re actually monitoring!) you’ll need to have at least 2 servers each across 2 data centers. In the case that you’re monitoring under 50 servers that’s $613 * 4, if you’re monitoring over 50 it’s $2452 * 4.

    This level of redundancy is necessary to ensure you can survive the failure of a node within one facility as well as the failure of the entire data center. Of course, this assumes you know how to set up Nagios in a redundant, load balanced cluster.

    Once you get over 50 servers then it’s totally unacceptable to be running just a single Nagios server, so we forced this above 50 servers with our Nagios calculator.

    How long does Nagios take to setup?

    We’ve calculated the initial monitoring setup to take 2 working days. This can be shorter if you know what you’re doing or longer if you’ve never done it before. This is because it takes time to go through the installation process and in particular, get the initial config right.

    How long does Nagios take to deploy across multiple server?

    Once you’ve spent the 16 hours setting your Nagios server up, you still need to consider how long it takes to install the monitoring agent(s). There’s no shortage of config files when you’re running Nagios. It’s usually the initial setup that takes the longest, with each additional server only taking a few minutes to get up and running.

    Nagios alerts configuration

    Monitoring alerts need to be reliable and flexible. By default, Nagios limits alert delivery to email so it takes extra time to set up SMS alerts, or configure push notifications on your phone, plus the services you’ll want to use are often not free. SMS gateway reliability is important and with push notifications you need apps, or some 3rd party that supports generic notifications. Again, reliability has to be monitored. With Server Density, all of this is taken care for you at no extra cost. Even down to free SMS credits.

    As part of the Nagios cost calculator, we estimate that setting up an alerting system that compares to the one we offer will take 8 hours of your time and have ignored the cost of using the external service such as for the SMS credits.

    Nagios Graphing

    It will take you a further 8 hours to install a plugin like nagiosgraph or configure an entirely separate system such as Cacti or Graphite – and even then, here’s the same data presented by Nagios and Server Density:

    Nagios Alternative

    Nagios Security

    Keeping everything nice and secure is essential. It takes time to get some basic hardening on any server and we’ve budgeted a couple of hours for this. What we don’t include is ongoing security assessments and patches that we take care of for you with Server Density. This is particularly important if a piece of software is installed on every single one of your servers or is a key part of your systems…such as monitoring.

    Monitoring your monitoring

    With no redundancy set up then you’re going to struggle to monitor the performance of your Nagios server without 2 to monitor each other. In the instance of no redundancy, you’ll need to use a service like Server Density on our 1 server plan to make sure everything is okay with your single Nagios server.

    Nagios Maintenance

    By default our calculator is set to allow for 12 hours of maintenance every year. That’s one hour a month fiddling with preferences, tweaking configs, fixing problems, upgrading or even thinking about improvements to your monitoring setup.

    Incident management

    We assume you to spend 6 hours every year (30 minutes a month) on incidents relating to your Nagios monitoring servers. This could be a hardware failure, instance retirements, whole region/data center reboots, instance upgrades, dealing with backups or clearing out metrics data from disk space.

    Worldwide locations for web checks

    If you want availability monitoring, then your best bet is to pay for an external provider like Server Density or Pingdom. On which, a ’50 checks’ account will cost ~ $250/year (as of Nov 2014).

    Setting up geographically dispersed monitoring locations and scheduling checks amongst them all is non-trivial, and is something you get as part of the product with Server Density.

    Most of the time the calculator settings are defaults and can be changed based on how long you’d consider things to take you. We have tried to be fair to Nagios with our time estimations, because after all cost isn’t the only way we think we have an advantage over the open source competition. There are some cases when Nagios is cheaper (e.g. if you don’t value your time highly or with tiny numbers of servers…but then why are you setting up a complex monitoring tool like Nagios in the first place?!) but with all the functionality Server Density provides, we think we have a pretty good offer!

    Thanks for taking the time to read through our justification, if you’d like to join the discussion please leave a comment below, or equally this reddit thread is home to some interesting comments – we love to reading and responding to your thoughts. Oh, and if you’re sick of Nagios, consider us next time you’re looking for a Nagios Alternative.

  10. A guide to handling incidents, downtime and outages

    Leave a Comment

    Outages and downtime are inevitable. Designing your systems to handle failure is a key part of modern infrastructure architecture which makes it possible to survive most problems, however there will be incidents you didn’t think about, software bugs you didn’t catch and other events which result in downtime for your service.

    Microsoft, Amazon and Google spend $billions every quarter and even they still have outages. How much do you spend?

    There are some companies who constantly seem to have problems and suffer from it unnecessarily. Regular outages ultimately become unacceptable but if you adopt a few key principles and design your systems properly, the few times when you do have service incidents you can be forgiven by customers.

    Step 1: Planning

    If critical alerts result in panic and chaos then you deserve to suffer from the incident! There are a number of things you can do in advance to ensure that when something does go wrong, everyone on your team knows what they should be doing.

    • Put in place the right documentation. This should be easily accessible, searchable and up to date. We use Google Docs for this.
    • Use proper config management, be it Puppet, Chef, Ansible, Salt Stack or some other systems to be able to make mass changes to your infrastructure in a controlled manner. It also helps your team understand novel issues because the code that defines the setup is easily accessible.

    Unexpected failures

    Be aware of your whole system. Unexpected failures can come from unusual places. Are you hosted on AWS? What happens if they suffer an outage and you need to use Slack or Hipchat for internal communication? Are you hosted on Google Cloud? What happens if your GMail is unavailable during a Google Cloud outage? Are you using a data center within the city you live in? What happens if there’s a weather event and the phone service is knocked out?

    Step 2: Be ready to handle the alerts

    Some people hate being on call, others love it! Either way, you need a system to handle on call rotations, escalating issues to other members of the team, planning for reachability and allowing people to go off-call after incidents. We use PagerDuty on a weekly rotation through the team and consider things like who is available, internet connectivity, illness, holidays and looping in product engineering so issues waking people up can be resolved quickly.

    pagerduty-on-call-calendar

    More and more outages are being caused by software bugs getting into production because it’s never just a single thing that goes wrong – a cascade of problems all culminate to cause downtimeso you need rotations amongst different teams, such as frontend engineering, not just ops.

    Step 3: Deal with it, using checklists

    Have a defined process in place ready to run through whenever the alerts go off. Using a checklist removes unnecessary thinking so you can focus on the real problem, and ensures key actions are taken and not forgotten. Have a channel for communication both internally and externally – there’s nothing worse to be the customer of a service that is down and you have no idea if they’re working on it or not.

    Google Docs Incident Handling

    Step 4: Write up a detailed postmortem

    This is the opportunity to win back trust. If you follow the steps above and provide accurate, useful information during the outage so people know what is going on, this is the chance to write it up, explain what happened, what went wrong and crucially, what you are going to do to prevent it from happening again. Outages highlight unknown system flaws and it’s important to tell your users that the hole no longer exists, or is in the process of being closed.

    Incidents, downtime and outages video

    We hosted an open discussion on best practices for handling incidents, downtime and outages with Charlie Allom (Network Engineer) and Brian Trump (Site Reliability Engineer) from Yelp. Contrasting a small company to a much larger one, we chatted through how we deal with things such as:

    • On call – rotations, scheduling, systems and policies
    • Preparing for downtime – teams, systems and product architecture
    • Documentation
    • Checklists and playbooks
    • How we actually handle incidents
    • Post mortems

    Here’s the full video: