Incident troubleshooting hints with logging
Written by David Mytton
You’ve been woken up with an alert and after getting to your laptop the first step is to find out what is going on. Ideally, you have quite specific alerting but it may just be that “response time is too high”. With a large infrastructure, most of the time spent after an alert is figuring out what is causing the problem rather than trying to fix it.
There are many tools to help with this. With MongoDB, you can use mongostat to see all the operations in your cluster and zoom in on some key metrics that look unusual. Or your monitoring dashboard might be glowing red for all of your AWS EC2 US East instances so you know Amazon might be down.
Eventually once you isolate the source you probably want to delve into the logs to see what is going on right now. Wouldn’t it have been nice to be able to isolate the source right away?
So the way to do this is with alerting on your logs. We pipe all our logs into Papertrail who have the option to trigger alerts based on the log stream in real time. This means when an alert gets triggered, we can quickly see if there are any corresponding log matches which might help us narrow down the issue.
This can help for a number of situations:
- Known problems – if you understand that a long running MongoDB remove operation can cause slowness and are working on a fix for this, you don’t want to spend time chasing a rogue issue you already know exists. Just set up an alert for that remove log line and when response time spikes, you will know the cause right away.
- Stuff is about to break – warnings often appear in logs some time before things start to go wrong. For example MongoDB logs failed chunk migrations in sharded clusters. You can expect to see a few of these every so often but if you get a deluge of failed migrations then something else is happening you’re going to want to investigate.
- Find out what broke – see what actions have been taken that can cause your alerts, for example if the kernel kills a process with OOM.
- Expected actions – “notice” alerts are useful so you know when certain actions happen, even if they are automated. If you don’t use Server Density for monitoring your MongoDB replica node status (e.g. change from secondary to primary) you could trigger a log alert based on state change, so you can be sure activity has ocurred when expected or run some checklists after an event.
There are plenty of services and tools for doing this. Papertrail is the one we use but there’s also Loggly, Splunk (and Splunk Storm) as well as open source projects like Graylog2. And for the ultimate tooling, Server Density can integrate with Papertrail and Loggly to bring quick access to the relevant log lines when alerts get triggered.