Auto-Remediation: Getting Started

December 2, 2015
by Patrick Hoolboom

On the latest Automation Happy Hour we talked with engineers from Netflix about auto-remediation. A good portion of the discussion was around how to get started. This got me thinking that I should probably take a moment to go over this topic a bit.

People tend to overanalyze auto-remediation. It seems there is a mentality that they must automate away all of their problems on day one. This type of thinking frequently leads to analysis paralysis. They deadlock on trying to decide what to automate.  In this article I am going to outline two of the best ways I have found to get people started in auto-remediation.  Facilitated troubleshooting and simple monitoring events.

Why Auto-Remediate?

Auto-remediation is more than a band-aid for poorly implemented infrastructure or applications.  Servers go down, processes hang, outages happen.  It provides a significant reduction in time to resolution and allows the team to focus more on root cause analysis to prevent future outages.  It helps alleviate pager fatigue and let’s people focus more on the important task of improving the applications or infrastructure.  Leveraging an event driven automation platform such as StackStorm also gives better visibility into what is and isn’t working in your process.  Let the machines mitigate the event so you can focus on making sure it doesn’t happen again.

Facilitated Troubleshooting

An easy way to get started is to NOT remediate right away. Your team may not be sold on fully automated resolutions yet.  Facilitated troubleshooting gives you a good way to show value from automation while still allowing a person to perform the final remediation action.   Auto-remediation is really broken into two pieces, diagnostic workflows and remediation workflows.  Facilitated troubleshooting is running the diagnostic workflow automatically, and the remediation workflow manually. These workflows collect information about the event, to better prepare the person who will respond to the page.

When an event fires, collect a lot more data about that event. Think about the things you would check if you were woken up by the page.  These steps will become the tasks in your diagnostic workflow.  These types of workflows are handy as they allow you to execute more expensive or long running checks. This lets you keep your monitoring system lean and mean but still get the necessary information during an event. Take this data and share it with the on call engineers or team as you see fit (chat, ticket, email, etc).  Include suggested next steps or additional workflows they may run to help reduce time to resolution even more.

KISS

When you are ready to auto-remediate, start with the low hanging fruit. Automate the easy things in order to identify the proper patterns for you and your organization. Some examples of easy tasks:

  • Restarting a dead/hung process
  • Clearing disk space
  • Removing unused volumes or VMs

Nir Alfasi from Netflix spoke of automating the remediation of health checks for service discovery. This is a great example of a simple remediation.

Does service discovery think the node is down?

  1. Check health of the instance
  2. Attempt to reboot if unhealthy
  3. Attempt to clear the health check if node is healthy
  4. If all else fails, escalate!

Another example would be a simple disk space remediation:

Disk Space Cleanup

Which Events Should I Auto-Remediate?

A good way to get started quickly is to look at the alert history from your monitoring system.

  • What alerts happen frequently?
  • Of these frequent alerts, which ones are dealt quickly and/or easily?

Ask yourself those questions as you look over the monitoring events.   Most teams have a fair amount of those nagging events.  Things that happen fairly regularly and are a simple fix.  Pick one and automate it!  There is no need to automate ALL of the alerts you find right away.  These are the types of events you should auto-remediate.

Which Events Are Easy Targets for Facilitated Troubleshooting?

Take a look at your monitoring events again. Look for more general alerts that require you to touch many different systems or applications to troubleshoot. These make great candidates for facilitated troubleshooting. StackStorm can contact all these systems to gather this data, saving you (as or with the on call engineer) from having to check everything manually.

Things like application latency alerts are perfect for this. You may need to check health of networking equipment, look for long running queries or deadlocks in the database, out of memory errors, etc.

Another great example provided by our friends at Netflix was building a rich context around alerts from their monitoring system (Atlas).  They leverage the power of StackStorm to make API calls out to other tools (such as their deployment tool, Spinnaker).  Making the API queries and building the context is not something that most monitoring systems can do…at least not easily.  Make use of workflows to do this heavier work for you.

How Do I Get Others To Love Auto-Remediation?

Often, the largest barriers to getting started with auto-remediation (or automation in general) are not technical.  Team members may have had bad experiences with automations or there may even be a fear of “automating oneself out of a job”.  The best way to overcome these issues is to show people the added value of automating a process.  One of the quickest ways to do this is by giving them visibility of what is being automated.  

Make sure you are adding notifications to all your workflows! The team should see all the awesome things that you are automating. Let them see all the work that your new workflows are doing for them. StackStorm has a great notification system built in that can make this significantly easier:

StackStorm Notifications

Leverage collaborative tools like ticketing systems or ChatOps to share this information.  Make it as seamless as possible for everyone.  If most of the event management and communication is done via JIRA or Bugzilla, have the automations update the appropriate issue or ticket in those tools.  On the other hand if chat is more prevalent, post the notifications to the appropriate chat channels.  By getting early notification of events, and a rich context around that event, you’ll be able to quickly show the value of automation.

Next Steps

Now that you are auto-remediating your disk space alerts, doing facilitated troubleshooting for your application latency issues, and automating the E_NOTENOUGHCAFFEINE errors at your desk, you may ask “What’s Next?”.

Well first and foremost, if you wrote an awesome workflow we’d love to see it!  Share your operational patterns with the community.  You can either make your own GitHub repo that is publicly accessible, or submit a pull request against ours!  Let others take advantage of the remediations you have written, and maybe even help you improve them.

StackStorm Exchange

There are a number of different ways to proceed from here, but one of the best routes is ChatOps. For more information on ChatOps, check out our docs:

StackStorm ChatOps

And of course, there is StackStorm Enterprise. This gives you access to role based access controls, ldap authentication, and the awesome graphical workflow designer, Flow. Flow is a fantastic utility for creating your workflows as well as sharing them.

Last but certainly not least…join our community and our Automation Happy Hours!

Sign up for the StackStorm Slack Community here:

StackStorm Community Sign Up

And keep an eye out here for our next Automation Happy Hour:

Automation Happy Hour Registration