Meetup: How to automate 94% incident responses: Facebook, Neptune.io, autoremediation

Feb 06, 2017

by Dmitri Zimine, @dzimine

Last Thursday, we had another great meetup session on Auto Remediation and Event Driven Automation. This time it was in San Francisco at Make School, thanks to our new sponsors Neptune.io. The drive from South Bay was totally worth it!

event driven automation 3

The Facebook Infrastructure Orchestration team presented their FBAR. No, this is NOT the FBAR talk you may have heard hundred times. It is the premiere of a new talk about their System Lifecycle Automation which includes a newer and better FBAR. The FB guys keep the scent of secrecy around it: take no video, post no slides – come see in person. Those who came were truly rewarded.

There are two types of remediations in FBAR: one that is out of the box, built by SRE team, and one that is built by application owners. Results?

94% alarms are cleared without human intervention.


The value of auto-remediation is immediately obvious from these numbers. “At our scale, even with 2% coming to us it’s a lot of manual work” says Gabriel dos Santos.

James Mills went over advanced topics – rate limiting, preventing “run-away automations”, job prioritization, automating large batches of hosts… When FB says “large” they mean it:
* Hundreds of millions distinct jobs per month
* Thousands of years combined run time per month
* Hundreds requests per second

Next on stage, Neptune.io founder Kiran Gollu drove home the need and business value for automated diagnostics and remediation; that it is not “if”, it’s “when” for any sizable ops. It is just a matter of time and maturity of the incident response team when automated diagnostics and remediation becomes a necessity.

Today, 95% of your incident’s MTTR is still manual

This resonates: “I am used to having FBAR at FB; now we need something like this at Uber n” says Rick Boone, ex FB production engineer, now SRE at Uber. Hey Rick, hope you’ll like StackStorm!

There would have been no fun without a good demo, and we enjoyed live demo of Neptune.io. It is a hosted auto-remediation solution, impressively easy to set up, nice event consolidation UI and familiar concepts to auto-remediation. StackStorm and Neptune.io are technically competitors, but we share the passion for auto-remediation, I really love some aspects of their solution and wholehartedly wish Neptune growth and success.

Of course the best part of meetup is meeting up – with like-minded folks. We are a small group (only 670 :P) of experienced devops practitioners and thought leaders who are knowledgeable and passionate about automating operations. We spent a good time hearing each others’ stories, learning perspectives, posing and trying to answer challenging questions.

Among other things, we’ve talked about event correlation, the state of live event processing in the industry, the technical reason why this functionality is still missing in FBAR, StackStorm, and Neptune.io, and brainstormed some ways to solve the problem – a topic worth a dedicated blog.

We are lining up more igniting talks and inviting speakers for our next sessions. What is YOUR story? Care to share? Please propose the topic. To do so, go to the Auto-Remediation meetup page page, and press a big red Suggest a Meetup button.

Talking same topic on stackstorm community slack today.

And for continuous conversation (CC) on event driven automation, join our on StackStorm Slack channel – stackstorm.com/community-signup.

Until next time,

DZ.