Chatops Pitfalls and Tips

December 11, 2015
by Dmitri Zimine

You are starting with ChatOps.

You have already watched Jesse Newland, Mark Imbriaco and our own James Fryman and Evan Powell preaching it. You’ve read the links on reddit, and skimmed ChatOps blogs from PagerDuty and VictorOps. You’ve studied ChatOps for Dummies

Congratulation and welcome to the journey, ChatOps is awesome way to run development and operations. I’ll spare repeating why ChatOps is good – you’ve eager to get going. I’d rather focus on few common pitfalls and misconceptions that can get you off the track.

chatops-pile

TL;DR

This is a loooooong blog posts… so here are topics, jump right in if you’d like:

Coffee? JS? Ruby? Python? A Fallacy of False Choice.

One of the first things you learn about ChatOps is “Bots”. There are bots, quite a few of them. Hubot Lita, and Err are the often cited as the most popular, but there are many more. They are important, and you (tend to believe) need to think carefully about selecting your bot, because this choice defines the programming language you’ll use for ChatOps and other aspects of your implementation.

“Your preference on programming languages may determine which bot you ultimately choose.”
Jason Hand, ChatOps for Dummies

There is a problem, though: this line of thinking leads to a naive implementation of ChatOps which I have seen far too often. Teams often select a bot based on a preferred language to write ChatOps commands. They then go ando implement these actions as Bot plugins. To add another command – add another CoffeeScripts (or JS, or Ruby, or Python, etc)… Warning: This is the Wrong Way.

wrong_way

Let me detail it on a concrete example: provision a VM on cloud. When ChatOps-ed, it would look something like this:

      > user: ! help create vm 
      < bot: create vm {hostname} on { provider: aws | rackspace } 
      > user: ! create vm web001 on aws
      < bot: on it, your execution id 5636fac02aa8856cc3f102ec 
      ... < some chatter here > ...
      < bot: hey @user your vm is ready: 
             web001 (i-f99b4320) https://us-west-2...

What is under the hood? The VM creation is likely 4-7 steps calling low-level scripts (create VM, wait for ssh to come up, add to DNS, etc). Times N, where N is a number of providers and the steps may differ slightly. Just for one command. Do you want to place this stuff all under the Bot? And make the bot your “automation server”, your control plane? Which leads you to deal with security, availability, logging and other aspects that Lita, Hubot, Err, don’t bring out of box. Or do you consider putting these scripts somewhere else and write a coffee-script to make it ChatOpsy with the help and human-talk-like syntax? Which leads to two smoking piles of scripts, a maintenance nightmare to keep all that smoke in sync?

Don’t fall into the fallacy of false choice here. The right choice is: “base Chatops on the Automation Library, expose actions to scripts with no extra code”.

The Automation Library is a core Operational Pattern that states that the operations against the infrastructure comprise an automation library that can be written in any language and versioned, reviewed, and available to ops and developers with fine-grain access control on who can run what and where. This automation library becomes your control plane. As such, it must be secure, reliable and highly available not only from ChatOps but from API, CLI, and hopefully GUI.

If you do use an Automation Library, you can then expose actions easily to chatops. A good solution will provide chat friendly syntax, help, and other goodies with no need for extra code on the Bot side. Bots are a part of the ChatOps solution, but they are like wheels, not the whole car, thus their choice is an internal implementation detail.

We have learned that those who are doing ChatOps right, do it this way. Take GitHub: they have a library of automation scripts, some of which are exposed to Hubot, with no extra Coffee on Hubot side (this part of their Chatops solution is not open sourced yet). Or take the devops folks from Oscar Insurance who presented their impressive ChatOps solution on Atlassian Summit. Or take WebEx Spark (StackStorm user!) – they have spoken about how they use Spark for ChatOps with StackStorm underneath as automation library and more.

Tip: Build your automation library first. Than expose some actions to Chatops. Stay in control what actions you expose and what you keep.

Bot or Not

What about the other extreme? For example, take the off-the-shelf integrations from services like Slack or HipChat… are you now doing ChatOps? Are 88 plugins by Slack not enough to do what I want? What about 112 HipChat integrations?

No, it won’t be enough.

Yes, there are great integrations and it is practical to leverage them here and there. Our team is on Slack, where we happily use Travis-CI, re:amaze, a bunch of configured email integrations. We absolutely love to /hangout when we need to talk.

But when it comes to a full ChatOps solution, you’ll quickly find that the exact integration you need are either: 1) not there 2) not doing what you need or 3) not doing it the way I want it.

  • Not there: NewRelic and Nagios happen to be in Slack, but Sensu, Logstash and Splunk integrations are not.
  • Not doing what you need: It’s fine that Slack’s Jira integration posts issues and updates. But what I really want is to create JIRA issue from my chat. It is not possible today. Even a revamped HipChat Jira integration” is not doing it – you may think they would, coming from the same company?

And what can YOU do about it? Complain? File a feature request? Hack up a slash command one-off? I move that you stay in control of your integrations, on what tools are being integrated, and how exactly they are integrated. Some of the integrations, incidentally, will be custom scripts against proprietary endpoints. Supporting it natively is a stretch for Slack, HipChat, or any chat service. A bot gives you that level of control, and integrates smoothly with the Chat services.

“Some could argue that a chatbot isn’t absolutely essential to begin your journey into ChatOps. All chat clients >outlined in Chapter 2 offer a wide range of third‐party integrations that can allow users to begin querying >information and interacting with services without the help of a bot. It wasn’t until teams began building more >complicated scripts that chatbots became an important piece of ChatOps. Nowadays, to take full advantage of >ChatOps, you really need a chatbot to execute commands that are specific to your own organization’s infrastructure >and environment.”

Jason Hand, “ChatOps for Dummies”

Tip: Don’t limit yourself to what’s out of box in your Chat platform. Own your ChatOps commands. Use a bot to expose your automation library to a chat platform.

Slack’s slash commands and incoming webhooks give a solid foundation for a custom Chatops solution, to the extent you are fine with exposing a [part of] your control plane over public REST endpoint. It by far beats a “naive approach”. And doesn’t require a bot. So, should you lock in with Slack? That brings me to my next point.

Best Chat for Chatops, or Is Slack eating everything?

While on the topics of Chat platforms and services: how do you choose one? Which one should you choose and why? Should you look for a more “ChatOps friendly” chat if you’re already using an established service? What about when you hear the claim that “the XXX Chat is a true ChatOps platform”, do you believe it?

The truth is ChatOps works with ANY chat platform, with any chat client. So the choice of a chat platform is entirely your team’s choice.

While Slack is a favorite of today with an incredible 1.7M users, history teaches us that favorites change. It was HipChat and Flowdock two years ago, Campfire 4 years, and IRC remains steady at ~300,000 users, a timeless favorite for many hardcore ops. New chat platforms grow like mushrooms after the rain, including opensource “Slack alternatives”: Mattermost kandan Zulip -these are only a few that have come up on my radar recently.

Think of a chat. Pick any. Got one? Good, now, here’s a secret: there are a few teams that WILL NOT BE USING IT, for reasons beyond our reasonings or control. “Must be on prem” policy. A lead Architect hates non-OSS software. A team is writing their own chat. Someone took this [anti-slack debate](Debates https://news.ycombinator.com/item?id=10486541) too close to one’s heart. Any of 1,000 other reasons.

There will always be different chats out there to choose from. It’s a choice that your team should make, or likely have already made, based on the merits of the Chat itself. Chatops can live on any platform, and when it’s frictionless, and makes people’s job easier, they won’t care if it’s graphical, text, or what-so-ever. The right ChatOps solution will support the chat of your choice. It will take advantage of a given chat platform, like leverage HipChat or Slack syntax formatting, while gracefully degrading to text for IRC. That’s where bots come in place: they provide an interface for a variety of chat platforms, giving a layer of abstraction and customization. That, not a programming language for commands, is the basis to pick a bot for your DIY chatops. That is how we leverage bots in StackStorm.

Tip: Stay in control of what chat platform to use. It’s the choice of your team. Turn away from solutions forcing their own chat on you. Use a bot, to make your ChatOps solution support a chat client you team loves today or will love tomorrow.

It’s a duplex, dummy

How do you like a team member who only responds to requests, never says a thing or cracks a joke? Same applies to your ChatOps Bot. Just firing up the commands from the chat is not good enough; it goes in both directions: when something happens with your infrastructure, the bot should notify the chat room. With the proper two-way integrations your ChatOps will rock like GitHub’s. Here is a fictional example based on watching GitHub Ops team in their devops lair in Nashville (glimpse here):

      > bot: twitter says "we are down"
      < user: @bot shut up twitter
      > bot: twitter silenced for 15 min, get busy fixing stuff fast!
      > bot: Nagios alert on web301: CRITICAL, high CPU over 95%
      < user: @bot nagios ack web301 
      < user: @bot graph me io,net on web301
      > bot: @user Here's your graph: https://mygraph.example.net/web301?show=io,net
      > other_user: looks like it's just high load. Let's add couple of nodes!
      < user: @bot autoscale add 2 nodes to cluster-3
      > bot: @user On it, your execution is 5636fac02aa8856cc3f102ec 
             check the progress at https://st2.example.net/history/5636fac02aa8856cc3f102ec

In this short dialog, a bot acts as a two-way relay between the infra and the chat. It reports events and responses to user’s commands. Under the hood, a solution wires in various sources of events, like Nagios, NewRelic, or Twitter (true, GitHub users Twitter as a monitoring tool), and relays them to chat by some rules. A shut up twitter may disable a rule for a period of time; a nagios ack may call Nagios to silence an alert. Other commands call actions which do as little as forming and posting a URL, or as much as launching a full-blown mutli-step auto-scaling workflow.

      StackStorm chatops two-way implementionation:

             Infra ---> st2 Sensors ---> st2 Rules ----------> Chat
             Infra <--- st2 Actions <--- st2 API   <-- Bot <-- Chat

Again, don’t trap yourself in believing that out-of-box integrations from Slack, HipChat, or “the Next Big Chat”, will be enough. Not just vendor lock-in, not just lack of code to control/update/edit settings. Think about your behind-firewall logstash and graphite, or posting collectd charts when Sensu events fire. Think Your toolbox will always be ahead of the mainstream. Design for that, and stay in control of your integrations with your infra and tools.

There is another trap when it comes to incoming integrations: it’s too easy to have it spread out all over your tool set. It’s tempting to post alarm straight from Sensu handler to Slack. To use the stock “Splunk – New Relic” integration. To add “post to HipChat” block to the end of your provisioning script. At the beginning it looks fine. Warning: Wrong Way.

wrong_way

This approach gets messy very fast. As fast as n*(n-1), where n is how many tools used by your team. And NO, it’s not n*(n-1)/2, as integrations are two way. For each integrated, you need both incoming and outgoing integration. Triggers and actions. Sensu sends alert (incoming, trigger) and Sensu silence alert (outgoing, action). Jira update ticket (action) and Jira on ticket update (trigger). Once you beyond two or three tools, it quickly spirals into unmanageable, unmaintainable spaghetti. Where is all my automation? How do I turn it off?

Just like a consolidated, shared library of actions, you need a shared, consolidated library of “rules”, defining what gets posted to chat on which events and how. And just like a library of actions, these rules better be readable, scriptable code under version control, with API, CLI and other goodies. If this reads like a shameless plug for StackStorm, it is because our team believes in this so much that we made it we’ve made it the center of our design.

Tip: Design ChatOps for two-way communications. Build a consolidated control plane for the event handlers to provide visibility and control of what events are posted to the chat.

Towards a smarter bot

This came up off the recent conversation with folks implementing Chatops with StackStorm. We begin to brainstorm how to make Bot act more like a human. One idea that came up is “carrying the context of a conversation”. That means that bot asks a question and I can just answer “@bot yes”, and just like a human, the bot will be smart enough to know what I am saying “yes” to. Or may be careful enough to ask clarification questions:

     > bot: @dzimine you mean "yes" to "should I restart web301 ? If so, say "pink martini"
     > dzimine: @bot pink martini
     > bot: ok @dzimine, on it, your execution is 5636fac02aa8856cc3f102ec
...

Another example of a smarter bot is providing two-factor command authorization, when two people should +1 an action. This comes handy on when launching some mission-critical automations. Surely it requires some workflow capabilities on the script side, but it can be done.

More brainstorming on smarter bot is happening on our chat, at stackstorm-community.slack.com. Please join and bring up your thoughts and ideas.

Chatops, StackStorm way

Time for a good StackStorm plug: we deliver an Automation Library, with turn-key ChatOps solution out of the box. We have taken these lessons, and more, and turned them into code. With StackStorm’s ChatOps, you choose your chat, implement actions in the language of your choice, and use community integrations with dozens of devops and infra tools. StackStorm guides you to the “right way”: start with automation library, then turn any action to chatops commands just by giving it an alias; 2) consolidate event routing with sensors, than route any event to ChatOps just by adding a rule.

We think of ChatOps as not a sidekick, but an integral part of your control plane. Invest upfront, profit over time. With StackStorm, you progress from simple commands like “create Jira ticket” or “deploy VM” to more powerful ones, by combining them into workflows of many actions underneath and turning these workflow actions into chatops commands. For how it works, check out a journey towards Chatops from Cybera.

This week we released StackStorm 1.2; Chatops is a highlight of the release, with so many new things and improvements that that the blog describing them is called “The New Chatops”. Please check it out, give StackStorm’s Chatops solution a try, send us feedback, and Happy ChatOpsing!