June 7, 2017
by Matt Oswalt
The team here at StackStorm was psyched to sponsor Monitorama 2017. This was the first Monitorama we attended, and it was an interesting new take on the world of monitoring. A proper monitoring infrastructure is key crucial for event-driven automation, in order to better understand what an “event” is.
Monitorama is a fairly small conference, opting for a single speaker track instead of several talks in parallel. I found this to be a very comfortable and engaging format; the organizers did a great job of structuring the schedule so that the topics flowed organically, and even flowed into the hallway conversations. Also, because it was in Portland, a local coffee shop came and made pour-overs for everyone, so that was a nice alternative to traditional conference coffee 🙂
Naturally, the conference focused on monitoring, but I noticed two strong common themes:
I come from a networking background, and when it comes to monitoring, most products in this space have historically focused on up/down monitoring – i.e. is it up or down? These days, simple metrics like that are not sufficient, and more granular, programmatically accessible telemetry is needed. This is a situation that is slowly improving, but there’s still a lot of work to do.
We’ve all seen recently, all IT disciplines have – out of necessity – had to operate closer and closer to the applications. As a result, the line where “traditional monitoring” has operated is getting blurred with things like application performance monitoring. No longer can the network engineer or sysadmin focus solely on whether their switch or server is on – they have to be able to understand how applications interact with their infrastructure, and be able to show detailed performance metrics and other telemetry on how this is taking place.
John Rauser started the conference with a message from the future – okay, not really, his goal was simply to bring the world of data science into the monitoring toolchains in use today, in his talk “Finding Inspiration in the Data Science Toolchain”. In it, he illustrated several ways that ideas from the world of data science were having a big impact on monitoring techniques, and vice versa. For instance, some data scientists are finding usefulness in the idea of infrastructure-as-code – describing visualizations and calculations in code as opposed to a click-through GUI. In general, the idea of bringing the specialized skills of a few, and making it consumable for the masses, resonated with me greatly, and is definitely a big driver for what we do at StackStorm.
Ian Bennett (Twitter) gave a talk near the end of the last day on “Debugging Distributed Systems” that I found interesting. In it, he highlights the need for traditional sysadmin skills and more developer-oriented skills to coexist to fully troubleshoot a problem. One minute he’s talking about an optimal logging infrastructure, and the next minute he’s diving into troubleshooting application performance by looking at how the garbage collector in JVM is handling certain string concatenations.
In general, I enjoyed this fresh take on monitoring. There are a lot of interesting ideas and tools that are just starting to take shape in this space, and I came away from these talks inspired with new ideas for StackStorm sensors to write in order to connect this advanced logic with some sweet auto-remediation workflows.
Alice Goldfuss gave a lively talk about the need for change in the culture of operations and being on-call. This was a very useful look into the way operations are traditionally run, and their impact on human beings. She did a good job of mixing real talk about what’s wrong, what’s right, and some funny stories along the way. There were some good ideas presented here – one that impacted me greatly was the need to stop viewing pages as a metric of success. I had many discussions later this week about this topic, especially as it pertains to StackStorm, since the whole idea of auto-remediation is to never have to solve the same problem twice. When encountering an outage, troubleshoot it, fix it, then write a workflow that incorporates all that logic into “code” – a workflow that performs these same steps on your behalf next time. Bottom line: don’t take pride in the fact that you get paged – work to reduce the number of times a human needs to be engaged to solve a problem.
These ideas and more – such as the need for developers and operations to work more closely – really hit home for me, and I would recommend this talk for anyone, whether or not you consider operations your primarily role.
I was pleased to participate in a panel at the first meeting of the Portland chapter of the Auto-remediation meetup. This not only served as a great (and audience-interactive) discussion of automation and monitoring, but also a good recap of Monitorama 2017 (this took place the last day of the conference at a local brewpub). I highly recommend you watch the video, there were a lot of great audience questions (that we hopefully answered):
It was clear that Monitorama’s audience leaned heavily towards the “ops” side. This kind of mix is no stranger to me, as I’ve been going to these kind of conferences for a few years. However, unlike traditional IT conferences, where applications or developer types are despised (for the most part), Monitorama definitely took a more positive approach – opting instead to work better with developers; even borrowing tools and ideas from the world of software in order to work more efficiently. This was greatly comforting to me, since I’ve been advocating for this approach in the world of networking for a few years.
I am excited for next year’s Monitorama, and am hoping we’ll be back as a sponsor next year. There’s a LOT to talk about at the intersection point between monitoring and automation, and I feel strongly that StackStorm can provide a lot of value as monitoring makes this transition into the world of data science.