July 26, 2016
by Lindsay Hill
We love our users. We love them even when they report bugs. We love them because they report bugs. But we really, really love our users when they report bugs, along with complete configurations, how to reproduce the bug, and logs. Recently Brian Martin did just this, helping us resolve a tricky deadlock bug. Thank you Brian!
When you’re investigating a bug report, the first thing you want to do is to try to reproduce it. This helps you see what’s going on, and it gives you a clear test case to see if you’ve resolved it. Deadlocks & race conditions can be horrible bugs to work with, as they are often difficult to reproduce. They might depend upon your hardware resources, or what other jobs are running at the same time. Even reproducing it on the exact same setup is tough.
So when you get initial reports of a possible deadlock, you groan, knowing that it’s going to be tough to prove/disprove the issue.
@bri365 logged issue #2814 two weeks ago: “Workflows with action policies can deadlock.” Yuck. Brian had identified a problem where the system deadlocks when the action_dispatcher_pool
gets filled with workflows. All workflow actions are subsequently held in the _work_buffer
queue indefinitely.
Basically if you’re trying to run a lot of workflows at once, they might fill up the same queue that the actions use. The workflows want to run their actions, but they can’t, because the queue is full…of workflows. The queue will never empty.
It was a pretty good description. But it went on. Brian created an example pack, containing simplified actions and policies, and the commands required to trigger the deadlock. Further on, he added the complete commands required to create a StackStorm system, add this test pack, and trigger the deadlock, and example CLI output.
He also proposed a couple of possible design changes to resolve the issue, and a workaround using another policy.
This triggered a discussion with the developers, who worked through a few ideas, before agreeing on an approach, and creating a patch.
If you think this bug might affect you, we’ve merged a patch in #2823. This is currently in the 1.6dev branch. We’re not too far away from releasing 1.6, but if you’re feeling keen, try out the unstable repo. Note that this has not undergone full testing!
Thanks Brian. We really appreciate it.
If you think you’ve found a bug, please help us to help you. The more detail you can give us, the faster we can resolve the issue. We’re always happy to chat about it on our #community Slack channel too. Jump in there and we can work through any issues you’re having.