REdeploy (for the first time)

The inaugural REdeployConf wrapped up yesterday (as I write this). I’m already feeling withdrawal from intense learning and conversations. I’ll attempt to summarize them in this post.

The RE in REdeploy doesn’t mean “again” (lo, it is the first of its kind). RE stands for Resilience Engineering. It is a newish field, focused on sociotechnical systems that continue to function in shifting, surprising, always-failing-somewhere conditions (aka, reality).

John Allspaw opened the conference with: resilience is in the humans. Your software might be robust, but in the end, it does what it was told. Only humans respond in new ways to new situations. People can be prepared to be unprepared.

John Allspaw is so excited this conference exists

Resilience is the antidote to complexity. Except not a full antidote: the complexity is still there. It just doesn’t kill us. Complexity is not avoidable, because success begets complexity. A successful system has impact, and impact means interdependence, and interdependence means complexity.

What is resilience? Laura Maguire enumerated some definitions. Rebound, robustness, and graceful extensibility are partial definitions that build into the real one: Resilience is sustained adaptive capacity. It’s the ability to find new abilities, to change in response to changing conditions to maintain functioning. Resilient systems are not the same moment to moment, but they keep fulfilling their purpose (even as their purpose morphs).

four definitions of resilience, illustrated

Resilience Engineering is not a computer science discipline. It’s broader than that. Industries like nuclear power and air traffic control have deeper roots in the study of coping with failure. This isn’t your old-school Root Cause Analysis that asked “why did this fail?” This is systems thinking, asking “how does this succeed?” How do systems constantly subject to new failures keep running anyway? (hint: people.)

Avery Regier pointed out that root cause analysis can prevent a specific failure from recurring. But we find new failures all the time. Some new service is going to run out of space. Some new query is going to be slow. Some new customer is going to call a new API a whole lot more than we expected. Prevention is never going to cut it, so don’t spend all your resources there. Grow your powers of recovery, and you mitigate whole classes of failures.

Resilience Engineering recognizes that our systems include software and humans, so half the talks were about code and half about people. Matty Stratton extended trauma therapy to organizations, and Lee Kussmann gave strategies for personal resilience to stress (notes for both). On the code side, Cici Deng spoke about making safer changes at AWS Lambda: like most things in this science, improvement isn’t having the right answers — it’s asking better questions (notes). Aaron Blohowiak talked about speeding recovery and isolating failure domains at Netflix. Then Hannah Foxwell on HumanOps: there is no failover for You. People are more difficult to work with than software, so start there. (notes for both)

Mary Thengvall and J Paul Reed organized this conference to beget conversations, to seed a community in this space. Existing communities exist around the SNAFUcatchers and the Lund program. This new one is an open, informal camerata of people who care about resilience in humans+computer systems within the software industry.

Mary and Paul lead the conversation

They succeeded! The conference was a conversation: speakers referred back to prior talks. Mary and Paul emceed with commentary before and after every talk, weaving them together, sharing their reactions and enthusiasm. At the end of each day, the speakers turned into a panel for Q&A. The questions drew from and among all the talks.

Liz asked, how can we move an organization toward resilience from the bottom? Matt and Cici went back and forth over “use data” and “data won’t convince some people.” Any solution must be opt-in, and then you need to collect stories. Stories move people. When every system is different, stories are what we have. We can’t do controlled experiments. What we can do is: dig into those stories to find the causes of success. This is what researchers like Laura Maguire do.

In one of the last questions, someone asked, “Where is accountability in all this?” Cici said, we have tons of talk about accountability in our culture already. I agree; every movement is relative to the culture it is moving. Other answers suggested: Accountability is assumed, not assigned. Personal theroy: maybe accountability at the individual-human level is too narrow for the larger networks that we require to work with systems of complexity larger than a personbyte. MAYBE teams need to be accountable for working safely and effectively, and people need to be accountable to their teams.

Aaron had a lovely rant during Q&A about the “sufficiently smart engineer.” This is the hypothetical engineer who would not make such mistakes. Who would understand the existing system thoroughly. This person is a myth. Our software is too complex for one person to hold in their head. You can’t hire a sufficiently smart engineer, and don’t feel bad that you aren’t one, because it’s not a thing. Instead, we need to build systems that support our own cognitive work.

Resilience Engineering is a new science. Its research does not take place in a lab, but in the field. “We refuse to simplify.” Laura Maguire closed with a description of next steps in research. In our own jobs, we can do resilience engineering by looking for who and what makes us more safe (learn from success), by keeping the messy details instead of seeking a clean story, and by maximizing for learning in our symmathesy-teams (including software, tools, and people). For instance, when you find a “root cause” of a failure, look for other situations when that trigger occurred and failure didn’t.

RE researchers study DevOps in real situations

Other fun stuff:

We witnessed the first open-source releases from Deere and Co.

Heidi Waterhouse got rate-limited on twitter from quoting the talks.

Paul Carleton told a story of Stripe’s journey from “We should restart old EC2 instances” to “Oh look, we’re chaos engineers now.” Matt Broberg told a scary story about stopping forward motion, about ⟳technical debt and social debt⟲ at Sensu, and the perils of IRC. (notes for Matt, Paul, and Laura)

Atomist sponsored — I hope we can sponsor every edition of this conference! We work on tools to help developers integrate the social and technical parts of our systems, so it’s relevant. This was our first lanyard sponsorship and they were beautiful, in my very biased opinion.

Yesterday (as I write this) we recorded a >Code episode (#95) with Heidi Waterhouse, and she and I brought up topics from REdeploy about a dozen times. Me: “This conference is going to keep coming up, over and over, for the rest of my life.”

Thank you, Mary and Paul and Jeremy and everyone.