Resilience and Waste in Software Teams

Announcement: “All passengers for Chicago at 11:45, please come to gate A14 immediately for an earlier departure.”

This came through at 10:35 am, after that 11:45 to Chicago was delayed by three hours. United has an earlier flight with seats available, and they are holding it for 30 minutes. I will get to Chicago early.

Flights everywhere are delayed today because of this morning’s FAA outage. Their venerable NOTAM software (roughly: status page, designed for teletype) glitched and grounded everything for hours.

Every airline is affected. No airline could prevent this. The question is, how do they recover?

Two weeks ago (December 2022), Southwest demonstrated a catastrophic inability to recover from some bad weather. After a bunch of canceled flights, they lost track of where their people were. Southwest canceled most of their flights for the next week, the week after Christmas. Good thing I’m not flying Southwest today. They are not resilient to widespread delays and cancellations.

Recovering from unexpected events is part of resilience.

Resilience is deeper than reliability. Reliability can be measured: it is how often you perform as expected. Southwest has a higher on-time percentage than United, in the data I looked at.

Resilience is deeper than robustness. Robustness can be tested: it is performing as expected under a wide range of conditions. Southwest demonstrated a few weeks ago (December 2022) that it is not robust to a pile of delays and cancellations.

Resilience is the “how” behind reliability and robustness. Resilience is recovering from unexpected events, and coping with unexpected change. When a flight is delayed, how much does that take down the whole system? How many travelers are stranded?

United today is resilient to my delayed flight, because it has extra seats on other flights. The staff in St. Louis have the ability to hold the 10:30 flight, and the information to see that all the 11:45 passengers will fit. The gate agent is missing her lunch break to rebook them all. ☹️

Those extra seats represent slack. There is room in the system to absorb surprise.

On a smoother day, those empty seats represent waste. A perfectly efficient airline would fill every seat. I experienced that when my flight was canceled on the 22nd of December: “You can rebook for the 25th.” No, that won’t work and I have status, so United booked me on American. That cost them something. Nothing like Southwest’s debacle following that storm, which cost $800M so far.

Efficiency + surprise = shocking price tag. Resilience + surprise = a minor bump.

What does resilience look like in software?

When a team supports production, resilience means the software stays up, recovers quickly when it goes down, and doesn’t turn a hiccup into a catastrophe. For instance, when an engineer types a command wrong, it doesn’t ground all planes in the US for hours.

When a team changes software, that work can be resilient too. Given a small requested feature, does the team have a decent idea how long it will take? Do they often run into surprises that bite them and make it take longer? Does a small change take weeks sometimes? Does it lead to six bugs that slow everything afterward?

A few years ago at Honeycomb, we added Unicode support to customer data and names. Now we can display spans named “🍕” or “⛔” or “跑步.” (It’s about supporting other languages, but internally we use it for pizza. Internationalization FTW.)

It doesn’t sound too hard. We store some of this data in MySQL, which supports full unicode… in a version we didn’t have yet. We designed a whole three-stage no-downtime database migration to do the upgrade, plus moving all our string encodings to a new format. The implementation took several weeks.

That kind of “this would be easy if only our database/libraries/language were up-to-date” surprise can bite you. So can a key piece of code that’s so convoluted we introduce bugs every time we touch it. It’s like I’m trying to cook but the kitchen sink is full of murky water and knives.

At the extreme, software ossifies and becomes impossible to change. No one knows the code, or the tech it runs on, or what all it does.

Keeping libraries and components up-to-date, keeping code readable, updating our automations, improving our observability, bringing other developers up to speed– these are a few of the tasks developers need to do regularly. Any one of these tasks could have no noticeable impact in the future, and any one of them could prevent the next big security incident. The most likely outcome of each is a smoothing of future work, a decrease in unpleasant surprise.

Last time I implemented a feature in the Honeycomb UI, I needed some React functionality that was only in the latest version. I looked at our package.json, and lo! We were on the latest version! I rejoiced, and my work proceeded.

Many of these tasks don’t make it onto the roadmap, because when I look at the overhead of creating a ticket, discussing it in planning, advocating for it–then I can’t. It isn’t worth that. I can’t justify any particular one. Instead, these are best done as we go. Oh look, this test is in the old framework, let’s update it. This name confused me, let’s change it. In the kitchen, I always wash the knives and put them away immediately as soon as I’m done chopping.

This kind of work happens when we have slack. When a development team isn’t under super crunch time, when project managers aren’t breathing down necks, when every commit doesn’t have to include a JIRA ticket number. (Do put a rename or upgrade in its own commit. I don’t usually make a separate PR, because GitHub PRs are too slow.)

Updates and improvements like this are also great for the last hour of a day after I’ve just completed a task. Wait until the morning to start the next one. This keeps WIP lower.

When we allocate 100% of developer time to forward progress as defined by someone external, we miss all of the contextual knowledge that developers have about what needs to be done.

My rule of thumb is: spend 80% of our attention on stuff whose importance we can explain, and 20% on work with less predictable benefits.

When software is brittle, it falls over in production, and that falls to people to fix. While software can be robust to anticipated conditions, only people handle unexpected events. When software can’t even handle stuff that happens all the time, then people suffer the strain.

At Southwest, very old crew scheduling software left handling of canceled flights to people. People had to manually enter crew locations when they got out of position, and people had to talk on the phone to tell the crew what to do. Last Christmas, this got out of hand, and most flights were canceled for three days after Christmas. All the people in the company couldn’t make up for this software failure. (I hear the Fleet Chief of the airline processed refunds on the phone with customers.)

With United, I can track my bags in the app. I can take overnight flights. Southwest software doesn’t support this. Apparently unable to update their software, Southwest hasn’t coped with changing expectations. They’ve lost resilience.

Efficiency is only good for one thing. Whatever thing you make efficient, the system can do that and little else. Resilience takes slack, and slack looks a lot like waste. 

Last quarter, Southwest expected the highest profits of any airline. Instead, they’ll post a loss. And this year? I don’t know about you, but I’m canceling my Southwest credit card.

Resilience is good for more than one thing. It covers the “everything else” that makes our work strong. Profit is necessary for a healthy company. The rest of what we need, we can see from inside. When people have the support they need to do our work well, the company succeeds not only this quarter, but ongoing. The people and the company can become who and what is needed for the future.

Slack in development teams leaves room for sickness, for onboarding, for helping other teams. It leaves room for checking production to see whether the feature we released last week had any unexpected effects. It lets us do our best work. Resilience comes with healthier systems and healthier people.

Thanks to: Eric Evans, Matt Schillerstrom, Brad Koehn

This post is excerpted from a keynote, Sustainable Resilience, for CodeFreeze 2023 in Minneapolis.

The slides aren’t useful by themselves, but here they are anyway: