Principles of Collaborative Automation

(this is a transcription of the talk by the same name. Here is a video.)

Collaboration is crucial in software teams – and not only among people. We need our software, our tools, and our automations to collaborate as well. But what does that mean?

I have four prerequisites for you here, and then four precautions (or “ironies of automation”). These principles come out of the Resilience Engineering community, and studies of collaboration in humans working together.

We don’t only work with humans anymore. We also work with computers, with software, with automations. As developers, we have the unique situation of getting to work with automations that we ourselves have made, or can change. Those are the ones I am especially focused on here.

There are degrees of automation, from big machines and PATRIOT missiles as an extreme, where the human is there just in case. At the other extreme you have little tools that just make things a little easier. And then the sweet spot in the middle, where the software is integrated into our work or our life. Like Google Inbox, may it rest in peace. Like many custom business apps, which it is our job to use, and the app’s job to make our job easier.

Behind every piece of software is people. People like us.

We use automation too, at all these different degrees. We are operating software for other people. We’re always using software. From the stuff we interact with closely and customize, like our deployment and IDE, to tiny tools. Most of these tools evolve with us.And here’s the thing. Because we are developers, this software is not closed to us. We don’t just get what we get, we can make it fit.

The software we build, we can change that and learn from that. We can add and change and configure and write queries in our tools, in the interest of learning what’s going on, and making better decisions at changing it.

We are part of a sociotechnical system. The code is on our team, the running software is on our team, and our tools are on our team. We work together to impact the lives of users by operating useful software. And all of us are changing every day.

The word for this kind of learning system is symmathesy.

The crucial flow in this system is learning.

The tricky obstacle is this line of representation, that separation between the social and digital participants. It’s hard to see what’s going on in the software. We are limited to screens, buttons, command lines. That makes our automations crucial, because they can punch holes to transmit learning (visibility and control) across the line of representation.

This post focuses especially on those automations, and how we can best engage in joint activity with them.

What is joint activity?

Joint activity is when we all succeed or fail together at our mutual purpose, and our tasks are interdependent. If I say “Let’s have dinner tonight. You bake a pie, and you cook a meatloaf, and I’ll bake bread, and we’ll meet back here at 6pm,” that is coordination, not joint activity. Unless we’re all sharing the same kitchen! Sharing one oven, counter space, ingredients, maybe washing each other’s dishes — now it’s joint activity.

In a software development team, there are tons of interdependencies between our tasks.

Research shows that to do joint activity well, we need four properties. I’ll go through each of them.

Basic Compact

The Basic Compact is an agreement (usually unspoken) to work toward a mutual purpose until further notice. Not perfectly, but with corrections, and we let each other know when we’re not going to participate.

That “until further notice” bit is important. In humans, it means we call in sick when we’re not coming to work. We resign, we don’t ghost. It also means that when we’re not at our best, we say so. In standup I might say, “Look, my kid is sick and I didn’t get much sleep last night, so if I’m cranky, it’s not you, it’s me.”

A counterexample from aviation: there’s a story of a near miss of an airplane crash. There was a fuel leak, and the crew just happened to notice that the left fuel tank level was dropping while the engines were pulling from the right. They go look out the window, and ope! there goes the fuel! Then they observe that the rudder is way to the right, because the fuel levels threw the plane off balance, and the autopilot was just doing its business, compensating, compensating, turning more and more to the right, not telling anyone that it was near the edge of what it could do.

The lesson is: don’t try harder! Talk to your team first.

An example from production software: health endpoints. Our service needs to report whether it’s up. But not yes/no! It needs to say, “Hey, I’m serving requests at this rate, and my downstream services are responding like this, this one’s down and I’m using the fallback” — the Basic Compact isn’t all or nothing.

The Basic Compact means, don’t swallow errors! Errors are data. Maybe don’t surface them to users, but get that information to the humans on the team. That’s how we learn.

Collaboration is not about being perfect. We work with each other within the team, so that the rest of the work can rely on the team.

Mutual Predictability

The next requisite property of joint activity is: mutual predictability. This means we can usually predict what other team members will do. If I ask for the salt, and you don’t also pass the pepper, I might be surprised. The basic compact says we react to such surprises by repairing predictability. If something you hear in standup surprises you, dig in, and find the “why” that you’re missing.

In automation, a counterexample: I can’t predict what autocorrect is going to do. I send a lot of texts that start with “Kinda” — but no, Apple, I have a daughter, and her name is Linda. It is predicting me wrong, so it surprises me.

In cars that have driver assist, there’s a thing that when you get too close to the car in front of you, it hits the brakes. Which is good — unless you’re trying to pass that car. So you need to your turn signal. This tells the driver-assist that you’re about to change lanes, so it can choose not to brake. It can predict you, and act in cooperation.

An example from software: those installers that take you step by step through “We’re gonna look at your system. We’re gonna tell you what we’re gonna install. We’re gonna install it. We’re gonna tell you what you installed.” boring, but comfortingly explicit.

Predictability leads to trust. Like people, we want to automations to “act neither capriciously nor unobservably” (Klein). “People must be able to understand their state, their actions, and what is about to happen” (Don Norman). We need our automations to tell us what they’re up to.

Automations should not be clever enough to surprise me. This means: No artificial intelligence, No natural language processing. Now, we might well use these in our product, in our real software that we operate for other people. In that case, we can put in enough effort to get past the uncanny valley of “It did what?? Where did it get that?”

In any machine learning software, I want it to report on what rules fired to trigger the result. The software needs to be accountable: able to explain why it acted as it did. That will let us reach mutual predictability.

Even better: I want my automation to teach new members of the team how things work around here. At Atomist, our software delivery machines list all the steps they chose to perform for each commit, and then the steps turn green as they get done. People can click to see the logs. I have more ideas about how to make this even more informative, but this display is already helpful.

Automations that are predictable are good. It builds trust. Don’t be clever, don’t be sneaky.

Mutual Directability

Once we can predict what other people or tools are going to do, we need the ability to change it.

When someone in standup says “I’m gonna add a new table…” I might say, “Hey, have you thought about this edge case? or just how many link records you’re going to need?” and we can modify the plan together.

Counterexample: when you call in to apply for something and the computer says NO, and there’s nothing you can do about it. You never want your tools to say NO to you — although they may say “good luck with that.”

You can make your production software directable — by you — with good admin screens. I once worked with an insurance system that had a bespoke ORM framework that was flat-out wrong, but it lasted years longer than it deserved because the admin screens were great. First-line support was a developer, and we could open the admin screens and clear the cache or even update the values in it. Great visibility, much control.

“Interfaces present and explain choices” (Mel Conway). For instance, car navigation systems used to choose a route and that was it. Now, Google Maps tells you its favorite, and also some alternative routes, in case you want scenery or don’t like downtown. It shows you where the traffic is and how much longer those other routes take.

At Atomist, our built-in automations like to suggest things. Oh, I notice you created a new project, would you like to get notifications in a slack channel? There’s a button for the preferred choice (create a new channel with the same name as the repository), a drop-down for less likely choices, and a fallback for anything-you-could-want. Make the preferred case easy, and everything else possible.

There’s a funny property of negotiations: the person with the least flexibility has more power. If you’re a vendor, and I’m buying software from you, and I’m a manager who can authorize up to $20k/year, then I have more power. You have more responsibility — you have a number to meet this quarter. If you don’t sell it to me for $20k, then maybe we’ll have another meeting next week… maybe. I have less flexibility, and therefore more power.

When the “computer says NO”, the computer is in charge.

Within our teams, keep the human in charge. Humans are maximally flexible, so that means the automation needs ultimate flexibility too: meaning, it can get out of the way and let the human do something else.

Here’s an tip for to coding style: separate making decisions from implementing them.

For example, `make` and many other programs have a `–dry-run` option. This causes them to print what they’re going to do, but not do it. Then, if you want `make` to implement those decisions, you can run it without `–dry-run`.

This is a principle of functional programming. Pure functions can make return data that specifies the effects they recommend for the world. You can check those decisions in tests. Then, another step can implement them. This improves both predictability and directability.

Common Ground

Finally, the hardest pillar of joint activity. Common Ground is essential, and it’s expensive. This is the common language and understanding that lets us communicate, predict each other, and direct each other. This is what limits the size of our teams.

Mathematically, common ground is the stuff that we all know, and we all know that we all know it, and we all know that we all know that we all…. and so on, infinitely.This is mathematically impossible in the presence of uncertain message passing. But as humans, we can say, aaaah, good enough. So when we’re all in the same meeting, and we decide on a coding standard, we consider that common ground. Later when we’re pairing, and I don’t follow that standard, you say “wait Jess, didn’t we decide not to do that anymore?” and I say “oh right, thanks, I wasn’t listening.” Common ground is repaired.

There’s a limit to how much common ground you want. I stream my work on Twitch sometimes, and hypothetically the whole company could watch, but do they? No, they don’t need that much detail. Part of maintaining common ground means respecting each others’ attention.

Counterexample: a nurse opens a patient record and clicks through six “Alert!” dialogs to get to the screen she needs. When “these two medicines are dangerous when combined” is at the same level as “this room is out of bedsheets,” then too much information is no information.

In software, every bug represents a breakdown in common ground. We did not expect that to happen!

Good tools help restore that common ground. I’m a back-end developer, and I’m trying to learn web development. It’s super complicated, but the tooling helps. For instance, I can write my React code in TypeScript. That compiles to readable JavaScript. I can also read that JS in the browser. I can use the React debugger in Chrome to see the structure of my elements. I can also see the HTML, and I can see the CSS and which elements were applied and why (sort of). I can read the JSON that went back and forth from the server. Each of these gives me a point I can check to restore common ground between me and the software.

Just like with people, we can’t have perfect common ground with the software. The system is more complex than we can model in our heads.

And it should be. We don’t want to fight all the complexity. Simplicity is not the solution. Complexity is the business domain is good, it’s our reason for being. Putting that complexity in software means our customers don’t have to think about it. And there’s complexity inherent in any distributed system.

Stop pretending we can defeat complexity, and start balancing it. “Increased complexity can be balanced with increased feedback.” (Woods, Cook)

We need to update our mental models in the areas where it matters, as we go. We need visibility into the complexity, and control to direct it. To do that, we need to think about an additional interface. We already think about the user interface, but what about the operator interface? Our experience of the software matters.

A good operator interface helps with common ground, predictability, and directability. Readable code is nice, but runtime information is richer, if we make it so. Some tools can help in a generic sense: Istio for visibility into connections, Honeycomb for arbitrary queries over events. Log aggregation and database queries can help, but we need more. Custom, application-specific interfaces can be much richer. They can teach why the system is doing what it is. They’re super helpful at development time too.

Another great opportunity for the system to teach people how to use it is: error messages. Especially with developers, error messages are a conversation. When I get one, hopefully it gives me a clue about how to interact with the system correctly. Then I want to get new error messages, repeatedly, until finally it works, and then I stop. Let’s rename them to “help and guidance” messages, and teach people how the system works.

Recap so far

These are the four pillars of joint activity, in humans and in sociotechnical systems.

Basic Compact: Surface weaknesses.
Mutually Predictable: Display and explain your plans.
Mutually Directable: The human is in charge.
Common Ground: Be careful with my attention, and give me windows into your workings.

Now it is time for:

Gratuitous Cat Picture

Here are Odin and Pixie “helping” with talk prep.

Ironies of Automation

Finally, it is time for the four ironies of automation. These are things that you’d think would get easier with automation, but they actually get harder.

First: the smarter the automation, the harder it is to operate. If you think you’re going to pay operators less as the software gets smarter, you’re wrong.

Automation is not about reducing costs. It is about increasing capabilities. In collaboration with automation, we’re able to do a lot more. Applications are approved two seconds after submission instead of 1-4 weeks, and in a consistent manner. We’re can have PATRIOT missiles at all, and fly jet planes that go too fast for a human to control.

Your people won’t be cheaper — they’ll be more valuable. Don’t train people to follow procedures; computers can do that. Training for procedure is like programming except way more expensive. Train for understanding, to help with problem solving.

Second: exceptions are the rule. Sure, the happy path may be 99% of code executions, but it is one of hundreds of possible paths through the code. We don’t need to handle every one, just make them noncatastrophic. In my own automation, it can fail, and tell me about it, and I can take if from there. Later I can add automatic handling as it’s worth it.

Third: automating the easy stuff makes the hard stuff harder. There’s an uncanny valley here, where the automation is good enough that I don’t think about it, but incomplete enough that I should be thinking about it.

If I operate Kafka as one component, most of the time I ignore it. Then when something does go wrong, crap! I’m not prepared. At the company that runs Kafka-as-a-service, they are always fixing problems in Kafka. They’re specialized, it’s in their heads, and they can fix it. Great.

Hila Peleg said that her research team works on really clever, contextual autocomplete. But they learned not to make it too frequent, because people just hit tab to accept, tab, tab, tab… and then they make mistakes. “Keep devs in dev mode.”

Fourth: the less flexible party has more power. We’ve talked about this already. When software is inflexible, it has all the power, and that is not what we want. So make the best path obvious and easy, but never mandatory.

Conclusion

As developers, we have a unique advantage: we can change our automations. We can overcome the ironies.

We get to choose which side of the sociotechnical system does each task. Let humans use our strengths: social skills, novelty, problem-solving. Move work below the line, into the tools, when we want speed, power, consistency.

The tools are a part of the team.

Choose what goes above the line and what goes below. That’s the secret power of software development.

When we increase the flows of learning, we keep getting better. There is no done; Our system is never going to be perfect. Your coworkers are never perfect. Your automations are never perfect. So make them collaborative.

References

Lisanne Bainbridge: “Ironies of Automation”, Automatica, 1983 (PDF)
Nora Bateson: Small Arcs of Larger Circles (Amazon)
Abeba Birhane: “Descartes was wrong: ‘a person is a person through other persons’”, Aeon
Sidney Dekker: Field Guide to Understanding ‘Human Error’, 3rd Edition (Amazon)
Klein, Feltovich, Woods: “Common Ground and Coordination in Joint Activity”, Organizational Simulation (PDF)
Klein, Woods, Bradshaw, Hoffman, Feltovich: “Ten Challenges for Making Automation a “Team Player” in Joint Human-Agent Activity”. IEEE Computer Society, 2004 (PDF)
Hoffman, Hawley, Bradshaw: “Myths of Automation, Part 2: Some Very Human Consequences”, IEEE Computer Society, 2014 (link with PDF)
Hollnagel, Woods, Leveson: Resilience Engineering: Concepts and Precepts (Amazon)
Don Norman: The Design of Future Things, 2009 (Amazon)
Don Norman: “The Problem of Automation,” Philosophical Transactions of the Society of London, 1990 (PDF)
Hila Peleg: “Automatic Programming: How Far Can Machines Go?” YOW! Melbourne, 2018 (video)
Sarter, Woods: “Autonomy, Authority, and Observability: Properties of Advanced Automation and their Impact on Human-Machine Coordination”, IFAC Man-Machine Systems, 1995 (link)
David Woods: “STELLA Report” (link)