I often say that software these days is complex. But is it? In Cynefin terms, it’s more like an airplane than a family: it’s complicated. Given enough expertise, all the code, lots of data, and enough time, we could analyze the causal structure of any particular system behavior.
We can reason about complicated systems.
When a complicated system has a problem, it’s time to dig in. We narrow the problem to a component that is behaving unexpectedly, and then we read code, run tests and reason about the results until we understand.
But sometimes we don’t have the necessary expertise, access to the code, a detailed error message, or enough time. Maybe it’s software we don’t maintain, the error doesn’t make any sense, and we need production to work NOW, not after weeks of research. If the software is under development, it won’t hold still for our dissection. In these situations,
we treat the complicated system as a complex one. (from Dave Snowden; wish I could find the particular post again.)
We can work with complex systems, if we keep our eyes open.
In complexity, we don’t expect to know exactly why a fix works. We have reasons to believe it won’t hurt, so we try it and find out what happens.
This week, my demo app running in Kubernetes stopped serving the page. Other pages (served by other services) work, but there’s one the proxy can’t seem to connect to. So I kill the pod. Kubernetes restarts it, and poof, it works again. I don’t know why, but I know what to do.
For someone else with more expertise in Kubernetes, this demo app might be complicated instead of complex. They might be able to drill into the failing network connection, gather more information, look at configuration, and tell my why this keeps happening and how to fix it for good. But that’s beyond my skills, so I have to treat this system as complex.
This means the same system is complicated or complex depending on what (or whom) you know.
With deep investigation, complex software can become merely complicated.
And that means we can move a system from complex to complicated by gaining the expertise and access we need. I’m frustrated by my own puzzlement with Kubernetes, so I’m reading a book about it. When I have time, a puzzle like this is an opportunity to drop down a level of abstraction and expand my investigative capacities.
During an incident, we start with the system in the complex domain. (If we understood it perfectly, would this be happening?) We check known influencers, like CPU and disk space. Meanwhile, we call in someone who has the most expertise on whatever part is failing, so that they can make an analytical approach to a smaller, complicated system.
Observability helps us work in complexity, and it helps with the transition to complicated.
Observability helps here in two ways: we can see what we’re doing in a complex system, and we can see where to look for expertise in a complicated system.
To work within complexity, we need to make small changes and then see what happens. If customers are experiencing a drastic rise in errors, we might ask “what is different about the failing requests?” Perhaps a large chunk of them hit a particular service instance. We restart that instance, and look at the results. We watch the error rate, and also look for other effects. Did latency change? How are the other instances looking?
To understand what happened, we move the system from complex to complicated. What code threw the error? What was it trying to do? Who knows that about component? A distributed trace points to the code to look at. It gives us clues about the input and the output. Then we can drill into the code and reason about it.
Other times our traces can only give us clues, and aggregations or metrics give us other clues.
Software is theoretically complicated, effectively complex. We can work with this.
I love programming because there is a reason for everything. (Even if sometimes it’s a solar ray flipping a bit and corrupting data.) Every violation of my expectations is either a design decision, a bug in some code, or an unexpected interaction that we can learn from. The painful part is giving up, not going that deep.
The other day in Honeycomb, I asked it “what is different about these events” and it ordered the little charts wrong. The interesting different ones were at the end. I reloaded the page (classic complex-system strategy) and that didn’t fix it. This wasn’t stopping my work, so I threw a screenshot in my Slack channel and went on to other business.
Ashley stopped by my channel, saw the screenshot, and something about the little donuts in the graphs tickled her memory. She knew that these graphs were cached under some circumstances, and that the code had changed recently. She took a look at it, and fixed the bug. To her, this page was complicated. She found a proper root cause for the behavior. Then she changed it on purpose, with results she could predict.
For Ashley, as a person integrated into the system, with how it works fresh in their head, unexpected behavior is a complicated problem to reason about it. For me, as a user, it was complex and could only be prodded with restarts and whatever input I could change.
And even for Ashley, the Honeycomb product as a whole is complex. Nobody can understand the whole thing at once.
Software is always going to grow big enough to be complex, and complicated at the same time. The difference is us, and how much time we have. We can work with it both ways.