Capturing the World in Software

TL;DR – we can get a complete, consistent model of a small piece of the world using Event Sourcing. This is powerful but expensive.

Today on twitter, Jimmy Bogard on the tradeoffs of Event Sourcing:

To be clear I’ve shipped successful ES and know others that have but these promises SHOULD NEVER be made. ES comes with many, many tradeoffs, both in dev and operationally. Those discussions should be at the forefront https://t.co/ihm6voLbWT
— Jimmy Bogard 🍻 (@jbogard) January 24, 2020

If event sourcing is not scalable, faster, or simpler, why use it?

Event Sourcing gives you a complete, consistent model of the slice of the world modeled by your software. That’s pretty attractive.

We want to model the real world in software.

You can think about the present world as a sum of everything that happened before. Looking around my room, I can say that my bookshelf is the sum of various purchases, some moving around, a set of decisions about what to read and what to keep.

my bookshelf has philosophy, math, visualization, and a hippo

I can think of myself as the sum of everything that has happened, plus the stories I told myself about that. My next action is an outcome of this, plus my present surroundings, plus incoming events. That action itself is an event in the world.

In life, in biology, we don’t get to see all these inputs. We don’t get to change the response algorithm and try again. But in software, we can!

Of course we want perfect modeling and traceability of decisions! This way we can always answer “why,” and we can improve our understanding and decisionmaking strategies as we learn.

This is what Event Sourcing offers.

We want our model to be complete and consistent.

It’s impossible to model the entire world. Completeness and consistency are in conflict, sadly. Still, if we limit “complete” to a business domain, and to the boundaries of our company, this is possible. Theoretically.

Event Sourcing offers a way to do that.

In event sourcing, every piece of input is an event. Someone requests a counseling appointment, event. Provider signs up for available hours, event. Appointment scheduled, event. Customer notified, event. Customer shows up, event. Session report filed, event.

We can sum past events to get the current state

Skim the timeline of all events for the relevant ones. Sum these up (there are other definitions of “sum” besides adding numbers). From this we calculate the state of the world.

From appointment-was-scheduled events, we construct a provider’s calendar for the day.

At the end of the month, we construct reports on customers served and provider utilization. Based on that, we might seek more providers or have a talk with the less active ones. Headquarters ranks the performance of our office compared with others.

We need to allow corrections

To accurately model the real world, we need to allow for all the stuff that happens in the real world.

Appointments are cancelled. Customers don’t show up. Session reports are filed late. (“Where’s that session report from last week?” “Oh right, they were too late, because the gate to the parking lot malfunctioned. Don’t charge them for it.”)

Data is late or lost. If you insist that this doesn’t happen (“Every provider must enter the session reports by the end of the day”) then your model is blind to reality. The weather turns bad, people go home. There’s a bomb threat, or an active shooter. Reality intrudes.

Events outside your careful model will happen. Accommodate corrections, incorporate events that arrive late, accept partial data. The more of reality you allow into your model, the more accurate it can be.

We can evaluate past decisions based on the information available at the time

When data arrives late, reports change after they are printed. An event sourced system handles this.

As new data comes in about past days, it gets summed in with the data about those days. Reports get more accurate.

A friend of mine works at a counseling center, and he gets calls from headquarters like “Why is your utilization so low for December?” and he’s like “What? It was fine” and then he runs the report again and sure enough, it’s different. After he ran the report, more data about December came in, and now the totals are different. He can’t reproduce the reports he saw, which makes it hard to explain his actions to HQ.

If their software used event sourcing, he could say, “Please run the report as of January 2, and you’ll see why I didn’t take any action.”

Each event records a received timestamp, for when we learned about it, and an effective timestamp, for the real-world happening it represents. Then the software can sum only the events received before January 2 to reproduce the report as it was seen that day.

We can re-evaluate the world with new logic

Not only can an event-sourced system reproduce the same report as on an earlier day, we can ask: what if we changed the report logic? Then what would it look like?

Maybe we want to report unreported appointments as “possibly cancelled” to reflect uncertainty. We can run the new logic against the same events and compare it to the old results.

This means we can run tests against the event stream and detect behavior changes.

We need to record externally-visible decisions for consistency

When we change the software, we endanger consistency.

If we update the report logic in February, then when HQ runs the report “as of January 2” they’ll see something different than my friend saw when he ran it on that date. For consistency, both the data and code need to match what existed on January 2.

Or, we can model the report itself as an event. “On January 2, I said this about December.” Then we can incorporate that into the reporting logic.

Anything our system does that is visible to the outside world is itself an event, because it changes the way external people and software act. To reproduce our behavior consistently, our system can either record its own behavior, or retain all the data and the code that went into choosing it.

So far, this is nice and deterministic. But the real world isn’t.

Reproducing behavior is possible in an event-sourced system, if that behavior is deterministic. In human behavior, we don’t get that luxury. Our choices come from many influences, some of them contradictory. One tweet inspired me to write this article. Thousands of other tweets distract me from it.

Conflicting information comes in from real life.

Event sourcing gets tricky when the real world we are modeling is inconsistent, according to the events that come in.

Now say we’re a shipping company. We model the movement of goods in containers as they move across the world. It is an event when a container is loaded on a ship, and an event when it is unloaded. An event when a ship’s itinerary is scheduled, and when it arrives at each port.

One event says that container 1418 was loaded onto the vessel Enceladus in Auckland. Another event says that Enceladus is scheduled for its next stop in Beijing. Another event says that container 1418 was unloaded in San Francisco. Another says that container 1418 was emptied in Beijing. Which do you believe?

This example comes from a real story. Weird things happen. Does your system let people report reality? Is there a fallback for “Ask a person to go look for that container. Is it really 1418?”

Decisions made in ambiguity are events

Whatever decision the system makes, it needs to record that as an event. Perhaps that shows up as a footnote in reports about Enceladus, Beijing, and San Francisco. Does anybody hear about it in Auckland?

We can see the provenance of each report and decision

If some report comes out uneven, and that feeds back to the development team as a bug, then event sourcing gives us excellent tools for tracking it down.

Each “I made this decision” or “I produced this report” event can record the set of events that were input, and the version of code that ran to produce the output. You can have complete provenance.

This kind of software is accountable. It can tell the story of its decisions, what it did and why. What its world was like at that time.

This is a beautiful property. With full provenance, we can understand what happened. We can tell the story to each other. With replayability, we can change the code and see whether we can improve it for next time.

Recording everything gets ridiculous

Yet, data about provenance gets big very quickly. Each report consumed thousands of events. Each decision that was based on a current-state sum of events now has a dependency on all of those past events, plus the code that defines the current state, plus all the other states it took input from, plus their code and set of events.

Meanwhile some of those events are old, and no longer fit the format expected by the latest code. Meanwhile, we’re still ignoring everything that happened outside the system, so we’re completely blind to a lot of causality. “A person clicked this button.” Why? What information did they see on the screen as input to their decision to click “Container 1418 is in San Francisco”?

In real life, most information is lost. History will never be fully written; the writing is itself history. We’re always generating new actions. The system could theoretically report on all the reports it has reported. It never ends.

Completeness is limited to very small systems. Be careful where you invest this effort. Consciously select the boundaries, outside of which you don’t know what happened. You don’t know what really happened in the shipyard, or in a person’s head, or in the software that another company runs. The slice of the world we see is tiny.

Provenance is precious but difficult. Then again, it is at least as hard to do well in designs other than event sourcing. The painful realities that make event sourcing are painful in other models, too.

There are reasons we don’t model the whole world.

Event sourcing makes a best effort to model the world in its fullness. We try to remember everything significant that happens, sum that up into our own current-state world in the software, make decisions and act.

But events come in out of order. Events are lost. Events contradict each other. Events have partial data, or old data formats. Logic changes. We can’t remember everything.

Sometimes it pays to think about what you would do in an event-sourced system, and then implement just enough of that. Keep copies of produced reports, so that people can retrieve them without re-generating them. Record difficult decisions in a place that lives longer than logs.

Event sourcing is powerful. But it is not easy. Expect to think really hard about edge cases you didn’t want to handle. Expect to deal with storage and speed and up-to-dateness tradeoffs. Allow a human to enter corrections, because the real world will always surprise you.

In the real world, we don’t have all the information, and that’s OK. We can’t model everything in our heads, because our heads are inside everything. This keeps it interesting.