here’s what you need to know, so you can understand & use the rest of the information on the internet.

What is this “observability”?

It’s when you can see what’s going on inside your software systems, especially in production. Yes, logs help with this; yes, monitoring helps with this. We can do better. The modern standard is distributed tracing, with piles of information inside each trace.

developer "what is happening?"; software: "this customer wants to see their cart, so I called the cart service"; cart service: "I got that request, so I called the database, and it took 2s to run this here query"

What about this (observability) is BS?

You never know everything about what’s going on in your software system. You only get the clues you leave yourself. “observability” as a practice is about giving your team the best clues we know how to give, without spending a ridiculous amount of development time or CPU or money on it. Distributed tracing still takes some development time, plenty of infrastructure setup, a bit of overhead at runtime, a lot of money to store this data in something queryable, and a bunch of decisions that you didn’t want to think about. And then to get value, you have to learn how to query it.

In real life, we cobble together deductions about what’s happening in the system based on the information we have. That usually includes logs, maybe there’s some metrics graphs, and perhaps parts of our system are emitting some useful traces and other events. Hopefully we add more useful info over time. It’s a journey.

What is this “distributed tracing”?

Tracing follows an operation through many steps. Distributed tracing follows an operation across processes. The clearest case is a web request hitting a microservice backend and bouncing around between services. The distributed case shows all these connections, how long each step took, where an error happened, and what was going on at the same time.

a request comes in to the backend; it calls cart service once and then product service three times simultaneously.

Distributed traces are usually pictured in a waterfall view, like the one in the network tab in your browser’s developer tools. That shows what happened when, what took so long, and what else went on at the same time.

waterfall of trace. backend calls cart service, which calls db, which takes 2s; point out that duration. Then cart service calls product service 3 times at once; point out the simultaneous calls.

Unlike the browser developer tools, this waterfall view has a tree structure. The call to cart service came out of the call to backend, and the call to the database came out of cart service. The calls to product service all came out of the backend. This shows you what called what.

same trace, this time with arrows going from backend to cart service span, etc.

What about this (distributed tracing) is BS?

It’s great when it works, when every service participates in sending trace data. On legacy systems, that can be a bear to set up. Then traces are partial and separated and you’re back to piecing them together, almost as bad as with logs.

It’s great for telling stories about “I got a request, I did stuff, I sent a response.” It is not so clean when the work is asynchronous like event-based architecture, or interactive like in a client.

It’s freaking expensive. You don’t want to store all of your trace data at scale. Which means you need to think about sampling, and that’s even more work to set up and make decisions about.

Words

Observability: you know what your system is doing, because it tells you.

(how does it tell you?)

Telemetry: data that software emits just to tell you, the developer & operator, what is happening.

(where does that come from?)

Instrumentation: code that builds & sends telemetry.

instrumentation: code that emits telemetry, leads to telemetry: data emitted to reveal inner state, leads to a cloud. A person with a question mark, observabilty: we can see what's happening.

All that telemetry data doesn’t magically turn into understanding. First, you have to store it somewhere. That is called, generically, a “telemetry backend.” Then you need to query it, and look at it in a UI.

How does this work?

Logs are telemetry.

logger.info() sends strings to a log aggregator; person reads logs

Maybe your logs are stored in ElasticSearch, and you view them in Kibana.

Metrics are telemetry.

agent sends numbers to a time series db, a person looks at graphs

Maybe your metrics are stored in a timeseries database behind Prometheus, and you display dashboards in Grafana.

Distributed traces are made of telemetry.

tracing instrumentation sends events that are trace spans to a database and trace waterfall views plus graphs on demand, so the person has observability

Traces are constructed from events with a certain structure (called trace spans). You could have Jaeger store them in Cassandra, and then Jaeger’s UI will stitch them together into a waterfall view.

State-of-the-art observability uses:

instrumentation to emit events that weave into traces
a backend that lets you search and aggregate them by any field in any event, and
a UI that makes both graphs and waterfall views.

The same events that build into traces can give you the graphs you get from metrics, and serve better than logs for debugging. This way you can always get from an error count to the story of an error, and from a high latency to an example of a slow one. And from any trace with a slow database call to a graph of how long it usually takes, or how frequently we run that particular query.

what about this (state-of-the-art observability) is BS?

I work for Honeycomb, so of course I think its solution is the state of the art. After all, we brought the word “observability” into use in the software industry. And wrote the book.

In real life, you’re going to keep using your logs, because you have them. And they’re cheaper to store, so use them for auditing.

You’re going to keep using your metrics and monitoring, because you have them. And for infrastructure stuff + software you only run but don’t write, that’s probably the level of insight you need.

Distributed traces are strongest in code that you write. You can make them good, because you can change the code; and you need them more, because you’re always changing the code.

What should I add to this page next?

So many directions to go. Click the ones you’re interested in, and I’ll get an event, which I will count in my observability tool 🤓.

OpenTelemetry, the libraries to get this instrumentation into your code

Yeah, there’s a lot to say about that.

You can also ask me questions if you want: honeycomb.io/office-hours

How to read a trace

Thanks! Yeah, traces have a ton of information in them, and it is not obvious how to get it.

How to find the trace to look at

This is a tricky problem in most tools. Honeycomb makes it way easier, but you still have to know how. Thanks for asking.

How traces are created, like the technical details

Yep. As an engineer, I don’t believe it until I know how it’s done, at least at a high level.

Sampling, which is how we keep tracing affordable

Yeah, that’s a heavy one. In the meantime, here’s a good post from Martin Thwaites about it.

Other…

I welcome your emails at jessitron@honeycomb.io 😀