here’s what you need to know, so you can understand & use the rest of the information on the internet.
What is this “observability”?
It’s when you can see what’s going on inside your software systems, especially in production. Yes, logs help with this; yes, monitoring helps with this. We can do better. The modern standard is distributed tracing, with piles of information inside each trace.
What about this (observability) is BS?
You never know everything about what’s going on in your software system. You only get the clues you leave yourself. “observability” as a practice is about giving your team the best clues we know how to give, without spending a ridiculous amount of development time or CPU or money on it. Distributed tracing still takes some development time, plenty of infrastructure setup, a bit of overhead at runtime, a lot of money to store this data in something queryable, and a bunch of decisions that you didn’t want to think about. And then to get value, you have to learn how to query it.
In real life, we cobble together deductions about what’s happening in the system based on the information we have. That usually includes logs, maybe there’s some metrics graphs, and perhaps parts of our system are emitting some useful traces and other events. Hopefully we add more useful info over time. It’s a journey.
What is this “distributed tracing”?
Tracing follows an operation through many steps. Distributed tracing follows an operation across processes. The clearest case is a web request hitting a microservice backend and bouncing around between services. The distributed case shows all these connections, how long each step took, where an error happened, and what was going on at the same time.
Distributed traces are usually pictured in a waterfall view, like the one in the network tab in your browser’s developer tools. That shows what happened when, what took so long, and what else went on at the same time.
Unlike the browser developer tools, this waterfall view has a tree structure. The call to cart service came out of the call to backend, and the call to the database came out of cart service. The calls to product service all came out of the backend. This shows you what called what.
What about this (distributed tracing) is BS?
It’s great when it works, when every service participates in sending trace data. On legacy systems, that can be a bear to set up. Then traces are partial and separated and you’re back to piecing them together, almost as bad as with logs.
It’s great for telling stories about “I got a request, I did stuff, I sent a response.” It is not so clean when the work is asynchronous like event-based architecture, or interactive like in a client.
It’s freaking expensive. You don’t want to store all of your trace data at scale. Which means you need to think about sampling, and that’s even more work to set up and make decisions about.
Observability: you know what your system is doing, because it tells you.
(how does it tell you?)
Telemetry: data that software emits just to tell you, the developer & operator, what is happening.
(where does that come from?)
Instrumentation: code that builds & sends telemetry.
All that telemetry data doesn’t magically turn into understanding. First, you have to store it somewhere. That is called, generically, a “telemetry backend.” Then you need to query it, and look at it in a UI.
How does this work?
Logs are telemetry.
Maybe your logs are stored in ElasticSearch, and you view them in Kibana.
Metrics are telemetry.
Maybe your metrics are stored in a timeseries database behind Prometheus, and you display dashboards in Grafana.
Distributed traces are made of telemetry.
Traces are constructed from events with a certain structure (called trace spans). You could have Jaeger store them in Cassandra, and then Jaeger’s UI will stitch them together into a waterfall view.
State-of-the-art observability uses:
- instrumentation to emit events that weave into traces
- a backend that lets you search and aggregate them by any field in any event, and
- a UI that makes both graphs and waterfall views.
The same events that build into traces can give you the graphs you get from metrics, and serve better than logs for debugging. This way you can always get from an error count to the story of an error, and from a high latency to an example of a slow one. And from any trace with a slow database call to a graph of how long it usually takes, or how frequently we run that particular query.
what about this (state-of-the-art observability) is BS?
In real life, you’re going to keep using your logs, because you have them. And they’re cheaper to store, so use them for auditing.
You’re going to keep using your metrics and monitoring, because you have them. And for infrastructure stuff + software you only run but don’t write, that’s probably the level of insight you need.
Distributed traces are strongest in code that you write. You can make them good, because you can change the code; and you need them more, because you’re always changing the code.
What should I add to this page next?
So many directions to go. Click the ones you’re interested in, and I’ll get an event, which I will count in my observability tool 🤓.
OpenTelemetry, the libraries to get this instrumentation into your code
Yeah, there’s a lot to say about that.
You can also ask me questions if you want: honeycomb.io/office-hours
How to read a trace
Thanks! Yeah, traces have a ton of information in them, and it is not obvious how to get it.
How to find the trace to look at
This is a tricky problem in most tools. Honeycomb makes it way easier, but you still have to know how. Thanks for asking.
How traces are created, like the technical details
Yep. As an engineer, I don’t believe it until I know how it’s done, at least at a high level.
Sampling, which is how we keep tracing affordable
Yeah, that’s a heavy one. In the meantime, here’s a good post from Martin Thwaites about it.
I welcome your emails at email@example.com 😀