1000x the telemetry at 0.01x the cost

Matt Klein

2024-04-17

This is my third post in a series on a different take on modern observability. In my first post I talked about why modern observability is so expensive. In my second post I talked about whether we really need to store every piece of telemetry that we generate, and a new way of thinking about observability using a control plane and local storage.

In this post, I’m going to take you through a journey where we explore a world in which observability is flipped on its head: we store nothing by default, but have access to anything we would ever want on the fly. What does life look like when you can get any telemetry you want, any time, without thinking about cost or storage? What would it mean for our ability to debug and operate systems?

How can I store nothing, but have access to everything?

The sad, or maybe just ugly, reality is that the vast majority of telemetry in traditional systems is never read, either by human or machine. Even as storage, compute, and network have gotten cheaper over the years, the complexity of our systems (microservices, CaaS, FaaS, etc.) and their ability to generate large volumes of telemetry data has kept pace, leading to egregious waste, high relative costs, and unhappy operators.

In my previous post, I argued that we can apply a dynamic control plane, plus use local storage, to intelligently decide what telemetry to store and use. If we conservatively estimate that overall monthly telemetry volumes and costs can be decreased by 90% using a fully dynamic control plane and local storage (I think the real volume decrease could approach 99% or more in some systems), we now have slack to reuse some of this excess capacity for more telemetry when it matters. Effectively we are trading a steady state of dubiously useful telemetry for bursts of extremely high fidelity telemetry that are targeted towards solving specific problems at hand.

For the rest of this post I am going to go into some specific examples across both mobile and server observability where getting bursts of 1000x the telemetry will fundamentally change how engineers debug and operate systems. These examples are by no means exhaustive and I fully expect that once dynamic telemetry systems are widely deployed we will continue to see new use cases, limited only by the imagination of engineers around the world.

What does cost mean when talking about telemetry?

Before talking about different examples of what 1000x the data can do, let’s first discuss the definition of cost in the telemetry context.

The most obvious definition of cost is of course in terms of dollars and cents: what is the bill to transport, store, and make queryable telemetry data? This one doesn’t need further explanation.
Another critical definition of cost, though less discussed, is what is the overhead of collecting telemetry data. Nothing is free in computing. Generating telemetry in and of itself uses CPU time, increases RAM usage, and possibly increases disk usage. In highly concurrent systems, the production of telemetry can change timings enough to alter program behavior. The more telemetry collected the higher the cost. Thus for the most in-depth telemetry collection at scale we want to amortize the overhead cost by limiting the data capture either by sampling across a very large population, limiting the collection time, or both.
Another definition of cost is what I would call cognitive cost. The larger the overall volume of data, the more of it there is to sift through in order to find the signal within the noise. If we generate large amounts of telemetry data, optimally we would like it to be targeted and obviously useful in understanding the problem currently at hand.
A final definition of cost is what I would call opportunity cost. These are all of the things that we can’t and don’t do because of the other three types of costs.

As an aside, on the topic of overhead I recall one particular bug I was chasing early in my career inside a mobile phone cellular stack – if we turned on all logging the bug wouldn’t repro – we had to add a single log line at a time and spend hours trying to repro the issue! The problem ended up being a super rare race condition deep in the telephony stack which was tickled by a particular phone model on a particular network carrier. What a fun bug!

For the remainder of the post when I talk about cost I am talking about all four: the financial cost, the performance overhead of telemetry collection, the cognitive cost of having too much irrelevant data to sift through, and the opportunity cost of not being able to get all of the data we need. Adding a control plane and local storage allows us to amortize the cost of extremely high fidelity telemetry collection across short collection intervals, large populations, and very specific collection triggers (finite state machine matchers sent from control plane to data plane), thus making all four costs low.

Debug and trace logging

Nearly every logging/event framework in existence has a way to categorize logs into relative levels of severity. For example, error, warn, info, debug, and trace. While there is no standard on how to categorize logs, and no standard on which severity level of logs to emit by default to traditional observability systems, most engineers have some rubric that they follow, typically based on personal experience and/or company culture.

In order to reduce volume and cost, most organizations emit info and higher severity logs to be stored by a traditional log storage system. Because of this, most engineers have been trained to categorize the severity of their logs, and are generally told to be judicious about info and higher severity, while debug/trace have no restrictions due to not being sent anywhere by default.

Thus, it is an extremely common occurrence that info logs do not have sufficient data to fully understand the path of execution that a program took prior to arriving at the info log in question. The number of times in my career that I have wished to easily see debug/trace logs on a production system are innumerable!

Adding a control plane and local storage on the other hand allows us to dynamically enable debug and trace logs for a limited period of time (and in response to a specific sequence of events) and get access to the entire volume of data that is invaluable in understanding the full set of events that led up to a particular application event.

Per-instance metrics

While metrics are largely considered to be more cost efficient than logs, they still can create enormous bills in large organizations. The largest organizations typically perform metric aggregation, essentially taking a set of related metrics and collapsing them into a single aggregate time series, in order to both reduce cost and increase query performance (since many queries typically end up asking for an aggregate average, max, etc. on the many underlying time series). A concrete example of this approach would be taking a counter generated by all Kubernetes pods in a service, dropping the pod label, and aggregating the remainder into a per-service count.

Using this technique can drop overall metric volume by 1-2 orders of magnitude depending on the deployment. Most users don’t notice any difference because as I mentioned above, queries tend to ask for aggregates anyway of the per-pod data.

However, there are cases in which having per-instance (per-pod) metrics is useful during debugging. For example, looking for outliers that are easy to lose in a large batch of aggregate data. A control plane driven observability architecture makes it trivial to temporarily turn on per-instance metrics, and even stream them to a live viewer if live debugging is all that is needed. This avoids permanently storing orders of magnitude more metrics when they are rarely needed.

Network protocol traces

Networking is pervasive in modern systems, both within server-side infrastructure as well as spanning the internet out to large fleets of mobile/IoT/web devices. Due to the highly concurrent and inherently fallible nature of networking, debugging problems that crop up can be very difficult without detailed tracing. Example low level protocols used in modern distributed systems include: IP, TCP, UDP, QUIC, and TLS. High level protocols include HTTP, gRPC, and many others specific to individual databases, caches, and so on.

Similar to logging severity levels discussed in the previous section, it is possible to emit variable levels of network protocol tracing, up to and including “trace” logging which might include the actual network payloads to greatly aid debugging.

Clearly, continuously sending extremely detailed network payload information is infeasible due to overhead and cost, but if it could be accessed on-demand when needed imagine how much simpler it would be to debug hard to understand issues!

Having worked in the application networking world for many years, I cannot count the number of times I have been asked to help capture HTTP REST/gRPC request and response bodies to aid debugging. A control plane and local storage makes this easily possible at a reasonable cost.

As an aside, collecting deeply detailed telemetry, especially of this nature, has substantial security implications. For example, network payloads are very likely to have PII, credit card numbers, passwords, etc. Future highly dynamic observability systems will have to very carefully consider data access security and encryption of telemetry data. One nice property of the control plane / data plane split is that some cases can be solved by matching only within the data plane and not capturing any data at all with its inherent security implications (for example emitting a synthetic metric when a match condition happens). This will be the topic of a future post!

Hardware instruction/sensor traces

Using software technologies like eBPF or functionality built in to hardware itself, it’s possible to produce extremely high volume tracing data to aid in problem analysis. Like networking data, it’s not practical to send this level of data all the time, but when required for in-depth analysis of hard to reproduce problems, being able to dynamically enable hardware level tracing for a short period of time can be invaluable in root cause analysis. Note that this category applies to all types of high volume hardware signals including instruction traces, accelerometer readings, GPS/location readings, virtual reality camera/sensor recordings, etc.

Hardware performance profiles

Similar to hardware instruction/sensor tracing, in-depth profiling using tools like Linux perf can be invaluable in understanding performance bottlenecks during application execution. However, collecting performance profiles has non-zero overhead and also generates large volumes of data which are impractical to collect. Modern continuous profiling tools get around this limitation by using sampling which is a fantastic approach but can make it difficult to drill down into very specific cases and get accurate real-time data specific to a particular program sequence. By allowing the observability control plane to engage the profiler when a specific sequence of events occurs, very accurate performance data can be retrieved on demand and at low overall cost.

Mobile/web session UX tracing

Session replay is generally defined as recording a user’s journey through a mobile or web application. Some variant of session replay is included in many different Real User Monitoring (RUM) and general observability tools. Typical session replays capture some variant of the following UX data:

Screen captures
Taps and clicks
“Rage events” such as shaking and aggressive tapping
Rotation / general orientation changes

This is in addition to general observability data capture including logs, networking events, etc.

The important thing to understand about session replay is that cost increases linearly along with the amount of data stored (the fidelity). For example, recording screen wire-frames is far more efficient (and privacy conscious) than doing pixel perfect recordings, but has less overall fidelity. Similarly, the more additional UX data is captured (up to and including phone gyroscope readings) the higher the cost both in terms of data volume and capture overhead.

Because of this, especially on mobile, very high fidelity session replay is rarely deployed to production due to overall cost. However, when using a control plane and local storage it becomes possible to on-demand capture very high fidelity data when a specific sequence of events occurs, making it substantially easier to understand real user journeys even for applications that are deployed to millions of active users.

What 1000x case will you think of?

In this post I’ve covered a small number of cases where using a control plane and local storage to dynamically enable very high fidelity telemetry can aid in root cause analysis of customer issues. Fundamentally, this type of system provides a massively better ROI on telemetry cost (both financial, overhead, and cognitive), because the generated data is highly detailed when needed and absent when not. 1000x telemetry at 0.01x the cost sounds too good to be true, but in the future I firmly believe that we will look back on traditional observability systems and wonder how we were ever able to debug anything!

At bitdrift we are just beginning to scratch the surface of what is possible when reimagining observability with a control plane / data plane split, and I can’t wait to see what other systems and use cases the industry comes up with collectively over the coming years!