Why does no one talk about mobile observability?

Matt Klein

2024-04-24

When it comes to server/backend, “observability” is both a huge business and a constant topic of conversation – even if sometimes the topic is how much money everyone is spending! Yet, we rarely hear about mobile observability, even though the importance of being mobile first is now a commonly accepted business practice. Why is this? The lack of discussion and focus is even more curious when considering something else I like to say:

An API success rate of 100% at the backend API gateway doesn’t mean anything if the customer success rate as perceived when interacting with an app is 0%.

Said another way, the only success rate that matters is the success rate measured where the user experience actually is, all the way at the mobile edge. I cannot count how many times in my career a backend API has returned an HTTP 200 OK only to have the response crash the app!

The same reasons that we’re obsessed with observability for servers obviously apply (perhaps even moreso) to mobile. And given the clear importance of mobile observability in operating reliable customer experiences for our applications, it seems like the topic should be widely discussed and widely applied, yet it’s not. For the rest of this post I’m going to talk about why this is the case (teaser: it’s very difficult!), and what we can do as an industry to prioritize its implementation in a cost effective manner that ultimately improves customer experiences.

Why is mobile development hard?

Before diving into mobile observability, let’s start by talking about why mobile development is difficult:

Extremely long release and upgrade cycles

Unlike server where most engineers are used to being able to deploy nearly instantaneously to their entire fleet, mobile deployment is a very lengthy process that starts with submitting new app versions to the app store, waiting for it to (maybe) be approved, and then waiting for gradual rollout. More established apps also likely have processes for alpha, beta, etc. before starting production rollout. All of this means that it might take 4 weeks or more before code changes can actually wind up in users’ hands, leading to excruciatingly slow iteration cycles. Additionally, because there is typically a long tail of slow upgrades, we are forced to support old app versions for a very long time.

Limited resources and limited control

While new smartphones have quite a bit of power, they still have a finite amount of CPU, RAM, and disk that is being shared across all apps on the device. The smartphone OSs provide very limited control around when apps get killed due to excessive CPU/RAM usage (where the definition of “excessive” can change at any point due to low battery, “doze” mode, the CPU overheating, etc.), and apps can be interrupted at any point due to receiving a call/text, user initiated context switch, etc. Finally, the OS makes no guarantees about disk availability which means that apps have to be very defensive about any assumed local storage. With server, you own the device and can control everything that’s happening on it; with mobile, you don’t.

Limited background capability

Related to the previous point, mobile OSs make no guarantee about whether an app can run in the background. This means that an app may be shutdown and not run again for days or weeks depending on when the user next interacts with it explicitly.

Supporting old models and OS versions

Unlike server where we rarely think about the underlying hardware and Linux version (other than perhaps whether we need to support ARM), apps must routinely support old OS versions and old phone models where supported APIs and capabilities may materially differ. This can lead to complex branching and feature enablement depending on detected capabilities.

Ever evolving privacy controls

Both Apple and Google are continuously evolving their posture on privacy and security. This means that apps need to support all branches of requested permissions and not all models and OS versions support all capabilities.

Time/language is hard

Users are likely to have different languages, locales, time zones, calendars, etc.

Sporadic connectivity

While the internet is fantastically reliable (all things considered), mobile apps have to contend with a wide range of poor networking conditions including:

Weak cellular signal leading to slow/sporadic connectivity
Transitioning from cellular to WiFi and vice versa
Misconfigured WiFi hotspots blackholing traffic
IPv6 and DNS issues that may be specific to certain mobile carriers or WiFi networks

Additionally, even when things are working, bandwidth is limited and shared across the entire device. Furthermore, not all users have unlimited data (this is vastly more common in developing countries) and will care about how much data apps are using because they are being charged by the byte.

Clearly, mobile development has many unique challenges that are completely foreign to the average backend engineer, even when the backend engineer works on a distributed system that ultimately is in support of a widely deployed mobile first product! Not surprisingly, these challenges directly affect the implementation of observability in the mobile environment.

Why is mobile observability hard?

In the vast majority of cases server engineers assume that connectivity between distributed system nodes is constant and largely consistent and reliable. This is not to say that server engineers do not have to handle sporadic failure in distributed system networking (they clearly do). However, for the most part, modern observability systems attempt to push metric/log/trace data through the pipeline with a limited amount of retries and local buffering in either RAM or on disk. The assumption is that the data will probably get where it needs to go. And if the retries or local buffering are not sufficient during sporadic failure, the data is dropped. Because server networks are generally extremely reliable, sporadic drops are rarely considered to be a large issue. To summarize, in the server observability world we assume:

High speed and reliable networking.
Applications running out of resources and being forcibly terminated is an anomaly and not the norm.
It’s possible to rapidly deploy changes, either to fix bugs or to add new telemetry to debug an ongoing issue.

Not surprisingly, the high level assumptions we make in the server observability world are at odds with the challenges inherent in mobile development and implementing mobile observability.

Long release cycles

Due to the epicly long release cycles inherent in mobile, the simple act of adding a log or analytic event can take weeks for initial rollout and months/years to reach the entire population of app users. This means that all things being equal mobile engineers would like to add as much logging/analytics as possible ahead of time in case the data is needed. However this is unfortunately at odds with the fact that:

Users care about data usage

Users that are on metered data plans care deeply about how much data an app is using, and will notice if an app is using “too much.” Furthermore, even if users have unlimited data, slow networks and sporadic connectivity means that observability data is sharing available bandwidth with critical application traffic. I.e., sending too many logs can materially impact the performance of the main application. Note that nearly every large mobile app has measured that perceived application performance (time to first interaction, delay when moving screens and clicking buttons, etc.) impacts conversion and core business metrics so this is not an academic concern.

Sporadic connectivity

Unlike on server, sporadic or poor connectivity on mobile must be considered a normal event and not an outlier. This means that observability systems must plan ahead for sophisticated batching, local storage, etc. or enough data will be lost to impact the reliability of the observation system.

Unplanned termination is the norm

Because apps can be terminated at any time and may not be able to run in the background, mobile observability systems must plan for this case, similar to how sporadic connectivity might be dealt with.

Huge cardinality and cost

Depending on the app, a relatively small server infrastructure may support hundreds of thousands or even millions of monthly active mobile users. Storing individual telemetry for each active user is very costly so aggregation and summarization systems are typically needed for mobile observability.

Permissions/capabilities may limit data collection

Some interesting mobile telemetry may require explicit permission opt-in (for example location tracking). App developers need to think carefully about whether they want to request permissions from a user and either way have to handle the case where a user denied the permission request. This makes data analysis tricky as the analysis has to account for data that a user refused to send.

Naive implementation of local storage for observability data can impact UI performance

While mobile apps can be multithreaded, in general a single “main” thread is used for drawing the UI and handling user input events. Care must be taken to not block the main thread unnecessarily or the app’s frame rate and general responsiveness may suffer. This is particularly important for local storage of observability data which needs to be durably stored due to several of the previously described issues. Thus, the observability implementation needs to carefully consider how logging data is processed and stored so as to not adversely impact app performance.

I will also mention that the idea that logging in production in mobile apps is “bad” (due to network utilization, blocking the main thread, etc.) is pervasive. Some examples include:

Google’s instructions for preparing an app for release say: “At a minimum, you need to make sure that logging is disabled and removed…”
A popular Android logging framework called Timber says: “There are no Tree implementations installed by default because every time you log in production, a puppy dies.”

There are clearly a lot of challenges stacked against mobile observability that we need to solve in order to get it widely deployed so that we can accurately measure real user experience.

A path forward for mobile observability

Given how important mobile observability is for measuring and debugging customer experiences where they ultimately matter, it is in our best interest as an industry to invest more heavily in fundamental technologies that solve for all of the challenges described above.

In a previous post I argued for not storing any telemetry by default, adding a control plane, and adding sophisticated local storage. The goal of this system is to reduce cost overall, but still allow for highly detailed telemetry to be accessed when needed in order to debug customer issues: 1000x the telemetry at 0.01x the cost. Is this type of observability system a potential path forward for mobile observability? I think it is. Adding a control plane and distributed local storage solves the challenge of mobile observability in multiple ways:

Due to very long release cycles, we optimally want mobile engineers to add as much logging as possible to their apps. This will increase the chance that a piece of telemetry is present if needed to root cause a production issue. If we don’t send this telemetry anywhere by default, we avoid concerns around data usage and shared bandwidth consumption.
To workaround extreme cardinality and cost issues if we were to theoretically send and store all logs, we can utilize the control plane to selectively target cohorts of devices for telemetry retrieval (e.g., all users in San Francisco, all users on a particular phone model, all users on a particular app version). We also have real-time control over what telemetry we send. For example, instead of sending raw logs we can dynamically create synthetic metrics on the device and send them instead as a substantially cheaper (and easier to reason about) aggregation mechanism.
As long as local storage for telemetry data is sophisticated enough, we can avoid UI performance issues (main thread blocking) and have resilience against sporadic network connectivity and uncontrolled app termination. Putting local storage directly on the device also allows us to “time travel” and retrieve historical data when an explicit set of events within the app occurs (effectively the control plane sending a finite state machine to the app for use in event matching).

If anything, the technical and cost challenges inherent in effective mobile observability most obviously necessitate a drastic change in how we think about observing systems in general via adding a control plane and distributed local storage. (Perhaps not surprisingly this is why bitdrift’s initial focus has been on mobile observability, though I think adding a control plane and local storage is broadly applicable to the entire observability ecosystem.)

I look forward to the day when identifying and diagnosing mobile-centric issues — like a server change causing crashes in older app versions due to malformed JSON in HTTP 200 responses, becomes as simple as it is with the server-side observability systems of today. If as an industry we prioritize the importance and implementation of effective mobile observability (the “final frontier” as it were), this day may be just around the corner. Onward!