I have been thinking a lot about Envoy Proxy control planes recently so I thought it would be useful to share some of my conclusions about the industry status quo and where I envision the state of the art progressing over the next few years.
Although I am continuously talking to Envoy users, our now yearly EnvoyCon/KubeCon has become an important check-in point for me as I try to wrap my head around the types of problems that are being faced deploying Envoy “on the front lines.” Although I have access to this type of information via Lyft’s deployment, having more data points allows me to get a better idea of the most important things the project can be working on to better serve the needs of the broadest possible set of users.
After attending the most recent EnvoyCon/KubeCon and having some time to think about what I learned, a few high level points have become very clear to me:
- Although the number of “out of the box” service mesh and API gateway solutions based on Envoy continues to increase (e.g., Istio and derivatives, Kuma, Consul Connect, App Mesh, Traffic Director, Ambassador, Contour, etc.), the reality is that these products are on the bleeding edge. By bleeding edge I mean that it’s early days, and most deployments trying them out right now are relatively small and typically greenfield.
- At the same time, I would conservatively estimate that there are now hundreds of organizations deploying Envoy-based service mesh and API gateway solutions with custom private control planes written directly against the xDS API (often built on top of go-control-plane), some at very large scale.
- All of the organizations deploying custom control planes are solving many of the same distributed systems problems over and over again independently, not benefiting from the power of collaborative development, learning, and hardening.
In the following sections I will touch on each of the above points in more detail and then discuss what I think we can do as a project to pragmatically help the greatest number of large scale Envoy users in the near term.
Much of the current cloud native ecosystem is a shiny place that mostly assumes greenfield deployment. In this context, greenfield means starting from scratch with a set of cloud native technologies that have been designed and tested to work well together. Although as an industry we certainly have a long way to go to make greenfield cloud native systems easy to use (I’m looking at you massive tangles of YAML), we have made large strides in making the building blocks stable. On the other hand, large organizations already deploying legacy technology at very large scale have a different and more complicated problem entirely: attempting to bridge the old with the new without forcing users to undergo massive migrations, instability, and pain.
To use Lyft as a concrete example, our “legacy” compute stack is a bespoke build/deploy system that uses raw EC2 virtual machines, auto scaling groups, etc. Our original Envoy-based service mesh and API gateway grew up tightly integrated into this system and all of its inherent assumptions. Over the last couple of years, Lyft has undertaken a migration to Kubernetes. Setting aside the discussion of whether this migration has yielded positive ROI, the technical requirements of such a migration are massive. We have built a mechanism to bridge our legacy system and new system into a unified compute platform, thus allowing services to be run simultaneously and transparently across both to allow for safe compute migrations. It’s almost inconceivable to think that one of the “out of the box” Envoy service mesh solutions could have been used for this purpose without massive modifications. (For more information on how Lyft adapted its service mesh to Kubernetes see this excellent EnvoyCon talk by Lita Cho and Tom Wanielista.)
One of the hardest problems we face in the cloud native space is balancing configurability with complexity and usability. The more opinionated a solution is, the less configuration it needs, and the less complex it is (at least theoretically). In this regard, greenfield solutions, and especially PaaS and FaaS, have a major leg up. They can make a variety of assumptions about compute stack, deploy tooling, networking, etc. that allow them to significantly reduce needed configuration, thus decreasing complexity and increasing usability.
As a project, Envoy has never tried to simplify configuration as a primary goal. We assume that Envoy is a tool that is used in a large variety of deployments and vertical products. This assumption works against us in the form of sometimes bewildered users that come to our documentation and APIs and don’t know where to start. At the same time, Envoy’s rich API has been one of the driving forces behind its rapid adoption within the industry, making this a fruitful tradeoff in my opinion. The main point is that configurability versus complexity is a sliding scale, and it’s impossible to make every customer happy, thus necessitating a layered approach to composing systems.
In this regard, the various service meshes and API gateways built on top of Envoy are approaching layering in a pragmatic way: they are using Envoy as a building block and creating a “simpler” and more opinionated platform on top. However, as previously explained, simplification is at odds with configurability, making it significantly more complicated (if not impossible) for these solutions to work in more complex brownfield deployments such as Lyft’s.
Lyft’s story is not uncommon and is representative of why Envoy is extensively deployed across the industry most often with a proprietary custom control plane written in-house: each of these organizations have independently decided that for their brownfield deployment it is more efficient to have full control over the Envoy xDS API and the control plane code that drives it.
Will most organizations still be writing custom Envoy control planes in 3-5 years? I doubt it. In that timeframe we will have seen more migration to standard cloud native technologies and many of the current greenfield systems will have since become large scale deployments. In such a future world, I would expect the majority of deployments to use one of the vertical solutions that are being developed today. In the interim however, things are messy. The realities of large scale brownfield deployments are pushing users to develop custom Envoy control plane solutions, and all of them are facing similar problems at scale. These problems include:
Immutable infrastructure compute systems such as Kubernetes typically drive very high rates of system changes, as containers are created and terminated rapidly due to autoscaling, deploys, batch jobs, etc. Our experience at Lyft has been that the rate of change can be an order of magnitude or more beyond that which occurred with our legacy system. A high rate of change in the compute/network topology puts a substantial amount of pressure on the control plane; if not careful, a naive implementation is easily susceptible to runaway topology recomputations.
An Envoy control plane has to implement a large number of standard distributed system best practices. This is partly due to immutable infrastructure leading to high rates of system change and partly due to the realities of scale in general. Such best practices include:
- Rate limiting: Making sure the control plane does not bombard backend configuration discovery systems with too many requests (the K8s API, Consul, Zookeeper, etc. are all susceptible to similar DoS scenarios).
- Batching: Making sure that the control plane does not thrash every Envoy with too many sequential updates. While this is particularly important in high rate of change systems, the control plane typically needs to batch updates that are received by the configuration discovery system. Note that while Envoy itself does similar batching internally to avoid thrashing the data plane worker threads, this is typically not sufficient and is only a modest defense against a control plane requesting a high rate of system change.
- Back pressure: The control plane needs to detect when overloaded either by incoming Envoy client connections or by a configuration discovery system issuing too many updates in too short a period of time. During times of duress the control plane needs to start dropping updates and have mechanisms to catch up in the future and ensure that all Envoys eventually converge on the system’s target state.
- Caching: In order to maintain high performance given a large number of clients, most control planes end up implementing some amount of caching to avoid refetching the state of the world from the configuration discovery system independently for each client. Implementing caching correctly is not without significant challenges, particularly when using Envoy’s push-based xDS variant.
At large enough system scale, it becomes untenable to send every endpoint for a service to every downstream client for the following reasons:
- If using client-side active health checking this can lead to an N^2 health checking explosion.
- Envoy’s memory usage scales mostly linearly with the number of upstream endpoints that it connects to. Thus, a large number of upstream endpoints leads to increased memory usage as well as inefficient connection pool utilization.
- “Simple” control planes typically use the State-of-The-World (SoTW) xDS API which means that the control plane sends a complete snapshot of a resource set whenever any resource in that set changes. I.e., if a single endpoint in a cluster gets added or removed, the control plane needs to send every endpoint in the cluster to Envoy when using SoTW. This leads to a substantial amount of CPU and network bandwidth being required for small updates.
The solutions to the previous two problems are twofold:
- Subsetting: Using the known topology of both the downstream and upstream service, the control plane can send a subset of upstream endpoints to each downstream client such that the overall load on each endpoint remains similar.
- Incremental/delta xDS: Each xDS API implements both a SoTW and incremental variant. Incremental xDS allows the control plane to send deltas to each client, at the expense of substantially more control plane complexity, as the control plane needs to keep track of the state of each connected client. However, implementing delta updates vastly reduces the CPU and network bandwidth required for performing small updates against each client.
Each of the previous two solutions are far from easy to implement in practice.
The debugging and observability requirements for operating a control plane at scale are not significantly different from operating Envoy at scale. It’s critical to understand the state of the control plane and its caches, whether the configuration has converged, what the current configuration is, etc. Although implementing such debugging and observability features are not complicated, each organization deploying Envoy at scale has duplicated this effort in a slightly different way.
If we assume, as this post outlines, that deploying Envoy is going to be messy for the next 3-5 years as organizations eventually converge on vertical solutions and products, what can we do as a community to make deploying Envoy control planes easier for everyone currently writing custom solutions?
The go-control-plane project has existed for quite some time as a library that can be used to speed up development of Go-based Envoy control planes. The library provides the outline of things that a control plane needs to support including caching, configuration delivery, etc. It also handles the unglamorous task of compiling Envoy’s protobuf API to Go code.
However, go-control-plane is a very thin library and does not currently provide any of the distributed systems best practices functionality described above, nor does it implement subsetting or incremental/delta xDS.
I’m often asked if Lyft will ever open source our control plane. The answer is no, primarily because of the brownfield tangle that I described earlier in the post. There are simply too many assumptions about Lyft’s legacy infrastructure and business logic built into the control plane. Not surprisingly, every organization of Lyft’s size or larger using Envoy says a similar thing.
What I think both Lyft and the industry can do is move more functionality into go-control-plane such that the library itself becomes more useful. This includes things like rate limiting, batching, back pressure, enhanced caching, a reference Kubernetes and Service Mesh Interface (SMI) implementation, etc. This obviously won’t help organizations that want to build their control plane in a different language, but with Go (like it or not) becoming the lingua franca of cloud native development this seems a reasonable compromise. Moving more functionality into the library will not produce a new service mesh product, but it will produce a more reliable base that we can all collaborate on, find bugs in, and harden.
In addition to enhancing the out-of-box capabilities of go-control-plane, we are kicking off a project called xds-relay (design document). This easiest way to think about xds-relay is that it is a CDN for Envoy configuration. We believe that it is possible to create a self-contained server (built itself on go-control-plane) that can be deployed in front of any xDS compliant control plane “origin” server. The relay will be a scale out component that helps large Envoy deployments achieve high availability by implementing all of the distributed systems best practices outlined in this post in one open source place.
Additionally, once the xds-relay MVP (simple caching and scale out) is complete, there is a huge amount of potential for community collaboration on additional features such as:
- State-of-The-World (SoTW) to incremental/delta XDS conversion: It will be possible create SoTW to incremental conversion code in the relay, thus keeping the origin control plane simple and relegating the delta logic to shared open source code.
- Automatic endpoint subsetting: The relay will know about overall system topology in terms of raw xDS building blocks such as clusters, endpoints, etc. Thus, the relay should be optionally capable of performing endpoint subsetting directly without the origin control plane being aware of it.
- xDS translation hooks: Allowing transformation of Envoy configuration as it travels through the control plane pipeline is a frequent feature request. The relay would be a common location where this could optionally be performed.
- API-driven relay configuration updates: One of the design goals of the relay is that it will only ever speak raw xDS. With that said, there is nothing to prevent the relay from eventually becoming an API-driven control plane itself. There are many interesting directions to explore in this area!
By moving much of the complexity of building Envoy control planes into open source, we can all collaborate together, find bugs together, and harden the implementation together, thus making the proprietary/internal pieces much simpler to reason about. This is a win/win for everyone. If you are interested in helping with relay development please reach out and join us!
The rapid industry uptake of Envoy has seen the creation of many vertical products and services, but also the creation of hundreds of custom bespoke control planes. It has become clear that many organizations with a custom control plane are not going to give it up any time soon in favor of shrink wrapped service mesh and API gateway solutions, but they are also redundantly solving the same set of distributed systems and scaling problems. Over the next year, through projects such as go-control-plane enhancement and xds-relay, we can improve the status quo for the average Envoy user as well as increase the stability of the vertical products and services that will ultimately utilize these projects as building blocks.