5 years of Envoy OSS

Matt Klein

2021-09-14

Today marks the 5 year anniversary of the open sourcing of Envoy Proxy. It would not be an exaggeration to say that professionally, the last 5 years have been a roller coaster of epic proportions, my emotions ranging between exhilaration, pride, anxiety, embarrassment, boredom, burnout, and everything in between. Before a loss to the sands of time, I thought it would be fun to share a brief prehistory and history of the project, along with some of the lessons I have learned along the way about growing a large OSS project.

Prehistory and history

Prehistory

Except for a few small detours, my two decade career in the technology industry has been focused on low-level systems: embedded systems, operating systems, virtualization, filesystems, and most recently distributed system networking. My journey with distributed system networking began at Amazon in early 2010 where I had the fortunate opportunity to help develop the first high performance computing (HPC) EC2 instance types. I learned a tremendous amount about low-level high performance computer networking, though I had only limited exposure to distributed systems concepts.

In 2012 I joined Twitter, where after a few false starts I wound up on the edge networking team. This was my first real exposure to distributed system application networking concepts. I lead the development of a new HTTP edge proxy called the Twitter Streaming Aggregator (TSA), which was first launched in 2013 to scale delivery of Twitter’s “firehose” API (streaming all tweets). In the runup to the 2014 world cup, we decided to launch TSA as a general purpose HTTP/HTTP2/TLS edge proxy in points of presence (POPs) close to the event in Brazil. This was done primarily because it was not possible to deploy the existing resource hungry JVM based edged proxy on the small number of colocation racks that would be available in the POPs. My team delivered a successful and incident free world cup on an extremely compressed schedule. (I can fondly remember a period of time in which I would page myself when the software crashed, no matter the time of day, fix the bug, and redeploy the canary fleet to keep testing.) During my time at Twitter I also had exposure to the way the company performed service-to-service networking with great success via the Finagle library.

Around New Year’s Eve 2015, in what would begin the coda to my time at Twitter, TSA caused millions of Twitter Android users to get logged out via a bug I wrote:

Let me tell you about the time that I (indirectly) logged out about 40M Android users from @Twitter with a single character bug. They couldn't log back on for several hours, and the follow on impact on logged in users was ... not great. https://t.co/HeivgpUimN
— Matt Klein (@mattklein123) April 13, 2018

Joining Lyft and the creation of “Lyft proxy”

I left Twitter in the spring of 2015, partly due to the fallout from the logout incident, partly due to frustration about not being promoted, and partly due to a desire to try something new. I followed my boss from Twitter to Lyft, along with a bunch of my other Twitter coworkers.

When I joined Lyft, the company was relatively small (< 100 engineers), and was struggling with a migration from a monolithic to microservices architecture. I have talked about this portion of the Envoy journey many times, so I won’t rehash it again, but the very short summary is that Lyft was having all of the typical microservice migration problems, primarily rooted in networking and observability. Additionally, Lyft was already “polyglot” (using multiple languages and frameworks), so it seemed impractical to use a library based solution to solving these problems. Thus, based on my previous experience building TSA and observing how service-to-service communication worked at Twitter, and due to my immediate credibility via overlapping coworkers, I proposed building a new application networking system called “Lyft proxy.”

After some spirited discussion that included whether the new proxy should be built in Python (yes, really), we agreed on the broad outlines of the project and settled on using C++ as the implementation language. At the time C++ seemed the only reasonable choice. Would I choose C++ today? No. However, today is not early 2015, eons ago in the technology world.

This part of the history would not be complete without the origin of the name “Envoy.” We were setting up the initial devops scaffolding for the project when a forward thinking coworker (Ryan Lane) said that we couldn’t call this new project “Lyft proxy,” we had to pick something better. Always practical, I went to the thesaurus, looked up “proxy,” and settled on Envoy as the new name.

Lyft rollout

I didn’t start in earnest on the Envoy source code until the summer of 2015. Those months were some of the most fun of my career. Empty source files and no support burden should be treasured while they last, because they don’t last long. I worked long hours toward producing something that would add value to Lyft within a reasonable amount of time (by my definition 3-4 months for a project of this type.). Lyft had given me a tremendous amount of rope to hang myself with, as the saying goes, and I was committed to making sure said hanging did not happen.

Of course, my efficiency can mostly be attributed to being fresh off of the compressed development schedule and many mistakes (mostly my own) that went into TSA at Twitter. I knew what mistakes not to make, what abstractions were needed, what type of testing worked and what didn’t, etc.

The initial version of Envoy that was readied for production in the fall of 2015 contained just a tiny fraction of the functionality and sophistication that the project contains today. It did not support TLS termination, only supported HTTP/1, and had extremely simplistic routing and resilience features. What it did have was the bones of what you see today. There have been very few major refactors in the history of the project, primarily because, as I said previously, I knew what was coming and what abstractions needed to be in place in order to support the functionality. What Envoy did have from the very beginning was top notch observability output, in the form of metrics and logs. In 2021, this type of network observability is table stakes (thanks in large part to the success of Envoy), but it was not so at the time.

Envoy was first rolled out at Lyft as an edge proxy, sitting behind the AWS ELBs that were providing TLS termination. By late fall of 2015 Envoy was serving 100% of Lyft traffic, and the edge dashboards that were produced by the system paid dividends immediately (e.g., providing API call percentile latency histograms, per endpoint success rate and request rate, etc.).

Shortly after the initial launch, another Twitter coworker (Bill Gallagher) joined me on the project and we quickly added features, such as TLS termination, HTTP/2 support, more routing and load balancing functionality, etc.

At the same time, Lyft’s Envoy based “service mesh” started to take shape. First, Envoy was deployed next to the PHP monolith to replace HAProxy and some of its inherent operational issues (at the time HAProxy was still single threaded for example) to aid with MongoDB proxying. It would not be an exaggeration to say that a substantial portion of Envoy’s early development was targeted towards MongoDB stability (load balancing, rate limiting, observability, etc.).

The benefit of direct Envoy based observability between the edge fleet and the monolith was immediately obvious. Shortly after, we deployed Envoy next to some of the high RPS decomposed microservices to aid in troubleshooting networking issues. The value was proven there as well. Over time we expanded beyond an observability focus and added features to aid in system reliability such as direct connection and service discovery (skipping internal ELBs), outlier detection, health checking, retries, circuit breakers, etc. The number of load based major incidents at Lyft slowly decreased from a frequency of every 1-2 weeks to much less. Envoy cannot take credit for all of that decrease, of course, but the network abstractions it provided did help a substantial amount.

In early 2016, we decided to push for a service mesh with 100% coverage. Initially, we thought it was going to be a slog that would require top-down mandates. In practice, teams signed up to do migrations because the benefit they would get was evident. “Carrot” migrations are almost always successful. “Stick” migrations are rarely successful, or if they are, leave behind a trail of tears and anger within the organization.

By mid 2016 Envoy was used for all network communication at Lyft including edge serving, service-to-service communication, databases, external partners, etc. By any measure the project had been a resounding success, helping Lyft complete the microservice migration, increasing overall reliability, and abstracting the network such that most engineers did not need to know anything about the real system topology. Bill had since left the project to work on other things at Lyft, and in his place Roman Dzhabarov and Constance Caramanolis had joined me. Our small team developed and operated Envoy for all of Lyft.

Road to OSS and launch

By the summer of 2016 we started to have a serious discussion about open sourcing Envoy. Early Lyft employees had an appreciation for open source and what it had done for the company. It was clear that Envoy was not Lyft’s primary business, so why not put it out there and give back? I will be honest in saying that we all approached the open sourcing process with different goals and expectations, as well as a substantial amount of naivety around what would happen if the project became very successful.

Prior to Envoy, I had used quite a bit of open source, but I had almost no experience with open source contributions and zero maintainer experience. (I did have a single commit in the Linux kernel though!) Open sourcing Envoy seemed like a great opportunity to expand my skill set and learn something new, possibly further my career, and frankly, I had no desire for there to be a TSA v3 at a third company. For Lyft, Envoy was a substantial piece of engineering, and leadership felt that open sourcing would lend credibility to Lyft as an engineering organization and help with recruiting. As I said before, all of us were naive about what it takes to both create successful open source and - more importantly - nurture it if it becomes successful.

But, we decided to give it a shot. We spent a good portion of the summer of 2016 working on documentation (Jose Nino joined the team around this time and his first task was reading and helping improve all of the docs), cleaning up the repository to make it “less embarrassing,” working on a website, a launch blog post, etc. I am truly grateful for my coworkers at Lyft during this time who not only supported us but helped us with myriad tasks including website design, logos, and more. Even at this early time, it was intuitive to us that first impressions matter, and if we were going to make a go of open source we had to make a good first impression via quality documentation, web presence, etc.

During this period we also used our industry connections to meet with some of Lyft’s “peer companies” (“unicorn” bay area internet startups) to show them what we had done with Envoy and get their feedback, thinking that if we managed to get a launch partner before going public it would be a major help to the project. All of these meetings were very friendly, and across the board all of the companies we met with were extremely impressed with what we had accomplished. But, as is obvious in hindsight, all of them said that there was no way they could adopt Envoy right away with their small infrastructure teams. They wished us the best with open sourcing and said they would check back later. We couldn’t help but feel depressed at the outcome of these meetings, but we pushed forward anyway.

In August of 2016, I had my first auspicious meeting with Google. A Lyft coworker (Chris Burnett) had spoken at a gRPC meetup and mentioned Envoy as it related to Envoy’s gRPC bridging support. Unbeknownst to me, Google was preparing for the launch of Istio on top of NGINX when they found out about Envoy. One meeting led to another, and then many more, and before Envoy was open sourced a substantial number of Google employees had already seen the source code and documentation. (More on this later.)

By the beginning of September we were ready, and set the open source day as September 14th. In general I’m a (over?) confident person, but there have been times in my life where I had a significant amount of anxiety about my ability to succeed. The ones that immediately come to mind are: starting high school, starting college, and starting at Microsoft after college. And open sourcing Envoy was one of those times. I remember being petrified about the public reaction. What would people say? Would the feedback be positive or vicious? Although we were a small team by the time we open sourced, I had still written 90% or more of the code, and felt like putting it into the public domain was a reflection on myself and my abilities.

As scheduled, Envoy became open source on September 14th, 2016. I remember celebrating with my wife and saying something along the lines of: “I will be happy if we can get just one other company like Lyft to use Envoy.”

The reaction to the open source release was almost universally positive. Much to our surprise, almost immediately, we started hearing from big companies, not small ones. Within weeks we were talking to Apple, Microsoft, and the conversations with Google kept picking up pace. Large companies had issues with existing solutions, and had large teams of people ready to dive in and work on solving those issues. Ironically (at least in the view of the Twittersphere), C++ was a help here, not a hindrance. The large companies all already had ample C/C++ development resources, existing libraries they wanted to integrate, etc. C++ was a selling point to them.

During this time, not surprisingly we had the most interaction with folks at Google. Initially primarily the teams building Istio, but gradually we spent more time with Anna Berenberg, now a distinguished engineer at Google leading various networking and load balancing efforts. That relationship would turn out to yield the “jet fuel” that really launched the project in early 2017.

Rocket ship takes off

By early 2017, it was becoming clear that Envoy was gaining traction quickly. Google committed to replacing NGINX with Envoy for Istio (which eventually launched in the spring of 2017), and much more importantly for the future of the project, Anna’s large team working on GCP cloud load balancing features began their march towards using Envoy for various cloud load balancing products as well as internal use cases (this was all very secret during this time period but is now well known).

I will always remember that period interacting with Google as being one of the most stressful of my career. In all honesty, it felt like an acquisition (inquisition?) process. I remember long meetings and email threads justifying our technical decisions, “interviews” in which Google was trying to determine whether we would be a good OSS partner to work with, etc. It was painfully obvious to us at the time that landing this “acquisition” would put Envoy on a trajectory that we could never achieve on our own, so we did everything in our power to make it a success, which it ultimately was. And, our work with Google truly has been an outstanding partnership over the last 4+ years. Early Google cloud engineers that eventually became maintainers, Harvey Tuch and Alyssa Wilk, brought loads of talent to the project, both technically, as well as leaning into open source and the community. My gratitude to them is immense and the project would not be what it is today without them. The rest of the Google engineers who have contributed to the project over the years (there are now many) have added an immense amount of engineering horsepower that the project would not have otherwise had, in addition to universally being excellent community stewards. I certainly had concerns about the initial Google partnership (technical and philosophical differences, etc.), but I can honestly say that none of those concerns have become a reality.

Apart from ensuring the success of the Google collaboration, both across Istio and GCP teams, we were also spending a considerable amount of time working with and onboarding other companies and maintainers, many of whom have had outsized impact on the project and are still heavily involved today either as maintainers, contributors, or users. The project would not be what it is today without these early community members and I am extremely grateful to them as well for placing their trust in the project.

At the same time, as the project continued to gain traction, I started to receive a substantial amount of investor interest in Envoy. There was a strong desire to get me to leave Lyft and start a company around the project. I wrote about this part of the journey so I won’t rehash it here, other than to say that a lot of time and headspace went towards processing all of these interactions. As the linked post describes, I ultimately decided to stay at Lyft and not start a company in order to support Envoy’s continued success.

Meanwhile, I still worked at Lyft, and as I will discuss further later, I was increasingly working two jobs. My first job was internally leading the networking team and operationally supporting Envoy at Lyft. My second job was being the public face of Envoy including OSS leadership, code reviews, fixing bugs, writing features that would further the project, speaking at conferences, helping other companies adopt and deploy Envoy, etc. I was starting to get spread too thin and showing signs of burnout. However, by mid 2017 there was no denying the fact that Envoy’s trajectory was substantially “up and to the right.” Adoption continued to climb across major corporations, “peer companies,” vertical products and services, etc.

CNCF donation and burnout

By the fall of 2017, two things were clear:

Envoy was outgrowing what the Lyft OSS apparatus could provide. The project needed help with legal, public relations, marketing, event organization, etc.
I was fast approaching total burnout, and needed to figure out a sustainable path forward.

To address point one, we finally agreed to consider moving Envoy to the CNCF. The CNCF had been courting the project for months, but it never seemed like there was any compelling reason to join. By late 2017, it was clear that CNCF resources would be at least neutral to the project, if not a net benefit. We began the submission process and ended up joining the foundation almost exactly a year after we initially open sourced the project. I am thankful to Alexis Richardson and Chris Aniszczyk for shepherding the project through this process.

Point two was much more complex. Fundamentally, I was working more hours than I had capacity to work, effectively across two different jobs. Furthermore, I was expecting my first child, due in early 2018, which as the arrival date got closer was causing me increasing anxiety. By this time it had become clear that I had not done a good enough job on setting expectations and boundaries on what I was capable of providing to Lyft while still focusing on the continued growth of Envoy from an industry perspective. Increasingly, I was letting things drop at Lyft, getting into interpersonal squabbles, and not meeting the expectations of my level in terms of providing mentorship and leadership to more junior team members.

In short, I was hitting my breaking point and ultimately I chose Envoy over Lyft, to the detriment of my Lyft coworkers. I would like to think that if I had been more transparent with the Lyft leadership about my workload in early to mid 2017 I might have avoided some of the worst outcomes, but the unfortunate reality is that resourcing OSS industry work which is not immediately obviously useful to the employing company is a complex endeavor. It might have gone better and it might not have. In any case, while I regret some of the interpersonal issues that I could have handled better, for better or worse I do not regret focusing on Envoy. My priority was Envoy over Lyft and I did what I thought I had to do at that time to make it succeed.

Plotting a more sustainable path

My first child was born in February of 2018, and Lyft’s extremely generous paternity leave policy provided a natural break and reset for me. I got some space from Lyft, and started to think more deeply about what I wanted and what would be sustainable for me.

When I came back from paternity leave, I was clear with the Lyft leadership that I could no longer participate in the “day to day” of operating Envoy at Lyft. Conversely, the infrastructure team also wanted some separation from me due to some of the fallout from late 2017. Due to this, I stepped back substantially, and actually took a complete hiatus from infrastructure at Lyft to work on writing the firmware networking code in the Lyft Bikes and Scooters initial release in mid to late 2018. This was an amazing team effort to get something shipped on a compressed time scale, and I really enjoyed doing something completely different for a few months.

2018 was also the year in which I aggressively started to figure out what it would look like to “replace myself” within the Envoy OSS community. I spent a substantial amount of time (and continue to spend a substantial amount of time) grooming maintainers, new contributors, organizing the first dedicated EnvoyCon, etc. Any leader should always have a goal of making sure that the organization will continue to function well should that leader step aside one day.

By the end of 2018, my major burnout risk had been sorted out, and I was working reasonable hours again and spending plenty of time with my wife and son, splitting my time roughly 50/50 between Envoy OSS work and general infrastructure leadership at Lyft. To be clear, the privilege that came from Envoy’s success enabled me to carve out this work life balance with Lyft. Over time, as my industry stature increased, my leverage increased in parallel, making it easier to set the terms of my employment as I liked them. Not many have the fortune of being in this situation and I understand how lucky I have been to “break through” to the other side of the burnout wall without having to leave my job.

Envoy starts to grow up

Since 2019, and through Covid, I have continued roughly the 50/50 split between Lyft infrastructure leadership and OSS leadership that I described above. There have certainly been times of monotony and yearning for something different (historically I am a habitual job changer - 6.5 years is by far the longest I have ever worked on one thing), but overall I have enjoyed seeing Envoy move from being an “upstart” to more of a “teenager.” I’m no longer preoccupied with doing everything my competitive brain can come up with to make Envoy an outsized success, because frankly Envoy is an outsized success, has swept the market, and has changed what users have come to expect from their application load balancing tools. Instead, I’m more focused on project sustainability. We are in this for the long haul, and these days I feel much more like a run of the mill CEO looking at attrition numbers, priorities, budgeting, security issues, etc. It’s not to say this is not useful work; it clearly is, it’s just a different kind of work from the early days which was substantially more technical and fast paced.

As of late 2021, the thing I am most proud of about Envoy is that in my opinion the community has become self-sustaining. We have an incredible group of maintainers, contributors, and users who are passionate about the project’s success and have all played a part in making Envoy what it is today. It is truly a team effort.

Lessons learned

The past 5 years have been an epic journey. While I feel I have learned relatively little technically, I have grown and learned so much about leadership, community building, and all of the other non-technical things that go into building a successful enterprise, whether corporate or a major open source success story. What follows are short summaries of some of my main learnings.

Successful OSS is like starting a business

Perhaps controversially, I think that if one has a goal to create an extremely successful OSS project they need to think of it like starting a business. There are lots of factors involved in starting a business beyond the core technology:

Hiring (in OSS this translates to recruiting contributors and maintainers)
Customer acquisition (in OSS this translates to users)
Documentation and technical writing
Public relations
Marketing
Legal (trademarks, licensing, etc.)
HR (in OSS this translates to resolving community disputes and setting culture)
Funding (in OSS this translates to ancillary costs like CI, finding maintainers jobs that allow them to work part or full time on the project, etc.)
General catch-all leadership and direction setting. There are limited resources and lots of things that can be worked on. The business/project needs to focus on the most important things to achieve product market fit.

Intuitively, I knew this going into the initial open source effort for Envoy, and I aggressively pursued all of the above areas as I worked to grow the project from where it started to what it is today. Everything in the above list is critical and a project is unlikely to succeed without all of them, especially if the technology area is crowded with well financed corporate competitors.

I strongly encourage those contemplating a large open source effort to invest in the above areas ahead of time to make the best possible impression on day one. Additionally, new open source projects should be prepared to invest more heavily in the above areas if the project grows and starts to see adoption.

Not surprisingly, I do relatively little coding on Envoy these days. My time on the project is primarily managing all of the non-technical aspects of the project (everything in the above list and more!) and making sure things are on track. Most of the coding items I do take on are “janitorial” background projects that are good for the project, but are not very much fun and are not likely to inspire other contributors (which I of course have no say over what they do on a day to day basis and am incentivized to keep them happy as much as possible so they don’t leave).

End-user driven OSS is a structural advantage

These days a lot of “big OSS,” especially in the infrastructure space, is financed by large corporations and venture backed startups. I won’t detour into a discussion about the difficult economics of OSS as I already wrote about it. I will say that I strongly believe that end-user OSS has a substantial advantage over corporate and venture backed OSS: an initial captive customer that is almost certainly deriving value from the software, otherwise the software wouldn’t be funded. This virtuous cycle of building something alongside a customer is powerful. It almost universally leads to better outcomes: software that is more reliable, more focused, and with less feature bloat. There are many examples of end-user driven OSS that then goes on to achieve substantial commercial success. This is not surprising to me given the solid foundation and inbuilt product market fit. I would love to see more end-user driven OSS than we do today, though I recognize the economics are difficult. For those who have the opportunity, lean into the structural advantage that this type of software has!

Don’t follow the hype, follow the customer

This is perhaps a corollary to “successful OSS is like starting a business” and “end-user driven OSS is a structural advantage,” but I can’t stress how critical it is to focus relentlessly on what customers actually want versus what the hype cycle thinks they might want. For example, over the years, there has been an endless amount of fun poked at Envoy for being written in C++. Do I like C++? No, not really. Did it get the job done in 2015 and appeal to the initial set of main users? Definitely. This is an example of focusing on the customer and the market and not giving into hype with no real “business” impact. If one treats OSS like a business, it becomes immediately clear that being customer and market focused is the only way to achieve massive success. With Envoy I have spent a substantial amount of time arguing for the end user, to make sure we are building things in a way that benefits everyone, and not just a small set of niche users.

Default to “yes,” via extensibility

Following the customer can often lead to customer requests that may not fit cleanly into a project’s architecture. From an OSS perspective, losing focus of the primary goal of the project can lead to feature sprawl, unmaintainable software, and overburdened maintainers. At the same time, saying “no” is a guaranteed way to lose a potential user.

With Envoy I wanted to make sure that we could always at least say “yes, but…” in the form of offering a robust extensibility model that would allow users to fulfill their needs without having every change and feature need to be pushed upstream. This strategy has paid dividends many times over by reducing maintainer burden, allowing users to solve their own problems and, more importantly, pushing Envoy into use cases that I never would have imagined when the software was initially designed.

Extensibility, especially for OSS building blocks, is critical.

Quality matters

A further corollary to following the customer is that quality really does matter. Users want software that is easy to operate, is relatively free of bugs, cares about security, etc. At times it can seem that because OSS is “free,” quality is not guaranteed. This is maybe true in spirit, but practically users will not converge on a piece of software in large numbers until it’s clear that a project takes software quality seriously. Because acquiring users is a flywheel that acquires further users (especially when moving from early to late adopters) it’s even more critical to make sure that time is budgeted for overall software quality.

With respect to Envoy, I have always had a “zero crash” philosophy. Any crash is investigated and fixed, no matter how infrequent. This kind of attention to stability and quality does not go unnoticed.

Community is the only way to scale

It’s obvious, but I will say it anyway: community is the only way to scale OSS. This is a community of maintainers, contributors, and users. Furthermore, the tone of the community is set at the very beginning of the project and is extremely hard to change. Humans tend to follow norms. Once norms are set, outliers to those norms tend to be shunned, no matter what the norms are. Thus, the initial public tone of the project is extremely critical to setting its long term community trajectory.

When we made Envoy OSS, I put in a huge amount of effort into working with people on GitHub, using constructive and welcoming language. In general I did everything I possibly could to make Envoy a welcoming place where people wanted to come and contribute, whether that be maintenance, the occasional contribution, or users helping other users.

Of all the different types of success that Envoy has had, the part that gives me the most personal gratification, by far, is that I have been told by a non-trivial number of people that they had sworn off OSS, and especially infrastructure OSS, because they felt people in most projects were awful to each other. Conversely, they love contributing to Envoy because the community is so respectful and welcoming to each other. It required a lot of hard work and discipline, especially early on in the project, to achieve this outcome and it has paid off in multitudes.

Do not underestimate the compounding effect of setting a project’s culture and tone from the very beginning.

Mixing commercial and OSS interests is very difficult

There has been much written about the difficult economics of OSS (including my own article which I referenced above). Suffice to say that trying to mix commercial success and open source success is very difficult, primarily because the successes can often be at odds with each other.

I do believe that Envoy threaded this needle via both its robust API and extensibility system. Essentially, Envoy became a tool that is now consumed by a large variety of vertical products and services. This has yielded a community that is filled with companies who have chosen to work together on a common substrate, even while shipping higher layer products that compete with each other, via innovating at the extension/API/control plane/UI/UX layer.

Any successful open source project will see substantial commercial/investor interest. If the goal of a project is to maintain a vibrant community while still allowing commercial success (which I would argue is required for overall project success as the money has to come from somewhere), it’s extremely important to think up front about how to split the core from the commercial layers. The practicality and strategy of doing this will differ depending on the project and technology, but I believe that focusing on a robust API/extensibility split is a fruitful strategy.

Foundations are tricky

There is much discussion in the modern open source discourse about the role of foundations. I’m not going to comment heavily on this topic, but my main piece of advice is not to get distracted by foundations and the theoretical benefit they might provide. Instead, focus aggressively on product market fit, shipping quality software, and providing value to users. The rest will come naturally if these things are achieved.

For very successful projects, foundations, and more specifically neutral trademark holding grounds, are useful constructs, so I would definitely consider one at that point. The value that Envoy has derived from being part of the CNCF has increased over time as the project has matured. CNCF employs OSS lawyers, marketers, public relations personnel, a top notch event staff, and more. These extra resources are invaluable when it comes to “running the business.”

Think about governance up front

Every time I see another OSS community leadership blowup, I can't help but think that BDFL is in practice the lowest drama and most practical way of running large OSS projects over the long term.

I wish this were not the case but it certainly seems that it is.
— Matt Klein (@mattklein123) September 13, 2021

Open source governance is extremely hard. By its very nature, open source is anarchic, without a clear leadership structure. There is no one size fits all approach to project governance, and each project has to find its way forward, either via a “BDFL”/CEO type model, a steering committee, an Apache PMC like process, etc. All governance models have pros and cons and have different failure modes.

What’s most important is to think hard about governance up front, before the project becomes large and successful. Write down a set of rules and norms and especially take time to document the project’s conflict resolution process.

Also realize that per my comment above about how community norms are set early, early project maintainers will have an outsized impact on the overall style of dialog and conflict resolution, much like early employees at a company have an outsized impact on the company’s culture.

We have been extremely fortunate within Envoy to not have had any major disagreement that I can remember that was not quickly resolved amicably. We have never in the history of the project needed to invoke the maintainer voting process for conflict resolution. This is in my opinion a substantial achievement, and a testament to the quality and professionalism of all of the maintainers, especially given how popular the project has become and all of the commercial interests that surround it.

Open source contribution expectations are critical

I alluded to this above, but much of my own burnout stemmed from a poor job of setting reasonable expectations with my employer on the amount of time I needed to spend managing Envoy’s open source growth. I’m not going to lie and say that having such a conversation will magically make an employer carve out lots of time for someone to work on OSS, especially items that may not be directly applicable to one’s day job. With that said, I do believe it’s very important for all involved to have open and honest expectations about the open source process. The following are reasonable questions to ask either before open sourcing a project or before starting to work in an open source capacity:

Employees should ask their employer why they want to open source something.
Employers should ask their employees why they want to open source something. (It’s completely reasonable that the answers to this question and the previous one are different, but it should be discussed in the open.)
Employees should ask their employer what will happen if the project becomes successful? What resources will be available to the project? How much time will the employee be able to work on general OSS issues with the goal of directly furthering the project?

Mismatched expectations between employers and employees is a recipe for future resentment and burnout.

Proxies are easy, APIs are hard

To some, it may seem like the underlying network proxy mechanics that Envoy provides is the complicated part of the project. As it turns out, the proxy bits are (in my opinion) relatively simple compared to what it has taken to evolve a stable API ecosystem for Envoy. The mechanics of balancing API ergonomics for both human and computer consumption, maintaining stability across versions, growing the API to support other clients such as gRPC, specifying protocol semantics so that Envoy can speak to hundreds (possibly thousands?) of different management servers, etc. are mindboggingly complicated. I’m proud of what the team has achieved in this area (and a special shout out to Harvey who has driven much of this work), even with some mistakes along the way (such as the forced migration from v2 of the API to v3).

If a piece of software offers an API, and more importantly wants this API to be a critical building block for other systems, don’t underestimate the cost and complexity of offering a stable and ergonomic API surface. The flipside of this is that robust APIs are a strong part of an ecosystem’s flywheel of more products and users yielding further products and users, so in my opinion it’s well worth the effort.

Don’t ignore burnout

I’m not one to believe that a good work life balance is achievable 100% of the time if one wants to accomplish big things. The reality is that any success is a mix of existing privilege/opportunity, a good idea, good execution, and a whole lot of luck including being in the right place at the right time. All of these things came into play with Envoy, and I’m not going to pretend that I didn’t work myself almost into the ground, especially in 2017. I would also do 2017 all over again, because from my perspective I did what I had to do to make the project a success. (Sometimes I wonder whether Envoy would have happened at all if I had already had children. I’m not sure that it would have, but that is a subject of a much longer conversation!)

With all of that said, the type of epic pushes that I describe in 2017 can only go on for so long until a person breaks. I encourage everyone to be reflective on an ongoing basis about their work life balance, and figure out a sustainable path forward for themselves. All situations are different, and I can’t offer any one piece of advice for avoiding burnout, but I think being reflective is a good start, and something that I have had to work on quite a bit for myself.

Thank you

Working on Envoy for the past 6.5 years, 5 of those as open source software, has been the highlight of my career. The project’s success has truly been a team effort that I never could have accomplished on my own, and I am so proud of what all of us (maintainers, contributors, and users) have accomplished together. The maintainers and contributors who work on the project are the best group of engineers I have ever worked with; too talented a group to ever find themselves all at the same company or in the same geographic location - truly the theoretical potential of open source played out in practice. As a team we have had worldwide impact, changed what users expect from their software load balancing systems, and also built a vibrant and welcoming community. In my wildest dreams I never would have imagined that the project would become what it is today.

What the future holds for me is less clear. As I said above, my focus has shifted to sustainability. I want to make sure that should I leave one day, the project will remain healthy. With that said, that day is not yet here, and I look forward to helping to lead the project forward for the foreseeable future, to hopefully even greater success and adoption. Onward!