As I made my annual pilgrimage to KubeCon (US) this year, I started to think about my work over the year with OpenTelemetry (OTel), as well as KubeCon’s past and my perspective on where things are at. This perspective has been informed by my own tinkering certainly, but also conversations with many customers over the course of the year, at various stages of OpenTelemetry (OTel) adoption and observability maturity (including those at KubeCon!). As it turns out, I was about to make another pilgrimage of sorts, to LEGOLAND in San Diego for my son’s birthday (saying he’s into LEGO would be sort of an understatement) – this got me thinking and comparing the current state of OTel to LEGO as well (and asking my son to create the LEGO above 🙂). In this blog, we’ll go from the “unboxing” experience of OpenTelemetry, what some recent surveys tell us about current challenges and comparing with “builds past”, to the future promise, and some suggestions on how to get there.
That New Set Feel – the Promise and the Problems!
When my son gets a new LEGO set, he drops everything and goes at it – he grabs the instructions, rips open the bag with a “1” on it, and dives right in – rarely stopping other than for food. He’s truly amazing at this stuff, but does have a tendency to uhm “rush” a bit. You know, like those “IKEA moments”, where you got the right hand side confused with the left, have to backtrack a whole bunch of steps and rework – happens in LEGO, too. Or there’s some Technic LEGO in there – certain pieces can require a fair amount of force to attach and he might sometimes need a little help (having watched all seasons of “LEGO masters” with my son – skill with LEGO Technic is definitely a prerequisite to going the distance in that competition)!
Getting started with OpenTelemetry can also bring such challenges. A number of recent surveys point to the most common obstacles such as “Concerns about technical support”, “Integration Issues”, “Complex setup/configuration” – none of these entirely unusual for emerging technology projects. In OpenTelemetry’s own “Getting Started” survey, improved documentation, reference implementations and detailed tutorials were top items folks wish they had when getting started with OTel.
Snippet From Recent OTel “Getting Started Survey”
In a great blog post, Jeremy Morrell points out that OTel is really a bunch of mini projects, or kind of like microservices versus a monolith – “OpenTelemetry is not a monolith.” There’s the protocol (OTLP), the APIs, SDKs, collector, semantic conventions, and a lot of great stuff in “contrib” for these, too. I’ve been personally spending a lot of time with the collector. It’s not always obvious to folks starting out for example, that the collector is a discrete project, that can be leveraged without having instrumented any of your code with OTel. It’s the ultimate “Swiss army knife” as Jeremy puts it, converting a multitude of existing telemetry formats into OTel (OTLP), and back again. The OTel collector is the most common place to start because it “meets customers where they are,” and helps them (gradually) move to OTel, migrate between vendors, and more generally analyze and optimize what they already have. The question to a customer “are you using OTel?” also becomes interesting, because you really need to qualify, “which part(s)?” (not dissimilar to the debate over what it means to be “OTel native”)
As an aside, the popularity of the collector is evident from the OTel year in review, with the collector docs receiving the most page views by %. In a survey specifically on the collector itself earlier in the year, it found the majority of respondents already using more than 10 collectors, a strong tilt towards K8s deployment of those collectors, and the top two areas for improvement of “stability” and “configuration management and resolution.” This all jives with the collector as “glue” comment above, but also points to a growing need to manage this glue moving forward.
And speaking of Kubernetes, can we learn anything about where OpenTelemetry is headed by looking at the trajectory of the CNCF’s most popular project?
Builds from the Past – Kubernetes
The team here at ControlTheory experienced the Kubernetes journey firsthand. Before ControlTheory, StackEngine was founded in early 2014, shortly after the rise of Docker. We saw that (similar to VMs before them), containers would need to be managed and orchestrated on their way from Developer laptops to eventual production use. This was before Kubernetes (K8s) was announced by Google (Jun 2014). Container orchestration “wars” ensued, and K8s ultimately won of course (the Kubernetes documentary Part 1 and Part 2 are a fun watch 🙂). StackEngine was acquired by Oracle at the end of 2015, and the team went on to build container services there, including the OCI managed Kubernetes service (OKE).
Having experienced the K8s journey 1st hand, there certainly seems to be some parallels with OpenTelemetry, from the rise in popularity of the projects themselves (OTel being often cited as the 2nd most active CNCF project after K8s) – to the growing chorus of cries of “complexity” – which persist for K8s to this day:
Debates have ensued over what K8s is – “the new platform for building other platforms” – and what it isn’t – “K8s is not for developers,” and even takes on the “platform engineering” movement itself being a possible reaction to K8s (complexity).
In a recent post by Jyoti Bansal, it was kind of funny/not funny to see if OTel will make the same eventual arc….
But despite all of the current challenges and the pitfalls of past projects, what about the potential upside of OTel?
The Promise of the “Golden Brick”
From this, to this?
In LEGO masters, there is the constant pursuit of the “Golden Brick”, a simple (but shiny) LEGO brick, awarding the owners immunity from elimination for the current round. It feels like we’re pursuing something similar with OTel – immunity against exorbitant observability bills anyone? The promise, oh the promise of OpenTelemetry. I particularly liked how Jeremy Morrell put it – “We’re trapped in a local maximum. Open Standards provide a way out and, hopefully, a better experience.” This reminded me of a conversation with a serial CTO here in town on observability, his perspective being that “open is the only answer to the (observability) cost problem”.
We’ve found many companies recognize this “need for open,” define OTel as a sort of “north star” or ultimate destination, and possibly even evaluate new vendors based on their compatibility with OTel. Getting from “here to there” though remains a challenge – how do these companies plot a pragmatic path to start transforming their existing instrumentation and incumbent solutions? Do they start with net new product initiatives? (AI projects anyone?) Together with the collector “glue” mentioned above, do they break down their existing products and start re-instrumenting piece by piece? (another CTO described the process as akin to moving from a “monolith into microservices”…)
The Role of Partners and Vendors
In the LEGO Masters show, they have teams of 2, and it’s not uncommon for one of the members to be more of the architect/designer and one to be the “builder.” Similarly on the OpenTelemetry journey, there’s certainly a case to be made that it is the job of vendors to help build the “better experience” mentioned above on top of open standards. For anyone that tried “K8s the hard way” (while great for learning and education), managed Kubernetes offerings (GKE, AKS, EKS, OKE, etc.) were clearly an extremely attractive proposition to remove much of the toil from building, operating, and scaling a K8s infrastructure. Same thing is true with the evolution of those managed K8s offerings, that continue to move “up the stack,” and all of which now provide some kind of “self driving K8s” (named auto “something”!) for self patching, healing, scaling and so on. These evolutions were all necessary to continue simplifying K8s to the point where in a recent conversation with a longtime AWS ECS user, (managed) “auto” K8s (in this case EKS) may finally be ready for them, more than 10 years after the introduction of K8s itself.
So how do we from here to there for OpenTelemetry?
A Path Forward – a LEGO Manual Included
In a previous life back in 2013, I was competing with AWS at another cloud provider. One of the things my team was tasked with, was coming up with a bunch of “reference architectures,” or how to assemble the LEGO pieces of our particular cloud into viable end user solutions (with more than 200 services and counting for some cloud providers, turns out we have LEGO assembly in lots of places 🙂). And AWS did such a phenomenal job of this – including reference architectures at the time (and way more today) with everything from Windows to LAMP stacks. In fact, one of the announcements coming out of KubeCon NA this year, was for a (welcome) expansion of Cloud Native Reference Architectures. I’d make the case that such reference architectures are also needed within the OpenTelemetry project itself, defining best practices and end states for implementing OTel, not just the broader CNCF ecosystem (and the #2 ask in the survey above).
While reference architectures provide great examples of the “desired state” or where we want to get to, we also need guides on how to get there. It is awesome to watch LEGO masters build something from scratch with nothing but raw parts, there surely needs to be some guideposts for the rest of us. While I agree with Jeremy’s comment that the complexity of OTel is driven in part by its extensibility and that “What looks like unnecessary bloat to you frequently turns out to be core to someone else’s adoption.” In my experience, there is always a Pareto distribution or “80-20” rule in the adoption of new technology. We’ve found in our conversations with customers for example, that the OTel Kubernetes Helm charts and operator are a case in point, an easy way to get started with OTel by launching onto a K8s cluster and wiring things up to an observability backend of choice for K8s and container metrics, events, logs and traces (MELT), without having to understand all of the details “under the hood.”
I’m probably dating myself, but I was a fan of the old “Cloud Native Trail Map” – it provided an opinionated set of steps on where to start and where to go next on your Cloud Native adoption journey. Sure, it kind of sucked if your project was one of the many (100+) not included, but again, 80-20 rule. I could see something like this for OpenTelemetry being extremely helpful to newcomers. Reference architectures, on-boarding/journey guides, and vendor solutions to pull it all together sounds like a winning combination.
Anyone remember this?
Summary
In this blog, we looked at some recent surveys on the state of OpenTelemetry, some of the top challenges reported, a comparison to the Kubernetes project journey, and some possible paths forward. Despite the challenges mentioned, ControlTheory believes that the future for observability looks distinctly bright, and open! And for those looking to suggest improvements, the Developer Experience Survey by the OpenTelemetry End-User SIG is open until the end of January!