Cast your mind back to 2016 – anything come to mind? – no, I’m not talking about the Olympics or another US election 🙂 It turns out this was the first time that the term “observability” was used to to talk about what had generally been referred to up until then as plain old “monitoring” – you know, trying to understand if our systems and applications are available and performant.
Now this definition of observability borrowed from Control Theory – an existing field in engineering and mathematics, used in a wide range of applications, everything from mechanical systems, robots, fighter jets to air conditioning systems.
The widely used definition of observability above for example from Wikipedia goes something like:
“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.”
Pretty nice definition – seems to capture the world view where we’re capturing various signals (logs, metrics, traces, whatever..) to try and figure out what’s going on! Turns out though, we left off the next sentence – from Wikipedia again underlined and italicized for effect 🙂 :
“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. In control theory, the observability and controllability of a linear system are mathematical duals.”
Cool so observability is actually linked with “controllability” as “mathematical duals” – so two sides of the same coin as it were… So what’s controllability? back to Wikipedia….
“Controllability is an important property of a control system and plays a crucial role in many control problems, such as stabilization of unstable systems by feedback, or optimal control.”
Ok – it’s an important property and it uses feedback and it helps stabilize our systems but what is it? Let’s step back and take a look at a basic control system from control theory, again borrowing from Wikipedia!
We have a system whose “output” we can measure (passing through A) and whose input can be controlled (through B) by leveraging a feedback loop. The output is typically the thing we want to optimize or get to some “desired state” – so make our fighter jet “stable”, move our robot arm to point X,Y,Z, or maintain our AC at 76 Fahrenheit (not always so easy here in Texas 😂) In other words, we are measuring in order to control something, not just for the sake of measuring.
Back to the world of the applications that our businesses rely on, some outputs, desired states, or goals that we might want to optimize for (and are not always complimentary):
- Ensure our applications (services) are available and performant
- Minimize our cost to deliver this (FinOps et al)
- Ensure we have the information we need to identify root cause (MTTI) issues when they occur and restore service quickly (MTTR)
- Ensure our product management team can understand our CoGs and “product usage” (e.g. to identify customer friction and/or opportunities)
- Ensure our Sales team understands “leading indicators” to move a customer from free to paid
- Ensure our Customer Success Team can track specific customer on-boarding and product usage/uptake to maximize our probability of renewal
Development/engineering and operations teams clearly have goals driving our observability and controllability feedback loops, but so do a wealth of other stakeholders as we see above. These goals represent the “why” behind what we’re trying to observe (and control) in the first place (more on this another time)
While feedback loops DO exist today, they are typically extremely inefficient – often starting with “projects” like “our [insert_observability_provider_here] bill is too high, figure out how to lower it” and resulting in development spending one or more sprints/spikes figuring out which telemetry not to send (or cardinality to reduce…), since most solutions are priced based on how much data is ingested/stored/indexed. Invariably when an issue does occur, we are then missing needed telemetry that we dropped on the floor, because we are often dealing with “unknown unknowns” or stuff we can’t always predict ahead of time (i.e. we want to minimize cost yes, but also minimize our MTTR as well….) Just as bad, we may not have known up front that the telemetry we just dropped on the floor was being used “downstream” by Product, or Sales or another stakeholder listed above, leading to other business impacts. This cycle – or what I call the “Goldilocks problem” is ongoing, and continues to consume precious development cycles that could be spent elsewhere….
Data Plane and Control Plane – Coming to Observability Near You
For those coming from the network space or who have worked at a Cloud Provider, the Data Plane and Control Plane terms will be very familiar. Originating in the networking world in the early 2000s, the concept was for the control plane to figure out the routing or switching “desired state”, and for the data plane (or “forwarding plane”) to implement it (often in hardware). This separation continued to evolve with virtualization across compute, storage and network, ultimately culminating in the “Software Defined DataCenter” (SDDC), and a pattern replicated across many other OSS and Cloud Native projects. These concepts are very much alive and well in cloud providers today – in fact for many Cloud services, there are entire engineering teams dedicated to the Control Plane (service CRUD) or Data Plane (customer workload/forwarding plane) parts of those services. Perhaps most importantly, this separation of data and control planes led to numerous benefits in performance, resilience, simplicity and agility – we could adapt the infrastructure dynamically based on business needs.
Now what if we brought the control plane and data plane separation to observability? What if we could dynamically adjust the telemetry we send and receive? What if we could use feedback loops driven by the goals (measures of utility) to drive that, both at runtime and across the SDLC? What if we could do this for all of our (telemetry) stakeholders and consumers?
Here at ControlTheory, we believe this opportunity is at hand, driven in part by the emergence of OpenTelemetry (OTEL) and other open standards. We believe that we can leverage these feedback loops to control our telemetry to both save costs AND enable much needed insights – in short we believe it’s time to move from observability to controllability.