Introduction: AI + SRE = New Requirements
AI is rapidly reshaping how we approach observability and operations. From summarizing incidents to surfacing root causes and forecasting outages, AI promises a world where Site Reliability Engineers (SREs) spend less time chasing dashboards and more time solving meaningful problems. But AI can’t reason its way to insight without clean, structured, and timely data. That’s where OpenTelemetry (OTel) becomes foundational.
What Is OpenTelemetry and Why It Matters
OpenTelemetry (or OTel for short) is an open standard for collecting logs, metrics, and traces (with other signals on the way) from applications and infrastructure. It’s vendor-neutral, supported by every major observability platform, and backed by the CNCF. But its real value goes beyond avoiding vendor lock-in.
OTel is the plumbing of AI SRE. It delivers structured, correlated telemetry — the raw materials AI needs to reason about system behavior. Without OpenTelemetry, AI models are left guessing in a sea of inconsistent and disconnected data.
The Three Pillars: Logs, Metrics, Traces
OpenTelemetry unifies the three core observability signals:
- Logs are the most widely adopted. They offer rich, human-readable context and are essential for incident retrospectives.
- Metrics provide structured, lightweight signals that are ideal for trend analysis and triggering alerts.
- Traces are the hardest to adopt but the most valuable for causal analysis. They show how a request moves through your system.
Tracing requires developers to think beyond their service boundary — a challenge when the mental model often stops at “me” and doesn’t (always!) extend to “not me.” But it’s precisely this end-to-end view that makes tracing indispensable for AI-assisted RCA.
Reality Check: Gaps and Ambiguity Are Inevitable
Real-world telemetry is messy. Spans are missing. Logs are noisy. Metrics are misnamed or lack labels. AI must be robust to partial, ambiguous input — just like a human operator. But if we want accurate summaries and useful recommendations, we need to improve our inputs.
Instrumentation: Garbage In, Garbage Out
Instrumentation quality directly impacts AI utility. Manual instrumentation is fragile and inconsistent. That’s why auto-instrumentation is a game-changer:
- OpenTelemetry auto-instrumentation agents exist for major languages (Java, Python, .NET, Node.js).
- eBPF-based tools like Beyla allow zero-code instrumentation at the kernel level.
Auto-instrumentation dramatically lowers the barrier to adoption, accelerates coverage, and ensures more consistent signal quality. It’s not all roses though, since we can potentially end up with even more telemetry data that may not necessarily be useful to drive desired outcomes (see Smart Edge layer below).
Semantic Conventions: Making AI’s Job Easier
OpenTelemetry’s semantic conventions ensure consistency in how telemetry data is labeled and structured. Fields like http.status_code
, k8s.pod.name
, and db.system
follow strict naming rules.
This consistency is gold for AI:
- Easier correlation across signals
- Better feature engineering for LLMs
- Improved model performance during inference
You can’t train on chaos. Semantic conventions turn chaos into clarity – and lead some to say “Why Semantic Conventions are OpenTelemetry’s Most Important Contribution”.
The OpenTelemetry Collector: The Smart Edge Layer
The OpenTelemetry Collector acts as a programmable router/switch and processor for telemetry data. It runs close to your applications — often as a sidecar, DaemonSet, or gateway — and can:
- Filter noise
- Enrich logs with (e.g. Kubernetes/app) metadata
- Downsample or aggregate metrics
- Export boiled-down (e.g. sampled) traces
With the right control plane or orchestration layer, the collector becomes a policy engine that distills telemetry to just what’s needed: the Goldilocks signals — not too little, not too much, just right for AI to reason about what went wrong.
Goldilocks vs. Data Lakes
The traditional model of observability has been to collect everything and throw it in a data lake. That’s expensive, slow, and increasingly unworkable.
AI SRE doesn’t benefit from more data. It benefits from better data.
Collectors let you pre-process, filter, and enrich data at the edge. That means faster insights, reduced storage costs, and real-time signals ready for AI consumption — without waiting on a slow, expensive query over petabytes of junk.
Shout Out: OTel Entities SIG
The OpenTelemetry Entities SIG is working to add metadata to telemetry – effectively an overlay on top of existing signals to associate them with entities (host, process etc…), and relationships that have their own lifecycle. This means:
- Tracking lifecycle of services, containers, pods
- Mapping relationships between entities
With this layer, AI can understand not just what happened, but how things are connected — essential for causality and correlation in complex systems.
Conclusion: OTel Is No Longer Optional for AI SRE
OpenTelemetry isn’t just an open standard — it’s the foundation of AI-assisted operations. Without it:
- You’re locked in to proprietary formats.
- Your AI is guessing from inconsistent inputs.
- Your observability costs balloon with no added insight.
With it, you get structured, connected, real-time data that’s ready for LLMs, anomaly detectors, and self-healing systems.
AI SRE is coming fast. OpenTelemetry is how you prepare for it.
press@controltheory.com