Why OpenTelemetry Is Foundational for AI SRE

August 8, 2025

By Jon Reeve

Picture showing OpenTelemetry (OTel) as a foundation for AI SRE

AI is rapidly reshaping how we approach observability and operations. From summarizing incidents to surfacing root causes and forecasting outages, AI promises a world where Site Reliability Engineers (SREs) spend less time chasing dashboards and more time solving meaningful problems. But AI can't reason its way to insight without clean, structured, and timely data. That’s where OpenTelemetry (OTel) becomes foundational.

Introduction: AI + SRE = New Requirements

AI is rapidly reshaping how we approach observability and operations. From summarizing incidents to surfacing root causes and forecasting outages, AI promises a world where Site Reliability Engineers (SREs) spend less time chasing dashboards and more time solving meaningful problems. But AI can’t reason its way to insight without clean, structured, and timely data. That’s where OpenTelemetry (OTel) becomes foundational.

What Is OpenTelemetry and Why It Matters

OpenTelemetry (or OTel for short) is an open standard for collecting logs, metrics, and traces (with other signals on the way) from applications and infrastructure. It’s vendor-neutral, supported by every major observability platform, and backed by the CNCF. But its real value goes beyond avoiding vendor lock-in.

OTel is the plumbing of AI SRE. It delivers structured, correlated telemetry — the raw materials AI needs to reason about system behavior. Without OpenTelemetry, AI models are left guessing in a sea of inconsistent and disconnected data.

The Three Pillars: Logs, Metrics, Traces

OpenTelemetry unifies the three core observability signals:

Logs are the most widely adopted. They offer rich, human-readable context and are essential for incident retrospectives.
Metrics provide structured, lightweight signals that are ideal for trend analysis and triggering alerts.
Traces are the hardest to adopt but the most valuable for causal analysis. They show how a request moves through your system.

Tracing requires developers to think beyond their service boundary — a challenge when the mental model often stops at “me” and doesn’t (always!) extend to “not me.” But it’s precisely this end-to-end view that makes tracing indispensable for AI-assisted RCA.

Reality Check: Gaps and Ambiguity Are Inevitable

Real-world telemetry is messy. Spans are missing. Logs are noisy. Metrics are misnamed or lack labels. AI must be robust to partial, ambiguous input — just like a human operator. But if we want accurate summaries and useful recommendations, we need to improve our inputs.

Instrumentation: Garbage In, Garbage Out

Instrumentation quality directly impacts AI utility. Manual instrumentation is fragile and inconsistent. That’s why auto-instrumentation is a game-changer:

OpenTelemetry auto-instrumentation agents exist for major languages (Java, Python, .NET, Node.js).
eBPF-based tools like Beyla allow zero-code instrumentation at the kernel level.

Auto-instrumentation dramatically lowers the barrier to adoption, accelerates coverage, and ensures more consistent signal quality. It’s not all roses though, since we can potentially end up with even more telemetry data that may not necessarily be useful to drive desired outcomes (see Smart Edge layer below).

Semantic Conventions: Making AI’s Job Easier

OpenTelemetry’s semantic conventions ensure consistency in how telemetry data is labeled and structured. Fields like http.status_code, k8s.pod.name, and db.system follow strict naming rules.

This consistency is gold for AI:

Easier correlation across signals
Better feature engineering for LLMs
Improved model performance during inference

You can’t train on chaos. Semantic conventions turn chaos into clarity – and lead some to say “Why Semantic Conventions are OpenTelemetry’s Most Important Contribution”.

The OpenTelemetry Collector: The Smart Edge Layer

The OpenTelemetry Collector acts as a programmable router/switch and processor for telemetry data. It runs close to your applications — often as a sidecar, DaemonSet, or gateway — and can:

Filter noise
Enrich logs with (e.g. Kubernetes/app) metadata
Downsample or aggregate metrics
Export boiled-down (e.g. sampled) traces

With the right control plane or orchestration layer, the collector becomes a policy engine that distills telemetry to just what’s needed: the Goldilocks signals — not too little, not too much, just right for AI to reason about what went wrong.

Goldilocks vs. Data Lakes

The traditional model of observability has been to collect everything and throw it in a data lake. That’s expensive, slow, and increasingly unworkable.

AI SRE doesn’t benefit from more data. It benefits from better data.

Collectors let you pre-process, filter, and enrich data at the edge. That means faster insights, reduced storage costs, and real-time signals ready for AI consumption — without waiting on a slow, expensive query over petabytes of junk.

Shout Out: OTel Entities SIG

The OpenTelemetry Entities SIG is working to add metadata to telemetry – effectively an overlay on top of existing signals to associate them with entities (host, process etc…), and relationships that have their own lifecycle. This means:

Tracking lifecycle of services, containers, pods
Mapping relationships between entities

With this layer, AI can understand not just what happened, but how things are connected — essential for causality and correlation in complex systems.

Conclusion: OTel Is No Longer Optional for AI SRE

OpenTelemetry isn’t just an open standard — it’s the foundation of AI-assisted operations. Without it:

You’re locked in to proprietary formats.
Your AI is guessing from inconsistent inputs.
Your observability costs balloon with no added insight.

With it, you get structured, connected, real-time data that’s ready for LLMs, anomaly detectors, and self-healing systems.

AI SRE is coming fast. OpenTelemetry is how you prepare for it.

Back

For media inquiries, please contact
press@controltheory.com

Picture showing Gonzo character receiving two types of logs to analalyze

A Tale of Two Log Types – Gonzo in Action

In this blog, we’re going to take a look at how Gonzo can be leveraged to troubleshoot and get to the bottom of what’s happening for a real scenario in the OTel demo application, for logs that might be coming from multiple sources.

September 3, 2025

By Jon Reeve

An Observability Renaissance: Humanism for the AI Era

The Renaissance wasn’t just about rediscovering old texts or painting in perspective. Its real breakthroughs lay in humanism: a shift in focus from abstract dogma to lived human experience. For too long, we’ve been in a kind of middle age of observability — monolithic platforms, heavy tools, and ever-sprawling data stores. They deliver data — […]

October 3, 2025

By Bob Quillin

Picture of Gonzo intercepting and clarifying Vercel Logs

Vercel Logs Meet Gonzo

If you’re building on Vercel, you’ve probably used the vercel logs command or the built-in dashboard to debug your apps. While these tools are useful, they can quickly feel limited when you need deeper insights, real-time context, or pattern recognition across streams of logs. That’s where Gonzo, our open source terminal UI (TUI) for logs, comes in. By combining […]

September 26, 2025

By Jon Reeve

Gonzo character for the log analysis TUI dressed as Loki, for live tailing Loki.

Live Tailing Grafana Loki Logs with Gonzo

Observability teams often use Grafana Loki for centralized log storage and querying. But when you need to watch logs as they happen—for debugging, troubleshooting, or monitoring—live tailing becomes essential. That’s where Gonzo comes in: an open-source, OTLP-native, terminal-based log viewer (think k9s for logs) with support for Loki’s native formats.

September 22, 2025

By Jon Reeve

Why OpenTelemetry Is Foundational for AI SRE

Introduction: AI + SRE = New Requirements

What Is OpenTelemetry and Why It Matters

The Three Pillars: Logs, Metrics, Traces

Reality Check: Gaps and Ambiguity Are Inevitable

Instrumentation: Garbage In, Garbage Out

Semantic Conventions: Making AI’s Job Easier

The OpenTelemetry Collector: The Smart Edge Layer

Goldilocks vs. Data Lakes

Shout Out: OTel Entities SIG

Conclusion: OTel Is No Longer Optional for AI SRE

Related Articles