How to Trust What Your Agents Ship
Closing the runtime feedback loop on AI-generated code
The short answer: you can’t trust AI-generated code with pre-deploy controls alone, because it fails on runtime assumptions that review, tests, and staging never see. Trust requires a post-deploy control: a runtime feedback loop that distills production signal, diagnoses incidents with cited evidence, and carries runtime context back to the developer and the agent that wrote the code. This guide explains why the loop is broken today and the four-stage architecture that closes it.
Why don’t engineering leaders trust AI-generated code?
The adoption question is settled. Microsoft and Google have both said roughly a quarter of their code is now AI-generated; some 2026 surveys put the median enterprise well above that. Agents aren’t autocompleting lines anymore. They’re taking issues, writing patches, and opening pull requests.
The trust question is not settled. Consider what engineering leaders told researchers in the first half of 2026:
| Finding | What leaders reported | Source |
|---|---|---|
| 0% | Senior SRE and DevOps leaders “very confident” that AI-generated code will behave correctly once deployed. Lightrun’s CBO called it a trust wall. | Lightrun, 2026 |
| 43% | AI-generated code changes that require manual debugging in production, after passing QA and staging. | Lightrun, 2026 |
| 88% | Teams that need two to three redeploy cycles to verify an AI-suggested fix. Zero respondents could verify in one cycle. | Lightrun, 2026 |
| 35% | Teams writing code with AI that won’t ship it, because they lack the confidence to do so safely. | Flux, 2026 |
| 81% | Enterprise leaders reporting an increase in production issues tied to AI-generated code. | CloudBees, 2026 |
Read those numbers together and a pattern emerges. This is not a story about AI writing bad code. It’s a story about teams that have no reliable way to verify what the code does once it’s running. The 43% figure is the tell: that code passed every gate the organization had. Review approved it. Tests passed. Staging looked fine. Production disagreed anyway.
If your gates all pass and production still fails, the problem isn’t that you need better gates. It’s that all your gates are on the wrong side of the deploy.
Why does AI-generated code fail in production after passing QA?
Every control the industry has added in response to AI-generated code lives pre-deploy: mandatory human review, SAST and DAST in CI, test-coverage thresholds, staging environments, no-AI rules on security-critical paths. These are reasonable controls. They catch what they’re designed to catch: syntax-level defects, known vulnerability classes, regressions your test suite anticipated.
They can’t catch what AI-generated code actually gets wrong most often: runtime assumptions. The connection pool that exhausts under real concurrency but not under staging load. The retry loop that’s correct in isolation and pathological when three services run it at once. The query that’s fine against seed data and a table scan against production volume. The error path that was never exercised because the failure it handles only happens in production.
There’s a second problem underneath the first, and it’s specific to how AI writes code: you didn’t decide what got logged. When a human writes a service, instrumentation reflects intent. The developer logs what they know they’ll need at 2am. When an agent writes a service, the logging is whatever the model emitted. In practice that produces three failure classes:
- Ghost logs. Log statements that fire constantly and say nothing, status noise from framework boilerplate that buries real signal.
- Missing failure-path instrumentation. The happy path logs beautifully; the catch block swallows the exception or emits a generic message with no context.
- Missing contextual logging. Errors fire without the request ID, user ID, or upstream state that would make them diagnosable.
None of it you decided. It’s just there. Or it’s not.
Step back and the structural problem gets bigger. Your entire monitoring stack was designed by humans, for humans, to watch code written by humans. Dashboards encode what someone anticipated might break. Alert thresholds encode known failure modes. Runbooks encode incidents you’ve already had. The whole model assumes failure modes are knowable in advance. For human-written code, they mostly were, because the person who wrote the code could also imagine how it would fail.
AI-generated code breaks that assumption. It’s non-deterministic in a way human code never was: the same prompt produces different implementations, different dependencies, different failure surfaces, on every run. Its failures are unknown unknowns, modes nobody anticipated because nobody made the decisions. There is no dashboard for them, no alert threshold, no runbook. No human ever conceived of them. A monitoring model built on “watch for the things we know can go wrong” cannot keep up with a code supply that invents new ways of going wrong daily.
And the loop has a third break, unique to the agentic workflow: the author never sees what happened. A human developer who ships a bug eventually feels it in the incident channel, the postmortem, the fix. An agent writes the patch, the PR merges, and the agent’s context ends. Whatever the code did at runtime never flows back to the thing that wrote it. Every session starts from zero runtime knowledge. That’s why fixes take two and three redeploy cycles: the agent is guessing at production behavior it has never been shown.
This is the runtime feedback gap. Closing it is a four-step problem, and the rest of this guide walks through the steps.
What can’t your runtime platforms tell you?
The reflexive answer is “we have logs.” You do. Every platform in a modern deployment chain captures logs faithfully. The gap is not collection. It’s that every one of these platforms solved storage and search, and the question the AI SDLC asks is time-to-answer: what’s new since that deploy landed, what’s the pattern, what caused it. That question isn’t answerable in any platform console. The specifics differ; the structural limit is the same.
Kubernetes
Replicas multiply sources, sidecars multiply streams per pod, auto-instrumentation multiplies volume per stream. A ten-service cluster can produce 40M+ log lines a day, of which two are the incident. kubectl logs -f caps out at five concurrent streams. And the cause often isn’t in application logs at all: CrashLoopBackOff, OOMKilled, and ImagePullBackOff live in K8s API events, a second surface you correlate with pod logs by hand.
AWS serverless
One request crosses API Gateway, several Lambdas, a queue, and a Fargate task, and the evidence lands in a dozen log groups, each split into a stream per concurrent instance. CloudWatch Insights caps a query at 50 log groups and can’t cross accounts or regions. X-Ray samples roughly 5% of requests, so the failed request your customer hit almost certainly has no trace. Teams report 4 to 8 hours of manual correlation for incidents an AI-led analysis resolves in 15 to 30 minutes.
AWS & CloudWatch log analysis →OpenTelemetry backends
OTel is the right standard, and your backend (Loki, Datadog, Elastic, wherever OTLP lands) stores everything. But these systems were built for search, aggregation, and retention at human deploy cadence. They answer “error rate over 30 days” well. They don’t answer “what pattern showed up in the last hour that wasn’t there before the deploy,” because they have no notion of what new looks like.
OpenTelemetry log analysis →GitHub Actions
Green checks are the moment trust gets granted, and the moment it’s least deserved. CI validates the code against the tests the code came with. The regression shows up as an error-rate shift in production twenty minutes after the workflow_run completes, and nothing in Actions connects the two. The deploy event is the single most valuable correlation key in the AI SDLC, and it’s stranded in a different system from the runtime evidence.
Vercel
Deploying is invisible, which makes failure invisible too. Init-phase crashes (a dead DB pool, a bad env var) kill the function before your handler, or any of your console.log statements, ever runs. And the evidence expires: runtime logs are retained for one hour on Hobby and one day on Pro. If the failure happened overnight, the logs are gone before you look.
Supabase
A single user-facing failure can span an RLS policy denial in Postgres, an Edge Function error, and a Storage policy rejection: three services, three separate log surfaces, correlated by hand in the dashboard. RLS failures are especially hostile to AI-generated code, because the agent that wrote the query usually has no visibility into the policy that rejected it.
Supabase log analysis →Railway
The deploy loop is fast enough that runtime is where you find out. When an agent ships several times a day, “scroll the service logs and see what looks different” doesn’t scale past the second deploy.
Railway log analysis →The pattern
Seven platforms, one structural gap: each captures its own slice of runtime truth, none can correlate across the chain, and none can answer the only question that matters after an agent ships: what changed?
Why not just feed raw logs into an LLM?
The obvious 2026 move is to point an LLM at the problem. Logs are text; language models read text; connect them. A category of AI observability tools does exactly this, piping raw log streams into a model and letting it reason.
It fails on three axes, predictably:
Token economics. A modestly sized cluster emits tens of millions of log lines a day. Feeding raw streams to a model means token spend scales with log volume, not with incident volume. The on-call tax goes up whether or not anything is broken.
No baseline. A model reading a raw stream has no idea what this service normally looks like. Every retry loop looks anomalous. Every DEBUG storm looks like an incident. Without per-service baselines, the model can describe what’s in the window you showed it. It cannot tell you what’s abnormal.
Ungrounded answers. Ask a model “what’s wrong with my cluster” over a raw sample and you get plausible prose: “maybe Kubernetes networking?” No citation, no specific evidence, no way to distinguish diagnosis from hallucination. An RCA you can’t verify is worse than no RCA, because it sends the on-call engineer down a confident wrong path.
The instinct is right. AI should be reading your runtime signal, because no human reads 40 million lines. The architecture is wrong. The fix is what happens before the AI reads anything.
What architecture does a runtime feedback loop need?
The architecture that works inverts the shotgun approach: raw logs never reach the AI layer. Whatever tooling you use, bought or built, the pipeline needs four stages, in order.
Distill
Repeated patterns collapse into clustered signals with counts. A 30,000-line retry storm becomes one row with a counter. Severity is baselined per service, so checkout-api’s normal is never confused with auth-service’s normal. Sentiment scoring, anomaly detection against those baselines, and dedup run continuously. Done right, more than 90% of raw log volume compresses into distilled signal, and it does so regardless of how well-instrumented the code was. That matters, because agent-written code gives you the logging it gives you. Distillation is also what surfaces the unknown unknowns: you don’t need to have predicted a failure mode to notice that a pattern exists today that didn’t exist yesterday.
Enrich
Distilled patterns get joined with the context that makes them diagnosable: resource attributes preserved from OTLP, Kubernetes API events correlated with pod logs by deployment and namespace, deploy events from CI attached to the error patterns that followed them, pattern fingerprints linking the same failure across services and platforms. The bad-image-deploy that used to span two browser tabs becomes one incident. The Vercel function timing out because a Supabase RLS policy changed becomes one incident.
Explain
Only now does an AI layer read anything, and what it reads is compact, structured, and baselined. The agent clusters related anomalies into incidents, ranks them by impact, and writes a short diagnosis: what happened, which service started it, which events confirm it, and the first thing to try. Every claim must cite the specific patterns, events, and counters it was built from. An RCA without evidence is a guess with good grammar. Because the input is distilled signal rather than raw streams, token spend is bounded by incident volume, not log volume.
One distinction worth being precise about, because vendors routinely aren’t: what “the fix” means depends on where the failure lives. When the cause is infrastructure or configuration (a bad image tag, an exhausted connection pool, a misconfigured policy), the diagnosis can name the fix directly. When the cause is application code, the honest output is the cause with evidence, handed to the developer or the coding agent that has the repo. The runtime layer knows what the code did; the agent with repo access knows how to change it. Pretending otherwise is how you get confident wrong patches.
Remember
Every triage should feed a persistent memory: the distilled signal, the incident, the outcome, the fix that worked. The next time a related pattern appears, the past incident and prior fix surface alongside the new diagnosis, so the second occurrence of a failure class is resolved in minutes, not re-investigated from scratch. Without this stage, every investigation starts from zero, exactly like the agents that caused the problem.
Dstl8: runtime feedback for AI-generated code
Dstl8 is ControlTheory’s runtime feedback platform for AI-generated code, built around exactly this pipeline. It connects your deployment chain (Kubernetes, AWS/CloudWatch, OpenTelemetry, GitHub, Vercel, Supabase, Railway) using open standards, with OTLP as a first-class input. Distillation and enrichment run continuously across every source. The explain stage is Möbius, Dstl8’s AI agent, powered by our own fine-tuned model built specifically for runtime signal. It detects, correlates, and diagnoses with cited evidence, proactively, before you open a dashboard. And every triage feeds a knowledge graph that compounds: Möbius surfaces past incidents and prior fixes on every new investigation, so the intelligence gets sharper with use.
Create Free Account Learn About Dstl8How does runtime feedback reach the agent that wrote the code?
The pipeline produces trustworthy runtime signal. The last step is putting it where the code gets written, in your flow, not out of it, because a dashboard in a browser tab is a broken feedback loop. The developer reviewing an agent’s patch doesn’t context-switch to Grafana to check whether it’s misbehaving, and the agent certainly doesn’t.
Dstl8 exposes an MCP server, which means the runtime signal is queryable from inside Claude Code, Cursor, Codex, or any MCP-capable client. The workflow that follows is the one the trust numbers say is missing:
- The agent ships a change; Actions deploys it.
- From the same session: “what’s new in checkout-service since that rollout?”
- The answer comes back grounded in distilled production signal, with cited evidence, not a guess.
- If something regressed, the cause and evidence are already in the context window of the agent holding the repo. The fix PR starts from runtime truth instead of from zero.
Recall the verification number: 88% of teams need two to three redeploy cycles to confirm an AI fix, because each cycle is deploy-and-hope. A closed loop turns that into one informed cycle. The agent can see what its change actually did.
And the payoff isn’t limited to incidents. Runtime context informs the next prompt, the next test the agent writes, the next feature’s priority, while every resolved triage lands in the knowledge graph for the one after it. That is what “trusting what your agents ship” means operationally: not blind confidence in generation, but a standing verification loop between production and the author.
What does trusting AI-shipped code look like operationally?
Skip the maturity model. Trust in AI-shipped code reduces to four questions you can answer yes or no today:
Four yeses and the trust wall is a solved problem. Not because the code got better, but because verification got cheap.
How do you start closing the loop?
The loop can be adopted incrementally, matched to how committed you are today:
Ready to close the loop?
Dstl8 is a free 14-day trial, no credit card. One line, brew install control-theory/dstl8/dstl8 && dstl8 setup, and guided onboarding handles the account, the MCP install into your AI coding clients, and your sources (Kubernetes, CloudWatch, OTLP, GitHub, Vercel, Supabase, Railway). Five-minute setup to Möbius reading your runtime.
Want to start smaller?
Gonzo is our open-source log analysis TUI: pattern detection, real-time charts, AI-powered insights in your terminal. No account, no config, brew install gonzo, two minutes to first signal. 2,600+ GitHub stars, MIT-licensed. It’s the same distillation thinking, terminal-native, and when you want the full loop, Dstl8 is one install away.
Want to talk it through?
Request a demo and we’ll walk your actual deployment chain.
Platform deep dives: Kubernetes · AWS · OpenTelemetry · GitHub Actions · Vercel · Supabase · Railway
Common questions about runtime feedback for AI-generated code
Does a runtime feedback loop replace code review or testing?
No. Pre-deploy controls catch syntax defects, known vulnerability classes, and anticipated regressions, and you should keep them. A runtime feedback loop is the post-deploy control they can’t provide: it verifies how code actually behaves in production and carries that evidence back to the author. The 2026 data shows 43% of AI code changes need production debugging after passing every pre-deploy gate, which is exactly the gap this closes.
How is Dstl8 different from Datadog, Loki, or my existing observability stack?
Your existing backend solved storage and search: ingest everything, retain it, make it queryable. Dstl8 solves time-to-answer: it distills logs into baselined signal, correlates across your whole deployment chain, and diagnoses incidents with cited evidence, then delivers that context into the editor where code gets written. It sits alongside your current stack rather than replacing it. Nothing gets ripped out.
What is Dstl8?
Dstl8 is ControlTheory’s runtime feedback platform for AI-generated code. It connects runtime platforms including Kubernetes, AWS/CloudWatch, OpenTelemetry, GitHub, Vercel, Supabase, and Railway, distills their logs into signal, and runs continuous AI-led root cause analysis. Results reach developers and coding agents through the dashboard, Slack, and an MCP server for Claude Code, Cursor, and Codex.
What is Möbius?
Möbius is the AI agent inside Dstl8, powered by our own fine-tuned model built specifically for runtime signal. It reads distilled signal rather than raw logs, detects emerging incidents proactively, ranks them by impact, and writes root cause narratives where every claim cites the log patterns and events behind it.
Do I need OpenTelemetry to use Dstl8?
No. Dstl8 has direct integrations for Kubernetes, CloudWatch, GitHub, Vercel, Supabase, and Railway that work without any OTel setup. If you already run OpenTelemetry, OTLP is a first-class input: add Dstl8 as one more exporter and keep your existing pipeline untouched.
How long does it take to see this working?
Minutes. Run brew install control-theory/dstl8/dstl8 && dstl8 setup and the guided onboarding walks you through the account, MCP install into your AI coding clients, and your first sources. The trial is 14 days, free, with no credit card required.
- Lightrun, 2026 State of AI-Powered Engineering Report (survey of 200 senior SRE/DevOps leaders, US/UK/EU), as reported by VentureBeat, April 2026.
- Flux, study of 309 engineering leaders and practitioners, as reported by LeadDev, June 2026.
- CloudBees, 2026 State of Code Abundance Report (200+ enterprise technology leaders), May 2026.
- New Relic, State of AI Coding 2026, June 2026.
- Public statements by Microsoft and Google leadership on AI-generated code share, 2025 to 2026.














