Guide · Agentic Engineering

The Agentic Engineer’s Guide to Runtime Feedback

Giving Claude Code, Cursor, Codex, and Copilot eyes on production

By Bob Quillin, founder and CEO, ControlTheory Last updated: July 3, 2026 12 minute read

The short answer: your coding agent is blind to the one environment that decides whether its work was correct. Claude Code, Cursor, Codex, and Copilot all close the loop from prompt to commit; none of them can see what the code did after deploy. This guide shows how to close that gap in practice: what runtime context an agent actually needs, how to wire it into each tool through MCP, and the workflow that turns “ship and hope” into “ship and verify.”

Companion guide: How to Trust What Your Agents Ship makes the leadership case and the architecture argument. This guide assumes you’re sold and shows you the wiring. →
The blind spot

Why can’t your coding agent see production?

Every agentic coding tool ends its work the same way: a summary, a green test suite, a commit. “Implementation complete.” And every one of them declares that state from inside a context that contains fixtures, docs, and the repo, but not one byte of production behavior.

Four blind spots come standard with the workflow, regardless of which tool you run:

“Done” is a claim, not a runtime fact. Tests pass on the fixtures the agent had in context. Real traffic shapes, real third-party API responses, and the RLS policies that only exist in production are not in that context. The session ends clean; the webhook starts throwing TypeErrors forty minutes later.

The iteration history is invisible. An agent doesn’t write code once. It tries, reads output, edits, retries, and commits the final state. The assumption made in attempt three that carried into attempt eight is nowhere in the Git history. When production breaks, the reasoning that caused the break doesn’t exist anywhere you can read it.

Context ends at commit. The session that wrote the code is gone by the time the code misbehaves. Whatever production reveals never flows back to the author. The next session starts from zero runtime knowledge, which is why verifying an AI fix takes teams two to three redeploy cycles on average (see the trust-wall data in the companion guide).

Nobody decided what got logged. The instrumentation in agent-written code is whatever the model emitted: happy paths logged beautifully, failure paths swallowed, errors fired without the request context that would make them diagnosable. You will be debugging with the logs you got, not the logs you would have written.

None of this is a bug in the tools. It’s the shape of the workflow. The fix isn’t a better agent; it’s giving the agent a feed from the environment it can’t see.

The spectrum

Which AI coding tools have the biggest runtime blind spot?

The major tools sit at different points on an autonomy spectrum, and the runtime blind spot scales with it.

Claude Code Autonomous sessions

Runs autonomous sessions: it decides the implementation, edits dozens of files, spawns subagents with their own contexts, runs tests, commits, and reports done. One session can touch 47 files through hundreds of intermediate edits, and what lands is one diff shaped by forty decisions nobody reviewed. Reading it as a normal code review is already outside the design of the workflow.

Claude Code runtime reality →

Cursor In-editor agent

Generates whole features in-editor with agent mode. The developer watches it work, which feels like oversight. But the code’s runtime behavior isn’t in the editor, so the confidence you feel while watching generation is confidence about the wrong thing.

Cursor: run with confidence →

Codex Sandboxed PRs

Ships PRs from ephemeral sandboxes. The sandbox has your repo and its tests; it has never seen your production traffic, your data shapes, or your infrastructure. Every PR is a hypothesis formed in a clean room.

Codex: from PR to runtime →

GitHub Copilot Prompt to PR

Moves code from prompt to pull request: autocomplete, Copilot code review, Spark. The human is still in the loop per suggestion, so each individual blind spot is small. But the volume is enormous, and Copilot-reviewed code carries an implicit “reviewed” stamp that production hasn’t countersigned.

GitHub Copilot debugging →

And the roster keeps widening. Agents now pick up work assigned in Linear, run inside CI pipelines, and ship from platforms that didn’t exist a year ago. Every new surface inherits the same blind spot, because the blind spot isn’t a property of any one tool. It’s the gap between where code gets written and where code gets judged.

The pattern across all of them: as the tools take on more of the loop from prompt to commit, the unreviewed decision surface grows, and the runtime is the only reviewer left with full coverage. Which is a problem, because it’s the one reviewer none of these tools can hear.

The requirements

What runtime context does an AI coding agent actually need?

The naive fix is to give the agent raw log access: pipe kubectl logs or a CloudWatch tail into the context window. This fails for the same reasons it fails for humans, plus one. A production service emits thousands of lines a minute; the agent’s context fills with retry noise; token spend scales with log volume; and the agent, having no baseline for what this service normally looks like, treats every DEBUG storm as a lead. You’ve given it a firehose and called it feedback.

What an agent needs is the same thing a senior engineer walking into an incident needs, and it isn’t more data:

Distilled signal, not raw streams. Patterns clustered with counts, severity baselined per service, anomalies flagged against what normal looks like. Forty thousand lines collapse into the twelve rows that matter.
Deploy correlation. The single most useful fact in any investigation is “this pattern appeared after that rollout.” The agent needs error patterns joined to the deploy events that preceded them, across every platform in the chain.
Cited evidence. An answer the agent can’t verify is a hallucination risk imported into your codebase. Every diagnosis needs to link the specific log patterns and events behind it, so the agent (and you) can check the claim before writing the fix.
Memory. If this failure class happened before, the prior incident and the fix that worked should arrive with the diagnosis. Otherwise every session re-derives what the last one learned.

This is the distill, enrich, explain, remember pipeline covered in depth in the companion guide. Dstl8 implements it end to end, with Möbius as the agent doing the detection and diagnosis, and exposes the result to your coding agent through an MCP server: it asks questions in natural language, and the answers come back grounded in distilled production signal with evidence attached.

The wiring

How do you connect Claude Code, Cursor, Codex, and Copilot to runtime feedback?

One line covers the common path:

brew install control-theory/dstl8/dstl8 && dstl8 setup

dstl8 setup is guided onboarding: account, MCP install into the AI coding clients it detects on your machine, sources, and the dashboard. If you’d rather drive it manually, the pieces are individual commands (dstl8 signup, dstl8 sources add kubernetes, dstl8 install), but setup gets most people from zero to a connected agent in one sitting.

Then the per-tool specifics:

Claude Code gets two native entry points. The MCP server (installed by dstl8 setup, or dstl8 install for a specific client) gives it direct runtime queries: “what’s new in checkout-service since that rollout.” The Dstl8 Skill (npx skills add control-theory/dstl8-skill) goes further and teaches Claude Code the orchestration: when to check runtime after a deploy, how to interrogate an incident, how to pull cited evidence into a fix. With both installed, verification becomes part of how the agent works instead of a step you remember to prompt.

Cursor connects through the same MCP server, picked up automatically by dstl8 setup. The practical change is that “is this actually working in prod” becomes a question you ask in the same pane where the code was generated, and the answer arrives with evidence instead of a dashboard link.

Codex is where deploy correlation earns its keep. Because Codex works PR-by-PR from a sandbox, the highest-value query is the one that connects a production regression back to the PR that shipped it. Dstl8 detects the regression, correlates it to the deploy, and exposes the answer over MCP, so the follow-up PR starts from the runtime evidence the sandbox never had.

Copilot doesn’t run MCP-driven sessions the way the other three do, so the loop closes at the repo level instead: Dstl8’s GitHub source correlates production error patterns to the merges that preceded them, and Möbius findings land where your team already works, in Slack and in the dashboard. The developer picking up the issue brings the runtime evidence into their next Copilot-assisted session by hand. Less automated, same principle.

Whichever tool you run, the humans stay in the loop for free: the same distilled signal the agent queries is what your team sees in the Dstl8 dashboard and Slack alerts, so nobody is trusting an agent’s private view of production. More on how in-context feedback loops work across Claude Code, Cursor, and Codex: equip agents and developers with context.

The habit

What does a verified-ship workflow look like?

Wiring is necessary but not sufficient. The habit that changes outcomes is small and specific: every deploy gets a runtime question before the session ends.

  1. The agent ships a change; CI deploys it.
  2. Same session, one prompt: “Check Dstl8: what’s new in this service since that deploy?”
  3. The answer comes back distilled and cited. Usually it’s “nothing anomalous,” and now that’s a verified fact instead of a hope.
  4. When something did regress, the cause and evidence are already in the agent’s context, and the fix PR starts from runtime truth instead of from a guess.

A few prompts that earn a place in your muscle memory (or your CLAUDE.md):

Any new error patterns in the last hour that weren’t in yesterday’s baseline?
What was the root cause of incident 1423, and what’s the cited evidence?
Did the error rate change for checkout-service after the 2:14pm deploy?
Have we seen this failure class before, and what fixed it last time?

Teams that adopt this habit collapse the verify-a-fix cycle from the industry-average two to three redeploys into one informed pass, because the agent is no longer guessing at production behavior it has never been shown. And because every triage feeds Dstl8’s knowledge graph, the fourth week of this workflow is measurably faster than the first: past incidents and prior fixes arrive with each new diagnosis. Over time the same runtime context does more than fix incidents; it informs test coverage and feature priority, making AI-generated code better with every cycle.

Getting started

How do you get started?

Ready to wire it?

Dstl8 is a free 14-day trial, no credit card. One line, brew install control-theory/dstl8/dstl8 && dstl8 setup, and your agents have eyes on production in about five minutes.

Create Free Account

Want to start smaller?

Gonzo is our open-source log analysis TUI: pattern detection and AI-powered insights in your terminal, no account required. brew install gonzo. 2,600+ GitHub stars, MIT-licensed.

Making the case to your team?

Send your engineering leadership the companion guide, How to Trust What Your Agents Ship. It carries the 2026 survey data and the architecture argument this guide builds on.

Tool deep dives: Claude Code · Codex · Cursor · Copilot · Ship AI code without runtime rabbit holes

FAQ

Common questions from agentic engineers

What is agentic engineering?

Agentic engineering is the practice of developing software with coding agents like Claude Code, Cursor, and Codex writing and executing code under human direction: the developer defines goals and constraints, the agents implement, and the human oversees and validates the output. The term was coined by Andrej Karpathy in early 2026 as the professional successor to vibe coding. Runtime feedback is the validation half of that definition: production evidence the agent and the engineer can actually see.

How do I connect Claude Code to my production logs?

Run brew install control-theory/dstl8/dstl8 && dstl8 setup. Guided onboarding creates your account, installs the Dstl8 MCP server into Claude Code and any other MCP-capable clients it detects, and connects your log sources. Optionally add the Dstl8 Skill with npx skills add control-theory/dstl8-skill so Claude Code knows when and how to check runtime on its own.

Won’t giving an agent access to logs flood its context window?

Only if you give it raw logs. Dstl8’s MCP server answers from distilled signal: patterns clustered with counts, baselined per service, with anomalies flagged. The agent gets the twelve rows that matter instead of forty thousand lines, so token spend is bounded by incident volume rather than log volume.

Which AI coding tools does Dstl8 work with?

Claude Code, Cursor, and Codex connect directly through the MCP server, along with any other MCP-capable client. Claude Code additionally supports the Dstl8 Skill for deeper orchestration. GitHub Copilot workflows close the loop at the repo level: Dstl8’s GitHub source correlates production regressions to merges, and findings reach the team through Slack and the dashboard.

What should I add to my CLAUDE.md for runtime verification?

A standing instruction that after any deploy, the agent checks Dstl8 for new patterns in the affected service before declaring the task complete, and that any fix for a production incident must reference the cited evidence from the diagnosis. The four prompts in the workflow section above are a good starting set.

Does my whole team need to use the same coding agent for this to work?

No. The runtime signal is shared: the same distilled incidents and diagnoses are available over MCP to every connected agent, in the Dstl8 dashboard, and in Slack. Engineers on different tools, and engineers not using agents at all, work from the same evidence.