EvaluationApr 30, 2026

When Agents Fail, Debug the Trajectory

Agent reliability improves when teams inspect the execution trace, locate the first critical failure, and fix the system around it.

When an AI agent fails, the first instinct is usually to edit the prompt.

Add a stricter instruction. Add a reminder. Add another sentence saying "do not hallucinate" or "think step by step" or "use the tool correctly." Sometimes that helps. Often it just moves the failure somewhere else.

For production agents, prompt tuning is not enough because the failure rarely lives in the final answer alone. It lives in the trajectory: the sequence of plans, tool calls, observations, interpretations, retries, skipped steps, and final synthesis that produced the answer.

If the agent is a process, the trace is the thing to debug.

Trajectory debugging

The answer is the symptom. The trace is the diagnostic surface.

first bad turn

Intent

The user goal and constraints are interpreted.

Plan

The workflow is chosen before tools are touched.

First breach

Wrong argument, bad read, skipped policy, or invented fact.

Symptom

The final answer exposes damage created earlier.

Regression eval

The incident becomes a replayable test case.

Debugging agents should feel less like prompt folklore and more like reading an incident trace with evidence.

The Failure Is Usually Upstream

Consider a customer-support agent that gives the wrong refund decision. The final answer says the user is eligible. That answer is wrong, but the final answer may not be the root cause.

The agent might have:

misunderstood the user's intent,
planned the wrong workflow,
called the right tool with the wrong argument,
ignored a policy returned by the tool,
invented a missing fact,
retried after a system error and lost context,
or executed an extra action that was never part of the plan.

Each of these failures needs a different fix. A better instruction may help if the model ignored a required step. It will not help much if the tool schema is ambiguous, the policy is missing from context, or the agent has no way to distinguish stale data from fresh data.

This is the core lesson from Microsoft's AgentRx work: agent failures need systematic diagnosis from execution trajectories. AgentRx analyzes failed traces, synthesizes constraints from tool schemas and domain policies, checks the trajectory step by step, and predicts the first critical failure with a root-cause category.

That framing is practical. A failed agent run is not just "bad output." It is an incident log.

A Useful Failure Taxonomy

One valuable part of AgentRx is the failure taxonomy. The Microsoft Research team describes categories such as plan adherence failure, invention of new information, invalid invocation, misinterpretation of tool output, intent-plan misalignment, underspecified user intent, unsupported intent, guardrails triggered, and system failure.

That taxonomy matters because it keeps teams from treating every defect as a model-quality problem.

An invalid invocation points toward schema design, validation, examples, or tool selection. A misinterpretation of tool output points toward output formatting, semantic clarity, or post-tool checks. Invention of new information points toward grounding and evidence discipline. Intent-plan misalignment may require better task classification or a clarification step. System failure may not be an AI problem at all.

The habit is simple: name the kind of failure before choosing the fix.

Without a taxonomy, teams drift into superstition. They keep adding prompt clauses because prompt clauses are easy to add. With a taxonomy, teams can ask a sharper question: what was the first unrecoverable step, and what system boundary allowed it?

What a Trace Should Capture

If the trajectory is the debugging surface, the harness needs to capture enough detail to reconstruct the run.

Useful traces include:

the user request and any clarifications,
the model's plan or intermediate intent,
every tool call with arguments,
every tool result with timestamps and status,
relevant policy or domain constraints,
retrieved documents or evidence identifiers,
memory reads and writes,
guardrail decisions,
retries, timeouts, and fallbacks,
the final answer and cited evidence.

This does not mean dumping everything into a giant prompt or log blob. It means designing the harness so each step is structured, inspectable, and linked to the decision it influenced.

OpenAI's Agents SDK direction is relevant here because a richer harness naturally produces richer traces. Filesystem tools, shell execution, sandbox orchestration, MCP tool use, skills, and memory are all operational events. If those events are invisible, agent reliability becomes impossible to reason about. If those events are structured, they become the raw material for evals, replay, incident review, and regression tests.

The trace also gives humans a way to supervise without micromanaging. A reviewer does not need to watch every token. They need to see what the agent believed, what it did, what evidence it used, and where the first bad turn happened.

From Prompt Tuning to Agent Engineering

Trajectory debugging changes the engineering workflow.

Instead of asking "what prompt would have made this answer better?", the team asks:

Did the agent understand the user's goal?
Did it choose the right plan?
Did each tool call match the schema and policy?
Did the agent correctly interpret tool output?
Did it know when to ask for missing information?
Did it stop at the right time?
Did it cite or preserve the evidence that supported the final answer?

This is closer to debugging a distributed system than editing copy. The model is one component, but the behavior emerges from the loop.

For example, if an agent keeps calling a tool with malformed arguments, the fix may be stricter schema validation, better tool descriptions, fewer overlapping tools, or examples of valid calls. If it misreads output, the fix may be structured JSON instead of prose. If it invents missing data, the fix may be an evidence gate that blocks final answers without source support. If it barrels ahead despite missing information, the fix may be an explicit clarification policy.

The prompt still matters. But it becomes one lever among many.

Evaluation Needs the Whole Run

Final-answer evals are useful for simple tasks. They are weak for agents.

A long-running agent can produce a correct final answer through an unsafe path. It can also produce a wrong final answer after doing most of the work correctly. If the eval sees only the final text, it cannot distinguish lucky success from robust execution, or small synthesis error from catastrophic tool misuse.

Trajectory-aware evals can grade the run at multiple levels:

Outcome: Did the user get the right result?
Grounding: Was the result supported by evidence?
Process: Did the agent follow required steps?
Tool use: Were calls valid, minimal, and appropriate?
Safety: Were risky actions gated or blocked?
Recovery: Did the agent handle missing data and tool failures well?

This matters because production teams do not just need a high average score. They need to know which failures are unacceptable, which are recoverable, and which are symptoms of a deeper design problem.

AgentRx reports meaningful gains in failure localization and root-cause attribution over prompting baselines. The precise numbers will matter less over time than the habit it encourages: treat agent evaluation as trace analysis, not answer grading alone.

Engineering Takeaways

Capture structured traces from the beginning. Retrofitting observability after an agent starts failing in production is painful.

Define failure categories that match your domain. Borrow general categories like invalid invocation and hallucination, but add domain-specific ones where policy or workflow mistakes matter.

Find the first critical failure, not every imperfection. Long traces are noisy. The question is which step made successful recovery unlikely.

Build replayable evals from real failures. Every incident should become a regression case that protects the system from repeating the same pattern.

Separate model failures from harness failures. A model may need better instructions, but the harness may need better schemas, tool boundaries, state management, or guardrails.

Most of all, stop treating the final answer as the whole artifact. In agentic systems, the answer is only the last page of the story. The trajectory is where the engineering lives.

Further Reading

Systematic debugging for AI agents: Introducing the AgentRx framework - Microsoft Research on locating critical failures and attributing root causes from execution traces.

The next evolution of the Agents SDK - OpenAI's update on the harness and sandbox primitives that make richer agent traces possible.