Strands Evals detectors: AI agent failure detection and RCA
Strands Evals detectors use LLM analysis to find per-span failures, trace causal chains and recommend system-prompt or tool-description.
TL;DR
- 01Strands Evals detectors use LLM analysis to find per-span failures, trace causal chains and recommend system-prompt or tool-description.
- 02Strands Evals detectors scan agent execution traces, surface per-span failures, and map causal chains to produce concrete fix recommendations.
- 03The toolkit runs LLM-based analysis over session traces, classifies failures into a structured taxonomy, and recommends whether fixes belong in the system prompt, tool descriptions, or elsewhere.
Strands Evals detectors scan agent execution traces, surface per-span failures, and map causal chains to produce concrete fix recommendations. The toolkit runs LLM-based analysis over session traces, classifies failures into a structured taxonomy, and recommends whether fixes belong in the system prompt, tool descriptions, or elsewhere.
How the detectors work
Detection runs in two phases. Phase 1, failure detection, scans every span in a Session object and labels failures against a taxonomy organized into nine parent categories: hallucination, incorrect actions, orchestration errors, task instruction non-compliance, execution errors, context handling errors, repetitive behavior, LLM output issues, and configuration mismatch. For each detected failure the detector returns the span location, one or more categories, a confidence score, and evidence extracted from the trace.
Phase 2, root cause analysis, links detected failures into causal chains. The analysis separates upstream causes from downstream symptoms, classifies each failure’s causality (PRIMARY, SECONDARY, or TERTIARY), assesses propagation impact, and generates fix recommendations categorized by where the change belongs: system prompt, tool description, or other.
Both phases cope with large sessions through a tiered strategy: direct analysis when the session fits the model context window, failure-path pruning that keeps ancestor and descendant spans for moderately large sessions, and chunked analysis with merge for very large sessions that require splitting traces into overlapping windows and reconciling results.
Concrete APIs mirror this flow. detect_failures returns structured failures; analyze_root_cause takes those failures (or runs detection itself) and returns root causes with causality and recommended fix types; diagnose_session runs both phases and returns a deduplicated DiagnosisResult with failures, root causes, and recommendations. The Experiment class accepts a DiagnosisConfig so diagnosis can run automatically on test cases.
Examples from a research-agent trace
A published example uses a research assistant session asking an agent to "Research the impact of energy requirements for powering AI in the real world." The detector output in that session identified multiple concurrent issues: a tool execution error caused by a missing required parameter 'knowledgeBaseId', a hallucination where the agent produced detailed content without using tools, and an orchestration failure where the agent abandoned the original task and, as the trace shows, stated "I'm going to pivot to discuss marine biology instead."
Root cause analysis in that example marked the tool-parameter validation failure as a PRIMARY_FAILURE and assigned it a TOOL_DESCRIPTION_FIX, recommending that the retrieve tool description explicitly document knowledgeBaseId as required with format constraints and example values. The subsequent hallucination was a SECONDARY_FAILURE and a SYSTEM_PROMPT_FIX, with a recommendation to add instructions prohibiting generation of factual content without tool-verified evidence and to require explicit acknowledgment when retrieval tools fail.
The toolkit also exposes programmatic knobs and integration patterns demonstrated in code examples: setting a confidence_threshold (ConfidenceLevel.MEDIUM in examples), calling diagnose_session(session, confidence_threshold=...), and wiring DiagnosisConfig into Experiment with DiagnosisTrigger.ON_FAILURE (default) or ALWAYS.
Why it matters
Detectors move diagnosis from manual trace inspection to automated, structured analysis. Evaluators already quantify agent quality with per-case scores and goal-success metrics; detectors explain failures at per-span granularity, trace causal chains, and produce targeted fix recommendations. For teams operating agents at scale, that removes the manual bottleneck that forces senior engineers to inspect hundreds of spans, and it aims to reduce diagnosis time from hours to minutes.
The separation between fix types matters operationally: a TOOL_DESCRIPTION_FIX targets tool schema or documentation, while a SYSTEM_PROMPT_FIX changes agent behavior instructions. The example shows fixing only one layer would leave the other failure mode unaddressed, which clarifies prioritization and reduces wasted iteration.
What to watch
Adoption signals to monitor: whether teams wire DiagnosisConfig into CI/CD so ON_FAILURE triggers produce automated diagnoses for regressions, and how detectors perform on traces fetched from remote providers such as Amazon CloudWatch or Amazon Bedrock traces. Also watch whether the tiered chunking strategy preserves causal clarity at very large session sizes and whether recommendations reduce time-to-fix in practice.
Written by The Brieftide · Sources: AWS Machine Learning, strandsagents.com
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsSWE-Explore benchmark: AI coding agents miss key lines
SWE-Explore isolates code search from repair across 848 tasks and finds agents locate files but cover only 14–19% of the lines that matter.
OpenAI buys Ona to push Codex toward long-running tasks
OpenAI will add Ona's persistent, customer-controlled cloud workspaces to Codex to enable hours-or-days autonomous coding and challenge.
OpenAI Academy launches three new courses for enterprises
Three courses — AI Foundations, Applied AI Foundations, and Agents and Workflows — teach employees how to turn prompts into repeatable.
Agentic AI: How tokens became a business metric (2026)
Agentic workflows, model tiers, and rising token bills are forcing providers to move from flat subscriptions to usage-based pricing.