Testing, Debugging, and Updating Agents

Tune delegation behavior, debug agent underperformance, distinguish designer failures from runtime failures, and update existing agents without drift.

Agent engineering doesn't stop at writing the first prompt. This page covers how to tune delegation, debug underperformance, and update agents while preserving their original intent.

Tuning delegation behavior

If delegation is too frequent

Narrow the description triggers
Add explicit exclusions ("Do not use for X")
Add or strengthen a near-miss <example> block with clear commentary explaining why it should not delegate

If delegation never happens

Add concrete keywords and file patterns to the description
Add or strengthen <example> blocks (aim for 2-4 total)
Use "use proactively..." phrasing if you want aggressive delegation

If output is too verbose

Strengthen the output contract with explicit verbosity limits
Add a "verbosity limit" rule (e.g., "max 1-2 screens unless asked")

If it misses key checks

Add a required checklist item ("Always run X", "Always verify Y")
Make the check explicit in the workflow section

If it overreaches or changes too much

Tighten scope and non-goals
Reduce tool access
Consider using plan mode or reviewer pattern instead

Diagnosing underperformance

When an agent fails, first identify whether it's a designer problem or a runtime problem:

Category	What it means	Examples	Fix
Designer failure	The prompt has issues	Ambiguous instructions; missing escalation rules; no output contract	Edit the agent prompt
Runtime failure	The prompt is sound but the model misapplies it	Model ignores clear instruction; hallucinates despite guardrails	Strengthen emphasis, add examples, or accept limitation

Rule of thumb: If you can imagine a careful human reader misinterpreting the prompt the same way the model did, it's a designer failure.

Diagnostic questions

If you gave this prompt to a different model, would it interpret it the same way?
Are there instructions that could be read two ways?
Does the agent have enough context to judge edge cases?
Is the output contract specific enough to know when output is "good"?

If any answer is "no," it's likely a designer failure. Fix the prompt before blaming the model.

Common designer failure modes

1. Ambiguous instructions

Instructions that could be read two ways — the model picks one interpretation; the user expected the other.

Ambiguous	Clear
"Be thorough"	"Check for security issues, missing error handling, and breaking API changes"
"Clean up the code"	"Remove unused imports and dead code paths"
"Review for issues"	"Review for security vulnerabilities and correctness bugs"

Fix: Add clarifying examples, "do X, not Y" constraints, or explicit scope.

2. Missing context assumptions

The prompt assumes context that won't be available at runtime.

Problem	Fix
"Continue from where we left off"	Subagents start fresh — include all needed context in the handoff
"Apply the pattern we discussed"	Specify the pattern explicitly
"Fix the issue"	Describe the issue in detail

3. Vague directive strength

Using weak language for requirements or strong language for preferences.

Problem	Better
"Try to run tests" (when tests are required)	"You must run tests before returning"
"Consider security" (when security is non-negotiable)	"You must check for security vulnerabilities"
"Must follow style guide" (for cosmetic preferences)	"You should follow the style guide when possible"

4. Missing failure mode awareness

The prompt tells the agent what to do, but not what to watch out for.

Fix: Include 3-5 contextually relevant failure modes from the failure mode catalog.

5. Underspecified output contract

The prompt doesn't define what "good output" looks like.

Vague	Specific
"Return your findings"	"Return findings as a prioritized list with severity, file path, description, and recommendation"
"Summarize the issues"	"Return a TL;DR (2-5 bullets) followed by findings sorted by severity"

6. Missing escalation rules

No guidance on when to ask for help vs proceed with assumptions.

Fix: Add explicit rules: when to ask, when to proceed with labeled assumptions, when to return partial results.

7. Overloaded scope

Too many instructions competing for attention.

Fix: Prioritize ruthlessly. Move reference material to files loaded on demand. Use "must" for critical items and "should" for nice-to-haves.

8. False dichotomies

Presenting two options when more exist.

False dichotomy	Better
"Either fix the bug or document it"	"Fix, document, or escalate based on severity and confidence"
"Ask or assume"	"Ask for high-stakes decisions; proceed with labeled assumptions for low-stakes ones"

The designer self-check

Before delivering an agent prompt, verify:

No instruction could be read two ways in a different context
No assumed context the agent won't have
Directive strength is clear throughout ("must"/"should"/"consider" match actual requirements)
3-5 relevant failure modes are addressed
Output contract is specific enough to validate against
Escalation rules are clear (when to ask, when to proceed, when to return partial results)
Scope is manageable (agent can hold all critical instructions in effective attention)

Updating existing agents

When updating an existing agent, the goal is to improve clarity and structure without changing the agent's meaning unless explicitly requested.

What you can change safely

These changes don't affect meaning, routing, or capability:

Reformatting (headings, lists, checklists) without changing requirements
Reducing redundancy and merging duplicated text
Reordering sections for scannability
Tightening language where intent is already clear
Fixing typos, grammar, or formatting inconsistencies

What requires explicit approval

These changes can affect when the agent fires or how it judges:

Expanding or narrowing description triggers
Adding, removing, or modifying <example> blocks
Changing MUST/SHOULD wording or severity levels
Modifying personality statements or escalation thresholds
Adding or removing failure modes

What requires impact analysis

These changes can break downstream consumers:

Renaming the agent (breaks Task tool references)
Changing return packet format (breaks orchestrator aggregation)
Changing severity levels or dedup keys
Changing from subagent to orchestrator pattern (or vice versa)

The update workflow

Inventory the current agent — read the full file, note frontmatter fields, body sections, and dependencies

Capture an intent snapshot — document purpose, routing posture, behavioral calibration, capability surface, and output contract

Identify update opportunities — find redundancy, unclear structure, ambiguous instructions, or missing output contracts

Classify each change — safe (implement by default), routing/calibration risk (require approval), capability/contract change (require explicit approval), downstream-breaking (require impact analysis)

Apply safe changes — implement formatting and clarity improvements

Drift check — compare before/after intent snapshots and flag any unintended changes

Common anti-patterns when updating

Anti-pattern	Description
Oversampling	Adding rules, failure modes, or edge cases "because it seems good" — only add what the author clearly implied
Tone normalization	Rewriting personality into your own style, which can shift strictness or escalation thresholds
Calibration creep	Subtly shifting MUST → SHOULD or SHOULD → CONSIDER without realizing the behavioral impact
Silent routing changes	Expanding description triggers so the agent fires in new contexts without consent
Tool accumulation	Adding tools "for convenience" without considering the risk surface
Schema drift	Changing return packet structure when an orchestrator depends on it
Diff noise	Massive reflow that makes review hard and increases risk of accidental semantic changes

Maintenance best practices

Prefer small, targeted edits based on observed failures
Keep a changelog in git history rather than inside the agent file
Test delegation behavior after changes to description or <example> blocks
When multiple failures occur, make targeted fixes rather than shotgun edits across the prompt

Testing, Debugging, and Updating Agents

On this page