Agent Engineering

Testing, Debugging, and Updating Agents

Copy page

Tune delegation behavior, debug agent underperformance, distinguish designer failures from runtime failures, and update existing agents without drift.

Agent engineering doesn't stop at writing the first prompt. This page covers how to tune delegation, debug underperformance, and update agents while preserving their original intent.

Tuning delegation behavior

If delegation is too frequent

  • Narrow the description triggers
  • Add explicit exclusions ("Do not use for X")
  • Add or strengthen a near-miss <example> block with clear commentary explaining why it should not delegate

If delegation never happens

  • Add concrete keywords and file patterns to the description
  • Add or strengthen <example> blocks (aim for 2-4 total)
  • Use "use proactively..." phrasing if you want aggressive delegation

If output is too verbose

  • Strengthen the output contract with explicit verbosity limits
  • Add a "verbosity limit" rule (e.g., "max 1-2 screens unless asked")

If it misses key checks

  • Add a required checklist item ("Always run X", "Always verify Y")
  • Make the check explicit in the workflow section

If it overreaches or changes too much

  • Tighten scope and non-goals
  • Reduce tool access
  • Consider using plan mode or reviewer pattern instead

Diagnosing underperformance

When an agent fails, first identify whether it's a designer problem or a runtime problem:

CategoryWhat it meansExamplesFix
Designer failureThe prompt has issuesAmbiguous instructions; missing escalation rules; no output contractEdit the agent prompt
Runtime failureThe prompt is sound but the model misapplies itModel ignores clear instruction; hallucinates despite guardrailsStrengthen emphasis, add examples, or accept limitation

Rule of thumb: If you can imagine a careful human reader misinterpreting the prompt the same way the model did, it's a designer failure.

Diagnostic questions

  1. If you gave this prompt to a different model, would it interpret it the same way?
  2. Are there instructions that could be read two ways?
  3. Does the agent have enough context to judge edge cases?
  4. Is the output contract specific enough to know when output is "good"?

If any answer is "no," it's likely a designer failure. Fix the prompt before blaming the model.

Common designer failure modes

1. Ambiguous instructions

Instructions that could be read two ways — the model picks one interpretation; the user expected the other.

AmbiguousClear
"Be thorough""Check for security issues, missing error handling, and breaking API changes"
"Clean up the code""Remove unused imports and dead code paths"
"Review for issues""Review for security vulnerabilities and correctness bugs"

Fix: Add clarifying examples, "do X, not Y" constraints, or explicit scope.

2. Missing context assumptions

The prompt assumes context that won't be available at runtime.

ProblemFix
"Continue from where we left off"Subagents start fresh — include all needed context in the handoff
"Apply the pattern we discussed"Specify the pattern explicitly
"Fix the issue"Describe the issue in detail

3. Vague directive strength

Using weak language for requirements or strong language for preferences.

ProblemBetter
"Try to run tests" (when tests are required)"You must run tests before returning"
"Consider security" (when security is non-negotiable)"You must check for security vulnerabilities"
"Must follow style guide" (for cosmetic preferences)"You should follow the style guide when possible"

4. Missing failure mode awareness

The prompt tells the agent what to do, but not what to watch out for.

Fix: Include 3-5 contextually relevant failure modes from the failure mode catalog.

5. Underspecified output contract

The prompt doesn't define what "good output" looks like.

VagueSpecific
"Return your findings""Return findings as a prioritized list with severity, file path, description, and recommendation"
"Summarize the issues""Return a TL;DR (2-5 bullets) followed by findings sorted by severity"

6. Missing escalation rules

No guidance on when to ask for help vs proceed with assumptions.

Fix: Add explicit rules: when to ask, when to proceed with labeled assumptions, when to return partial results.

7. Overloaded scope

Too many instructions competing for attention.

Fix: Prioritize ruthlessly. Move reference material to files loaded on demand. Use "must" for critical items and "should" for nice-to-haves.

8. False dichotomies

Presenting two options when more exist.

False dichotomyBetter
"Either fix the bug or document it""Fix, document, or escalate based on severity and confidence"
"Ask or assume""Ask for high-stakes decisions; proceed with labeled assumptions for low-stakes ones"

The designer self-check

Before delivering an agent prompt, verify:

  • No instruction could be read two ways in a different context
  • No assumed context the agent won't have
  • Directive strength is clear throughout ("must"/"should"/"consider" match actual requirements)
  • 3-5 relevant failure modes are addressed
  • Output contract is specific enough to validate against
  • Escalation rules are clear (when to ask, when to proceed, when to return partial results)
  • Scope is manageable (agent can hold all critical instructions in effective attention)

Updating existing agents

When updating an existing agent, the goal is to improve clarity and structure without changing the agent's meaning unless explicitly requested.

What you can change safely

These changes don't affect meaning, routing, or capability:

  • Reformatting (headings, lists, checklists) without changing requirements
  • Reducing redundancy and merging duplicated text
  • Reordering sections for scannability
  • Tightening language where intent is already clear
  • Fixing typos, grammar, or formatting inconsistencies

What requires explicit approval

These changes can affect when the agent fires or how it judges:

  • Expanding or narrowing description triggers
  • Adding, removing, or modifying <example> blocks
  • Changing MUST/SHOULD wording or severity levels
  • Modifying personality statements or escalation thresholds
  • Adding or removing failure modes

What requires impact analysis

These changes can break downstream consumers:

  • Renaming the agent (breaks Task tool references)
  • Changing return packet format (breaks orchestrator aggregation)
  • Changing severity levels or dedup keys
  • Changing from subagent to orchestrator pattern (or vice versa)

The update workflow

Inventory the current agent — read the full file, note frontmatter fields, body sections, and dependencies

Capture an intent snapshot — document purpose, routing posture, behavioral calibration, capability surface, and output contract

Identify update opportunities — find redundancy, unclear structure, ambiguous instructions, or missing output contracts

Classify each change — safe (implement by default), routing/calibration risk (require approval), capability/contract change (require explicit approval), downstream-breaking (require impact analysis)

Apply safe changes — implement formatting and clarity improvements

Drift check — compare before/after intent snapshots and flag any unintended changes

Common anti-patterns when updating

Anti-patternDescription
OversamplingAdding rules, failure modes, or edge cases "because it seems good" — only add what the author clearly implied
Tone normalizationRewriting personality into your own style, which can shift strictness or escalation thresholds
Calibration creepSubtly shifting MUST → SHOULD or SHOULD → CONSIDER without realizing the behavioral impact
Silent routing changesExpanding description triggers so the agent fires in new contexts without consent
Tool accumulationAdding tools "for convenience" without considering the risk surface
Schema driftChanging return packet structure when an orchestrator depends on it
Diff noiseMassive reflow that makes review hard and increases risk of accidental semantic changes

Maintenance best practices

  • Prefer small, targeted edits based on observed failures
  • Keep a changelog in git history rather than inside the agent file
  • Test delegation behavior after changes to description or <example> blocks
  • When multiple failures occur, make targeted fixes rather than shotgun edits across the prompt