Passing threshold: 75%
-p flag disables trust verification, removing the safeguard that would otherwise require manual approval for unrecognized commands when Claude processes content from untrusted external sources.
-p flag enables non-interactive mode, and a documented side effect is that trust verification is disabled in this mode. Trust verification is the safeguard that challenges first-time or unrecognized operations — including unmatched commands that would otherwise default to requiring manual approval under fail-closed matching. By disabling trust verification, the pipeline loses the gate that prevents unrecognized commands from executing without human review, even though fail-closed matching is nominally in place. The root cause is not a misconfiguration of the allowlist but the architectural tradeoff of using -p in a context where untrusted content is being processed.
max_tokens value is too low relative to task_budget, causing the API to truncate responses before the agentic loop can complete. Raise max_tokens to match or exceed task_budget, since the two values must be kept in proportion to prevent early truncation.
task_budget is set far below what the refactoring tasks actually require, causing Claude to scope down or refuse tasks it cannot finish within the budget. The budget should be sized against the actual token-length distribution of these tasks.
task_budget is a hard cap, Claude is being terminated mid-execution by the API whenever the budget is exhausted. Add error-handling logic to catch these truncated responses and retry the failed requests with exponential backoff.
task_budget that is clearly insufficient for the requested work causes Claude to exhibit refusal-like behavior — declining to start, aggressively scoping down, or stopping early with partial results rather than beginning work it cannot finish. The absence of stop_reason: "max_tokens" in the logs confirms the API is not truncating responses; the problem is Claude's self-regulation in response to a budget it perceives as unworkable. The fix is to size the budget against the actual task-length distribution rather than a fixed conservative default.
stop_reason before treating a response as complete; when output is truncated mid-tool-use-block, the agent must detect stop_reason: "max_tokens" and treat that invocation as a hard failure requiring retry or escalation — not silently advance to the next item.
stop_reason: "max_tokens" and contains an incomplete tool use block, the tool invocation did not complete — the agent received a partial, unusable result. The correct architectural response is to inspect stop_reason on every API response and treat max_tokens with an incomplete tool use block as an explicit failure signal, not a successful response. Silently advancing is a silent error suppression pattern: the pipeline hides the failure instead of surfacing it. The fix is structural — the agent loop must branch on stop_reason before processing tool call results, then retry or escalate to human review.
Write and Edit tools are not supported in CI environments and must be replaced with MCP server equivalents using the mcp__<server>__<action> naming convention.
PreToolUse hook was added or modified with an empty or overly broad matcher, causing it to match all tool calls — including Write and Edit — and return permissionDecision: 'deny'.
PostToolUse hook for Write is returning a deny decision after the tool runs, and the agent is rolling back file changes before they are persisted.
allow entries for each built-in tool in the ClaudeAgentOptions; without them, tools default to blocked after an agent configuration change.
matcher field in a PreToolUse hook matches every tool event — not just the intended target. If a hook added during the sprint omits the matcher field or sets it to a pattern that inadvertently matches Write and Edit (e.g., an empty string), it will intercept all tool calls and apply its deny decision universally. The documented guidance explicitly warns: "an empty matcher matches all tools." Adding logging to surface permissionDecisionReason and tightening the matcher to the specific tool name (e.g., 'Bash') is the correct diagnostic and fix.
PreToolUse SDK callback with no matcher, so it fires before any tool call in the session and acts as a session-start proxy.
SessionStart shell command hook in .claude/settings.json and include setting_sources=["project"] in the ClaudeAgentOptions so the SDK loads it.
SessionStart as an SDK callback hook natively.
SessionStart is not available as a Python SDK callback hook — the Python HookEvent type omits it. The correct fix uses the mechanism designed for this gap: define the hook as a shell command in .claude/settings.json and load that settings file via setting_sources=["project"] in ClaudeAgentOptions. This makes the shell command hook available to the Python SDK session without any code restructuring, and is the pattern Anthropic guidance prescribes for this exact capability gap.
enableWeakerNestedSandbox reduces filesystem and network isolation, and allowing /var/run/docker.sock via allowUnixSockets effectively grants the agent host-level access — together these eliminate both the OS-level sandbox boundary and the container boundary simultaneously.
max_tokens so the agent batches more data per request, reducing the number of outbound connections.
enableWeakerNestedSandbox is only a risk on macOS; on Linux the sandbox implementation is strong enough that allowing /var/run/docker.sock poses no additional escalation risk in Docker environments.
enableWeakerNestedSandbox is documented as considerably weakening security — it removes the strong OS-level isolation that is the sandbox's primary defense. Simultaneously, allowing /var/run/docker.sock through allowUnixSockets is explicitly called out as effectively granting access to the host system via the Docker socket. Together, these two choices eliminate independent layers of protection: the first removes the sandbox boundary, and the second grants a direct escalation path to the host. Recognizing that each configuration option weakens a distinct isolation layer — and that compounding them leaves no effective boundary — is the correct architectural judgment.
tools frontmatter field so they can be pre-approved at background launch time.
CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1 to eliminate background execution and require developers to manually trigger each test suite run.
tools frontmatter ensures the pre-approval prompt at launch covers all needed permissions, eliminating the silent mid-run abandonment without sacrificing concurrency.
ClaudeAgentOptions.
max_turns and max_budget_usd in each subagent's ClaudeAgentOptions, and handle error_max_turns and error_max_budget_usd result subtypes in the orchestrator to detect and resume or escalate stopped sessions.
max_turns and max_budget_usd fields in ClaudeAgentOptions are structural safety limits enforced by the SDK itself — the agent loop terminates and emits a ResultMessage with a specific error subtype (error_max_turns or error_max_budget_usd) rather than continuing indefinitely. Handling these subtypes in the orchestrator converts a silent runaway into a detectable, recoverable event (using session_id for resumption or escalating to a human). This addresses the root cause — absence of per-session guardrails — at the configuration layer, not through probabilistic prompt instructions.
defer mechanism requires Claude Code v2.1.89 or later, and CI environments typically pin to an older version to ensure reproducibility, so the hook result is silently ignored and the tool proceeds through the normal permission flow.
defer is ignored with a warning and the tool proceeds through the normal permission flow, leaving AskUserQuestion undeferred and the headless session hanging as before.
defer mechanism only works in interactive terminal sessions; in non-interactive -p mode the hook result is silently discarded, so the AskUserQuestion call executes immediately and the pipeline hangs waiting for terminal input.
cleanupPeriodDays retention sweep will delete the session before the CI pipeline can resume it, because the default 30-day window does not protect sessions that have been idle for more than a few hours between CI stages.
defer mechanism has an explicit architectural constraint: it only works when Claude makes a single tool call in the turn. If Claude issues multiple parallel tool calls in the same turn, defer is ignored with a warning and the tool proceeds through the normal permission flow. Because the failing runs show AskUserQuestion batched alongside Read calls, returning "defer" from the hook will be silently bypassed, the AskUserQuestion will proceed to execution, and the headless session will hang as before. The fix must address this batching behavior — either by constraining Claude to sequential tool calls or by intercepting AskUserQuestion with a "deny" and injecting a default answer.
/etc as a prohibited destination, making the constraint clearer to the model on every request.
PreToolUse hook with a WriteFile matcher that inspects file_path, returns permissionDecision: "deny" when the path starts with /etc, and injects a systemMessage so the agent understands why the operation was blocked.
PostToolUse hook that checks the written file path after each operation completes, then triggers a rollback script to restore /etc/claims-config/ whenever an unauthorized write is detected and committed.
WriteFile calls through a second validation agent that re-evaluates the destination path against an allowlist policy before forwarding the request onward to the file system.
PreToolUse hook with permissionDecision: "deny" intercepts the tool call before it executes — the write never reaches the file system. Pairing the denial with a systemMessage injects context explaining why the operation was blocked, preventing retries. This is a structural, deterministic enforcement mechanism applied at the SDK level, outside the model's decision loop — unlike prompt instructions, which are probabilistic and already proven insufficient by the observed 3% failure rate.
"Bash(find *)" and "Bash(tree *)" as additional scoped tool permissions to speed up file traversal and reduce the number of turns consumed before the agent reaches the relevant files.
error_max_turns in the result loop by resuming the session with an increased max_turns limit, and add a disallowed_tools list excluding broad commands like Bash to force the agent to use more targeted read-only traversal tools.
error_max_turns, capture the session_id from the ResultMessage and resume the session rather than silently completing, so the agent continues from where it stopped instead of discarding progress.
effort="low" so it uses fewer turns on reasoning overhead, freeing up the remaining turn budget for actual file edits and test fixes.
error_max_turns: save the session_id from ResultMessage and pass it back to query to continue from where the agent stopped. The current implementation is silently discarding the session on turn exhaustion — treating a resumable pause as a terminal failure. Resumption preserves accumulated context (files read, hypotheses formed, edits staged) rather than restarting the traversal from scratch. This addresses the root cause: the agent is running out of budget mid-investigation on long traversals, not producing wrong output or using the wrong tools.
stop_hook_active and exit early when the field is true.
type: "http" pointing at a local service will prevent the re-entry loop.
stop_hook_active field is specifically set to true when the Stop event is firing during a hook-triggered continuation. A hook that doesn't inspect this field and exit early will unconditionally trigger another continuation, producing an infinite loop. The fix is to parse stop_hook_active at the top of the script and exit 0 when it is true, allowing Claude to stop rather than re-enter.
"version": "1.4.0" is set in plugin.json, Claude Code uses that string as the cache key and skips the update when the string hasn't changed — regardless of new commits. The fix is to bump the version field on every release, or remove it so the git commit SHA is used as the cache key instead.
/plugin update can detect new commits. The fix is to configure the marketplace server to disable HTTP caching headers so fresh archives are always served.
version field in plugin.json takes precedence over the marketplace entry, so the marketplace's version was never consulted. The fix is to remove the version from plugin.json and set it only in marketplace.json, making the marketplace the single authoritative source for the version string.
version is set explicitly in plugin.json, that string is frozen — pushing new commits to the git source has no effect because Claude Code sees the same "1.4.0" string and concludes no update is available. The fix is either to bump the version field on every release (enforcing a deliberate release gate) or to remove the version field so the git commit SHA becomes the cache key, ensuring every new commit automatically surfaces as an update. This is the core versioning tradeoff: explicit versions provide stable release cycles; commit-SHA versions provide continuous delivery with no manual bookkeeping.
PreToolUse hook matched to Write that checks tool_input.file_path; if it starts with /etc, return permissionDecision: 'deny' inside hookSpecificOutput and include a systemMessage explaining that system directories are protected.
PostToolUse hook matched to Write that detects writes to /etc after they complete and triggers a rollback function to restore the original file, then logs the violation for operator review.
PreToolUse hook without a matcher field so it intercepts all tool calls, then inside the callback return permissionDecision: 'deny' for any file_path starting with /etc. Omit systemMessage to keep the response payload minimal.
PreToolUse hook with a Write matcher intercepts the operation before execution, making blocking deterministic rather than probabilistic. Returning permissionDecision: 'deny' inside hookSpecificOutput stops the tool call, while the top-level systemMessage field injects context into the conversation so the agent understands why the operation was blocked and does not retry it. Both output fields are required together: the deny decision prevents the write, and the system message breaks the retry loop by providing the agent with actionable context — addressing both failure modes at their source.
max_turns from 30 to 60 so the agent has sufficient headroom to complete the most complex debugging tasks it currently abandons before finishing.
subtype: "error_max_turns", capture the session_id from the ResultMessage and resubmit as a resumption with an additional turn budget, preserving the agent's intermediate progress rather than discarding it.
CLAUDE.md directing the agent to prioritize applying fixes before exhausting its turn budget, so fewer pipeline runs reach the limit and get discarded.
effort from "high" to "low" so the agent uses fewer turns per reasoning step, reducing the probability of hitting the turn limit before completing the fix on complex debugging tasks.
ResultMessage on error_max_turns exposes the session_id precisely to support resumption — the agent loop preserves intermediate context (files read, edits made, test results observed) inside the session. Resuming with the captured ID continues from where the loop stopped rather than restarting cold, which is the mechanism designed for this failure mode. This preserves the safety value of the turn limit (preventing runaway sessions) while recovering from legitimate mid-task interruptions.
disable-model-invocation: true and require operators to invoke the correct schema skill explicitly with /<name> before each extraction task.
disable-model-invocation: true makes a skill invisible to Claude until explicitly invoked — it contributes zero tokens to every request. For a pipeline where the operator knows which document type they're processing before the task begins, explicit invocation via /<name> is the natural control point. This addresses both the context waste (descriptions no longer load speculatively) and the accuracy problem (schema selection is no longer left to Claude's probabilistic matching against descriptions) at their source: the session-start loading of all ten descriptions.
disable-model-invocation: true in the skill's frontmatter so the skill only activates when the orchestrator explicitly invokes it via /gather-sources.
disable-model-invocation: true. This is a structural configuration mechanism that deterministically prevents Claude from auto-triggering the skill regardless of how relevant the description seems. A more specific description (A) reduces the frequency of false triggers but cannot eliminate them; Claude may still infer relevance in ambiguous contexts. The problem here is not that the description is too broad — it's that autonomous invocation is undesirable in any context, which is exactly the use case disable-model-invocation: true was designed for.
Specialized Advice.
max_tokens for the moderation call so Claude has more room to reason through ambiguous financial content before producing its classification decision, reducing the chance of a hasty or truncated judgment.
Investment_Advice and Market_Commentary, relying on Claude to infer the boundary between them from the subcategory names alone.
{"continue": false, "stopReason": "Build artifact missing"} so the user is notified and the teammate terminates cleanly.
PostToolUse hook instead, so each build tool invocation checks for dist/output.js and re-runs the build step if the artifact is absent.
TeammateIdle hook is the mechanism that redirects a teammate: the stderr message is delivered to the teammate as feedback and the teammate resumes working rather than going idle. This addresses the root cause — the teammate stops prematurely — at the exact lifecycle point where that decision is made, using the structural control the hook exposes. The other approaches either terminate the teammate (A), silently allow idle (B), or intervene at the wrong lifecycle event (D).
.claude/rules/ files under each subagent's working directory.
@path imports — one per subagent role — so that each subagent's instructions are organized in a separate file loaded at launch.
.claude/rules/ files load only when Claude works with files matching the path, so each subagent receives only the instructions relevant to its role — reducing context size and eliminating cross-contamination. This is the structural mechanism designed specifically for this problem: isolating role-specific rules to the working directory of the subagent that needs them.
read_write access, so a prompt injection in fetched web or third-party API content could write attacker-controlled material into the store, which later sessions then read as trusted memory.
read_write access by default. When agents process untrusted external inputs — web content, third-party API responses — a successful prompt injection in that content can cause the agent to write attacker-controlled material into the store. Because later sessions read the store as trusted memory, the injected content propagates downstream without further scrutiny. The fix is to attach reference or persisted-research stores as read_only for agents that don't need to modify them, and isolate write-capable stores to agents whose inputs are fully controlled. This is a structural access-control failure, not an instruction or capacity issue.
project by default, the accumulated knowledge is also shareable via version control, and improvement compounds automatically over time at the source where domain knowledge is generated.
system_warning messages in the conversation history, diluting attention on the extraction schema by the late-session documents.
<budget:token_budget> and <system_warning> updates on remaining capacity — solves exactly this failure mode. Without it, a model "competes without a clock": it has no signal that capacity is nearly exhausted and cannot adjust behavior (prioritizing, checkpointing, or surfacing a graceful handoff) before the window fills. The clustering of failures in late-session documents is the diagnostic signature of a model that doesn't know when it's running out of room. The fix is architectural: use a model with context awareness or design the agent harness to emit explicit token budget signals so the model can adapt before truncation occurs.
disable-model-invocation: true on skills that should only be triggered explicitly, and ensure the remaining model-invocable skills have distinct, non-overlapping descriptions so Claude can match them accurately.
disable-model-invocation: true removes side-effect or user-triggered skills from Claude's automatic matching entirely (zero description cost until invoked), and distinct non-overlapping descriptions allow the remaining model-invocable skills to be differentiated accurately. This addresses the root cause — attention noise from overlapping signals — rather than adding more context or shifting the burden to users.
/memory confirms a file loaded but Claude still violates an instruction, the documented root cause is how the instruction is written — not whether it loaded. Adherence drops when instructions are vague enough to interpret multiple ways or when the file has grown long enough that individual rules receive less attention. The fix is to improve instruction specificity and trim the file, addressing the actual failure mode rather than a loading symptom that has already been ruled out.
Specialized Advice category for any comment that initially passes, using a stricter system prompt focused exclusively on financial statements for that second pass.
max_tokens budget for each moderation call so Claude has more room to reason through borderline edge cases before returning a violation verdict on financial statements.
tool_choice: {"type": "tool", "name": "run_tests"} to force the first tool call, which is incompatible with extended thinking and silently corrupts the reasoning chain.
budget_tokens value is set below max_tokens, which prevents Claude from completing its reasoning about tool results before generating the summary.
max_tokens limit is being reached mid-response, truncating the tool_use block before Claude can close it, producing an unusable partial output that the pipeline treats as a success.
end_turn response as a success when the actual content is empty, masking a prompt construction error.
stop_reason is max_tokens, Claude's response generation was cut short by the token budget — it did not finish naturally. If the truncation occurs inside a tool_use block, the block is structurally incomplete and cannot be parsed or executed. The API call itself succeeds (HTTP 200, no error), so a pipeline that only checks for API-level errors will silently record a success while discarding the unusable partial output. The diagnostic signal (stop_reason: "max_tokens" + incomplete tool_use block) uniquely identifies this cause. The fix is to detect stop_reason: "max_tokens" explicitly and retry with an adequate token budget, not to ignore the stop reason.
<reasoning> tags before the <intent> tag, so downstream agents and human reviewers can audit why each query was routed and identify systematic misclassification patterns.
<reasoning> tags before committing to an <intent> forces it to externalize its classification logic, which serves two functions: it improves classification accuracy by making the model work through ambiguous cases rather than pattern-matching directly to a label, and it produces an auditable signal that lets you identify which query characteristics are driving the 18% misroute rate. This is chain-of-thought elicitation applied to a structured classification task — the reasoning is the diagnostic artifact that makes systematic prompt improvement possible.
progress.txt narrative, including the names of every test fixed, so Claude can scan the file to determine what has been resolved across sessions.
tests.json) tracking each test's id, name, and status alongside the existing freeform notes, applying each format to what it does best.
id, name, status) gives Claude a machine-readable schema it can query deterministically — checking status: "passing" rather than inferring completion from prose. Combining both formats applies each to the problem it's suited for: structured data for schema-driven state, unstructured text for narrative context.
maxTurns value to give each subagent more attempts to resolve the permission issue.
allowed_callers field is not supported in client-side execution mode, so any tool in the registry can be invoked by Claude without restriction, bypassing intended access controls.
max_tokens on the extraction request to ensure Claude has enough room to complete the full JSON parameter before the stream terminates, eliminating truncation at the source.
eager_input_streaming on the extract_claim_fields tool so the API buffers and validates the complete parameter before streaming, eliminating mid-stream truncation at the cost of higher latency.
max_tokens stop reason: wrap the accumulated partial JSON in a structured error object (e.g., {"INVALID_JSON": "<partial>"}) and return it to Claude as an error tool result so Claude can recover with a corrective pass.
input_json_delta event rather than waiting for content_block_stop, so partial fields can be extracted before truncation occurs.
max_tokens is reached — the stream may end mid-parameter. The documented pattern for this failure mode is to detect the max_tokens stop reason, wrap the incomplete fragment in a structured error object ({"INVALID_JSON": "..."}) so Claude can reason about what went wrong, and feed it back as an error tool result for a recovery pass. This addresses the root cause at the point of failure rather than silently skipping documents or removing the capability responsible for latency improvements.
settings.json because subagents don't always load project memory automatically; the rules must be embedded directly in the subagent file body.
matcher field is a JSON array instead of a single string, causing the hook definition to fail schema validation and be silently dropped — so it never fires regardless of which tool is called.
settings.json after the subagent was launched are not reflected until the session is restarted.
PreToolUse hook fires for the orchestrator agent only; subagents require a separate SubagentPreToolUse hook event to intercept tool calls at that level.
matcher field must be a single string using | to separate tool names — for example "WebFetch|Bash". When matcher is a JSON array, Claude Code treats it as a schema error: the hook definition is dropped entirely and will not appear in /hooks. The fix is a structural one (correcting the configuration syntax), not a prompt instruction or session restart. This is a silent failure mode: no error is surfaced at runtime, but the hook simply never fires.
max_tokens limit so they have more room to process the inherited tool definitions alongside the analysis tasks.
/compact on the parent session before spawning subagents so that the summarized conversation history passed to each subagent is shorter.
/mcp disable <name> before spawning subagents, so their tool definitions are not inherited by each subagent.
contract-processing Skill and confirm it achieves equivalent performance to each of the three individual Skills it replaces before deploying it to production.
max_tokens to ensure Claude has enough room to process all 400 tool definitions before selecting the appropriate one.
mcp__crm__write_record to an unscoped pattern so it fires for all tool calls, giving the agent more opportunities to intercept invalid data before a write is attempted.
systemMessage field to the hook's deny response that explains why the contract_id format is invalid and what a valid format looks like, so the agent receives actionable context and stops retrying.
permissionDecision: 'deny' with a post-processing step that silently drops invalid writes after they execute, so the agent receives a success response and does not attempt retries on the same record.
contract_id format validation out of the hook and into the system prompt, instructing Claude to validate all extracted fields before calling mcp__crm__write_record on any contract record.
systemMessage, the agent receives a bare rejection with no explanation of why the operation failed or what to do instead. This leaves the agent in an ambiguous state where retrying the same call is a plausible recovery strategy. Adding systemMessage injects structured context directly into the conversation — naming the violation and specifying the valid format — so the agent understands the constraint and can take corrective action rather than re-issuing the same blocked call. The permissionDecision and systemMessage fields are designed to work together: permissionDecision enforces the constraint deterministically, while systemMessage provides the guidance needed to avoid re-triggering it.
output_config.format parameter conflicts with PDF-type documents specifically; switching the contracts to plain-text document blocks will allow citations and structured outputs to coexist in the same request without a 400 error.
pause_turn, append the assistant's response to the message history, and issue a continuation request — gating the database write until a non-pause_turn stop reason is received.
max_tokens budget is being exhausted mid-enrichment. Increase max_tokens so Claude has sufficient room to complete the full sequence of lookups in a single response, preventing truncation before all fields are populated.
pause_turn stop reason is returned specifically when the server-side sampling loop reaches its iteration limit (default: 10) while executing server tools. The response at that point contains an incomplete server_tool_use block without a corresponding result. The correct fix is to detect this signal in the agent loop, append the assistant's partial response to the message history, and issue another API request — allowing Claude to resume from where it stopped. Writing to the database before handling pause_turn is what causes the missing fields; the fix is to gate the database write until a non-pause_turn stop reason is received.