Exam Results — CCA Practice Exam

— Your team has integrated Claude Code into a CI/CD pipeline that automatically processes pull requests by fetching... Show

A. The allowlist matching logic has a bug causing commands that share prefixes with approved commands to be incorrectly approved without manual review.

✓ B. Running with the -p flag disables trust verification, removing the safeguard that would otherwise require manual approval for unrecognized commands when Claude processes content from untrusted external sources.

C. Fetched repository content is being injected into the primary context window, allowing malicious prompt content from contributors to influence Claude's command suggestions.

D. The pipeline's network request approval setting was inadvertently disabled when the CI environment was provisioned, allowing external fetches to bypass the permission system.

Why B is correct: The -p flag enables non-interactive mode, and a documented side effect is that trust verification is disabled in this mode. Trust verification is the safeguard that challenges first-time or unrecognized operations — including unmatched commands that would otherwise default to requiring manual approval under fail-closed matching. By disabling trust verification, the pipeline loses the gate that prevents unrecognized commands from executing without human review, even though fail-closed matching is nominally in place. The root cause is not a misconfiguration of the allowlist but the architectural tradeoff of using -p in a context where untrusted content is being processed.

— Your team has deployed a Claude Code agent to automate large-scale refactoring tasks across a monorepo. To control... Show

A. The max_tokens value is too low relative to task_budget, causing the API to truncate responses before the agentic loop can complete. Raise max_tokens to match or exceed task_budget, since the two values must be kept in proportion to prevent early truncation.

✓ B. The task_budget is set far below what the refactoring tasks actually require, causing Claude to scope down or refuse tasks it cannot finish within the budget. The budget should be sized against the actual token-length distribution of these tasks.

C. Because task_budget is a hard cap, Claude is being terminated mid-execution by the API whenever the budget is exhausted. Add error-handling logic to catch these truncated responses and retry the failed requests with exponential backoff.

D. The refusal behavior indicates the system prompt is conflicting with the task budget signal. Rewrite the system prompt to explicitly instruct Claude to attempt all tasks regardless of budget constraints, overriding the self-regulation behavior that is blocking execution.

Why B is correct: A task_budget that is clearly insufficient for the requested work causes Claude to exhibit refusal-like behavior — declining to start, aggressively scoping down, or stopping early with partial results rather than beginning work it cannot finish. The absence of stop_reason: "max_tokens" in the logs confirms the API is not truncating responses; the problem is Claude's self-regulation in response to a budget it perceives as unworkable. The fix is to size the budget against the actual task-length distribution rather than a fixed conservative default.

— Your CI pipeline uses Claude Code to automatically triage test failures: for each failing test, an agent calls a set... Show

A. The agent's tool descriptions are too long, consuming input tokens that should be reserved for output; shorten the tool descriptions so the model has more room to generate complete responses and tool use blocks are no longer truncated mid-execution.

✓ B. The agent loop does not inspect stop_reason before treating a response as complete; when output is truncated mid-tool-use-block, the agent must detect stop_reason: "max_tokens" and treat that invocation as a hard failure requiring retry or escalation — not silently advance to the next item.

C. The pipeline is processing too many failures concurrently, causing resource contention that limits how many tokens each individual response can use; reducing batch size and serializing requests will give each invocation its full token budget and prevent truncation.

D. The diagnostic tools are returning excessively verbose output that fills the context window before Claude can emit a complete tool use block; summarizing tool outputs before passing them back to the agent will free up space for Claude to finish generating its response.

Why B is correct: When a response stops with stop_reason: "max_tokens" and contains an incomplete tool use block, the tool invocation did not complete — the agent received a partial, unusable result. The correct architectural response is to inspect stop_reason on every API response and treat max_tokens with an incomplete tool use block as an explicit failure signal, not a successful response. Silently advancing is a silent error suppression pattern: the pipeline hides the failure instead of surfacing it. The fix is structural — the agent loop must branch on stop_reason before processing tool call results, then retry or escalate to human review.

— Your structured data extraction pipeline uses Claude to process incoming legal documents. You've built 14... Show

A. Rewrite each Skill's name and description to be more distinctive, so Claude can more reliably differentiate between them during selection.

✓ B. Route each request to a reduced Skill set matched to the identified document category, keeping the number of simultaneously loaded Skills within reliable recall limits.

C. Increase the system prompt's detail about each Skill's purpose so Claude has richer context to distinguish between all 14 Skills simultaneously.

D. Consolidate all 14 document-type Skills into a single broad extraction Skill that handles all document types through branching logic in its implementation.

Why B is correct: Recall failure under a large simultaneous Skill load is an attention problem — each Skill's metadata competes for attention in the system prompt, and beyond a reliable threshold, Claude's ability to select the right Skill degrades. The fix is architectural: route requests to a smaller, task-appropriate Skill set rather than loading all Skills on every request. This keeps the number of competing Skills within the range where recall accuracy is reliable, addressing the root cause without adding noise or collapsing useful specialization.

— Your CI pipeline uses Claude Code to automate test generation and deployment validation. After a recent refactor,... Show

A. The Write and Edit tools are not supported in CI environments and must be replaced with MCP server equivalents using the mcp__<server>__<action> naming convention.

✓ B. A PreToolUse hook was added or modified with an empty or overly broad matcher, causing it to match all tool calls — including Write and Edit — and return permissionDecision: 'deny'.

C. The PostToolUse hook for Write is returning a deny decision after the tool runs, and the agent is rolling back file changes before they are persisted.

D. Claude Code requires explicit allow entries for each built-in tool in the ClaudeAgentOptions; without them, tools default to blocked after an agent configuration change.

Why B is correct: An empty matcher field in a PreToolUse hook matches every tool event — not just the intended target. If a hook added during the sprint omits the matcher field or sets it to a pattern that inadvertently matches Write and Edit (e.g., an empty string), it will intercept all tool calls and apply its deny decision universally. The documented guidance explicitly warns: "an empty matcher matches all tools." Adding logging to surface permissionDecisionReason and tightening the matcher to the specific tool name (e.g., 'Bash') is the correct diagnostic and fix.

— Your team is building a Python-based CI pipeline that uses the Claude Code agent SDK to run automated code reviews... Show

A. Register the initialization logic as a PreToolUse SDK callback with no matcher, so it fires before any tool call in the session and acts as a session-start proxy.

✓ B. Define the initialization logic as a SessionStart shell command hook in .claude/settings.json and include setting_sources=["project"] in the ClaudeAgentOptions so the SDK loads it.

C. Wrap the SDK client initialization in a try/except block that logs the failure and retries session creation, ensuring the callback eventually runs.

D. Migrate the CI pipeline from the Python SDK to the TypeScript SDK, which supports SessionStart as an SDK callback hook natively.

Why B is correct: SessionStart is not available as a Python SDK callback hook — the Python HookEvent type omits it. The correct fix uses the mechanism designed for this gap: define the hook as a shell command in .claude/settings.json and load that settings file via setting_sources=["project"] in ClaudeAgentOptions. This makes the shell command hook available to the Python SDK session without any code restructuring, and is the pattern Anthropic guidance prescribes for this exact capability gap.

— Your team has deployed a structured data extraction pipeline using Claude Code inside a Docker-based CI environment.... Show

✓ A. enableWeakerNestedSandbox reduces filesystem and network isolation, and allowing /var/run/docker.sock via allowUnixSockets effectively grants the agent host-level access — together these eliminate both the OS-level sandbox boundary and the container boundary simultaneously.

B. Running Claude Code inside Docker without privileged namespaces prevents filesystem writes, so the agent cannot persist extracted data — the fix is to grant broader filesystem write permissions to the output directories.

C. The network filtering proxy does not inspect traffic content, so any allowed domain could exfiltrate extracted data — the fix is to increase max_tokens so the agent batches more data per request, reducing the number of outbound connections.

D. enableWeakerNestedSandbox is only a risk on macOS; on Linux the sandbox implementation is strong enough that allowing /var/run/docker.sock poses no additional escalation risk in Docker environments.

Why A is correct: enableWeakerNestedSandbox is documented as considerably weakening security — it removes the strong OS-level isolation that is the sandbox's primary defense. Simultaneously, allowing /var/run/docker.sock through allowUnixSockets is explicitly called out as effectively granting access to the host system via the Docker socket. Together, these two choices eliminate independent layers of protection: the first removes the sandbox boundary, and the second grants a direct escalation path to the host. Recognizing that each configuration option weakens a distinct isolation layer — and that compounding them leaves no effective boundary — is the correct architectural judgment.

— Your team has configured a background subagent to run automated integration test suites concurrently while... Show

A. Add a system prompt instruction to the subagent telling it to request only the minimum necessary permissions and avoid tool calls it is uncertain about.

✓ B. Audit the subagent's full tool usage during a foreground run, then enumerate all required tools in the subagent's tools frontmatter field so they can be pre-approved at background launch time.

C. Switch the integration test subagent to run in the foreground so that unexpected permission prompts can be answered interactively, accepting the loss of concurrency.

D. Set CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1 to eliminate background execution and require developers to manually trigger each test suite run.

Why B is correct: Background subagents must have all required tool permissions pre-approved at launch — once running, they auto-deny any tool call not in the pre-approved set, and clarifying-question tool calls (like AskUserQuestion) fail silently rather than blocking. The root cause is incomplete tool enumeration at definition time. Auditing the full tool surface during a foreground run and declaring every required tool in the subagent's tools frontmatter ensures the pre-approval prompt at launch covers all needed permissions, eliminating the silent mid-run abandonment without sacrificing concurrency.

— Your multi-agent research system dispatches specialized subagents — one for web search, one for document parsing,... Show

A. Add an instruction to the subagent's system prompt directing it to stop after processing a fixed number of documents and report results, preventing runaway loops without requiring changes to ClaudeAgentOptions.

✓ B. Set max_turns and max_budget_usd in each subagent's ClaudeAgentOptions, and handle error_max_turns and error_max_budget_usd result subtypes in the orchestrator to detect and resume or escalate stopped sessions.

C. Increase the orchestrator's external timeout threshold so it waits longer before killing sessions, reducing disruptive mid-task interruptions when the document-parsing subagent processes large PDF batches.

D. Replace the three specialized subagents with a single merged agent that handles all research tasks, simplifying session monitoring and eliminating the multi-agent coordination overhead that causes runaway behavior.

Why B is correct: The max_turns and max_budget_usd fields in ClaudeAgentOptions are structural safety limits enforced by the SDK itself — the agent loop terminates and emits a ResultMessage with a specific error subtype (error_max_turns or error_max_budget_usd) rather than continuing indefinitely. Handling these subtypes in the orchestrator converts a silent runaway into a detectable, recoverable event (using session_id for resumption or escalating to a human). This addresses the root cause — absence of per-session guardrails — at the configuration layer, not through probabilistic prompt instructions.

— Your CI pipeline uses an executor model with the advisor tool enabled for code change validation tasks.... Show

A. Add a system prompt instruction telling the executor to retry the advisor call if no response is received within a defined timeout window, ensuring advice is eventually received before the pipeline continues.

B. Restructure the executor system prompt so that the completion-check advisor call is made before writing files, giving the advisor a chance to approve the overall approach before any changes are persisted to disk.

✓ C. Restructure the executor system prompt so that deliverables are written, saved, and committed before the completion-check advisor call, ensuring durable artifacts exist before any session-ending latency occurs.

D. Increase the session timeout threshold in the CI environment so that long-running advisor inference passes have sufficient time to complete before the pipeline is terminated by the orchestration layer.

Why C is correct: The advisor tool documentation explicitly prescribes making deliverables durable — writing the file, saving the result, committing the change — before the completion-check advisor call. The principle is that the advisor call introduces latency during which the session may end; durable artifacts survive a session termination while in-memory or unwritten results do not. This is an ordering constraint in the orchestration protocol, not a timeout or approval issue.

— Your team has built a CI pipeline that invokes `claude -p` as a subprocess to auto-generate test scaffolding for... Show

A. The defer mechanism requires Claude Code v2.1.89 or later, and CI environments typically pin to an older version to ensure reproducibility, so the hook result is silently ignored and the tool proceeds through the normal permission flow.

✓ B. When Claude makes multiple tool calls in a single turn, defer is ignored with a warning and the tool proceeds through the normal permission flow, leaving AskUserQuestion undeferred and the headless session hanging as before.

C. The defer mechanism only works in interactive terminal sessions; in non-interactive -p mode the hook result is silently discarded, so the AskUserQuestion call executes immediately and the pipeline hangs waiting for terminal input.

D. The cleanupPeriodDays retention sweep will delete the session before the CI pipeline can resume it, because the default 30-day window does not protect sessions that have been idle for more than a few hours between CI stages.

Why B is correct: The defer mechanism has an explicit architectural constraint: it only works when Claude makes a single tool call in the turn. If Claude issues multiple parallel tool calls in the same turn, defer is ignored with a warning and the tool proceeds through the normal permission flow. Because the failing runs show AskUserQuestion batched alongside Read calls, returning "defer" from the hook will be silently bypassed, the AskUserQuestion will proceed to execution, and the headless session will hang as before. The fix must address this batching behavior — either by constraining Claude to sequential tool calls or by intercepting AskUserQuestion with a "deny" and injecting a default answer.

— Your structured data extraction pipeline processes 1,200 insurance claim forms per day. Each extraction job runs... Show

✓ A. Replace the cleanup instruction with explicit guidance distinguishing reversible actions (writing to staging, flagging records) from irreversible ones (deleting source files), and require a concrete integrity check as a prerequisite before any deletion proceeds.

B. Increase the pipeline's max_tokens budget so Claude has more room to reason through edge cases before deciding whether to delete source files, giving it space to weigh the risks of deletion against the efficiency gains of cleanup.

C. Add a post-processing review step where a second Claude instance audits the staging table after each batch, flags potential data loss, and attempts to restore source files if integrity issues are detected in the written records.

D. Rewrite the system prompt to instruct Claude to "be conservative and avoid risky actions," relying on its judgment to assess when cleanup conditions are met rather than enumerating specific irreversible operations or prerequisite checks.

Why A is correct: The principle is: for irreversible actions affecting shared systems, prompt guidance must explicitly classify the action as high-risk and define a concrete prerequisite gate — not rely on vague efficiency framing or model judgment. Naming deletion as a destructive operation and requiring an integrity check before proceeding mirrors the "reversibility and potential impact" framing from the docs: Claude is encouraged to take local, reversible actions freely but must pause before hard-to-reverse ones. This addresses the root cause — the prompt gave no signal that file deletion was categorically different from writing records — without requiring infrastructure changes.

— Your Claude Code CI pipeline runs automated PR reviews on every push to main. Each PR review session starts a fresh... Show

A. Trim CLAUDE.md to under 200 lines by deleting the database migration and release checklist instructions, keeping only the PR review workflow since CI sessions only perform reviews.

✓ B. Move the PR review, database migration, and release checklist instructions out of CLAUDE.md and into separate skills, so each workflow's instructions load on-demand only when that skill is invoked.

C. Increase the token budget allocated to each CI session so the base context overhead represents a smaller proportion of total usage.

D. Split the CI pipeline into separate agent team instances — one per workflow type — so each instance only loads the instructions relevant to its assigned task.

Why B is correct: Skills load on-demand only when invoked, whereas CLAUDE.md contents are loaded into every session's context at start regardless of relevance. Moving specialized workflow instructions into skills keeps the base context small across all session types — not just CI PR reviews. This addresses the root cause (all workflows' instructions loading unconditionally) rather than just the symptom visible in CI, and it scales cleanly as new workflows are added.

— Your structured data extraction pipeline processes thousands of insurance claim documents per day. The agent uses a... Show

A. Strengthen the system prompt by adding explicit examples of the allowed staging directory path and listing /etc as a prohibited destination, making the constraint clearer to the model on every request.

✓ B. Register a PreToolUse hook with a WriteFile matcher that inspects file_path, returns permissionDecision: "deny" when the path starts with /etc, and injects a systemMessage so the agent understands why the operation was blocked.

C. Add a PostToolUse hook that checks the written file path after each operation completes, then triggers a rollback script to restore /etc/claims-config/ whenever an unauthorized write is detected and committed.

D. Route all WriteFile calls through a second validation agent that re-evaluates the destination path against an allowlist policy before forwarding the request onward to the file system.

Why B is correct: A PreToolUse hook with permissionDecision: "deny" intercepts the tool call before it executes — the write never reaches the file system. Pairing the denial with a systemMessage injects context explaining why the operation was blocked, preventing retries. This is a structural, deterministic enforcement mechanism applied at the SDK level, outside the model's decision loop — unlike prompt instructions, which are probabilistic and already proven insufficient by the observed 3% failure rate.

— Your multi-agent research system ingests documents from third-party sources — including web scrapes, uploaded PDFs,... Show

A. Add explicit instructions to each subagent's system prompt warning it to ignore any directives embedded in document content and to never make unauthenticated external calls.

✓ B. Move credentials outside the subagent's trust boundary by routing all authenticated API calls through an external proxy that injects credentials — subagents request the proxy endpoint rather than holding secrets directly.

C. Add a post-processing validation agent that reviews each subagent's outputs and API call logs before results are propagated to the orchestrator, flagging suspicious patterns for human review.

D. Consolidate the specialized subagents into a single agent with broader context so that no credential handoff is needed between components, eliminating the injection surface at handoff points.

Why B is correct: Placing credentials outside the subagent's trust boundary is a structural security mechanism. Even if a prompt injection successfully hijacks a subagent's behavior, the attacker cannot exfiltrate credentials because the subagent never possessed them. The proxy injects the key at request time without exposing it to the agent's environment — limiting what a compromised subagent can do regardless of how it was compromised. This is defense in depth applied at the architecture level, not the instruction level.

— Your structured data extraction pipeline processes batches of 500 legal contracts per week across multiple agent... Show

A. Expand the system prompt to include a comprehensive summary of all known field patterns, validation rules, and schema refinements discovered to date, updating it manually before each new session starts.

✓ B. Bootstrap a structured memory file at session start that captures discovered field patterns, validation edge cases, and schema refinements, then instruct the agent to update this file at session end so each new session loads it as its first action.

C. Increase the context window allocation so the agent can hold all prior session outputs in active context simultaneously, eliminating the need for a separate external persistence mechanism entirely.

D. Require operators to paste a structured session summary into the first user turn of each new session so the agent can fully reconstruct the necessary extraction context before beginning work.

Why B is correct: The memory tool's just-in-time context retrieval pattern solves this problem structurally: by bootstrapping a memory file at session start and committing discoveries back to it at session end, each session begins with exactly the context it needs without operator intervention. This turns memory into a structured recovery mechanism — the core purpose of the multi-session software development pattern — keeping active context focused on what is currently relevant rather than bloated with accumulated history.

— Your multi-agent research system is investigating a complex policy topic using a lead agent and six research... Show

A. The lead agent lacks sufficient context about the research topic. Expanding the system prompt with more background will help it anticipate slow subagents and proactively redistribute work before stalls occur.

✓ B. The subagents' tasks are sized too large. Without self-contained deliverables and natural check-in points, the lead cannot detect stalls or reassign work. Tasks should be decomposed into smaller, clearly scoped units each producing a concrete deliverable.

C. The system is under-resourced. Spawning additional subagents to run redundant copies of the long-running tasks in parallel will reduce the probability that all copies stall simultaneously and restore throughput.

D. The lead agent's coordination loop is too slow. Increasing the lead's max_tokens budget will allow it to monitor more subagents per cycle, process status signals faster, and detect stalls earlier in the pipeline.

Why B is correct: When tasks are too large, subagents work for extended periods without producing intermediate deliverables, removing the lead's ability to gauge progress, identify stalls, and reassign work. Decomposing into self-contained units — each with a clear deliverable — creates natural check-in points that surface stalls early. The underlying principle: coordination effectiveness depends on task granularity; the lead can only rebalance work it can observe completing.

— Your Claude Code CI pipeline automatically runs tool-returned shell commands against a staging environment as part... Show

A. The CI pipeline's context window is too small, causing Claude to hallucinate shell commands to fill gaps in its reasoning about the build state.

✓ B. Tool results from the external API are passed to the execution environment without validation, allowing executable content embedded in the returned strings to run as commands.

C. Claude's system prompt lacks an explicit instruction prohibiting environment modifications, so Claude freely interprets tool output as authorized commands and acts on it.

D. The external API is returning results in an unsupported format, causing Claude to switch to a fallback behavior that executes arbitrary commands to resolve the ambiguity.

Why B is correct: Tool results are returned as strings and can contain any content — including executable commands — that may be processed by the execution environment. When a pipeline passes tool output directly to an execution layer without validation, any shell commands embedded in that string can run with the pipeline's permissions. The fix is to validate and sanitize external tool results before they are interpreted or executed, breaking the injection path at the boundary where untrusted external data enters the trusted execution environment.

— Your CI pipeline uses `claude_agent_sdk` to automatically diagnose and fix failing tests in a monorepo. The agent is... Show

A. Add "Bash(find *)" and "Bash(tree *)" as additional scoped tool permissions to speed up file traversal and reduce the number of turns consumed before the agent reaches the relevant files.

B. Handle error_max_turns in the result loop by resuming the session with an increased max_turns limit, and add a disallowed_tools list excluding broad commands like Bash to force the agent to use more targeted read-only traversal tools.

✓ C. On error_max_turns, capture the session_id from the ResultMessage and resume the session rather than silently completing, so the agent continues from where it stopped instead of discarding progress.

D. Switch the agent to effort="low" so it uses fewer turns on reasoning overhead, freeing up the remaining turn budget for actual file edits and test fixes.

Why C is correct: The corpus explicitly documents session resumption as the designed response to error_max_turns: save the session_id from ResultMessage and pass it back to query to continue from where the agent stopped. The current implementation is silently discarding the session on turn exhaustion — treating a resumable pause as a terminal failure. Resumption preserves accumulated context (files read, hypotheses formed, edits staged) rather than restarting the traversal from scratch. This addresses the root cause: the agent is running out of budget mid-investigation on long traversals, not producing wrong output or using the wrong tools.

— Your team has deployed a Claude Code Stop hook that automatically runs a suite of post-generation linting and test... Show

✓ A. The Stop hook is triggering Claude to continue working, and the hook script re-fires each time Claude finishes, because it does not check stop_hook_active and exit early when the field is true.

B. The Stop hook is running in parallel with a PostToolUse hook, and the two hooks are issuing conflicting instructions that prevent the session from terminating cleanly.

C. Claude is exceeding its context window during the linting phase, causing the session to restart automatically rather than halt, and the fix is to increase the token budget allocated to the Stop hook.

D. The shell command type is inappropriate for post-generation validation; switching the hook to type: "http" pointing at a local service will prevent the re-entry loop.

Why A is correct: The Stop hook fires each time Claude finishes responding. If the hook script triggers Claude to do additional work (e.g., re-run tests), Claude finishes that work and fires Stop again — re-entering the hook. The stop_hook_active field is specifically set to true when the Stop event is firing during a hook-triggered continuation. A hook that doesn't inspect this field and exit early will unconditionally trigger another continuation, producing an infinite loop. The fix is to parse stop_hook_active at the top of the script and exit 0 when it is true, allowing Claude to stop rather than re-enter.

— Your team ships a Claude Code plugin that bundles a custom MCP server for interacting with your CI pipeline. The... Show

A. The plugin's MCP server does not restart automatically after an update; engineers must manually restart their Claude Code sessions to reload the new server binary and pick up the gate-check script fix.

✓ B. Because "version": "1.4.0" is set in plugin.json, Claude Code uses that string as the cache key and skips the update when the string hasn't changed — regardless of new commits. The fix is to bump the version field on every release, or remove it so the git commit SHA is used as the cache key instead.

C. The internal git-hosted marketplace caches the old plugin archive on its server and must be manually cleared before /plugin update can detect new commits. The fix is to configure the marketplace server to disable HTTP caching headers so fresh archives are always served.

D. The version field in plugin.json takes precedence over the marketplace entry, so the marketplace's version was never consulted. The fix is to remove the version from plugin.json and set it only in marketplace.json, making the marketplace the single authoritative source for the version string.

Why B is correct: Claude Code uses the resolved plugin version string as the cache key for update checks. When version is set explicitly in plugin.json, that string is frozen — pushing new commits to the git source has no effect because Claude Code sees the same "1.4.0" string and concludes no update is available. The fix is either to bump the version field on every release (enforcing a deliberate release gate) or to remove the version field so the git commit SHA becomes the cache key, ensuring every new commit automatically surfaces as an update. This is the core versioning tradeoff: explicit versions provide stable release cycles; commit-SHA versions provide continuous delivery with no manual bookkeeping.

— Your multi-agent research system deploys subagents that autonomously write intermediate findings to disk. After a... Show

✓ A. Configure a PreToolUse hook matched to Write that checks tool_input.file_path; if it starts with /etc, return permissionDecision: 'deny' inside hookSpecificOutput and include a systemMessage explaining that system directories are protected.

B. Configure a PostToolUse hook matched to Write that detects writes to /etc after they complete and triggers a rollback function to restore the original file, then logs the violation for operator review.

C. Strengthen the system prompt for each research subagent with an explicit instruction: "Never write to system directories such as /etc. If a write is blocked, do not retry." This is the lowest-overhead solution and requires no hook infrastructure.

D. Configure a PreToolUse hook without a matcher field so it intercepts all tool calls, then inside the callback return permissionDecision: 'deny' for any file_path starting with /etc. Omit systemMessage to keep the response payload minimal.

Why A is correct: The PreToolUse hook with a Write matcher intercepts the operation before execution, making blocking deterministic rather than probabilistic. Returning permissionDecision: 'deny' inside hookSpecificOutput stops the tool call, while the top-level systemMessage field injects context into the conversation so the agent understands why the operation was blocked and does not retry it. Both output fields are required together: the deny decision prevents the write, and the system message breaks the retry loop by providing the agent with actionable context — addressing both failure modes at their source.

— Your team has deployed Claude Code to automate a nightly CI job that generates implementations for a suite of... Show

A. Add a post-generation review step where a second Claude instance inspects each solution for hardcoded values and literal test-input matches, flagging violations before the code is merged into the main branch.

B. Increase the token budget for the generation step so Claude has significantly more reasoning space to work through general algorithms and explore alternative implementation strategies before committing to a final solution.

✓ C. Embed explicit orchestration instructions — evaluated before code executes — that require general-purpose logic, mandate use of standard tools, prohibit hardcoded values and helper scripts, and direct Claude to report infeasible tasks rather than work around them.

D. Route the 30% of solutions that fail production validation to a dedicated human reviewer who can identify and rewrite the hardcoded sections, while allowing the passing solutions to continue through the pipeline automatically.

Why C is correct: The root cause is that the agent's default optimization target — passing tests — is misaligned with the actual goal of producing generalized solutions. The fix is to redirect the agent's behavior at the source, before generation, by providing explicit orchestration criteria: implement general logic, use standard tools directly, prohibit workarounds, and escalate rather than circumvent infeasible tasks. This is a structural intervention at the orchestration layer that shapes every generation pass, not a downstream review that reacts to problems after they occur. Explicit behavioral criteria applied prospectively are more reliable than probabilistic post-hoc review.

— A structured data extraction pipeline processes 150 regulatory filings per night, each averaging 35,000 tokens.... Show

✓ A. Reorder the prompt so the full filing text appears before the extraction instructions and field definitions, and require Claude to quote the relevant passages before populating each field.

B. Split each 35,000-token filing into overlapping 8,000-token chunks and run a separate extraction call per chunk, then merge the results downstream.

C. Add a stronger system prompt instruction emphasizing that Claude must read the entire document carefully before extracting any values, particularly those in the middle sections.

D. Increase max_tokens to give Claude more room to reason through the full filing before committing to extracted field values.

Why A is correct: Placing longform document content before instructions is a documented structural technique that can improve response quality by up to 30% on complex, multi-document inputs — the model attends better to content encountered during the reading phase before the query anchors its attention. Combining this with a quote-first step further improves accuracy: by requiring Claude to surface relevant passages before populating fields, the model is forced to locate and anchor on the exact text for each value rather than relying on diffuse recall across a 35,000-token document. Both changes address the root cause — attention drift over long inputs — at the prompt structure level.

— Your CI pipeline uses the Claude Agent SDK to automatically fix failing tests. The integration invokes `query()`... Show

A. Increase max_turns from 30 to 60 so the agent has sufficient headroom to complete the most complex debugging tasks it currently abandons before finishing.

✓ B. On receiving subtype: "error_max_turns", capture the session_id from the ResultMessage and resubmit as a resumption with an additional turn budget, preserving the agent's intermediate progress rather than discarding it.

C. Add a post-run prompt instruction in CLAUDE.md directing the agent to prioritize applying fixes before exhausting its turn budget, so fewer pipeline runs reach the limit and get discarded.

D. Switch effort from "high" to "low" so the agent uses fewer turns per reasoning step, reducing the probability of hitting the turn limit before completing the fix on complex debugging tasks.

Why B is correct: The ResultMessage on error_max_turns exposes the session_id precisely to support resumption — the agent loop preserves intermediate context (files read, edits made, test results observed) inside the session. Resuming with the captured ID continues from where the loop stopped rather than restarting cold, which is the mechanism designed for this failure mode. This preserves the safety value of the turn limit (preventing runaway sessions) while recovering from legitimate mid-task interruptions.

— Your structured data extraction pipeline uses Claude Code with a set of ten skills — each encoding a different... Show

A. Increase the context window budget allocated to each session so that loading all ten skill descriptions no longer risks crowding out the correct schema.

✓ B. Mark the nine less-frequently-used schema skills with disable-model-invocation: true and require operators to invoke the correct schema skill explicitly with /<name> before each extraction task.

C. Rewrite all ten skill descriptions to be longer and more detailed so Claude can more reliably distinguish between schemas and load the correct one automatically.

D. Consolidate the ten schema skills into a single skill that contains all schema definitions, so only one skill description loads at session start.

Why B is correct: disable-model-invocation: true makes a skill invisible to Claude until explicitly invoked — it contributes zero tokens to every request. For a pipeline where the operator knows which document type they're processing before the task begins, explicit invocation via /<name> is the natural control point. This addresses both the context waste (descriptions no longer load speculatively) and the accuracy problem (schema selection is no longer left to Claude's probabilistic matching against descriptions) at their source: the session-start loading of all ten descriptions.

— Your multi-agent research system uses Claude Code skills to automate specialized research workflows. Each subagent... Show

A. Rewrite the skill description to be more specific, narrowing the trigger conditions — for example, "Gathers and compiles peer-reviewed academic sources when the orchestrator explicitly requests a citation sweep for a named research topic."

✓ B. Set disable-model-invocation: true in the skill's frontmatter so the skill only activates when the orchestrator explicitly invokes it via /gather-sources.

C. Move the skill's instructions into each subagent's CLAUDE.md so Claude has the procedure in context at all times, eliminating autonomous triggering decisions.

D. Add a system prompt instruction to each subagent telling it not to invoke the gather-sources skill unless it detects an explicit citation intent in the orchestrator's message.

Why B is correct: When the requirement is that a skill should only activate on explicit invocation — never autonomously — the correct fix is disable-model-invocation: true. This is a structural configuration mechanism that deterministically prevents Claude from auto-triggering the skill regardless of how relevant the description seems. A more specific description (A) reduces the frequency of false triggers but cannot eliminate them; Claude may still infer relevance in ambiguous contexts. The problem here is not that the description is too broad — it's that autonomous invocation is undesirable in any context, which is exactly the use case disable-model-invocation: true was designed for.

— Your multi-agent research system includes a content moderation layer that screens research summaries before they are... Show

✓ A. Replace the flat prohibited-category list with a category-to-definition dictionary, pairing each category name with a precise description of what content it covers — for example, specifying exactly which financial claims constitute prohibited Specialized Advice.

B. Increase max_tokens for the moderation call so Claude has more room to reason through ambiguous financial content before producing its classification decision, reducing the chance of a hasty or truncated judgment.

C. Add a second moderation pass that re-evaluates every flagged item, instructing the second pass to "be more conservative and avoid false positives" to catch errors the first pass missed.

D. Expand the prohibited-category list by adding finer-grained subcategories such as Investment_Advice and Market_Commentary, relying on Claude to infer the boundary between them from the subcategory names alone.

Why A is correct: When a category name alone is ambiguous, Claude must infer what content it covers — producing inconsistent decisions at the boundary. Pairing each category name with a precise definition gives Claude an explicit classification criterion, eliminating the guesswork that produces both false positives (legitimate market observations flagged as advice) and false negatives (genuine prohibited advice that doesn't pattern-match the category name). This is the principle that replacing a bare category list with a category-to-definition dictionary directly controls: the definition becomes part of the prompt and narrows the decision boundary to exactly what the designer intended.

— Your team uses agent teams in Claude Code to parallelize a large code generation workflow. After several sprints,... Show

A. Return JSON {"continue": false, "stopReason": "Build artifact missing"} so the user is notified and the teammate terminates cleanly.

B. Emit a warning message to stdout and exit with code 0, allowing the teammate to proceed to idle but logging the issue for later review.

✓ C. Write the failure message to stderr and exit with code 2, which delivers the feedback to the teammate and causes it to continue working instead of going idle.

D. Register a PostToolUse hook instead, so each build tool invocation checks for dist/output.js and re-runs the build step if the artifact is absent.

Why C is correct: Exit code 2 from a TeammateIdle hook is the mechanism that redirects a teammate: the stderr message is delivered to the teammate as feedback and the teammate resumes working rather than going idle. This addresses the root cause — the teammate stops prematurely — at the exact lifecycle point where that decision is made, using the structural control the hook exposes. The other approaches either terminate the teammate (A), silently allow idle (B), or intervene at the wrong lifecycle event (D).

— Your CI pipeline uses Claude Code to generate automated code review summaries for pull requests. After three weeks... Show

A. Add a chain-of-thought instruction requiring Claude to cite the specific line number and code path that supports each behavioral claim before including it in the summary.

B. Switch to a best-of-N approach by running the same diff through the review agent three times and surfacing only claims that appear in at least two of the three outputs.

C. Increase the context window allocation so Claude can see the full file history alongside the diff, giving it more information to base behavioral claims on.

✓ D. Add an instruction explicitly restricting Claude to only describe changes visible in the provided diff and to flag any behavioral inference it cannot directly trace to a line in the diff.

Why D is correct: The root cause is that Claude is drawing on general programming knowledge to infer behavioral consequences that aren't demonstrated in the provided diff — a hallucination pattern where the model fills gaps with plausible but unverified claims. Explicitly restricting Claude to only describe what is directly observable in the provided diff (external knowledge restriction) targets this failure mode at its source: it prevents the model from generating claims it cannot ground in the input, rather than trying to verify or outvote bad outputs after they're generated.

— Your multi-agent research system synthesizes findings from 12 legal subagents, each summarizing a different... Show

A. Add a post-processing validation pass that uses sentiment analysis to detect and flag terminological inconsistencies before the report reaches legal reviewers for final sign-off.

✓ B. Define a shared controlled vocabulary and structural output template in each subagent's system prompt, requiring all summaries to use standardized terminology and a common section schema.

C. Increase the synthesis agent's context window so it can compare all 12 subagent outputs simultaneously and resolve any terminological inconsistencies during the final assembly step.

D. Instruct each subagent to flag terminology it is uncertain about, then route those flags to a human reviewer who normalizes the vocabulary across all 12 summaries before synthesis begins.

Why B is correct: The consistency criterion for multi-document summarization requires that all agents maintain a consistent structure and approach. The root cause here is that each subagent independently selects its own terminology — producing accurate but divergent output. Embedding a shared controlled vocabulary and structural template in each subagent's system prompt addresses the problem at its source: it enforces consistency upstream, at the point of generation, rather than correcting it downstream. This is the prompt engineering equivalent of establishing a common schema — preventing divergence rather than treating its symptoms.

— Your multi-agent research system uses Claude Code subagents to handle specialized research tasks: literature review,... Show

A. Increase the verbosity of each role-specific instruction block in the root CLAUDE.md, adding explicit examples and edge cases to make the intent unambiguous.

✓ B. Trim the root CLAUDE.md to shared cross-agent rules only, and move role-specific instructions into path-scoped .claude/rules/ files under each subagent's working directory.

C. Replace the single root CLAUDE.md with a set of @path imports — one per subagent role — so that each subagent's instructions are organized in a separate file loaded at launch.

D. Configure each subagent to use auto memory so that role-specific behaviors are reinforced through session learnings rather than CLAUDE.md instructions.

Why B is correct: The root cause is context dilution from an oversized CLAUDE.md (340 lines, well over the 200-line threshold where adherence degrades). Path-scoped .claude/rules/ files load only when Claude works with files matching the path, so each subagent receives only the instructions relevant to its role — reducing context size and eliminating cross-contamination. This is the structural mechanism designed specifically for this problem: isolating role-specific rules to the working directory of the subagent that needs them.

— Your CI pipeline uses Claude Code to auto-implement utility functions flagged by failing tests. Over the past two... Show

A. Add a system prompt instruction telling Claude to be more rigorous and to carefully double-check its implementations before marking them complete.

✓ B. Provide concrete test cases inline with each request — exact inputs paired with expected outputs — and instruct Claude to run them after implementing.

C. Increase the token budget for each implementation task so Claude has additional room to reason through edge cases before producing its final output.

D. Add a second Claude Code pass that reviews the first implementation and flags any logic it considers likely to fail in the test suite.

Why B is correct: Claude performs better when it can check its own work against concrete, verifiable criteria. Providing explicit test cases — exact inputs paired with expected outputs — gives Claude a deterministic feedback loop: it can run the tests, observe pass/fail signals, and iterate until the criteria are met. This converts an open-ended generation task into a constrained verification loop, addressing the root cause (no grounding signal) rather than hoping Claude self-corrects through more careful reasoning.

— Your multi-agent research system uses a shared memory store to persist intermediate findings across sessions.... Show

A. The memory store's per-file 100KB cap caused subagents to truncate source summaries, losing attribution data that the synthesis agent then reconstructed incorrectly as fabricated citations.

✓ B. The shared memory store was attached with the default read_write access, so a prompt injection in fetched web or third-party API content could write attacker-controlled material into the store, which later sessions then read as trusted memory.

C. The synthesis agent's session instructions did not explicitly direct it to cross-check memory store contents against original sources before incorporating them into reports, allowing distorted summaries to pass through unchallenged.

D. The memory store was attached to multiple concurrent research subagent sessions simultaneously, causing uncoordinated write conflicts between subagents that silently corrupted the stored summaries over time.

Why B is correct: Memory stores attach with read_write access by default. When agents process untrusted external inputs — web content, third-party API responses — a successful prompt injection in that content can cause the agent to write attacker-controlled material into the store. Because later sessions read the store as trusted memory, the injected content propagates downstream without further scrutiny. The fix is to attach reference or persisted-research stores as read_only for agents that don't need to modify them, and isolate write-capable stores to agents whose inputs are fully controlled. This is a structural access-control failure, not an instruction or capacity issue.

— Your CI pipeline runs Claude Code on every pull request to perform automated code review. After three weeks in... Show

A. Increase the token budget allocated to each CI run so that context-limit failures no longer interrupt mid-task execution and all approaches can complete.

B. Add a system prompt instruction telling Claude to commit to its first implementation approach and avoid exploratory rewrites before producing output.

✓ C. Invoke Claude in plan mode first, gate the pipeline on the proposed approach, then proceed to implementation — structurally separating exploration from code generation.

D. Switch the CI pipeline to spawn an agent team, assigning each candidate implementation approach to a separate teammate so that exploration is parallelized across context windows.

Why C is correct: Plan mode (Shift+Tab) separates exploration from implementation: Claude maps the codebase and proposes a single, approved approach before writing any code. In a CI context, this means the pipeline gates on the plan before proceeding, eliminating the pattern of committing partial code in multiple directions. The fix addresses the root cause — undirected exploration — at the structural workflow level rather than compensating for its effects downstream.

— Your multi-agent research system deploys five specialized subagents — Literature Review, Data Analysis, Citation... Show

A. Add a post-task prompt to the orchestrator instructing it to summarize each subagent's findings and append them to a shared research log at the end of every session, so all subagents can reference accumulated project knowledge on future runs.

✓ B. Embed memory-maintenance instructions in each subagent's markdown file, directing them to proactively record domain-specific patterns, recurring findings, and key decisions encountered during their work.

C. After each research session, manually review each subagent's output and write summary notes into a shared CLAUDE.md file at the project root so all subagents can reference the accumulated knowledge on future runs.

D. Increase the context window allocation for each subagent so they can retain more of their prior session outputs as in-context examples, giving them access to historical findings during future runs.

Why B is correct: Embedding memory-maintenance instructions directly in each subagent's markdown file makes memory updates a built-in behavior of each agent's system prompt — the subagent proactively maintains its own knowledge base without requiring orchestrator intervention or human action. This is the same mechanism that produced the compounding improvement in the Literature Review subagent. Because memory is scoped to project by default, the accumulated knowledge is also shareable via version control, and improvement compounds automatically over time at the source where domain knowledge is generated.

— Your structured data extraction pipeline processes a corpus of 50,000 legal contracts in long-running agent... Show

A. The extraction schema's required fields create token overhead that compounds across documents, eventually exceeding the per-call output limit.

✓ B. Sessions lacking context awareness cause Claude to continue processing as if full capacity remains, leading to abrupt truncation when the context window is exhausted near its boundary.

C. The legal documents in the final 20% are structurally more complex, requiring more tokens per extraction and causing the model to exceed its response budget.

D. The pipeline's tool call loop accumulates system_warning messages in the conversation history, diluting attention on the extraction schema by the late-session documents.

Why B is correct: Context awareness — the mechanism by which Claude Sonnet 4.5+ models receive explicit <budget:token_budget> and <system_warning> updates on remaining capacity — solves exactly this failure mode. Without it, a model "competes without a clock": it has no signal that capacity is nearly exhausted and cannot adjust behavior (prioritizing, checkpointing, or surfacing a graceful handoff) before the window fills. The clustering of failures in late-session documents is the diagnostic signature of a model that doesn't know when it's running out of room. The fix is architectural: use a model with context awareness or design the agent harness to emit explicit token budget signals so the model can adapt before truncation occurs.

— Your multi-agent research system uses a subagent whose system prompt contains proprietary scoring rubrics and... Show

✓ A. Implement output screening and post-processing on the subagent's responses to catch and strip leaked prompt fragments before they reach the synthesis agent.

B. Rewrite the subagent's system prompt with comprehensive leak-resistant techniques to make the internal instructions impossible to surface in outputs.

C. Increase the subagent's context window allocation so the system prompt is less salient relative to the total token budget and less likely to be reproduced.

D. Require the synthesis agent to strip any explanatory text from subagent outputs and accept only structured JSON fields, eliminating the channel through which leaks occur.

Why A is correct: The guidance explicitly recommends trying monitoring and output screening techniques first — before attempting leak-resistant prompt engineering — because leak-proofing system prompts adds complexity that can degrade the model's overall task performance. Output screening is a proportionate, non-destructive first step: it intercepts the leak at the point of exposure without altering the subagent's functioning. Rewriting the system prompt is a heavier intervention that should only be pursued if monitoring proves insufficient.

— Your team uses Claude Code across a large monorepo. A senior engineer added fifteen detailed skills covering... Show

A. Increase the CLAUDE.md context budget by consolidating all skill descriptions inline so Claude has complete information about each skill without needing to load them separately.

✓ B. Set disable-model-invocation: true on skills that should only be triggered explicitly, and ensure the remaining model-invocable skills have distinct, non-overlapping descriptions so Claude can match them accurately.

C. Move all fifteen skills into subagent configurations so their full content is preloaded at launch, giving each subagent complete context about every available skill.

D. Add a system prompt instruction in CLAUDE.md telling Claude to always ask the user which skill to apply before beginning any task that could match more than one skill description.

Why B is correct: Skills compete for Claude's attention through their descriptions, which load into every request. When descriptions overlap or are vague, Claude cannot reliably distinguish between skills — producing exactly the observed failure pattern of wrong-skill invocation and missed relevance. The fix operates at two levels: disable-model-invocation: true removes side-effect or user-triggered skills from Claude's automatic matching entirely (zero description cost until invoked), and distinct non-overlapping descriptions allow the remaining model-invocable skills to be differentiated accurately. This addresses the root cause — attention noise from overlapping signals — rather than adding more context or shifting the burden to users.

— Your CI pipeline uses Claude Code to auto-generate and commit test scaffolding for each pull request. Three weeks... Show

A. The CLAUDE.md file is loading too late in the session; move the path rules to the system prompt so they are present before any tool calls execute and the timing issue is resolved.

✓ B. The path rules are likely too vague or the file has grown large enough that individual rules receive less attention; rewrite them with explicit per-service directory mappings and trim unrelated content.

C. Add a permissions deny rule inside CLAUDE.md blocking writes outside each service's expected directory so Claude cannot physically commit to the wrong path under any circumstance.

D. The per-service subdirectory CLAUDE.md files are not loading at session start; consolidate all path rules into the root CLAUDE.md so every rule is guaranteed present before any tool call runs.

Why B is correct: When /memory confirms a file loaded but Claude still violates an instruction, the documented root cause is how the instruction is written — not whether it loaded. Adherence drops when instructions are vague enough to interpret multiple ways or when the file has grown long enough that individual rules receive less attention. The fix is to improve instruction specificity and trim the file, addressing the actual failure mode rather than a loading symptom that has already been ruled out.

— Your CI pipeline uses Claude to moderate user-submitted changelogs and release notes for policy violations before... Show

✓ A. Replace the flat category list with a dictionary mapping each category name to a detailed definition, and include both the name and its definition in the prompt so Claude evaluates against precise policy criteria rather than an ambiguous label.

B. Add a secondary Claude pass that re-evaluates only the Specialized Advice category for any comment that initially passes, using a stricter system prompt focused exclusively on financial statements for that second pass.

C. Increase the max_tokens budget for each moderation call so Claude has more room to reason through borderline edge cases before returning a violation verdict on financial statements.

D. Instruct Claude in the system prompt to "be more conservative when evaluating financial statements" and route any ambiguous cases flagged by that instruction to a human reviewer before publication.

Why A is correct: The root cause is that category names alone are ambiguous — "Specialized Advice" does not tell Claude which specific types of financial advice are prohibited. Replacing the flat list with a category-to-definition dictionary embeds precise policy criteria directly in the prompt, giving Claude deterministic classification signal for borderline cases. The false-negative pattern disappears because Claude now has explicit criteria to match against, not just a label to interpret.

— Your CI pipeline uses a Claude Code agent to triage incoming pull requests: it labels PRs by type (bug fix, feature,... Show

✓ A. Add a dedicated intent-classification step before the triage agent that identifies multi-type PRs and routes them to a specialized orchestration path with its own tools and system prompt.

B. Expand the triage agent's system prompt with detailed examples of mixed-type PRs and explicit instructions for assigning a primary label when multiple change types are present.

C. Increase the triage agent's context window allocation so it can analyze a larger portion of the diff before making its labeling and reviewer-assignment decision.

D. Require developers to manually specify the PR type in the pull request description template so the agent can skip classification and proceed directly to reviewer assignment.

Why A is correct: When a single generalist agent handles highly varied inputs — here, PRs spanning multiple change types — a dedicated classifier that routes to specialized paths (each with tailored tools and system prompts) addresses the root cause: the triage agent's single decision path cannot accurately handle multi-type inputs without structural differentiation. The tradeoff is an additional latency-adding call, which is acceptable in a non-synchronous CI pipeline.

— You've built a Claude Code–based automated code review agent that uses extended thinking with tool use. The agent... Show

A. The agent is using tool_choice: {"type": "tool", "name": "run_tests"} to force the first tool call, which is incompatible with extended thinking and silently corrupts the reasoning chain.

✓ B. The agent is stripping thinking blocks from previous assistant messages before sending them back to the API, breaking reasoning continuity across tool call turns.

C. The budget_tokens value is set below max_tokens, which prevents Claude from completing its reasoning about tool results before generating the summary.

D. The agent is not using interleaved thinking, so Claude only reasons once at the start and cannot incorporate intermediate tool results into its thinking when producing the summary.

Why B is correct: When using extended thinking with tool use, the complete, unmodified thinking blocks from each assistant turn must be passed back to the API in subsequent turns. Omitting them breaks reasoning continuity — Claude loses the chain of intermediate reasoning that informs how it interprets tool results. The symptom (misinterpreting test results, ignoring coverage gaps) is a direct consequence of Claude reconstructing its reasoning from scratch on each turn without access to its prior thinking context. This is not a prompt or configuration issue; it is a structural data-flow requirement for maintaining coherent multi-turn reasoning.

— Your Claude Code integration generates function implementations by issuing tool-use requests. Overnight batch jobs... Show

A. The tool schema is missing required fields, causing Claude to abandon the tool_use block before completion.

✓ B. The max_tokens limit is being reached mid-response, truncating the tool_use block before Claude can close it, producing an unusable partial output that the pipeline treats as a success.

C. The pipeline is misreading a valid end_turn response as a success when the actual content is empty, masking a prompt construction error.

D. Claude is returning a text block instead of a tool_use block because the system prompt and tool description are ambiguous, and the pipeline does not validate the content type before proceeding.

Why B is correct: When stop_reason is max_tokens, Claude's response generation was cut short by the token budget — it did not finish naturally. If the truncation occurs inside a tool_use block, the block is structurally incomplete and cannot be parsed or executed. The API call itself succeeds (HTTP 200, no error), so a pipeline that only checks for API-level errors will silently record a success while discarding the unusable partial output. The diagnostic signal (stop_reason: "max_tokens" + incomplete tool_use block) uniquely identifies this cause. The fix is to detect stop_reason: "max_tokens" explicitly and retry with an adequate token budget, not to ignore the stop reason.

— Your CI pipeline uses Claude Code to perform dependency vulnerability research across 40+ microservices. Each... Show

A. Increase the token budget for the session so Claude has enough context to retain CVE findings from early services all the way through to the final consolidated report.

✓ B. After each service scan, have Claude write structured research notes (CVE findings, confidence levels, service status) to a persisted file, then start a fresh context window that loads those notes before continuing.

C. Add a system prompt instruction telling Claude to explicitly prioritize retaining vulnerability findings from earlier services and to cross-reference all of them before writing the final consolidated security report.

D. Run a dedicated second Claude pass over the completed report to catch any missed vulnerabilities, supplying the original CVE database records as reference input for that review pass.

Why B is correct: Attention dilution across a long single-context session causes findings from early in the run to lose salience by the time the final report is generated — even though the information was present in context. The fix is to treat context as a scarce resource: persist structured state (CVE findings, confidence levels, service status) to an external file after each service scan, then continue with a fresh context window that loads only the persisted notes. This decouples information retention from context length and mirrors the established pattern for long-horizon tasks where Claude saves state and resumes across multiple context windows.

— Your multi-agent research system uses a classification subagent to route incoming research queries to specialized... Show

A. Increase the number of few-shot examples in the classification prompt to cover more query types, especially edge cases where literature review and synthesis overlap, so the model has more patterns to match against.

✓ B. Instruct the classification subagent to output its reasoning inside <reasoning> tags before the <intent> tag, so downstream agents and human reviewers can audit why each query was routed and identify systematic misclassification patterns.

C. Expand the system instructions to tell the subagent to "be more careful and precise when distinguishing between literature review and synthesis queries," reinforcing the importance of accuracy before it commits to a label.

D. Replace the single-label classification scheme with a confidence score for each category, routing each query to whichever agent receives the highest score, so ambiguous queries are handled more gracefully than with a hard label.

Why B is correct: Requiring the model to produce <reasoning> tags before committing to an <intent> forces it to externalize its classification logic, which serves two functions: it improves classification accuracy by making the model work through ambiguous cases rather than pattern-matching directly to a label, and it produces an auditable signal that lets you identify which query characteristics are driving the 18% misroute rate. This is chain-of-thought elicitation applied to a structured classification task — the reasoning is the diagnostic artifact that makes systematic prompt improvement possible.

— Your team is using Claude Code to automate a large-scale test remediation effort: fixing 200 failing tests across a... Show

A. Add more detail to the freeform progress.txt narrative, including the names of every test fixed, so Claude can scan the file to determine what has been resolved across sessions.

✓ B. Introduce a structured JSON file (e.g., tests.json) tracking each test's id, name, and status alongside the existing freeform notes, applying each format to what it does best.

C. Instruct Claude at the start of each session to re-run the full test suite and derive current status from scratch, discarding prior session notes entirely to avoid stale information.

D. Switch to a single longer session so Claude never loses context between handoffs, removing the need for any cross-session state tracking mechanism in the workflow.

Why B is correct: Freeform narrative is well-suited for tracking general progress and contextual notes, but it is a poor format for discrete, structured records like per-test status. A structured JSON file with explicit fields (id, name, status) gives Claude a machine-readable schema it can query deterministically — checking status: "passing" rather than inferring completion from prose. Combining both formats applies each to the problem it's suited for: structured data for schema-driven state, unstructured text for narrative context.

— Your structured data extraction pipeline processes 10,000 contracts per day using a main Claude Code session. For... Show

A. Forks share the parent's prompt cache, so the 8,000-token legal taxonomy would be re-encoded separately for each forked worker, eliminating any cost savings.

✓ B. Forks inherit the main session's model and cannot be configured to use a different model, so the cost-optimized model cannot be applied to forked workers.

C. Forks load the full conversation history into each worker's context, which inflates token usage as the main session grows longer throughout the day.

D. Forks run with the same tool permissions as the parent session, preventing the extraction workers from being restricted to read-only document access.

Why B is correct: A fork inherits the model field from the parent session — it cannot be overridden at spawn time. Named subagents, by contrast, specify the model in their definition file, allowing each worker to be routed to a cheaper model like Haiku. When the architectural goal requires a different model than the parent, forks cannot satisfy that requirement regardless of their caching advantage. This is the principle that the caching benefit of forks is only relevant when the fork's other inherited properties (model, system prompt, tools) are acceptable — if any inherited property conflicts with the task's requirements, a named subagent is the correct choice.

— Your structured data extraction pipeline uses background subagents to process contracts in parallel — each subagent... Show

A. Add a system prompt instruction telling the extraction subagents to skip any contract that requires external classification lookups and flag it for manual review instead.

✓ B. Enumerate the classification API tool in the background subagent's pre-approved tool set before launch, so the permission is granted upfront and the subagent can call it without interruption.

C. Switch all extraction subagents from background to foreground mode so they can surface interactive permission prompts when the classification API tool is needed mid-run.

D. Catch the failed subagent runs at the orchestrator level and automatically retry them with a higher maxTurns value to give each subagent more attempts to resolve the permission issue.

Why B is correct: Background subagents require all tool permissions to be granted upfront before launch — once running, they auto-deny any tool call not in the pre-approved set. Because the classification API was not pre-approved, those tool calls fail silently and the subagent continues without the data it needs, producing no output. The fix is to include the classification API tool in the subagent's declared tool set so permissions are resolved at launch time, not mid-run. This is a structural configuration fix: the permission boundary is defined in the subagent definition, not inferred at runtime.

— Your team has built a Claude Code–based assistant that helps developers analyze build artifacts. To keep the... Show

A. Because intermediate tool results are not loaded into Claude's context window, the model lacks visibility into failures and may silently retry operations, producing duplicate side effects on the local filesystem.

✓ B. Client-side direct execution runs Claude-generated code outside of a sandbox, meaning arbitrary code executes with the same privileges as the host application, and tool invocations can be vectors for code injection.

C. The allowed_callers field is not supported in client-side execution mode, so any tool in the registry can be invoked by Claude without restriction, bypassing intended access controls.

D. Running code execution locally forces synchronous tool calls, which prevents Claude from writing Python that aggregates results across multiple tools and increases context window pressure from repeated sampling.

Why B is correct: Client-side direct execution has two documented security properties: it executes untrusted code outside of a sandbox (meaning Claude-generated code runs with host-process privileges), and tool invocations are vectors for code injection. These are precisely the risks a security review would flag. The pattern trades security isolation for implementation simplicity — the correct architectural response is to move to a sandboxed execution model when the threat model requires it.

— Your structured data extraction pipeline processes 1,200 insurance claim documents per night using a... Show

A. Increase max_tokens on the extraction request to ensure Claude has enough room to complete the full JSON parameter before the stream terminates, eliminating truncation at the source.

B. Disable eager_input_streaming on the extract_claim_fields tool so the API buffers and validates the complete parameter before streaming, eliminating mid-stream truncation at the cost of higher latency.

✓ C. Add specific handling for the max_tokens stop reason: wrap the accumulated partial JSON in a structured error object (e.g., {"INVALID_JSON": "<partial>"}) and return it to Claude as an error tool result so Claude can recover with a corrective pass.

D. Switch the accumulation logic to parse the JSON incrementally on each input_json_delta event rather than waiting for content_block_stop, so partial fields can be extracted before truncation occurs.

Why C is correct: Fine-grained streaming explicitly does not guarantee a valid JSON string when max_tokens is reached — the stream may end mid-parameter. The documented pattern for this failure mode is to detect the max_tokens stop reason, wrap the incomplete fragment in a structured error object ({"INVALID_JSON": "..."}) so Claude can reason about what went wrong, and feed it back as an error tool result for a recovery pass. This addresses the root cause at the point of failure rather than silently skipping documents or removing the capability responsible for latency improvements.

— Your multi-agent research system uses Claude Code with `PreToolUse` hooks to enforce source-quality rules — blocking... Show

A. The subagent is not inheriting the project-level settings.json because subagents don't always load project memory automatically; the rules must be embedded directly in the subagent file body.

✓ B. The matcher field is a JSON array instead of a single string, causing the hook definition to fail schema validation and be silently dropped — so it never fires regardless of which tool is called.

C. The hook events are evaluated at session start, so changes made to settings.json after the subagent was launched are not reflected until the session is restarted.

D. The PreToolUse hook fires for the orchestrator agent only; subagents require a separate SubagentPreToolUse hook event to intercept tool calls at that level.

Why B is correct: The matcher field must be a single string using | to separate tool names — for example "WebFetch|Bash". When matcher is a JSON array, Claude Code treats it as a schema error: the hook definition is dropped entirely and will not appear in /hooks. The fix is a structural one (correcting the configuration syntax), not a prompt instruction or session restart. This is a silent failure mode: no error is surfaced at runtime, but the hook simply never fires.

— Your CI pipeline uses a multi-agent Claude Code setup where a parent orchestrator spawns subagents to run parallel... Show

A. Increase the subagents' token budget by setting a higher max_tokens limit so they have more room to process the inherited tool definitions alongside the analysis tasks.

B. Run /compact on the parent session before spawning subagents so that the summarized conversation history passed to each subagent is shorter.

✓ C. Disable the 7 unused MCP servers in the parent session with /mcp disable <name> before spawning subagents, so their tool definitions are not inherited by each subagent.

D. Move the orchestrator's CLAUDE.md instructions into path-scoped rules so that subagents load only the instructions relevant to their assigned service module.

Why C is correct: Subagents inherit every MCP tool definition from the parent session. With 9 MCP servers registered, each subagent's context window is pre-filled with tool definitions before any task input is added — and the error occurs on the first turn, confirming the problem is inherited overhead, not accumulated conversation. Disabling unused MCP servers removes that overhead at the source, which is where the context pressure originates.

— A team is building a structured data extraction pipeline to pull financial line items from 500 PDF invoices per... Show

A. Expand the Skill documentation to cover additional edge cases and invoice formats, ensuring Claude has comprehensive guidance before processing each batch.

✓ B. Establish a baseline without the Skill, build targeted evaluations from the specific failure types already observed in production, and write minimal instructions only to close those measured gaps.

C. Increase the model's context window allocation so that Claude can ingest the full text of each invoice alongside the extraction instructions without any truncation risk.

D. Switch to a multi-pass workflow where a second Claude call reviews every extracted result and flags low-confidence fields for human review before the batch completes.

Why B is correct: Evaluation-driven development addresses the root cause: the team is optimizing for imagined requirements rather than actual production failures. By establishing a baseline without the Skill, constructing evaluations from real observed failure modes, and writing minimal instructions only to close those specific gaps, each iteration is grounded in measured evidence. This converts the accuracy improvement cycle from speculative documentation into a testable, falsifiable feedback loop — which is exactly what the plateau exposes as missing.

— Your team has been building a structured data extraction platform for six months. Three separate Skills have been... Show

✓ A. Run evaluations against the consolidated contract-processing Skill and confirm it achieves equivalent performance to each of the three individual Skills it replaces before deploying it to production.

B. Audit the consolidated Skill's SKILL.md and all bundled scripts for adversarial instructions, hardcoded credentials, and unauthorized network calls before merging the three Skills.

C. Increase the token budget for the consolidated Skill to ensure it has sufficient context to handle the combined logic of all three individual Skills in a single pass.

D. Survey the extraction team to confirm that all three Skills are used together in every workflow before consolidating, so the merged Skill doesn't bundle unused capabilities.

Why A is correct: The principle of "consolidate later, when evals confirm equivalent performance" establishes that high individual evaluation scores do not guarantee a consolidated Skill will perform equivalently. Skills can interfere with each other when bundled — instructions may conflict, tool scope may broaden in unexpected ways, or the model's attention may be diluted across combined responsibilities. The merge decision must be gated on evaluation results for the consolidated Skill itself, not inferred from the components' individual scores. This is a structural quality gate, not a prompt or infrastructure change.

— Your CI pipeline uses Claude Code with a catalog of 400 tools covering linting, testing, security scanning,... Show

A. Increase max_tokens to ensure Claude has enough room to process all 400 tool definitions before selecting the appropriate one.

✓ B. Enable tool search so Claude dynamically discovers and loads only the tools relevant to each CI step rather than loading all 400 definitions upfront.

C. Add a system prompt instruction telling Claude to prefer specific, purpose-named tools over generic alternatives whenever both are available.

D. Consolidate the 400 tools into broader, multi-purpose tools so fewer definitions are loaded into context at once.

Why B is correct: Tool selection accuracy degrades with more than 30–50 tools loaded simultaneously. With 400 tools in context, Claude cannot reliably distinguish between closely related tools. Tool search addresses this at the architectural level — instead of presenting all 400 definitions upfront, the agent searches the catalog and loads only the tools needed for the current step, keeping the active set within the range where selection is reliable. This solves both the attention problem (fewer definitions to reason over) and the context efficiency problem (tool definitions can consume 10–20K tokens for just 50 tools).

— Your structured data extraction pipeline uses an MCP tool named `mcp__crm__write_record` to persist extracted... Show

A. Change the hook's matcher from mcp__crm__write_record to an unscoped pattern so it fires for all tool calls, giving the agent more opportunities to intercept invalid data before a write is attempted.

✓ B. Add a systemMessage field to the hook's deny response that explains why the contract_id format is invalid and what a valid format looks like, so the agent receives actionable context and stops retrying.

C. Replace permissionDecision: 'deny' with a post-processing step that silently drops invalid writes after they execute, so the agent receives a success response and does not attempt retries on the same record.

D. Move the contract_id format validation out of the hook and into the system prompt, instructing Claude to validate all extracted fields before calling mcp__crm__write_record on any contract record.

Why B is correct: When a hook denies a tool call without a systemMessage, the agent receives a bare rejection with no explanation of why the operation failed or what to do instead. This leaves the agent in an ambiguous state where retrying the same call is a plausible recovery strategy. Adding systemMessage injects structured context directly into the conversation — naming the violation and specifying the valid format — so the agent understands the constraint and can take corrective action rather than re-issuing the same blocked call. The permissionDecision and systemMessage fields are designed to work together: permissionDecision enforces the constraint deterministically, while systemMessage provides the guidance needed to avoid re-triggering it.

— Your team has built a Claude Code–based pipeline that extracts structured fields from legal contracts. Each session... Show

A. The risk-scoring subagent's CLAUDE.md memory file has grown too large, consuming most of the context window before any contract content loads — a problem that would not affect the earlier two subagents sharing the same session memory.

✓ B. Both MCP servers' tool definitions are inherited by all subagents, and the risk-scoring subagent — which also receives accumulated output from the prior two agents — exhausts its context window before completing.

C. The nightly batch of 200 contracts exceeds the Claude API's throughput capacity, causing buffered requests to inflate each subagent's context window until the "Prompt is too long" limit is hit.

D. Auto-compact is summarizing earlier turns too aggressively in the risk-scoring subagent, stripping clause and entity context it depends on and forcing the model to re-load that content, which overfills the window.

Why B is correct: Subagents inherit every MCP tool definition from the parent session, adding full tool schemas to each subagent's context at spawn time before any task content is loaded. The risk-scoring subagent is third in sequence, so it carries the same MCP overhead as the others plus the accumulated structured output passed from the two preceding agents. The earlier agents process smaller inputs and retain more headroom despite identical MCP overhead. The fix is to disable unneeded MCP servers before spawning subagents, or to scope which servers are active per phase. This is a structural context-budget problem, not a prompt instruction problem.

— Your team is building a contract analysis pipeline that must both verify claims against source documents (for... Show

✓ A. Citations interleave citation blocks with text output, which cannot conform to a strict JSON schema; the two requirements must be served by separate API calls — one citations-enabled call for auditor review, one structured-output call for the CRM.

B. The output_config.format parameter conflicts with PDF-type documents specifically; switching the contracts to plain-text document blocks will allow citations and structured outputs to coexist in the same request without a 400 error.

C. Citations are only incompatible with nested JSON schemas; flattening the CRM schema to a single-level object will eliminate the 400 error while still allowing citation blocks to appear in the response alongside the output.

D. The 400 error is triggered by exceeding a document block limit when citations are enabled; splitting the contracts across smaller batched requests will allow both citations and structured outputs to be used together successfully.

Why A is correct: Citations require interleaving citation blocks within text output so that each claim is paired with its source reference. This structure is fundamentally incompatible with strict JSON schema constraints — the API cannot simultaneously guarantee a fixed schema and inject citation blocks at arbitrary positions. The correct resolution is architectural separation: a citations-enabled call for auditor traceability and a separate structured-output call for the CRM. This is a design tradeoff, not a configuration error patchable within a single call.

— A structured data extraction pipeline uses server-side web_search tools to enrich contract records with public... Show

A. The server tools are timing out due to network latency. Add retry logic with exponential backoff around each individual web_search call to prevent early termination and ensure all enrichment lookups complete successfully.

✓ B. The sampling loop is hitting the 10-iteration limit before completing all lookups. Detect pause_turn, append the assistant's response to the message history, and issue a continuation request — gating the database write until a non-pause_turn stop reason is received.

C. The max_tokens budget is being exhausted mid-enrichment. Increase max_tokens so Claude has sufficient room to complete the full sequence of lookups in a single response, preventing truncation before all fields are populated.

D. Claude is silently dropping tool calls because the tool description is too vague. Rewrite the web_search tool description to explicitly enumerate each field the pipeline expects to extract, ensuring all required lookups are attempted before the response is returned.

Why B is correct: The pause_turn stop reason is returned specifically when the server-side sampling loop reaches its iteration limit (default: 10) while executing server tools. The response at that point contains an incomplete server_tool_use block without a corresponding result. The correct fix is to detect this signal in the agent loop, append the assistant's partial response to the message history, and issue another API request — allowing Claude to resume from where it stopped. Writing to the database before handling pause_turn is what causes the missing fields; the fix is to gate the database write until a non-pause_turn stop reason is received.

Performance by Domain

Review — 60 questions to revisit