AI 编程AI Agent

The Hidden Lifecycle of Claude Code: Background Activity When You're Not Typing

The leaked Claude Code source code reveals a fact that was previously hard to verify: Claude Code’s lifecycle extends far beyond the user-visible request-response loop. While your cursor blinks in the input box as you compose your next message, Claude Code’s backend is executing dozens of asynchronous tasks — speculative execution, memory extraction, document maintenance, context compaction, and more. Every moment you assume is idle is actually one of the system’s most computation-dense work periods.

Why This Article Is Relevant to You

These background mechanisms are not clever innovations unique to Claude Code. They represent a set of universal patterns that the AI agent industry is converging on, and understanding them has direct value for building any agent system.

The design philosophy of our own context infrastructure closely aligns with Claude Code’s background activity system. Claude Code’s memory extraction, automatic compaction, and prompt cache optimization can seamlessly transfer to our workflows because the underlying problem being solved is the same: how to maintain coherent cognitive state for an agent across multiple interactions. OpenClaw’s heartbeat auto-distillation mechanism is another instantiation of the same pattern. When we see Claude Code’s auto-dream consolidating session memories every 24 hours and OpenClaw’s heartbeat distilling context at a fixed cadence, two systems that evolved independently arrived at nearly identical cognitive maintenance rhythms — this indicates that such designs have become industry standard practice. The tension between automation and controllability that we discussed in our Claude Dispatch vs OpenClaw analysis directly echoes the observations in this article: the more background activity, the smarter the system, but the harder it becomes for users to understand what the system is doing. This trade-off permeates the design decisions behind every mechanism discussed below.

This article starts from the source code and examines four background mechanisms with the greatest engineering depth, unpacking their design decisions and implementation details.

Prompt Cache: The Pervasive Engineering Constraint

Before diving into specific mechanisms, there is a cross-cutting engineering principle that needs to be stated upfront: maintaining the prompt cache is a hard constraint across all of Claude Code’s background activity, not an optimization point for any single feature. The Manus team’s context engineering practice summary also reported similar findings: in production environments, the input-to-output token ratio is approximately 100:1, and prompt cache hit efficiency directly determines the cost and speed of agent systems.

Every background agent in Claude Code runs in forked agent mode, and the first design principle of a forked agent is: it must share exactly the same cache key parameters as the parent process (system prompt, tools, model, messages prefix, thinking config). Any parameter deviation causes cache invalidation, with the cost being a full cold-start API call. The source code documents a real lesson learned: PR #18143 attempted to set effort:'low' on a fork, which caused cache hit rate to plummet from 92.7% to 61%, and cache writes to spike by 45x. The only safe overrides are four: abortController (not sent to the API), skipTranscript (purely client-side), skipCacheWrite (controls cache_control markers without affecting the cache key), and canUseTool (client-side permission check).

This constraint manifests in every background mechanism. The speculative execution forked agent shares the prompt cache. The memory extraction forked agent shares the prompt cache. The auto-dream forked agent shares the prompt cache. The session memory forked agent shares the prompt cache. All background agents are designed under the prompt cache constraint — parameters cannot be touched, models cannot be swapped, thinking config cannot be altered. This makes background activity extremely low-cost (the vast majority of tokens hit the cache), while also explaining why the behavioral space of these forks is constrained so narrowly.

I. Speculative Execution: Predicting Your Next Move, Then Actually Doing It

Speculative execution (Speculation) is the most aggressive design in the entire background activity system. (Anthropic internal only, USER_TYPE === 'ant'.) Its logic chain has three steps: predict what command the user is about to type, hand this predicted command to a forked agent for actual execution, and use a copy-on-write overlay filesystem to isolate execution artifacts. If the user ultimately accepts the prediction, the overlay merges into the main filesystem immediately and the response returns almost instantly.

Prediction Phase

Prediction occurs after every model response completes. promptSuggestion.ts is called fire-and-forget in the stop hooks, forking a child agent that uses exactly the same prompt cache parameters as the parent process to generate predictions. (Prompt Suggestion itself is available in the public version, controlled by the GrowthBook tengu_chomp_inflection experiment flag. Speculative execution is a further behavior built on top of prompt suggestion, restricted to internal users only.)

const result = await runForkedAgent({
  promptMessages: [createUserMessage({ content: prompt })],
  cacheSafeParams, // Don't override tools/thinking settings - busts cache
  canUseTool,
  querySource: 'prompt_suggestion',
  forkLabel: 'prompt_suggestion',
  overrides: { abortController },
  skipTranscript: true,
  skipCacheWrite: true,
})

Predicted content passes through a series of filters: length is limited to 2-12 words, evaluative statements (thanks, looks good) are filtered out, Claude-style expressions (Let me…, I’ll…) are removed, as are multi-sentence outputs and formatting markers. The goal of filtering is to retain only short commands the user themselves might actually type.

Execution Phase

Once a prediction passes filtering, speculation.ts immediately launches actual speculative execution. The key design is the copy-on-write overlay:

// Copy-on-write: copy original to overlay if not yet there
if (!writtenPathsRef.current.has(rel)) {
  const overlayFile = join(overlayPath, rel)
  await mkdir(dirname(overlayFile), { recursive: true })
  try {
    await copyFile(join(cwd, rel), overlayFile)
  } catch {
    // Original may not exist (new file creation) - that's fine
  }
  writtenPathsRef.current.add(rel)
}
input = { ...input, [pathKey]: join(overlayPath, rel) }

The overlay path is located at ~/.claude/tmp/speculation/<pid>/<uuid>/. When the forked agent needs to write a file, the system first copies the original file to the overlay directory, then redirects the write operation to the copy in the overlay. Read operations perform a reverse check: if the target file has been modified in the overlay, it reads from the overlay; otherwise it reads directly from the main filesystem. This achieves a fully isolated speculative execution environment.

Speculative execution has explicit safety boundaries. Only three tools — Edit, Write, and NotebookEdit — are allowed to write (and writes are redirected to the overlay). Read-only tools like Read, Glob, and Grep are allowed through directly. Bash commands are only allowed if they pass read-only validation. When encountering file edits that require user confirmation (permission mode below acceptEdits), speculation immediately pauses and records a boundary. A maximum of 20 conversation turns and 100 messages is enforced.

Acceptance and Pipelining

When the user presses Tab to accept a prediction, acceptSpeculation copies files from the overlay back to the main filesystem one by one and injects the messages produced during speculation into the official conversation stream. If speculation has completed (boundary type is complete), the entire response is presented instantly. If speculation paused midway due to hitting a safety boundary, the system truncates to the last user message and initiates a follow-up query to let the model continue from the breakpoint.

Even more impressive is pipelining. When the first round of speculation completes, the system immediately starts a second round of prediction in the gap while waiting for the user to accept:

// Pipeline: generate the next suggestion while we wait for the user to accept
void generatePipelinedSuggestion(
  contextRef.current,
  suggestionText,
  messagesRef.current,
  setAppState,
  abortController,
)

If the user accepts the first round of prediction, the system checks whether a pipelined suggestion is already available. If so, it promotes it to the new prediction and immediately launches the corresponding speculative execution. This forms a pre-computation chain: the instant the first step finishes predicting, the second step is already underway.

In theory, if the user accepts multiple predictions in succession, the response time for each approaches zero, because all computation happens during the user’s thinking gaps. This has crossed beyond the realm of completion into agentic autonomous workflow territory.

II. Auto-Dream: Memory Consolidation After 24 Hours Without Interaction

The design inspiration for Auto-Dream is clear: just as humans consolidate daytime memories during sleep, Claude Code consolidates multi-session context during periods of no interaction. (Available in the public version, controlled by the GrowthBook tengu_onyx_plover experiment flag. The user setting autoDreamEnabled can override remote configuration.)

The entry point in the source code is autoDream.ts, with trigger conditions following a three-level gating mechanism (gate order: cheapest first):

1. Time: hours since lastConsolidatedAt >= minHours (one stat)
2. Sessions: transcript count with mtime > lastConsolidatedAt >= minSessions
3. Lock: no other process mid-consolidation

The default parameters are 24 hours and 5 sessions. This means dream is only triggered when more than 24 hours have passed since the last consolidation and at least 5 session transcripts have accumulated in the interim. These parameters are remotely controlled by the GrowthBook feature flag tengu_onyx_plover and can be adjusted online without requiring a release.

Once triggered, the system forks a child agent and gives it a carefully designed consolidation prompt. The prompt divides the consolidation process into four phases: Orient (read existing memory files to understand current state), Gather (search transcripts for new signals), Consolidate (merge new information into memory files), and Prune and Index (update the index, clean up stale content).

Regarding transcript search, the prompt explicitly instructs the child agent to use grep for narrow searches:

grep -rn "<narrow term>" ${transcriptDir}/ --include="*.jsonl" | tail -50

The reason is that transcript files (large JSONL) can be very large, and reading them in full would consume massive amounts of tokens. The prompt’s guiding philosophy is “Don’t exhaustively read transcripts. Look only for things you already suspect matter.”

The child agent’s permissions are strictly limited: Bash only allows read-only commands (ls, find, grep, cat, stat, wc, head, tail), and Edit and Write can only operate on files within the memory directory. This is implemented through the createAutoMemCanUseTool function, with extractMemories and autoDream sharing the same permission logic.

After consolidation completes, if the child agent modified any memory files, the system inserts an “Improved N memories” system message into the main conversation stream to inform the user that memory updates occurred in the background. If the child agent fails, the system rolls back the lock file’s mtime so that the next time gate check will pass again, achieving automatic retry. There is a 10-minute scan throttle between retries to avoid repeated scanning when the session gate is not met.

A notable engineering trade-off in this design: dream is always triggered during gaps in normal user interaction with Claude Code (checked during each stop hooks execution), rather than driven by an independent timer. This means if the user hasn’t used Claude Code for 48 consecutive hours, dream won’t execute until the next interaction begins. This design prioritizes resource efficiency at the cost of potential consolidation delay.

III. Magic Docs: Auto-Maintained Documents That Get Tracked After a Single Read

The trigger mechanism for Magic Docs is remarkably elegant. (Anthropic internal only, USER_TYPE === 'ant'.) If the first line of any Markdown file matches the pattern # MAGIC DOC: <title>, it is automatically registered as a document requiring continuous maintenance. Registration occurs in the FileReadTool’s listener callback:

registerFileReadListener((filePath: string, content: string) => {
  const result = detectMagicDocHeader(content)
  if (result) {
    registerMagicDoc(filePath)
  }
})

In other words, the moment you read the file once, it gets tracked. From then on, every time the model finishes a response and the last assistant message in that turn contains no tool calls (indicating the conversation is at a natural idle point), the Magic Docs post-sampling hook updates all tracked documents one by one.

The update process uses the Sonnet model (not the Opus used in the main conversation), running as a constrained child agent with only Edit tool permissions, restricted to editing the corresponding Magic Doc file. The update prompt’s philosophy is worth quoting:

DOCUMENTATION PHILOSOPHY - READ CAREFULLY:
- BE TERSE. High signal only. No filler words or unnecessary elaboration.
- Documentation is for OVERVIEWS, ARCHITECTURE, and ENTRY POINTS - not detailed code walkthroughs
- Do NOT duplicate information that's already obvious from reading the source code
- Focus on: WHY things exist, HOW components connect, WHERE to start reading, WHAT patterns are used

Additionally, if a line of italic text immediately follows the Magic Doc header, it is parsed as document-specific instructions and passed to the update child agent with higher priority than the general rules. This means document authors can embed control over AI update behavior directly within the file.

The prompt also has a key constraint: “Keep the document CURRENT with the latest state of the codebase. This is NOT a changelog or history.” The update child agent is explicitly required to modify information in place and delete outdated content, rather than appending historical records. This ensures Magic Docs always reflect the current state of the codebase rather than devolving into unmaintained changelogs.

IV. Extract Memories: Persistent Memory Extraction After Every Conversation Turn

Extract Memories is the core write path for Claude Code’s memory system. (Available in the public version, build flag EXTRACT_MEMORIES is compiled into the public release, controlled at runtime by isExtractModeActive() and the auto-memory toggle isAutoMemoryEnabled().) At the end of each query loop (when the model produces a final response with no tool calls), handleStopHooks calls executeExtractMemories in a fire-and-forget manner:

if (feature('EXTRACT_MEMORIES') && !toolUseContext.agentId && isExtractModeActive()) {
  void extractMemoriesModule!.executeExtractMemories(
    stopHookContext,
    toolUseContext.appendSystemMessage,
  )
}

The extraction agent runs in forked agent mode, sharing the parent process’s full prompt cache. It only examines recently added messages (tracked via a cursor UUID that marks where the last processing left off), identifies information worth persisting, and writes it to the ~/.claude/projects/<path>/memory/ directory.

There is an elegant mutual exclusion design here. If the main agent has already written memory files during the conversation (user explicitly asked to “remember this”), the extraction agent skips this round:

if (hasMemoryWritesSince(messages, lastMemoryMessageUuid)) {
  logForDebugging(
    '[extractMemories] skipping — conversation already wrote to memory files',
  )
  // ...advance cursor past this range
  return
}

The main agent and the background extraction agent are mutually exclusive for the same conversation segment. This avoids duplicate writes and prevents two agents from producing conflicting memories about the same conversation.

Extraction frequency is controlled by two dimensions: token threshold and tool call count. Extraction is only triggered when both conditions are met. Additionally, a tengu_bramble_lintel feature flag controls turn intervals, allowing further dilution of extraction frequency.

The extraction agent’s prompt design emphasizes efficiency. It is limited to completing its work within 5 turns, with the recommended strategy being: read all files that need updating in parallel during the first turn, then write all modifications in parallel during the second turn. The prompt explicitly forbids exploratory behaviors like “reading code to verify whether a memory is correct”:

You MUST only use content from the last ~N messages to update your persistent memories. 
Do not waste any turns attempting to investigate or verify that content further — 
no grepping source files, no reading code to confirm a pattern exists, no git commands.

In non-interactive mode (-p mode or SDK), print.ts explicitly waits for in-flight extraction to complete after flushing the response before performing graceful shutdown, ensuring memory extraction is not truncated by process exit. This is implemented through drainPendingExtraction with a 60-second timeout.

The Complete Background Activity Map

The four mechanisms above are the most engineering-complex background activities. Beyond them, Claude Code runs a series of auxiliary background tasks:

Design Philosophy: Idle Time Is Compute

Viewing these mechanisms together, Claude Code’s design philosophy is clear: the user’s thinking gaps are the most valuable compute resource. In the traditional REPL model, the gap between user input and AI response is pure waiting. Claude Code transforms this gap into a dense background scheduling window, running multiple parallel pipelines for speculative execution, memory consolidation, document maintenance, context management, and more.

Every pipeline follows the same engineering constraints: forked agent mode ensures prompt cache sharing with the parent process (at the cost of zero parameter deviation allowed), fire-and-forget invocation ensures background tasks don’t block user interaction, feature flag control ensures any new mechanism can be canary-released and tuned online, and permission sandboxing ensures background agents can only operate on resources within their scope of responsibility.

The central scheduling hub for all background activity is handleStopHooks, executed at the end of each query loop. The call order within this function represents the priority of background activities: first save cache params (preparing the cache for future forks), then launch prompt suggestion, extract memories, and auto-dream in parallel. These fire-and-forget calls run on Node’s event loop. When the user’s next input arrives, background tasks still in progress are either cancelled via abort controller (like speculation) or continue to run silently in the background until completion (like extract memories).

From an engineering practice perspective, Claude Code is already a daemon with an autonomous lifecycle, not a REPL waiting for input. This shift from passive response to proactive computation may represent a direction in the evolution of AI-assisted programming tools: when collaborating with humans, AI systems should leverage every available idle window to pre-compute, organize, and optimize, so that when the human is ready, the AI’s response is already on its way.