Home

Agent Recovery After Upstream Merges

Pillar: post-merge-agent-recovery | Date: May 2026
Scope: What an agent should do when a lock it was waiting on releases and upstream main has changed: automated git rebase feasibility and failure modes, full restart vs partial context reuse, incremental context refresh (replaying only the delta), divergence severity classification (semantic vs syntactic conflict), cost models for rebase vs restart, and tooling for automated conflict resolution in rebase.
Sources: 34 gathered, consolidated, synthesized.

Executive Summary

Critical finding: Proactive semantic drift detection before rebase attempts achieves 100% workflow completion versus only 25.1% for reactive detection that waits for rebase failure — a 4× gap driven entirely by whether the agent learns its context is stale before or after it tries to act on it.[18]

The scale of the underlying problem is larger than tooling discussions typically acknowledge. Across production multi-agent deployments, 25.9% of all agent interactions generate semantic conflicts requiring resolution — making conflict handling routine operational state, not an edge case.[18] Overall multi-agent failure rates range from 41% to 86.7%, with 79% of those failures attributable to specification and coordination breakdowns rather than model capability limits.[16] Semantic Intent Divergence (SID) — where agents develop incompatible interpretations of shared objectives before a single line of code is written — averages 32 drift events per agent interaction in software development contexts.[18] Recovery strategies that address only git-level conflicts miss the larger class of pre-code semantic failures entirely.

Divergence severity determines whether automated resolution is feasible, and the data here is more granular than the conventional "syntactic vs. semantic" framing. The ConGra benchmark finds 87.9% of individual conflict hunks are derivable automatically, but only 34.5% of complete file merges succeed without human intervention.[34] The gap is compound hunk interactions: each chunk may be individually resolvable, but their combined effect on a file's semantics exceeds what automated tools can verify. This makes hunk-level submission the correct granularity for AI-assisted resolution — agents attempting whole-file automated resolution will fail on two out of three complete merges. The three-level conflict taxonomy provides the decision tree: Level 1 syntactic conflicts resolve at ~82% via standard git tools;[21] Level 2 semantic conflicts (logically correct but incompatible changes that bypass all line-diff detection) achieve 64–68% with LLM-based tools;[5] Level 3 structural/architectural conflicts require human judgment with no current automated tooling adequate.

The comparison between proactive and reactive drift detection is the most consequential finding in this research, with quantified outcomes across three conditions. Ungoverned multi-agent systems — no drift detection at all — achieve a 0.2% workflow completion rate. Reactive detection (a judge-agent that catches failure on rebase) raises completion to 25.1% with 93.2% conflict detection precision but only 66.3% recall. Proactive monitoring via Semantic Alignment Score (SAS) — detecting drift before the agent acts — achieves 100% completion at lower precision (27.9%) but the precision trade-off is irrelevant: false alarms that trigger unnecessary re-synchronization are far cheaper than failed completion.[18] The implication is architectural: recovery decision logic must run before the agent attempts rebase, not in response to rebase failure. An agent that detects likely stale context and pauses to assess can choose its recovery path; one that learns of staleness only on failure must abort from an unknown intermediate state.

Token cost models reveal why restart is often cheaper than it appears. Context accumulates at O(N²) because each API call re-processes the entire conversation history: a 20-step loop generating 1,000 tokens per step produces 210,000 cumulative input tokens, not 20,000.[30] The break-even point for rebase vs. restart is quantified: rebase with minor upstream changes costs roughly 5–10K additional tokens; major upstream changes cost 20–30K tokens; full restart costs 30–50K tokens depending on task complexity. Restart becomes cheaper when upstream changes touch more than 40–60% of the agent's working files, because the agent must re-read most of its context anyway.[30] The coordinator-specialist architecture crossover point sits at approximately 5–6 steps of remaining work: below that threshold, restart and rebase costs are comparable; above it, accumulated stale context compounds at O(N²) making restart the economically correct choice.[17] A production API format change incident illustrates the failure mode concretely: undetected upstream changes caused an agent to loop at 200× baseline token rate, burning ~$50 in 40 minutes before detection.[17]

Incremental context refresh — refreshing only files in the delta that the agent previously read — reduces context rebuild cost by 90%+ for narrow-scope agents relative to full restart.[31] The procedure is straightforward: compute git diff <agent-branch-base>..origin/main --name-only, intersect with the agent's read file set, re-read only the intersection, skip the rest. Files the agent read that are unchanged in main remain valid. The Git Context Controller (GCC) framework formalizes this approach by treating agent memory as a version-controlled artifact with COMMIT, BRANCH, MERGE, and CONTEXT operations. GCC-equipped agents achieved 48% task resolution on SWE-Bench-Lite, outperforming 26 competitive baseline systems.[8] The key mechanism is the CONTEXT operation's multi-level retrieval, which allows a recovering agent to query its own prior reasoning commits and identify which reasoning steps are still valid versus invalidated by upstream changes — the equivalent of git blame for agent context.

Automated conflict resolution tooling spans a wide performance range, with context width as the primary determinant of quality — not model size. Raw conflict markers fed directly to an LLM produce single-digit resolution rates on real-world benchmark conflicts. Providing broader context — surrounding code, call sites, related files — raises success to approximately 50%.[3] Research-grade tools cluster around 64–68% accuracy: MergeBERT achieves 63–68% on token-level classification; CoReRL achieves ~64% via reinforcement learning without labeled training data; Gmerge reaches 64.6% after 10 trials.[21] git rerere provides institutional memory for previously seen conflicts — resolved patterns are stored in .git/rr-cache and automatically replayed (resolved conflicts retained for 60 days, unresolved for 15 days) — but only helps with conflicts the system has previously resolved, leaving novel upstream conflicts requiring LLM or human intervention.[4] Critically, reasoning-style models (o1, R1) underperform on conflict resolution despite their general strength; Claude and DeepSeek V3 lead in merde.ai benchmarks. The correct failure mode when AI resolution fails is loud rejection — producing no resolution rather than a silent corrupt one.[11]

Infrastructure failures are harder to recover from than semantic conflicts, and deserve equal architectural attention. Git is not designed for concurrent operations: simultaneous git worktree add commands race on .git/config.lock; a process crash leaves a stale .git/index.lock that blocks all subsequent git operations until manually removed.[19] Most AI agents do not recover gracefully from lock errors — some abort and continue generating code without committing, producing orphaned branches. The fix is serialization: all git operations across parallel agents must be sequenced through an internal mutex or queue. Parallel worktree cleanup is specifically unsafe and can remove .git/ entirely during concurrent cleanup of multiple worktrees.[19] The practical ceiling for parallel agents on a modern workstation is 5–7 concurrent agents before rate limits, disk consumption, and merge review overhead cancel throughput gains.[6]

Production architectures that have shipped multi-agent workflows converge on one structural insight: conflicts are not failures to be minimized but expected operational events to be processed efficiently. Overstory's architecture explicitly assumes a ~25% baseline conflict rate and routes every conflict through a four-tier escalation hierarchy — mechanical git rebase, AI-assisted triage, monitor agent fleet patrol, human escalation — with the monitor agent empowered to trigger context refresh or full agent restart based on output quality degradation signals.[20] The Addy Osmani framework specifies MAX_ITERATIONS=8 as an outer retry limit, with forced reassignment to a fresh agent after 3+ iterations stuck on the same error, and a mandatory reflection prompt before each retry to prevent identical failed loops.[25] Agents should pause autonomously when token consumption reaches 85% of budget — approaching context limits is a reliable leading indicator that restart will be cheaper than continuation.[25]

Practitioners building multi-agent systems should treat post-merge recovery as a first-class design concern rather than an error handler. The actionable sequence: adopt worktree isolation as the baseline architecture (no contamination, visible conflicts at merge time); instrument Semantic Alignment Score monitoring before any agent acts on an acquired lock, not in response to rebase failure; implement hunk-level rather than file-level automated conflict resolution; set hard retry limits (8 outer, 3 inner) with forced reflection before escalation; apply the 40–60% file overlap threshold to decide between rebase and restart; and enable git rerere for recurring conflict patterns while maintaining LLM-based resolution with broad context retrieval for novel conflicts. The 100% vs. 25.1% completion differential between proactive and reactive detection is achievable with current tooling — the gap is architectural discipline, not missing infrastructure.



Table of Contents

  1. Context Invalidation and the Recovery Trigger
  2. Automated Rebase: Feasibility and Failure Modes
  3. Divergence Severity Classification
  4. Full Restart vs. Partial Context Reuse
  5. Incremental Context Refresh: Replaying Only the Delta
  6. Cost Models: Rebase vs. Restart
  7. Tooling for Automated Conflict Resolution
  8. Production Architectures and Recovery Patterns

Section 1: Context Invalidation and the Recovery Trigger

Each AI agent operates within its own context window with no automatic visibility into changes made by concurrent agents. When a lock releases and upstream main has advanced, the agent waking up faces three distinct problems: (1) its internal context may reflect a codebase state that no longer exists, (2) its planned actions may conflict with changes already merged, and (3) the divergence may be invisible until a rebase attempt fails. Production multi-agent deployments report failure rates between 41% and 86.7%, with 79% of failures attributable to specification and coordination issues rather than model capability limitations.[16] Critically, 25.9% of agent interactions across production deployments generate semantic conflicts requiring resolution — making conflict handling a routine operational concern, not an edge case.[18]

Context Rot vs. Context Staleness

Context rot describes quality degradation as an agent's session window fills with accumulated history — agents contradict earlier decisions, reintroduce rejected patterns, and re-ask previously answered questions.[14][23] Post-merge staleness is a distinct but compound problem: the agent's context is not merely degraded by volume but actively incorrect — it reflects codebase state that upstream has superseded. The critical insight: the issue is not too much context, but too much stale or irrelevant context. Targeted removal or refresh of the stale portion is more efficient than full restart — analogous to a focused git rebase rather than restarting a feature branch from scratch.[14] Larger context windows do not solve the underlying signal-to-noise problem: "larger windows just delay the onset rather than preventing the degradation."[23] A typical refactoring task consumes 20,000–40,000 tokens — an agent operating on stale context for an equivalent task may spend that budget compounding invalid reasoning.[23]

Context Contamination: Uncommitted Writes

Context contamination is distinct from context staleness. Contamination occurs when agent B reads agent A's uncommitted in-progress files from a shared working directory, ingesting a half-finished state that leads to inconsistent, confusing context and bad decisions.[2][6] Git worktrees prevent contamination by making each agent's in-progress changes invisible to all other agents until explicitly merged. Post-merge recovery, by contrast, addresses staleness: the agent's context correctly reflected committed state at some earlier point, but upstream has since advanced and superseded it. The two failure modes require different interventions — worktree isolation eliminates contamination at the architecture level; the recovery strategies in this section address staleness after lock release.

Semantic Intent Divergence (SID): Pre-Code Conflict

Beyond git-level conflicts, multi-agent LLM systems experience Semantic Intent Divergence (SID) — agents develop inconsistent interpretations of shared objectives without any mechanism to detect the inconsistency. SID occurs before any code is written; agents can produce incompatible implementations even when no git-level conflict exists.[16][18] The software development domain averages approximately 32 drift events per agent interaction, measured by the Semantic Alignment Score (SAS) framework.[18] (See Section 3 for the full three-type taxonomy with post-merge recovery implications per type.)

Worktree Isolation: The Standard Pre-Condition for Recovery

Git worktrees are the established architectural pre-condition enabling post-merge recovery: each worktree has its own working directory and branch while reading from a shared object store, making concurrent changes invisible until explicitly merged and surfacing conflicts at merge time where standard tooling can detect them.[6] Anthropic's Claude Code documentation recommends worktrees for multi-session workflows; Cursor built their Parallel Agents feature directly on worktrees; Claude Code v2.1.50 added first-class CLI support for creating, managing, and cleaning up worktrees.[6][29] The productive ceiling for parallel agents on a modern laptop is approximately 5–7 concurrent agents before rate limits, disk consumption, and merge review overhead cancel throughput gains.[6]

Ecosystem gap: There is no current tooling that provides real-time alerts when agents approach the same code regions during execution — multi-agent coding conflicts go undetected until patches merge, because concurrent in-memory writes to the same working directory fall outside git's conflict-detection mechanism entirely.[6][29] This is a documented gap in the ecosystem. It is precisely why proactive SAS monitoring (Section 8) represents a significant architectural advance over waiting for rebase failure — without real-time cross-worktree change alerts, reactive detection is structurally blind until merge time.

Key finding: The combination of SID (pre-code divergence) and git-level conflicts means recovery triggers span both semantic and structural dimensions — architectures must handle both to achieve reliable completion rates.[16][18]

See also: Git Worktree Mechanics (rebase mechanics), Scope Overlap Detection (conflict detection before work starts), Lock Design and Granularity (lock release triggering this event)


Section 2: Automated Rebase: Feasibility and Failure Modes

Standard post-lock-release workflow requires: fetching to stay current with main, rebasing the feature branch on the latest upstream before merging, re-running test suites to catch integration issues, and verifying against pre-execution baselines to isolate regressions introduced by the upstream change.[15] While rebase is the canonical integration path, automated rebase in agent systems encounters specific failure modes that standard developer tooling documentation does not fully address.

Rebase Failure Mode Taxonomy

Failure Mode Mechanism Agent-Specific Risk Mitigation
Repeated conflict resolution[28] Each commit in a long rebase sequence produces its own conflict requiring resolution High — agents create many small atomic commits Squash-before-rebase: squash agent's work into a single commit first
Shared branch corruption[28] Force pushing to collaborative branches risks losing work from other agents on related branches Critical — multi-agent force pushes can cascade Never force push to shared branches; use merge queue
Unrecoverable rebase failure[28] Unlike merges (revert with git reset --hard HEAD^), undoing a failed rebase is complex High — automated agents may not recognize catastrophic failure Abort early; use git reflog for undo
Metadata loss[28] GPG signatures and Co-Authored-By information disappear during rebase Medium — compliance-relevant in regulated environments Use merge instead of rebase for signed commits
Broken intermediate commits[28] Tests fail on commits between start and end of rebase sequence Medium — bisect-based debugging broken Squash before rebase; run tests only on final commit
Git lock contention[19] Concurrent git worktree add commands race on .git/config.lock Critical — all git operations blocked if stale lock persists Serialize git operations with internal mutex/queue

Git Lock Contention: Infrastructure vs. Conflict Failures

Git is not designed for concurrent operations. Multiple git worktree add commands racing simultaneously fail because each must write to .git/config, which uses .git/config.lock for mutual exclusion. If an agent process crashes while holding the lock, the stale .git/index.lock persists and blocks all subsequent git operations until manually removed (rm -f .git/index.lock).[19] Most AI agents do not recover from lock errors gracefully: some retry and succeed if the lock clears quickly; others abort and continue generating code without committing; some produce orphaned branches without cleanup.[19]

Recovery Option Mechanism Recommendation
Option A[6][19] Serialize git worktree add calls with an internal mutex/queue Recommended
Option B[19] Add retry logic with exponential backoff when command fails due to lock contention Acceptable fallback
Option C[19] For read-only agent types (e.g., Explore agents), skip worktree isolation entirely Acceptable for read-only agents

A critical additional hazard: the worktree cleanup mechanism can operate on the main working tree instead of isolated worktree directories. During parallel cleanup, processes may remove .git/ entirely. Cleanup of worktrees must be serialized — parallel cleanup is unsafe.[19]

Practical Rebase-on-Conflict Script Pattern

The following pattern from a documented Claude Code production issue (#34645) represents the recommended automated rebase handler — detecting failure, aborting cleanly, and releasing the lock for retry rather than leaving a half-rebased state:[19]

git checkout "$branch"
if git rebase main; then
    echo "Rebase succeeded on $branch"
else
    echo "Rebase conflict on $branch — needs manual resolution"
    git rebase --abort
    release_lock "$branch"
    exit 1
fi
Key finding: Infrastructure failures (lock contention, corrupt worktrees) are harder to recover from than conflict failures. Serializing all git operations in multi-agent systems is a prerequisite for reliable automated rebase — not an optimization.[19]

See also: Git Worktree Mechanics (git rebase mechanics in depth)


Section 3: Divergence Severity Classification

Post-lock-release recovery decisions depend critically on how severe the divergence is. Research from ACM, Martin Fowler, and the ConGra benchmark establishes a three-level taxonomy with measurable automated resolution feasibility at each level.

Three-Level Conflict Taxonomy

Level Type Detection Method Auto-Resolution Feasibility Key Property
Level 1[21][27] Syntactic/Textual Standard 3-way merge, git conflict markers ~82% (git merge, diff3, MergeBERT) No logical implications; purely textual
Level 2[5][27] Semantic Compilation errors, broken tests after merge ~64–68% (Gmerge LLM, CoReRL RL) Both changes logically correct but incompatible; bypasses all line-diff detection
Level 3[21] Structural/Architectural Manual inspection; no automated tools Low — requires intent-preservation or manual intervention System design changed; current git tooling entirely inadequate

Martin Fowler's formulation: "A semantic conflict is a situation where two developers make changes that can be safely merged on a textual level, but cause the program to behave differently."[27] The particularly dangerous property: semantic conflicts bypass all automated detection mechanisms. A critical caveat from ACM ESEC/FSE 2015: a failing test in a merged version does not necessarily indicate a semantic conflict — a test can fail because it was written for one branch's assumptions without those assumptions actually conflicting with the other branch. False-positive detection is a core challenge in automated semantic conflict detection.[5]

Semantic Conflict Subtypes Relevant to Post-Merge Recovery

The Semantic Consensus Framework proposes a three-type taxonomy for semantic conflicts in multi-agent systems:[16][18]

SCF Type Description Post-Merge Recovery Implication
Type 1 — Contradictory Intents[16] Agent A plans to deprecate a module that Agent B plans to extend Agent B's accumulated context is entirely invalidated — full restart required
Type 2 — Resource Contention[16] Two agents need exclusive write access to the same configuration file One agent wins; the other must refresh its view of that file before continuing
Type 3 — Causal Violation[16] Agent A deletes an API that Agent B's in-progress work depends on Most directly relevant to lock-release recovery; agent B must adapt to new API surface or restart

ConGra Graded Benchmarking: Chunk vs. File Resolution

The ConGra (CONflict-GRAded) benchmark provides the most precise data on automated resolution limits across conflict complexity grades:[34]

Granularity Derivable Automatically Implication
Individual conflict chunks (hunks)[34] 87.9% Most individual conflict sites can be auto-resolved
Complete merges (full file)[34] 34.5% Only one-third of complete file merges can be derived automatically — compound interactions between hunks defeat automation

ConGra further finds that LLMs with longer context support do not always yield better conflict resolution results compared to models with shorter contexts, and that general LLMs (e.g., LLaMA3-8B) can outperform specialized code LLMs in conflict resolution.[34]

Branch Duration Effect

When feature branches remain separate from main for extended periods (weeks or months), they accumulate significantly more structural conflicts — the longer an agent's branch diverges from main, the more likely architectural clashes become.[34][27] Martin Fowler's SelfTestingCode mitigation: if features are built quickly (within a couple of days), developers encounter fewer semantic conflicts — suggesting parallel agent work should target short-lived branches rather than long-running parallel streams.[27]

Semantic Rebase Framework: Four-Level Escalation Hierarchy

Peter Thomson's Semantic Rebase framework provides a decision hierarchy for post-merge recovery when structural divergence is too severe for standard rebase:[10]

Level Name Mechanism Fails When
Level 1[10] Mechanical Merge Standard git rebase replaying commits onto new baseline Code structure fundamentally changes
Level 2[10] Conflict-Resolution Rebase Human-guided textual conflict resolution Architectural patterns shift beyond simple text conflicts
Level 3[10] Intent-Preserving Reverse Merge Extract feature intent, reimplement on new architecture: code₁ → intent → code₂ When subsystems were not merely refactored but replaced
Level 4[10] Semantic Rebase Merge meaning rather than code — requires understanding why changes occurred Currently requires manual human judgment; tooling does not exist

Current tooling gaps at Level 3–4: branch intent extraction mechanisms, architectural divergence visualization, conflict severity classification tools, and intent preservation verification tools do not currently exist as standard git infrastructure.[10]

Semantic Conflict Detection: Static Analysis Methods

Method Approach Performance Source
Semex (Variability-Aware Execution) Encodes all parallel changes into a single program, runs tests across all merge scenarios 100% on 19-conflict test set (preliminary) [5]
RefFilter Identifies refactorings in branches, discards false-positive interferences Reduces false positives without reducing true positives [5]
Pointer Analysis (PA) Reduces false positives via data-flow analysis 44.4% false positive reduction; 28.6% increase in false negatives [5]
Symbolic Execution Identifies test cases where results differ between merged and original versions Works on post-merge code; catches behavioral divergence [5]
SAM (SemAntic Merge) Unit test-based detection; generates tests as partial specifications Detects conflicts missed by textual merge [5][27]
SemanticMerge Structure-based merge at method level (not text-based) Handles divergent move conflicts [5]
Dynamic Analysis Detects changes involving writes to the same state element at runtime No assertions needed; works on post-merge code [5]
Key finding: 87.9% of individual conflict chunks are derivable automatically, but only 34.5% of complete merges succeed — hunk-level automation is reliable while file-level automation is not. Agents should submit conflict hunks individually rather than attempting whole-file resolution.[34]

See also: Scope Overlap Detection (detecting conflicts before work starts)


Section 4: Full Restart vs. Partial Context Reuse

The fundamental recovery decision when a lock releases with upstream changes is whether to integrate those changes into the agent's existing session (rebase/context-refresh path) or discard the session and begin again (restart path). Both strategies are used together in mature multi-agent workflows: isolation via worktrees, sequential rebasing for code integration, and fresh sessions to combat context rot.[1]

Decision Framework: Post-Lock-Release Divergence Assessment

Recommended decision sequence when upstream lock releases with main having advanced:[29]

Step Command Decision Gate
1. Measure commit distance[29] git log HEAD..origin/main --oneline | wc -l Threshold: commit count is a proxy for structural risk; assess alongside file overlap (Step 2)
2. Measure file overlap[29] git diff HEAD..origin/main --name-only vs. agent's touched files No overlap → rebase is safe
3. Assess overlap type[29] File diff + semantic analysis of changed functions File overlap + no logical dependency → attempt automated rebase with LLM resolution
4. Evaluate restart cost[30] Estimate: files changed in main ÷ files in agent's context Semantic overlap + same functions/classes → evaluate restart cost vs. rebase risk; if >40–60% overlap, restart cheaper

Proactive vs. Reactive Detection: Performance Data

Proactive drift detection is dramatically more effective than reactive conflict detection. The Semantic Consensus Framework evaluation:[18]

Approach Detection Timing Workflow Completion Rate Conflict Detection Precision Conflict Detection Recall
Ungoverned[18] No detection 0.2%
Judge-Agent (reactive)[18] Post-hoc (detects on rebase failure) 25.1% 93.2% 66.3%
SCF Full (proactive)[18] Before action (monitors SAS score) 100% 27.9% 65.2%

The trade-off: reactive detection achieves high precision (93.2%) but only 25.1% completion — it catches real conflicts but too late. Proactive detection (lower precision, 27.9%) achieves 100% completion because it catches divergence before the agent has accumulated irrecoverable invalid context. An agent that detects likely stale context before attempting rebase can plan recovery; an agent that detects it only on failure must abort cold.[18]

Kill Criteria: When to Force Restart

The Addy Osmani framework specifies MAX_ITERATIONS=8 as the outer limit, with reassignment to fresh agents if stuck 3+ iterations on the same error. The kill criteria approach favors restart over attempting to repair degraded agent context.[25] Before each retry, a forced reflection prompt is required: "What failed? What specific change would fix it? Am I repeating the same approach?" — preventing endless loops of identical failed attempts.[25] A complementary automatic trigger: agents should pause execution when token consumption reaches 85% of budget — approaching context limits is a reliable signal that restarting with compacted context will be cheaper than continuing.[25] AGENTS.md files — persisted institutional memory of task decisions, API surface, and prior conflict resolutions — allow a restarted agent to rebuild working context faster than cold-start; when restart is chosen, seeding the new session with a well-maintained AGENTS.md reduces restart cost significantly.[25]

Hybrid Recovery: In-Place with Hard Escalation Limits

The ComposioHQ Agent Orchestrator implements the recommended production pattern — prefer in-place recovery but with hard limits that trigger escalation rather than infinite retry:[9]

Scenario Recovery Behavior Default Limit
ci-failed[9] Auto-routes to agents 2 retries
changes-requested[9] Auto-sends to agents Escalation after configurable timeout
approved-and-green[9] Notifications or auto-merge N/A

Each agent retains its worktree, branch, and PR across retries — the upstream change is absorbed via the feedback loop, not via full teardown. After retry exhaustion, escalation goes to human review rather than spinning indefinitely.[9] Plugin architecture enables switching from "worktree rebase" to "fresh clone restart" by changing the workspace plugin — the recovery strategy is abstracted from the agent implementation.[9]

Context Window Overflow as a Forced Restart Trigger

When Claude Code launches many background agents in parallel, returning results can overflow the context window — killing the session with no recovery path while billing for all tokens consumed by the agents. Proposed mitigation: context budget awareness — before launching N background agents, estimate whether returning results will fit in remaining context window.[32] The shared task list with auto-unblocking is the mechanism by which an agent learns its lock has released; without explicit dependency tracking, an agent may complete its scope without knowing upstream has changed the game.[13]

Key finding: Proactive drift detection (monitoring Semantic Alignment Score before action) achieves 100% workflow completion vs. 25.1% for reactive detection (noticing failure on rebase). The decision point for recovery should occur before the agent attempts to use its stale context, not after the rebase fails.[18]

Section 5: Incremental Context Refresh: Replaying Only the Delta

Full restart is expensive; targeted context refresh is the preferred cost-saving alternative when the upstream delta intersects only a portion of the agent's working set. The core principle: reduce context rebuild cost from O(all context) to O(delta ∩ agent-context). If the upstream change touched files in the agent's context but not its working set, an incremental refresh is sufficient. If the upstream change touched files the agent modified, a rebase or restart is required.[17]

Delta Refresh Procedure

Step-by-step incremental refresh:[31]

  1. Identify the delta: git diff <agent-branch-base>..origin/main --name-only
  2. Filter to agent-relevant files: Intersect delta with agent's touched/read files
  3. Incremental refresh: Re-read only the changed files the agent previously read
  4. Skip unchanged context: If a file the agent read hasn't changed in main, its context is still valid

This approach potentially reduces context rebuild cost by 90%+ for agents with narrow scope relative to overall codebase changes.[31]

Context Refresh Strategy Comparison

Strategy Cost Best For
Full restart[31] O(N) — all touched files Severe structural divergence; agent context fundamentally invalid
Incremental refresh[31] O(k) — changed files only Moderate upstream changes; agent's core working set unchanged
Delta-only with dependency tracking[31] Follows dependency graph — lowest cost Narrow-scope agents with explicit dependency tracking (CocoIndex pattern)

Git Context Controller (GCC): Version-Controlled Agent Memory

The GCC framework (arXiv 2508.00031) manages agent memory as a version-controlled file system with four Git-like operations:[8]

Operation Purpose Recovery Role
COMMIT[8] Checkpoints meaningful progress with structured summaries Enables recovery without re-reading all prior history
BRANCH[8] Creates isolated exploration spaces for alternative approaches Test "rebase" vs. "restart" as parallel branches before committing
MERGE[8] Synthesizes divergent branches with origin-tagged commit logs Integrate upstream delta into agent's memory state
CONTEXT[8] Multi-level memory retrieval from high-level summaries to fine-grained traces Query prior commits to identify which reasoning steps are still valid

Post-merge recovery workflow with GCC:[8]

  1. Agent checkpoints progress via COMMIT at meaningful milestones
  2. When upstream main changes, recovering agent uses CONTEXT to query its prior commits and understand current state
  3. Agent runs git rebase (or applies the upstream diff) to the worktree
  4. Agent replays only the delta — using CONTEXT at fine granularity to identify which prior reasoning steps are still valid vs. invalidated by the upstream change
  5. BRANCH can be used to explore whether to continue on the rebased branch or start fresh

Empirical results: GCC-equipped agents achieved 48.00% task resolution on SWE-Bench-Lite, outperforming 26 competitive systems.[8]

Durable Execution: Checkpointing Gap for Post-Merge Recovery

Durable execution preserves agent progress through external state persistence, enabling resumption from the last checkpoint. Key components: state checkpointing after each meaningful step, resumability from last checkpoint, retry logic with appropriate backoff, and idempotency (repeated tool calls don't cause duplicate side effects).[33] However, a critical gap exists: standard durable execution checkpoints capture the agent's internal state but not the codebase state at checkpointing time.[33][32]

This requires a hybrid approach for post-merge recovery:[33]

Framework-level checkpointing (LangGraph, CrewAI, Google ADK, Strands) saves state but "leaves failure detection, automatic recovery, and duplicate prevention entirely to you."[32] LangGraph-specific behavior: when a graph node fails mid-execution, LangGraph stores pending checkpoint writes from successfully completed nodes at that superstep, but Microsoft Agent Framework's superstep-level granularity means all three parallel executors in a superstep re-execute if any one fails — even if two already succeeded.[32]

CocoIndex Dependency Tracking

Track which source files each piece of agent context was derived from. When upstream changes, identify which derived contexts are invalidated. Recompute only invalidated contexts; valid contexts (derived from unchanged sources) reuse directly. This enables "surgical" context refresh rather than full restart.[31] Change Data Capture (CDC) — monitoring database transaction logs for real-time change identification — is the underlying technology enabling sophisticated incremental refresh strategies.[31]

Idempotency as a Recovery Property

For coding agents: writing the same code twice produces the same result (safe to retry); reading files that haven't changed gives the same result (safe to skip re-read); but reading files that have changed gives different results (must re-read).[33] The core challenge: the delta between the checkpoint's assumed codebase state and the actual post-upstream-merge codebase state must be applied before resuming execution.

Key finding: GCC-equipped agents achieved 48% task resolution on SWE-Bench-Lite (outperforming 26 systems) by treating agent memory as a version-controlled artifact — enabling targeted delta refresh after upstream merges rather than full context reconstruction.[8]

Section 6: Cost Models: Rebase vs. Restart

Recovery strategy selection has direct token cost implications. The O(N²) cost growth characteristic of long agent sessions makes restart not just a correctness strategy but a cost-efficiency strategy when sessions have accumulated substantial history.

The Quadratic Context Accumulation Problem

Token costs in multi-agent systems compound at O(N²): each LLM API call re-processes the entire conversation history. A 20-step loop where each step generates 1,000 tokens produces 210,000 cumulative input tokens rather than 20,000.[30] Splitting a 10-step task across 3–4 parallel specialists raises end-to-end reliability from ~60% to ~81–86% while shifting from a single quadratic curve to multiple short linear cost curves.[17][30] State reset (restart) specifically breaks the O(N²) cost curve — making it a cost-efficiency strategy for any agent that has accumulated significant history.[30]

Formal Cost Comparison

Strategy Cost Components Concrete Estimate (50K accumulated context)
Rebase with minor upstream changes[30] git ops + conflict resolution tokens + agent context refresh (changed files) ~5–10K additional tokens
Rebase with major upstream changes[30] git ops + extensive conflict resolution + many changed files to re-read ~20–30K additional tokens
Full restart[30] Full context initialization + re-read relevant code + re-establish task understanding ~30–50K tokens (task-complexity dependent)

Break-even point: Rebase becomes more expensive than restart when upstream changes touch >40–60% of files the agent was working with, as the agent must re-read most of its context regardless.[30]

Coordinator-Specialist Crossover Point

The coordinator-specialist architecture introduces fixed overhead that only pays off for tasks exceeding approximately 5–6 steps for typical token sizes.[17] Applied to restart decisions: if an agent's remaining work is fewer than 5–6 steps, restart cost is comparable to rebase cost. If more than 5–6 steps remain, restart is likely cheaper because accumulated stale context compounds costs with each subsequent step.[17]

Incremental Rebuild Token Savings

When a file is edited, incremental rebuild systems update only affected nodes and edges — analogous to how a build system knows what changed without recompiling everything. Queries that previously consumed 8,000–12,000 tokens now require 800–2,000 tokens — same answer quality, but 40–95% fewer tokens depending on question and codebase size.[17]

Context Compaction and Prompt Caching

Context compaction removes redundant tokens from conversation history by verbatim deletion (not summarization), achieving 50–70% token reduction. A 200K-token conversation compacted to 80K tokens generates compound savings — the compacted conversation is what gets re-sent on every subsequent turn.[30] Prompt caching: cache reads cost $0.30/1M vs. $3.00/1M for uncached input on Claude Sonnet — a 90% reduction on the cached prefix.[30]

Real-World Cost Data

Metric Value Source
Token multiplier for agent teams vs. standard sessions[7][17] ~7x more tokens when teammates run in plan mode Factory.ai / Augment Code
Average cost per active developer day[17] ~$13 Augment Code
Enterprise monthly cost per developer[17] $150–250 Augment Code
Cost for 200 API calls (Claude Opus)[30] $7+ per session Augment Code
Team cost (20 developers × 50 sessions/day, Claude Opus)[30] $10,200/month Augment Code
Cost anomaly from stale API format assumption (production incident)[17] ~$50 in 40 minutes (200x baseline token rate) Augment Code (production incident)

Model Pricing Reference (Early 2026)

Model 10M input / 2M output workload
Haiku 4.5[7] $20
Sonnet 4.5/4.6[7] $60
Opus 4.5/4.6[7] $100
Opus 4 (older)[7] $300

Model Selection Strategy for Recovery Operations

Use budget-tier models (15–50x cheaper than flagship) for conflict classification and triage. Reserve flagship models for complex reasoning or mission-critical, low-latency use cases.[7] Delegate test execution, log processing, and documentation fetching to sub-agents — keeps verbose output in sub-agent context while only summaries return to main conversation.[7] Anthropic applies premium rates when a request exceeds 200K input tokens — another driver toward compaction and restart rather than unbounded context accumulation.[7]

Key finding: An API format change caused a production agent to loop at 200x baseline token rate, costing ~$50 in 40 minutes — a documented example of how undetected upstream changes compound costs before detection.[17]

Section 7: Tooling for Automated Conflict Resolution in Rebase

The automated conflict resolution landscape spans git-native tooling (rerere), AI-assisted tools (merde.ai, Graphite), and research-grade systems (MergeBERT, CoReRL). Each operates at a different point in the resolution pipeline.

git rerere: Institutional Memory for Rebase

git rerere (Reuse Recorded Resolution) records how conflict hunks were resolved and automatically re-applies those resolutions on future encounters of the same conflict pattern, storing resolutions in .git/rr-cache.[4][26]

Command Purpose
git config --global rerere.enabled true[4] Enable rerere globally
git rerere status[4] Show files rerere is tracking
git rerere diff[4] Show current state of recorded resolution
git rerere remaining[4] List files rerere could NOT resolve
git rerere clear[4] Clear the rerere cache

Cache retention: Unresolved conflicts are pruned after 15 days (gc.rerereUnresolved); resolved conflicts are pruned after 60 days (gc.rerereResolved).[4] rerere integrates automatically: git merge, git commit, and git rebase all invoke it on conflict.[4]

Key caveat: rerere relies on conflict markers in the file to detect conflicts; if a file already contains lines that look like conflict markers, rerere may fail to record a resolution.[4] Key limitation: rerere only helps with previously seen conflicts. For novel conflicts introduced by new upstream changes, human or LLM intervention is still required — rerere complements but does not replace AI-based conflict resolution.[26]

In agent-based workflows, rerere provides "institutional memory" for conflict resolution patterns. If an agent repeatedly rebases against an evolving main branch, rerere can automate resolutions to previously resolved conflicts — directly reducing the cost of repeated rebase attempts.[4]

merde.ai: AI-Assisted Non-Destructive Conflict Resolution

Merde.ai automates git merge and rebase conflict resolution using AI with a non-destructive design: creates new branches with resolved conflicts without touching the current working directory.[3][11][22]

Key development insight: Initial attempts feeding raw conflict markers to an LLM scored "single digits" on a real-world conflict benchmark. Performance improved to approximately 50% success rate by engineering broader context retrieval — surrounding code, call sites, and related files — mirroring how a skilled developer approaches conflict resolution.[3]

Model Performance Ranking (merde.ai benchmark)

Tier Models Notes
Top[3] Claude, DeepSeek V3 (roughly tied) Best overall performance
Strong[3] GPT-4o Good but below top tier
Ineffective[3] o1, R1 ("slow thinker" reasoning models) Reasoning-style models do not help for conflict resolution
Problematic[3] Llama 3.3 and smaller models Quirks like trailing whitespace break code with significant-whitespace syntax

Failure mode philosophy: When AI fails, merde.ai produces "no resolution rather than a bad resolution" — failures are loud rather than silent. Silent corruption is considered far more dangerous than visible failure.[3][11] The ~50% success rate means agents need a fallback strategy for the unresolved 50%: escalate to human review, attempt full restart, or fall back to intent-preservation strategies.[11]

Recommended agent workflow with merde.ai:[3]

  1. Detect conflict during rebase
  2. Submit to AI resolution as a separate non-destructive branch
  3. Review and validate the resolution
  4. Merge or discard (as-if-nothing-happened)

Research-Grade Conflict Resolution Tools

Tool Approach Performance Note Source
Gmerge Pre-trained LMs for semantic conflict resolution 64.6% accuracy after 10 trials [21]
MergeBERT Transformer encoder, token-level classification 63–68% accuracy [21][34]
CHATMERGE Two-stage: ML strategy classification + LLM generation Most practical framework for automated agents [34]
Semex Variability-aware execution across all merge scenarios Also a detection tool (see Section 3) [5]
SemanticMerge Structure-based merge at method level Also a detection tool (see Section 3) [5]
CoReRL Reinforcement learning — learns without human classification ~64% accuracy; no labeled training data required [21]

CHATMERGE's two-stage approach (classify strategy → generate resolution) is the most practical framework for automated agents: classify the conflict type first (syntactic vs. semantic vs. structural), then choose the appropriate recovery mechanism.[34]

Production AI Conflict Resolution Tools

Tool Integration Key Feature
JetBrains AI Assistant[12] JetBrains IDEs ML understanding of code structure in IDE context
CodeGPT[12] IDE plugin Personalized conflict resolution strategy guidance
Resolve.AI[12] Visual Studio Code Context-aware resolution generation
Graphite merge queue[12] GitHub integration Automatically rebases PRs onto latest main before merging; prevents last write wins; agents waiting for merge unblocked in deterministic queue order — the external trigger for lock-release recovery
merde-bot[22] GitHub PR comments Opens a new PR resolving conflicts; non-destructive review-before-merge

Context Drift Prevention as Pre-Emptive Tooling

Lumenalta's 8-tactic framework for reducing context drift represents a pre-emptive alternative to post-merge recovery tooling:[1] (1) central shared task spec as single reference for all agents, (2) scoped role-specific context, (3) external memory with selective retrieval, (4) aggressive history trimming, (5) coordinator agent layer that flags conflicts against central specs, (6) structured state and protocols, (7) guardrail prompts and validation checks, (8) observability and feedback loops.

Key finding: Raw conflict markers fed to an LLM produce single-digit resolution rates. Providing broader context — surrounding code, call sites, related files — raises success to ~50%. Context width, not model size, is the primary determinant of AI conflict resolution quality.[3]

Section 8: Production Architectures and Recovery Patterns

Production multi-agent systems converge on a common structural pattern: FIFO merge queues for deterministic sequencing, multi-tier escalation hierarchies for conflict resolution, and proactive drift monitoring rather than reactive failure detection.

Overstory: FIFO Merge Queue with 4-Tier Conflict Resolution

Overstory implements a SQLite-backed FIFO merge queue where each merge attempt incorporates all prior merges, conflicts surface as standard git conflicts at queue processing time, and failed merges are tracked and returned to the queue or escalated.[20] The architecture is designed around a ~25% baseline conflict rate — "Conflicts are not failures — they are expected events in multi-agent workflows."[20]

Tier Name Mechanism
Tier 0[20] Mechanical (Daemon) Standard git merge/rebase; tmux/pid liveness monitoring
Tier 1[20] AI-Assisted Triage Classifies conflict type/severity; determines syntactic vs. semantic
Tier 2[20] Monitor Agent (Fleet Patrol) Continuous patrol; detects stuck agents, stale context, degrading output; can trigger restart
Tier 3[20] Human Escalation Unresolvable conflicts escalate with structured conflict reports

Context invalidation handling decision tree: When an agent's lock releases and main has advanced, the Overstory orchestrator chooses among three paths: (1) let the agent finish and rebase on completion, (2) interrupt the agent with a context update via SQLite mail, (3) restart the agent with fresh context.[20] GitHub API integration: mergeable: false and mergeable_state: "dirty" signals trigger automated agent dispatch for resolution in an isolated worktree.[20]

Augment Code Coordinator-Specialist Architecture

Augment Code's Intent orchestrator implements three-tier parallel agent management:[2]

Role Function Recovery Role
Coordinator[2] Planning; maintains Living Spec shared coordination artifact Triggers rebase-before-merge to surface visible git conflicts
Specialists[2] Parallel execution in isolated worktrees (one per logical unit) Execute recovery within worktree scope; escalate via Coordinator
Verifier[2] Quality gate before merge Human checkpoint: Verifier output review before merge

The Context Engine provides a semantic codebase index preventing architectural blind spots when agents re-read context after an upstream change.[2]

Semantic Consensus Framework: Consensus Resolution Protocol (CRP)

The SCF's CRP applies a cascading three-tier authority hierarchy for automated conflict resolution:[18]

Tier Authority Decision Basis
Tier 1[18] Policy Authority Organizational governance policies, compliance rules, NIST AI RMF controls; if policy unambiguously dictates outcome, it prevails
Tier 2[18] Capability Authority Agent's organizational role, skill relevance to contested domain, historical accuracy on similar decisions
Tier 3[18] Temporal Priority The agent whose intent was registered first retains priority; the conflicting agent's intent is queued for re-evaluation — maps directly to the rebase model: agent that merged first "wins"

Unresolved conflicts escalate to human operators with structured conflict reports including the Semantic Intent Graph subgraph and attempted resolution paths.[18]

Context Engineering Techniques for Recovery

Effective multi-agent systems treat context the way operating systems treat memory and CPU cycles: as finite resources to be budgeted, compacted, and intelligently paged.[7] Four techniques applicable to post-merge recovery:[7]

Technique Mechanism Recovery Application
Offloading[7] Summarize tool responses; avoid verbatim output in context Compress prior rebase attempt outputs before retry
Reduction[7] Compact/summarize conversations proactively Compact session before rebase attempt to reduce O(N²) cost
Retrieval (RAG)[7] Dynamically fetch only relevant information at runtime Re-fetch only changed files as identified by delta diff
Isolation[7] Use sub-agents to handle specific tasks without context overlap Dispatch sub-agent to assess what changed; receive only the summary

File-based memory (operating manuals, decision logs) can be refreshed with a delta update rather than a full restart. Sub-agents can be specifically tasked with detecting what changed and producing a summary for the main agent.[14]

The Semantic Alignment Score (SAS) Drift Monitor

The SCF Drift Monitor computes SAS from three components:[18]

When SAS falls below threshold, the system triggers proactive re-synchronization before the agent attempts to act on stale context. This proactive monitoring is the mechanism behind SCF's 100% workflow completion vs. 25.1% for reactive detection — see Section 4 for the full evaluation data.

Key finding: Production architectures converge on designing for conflict as a baseline — Overstory explicitly assumes a ~25% conflict rate and treats it as first-class operational state. The architectural question is not "how do we prevent conflicts?" but "how do we resolve them cheaply and correctly at scale?"[20]

See also: Git Worktree Mechanics (git infrastructure details), Scope Overlap Detection (pre-work conflict detection), Lock Design and Granularity (lock release event that triggers recovery)


Sources

  1. 8 Tactics to Reduce Context Drift with Parallel AI Agents (retrieved 2026-05-03)
  2. How to Use Git Worktrees for Parallel AI Agent Execution (retrieved 2026-05-03)
  3. Have AI resolve your merge/rebase conflicts (sketch.dev blog on merde.ai) (retrieved 2026-05-03)
  4. Git - Rerere (Official Git Documentation) (retrieved 2026-05-03)
  5. Detecting Semantic Merge Conflicts with Variability-Aware Execution (ACM ESEC/FSE 2015) (retrieved 2026-05-03)
  6. Git Worktrees for Parallel AI Coding Agents — Agent Context Invalidation and Upstream Change Detection (retrieved 2026-05-03)
  7. The Context Window Problem: Scaling Agents Beyond Token Limits (Factory.ai) (retrieved 2026-05-03)
  8. Git Context Controller: Manage the Context of LLM-based Agents like Git (arXiv 2508.00031) (retrieved 2026-05-03)
  9. Agent Orchestrator: Parallel AI Coding Agents with Worktree Isolation (ComposioHQ) (retrieved 2026-05-03)
  10. Semantic Rebase - Peter J Thomson (retrieved 2026-05-03)
  11. Have AI resolve your merge/rebase conflicts - sketch.dev blog (retrieved 2026-05-03)
  12. The role of AI in merge conflict resolution - Graphite (retrieved 2026-05-03)
  13. The Code Agent Orchestra - What Makes Multi-Agent Coding Work - Addy Osmani (retrieved 2026-05-03)
  14. Context Rot in AI Coding Agents: What It Is and How to Prevent It - MindStudio (retrieved 2026-05-03)
  15. How to Use Git Worktrees for Parallel AI Agent Execution - Augment Code (retrieved 2026-05-03)
  16. Semantic Consensus: Process-Aware Conflict Detection and Resolution for Enterprise Multi-Agent LLM Systems (retrieved 2026-05-03)
  17. AI Agent Loop Token Costs: How to Constrain Context - Augment Code (retrieved 2026-05-03)
  18. Semantic Consensus: Process-Aware Conflict Detection and Resolution for Enterprise Multi-Agent LLM Systems (Full Paper) (retrieved 2026-05-03)
  19. [BUG] Parallel subagents with worktree isolation fail due to git config lock contention - Claude Code GitHub Issue #34645 (retrieved 2026-05-03)
  20. Overstory: Multi-agent orchestration for AI coding agents with FIFO merge queue and 4-tier conflict resolution (retrieved 2026-05-03)
  21. What is a Merge Conflict? Understanding the Difference Between Semantic and Code Conflicts - DEV Community (retrieved 2026-05-03)
  22. Have AI resolve your merge/rebase conflicts - sketch blog (retrieved 2026-05-03)
  23. Context Rot in AI Coding Agents: What It Is and How to Prevent It | MindStudio (retrieved 2026-05-03)
  24. The role of AI in merge conflict resolution - Graphite (retrieved 2026-05-03)
  25. The Code Agent Orchestra - what makes multi-agent coding work - AddyOsmani.com (retrieved 2026-05-03)
  26. Git - Rerere (Reuse Recorded Resolution) (retrieved 2026-05-03)
  27. bliki: Semantic Conflict - Martin Fowler (retrieved 2026-05-03)
  28. git rebase: what can go wrong? - Julia Evans (retrieved 2026-05-03)
  29. How to Use Git Worktrees for Parallel AI Agent Execution | Augment Code (retrieved 2026-05-03)
  30. AI Agent Loop Token Costs: How to Constrain Context | Augment Code (retrieved 2026-05-03)
  31. CocoIndex - Incremental engine for long horizon agents (retrieved 2026-05-03)
  32. Checkpoints Are Not Durable Execution: Why LangGraph, CrewAI, Google ADK Fall Short | Diagrid Blog (retrieved 2026-05-03)
  33. Durable Execution for AI Agents | inference.sh (retrieved 2026-05-03)
  34. ConGra: Benchmarking Automatic Conflict Resolution | OpenReview (retrieved 2026-05-03)

Home