Existing Multi-Agent Coding Systems

Executive Summary

Central finding: Empirical research (CooperBench, arXiv 2601.13295) measured a 39–70% performance degradation when multi-agent configurations attempt the same workload as a single agent — the "curse of coordination" — making task decomposition quality, not agent count, the primary determinant of whether parallelism helps or harms.^[28]

The field divides sharply between systems that treat isolation as a first-class concern and those that defer it to the user. GitHub Copilot's /fleet command is the starkest example of the latter: sub-agents share a filesystem with no file locking, and GitHub's own documentation warns that "being thoughtful about partitioning work is essential" to avoid merge conflicts.^[7]^[14] At the other extreme, Patchwork employs a three-layer stack — git worktrees for filesystem isolation, per-worktree database names for state isolation, and deterministically assigned dev-server ports for runtime isolation — but hits a practical ceiling of 5–7 concurrent agents before rate limits, disk consumption (~5 GB per worktree on a 2 GB codebase, totaling 30+ GB for 6 agents), and merge-review overhead cancel the throughput gains.^[8]^[26]

No production system fully solves the concurrency problem. MultiDevin (Cognition AI, 2024) scales to 1 manager plus 10 workers, each in an isolated VM, but provides only task-level isolation — two workers can independently modify the same file if tasks are mis-scoped, leaving conflict resolution to a sequential final merge by the manager.^[11] AutoDev (Microsoft Research, arXiv 2403.08299) uses Docker isolation per session, not per agent within a session, leaving intra-session concurrent writes without file-level protection.^[17] Devin 2.0 (April 3, 2025) improved PR merge rates from 34% to 67% and delivered 83% more junior-level tasks per ACU compared to 1.x, yet complex end-to-end task completion remains in the single-digit to low-double-digit percent range.^[3]^[21]

The benchmark record exposes a consistent complexity cliff. Frontier models score above 70% on SWE-Bench Verified (single-issue tasks), but drop to roughly 23% on SWE-Bench Pro — a data-contamination-resistant benchmark requiring multi-file patches averaging 107 lines across 4+ files.^[9]^[10] OpenHands reached 72% on SWE-Bench Verified (2026, Claude Sonnet 4.5 with extended thinking), up from 26% at publication.^[4]^[6] Meanwhile, Mini-SWE-Agent — a 100-line implementation — scores above 74% on SWE-Bench Verified, demonstrating that much of the complexity in full SWE-agent is not load-bearing for benchmark performance.^[27] SWE-Edit's decomposed Viewer/Editor subagent architecture improved over baseline by +2.1% while reducing inference cost by −17.9%, validating specialization as a cost-efficiency lever even at modest accuracy gains.^[9]

Coordination topology matters more than agent count. CodeCRDT (arXiv 2510.18893) ran 600 trials using Claude Sonnet and found that parallel coordination produces outcomes ranging from a 21.1% speedup to a 39.4% slowdown depending on task structure alone.^[25] Cursor's internal experimentation documented the failure modes directly: equal-status agents with locking made 20 agents perform at the throughput of 2–3 because agents held locks too long; optimistic concurrency control made agents risk-averse and avoidant of hard tasks; the architecture that worked was Planners + Workers + Judges — role differentiation as the scaling mechanism.^[28] A systematic taxonomy of 13 open-source agents (arXiv 2604.03515v2) found that 11 of 13 compose multiple control primitives rather than relying on a single loop strategy, with tool counts ranging from 0 (Aider) to 37 action classes (Moatless Tools).^[29]

Sequential architecture is not a concession — it is a deliberate coordination strategy. MetaGPT (ICLR 2024 Oral, top 1.8%) encodes a five-role waterfall pipeline (ProductManager → Architect → ProjectManager → Engineer → QA) in which no two agents ever write to the same artifact concurrently, eliminating merge conflicts by construction rather than detection.^[5]^[13] This trades throughput for consistency and achieved 85.9% HumanEval Pass@1 and 87.7% MBPP Pass@1 at publication.^[5] The core insight, validated across systems: naively chaining LLMs produces cascading hallucinations from logic inconsistencies; structured handoffs with verification gates at each role boundary prevent this.^[13]

The isolation mechanism chosen has decisive downstream consequences. Git worktrees spin up in seconds and share the object store (low disk cost), providing file-level isolation with branch exclusivity enforced by git — the same branch cannot be checked out in more than one worktree simultaneously.^[2] Docker containers provide full namespace isolation including ports, databases, and network, but take minutes to create and multiply disk usage per agent. The four failure modes worktrees prevent — concurrent file overwrites, context contamination, race conditions on shared state, and git index lock contention — are all documented in production systems that skipped isolation.^[2] Worktrees, however, provide no runtime isolation and no cross-worktree conflict warnings; full-stack agents (running dev servers, test databases) require the three-layer stack that Patchwork implements.^[8] Full worktree support arrived in JetBrains 2026.1 and VS Code July 2025, making the tooling ecosystem for this pattern newly mature.^[2]

Verification has overtaken generation as the bottleneck. "The bottleneck is no longer generation. It's verification."^[10] Organizations with working multi-agent setups report 20–30% faster cycles, but only when task partitioning quality is high.^[25] Developer-written AGENTS.md files compound learning across sessions and produce roughly a 4% accuracy improvement; LLM-generated equivalents cause a ~3% regression and a 20%+ cost increase.^[10] The AgenticFlict dataset (arXiv 2604.03551) found that some multi-agent PR rejections stem not from substantive conflicts but from superficial formatting or indentation differences — a category addressable by pre-commit normalization rather than architectural changes.^[28]

Implications for practitioners: Start with 2 well-isolated agents on genuinely independent features before scaling — the CooperBench and CodeCRDT data both show that adding agents to poorly partitioned work makes outcomes worse, not better. Use git worktrees for code-only parallel generation (3–5 agents is the validated sweet spot), and add database and port namespacing only when agents run live servers. Adopt the Planners + Workers + Judges topology rather than equal-status agents with locking — Cursor's internal research proves the latter collapses at scale. Invest human time in writing the decomposition brief and the AGENTS.md; LLM-generated context for either degrades performance. Finally, treat the 70% SWE-Bench Verified scores with caution: the 3× drop to 23% on multi-file tasks means current agents are reliable for single-issue scoped work and unreliable for anything that touches 4+ files simultaneously — which is precisely the class of work multi-agent systems are most often proposed to accelerate.

Section 1: Devin (Cognition AI) — MultiDevin Parallel Architecture

Cognition Labs launched Devin 1.0 on March 12, 2024, positioning it as the "world's first fully autonomous AI software engineer."^[11]^[21] Devin integrates an LLM with tools, memory, and reasoning capabilities to independently plan, execute, and iterate on multi-step engineering tasks requiring thousands of decisions.^[3] Devin accepts tasks via Slack or Microsoft Teams integrations and executes autonomously in a cloud sandbox, optimized for clearly scoped multi-hour tasks.^[3] The most architecturally significant feature added in 2024 was MultiDevin — a manager-worker parallel execution pattern scaling to 10 concurrent agents.

MultiDevin: Manager-Worker Architecture

MultiDevin fields one "manager" Devin and up to 10 "worker" Devins.^[11] The manager distributes tasks to each worker, then merges changes from all successful workers into one branch or pull request. The design is explicitly limited to "repeated, isolated tasks like lint errors, code clean-ups, migrations, refactors" and is not suited for interdependent feature work.^[11]

Devin 2.0: Agent-Native IDE (April 2025)

Devin 2.0 (released April 3, 2025) operates within a cloud-based agent-native IDE combining a code editor, terminal, sandboxed browser, and smart planning tools.^[11]^[21] Per a technical analysis: "a cloud-based development environment that allows users to spin up multiple parallel Devin instances, each running in an isolated virtual machine."^[3]

Performance Metrics

Funding Trajectory

Documented Limitations

Section 2: SWE-Agent (Princeton NLP) — Agent-Computer Interface and Decomposition

Devin Version	Release	Multi-Agent Capability	Isolation Unit	Coordination Mode
Devin 1.0	March 12, 2024	Single autonomous agent; REST API for parallel sessions^[11]	Cloud sandbox per session	None (independent)
MultiDevin	2024 (Q3–Q4)	1 manager + up to 10 workers^[11]	Task-scoped (not file-scoped)	Manager merges worker output
Devin 2.0	April 3, 2025	Parallel instances; agent dispatches sub-tasks to other Devins^[11]^[21]	Isolated VM per instance	Task-scoped; plan approval interface
Devin (Feb 2026)	February 2026	Parallel sessions with improved context retention^[21]	Isolated VM	Parallel session management

Metric	Value	Source
Task completion improvement, Devin 2.0 vs. 1.x	83% more junior-level tasks per ACU	^[11]^[21]
COBOL migration scale	5 million lines across 500 GB of repositories, single agent	^[3]
PR merge rate improvement	34% → 67%	^[3]
Nubank case study efficiency	12× efficiency improvement, 20× cost savings; weeks vs. months (per Cognition-published case study)	^[11]
Goldman Sachs pilot (July 2025)	20% efficiency gains alongside 12,000 human developers (per Cognition-published case study)	^[21]
Complex end-to-end task completion (early 2025)	Single-digit to low-double-digit %	^[21]

Round	Date	Valuation
Series A	March 2024	$350M^[11]
Series B	April 2024	$2B^[11]
Growth round	March 2025	~$4B (8VC)^[21]

SWE-agent is an open-source platform developed by Princeton University's NLP group (Yang et al.), published at NeurIPS 2024 (arXiv: 2405.15793).^[27] It takes a GitHub issue and automatically fixes it using an LM of choice. The system's central contribution is the Agent-Computer Interface (ACI) — the insight that LM agents benefit from specially designed software interfaces, analogous to how human developers benefit from IDEs.^[9]^[27]

Isolation Mechanism

SWE-agent executes code in isolated Docker environments — each issue gets its own sandboxed container. Docker-based isolation is the primary mechanism, not git worktrees.^[27]^[29]

SWE-Edit: Decomposed Subagent Architecture

The SWE-Edit framework decomposes code editing into two specialized subagents to address the exploration-precision tension:^[9]

Subagent	Role	Context Characteristic
Viewer	Extracts task-relevant code on demand	Broad — can inspect many files
Editor	Executes modifications from high-level plans	Narrow — receives only what is needed for edits

Adaptive editing mode selection uses a Qwen3-8B model (GRPO-trained) to choose between find-replace (small changes) and whole-file rewrite (complex restructuring).^[9]

Architectural Evolution: 2024–2025

Metric	Result
SWE-bench Verified improvement over baseline	+2.1%^[9]
Inference cost reduction	−17.9%^[9]
Edit formatting reliability improvement	+3.5%^[9]

Multi-agent systems for software engineering (with specialized agents for repository navigation, bug localization, patch generation, and verification) have outpaced single-agent architectures in scalability and performance as of 2025.^[9]

Benchmark Performance

Mini-SWE-Agent: Complexity vs. Performance

Year	Dominant Pattern	Example
2024	Single-agent designs with custom ACIs	SWE-agent original
2025	Decomposed multi-agent architectures with specialized roles	SWE-Edit (Viewer + Editor), Agentless

Benchmark	Score
SWE-bench (pass@1, NeurIPS 2024 — best open-source)	12.5%^[27]
HumanEvalFix (pass@1)	87.7%^[27]
SWE-bench Pro (top models, data-contamination-resistant)	~23% (vs. 70%+ on SWE-bench Verified)^[9]

Mini-SWE-Agent — 100 lines of code total — solves GitHub issues from the command line and scores >74% on SWE-bench Verified, demonstrating that much of SWE-agent's complexity is not essential to performance.^[27]

Section 3: OpenHands — Event-Stream Architecture and Hierarchical Delegation

OpenHands (formerly OpenDevin) started in early 2024 and was published at ICLR 2025 (arXiv: 2407.16741).^[4] As of late 2025 it has 64K+ GitHub stars, 188+ contributors, 2.1K+ contributions, and an $18.8M Series A (Madrona, November 2025).^[4]^[12]^[22] Adopters include AMD, Apple, Google, Amazon, Netflix, TikTok, NVIDIA, Mastercard, and VMware.^[4]

Core Architecture: Event Stream

OpenHands uses an event stream architecture through which user interfaces, agents, and environments interact.^[4]^[6] The state encapsulates all relevant information for agent execution: a chronological collection of past actions and observations — agent actions, user interactions, accumulative LLM call cost, and metadata to track multi-agent delegation.^[4]^[12]

Multi-Agent Coordination: AgentDelegateAction

Hierarchical agent structures delegate subtasks to specialized agents using the AgentDelegateAction — a typed action enabling explicit handoff. Control passes explicitly, not via shared memory; the event stream is the single coordination source of truth.^[4]^[6]^[12]

Sandbox Isolation

Coordination Pattern	Mechanism
Capability-based handoff	`AgentDelegateAction` routes to specialized agent (e.g., BrowsingAgent for web tasks)^[4]
Human-guided workflows	Interactive event injection into the stream^[22]
Dynamic multi-agent composition	Coordination protocol vocabulary across agents^[4]
Division of labor	Skill specialization per agent type^[22]

Each task session runs in a securely isolated Docker container sandbox containing a bash shell, Jupyter IPython Server, and a Chromium browser (Playwright-based).^[6]^[12] An OpenHands Action Execution API server inside each sandbox listens for requests and returns results as observations. Agents share no runtime state by default.^[6]

CodeActAgent Action Primitives

SDK Architecture Evolution: V0 → V1

Benchmark Performance

Section 4: MetaGPT — Sequential SOP-Driven Pipeline

Action Type	Description
`IPythonRunCellAction`	Executes arbitrary Python code^[6]
`CmdRunAction`	Runs bash commands^[6]
`BrowserInteractiveAction`	Web browsing via domain-specific language^[6]
`edit_file` (skills library)	Precise line-range modifications rather than whole-file overwrites^[6]

Feature	V0	V1
Architecture	Monolithic, sandbox-centric	Modular SDK with clear boundaries^[4]^[12]
State model	Flat	Event-sourced with deterministic replay^[22]
Sandboxing	Mandatory	Opt-in^[12]
Tool system	Internal	Typed + MCP integration^[12]
Scale support	Single session	Native distributed deployment to thousands of agents in cloud^[4]
Reconnection	None	Automatic reconnection + state synchronization^[12]

Benchmark	Score	Configuration
SWE-Bench Lite (at publication)	26%	CodeActAgent v1.8 + claude-3.5-sonnet^[6]
SWE-Bench Verified (2026)	72%	Claude Sonnet 4.5 + extended thinking^[4]^[12]
GAIA (validation set)	67.9%	—^[4]^[12]
HumanEvalFix (0-shot)	79.3%	gpt-4o^[6]
WebArena	15.3%	BrowsingAgent + claude-3.5-sonnet^[6]

MetaGPT (arXiv: 2308.00352, ICLR 2024 Oral — top 1.8%) is an open-source multi-agent framework that encodes software company SOPs into prompt sequences.^[5]^[13]^[23] It accepts a one-line requirement and outputs user stories, competitive analysis, requirements, data structures, APIs, and documents. It has 40K+ GitHub stars and an Apache 2.0 license.^[5]

Core Principle: Code = SOP(Team)

MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences. The core problem addressed: naively chaining LLMs causes cascading hallucinations due to logic inconsistencies. Structured SOPs prevent this by enforcing verification at each handoff.^[5]^[13]^[23]

Five-Role Sequential Pipeline

Role	Input	Output	Verification
ProductManager	One-line user requirement	PRD (Product Requirements Document)	Human or agent review at handoff^[5]
Architect	PRD	Technical spec, system architecture diagrams, interface definitions	Architectural review^[13]
ProjectManager	Spec	Task list; code files as task assignments	Scope validation^[13]
Engineer	Task specification	Implementation code	Execution feedback loop^[5]
QA Engineer	Code	Unit tests; bug fix instructions	Test pass/fail^[5]^[13]

Communication protocol: publish-subscribe mechanism for information sharing and updates.^[5]^[13]

Sequential vs. Parallel: Explicit Trade-off

Documented Failure Modes in Multi-Agent Systems (MetaGPT corpus)

Performance and Recent Developments

Section 5: GitHub Copilot Workspace, Coding Agent, and Fleet

Property	Sequential (MetaGPT)	Parallel (MultiDevin, /fleet)
Merge conflicts	Eliminated (by design)^[13]	Risk present; deferred to merge time
Consistency failures	Caught at handoff	Require post-hoc reconciliation
Throughput	Lower	Higher (when tasks are independent)
Interdependency handling	Natural (sequential ordering)	Requires explicit decomposition

Failure Mode	Description
Assistant repeated instruction	Agent repeats instructions instead of executing them^[13]
Infinite loop of message	Agents get stuck in recursive message exchange^[13]

Metric	Value
HumanEval Pass@1	85.9% (state-of-the-art at publication)^[5]^[23]
MBPP Pass@1	87.7% (state-of-the-art at publication)^[5]^[23]
MBPP executive feedback improvement	+5.4% absolute^[5]^[13]
Experimental task completion rate	100%^[5]^[13]^[23]
AFlow (Jan 2025) — ICLR 2025 Oral rank	#2 in LLM-based Agent category^[23]
MGX (Feb 2025)	"World's first AI agent development team"^[5]^[23]

GitHub launched Copilot Workspace as a technical preview in April 2024 — a browser-based environment that turned a plain-English GitHub issue into a spec, plan, and code changes via a four-stage workflow.^[7]^[14]^[24] By September 2025, GitHub rebuilt those learnings into the Copilot Coding Agent (GA to all paid subscribers), incorporating a sub-agent architecture, issue-to-PR async workflow, GitHub Actions as execution environment, and isolated environments respecting repository access scopes.^[7]^[14]

Copilot Workspace Four-Stage Workflow

Agent Mode in IDEs (Announced February 2025)

Stage	Output
1. Task definition	Parsed GitHub issue^[7]
2. Specification generation	Natural-language spec^[14]
3. Plan generation	Files to create / modify / delete^[14]
4. Implementation	Code changes in isolated environment^[7]

Agent Mode rolled out to VS Code, JetBrains, Eclipse, and Xcode.^[14]^[24] It independently translates ideas into code, automatically identifies subtasks, executes across multiple files, and self-corrects on lint errors and test failures.

Tool Available to Agent Mode	Function
`read_file`	Read file contents^[14]
`list_dir`	Enumerate directory^[14]
`run_terminal`	Execute shell commands^[14]
`apply_edit`	Apply code modifications^[14]

Known limitation: Sub-agents in Copilot Agent Mode (IDE) cannot currently run in parallel — they execute sequentially.^[14]^[24]

The /fleet Command: Parallel Subagent Architecture

The /fleet slash command breaks complex requests into smaller tasks and runs them in parallel.^[7]^[14]^[24] The main Copilot agent acts as an orchestrator, dispatching parallel subagents — by default using a low-cost AI model, overridable to custom agents via @CUSTOM-AGENT-NAME.

Fleet Use Cases vs. Limitations

Mission Control and Merge Conflict Warning (Late 2025)

Good Use Case	Problematic Use Case
Refactoring across multiple independent files^[7]	Two agents touching the same shared file
Documentation for several components^[7]	Interdependent API + frontend work^[14]
Features spanning API/UI/tests (if independent)	Tasks without explicit dependency tracking
Independent code modifications not sharing state^[24]	Any unpartitioned shared-state work

Mission Control provides a unified interface for managing multiple parallel Copilot Coding Agent tasks — assign, pick agents, watch real-time logs, steer mid-run, and jump to resulting PRs.^[7]^[14]^[24]

GitHub's own documentation warns: "When assigning multiple tasks from the same repo, it's important to consider overlap: agents working in parallel can create merge conflicts if they touch the same files, so being thoughtful about partitioning work is essential."^[7]^[14]^[24]

AgentHQ: Orchestra Pattern (November 2025)

AgentHQ is a platform for building and deploying custom agents integrated with GitHub workflows, using the "Orchestra" pattern: a Conductor agent orchestrates specialized Planning, Implementation, and Code Review subagents.^[24]

Task Isolation Mechanism

Copilot CLI can run in the background optionally using git worktrees for isolation. GitHub Agentic Workflows are designed with isolation, constrained outputs, and comprehensive logging.^[7]^[14] This contrasts with the /fleet shared-filesystem model documented above, making the CLI's git worktree option an important counterpoint for tasks where stronger isolation guarantees are required.

Section 6: Aider, Patchwork, and AutoDev

Aider: Architect/Coder Two-Model Pipeline

Aider operates through a sequential two-model pipeline in architect mode: an architect model proposes how to solve the coding request, and an editor model turns that proposal into specific file editing instructions.^[15] This is not concurrent multi-agent; the coordination challenge is handoff quality, not concurrency control.

Rationale for two-model design: Certain LLMs (especially reasoning models like o1) excel at reasoning but produce poor edit syntax. Separating architecture from editing eliminates hallucinated edits and malformed diff output.^[15]

Isolation model: Aider "thinks in git" — every edit is a commit, every session a branch that can be reviewed, reverted, or cherry-picked. Natural isolation through git history rather than worktrees.^[15]^[25] Aider does not run multiple concurrent agents on the same repo.

Patchwork (patched-codes): Three-Layer Isolation Stack

Attribute	Value
Tool count	0 (zero LLM-callable tools; user drives navigation)^[29]
Context retrieval	PageRank-weighted dependency graphs^[29]
Isolation mechanism	Git history (no worktrees); human supervision as safety boundary^[15]^[29]
Benchmark (DeepSeek R1 + Claude 3.5 Sonnet)	64% accuracy at $13.29 cost^[15]
Role in broader ecosystem	Worker agent within orchestration systems (Claude Squad, AgentsMesh, Toryo, ai-maestro, Composio)^[25]

Patchwork is a self-hosted CLI agent automating PR reviews, bug fixing, and security patching using the user's preferred LLMs (AGPL-3.0 core; Apache-2.0 for custom patchflows).^[8]^[16]^[26] Key milestones: July 2024 (RTC evaluation methodology), December 2024 (official GitHub Action).

Patchwork's Three-Layer Isolation Stack

Layer	Mechanism	Problem Solved
1. Filesystem isolation	Git worktrees — one task → one branch → one worktree → one agent^[8]^[26]	Concurrent file overwrites
2. State isolation	Separate database names per worktree^[8]	Interleaved test data
3. Runtime isolation	Deterministically assigned dev server ports^[8]^[26]	Port collision between agents

Coordination mechanism: architect agent plans work, manager breaks into tasks, engineers execute in isolated environments, kanban board tracks state, tests gate completion.^[26]

Practical scaling ceiling: 5–7 concurrent agents on a modern laptop before rate limits, disk consumption, and merge review overhead cancel throughput gains. Six agents on a 2 GB codebase consume 30+ GB of disk (≈5 GB per worktree).^[8]^[26]

AutoDev (Microsoft Research, 2024)

AutoDev is a fully automated AI-driven software development framework (arXiv: 2403.08299, 5 Microsoft researchers) for autonomous planning and execution of complex software engineering tasks.^[17]^[30] It builds on AutoGen and Auto-GPT, adding direct repository interaction and filling the gap left by GitHub Copilot's constraint to code snippet suggestions. AutoDev’s LLM-agnostic Agent Scheduler supports diverse model sizes and architectures collaborating on the same task, shifting the developer from manual code validator to multi-agent supervisor.^[17]^[30]

AutoDev Four Core Components

Component	Function
Conversation Manager	Supervises dialogue between user, AI agents, and system; manages interruptions^[17]
Agent Scheduler	Schedules and orchestrates agents to collaborate; employs various collaboration algorithms^[17]^[30]
Tools Library	File editing, retrieval, build, execution, testing, git operations^[17]
Evaluation Environment	Secure Docker container; runs commands, abstracts low-level complexity, closes the feedback loop^[17]

Isolation model: All operations occur within Docker containers, isolating them from the host. The architecture does not describe explicit file-level locking or worktree isolation for agents running in parallel within a session — Docker isolation is per-session, not per-agent within a session.^[17]

AutoDev Performance

Section 7: Isolation Mechanisms — Git Worktrees vs. Docker

Metric	Score
Code generation Pass@1 (HumanEval)	91.5% (second-best; best requiring no extra training data)^[17]^[30]
Test generation Pass@1	87.8% with 99.3% coverage from passing tests^[17]^[30]
Languages supported	Java, Kotlin, JavaScript/TypeScript, Rust, Python, Go, C/C++^[17]

Note: Detailed git worktree internals (two-pointer model, branch exclusivity, object-store sharing) are covered in the Git Worktree Mechanics pillar; this section focuses on comparative isolation tradeoffs specifically relevant to agent system design.

The four failure modes that worktree isolation prevents are well-documented across the corpus:^[2]^[20]

How Git Worktrees Work

#	Failure Mode	Mechanism
1	Concurrent File Overwrites	One agent's changes silently overwrite another's without detection until merge — untracked data loss^[2]
2	Context Contamination	Agents in separate context windows are unaware of peer changes — Agent A's refactoring invalidates Agent B's assumptions mid-task^[2]
3	Race Conditions on Shared State	Multiple agents independently trigger expensive operations (builds, tests), causing resource thrashing^[2]
4	Git Lock Contention	Concurrent git operations fail with fatal errors on `.git/index.lock`; agents don't gracefully recover, leaving stale locks requiring manual intervention^[2]

A git worktree enables multiple working directories from the same repository. Each worktree has its own working directory, private HEAD pointer, and private staging index, but shares the same .git object store (no duplication of history).^[2] The two-pointer model: $GIT_DIR points to each worktree's private directory; $GIT_COMMON_DIR points to the shared .git.^[2]

Branch exclusivity enforced by git: the same branch cannot be checked out in more than one worktree simultaneously by default.^[2] Merge conflicts are deferred from runtime to merge time, where standard git tooling detects them as visible conflicts rather than silent overwrites.^[1]^[2]

Isolation Strategy Comparison

Strategy	Git Objects	Creation Time	Filesystem Isolation	Runtime Isolation	Disk Cost
Worktrees	Shared	Seconds	File-level	None	Low^[2]
Docker containers	Per-image layers	Minutes	Full with namespaces	Complete	High^[2]
Separate clones	Duplicated	Minutes	Full	None	Very high^[2]
Sequential checkout	Shared	Instant	None	N/A	N/A^[2]
Copy-on-Write	Shared	Instant	Full	None	Very low^[2]

Worktrees excel for code-only parallel generation with 3–5 concurrent agents. Docker wins when agents need full runtime isolation (separate ports, databases, network namespaces).^[2]

Known Worktree Limitations

The Sequential Merge Problem

Limitation	Detail
No runtime isolation	Worktrees share local databases, Docker daemon, cache directories, ports — requires three-layer stack for full isolation^[1]^[2]^[8]
No cross-worktree conflict warnings	Git provides no alerts when worktrees modify identical files on different branches^[2]^[16]
Shared git hooks	Hooks in `.git/hooks/` execute in all worktrees; pre-commit assumptions may not hold in fresh worktrees^[1]^[2]
Submodule multiplication	Each worktree gets its own submodule set, multiplying disk usage^[2]
IDE gaps (historical)	Full worktree support arrived in JetBrains 2026.1 and VS Code July 2025^[2]
Monorepo performance	File watchers and build tools in each worktree compound I/O; `git sparse-checkout` can constrain scope^[2]

Even with perfect execution-time isolation, merging remains sequential. Three common patterns:^[16]

None of these eliminate merge conflicts when agents modify the same shared files — they only defer conflict detection to merge time.^[16]

Section 8: Work Partitioning and Orchestration Patterns

The Fundamental Challenge: Task Decomposition Quality

Worktree isolation improves execution for independent tasks but cannot resolve file-level dependencies between concurrent agents. If Agent A builds an API while Agent B builds a frontend consuming that API, these must be sequenced, not parallelized.^[2]^[8]^[10]^[26]

Augment Code's Intent Platform: Three-Tier Architecture

Tier	Agent Role	Function
1	Coordinator Agent	Plans work, reviews specs before implementation, decomposes tasks into dependency-ordered waves^[2]^[20]
2	Specialist Agents (6 personas)	Investigate, Implement, Verify, Critique, Debug, Code Review^[2]
3	Verifier Agent	Quality gate: checks results against spec for inconsistencies, bugs, missing pieces^[2]

Living Spec: A shared coordination artifact all agents continuously reference — "the source of truth that keeps all participants aligned."^[2] Context Engine: Semantic codebase indexing shared across all agents, supporting 400,000+ file repositories.^[2]

Addy Osmani's Code Agent Orchestra: Two Coordination Patterns

Pattern 1: Subagents (In-Process Delegation)

Pattern 2: Agent Teams (True Peer Coordination)

Conflict Avoidance Mechanisms

Coordination Topologies: Five Models

Cursor's Internal Architecture: What Failed and What Worked

The Ralph Loop: Self-Improving Stateless Cycle

Mechanism	How It Works
Plan approval	Teammates submit implementation plans before coding; leads approve or reject^[10]
Lifecycle hooks	Automated checks (lint, tests) before task completion^[10]
Task dependencies	Explicit blocking relationships prevent out-of-order execution^[10]
Token budgeting	Hard per-agent limits; auto-pause at 85% consumed^[10]
Kill criteria	Reassign agents stuck 3+ iterations on identical errors^[10]

Topology	Description	Coordination Overhead
Single-Agent System (SAS)	Baseline — one agent, no coordination	None
Independent MAS	Agents work without coordination	None (at cost of consistency)
Decentralised MAS	Peer-to-peer coordination	Medium
Centralised MAS	Orchestrator coordinates all agents	High (orchestrator bottleneck)
Hybrid MAS	Mix of central and peer coordination	Medium-high

Approach Tried	Result	Root Cause
Equal-status agents with locking	Failed — 20 agents slowed to throughput of 2–3^[28]	Agents held locks too long
Optimistic concurrency control	Failed — agents became risk-averse, avoided hard tasks^[28]	Conflict cost changed agent behavior
Planners + Workers + Judges (hierarchical)	Successful^[28]	Role differentiation enables scale

External memory persists through: git history, progress logs, task state files, and AGENTS.md. Research note: "LLM-generated AGENTS.md files offer no benefit and can marginally reduce success rates (~3%) while increasing costs over 20%." Developer-written context provides ~4% improvement.^[10]

Azure AI Agent Design Principles (Microsoft)

Section 9: Coordination Failures — Empirical Research and Documented Failure Modes

CooperBench: The "Curse of Coordination" (arXiv: 2601.13295)

Empirical research demonstrates a "curse of coordination": agent cooperation performs significantly worse than a single agent given the same total workload.^[28]

CooperBench tested 2–4 agent configurations on SWE-bench-style tasks; the 39–70% degradation range varies by agent count and task coupling, with higher agent counts and tighter task coupling producing worse outcomes.^[28]

CodeCRDT: Coordination Trade-off Is Not Uniform (arXiv: 2510.18893)

CodeCRDT proposes observation-driven coordination — agents coordinate by monitoring a shared state with observable updates, using deterministic convergence rather than explicit message passing.^[25]

Outcome	Result (600 trials, Claude Sonnet)
Best case (some task types)	Up to 21.1% speedup with parallel coordination^[25]
Worst case (other task types)	Up to 39.4% slowdown^[25]

Whether multi-agent coordination helps or hurts depends heavily on task structure — not simply on adding more agents.^[25]

AgenticFlict Dataset (arXiv: 2604.03551)

Key Failure Modes at Scale

Finding	Data
Merge conflicts in Agentic-PR rejections	>1.1% of all rejections^[28]
Root cause of some failures	Superficial differences (formatting, indentation) rather than substantive conflicts^[28]

Failure Mode	Description
Coherence degradation	"Lost in the middle" phenomenon as context grows^[28]
Architectural drift	Agents make locally sensible but globally inconsistent decisions^[28]
Pattern violation	Agents suggest or use deprecated APIs^[28]
Staleness	Index updates lag behind rapid development^[28]
Task-scoped isolation failure	Two workers independently modify the same file when tasks are not properly scoped^[11]

Note: MetaGPT-specific failures (assistant repeated instruction, infinite loop of message) — see Section 4.

SWE-Bench Complexity Cliff

Benchmark	Scope	Top Model Score
SWE-Bench Verified	Single-issue tasks	>70%^[10]
SWE-Bench Pro	Multi-file patches averaging 107 lines across 4+ files	~23%^[9]

This 3×+ performance drop demonstrates the need to decompose work into smaller, testable units that stay within each agent's accuracy range.^[10]

Solutions That Work

AI-Assisted Merge Conflict Resolution Tools (2025)

Recommended Scaling Approach

Solution	Components	Source
Claude Code shared task list pattern	Status flags (lock claims) + git worktrees (isolate edits) + dependency markers (sequence constrained work)^[28]	^[28]
Centralized orchestrator with per-agent worktrees	Orchestrator dispatches; each agent gets its own working copy (e.g., Nevo production system)^[28]	^[28]
Plan approval before coding	Prevents architectural mistakes before code exists^[10]	^[10]

Tool	Capability
JetBrains AI Assistant	Integrated merge conflict suggestions^[28]
VS Code 1.105 (Sept 2025)	AI-assisted merge conflict resolution using merge base and both branch changes as context^[28]
GitKraken Desktop	"Auto-resolve with AI" with explanations^[28]
Graphite	AI merge conflict resolution guidance^[28]
Resolve.AI	Dedicated merge conflict tool^[28]
CodeGPT	Intelligent merge resolution^[28]

Common mistake: launching maximum agents immediately without learning coordination patterns, producing overwhelming complexity and incompatible code.^[25]^[10] Recommended: start with 2 agents on well-isolated features, master the workflow, then scale to 4, 6, 8 — task decomposition quality is the primary variable. "Swarming only works when work units are genuinely independent."^[1]

Section 10: Agent Architecture Taxonomy — Source-Code Survey of 13 Systems

A systematic taxonomy of 13 open-source coding agents across three layers (arXiv: 2604.03515v2):^[29] (1) control architecture, (2) tool/environment interface, (3) resource management. Agents occupy positions along continuous spectra — not discrete categories.

Control Architecture Patterns

Loop Driver Paradigms

Context Retrieval Paradigms

Execution Isolation Strategies

Speculative Execution (Tree-Search Agents)

Cross-Agent Architecture Comparison

Context Compaction Strategies

Section 11: Comparative Benchmarks Across Systems

Cross-System Architecture Overview

Section 12: Practical Scaling Limits and Emerging Open-Source Tooling

Practical Scaling Ceilings

The Verification Bottleneck

Loop Strategy	Agents	Complexity
Fixed pipeline	Agentless	Lowest
Sequential ReAct loops	7 agents (including SWE-agent, OpenHands)	Medium
Phased state machine	Prometheus	Medium-high
Full Monte Carlo Tree Search with backpropagation	Moatless Tools	Highest

Driver	Mechanism	Examples
User-driven	Humans select files — sidesteps localization bottleneck	Aider^[29]
Scaffold-driven	Pre-computed paths determine agent actions	Agentless, AutoCodeRover^[29]
LLM-driven	Full tool autonomy; agent chooses actions	9 of 13 agents surveyed^[29]

Approach	Agents	Mechanism
LLM-as-Navigator	8 agents	Grep/find tools; LLM formulates queries^[29]
Scaffold-side understanding	5 agents	Pre-computed code representations^[29]
— PageRank-weighted dependency graphs	Aider	Graph-based relevance ranking^[29]
— AST-indexed + spectrum fault localization	AutoCodeRover	Structure-aware queries; unique in corpus^[29]
— Neo4j knowledge graphs (20 languages)	Prometheus	Cross-language symbol graph^[29]
— FAISS embedding-based semantic search	Moatless Tools	Vector similarity^[29]

Approach	Agents	Mechanism
Containerized (Docker)	5 agents: SWE-agent, OpenHands, DARS-Agent, AutoCodeRover, Prometheus	In-container FastAPI servers (OpenHands); full namespace isolation^[29]
Shadow git checkpoints	Cline	Rollback without touching user history^[29]
OS-level sandboxing	Codex CLI	Bubblewrap/Landlock + LLM guardian safety scoring (0–100 scale, 80-point threshold)^[29]
Rule-based policy engine	Gemini CLI	Per-tool approval requirements^[29]
Human supervision	Aider	Safety boundary is the human operator^[29]
Stateless subshells	mini-swe-agent	Process-level isolation per operation^[29]

Agent	Mechanism
DARS-Agent	Full Docker reset; replays all actions from root^[29]
Moatless Tools	Shadow mode — tracks modifications in-memory without filesystem writes^[29]

Dimension	SWE-agent	Aider	OpenHands	Prometheus
Loop strategy	Sequential ReAct	User-driven	Sequential ReAct	Phased state machine
Tool count	3–35 (bundled)	0 (text-parsed)	9+ (MCP-enabled)	17 (per-node scoped)
Context retrieval	Keyword search	PageRank graph	Keyword search	Neo4j knowledge graph
Isolation	Docker	None (user trust)	Docker + HTTP API	Docker
Compaction strategy	Polling triggers	Summarization	Request-based	Not detailed

System	Isolation Mechanism	Coordination Pattern	Max Parallel Scale	Open / Closed	Key Limitation
Devin	Isolated VM per instance^[3]^[11]	Manager assigns to workers; manager merges results^[11]	10 workers (MultiDevin)^[11]	Closed	Task-level isolation only — workers can modify the same file if tasks are mis-scoped^[11]
SWE-agent	Docker container per issue^[27]^[29]	Single-agent per issue; no built-in parallel execution^[27]	1 (single-agent design)^[27]	Open	No multi-agent parallelism; parallel scale requires external orchestration^[9]
OpenHands	Docker container per session + event-stream state^[6]^[12]	Hierarchical delegation via `AgentDelegateAction`^[4]	Cloud-scale (V1 SDK: native distributed deployment)^[4]^[12]	Open	Automatic workflow generation still requires substantial handcrafting^[22]
MetaGPT	Sequential handoffs; no concurrent artifact editing^[13]	Sequential SOP pipeline (ProductManager → Architect → Engineer → QA)^[5]	1 (intentionally sequential)^[13]	Open	No parallelism by design; lower throughput than parallel systems^[13]
GitHub Copilot / `/fleet`	Shared filesystem, no file locking (`/fleet`); optional git worktrees in CLI^[14]	Orchestrator dispatches parallel subagents^[7]^[14]	Parallel (no stated maximum); Mission Control manages multiple tasks concurrently^[14]^[24]	Closed	No automatic file-conflict prevention in `/fleet`; partitioning is the user's responsibility^[14]
Aider	Git history / user-supervised (no worktrees)^[15]^[25]	Sequential two-model pipeline (architect → editor)^[15]	1 (no concurrent agents on same repo)^[15]	Open	Not designed for parallel execution; used as worker agent in external orchestration^[25]
Patchwork	Three-layer stack: git worktrees + database branches + port namespacing^[8]^[26]	Architect plans → manager decomposes → engineers execute in isolated worktrees^[26]	5–7 practical ceiling (rate limits + disk + review overhead)^[8]^[26]	Open	Disk and rate-limit ceiling; ~5 GB per worktree on a 2 GB codebase^[8]
AutoDev	Docker container per session (session-scoped, not per-agent within a session)^[17]	Agent Scheduler orchestrates LLM-agnostic multi-model collaboration^[17]^[30]	Session-scoped (parallel scale not specified)^[17]	Open	No explicit per-agent file-level locking or worktree isolation within a session^[17]

System	Benchmark	Score	Date / Configuration
OpenHands	SWE-Bench Verified	72%^[4]^[12]	2026, Claude Sonnet 4.5 + extended thinking
OpenHands	SWE-Bench Lite	26%^[6]	At publication, CodeActAgent v1.8 + claude-3.5-sonnet
OpenHands	GAIA (val set)	67.9%^[4]	2025
MetaGPT	HumanEval Pass@1	85.9%^[5]^[23]	At ICLR 2024 publication
MetaGPT	MBPP Pass@1	87.7%^[5]^[23]	At ICLR 2024 publication
AutoDev	HumanEval Pass@1	91.5%^[17]^[30]	2024 (best requiring no extra training data)
AutoDev	Test generation Pass@1	87.8%^[17]	99.3% coverage from passing tests
SWE-agent	SWE-bench (pass@1)	12.5%^[27]	NeurIPS 2024, best open-source at publication
SWE-agent	HumanEvalFix (pass@1)	87.7%^[27]	NeurIPS 2024
SWE-agent (mini)	SWE-bench Verified	>74%^[27]	100 LOC implementation
SWE-Edit (decomposed)	SWE-bench Verified (vs. baseline)	+2.1% / −17.9% cost^[9]	2025
Aider (DeepSeek R1 + Claude 3.5 Sonnet)	Internal benchmark	64% at $13.29^[15]	Architect mode
Frontier models (general)	SWE-Bench Verified	>70%^[10]	2025 state of the art
Frontier models (general)	SWE-Bench Pro (multi-file, 107 lines avg, 4+ files)	~23%^[9]	2025 state of the art

Threshold	Value	Limiting Factor
Recommended sweet spot (most repos)	3–5 concurrent agents^[8]	Merge review overhead begins to exceed throughput gains
Productive ceiling (modern laptop)	5–7 concurrent agents^[8]^[26]^[25]	Rate limits, disk consumption, review overhead
Disk consumption (2 GB codebase)	~30+ GB for 6 worktree agents (~5 GB each)^[8]^[26]	Submodule multiplication + per-worktree indexes
Cursor 2.0 (Oct 2025) supported agents	Up to 8 concurrent^[25]	Product-imposed limit
Organizations reporting workflow improvement	20–30% faster cycles with multi-agent setups^[25]	Dependent on task partitioning quality
Rate limit impact	10 Claude Code instances hit Anthropic rate limits faster than 1^[25]	API throughput constraints

"The bottleneck is no longer generation. It's verification."^[10] Three-layer verification approach recommended:

Human review remains mandatory — agents generate volume quickly, but determining correctness requires full system context.^[10]

Tool	Key Feature	Isolation Mechanism
Parallel Code	Each task gets own git branch and worktree^[16]	Git worktrees
Superset IDE	Run 10+ agents simultaneously^[16]	Git worktrees
Composio Agent Orchestrator	Multiple agents, each with its own PR, supervised dashboard^[16]	Git worktrees + PRs
Conductor (Mac)	Multiple parallel agents, clean separation^[16]	Git worktrees
Baton (mraza007)	Polls GitHub Issues, runs Claude Code in isolated worktrees^[16]	Git worktrees
Claude Squad	Orchestrates Claude Code, Aider, Codex simultaneously^[16]	Git worktrees + tmux

Strategy	Advantage	Disadvantage
Worktree per task	Short-lived; no state accumulation^[2]	Zero cache reuse; cold dependency installs for each task
Worktree per agent	Warm dependency caches; faster task startup^[2]	Long-lived worktrees accumulate state; harder cleanup

Executive Summary

Table of Contents

Section 1: Devin (Cognition AI) — MultiDevin Parallel Architecture

MultiDevin: Manager-Worker Architecture

Devin 2.0: Agent-Native IDE (April 2025)

Performance Metrics

Funding Trajectory

Documented Limitations

Section 2: SWE-Agent (Princeton NLP) — Agent-Computer Interface and Decomposition

Isolation Mechanism

SWE-Edit: Decomposed Subagent Architecture

Architectural Evolution: 2024–2025

Benchmark Performance

Mini-SWE-Agent: Complexity vs. Performance

Section 3: OpenHands — Event-Stream Architecture and Hierarchical Delegation

Core Architecture: Event Stream

Multi-Agent Coordination: AgentDelegateAction

Sandbox Isolation

CodeActAgent Action Primitives

SDK Architecture Evolution: V0 → V1

Benchmark Performance

Section 4: MetaGPT — Sequential SOP-Driven Pipeline

Core Principle: Code = SOP(Team)

Five-Role Sequential Pipeline

Sequential vs. Parallel: Explicit Trade-off

Documented Failure Modes in Multi-Agent Systems (MetaGPT corpus)

Performance and Recent Developments

Section 5: GitHub Copilot Workspace, Coding Agent, and Fleet

Copilot Workspace Four-Stage Workflow

Agent Mode in IDEs (Announced February 2025)

The /fleet Command: Parallel Subagent Architecture

Fleet Use Cases vs. Limitations

Mission Control and Merge Conflict Warning (Late 2025)

AgentHQ: Orchestra Pattern (November 2025)

Task Isolation Mechanism

Section 6: Aider, Patchwork, and AutoDev

Aider: Architect/Coder Two-Model Pipeline

Patchwork (patched-codes): Three-Layer Isolation Stack

Patchwork's Three-Layer Isolation Stack

AutoDev (Microsoft Research, 2024)

AutoDev Four Core Components

AutoDev Performance

Section 7: Isolation Mechanisms — Git Worktrees vs. Docker

How Git Worktrees Work

Isolation Strategy Comparison

Known Worktree Limitations

The Sequential Merge Problem

Section 8: Work Partitioning and Orchestration Patterns

The Fundamental Challenge: Task Decomposition Quality

Augment Code's Intent Platform: Three-Tier Architecture

Addy Osmani's Code Agent Orchestra: Two Coordination Patterns

Pattern 1: Subagents (In-Process Delegation)

Pattern 2: Agent Teams (True Peer Coordination)

Conflict Avoidance Mechanisms

Coordination Topologies: Five Models

Cursor's Internal Architecture: What Failed and What Worked

The Ralph Loop: Self-Improving Stateless Cycle

Azure AI Agent Design Principles (Microsoft)

Section 9: Coordination Failures — Empirical Research and Documented Failure Modes

CooperBench: The "Curse of Coordination" (arXiv: 2601.13295)

CodeCRDT: Coordination Trade-off Is Not Uniform (arXiv: 2510.18893)

AgenticFlict Dataset (arXiv: 2604.03551)

Key Failure Modes at Scale

SWE-Bench Complexity Cliff

Solutions That Work

AI-Assisted Merge Conflict Resolution Tools (2025)

Recommended Scaling Approach

Section 10: Agent Architecture Taxonomy — Source-Code Survey of 13 Systems

Control Architecture Patterns

Loop Driver Paradigms

Context Retrieval Paradigms

Execution Isolation Strategies

Speculative Execution (Tree-Search Agents)

Cross-Agent Architecture Comparison

Context Compaction Strategies

Section 11: Comparative Benchmarks Across Systems

Cross-System Architecture Overview

Section 12: Practical Scaling Limits and Emerging Open-Source Tooling

Practical Scaling Ceilings