Pillar: existing-agent-systems | Date: May 2026
Scope: Prior art survey of tools that run multiple AI coding agents concurrently on shared repos: Devin, SWE-Agent, AutoDev, MetaGPT, Patchwork, Aider multi-agent, GitHub Copilot Workspace, OpenHands (OpenDevin). How they partition work, what isolation guarantees they provide, and what coordination failures they document or avoid.
Sources: 30 gathered, consolidated, synthesized.
Central finding: Empirical research (CooperBench, arXiv 2601.13295) measured a 39–70% performance degradation when multi-agent configurations attempt the same workload as a single agent — the "curse of coordination" — making task decomposition quality, not agent count, the primary determinant of whether parallelism helps or harms.[28]
The field divides sharply between systems that treat isolation as a first-class concern and those that defer it to the user. GitHub Copilot's /fleet command is the starkest example of the latter: sub-agents share a filesystem with no file locking, and GitHub's own documentation warns that "being thoughtful about partitioning work is essential" to avoid merge conflicts.[7][14] At the other extreme, Patchwork employs a three-layer stack — git worktrees for filesystem isolation, per-worktree database names for state isolation, and deterministically assigned dev-server ports for runtime isolation — but hits a practical ceiling of 5–7 concurrent agents before rate limits, disk consumption (~5 GB per worktree on a 2 GB codebase, totaling 30+ GB for 6 agents), and merge-review overhead cancel the throughput gains.[8][26]
No production system fully solves the concurrency problem. MultiDevin (Cognition AI, 2024) scales to 1 manager plus 10 workers, each in an isolated VM, but provides only task-level isolation — two workers can independently modify the same file if tasks are mis-scoped, leaving conflict resolution to a sequential final merge by the manager.[11] AutoDev (Microsoft Research, arXiv 2403.08299) uses Docker isolation per session, not per agent within a session, leaving intra-session concurrent writes without file-level protection.[17] Devin 2.0 (April 3, 2025) improved PR merge rates from 34% to 67% and delivered 83% more junior-level tasks per ACU compared to 1.x, yet complex end-to-end task completion remains in the single-digit to low-double-digit percent range.[3][21]
The benchmark record exposes a consistent complexity cliff. Frontier models score above 70% on SWE-Bench Verified (single-issue tasks), but drop to roughly 23% on SWE-Bench Pro — a data-contamination-resistant benchmark requiring multi-file patches averaging 107 lines across 4+ files.[9][10] OpenHands reached 72% on SWE-Bench Verified (2026, Claude Sonnet 4.5 with extended thinking), up from 26% at publication.[4][6] Meanwhile, Mini-SWE-Agent — a 100-line implementation — scores above 74% on SWE-Bench Verified, demonstrating that much of the complexity in full SWE-agent is not load-bearing for benchmark performance.[27] SWE-Edit's decomposed Viewer/Editor subagent architecture improved over baseline by +2.1% while reducing inference cost by −17.9%, validating specialization as a cost-efficiency lever even at modest accuracy gains.[9]
Coordination topology matters more than agent count. CodeCRDT (arXiv 2510.18893) ran 600 trials using Claude Sonnet and found that parallel coordination produces outcomes ranging from a 21.1% speedup to a 39.4% slowdown depending on task structure alone.[25] Cursor's internal experimentation documented the failure modes directly: equal-status agents with locking made 20 agents perform at the throughput of 2–3 because agents held locks too long; optimistic concurrency control made agents risk-averse and avoidant of hard tasks; the architecture that worked was Planners + Workers + Judges — role differentiation as the scaling mechanism.[28] A systematic taxonomy of 13 open-source agents (arXiv 2604.03515v2) found that 11 of 13 compose multiple control primitives rather than relying on a single loop strategy, with tool counts ranging from 0 (Aider) to 37 action classes (Moatless Tools).[29]
Sequential architecture is not a concession — it is a deliberate coordination strategy. MetaGPT (ICLR 2024 Oral, top 1.8%) encodes a five-role waterfall pipeline (ProductManager → Architect → ProjectManager → Engineer → QA) in which no two agents ever write to the same artifact concurrently, eliminating merge conflicts by construction rather than detection.[5][13] This trades throughput for consistency and achieved 85.9% HumanEval Pass@1 and 87.7% MBPP Pass@1 at publication.[5] The core insight, validated across systems: naively chaining LLMs produces cascading hallucinations from logic inconsistencies; structured handoffs with verification gates at each role boundary prevent this.[13]
The isolation mechanism chosen has decisive downstream consequences. Git worktrees spin up in seconds and share the object store (low disk cost), providing file-level isolation with branch exclusivity enforced by git — the same branch cannot be checked out in more than one worktree simultaneously.[2] Docker containers provide full namespace isolation including ports, databases, and network, but take minutes to create and multiply disk usage per agent. The four failure modes worktrees prevent — concurrent file overwrites, context contamination, race conditions on shared state, and git index lock contention — are all documented in production systems that skipped isolation.[2] Worktrees, however, provide no runtime isolation and no cross-worktree conflict warnings; full-stack agents (running dev servers, test databases) require the three-layer stack that Patchwork implements.[8] Full worktree support arrived in JetBrains 2026.1 and VS Code July 2025, making the tooling ecosystem for this pattern newly mature.[2]
Verification has overtaken generation as the bottleneck. "The bottleneck is no longer generation. It's verification."[10] Organizations with working multi-agent setups report 20–30% faster cycles, but only when task partitioning quality is high.[25] Developer-written AGENTS.md files compound learning across sessions and produce roughly a 4% accuracy improvement; LLM-generated equivalents cause a ~3% regression and a 20%+ cost increase.[10] The AgenticFlict dataset (arXiv 2604.03551) found that some multi-agent PR rejections stem not from substantive conflicts but from superficial formatting or indentation differences — a category addressable by pre-commit normalization rather than architectural changes.[28]
Implications for practitioners: Start with 2 well-isolated agents on genuinely independent features before scaling — the CooperBench and CodeCRDT data both show that adding agents to poorly partitioned work makes outcomes worse, not better. Use git worktrees for code-only parallel generation (3–5 agents is the validated sweet spot), and add database and port namespacing only when agents run live servers. Adopt the Planners + Workers + Judges topology rather than equal-status agents with locking — Cursor's internal research proves the latter collapses at scale. Invest human time in writing the decomposition brief and the AGENTS.md; LLM-generated context for either degrades performance. Finally, treat the 70% SWE-Bench Verified scores with caution: the 3× drop to 23% on multi-file tasks means current agents are reliable for single-issue scoped work and unreliable for anything that touches 4+ files simultaneously — which is precisely the class of work multi-agent systems are most often proposed to accelerate.
Cognition Labs launched Devin 1.0 on March 12, 2024, positioning it as the "world's first fully autonomous AI software engineer."[11][21] Devin integrates an LLM with tools, memory, and reasoning capabilities to independently plan, execute, and iterate on multi-step engineering tasks requiring thousands of decisions.[3] Devin accepts tasks via Slack or Microsoft Teams integrations and executes autonomously in a cloud sandbox, optimized for clearly scoped multi-hour tasks.[3] The most architecturally significant feature added in 2024 was MultiDevin — a manager-worker parallel execution pattern scaling to 10 concurrent agents.
MultiDevin fields one "manager" Devin and up to 10 "worker" Devins.[11] The manager distributes tasks to each worker, then merges changes from all successful workers into one branch or pull request. The design is explicitly limited to "repeated, isolated tasks like lint errors, code clean-ups, migrations, refactors" and is not suited for interdependent feature work.[11]
Key finding: MultiDevin's isolation guarantee is at the task level, not the file level — two workers can independently modify the same file if tasks are not properly scoped. The manager must reconcile all worker changes in a final sequential merge step.[11]
Devin 2.0 (released April 3, 2025) operates within a cloud-based agent-native IDE combining a code editor, terminal, sandboxed browser, and smart planning tools.[11][21] Per a technical analysis: "a cloud-based development environment that allows users to spin up multiple parallel Devin instances, each running in an isolated virtual machine."[3]
| Devin Version | Release | Multi-Agent Capability | Isolation Unit | Coordination Mode |
|---|---|---|---|---|
| Devin 1.0 | March 12, 2024 | Single autonomous agent; REST API for parallel sessions[11] | Cloud sandbox per session | None (independent) |
| MultiDevin | 2024 (Q3–Q4) | 1 manager + up to 10 workers[11] | Task-scoped (not file-scoped) | Manager merges worker output |
| Devin 2.0 | April 3, 2025 | Parallel instances; agent dispatches sub-tasks to other Devins[11][21] | Isolated VM per instance | Task-scoped; plan approval interface |
| Devin (Feb 2026) | February 2026 | Parallel sessions with improved context retention[21] | Isolated VM | Parallel session management |
| Metric | Value | Source |
|---|---|---|
| Task completion improvement, Devin 2.0 vs. 1.x | 83% more junior-level tasks per ACU | [11][21] |
| COBOL migration scale | 5 million lines across 500 GB of repositories, single agent | [3] |
| PR merge rate improvement | 34% → 67% | [3] |
| Nubank case study efficiency | 12× efficiency improvement, 20× cost savings; weeks vs. months (per Cognition-published case study) | [11] |
| Goldman Sachs pilot (July 2025) | 20% efficiency gains alongside 12,000 human developers (per Cognition-published case study) | [21] |
| Complex end-to-end task completion (early 2025) | Single-digit to low-double-digit % | [21] |
| Round | Date | Valuation |
|---|---|---|
| Series A | March 2024 | $350M[11] |
| Series B | April 2024 | $2B[11] |
| Growth round | March 2025 | ~$4B (8VC)[21] |
SWE-agent is an open-source platform developed by Princeton University's NLP group (Yang et al.), published at NeurIPS 2024 (arXiv: 2405.15793).[27] It takes a GitHub issue and automatically fixes it using an LM of choice. The system's central contribution is the Agent-Computer Interface (ACI) — the insight that LM agents benefit from specially designed software interfaces, analogous to how human developers benefit from IDEs.[9][27]
Key finding: For coding agents, exploration and precision are fundamentally at odds — a single agent cannot simultaneously optimize for comprehensive code understanding (benefits from viewing many files) AND reliable edit generation (benefits from clean, focused context).[9] This tension motivates decomposed multi-agent architectures.
SWE-agent executes code in isolated Docker environments — each issue gets its own sandboxed container. Docker-based isolation is the primary mechanism, not git worktrees.[27][29]
The SWE-Edit framework decomposes code editing into two specialized subagents to address the exploration-precision tension:[9]
| Subagent | Role | Context Characteristic |
|---|---|---|
| Viewer | Extracts task-relevant code on demand | Broad — can inspect many files |
| Editor | Executes modifications from high-level plans | Narrow — receives only what is needed for edits |
Adaptive editing mode selection uses a Qwen3-8B model (GRPO-trained) to choose between find-replace (small changes) and whole-file rewrite (complex restructuring).[9]
| Metric | Result |
|---|---|
| SWE-bench Verified improvement over baseline | +2.1%[9] |
| Inference cost reduction | −17.9%[9] |
| Edit formatting reliability improvement | +3.5%[9] |
Multi-agent systems for software engineering (with specialized agents for repository navigation, bug localization, patch generation, and verification) have outpaced single-agent architectures in scalability and performance as of 2025.[9]
| Year | Dominant Pattern | Example |
|---|---|---|
| 2024 | Single-agent designs with custom ACIs | SWE-agent original |
| 2025 | Decomposed multi-agent architectures with specialized roles | SWE-Edit (Viewer + Editor), Agentless |
| Benchmark | Score |
|---|---|
| SWE-bench (pass@1, NeurIPS 2024 — best open-source) | 12.5%[27] |
| HumanEvalFix (pass@1) | 87.7%[27] |
| SWE-bench Pro (top models, data-contamination-resistant) | ~23% (vs. 70%+ on SWE-bench Verified)[9] |
Mini-SWE-Agent — 100 lines of code total — solves GitHub issues from the command line and scores >74% on SWE-bench Verified, demonstrating that much of SWE-agent's complexity is not essential to performance.[27]
See also: Agent Architecture Taxonomy (Section 10); Coordination Failures (Section 9)OpenHands (formerly OpenDevin) started in early 2024 and was published at ICLR 2025 (arXiv: 2407.16741).[4] As of late 2025 it has 64K+ GitHub stars, 188+ contributors, 2.1K+ contributions, and an $18.8M Series A (Madrona, November 2025).[4][12][22] Adopters include AMD, Apple, Google, Amazon, Netflix, TikTok, NVIDIA, Mastercard, and VMware.[4]
OpenHands uses an event stream architecture through which user interfaces, agents, and environments interact.[4][6] The state encapsulates all relevant information for agent execution: a chronological collection of past actions and observations — agent actions, user interactions, accumulative LLM call cost, and metadata to track multi-agent delegation.[4][12]
Key finding: OpenHands' core design philosophy — "an autonomous agent is a function from event history to next event, run in a loop. Everything else (condensers, skills, sub-agents, security analyzers) is a hook into that one loop"[22] — enables deterministic replay and full audit trail of agent behavior.[6]
Hierarchical agent structures delegate subtasks to specialized agents using the AgentDelegateAction — a typed action enabling explicit handoff. Control passes explicitly, not via shared memory; the event stream is the single coordination source of truth.[4][6][12]
| Coordination Pattern | Mechanism |
|---|---|
| Capability-based handoff | AgentDelegateAction routes to specialized agent (e.g., BrowsingAgent for web tasks)[4] |
| Human-guided workflows | Interactive event injection into the stream[22] |
| Dynamic multi-agent composition | Coordination protocol vocabulary across agents[4] |
| Division of labor | Skill specialization per agent type[22] |
Each task session runs in a securely isolated Docker container sandbox containing a bash shell, Jupyter IPython Server, and a Chromium browser (Playwright-based).[6][12] An OpenHands Action Execution API server inside each sandbox listens for requests and returns results as observations. Agents share no runtime state by default.[6]
| Action Type | Description |
|---|---|
IPythonRunCellAction | Executes arbitrary Python code[6] |
CmdRunAction | Runs bash commands[6] |
BrowserInteractiveAction | Web browsing via domain-specific language[6] |
edit_file (skills library) | Precise line-range modifications rather than whole-file overwrites[6] |
| Feature | V0 | V1 |
|---|---|---|
| Architecture | Monolithic, sandbox-centric | Modular SDK with clear boundaries[4][12] |
| State model | Flat | Event-sourced with deterministic replay[22] |
| Sandboxing | Mandatory | Opt-in[12] |
| Tool system | Internal | Typed + MCP integration[12] |
| Scale support | Single session | Native distributed deployment to thousands of agents in cloud[4] |
| Reconnection | None | Automatic reconnection + state synchronization[12] |
| Benchmark | Score | Configuration |
|---|---|---|
| SWE-Bench Lite (at publication) | 26% | CodeActAgent v1.8 + claude-3.5-sonnet[6] |
| SWE-Bench Verified (2026) | 72% | Claude Sonnet 4.5 + extended thinking[4][12] |
| GAIA (validation set) | 67.9% | —[4][12] |
| HumanEvalFix (0-shot) | 79.3% | gpt-4o[6] |
| WebArena | 15.3% | BrowsingAgent + claude-3.5-sonnet[6] |
MetaGPT (arXiv: 2308.00352, ICLR 2024 Oral — top 1.8%) is an open-source multi-agent framework that encodes software company SOPs into prompt sequences.[5][13][23] It accepts a one-line requirement and outputs user stories, competitive analysis, requirements, data structures, APIs, and documents. It has 40K+ GitHub stars and an Apache 2.0 license.[5]
MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences. The core problem addressed: naively chaining LLMs causes cascading hallucinations due to logic inconsistencies. Structured SOPs prevent this by enforcing verification at each handoff.[5][13][23]
| Role | Input | Output | Verification |
|---|---|---|---|
| ProductManager | One-line user requirement | PRD (Product Requirements Document) | Human or agent review at handoff[5] |
| Architect | PRD | Technical spec, system architecture diagrams, interface definitions | Architectural review[13] |
| ProjectManager | Spec | Task list; code files as task assignments | Scope validation[13] |
| Engineer | Task specification | Implementation code | Execution feedback loop[5] |
| QA Engineer | Code | Unit tests; bug fix instructions | Test pass/fail[5][13] |
Communication protocol: publish-subscribe mechanism for information sharing and updates.[5][13]
Key finding: MetaGPT's sequential design is an intentional coordination strategy — it mirrors waterfall development to prevent merge conflicts and consistency failures. No two agents edit the same artifact simultaneously; each agent receives complete, stable input before starting.[13]
| Property | Sequential (MetaGPT) | Parallel (MultiDevin, /fleet) |
|---|---|---|
| Merge conflicts | Eliminated (by design)[13] | Risk present; deferred to merge time |
| Consistency failures | Caught at handoff | Require post-hoc reconciliation |
| Throughput | Lower | Higher (when tasks are independent) |
| Interdependency handling | Natural (sequential ordering) | Requires explicit decomposition |
| Failure Mode | Description |
|---|---|
| Assistant repeated instruction | Agent repeats instructions instead of executing them[13] |
| Infinite loop of message | Agents get stuck in recursive message exchange[13] |
| Metric | Value |
|---|---|
| HumanEval Pass@1 | 85.9% (state-of-the-art at publication)[5][23] |
| MBPP Pass@1 | 87.7% (state-of-the-art at publication)[5][23] |
| MBPP executive feedback improvement | +5.4% absolute[5][13] |
| Experimental task completion rate | 100%[5][13][23] |
| AFlow (Jan 2025) — ICLR 2025 Oral rank | #2 in LLM-based Agent category[23] |
| MGX (Feb 2025) | "World's first AI agent development team"[5][23] |
GitHub launched Copilot Workspace as a technical preview in April 2024 — a browser-based environment that turned a plain-English GitHub issue into a spec, plan, and code changes via a four-stage workflow.[7][14][24] By September 2025, GitHub rebuilt those learnings into the Copilot Coding Agent (GA to all paid subscribers), incorporating a sub-agent architecture, issue-to-PR async workflow, GitHub Actions as execution environment, and isolated environments respecting repository access scopes.[7][14]
| Stage | Output |
|---|---|
| 1. Task definition | Parsed GitHub issue[7] |
| 2. Specification generation | Natural-language spec[14] |
| 3. Plan generation | Files to create / modify / delete[14] |
| 4. Implementation | Code changes in isolated environment[7] |
Agent Mode rolled out to VS Code, JetBrains, Eclipse, and Xcode.[14][24] It independently translates ideas into code, automatically identifies subtasks, executes across multiple files, and self-corrects on lint errors and test failures.
| Tool Available to Agent Mode | Function |
|---|---|
read_file | Read file contents[14] |
list_dir | Enumerate directory[14] |
run_terminal | Execute shell commands[14] |
apply_edit | Apply code modifications[14] |
Known limitation: Sub-agents in Copilot Agent Mode (IDE) cannot currently run in parallel — they execute sequentially.[14][24]
The /fleet slash command breaks complex requests into smaller tasks and runs them in parallel.[7][14][24] The main Copilot agent acts as an orchestrator, dispatching parallel subagents — by default using a low-cost AI model, overridable to custom agents via @CUSTOM-AGENT-NAME.
Key finding: Sub-agents in /fleet share a filesystem with no file locking.[14] Work partitioning to avoid conflicts is entirely the user's responsibility — the framework provides no automatic conflict prevention for parallel agents modifying the same files.
| Good Use Case | Problematic Use Case |
|---|---|
| Refactoring across multiple independent files[7] | Two agents touching the same shared file |
| Documentation for several components[7] | Interdependent API + frontend work[14] |
| Features spanning API/UI/tests (if independent) | Tasks without explicit dependency tracking |
| Independent code modifications not sharing state[24] | Any unpartitioned shared-state work |
Mission Control provides a unified interface for managing multiple parallel Copilot Coding Agent tasks — assign, pick agents, watch real-time logs, steer mid-run, and jump to resulting PRs.[7][14][24]
GitHub's own documentation warns: "When assigning multiple tasks from the same repo, it's important to consider overlap: agents working in parallel can create merge conflicts if they touch the same files, so being thoughtful about partitioning work is essential."[7][14][24]
AgentHQ is a platform for building and deploying custom agents integrated with GitHub workflows, using the "Orchestra" pattern: a Conductor agent orchestrates specialized Planning, Implementation, and Code Review subagents.[24]
Copilot CLI can run in the background optionally using git worktrees for isolation. GitHub Agentic Workflows are designed with isolation, constrained outputs, and comprehensive logging.[7][14] This contrasts with the /fleet shared-filesystem model documented above, making the CLI's git worktree option an important counterpoint for tasks where stronger isolation guarantees are required.
Aider operates through a sequential two-model pipeline in architect mode: an architect model proposes how to solve the coding request, and an editor model turns that proposal into specific file editing instructions.[15] This is not concurrent multi-agent; the coordination challenge is handoff quality, not concurrency control.
Rationale for two-model design: Certain LLMs (especially reasoning models like o1) excel at reasoning but produce poor edit syntax. Separating architecture from editing eliminates hallucinated edits and malformed diff output.[15]
Isolation model: Aider "thinks in git" — every edit is a commit, every session a branch that can be reviewed, reverted, or cherry-picked. Natural isolation through git history rather than worktrees.[15][25] Aider does not run multiple concurrent agents on the same repo.
| Attribute | Value |
|---|---|
| Tool count | 0 (zero LLM-callable tools; user drives navigation)[29] |
| Context retrieval | PageRank-weighted dependency graphs[29] |
| Isolation mechanism | Git history (no worktrees); human supervision as safety boundary[15][29] |
| Benchmark (DeepSeek R1 + Claude 3.5 Sonnet) | 64% accuracy at $13.29 cost[15] |
| Role in broader ecosystem | Worker agent within orchestration systems (Claude Squad, AgentsMesh, Toryo, ai-maestro, Composio)[25] |
Patchwork is a self-hosted CLI agent automating PR reviews, bug fixing, and security patching using the user's preferred LLMs (AGPL-3.0 core; Apache-2.0 for custom patchflows).[8][16][26] Key milestones: July 2024 (RTC evaluation methodology), December 2024 (official GitHub Action).
Key finding: "Parallelism is not the hard part — isolation is. If two agents can edit the same repo but you cannot review, replay, or merge their work safely, you don't have a scalable workflow — you have a faster way to create conflicts."[16]
| Layer | Mechanism | Problem Solved |
|---|---|---|
| 1. Filesystem isolation | Git worktrees — one task → one branch → one worktree → one agent[8][26] | Concurrent file overwrites |
| 2. State isolation | Separate database names per worktree[8] | Interleaved test data |
| 3. Runtime isolation | Deterministically assigned dev server ports[8][26] | Port collision between agents |
Coordination mechanism: architect agent plans work, manager breaks into tasks, engineers execute in isolated environments, kanban board tracks state, tests gate completion.[26]
Practical scaling ceiling: 5–7 concurrent agents on a modern laptop before rate limits, disk consumption, and merge review overhead cancel throughput gains. Six agents on a 2 GB codebase consume 30+ GB of disk (≈5 GB per worktree).[8][26]
AutoDev is a fully automated AI-driven software development framework (arXiv: 2403.08299, 5 Microsoft researchers) for autonomous planning and execution of complex software engineering tasks.[17][30] It builds on AutoGen and Auto-GPT, adding direct repository interaction and filling the gap left by GitHub Copilot's constraint to code snippet suggestions. AutoDev’s LLM-agnostic Agent Scheduler supports diverse model sizes and architectures collaborating on the same task, shifting the developer from manual code validator to multi-agent supervisor.[17][30]
| Component | Function |
|---|---|
| Conversation Manager | Supervises dialogue between user, AI agents, and system; manages interruptions[17] |
| Agent Scheduler | Schedules and orchestrates agents to collaborate; employs various collaboration algorithms[17][30] |
| Tools Library | File editing, retrieval, build, execution, testing, git operations[17] |
| Evaluation Environment | Secure Docker container; runs commands, abstracts low-level complexity, closes the feedback loop[17] |
Isolation model: All operations occur within Docker containers, isolating them from the host. The architecture does not describe explicit file-level locking or worktree isolation for agents running in parallel within a session — Docker isolation is per-session, not per-agent within a session.[17]
| Metric | Score |
|---|---|
| Code generation Pass@1 (HumanEval) | 91.5% (second-best; best requiring no extra training data)[17][30] |
| Test generation Pass@1 | 87.8% with 99.3% coverage from passing tests[17][30] |
| Languages supported | Java, Kotlin, JavaScript/TypeScript, Rust, Python, Go, C/C++[17] |
Note: Detailed git worktree internals (two-pointer model, branch exclusivity, object-store sharing) are covered in the Git Worktree Mechanics pillar; this section focuses on comparative isolation tradeoffs specifically relevant to agent system design.
The four failure modes that worktree isolation prevents are well-documented across the corpus:[2][20]
| # | Failure Mode | Mechanism |
|---|---|---|
| 1 | Concurrent File Overwrites | One agent's changes silently overwrite another's without detection until merge — untracked data loss[2] |
| 2 | Context Contamination | Agents in separate context windows are unaware of peer changes — Agent A's refactoring invalidates Agent B's assumptions mid-task[2] |
| 3 | Race Conditions on Shared State | Multiple agents independently trigger expensive operations (builds, tests), causing resource thrashing[2] |
| 4 | Git Lock Contention | Concurrent git operations fail with fatal errors on .git/index.lock; agents don't gracefully recover, leaving stale locks requiring manual intervention[2] |
A git worktree enables multiple working directories from the same repository. Each worktree has its own working directory, private HEAD pointer, and private staging index, but shares the same .git object store (no duplication of history).[2] The two-pointer model: $GIT_DIR points to each worktree's private directory; $GIT_COMMON_DIR points to the shared .git.[2]
Key finding: "Parallel agents without isolation is not acceleration. It is entropy."[19] Git simultaneously acts as isolation mechanism (worktrees separate agents), integration boundary (deliberate merges via PRs), conflict detection (surfaces overlapping changes at merge time), and rollback capability (failed branches can be discarded).[19]
Branch exclusivity enforced by git: the same branch cannot be checked out in more than one worktree simultaneously by default.[2] Merge conflicts are deferred from runtime to merge time, where standard git tooling detects them as visible conflicts rather than silent overwrites.[1][2]
| Strategy | Git Objects | Creation Time | Filesystem Isolation | Runtime Isolation | Disk Cost |
|---|---|---|---|---|---|
| Worktrees | Shared | Seconds | File-level | None | Low[2] |
| Docker containers | Per-image layers | Minutes | Full with namespaces | Complete | High[2] |
| Separate clones | Duplicated | Minutes | Full | None | Very high[2] |
| Sequential checkout | Shared | Instant | None | N/A | N/A[2] |
| Copy-on-Write | Shared | Instant | Full | None | Very low[2] |
Worktrees excel for code-only parallel generation with 3–5 concurrent agents. Docker wins when agents need full runtime isolation (separate ports, databases, network namespaces).[2]
| Limitation | Detail |
|---|---|
| No runtime isolation | Worktrees share local databases, Docker daemon, cache directories, ports — requires three-layer stack for full isolation[1][2][8] |
| No cross-worktree conflict warnings | Git provides no alerts when worktrees modify identical files on different branches[2][16] |
| Shared git hooks | Hooks in .git/hooks/ execute in all worktrees; pre-commit assumptions may not hold in fresh worktrees[1][2] |
| Submodule multiplication | Each worktree gets its own submodule set, multiplying disk usage[2] |
| IDE gaps (historical) | Full worktree support arrived in JetBrains 2026.1 and VS Code July 2025[2] |
| Monorepo performance | File watchers and build tools in each worktree compound I/O; git sparse-checkout can constrain scope[2] |
Even with perfect execution-time isolation, merging remains sequential. Three common patterns:[16]
None of these eliminate merge conflicts when agents modify the same shared files — they only defer conflict detection to merge time.[16]
See also: Git Worktree Mechanics (separate pillar); Coordination Failures (Section 9)Worktree isolation improves execution for independent tasks but cannot resolve file-level dependencies between concurrent agents. If Agent A builds an API while Agent B builds a frontend consuming that API, these must be sequenced, not parallelized.[2][8][10][26]
Key finding: "The secret to building robust, performant systems is the topology of coordination and not simply adding more agents to the task."[25] A good architect agent that breaks 'Build auth system' into well-scoped, independent subtasks will outperform six engineers working on poorly defined work.[26]
| Tier | Agent Role | Function |
|---|---|---|
| 1 | Coordinator Agent | Plans work, reviews specs before implementation, decomposes tasks into dependency-ordered waves[2][20] |
| 2 | Specialist Agents (6 personas) | Investigate, Implement, Verify, Critique, Debug, Code Review[2] |
| 3 | Verifier Agent | Quality gate: checks results against spec for inconsistencies, bugs, missing pieces[2] |
Living Spec: A shared coordination artifact all agents continuously reference — "the source of truth that keeps all participants aligned."[2] Context Engine: Semantic codebase indexing shared across all agents, supporting 400,000+ file repositories.[2]
| Mechanism | How It Works |
|---|---|
| Plan approval | Teammates submit implementation plans before coding; leads approve or reject[10] |
| Lifecycle hooks | Automated checks (lint, tests) before task completion[10] |
| Task dependencies | Explicit blocking relationships prevent out-of-order execution[10] |
| Token budgeting | Hard per-agent limits; auto-pause at 85% consumed[10] |
| Kill criteria | Reassign agents stuck 3+ iterations on identical errors[10] |
Recommended team size: "Three to five teammates is the sweet spot."[10]
| Topology | Description | Coordination Overhead |
|---|---|---|
| Single-Agent System (SAS) | Baseline — one agent, no coordination | None |
| Independent MAS | Agents work without coordination | None (at cost of consistency) |
| Decentralised MAS | Peer-to-peer coordination | Medium |
| Centralised MAS | Orchestrator coordinates all agents | High (orchestrator bottleneck) |
| Hybrid MAS | Mix of central and peer coordination | Medium-high |
(Source: CodeCRDT research, 600 trials using Claude Sonnet)[25]
| Approach Tried | Result | Root Cause |
|---|---|---|
| Equal-status agents with locking | Failed — 20 agents slowed to throughput of 2–3[28] | Agents held locks too long |
| Optimistic concurrency control | Failed — agents became risk-averse, avoided hard tasks[28] | Conflict cost changed agent behavior |
| Planners + Workers + Judges (hierarchical) | Successful[28] | Role differentiation enables scale |
Five-step stateless-but-iterative cycle for solo agent sessions:[10]
External memory persists through: git history, progress logs, task state files, and AGENTS.md. Research note: "LLM-generated AGENTS.md files offer no benefit and can marginally reduce success rates (~3%) while increasing costs over 20%." Developer-written context provides ~4% improvement.[10]
Empirical research demonstrates a "curse of coordination": agent cooperation performs significantly worse than a single agent given the same total workload.[28]
Key finding: Multi-agent configurations degrade performance by 39 to 70 percent relative to single-agent baselines. Inter-agent misalignment is identified as the primary failure category.[28]
CooperBench tested 2–4 agent configurations on SWE-bench-style tasks; the 39–70% degradation range varies by agent count and task coupling, with higher agent counts and tighter task coupling producing worse outcomes.[28]
CodeCRDT proposes observation-driven coordination — agents coordinate by monitoring a shared state with observable updates, using deterministic convergence rather than explicit message passing.[25]
| Outcome | Result (600 trials, Claude Sonnet) |
|---|---|
| Best case (some task types) | Up to 21.1% speedup with parallel coordination[25] |
| Worst case (other task types) | Up to 39.4% slowdown[25] |
Whether multi-agent coordination helps or hurts depends heavily on task structure — not simply on adding more agents.[25]
| Finding | Data |
|---|---|
| Merge conflicts in Agentic-PR rejections | >1.1% of all rejections[28] |
| Root cause of some failures | Superficial differences (formatting, indentation) rather than substantive conflicts[28] |
| Failure Mode | Description |
|---|---|
| Coherence degradation | "Lost in the middle" phenomenon as context grows[28] |
| Architectural drift | Agents make locally sensible but globally inconsistent decisions[28] |
| Pattern violation | Agents suggest or use deprecated APIs[28] |
| Staleness | Index updates lag behind rapid development[28] |
| Task-scoped isolation failure | Two workers independently modify the same file when tasks are not properly scoped[11] |
Note: MetaGPT-specific failures (assistant repeated instruction, infinite loop of message) — see Section 4.
| Benchmark | Scope | Top Model Score |
|---|---|---|
| SWE-Bench Verified | Single-issue tasks | >70%[10] |
| SWE-Bench Pro | Multi-file patches averaging 107 lines across 4+ files | ~23%[9] |
This 3×+ performance drop demonstrates the need to decompose work into smaller, testable units that stay within each agent's accuracy range.[10]
| Solution | Components | Source |
|---|---|---|
| Claude Code shared task list pattern | Status flags (lock claims) + git worktrees (isolate edits) + dependency markers (sequence constrained work)[28] | [28] |
| Centralized orchestrator with per-agent worktrees | Orchestrator dispatches; each agent gets its own working copy (e.g., Nevo production system)[28] | [28] |
| Plan approval before coding | Prevents architectural mistakes before code exists[10] | [10] |
| Tool | Capability |
|---|---|
| JetBrains AI Assistant | Integrated merge conflict suggestions[28] |
| VS Code 1.105 (Sept 2025) | AI-assisted merge conflict resolution using merge base and both branch changes as context[28] |
| GitKraken Desktop | "Auto-resolve with AI" with explanations[28] |
| Graphite | AI merge conflict resolution guidance[28] |
| Resolve.AI | Dedicated merge conflict tool[28] |
| CodeGPT | Intelligent merge resolution[28] |
Common mistake: launching maximum agents immediately without learning coordination patterns, producing overwhelming complexity and incompatible code.[25][10] Recommended: start with 2 agents on well-isolated features, master the workflow, then scale to 4, 6, 8 — task decomposition quality is the primary variable. "Swarming only works when work units are genuinely independent."[1]
See also: Concurrency Control Theory (separate pillar); Work Partitioning Patterns (Section 8)A systematic taxonomy of 13 open-source coding agents across three layers (arXiv: 2604.03515v2):[29] (1) control architecture, (2) tool/environment interface, (3) resource management. Agents occupy positions along continuous spectra — not discrete categories.
Key finding: 11 of 13 agents compose multiple primitives rather than relying on a single control structure.[29] Four core tool capability categories — Read, Search, Edit, Execute — appear in all LLM-driven agents, with tool counts ranging from 0 (Aider) to 37 action classes (Moatless Tools).[29]
| Loop Strategy | Agents | Complexity |
|---|---|---|
| Fixed pipeline | Agentless | Lowest |
| Sequential ReAct loops | 7 agents (including SWE-agent, OpenHands) | Medium |
| Phased state machine | Prometheus | Medium-high |
| Full Monte Carlo Tree Search with backpropagation | Moatless Tools | Highest |
| Driver | Mechanism | Examples |
|---|---|---|
| User-driven | Humans select files — sidesteps localization bottleneck | Aider[29] |
| Scaffold-driven | Pre-computed paths determine agent actions | Agentless, AutoCodeRover[29] |
| LLM-driven | Full tool autonomy; agent chooses actions | 9 of 13 agents surveyed[29] |
| Approach | Agents | Mechanism |
|---|---|---|
| LLM-as-Navigator | 8 agents | Grep/find tools; LLM formulates queries[29] |
| Scaffold-side understanding | 5 agents | Pre-computed code representations[29] |
| — PageRank-weighted dependency graphs | Aider | Graph-based relevance ranking[29] |
| — AST-indexed + spectrum fault localization | AutoCodeRover | Structure-aware queries; unique in corpus[29] |
| — Neo4j knowledge graphs (20 languages) | Prometheus | Cross-language symbol graph[29] |
| — FAISS embedding-based semantic search | Moatless Tools | Vector similarity[29] |
| Approach | Agents | Mechanism |
|---|---|---|
| Containerized (Docker) | 5 agents: SWE-agent, OpenHands, DARS-Agent, AutoCodeRover, Prometheus | In-container FastAPI servers (OpenHands); full namespace isolation[29] |
| Shadow git checkpoints | Cline | Rollback without touching user history[29] |
| OS-level sandboxing | Codex CLI | Bubblewrap/Landlock + LLM guardian safety scoring (0–100 scale, 80-point threshold)[29] |
| Rule-based policy engine | Gemini CLI | Per-tool approval requirements[29] |
| Human supervision | Aider | Safety boundary is the human operator[29] |
| Stateless subshells | mini-swe-agent | Process-level isolation per operation[29] |
| Agent | Mechanism |
|---|---|
| DARS-Agent | Full Docker reset; replays all actions from root[29] |
| Moatless Tools | Shadow mode — tracks modifications in-memory without filesystem writes[29] |
| Dimension | SWE-agent | Aider | OpenHands | Prometheus |
|---|---|---|---|---|
| Loop strategy | Sequential ReAct | User-driven | Sequential ReAct | Phased state machine |
| Tool count | 3–35 (bundled) | 0 (text-parsed) | 9+ (MCP-enabled) | 17 (per-node scoped) |
| Context retrieval | Keyword search | PageRank graph | Keyword search | Neo4j knowledge graph |
| Isolation | Docker | None (user trust) | Docker + HTTP API | Docker |
| Compaction strategy | Polling triggers | Summarization | Request-based | Not detailed |
Source: arXiv 2604.03515v2[29]
7 strategies identified across the 13 agents:[29]
| System | Isolation Mechanism | Coordination Pattern | Max Parallel Scale | Open / Closed | Key Limitation |
|---|---|---|---|---|---|
| Devin | Isolated VM per instance[3][11] | Manager assigns to workers; manager merges results[11] | 10 workers (MultiDevin)[11] | Closed | Task-level isolation only — workers can modify the same file if tasks are mis-scoped[11] |
| SWE-agent | Docker container per issue[27][29] | Single-agent per issue; no built-in parallel execution[27] | 1 (single-agent design)[27] | Open | No multi-agent parallelism; parallel scale requires external orchestration[9] |
| OpenHands | Docker container per session + event-stream state[6][12] | Hierarchical delegation via AgentDelegateAction[4] |
Cloud-scale (V1 SDK: native distributed deployment)[4][12] | Open | Automatic workflow generation still requires substantial handcrafting[22] |
| MetaGPT | Sequential handoffs; no concurrent artifact editing[13] | Sequential SOP pipeline (ProductManager → Architect → Engineer → QA)[5] | 1 (intentionally sequential)[13] | Open | No parallelism by design; lower throughput than parallel systems[13] |
GitHub Copilot / /fleet |
Shared filesystem, no file locking (/fleet); optional git worktrees in CLI[14] |
Orchestrator dispatches parallel subagents[7][14] | Parallel (no stated maximum); Mission Control manages multiple tasks concurrently[14][24] | Closed | No automatic file-conflict prevention in /fleet; partitioning is the user's responsibility[14] |
| Aider | Git history / user-supervised (no worktrees)[15][25] | Sequential two-model pipeline (architect → editor)[15] | 1 (no concurrent agents on same repo)[15] | Open | Not designed for parallel execution; used as worker agent in external orchestration[25] |
| Patchwork | Three-layer stack: git worktrees + database branches + port namespacing[8][26] | Architect plans → manager decomposes → engineers execute in isolated worktrees[26] | 5–7 practical ceiling (rate limits + disk + review overhead)[8][26] | Open | Disk and rate-limit ceiling; ~5 GB per worktree on a 2 GB codebase[8] |
| AutoDev | Docker container per session (session-scoped, not per-agent within a session)[17] | Agent Scheduler orchestrates LLM-agnostic multi-model collaboration[17][30] | Session-scoped (parallel scale not specified)[17] | Open | No explicit per-agent file-level locking or worktree isolation within a session[17] |
| System | Benchmark | Score | Date / Configuration |
|---|---|---|---|
| OpenHands | SWE-Bench Verified | 72%[4][12] | 2026, Claude Sonnet 4.5 + extended thinking |
| OpenHands | SWE-Bench Lite | 26%[6] | At publication, CodeActAgent v1.8 + claude-3.5-sonnet |
| OpenHands | GAIA (val set) | 67.9%[4] | 2025 |
| MetaGPT | HumanEval Pass@1 | 85.9%[5][23] | At ICLR 2024 publication |
| MetaGPT | MBPP Pass@1 | 87.7%[5][23] | At ICLR 2024 publication |
| AutoDev | HumanEval Pass@1 | 91.5%[17][30] | 2024 (best requiring no extra training data) |
| AutoDev | Test generation Pass@1 | 87.8%[17] | 99.3% coverage from passing tests |
| SWE-agent | SWE-bench (pass@1) | 12.5%[27] | NeurIPS 2024, best open-source at publication |
| SWE-agent | HumanEvalFix (pass@1) | 87.7%[27] | NeurIPS 2024 |
| SWE-agent (mini) | SWE-bench Verified | >74%[27] | 100 LOC implementation |
| SWE-Edit (decomposed) | SWE-bench Verified (vs. baseline) | +2.1% / −17.9% cost[9] | 2025 |
| Aider (DeepSeek R1 + Claude 3.5 Sonnet) | Internal benchmark | 64% at $13.29[15] | Architect mode |
| Frontier models (general) | SWE-Bench Verified | >70%[10] | 2025 state of the art |
| Frontier models (general) | SWE-Bench Pro (multi-file, 107 lines avg, 4+ files) | ~23%[9] | 2025 state of the art |
| Threshold | Value | Limiting Factor |
|---|---|---|
| Recommended sweet spot (most repos) | 3–5 concurrent agents[8] | Merge review overhead begins to exceed throughput gains |
| Productive ceiling (modern laptop) | 5–7 concurrent agents[8][26][25] | Rate limits, disk consumption, review overhead |
| Disk consumption (2 GB codebase) | ~30+ GB for 6 worktree agents (~5 GB each)[8][26] | Submodule multiplication + per-worktree indexes |
| Cursor 2.0 (Oct 2025) supported agents | Up to 8 concurrent[25] | Product-imposed limit |
| Organizations reporting workflow improvement | 20–30% faster cycles with multi-agent setups[25] | Dependent on task partitioning quality |
| Rate limit impact | 10 Claude Code instances hit Anthropic rate limits faster than 1[25] | API throughput constraints |
"The bottleneck is no longer generation. It's verification."[10] Three-layer verification approach recommended:
Human review remains mandatory — agents generate volume quickly, but determining correctness requires full system context.[10]
| Tool | Key Feature | Isolation Mechanism |
|---|---|---|
| Parallel Code | Each task gets own git branch and worktree[16] | Git worktrees |
| Superset IDE | Run 10+ agents simultaneously[16] | Git worktrees |
| Composio Agent Orchestrator | Multiple agents, each with its own PR, supervised dashboard[16] | Git worktrees + PRs |
| Conductor (Mac) | Multiple parallel agents, clean separation[16] | Git worktrees |
| Baton (mraza007) | Polls GitHub Issues, runs Claude Code in isolated worktrees[16] | Git worktrees |
| Claude Squad | Orchestrates Claude Code, Aider, Codex simultaneously[16] | Git worktrees + tmux |
| Strategy | Advantage | Disadvantage |
|---|---|---|
| Worktree per task | Short-lived; no state accumulation[2] | Zero cache reuse; cold dependency installs for each task |
| Worktree per agent | Warm dependency caches; faster task startup[2] | Long-lived worktrees accumulate state; harder cleanup |