Home

Existing Multi-Agent Coding Systems

Pillar: existing-agent-systems | Date: May 2026
Scope: Prior art survey of tools that run multiple AI coding agents concurrently on shared repos: Devin, SWE-Agent, AutoDev, MetaGPT, Patchwork, Aider multi-agent, GitHub Copilot Workspace, OpenHands (OpenDevin). How they partition work, what isolation guarantees they provide, and what coordination failures they document or avoid.
Sources: 30 gathered, consolidated, synthesized.

Executive Summary

Central finding: Empirical research (CooperBench, arXiv 2601.13295) measured a 39–70% performance degradation when multi-agent configurations attempt the same workload as a single agent — the "curse of coordination" — making task decomposition quality, not agent count, the primary determinant of whether parallelism helps or harms.[28]

The field divides sharply between systems that treat isolation as a first-class concern and those that defer it to the user. GitHub Copilot's /fleet command is the starkest example of the latter: sub-agents share a filesystem with no file locking, and GitHub's own documentation warns that "being thoughtful about partitioning work is essential" to avoid merge conflicts.[7][14] At the other extreme, Patchwork employs a three-layer stack — git worktrees for filesystem isolation, per-worktree database names for state isolation, and deterministically assigned dev-server ports for runtime isolation — but hits a practical ceiling of 5–7 concurrent agents before rate limits, disk consumption (~5 GB per worktree on a 2 GB codebase, totaling 30+ GB for 6 agents), and merge-review overhead cancel the throughput gains.[8][26]

No production system fully solves the concurrency problem. MultiDevin (Cognition AI, 2024) scales to 1 manager plus 10 workers, each in an isolated VM, but provides only task-level isolation — two workers can independently modify the same file if tasks are mis-scoped, leaving conflict resolution to a sequential final merge by the manager.[11] AutoDev (Microsoft Research, arXiv 2403.08299) uses Docker isolation per session, not per agent within a session, leaving intra-session concurrent writes without file-level protection.[17] Devin 2.0 (April 3, 2025) improved PR merge rates from 34% to 67% and delivered 83% more junior-level tasks per ACU compared to 1.x, yet complex end-to-end task completion remains in the single-digit to low-double-digit percent range.[3][21]

The benchmark record exposes a consistent complexity cliff. Frontier models score above 70% on SWE-Bench Verified (single-issue tasks), but drop to roughly 23% on SWE-Bench Pro — a data-contamination-resistant benchmark requiring multi-file patches averaging 107 lines across 4+ files.[9][10] OpenHands reached 72% on SWE-Bench Verified (2026, Claude Sonnet 4.5 with extended thinking), up from 26% at publication.[4][6] Meanwhile, Mini-SWE-Agent — a 100-line implementation — scores above 74% on SWE-Bench Verified, demonstrating that much of the complexity in full SWE-agent is not load-bearing for benchmark performance.[27] SWE-Edit's decomposed Viewer/Editor subagent architecture improved over baseline by +2.1% while reducing inference cost by −17.9%, validating specialization as a cost-efficiency lever even at modest accuracy gains.[9]

Coordination topology matters more than agent count. CodeCRDT (arXiv 2510.18893) ran 600 trials using Claude Sonnet and found that parallel coordination produces outcomes ranging from a 21.1% speedup to a 39.4% slowdown depending on task structure alone.[25] Cursor's internal experimentation documented the failure modes directly: equal-status agents with locking made 20 agents perform at the throughput of 2–3 because agents held locks too long; optimistic concurrency control made agents risk-averse and avoidant of hard tasks; the architecture that worked was Planners + Workers + Judges — role differentiation as the scaling mechanism.[28] A systematic taxonomy of 13 open-source agents (arXiv 2604.03515v2) found that 11 of 13 compose multiple control primitives rather than relying on a single loop strategy, with tool counts ranging from 0 (Aider) to 37 action classes (Moatless Tools).[29]

Sequential architecture is not a concession — it is a deliberate coordination strategy. MetaGPT (ICLR 2024 Oral, top 1.8%) encodes a five-role waterfall pipeline (ProductManager → Architect → ProjectManager → Engineer → QA) in which no two agents ever write to the same artifact concurrently, eliminating merge conflicts by construction rather than detection.[5][13] This trades throughput for consistency and achieved 85.9% HumanEval Pass@1 and 87.7% MBPP Pass@1 at publication.[5] The core insight, validated across systems: naively chaining LLMs produces cascading hallucinations from logic inconsistencies; structured handoffs with verification gates at each role boundary prevent this.[13]

The isolation mechanism chosen has decisive downstream consequences. Git worktrees spin up in seconds and share the object store (low disk cost), providing file-level isolation with branch exclusivity enforced by git — the same branch cannot be checked out in more than one worktree simultaneously.[2] Docker containers provide full namespace isolation including ports, databases, and network, but take minutes to create and multiply disk usage per agent. The four failure modes worktrees prevent — concurrent file overwrites, context contamination, race conditions on shared state, and git index lock contention — are all documented in production systems that skipped isolation.[2] Worktrees, however, provide no runtime isolation and no cross-worktree conflict warnings; full-stack agents (running dev servers, test databases) require the three-layer stack that Patchwork implements.[8] Full worktree support arrived in JetBrains 2026.1 and VS Code July 2025, making the tooling ecosystem for this pattern newly mature.[2]

Verification has overtaken generation as the bottleneck. "The bottleneck is no longer generation. It's verification."[10] Organizations with working multi-agent setups report 20–30% faster cycles, but only when task partitioning quality is high.[25] Developer-written AGENTS.md files compound learning across sessions and produce roughly a 4% accuracy improvement; LLM-generated equivalents cause a ~3% regression and a 20%+ cost increase.[10] The AgenticFlict dataset (arXiv 2604.03551) found that some multi-agent PR rejections stem not from substantive conflicts but from superficial formatting or indentation differences — a category addressable by pre-commit normalization rather than architectural changes.[28]

Implications for practitioners: Start with 2 well-isolated agents on genuinely independent features before scaling — the CooperBench and CodeCRDT data both show that adding agents to poorly partitioned work makes outcomes worse, not better. Use git worktrees for code-only parallel generation (3–5 agents is the validated sweet spot), and add database and port namespacing only when agents run live servers. Adopt the Planners + Workers + Judges topology rather than equal-status agents with locking — Cursor's internal research proves the latter collapses at scale. Invest human time in writing the decomposition brief and the AGENTS.md; LLM-generated context for either degrades performance. Finally, treat the 70% SWE-Bench Verified scores with caution: the 3× drop to 23% on multi-file tasks means current agents are reliable for single-issue scoped work and unreliable for anything that touches 4+ files simultaneously — which is precisely the class of work multi-agent systems are most often proposed to accelerate.



Table of Contents

  1. Devin (Cognition AI) — MultiDevin Parallel Architecture
  2. SWE-Agent (Princeton NLP) — Agent-Computer Interface and Decomposition
  3. OpenHands — Event-Stream Architecture and Hierarchical Delegation
  4. MetaGPT — Sequential SOP-Driven Pipeline
  5. GitHub Copilot Workspace, Coding Agent, and Fleet
  6. Aider, Patchwork, and AutoDev
  7. Isolation Mechanisms: Git Worktrees vs. Docker
  8. Work Partitioning and Orchestration Patterns
  9. Coordination Failures: Empirical Research and Documented Failure Modes
  10. Agent Architecture Taxonomy: Source-Code Survey of 13 Systems
  11. Comparative Benchmarks Across Systems
  12. Practical Scaling Limits and Emerging Tooling

Section 1: Devin (Cognition AI) — MultiDevin Parallel Architecture

Cognition Labs launched Devin 1.0 on March 12, 2024, positioning it as the "world's first fully autonomous AI software engineer."[11][21] Devin integrates an LLM with tools, memory, and reasoning capabilities to independently plan, execute, and iterate on multi-step engineering tasks requiring thousands of decisions.[3] Devin accepts tasks via Slack or Microsoft Teams integrations and executes autonomously in a cloud sandbox, optimized for clearly scoped multi-hour tasks.[3] The most architecturally significant feature added in 2024 was MultiDevin — a manager-worker parallel execution pattern scaling to 10 concurrent agents.

MultiDevin: Manager-Worker Architecture

MultiDevin fields one "manager" Devin and up to 10 "worker" Devins.[11] The manager distributes tasks to each worker, then merges changes from all successful workers into one branch or pull request. The design is explicitly limited to "repeated, isolated tasks like lint errors, code clean-ups, migrations, refactors" and is not suited for interdependent feature work.[11]

Key finding: MultiDevin's isolation guarantee is at the task level, not the file level — two workers can independently modify the same file if tasks are not properly scoped. The manager must reconcile all worker changes in a final sequential merge step.[11]

Devin 2.0: Agent-Native IDE (April 2025)

Devin 2.0 (released April 3, 2025) operates within a cloud-based agent-native IDE combining a code editor, terminal, sandboxed browser, and smart planning tools.[11][21] Per a technical analysis: "a cloud-based development environment that allows users to spin up multiple parallel Devin instances, each running in an isolated virtual machine."[3]

Devin Version Release Multi-Agent Capability Isolation Unit Coordination Mode
Devin 1.0 March 12, 2024 Single autonomous agent; REST API for parallel sessions[11] Cloud sandbox per session None (independent)
MultiDevin 2024 (Q3–Q4) 1 manager + up to 10 workers[11] Task-scoped (not file-scoped) Manager merges worker output
Devin 2.0 April 3, 2025 Parallel instances; agent dispatches sub-tasks to other Devins[11][21] Isolated VM per instance Task-scoped; plan approval interface
Devin (Feb 2026) February 2026 Parallel sessions with improved context retention[21] Isolated VM Parallel session management

Performance Metrics

Metric Value Source
Task completion improvement, Devin 2.0 vs. 1.x 83% more junior-level tasks per ACU [11][21]
COBOL migration scale 5 million lines across 500 GB of repositories, single agent [3]
PR merge rate improvement 34% → 67% [3]
Nubank case study efficiency 12× efficiency improvement, 20× cost savings; weeks vs. months (per Cognition-published case study) [11]
Goldman Sachs pilot (July 2025) 20% efficiency gains alongside 12,000 human developers (per Cognition-published case study) [21]
Complex end-to-end task completion (early 2025) Single-digit to low-double-digit % [21]

Funding Trajectory

RoundDateValuation
Series AMarch 2024$350M[11]
Series BApril 2024$2B[11]
Growth roundMarch 2025~$4B (8VC)[21]

Documented Limitations

See also: Coordination Failures (Section 9); Isolation Mechanisms (Section 7)

Section 2: SWE-Agent (Princeton NLP) — Agent-Computer Interface and Decomposition

SWE-agent is an open-source platform developed by Princeton University's NLP group (Yang et al.), published at NeurIPS 2024 (arXiv: 2405.15793).[27] It takes a GitHub issue and automatically fixes it using an LM of choice. The system's central contribution is the Agent-Computer Interface (ACI) — the insight that LM agents benefit from specially designed software interfaces, analogous to how human developers benefit from IDEs.[9][27]

Key finding: For coding agents, exploration and precision are fundamentally at odds — a single agent cannot simultaneously optimize for comprehensive code understanding (benefits from viewing many files) AND reliable edit generation (benefits from clean, focused context).[9] This tension motivates decomposed multi-agent architectures.

Isolation Mechanism

SWE-agent executes code in isolated Docker environments — each issue gets its own sandboxed container. Docker-based isolation is the primary mechanism, not git worktrees.[27][29]

SWE-Edit: Decomposed Subagent Architecture

The SWE-Edit framework decomposes code editing into two specialized subagents to address the exploration-precision tension:[9]

SubagentRoleContext Characteristic
Viewer Extracts task-relevant code on demand Broad — can inspect many files
Editor Executes modifications from high-level plans Narrow — receives only what is needed for edits

Adaptive editing mode selection uses a Qwen3-8B model (GRPO-trained) to choose between find-replace (small changes) and whole-file rewrite (complex restructuring).[9]

MetricResult
SWE-bench Verified improvement over baseline+2.1%[9]
Inference cost reduction−17.9%[9]
Edit formatting reliability improvement+3.5%[9]

Architectural Evolution: 2024–2025

Multi-agent systems for software engineering (with specialized agents for repository navigation, bug localization, patch generation, and verification) have outpaced single-agent architectures in scalability and performance as of 2025.[9]

YearDominant PatternExample
2024Single-agent designs with custom ACIsSWE-agent original
2025Decomposed multi-agent architectures with specialized rolesSWE-Edit (Viewer + Editor), Agentless

Benchmark Performance

BenchmarkScore
SWE-bench (pass@1, NeurIPS 2024 — best open-source)12.5%[27]
HumanEvalFix (pass@1)87.7%[27]
SWE-bench Pro (top models, data-contamination-resistant)~23% (vs. 70%+ on SWE-bench Verified)[9]

Mini-SWE-Agent: Complexity vs. Performance

Mini-SWE-Agent — 100 lines of code total — solves GitHub issues from the command line and scores >74% on SWE-bench Verified, demonstrating that much of SWE-agent's complexity is not essential to performance.[27]

See also: Agent Architecture Taxonomy (Section 10); Coordination Failures (Section 9)

Section 3: OpenHands — Event-Stream Architecture and Hierarchical Delegation

OpenHands (formerly OpenDevin) started in early 2024 and was published at ICLR 2025 (arXiv: 2407.16741).[4] As of late 2025 it has 64K+ GitHub stars, 188+ contributors, 2.1K+ contributions, and an $18.8M Series A (Madrona, November 2025).[4][12][22] Adopters include AMD, Apple, Google, Amazon, Netflix, TikTok, NVIDIA, Mastercard, and VMware.[4]

Core Architecture: Event Stream

OpenHands uses an event stream architecture through which user interfaces, agents, and environments interact.[4][6] The state encapsulates all relevant information for agent execution: a chronological collection of past actions and observations — agent actions, user interactions, accumulative LLM call cost, and metadata to track multi-agent delegation.[4][12]

Key finding: OpenHands' core design philosophy — "an autonomous agent is a function from event history to next event, run in a loop. Everything else (condensers, skills, sub-agents, security analyzers) is a hook into that one loop"[22] — enables deterministic replay and full audit trail of agent behavior.[6]

Multi-Agent Coordination: AgentDelegateAction

Hierarchical agent structures delegate subtasks to specialized agents using the AgentDelegateAction — a typed action enabling explicit handoff. Control passes explicitly, not via shared memory; the event stream is the single coordination source of truth.[4][6][12]

Coordination PatternMechanism
Capability-based handoffAgentDelegateAction routes to specialized agent (e.g., BrowsingAgent for web tasks)[4]
Human-guided workflowsInteractive event injection into the stream[22]
Dynamic multi-agent compositionCoordination protocol vocabulary across agents[4]
Division of laborSkill specialization per agent type[22]

Sandbox Isolation

Each task session runs in a securely isolated Docker container sandbox containing a bash shell, Jupyter IPython Server, and a Chromium browser (Playwright-based).[6][12] An OpenHands Action Execution API server inside each sandbox listens for requests and returns results as observations. Agents share no runtime state by default.[6]

CodeActAgent Action Primitives

Action TypeDescription
IPythonRunCellActionExecutes arbitrary Python code[6]
CmdRunActionRuns bash commands[6]
BrowserInteractiveActionWeb browsing via domain-specific language[6]
edit_file (skills library)Precise line-range modifications rather than whole-file overwrites[6]

SDK Architecture Evolution: V0 → V1

FeatureV0V1
ArchitectureMonolithic, sandbox-centricModular SDK with clear boundaries[4][12]
State modelFlatEvent-sourced with deterministic replay[22]
SandboxingMandatoryOpt-in[12]
Tool systemInternalTyped + MCP integration[12]
Scale supportSingle sessionNative distributed deployment to thousands of agents in cloud[4]
ReconnectionNoneAutomatic reconnection + state synchronization[12]

Benchmark Performance

BenchmarkScoreConfiguration
SWE-Bench Lite (at publication)26%CodeActAgent v1.8 + claude-3.5-sonnet[6]
SWE-Bench Verified (2026)72%Claude Sonnet 4.5 + extended thinking[4][12]
GAIA (validation set)67.9%[4][12]
HumanEvalFix (0-shot)79.3%gpt-4o[6]
WebArena15.3%BrowsingAgent + claude-3.5-sonnet[6]
See also: Agent Architecture Taxonomy (Section 10); Isolation Mechanisms (Section 7)

Section 4: MetaGPT — Sequential SOP-Driven Pipeline

MetaGPT (arXiv: 2308.00352, ICLR 2024 Oral — top 1.8%) is an open-source multi-agent framework that encodes software company SOPs into prompt sequences.[5][13][23] It accepts a one-line requirement and outputs user stories, competitive analysis, requirements, data structures, APIs, and documents. It has 40K+ GitHub stars and an Apache 2.0 license.[5]

Core Principle: Code = SOP(Team)

MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences. The core problem addressed: naively chaining LLMs causes cascading hallucinations due to logic inconsistencies. Structured SOPs prevent this by enforcing verification at each handoff.[5][13][23]

Five-Role Sequential Pipeline

RoleInputOutputVerification
ProductManager One-line user requirement PRD (Product Requirements Document) Human or agent review at handoff[5]
Architect PRD Technical spec, system architecture diagrams, interface definitions Architectural review[13]
ProjectManager Spec Task list; code files as task assignments Scope validation[13]
Engineer Task specification Implementation code Execution feedback loop[5]
QA Engineer Code Unit tests; bug fix instructions Test pass/fail[5][13]

Communication protocol: publish-subscribe mechanism for information sharing and updates.[5][13]

Key finding: MetaGPT's sequential design is an intentional coordination strategy — it mirrors waterfall development to prevent merge conflicts and consistency failures. No two agents edit the same artifact simultaneously; each agent receives complete, stable input before starting.[13]

Sequential vs. Parallel: Explicit Trade-off

PropertySequential (MetaGPT)Parallel (MultiDevin, /fleet)
Merge conflictsEliminated (by design)[13]Risk present; deferred to merge time
Consistency failuresCaught at handoffRequire post-hoc reconciliation
ThroughputLowerHigher (when tasks are independent)
Interdependency handlingNatural (sequential ordering)Requires explicit decomposition

Documented Failure Modes in Multi-Agent Systems (MetaGPT corpus)

Failure ModeDescription
Assistant repeated instructionAgent repeats instructions instead of executing them[13]
Infinite loop of messageAgents get stuck in recursive message exchange[13]

Performance and Recent Developments

MetricValue
HumanEval Pass@185.9% (state-of-the-art at publication)[5][23]
MBPP Pass@187.7% (state-of-the-art at publication)[5][23]
MBPP executive feedback improvement+5.4% absolute[5][13]
Experimental task completion rate100%[5][13][23]
AFlow (Jan 2025) — ICLR 2025 Oral rank#2 in LLM-based Agent category[23]
MGX (Feb 2025)"World's first AI agent development team"[5][23]
See also: Coordination Failures (Section 9); Work Partitioning Patterns (Section 8)

Section 5: GitHub Copilot Workspace, Coding Agent, and Fleet

GitHub launched Copilot Workspace as a technical preview in April 2024 — a browser-based environment that turned a plain-English GitHub issue into a spec, plan, and code changes via a four-stage workflow.[7][14][24] By September 2025, GitHub rebuilt those learnings into the Copilot Coding Agent (GA to all paid subscribers), incorporating a sub-agent architecture, issue-to-PR async workflow, GitHub Actions as execution environment, and isolated environments respecting repository access scopes.[7][14]

Copilot Workspace Four-Stage Workflow

StageOutput
1. Task definitionParsed GitHub issue[7]
2. Specification generationNatural-language spec[14]
3. Plan generationFiles to create / modify / delete[14]
4. ImplementationCode changes in isolated environment[7]

Agent Mode in IDEs (Announced February 2025)

Agent Mode rolled out to VS Code, JetBrains, Eclipse, and Xcode.[14][24] It independently translates ideas into code, automatically identifies subtasks, executes across multiple files, and self-corrects on lint errors and test failures.

Tool Available to Agent ModeFunction
read_fileRead file contents[14]
list_dirEnumerate directory[14]
run_terminalExecute shell commands[14]
apply_editApply code modifications[14]

Known limitation: Sub-agents in Copilot Agent Mode (IDE) cannot currently run in parallel — they execute sequentially.[14][24]

The /fleet Command: Parallel Subagent Architecture

The /fleet slash command breaks complex requests into smaller tasks and runs them in parallel.[7][14][24] The main Copilot agent acts as an orchestrator, dispatching parallel subagents — by default using a low-cost AI model, overridable to custom agents via @CUSTOM-AGENT-NAME.

Key finding: Sub-agents in /fleet share a filesystem with no file locking.[14] Work partitioning to avoid conflicts is entirely the user's responsibility — the framework provides no automatic conflict prevention for parallel agents modifying the same files.

Fleet Use Cases vs. Limitations

Good Use CaseProblematic Use Case
Refactoring across multiple independent files[7]Two agents touching the same shared file
Documentation for several components[7]Interdependent API + frontend work[14]
Features spanning API/UI/tests (if independent)Tasks without explicit dependency tracking
Independent code modifications not sharing state[24]Any unpartitioned shared-state work

Mission Control and Merge Conflict Warning (Late 2025)

Mission Control provides a unified interface for managing multiple parallel Copilot Coding Agent tasks — assign, pick agents, watch real-time logs, steer mid-run, and jump to resulting PRs.[7][14][24]

GitHub's own documentation warns: "When assigning multiple tasks from the same repo, it's important to consider overlap: agents working in parallel can create merge conflicts if they touch the same files, so being thoughtful about partitioning work is essential."[7][14][24]

AgentHQ: Orchestra Pattern (November 2025)

AgentHQ is a platform for building and deploying custom agents integrated with GitHub workflows, using the "Orchestra" pattern: a Conductor agent orchestrates specialized Planning, Implementation, and Code Review subagents.[24]

Task Isolation Mechanism

Copilot CLI can run in the background optionally using git worktrees for isolation. GitHub Agentic Workflows are designed with isolation, constrained outputs, and comprehensive logging.[7][14] This contrasts with the /fleet shared-filesystem model documented above, making the CLI's git worktree option an important counterpoint for tasks where stronger isolation guarantees are required.

See also: Isolation Mechanisms (Section 7); Coordination Failures (Section 9)

Section 6: Aider, Patchwork, and AutoDev

Aider: Architect/Coder Two-Model Pipeline

Aider operates through a sequential two-model pipeline in architect mode: an architect model proposes how to solve the coding request, and an editor model turns that proposal into specific file editing instructions.[15] This is not concurrent multi-agent; the coordination challenge is handoff quality, not concurrency control.

Rationale for two-model design: Certain LLMs (especially reasoning models like o1) excel at reasoning but produce poor edit syntax. Separating architecture from editing eliminates hallucinated edits and malformed diff output.[15]

Isolation model: Aider "thinks in git" — every edit is a commit, every session a branch that can be reviewed, reverted, or cherry-picked. Natural isolation through git history rather than worktrees.[15][25] Aider does not run multiple concurrent agents on the same repo.

AttributeValue
Tool count0 (zero LLM-callable tools; user drives navigation)[29]
Context retrievalPageRank-weighted dependency graphs[29]
Isolation mechanismGit history (no worktrees); human supervision as safety boundary[15][29]
Benchmark (DeepSeek R1 + Claude 3.5 Sonnet)64% accuracy at $13.29 cost[15]
Role in broader ecosystemWorker agent within orchestration systems (Claude Squad, AgentsMesh, Toryo, ai-maestro, Composio)[25]

Patchwork (patched-codes): Three-Layer Isolation Stack

Patchwork is a self-hosted CLI agent automating PR reviews, bug fixing, and security patching using the user's preferred LLMs (AGPL-3.0 core; Apache-2.0 for custom patchflows).[8][16][26] Key milestones: July 2024 (RTC evaluation methodology), December 2024 (official GitHub Action).

Key finding: "Parallelism is not the hard part — isolation is. If two agents can edit the same repo but you cannot review, replay, or merge their work safely, you don't have a scalable workflow — you have a faster way to create conflicts."[16]

Patchwork's Three-Layer Isolation Stack

LayerMechanismProblem Solved
1. Filesystem isolation Git worktrees — one task → one branch → one worktree → one agent[8][26] Concurrent file overwrites
2. State isolation Separate database names per worktree[8] Interleaved test data
3. Runtime isolation Deterministically assigned dev server ports[8][26] Port collision between agents

Coordination mechanism: architect agent plans work, manager breaks into tasks, engineers execute in isolated environments, kanban board tracks state, tests gate completion.[26]

Practical scaling ceiling: 5–7 concurrent agents on a modern laptop before rate limits, disk consumption, and merge review overhead cancel throughput gains. Six agents on a 2 GB codebase consume 30+ GB of disk (≈5 GB per worktree).[8][26]

AutoDev (Microsoft Research, 2024)

AutoDev is a fully automated AI-driven software development framework (arXiv: 2403.08299, 5 Microsoft researchers) for autonomous planning and execution of complex software engineering tasks.[17][30] It builds on AutoGen and Auto-GPT, adding direct repository interaction and filling the gap left by GitHub Copilot's constraint to code snippet suggestions. AutoDev’s LLM-agnostic Agent Scheduler supports diverse model sizes and architectures collaborating on the same task, shifting the developer from manual code validator to multi-agent supervisor.[17][30]

AutoDev Four Core Components

ComponentFunction
Conversation Manager Supervises dialogue between user, AI agents, and system; manages interruptions[17]
Agent Scheduler Schedules and orchestrates agents to collaborate; employs various collaboration algorithms[17][30]
Tools Library File editing, retrieval, build, execution, testing, git operations[17]
Evaluation Environment Secure Docker container; runs commands, abstracts low-level complexity, closes the feedback loop[17]

Isolation model: All operations occur within Docker containers, isolating them from the host. The architecture does not describe explicit file-level locking or worktree isolation for agents running in parallel within a session — Docker isolation is per-session, not per-agent within a session.[17]

AutoDev Performance

MetricScore
Code generation Pass@1 (HumanEval)91.5% (second-best; best requiring no extra training data)[17][30]
Test generation Pass@187.8% with 99.3% coverage from passing tests[17][30]
Languages supportedJava, Kotlin, JavaScript/TypeScript, Rust, Python, Go, C/C++[17]
See also: Isolation Mechanisms (Section 7); Agent Architecture Taxonomy (Section 10)

Section 7: Isolation Mechanisms — Git Worktrees vs. Docker

Note: Detailed git worktree internals (two-pointer model, branch exclusivity, object-store sharing) are covered in the Git Worktree Mechanics pillar; this section focuses on comparative isolation tradeoffs specifically relevant to agent system design.

The four failure modes that worktree isolation prevents are well-documented across the corpus:[2][20]

#Failure ModeMechanism
1 Concurrent File Overwrites One agent's changes silently overwrite another's without detection until merge — untracked data loss[2]
2 Context Contamination Agents in separate context windows are unaware of peer changes — Agent A's refactoring invalidates Agent B's assumptions mid-task[2]
3 Race Conditions on Shared State Multiple agents independently trigger expensive operations (builds, tests), causing resource thrashing[2]
4 Git Lock Contention Concurrent git operations fail with fatal errors on .git/index.lock; agents don't gracefully recover, leaving stale locks requiring manual intervention[2]

How Git Worktrees Work

A git worktree enables multiple working directories from the same repository. Each worktree has its own working directory, private HEAD pointer, and private staging index, but shares the same .git object store (no duplication of history).[2] The two-pointer model: $GIT_DIR points to each worktree's private directory; $GIT_COMMON_DIR points to the shared .git.[2]

Key finding: "Parallel agents without isolation is not acceleration. It is entropy."[19] Git simultaneously acts as isolation mechanism (worktrees separate agents), integration boundary (deliberate merges via PRs), conflict detection (surfaces overlapping changes at merge time), and rollback capability (failed branches can be discarded).[19]

Branch exclusivity enforced by git: the same branch cannot be checked out in more than one worktree simultaneously by default.[2] Merge conflicts are deferred from runtime to merge time, where standard git tooling detects them as visible conflicts rather than silent overwrites.[1][2]

Isolation Strategy Comparison

StrategyGit ObjectsCreation TimeFilesystem IsolationRuntime IsolationDisk Cost
Worktrees Shared Seconds File-level None Low[2]
Docker containers Per-image layers Minutes Full with namespaces Complete High[2]
Separate clones Duplicated Minutes Full None Very high[2]
Sequential checkout Shared Instant None N/A N/A[2]
Copy-on-Write Shared Instant Full None Very low[2]

Worktrees excel for code-only parallel generation with 3–5 concurrent agents. Docker wins when agents need full runtime isolation (separate ports, databases, network namespaces).[2]

Known Worktree Limitations

LimitationDetail
No runtime isolationWorktrees share local databases, Docker daemon, cache directories, ports — requires three-layer stack for full isolation[1][2][8]
No cross-worktree conflict warningsGit provides no alerts when worktrees modify identical files on different branches[2][16]
Shared git hooksHooks in .git/hooks/ execute in all worktrees; pre-commit assumptions may not hold in fresh worktrees[1][2]
Submodule multiplicationEach worktree gets its own submodule set, multiplying disk usage[2]
IDE gaps (historical)Full worktree support arrived in JetBrains 2026.1 and VS Code July 2025[2]
Monorepo performanceFile watchers and build tools in each worktree compound I/O; git sparse-checkout can constrain scope[2]

The Sequential Merge Problem

Even with perfect execution-time isolation, merging remains sequential. Three common patterns:[16]

  1. Time-based merge: first-come, first-served
  2. Topological merge: dependencies determine order
  3. Human-mediated merge: each agent creates a PR; human reviews and merges

None of these eliminate merge conflicts when agents modify the same shared files — they only defer conflict detection to merge time.[16]

See also: Git Worktree Mechanics (separate pillar); Coordination Failures (Section 9)

Section 8: Work Partitioning and Orchestration Patterns

The Fundamental Challenge: Task Decomposition Quality

Worktree isolation improves execution for independent tasks but cannot resolve file-level dependencies between concurrent agents. If Agent A builds an API while Agent B builds a frontend consuming that API, these must be sequenced, not parallelized.[2][8][10][26]

Key finding: "The secret to building robust, performant systems is the topology of coordination and not simply adding more agents to the task."[25] A good architect agent that breaks 'Build auth system' into well-scoped, independent subtasks will outperform six engineers working on poorly defined work.[26]

Augment Code's Intent Platform: Three-Tier Architecture

TierAgent RoleFunction
1 Coordinator Agent Plans work, reviews specs before implementation, decomposes tasks into dependency-ordered waves[2][20]
2 Specialist Agents (6 personas) Investigate, Implement, Verify, Critique, Debug, Code Review[2]
3 Verifier Agent Quality gate: checks results against spec for inconsistencies, bugs, missing pieces[2]

Living Spec: A shared coordination artifact all agents continuously reference — "the source of truth that keeps all participants aligned."[2] Context Engine: Semantic codebase indexing shared across all agents, supporting 400,000+ file repositories.[2]

Addy Osmani's Code Agent Orchestra: Two Coordination Patterns

Pattern 1: Subagents (In-Process Delegation)

Pattern 2: Agent Teams (True Peer Coordination)

Conflict Avoidance Mechanisms

MechanismHow It Works
Plan approvalTeammates submit implementation plans before coding; leads approve or reject[10]
Lifecycle hooksAutomated checks (lint, tests) before task completion[10]
Task dependenciesExplicit blocking relationships prevent out-of-order execution[10]
Token budgetingHard per-agent limits; auto-pause at 85% consumed[10]
Kill criteriaReassign agents stuck 3+ iterations on identical errors[10]

Recommended team size: "Three to five teammates is the sweet spot."[10]

Coordination Topologies: Five Models

TopologyDescriptionCoordination Overhead
Single-Agent System (SAS)Baseline — one agent, no coordinationNone
Independent MASAgents work without coordinationNone (at cost of consistency)
Decentralised MASPeer-to-peer coordinationMedium
Centralised MASOrchestrator coordinates all agentsHigh (orchestrator bottleneck)
Hybrid MASMix of central and peer coordinationMedium-high

(Source: CodeCRDT research, 600 trials using Claude Sonnet)[25]

Cursor's Internal Architecture: What Failed and What Worked

Approach TriedResultRoot Cause
Equal-status agents with locking Failed — 20 agents slowed to throughput of 2–3[28] Agents held locks too long
Optimistic concurrency control Failed — agents became risk-averse, avoided hard tasks[28] Conflict cost changed agent behavior
Planners + Workers + Judges (hierarchical) Successful[28] Role differentiation enables scale

The Ralph Loop: Self-Improving Stateless Cycle

Five-step stateless-but-iterative cycle for solo agent sessions:[10]

  1. Pick next task from task list
  2. Implement change
  3. Validate (tests, types, lint)
  4. Commit if checks pass
  5. Reset context for next iteration

External memory persists through: git history, progress logs, task state files, and AGENTS.md. Research note: "LLM-generated AGENTS.md files offer no benefit and can marginally reduce success rates (~3%) while increasing costs over 20%." Developer-written context provides ~4% improvement.[10]

Azure AI Agent Design Principles (Microsoft)

See also: Lock Design Granularity (separate pillar); Concurrency Control Theory (separate pillar)

Section 9: Coordination Failures — Empirical Research and Documented Failure Modes

CooperBench: The "Curse of Coordination" (arXiv: 2601.13295)

Empirical research demonstrates a "curse of coordination": agent cooperation performs significantly worse than a single agent given the same total workload.[28]

Key finding: Multi-agent configurations degrade performance by 39 to 70 percent relative to single-agent baselines. Inter-agent misalignment is identified as the primary failure category.[28]

CooperBench tested 2–4 agent configurations on SWE-bench-style tasks; the 39–70% degradation range varies by agent count and task coupling, with higher agent counts and tighter task coupling producing worse outcomes.[28]

CodeCRDT: Coordination Trade-off Is Not Uniform (arXiv: 2510.18893)

CodeCRDT proposes observation-driven coordination — agents coordinate by monitoring a shared state with observable updates, using deterministic convergence rather than explicit message passing.[25]

OutcomeResult (600 trials, Claude Sonnet)
Best case (some task types)Up to 21.1% speedup with parallel coordination[25]
Worst case (other task types)Up to 39.4% slowdown[25]

Whether multi-agent coordination helps or hurts depends heavily on task structure — not simply on adding more agents.[25]

AgenticFlict Dataset (arXiv: 2604.03551)

FindingData
Merge conflicts in Agentic-PR rejections>1.1% of all rejections[28]
Root cause of some failuresSuperficial differences (formatting, indentation) rather than substantive conflicts[28]

Key Failure Modes at Scale

Failure ModeDescription
Coherence degradation"Lost in the middle" phenomenon as context grows[28]
Architectural driftAgents make locally sensible but globally inconsistent decisions[28]
Pattern violationAgents suggest or use deprecated APIs[28]
StalenessIndex updates lag behind rapid development[28]
Task-scoped isolation failureTwo workers independently modify the same file when tasks are not properly scoped[11]

Note: MetaGPT-specific failures (assistant repeated instruction, infinite loop of message) — see Section 4.

SWE-Bench Complexity Cliff

BenchmarkScopeTop Model Score
SWE-Bench Verified Single-issue tasks >70%[10]
SWE-Bench Pro Multi-file patches averaging 107 lines across 4+ files ~23%[9]

This 3×+ performance drop demonstrates the need to decompose work into smaller, testable units that stay within each agent's accuracy range.[10]

Solutions That Work

SolutionComponentsSource
Claude Code shared task list pattern Status flags (lock claims) + git worktrees (isolate edits) + dependency markers (sequence constrained work)[28] [28]
Centralized orchestrator with per-agent worktrees Orchestrator dispatches; each agent gets its own working copy (e.g., Nevo production system)[28] [28]
Plan approval before coding Prevents architectural mistakes before code exists[10] [10]

AI-Assisted Merge Conflict Resolution Tools (2025)

ToolCapability
JetBrains AI AssistantIntegrated merge conflict suggestions[28]
VS Code 1.105 (Sept 2025)AI-assisted merge conflict resolution using merge base and both branch changes as context[28]
GitKraken Desktop"Auto-resolve with AI" with explanations[28]
GraphiteAI merge conflict resolution guidance[28]
Resolve.AIDedicated merge conflict tool[28]
CodeGPTIntelligent merge resolution[28]

Recommended Scaling Approach

Common mistake: launching maximum agents immediately without learning coordination patterns, producing overwhelming complexity and incompatible code.[25][10] Recommended: start with 2 agents on well-isolated features, master the workflow, then scale to 4, 6, 8 — task decomposition quality is the primary variable. "Swarming only works when work units are genuinely independent."[1]

See also: Concurrency Control Theory (separate pillar); Work Partitioning Patterns (Section 8)

Section 10: Agent Architecture Taxonomy — Source-Code Survey of 13 Systems

A systematic taxonomy of 13 open-source coding agents across three layers (arXiv: 2604.03515v2):[29] (1) control architecture, (2) tool/environment interface, (3) resource management. Agents occupy positions along continuous spectra — not discrete categories.

Key finding: 11 of 13 agents compose multiple primitives rather than relying on a single control structure.[29] Four core tool capability categories — Read, Search, Edit, Execute — appear in all LLM-driven agents, with tool counts ranging from 0 (Aider) to 37 action classes (Moatless Tools).[29]

Control Architecture Patterns

Loop StrategyAgentsComplexity
Fixed pipelineAgentlessLowest
Sequential ReAct loops7 agents (including SWE-agent, OpenHands)Medium
Phased state machinePrometheusMedium-high
Full Monte Carlo Tree Search with backpropagationMoatless ToolsHighest

Loop Driver Paradigms

DriverMechanismExamples
User-drivenHumans select files — sidesteps localization bottleneckAider[29]
Scaffold-drivenPre-computed paths determine agent actionsAgentless, AutoCodeRover[29]
LLM-drivenFull tool autonomy; agent chooses actions9 of 13 agents surveyed[29]

Context Retrieval Paradigms

ApproachAgentsMechanism
LLM-as-Navigator 8 agents Grep/find tools; LLM formulates queries[29]
Scaffold-side understanding 5 agents Pre-computed code representations[29]
— PageRank-weighted dependency graphs Aider Graph-based relevance ranking[29]
— AST-indexed + spectrum fault localization AutoCodeRover Structure-aware queries; unique in corpus[29]
— Neo4j knowledge graphs (20 languages) Prometheus Cross-language symbol graph[29]
— FAISS embedding-based semantic search Moatless Tools Vector similarity[29]

Execution Isolation Strategies

ApproachAgentsMechanism
Containerized (Docker) 5 agents: SWE-agent, OpenHands, DARS-Agent, AutoCodeRover, Prometheus In-container FastAPI servers (OpenHands); full namespace isolation[29]
Shadow git checkpoints Cline Rollback without touching user history[29]
OS-level sandboxing Codex CLI Bubblewrap/Landlock + LLM guardian safety scoring (0–100 scale, 80-point threshold)[29]
Rule-based policy engine Gemini CLI Per-tool approval requirements[29]
Human supervision Aider Safety boundary is the human operator[29]
Stateless subshells mini-swe-agent Process-level isolation per operation[29]

Speculative Execution (Tree-Search Agents)

AgentMechanism
DARS-AgentFull Docker reset; replays all actions from root[29]
Moatless ToolsShadow mode — tracks modifications in-memory without filesystem writes[29]

Cross-Agent Architecture Comparison

DimensionSWE-agentAiderOpenHandsPrometheus
Loop strategySequential ReActUser-drivenSequential ReActPhased state machine
Tool count3–35 (bundled)0 (text-parsed)9+ (MCP-enabled)17 (per-node scoped)
Context retrievalKeyword searchPageRank graphKeyword searchNeo4j knowledge graph
IsolationDockerNone (user trust)Docker + HTTP APIDocker
Compaction strategyPolling triggersSummarizationRequest-basedNot detailed

Source: arXiv 2604.03515v2[29]

Context Compaction Strategies

7 strategies identified across the 13 agents:[29]

  1. Hard truncation
  2. Sliding windows
  3. LLM-generated summarization (Aider)
  4. Selective tool-result dropping
  5. Polling parameters (SWE-agent)
  6. Request-based compaction (OpenHands)
  7. Per-node tool scoping (Prometheus)
See also: Isolation Mechanisms (Section 7)

Section 11: Comparative Benchmarks Across Systems

Cross-System Architecture Overview

System Isolation Mechanism Coordination Pattern Max Parallel Scale Open / Closed Key Limitation
Devin Isolated VM per instance[3][11] Manager assigns to workers; manager merges results[11] 10 workers (MultiDevin)[11] Closed Task-level isolation only — workers can modify the same file if tasks are mis-scoped[11]
SWE-agent Docker container per issue[27][29] Single-agent per issue; no built-in parallel execution[27] 1 (single-agent design)[27] Open No multi-agent parallelism; parallel scale requires external orchestration[9]
OpenHands Docker container per session + event-stream state[6][12] Hierarchical delegation via AgentDelegateAction[4] Cloud-scale (V1 SDK: native distributed deployment)[4][12] Open Automatic workflow generation still requires substantial handcrafting[22]
MetaGPT Sequential handoffs; no concurrent artifact editing[13] Sequential SOP pipeline (ProductManager → Architect → Engineer → QA)[5] 1 (intentionally sequential)[13] Open No parallelism by design; lower throughput than parallel systems[13]
GitHub Copilot / /fleet Shared filesystem, no file locking (/fleet); optional git worktrees in CLI[14] Orchestrator dispatches parallel subagents[7][14] Parallel (no stated maximum); Mission Control manages multiple tasks concurrently[14][24] Closed No automatic file-conflict prevention in /fleet; partitioning is the user's responsibility[14]
Aider Git history / user-supervised (no worktrees)[15][25] Sequential two-model pipeline (architect → editor)[15] 1 (no concurrent agents on same repo)[15] Open Not designed for parallel execution; used as worker agent in external orchestration[25]
Patchwork Three-layer stack: git worktrees + database branches + port namespacing[8][26] Architect plans → manager decomposes → engineers execute in isolated worktrees[26] 5–7 practical ceiling (rate limits + disk + review overhead)[8][26] Open Disk and rate-limit ceiling; ~5 GB per worktree on a 2 GB codebase[8]
AutoDev Docker container per session (session-scoped, not per-agent within a session)[17] Agent Scheduler orchestrates LLM-agnostic multi-model collaboration[17][30] Session-scoped (parallel scale not specified)[17] Open No explicit per-agent file-level locking or worktree isolation within a session[17]
System Benchmark Score Date / Configuration
OpenHands SWE-Bench Verified 72%[4][12] 2026, Claude Sonnet 4.5 + extended thinking
OpenHands SWE-Bench Lite 26%[6] At publication, CodeActAgent v1.8 + claude-3.5-sonnet
OpenHands GAIA (val set) 67.9%[4] 2025
MetaGPT HumanEval Pass@1 85.9%[5][23] At ICLR 2024 publication
MetaGPT MBPP Pass@1 87.7%[5][23] At ICLR 2024 publication
AutoDev HumanEval Pass@1 91.5%[17][30] 2024 (best requiring no extra training data)
AutoDev Test generation Pass@1 87.8%[17] 99.3% coverage from passing tests
SWE-agent SWE-bench (pass@1) 12.5%[27] NeurIPS 2024, best open-source at publication
SWE-agent HumanEvalFix (pass@1) 87.7%[27] NeurIPS 2024
SWE-agent (mini) SWE-bench Verified >74%[27] 100 LOC implementation
SWE-Edit (decomposed) SWE-bench Verified (vs. baseline) +2.1% / −17.9% cost[9] 2025
Aider (DeepSeek R1 + Claude 3.5 Sonnet) Internal benchmark 64% at $13.29[15] Architect mode
Frontier models (general) SWE-Bench Verified >70%[10] 2025 state of the art
Frontier models (general) SWE-Bench Pro (multi-file, 107 lines avg, 4+ files) ~23%[9] 2025 state of the art

Section 12: Practical Scaling Limits and Emerging Open-Source Tooling

Practical Scaling Ceilings

ThresholdValueLimiting Factor
Recommended sweet spot (most repos) 3–5 concurrent agents[8] Merge review overhead begins to exceed throughput gains
Productive ceiling (modern laptop) 5–7 concurrent agents[8][26][25] Rate limits, disk consumption, review overhead
Disk consumption (2 GB codebase) ~30+ GB for 6 worktree agents (~5 GB each)[8][26] Submodule multiplication + per-worktree indexes
Cursor 2.0 (Oct 2025) supported agents Up to 8 concurrent[25] Product-imposed limit
Organizations reporting workflow improvement 20–30% faster cycles with multi-agent setups[25] Dependent on task partitioning quality
Rate limit impact 10 Claude Code instances hit Anthropic rate limits faster than 1[25] API throughput constraints

The Verification Bottleneck

"The bottleneck is no longer generation. It's verification."[10] Three-layer verification approach recommended:

  1. Plan approval — prevents architectural mistakes before code exists
  2. Hooks — enforce automated validation on lifecycle events
  3. AGENTS.md — compounds learning across sessions (developer-written; LLM-generated versions incur ~3% regression and >20% cost increase)[10]

Human review remains mandatory — agents generate volume quickly, but determining correctness requires full system context.[10]

Open-Source Parallel Agent Tooling Ecosystem

ToolKey FeatureIsolation Mechanism
Parallel CodeEach task gets own git branch and worktree[16]Git worktrees
Superset IDERun 10+ agents simultaneously[16]Git worktrees
Composio Agent OrchestratorMultiple agents, each with its own PR, supervised dashboard[16]Git worktrees + PRs
Conductor (Mac)Multiple parallel agents, clean separation[16]Git worktrees
Baton (mraza007)Polls GitHub Issues, runs Claude Code in isolated worktrees[16]Git worktrees
Claude SquadOrchestrates Claude Code, Aider, Codex simultaneously[16]Git worktrees + tmux

Worktree-per-Task vs. Worktree-per-Agent Trade-off

StrategyAdvantageDisadvantage
Worktree per task Short-lived; no state accumulation[2] Zero cache reuse; cold dependency installs for each task
Worktree per agent Warm dependency caches; faster task startup[2] Long-lived worktrees accumulate state; harder cleanup
See also: Git Worktree Mechanics (separate pillar); Lock Design Granularity (separate pillar)

Sources

  1. What is Worktree Isolation in AI Agents? Parallel Execution Without Conflicts (retrieved 2026-05-03)
  2. How to Use Git Worktrees for Parallel AI Agent Execution | Augment Code (retrieved 2026-05-03)
  3. Agent-Native Development: A Deep Dive into Devin 2.0's Technical Design (retrieved 2026-05-03)
  4. OpenHands: An Open Platform for AI Software Developers as Generalist Agents (ICLR 2025) (retrieved 2026-05-03)
  5. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (ICLR 2024) (retrieved 2026-05-03)
  6. OpenHands: An Open Platform for AI Software Developers as Generalist Agents — Full Technical Architecture (retrieved 2026-05-03)
  7. Run multiple agents at once with /fleet in Copilot CLI — GitHub Blog (retrieved 2026-05-03)
  8. Patchwork: Agentic AI Framework for Enterprise Workflow Automation (patched-codes) (retrieved 2026-05-03)
  9. SWE-Agent: Architecture, Design, and Benchmarks (retrieved 2026-05-03)
  10. The Code Agent Orchestra - What Makes Multi-Agent Coding Work (retrieved 2026-05-03)
  11. Cognition | Devin 2.0 - Agent-Native IDE & Parallel Agent Architecture (retrieved 2026-05-03)
  12. OpenHands: An Open Platform for AI Software Developers as Generalist Agents (ICLR 2025) (retrieved 2026-05-03)
  13. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (retrieved 2026-05-03)
  14. GitHub Copilot Workspace & Coding Agent: Parallel Agents Architecture (retrieved 2026-05-03)
  15. Aider: Architect/Coder Multi-Agent Mode - Coordination & Benchmarks (retrieved 2026-05-03)
  16. Patchwork: Agentic AI Framework for Enterprise Workflow Automation (retrieved 2026-05-03)
  17. AutoDev: Automated AI-Driven Development (Microsoft Research, 2024) (retrieved 2026-05-03)
  18. What is Worktree Isolation in AI Agents? Parallel Execution Without Conflicts (retrieved 2026-05-03)
  19. Swarming the Codebase: Orchestrated Execution with Multiple Claude Code Agents (retrieved 2026-05-03)
  20. How to Use Git Worktrees for Parallel AI Agent Execution (retrieved 2026-05-03)
  21. Cognition | Devin 2.0 (retrieved 2026-05-03)
  22. OpenHands: An Open Platform for AI Software Developers as Generalist Agents (ICLR 2025) (retrieved 2026-05-03)
  23. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (ICLR 2024 Oral) (retrieved 2026-05-03)
  24. Run multiple agents at once with /fleet in Copilot CLI - The GitHub Blog (retrieved 2026-05-03)
  25. CodeCRDT: Observation-Driven Coordination for Multi-Agent LLM Code Generation (retrieved 2026-05-03)
  26. Patchwork: Agentic AI Framework for Enterprise Workflow Automation (retrieved 2026-05-03)
  27. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (NeurIPS 2024) (retrieved 2026-05-03)
  28. CooperBench: Why Coding Agents Cannot be Your Teammates Yet (retrieved 2026-05-03)
  29. Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures (retrieved 2026-05-03)
  30. AutoDev: Automated AI-Driven Development (Microsoft Research, 2024) (retrieved 2026-05-03)

Home