Pillar: scope-overlap-detection | Date: May 2026
Scope: Static and dynamic analysis techniques for predicting whether two agents' intended changes will overlap BEFORE they start work: file dependency graphs, call graph analysis, module boundary inference, package/import graph overlap, change blast radius estimation, semantic similarity of task descriptions, and ownership map lookup as a pre-flight check.
Sources: 32 gathered, consolidated, synthesized.
Critical finding: Production multi-agent LLM systems show failure rates between 41% and 86.7%, and nearly 79% of those failures originate from specification and coordination issues — not model capability limitations.[18][28] Pre-flight conflict detection directly addresses the dominant failure mode, making it the highest-leverage investment in any parallel coding agent system.
Microsoft's ConE system, deployed in March 2020 across 234 repositories, provides the most rigorous production evidence for pre-flight conflict detection at scale. Over 26,000 pull requests, ConE generated 775 recommendations about conflicting changes and was rated useful in over 70% of cases. More tellingly, over 90% of the 48 interviewed developers intended to keep using it daily — a retention rate that decisively separates it from the technically comparable blast-radius.dev project, which was abandoned due to insufficient adoption despite sound engineering.[10] ConE achieves this by avoiding deep semantic analysis entirely: its Extent of Overlap (EOO) metric is a lightweight file-level scalar conflict score, deliberately fast and scalable. The deployment contrast with blast-radius.dev (voluntary external tool, no reported retention, now discontinued) establishes that deployment model — mandated internal integration versus voluntary adoption — predicts success more reliably than technical sophistication.
Static semantic analysis delivers a recall of 0.60 against dynamic analysis's 0.14 on the same benchmark — a fourfold improvement — while running at a median of 17.8 seconds, two to three orders of magnitude faster than prior information flow approaches.[1] The technique, evaluated across 99 experimental units from 54 merge scenarios in 39 projects, uses four algorithms — interprocedural data flow, confluence, override assignment, and program dependence graph analysis — applied to merged code annotated with per-contributor metadata. The F1 of 0.50 comes with a precision of only 0.43, driven primarily by 26 false positives from refactoring changes (14 cases) and 13 false negatives from deleted lines (7 cases). Pre-flight application is direct: annotate planned modification zones with agent metadata and run the four analyses against the current codebase before any agent writes code. If data flow or confluence paths connect two planned zones, conflict is predicted upfront.[22]
For call graph-based impact analysis, the counterintuitive finding from a study of 10 open-source Java projects with approximately 17,000 mutants is that the most basic algorithm — Class Hierarchy Analysis (CHA) — gives the best precision-recall tradeoff for impact prediction.[2] More sophisticated pointer analysis (SPARK) improves completeness but raises false positive rates in conflict prediction, making it counterproductive for pre-flight gates. Practical implication: transitive CHA closure from an agent's target files defines a reliable blast radius estimate. The intersection of two agents' transitive closures defines the predicted conflict zone. Teams should start with CHA rather than investing in expensive whole-program pointer analysis, reserving SPARK for post-flag deep analysis of confirmed high-risk pairs only.[23]
Task description similarity analysis using SBERT (Sentence-BERT) embeddings achieves F1 scores of 87.1% to 92.3% for detecting intent overlap across three of four benchmark datasets — outperforming GPT-4o, Llama-3, Claude Sonnet 3.5, and Gemini-1.5 in domain-specific settings.[4][32] The two-phase S3CDA algorithm computes cosine similarity on SBERT vectors, then validates high-similarity pairs through entity extraction (actors, actions, objects, resources) with an overlap ratio threshold. For maximum recall, the unsupervised variant UnSupCDA achieves 100% recall across most datasets at the cost of lower precision. This semantic layer is orthogonal to dependency graph analysis: high similarity without file overlap signals hidden future conflict; file overlap without semantic similarity may be incidental co-location. Running SBERT similarity in milliseconds against all pending task pairs provides a Layer 0 gate that filters candidates before triggering any structural analysis.
Code ownership maps expose a hidden divergence problem with direct implications for agent task assignment: across studied systems, only 0–40% of developers are commonly identified by both commit-based and line-based ownership methods (correlation: 0.24–0.65), and 79% of individual code owners were NOT among the top 100 most frequent committers.[3][29] This means using commit frequency as an ownership proxy misses the majority of actual owners. Commit-based metrics — proportion of commits by developer relative to total file commits — appear as the highest-ranked predictor in 97% of correctly predicted defective files, making them the superior method for conflict risk (rather than authorship). Files with no clear committed ownership, or with high minor-contributor counts, carry disproportionately elevated conflict risk. The CODEOWNERS pre-flight check runs in O(1) per agent pair: parse CODEOWNERS, map each agent's target files to owners, flag multi-agent ownership collisions before dispatch.[15]
Machine learning on git history delivers the strongest predictive accuracy of any single technique: across 744 open-source GitHub repositories in 7 programming languages — the largest merge conflict prediction study published to date — a Random Forest model combining social and technical features achieves 0.92 accuracy and 1.00 recall.[21] The social features reveal non-obvious patterns: top contributors at the project level cause more conflicts, while occasional contributors at the merge-scenario level also cause more conflicts. The specific combination of top project contributor and occasional merge contributor simultaneously yields a 32.31% conflict probability. Cross-layer changes (e.g., spanning MVC layers) are significantly more conflict-prone than same-layer changes, and long-lived branches cause disproportionately more conflicts. Applied to agents: these features — file ownership, task scope across architectural layers, agent task frequency patterns — are all computable pre-dispatch from git history with no code execution required.
RIPPLE (ICSE 2026) bridges natural language task intent and concrete file impact prediction: in 86% of commits, at least one co-changing location is structurally or semantically dependent on the seed edit (Hit@K metric), and RIPPLE's F1 improves over existing change impact analysis baselines by 39.7% to 380.8%.[27] Its two-phase design — a recall-focused expansion combining evolutionary coupling (commit co-change history) and dependence coupling, followed by a precision-focused LLM planner that reasons over dependence clusters — converts a natural language task description directly into an expanded file/function impact set. Evolutionary coupling independently identifies impacted locations in 21% of commits that structural dependence analysis alone misses entirely. For two parallel agents, computing the intersection of their RIPPLE-expanded impact sets provides a conflict zone prediction before any code is written.
Practitioners building parallel coding agent systems should implement a five-layer pre-flight pipeline where each layer filters candidates before the next: Layer 0 runs file-level EOO and CODEOWNERS checks in milliseconds — direct conflicts routed sequential immediately; Layer 1 runs SBERT cosine similarity against all non-blocked task pairs in seconds — flags intent overlap; Layer 2 traverses import/dependency graphs and CHA call graphs for structurally flagged pairs — computes blast radius intersections; Layer 3 applies static semantic analysis (data flow / confluence algorithms) or RIPPLE for pairs with structural overlap — confirms and localizes conflict; Layer 4 runs a Random Forest model trained on git history continuously scoring all pairs for probabilistic risk. The critical architectural constraint is that the pipeline must complete before agents are dispatched — not during or after. ConE's production success with a deliberately lightweight Layer 0 alone (70%+ usefulness, 90%+ retention) suggests that even partial coverage dramatically improves outcomes, making incremental deployment viable: start with file-level overlap and ownership checks, measure false negative rates, then add structural layers based on observed miss patterns in real workloads.
Two structurally distinct conflict classes appear consistently across the literature, and any pre-flight detection system must handle both. Direct conflicts arise from concurrent changes to the same file, function, or line — detectable via file-level overlap in O(n) time. Indirect conflicts arise when changes to different code areas interact through logical or semantic dependencies: one agent changes an API that another agent's code depends on, even when there is zero textual overlap between their changes.[8][19]
| Class | Definition | Detection Mechanism Required | Example |
|---|---|---|---|
| Direct | Same artifact modified concurrently by two agents | File/symbol overlap check (O(n)) | Both agents edit auth.py simultaneously |
| Indirect (API-induced) | "Changes in one artifact affecting concurrent changes in another artifact"[8] | Dependency graph traversal; call graph analysis | Agent A changes authenticate() signature; Agent B calls it from an unrelated file |
| Semantic (silent) | Merged code compiles but exhibits unintended runtime behavior due to interacting changes with no textual overlap | Data flow / program dependence graph analysis | Agent A's conditional check interferes with Agent B's duplicate removal logic after merge |
Key finding: "The earlier a conflict is detected, the easier it is to resolve." The Palantír workspace awareness system (IEEE TSE 2012) demonstrated experimentally that users given real-time workspace awareness detected conflicts earlier, resolved a larger number of conflicts, and self-coordinated more effectively than control groups without such signals.[8][19]
Simple file overlap detection catches only direct conflicts. Call graph and dependency graph traversal are required for indirect conflict detection. Both layers are necessary for a complete pre-flight check.[8] In multi-agent coding practice, the pattern that mitigates the most conflicts is a mandatory plan approval before implementation workflow: agents write plans specifying files they intend to modify, a lead reviews for overlap, and then approves or rejects before any code is written — catching collision at the intent layer rather than the diff layer.[30]
Traditional version control is pull-based: agents learn of others' changes only when they perform their own VCS operations. Palantír inverts this to push-based: continuously sharing workspace events across all agents, yielding "a more complete, accurate, and up-to-date picture of parallel activity."[8][19] The incremental query only requests events relevant to artifacts present in the local workspace, avoiding information overload.
For indirect conflict awareness specifically, Palantír transmits API differences of ongoing changes across workspaces. Each workspace uses a local cache of dependencies to calculate the impact of remote API changes and determines if local changes create new indirect conflicts — moving detection from merge-time to work-time.[8] Mapped to AI agents: each agent is a "workspace," and the coordinator monitors API changes across agent worktrees.
See also: Lock Design Granularity (post-detection coordination strategies)Standard textual merge tools fail on a specific class of integration failures: "textual merge tools aren't able to detect incompatible changes that occur in areas of the code separated by at least a single line."[1][12] Merged code may compile successfully but exhibit unintended runtime behavior due to unintended interference between concurrent changes — what the literature terms dynamic semantic conflicts.
A technique evaluated on 99 experimental units from 54 merge scenarios across 39 projects analyzes merged code annotated with developer-specific metadata using four lightweight static analyses:[1][12][22]
| Algorithm | Mechanism | Conflict Detected |
|---|---|---|
| Interprocedural Data Flow (DF) | Sparse Value Flow Analysis (SVFA) to detect interprocedural data flow paths between contributors' code | Def-use relationships where one agent's state modification affects another's state usage across method boundaries |
| Interprocedural Confluence (CF) | Identifies situations where separate changes flow to a common statement | Two agents modify different state elements that converge at a common statement, affecting behavior despite no direct data flow between their changes |
| Interprocedural Override Assignment (OA) | Tracks state update sequences across contributors | One agent's state updates overridden by another's — prevents behavior preservation during integration |
| Program Dependence Graph (PDG) | Analyzes control and data dependencies between instructions | One agent's changes influence execution of another's modifications through control flow relationships |
| Metric | Static Analysis (this technique) | Dynamic Analysis (baseline) |
|---|---|---|
| Precision | 0.43 | Higher (but lower recall) |
| Recall | 0.60 | 0.14 |
| F1 Score | 0.50 | <0.30 (estimated) |
| Median Execution Time | 17.8 seconds | Hours (information flow analysis) |
The technique significantly outperforms dynamic analysis in recall (0.60 vs. 0.14) while running 2–3 orders of magnitude faster than prior information flow approaches.[1]
| Error Type | Count | Primary Cause | Secondary Causes |
|---|---|---|---|
| False Positives | 26 cases | Refactoring changes (14 cases) — extract method refactorings create unnecessary annotations | Harmless code insertions |
| False Negatives | 13 cases | Deleted lines (7 cases) — invisible in merged version, undetectable by analysis | Interface implementation limits; exception handling; recursive method limits; Java reflection; native methods |
Key finding: The authors recommend combining static analysis with refactoring detection tools to reduce false positives. The approach shows particular promise when computational efficiency is critical — 17.8-second median execution time makes it viable as a pre-dispatch pre-flight gate.[1][22]
The data flow and confluence analyses can be applied to target code regions before agents start work by annotating planned modification zones with agent metadata, then running the four analyses against the current codebase. If DF or CF paths connect the two planned modification zones, conflict is predicted before any agent writes a single line.[22]
A large-scale empirical study (Musco, Monperrus, Preux, Software Quality Journal 2017) evaluated four call graph construction algorithms across 10 open-source Java projects with approximately 17,000 mutants.[2][13][23]
| Algorithm | Approach | Precision | Recall | Speed | Best For |
|---|---|---|---|---|---|
| CHA (Class Hierarchy Analysis) | Considers all potential call targets in the class hierarchy | Low | High | Fastest | Pre-flight blast radius — wide net, fast execution |
| RTA (Rapid Type Analysis) | Improves CHA by tracking instantiated types | Medium | High | Fast | Refined impact sets where instantiation is tracked |
| VTA (Variable Type Analysis) | Considers types of variables at call sites | Medium-high | Medium-high | Medium | Moderate precision requirements |
| SPARK | Pointer analysis — most complete | Highest | Highest | Slowest | Post-flag deep analysis of suspected high-risk overlap |
Key finding: "The most basic call graph gives the best trade-off between precision and recall for impact prediction." Counterintuitively, increased graph sophistication improves completeness but not overall effectiveness — more edges increase recall at the cost of precision (more false positives in conflict prediction).[2][23]
Practical implication for agent conflict detection: simple CHA-level call graph traversal from target files outward provides a reliable blast radius estimate. The transitive closure of the call graph defines the set of potentially impacted files/functions. The intersection of two agents' transitive closures defines the predicted conflict zone. Teams should start with CHA rather than investing in expensive whole-program pointer analysis.[23]
| Type | Approach | Precision | Recall | Availability |
|---|---|---|---|---|
| Static | Over-approximates potential callers/callees (especially virtual dispatch) | Lower | Higher | At dispatch time (no execution required) |
| Dynamic | Only captures actually-observed call paths during execution | Higher | Lower | Requires prior execution traces |
For pre-flight conflict detection, static call graphs are the only option — dynamic graphs require execution that hasn't happened yet. Static over-approximation is acceptable because false-positive conflicts (flagging safe pairs as risky) are preferable to false-negative conflicts (missing real collisions).[13]
Program slicing computes the set of statements that may affect values at a point of interest.[5] Two directions are relevant:
| Slice Type | Direction | Use Case | Pre-Flight Application |
|---|---|---|---|
| Backward slice | Find all statements that could affect a variable's value at a given point | Causation analysis | Identify what an agent's change depends on |
| Forward slice | Find all statements affected by a variable's current value | Ripple effect prediction | Primary tool: map which files will be affected by a planned modification |
Pre-flight conflict detection using slices:[5]
NS-Slicer uses pre-trained language models (GraphCodeBERT, CodeBERT) for static program slicing, achieving F1-score of 97.41% for backward slices and 95.82% for forward slices on partial code — useful specifically for in-progress agent tasks where full context is unavailable.[5]
| Granularity | Speed | Precision | Recall | Recommended Use |
|---|---|---|---|---|
| File level | Fastest | Low (high false positives) | Highest | Initial pre-flight gate — cast wide net |
| Function level | Medium | Medium | High | Secondary check for flagged file pairs |
| Line level | Slowest | Highest | Medium | Deep analysis for confirmed conflict zones |
The Affected Slice Graph (ASG) metric — Affected Component Coupling (ACC) — directly ranks conflict risk: higher ACC values correspond to higher fault-proneness in the affected component.[5]
Hudson River Trading (HRT) built a complete import dependency graph for a Python codebase of millions of lines to address "code tangling" — overlapping dependency cycles causing more than 30-second import overhead.[11]
| Pipeline Step | Tool | Output |
|---|---|---|
| Parse imports | Python ast module |
All import X and from X import Y declarations |
| Build directed graph | NetworkX | Nodes = modules; edges = import relationships (importer → imported) |
| Parallelize parsing | concurrent.futures |
Thousands of modules processed simultaneously |
| Transitive analysis | NetworkX graph algorithms | Transitive dependency closure, critical edge identification |
| Automated refactoring | libcst |
CST-based import restructuring |
Application to agent conflict detection:[11]
In monorepos without enforced boundaries, any library can import from any other: "Changes in one package can ripple across dozens or even hundreds of interconnected modules."[14]
| Tool | Mechanism | Agent Pre-Flight Relevance |
|---|---|---|
| Nx | Tag-based module boundary enforcement via @nx/enforce-module-boundaries ESLint rule; analyzes import statements across the monorepo |
Boundary violations by an agent's planned imports are detectable before code is written |
| Dependency Cruiser | Framework-agnostic dependency rule validation; identifies circular dependencies, orphans, shared code with single consumers | Run against agent's planned dependency additions to detect rule violations upfront |
| Sheriff | TypeScript module boundary enforcement at folder level without project.json tags | Lightweight enforcement for TypeScript monorepos |
Key finding: Re-exports introduce implicit dependencies — downstream code becomes coupled to the transitive closure of a library. Changes in any modules a library depends on require rebuilding dependent apps, making re-export chains a multiplier on blast radius.[14]
CodeRAG builds dependency graphs using Tree-Sitter for LLM-based code querying, storing relationships in Neo4j for graph traversal queries across the full dependency structure.[9][26]
| Pipeline Step | Action |
|---|---|
| 1. Repository traverse | Clone repo; traverse all files; identify language via extension mapping |
| 2. AST extraction | Extract structural nodes via Tree-Sitter DFS per language: type, position, text, function/class calls, inheritance, imports |
| 3. Intra-file edges | Call name → function ID resolution within a file |
| 4. Inter-file edges | Import metadata → source file → exported module (cross-file dependency) |
| 5. Vector embedding | Google Gemini embeddings per node; stored alongside graph structure in Neo4j |
The combination of vector similarity (semantic overlap) and graph traversal (structural overlap) provides two complementary conflict signals.[9] Language support covers JavaScript, TypeScript, JSX, TSX, and Python; extensible to 100+ languages via the Tree-Sitter grammar ecosystem.[26]
An ecosystem-agnostic framework for detecting dependency conflicts in heterogeneous development environments provides complexity benchmarks for choosing the right algorithm at different stages of a pre-flight pipeline.[20]
| Technique | What It Detects | Complexity | Pre-Flight Stage |
|---|---|---|---|
| Change Overlapping | Same graph node touched by two changes | O(n) | Layer 1 — immediate gate |
| Constraint Violation | Incompatible version requirements | O(n²) | Layer 2 — dependency version check |
| Pattern Matching | Known anti-patterns (diamond deps, circular) | O(n log n) | Layer 2 — structural anti-pattern scan |
| CDA / Critical Pairs | All potential conflicts in minimal context | Exponential worst case | Layer 3 — only for flagged pairs |
| Graph Embedding + ML | Probabilistic conflict risk from graph structure | Training cost amortized; inference fast | Layer 2 — probabilistic scoring |
| GNN + LLM | Semantic + structural conflict detection | High, but parallelizable | Layer 3 — deep analysis for high-risk pairs |
Graph embedding approaches use Node2Vec and GraphSAGE models to encode structural and contextual features into vector spaces; supervised classifiers trained on known conflict instances predict probable future conflicts, with risk scores correlated against Git version histories.[20]
See also: Monorepo Tooling (Nx, Turborepo, Bazel boundary enforcement)Blast radius in software engineering is "the potential impact that a change or failure in a system or service can have on other interconnected systems or services."[7][17] It has two components: direct impact (systems immediately affected) and indirect impact (systems affected as a result of disruption to directly impacted systems). Factors influencing blast radius size include system complexity, interconnectivity level, nature of the change (interface changes carry larger blast radius than implementation-only changes), and system resilience.[7]
The pre-flight blast radius approach converts conflict detection into a computationally tractable form:[7][17]
| Analysis Type | What It Measures | Output for Pre-Flight |
|---|---|---|
| Create/maintain dependency map | Module-level interdependencies | Graph queryable for immediate overlap |
| Chain analysis | Multi-hop propagation paths through dependency chains | Transitive blast radius (not just direct neighbors) |
| Centrality ranking | In-degree and out-degree of modules | High-centrality modules = highest-risk overlap zones; deprioritize for concurrent assignment |
Port's AI calculates blast radius pre-deployment through a five-step pipeline:[7][17]
The blast-radius.dev tool implemented a dedicated PR-level detection pipeline:[24]
| Stage | Action |
|---|---|
| Diff Parsing | Examine proposed modifications in pull requests |
| Change Identification | Detect API and schema modifications |
| Dependency Mapping | Map identified changes to related code across services |
| Notification | Post impact summary as PR comment |
Adoption warning: The blast-radius.dev project is no longer active. The creator concluded that while the cross-service impact problem exists, "teams didn't prioritize this type of analysis at the time" — establishing an adoption barrier despite technical soundness.[24] This contrasts sharply with Microsoft's ConE (Section 10), which achieved 90%+ user retention, suggesting that internal mandated tooling succeeds where voluntary external tooling does not.
Augment Code's AI-powered microservices impact analysis integrates four primary data sources to capture hidden dependencies invisible to static analysis alone:[25]
| Data Source | What It Captures | Hidden Dependency Type |
|---|---|---|
| Static analysis | Explicit imports and function calls | Direct structural dependencies |
| Distributed tracing | Actual runtime request paths | HTTP calls embedded in strings; dynamic service discovery |
| CI/CD logs | Deployment patterns and co-deployment history | Deployment coupling invisible in code |
| API specifications | Service contracts and schemas | Message queue topic subscriptions; config template references |
Scale: Context Engine processes 400,000+ file codebases without chunking, using models supporting 128,000 tokens of context; rebuilds dependency model within seconds of a branch push.[25] Reported result: teams report up to 70% reduction in impact analysis time.[25]
| Pattern | Mechanism | Agent Application |
|---|---|---|
| Bulkhead Pattern | Compartmentalize modules so failure/change in one doesn't cascade | Assign agents to isolated module bulkheads; cross-bulkhead tasks flagged |
| Incremental Changes | Break changes into smaller steps with intermediate verification | Decompose large agent tasks to reduce individual blast radius |
| Module Isolation | Minimize cross-domain dependencies in design | Pre-partition agent task assignments to aligned module boundaries |
Key finding: Testing overly large blast radii leads to bloated, inefficient test suites that provide little real feedback. Test architecture should match decoupled software architecture — and agent task partitioning should match both.[7]
Tree-sitter is a parser generator and incremental parsing library that builds concrete syntax trees (CSTs) with full source fidelity. Its key property for conflict detection is incremental parsing: sharing unmodified tree nodes between versions enables fast re-parse when code changes (enabling source parsing on every keystroke in editors), making it viable as a live pre-flight analysis layer.[6][16]
| Capability | Mechanism | Pre-Flight Use |
|---|---|---|
| Language-agnostic AST | 13–19+ language grammars with uniform node/edge shapes | Single dependency graph query across polyglot codebases |
| Incremental parsing | Shared unmodified tree nodes between versions | Real-time conflict re-evaluation as agent plans evolve |
| Structural querying | S-expression patterns to extract specific code structures | Extract function names, call sites, import statements for graph construction |
| Precise line/column mapping | All nodes mapped to exact source positions | Identifies which source position each dependency edge originates from |
Critical limitation: Building dependency graphs with Tree-Sitter requires language-specific work — developers must write language-specific queries producing common captures, or hand-write AST traversers per language.[6][16]
The AFT toolkit (cortexkit) is built on top of Tree-Sitter's CSTs. Every operation addresses code by what it is — function, class, call site, symbol — not by where it sits in a file.[16] This directly addresses the root cause of line-number-based conflicts: "AI coding agents are fast, but their interaction with code is often blunt. The typical pattern: read an entire file to find one function, construct a diff from memory, apply it by line number — burning tokens on context noise, with edits that break when the file changes."[6]
| AFT Feature | Function | Pre-Flight Conflict Detection Role |
|---|---|---|
| Git Conflict Viewer | Shows all merge conflicts across repo in one call with line-numbered regions | Post-detection inventory; identifies residual direct conflicts |
| Symbol Resolution | Address code by name, not line number | Stable cross-agent references that don't break when files change |
| Call Graph Generation | Follow callers/callees across the workspace | Compute transitive impact set for any planned modification |
| Diff by Symbol | Generate and apply diffs at the semantic symbol level | Enables symbol-level locking (lower false-positive rate than file-level locking) |
The AST-based import graph construction pipeline:[16]
import_statement / import_declaration nodes"AST beats regex": early approaches used pattern matching on source text; Tree-Sitter gives the real dependency graph, not approximations that fail on multi-line imports, aliased imports, or conditional imports.[16]
AST-level conflict detection mechanism:[6]
tree.edit calls before reparsing; incremental parsing produces updated ASTTwo-agent symbol conflict check: Both agents query the same dependency graph to check for overlapping symbols. If both plan to modify the same function or class, conflict is detected pre-flight. Symbol-based locking reduces false positive conflict rates compared to file-based locking.[6]
Architecture analyzers built on Tree-Sitter AST detect these patterns, which indicate zones of elevated parallel agent conflict risk:[16]
| Anti-Pattern | Detection Method | Risk Implication |
|---|---|---|
| God Classes | Classes with method count or responsibility count above threshold | Multiple agents likely to need changes in the same class |
| Circular Dependencies | A imports B imports A (graph cycle detection) | Change anywhere in cycle affects all participants |
| Leaky Abstractions | Internal implementation details in public interface | Interface changes cascade unexpectedly through callers |
| Spaghetti Modules | High bidirectional coupling; no clear layer boundaries | Blast radius estimation becomes unreliable |
Key finding: Symbol-level locking (AFT's approach) reduces false-positive conflict rates compared to file-level locking because two agents can safely edit different functions in the same file. File-level locks block this safe parallelism unnecessarily.[6][16]
Ownership maps provide a lightweight O(1)-per-agent-pair pre-flight check that catches the most common single-file edit collisions before any dependency graph analysis is required. Files with a single clear owner can be claimed exclusively; files with shared or disputed ownership are higher-risk zones warranting deeper analysis.[3][29][15]
| Method | Calculation | Best Use Case |
|---|---|---|
| Commit-based | Proportion of commits by developer relative to total commits for a file; "the more frequent the code changes made by a developer to a file, the higher ownership value"[3] | Quality improvement, bug-fixing, conflict prevention — commit-based metrics appear as highest-ranked in 97% of correctly predicted defective files |
| Line-based | Percentage of code lines authored by developer relative to total file lines | Accountability, authorship, IP attribution — provides broader developer identification |
Critical divergence finding: Only 0–40% of developers are commonly identified by both methods across studied systems. Correlation between methods ranges from 0.24–0.65. Importantly, 79% of individual code owners were NOT among the top 100 most frequent committers — significant divergence between declared and contribution-based ownership.[3][29]
CODEOWNERS is a configuration file mapping files/folders to responsible owners (teams or individuals). When a pull request touches those paths, GitHub/GitLab automatically requests review from listed owners.[15] It encodes four types of organizational information:[15]
"In larger projects with multiple codeowners, merge conflicts can arise when different codeowners make changes to the same file simultaneously." Overlapping ownership rules lead to conflicts.[15]
Pre-flight ownership check procedure (O(1) per agent pair):[3][29][15]
| Metric | Definition | Pre-Flight Relevance |
|---|---|---|
| CODEOWNERS Coverage | % of codebase files mentioned in CODEOWNERS | Low coverage = blind spots for ownership-based conflict detection |
| Modularization Progress | % of files mapped into modules | Higher modularization = more reliable boundary-based agent partitioning |
| Confidence Score | % of files with engineers with significant hands-on experience | Low confidence = unreliable ownership data for conflict prediction |
| Lost Knowledge | Files not modified in a long time | Staleness indicator — historical ownership may no longer reflect current understanding |
Key finding: High minor-contributor count correlates with higher defect rates, and making changes to a depending component without coordinating with the owner increases likelihood of faults.[29] This extends directly to AI agents: agents assigned to files without clear ownership incur significantly higher conflict risk.
GitHub/GitLab CODEOWNERS integrates with Static Analysis Security Testing (SAST) triage by mapping ownership to file structures — automatically assigning suppression ownership. The same mechanism is applicable to conflict triage: file → owner → responsible agent → automatic conflict escalation routing.[29]
See also: Monorepo Tooling (CODEOWNERS at enterprise scale)Dependency graph analysis detects structural overlap; semantic similarity analysis detects intent overlap — two agents heading toward the same logical territory even when their initial file lists don't yet intersect. These are complementary signals: high semantic similarity without file overlap may indicate hidden future conflict; file overlap without semantic similarity may be incidental co-location rather than true conflict.
S3CDA (Supervised Semantic Similarity-based Conflict Detection Algorithm) was designed to automatically detect conflicts in software requirements — directly mappable to detecting when two AI agents have overlapping task intents.[4][32]
| Embedding Method | Mechanism | Relative Performance |
|---|---|---|
| TFIDF | Frequency-based term weighting | Best on OpenCoss dataset; weakest on semantic tasks |
| USE (Universal Sentence Encoder) | Pre-trained 512-dimensional vectors | Best on UAV dataset (92.3% F1) |
| SBERT | Semantic-aware embeddings capturing contextual meaning | Best performer overall — recommended default |
| SBERT-TFIDF | Hybrid: semantic + frequency signals | Best on WorldVista (87.1% F1) |
Similarity formula: cos(r₁,r₂) = r₁·r₂ / (‖r₁‖ ‖r₂‖), ranging from -1 (dissimilar) to 1 (identical). Optimal thresholds determined via ROC curves per dataset.[32]
High-similarity candidate pairs enter Phase II for entity extraction and overlap ratio calculation:[32]
| Entity Extraction Method | Structure | Entities Extracted |
|---|---|---|
| POS Tagging | Actor + Action + Object + Resource | Nouns and verbs from requirement/task text |
| Software-specific NER (S-NER) | Transformer-based extraction | Actors, actions, objects, properties, metrics, operators |
Overlap ratio computed against m=5 most similar candidates; if ratio exceeds threshold T₀=1.0, pair enters the final conflict set.[32]
| Dataset | Best Embedding | F1-Score |
|---|---|---|
| PURE | SBERT | 89.6% |
| UAV | USE | 92.3% |
| WorldVista | SBERT-TFIDF | 87.1% |
| OpenCoss | TFIDF | 57.0% |
LLM comparison: S3CDA consistently outperforms GPT-4o, Llama-3, Sonnet-3.5, and Gemini-1.5 in domain-specific settings. LLMs show promise on general datasets but fall short in specialized domains.[4][32] For high-recall requirements, the unsupervised variant UnSupCDA achieves 100% recall across most datasets at the cost of lower precision.[32]
Direct mapping to coding agent tasks:[32]
This provides a lightweight, fast pre-flight screen before any dependency graph analysis — task description similarity check can run in milliseconds and serves as a Layer 0 gate before triggering more expensive structural analysis.[32]
Production multi-agent LLM systems show failure rates between 41–86.7%, with nearly 79% of failures originating from specification and coordination issues — not model capability limitations. The root cause is Semantic Intent Divergence: cooperating LLM agents develop inconsistent interpretations of shared objectives due to siloed context.[18][28]
| Component | Function |
|---|---|
| Process Context Layer | Establishes shared operational semantics across all agents |
| Semantic Intent Graph | Formal graph representation of agent intentions |
| Conflict Detection Engine | Real-time identification of contradictory, contention-based, and causally invalid intent combinations |
| Consensus Resolution Protocol | Policy-authority-temporal hierarchy for dispute resolution |
| Drift Monitor | Detects gradual semantic divergence over time |
| Process-Aware Governance Integration | Enforces organizational policy compliance |
| Category | Definition | Detection Mechanism |
|---|---|---|
| Contradictory | Agent intents directly oppose each other | Semantic Intent Graph polarity analysis |
| Contention-based | Agents compete for the same resource/file/function | Resource node conflict in Semantic Intent Graph |
| Causally invalid | An agent's intent violates causal dependencies established by another | Process model valid transition verification |
| Metric | SCF | Best Baseline |
|---|---|---|
| Workflow completion rate | 100% | 25.1% |
| Semantic conflict detection rate | 65.2% | N/A (not reported) |
| Detection precision | 27.9% | N/A |
| Protocol compatibility | MCP and A2A | — |
SCF also defines a Semantic Alignment Score (SAS) per agent pair, combining: (1) overlap between each agent's entity state model, (2) consistency of planned actions with the process model's valid transitions, and (3) divergence between agent confidence levels and historical base rates. SAS provides a scalar conflict risk indicator that complements the binary conflict categories.[28]
Key finding: The 65.2% detection rate at 27.9% precision for purely semantic (task-description-level) conflict detection establishes the baseline for the semantic layer alone. Combining semantic intent analysis with dependency graph analysis should significantly improve both numbers — the two methods are orthogonal and complementary.[28]
A large-scale study (Springer Empirical Software Engineering) evaluated machine learning on git history features for binary merge conflict prediction across 744 open-source GitHub repositories across 7 programming languages — described as the largest merge conflict prediction study to date.[21]
| Feature Category | Features | Key Finding |
|---|---|---|
| Technical — Structural | Relation of modularity (MVC layers) to conflict frequency | Cross-layer changes are significantly more conflict-prone than same-layer changes |
| Technical — Size | Size of code changes (lines added/deleted) | Larger changes correlate with more conflicts |
| Technical — Timing | Branch age; timing of code changes | Long-lived branches cause disproportionately more conflicts |
| Social — Role | Developer roles and contribution patterns | Top contributors at project level cause more conflicts |
| Social — Pattern | Contributor frequency at merge-scenario level | Occasional contributors at merge level cause more conflicts |
| Social — Combined | Top project contributor + occasional merge contributor simultaneously | 32.31% conflict probability for this specific combination |
| Model Type | Features Used | Accuracy | Recall |
|---|---|---|---|
| Technical only | Structural, size, timing | ~0.80 | ~0.85 |
| Social only | Role, pattern, combined | ~0.75 | ~0.90 |
| Combined (best) | Social + technical, Random Forest | 0.92 | 1.00 |
Class imbalance note: Merge conflict data from git history is highly imbalanced (far more non-conflicting merges). Handling requires SMOTE, Random Forest ensemble, or class-weighted training.[21]
A neural program merge framework based on token-level three-way differencing and a multi-input BERT variant:[21]
Facebook developed a ML system shifting from "which tests could be affected" to "what's the probability a test will catch a regression?"[31]
| Component | Method | Role |
|---|---|---|
| Build dependency analysis | All tests transitively depending on modified code | Candidate set generation |
| ML probability scoring | Gradient-boosted decision trees | Estimate likelihood each test detects a regression |
| Graph distance | Distance in build dependency graph between changed units and tests | Key feature: empirically, changed code and failing tests have small graph distance |
Production results: Detects 99.9% of regressions while running only 1/3 of all dependent tests; requires 95%+ prediction accuracy; achieved 2x testing infrastructure efficiency gains.[31]
Key finding: Facebook's approach is directly transferable to pre-flight conflict prediction: instead of predicting test failure probability, predict conflict probability for two agent task pairs. Features: build dependency graph distance between target files, commit co-change history, owner overlap, semantic similarity. Training data: historical parallel development sessions where conflicts occurred vs. not. Facebook's results demonstrate this pattern is production-proven at massive scale.[31]
RIPPLE ("From Seed to Scope: Reasoning to Identify Change Impact Sets," Yadavally and Nguyen, ICSE 2026) addresses the precision-recall tradeoff in change impact analysis with a two-phase design.[27]
| Phase | Focus | Method | Output |
|---|---|---|---|
| Phase 1 — Seed-to-Scope | Recall-focused | Combines evolutionary coupling (commit history) + dependence coupling (structural/semantic); progressively expands impact set from seed edit | Wide-net candidate impact set |
| Phase 2 — Plan-Then-Predict | Precision-focused | Planner LLM produces change plan via Chain-of-Thought; Reasoner LLM performs impact estimation per dependence cluster (localized to mitigate hallucinations) | Precision-filtered impact set aligned with change intent |
| Metric | Value | Interpretation |
|---|---|---|
| Hit@K | 86% | In 86% of commits, at least one co-changing location is structurally/semantically dependent on the seed edit |
| F1-score improvement | 39.7%–380.8% over baselines | Versus existing top-down and bottom-up CIA approaches |
| Evolutionary coupling unique contribution | 21% of commits | In 21% of commits, evolutionary coupling identifies locations that dependence coupling alone misses |
Application to multi-agent pre-flight:[27]
RIPPLE's key bridge: natural language intent → dependence-expanded impact set transforms task descriptions into concrete file sets comparable before any agent starts working.[27]
ConE is a production-deployed concurrent edit detection service at Microsoft (ACM TOSEM 2022), deployed March 2020 onwards across 234 repositories.[10]
Empirical foundation: Files concurrently edited in different pull requests are more likely to introduce bugs — established from half a year of changes across 6 large Microsoft repositories, each with 1,000+ monthly PRs.[10]
| Metric | Definition | Design Decision |
|---|---|---|
| Extent of Overlap (EOO) | Percentage value representing overlap between two PRs active at the same time; measures file-level overlap | Scalar conflict potential score (not binary); deliberately lightweight — avoids time-consuming deep semantic analysis |
| Rarely Concurrently Edited (RCE) Files | Files infrequently modified together with other files; historical co-edit frequency as prior | Concurrent edits to RCE files = special warning signal; files always edited together = expected concurrent edits (low alert) |
| Metric | Value |
|---|---|
| Repositories covered | 234 across different product lines at Microsoft |
| Pull requests evaluated | 26,000 |
| Recommendations made | 775 about conflicting changes |
| Rated useful by developers | Over 70% (554 cases) |
| Users intending to keep daily use | Over 90% of 48 interviewed users |
| Patent | Google Patent WO2022031338A1 |
Key finding: ConE deliberately avoids deep semantic analysis in favor of fast, scalable overlap estimation — and this is presented as the right tradeoff, not a limitation. Production validation at 70%+ usefulness confirms file-overlap-based conflict prediction is practically valuable at scale, without requiring call graph traversal or semantic analysis.[10]
| Insight | Agent System Application |
|---|---|
| EOO is directly applicable | Agent tasks ≈ PRs; EOO of planned file modifications provides pre-dispatch conflict score |
| RCE concept | Historical co-edit data is a strong prior — files rarely co-modified are higher-risk when two agents plan concurrent modification |
| Lightweight heuristics beat precision | Fast, scalable overlap estimation is the right tradeoff for pre-flight checks in high-velocity systems |
| Adaptive thresholds | Different codebases have different expected overlap patterns — adaptive thresholds prevent alert fatigue |
| System | Deployment Model | User Retention | Outcome |
|---|---|---|---|
| ConE (Microsoft) | Internally mandated; integrated into Azure DevOps workflows | 90%+ intended daily use | Production success; patented |
| Blast-radius.dev | External tool requiring voluntary adoption | N/A (project discontinued) | Technically sound but no market adoption |
The contrast suggests that deployment model (mandated internal integration vs. voluntary external tool) determines adoption success more than technical merit for conflict detection tooling.[10][24]
Palantír (2012, IEEE TSE) remains the foundational reference for real-time parallel development conflict detection. Its push-based workspace awareness architecture — where API diffs are transmitted across workspaces as work progresses — directly models the architecture needed for multi-agent pre-flight systems.[8][19]
By end of 2025, approximately 85% of developers regularly used AI tools for coding, still mostly single-agent; multi-agent coordination became the new frontier in early 2026.[30] The key upfront conflict-detection pattern that emerged in this period is mandatory plan approval before implementation: agents write plans specifying files they intend to modify, a lead agent reviews for overlap, and approves or rejects before any code is written — catching collision at the intent layer rather than the diff layer.[30]
No single technique covers the full conflict space. A complete pre-flight system is a layered pipeline where cheap, fast techniques filter the candidate space before expensive, precise techniques are applied only to flagged pairs.
| Method | Conflict Type Detected | Precision | Recall | Latency | Requires Codebase Execution | Source |
|---|---|---|---|---|---|---|
| File-level overlap (EOO) | Direct only | Medium | High (for direct) | Milliseconds | No | [10] |
| Ownership map check | Direct + organizational | Medium | Medium | Milliseconds | No | [3][15] |
| Semantic task similarity (SBERT) | Intent overlap | Medium (87–92% F1) | High (100% for UnSupCDA) | Seconds | No | [4][32] |
| Import/dependency graph | Direct + indirect (structural) | Medium-high | High | Seconds–minutes | No | [11][9] |
| CHA call graph traversal | Direct + indirect (call paths) | Low-medium (over-approx) | Highest | Seconds–minutes | No | [2][13] |
| Program slicing (forward) | Direct + data flow ripple | Medium-high | High (97% F1 for NS-Slicer) | Seconds–minutes | No | [5] |
| ML from git history (Random Forest) | All types (probabilistic) | N/R (Accuracy: 0.92)* | 1.00 | Milliseconds (inference) | No (requires training) | [21] |
| Static semantic analysis (4 algorithms) | Semantic (data flow / confluence) | 0.43 | 0.60 (vs. 0.14 dynamic) | 17.8s median | No | [1][12] |
| RIPPLE intent-aware CIA | All types (intent + structure) | High (86% Hit@K) | High (+39–381% vs. baselines) | LLM inference time | No | [27] |
| SCF semantic intent graph | Contradictory + contention + causal | 27.9% | 65.2% | Real-time | No | [28] |
| Layer | Technique | Trigger | Action on Flag |
|---|---|---|---|
| Layer 0 — Instant | File-level overlap (EOO) + CODEOWNERS check | At task dispatch time | Immediate sequential routing for direct conflicts |
| Layer 1 — Fast | Semantic task similarity (SBERT cosine) | All non-blocked pairs from Layer 0 | Flag intent-similar pairs for deeper analysis |
| Layer 2 — Structural | Import/dependency graph traversal + CHA call graph | Pairs flagged by Layer 1 | Compute blast radius intersection; flag overlapping sets |
| Layer 3 — Semantic | Static semantic analysis (DF/CF/OA/PDG) or RIPPLE | Pairs with structural overlap from Layer 2 | Confirm semantic conflict; generate specific conflict report |
| Layer 4 — Historical | ML from git history (Random Forest) + RCE scoring | Continuous scoring of all pairs | Probabilistic conflict risk score for routing decisions |
| Gap | Description | Evidence of Gap |
|---|---|---|
| Agent-native pre-flight benchmarks | All evaluated systems target human developer workflows (PRs, branches). No published benchmarks for AI agent-specific pre-flight conflict detection with agent think-time, task description length, or agent velocity as variables. | All primary sources (Palantír, ConE, S3CDA) use human developer datasets |
| Dynamic language dependency accuracy | Import graph analysis and call graph construction are inherently more imprecise for dynamically typed languages (Python, JavaScript). The static analysis technique explicitly excludes Java reflection and native methods. No dynamic language benchmark published. | [1][22] limitation sections |
| Real-time update cost at agent velocity | Augment Code rebuilds its dependency model "within seconds of a branch push" but at agent velocity (dozens of parallel agents committing continuously), the cost and consistency of real-time dependency graph updates is not studied. | [25] reports reconstruction time but not under concurrent write load |
| SCF precision at scale | SCF achieves 65.2% detection at 27.9% precision across 600 runs on AutoGen/CrewAI/LangGraph. Performance at 10x+ agent count, with heterogeneous task descriptions, is not characterized. | [28] limited to 600 runs |
| Ownership map freshness | Commit-based and line-based ownership calculations are point-in-time snapshots. No literature addresses how frequently ownership must be recalculated for rapidly evolving AI-augmented codebases. | [3][29] report static calculations only |
| Cross-language blast radius | Polyglot codebases (e.g., Python backend + TypeScript frontend + Go services) require cross-language dependency edges for accurate blast radius. No published system handles this end-to-end. | [20] addresses the problem conceptually but no evaluated implementation |
Key finding: The field has strong theoretical foundations and production-proven components (ConE, Palantír, S3CDA, RIPPLE) but lacks end-to-end evaluation of any layered pre-flight pipeline specifically designed for AI agent systems. The biggest open problem is not technique efficacy — it is integration: combining semantic, structural, and historical signals into a single sub-second gate without creating a bottleneck that eliminates the velocity gains from parallelism.[10][28][27]